How to use for "other" data with replicates

Hi! This package sounds great, but I have a problem understanding how to properly use it.

**What is your input?**
pandas dataframe with two categorical variables "var1" & "var2" with associated label. Has repeated measurements. Can be thought of as the same as a protein-ligand-affinity dataset.
[interactions_with_reps.csv](https://github.com/user-attachments/files/22656482/interactions_with_reps.csv)

**question**
I simply want to use datasail to split this dataset in a k-fold crossvalidation (k=5 folds) using a 2D split in an optimal way to retain as much data as possible. If I had no experimental replicates in the data, this would be simple, but since some experiments are represented more than others, this is an optimization problem that I suppose datasail could help me solve. I do not need any clustering/similarity, I only want a identity (what I understand is the I2 split). 

Although it seems quite simple, I don't understand how to do this from the documentation or the provided pbd example. 

The closest I've gotten is this, although I don't understand why I would format it the data/input/arguments this way. So I wonder:
1. Is this correct?
2. If so, why do I get the assersion error, this implies it has not separated along both dimensions?

```
import os
from datasail.sail import datasail
import pandas as pd
import numpy as np

df = pd.read_csv('/content/interactions_with_reps.csv')
df['var1'] = [str(int(x)) for x in df['var1']]
df['var2'] = [str(int(x)) for x in df['var2']]
df.reset_index(drop=True, inplace=True)
# Create id column (not sure why?)
df['ID'] = [str(x) for x in range(len(df))]

# Take first 10000 rows as example
df = df.head(1000)
id2var1 = dict(df[["ID", "var1"]].values.tolist())
id2var2 = dict(df[["ID", "var2"]].values.tolist())
inter = [(x[0], x[0]) for x in df[["ID"]].values.tolist()] # Not sure why we need this as it doesn't really describe the actual interactions...

_, _, inter_splits = datasail(
    techniques=["I2"],
    splits=[1/3]*3, # Split into 3 folds each making up 0.33 (desired)
    names=[f'fold_{i}' for i in range(3)], # Name each fold
    runs=1,
    solver="SCIP",
    inter=inter, 
    e_type="O",
    e_data=id2var1,
    f_type="O",
    f_data=id2var2,
)

# Split dataframe
split_dict = inter_splits['I2'][0]
folds = []
fold_names = list(set(split_dict.values())-{'not selected'})
for fold in fold_names:
  folds.append(df[df['ID'].isin([k[0] for k, v in split_dict.items() if v == fold])])

# Verify there is no overlap
assert len(set(folds[0]['var1']).intersection(set(folds[1]['var1']))) == 0
assert len(set(folds[0]['var2']).intersection(set(folds[1]['var2']))) == 0
```

**Output**
```
===============================================================================
                                     CVXPY                                     
                                     v1.5.3                                    
===============================================================================
(CVXPY) Oct 02 11:30:16 AM: Your problem has 6000 variables, 2006 constraints, and 0 parameters.
(CVXPY) Oct 02 11:30:16 AM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Oct 02 11:30:16 AM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Oct 02 11:30:16 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Oct 02 11:30:16 AM: Your problem is compiled with the CPP canonicalization backend.
-------------------------------------------------------------------------------
                                  Compilation                                  
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:30:16 AM: Compiling problem (target solver=SCIP).
(CVXPY) Oct 02 11:30:16 AM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Oct 02 11:30:16 AM: Applying reduction Dcp2Cone
(CVXPY) Oct 02 11:30:16 AM: Applying reduction CvxAttr2Constr
(CVXPY) Oct 02 11:30:16 AM: Applying reduction ConeMatrixStuffing
(CVXPY) Oct 02 11:30:16 AM: Applying reduction SCIP
(CVXPY) Oct 02 11:30:16 AM: Finished problem compilation (took 3.474e-02 seconds).
-------------------------------------------------------------------------------
                                Numerical solver                               
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:30:16 AM: Invoking solver SCIP  to obtain a solution.
-------------------------------------------------------------------------------
                                    Summary                                    
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:31:04 AM: Problem status: optimal
(CVXPY) Oct 02 11:31:04 AM: Optimal value: 1.000e+00
(CVXPY) Oct 02 11:31:04 AM: Compilation took 3.474e-02 seconds
(CVXPY) Oct 02 11:31:04 AM: Solver (including time spent in interface) took 4.776e+01 seconds
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[/tmp/ipython-input-2906173274.py](https://localhost:8080/#) in <cell line: 0>()
     37 
     38 # Verify there is no overlap
---> 39 assert len(set(folds[0]['var1']).intersection(set(folds[1]['var1']))) == 0
     40 assert len(set(folds[0]['var2']).intersection(set(folds[1]['var2']))) == 0

AssertionError:
```


Any help will be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use for "other" data with replicates #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to use for "other" data with replicates #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions