Skip to content

How to use for "other" data with replicates #31

@StyrbjornKall

Description

@StyrbjornKall

Hi! This package sounds great, but I have a problem understanding how to properly use it.

What is your input?
pandas dataframe with two categorical variables "var1" & "var2" with associated label. Has repeated measurements. Can be thought of as the same as a protein-ligand-affinity dataset.
interactions_with_reps.csv

question
I simply want to use datasail to split this dataset in a k-fold crossvalidation (k=5 folds) using a 2D split in an optimal way to retain as much data as possible. If I had no experimental replicates in the data, this would be simple, but since some experiments are represented more than others, this is an optimization problem that I suppose datasail could help me solve. I do not need any clustering/similarity, I only want a identity (what I understand is the I2 split).

Although it seems quite simple, I don't understand how to do this from the documentation or the provided pbd example.

The closest I've gotten is this, although I don't understand why I would format it the data/input/arguments this way. So I wonder:

  1. Is this correct?
  2. If so, why do I get the assersion error, this implies it has not separated along both dimensions?
import os
from datasail.sail import datasail
import pandas as pd
import numpy as np

df = pd.read_csv('/content/interactions_with_reps.csv')
df['var1'] = [str(int(x)) for x in df['var1']]
df['var2'] = [str(int(x)) for x in df['var2']]
df.reset_index(drop=True, inplace=True)
# Create id column (not sure why?)
df['ID'] = [str(x) for x in range(len(df))]

# Take first 10000 rows as example
df = df.head(1000)
id2var1 = dict(df[["ID", "var1"]].values.tolist())
id2var2 = dict(df[["ID", "var2"]].values.tolist())
inter = [(x[0], x[0]) for x in df[["ID"]].values.tolist()] # Not sure why we need this as it doesn't really describe the actual interactions...

_, _, inter_splits = datasail(
    techniques=["I2"],
    splits=[1/3]*3, # Split into 3 folds each making up 0.33 (desired)
    names=[f'fold_{i}' for i in range(3)], # Name each fold
    runs=1,
    solver="SCIP",
    inter=inter, 
    e_type="O",
    e_data=id2var1,
    f_type="O",
    f_data=id2var2,
)

# Split dataframe
split_dict = inter_splits['I2'][0]
folds = []
fold_names = list(set(split_dict.values())-{'not selected'})
for fold in fold_names:
  folds.append(df[df['ID'].isin([k[0] for k, v in split_dict.items() if v == fold])])

# Verify there is no overlap
assert len(set(folds[0]['var1']).intersection(set(folds[1]['var1']))) == 0
assert len(set(folds[0]['var2']).intersection(set(folds[1]['var2']))) == 0

Output

===============================================================================
                                     CVXPY                                     
                                     v1.5.3                                    
===============================================================================
(CVXPY) Oct 02 11:30:16 AM: Your problem has 6000 variables, 2006 constraints, and 0 parameters.
(CVXPY) Oct 02 11:30:16 AM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Oct 02 11:30:16 AM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Oct 02 11:30:16 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Oct 02 11:30:16 AM: Your problem is compiled with the CPP canonicalization backend.
-------------------------------------------------------------------------------
                                  Compilation                                  
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:30:16 AM: Compiling problem (target solver=SCIP).
(CVXPY) Oct 02 11:30:16 AM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Oct 02 11:30:16 AM: Applying reduction Dcp2Cone
(CVXPY) Oct 02 11:30:16 AM: Applying reduction CvxAttr2Constr
(CVXPY) Oct 02 11:30:16 AM: Applying reduction ConeMatrixStuffing
(CVXPY) Oct 02 11:30:16 AM: Applying reduction SCIP
(CVXPY) Oct 02 11:30:16 AM: Finished problem compilation (took 3.474e-02 seconds).
-------------------------------------------------------------------------------
                                Numerical solver                               
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:30:16 AM: Invoking solver SCIP  to obtain a solution.
-------------------------------------------------------------------------------
                                    Summary                                    
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:31:04 AM: Problem status: optimal
(CVXPY) Oct 02 11:31:04 AM: Optimal value: 1.000e+00
(CVXPY) Oct 02 11:31:04 AM: Compilation took 3.474e-02 seconds
(CVXPY) Oct 02 11:31:04 AM: Solver (including time spent in interface) took 4.776e+01 seconds
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[/tmp/ipython-input-2906173274.py](https://localhost:8080/#) in <cell line: 0>()
     37 
     38 # Verify there is no overlap
---> 39 assert len(set(folds[0]['var1']).intersection(set(folds[1]['var1']))) == 0
     40 assert len(set(folds[0]['var2']).intersection(set(folds[1]['var2']))) == 0

AssertionError:

Any help will be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions