-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hi! This package sounds great, but I have a problem understanding how to properly use it.
What is your input?
pandas dataframe with two categorical variables "var1" & "var2" with associated label. Has repeated measurements. Can be thought of as the same as a protein-ligand-affinity dataset.
interactions_with_reps.csv
question
I simply want to use datasail to split this dataset in a k-fold crossvalidation (k=5 folds) using a 2D split in an optimal way to retain as much data as possible. If I had no experimental replicates in the data, this would be simple, but since some experiments are represented more than others, this is an optimization problem that I suppose datasail could help me solve. I do not need any clustering/similarity, I only want a identity (what I understand is the I2 split).
Although it seems quite simple, I don't understand how to do this from the documentation or the provided pbd example.
The closest I've gotten is this, although I don't understand why I would format it the data/input/arguments this way. So I wonder:
- Is this correct?
- If so, why do I get the assersion error, this implies it has not separated along both dimensions?
import os
from datasail.sail import datasail
import pandas as pd
import numpy as np
df = pd.read_csv('/content/interactions_with_reps.csv')
df['var1'] = [str(int(x)) for x in df['var1']]
df['var2'] = [str(int(x)) for x in df['var2']]
df.reset_index(drop=True, inplace=True)
# Create id column (not sure why?)
df['ID'] = [str(x) for x in range(len(df))]
# Take first 10000 rows as example
df = df.head(1000)
id2var1 = dict(df[["ID", "var1"]].values.tolist())
id2var2 = dict(df[["ID", "var2"]].values.tolist())
inter = [(x[0], x[0]) for x in df[["ID"]].values.tolist()] # Not sure why we need this as it doesn't really describe the actual interactions...
_, _, inter_splits = datasail(
techniques=["I2"],
splits=[1/3]*3, # Split into 3 folds each making up 0.33 (desired)
names=[f'fold_{i}' for i in range(3)], # Name each fold
runs=1,
solver="SCIP",
inter=inter,
e_type="O",
e_data=id2var1,
f_type="O",
f_data=id2var2,
)
# Split dataframe
split_dict = inter_splits['I2'][0]
folds = []
fold_names = list(set(split_dict.values())-{'not selected'})
for fold in fold_names:
folds.append(df[df['ID'].isin([k[0] for k, v in split_dict.items() if v == fold])])
# Verify there is no overlap
assert len(set(folds[0]['var1']).intersection(set(folds[1]['var1']))) == 0
assert len(set(folds[0]['var2']).intersection(set(folds[1]['var2']))) == 0
Output
===============================================================================
CVXPY
v1.5.3
===============================================================================
(CVXPY) Oct 02 11:30:16 AM: Your problem has 6000 variables, 2006 constraints, and 0 parameters.
(CVXPY) Oct 02 11:30:16 AM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Oct 02 11:30:16 AM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Oct 02 11:30:16 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Oct 02 11:30:16 AM: Your problem is compiled with the CPP canonicalization backend.
-------------------------------------------------------------------------------
Compilation
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:30:16 AM: Compiling problem (target solver=SCIP).
(CVXPY) Oct 02 11:30:16 AM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Oct 02 11:30:16 AM: Applying reduction Dcp2Cone
(CVXPY) Oct 02 11:30:16 AM: Applying reduction CvxAttr2Constr
(CVXPY) Oct 02 11:30:16 AM: Applying reduction ConeMatrixStuffing
(CVXPY) Oct 02 11:30:16 AM: Applying reduction SCIP
(CVXPY) Oct 02 11:30:16 AM: Finished problem compilation (took 3.474e-02 seconds).
-------------------------------------------------------------------------------
Numerical solver
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:30:16 AM: Invoking solver SCIP to obtain a solution.
-------------------------------------------------------------------------------
Summary
-------------------------------------------------------------------------------
(CVXPY) Oct 02 11:31:04 AM: Problem status: optimal
(CVXPY) Oct 02 11:31:04 AM: Optimal value: 1.000e+00
(CVXPY) Oct 02 11:31:04 AM: Compilation took 3.474e-02 seconds
(CVXPY) Oct 02 11:31:04 AM: Solver (including time spent in interface) took 4.776e+01 seconds
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
[/tmp/ipython-input-2906173274.py](https://localhost:8080/#) in <cell line: 0>()
37
38 # Verify there is no overlap
---> 39 assert len(set(folds[0]['var1']).intersection(set(folds[1]['var1']))) == 0
40 assert len(set(folds[0]['var2']).intersection(set(folds[1]['var2']))) == 0
AssertionError:
Any help will be appreciated!