Skip to content

Commit 750c462

Browse files
Documentation and version update
1 parent 856f7fa commit 750c462

File tree

9 files changed

+203
-19
lines changed

9 files changed

+203
-19
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@
1212
- [ ] Include MASH for amino acid sequences
1313
- [ ] Custom clustering methods ([Issue #25](https://github.com/kalininalab/DataSAIL/issues/25))
1414

15+
## v1.2.2 (2025-10-14)
16+
17+
- Bug fixed in the evaluation module.
18+
1519
## v1.2.1 (2025-08-19)
1620

1721
- Improved stratification and testing thereof to better handle mutliclass and multilabel-multiclass stratification

base_recipe.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
package:
2-
version: '1.2.1'
2+
version: '1.2.2'
33

44
source:
55
path: ..

datasail/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "1.2.1"
1+
__version__ = "1.2.2"

docs/other.rst

Lines changed: 29 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@ MoleculeNet
2424
This benchmark suite provides multiple datasets for molecular property prediction with different properties to predict. Each dataset
2525
contains a predefined split, some of which are scaffold-based or time-based, but most are random. Here, we compare these default split to similarity-based DataSAIL splits.
2626

27-
*comparison coming soon*
27+
.. raw:: html
28+
:file: tables/moleculenet.html
29+
30+
|
2831
2932
Leak Proof PDBBind (LP-PDBBind)
3033
----------------------------------
@@ -42,18 +45,6 @@ while ligand similarity was measured as the Dice similarity between Morgan finge
4245

4346
|
4447
45-
Gold Standard Human Proteome Dataset for sequence-based PPI prediction
46-
----------------------------------------------------------------------
47-
| Bernett et al. (2023)
48-
| DOI: `10.1093/bib/bbae076 <https://doi.org/10.1093/bib/bbae076>`_
49-
50-
The authors first show that all sequence-based protein-protein interaction (PPI) predictors they evaluated perform no better than random when sequence similarity
51-
between splits is removed. They further develop a PPI dataset based on the human proteome where they separate the proteins into three blocks
52-
using KaHIP over SIMAP2 bitscores. Then, the PPIs are assigned to the blocks if and only if the interacting proteins are both in the corresponding block. In
53-
a last step, CDHIT is used to remove redundancy (max 40% sequence similarity) within each block.
54-
55-
*comparison coming soon*
56-
5748
Protein Ligand INteraction Dataset and Evaluation Resource (PLINDER)
5849
--------------------------------------------------------------------
5950
| Durairaj et al. (2024)
@@ -76,4 +67,28 @@ Protein INteraction Dataset and Evaluation Resource (PINDER)
7667
| Kovtun et al. (2024)
7768
| DOI: `10.1101/2024.07.17.603980 <https://doi.org/10.1101/2024.07.17.603980>`_
7869
79-
*coming soon*
70+
The Protein INteraction Dataset and Evaluation Resource (PINDER) contains curated and highly annotated protein-protein interactions obtained from the
71+
RCSB NextGen database. After data cleaning and preprocessing, PINDER provides a data leakage removed split. To measure the leakage between two systems
72+
(interacting protein-protein pairs), the authors employed FoldSeek and MMseqs. Here, we compare DataSAIL to version 1 of PINDER, released in November 2023.
73+
74+
Other than the LP-PDBBind dataset, we can define a similarity metric between the two dimensions interacting in this two-dimensional dataset.
75+
Therefore, we did not directly use DataSAILs S2 splitting module but rather the S1 with all protein sequences from both dimensions, weighted with the number
76+
of interactions each protein participates in. From the resulting assignment, we assigned an interaction to a split if and only if both proteins are assigned
77+
to that same split.
78+
79+
.. raw:: html
80+
:file: tables/pinder.html
81+
82+
|
83+
84+
Gold Standard Human Proteome Dataset for sequence-based PPI prediction
85+
----------------------------------------------------------------------
86+
| Bernett et al. (2023)
87+
| DOI: `10.1093/bib/bbae076 <https://doi.org/10.1093/bib/bbae076>`_
88+
89+
The authors first show that all sequence-based protein-protein interaction (PPI) predictors they evaluated perform no better than random when sequence similarity
90+
between splits is removed. They further develop a PPI dataset based on the human proteome where they separate the proteins into three blocks
91+
using KaHIP over SIMAP2 bitscores. Then, the PPIs are assigned to the blocks if and only if the interacting proteins are both in the corresponding block. In
92+
a last step, CDHIT is used to remove redundancy (max 40% sequence similarity) within each block.
93+
94+
*comparison coming soon*

docs/tables/lppdbbind.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<head>
44
<meta charset="UTF-8">
55
<meta name="viewport" content="width=device-width, initial-scale=1.0">
6-
<title>PLINDER Results Comparison</title>
6+
<title>LP-PDBBind Results Comparison</title>
77
</head>
88
<body>
99
<div class="table-container">

docs/tables/moleculenet.html

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="UTF-8">
5+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
6+
<title>MoleculeNet Results Comparison</title>
7+
</head>
8+
<body>
9+
<div class="table-container">
10+
<div class="table-header">
11+
<h2>MoleculeNet Dataset Comparison</h2>
12+
<p>Scaled L(π) values for different splitting methods</p>
13+
</div>
14+
<table>
15+
<thead>
16+
<tr>
17+
<th>Dataset</th>
18+
<th>MoleculeNet Technique</th>
19+
<th>MoleculeNet Split</th>
20+
<th>DataSAIL Split</th>
21+
</tr>
22+
</thead>
23+
<tbody>
24+
<tr>
25+
<td>QM7</td>
26+
<td>stratified</td>
27+
<td>0.3425</td>
28+
<td>0.2680</td>
29+
</tr>
30+
<tr>
31+
<td>QM8</td>
32+
<td>random</td>
33+
<td>0.3300</td>
34+
<td>0.2918</td>
35+
</tr>
36+
<tr>
37+
<td>QM9</td>
38+
<td>random</td>
39+
<td>0.3306</td>
40+
<td>0.2727</td>
41+
</tr>
42+
<tr>
43+
<td>ESOL</td>
44+
<td>random</td>
45+
<td>0.3069</td>
46+
<td>0.1808</td>
47+
</tr>
48+
<tr>
49+
<td>FreeSolv</td>
50+
<td>random</td>
51+
<td>0.3213</td>
52+
<td>0.1410</td>
53+
</tr>
54+
<tr>
55+
<td>Lipophilicity</td>
56+
<td>random</td>
57+
<td>0.3343</td>
58+
<td>0.3027</td>
59+
</tr>
60+
<tr>
61+
<td>MUV</td>
62+
<td>random</td>
63+
<td>0.3349</td>
64+
<td>0.3143</td>
65+
</tr>
66+
<tr>
67+
<td>HIV</td>
68+
<td>scaffold</td>
69+
<td>0.3306</td>
70+
<td>0.3071</td>
71+
</tr>
72+
<tr>
73+
<td>BACE</td>
74+
<td>scaffold</td>
75+
<td>0.3309</td>
76+
<td>0.3036</td>
77+
</tr>
78+
<tr>
79+
<td>BBBP</td>
80+
<td>scaffold</td>
81+
<td>0.3366</td>
82+
<td>0.2866</td>
83+
</tr>
84+
<tr>
85+
<td>Toc21</td>
86+
<td>random</td>
87+
<td>0.3333</td>
88+
<td>0.2224</td>
89+
</tr>
90+
<tr>
91+
<td>ToxCast</td>
92+
<td>random</td>
93+
<td>0.3355</td>
94+
<td>0.2220</td>
95+
</tr>
96+
<tr>
97+
<td>SIDER</td>
98+
<td>random</td>
99+
<td>0.3513</td>
100+
<td>0.2345</td>
101+
</tr>
102+
<tr>
103+
<td>ClinTox</td>
104+
<td>random</td>
105+
<td>0.3317</td>
106+
<td>0.2303</td>
107+
</tr>
108+
</tbody>
109+
</table>
110+
</div>
111+
</body>
112+
</html>

docs/tables/pinder.html

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="UTF-8">
5+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
6+
<title>PINDER Results Comparison</title>
7+
</head>
8+
<body>
9+
<div class="table-container">
10+
<div class="table-header">
11+
<h2>PLINDER Dataset Comparison</h2>
12+
<p>Scaled L(π) values for different splitting methods</p>
13+
</div>
14+
<table>
15+
<thead>
16+
<tr>
17+
<th>Split Method</th>
18+
<th>Scaled L(π)</th>
19+
</tr>
20+
</thead>
21+
<tbody>
22+
<tr>
23+
<td>PINDER</td>
24+
<td>0.0068</td>
25+
</tr>
26+
<tr class="datasail-row">
27+
<td>DataSAIL</td>
28+
<td>0.0140</td>
29+
</tr>
30+
</tbody>
31+
</table>
32+
</div>
33+
</body>
34+
</html>

docs/tables/plinder.html

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,11 @@
6262
border-top-left-radius: 0;
6363
}
6464

65+
th:nth-child(3) {
66+
border-top-right-radius: 0;
67+
text-align: center;
68+
}
69+
6570
th:last-child {
6671
border-top-right-radius: 0;
6772
text-align: center;
@@ -73,6 +78,13 @@
7378
transition: background-color 0.2s ease;
7479
}
7580

81+
td:nth-child(3) {
82+
text-align: center;
83+
font-weight: 600;
84+
font-family: 'Courier New', monospace;
85+
color: #2d3748;
86+
}
87+
7688
td:last-child {
7789
text-align: center;
7890
font-weight: 600;
@@ -88,7 +100,7 @@
88100
border-bottom: none;
89101
}
90102

91-
/* Style for DataSAIL rows */
103+
/* Style for DataSAIL rows
92104
tr:nth-child(4) {
93105
background-color: #edf2f7;
94106
}
@@ -97,6 +109,13 @@
97109
}
98110
tr:nth-child(6) {
99111
background-color: #edf2f7;
112+
}*/
113+
.datasail-row {
114+
background-color: #edf2f7;
115+
}
116+
117+
.datasail-cell {
118+
background-color: #edf2f7;
100119
}
101120

102121
/* Add indicator for best/worst performances */

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "datasail"
3-
version = "1.2.1"
3+
version = "1.2.2"
44
repository = "https://github.com/kalininalab/DataSAIL"
55
readme = "README.md"
66
description = "A package to compute hard out-of-distribution data splits for machine learning, challenging generalization of models."

0 commit comments

Comments
 (0)