You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/other.rst
+29-14Lines changed: 29 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,10 @@ MoleculeNet
24
24
This benchmark suite provides multiple datasets for molecular property prediction with different properties to predict. Each dataset
25
25
contains a predefined split, some of which are scaffold-based or time-based, but most are random. Here, we compare these default split to similarity-based DataSAIL splits.
26
26
27
-
*comparison coming soon*
27
+
.. raw:: html
28
+
:file: tables/moleculenet.html
29
+
30
+
|
28
31
29
32
Leak Proof PDBBind (LP-PDBBind)
30
33
----------------------------------
@@ -42,18 +45,6 @@ while ligand similarity was measured as the Dice similarity between Morgan finge
42
45
43
46
|
44
47
45
-
Gold Standard Human Proteome Dataset for sequence-based PPI prediction
The authors first show that all sequence-based protein-protein interaction (PPI) predictors they evaluated perform no better than random when sequence similarity
51
-
between splits is removed. They further develop a PPI dataset based on the human proteome where they separate the proteins into three blocks
52
-
using KaHIP over SIMAP2 bitscores. Then, the PPIs are assigned to the blocks if and only if the interacting proteins are both in the corresponding block. In
53
-
a last step, CDHIT is used to remove redundancy (max 40% sequence similarity) within each block.
54
-
55
-
*comparison coming soon*
56
-
57
48
Protein Ligand INteraction Dataset and Evaluation Resource (PLINDER)
The Protein INteraction Dataset and Evaluation Resource (PINDER) contains curated and highly annotated protein-protein interactions obtained from the
71
+
RCSB NextGen database. After data cleaning and preprocessing, PINDER provides a data leakage removed split. To measure the leakage between two systems
72
+
(interacting protein-protein pairs), the authors employed FoldSeek and MMseqs. Here, we compare DataSAIL to version 1 of PINDER, released in November 2023.
73
+
74
+
Other than the LP-PDBBind dataset, we can define a similarity metric between the two dimensions interacting in this two-dimensional dataset.
75
+
Therefore, we did not directly use DataSAILs S2 splitting module but rather the S1 with all protein sequences from both dimensions, weighted with the number
76
+
of interactions each protein participates in. From the resulting assignment, we assigned an interaction to a split if and only if both proteins are assigned
77
+
to that same split.
78
+
79
+
.. raw:: html
80
+
:file: tables/pinder.html
81
+
82
+
|
83
+
84
+
Gold Standard Human Proteome Dataset for sequence-based PPI prediction
The authors first show that all sequence-based protein-protein interaction (PPI) predictors they evaluated perform no better than random when sequence similarity
90
+
between splits is removed. They further develop a PPI dataset based on the human proteome where they separate the proteins into three blocks
91
+
using KaHIP over SIMAP2 bitscores. Then, the PPIs are assigned to the blocks if and only if the interacting proteins are both in the corresponding block. In
92
+
a last step, CDHIT is used to remove redundancy (max 40% sequence similarity) within each block.
0 commit comments