GitHub

Glycan property prediction is an increasingly popular area of machine learning research. Supervised learning approaches have shown promise in glycan modeling; however, the current literature is fragmented regarding datasets and standardized evaluation techniques, hampering progress in understanding these complex, branched carbohydrates that play crucial roles in biological processes. To facilitate progress, we introduce GlycoGym, a comprehensive benchmark suite containing six biologically relevant supervised learning tasks spanning different domains of glycobiology: glycosylation linkage identification, tissue expression prediction, taxonomy classification, tandem mass spectrometry fragmentation prediction, lectin-glycan interaction modeling, and structural property estimation. We curate tasks into specific training, validation, and test splits using multi-class stratification to ensure that each task tests biologically relevant generalization that transfers to real-life glycan property prediction scenarios. GlycoGym will help the machine learning community to focus their efforts on scientifically relevant glycan prediction problems.

Installation

You can install GlycoGym via pip:

pip install glycogym

Usage

The main intention of this package is to build the benchmark for the upload to Zenodo, everytime the datasets with glycowork or GlyContact get significantly updated.

But one can also use it to build local versions of the benchmark during the update cycles of the Zenodo repository.

from glycogym import build_glycosylation, build_taxonomy, build_tissue, build_lgi

df, mapping = build_glycosylation()
df_taxonomy = build_taxonomy("Kingdom")
df_tissue = build_tissue()
df_r, df_cl, df_cg = build_lgi()

Tandem Mass Spectrometry Fragmentation Prediction

One special dataset is the MS fragmentation prediction dataset, which can be built as follows:

from glycogym import build_spectrum

df_ms = build_spectrum(root="path/to/folder/with/pkl/files")

Here, the root argument defined the path to the folder containing the .pkl files comprising the MS fragmentation prediction dataset by CandyCrunch, which can be downloaded from here.

Structural Property Estimation

The second dataset that requires special handling is the structural property estimation dataset. Currently, it needs to be build from the GlyContact package. That can be installed with the following command:

pip install -e git+https://github.com/lthomes/glycontact.git#egg=glycontact[ml]

Then, the dataset can be built as follows:

from glycontact.learning import create_dataset

train, val, test = create_dataset(splits=[0.7, 0.2, 0.1])

Zenodo

The latest version of the GlycoGym benchmark can be found on Zenodo: https://doi.org/10.5281/zenodo.17313055

Citation

tbd

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
docs/imgs		docs/imgs
glycogym		glycogym
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Usage

Tandem Mass Spectrometry Fragmentation Prediction

Structural Property Estimation

Zenodo

Citation

About

Uh oh!

Releases

Packages

Languages

License

BojarLab/GlycoGym

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Tandem Mass Spectrometry Fragmentation Prediction

Structural Property Estimation

Zenodo

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages