This repository contains all the code accompanying the paper BarcodeBERT: Transformers for Biodiversity Analysis (Millan Arias et al., 2025)
BarcodeBERT is a BERT-style transformer model trained exclusively on a dataset of DNA barcode sequences extracted from a reference library of Canadian invertebrates. In addition to the full pretraining pipeline, you’ll find scripts and notebooks for evaluating BarcodeBERT (and several off-the-shelf DNA foundation models) in various downstream tasks:
- Fine-tuning for supervised species-level classification.
- Similarity retrieval for labelling rare or unseen species via nearest neighbour search in the embedding space.
- BIN reconstruction, where BarcodeBERT embeddings are used to group sequences into putative Barcode Index Numbers.
What's Changed
- BZSL implementation for BarcodeBERT: Transformers for Biodiversity Analysis by @atwang16 in #1
- DOC: Fix paths to scripts shown in README by @scottclowe in #3
- RF: Standardize requirements.txt into a single file by @scottclowe in #5
- Bump transformers from 4.29.2 to 4.36.0 by @dependabot in #7
- Bump black from 23.11.0 to 24.3.0 by @dependabot in #8
- Bump transformers from 4.36.0 to 4.38.0 by @dependabot in #9
- Bump scikit-learn from 1.3.0 to 1.5.0 by @dependabot in #10
- MNT: Fix issues flagged by pre-commit by @scottclowe in #11
- DOC: Update citation by @scottclowe in #12
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #13
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #15
- Added conditional check for
use_cuda
before usingtorch.cuda
by @NotMyLyfe in #14 - Camera Ready Version for Bioinformatics by @millanp95 in #19
New Contributors
- @atwang16 made their first contribution in #1
- @dependabot made their first contribution in #7
- @pre-commit-ci made their first contribution in #13
- @NotMyLyfe made their first contribution in #14
- @millanp95 made their first contribution in #19
Full Changelog: https://github.com/bioscan-ml/BarcodeBERT/commits/v1.0.0