Skip to content

Commit 10d6905

Browse files
committed
Updated README.md
1 parent dff2dde commit 10d6905

File tree

1 file changed

+32
-10
lines changed

1 file changed

+32
-10
lines changed

README.md

Lines changed: 32 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,13 @@ features = output.mean(1)
4040
pip install -e .
4141
```
4242

43-
1. Download the [data](https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm)
43+
1. Download the data from our Hugging Face Dataset [repository](https://huggingface.co/datasets/bioscan-ml/CanadianInvertebrates-ML)
44+
```shell
45+
cd data/
46+
python download_HF_CanInv.py
47+
```
48+
49+
**Optional**: You can also download the first version of the [data](https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm)
4450
```shell
4551
wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip
4652
unzip data.zip
@@ -49,20 +55,26 @@ rm -r new_data
4955
rm data.zip
5056
```
5157

52-
3. Pretrain BarcodeBERT
53-
54-
```bash
55-
python barcodebert/pretraining.py --dataset=CANADA-1.5M --k_mer=4 --n_layers=4 --n_heads=4 --data_dir=data/ --checkpoint=model_checkpoints/CANADA-1.5M/4_4_4/checkpoint_pretraining.pt
56-
```
57-
58-
4. Baseline model pipelines: The desired backbone can be selected using one of the following keywords:
58+
4. DNA foundation model baselines: The desired backbone can be selected using one of the following keywords:
5959
`BarcodeBERT, NT, Hyena_DNA, DNABERT, DNABERT-2, DNABERT-S`
6060
```bash
6161
python baselines/knn_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/
6262
python baselines/linear_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/
6363
python baselines/finetuning.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ --batch_size=32
64+
python baselines/zsc.py --backbone=<DESIRED-BACKBONE> --data-dir=data/
6465
```
65-
**Note**: HyenaDNA has to be downloaded using `git-lfs`. If that is not available to you, you may download the `/hyenadna-tiny-1k-seqlen/` checkpoint directly from [Hugging face](https://huggingface.co/LongSafari/hyenadna-tiny-1k-seqlen/tree/main). The keyword `BarcodeBERT` is also available as a baseline but this will download the publicly available model as presented in our workshop paper.
66+
**Note**: The DNABERT model has to be downloaded manually following the instructions in the paper's (repo)[https://github.com/jerryji1993/DNABERT] and placed in the `pretrained-models` folder.
67+
68+
4.Supervised CNN
69+
70+
```bash
71+
python baselines/cnn/1D_CNN_supervised.py
72+
python baselines/cnn/1D_CNN_KNN.py
73+
python baselines/cnn/1D_CNN_Linear_probing.py
74+
python baselines/cnn/1D_CNN_ZSC.py
75+
76+
```
77+
**Note**: Train the CNN backbone with `1D_CNN_supervised.py` before evaluating it on any downtream task.
6678

6779
5. BLAST
6880
```shell
@@ -75,7 +87,17 @@ makeblastdb -in supervised_train.fas -title train -dbtype nucl -out train.fas
7587
blastn -query supervised_test.fas -db train.fas -out results_supervised_test.tsv -outfmt 6 -num_threads 16
7688
blastn -query unseen.fas -db train.fas -out results_unseen.tsv -outfmt 6 -num_threads 16
7789
```
78-
90+
### Pretrain BarcodeBERT
91+
To train the model you can run the following command. However,
92+
```bash
93+
python barcodebert/pretraining.py
94+
--dataset=CANADA-1.5M \
95+
--k_mer=4 \
96+
--n_layers=4 \
97+
--n_heads=4 \
98+
--data_dir=data/ \
99+
--checkpoint=model_checkpoints/CANADA-1.5M/4_4_4/checkpoint_pretraining.pt
100+
```
79101

80102
## Citation
81103

0 commit comments

Comments
 (0)