Skip to content

Commit 45721b3

Browse files
committed
Updated README and add column aliases
- Updated README with details about the Canadian Invertebrate 1.5M dataset. - Added column aliases to ensure compatibility with the previous metadata structure.
1 parent 971a0f1 commit 45721b3

File tree

2 files changed

+39
-5
lines changed

2 files changed

+39
-5
lines changed

README.rst

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,13 @@
1212
BIOSCAN Datasets for PyTorch
1313
============================
1414

15-
In this package, we provide PyTorch/torchvision style dataset classes to load the `BIOSCAN-1M <BIOSCAN-1M paper_>`_ and `BIOSCAN-5M <BIOSCAN-5M paper_>`_ datasets.
15+
In this package, we provide PyTorch/torchvision style dataset classes to load the `BIOSCAN-1M <BIOSCAN-1M paper_>`_, `BIOSCAN-5M <BIOSCAN-5M paper_>`_, and `Canadian Invertebrates 1.5M <CanadianInvertebrates paper_>`_ datasets.
1616

1717
BIOSCAN-1M and 5M are large multimodal datasets for insect biodiversity monitoring, containing over 1 million and 5 million specimens, respectively.
1818
The datasets are comprised of RGB microscopy images, `DNA barcodes <what-is-DNA-barcoding_>`_, and fine-grained, hierarchical taxonomic labels.
19+
The Canadian Invertebrates 1.5M dataset provides DNA barcodes for over 1.5 million Invertebrates that are collected across 23 ecozones in Canada. It is a major reference library for biodiversity research and consists of DNA Barcodes collected from platforms like BOLD, GenBank and GBIF.
1920
Every sample has both an image and a DNA barcode, but the taxonomic labels are incomplete and only extend all the way to the species level for around 9% of the specimens.
20-
For more details about the datasets, please see the `BIOSCAN-1M paper`_ and `BIOSCAN-5M paper`_, respectively.
21+
For more details about the datasets, please see the `BIOSCAN-1M paper`_ , `BIOSCAN-5M paper`_, and `Canadian Invertebrates 1.5M <CanadianInvertebrates paper_>`_ respectively.
2122

2223
Documentation about this package, including the full API details, is available online at readthedocs_.
2324

@@ -69,6 +70,18 @@ To load the BIOSCAN-1M dataset:
6970
# Do something with the image, dna_barcode, and label
7071
pass
7172
73+
To load the Canadian Invertebrates 1.5M dataset:
74+
75+
.. code-block:: python
76+
77+
from bioscan_dataset import CanadianInvertebrates
78+
79+
dataset = CanadianInvertebrate(root="~/Datasets/bioscan/")
80+
81+
for dna_barcode, label in dataset:
82+
# Do something with the dna_barcode, and label
83+
pass
84+
7285
Note that although BIOSCAN-5M is a superset of BIOSCAN-1M, the repeated data samples are not identical between the two due to data cleaning and processing differences.
7386
For details, please see Appendix Q of the `BIOSCAN-5M paper`_.
7487
Additionally, note that the splits are incompatible between the two datasets.
@@ -341,7 +354,7 @@ The transform indicates the name of a taxonomic rank and its value for every ran
341354
Other resources
342355
---------------
343356

344-
- Read the `BIOSCAN-1M paper`_ and `BIOSCAN-5M paper`_.
357+
- Read the `BIOSCAN-1M paper`_ , `BIOSCAN-5M paper`_ and `Canadian Invertebrates 1.5M <CanadianInvertebrates paper_>`_.
345358
- The dataset can be explored through a web interface using our `BIOSCAN Browser`_.
346359
- Read more about the `International Barcode of Life (iBOL) <https://ibol.org/>`__ and `BIOSCAN <https://ibol.org/bioscan/>`__ initiatives.
347360
- See the code for the `cropping tool <https://github.com/bioscan-ml/BIOSCAN-5M/tree/main/BIOSCAN_crop_resize>`__ that was applied to the images to create the cropped image package.
@@ -352,7 +365,7 @@ Other resources
352365
Citation
353366
--------
354367

355-
If you make use of the BIOSCAN-1M or BIOSCAN-5M datasets in your research, please cite the following papers as appropriate.
368+
If you make use of the BIOSCAN-1M, BIOSCAN-5M or Canadian Invertebrates 1.5M datasets in your research, please cite the following papers as appropriate.
356369

357370
`BIOSCAN-5M <BIOSCAN-5M paper_>`_:
358371

@@ -394,6 +407,25 @@ If you make use of the BIOSCAN-1M or BIOSCAN-5M datasets in your research, pleas
394407
url={https://proceedings.neurips.cc/paper_files/paper/2023/file/87dbbdc3a685a97ad28489a1d57c45c1-Paper-Datasets_and_Benchmarks.pdf},
395408
}
396409
410+
`CanadianInvertebrates <CanadianInvertebrates paper_>`_:
411+
412+
.. code-block:: bibtex
413+
414+
@article{dewaard2019reference,
415+
title={A reference library for Canadian invertebrates with 1.5 million barcodes, voucher specimens, and DNA samples},
416+
author={DeWaard, J. R. and Ratnasingham, S. and Zakharov, E. V. and Borisenko, A. V.
417+
and Steinke, D. and Telfer, A. C. and Perez, K. H. J. and Sones, J. E.
418+
and Young, M. R. and Levesque-Beaudin, V. and others
419+
},
420+
journal={Scientific data},
421+
volume={6},
422+
number={1},
423+
pages={308},
424+
year={2019},
425+
publisher={Nature Publishing Group UK London}
426+
url={https://www.nature.com/articles/s41597-019-0320-2.pdf},
427+
}
428+
397429
If you use the CLIBD partitioning scheme for BIOSCAN-1M, please also consider citing the `CLIBD paper`_.
398430

399431
.. code-block:: bibtex
@@ -411,6 +443,7 @@ If you use the CLIBD partitioning scheme for BIOSCAN-1M, please also consider ci
411443
.. _BIOSCAN Browser: https://bioscan-browser.netlify.app/
412444
.. _BIOSCAN-1M paper: https://papers.nips.cc/paper_files/paper/2023/hash/87dbbdc3a685a97ad28489a1d57c45c1-Abstract-Datasets_and_Benchmarks.html
413445
.. _BIOSCAN-5M paper: https://arxiv.org/abs/2406.12723
446+
.. _CanadianInvertebrates paper: https://www.nature.com/articles/s41597-019-0320-2
414447
.. _BS1M-class: https://bioscan-dataset.readthedocs.io/en/stable/api.html#bioscan_dataset.BIOSCAN1M
415448
.. _BS5M-class: https://bioscan-dataset.readthedocs.io/en/stable/api.html#bioscan_dataset.BIOSCAN5M
416449
.. _CLIBD paper: https://arxiv.org/abs/2405.17537

bioscan_dataset/CanadianInvertebrates.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@
5050
"split",
5151
]
5252

53+
COLUMN_ALIASES = {"bin_uri": "dna_bin", "nucleotides": "dna_barcode"}
5354
VALID_SPLITS = ["pretrain", "train", "val", "test", "key_unseen", "val_unseen", "test_unseen", "other_heldout"]
5455
SPLIT_ALIASES = {"validation": "val"}
5556
VALID_METASPLITS = ["all", "seen", "unseen"]
@@ -385,7 +386,7 @@ def __init__(
385386
self.target_type = [target_type]
386387
else:
387388
self.target_type = list(target_type)
388-
self.target_type = ["dna_bin" if t == "uri" else t for t in self.target_type]
389+
self.target_type = [COLUMN_ALIASES.get(t, t) for t in self.target_type]
389390

390391
if not self.target_type and self.target_transform is not None:
391392
raise RuntimeError("target_transform is specified but target_type is empty")

0 commit comments

Comments
 (0)