Skip to content

Commit a5a1456

Browse files
authored
Merge pull request #11 from bioscan-ml/doc_improvements
DOC: Documentation improvements
2 parents e444c55 + da917cd commit a5a1456

File tree

5 files changed

+108
-47
lines changed

5 files changed

+108
-47
lines changed

README.rst

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@ In this package, we provide PyTorch/torchvision style dataset classes to load th
1919
BIOSCAN-1M and 5M are large multimodal datasets for insect biodiversity monitoring, containing over 1 million and 5 million specimens, respectively.
2020
The datasets are comprised of RGB microscopy images, DNA barcodes, and fine-grained, hierarchical taxonomic labels.
2121
Every sample has both an image and a DNA barcode, but the taxonomic labels are incomplete and only extend all the way to the species level for around 9% of the specimens.
22+
For more details about the datasets, please see the `BIOSCAN-1M paper <BS1M-paper_>`_ and `BIOSCAN-5M paper <BS5M-paper_>`_, respectively.
2223

23-
Documentation, including the full API details, is available online at readthedocs_.
24+
Documentation about this package, including the full API details, is available online at readthedocs_.
2425

2526

2627
Installation
@@ -38,14 +39,14 @@ To install the package, run:
3839
Usage
3940
-----
4041

41-
The datasets can be used in the same way as PyTorch's torchvision datasets.
42+
The datasets can be used in the same way as PyTorch's `torchvision datasets <https://pytorch.org/vision/main/datasets.html#built-in-datasets_>`_.
4243
For example, to load the BIOSCAN-1M dataset:
4344

4445
.. code-block:: python
4546
4647
from bioscan_dataset import BIOSCAN1M
4748
48-
dataset = BIOSCAN1M(root="~/Datasets/bioscan/bioscan-1m/")
49+
dataset = BIOSCAN1M(root="~/Datasets/bioscan/")
4950
5051
for image, dna_barcode, label in dataset:
5152
# Do something with the image, dna_barcode, and label
@@ -57,7 +58,7 @@ To load the BIOSCAN-5M dataset:
5758
5859
from bioscan_dataset import BIOSCAN5M
5960
60-
dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/")
61+
dataset = BIOSCAN5M(root="~/Datasets/bioscan/")
6162
6263
for image, dna_barcode, label in dataset:
6364
# Do something with the image, dna_barcode, and label
@@ -79,21 +80,19 @@ This can be performed by setting the argument ``download=True``:
7980

8081
.. code-block:: python
8182
82-
dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", download=True)
83+
dataset = BIOSCAN5M(root="~/Datasets/bioscan/", download=True)
8384
8485
To use a different image package, follow the download instructions given in the `BIOSCAN-5M repository <https://github.com/bioscan-ml/BIOSCAN-5M?tab=readme-ov-file#dataset-access>`_, then set the argument ``image_package`` to the desired package name, e.g.
8586

8687
.. code-block:: python
8788
8889
# Manually download original_full from
8990
# https://drive.google.com/drive/u/1/folders/1Jc57eKkeiYrnUBc9WlIp-ZS_L1bVlT-0
90-
# and unzip the 5 zip files into ~/Datasets/bioscan/bioscan-5m/bioscan5m/images/original_full/
91+
# and unzip the 5 zip files into ~/Datasets/bioscan/bioscan5m/images/original_full/
9192
# Then load the dataset as follows:
92-
dataset = BIOSCAN5M(
93-
root="~/Datasets/bioscan/bioscan-5m/", image_package="original_full"
94-
)
93+
dataset = BIOSCAN5M(root="~/Datasets/bioscan/", image_package="original_full")
9594
96-
For BIOSCAN-1M, automatic dataset download is not supported and so the dataset must be manually downloaded.
95+
For `BIOSCAN1M <BS1M-class_>`_, automatic dataset download is not supported and so the dataset must be manually downloaded.
9796
See the `BIOSCAN-1M repository <https://github.com/bioscan-ml/BIOSCAN-1M?tab=readme-ov-file#-dataset-access>`_ for download instructions.
9897

9998

@@ -107,7 +106,7 @@ For example, to load the validation split:
107106

108107
.. code-block:: python
109108
110-
dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", split="val")
109+
dataset = BIOSCAN5M(root="~/Datasets/bioscan/", split="val")
111110
112111
In the BIOSCAN-5M dataset, the dataset is partitioned so there are ``train``, ``val``, and ``test`` splits to use for closed-world tasks (seen species), and ``key_unseen``, ``val_unseen``, and ``test_unseen`` splits to use for open-world tasks (unseen species).
113112
These partitions only use samples labelled to species-level.
@@ -150,34 +149,34 @@ This can be changed by setting the argument ``input_modality`` to either ``"imag
150149

151150
.. code-block:: python
152151
153-
dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", modality="image")
152+
dataset = BIOSCAN5M(root="~/Datasets/bioscan/", modality="image")
154153
155154
or ``"dna"``:
156155

157156
.. code-block:: python
158157
159-
dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", modality="dna")
158+
dataset = BIOSCAN5M(root="~/Datasets/bioscan/", modality="dna")
160159
161160
162161
Target selection
163162
~~~~~~~~~~~~~~~~
164163

165164
The target label can be selected by setting the argument ``target`` to be either a taxonomic label or ``dna_bin``.
166165
The DNA BIN is similar in granularity to subspecies, but was generated by clustering the DNA barcodes instead of morphology.
167-
The default target is ``"family"`` for BIOSCAN1M and ``"species"`` for BIOSCAN5M.
166+
The default target is ``"family"`` for `BIOSCAN1M <BS1M-class_>`_ and ``"species"`` for `BIOSCAN5M <BS5M-class_>`_.
168167

169168
The target can be a single label, e.g.
170169

171170
.. code-block:: python
172171
173-
dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", target_type="genus")
172+
dataset = BIOSCAN5M(root="~/Datasets/bioscan/", target_type="genus")
174173
175174
or a list of labels, e.g.
176175

177176
.. code-block:: python
178177
179178
dataset = BIOSCAN5M(
180-
root="~/Datasets/bioscan/bioscan-5m/", target_type=["genus", "species", "dna_bin"]
179+
root="~/Datasets/bioscan/", target_type=["genus", "species", "dna_bin"]
181180
)
182181
183182
By default, the target values will be provided as integer indices that map to the labels for that taxonomic rank (with value ``-1`` used for missing labels), appropriate for training a classification model with cross-entropy.
@@ -188,13 +187,13 @@ If this is set to ``target_format="text"``, the output will instead be the raw l
188187
189188
# Default target format is "index"
190189
dataset = BIOSCAN5M(
191-
root="~/Datasets/bioscan/bioscan-5m/", target_type="species", target_format="index"
190+
root="~/Datasets/bioscan/", target_type="species", target_format="index"
192191
)
193192
assert dataset[0][-1] is 240
194193
195194
# Using target format "text"
196195
dataset = BIOSCAN5M(
197-
root="~/Datasets/bioscan/bioscan-5m/", target_type="species", target_format="text"
196+
root="~/Datasets/bioscan/", target_type="species", target_format="text"
198197
)
199198
assert dataset[0][-1] is "Gnamptogenys sulcata"
200199
@@ -230,7 +229,7 @@ The dataset class supports the use of data transforms for the image and DNA barc
230229
)
231230
# Load the dataset with the transforms applied for each sample
232231
ds_train = BIOSCAN5M(
233-
root="~/Datasets/bioscan/bioscan-5m/",
232+
root="~/Datasets/bioscan/",
234233
split="train",
235234
transform=image_transform,
236235
dna_transform=dna_transform,
@@ -241,7 +240,7 @@ Size and geolocation metadata
241240
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
242241

243242
The BIOSCAN-5M dataset also contains insect size and geolocation metadata.
244-
Loading this metadata is not yet supported by the BIOSCAN5M pytorch dataset class.
243+
Loading this metadata is not yet supported by the `BIOSCAN5M <BS5M-class_>`_ pytorch dataset class.
245244
In the meantime, users of the dataset are welcome to explore this metadata themselves.
246245

247246

@@ -305,6 +304,8 @@ If you make use of the BIOSCAN-1M or BIOSCAN-5M datasets in your research, pleas
305304
.. _PyPI: https://pypi.org/project/bioscan-dataset/
306305
.. _readthedocs: https://bioscan-dataset.readthedocs.io
307306
.. _pip: https://pip.pypa.io/
307+
.. _BS1M-class: https://bioscan-dataset.readthedocs.io/en/latest/api.html#bioscan_dataset.BIOSCAN1M
308+
.. _BS5M-class: https://bioscan-dataset.readthedocs.io/en/latest/api.html#bioscan_dataset.BIOSCAN5M
308309

309310
.. |PyPI badge| image:: https://img.shields.io/pypi/v/bioscan-dataset.svg
310311
:target: PyPI_

bioscan_dataset/bioscan1m.py

Lines changed: 38 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@
1010

1111
import os
1212
from enum import Enum
13+
from typing import Any, Tuple
1314

14-
import pandas as pd
15+
import pandas
1516
import PIL
1617
import torch
1718
from torchvision.datasets.vision import VisionDataset
@@ -88,7 +89,7 @@ def load_bioscan1m_metadata(
8889
partitioning_version="large_diptera_family",
8990
dtype=MetadataDtype.DEFAULT,
9091
**kwargs,
91-
) -> pd.DataFrame:
92+
) -> pandas.DataFrame:
9293
r"""
9394
Load BIOSCAN-1M metadata from its TSV file, and prepare it for training.
9495
@@ -140,13 +141,13 @@ def load_bioscan1m_metadata(
140141
141142
Returns
142143
-------
143-
df : pd.DataFrame
144+
df : pandas.DataFrame
144145
The metadata DataFrame.
145146
"""
146147
if dtype == MetadataDtype.DEFAULT:
147148
# Use our default column data types
148149
dtype = COLUMN_DTYPES
149-
df = pd.read_csv(metadata_path, sep="\t", dtype=dtype, **kwargs)
150+
df = pandas.read_csv(metadata_path, sep="\t", dtype=dtype, **kwargs)
150151
# Taxonomic label column names
151152
label_cols = [
152153
"phylum",
@@ -175,7 +176,7 @@ def load_bioscan1m_metadata(
175176
df = df.sort_index()
176177
# Convert missing values to NaN
177178
for c in label_cols:
178-
df.loc[df[c] == "not_classified", c] = pd.NA
179+
df.loc[df[c] == "not_classified", c] = pandas.NA
179180
# Fix some tribe labels which were only partially applied
180181
df.loc[df["genus"].notna() & (df["genus"] == "Asteia"), "tribe"] = "Asteiini"
181182
df.loc[df["genus"].notna() & (df["genus"] == "Nemorilla"), "tribe"] = "Winthemiini"
@@ -245,8 +246,8 @@ class BIOSCAN1M(VisionDataset):
245246
Note that the barcode should only be 660 base pairs long.
246247
Characters beyond this length are unlikely to be accurate.
247248
248-
target_type : str, default="family"
249-
Type of target to use. One of:
249+
target_type : str or Iterable[str], default="family"
250+
Type of target to use. One of, or a list of:
250251
251252
- ``"phylum"``
252253
- ``"class"``
@@ -271,6 +272,8 @@ class BIOSCAN1M(VisionDataset):
271272
If this is set to ``"text"``, the target(s) will each be returned as a string,
272273
appropriate for processing with language models.
273274
275+
.. versionadded:: 1.1.0
276+
274277
transform : Callable, default=None
275278
Image transformation pipeline.
276279
@@ -339,7 +342,33 @@ def __init__(
339342
def __len__(self):
340343
return len(self.metadata)
341344

342-
def __getitem__(self, index: int):
345+
def __getitem__(self, index: int) -> Tuple[Any, ...]:
346+
"""
347+
Get a sample from the dataset.
348+
349+
Parameters
350+
----------
351+
index : int
352+
Index of the sample to retrieve.
353+
354+
Returns
355+
-------
356+
image : PIL.Image.Image
357+
The image, if the ``"image"`` modality is requested, optionally transformed
358+
by the ``transform`` pipeline.
359+
360+
dna : str
361+
The DNA barcode, if the ``"dna"`` modality is requested, optionally
362+
transformed by the ``dna_transform`` pipeline.
363+
364+
target : int or Tuple[int, ...] or str or Tuple[str, ...] or None
365+
The target(s), optionally transformed by the ``target_transform`` pipeline.
366+
If ``target_format="index"``, the target(s) will be returned as integer
367+
indices, with missing values filled with ``-1``.
368+
If ``target_format="text"``, the target(s) will be returned as a string.
369+
If there are multiple targets, they will be returned as a tuple.
370+
If ``target_type`` is an empty list, the output ``target`` will be ``None``.
371+
"""
343372
sample = self.metadata.iloc[index]
344373
img_path = os.path.join(self.image_dir, f"part{sample['chunk_number']}", sample["image_file"])
345374
values = []
@@ -403,7 +432,7 @@ def _check_exists(self, verbose=0) -> bool:
403432
check_all &= check
404433
return check_all
405434

406-
def _load_metadata(self) -> pd.DataFrame:
435+
def _load_metadata(self) -> pandas.DataFrame:
407436
r"""
408437
Load metadata from CSV file and prepare it for training.
409438
"""

bioscan_dataset/bioscan5m.py

Lines changed: 35 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@
1010

1111
import os
1212
from enum import Enum
13+
from typing import Any, Tuple
1314

14-
import pandas as pd
15+
import pandas
1516
import PIL
1617
import torch
1718
from torchvision.datasets.utils import check_integrity, download_and_extract_archive
@@ -80,7 +81,7 @@ def get_image_path(row):
8081
The path to the image file.
8182
"""
8283
image_path = row["split"] + os.path.sep
83-
if pd.notna(row["chunk"]) and row["chunk"]:
84+
if pandas.notna(row["chunk"]) and row["chunk"]:
8485
image_path += str(row["chunk"]) + os.path.sep
8586
image_path += row["processid"] + ".jpg"
8687
return image_path
@@ -97,7 +98,7 @@ def load_bioscan5m_metadata(
9798
split=None,
9899
dtype=MetadataDtype.DEFAULT,
99100
**kwargs,
100-
) -> pd.DataFrame:
101+
) -> pandas.DataFrame:
101102
r"""
102103
Load BIOSCAN-5M metadata from its CSV file and prepare it for training.
103104
@@ -148,7 +149,7 @@ def load_bioscan5m_metadata(
148149
if dtype == MetadataDtype.DEFAULT:
149150
# Use our default column data types
150151
dtype = COLUMN_DTYPES
151-
df = pd.read_csv(metadata_path, dtype=dtype, **kwargs)
152+
df = pandas.read_csv(metadata_path, dtype=dtype, **kwargs)
152153
# Truncate the DNA barcodes to the specified length
153154
if max_nucleotides is not None:
154155
df["dna_barcode"] = df["dna_barcode"].str[:max_nucleotides]
@@ -260,6 +261,8 @@ class BIOSCAN5M(VisionDataset):
260261
If this is set to ``"text"``, the target(s) will each be returned as a string,
261262
appropriate for processing with language models.
262263
264+
.. versionadded:: 1.1.0
265+
263266
transform : Callable, default=None
264267
Image transformation pipeline.
265268
@@ -406,7 +409,33 @@ def __init__(
406409
def __len__(self):
407410
return len(self.metadata)
408411

409-
def __getitem__(self, index: int):
412+
def __getitem__(self, index: int) -> Tuple[Any, ...]:
413+
"""
414+
Get a sample from the dataset.
415+
416+
Parameters
417+
----------
418+
index : int
419+
Index of the sample to retrieve.
420+
421+
Returns
422+
-------
423+
image : PIL.Image.Image
424+
The image, if the ``"image"`` modality is requested, optionally transformed
425+
by the ``transform`` pipeline.
426+
427+
dna : str
428+
The DNA barcode, if the ``"dna"`` modality is requested, optionally
429+
transformed by the ``dna_transform`` pipeline.
430+
431+
target : int or Tuple[int, ...] or str or Tuple[str, ...] or None
432+
The target(s), optionally transformed by the ``target_transform`` pipeline.
433+
If ``target_format="index"``, the target(s) will be returned as integer
434+
indices, with missing values filled with ``-1``.
435+
If ``target_format="text"``, the target(s) will be returned as a string.
436+
If there are multiple targets, they will be returned as a tuple.
437+
If ``target_type`` is an empty list, the output ``target`` will be ``None``.
438+
"""
410439
sample = self.metadata.iloc[index]
411440
img_path = os.path.join(self.image_dir, sample["image_path"])
412441
values = []
@@ -550,7 +579,7 @@ def download(self) -> None:
550579
if "image" in self.modality:
551580
self._download_images()
552581

553-
def _load_metadata(self) -> pd.DataFrame:
582+
def _load_metadata(self) -> pandas.DataFrame:
554583
r"""
555584
Load metadata from CSV file and prepare it for training.
556585
"""

docs/source/api.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ BIOSCAN-1M Dataset
66

77
.. autoclass:: bioscan_dataset.BIOSCAN1M
88
:members:
9+
:special-members: __getitem__
910
:show-inheritance:
1011

1112
.. autofunction:: bioscan_dataset.load_bioscan1m_metadata
@@ -15,6 +16,7 @@ BIOSCAN-5M Dataset
1516

1617
.. autoclass:: bioscan_dataset.BIOSCAN5M
1718
:members:
19+
:special-members: __getitem__
1820
:show-inheritance:
1921

2022
.. autofunction:: bioscan_dataset.load_bioscan5m_metadata

0 commit comments

Comments
 (0)