Merge pull request #11 from bioscan-ml/doc_improvements

scottclowe · web-flow · commit a5a14565ccfd · 2025-03-20T01:10:48.000-04:00
DOC: Documentation improvements
diff --git a/README.rst b/README.rst
@@ -19,8 +19,9 @@ In this package, we provide PyTorch/torchvision style dataset classes to load th
 BIOSCAN-1M and 5M are large multimodal datasets for insect biodiversity monitoring, containing over 1 million and 5 million specimens, respectively.
 The datasets are comprised of RGB microscopy images, DNA barcodes, and fine-grained, hierarchical taxonomic labels.
 Every sample has both an image and a DNA barcode, but the taxonomic labels are incomplete and only extend all the way to the species level for around 9% of the specimens.
+For more details about the datasets, please see the `BIOSCAN-1M paper <BS1M-paper_>`_ and `BIOSCAN-5M paper <BS5M-paper_>`_, respectively.
 
-Documentation, including the full API details, is available online at readthedocs_.
+Documentation about this package, including the full API details, is available online at readthedocs_.
 
 
 Installation
@@ -38,14 +39,14 @@ To install the package, run:
 Usage
 -----
 
-The datasets can be used in the same way as PyTorch's torchvision datasets.
+The datasets can be used in the same way as PyTorch's `torchvision datasets <https://pytorch.org/vision/main/datasets.html#built-in-datasets_>`_.
 For example, to load the BIOSCAN-1M dataset:
 
 .. code-block:: python
 
    from bioscan_dataset import BIOSCAN1M
 
-   dataset = BIOSCAN1M(root="~/Datasets/bioscan/bioscan-1m/")
+   dataset = BIOSCAN1M(root="~/Datasets/bioscan/")
 
    for image, dna_barcode, label in dataset:
        # Do something with the image, dna_barcode, and label
@@ -57,7 +58,7 @@ To load the BIOSCAN-5M dataset:
 
    from bioscan_dataset import BIOSCAN5M
 
-   dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/")
+   dataset = BIOSCAN5M(root="~/Datasets/bioscan/")
 
    for image, dna_barcode, label in dataset:
        # Do something with the image, dna_barcode, and label
@@ -79,21 +80,19 @@ This can be performed by setting the argument ``download=True``:
 
 .. code-block:: python
 
-   dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", download=True)
+   dataset = BIOSCAN5M(root="~/Datasets/bioscan/", download=True)
 
 To use a different image package, follow the download instructions given in the `BIOSCAN-5M repository <https://github.com/bioscan-ml/BIOSCAN-5M?tab=readme-ov-file#dataset-access>`_, then set the argument ``image_package`` to the desired package name, e.g.
 
 .. code-block:: python
 
    # Manually download original_full from
    # https://drive.google.com/drive/u/1/folders/1Jc57eKkeiYrnUBc9WlIp-ZS_L1bVlT-0
-   # and unzip the 5 zip files into ~/Datasets/bioscan/bioscan-5m/bioscan5m/images/original_full/
+   # and unzip the 5 zip files into ~/Datasets/bioscan/bioscan5m/images/original_full/
    # Then load the dataset as follows:
-   dataset = BIOSCAN5M(
-       root="~/Datasets/bioscan/bioscan-5m/", image_package="original_full"
-   )
+   dataset = BIOSCAN5M(root="~/Datasets/bioscan/", image_package="original_full")
 
-For BIOSCAN-1M, automatic dataset download is not supported and so the dataset must be manually downloaded.
+For `BIOSCAN1M <BS1M-class_>`_, automatic dataset download is not supported and so the dataset must be manually downloaded.
 See the `BIOSCAN-1M repository <https://github.com/bioscan-ml/BIOSCAN-1M?tab=readme-ov-file#-dataset-access>`_ for download instructions.
 
 
@@ -107,7 +106,7 @@ For example, to load the validation split:
 
 .. code-block:: python
 
-   dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", split="val")
+   dataset = BIOSCAN5M(root="~/Datasets/bioscan/", split="val")
 
 In the BIOSCAN-5M dataset, the dataset is partitioned so there are ``train``, ``val``, and ``test`` splits to use for closed-world tasks (seen species), and ``key_unseen``, ``val_unseen``, and ``test_unseen`` splits to use for open-world tasks (unseen species).
 These partitions only use samples labelled to species-level.
@@ -150,34 +149,34 @@ This can be changed by setting the argument ``input_modality`` to either ``"imag
 
 .. code-block:: python
 
-   dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", modality="image")
+   dataset = BIOSCAN5M(root="~/Datasets/bioscan/", modality="image")
 
 or ``"dna"``:
 
 .. code-block:: python
 
-   dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", modality="dna")
+   dataset = BIOSCAN5M(root="~/Datasets/bioscan/", modality="dna")
 
 
 Target selection
 ~~~~~~~~~~~~~~~~
 
 The target label can be selected by setting the argument ``target`` to be either a taxonomic label or ``dna_bin``.
 The DNA BIN is similar in granularity to subspecies, but was generated by clustering the DNA barcodes instead of morphology.
-The default target is ``"family"`` for BIOSCAN1M and ``"species"`` for BIOSCAN5M.
+The default target is ``"family"`` for  `BIOSCAN1M <BS1M-class_>`_ and ``"species"`` for `BIOSCAN5M <BS5M-class_>`_.
 
 The target can be a single label, e.g.
 
 .. code-block:: python
 
-   dataset = BIOSCAN5M(root="~/Datasets/bioscan/bioscan-5m/", target_type="genus")
+   dataset = BIOSCAN5M(root="~/Datasets/bioscan/", target_type="genus")
 
 or a list of labels, e.g.
 
 .. code-block:: python
 
    dataset = BIOSCAN5M(
-       root="~/Datasets/bioscan/bioscan-5m/", target_type=["genus", "species", "dna_bin"]
+       root="~/Datasets/bioscan/", target_type=["genus", "species", "dna_bin"]
    )
 
 By default, the target values will be provided as integer indices that map to the labels for that taxonomic rank (with value ``-1`` used for missing labels), appropriate for training a classification model with cross-entropy.
@@ -188,13 +187,13 @@ If this is set to ``target_format="text"``, the output will instead be the raw l
 
    # Default target format is "index"
    dataset = BIOSCAN5M(
-       root="~/Datasets/bioscan/bioscan-5m/", target_type="species", target_format="index"
+       root="~/Datasets/bioscan/", target_type="species", target_format="index"
    )
    assert dataset[0][-1] is 240
 
    # Using target format "text"
    dataset = BIOSCAN5M(
-       root="~/Datasets/bioscan/bioscan-5m/", target_type="species", target_format="text"
+       root="~/Datasets/bioscan/", target_type="species", target_format="text"
    )
    assert dataset[0][-1] is "Gnamptogenys sulcata"
 
@@ -230,7 +229,7 @@ The dataset class supports the use of data transforms for the image and DNA barc
    )
    # Load the dataset with the transforms applied for each sample
    ds_train = BIOSCAN5M(
-       root="~/Datasets/bioscan/bioscan-5m/",
+       root="~/Datasets/bioscan/",
        split="train",
        transform=image_transform,
        dna_transform=dna_transform,
@@ -241,7 +240,7 @@ Size and geolocation metadata
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The BIOSCAN-5M dataset also contains insect size and geolocation metadata.
-Loading this metadata is not yet supported by the BIOSCAN5M pytorch dataset class.
+Loading this metadata is not yet supported by the `BIOSCAN5M <BS5M-class_>`_ pytorch dataset class.
 In the meantime, users of the dataset are welcome to explore this metadata themselves.
 
 
@@ -305,6 +304,8 @@ If you make use of the BIOSCAN-1M or BIOSCAN-5M datasets in your research, pleas
 .. _PyPI: https://pypi.org/project/bioscan-dataset/
 .. _readthedocs: https://bioscan-dataset.readthedocs.io
 .. _pip: https://pip.pypa.io/
+.. _BS1M-class: https://bioscan-dataset.readthedocs.io/en/latest/api.html#bioscan_dataset.BIOSCAN1M
+.. _BS5M-class: https://bioscan-dataset.readthedocs.io/en/latest/api.html#bioscan_dataset.BIOSCAN5M
 
 .. |PyPI badge| image:: https://img.shields.io/pypi/v/bioscan-dataset.svg
    :target: PyPI_
diff --git a/bioscan_dataset/bioscan1m.py b/bioscan_dataset/bioscan1m.py
@@ -10,8 +10,9 @@
 
 import os
 from enum import Enum
+from typing import Any, Tuple
 
-import pandas as pd
+import pandas
 import PIL
 import torch
 from torchvision.datasets.vision import VisionDataset
@@ -88,7 +89,7 @@ def load_bioscan1m_metadata(
     partitioning_version="large_diptera_family",
     dtype=MetadataDtype.DEFAULT,
     **kwargs,
-) -> pd.DataFrame:
+) -> pandas.DataFrame:
     r"""
     Load BIOSCAN-1M metadata from its TSV file, and prepare it for training.
 
@@ -140,13 +141,13 @@ def load_bioscan1m_metadata(
 
     Returns
     -------
-    df : pd.DataFrame
+    df : pandas.DataFrame
         The metadata DataFrame.
     """
     if dtype == MetadataDtype.DEFAULT:
         # Use our default column data types
         dtype = COLUMN_DTYPES
-    df = pd.read_csv(metadata_path, sep="\t", dtype=dtype, **kwargs)
+    df = pandas.read_csv(metadata_path, sep="\t", dtype=dtype, **kwargs)
     # Taxonomic label column names
     label_cols = [
         "phylum",
@@ -175,7 +176,7 @@ def load_bioscan1m_metadata(
         df = df.sort_index()
     # Convert missing values to NaN
     for c in label_cols:
-        df.loc[df[c] == "not_classified", c] = pd.NA
+        df.loc[df[c] == "not_classified", c] = pandas.NA
     # Fix some tribe labels which were only partially applied
     df.loc[df["genus"].notna() & (df["genus"] == "Asteia"), "tribe"] = "Asteiini"
     df.loc[df["genus"].notna() & (df["genus"] == "Nemorilla"), "tribe"] = "Winthemiini"
@@ -245,8 +246,8 @@ class BIOSCAN1M(VisionDataset):
         Note that the barcode should only be 660 base pairs long.
         Characters beyond this length are unlikely to be accurate.
 
-    target_type : str, default="family"
-        Type of target to use. One of:
+    target_type : str or Iterable[str], default="family"
+        Type of target to use. One of, or a list of:
 
         - ``"phylum"``
         - ``"class"``
@@ -271,6 +272,8 @@ class BIOSCAN1M(VisionDataset):
         If this is set to ``"text"``, the target(s) will each be returned as a string,
         appropriate for processing with language models.
 
+        .. versionadded:: 1.1.0
+
     transform : Callable, default=None
         Image transformation pipeline.
 
@@ -339,7 +342,33 @@ def __init__(
     def __len__(self):
         return len(self.metadata)
 
-    def __getitem__(self, index: int):
+    def __getitem__(self, index: int) -> Tuple[Any, ...]:
+        """
+        Get a sample from the dataset.
+
+        Parameters
+        ----------
+        index : int
+            Index of the sample to retrieve.
+
+        Returns
+        -------
+        image : PIL.Image.Image
+            The image, if the ``"image"`` modality is requested, optionally transformed
+            by the ``transform`` pipeline.
+
+        dna : str
+            The DNA barcode, if the ``"dna"`` modality is requested, optionally
+            transformed by the ``dna_transform`` pipeline.
+
+        target : int or Tuple[int, ...] or str or Tuple[str, ...] or None
+            The target(s), optionally transformed by the ``target_transform`` pipeline.
+            If ``target_format="index"``, the target(s) will be returned as integer
+            indices, with missing values filled with ``-1``.
+            If ``target_format="text"``, the target(s) will be returned as a string.
+            If there are multiple targets, they will be returned as a tuple.
+            If ``target_type`` is an empty list, the output ``target`` will be ``None``.
+        """
         sample = self.metadata.iloc[index]
         img_path = os.path.join(self.image_dir, f"part{sample['chunk_number']}", sample["image_file"])
         values = []
@@ -403,7 +432,7 @@ def _check_exists(self, verbose=0) -> bool:
             check_all &= check
         return check_all
 
-    def _load_metadata(self) -> pd.DataFrame:
+    def _load_metadata(self) -> pandas.DataFrame:
         r"""
         Load metadata from CSV file and prepare it for training.
         """
diff --git a/bioscan_dataset/bioscan5m.py b/bioscan_dataset/bioscan5m.py
@@ -10,8 +10,9 @@
 
 import os
 from enum import Enum
+from typing import Any, Tuple
 
-import pandas as pd
+import pandas
 import PIL
 import torch
 from torchvision.datasets.utils import check_integrity, download_and_extract_archive
@@ -80,7 +81,7 @@ def get_image_path(row):
         The path to the image file.
     """
     image_path = row["split"] + os.path.sep
-    if pd.notna(row["chunk"]) and row["chunk"]:
+    if pandas.notna(row["chunk"]) and row["chunk"]:
         image_path += str(row["chunk"]) + os.path.sep
     image_path += row["processid"] + ".jpg"
     return image_path
@@ -97,7 +98,7 @@ def load_bioscan5m_metadata(
     split=None,
     dtype=MetadataDtype.DEFAULT,
     **kwargs,
-) -> pd.DataFrame:
+) -> pandas.DataFrame:
     r"""
     Load BIOSCAN-5M metadata from its CSV file and prepare it for training.
 
@@ -148,7 +149,7 @@ def load_bioscan5m_metadata(
     if dtype == MetadataDtype.DEFAULT:
         # Use our default column data types
         dtype = COLUMN_DTYPES
-    df = pd.read_csv(metadata_path, dtype=dtype, **kwargs)
+    df = pandas.read_csv(metadata_path, dtype=dtype, **kwargs)
     # Truncate the DNA barcodes to the specified length
     if max_nucleotides is not None:
         df["dna_barcode"] = df["dna_barcode"].str[:max_nucleotides]
@@ -260,6 +261,8 @@ class BIOSCAN5M(VisionDataset):
         If this is set to ``"text"``, the target(s) will each be returned as a string,
         appropriate for processing with language models.
 
+        .. versionadded:: 1.1.0
+
     transform : Callable, default=None
         Image transformation pipeline.
 
@@ -406,7 +409,33 @@ def __init__(
     def __len__(self):
         return len(self.metadata)
 
-    def __getitem__(self, index: int):
+    def __getitem__(self, index: int) -> Tuple[Any, ...]:
+        """
+        Get a sample from the dataset.
+
+        Parameters
+        ----------
+        index : int
+            Index of the sample to retrieve.
+
+        Returns
+        -------
+        image : PIL.Image.Image
+            The image, if the ``"image"`` modality is requested, optionally transformed
+            by the ``transform`` pipeline.
+
+        dna : str
+            The DNA barcode, if the ``"dna"`` modality is requested, optionally
+            transformed by the ``dna_transform`` pipeline.
+
+        target : int or Tuple[int, ...] or str or Tuple[str, ...] or None
+            The target(s), optionally transformed by the ``target_transform`` pipeline.
+            If ``target_format="index"``, the target(s) will be returned as integer
+            indices, with missing values filled with ``-1``.
+            If ``target_format="text"``, the target(s) will be returned as a string.
+            If there are multiple targets, they will be returned as a tuple.
+            If ``target_type`` is an empty list, the output ``target`` will be ``None``.
+        """
         sample = self.metadata.iloc[index]
         img_path = os.path.join(self.image_dir, sample["image_path"])
         values = []
@@ -550,7 +579,7 @@ def download(self) -> None:
         if "image" in self.modality:
             self._download_images()
 
-    def _load_metadata(self) -> pd.DataFrame:
+    def _load_metadata(self) -> pandas.DataFrame:
         r"""
         Load metadata from CSV file and prepare it for training.
         """
diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -6,6 +6,7 @@ BIOSCAN-1M Dataset
 
 .. autoclass:: bioscan_dataset.BIOSCAN1M
    :members:
+   :special-members: __getitem__
    :show-inheritance:
 
 .. autofunction:: bioscan_dataset.load_bioscan1m_metadata
@@ -15,6 +16,7 @@ BIOSCAN-5M Dataset
 
 .. autoclass:: bioscan_dataset.BIOSCAN5M
    :members:
+   :special-members: __getitem__
    :show-inheritance:
 
 .. autofunction:: bioscan_dataset.load_bioscan5m_metadata
diff --git a/docs/source/conf.py b/docs/source/conf.py