Species category indices are randomly ordered

The order of category indices are currently done based on which appears first in the dataset.
- This means the indices are ordered differently for BIOSCAN1M if the DNA barcodes are deduplicated (`reduce_repeated_barcodes=True`). N.B. This isn't an issue for BIOSCAN5M because the categories are declared before the deduplication step.
- This means that a closed-world supervised classification model needs to have 22618 output logits to train at species prediction (essentially all of the 22622 species in the dataset), even though many of these indices are for species in the open world partition and will not occur in closed world.

It would be better if the category indices were ordered so first is all the labels which appear in the test partition, then labels in val but not in test, then in train but not in val or test, then in unseen, then in other, then in pretrain. This would make it easier to only use the first 3483 indices for a model that produces valid test labels, or the first 11846 indices for a model that produces all possible train labels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Species category indices are randomly ordered #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Species category indices are randomly ordered #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions