generated from scottclowe/python-template-repo
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
bugSomething isn't workingSomething isn't working
Description
The order of category indices are currently done based on which appears first in the dataset.
- This means the indices are ordered differently for BIOSCAN1M if the DNA barcodes are deduplicated (
reduce_repeated_barcodes=True
). N.B. This isn't an issue for BIOSCAN5M because the categories are declared before the deduplication step. - This means that a closed-world supervised classification model needs to have 22618 output logits to train at species prediction (essentially all of the 22622 species in the dataset), even though many of these indices are for species in the open world partition and will not occur in closed world.
It would be better if the category indices were ordered so first is all the labels which appear in the test partition, then labels in val but not in test, then in train but not in val or test, then in unseen, then in other, then in pretrain. This would make it easier to only use the first 3483 indices for a model that produces valid test labels, or the first 11846 indices for a model that produces all possible train labels.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working