-
Notifications
You must be signed in to change notification settings - Fork 737
🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…mes to test filter and split classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge effort.
Here's an inital (partial) round of review.
In general, the logic might still be a bit hard to follow for newcomers. This is inevitable because of the many scenarios that we need to cover, but at least we should make sure that it is thoroughly covered in the documentation.
src/anomalib/data/base/datamodule.py
Outdated
@property | ||
def category(self) -> str: | ||
"""Get the category of the datamodule.""" | ||
return self._category | ||
|
||
@category.setter | ||
def category(self, category: str) -> None: | ||
"""Set the category of the datamodule.""" | ||
self._category = category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these used anywhere? It might be a bit confusing because not all datasets consist of multiple categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is used for saving the images to filesystem. Maybe we could address this part in another PR, as the scope will expand
mapping = { | ||
"none": SplitMode.AUTO, | ||
"from_dir": SplitMode.PREDEFINED, | ||
"synthetic": SplitMode.SYNTHETIC, | ||
"same_as_test": SplitMode.AUTO, | ||
"from_train": SplitMode.AUTO, | ||
"from_test": SplitMode.AUTO, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping will not lead to the exact same behaviour between the new version and legacy versions. Not sure how big of an issue this is, but it's something to be aware of. Maybe we could include it in the warning message.
src/anomalib/data/base/datamodule.py
Outdated
|
||
# Check validation set | ||
if hasattr(self, "val_data") and not (self.val_data.has_normal and self.val_data.has_anomalous): | ||
msg = "Validation set should contain both normal and abnormal images." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be too strict. Some users may not have access to abnormal images at training time, but may still benefit from running a validation sequence on normal images for adaptive thresholding. (The adaptive threshold value in this case will default to the highest anomaly score predicted over the normal validation images, which turns out to be a not-too-bad estimate in absence of anomalous samples).
src/anomalib/data/base/datamodule.py
Outdated
|
||
# Check test set | ||
if hasattr(self, "test_data") and not (self.test_data.has_normal and self.test_data.has_anomalous): | ||
msg = "Test set should contain both normal and abnormal images." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may also be too strict. In some papers the pixel-level performance is reported over only the anomalous images of the test set. While this may not be the best practice, I think we should support it for those users that want to use this approach.
src/anomalib/data/base/datamodule.py
Outdated
) | ||
elif self.val_split_mode == SplitMode.SYNTHETIC: | ||
logger.info("Generating synthetic val set.") | ||
self.val_data = SyntheticAnomalyDataset.from_dataset(self.train_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to split the dataset first. Otherwise the training set and the validation set will consist of the same images.
src/anomalib/data/base/datamodule.py
Outdated
) | ||
elif self.test_split_mode == SplitMode.SYNTHETIC: | ||
logger.info("Generating synthetic test set.") | ||
self.test_data = SyntheticAnomalyDataset.from_dataset(self.train_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. We need to split the train set first, to ensure that the train and test sets are mutually exclusive.
Co-authored-by: Dick Ameln <amelndjd@gmail.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
…m:samet-akcay/anomalib into feature/add-custom-validation-set-support
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/dataset-improvements #2239 +/- ##
===============================================================
Coverage ? 78.26%
===============================================================
Files ? 310
Lines ? 13502
Branches ? 0
===============================================================
Hits ? 10568
Misses ? 2934
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the massive effort. At risk of adding more work. I do have a few comments.
src/anomalib/data/video/avenue.py
Outdated
# Avenue dataset does not provide a validation set | ||
# Auto behaviour is to clone the test set as validation set. | ||
if self.val_split_mode == SplitMode.AUTO: | ||
self.val_data = self.test_data.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably inform users about the selection. Maybe logger.info("Using testing data for validation")
assert all(filtered_samples.iloc[i]["image_path"] == f"image_{indices[i]}.jpg" for i in range(len(indices))) | ||
|
||
|
||
def test_filter_by_ratio(sample_classification_dataframe: pd.DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test edge cases like ratio = 0, and 1?
"""Test filtering by count.""" | ||
dataset_filter = DatasetFilter(sample_segmentation_dataframe) | ||
count = 50 | ||
filtered_samples = dataset_filter.apply(by=count, seed=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we also test label_aware filter by count?
src/anomalib/data/base/dataset.py
Outdated
return copy.deepcopy(self) | ||
|
||
# Alias for copy method | ||
clone = copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of defining this here?
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
- Updated validation split modes to 'auto' and set ratios to 'null' in avenue.yaml, btech.yaml, datumaro.yaml, shanghaitech.yaml, and ucsd_ped.yaml. - Changed test split modes to 'predefined' and set test split ratios to 'null' in btech.yaml, kolektor.yaml, mvtec.yaml, and visa.yaml. - Adjusted the val_split_ratio in folder.yaml to '0.5' for consistency. These changes standardize the configuration settings for validation and test splits across various datasets, enhancing maintainability and clarity.
… feature/add-custom-validation-set-support
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
Signed-off-by: Samet Akcay <samet.akcay@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
||
SYNTHETIC = "synthetic" | ||
PREDEFINED = "predefined" | ||
AUTO = "auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have auto? Currently it will perform differently based on the dataset. Currently, it clones the test data for validation in auto
mode. Since Kolektor does not have a val set, cloning can also just be predefined
. But maybe I am confused on the intent of predefined
|
||
Warning: | ||
Usage with legacy split modes is deprecated and will be removed in | ||
version 1.3. Please update your code to use ``SplitMode`` directly when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.3
? or 3.0
?
I'm closing this PR due to the following reasons:
|
For those who would like to access these changes, they will still be here |
📝 Description
This PR introduces the following changes:
CSV Data Support
### New Splitting Mechanism via
SplitMode
TestSplitMode
andValSplitMode
has some duplication and overall is a bit confusing to use. Instead, we introduceSplitMode
to standardise the splitting mechanism across each subset.Dataset Filtering
Dataset Splitting
✨ Changes
Select what type of change your PR is:
✅ Checklist
Before you submit your pull request, please make sure you have completed the following steps:
For more information about code review checklists, see the Code Review Checklist.