🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

samet-akcay · 2024-08-08T12:28:10Z

📝 Description

Addresses:

This PR introduces the following changes:

CSV Data Support

With this PR users will be able to provide a csv file for their custom datasets. For example;

        >>> data_module = CSV(
        ...     name="custom_format_dataset",
        ...     csv_path="path/to/sample_dataset.csv",
        ...     task=TaskType.CLASSIFICATION,
        ...     sep=";",
        ...     extension=[".jpg", ".png"],
        ... )

### New Splitting Mechanism via SplitMode

Existing splitting mechanism for TestSplitMode and ValSplitMode has some duplication and overall is a bit confusing to use. Instead, we introduce SplitMode to standardise the splitting mechanism across each subset.
This PR also provides a backward compatibility layer to map the old splitting keys to the new one.

class SplitMode(str, Enum):
    SYNTHETIC = "synthetic"
    PREDEFINED = "predefined"
    AUTO = "auto"

        >>> resolve_split_mode(TestSplitMode.NONE)  # Legacy input (deprecated)
        DeprecationWarning: The split mode TestSplitMode.NONE is deprecated. Use 'SplitMode.AUTO' instead.
        SplitMode.AUTO

        >>> resolve_split_mode(TestSplitMode.SYNTHETIC)  # Legacy input (deprecated)
        DeprecationWarning: The split mode TestSplitMode.SYNTHETIC is deprecated. Use 'SplitMode.SYNTHETIC' instead.
        SplitMode.SYNTHETIC

        >>> resolve_split_mode(ValSplitMode.FROM_TRAIN)  # Legacy input (deprecated)
        DeprecationWarning: The split mode ValSplitMode.FROM_TRAIN is deprecated. Use 'SplitMode.AUTO' instead.
        SplitMode.AUTO

        >>> resolve_split_mode(SplitMode.PREDEFINED)  # Current input (preferred)
        SplitMode.PREDEFINED

Dataset Filtering

We propose a new dataset filter object to be able to filter datasets easily. For example;

            #Apply filters via apply method:
            >>> dataset.filter.apply("normal")  # label
            >>> dataset.filter.apply(100)       # count
            >>> dataset.filter.apply(0.5)       # ratio
            >>> dataset.filter.apply({"label": "normal", "count": 100})  # multiple filters

Dataset Splitting

In addition to the dataset filtering, this PR introduces dataset splitting via:

            #Create a subset based on label values:
            >>> normal_dataset, abnormal_dataset = dataset.create_subset("label")

            #Create a subset based on specific sample indices:
            >>> train_set, val_set, test_set = dataset.create_subset([[0, 2, 3], [1, 4], [5]])

            #Create a subset based on specific split ratios:
            >>> train_set, val_set, test_set = dataset.create_subset([0.6, 0.2, 0.2], seed=42)

            #Create a subset based on the number of samples:
            >>> dataset.create_subset(100)

            #Create a subset based on custom criteria:
            >>> dataset.create_subset({"label": "normal", "count": 100})

✨ Changes

Select what type of change your PR is:

🐞 Bug fix (non-breaking change which fixes an issue)
🔨 Refactor (non-breaking change which refactors the code base)
🚀 New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🔒 Security update

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

📋 I have summarized my changes in the CHANGELOG and followed the guidelines for my type of change (skip for minor changes, documentation updates, and test enhancements).
📚 I have made the necessary updates to the documentation (if applicable).
🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).

For more information about code review checklists, see the Code Review Checklist.

…mes to test filter and split classes

…rovided

djdameln

Thanks for the huge effort.

Here's an inital (partial) round of review.

In general, the logic might still be a bit hard to follow for newcomers. This is inevitable because of the many scenarios that we need to cover, but at least we should make sure that it is thoroughly covered in the documentation.

djdameln · 2024-08-08T17:03:11Z

src/anomalib/data/base/datamodule.py

+    @property
+    def category(self) -> str:
+        """Get the category of the datamodule."""
+        return self._category
+
+    @category.setter
+    def category(self, category: str) -> None:
+        """Set the category of the datamodule."""
+        self._category = category


Are these used anywhere? It might be a bit confusing because not all datasets consist of multiple categories.

I think this part is used for saving the images to filesystem. Maybe we could address this part in another PR, as the scope will expand

djdameln · 2024-08-08T17:16:15Z

src/anomalib/data/utils/split.py

+    mapping = {
+        "none": SplitMode.AUTO,
+        "from_dir": SplitMode.PREDEFINED,
+        "synthetic": SplitMode.SYNTHETIC,
+        "same_as_test": SplitMode.AUTO,
+        "from_train": SplitMode.AUTO,
+        "from_test": SplitMode.AUTO,
+    }


This mapping will not lead to the exact same behaviour between the new version and legacy versions. Not sure how big of an issue this is, but it's something to be aware of. Maybe we could include it in the warning message.

djdameln · 2024-08-08T17:23:52Z

src/anomalib/data/base/datamodule.py

+
+        # Check validation set
+        if hasattr(self, "val_data") and not (self.val_data.has_normal and self.val_data.has_anomalous):
+            msg = "Validation set should contain both normal and abnormal images."


This may be too strict. Some users may not have access to abnormal images at training time, but may still benefit from running a validation sequence on normal images for adaptive thresholding. (The adaptive threshold value in this case will default to the highest anomaly score predicted over the normal validation images, which turns out to be a not-too-bad estimate in absence of anomalous samples).

djdameln · 2024-08-08T17:26:23Z

src/anomalib/data/base/datamodule.py

+
+        # Check test set
+        if hasattr(self, "test_data") and not (self.test_data.has_normal and self.test_data.has_anomalous):
+            msg = "Test set should contain both normal and abnormal images."


This may also be too strict. In some papers the pixel-level performance is reported over only the anomalous images of the test set. While this may not be the best practice, I think we should support it for those users that want to use this approach.

src/anomalib/data/base/datamodule.py

djdameln · 2024-08-08T17:57:14Z

src/anomalib/data/base/datamodule.py

+            )
+        elif self.val_split_mode == SplitMode.SYNTHETIC:
+            logger.info("Generating synthetic val set.")
+            self.val_data = SyntheticAnomalyDataset.from_dataset(self.train_data)


I think we need to split the dataset first. Otherwise the training set and the validation set will consist of the same images.

djdameln · 2024-08-08T18:00:23Z

src/anomalib/data/base/datamodule.py

+            )
+        elif self.test_split_mode == SplitMode.SYNTHETIC:
+            logger.info("Generating synthetic test set.")
+            self.test_data = SyntheticAnomalyDataset.from_dataset(self.train_data)


Same as above. We need to split the train set first, to ensure that the train and test sets are mutually exclusive.

Co-authored-by: Dick Ameln <amelndjd@gmail.com>

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

…m:samet-akcay/anomalib into feature/add-custom-validation-set-support

codecov · 2024-08-22T11:46:42Z

Codecov Report

Attention: Patch coverage is 80.11050% with 108 lines in your changes missing coverage. Please review.

Please upload report for BASE (feature/dataset-improvements@2f3d616). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/anomalib/data/datamodules/base/image.py	55.91%	41 Missing ⚠️
src/anomalib/data/datasets/image/csv.py	65.07%	22 Missing ⚠️
src/anomalib/data/utils/filter.py	86.81%	12 Missing ⚠️
src/anomalib/data/utils/split.py	90.47%	12 Missing ⚠️
src/anomalib/data/datasets/base/image.py	76.08%	11 Missing ⚠️
src/anomalib/data/datamodules/image/folder.py	60.00%	4 Missing ⚠️
src/anomalib/data/datasets/image/kolektor.py	86.95%	3 Missing ⚠️
src/anomalib/data/datamodules/base/video.py	83.33%	1 Missing ⚠️
src/anomalib/data/datamodules/image/csv.py	96.15%	1 Missing ⚠️
src/anomalib/data/datasets/image/folder.py	91.66%	1 Missing ⚠️

Additional details and impacted files

@@                       Coverage Diff                       @@
##             feature/dataset-improvements    #2239   +/-   ##
===============================================================
  Coverage                                ?   78.26%           
===============================================================
  Files                                   ?      310           
  Lines                                   ?    13502           
  Branches                                ?        0           
===============================================================
  Hits                                    ?    10568           
  Misses                                  ?     2934           
  Partials                                ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ashwinvaidya17

Thanks for the massive effort. At risk of adding more work. I do have a few comments.

ashwinvaidya17 · 2024-09-05T13:57:27Z

src/anomalib/data/video/avenue.py

+        # Avenue dataset does not provide a validation set
+        # Auto behaviour is to clone the test set as validation set.
+        if self.val_split_mode == SplitMode.AUTO:
+            self.val_data = self.test_data.clone()


We should probably inform users about the selection. Maybe logger.info("Using testing data for validation")

ashwinvaidya17 · 2024-09-05T14:17:12Z

tests/unit/data/utils/test_filter.py

+    assert all(filtered_samples.iloc[i]["image_path"] == f"image_{indices[i]}.jpg" for i in range(len(indices)))
+
+
+def test_filter_by_ratio(sample_classification_dataframe: pd.DataFrame) -> None:


Should we also test edge cases like ratio = 0, and 1?

ashwinvaidya17 · 2024-09-05T14:18:34Z

tests/unit/data/utils/test_filter.py

+    """Test filtering by count."""
+    dataset_filter = DatasetFilter(sample_segmentation_dataframe)
+    count = 50
+    filtered_samples = dataset_filter.apply(by=count, seed=42)


Also, should we also test label_aware filter by count?

ashwinvaidya17 · 2024-09-05T14:20:02Z

src/anomalib/data/base/dataset.py

+        return copy.deepcopy(self)
+
+    # Alias for copy method
+    clone = copy


What's the advantage of defining this here?

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

- Updated validation split modes to 'auto' and set ratios to 'null' in avenue.yaml, btech.yaml, datumaro.yaml, shanghaitech.yaml, and ucsd_ped.yaml. - Changed test split modes to 'predefined' and set test split ratios to 'null' in btech.yaml, kolektor.yaml, mvtec.yaml, and visa.yaml. - Adjusted the val_split_ratio in folder.yaml to '0.5' for consistency. These changes standardize the configuration settings for validation and test splits across various datasets, enhancing maintainability and clarity.

… feature/add-custom-validation-set-support

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

ashwinvaidya17

Thanks!

ashwinvaidya17 · 2024-12-03T15:44:11Z

src/anomalib/data/utils/split.py

+
+    SYNTHETIC = "synthetic"
+    PREDEFINED = "predefined"
+    AUTO = "auto"


Should we have auto? Currently it will perform differently based on the dataset. Currently, it clones the test data for validation in auto mode. Since Kolektor does not have a val set, cloning can also just be predefined. But maybe I am confused on the intent of predefined

ashwinvaidya17 · 2024-12-03T15:45:39Z

src/anomalib/data/utils/split.py

+
+    Warning:
+        Usage with legacy split modes is deprecated and will be removed in
+        version 1.3. Please update your code to use ``SplitMode`` directly when


2.3? or 3.0?

samet-akcay · 2024-12-09T11:06:11Z

I'm closing this PR due to the following reasons:

This PR introduces breaking changes with a new API.
We will switch to Datumaro in the future to replace our make_<dataset_name>_function and DataFrame-based samples. This will introduce yet another breaking changes. We cannot afford two breaking changes, which is why I'm closing this one. We will re-create the PR with the datamaro upgrades.

samet-akcay · 2024-12-09T11:07:04Z

For those who would like to access these changes, they will still be here
https://github.com/samet-akcay/anomalib/tree/feature/dataset-improvements

samet-akcay added 15 commits August 7, 2024 13:19

re-order data __init__

c35804f

Added dataset filter and split

5780541

Add the base dataset and datamodule implementations

a139f4f

Add conftest to create sample classification and segmentation datafra…

11492a0

…mes to test filter and split classes

Edited video datamodule

1ad059c

Add csv data documentation

e631aa0

Fix a bug in datamodule to address when train/val/test datasets are p…

9e53c4c

…rovided

Fix a bug in resolve_split_mode function

f45c7ec

Add CSV dataset and datamodules

28a6244

add relative import to anomalib.data

e46bb37

Add clone and copy method to anomalib dataset

3aa0ac2

Add csv tests

1565bbf

Change csv dataset assignment logic

591e2ed

Update the mvtec logic

ad8271b

Reflect the changes in image datasets.

a19b7f1

samet-akcay requested review from ashwinvaidya17 and djdameln as code owners August 8, 2024 12:28

samet-akcay changed the title ~~🔨 Dataset refactor: Add CSV Data support and custom validation set support~~ 🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support Aug 8, 2024

djdameln reviewed Aug 8, 2024

View reviewed changes

samet-akcay and others added 11 commits August 13, 2024 05:49

Add clear cache option to avoid using old samples

7bc8439

Modify split_by_ratio logic

374d360

Modify _process_train_only_scenario

df64670

Update the Folder datamodule

751ed3c

Update src/anomalib/data/base/datamodule.py

ceb9675

Co-authored-by: Dick Ameln <amelndjd@gmail.com>

pre-commit

e7e955c

Remove unused old code

64fd3f3

Refactor CSV dataset assignment logic

4dc7d7e

Update folder notebook

8a7d87a

Add CSV notebook

73fcd41

chore: Add CSV datamodule notebook and update folder notebook

3b91977

samet-akcay added 6 commits August 20, 2024 12:48

remove normal_test_dir arg from tests

9f58d18

Reflect the new changes in video datamodules

b871b9e

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Modify the tests to conform the new datamodule format

ebfaf7f

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Modify the tests to conform the new datamodule format

ce8b5fa

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Merge branch 'feature/add-custom-validation-set-support' of github.co…

b3fc488

…m:samet-akcay/anomalib into feature/add-custom-validation-set-support

Write the csv file to the notebooks directory.

c575d6a

ashwinvaidya17 reviewed Sep 5, 2024

View reviewed changes

samet-akcay added this to the v2.0.0 milestone Oct 22, 2024

samet-akcay mentioned this pull request Oct 22, 2024

🎯 [EPIC] Design and Implement the New AnomalibModule for v2.0 #2364

Closed

12 tasks

samet-akcay mentioned this pull request Nov 5, 2024

🚀 Add PreProcessor to AnomalyModule #2358

Merged

9 tasks

samet-akcay added 2 commits November 27, 2024 08:57

Resolve merge conflicts

8e905e8

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Resolve merge conflicts

c44f377

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

samet-akcay changed the base branch from main to feature/v2 November 27, 2024 10:09

samet-akcay added 6 commits November 27, 2024 13:30

Update the tests

ddede9e

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Add the new split logic to kolektor

489e96c

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

fix pre-commit

e98bada

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Modify make_kolektor_dataset function

17b6923

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Merge branch 'feature/v2' of github.com:openvinotoolkit/anomalib into…

5973e20

… feature/add-custom-validation-set-support

samet-akcay requested a review from djdameln November 27, 2024 16:51

samet-akcay added 4 commits November 29, 2024 11:24

Properly assign val data transforms

529e0ad

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Convert the mode configs to upper-case in data config

06dd9ab

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

Add CSV relative imports

6888807

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

fix csv notebook tests

cb7eea2

Signed-off-by: Samet Akcay <samet.akcay@intel.com>

ashwinvaidya17 approved these changes Dec 3, 2024

View reviewed changes

samet-akcay changed the base branch from feature/v2 to feature/dataset-improvements December 9, 2024 10:46

samet-akcay closed this Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

samet-akcay commented Aug 8, 2024 •

edited

Loading

djdameln left a comment

djdameln Aug 8, 2024

samet-akcay Aug 13, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

codecov bot commented Aug 22, 2024 •

edited

Loading

ashwinvaidya17 left a comment

ashwinvaidya17 Sep 5, 2024

ashwinvaidya17 Sep 5, 2024

ashwinvaidya17 Sep 5, 2024

ashwinvaidya17 Sep 5, 2024

ashwinvaidya17 left a comment

ashwinvaidya17 Dec 3, 2024

ashwinvaidya17 Dec 3, 2024

samet-akcay commented Dec 9, 2024

samet-akcay commented Dec 9, 2024

		assert all(filtered_samples.iloc[i]["image_path"] == f"image_{indices[i]}.jpg" for i in range(len(indices)))


		def test_filter_by_ratio(sample_classification_dataframe: pd.DataFrame) -> None:

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

Conversation

samet-akcay commented Aug 8, 2024 • edited Loading

📝 Description

CSV Data Support

Dataset Filtering

Dataset Splitting

✨ Changes

✅ Checklist

djdameln left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 22, 2024 • edited Loading

Codecov Report

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samet-akcay commented Dec 9, 2024

samet-akcay commented Dec 9, 2024

samet-akcay commented Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 22, 2024 •

edited

Loading