Skip to content

Fix: drop spurious label column in audiofolder/imagefolder datasets#3268

Open
neha222222 wants to merge 1 commit intohuggingface:mainfrom
neha222222:fix/drop-labels-audiofolder-imagefolder
Open

Fix: drop spurious label column in audiofolder/imagefolder datasets#3268
neha222222 wants to merge 1 commit intohuggingface:mainfrom
neha222222:fix/drop-labels-audiofolder-imagefolder

Conversation

@neha222222
Copy link
Copy Markdown

Issue - 3014

@neha222222
Copy link
Copy Markdown
Author

hi @severo
Can you please review this pull request.

@severo severo requested a review from lhoestq November 26, 2025 08:34
@lhoestq
Copy link
Copy Markdown
Member

lhoestq commented Nov 26, 2025

Hi ! the viewer shows the same data as the datasets lib, so if there is a wrong column it should probably be handled in datasets. The logic to add the labels column is defined here: https://github.com/huggingface/datasets/blob/c97e757836d16d4083ae057b03e22747c2ffe477/src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py#L144-L147

not sure why it could add a label column full of None values

@neha222222
Copy link
Copy Markdown
Author

Hi @lhoestq , thank you for the review!
I investigated the root cause and you're right - the issue is in the datasets library. Here's what I found:
In folder_based_builder.py, the labels set is accumulated across ALL splits (line 77). When a dataset uses directories for splits (train/test) rather than classes:

  1. The analyze() function collects directory names from all splits
  2. labels = {"train", "test"} → len(labels) > 1 → add_labels = True
  3. A spurious label column is created with split names as class labels

i will submit a fix pr in the dataset library.

@neha222222
Copy link
Copy Markdown
Author

hi @lhoestq , @severo
i have raised the issue and fix in the dataset repo. - #7881

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants