Description
Describe the bug
I currently have multiple datasets (train+validation) saved as 50MB shards. For one dataset the validation pair is small enough to fit into a single shard and this apparently causes problems when loading the dataset. I created the datasets using a DatasetDict, saved them as 50MB arrow files for streaming and then load each dataset. I have no problem loading any of the other datasets with more than 1 arrow file/shard.
The error indicates the training set got loaded in arrow format (correct) and the validation set in json (incorrect). This seems to be because some of the metadata files are considered as dataset files.
Error loading /nfs/dataset_pt-uk: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('validation'): ('json', {})}
Concretely, there is a mismatch between the metadata created by the DatasetDict.save_to_file
and the builder for datasets.load_dataset
:
datasets/src/datasets/data_files.py
Line 107 in e71b0b1
The folder_based_builder
lists all files and with 1 arrow file the json files (that are actually metadata) are in the majority.
Steps to reproduce the bug
Create a dataset with metadata and 1 arrow file in validation and multiple arrow files in the training set, following the above description. In my case, I saved the files via:
dataset = DatasetDict({
'train': train_dataset,
'validation': val_dataset
})
dataset.save_to_disk(output_path, max_shard_size="50MB")
Expected behavior
The dataset would get loaded.
Environment info
datasets
version: 3.6.0- Platform: Linux-6.14.0-22-generic-x86_64-with-glibc2.41
- Python version: 3.12.7
huggingface_hub
version: 0.31.1- PyArrow version: 18.1.0
- Pandas version: 2.2.3
fsspec
version: 2024.6.1