Skip to content

load_dataset defaults to json file format for datasets with 1 shard #7650

Open
@iPieter

Description

@iPieter

Describe the bug

I currently have multiple datasets (train+validation) saved as 50MB shards. For one dataset the validation pair is small enough to fit into a single shard and this apparently causes problems when loading the dataset. I created the datasets using a DatasetDict, saved them as 50MB arrow files for streaming and then load each dataset. I have no problem loading any of the other datasets with more than 1 arrow file/shard.

The error indicates the training set got loaded in arrow format (correct) and the validation set in json (incorrect). This seems to be because some of the metadata files are considered as dataset files.

Error loading /nfs/dataset_pt-uk: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('validation'): ('json', {})} 

Image

Concretely, there is a mismatch between the metadata created by the DatasetDict.save_to_file and the builder for datasets.load_dataset:

"dataset_info.json",

The folder_based_builder lists all files and with 1 arrow file the json files (that are actually metadata) are in the majority.

METADATA_FILENAMES: list[str] = ["metadata.csv", "metadata.jsonl", "metadata.parquet"]

Steps to reproduce the bug

Create a dataset with metadata and 1 arrow file in validation and multiple arrow files in the training set, following the above description. In my case, I saved the files via:

        dataset = DatasetDict({
            'train': train_dataset,
            'validation': val_dataset
        })
        
        dataset.save_to_disk(output_path, max_shard_size="50MB")

Expected behavior

The dataset would get loaded.

Environment info

  • datasets version: 3.6.0
  • Platform: Linux-6.14.0-22-generic-x86_64-with-glibc2.41
  • Python version: 3.12.7
  • huggingface_hub version: 0.31.1
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions