Skip to content

One subset per file in repo ? #7066

Open
@lhoestq

Description

@lhoestq

Right now we consider all the files of a dataset to be the same data, e.g.

single_subset_dataset/
├── train0.jsonl
├── train1.jsonl
└── train2.jsonl

but in cases like this, each file is actually a different subset of the dataset and should be loaded separately

many_subsets_dataset/
├── animals.jsonl
├── trees.jsonl
└── metadata.jsonl

It would be nice to detect those subsets automatically using a simple heuristic. For example we can group files together if their paths names are the same except some digits ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions