Skip to content

num_proc parallelization works only for first ~10s. #7518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pshishodiaa opened this issue Apr 15, 2025 · 2 comments
Open

num_proc parallelization works only for first ~10s. #7518

pshishodiaa opened this issue Apr 15, 2025 · 2 comments

Comments

@pshishodiaa
Copy link

Describe the bug

When I try to load an already downloaded dataset with num_proc=64, the speed is very high for the first 10-20 seconds acheiving 30-40K samples / s, and 100% utilization for all cores but it soon drops to <= 1000 with almost 0% utilization for most cores.

Steps to reproduce the bug

// download dataset with cli
!huggingface-cli download --repo-type dataset timm/imagenet-1k-wds --max-workers 32

from datasets import load_dataset
ds = load_dataset("timm/imagenet-1k-wds", num_proc=64)

Expected behavior

100% core utilization throughout.

Environment info

Azure A100-80GB, 16 cores VM

Image

@lhoestq
Copy link
Member

lhoestq commented Apr 15, 2025

Hi, can you check if the processes are still alive ? It's a bit weird because datasets does check if processes crash and return an error in that case

@pshishodiaa
Copy link
Author

pshishodiaa commented Apr 15, 2025

Thank you for reverting quickly. I digged a bit, and realized my disk's IOPS is also limited - which is causing this. will check further and report if it's an issue of hf datasets' side or mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants