Skip to content

Do not scan the same files multiple times in ScalableShardDataset #137

@rualark

Description

@rualark

It seems that ScalableShardDataset can run os.walk of the same folders and then get length of each file potentially hundreds of times depending on the number of logical shards in one rank, because each StreamingDocDataset runs full os.walk() in its setup:

[d.setup() for d in self.data]

for root, dirs, files in os.walk(datapath, topdown=False)

This can be especially inefficient in case of many files. Should os.walk() run only once and its result should be reused?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions