It seems that ScalableShardDataset can run os.walk of the same folders and then get length of each file potentially hundreds of times depending on the number of logical shards in one rank, because each StreamingDocDataset runs full os.walk() in its setup:
|
[d.setup() for d in self.data] |
|
for root, dirs, files in os.walk(datapath, topdown=False) |
This can be especially inefficient in case of many files. Should os.walk() run only once and its result should be reused?