[Multimodal] Timeline for Creating Image-Text Dataloader

From the pytorch stateful distributed dataloader talk, there was a mention at the end how further multi-process development was needed in order to support multimodal datasets. I was wondering what the current development stage is in supporting an image with corresponding text caption dataloader?