-
Notifications
You must be signed in to change notification settings - Fork 20
Description
During the LoG poster session, I collected these relevant feature requests:
Explicit Data Versioning
It is important to keep the dataset version stable during the research project. Currently, our dataset versioning is based on the implicit git log of urls.json. Rather, an explicit version of the dataset might be useful - we can probably append new versions of a dataset to the urls.json, and let users specify the version during data loading.
Dataset-specific Preprocessing (Feature Extractor)
Some datasets (e.g. molecular) are small without preprocessing (~50Mb) but can expand massively after featurization (~2G). We might want to add a (user-defined) step (an abstract layer) between downloading and data loading that preprocess the dataset locally. Perhaps we should add this in the metadata.json and cache the processed dataset after first dataloading.
Support of PyG
A lot of Graph ML projects are using PyG for development (potentially because PyG has a longer age). We are going to implement a PyG data-loading pipeline in our next step.
We thank to Ladislav, Remy, Song, Semih, and all people who gave us valuable feedback.