-
Notifications
You must be signed in to change notification settings - Fork 107
Open
0 / 10 of 1 issue completedOpen
0 / 10 of 1 issue completed
Copy link
Labels
good first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is neededhigh priorityThis issue or pul request needs immediate resolutionThis issue or pul request needs immediate resolution
Milestone
Description
Description
Fast access to experiment file formats is essential. Alternatives such as LMDB or Hugging Face datasets may offer better performance in some scenarios. The current Parquet dataset is broken and unlikely to work satisfactorily. Benchmarking is needed to determine tradeoffs.
Potential candidates: LMBD, Hugging Face Dataset, memorymapped .npy arrays (PolarBERT)
Some of these formats provide fast random access (like SQLite), while others is read sequentially and therefore require randomization on-write. As a result, the user experience is different. We should consider if/how we can support both regimes.
Acceptance Criteria
- Benchmark storage footprint and query speeds
- Assess feasibility of storing data representations, not just raw data
- Document benchmarking results and recommendations
Sub-issues
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is neededhigh priorityThis issue or pul request needs immediate resolutionThis issue or pul request needs immediate resolution