Skip to content

Explore alternatives to SQLite/Parquet for fast data access #834

@RasmusOrsoe

Description

@RasmusOrsoe

Description
Fast access to experiment file formats is essential. Alternatives such as LMDB or Hugging Face datasets may offer better performance in some scenarios. The current Parquet dataset is broken and unlikely to work satisfactorily. Benchmarking is needed to determine tradeoffs.

Potential candidates: LMBD, Hugging Face Dataset, memorymapped .npy arrays (PolarBERT)

Some of these formats provide fast random access (like SQLite), while others is read sequentially and therefore require randomization on-write. As a result, the user experience is different. We should consider if/how we can support both regimes.

Acceptance Criteria

  • Benchmark storage footprint and query speeds
  • Assess feasibility of storing data representations, not just raw data
  • Document benchmarking results and recommendations

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issueGood for newcomershelp wantedExtra attention is neededhigh priorityThis issue or pul request needs immediate resolution

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions