Skip to content

Add Dataset and Writer modules for the lmdb format #820

@RasmusOrsoe

Description

@RasmusOrsoe

Is your feature request related to a problem? Please describe.
lmdb is a memory-mapped database format that provides random access to individual training events faster than SQLite.

The main benefits of lmdb: Faster query speeds, lower storage footprint, serialized data allows for an elegant implementation of storing arbitrary DataRepresentations (see #781).

Image Image

For larger datasets, and particularly large events, SQLite can become prohibitive as a dataset format.

Describe the solution you'd like

  • A LMDBWriter that outputs events in lmdb format. Should accept a serialization method (msgpack, for example). Should optionally accept a DataRepresentation - if given, representations are pre-computed and serialized using dill or similar pickle methods (see Graph construction before training #781). A field in the database should contain relevant information regarding the serializer, such that the file is a self-contained object that users can read without prior knowledge of the serializer used.
  • A LMDBDataset that is compatible with the lmdb database format. Should automatically check for which serializer was used, so the user doesn't have to guess. Should be able to retrieve pre-computed data representations.

Metadata

Metadata

Assignees

Labels

featureNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions