-
Notifications
You must be signed in to change notification settings - Fork 1
MDC API
Beyond the possibility of decompressing mdc-compressed files with CLI, it's also possible to access the data from the programming language via an appropriate API. Before introducing details of specific APIs the general idea is presented. The first thing to do is always open the MDC file via the appropriate module (class in the case of C++, function + struct in the case of C). The handle (object/struct instance) to this file cannot be used directly to perform queries, but instead, it offers a mechanism to obtain another module (query engine) that may be used to query. Multiple query engines may be obtained from a single MDC file handle. To get a query engine, the programmer must provide a list of segment IDs (strings) and a list of atom IDs (ints). Both can be empty (if both are empty, the query engine will select all the atoms from the MDC file). Having query engine programmer may query for a list of frame IDs (resulting frames are in the order of this list). The distinction between a file handler and a query engine is for multithreaded purposes. There should be (usually) a single file handler for a given MDC file, and to perform queries from multiple threads, each thread should have its own query engine.
The C++ API is enclosed in mdc namespace.
There are two main classes:
-
reader- realizes MDC file handler -
query_engine- used to perform queries
-
reader(const std::string& path)- constructor that opens MDC file. In the case of failure, it sets an internal error message that may be checked with theget_current_errormethod -
const std::string& get_current_error() const- returns a current error message; if it's empty, it means there were no errors; in the opposite case, it returns the last error message; errors are never cleared and are considered unrecoverable (using areaderobject that is in erroneous state for anything other than destruction is undefined behavior) -
const std::vector<segment_desc_t>& get_segments() const- returns a description of segments in the MDC file; each segment is described by size (in number of atoms), and type (water, ios, etc.); details onsegment_desc_ttype are below -
uint32_t get_no_frames() const- returns the number of frames in MDC file -
const std::vector<uint32_t>& get_anchor_ids() const- returns the ids of anchors frames. Details: some of the frames in MDC file are so-called anchor frames, each anchor frame can be decompressed without the knowledge of previous frames, the knowledge of which are anchor frames may be utilized to build efficient parallel decompressor utilizing API (each thread should start on anchor frame) -
std::unique_ptr<query_engine> get_query_engine(const std::vector<std::string>& segment_ids, const std::vector<uint32_t>& atom_ids) const- returns a newquery_enginethat may be used to perform queries for specified segment IDs and atom IDs (1-base indexing). Ifsegment_idsandatom_idsare empty, all atoms are returned from queries.
-
const std::vector<uint32_t>& get_original_atom_ids() const- allows for remapping the atoms as ordered in the query result to the original numbering -
bool query(std::span<const uint32_t> frame_ids, query_result& result)- performs a query for atoms specified withget_query_enginefor a given frame IDs; the result of a query is represented asquery_resulttype, and its details are below, in the case of failure returnsfalseand details should be checked withget_current_errormethod -
bool query(std::initializer_list<uint32_t> frame_ids, query_result& result)- same as above, but allows to pass frame IDs asstd::initializer_list -
const std::string& get_current_error() const- returns a current error message; if it's empty, it means there were no errors; in the opposite case, it returns the last error message; errors are never cleared and are considered unrecoverable (using aquery_engineobject that is in erroneous state for anything other than destruction is undefined behavior)
It is a structure with the following fields
-
segment_type_t type- is an enum with segment type (one ofunknown,molecule,ion,water,none -
std::string name- the name of a segment -
uint32_t size- number of atoms in segment
query_result represents the result of a single query. It's a struct (with some private fields related to storage), that contains the following field:
-
std::vector<frame> frames- frames being the result of a query, typeframeis described below. The order of frames is the same as the order of frame IDs in the first parameter of the performedquerymethod.
frame is a struct with the following fields:
-
std::span<atom_coords> coords- coordinates of atoms (atom_coordsis as struct with three floats:x,y,z) -
int step- frame step -
float time- frame time -
float box[3][3]- frame box -
float prec- frame precision
Let's assume we have file example.cpp:
#include <iostream>
#include <string>
#include "mdc_reader.h"
std::string segment_type_to_str(mdc::segment_type type) {
switch (type) {
case mdc::segment_type::unknown: return "unknown";
case mdc::segment_type::molecule: return "molecule";
case mdc::segment_type::ion: return "ion";
case mdc::segment_type::water: return "water";
case mdc::segment_type::none: return "none";
}
return ""; // this should never happen
}
int main(int argc, char* argv[]) {
mdc::reader reader("example.mdc");
uint32_t no_frames = reader.get_no_frames();
std::cerr << "Segments: " << reader.get_segments().size() << ", Frames: " << no_frames << "\n";
for (const auto& segment : reader.get_segments())
std::cerr << "Segment: " << segment.name << ", size: " << segment.size << ", type: " << segment_type_to_str(segment.type) << "\n";
//query_engine to get all atoms
auto query_engine = reader.get_query_engine({}, {});
mdc::query_result result;
for (uint32_t frame_id = 0; frame_id < no_frames; frame_id++) {
if (!query_engine->query({ frame_id }, result)) {
std::cerr << "Error: " << query_engine->get_current_error();
return 1;
}
auto coords = result.frames[0].coords;
for (const auto & coord : coords)
std::cout << "[" << coord.x << ", " << coord.y << ", " << coord.z << "]\n";
std::cerr << "Frame time: " << result.frames[0].time << "\tNo. atoms: " << coords.size() << "\n";
}
return 0;
}
Compilation
To use an API, one needs to include the mdc_reader.h header file and link against libmdc.a.
mdc_reader.h is in src/mdc_lib/
libmdc.a is in bin as a result of running make.
Assuming the current working directory is the main root of the cloned repo and simple.cpp is also there, to compile and link the code above, one could use the following command:
g++ -Wall -O3 -std=c++20 -I src/mdc_lib/ -L bin simple.cpp -o simple -lmdc
A more advanced example of C++ API is in src/cpp_mdc_reader_example/main.cpp
MDC provides bindings for its API in a couple of languages.
There are two main typedef's:
-
mdc_reader- realizes MDC file handler -
mdc_query_engine- used to perform queries
-
mdc_reader* mdc_reader_open(const char* path)- opens MDC file and returns handle to it. In the case of failure, it sets an internal error message that may be checked with themdc_reader_get_errorfunction -
const char* mdc_reader_get_error(mdc_reader* reader)- returns a current error message; if it'sNULL, it means there were no errors; in the opposite case, it returns the last error message; errors are never cleared and are considered unrecoverable (using amdc_readerthat is in the erroneous state for anything other than freeing is undefined behavior) -
uint32_t mdc_get_no_segments(mdc_reader* reader)- returns the number of segments in the MDC file, -
mdc_segment_desc mdc_get_segment_desc(mdc_reader* reader, uint32_t index)- returns the description of theindex-th segment; each segment is described by size (in number of atoms), and type (water, ios, etc.); details onmdc_segment_desctype are below -
uint32_t mdc_get_no_frames(mdc_reader* reader)- returns the number of frames in MDC file -
uint32_t mdc_get_no_anchors(mdc_reader* reader)- returns the number of anchors in MDC file -
const uint32_t* mdc_get_anchor_ids(mdc_reader* reader)- returns the ids of anchors frames. Details: some of the frames in MDC file are so-called anchor frames, each anchor frame can be decompressed without the knowledge of previous frames, the knowledge of which are anchor frames may be utilized to build efficient parallel decompressor utilizing API (each thread should start on anchor frame), the size of the returned array is as returned bymdc_get_no_anchors -
void mdc_reader_close(mdc_reader* reader)- frees all resources associated with existingmdc_reader -
mdc_query_engine* mdc_get_query_engine(mdc_reader* reader, char** segment_ids, size_t num_segment_ids, uint32_t* atom_ids, size_t num_atom_ids)- returns a newmdc_query_enginethat may be used to perform queries for specified segment IDs and atom IDs (1-base indexing). Ifsegment_idsandatom_idsare empty, all atoms are returned from queries.
-
const uint32_t* mdc_query_engine_get_original_atom_ids(mdc_query_engine* engine)- allows for remapping the atoms as ordered in the query result to the original numbering -
int mdc_query(mdc_query_engine* engine, uint32_t* frame_ids, uint32_t n_frame_ids, mdc_query_result* result)- performs a query for atoms specified withmdc_get_query_enginefor a given frame IDs; the result of a query is represented asmdc_query_resulttype, and its details are below, in the case of failure returns1and details should be checked withmdc_query_engine_get_errorfunction, in the case of success returns0 -
const char* mdc_query_engine_get_error(mdc_query_engine* engine)- returns a current error message; if it'sNULL, it means there were no errors; in the opposite case, it returns the last error message; errors are never cleared and are considered unrecoverable (using amdc_query_enginethat is in the erroneous state for anything other than destruction is undefined behavior) -
void mdc_free_query_engine(mdc_query_engine* engine)- frees all resources associated with existingmdc_query_engine
It is a structure with the following fields
-
mdc_segment_type type- is an enum with segment type (one ofMD_COMPRESS_SEGMENT_TYPE_UNKNOWN,MD_COMPRESS_SEGMENT_TYPE_MOLECULE,MD_COMPRESS_SEGMENT_TYPE_ION,MD_COMPRESS_SEGMENT_TYPE_WATER,MD_COMPRESS_SEGMENT_TYPE_NONE -
const char* name- the name of a segment -
uint32_t size- number of atoms in segment
mdc_query_result represents the result of a single query. It's a struct, that contains the following field (any other should not be used):
-
mdc_frame* frames- frames being the result of a query, typemdc_frameis described below. The order of frames is the same as the order of frame IDs in the first parameter of the performedquerymethod. -
uint32_t n_frames- number of frames (size offramesfield)
To manage resources related to mdc_query_result the following functions should be used:
-
mdc_query_result* mdc_create_query_result(), void mdc_free_query_result(mdc_query_result* result)
frame is a struct with the following fields:
-
mdc_atom_coords* coords- coordinates of atoms (mdc_atom_coordsis as struct with three floats:x,y,z) -
uint32_t n_coords- number of coords (size ofcoordsfield) -
int step- frame step -
float time- frame time -
float box[3][3]- frame box -
float prec- frame precision
Let's assume we have file example.c:
#include <stdio.h>
#include "mdc_reader.h"
const char* segment_type_to_str(mdc_segment_type type) {
switch (type) {
case MD_COMPRESS_SEGMENT_TYPE_UNKNOWN: return "unknown";
case MD_COMPRESS_SEGMENT_TYPE_MOLECULE: return "molecule";
case MD_COMPRESS_SEGMENT_TYPE_ION: return "ion";
case MD_COMPRESS_SEGMENT_TYPE_WATER: return "water";
case MD_COMPRESS_SEGMENT_TYPE_NONE: return "none";
}
return ""; // this should never happen
}
int main(int argc, char* argv[]) {
int err_code = 0;
mdc_reader* reader = mdc_reader_open("example.mdc");
uint32_t no_frames = mdc_get_no_frames(reader);
fprintf(stderr, "Segments: %d, Frames: %d\n", mdc_get_no_segments(reader), no_frames);
for (uint32_t index = 0 ; index < mdc_get_no_segments(reader) ; ++index)
fprintf(stderr, "Segment: %s, size: %d, type: %s\n", mdc_get_segment_desc(reader, index).name, mdc_get_segment_desc(reader, index).size, segment_type_to_str(mdc_get_segment_desc(reader, index).type));
//query_engine to get all atoms
mdc_query_engine* query_engine = mdc_get_query_engine(reader, NULL, 0, NULL, 0);
mdc_query_result *result = mdc_create_query_result();
for (uint32_t frame_id = 0; frame_id < no_frames; frame_id++) {
if (mdc_query(query_engine, &frame_id, 1, result)) {
fprintf(stderr, "Error: %s\n", mdc_query_engine_get_error(query_engine));
err_code = 1;
goto cleanup;
}
mdc_atom_coords* coords = result->frames[0].coords;
for (uint32_t coord_id = 0 ; coord_id < result->frames[0].n_coords ; ++coord_id)
printf("[%f, %f, %f]\n", coords[coord_id].x, coords[coord_id].y, coords[coord_id].z);
fprintf(stderr, "Frame time: %f\tNo. atoms: %d\n", result->frames[0].time, result->frames[0].n_coords);
}
cleanup:
mdc_free_query_result(result);
mdc_free_query_engine(query_engine);
mdc_reader_close(reader);
return err_code;
}
Compilation
To use an API, one needs to include the mdc_reader.h header file and link against libmdc.a.
mdc_reader.h is in src/mdc_lib/
libmdc.a is in bin as a result of running make.
Assuming the current working directory is the main root of the cloned repo and simple.c is also there, to compile and link the code above, one could use the following command:
gcc -Wall -O3 -I src/mdc_lib/ -L bin simple.c -o simple -lmdc -lstdc++
A more advanced example of C API is in src/c_mdc_reader_example/main.c
Rust API is enclosed in rust_mdc module. There are two main structs:
-
Reader- realizes MDC file handler -
QueryEngine- used to perform queries
-
pub fn new(path: &str) -> Result<Self, String>- opens MDC file -
pub fn get_no_frames(&self) -> u32- returns the number of frames in MDC file -
pub fn get_segments(&self) -> Result<Vec<SegmentDesc>, String>- returns a description of segments in the MDC file; each segment is described by size (in number of atoms), and type (water, ios, etc.); details onSegmentDesctype are below -
pub fn get_anchor_ids(&self) -> Result<&[u32], String>- returns the ids of anchors frames. Details: some of the frames in MDC file are so-called anchor frames, each anchor frame can be decompressed without the knowledge of previous frames, the knowledge of which are anchor frames may be utilized to build an efficient parallel decompressor utilizing API (each thread should start on anchor frame) -
pub fn get_query_engine(&self, segment_ids: &[String], atom_ids: &[u32]) -> Result<QueryEngine, String>- returns a newQueryEnginethat may be used to perform queries for specified segment IDs and atom IDs (1-base indexing). If segment_ids and atom_ids are empty, all atoms are returned from queries.
-
pub fn get_original_atom_ids(&self) -> Result<&[u32], String>- allows for remapping the atoms as ordered in the query result to the original numbering -
pub fn query(&self, frame_ids: &[u32], result: &mut QueryResult) -> Result<(), String>- performs a query for atoms specified with get_query_engine for a given frame IDs; the result of a query is represented asQueryResulttype, and its details are below
It is a structure with the following fields
-
pub type_: SegmentTypeis an enum with segment type (one of Unknown, Molecule, Ion, Water, None_ -
pub name: String- the name of a segment -
pub size: u32- number of atoms in segment
QueryResult represents the result of a single query. It's a struct (with some private fields related to storage), that contains the following public field:
-
pub frames: Vec<Frame<'a>>- frames being the result of a query, typeFrameis described below. The order of frames is the same as the order of frame IDs in the first parameter of the performed query method.
It is a structure with the following fields
-
pub coords: &'a [MdcAtomCoords]- coordinates of atoms (MdcAtomCoordsis as struct with three floats: x, y, z) -
pub step: i32- frame step -
pub time: f32- frame time -
pub box_: [[f32; 3]; 3]- frame box -
pub prec: f32- frame precision
Let's assume we have the following file structure:
simple/
├── Cargo.toml
└── src
└── main.rs
Where simple/Cargo.toml:
[package]
name = "simple"
version = "0.1.0"
edition = "2021"
[dependencies]
rust_mdc = { git = "https://github.com/refresh-bio/mdcompress.git", subdir = "rust_mdc" }
simple/src/main.rs:
use rust_mdc::{Reader, QueryResult};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let reader = Reader::new("example.mdc").map_err(|e| {
format!("Failed to open reader: {}", e)
})?;
let segments = reader.get_segments().map_err(|e| {
format!("Failed to get segments: {}", e)
})?;
let no_frames = reader.get_no_frames();
eprintln!("Segments: {}, Frames: {}", segments.len(), no_frames);
for segment in segments {
eprintln!("Segment: {}, size: {}, type: {}", segment.name, segment.size, segment.type_);
}
let engine = reader.get_query_engine(&Vec::<String>::new(), &Vec::<u32>::new()).map_err(|e| {
format!("Failed to get query engine: {}", e)
})?;
let mut result = QueryResult::new().map_err(|e| {
format!("Failed to create query result: {}", e)
})?;
let mut frame_ids: Vec<u32> = vec![0];
for frame_id in 0..no_frames {
frame_ids[0] = frame_id;
engine.query(&frame_ids, &mut result).map_err(|e| {
format!("Failed to query: {}", e)
})?;
let frame = &result.frames[0];
let total_atoms = frame.coords.len();
let coords = frame.coords;
for coord in coords {
println!("[{}, {}, {}]", coord.x, coord.y, coord.z)
}
eprintln!("Frame time: {}, No. atoms: {}", frame.time, total_atoms);
}
Ok(())
}
To run the example from inside simple directory (assure there is example.mdc file there!) it is enough to just run:
cargo run
A more advanced example of Rust API is in src/rust_mdc_reader_example/src/main.rs
Python binding was created using nanobind.
Python API requires numpy (may be installed with pip).
Python API is enclosed in mdc module. There are two main classes:
-
Reader- realizes MDC file handler -
QueryEngine- used to perform queries
-
mdc.Reader(path)- created new instance ofReader, also opens MDC file. In the case of failure, it sets an internal error message that may be checked with theget_current_error method -
get_current_error- returns a current error message; if it's empty, it means there were no errors; in the opposite case, it returns the last error message; errors are never cleared and are considered unrecoverable (using aReaderobject that is in erroneous state for anything other than destruction is undefined behavior) -
get_segments()- returns a description of segments in the MDC file; each segment is described by size (in number of atoms), name and type (water, ios, etc.); details on returnedSegmentDesctype below -
get_no_frames()- get_no_frames -
get_anchor_ids()- returns the ids of anchors frames. Details: some of the frames in MDC file are so-called anchor frames, each anchor frame can be decompressed without the knowledge of previous frames, the knowledge of which are anchor frames may be utilized to build efficient parallel decompressor utilizing API (each thread should start on anchor frame) -
get_query_engine(segment_ids, atom_ids)- returns a newQueryEnginethat may be used to perform queries for specified segment IDs and atom IDs (1-base indexing). Ifsegment_idsandatom_idsare emptylists, all atoms are returned from queries.
-
get_original_atom_ids()- allows for remapping the atoms as ordered in the query result to the original numbering -
query(frame_ids, result)- performs a query for atoms specified withget_query_enginefor a given frame IDs; the result of a query is represented asQueryResulttype, and its details are below, in the case of failure returns false and details should be checked withget_current_errormethod,frame_idsshould be specified vianp.array -
get_current_error()- returns a current error message; if it's empty, it means there were no errors; in the opposite case, it returns the last error message; errors are never cleared and are considered unrecoverable (using aQueryEngineobject that is in erroneous state for anything other than destruction is undefined behavior)
Describes single segment, contains:
-
type- enumeration (SegmentType), one of:unknown,molecule,ion,water,none -
name- the name of a segment -
size- number of atoms in segment
QueryResult represents the result of a single query. It contains:
-
frames- frames being the result of aquery, type of a single frame (Frame) is described below. The order of frames is the same as the order of frame IDs in the first parameter of the performed query method.
Frame contains:
-
coords- coordinates of atoms represented asnumpy.ndarraywith shape(NUM_ATOMS, 3) -
step- frame step -
time- frame time -
box- frame box (typenumpy.ndarraywith shape:(3, 3) -
prec- frame precision
Let's assume we have file example.py:
#!/bin/env python3
import mdc
import sys
import numpy as np
reader = mdc.Reader("example.mdc")
if reader.get_current_error():
print("Error: {}".format(reader.get_current_error()))
sys.exit(1)
no_frames = reader.get_no_frames()
print(f"Segments: {reader.get_no_frames()}, Frames: {no_frames}")
for segment in reader.get_segments():
print(f"Segment: {segment.name}, size: {segment.size}, type: {segment.type}")
#query_engine to get all atoms
query_engine = reader.get_query_engine([], [])
query_result = mdc.QueryResult()
# queries require np.array
frame_ids = np.array([0])
for frame_id in range(no_frames):
frame_ids[0] = frame_id
if not query_engine.query(frame_ids, query_result):
print("Error: {}", query_engine.get_current_error())
sys.exit(1)
print(query_result.frames[0].coords)
print(f"Frame time: {query_result.frames[0].time}\tNo. atoms: {query_result.frames[0].coords.shape[0]}")
Compilation
To use the mdc Python module, it must be first compiled into a mdc*.so file for the appropriate Python version.
The easiest way to produce this file is to run:
git clone --recursive https://github.com/refresh-bio/mdcompress/
make python_mdc
For the above to succeed, the Python development package must be installed. To install it for Debina/Ubuntu, one may run:
sudo apt install python3-dev
And the appropriate *.so file will be in the bin directory.
It should be placed where Python interpreter can find it or specified with PYTHONPATH, for example (assuming current working directory is root of this repo):
PYTHONPATH=bin src/python_mdc_reader_example/main.py -i example.mdc -o out.bin
A more advanced example of Python API is in src/python_mdc_reader_example/main.py
Wasm binding was created using Embind. Currently, API usage is described as an example instead of a more formal API with a description of types/functions. In general, the API tries to mimic the C++ API. Quick code to run mdc decompression in the browser:
git clone --recursive https://github.com/refresh-bio/mdcompress/
cd mdcompress/wasm
./build.sh # will try to use em++, if not found will download emscripten and use it
cd test_project && python3 -m http.server 8080
If everything went OK, one may open http://localhost:8080/ in the browser to see the API in action.
This website allows users to select an MDC file that will be loaded into the browser's storage.
Currently, the API communicates via file paths (raw buffers are not supported), so the whole file must be loaded. This may be a limitation for larger files. (This may be changed in the future if needed).
This website uses Three.js for visualisation, which is probably not the right choice, but it's there only to demonstrate how MDC API may be used.
Below is the explanation of the most important part of accessing the MDC file from JavaScript.
The MDC file is assumed to be in local storage (FS) at /uploaded.mdc.
//helper function
function seg_type_to_str(type) {
if (type === Module.segment_type.molecule)
return "molecule";
else if (type === Module.segment_type.ion)
return "ion";
else if (type === Module.segment_type.water)
return "water";
else if (type === Module.segment_type.unknown)
return "unknown";
else if (type === Module.segment_type.none)
return "none";
}
//open MDC file and get basic info
let reader = new Module.reader("/uploaded.mdc");
let noFrames = reader.get_no_frames();
console.log("# frames: ", noFrames);
let segments = reader.get_segments();
let numSegments = segments.size();
for (let i = 0; i < segments.size(); i++) {
let seg = segments.get(i);
console.log("Segment", i, ":", "name =", seg.name, "size =", seg.size, "type =", seg_type_to_str(seg.type));
}
//which segments and/or atoms we want to query?
let segmentIds = new Module.vec_str(); //empty means all
let atomIds = new Module.vec_u32();
//get appropriate query_engine
let engine = reader.get_query_engine(segmentIds, atomIds);
//this will store the result of a single successful query
let queryResult = new Module.query_result();
//ids of frames we want to query (in this example, we query a single frame at a time)
let frameIds = new Module.vec_u32();
frameIds.push_back(0);
for (let frameId = 0; frameId < noFrames ; ++frameId) {
console.log("Frame id:", frameId);
frameIds.set(0, frameId);
if (!engine.query(frameIds, queryResult))
{
console.log("Error querying frame", frameId);
return;
}
let coords = queryResult.frames.get(0).coords();
let nAtoms = queryResult.frames.get(0).n_coords;
let time = queryResult.frames.get(0).time;
let box = queryResult.frames.get(0).box;
console.log(`Frame ${frameId}: Atoms ${nAtoms}, Time ${time}, Box ${box}`);
//lets print at most 10 first atoms
for (let i = 0 ; i < nAtoms && i < 10; i++)
{
//This is in the raw memory, so we need to calculate memory location based on coords ptr
let baseIndex = (coords / 4) + i * 3;
let x = Module.HEAPF32[baseIndex];
let y = Module.HEAPF32[baseIndex + 1];
let z = Module.HEAPF32[baseIndex + 2];
console.log(`Atom ${i}: x=${x}, y=${y}, z=${z}`);
}
}