This organization contains GitHub Repositories for the Medical Event Data Standard (MEDS), a simple dataset schema for machine learning over electronic health record (EHR) data. Unlike existing tools, pipelines, or common data models, MEDS is a minimal standard designed for maximum interoperability across datasets, existing tools, and model architectures. By providing a simple standardization layer between datasets and model-specific code, MEDS can help make machine learning research for EHR data dramatically more reproducible, robust, computationally performant, and collaborative. Alongside this report, we also release several existing integrations with models, datasets, and tools, and will work actively with the community going forward for further adoption and use. See our draft proposal for more details, and please leave comments or questions via GitHub issues to help us improve this effort! Find the Contribution guidelines here.
Project | Type | Documentation URL | Repository URL | Paper URL | Description |
---|---|---|---|---|---|
Core MEDS | Core | GitHub | GitHub | OpenReview | A data standard and community for building and sharing EHR machine learning tools |
MEDS-Reader | Package | Docs | GitHub | arXiv | An optimized Python package for efficient EHR data processing achieving 10-100x improvements in memory, speed, and disk usage |
MEDS-Transforms | Package | GitHub | A set of functions and scripts for extraction to and transformation/pre-processing of MEDS-formatted data. | ||
MEDS-Tab | Package | Docs | GitHub | A library designed for automated tabularization, data preparation with aggregations and time windowing. | |
ACES | Package | Docs | GitHub | arXiv | A package and configuration language for reproducible extraction of task cohorts for machine learning over event-stream datasets |
MEDS-Torch | Package | Docs | GitHub | Advancing healthcare machine learning through flexible, robust, and scalable sequence modeling tools. | |
MEDS-Evaluation | Package | GitHub | Evaluation pipeline for MEDS. | ||
MEDS-ETL | Package | GitHub | Efficient ETL that supports OMOP, MIMIC, eICU, PyHealth. | ||
FEMR | Package | GitHub | A Python package for manipulating longitudinal EHR data for machine learning, with a focus on supporting the creation of foundation models and verifying their presumed benefits in healthcare. | ||
MEDS-DEV | Benchmark | GitHub | A benchmark for evaluating the performance of machine learning models on MEDS-formatted data. | ||
MEDS-Inspect | Package | GitHub | A package to interactively inspect your MEDS data. |
- CLMBR-T-base: https://huggingface.co/StanfordShahLab/clmbr-t-base
- Context Clues (a collection of Mamba, Llama, Hyena, and GPT models across context lengths from 512 - 16,384 tokens): https://huggingface.co/collections/StanfordShahLab/context-clues-6757f893f6a2918c7ab809f1
Dataset | Stays | Version | Frequency | Origin | Originally Published | License | Repository Link | MEDS ETL | Full Dataset Name |
---|---|---|---|---|---|---|---|---|---|
AUMCdb | 23,000 | v1.0.2 | up to 1 minute | Netherlands | 2019 | Not specified | DANS | Github | Amsterdam University Medical Center Database |
eICU | 201,000 | v2.0 | 5 minutes | USA | 2017 | PhysioNet | PhysioNet | Github | eICU Collaborative Research Database |
HiRID | 34,000 | v1.1.1 | 2 / 5 minutes | Switzerland | 2020 | Physionet | PhysioNet | Github | High-Resolution ICU Dataset |
INSPIRE | 130,000 | v1.2 | Not specified | South Korea | 2024 | Korea Credentialed Health Data License | PhysioNet | Github | INformative Surgical Patient dataset for Innovative Research Environment |
MIMIC-IV | 73,000 | v3.1 |
~1 hour | USA | 2020 | PhysioNet | PhysioNet | Github | Medical Information Mart for Intensive Care IV |
NWICU | 25,000 | v0.1.0 | Not specified | USA | 2023 | Physionet | PhysioNet | Github | Northwestern ICU Database |
SICdb | 27,350 | v1.0.8 | 1 minute | Austria | 2024 | PhysioNet | PhysioNet | Github | Salzburg Intensive Care Database |
- EHRSHOT: https://ehrshot.stanford.edu
Tools that are planned to be compatible with MEDS: