A template repository for a MEDS-Transforms powered extraction pipeline for a custom dataset. Once you have customized the repository to your dataset (see instructions below), you will be able to run your extraction pipeline with a few simple command-line commands, such as:
pip install PACKAGE_NAME # you can do this locally or via PyPI
# Download your data or set download credentials
COMMAND_NAME root_output_dir=$ROOT_OUTPUT_DIR
See the MIMIC-IV MEDS Extraction ETL for an end to end example!
- Initialize a new repository using this template repository.
- Rename the directory after
src/
to the name of your package in python-friendly format (e.g.,MIMIC_IV_MEDS
). - Customize the following code points:
- Customize the following external services:
- CodeCov
- PyPI
In the pyproject.toml
file, you will need to update the following fields:
- Under
[project]
:name = "ETL-MEDS"
: UpdateETL-MEDS
to the name of your package (e.g.,MIMIC-IV-MEDS
)authors = [...]
: Update the author information to your name and email.description = "..."
: Update the description to a brief description of your dataset.dependencies = [...]
: Update the dependencies to include the necessary packages for your ETL pipeline (if any additional packages are needed).
- Under
[project.scripts]
MEDS_extract-sample_dataset = "ETL_MEDS.__main__:main"
: UpdateMEDS_extract-sample_dataset
to the name of your command-line pipeline (e.g.,MIMIC-IV_extract
) and updateETL_MEDS
to the name of your package that you would import in python (e.g.,MIMIC_IV_MEDS
). This will be the same as the directory name betweensrc
and your actual code.
- Under
[project.urls]
Homepage = "..."
Update the homepage to the URL of your GitHub repository.Issues = "..."
Update the issues URL to the URL of your GitHub repository issues page.
In this file, you simply need to update the __package_name__ = "ETL_MEDS"
line to refer not to ETL_MEDS
but to your new package import name (e.g., MIMIC_IV_MEDS
)
In this file, you can add details about the dataset you are working with. This will be used to record metadata about the dataset and to provide links from which the dataset can be downloaded. You'll need to modify:
dataset_name
: The name of the dataset.raw_dataset_version
: The version this version of your pipeline is designed to work with.urls
: This block contains the URLs from which the dataset can be downloaded. This field requires additional commentary, explored below.
This field is an object and contains three sub-keys:
dataset
: The URLs for the full dataset.demo
: The URLs for a smaller, open, demo version of the dataset.common
: The URLs for shared metadata files or other shared resources.
Each of these sub-keys should be a list of either strings (plain URLs) or dictionaries containing the URL (in
the key url
) and username and password authentication information (in the keys username
and password
).
Note that we strongly recommend that you do not include your username and password in the raw file.
Instead, leverage the OmegaConf resolvers to reference external environment variables or other secure methods
of storing this information. In the example in this repository, we include one URL with the following
configuration:
- url: EXAMPLE_CONTROLLED_URL
username: ${oc.env:DATASET_DOWNLOAD_USERNAME}
password: ${oc.env:DATASET_DOWNLOAD_PASSWORD}
which would resolve to fill in the username
and password
from the environment variables
DATASET_DOWNLOAD_USERNAME
and DATASET_DOWNLOAD_PASSWORD
, respectively.
That's no problem! You can simply turn off downloading entirely by setting do_download=False
in the
configs/main.yaml
or on the command line when you run the pipeline and ensure that your data files are
manually downloaded and placed in the appropriate directory (the raw_input_dir
in the configs/main.yaml
).
If there is a technical issue with downloading the data through the format supported so far, you can also file
a GitHub Issue outlining your issue and we can
attempt to expand the supported libraries to cover your use case!
This script should be generally modified to include any "pre-MEDS" steps that are necessary to prepare the dataset for MEDS-Transforms based extraction. Critically, these steps often include:
- De-compressing files or otherwise preparing the raw data for extraction at a technical level.
- Joining tables together so that all relevant rows include the unifying
subject_id
. - Converting any offsets into timestamps.
- Any other modifications of interest.
See MEDS-Transforms for more documentation on the appropriate construction of the pre-MEDS script.
This file is the configuration file for mapping the rows in your various raw data tables to MEDS events via the MEDS-Transforms pipeline. See MEDS-Transforms for more documentation on the format and usage of this file.
Insert badges like below:
[](https://pypi.org/project/PACKAGE_NAME/)
[](https://REPO_NAME.readthedocs.io/en/stable/?badge=stable)
[](https://codecov.io/gh/Medical-Event-Data-Standard/REPO_NAME)
[](https://github.com/Medical-Event-Data-Standard/REPO_NAME/actions/workflows/tests.yml)
[](https://github.com/Medical-Event-Data-Standard/REPO_NAME/actions/workflows/code-quality-main.yaml)

[](https://github.com/Medical-Event-Data-Standard/REPO_NAME#license)
[](https://github.com/Medical-Event-Data-Standard/REPO_NAME/pulls)
[](https://github.com/Medical-Event-Data-Standard/REPO_NAME/graphs/contributors)
If your dataset does not have an open demo version, you can remove this file, as there is no way to set up automated testing of the end-to-end pipeline in a safe manner without a demo dataset.
If you do have a demo dataset, ensure that it is included in your dataset.yaml
file and update the
e2e_demo_test.py
file as follows:
- Update the command in the
command_parts
variable to match the command you set in yourpyproject.toml
file for the executable for your pipeline (e.g.,MIMIC-IV_extract
). - Remove the
pytest.mark.skip
decorator from the test function so that it runs successfully!
The test file (and the internal doctests, which can help unittest your pre-MEDS
file) can then be run via
pytest --doctest-modules -s
from the root directory to ensure correctness of your pipeline. These tests will
also be run on pull requests or pushes to the main
branch of your repository via GitHub Actions, and test
code coverage will be tracked via CodeCov.
This will run pre-commit hooks on GitHub when you're creating your ETL. To locally install this, you can install the optional dependencies:
pip install ".[dev]"
Then, run:
pre-commit run --all-files
This will automatically reformat your files (if possible) to conform with the linting requirements.
Note
Currently we need pre-commit<4 to run the docformatter
hook.
- Go to CodeCov and add make an account or log-in as needed.
- Follow the instructions to configure your new repository with CodeCov.
- Copy the badge markdown from CodeCov and paste it into the
README.md
file. To find the badge markdown link, go to your repository in CodeCov, click on the "Configuration" tab, click on the "Badges and Graphs" option, then copy the markdown link from the top section and paste it in the corresponding line of the README, in place of the default link included above. - It will now track the test coverage of your ETL, including running the full pipeline against the linked
demo data you provide in
dataset.yaml
.
- Go to PyPI and add make an account or log-in as needed.
- Go to your account settings and go to the "Publishing" settings.
- Set up a new "Trusted Publisher" for your GitHub Repository (e.g., see the image below). Ensure your
package name matches in the trusted publisher and in your
pyproject.toml
file! - Now, if, on the local command line, you run
git tag 0.0.1
, thengit push origin 0.0.1
, it will push a new, tagged version of your code as of the local commit when you ran the command both to a new GitHub Release and to PyPI. This will allow you to install your package viapip install PACKAGE_NAME
and to manage versions effectively!