MedLingo dataset, clinical abbreviation task, and lookup baseline model#1043
Open
tbirch5 wants to merge 2 commits intosunlabuiuc:masterfrom
Open
MedLingo dataset, clinical abbreviation task, and lookup baseline model#1043tbirch5 wants to merge 2 commits intosunlabuiuc:masterfrom
tbirch5 wants to merge 2 commits intosunlabuiuc:masterfrom
Conversation
Collaborator
Jathurshan0330
left a comment
There was a problem hiding this comment.
It's a great PR. A couple of comments:
- Can you include the test resources sample JSON/DICT in the test folder? Can it be defined inside the test.py files?
- Are medlingo_samples synthetically generated or real data? If it's real data, exclude them from the PR.
- use standard naming for the files. such as for medlingo examples avoid naming as my_replications which would confuse the users.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a MedLingo clinical abbreviation expansion pipeline to PyHealth, including a dataset loader, task definitions, a rule-based lookup baseline, documentation pages, runnable examples, and synthetic tests/resources.
Changes:
- Introduces
MedLingoDatasetplus demo samples undertest-resources/for abbreviation expansion. - Adds two task utilities (
ClinicalAbbreviationTask,MedLingoTask) and a baselineAbbreviationLookupModel. - Adds docs entries and example scripts, plus unit tests for dataset/task/model behavior.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
pyhealth/datasets/medlingo.py |
Adds the MedLingo dataset class with JSON-based process() loader. |
pyhealth/datasets/configs/medlingo.yaml |
Introduces a dataset config file intended for MedLingo. |
pyhealth/datasets/__init__.py |
Exposes MedLingoDataset at the package level. |
pyhealth/tasks/clinical_abbreviation.py |
Adds a task helper to produce abbreviation-only inputs (optionally extracting from context). |
pyhealth/tasks/medlingo_task.py |
Adds a dataset-to-(input,target) wrapper task for MedLingo samples. |
pyhealth/models/abbreviation_lookup.py |
Adds a simple rule-based lookup baseline model with optional normalization. |
test-resources/medlingo_samples.json |
Adds synthetic/demo MedLingo-style samples for testing/examples. |
tests/test_medlingo.py |
Adds a unit test for dataset structure. |
tests/test_clinical_abbreviation.py |
Adds unit tests for task behavior with/without context. |
tests/test_abbreviation_lookup.py |
Adds unit tests for lookup baseline normalization/prediction. |
examples/my_replication.py |
Adds a full pipeline example (dataset → task → lookup baseline → accuracy). |
examples/medlingo_clinical_abbreviation_abbreviation_lookup.py |
Adds an ablation-style example for task input variants. |
examples/medlingo_gpt_vs_lookup.py |
Adds an optional GPT vs lookup comparison script across input conditions. |
examples/medlingo_demo.py |
Adds a minimal “load dataset” demo script. |
docs/api/datasets/pyhealth.datasets.medlingo.rst |
Adds dataset API documentation page. |
docs/api/datasets.rst |
Links the MedLingo dataset into the datasets API index. |
docs/api/tasks/pyhealth.tasks.clinical_abbreviation.rst |
Adds task module documentation page. |
docs/api/tasks/pyhealth.tasks.medlingo_task.rst |
Adds task class documentation page. |
docs/api/tasks.rst |
Links the new tasks into the tasks API index. |
docs/api/models/pyhealth.models.abbreviation_lookup.rst |
Adds model module documentation page. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+36
to
+47
| def __init__( | ||
| self, | ||
| root: str = "", | ||
| config_path: str | None = None, | ||
| ) -> None: | ||
| tables = ["medlingo"] # single table dataset | ||
| super().__init__( | ||
| root=root, | ||
| tables=tables, | ||
| dataset_name="medlingo", | ||
| config_path=config_path, | ||
| ) |
Comment on lines
+49
to
+52
| @classmethod | ||
| def from_json(cls, filepath: str | Path) -> "MedLingoDataset": | ||
| dataset = cls(root=str(Path(filepath).parent)) | ||
| return dataset |
Comment on lines
+24
to
+40
| def __init__(self) -> None: | ||
| super().__init__() | ||
|
|
||
| def __call__(self, sample: dict[str, Any]) -> dict[str, str]: | ||
| """ | ||
| Convert a single MedLingo sample into task-ready format. | ||
|
|
||
| Args: | ||
| sample: A dictionary containing the fields 'context' and 'label'. | ||
|
|
||
| Returns: | ||
| A dictionary with the processed input and target fields. | ||
| """ | ||
| return { | ||
| "input": sample["context"], | ||
| "target": sample["label"], | ||
| } |
Comment on lines
+50
to
+53
| # Then, try to find mixed-case shorthand (2+ letters) | ||
| mixed_match = re.search(r"\b([A-Z][a-z]{1,})\b", text) | ||
| if mixed_match: | ||
| return mixed_match.group(0) |
Comment on lines
+25
to
+26
| input_schema = {"input": "str"} | ||
| output_schema = {"label": "str"} |
Comment on lines
+29
to
+35
| from dotenv import load_dotenv | ||
| from openai import OpenAI | ||
|
|
||
| from pyhealth.datasets.medlingo import MedLingoDataset | ||
| from pyhealth.models.abbreviation_lookup import AbbreviationLookupModel | ||
| from pyhealth.tasks.clinical_abbreviation import ClinicalAbbreviationTask | ||
|
|
Comment on lines
+21
to
+25
| Each sample contains: | ||
| - abbr: clinical abbreviation string | ||
| - context: short clinical text snippet | ||
| - label: ground truth expanded meaning | ||
| - source: source of the sample (e.g. "mimic_iv", "synthetic_demo") |
Comment on lines
+1
to
+13
| dataset_name: medlingo | ||
| task: abbreviation_expansion | ||
| modality: text | ||
|
|
||
| tables: | ||
| - medlingo | ||
|
|
||
| fields: | ||
| - abbr | ||
| - context | ||
| - label | ||
|
|
||
| label_field: label No newline at end of file |
Comment on lines
+21
to
+22
| input_schema = {"input": "str"} | ||
| output_schema = {"target": "str"} |
Comment on lines
+1
to
+15
| from pyhealth.datasets.medlingo import MedLingoDataset | ||
| from pyhealth.tasks.medlingo_task import MedLingoTask | ||
| from pyhealth.models.abbreviation_lookup import AbbreviationLookupModel | ||
|
|
||
| """ | ||
| This script demonstrates a replication of the MedLingo clinical abbreviation expansion task. | ||
| It loads the MedLingo dataset, processes it into task-ready format, and evaluates a simple rule-based abbreviation lookup model. | ||
| Contributors: | ||
| Tedra Birch (tbirch2@illinois.edu) | ||
|
|
||
| Paper: | ||
| Diagnosing Our Datasets: How Does My Language Model Learn Clinical Information? | ||
| https://arxiv.org/abs/2505.15024 | ||
|
|
||
| """ |
…ency, standardize examples
Author
|
@Jathurshan0330 Thanks for the helpful feedback. I’ve addressed all three points: 1. Test resources / JSON dependency
2. Data source clarification
3. File naming conventions
Additional updates
Please let me know if there’s anything else I can refine - happy to iterate further. |
Jathurshan0330
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributor
Tedra Birch (tbirch2@illinois.edu)
Type of Contribution
Full pipeline: Dataset + Task + Model
Paper
Diagnosing Our Datasets: How Does My Language Model Learn Clinical Information?
https://arxiv.org/abs/2505.15024
Overview
This PR introduces the MedLingo dataset for clinical abbreviation interpretation,
along with a corresponding task and a rule-based baseline model.
It provides a full pipeline from dataset loading to model evaluation,
including ablation studies to analyze how input variations affect performance.
Components
pyhealth/datasets/medlingo.py: dataset implementationpyhealth/datasets/configs/medlingo.yaml: dataset configpyhealth/tasks/clinical_abbreviation.py: task definitionpyhealth/tasks/medlingo_task.py: dataset-to-task wrapperpyhealth/models/abbreviation_lookup.py: rule-based lookup baselineExamples
examples/my_replication.py: full pipeline exampleexamples/medlingo_clinical_abbreviation_abbreviation_lookup.py: ablation study scriptAblation / Example Usage
The example script includes multiple input conditions:
This is intended to study how input variation affects abbreviation interpretation performance.
Files to Review
pyhealth/datasets/medlingo.pypyhealth/datasets/configs/medlingo.yamlpyhealth/tasks/clinical_abbreviation.pypyhealth/tasks/medlingo_task.pypyhealth/models/abbreviation_lookup.pydocs/api/datasets/pyhealth.datasets.medlingo.rstdocs/api/tasks/pyhealth.tasks.clinical_abbreviation.rstdocs/api/tasks/pyhealth.tasks.medlingo_task.rstdocs/api/models/pyhealth.models.abbreviation_lookup.rstexamples/my_replication.pyexamples/medlingo_clinical_abbreviation_abbreviation_lookup.pytests/test_medlingo.pytests/test_clinical_abbreviation.pytests/test_abbreviation_lookup.pyTesting
Notes
test-resources/