MedLingo dataset, clinical abbreviation task, and lookup baseline model by tbirch5 · Pull Request #1043 · sunlabuiuc/PyHealth

tbirch5 · 2026-04-20T20:19:12Z

Contributor

Type of Contribution

Full pipeline: Dataset + Task + Model

Paper

Diagnosing Our Datasets: How Does My Language Model Learn Clinical Information?
https://arxiv.org/abs/2505.15024

Overview

This PR introduces the MedLingo dataset for clinical abbreviation interpretation,
along with a corresponding task and a rule-based baseline model.

It provides a full pipeline from dataset loading to model evaluation,
including ablation studies to analyze how input variations affect performance.

Components

pyhealth/datasets/medlingo.py: dataset implementation
pyhealth/datasets/configs/medlingo.yaml: dataset config
pyhealth/tasks/clinical_abbreviation.py: task definition
pyhealth/tasks/medlingo_task.py: dataset-to-task wrapper
pyhealth/models/abbreviation_lookup.py: rule-based lookup baseline

Examples

examples/my_replication.py: full pipeline example
examples/medlingo_clinical_abbreviation_abbreviation_lookup.py: ablation study script

Ablation / Example Usage

The example script includes multiple input conditions:

abbreviation-only input
lowercase abbreviation
short clinical context
noisy punctuation

This is intended to study how input variation affects abbreviation interpretation performance.

Files to Review

pyhealth/datasets/medlingo.py
pyhealth/datasets/configs/medlingo.yaml
pyhealth/tasks/clinical_abbreviation.py
pyhealth/tasks/medlingo_task.py
pyhealth/models/abbreviation_lookup.py
docs/api/datasets/pyhealth.datasets.medlingo.rst
docs/api/tasks/pyhealth.tasks.clinical_abbreviation.rst
docs/api/tasks/pyhealth.tasks.medlingo_task.rst
docs/api/models/pyhealth.models.abbreviation_lookup.rst
examples/my_replication.py
examples/medlingo_clinical_abbreviation_abbreviation_lookup.py
tests/test_medlingo.py
tests/test_clinical_abbreviation.py
tests/test_abbreviation_lookup.py

Testing

All tests use synthetic/demo data
Covers dataset, task, and model functionality

Notes

Dataset is cleaned and curated (not raw MIMIC data)
Demo data is stored in test-resources/
Lookup model serves as a reproducible baseline
GPT comparison is optional and not part of the core pipeline

Jathurshan0330

It's a great PR. A couple of comments:

Can you include the test resources sample JSON/DICT in the test folder? Can it be defined inside the test.py files?
Are medlingo_samples synthetically generated or real data? If it's real data, exclude them from the PR.
use standard naming for the files. such as for medlingo examples avoid naming as my_replications which would confuse the users.

Copilot

Pull request overview

Adds a MedLingo clinical abbreviation expansion pipeline to PyHealth, including a dataset loader, task definitions, a rule-based lookup baseline, documentation pages, runnable examples, and synthetic tests/resources.

Changes:

Introduces MedLingoDataset plus demo samples under test-resources/ for abbreviation expansion.
Adds two task utilities (ClinicalAbbreviationTask, MedLingoTask) and a baseline AbbreviationLookupModel.
Adds docs entries and example scripts, plus unit tests for dataset/task/model behavior.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`pyhealth/datasets/medlingo.py`	Adds the MedLingo dataset class with JSON-based `process()` loader.
`pyhealth/datasets/configs/medlingo.yaml`	Introduces a dataset config file intended for MedLingo.
`pyhealth/datasets/__init__.py`	Exposes `MedLingoDataset` at the package level.
`pyhealth/tasks/clinical_abbreviation.py`	Adds a task helper to produce abbreviation-only inputs (optionally extracting from context).
`pyhealth/tasks/medlingo_task.py`	Adds a dataset-to-(input,target) wrapper task for MedLingo samples.
`pyhealth/models/abbreviation_lookup.py`	Adds a simple rule-based lookup baseline model with optional normalization.
`test-resources/medlingo_samples.json`	Adds synthetic/demo MedLingo-style samples for testing/examples.
`tests/test_medlingo.py`	Adds a unit test for dataset structure.
`tests/test_clinical_abbreviation.py`	Adds unit tests for task behavior with/without context.
`tests/test_abbreviation_lookup.py`	Adds unit tests for lookup baseline normalization/prediction.
`examples/my_replication.py`	Adds a full pipeline example (dataset → task → lookup baseline → accuracy).
`examples/medlingo_clinical_abbreviation_abbreviation_lookup.py`	Adds an ablation-style example for task input variants.
`examples/medlingo_gpt_vs_lookup.py`	Adds an optional GPT vs lookup comparison script across input conditions.
`examples/medlingo_demo.py`	Adds a minimal “load dataset” demo script.
`docs/api/datasets/pyhealth.datasets.medlingo.rst`	Adds dataset API documentation page.
`docs/api/datasets.rst`	Links the MedLingo dataset into the datasets API index.
`docs/api/tasks/pyhealth.tasks.clinical_abbreviation.rst`	Adds task module documentation page.
`docs/api/tasks/pyhealth.tasks.medlingo_task.rst`	Adds task class documentation page.
`docs/api/tasks.rst`	Links the new tasks into the tasks API index.
`docs/api/models/pyhealth.models.abbreviation_lookup.rst`	Adds model module documentation page.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def __init__(
+        self,
+        root: str = "",
+        config_path: str | None = None,
+    ) -> None:
+        tables = ["medlingo"]  # single table dataset
+        super().__init__(
+            root=root,
+            tables=tables,
+            dataset_name="medlingo",
+            config_path=config_path,
+        )


+    @classmethod
+    def from_json(cls, filepath: str | Path) -> "MedLingoDataset":
+        dataset = cls(root=str(Path(filepath).parent))
+        return dataset


+    def __init__(self) -> None:
+        super().__init__()
+
+    def __call__(self, sample: dict[str, Any]) -> dict[str, str]:
+        """
+        Convert a single MedLingo sample into task-ready format.
+
+        Args:
+            sample: A dictionary containing the fields 'context' and 'label'.
+
+        Returns:
+            A dictionary with the processed input and target fields.
+        """
+        return {
+            "input": sample["context"],
+            "target": sample["label"],
+        }


+        # Then, try to find mixed-case shorthand (2+ letters)
+        mixed_match = re.search(r"\b([A-Z][a-z]{1,})\b", text)
+        if mixed_match:
+            return mixed_match.group(0)


+    input_schema = {"input": "str"}
+    output_schema = {"label": "str"}


+from dotenv import load_dotenv
+from openai import OpenAI
+
+from pyhealth.datasets.medlingo import MedLingoDataset
+from pyhealth.models.abbreviation_lookup import AbbreviationLookupModel
+from pyhealth.tasks.clinical_abbreviation import ClinicalAbbreviationTask
+


+    Each sample contains:
+        - abbr: clinical abbreviation string
+        - context: short clinical text snippet
+        - label: ground truth expanded meaning
+        - source: source of the sample (e.g. "mimic_iv", "synthetic_demo")


+dataset_name: medlingo
+task: abbreviation_expansion
+modality: text
+
+tables:
+  - medlingo
+
+fields:
+  - abbr
+  - context
+  - label
+
+label_field: label


+    input_schema = {"input": "str"}
+    output_schema = {"target": "str"}


+from pyhealth.datasets.medlingo import MedLingoDataset
+from pyhealth.tasks.medlingo_task import MedLingoTask
+from pyhealth.models.abbreviation_lookup import AbbreviationLookupModel
+
+"""
+This script demonstrates a replication of the MedLingo clinical abbreviation expansion task.
+It loads the MedLingo dataset, processes it into task-ready format, and evaluates a simple rule-based abbreviation lookup model.
+Contributors:
+    Tedra Birch (tbirch2@illinois.edu)
+
+Paper:
+    Diagnosing Our Datasets: How Does My Language Model Learn Clinical Information?
+    https://arxiv.org/abs/2505.15024
+
+"""


…ency, standardize examples

tbirch5 · 2026-05-04T20:45:39Z

@Jathurshan0330 Thanks for the helpful feedback. I’ve addressed all three points:

1. Test resources / JSON dependency

Removed all reliance on external JSON files
All dataset samples are now defined inline as synthetic data within tests and example scripts
Tests now directly validate dataset behavior without external file dependencies

2. Data source clarification

All MedLingo samples are synthetic and curated for demonstration purposes
Added explicit documentation in the dataset and examples confirming that no real patient or MIMIC data is included

3. File naming conventions

Updated example scripts to follow the required naming convention
Renamed my_replication.py to medlingo_full_pipeline.py to improve clarity and consistency

Additional updates

All examples now use MedLingoDataset(samples=...) for fully self-contained execution
Ensured all scripts are reproducible, lightweight, and consistent with PyHealth contribution guidelines

Please let me know if there’s anything else I can refine - happy to iterate further.

Add MedLingo dataset, task, model, tests, docs, and examples

50e7831

tbirch5 changed the title ~~Add MedLingo dataset, clinical abbreviation task, and lookup baseline model~~ MedLingo dataset, clinical abbreviation task, and lookup baseline model Apr 20, 2026

Jathurshan0330 requested review from Jathurshan0330 and Copilot May 3, 2026 20:14

Copilot started reviewing on behalf of Jathurshan0330 May 3, 2026 20:15 View session

Jathurshan0330 reviewed May 3, 2026

View reviewed changes

Copilot AI reviewed May 3, 2026

View reviewed changes

Address PR feedback: use synthetic inline samples, remove file depend…

18c7ae9

…ency, standardize examples

Jathurshan0330 approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MedLingo dataset, clinical abbreviation task, and lookup baseline model#1043

MedLingo dataset, clinical abbreviation task, and lookup baseline model#1043
tbirch5 wants to merge 2 commits intosunlabuiuc:masterfrom
tbirch5:medlingo-contribution

tbirch5 commented Apr 20, 2026

Uh oh!

Jathurshan0330 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

tbirch5 commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		input_schema = {"input": "str"}
		output_schema = {"label": "str"}

		input_schema = {"input": "str"}
		output_schema = {"target": "str"}

Conversation

tbirch5 commented Apr 20, 2026

Contributor

Type of Contribution

Paper

Overview

Components

Examples

Ablation / Example Usage

Files to Review

Testing

Notes

Uh oh!

Jathurshan0330 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

tbirch5 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Test resources / JSON dependency

2. Data source clarification

3. File naming conventions

Additional updates

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tbirch5 commented May 4, 2026 •

edited

Loading