Skip to content

MedLingo dataset, clinical abbreviation task, and lookup baseline model#1043

Open
tbirch5 wants to merge 2 commits intosunlabuiuc:masterfrom
tbirch5:medlingo-contribution
Open

MedLingo dataset, clinical abbreviation task, and lookup baseline model#1043
tbirch5 wants to merge 2 commits intosunlabuiuc:masterfrom
tbirch5:medlingo-contribution

Conversation

@tbirch5
Copy link
Copy Markdown

@tbirch5 tbirch5 commented Apr 20, 2026

Contributor

Tedra Birch (tbirch2@illinois.edu)

Type of Contribution

Full pipeline: Dataset + Task + Model

Paper

Diagnosing Our Datasets: How Does My Language Model Learn Clinical Information?
https://arxiv.org/abs/2505.15024

Overview

This PR introduces the MedLingo dataset for clinical abbreviation interpretation,
along with a corresponding task and a rule-based baseline model.

It provides a full pipeline from dataset loading to model evaluation,
including ablation studies to analyze how input variations affect performance.


Components

  • pyhealth/datasets/medlingo.py: dataset implementation
  • pyhealth/datasets/configs/medlingo.yaml: dataset config
  • pyhealth/tasks/clinical_abbreviation.py: task definition
  • pyhealth/tasks/medlingo_task.py: dataset-to-task wrapper
  • pyhealth/models/abbreviation_lookup.py: rule-based lookup baseline

Examples

  • examples/my_replication.py: full pipeline example
  • examples/medlingo_clinical_abbreviation_abbreviation_lookup.py: ablation study script

Ablation / Example Usage

The example script includes multiple input conditions:

  • abbreviation-only input
  • lowercase abbreviation
  • short clinical context
  • noisy punctuation

This is intended to study how input variation affects abbreviation interpretation performance.


Files to Review

  • pyhealth/datasets/medlingo.py
  • pyhealth/datasets/configs/medlingo.yaml
  • pyhealth/tasks/clinical_abbreviation.py
  • pyhealth/tasks/medlingo_task.py
  • pyhealth/models/abbreviation_lookup.py
  • docs/api/datasets/pyhealth.datasets.medlingo.rst
  • docs/api/tasks/pyhealth.tasks.clinical_abbreviation.rst
  • docs/api/tasks/pyhealth.tasks.medlingo_task.rst
  • docs/api/models/pyhealth.models.abbreviation_lookup.rst
  • examples/my_replication.py
  • examples/medlingo_clinical_abbreviation_abbreviation_lookup.py
  • tests/test_medlingo.py
  • tests/test_clinical_abbreviation.py
  • tests/test_abbreviation_lookup.py

Testing

  • All tests use synthetic/demo data
  • Covers dataset, task, and model functionality

Notes

  • Dataset is cleaned and curated (not raw MIMIC data)
  • Demo data is stored in test-resources/
  • Lookup model serves as a reproducible baseline
  • GPT comparison is optional and not part of the core pipeline

@tbirch5 tbirch5 changed the title Add MedLingo dataset, clinical abbreviation task, and lookup baseline model MedLingo dataset, clinical abbreviation task, and lookup baseline model Apr 20, 2026
Copy link
Copy Markdown
Collaborator

@Jathurshan0330 Jathurshan0330 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great PR. A couple of comments:

  1. Can you include the test resources sample JSON/DICT in the test folder? Can it be defined inside the test.py files?
  2. Are medlingo_samples synthetically generated or real data? If it's real data, exclude them from the PR.
  3. use standard naming for the files. such as for medlingo examples avoid naming as my_replications which would confuse the users.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a MedLingo clinical abbreviation expansion pipeline to PyHealth, including a dataset loader, task definitions, a rule-based lookup baseline, documentation pages, runnable examples, and synthetic tests/resources.

Changes:

  • Introduces MedLingoDataset plus demo samples under test-resources/ for abbreviation expansion.
  • Adds two task utilities (ClinicalAbbreviationTask, MedLingoTask) and a baseline AbbreviationLookupModel.
  • Adds docs entries and example scripts, plus unit tests for dataset/task/model behavior.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
pyhealth/datasets/medlingo.py Adds the MedLingo dataset class with JSON-based process() loader.
pyhealth/datasets/configs/medlingo.yaml Introduces a dataset config file intended for MedLingo.
pyhealth/datasets/__init__.py Exposes MedLingoDataset at the package level.
pyhealth/tasks/clinical_abbreviation.py Adds a task helper to produce abbreviation-only inputs (optionally extracting from context).
pyhealth/tasks/medlingo_task.py Adds a dataset-to-(input,target) wrapper task for MedLingo samples.
pyhealth/models/abbreviation_lookup.py Adds a simple rule-based lookup baseline model with optional normalization.
test-resources/medlingo_samples.json Adds synthetic/demo MedLingo-style samples for testing/examples.
tests/test_medlingo.py Adds a unit test for dataset structure.
tests/test_clinical_abbreviation.py Adds unit tests for task behavior with/without context.
tests/test_abbreviation_lookup.py Adds unit tests for lookup baseline normalization/prediction.
examples/my_replication.py Adds a full pipeline example (dataset → task → lookup baseline → accuracy).
examples/medlingo_clinical_abbreviation_abbreviation_lookup.py Adds an ablation-style example for task input variants.
examples/medlingo_gpt_vs_lookup.py Adds an optional GPT vs lookup comparison script across input conditions.
examples/medlingo_demo.py Adds a minimal “load dataset” demo script.
docs/api/datasets/pyhealth.datasets.medlingo.rst Adds dataset API documentation page.
docs/api/datasets.rst Links the MedLingo dataset into the datasets API index.
docs/api/tasks/pyhealth.tasks.clinical_abbreviation.rst Adds task module documentation page.
docs/api/tasks/pyhealth.tasks.medlingo_task.rst Adds task class documentation page.
docs/api/tasks.rst Links the new tasks into the tasks API index.
docs/api/models/pyhealth.models.abbreviation_lookup.rst Adds model module documentation page.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +36 to +47
def __init__(
self,
root: str = "",
config_path: str | None = None,
) -> None:
tables = ["medlingo"] # single table dataset
super().__init__(
root=root,
tables=tables,
dataset_name="medlingo",
config_path=config_path,
)
Comment thread pyhealth/datasets/medlingo.py Outdated
Comment on lines +49 to +52
@classmethod
def from_json(cls, filepath: str | Path) -> "MedLingoDataset":
dataset = cls(root=str(Path(filepath).parent))
return dataset
Comment on lines +24 to +40
def __init__(self) -> None:
super().__init__()

def __call__(self, sample: dict[str, Any]) -> dict[str, str]:
"""
Convert a single MedLingo sample into task-ready format.

Args:
sample: A dictionary containing the fields 'context' and 'label'.

Returns:
A dictionary with the processed input and target fields.
"""
return {
"input": sample["context"],
"target": sample["label"],
}
Comment on lines +50 to +53
# Then, try to find mixed-case shorthand (2+ letters)
mixed_match = re.search(r"\b([A-Z][a-z]{1,})\b", text)
if mixed_match:
return mixed_match.group(0)
Comment on lines +25 to +26
input_schema = {"input": "str"}
output_schema = {"label": "str"}
Comment on lines +29 to +35
from dotenv import load_dotenv
from openai import OpenAI

from pyhealth.datasets.medlingo import MedLingoDataset
from pyhealth.models.abbreviation_lookup import AbbreviationLookupModel
from pyhealth.tasks.clinical_abbreviation import ClinicalAbbreviationTask

Comment thread pyhealth/datasets/medlingo.py Outdated
Comment on lines +21 to +25
Each sample contains:
- abbr: clinical abbreviation string
- context: short clinical text snippet
- label: ground truth expanded meaning
- source: source of the sample (e.g. "mimic_iv", "synthetic_demo")
Comment on lines +1 to +13
dataset_name: medlingo
task: abbreviation_expansion
modality: text

tables:
- medlingo

fields:
- abbr
- context
- label

label_field: label No newline at end of file
Comment on lines +21 to +22
input_schema = {"input": "str"}
output_schema = {"target": "str"}
Comment thread examples/my_replication.py Outdated
Comment on lines +1 to +15
from pyhealth.datasets.medlingo import MedLingoDataset
from pyhealth.tasks.medlingo_task import MedLingoTask
from pyhealth.models.abbreviation_lookup import AbbreviationLookupModel

"""
This script demonstrates a replication of the MedLingo clinical abbreviation expansion task.
It loads the MedLingo dataset, processes it into task-ready format, and evaluates a simple rule-based abbreviation lookup model.
Contributors:
Tedra Birch (tbirch2@illinois.edu)

Paper:
Diagnosing Our Datasets: How Does My Language Model Learn Clinical Information?
https://arxiv.org/abs/2505.15024

"""
@tbirch5
Copy link
Copy Markdown
Author

tbirch5 commented May 4, 2026

@Jathurshan0330 Thanks for the helpful feedback. I’ve addressed all three points:

1. Test resources / JSON dependency

  • Removed all reliance on external JSON files
  • All dataset samples are now defined inline as synthetic data within tests and example scripts
  • Tests now directly validate dataset behavior without external file dependencies

2. Data source clarification

  • All MedLingo samples are synthetic and curated for demonstration purposes
  • Added explicit documentation in the dataset and examples confirming that no real patient or MIMIC data is included

3. File naming conventions

  • Updated example scripts to follow the required naming convention
  • Renamed my_replication.py to medlingo_full_pipeline.py to improve clarity and consistency

Additional updates

  • All examples now use MedLingoDataset(samples=...) for fully self-contained execution
  • Ensured all scripts are reproducible, lightweight, and consistent with PyHealth contribution guidelines

Please let me know if there’s anything else I can refine - happy to iterate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants