Skip to content

Jubel9/label-qa-starter

Repository files navigation

label-qa-starter

CI Minimal tools to QA text classification labels exported from Label Studio.

Structure

label-qa-starter/
├─ data/
│  ├─ raw/
│  │  └─ SMSSpamCollection        # original UCI TSV (label<TAB>text)
│  ├─ ls_sms_unlabeled.csv        # derived id,text for Label Studio import
│  ├─ ls_export_v1_r1.csv         # 1st pass labels (ham/spam/unclear)
│  ├─ ls_export_v1_r2.csv         # 2nd pass labels (independent pass)
│  ├─ ls_export_v1_r3.csv         # 3rd pass labels (independent pass)
│  ├─ ls_export_sample.csv        # tiny demo file for validator
│  └─ ls_export_sample_v2.csv     # tiny demo + label2 for κ demo
├─ tools/
│  └─ sms_to_csv.py               # converts UCI TSV → id,text CSV (no labels)
├─docs/
│  └─ quality_notes.md            # Notes for quality of the annotation
├─ examples/
│  └─ README.md                   # Worked examples (with rationales)
├─ validate_labels.py             # CSV/JSON/JSONL validator (schema/dupes/empties)
├─ agreement_labels.py            # Cohen’s κ + confusion matrix
├─ requirements.txt               # scikit-learn
├─ RUBRIC.md                      # Label definition, decision rules, tie-breakers, and edge-cases
└─ README.md

Setup

python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Validate a Label Studio CSV export

python validate_labels.py --path data/ls_export_sample.csv --format csv --id-col id --text-col text --label-col label

Compute Cohen's κ (same file two columns)

python agreement_labels.py --file-a data/ls_export_sample_v2.csv --col-a label --col-b label2

Compute Cohen's κ (two files)

python agreement_labels.py --file-a data/ls_export_sample.csv --file-b data/ls_export_sample_v2.csv --col-a label --col-b label

CSV Schema

Required columns: id,text,label (optional: label2 for second pass).

Notes

  • Change --label-col if your project uses a different from_name (e.g., sentiment).
  • Exported file can be CSV, JSON, or JSONL (API). This tool supports all three.

Dataset

SMSSpamCollection (https://archive.ics.uci.edu/dataset/228/sms%2Bspam%2Bcollection)

Result

Aug 29, 2025 (JST)

  • items: 800/5574 (SMSSpamCollection)

  • label set: ["ham","spam","unclear"]

  • κ = 0.967

  • confusion matrix:

    [[670 0 0]

    [ 2 121 0]

    [ 5 0 2]]

    rater A class counts: ham=670, spam=123, unclear=7 (sums of matrix rows).

  • Edge cases:

    1. Some messages have “…” in the middle, which causes confusion; this might be due to truncation or because the original message actually contains ellipses.
    2. Some advertising messages are convincing enough to be overlooked as ham; if not read carefully, they may be mistakenly labeled as ham.
    3. Some messages are not identifiable as correct English, probably because they were sent long ago and use slang, causing misunderstanding of the meaning of the words.

Aug 30, 2025 (JST)

  • Items: 1151/5574 ((SMSSpamCollection))

  • Notes: Updated 200 new annotation in ls_export__v1_r3.csv and 150 new annotation in ls_export_v1_r4.csv

  • Edge cases:

    1. Some messages contain phone numbers or website links; if not read carefully, they might be mistaken for spam.
    2. Some personal texts use full uppercase letters, which is typical in spam ads and can be misclassified if not read carefully.
    3. Some numbers are censored as <DECIMAL> or <#>, which is confusing without context.
  • Validator (ls_export_r3.csv):

    Format: csv

    Total rows: 1151

    Unique ids: 1151

    Duplicate ids: 0

    Errors found: 0

About

Minimal tools to QA text classification labels exported from Label Studio.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages