Minimal tools to QA text classification labels exported from Label Studio.
label-qa-starter/
├─ data/
│ ├─ raw/
│ │ └─ SMSSpamCollection # original UCI TSV (label<TAB>text)
│ ├─ ls_sms_unlabeled.csv # derived id,text for Label Studio import
│ ├─ ls_export_v1_r1.csv # 1st pass labels (ham/spam/unclear)
│ ├─ ls_export_v1_r2.csv # 2nd pass labels (independent pass)
│ ├─ ls_export_v1_r3.csv # 3rd pass labels (independent pass)
│ ├─ ls_export_sample.csv # tiny demo file for validator
│ └─ ls_export_sample_v2.csv # tiny demo + label2 for κ demo
├─ tools/
│ └─ sms_to_csv.py # converts UCI TSV → id,text CSV (no labels)
├─docs/
│ └─ quality_notes.md # Notes for quality of the annotation
├─ examples/
│ └─ README.md # Worked examples (with rationales)
├─ validate_labels.py # CSV/JSON/JSONL validator (schema/dupes/empties)
├─ agreement_labels.py # Cohen’s κ + confusion matrix
├─ requirements.txt # scikit-learn
├─ RUBRIC.md # Label definition, decision rules, tie-breakers, and edge-cases
└─ README.md
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python validate_labels.py --path data/ls_export_sample.csv --format csv --id-col id --text-col text --label-col label
python agreement_labels.py --file-a data/ls_export_sample_v2.csv --col-a label --col-b label2
python agreement_labels.py --file-a data/ls_export_sample.csv --file-b data/ls_export_sample_v2.csv --col-a label --col-b label
Required columns: id,text,label (optional: label2 for second pass).
- Change
--label-col
if your project uses a differentfrom_name
(e.g.,sentiment
). - Exported file can be CSV, JSON, or JSONL (API). This tool supports all three.
SMSSpamCollection (https://archive.ics.uci.edu/dataset/228/sms%2Bspam%2Bcollection)
Aug 29, 2025 (JST)
-
items: 800/5574 (SMSSpamCollection)
-
label set: ["ham","spam","unclear"]
-
κ = 0.967
-
confusion matrix:
[[670 0 0]
[ 2 121 0]
[ 5 0 2]]
rater A class counts: ham=670, spam=123, unclear=7 (sums of matrix rows).
-
Edge cases:
- Some messages have “…” in the middle, which causes confusion; this might be due to truncation or because the original message actually contains ellipses.
- Some advertising messages are convincing enough to be overlooked as
ham
; if not read carefully, they may be mistakenly labeled asham
. - Some messages are not identifiable as correct English, probably because they were sent long ago and use slang, causing misunderstanding of the meaning of the words.
Aug 30, 2025 (JST)
-
Items: 1151/5574 ((SMSSpamCollection))
-
Notes: Updated 200 new annotation in ls_export__v1_r3.csv and 150 new annotation in ls_export_v1_r4.csv
-
Edge cases:
- Some messages contain phone numbers or website links; if not read carefully, they might be mistaken for
spam
. - Some personal texts use full uppercase letters, which is typical in spam ads and can be misclassified if not read carefully.
- Some numbers are censored as <DECIMAL> or <#>, which is confusing without context.
- Some messages contain phone numbers or website links; if not read carefully, they might be mistaken for
-
Validator (ls_export_r3.csv):
Format: csv
Total rows: 1151
Unique ids: 1151
Duplicate ids: 0
Errors found: 0