This repository contains the data related to the paper:
De-identification of Dutch Medical Text, by Erik Tjong Kim Sang, Ben de Vries, Wouter Smink, Bernard Veldkamp, Gerben Westerhof and Anneke Sools. In: 2nd Healthcare Text Analytics Conference (HealTAC2019), Cardiff, Wales, UK, 2019. (pdf, bibtex)
The data concern text versions of two Dutch Wikipedia pages (Mata Hari and Rembrandt van Rijn), which were egofied (converted to diary-type of text). The task of the evaluated de-identification systems (deduce and tks) was to identify all names and numbers in the texts and remove these, to make identification of the subjects in the texts impossible.
There are six versions of each text:
- original Wikipedia text from January 2019 (matahari.txt and rembrandt.txt)
- egofied texts (matahari-egofied.txt and rembrandt-egofied.txt)
- part-of-speech tagged text (matahari-pos.txt and rembrandt-pos.txt, one token per line)
- gold annotation of the text (matahari-gold.txt and rembrandt-gold.txt, one token per line)
- output of systems (deduce and our system: tks) for the text (matahari-SYSTEM.txt and rembrandt-SYSTEM.txt, one token per line, replace SYSTEM by system name)
The system output files contain five columns: 1. gold label; 2. token; 3. system-assigned label; 4. gold label in IOB format; and 5. system-assigned label in IOB format
The egofy software can be found in the data-processing repository
To reproduce the scores in Table 2 of the paper, run this Linux command:
cat matahari-deduce.txt rembrandt-deduce.txt | ./conlleval
Replace deduce
by tks
to get the results for our system.
To reproduce the unlabeled column (UNL) in Table 2 of the paper, run this Linux command:
cat matahari-tks.txt rembrandt-tks.txt | sed 's/-[A-Z][A-Z]*/-UNL/g' | ./conlleval
Replace tks
by deduce
to get the results for the deduce system
Erik Tjong Kim Sang, e.tjongkimsang@esciencecenter.nl