Skip to content

e-mental-health/data

Repository files navigation

Dutch de-identification data

This repository contains the data related to the paper:

De-identification of Dutch Medical Text, by Erik Tjong Kim Sang, Ben de Vries, Wouter Smink, Bernard Veldkamp, Gerben Westerhof and Anneke Sools. In: 2nd Healthcare Text Analytics Conference (HealTAC2019), Cardiff, Wales, UK, 2019. (pdf, bibtex)

Data

The data concern text versions of two Dutch Wikipedia pages (Mata Hari and Rembrandt van Rijn), which were egofied (converted to diary-type of text). The task of the evaluated de-identification systems (deduce and tks) was to identify all names and numbers in the texts and remove these, to make identification of the subjects in the texts impossible.

There are six versions of each text:

  • original Wikipedia text from January 2019 (matahari.txt and rembrandt.txt)
  • egofied texts (matahari-egofied.txt and rembrandt-egofied.txt)
  • part-of-speech tagged text (matahari-pos.txt and rembrandt-pos.txt, one token per line)
  • gold annotation of the text (matahari-gold.txt and rembrandt-gold.txt, one token per line)
  • output of systems (deduce and our system: tks) for the text (matahari-SYSTEM.txt and rembrandt-SYSTEM.txt, one token per line, replace SYSTEM by system name)

The system output files contain five columns: 1. gold label; 2. token; 3. system-assigned label; 4. gold label in IOB format; and 5. system-assigned label in IOB format

The egofy software can be found in the data-processing repository

Evaluation

To reproduce the scores in Table 2 of the paper, run this Linux command:

cat matahari-deduce.txt rembrandt-deduce.txt | ./conlleval

Replace deduce by tks to get the results for our system.

To reproduce the unlabeled column (UNL) in Table 2 of the paper, run this Linux command:

cat matahari-tks.txt rembrandt-tks.txt | sed 's/-[A-Z][A-Z]*/-UNL/g' | ./conlleval

Replace tks by deduce to get the results for the deduce system

Contact

Erik Tjong Kim Sang, e.tjongkimsang@esciencecenter.nl

About

data related to de-identification paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages