Dutch de-identification data

This repository contains the data related to the paper:

De-identification of Dutch Medical Text, by Erik Tjong Kim Sang, Ben de Vries, Wouter Smink, Bernard Veldkamp, Gerben Westerhof and Anneke Sools. In: 2nd Healthcare Text Analytics Conference (HealTAC2019), Cardiff, Wales, UK, 2019. (pdf, bibtex)

Data

The data concern text versions of two Dutch Wikipedia pages (Mata Hari and Rembrandt van Rijn), which were egofied (converted to diary-type of text). The task of the evaluated de-identification systems (deduce and tks) was to identify all names and numbers in the texts and remove these, to make identification of the subjects in the texts impossible.

There are six versions of each text:

original Wikipedia text from January 2019 (matahari.txt and rembrandt.txt)
egofied texts (matahari-egofied.txt and rembrandt-egofied.txt)
part-of-speech tagged text (matahari-pos.txt and rembrandt-pos.txt, one token per line)
gold annotation of the text (matahari-gold.txt and rembrandt-gold.txt, one token per line)
output of systems (deduce and our system: tks) for the text (matahari-SYSTEM.txt and rembrandt-SYSTEM.txt, one token per line, replace SYSTEM by system name)

The system output files contain five columns: 1. gold label; 2. token; 3. system-assigned label; 4. gold label in IOB format; and 5. system-assigned label in IOB format

The egofy software can be found in the data-processing repository

Evaluation

To reproduce the scores in Table 2 of the paper, run this Linux command:

cat matahari-deduce.txt rembrandt-deduce.txt | ./conlleval

Replace deduce by tks to get the results for our system.

To reproduce the unlabeled column (UNL) in Table 2 of the paper, run this Linux command:

cat matahari-tks.txt rembrandt-tks.txt | sed 's/-[A-Z][A-Z]*/-UNL/g' | ./conlleval

Replace tks by deduce to get the results for the deduce system

Contact

Erik Tjong Kim Sang, e.tjongkimsang@esciencecenter.nl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dutch de-identification data

Data

Evaluation

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
conlleval		conlleval
matahari-deduce.txt		matahari-deduce.txt
matahari-egofied.txt		matahari-egofied.txt
matahari-gold.txt		matahari-gold.txt
matahari-pos.txt		matahari-pos.txt
matahari-tks.txt		matahari-tks.txt
matahari.txt		matahari.txt
rembrandt-deduce.txt		rembrandt-deduce.txt
rembrandt-egofied.txt		rembrandt-egofied.txt
rembrandt-gold.txt		rembrandt-gold.txt
rembrandt-pos.txt		rembrandt-pos.txt
rembrandt-tks.txt		rembrandt-tks.txt
rembrandt.txt		rembrandt.txt

License

e-mental-health/data

Folders and files

Latest commit

History

Repository files navigation

Dutch de-identification data

Data

Evaluation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages