Analysis of Corpora

General Information

To develop good machine learning methods in text mining the researcher needs to have a good golden corpus, that he needs to understand in order to expose the right attributes and create good features. Therefore this section describes the corpora that were used in the process of this thesis and the project. We decided to extend the corpora selection to not only include our own corpus - IDP4, but to have comparable data that we can use for performance evaluation with other methods. A mix of state of the art corpora, full-text and abstract corpora as well as the mix of corpora just having ST mentions annotated.

Corpora are saved in resources/corpora

A List of available corpora:

Corpus name	Paper	Public. Year	NR Of Documents	Notes	Source
IDP4	on rostlab	2014	163 Full + Abstracts	Our own project; initial project	Will be publicly available soon on tagtog.net
tmVar	Wei C-H, Harris BR, Kao H-Y, Lu Z (2013)	2013	500 Abstracts	state of the art method based on this corpus; searching for standard mutations	Current package, which is not the one mentioned in the paper
MutationFinder	MutationFinder: A high-performance system for extracting point mutation mentions from text	2007	813 Abstracts	annotations not available for our own validation	MF v1.1 supplemental
SETH	Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014)	2014	630 abstracts	aims at standard mutation mentions	github
Variome	Annotating the biomedical literature for the human variome	2013	10 Full	Only full text	tar.gz
Variome_120	subset 118 + 2 extra NL	2014	10 Full	Only full text	here
OSIRIS	Link	2008	105 Abstracts	As first observation, standard mentions only	link
Open Mutation Miner (OMM)
SNP Corpus				from Fraunhofer Inst. and paper
LEAP-FS / Protein Residue Corpus	Ravikumar et al	2012	50 Full	texts not given (only PMIDs)	tar.gz

Name, Year	Docs	Abs	Fulls	Tokens	NormAbs
IDP4, 2015	163	85	78	338093	1601
IDP4+ (It7), 2016	232	154	78	353431	1674
tmVar, 2013	500	500	0	118753	562
Variome, 2012	10	0	10	41736	197
MutationFinder, 2007	813	588	0	168906	800
SETH, 2014	630	630	0	110562	523
OSIRIS, 2008	?	?	?	?	?

Caption of table:

"Docs" stands for Number of documents
"Abs" for number of abstracts only
"Fulls" for number of full-text documents
"Tokens" for the total amount of tokens by using the NLTKTokenizer
"NormAbs" represents the hypothetical normalized number of abstracts (TODO @carstenuhlig fix calculation explain how it's calculated. See #4)

Average per Abstract over all Corpora: 211,100511 (OSIRIS not included)

Corpora

IDP4

Examiner: Professor Burkhard Rost, Ph.D.
Supervisor: Juan-Miguel Cejuela
Participants: Aleksandard Bojchevski, Rustem Bekmukhametov, Sanjeev Karn, Shpend Mahmuti

The result of our work is a gold standard corpus that, comparing to related corpora, additionally contains annotation of natural language mentions of mutations and relations of entities. The quality of the corpus is reflected by the high inter-annotator agreement.

IDP4 is the main dataset we use to extend our dataset via bootstrapping on. We use it for training our model. It is also used for the cross-validation. And the manual annotation for the NL mentions analysis was performed on this dataset, too. The authors of this corpus are Aleksandar Bojchevski, Rustem Bek- mukhametov, Sanjeev Karn and Shpend Mahmuti. Juan-Miguel Cejuela was the Advisor for this project and supports us with tagtog.net. The guidelines that were used to annotate the corpus are available on tagtog.net. (URL: “https://www.tagtog.net/jmcejuela/IDP4/- settings”, In case of interest J. M. Cejuela has to be asked for access until publication of the IDP4 corpus) The aim of this corpus was to create robust annotation guidelines, that are not only based on abstracts but full-text documents.

Corpus was created using keyword filters in pubmed. And tagtog.net was used.

Corpus on Github.com resources/corpora/idp4

Statistics

Property	Stat
Full documents	78
Abstract documents	163
Full doc tokens	298510
Abstract doc tokens	39583
All tokens	338093
Hypothetical abstract nr	1601

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

tmVar

Chih-Hsuan Wei, Bethany R. Harris, Hung-Yu Kao and Zhiyong Lu

tmVar provides the dataset for the state of the art method tmVar and was used for the analysis of the NL mentions by manual annotation. Since we reproduced tmVar as baseline for our framework, we extensively tested this corpus to evaluate performance.

Corpus on Github.com resources/corpora/tmvar

Abstract

Motivation: Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy.

Results: Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. Availability: tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/ CBBresearch/Lu/pub/tmVar

Statistics

Property	Stat
Full documents	0
Abstract documents	500
Full doc tokens	0
Abstract doc tokens	118753
All tokens	118753
Hypothetical abstract nr	562

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 28 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and 29++

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

SETH

Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014).

SETH: SNP Extraction Tool for Human Variations.

SETH contains 630 abstracts and provides the main corpus for the SETH method.

Source: rockt/SETH

SETH is a software that performs named entity recognition (NER) of single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) from natural language texts. It uses a combination of other tools to improve the method.

Corpus on Github.com resources/corpora/seth

Statistics

Property	Stat
Full documents	0
Abstract documents	630
Full doc tokens	0
Abstract doc tokens	110562
All tokens	110562
Hypothetical abstract nr	523

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 27 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and inclusive_28++

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

Variome

Annotating the biomedical literature for the human variome -- Karin Verspoor, Antonio Jimeno Yepes, Lawrence Cavedon, Tara McIntosh, Asha Herten-Crabb, Zoë Thomas, John-Paul Plazzer (2013).

Human Variome Project (HVP)

Variome Corpus (or Human Variome Project (HVP) Corpus, or previously known by us Verspoor Corpus) contains 10 full-text documents. This corpus has many NL mentions of which many would not be considered as mutation mention according to the guidelines of the IDP4 corpus. This corpus is also known as the Human Variome Project (HVP) corpus.

Corpus on Github.com resources/corpora/verspoor

Description

Source: opennicta.com.au

A corpus of 10 full text publications, known as the Variome Corpus or alternatively the Human Variome Project (HVP) Corpus

Statistics

Property	Stat
Full documents	10
Abstract documents	10
Full doc tokens	38792
Abstract doc tokens	2944
All tokens	41736
Hypothetical abstract nr	197

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

MutationFinder

J. Gregory Caporaso, William A. Baumgartner Jr., David A. Randolph, K. Bretonnel Cohen, and Lawrence Hunter; Bioinformatics, 2007

MutationFinder: A high-performance system for extracting point mutation mentions from text

A gold standard corpus for mutation extraction systems consisting of 1515 human-annotated mutation mentions in 813 MEDLINE abstracts.

Corpus on Github.com resources/corpora/mutationfinder

Statistics

Property	Stat
Full documents	0
Abstract documents	813
Full doc tokens	0
Abstract doc tokens	168906
All tokens	168906
Hypothetical abstract nr	800

Link to definitions

Hypothetical abstract nr = "total nr of tokens from corpus (including full documents)" / "average of amount of tokens per abstract" --> getting feeling for the size of the corpus

guidelines for adding new corpora

- each section should be equally structured:
1. participants
2. link to corpus on github as link and include relative path in as `resources/corpora/corpus`
3. short description or abstract of original paper
4. general stats e.g. amount of documents
5. nl definitions stats

Other Notes

SETH:

no rsids (however, it has some annotated)
No genetic markers (however, it has some annotated)
Mostly no unnumbered

tmVar_test:

Yes rsids
No genetic markers
Mostly yes unnumbered

Variome120:

No rsids at all
No genetic markers
No unnumbered

nala_test:

Yes rsids
Yes genetic markers
Mostly no unnumbered

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analysis of Corpora

General Information

Average per Abstract over all Corpora: 211,100511 (OSIRIS not included)

Corpora

IDP4

Statistics

tmVar

Abstract

Statistics

SETH

Statistics

Variome

Description

Statistics

MutationFinder

Statistics

guidelines for adding new corpora

Other Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally