Analysis of Corpora

General Information

To develop good machine learning methods in text mining the researcher needs to have a good golden corpus, that he needs to understand in order to expose the right attributes and create good features. Therefore this section describes the corpora that were used in the process of this thesis and the project. We decided to extend the corpora selection to not only include our own corpus - IDP4, but to have comparable data that we can use for performance evaluation with other methods. A mix of state of the art corpora, full-text and abstract corpora as well as the mix of corpora just having ST mentions annotated.

Corpora are saved in resources/corpora

A List of available corpora:

Corpus name	Paper	Public. Year	NR Of Documents	Notes	Source
IDP4	on rostlab	2014	163 Full + Abstracts	Our own project; initial project	Will be publicly available soon on tagtog.net
tmVar	Wei C-H, Harris BR, Kao H-Y, Lu Z (2013)	2013	500 Abstracts	state of the art method based on this corpus; searching for standard mutations	Current package, which is not the one mentioned in the paper
MutationFinder	MutationFinder: A high-performance system for extracting point mutation mentions from text	2007	813 Abstracts	annotations not available for our own validation	tar.gz
SETH	Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014)	2014	630 abstracts	aims at standard mutation mentions	github
Verspoor	Annotating the biomedical literature for the human variome	2013	10 Full	Only full text	tar.gz

Name, Year	Docs	Abs	Full	Tokens	HypAbs
IDP4, 2015	163	85	78	338098	1411.05
tmVar, 2013	500	500	0	118753	500
Verspoor, 2012	10	0	10	38384	144.4
MutationFinder, 2007	588	588	0	953841	588
SETH, 2014	630	630	0	110562	630

Table of used corpora in the framework of the project. "Docs" stands for Number of documents, "Abs" for number of abstracts only, "Full" for number of full-text documents, "Tokens" for the total amount of tokens by using the NLTKTokenizer and "HypAbs" represents the hypothetical amount of abstracts.

Corpora

IDP4

Examiner: Professor Burkhard Rost, Ph.D.
Supervisor: Juan-Miguel Cejuela
Participants: Aleksandard Bojchevski, Rustem Bekmukhametov, Sanjeev Karn, Shpend Mahmuti

The result of our work is a gold standard corpus that, comparing to related corpora, additionally contains annotation of natural language mentions of mutations and relations of entities. The quality of the corpus is reflected by the high inter-annotator agreement.

IDP4 is the main dataset we use to extend our dataset via bootstrapping on. We use it for training our model. It is also used for the cross-validation. And the manual annotation for the NL mentions analysis was performed on this dataset, too. The authors of this corpus are Aleksandar Bojchevski, Rustem Bek- mukhametov, Sanjeev Karn and Shpend Mahmuti. Juan-Miguel Cejuela was the Advisor for this project and supports us with tagtog.net. The guidelines that were used to annotate the corpus are available on tagtog.net. (URL: “https://www.tagtog.net/jmcejuela/IDP4/- settings”, In case of interest J. M. Cejuela has to be asked for access until publication of the IDP4 corpus) The aim of this corpus was to create robust annotation guidelines, that are not only based on abstracts but full-text documents.

Corpus was created using keyword filters in pubmed. And tagtog.net was used.

Corpus on Github.com resources/corpora/idp4

Statistics

Property	Stat
Full documents	78
Abstract documents	163
Full doc tokens	299042
Abstract doc tokens	39056
All tokens	338098
Average tokens per abstract	239.61
Hypothetical abstract nr	1411.05

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

tmVar

Chih-Hsuan Wei, Bethany R. Harris, Hung-Yu Kao and Zhiyong Lu

tmVar provides the dataset for the state of the art method tmVar and was used for the analysis of the NL mentions by manual annotation. Since we reproduced tmVar as baseline for our framework, we extensively tested this corpus to evaluate performance.

Corpus on Github.com resources/corpora/tmvar

Abstract

Motivation: Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy.

Results: Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. Availability: tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/ CBBresearch/Lu/pub/tmVar

Statistics

Property	Stat
Full documents	0
Abstract documents	500
Full doc tokens	0
Abstract doc tokens	118753
All tokens	118753
Average tokens per abstract	237.51
Hypothetical abstract nr	500.00

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 28 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and 29++

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

SETH

Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014).

SETH: SNP Extraction Tool for Human Variations.

SETH contains 630 abstracts and provides the main corpus for the SETH method.

Source: rockt/SETH

SETH is a software that performs named entity recognition (NER) of single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) from natural language texts. It uses a combination of other tools to improve the method.

Corpus on Github.com resources/corpora/seth

Statistics

Property	Stat
Full documents	0
Abstract documents	630
Full doc tokens	0
Abstract doc tokens	110562
All tokens	110562
Average tokens per abstract	175.50
Hypothetical abstract nr	630.00

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 27 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and inclusive_28++

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

Verspoor

Karin Verspoor, Antonio Jimeno Yepes, Lawrence Cavedon, Tara McIntosh, Asha Herten-Crabb, Zoë Thomas, John-Paul Plazzer (2013).

Human Variome Project (HVP)

Verspoor contains 10 full-text documents. This corpus has many NL mentions of which many would not be considered as mutation mention according to the guidelines of the IDP4 corpus. This corpus is also known as the Human Variome Project (HVP) corpus.

Corpus on Github.com resources/corpora/verspoor

Description

Source: opennicta.com.au

A corpus of 10 full text publications, known as the Variome Corpus or alternatively the Human Variome Project (HVP) Corpus

Statistics

Property	Stat
Full documents	10
Abstract documents	10
Full doc tokens	38384
Abstract doc tokens	2856
All tokens	41240
Average tokens per abstract	285.60
Hypothetical abstract nr	144.40

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

tmVarRegex: RegEx.NL file exclusive method
tmVarComplete: full pipeline using web-service on abstracts
Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

MutationFinder

J. Gregory Caporaso, William A. Baumgartner Jr., David A. Randolph, K. Bretonnel Cohen, and Lawrence Hunter; Bioinformatics, 2007

MutationFinder: A high-performance system for extracting point mutation mentions from text

A gold standard corpus for mutation extraction systems consisting of 1515 human-annotated mutation mentions in 813 MEDLINE abstracts.

Corpus on Github.com resources/corpora/mutationfinder

Statistics

Property	Stat
Full documents	0
Abstract documents	588
Full doc tokens	0
Abstract doc tokens	953841
All tokens	953841
Average tokens per abstract	1622.18
Hypothetical abstract nr	588.00

Link to definitions

Hypothetical abstract nr = "total nr of tokens from corpus (including full documents)" / "average of amount of tokens per abstract" --> getting feeling for the size of the corpus

guidelines for adding new corpora

- each section should be equally structured:
1. participants
2. link to corpus on github as link and include relative path in as `resources/corpora/corpus`
3. short description or abstract of original paper
4. general stats e.g. amount of documents
5. nl definitions stats

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analysis of Corpora

General Information

Corpora

IDP4

Statistics

tmVar

Abstract

Statistics

SETH

Statistics

Verspoor

Description

Statistics

MutationFinder

Statistics

guidelines for adding new corpora

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally