Skip to content

Analysis of Corpora

Juan Miguel Cejuela edited this page Dec 9, 2015 · 18 revisions

General Information

To develop good machine learning methods in text mining the researcher needs to have a good golden corpus, that he needs to understand in order to expose the right attributes and create good features. Therefore this section describes the corpora that were used in the process of this thesis and the project. We decided to extend the corpora selection to not only include our own corpus - IDP4, but to have comparable data that we can use for performance evaluation with other methods. A mix of state of the art corpora, full-text and abstract corpora as well as the mix of corpora just having ST mentions annotated.

Corpora are saved in resources/corpora

A List of available corpora:

Corpus name Paper Public. Year NR Of Documents Notes Source
IDP4 on rostlab 2014 163 Full + Abstracts Our own project; initial project Will be publicly available soon on tagtog.net
tmVar Wei C-H, Harris BR, Kao H-Y, Lu Z (2013) 2013 500 Abstracts state of the art method based on this corpus; searching for standard mutations Current package, which is not the one mentioned in the paper
MutationFinder MutationFinder: A high-performance system for extracting point mutation mentions from text 2007 813 Abstracts annotations not available for our own validation tar.gz
SETH Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014) 2014 630 abstracts aims at standard mutation mentions github
Verspoor Annotating the biomedical literature for the human variome 2013 10 Full Only full text tar.gz
Name, Year Docs Abs Full Tokens HypAbs
IDP4, 2015 163 85 78 338098 1411.05
tmVar, 2013 500 500 0 118753 500
Verspoor, 2012 10 0 10 38384 144.4
MutationFinder, 2007 588 588 0 953841 588
SETH, 2014 630 630 0 110562 630

Table of used corpora in the framework of the project. "Docs" stands for Number of documents, "Abs" for number of abstracts only, "Full" for number of full-text documents, "Tokens" for the total amount of tokens by using the NLTKTokenizer and "HypAbs" represents the hypothetical amount of abstracts.

Corpora

IDP4

Examiner: Professor Burkhard Rost, Ph.D.
Supervisor: Juan-Miguel Cejuela
Participants: Aleksandard Bojchevski, Rustem Bekmukhametov, Sanjeev Karn, Shpend Mahmuti

The result of our work is a gold standard corpus that, comparing to related corpora, additionally contains annotation of natural language mentions of mutations and relations of entities. The quality of the corpus is reflected by the high inter-annotator agreement.

IDP4 is the main dataset we use to extend our dataset via bootstrapping on. We use it for training our model. It is also used for the cross-validation. And the manual annotation for the NL mentions analysis was performed on this dataset, too. The authors of this corpus are Aleksandar Bojchevski, Rustem Bek- mukhametov, Sanjeev Karn and Shpend Mahmuti. Juan-Miguel Cejuela was the Advisor for this project and supports us with tagtog.net. The guidelines that were used to annotate the corpus are available on tagtog.net. (URL: “https://www.tagtog.net/jmcejuela/IDP4/- settings”, In case of interest J. M. Cejuela has to be asked for access until publication of the IDP4 corpus) The aim of this corpus was to create robust annotation guidelines, that are not only based on abstracts but full-text documents.

Corpus was created using keyword filters in pubmed. And tagtog.net was used.

Corpus on Github.com resources/corpora/idp4

Statistics

Property Stat
Full documents 78
Abstract documents 163
Full doc tokens 299042
Abstract doc tokens 39056
All tokens 338098
Average tokens per abstract 239.61
Hypothetical abstract nr 1411.05

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

tmVar

Chih-Hsuan Wei, Bethany R. Harris, Hung-Yu Kao and Zhiyong Lu

tmVar provides the dataset for the state of the art method tmVar and was used for the analysis of the NL mentions by manual annotation. Since we reproduced tmVar as baseline for our framework, we extensively tested this corpus to evaluate performance.

Corpus on Github.com resources/corpora/tmvar

Abstract

Motivation: Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy.

Results: Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. Availability: tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/ CBBresearch/Lu/pub/tmVar

Statistics

Property Stat
Full documents 0
Abstract documents 500
Full doc tokens 0
Abstract doc tokens 118753
All tokens 118753
Average tokens per abstract 237.51
Hypothetical abstract nr 500.00

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 28 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and 29++

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

SETH

Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014).

SETH: SNP Extraction Tool for Human Variations.

SETH contains 630 abstracts and provides the main corpus for the SETH method.

Source: rockt/SETH

SETH is a software that performs named entity recognition (NER) of single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) from natural language texts. It uses a combination of other tools to improve the method.

Corpus on Github.com resources/corpora/seth

Statistics

Property Stat
Full documents 0
Abstract documents 630
Full doc tokens 0
Abstract doc tokens 110562
All tokens 110562
Average tokens per abstract 175.50
Hypothetical abstract nr 630.00

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 27 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and inclusive_28++

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

Verspoor

Karin Verspoor, Antonio Jimeno Yepes, Lawrence Cavedon, Tara McIntosh, Asha Herten-Crabb, Zoë Thomas, John-Paul Plazzer (2013).

Human Variome Project (HVP)

Verspoor contains 10 full-text documents. This corpus has many NL mentions of which many would not be considered as mutation mention according to the guidelines of the IDP4 corpus. This corpus is also known as the Human Variome Project (HVP) corpus.

Corpus on Github.com resources/corpora/verspoor

Description

Source: opennicta.com.au

A corpus of 10 full text publications, known as the Variome Corpus or alternatively the Human Variome Project (HVP) Corpus

Statistics

Property Stat
Full documents 10
Abstract documents 10
Full doc tokens 38384
Abstract doc tokens 2856
All tokens 41240
Average tokens per abstract 285.60
Hypothetical abstract nr 144.40

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

MutationFinder

J. Gregory Caporaso, William A. Baumgartner Jr., David A. Randolph, K. Bretonnel Cohen, and Lawrence Hunter; Bioinformatics, 2007

MutationFinder: A high-performance system for extracting point mutation mentions from text

A gold standard corpus for mutation extraction systems consisting of 1515 human-annotated mutation mentions in 813 MEDLINE abstracts.

Corpus on Github.com resources/corpora/mutationfinder

Statistics

Property Stat
Full documents 0
Abstract documents 588
Full doc tokens 0
Abstract doc tokens 953841
All tokens 953841
Average tokens per abstract 1622.18
Hypothetical abstract nr 588.00

Link to definitions

Hypothetical abstract nr = "total nr of tokens from corpus (including full documents)" / "average of amount of tokens per abstract" --> getting feeling for the size of the corpus

guidelines for adding new corpora

- each section should be equally structured:
1. participants
2. link to corpus on github as link and include relative path in as `resources/corpora/corpus`
3. short description or abstract of original paper
4. general stats e.g. amount of documents
5. nl definitions stats
Clone this wiki locally