Skip to content

Analysis of Corpora

Juan Miguel Cejuela edited this page Nov 23, 2016 · 18 revisions

General Information

To develop good machine learning methods in text mining the researcher needs to have a good golden corpus, that he needs to understand in order to expose the right attributes and create good features. Therefore this section describes the corpora that were used in the process of this thesis and the project. We decided to extend the corpora selection to not only include our own corpus - IDP4, but to have comparable data that we can use for performance evaluation with other methods. A mix of state of the art corpora, full-text and abstract corpora as well as the mix of corpora just having ST mentions annotated.

Corpora are saved in resources/corpora

A List of available corpora:

Corpus name Paper Public. Year NR Of Documents Notes Source
IDP4 on rostlab 2014 163 Full + Abstracts Our own project; initial project Will be publicly available soon on tagtog.net
tmVar Wei C-H, Harris BR, Kao H-Y, Lu Z (2013) 2013 500 Abstracts state of the art method based on this corpus; searching for standard mutations Current package, which is not the one mentioned in the paper
MutationFinder MutationFinder: A high-performance system for extracting point mutation mentions from text 2007 813 Abstracts annotations not available for our own validation MF v1.1 supplemental
SETH Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014) 2014 630 abstracts aims at standard mutation mentions github
Variome Annotating the biomedical literature for the human variome 2013 10 Full Only full text tar.gz
Variome_120 subset 118 + 2 extra NL 2014 10 Full Only full text here
OSIRIS Link 2008 105 Abstracts As first observation, standard mentions only link
Open Mutation Miner (OMM)
SNP Corpus from Fraunhofer Inst. and paper
LEAP-FS / Protein Residue Corpus Ravikumar et al 2012 50 Full texts not given (only PMIDs) tar.gz
Name, Year Docs Abs Fulls Tokens NormAbs
IDP4, 2015 163 85 78 338093 1601
IDP4+ (It7), 2016 232 154 78 353431 1674
tmVar, 2013 500 500 0 118753 562
Variome, 2012 10 0 10 41736 197
MutationFinder, 2007 813 588 0 168906 800
SETH, 2014 630 630 0 110562 523
OSIRIS, 2008 ? ? ? ? ?

Caption of table:

  • "Docs" stands for Number of documents
  • "Abs" for number of abstracts only
  • "Fulls" for number of full-text documents
  • "Tokens" for the total amount of tokens by using the NLTKTokenizer
  • "NormAbs" represents the hypothetical normalized number of abstracts (TODO @carstenuhlig fix calculation explain how it's calculated. See #4)
Average per Abstract over all Corpora: 211,100511 (OSIRIS not included)

Corpora

IDP4

Examiner: Professor Burkhard Rost, Ph.D.
Supervisor: Juan-Miguel Cejuela
Participants: Aleksandard Bojchevski, Rustem Bekmukhametov, Sanjeev Karn, Shpend Mahmuti

The result of our work is a gold standard corpus that, comparing to related corpora, additionally contains annotation of natural language mentions of mutations and relations of entities. The quality of the corpus is reflected by the high inter-annotator agreement.

IDP4 is the main dataset we use to extend our dataset via bootstrapping on. We use it for training our model. It is also used for the cross-validation. And the manual annotation for the NL mentions analysis was performed on this dataset, too. The authors of this corpus are Aleksandar Bojchevski, Rustem Bek- mukhametov, Sanjeev Karn and Shpend Mahmuti. Juan-Miguel Cejuela was the Advisor for this project and supports us with tagtog.net. The guidelines that were used to annotate the corpus are available on tagtog.net. (URL: “https://www.tagtog.net/jmcejuela/IDP4/- settings”, In case of interest J. M. Cejuela has to be asked for access until publication of the IDP4 corpus) The aim of this corpus was to create robust annotation guidelines, that are not only based on abstracts but full-text documents.

Corpus was created using keyword filters in pubmed. And tagtog.net was used.

Corpus on Github.com resources/corpora/idp4

Statistics

Property Stat
Full documents 78
Abstract documents 163
Full doc tokens 298510
Abstract doc tokens 39583
All tokens 338093
Hypothetical abstract nr 1601

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

tmVar

Chih-Hsuan Wei, Bethany R. Harris, Hung-Yu Kao and Zhiyong Lu

tmVar provides the dataset for the state of the art method tmVar and was used for the analysis of the NL mentions by manual annotation. Since we reproduced tmVar as baseline for our framework, we extensively tested this corpus to evaluate performance.

Corpus on Github.com resources/corpora/tmvar

Abstract

Motivation: Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy.

Results: Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. Availability: tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/ CBBresearch/Lu/pub/tmVar

Statistics

Property Stat
Full documents 0
Abstract documents 500
Full doc tokens 0
Abstract doc tokens 118753
All tokens 118753
Hypothetical abstract nr 562

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 28 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and 29++

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

SETH

Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014).

SETH: SNP Extraction Tool for Human Variations.

SETH contains 630 abstracts and provides the main corpus for the SETH method.

Source: rockt/SETH

SETH is a software that performs named entity recognition (NER) of single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) from natural language texts. It uses a combination of other tools to improve the method.

Corpus on Github.com resources/corpora/seth

Statistics

Property Stat
Full documents 0
Abstract documents 630
Full doc tokens 0
Abstract doc tokens 110562
All tokens 110562
Hypothetical abstract nr 523

Imgur NL vs Total

Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 27 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and inclusive_28++

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

Variome

Annotating the biomedical literature for the human variome -- Karin Verspoor, Antonio Jimeno Yepes, Lawrence Cavedon, Tara McIntosh, Asha Herten-Crabb, Zoë Thomas, John-Paul Plazzer (2013).

Human Variome Project (HVP)

Variome Corpus (or Human Variome Project (HVP) Corpus, or previously known by us Verspoor Corpus) contains 10 full-text documents. This corpus has many NL mentions of which many would not be considered as mutation mention according to the guidelines of the IDP4 corpus. This corpus is also known as the Human Variome Project (HVP) corpus.

Corpus on Github.com resources/corpora/verspoor

Description

Source: opennicta.com.au

A corpus of 10 full text publications, known as the Variome Corpus or alternatively the Human Variome Project (HVP) Corpus

Statistics

Property Stat
Full documents 10
Abstract documents 10
Full doc tokens 38792
Abstract doc tokens 2944
All tokens 41736
Hypothetical abstract nr 197

Imgur NL vs Total Imgur Abstract vs Full

Definitions:

  • tmVarRegex: RegEx.NL file exclusive method
  • tmVarComplete: full pipeline using web-service on abstracts
  • Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
  • Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)

MutationFinder

J. Gregory Caporaso, William A. Baumgartner Jr., David A. Randolph, K. Bretonnel Cohen, and Lawrence Hunter; Bioinformatics, 2007

MutationFinder: A high-performance system for extracting point mutation mentions from text

A gold standard corpus for mutation extraction systems consisting of 1515 human-annotated mutation mentions in 813 MEDLINE abstracts.

Corpus on Github.com resources/corpora/mutationfinder

Statistics

Property Stat
Full documents 0
Abstract documents 813
Full doc tokens 0
Abstract doc tokens 168906
All tokens 168906
Hypothetical abstract nr 800

Link to definitions

Hypothetical abstract nr = "total nr of tokens from corpus (including full documents)" / "average of amount of tokens per abstract" --> getting feeling for the size of the corpus

guidelines for adding new corpora

- each section should be equally structured:
1. participants
2. link to corpus on github as link and include relative path in as `resources/corpora/corpus`
3. short description or abstract of original paper
4. general stats e.g. amount of documents
5. nl definitions stats

Other Notes

SETH:

  • no rsids (however, it has some annotated)
  • No genetic markers (however, it has some annotated)
  • Mostly no unnumbered

tmVar_test:

  • Yes rsids
  • No genetic markers
  • Mostly yes unnumbered

Variome120:

  • No rsids at all
  • No genetic markers
  • No unnumbered

nala_test:

  • Yes rsids
  • Yes genetic markers
  • Mostly no unnumbered
Clone this wiki locally