-
Notifications
You must be signed in to change notification settings - Fork 5
Analysis of Corpora
To develop good machine learning methods in text mining the researcher needs to have a good golden corpus, that he needs to understand in order to expose the right attributes and create good features. Therefore this section describes the corpora that were used in the process of this thesis and the project. We decided to extend the corpora selection to not only include our own corpus - IDP4, but to have comparable data that we can use for performance evaluation with other methods. A mix of state of the art corpora, full-text and abstract corpora as well as the mix of corpora just having ST mentions annotated.
Corpora are saved in resources/corpora
A List of available corpora:
Corpus name | Paper | Public. Year | NR Of Documents | Notes | Source |
---|---|---|---|---|---|
IDP4 | on rostlab | 2014 | 163 Full + Abstracts | Our own project; initial project | Will be publicly available soon on tagtog.net |
tmVar | Wei C-H, Harris BR, Kao H-Y, Lu Z (2013) | 2013 | 500 Abstracts | state of the art method based on this corpus; searching for standard mutations | Current package, which is not the one mentioned in the paper |
MutationFinder | MutationFinder: A high-performance system for extracting point mutation mentions from text | 2007 | 813 Abstracts | annotations not available for our own validation | MF v1.1 supplemental |
SETH | Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014) | 2014 | 630 abstracts | aims at standard mutation mentions | github |
Variome | Annotating the biomedical literature for the human variome | 2013 | 10 Full | Only full text | tar.gz |
Variome_120 | subset 118 + 2 extra NL | 2014 | 10 Full | Only full text | here |
OSIRIS | Link | 2008 | 105 Abstracts | As first observation, standard mentions only | link |
Open Mutation Miner (OMM) | |||||
SNP Corpus | from Fraunhofer Inst. and paper | ||||
LEAP-FS / Protein Residue Corpus | Ravikumar et al | 2012 | 50 Full | texts not given (only PMIDs) | tar.gz |
Name, Year | Docs | Abs | Fulls | Tokens | NormAbs |
---|---|---|---|---|---|
IDP4, 2015 | 163 | 85 | 78 | 338093 | 1601 |
IDP4+ (It7), 2016 | 232 | 154 | 78 | 353431 | 1674 |
tmVar, 2013 | 500 | 500 | 0 | 118753 | 562 |
Variome, 2012 | 10 | 0 | 10 | 41736 | 197 |
MutationFinder, 2007 | 813 | 588 | 0 | 168906 | 800 |
SETH, 2014 | 630 | 630 | 0 | 110562 | 523 |
OSIRIS, 2008 | ? | ? | ? | ? | ? |
Caption of table:
- "Docs" stands for Number of documents
- "Abs" for number of abstracts only
- "Fulls" for number of full-text documents
- "Tokens" for the total amount of tokens by using the NLTKTokenizer
- "NormAbs" represents the hypothetical normalized number of abstracts (TODO @carstenuhlig fix calculation explain how it's calculated. See #4)
Examiner: Professor Burkhard Rost, Ph.D.
Supervisor: Juan-Miguel Cejuela
Participants: Aleksandard Bojchevski, Rustem Bekmukhametov, Sanjeev Karn, Shpend Mahmuti
The result of our work is a gold standard corpus that, comparing to related corpora, additionally contains annotation of natural language mentions of mutations and relations of entities. The quality of the corpus is reflected by the high inter-annotator agreement.
IDP4 is the main dataset we use to extend our dataset via bootstrapping on. We use it for training our model. It is also used for the cross-validation. And the manual annotation for the NL mentions analysis was performed on this dataset, too. The authors of this corpus are Aleksandar Bojchevski, Rustem Bek- mukhametov, Sanjeev Karn and Shpend Mahmuti. Juan-Miguel Cejuela was the Advisor for this project and supports us with tagtog.net. The guidelines that were used to annotate the corpus are available on tagtog.net. (URL: “https://www.tagtog.net/jmcejuela/IDP4/- settings”, In case of interest J. M. Cejuela has to be asked for access until publication of the IDP4 corpus) The aim of this corpus was to create robust annotation guidelines, that are not only based on abstracts but full-text documents.
Corpus was created using keyword filters in pubmed. And tagtog.net was used.
Corpus on Github.com resources/corpora/idp4
Property | Stat |
---|---|
Full documents | 78 |
Abstract documents | 163 |
Full doc tokens | 298510 |
Abstract doc tokens | 39583 |
All tokens | 338093 |
Hypothetical abstract nr | 1601 |
Definitions:
- tmVarRegex: RegEx.NL file exclusive method
- tmVarComplete: full pipeline using web-service on abstracts
- Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
- Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)
Chih-Hsuan Wei, Bethany R. Harris, Hung-Yu Kao and Zhiyong Lu
tmVar provides the dataset for the state of the art method tmVar and was used for the analysis of the NL mentions by manual annotation. Since we reproduced tmVar as baseline for our framework, we extensively tested this corpus to evaluate performance.
Corpus on Github.com resources/corpora/tmvar
Motivation: Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy.
Results: Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. Availability: tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/ CBBresearch/Lu/pub/tmVar
Property | Stat |
---|---|
Full documents | 0 |
Abstract documents | 500 |
Full doc tokens | 0 |
Abstract doc tokens | 118753 |
All tokens | 118753 |
Hypothetical abstract nr | 562 |
Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 28 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and 29++
Definitions:
- tmVarRegex: RegEx.NL file exclusive method
- tmVarComplete: full pipeline using web-service on abstracts
- Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
- Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)
Thomas, P., Rocktäschel, T., Mayer, Y., and Leser, U. (2014).
SETH: SNP Extraction Tool for Human Variations.
SETH contains 630 abstracts and provides the main corpus for the SETH method.
Source: rockt/SETH
SETH is a software that performs named entity recognition (NER) of single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) from natural language texts. It uses a combination of other tools to improve the method.
Corpus on Github.com resources/corpora/seth
Property | Stat |
---|---|
Full documents | 0 |
Abstract documents | 630 |
Full doc tokens | 0 |
Abstract doc tokens | 110562 |
All tokens | 110562 |
Hypothetical abstract nr | 523 |
Note: Abstract vs Full graph could not be created, since there were no full documents. Min length parameter of the inclusive method at 27 letters sets the border for difference in tmVar regex recognized nl mention and inclusive method. --> no intersection of found mentions between tmvar regex and inclusive_28++
Definitions:
- tmVarRegex: RegEx.NL file exclusive method
- tmVarComplete: full pipeline using web-service on abstracts
- Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
- Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)
Annotating the biomedical literature for the human variome -- Karin Verspoor, Antonio Jimeno Yepes, Lawrence Cavedon, Tara McIntosh, Asha Herten-Crabb, Zoë Thomas, John-Paul Plazzer (2013).
Human Variome Project (HVP)
Variome Corpus (or Human Variome Project (HVP) Corpus, or previously known by us Verspoor Corpus) contains 10 full-text documents. This corpus has many NL mentions of which many would not be considered as mutation mention according to the guidelines of the IDP4 corpus. This corpus is also known as the Human Variome Project (HVP) corpus.
Corpus on Github.com resources/corpora/verspoor
Source: opennicta.com.au
A corpus of 10 full text publications, known as the Variome Corpus or alternatively the Human Variome Project (HVP) Corpus
Property | Stat |
---|---|
Full documents | 10 |
Abstract documents | 10 |
Full doc tokens | 38792 |
Abstract doc tokens | 2944 |
All tokens | 41736 |
Hypothetical abstract nr | 197 |
Definitions:
- tmVarRegex: RegEx.NL file exclusive method
- tmVarComplete: full pipeline using web-service on abstracts
- Inclusive: Inclusive method just using minimum space and minimum letter attribute to filter with letter as parameter
- Carsten: Custom made exclusive method, that uses RegEx.NL partly and custom regexs' follows rules that standard mention is everything that has no natural language parts/phrases included (e.g. of, on, at, deletion of)
J. Gregory Caporaso, William A. Baumgartner Jr., David A. Randolph, K. Bretonnel Cohen, and Lawrence Hunter; Bioinformatics, 2007
MutationFinder: A high-performance system for extracting point mutation mentions from text
A gold standard corpus for mutation extraction systems consisting of 1515 human-annotated mutation mentions in 813 MEDLINE abstracts.
Corpus on Github.com resources/corpora/mutationfinder
Property | Stat |
---|---|
Full documents | 0 |
Abstract documents | 813 |
Full doc tokens | 0 |
Abstract doc tokens | 168906 |
All tokens | 168906 |
Hypothetical abstract nr | 800 |
Hypothetical abstract nr = "total nr of tokens from corpus (including full documents)" / "average of amount of tokens per abstract" --> getting feeling for the size of the corpus
- each section should be equally structured:
1. participants
2. link to corpus on github as link and include relative path in as `resources/corpora/corpus`
3. short description or abstract of original paper
4. general stats e.g. amount of documents
5. nl definitions stats
SETH:
- no rsids (however, it has some annotated)
- No genetic markers (however, it has some annotated)
- Mostly no unnumbered
tmVar_test:
- Yes rsids
- No genetic markers
- Mostly yes unnumbered
Variome120:
- No rsids at all
- No genetic markers
- No unnumbered
nala_test:
- Yes rsids
- Yes genetic markers
- Mostly no unnumbered