Skip to content

Background, References, and Resources

Juan Miguel Cejuela edited this page Mar 4, 2016 · 6 revisions

Introduction to Text Mining

Biomedical research has shifted from single genes and proteins to the study of systems including many genes and proteins on multiple levels. Current trends in pharmacological research are showing explicate increase of multifactor aetiopathology of diseases, thus requires to look at more than one gene. Significant efforts are currently underway to explain the variability observed in drug responses for complex diseases. This means an increased amount of publications that need to be read prior to choosing where to start. At the end it often occurs that researchers have to choose hundreds of papers to read. These kinds of analyses are called meta-analyses about a specific topic and are very common. To meet this challenge, the project group has taken on the task of collecting and organizing published experimental data into a format suitable for large-scale querying, comparison and computational analysis by converting some of the free text data into controlled vocabulary-based statements. Biomedical researchers today have become increasingly dependent on such computable datasets provided by biological databases for data access, analysis and discovery. Thus text mining plays an important role in bioinformatics downstream analyses. The systematic study of biological effects of protein mutations [1], [2], curation of mutation-related databases and personalized medicine and personalized medicine [3]–[5].

Text mining has improved over the last decade dramatically, but there are still many challenges to tackle. Nomenclature that is often used, helps in identifying relevant entities in text. For example dbSNP RS numbers [6] can be easily detected by using regular expressions, but they are not the only way of describing mutation mentions. Mentions following the Human Genome Variation Society nomenclature (HGVS [7]) appear increasingly often, but are often not used properly. Frequent author's errors are appearing by the utilisation of more variants of the regular expressions, than defined by the official nomenclature.

Conditional Random Fields

Literature on unsupervised / word representation features

The papers here are mostly sorted in terms of their relevance / interestingness.

Mixed List

Python

Some good links

Methods of Text Mining

MutationFinder [8], which is one of the available methods for automatic curation of protein mutations, is focusing on standard mentions that variate from each other in only with minor differences. The method over 700 regular expressions and is one of the current state of the art tools.

As already stated, text entities, that are text entities (called named entities), are usually recognized. Such systems are for example just extracting mutation mentions from biomedical literature [8], [9]. The next step is to extract relations from the text in order to link mutation mentions with the corresponding genes [10], [11]. And some other systems focus on the association with diseases [12]–[14].

Anyhow, there are different ways to approach this challenge, but the main and first task that remaining - named entity recognition - is still not perfect. Methods that do use regular expressions, like the already mentioned MutationFinder, are very static and need a lot of effort in order to reach respectable performance (MutationFinder around 90% F-measure on their own corpus [8]). Rule-based methods like VTag [9] are performing well too (around 82% F-measure [9]), but require much work as well.

State of the Art: tmVar

TmVar [15] does use Conditional Random Fields (CRFs) like VTag does. At the time of writing it represents the state-of-the-art method for extracting protein mutation mentions from biomedical literature. As later explained in detail, our project group reproduced tmVar in our own method for protein mutation extracting (and normalisation) using CRFs in order to have a good seed method to start of with.

Conditional Random Fields (CRFs) are an undirected graphical models. Of- ten they are referred to Linear-Chain Conditional Random Fields. These can be used for recognising sequences. A CRF consists of 2 different types of probability variables. One represent the labels and are dependent of each other. The other type of probability variable is the data outside of the Markov Model, which is supplied through e.g. a sequence of tokens.

Feature vectors are defined, that can depend on combinations of labels and properties of tokens. These tokens can be neighbours or itself. The feature vectors are then scored together with each having a weight, that is defined by the user. Through machine learning the best-possible sequence is then evaluated, by modifying these weights. At the end, the feature vectors - which are usually a few thousand to millions - define the model to the data they were trained.

When applying an unknown (unlabeled) sequence, then the sequence with the labels that have the highest score, are used as prediction. For further information please refer to tmVar paper [15].

tmVar Pipeline

The tmVar pipeline consists of 3 different main steps apart from Import and Output. • Pre-processing • CRFs • Post-processing Pre-processing does split the text in sentences and then tokenizes it, so that they are separated by words and when the word is a code, separated by certain features like numbers following some letters [15]. The text is then prepared to be labelled from tmVar into 11 different categories. The feature generation is applied afterwards and supplies the main part of tmVar, the CRFs Machine Learning with CRF++ [16]. The CRFs will be explained later. Post-processing is the last big step of the pipeline and involves many regular expressions. The following figure shows the pipeline in their main steps.

tmVar Pipeline

Pipeline overview of tmVar. The system includes three major components: pre-processing (tokenization, splitting, feature generation), mutation identification (CRFs) and post-processing. [15]

Natural Language Mentions (NL)

All these previous mentioned methods [8], [15] mostly concentrate on high performance on mentions containing some sort of nomenclature or code, which is captured by some kind of pattern recognition method, like regular expressions. The difficulty with these is the capturing of standard mutation mentions. But humans tend to use natural language (NL) for protein mutation mentions as well. Since NL mentions were never really treated, in this thesis studies NL mentions as the major factor to improve the performance of literature curation. The first goal is to show the significance of NL mentions through some analysis focused on NL mentions. The IDP4 corpus [17], which contains many natural language mutation mentions, is used to get an understanding how NL mutation mentions behave. In the framework of the group, we decided to split this task into two subtasks.

The first subtask is the automatic evaluation of a corpus and the second subtask is the manual annotation of two corpora.

In order to quickly extract some information out of a corpus or some text that is given, I decided to implement some Definers, that use already annotated text to subclass them into so called standard or natural language mentions. A few characteristic values are extracted to provide a quick understanding of the text or corpus, that could help with developing features for natural language mentions.

Another idea is to separate the method into 2 models so we have the possibility to work on a natural language CRF model, that just uses the information which is relevant to natural languages. We thought that the result would be more promising, since the mutation mentions would not be polluted by nomenclatures and coding conventions, that otherwise add unnecessary noise.

Measurement Values

The ratio of NL mutation mentions (RatioNL) is defined through the absolute number of NL mutation mentions (NLmentions) divided by all mutation men- tions (ALLmentions) over the whole corpus.

RatioNL

The ratio of NL mutation mentions containing abstracts to NL mutation mentions containing full-text documents (RatioAbstractFull) was more advanced, since I wanted to normalise the NL mutation mentions on the amount of text one person would have to read manually on average to find a NL mutation mention. This would indicate, which part of a document would have more value when annotating and in consequence whether focusing on abstracts or not is suitable for our purpose.

The normalisation is done by simply dividing NLmentions through the ab- solute number of tokens (Tokens) in their respective part meaning that abstract would consist of NLmentions of abstract (NLmentionsAbstract) divided by To- kens of abstract (TokensAbstract).

NormAbstract

And the body of the full-text document would consist of NLmentions of full-text document (NLmentionsFull) divided by Tokens of full-text document (TokensFull).

NormFull

The ratio of NL mutation mentions between abstract and full-text documents would be:

RatioAbstractFull

The algorithm that was used to calculate was the class SimpleExclusiveDefiner defined in nala.preprocessing.definers.

Natural Language Definitions as Algorithms

To do the automatic analysis we defined an interface nala.preprocessing.definers.NLDefiner to create multiple variants of natural language definers. In the process of developing the Definer I created 3 subclasses, which are shown in Table of subclasses used.

Sub-class Type Full Name 2-class separation Meaning example Method
0 ST standard mutation mention ST pure nomenclature MutationFinder [8]
2 SS semi-standard mutation mention NL mutation mention that contains some NL attributes tmVar [15]
1 NL natural language mutation mention NL pure natural language mutation mention nala Framework

Table of subclasses used. ST stands for standard mention, SS for semi-standard and NL for natural language mention. Each of them are specific to protein mutations, but can be modified by simply changing the dictionaries and regular expressions used for classification.

Now I come to the actual algorithm or better to the attributes that were exposed in order to classify the mentions. There are 2 main approaches that I was able to implement.

  • Inclusive simply defines anything that is a NL mention
  • Exclusive defines a ST mention and classifies anything else not recognised as NL mention.

The attributes of the mentions that were exposed to include in the NLDefiners are as follows:

  • Length of mention in characters
  • Length of mention in tokens (that is dependent on the tokenizer used in the pipeline)
  • Various regular expressions
    • Regular expressions from the post processing step of tmVar [15]
    • Custom regular expressions defined by me to capture some more in-official coding conventions like `12 A --> C'
  • Dictionary of NL words to distinguish between ST mentions and SS mentions.

As explained before there are 2 main approaches. I developed 3 NLDefiners. One of them is the InclusiveNLDefiner, which is able to take the minimum length of characters and the minimum length of tokens as parameters. The 2 other definers are the SimpleExclusiveNLDefiner, with one being a simple version, that does only classify into subclasses ST and NL. The ExclusiveNLDefiner is the final version and able to classify into all 3 subclasses using a dictionary in nala.data.dict_nl_words.json.

The 2 remaining definers TmVarRegexNLDefiner and TmVarNLDefiner were developed in the framework of the group to compare it to tmVar. This was used later for some minor analyses, which are available in this repository.

Manual Definitions

The second analysis focused on finding the percentage of mutation mentions in natural language that don’t appear as standard mention. In other words if some mention appears as both NL and standard mention within the same text, then the significance is significantly lower compared to mentions of mutations, that only appear as NL mention and are not translated into a standard mention. In order to do that, we manually inspected the mutation mentions of every article for two annotators from the IDP4 corpus and from the Verspoor corpus as well. This analysis was performed in the framework of the project group.

Results

The evaluation of the first analysis was performed on all corpora except the MutationFinder corpus, since only the already normalised annotations are provided [8]. Figure Fraction of NL mentions shows results of analysis I, whereas figure Ratio of NL mentions shows analysis III.

Fraction of NL mentions

Fraction of NL mentions. The fraction was calculated as explained before (measurement values). Each of the corpora was evaluated through ExclusiveNLDefiner and InclusiveNLDefiner (using the minimum length parameter 18 and 28 characters).

Ratio of NL mentions

Ratio of NL mentions normalised between abstracts and full-text documents. The ratio was calculated as explained in measurement values. Each of the corpora was evaluated through ExclusiveNLDefiner and InclusiveNLDefiner (using the minimum length parameter 18 and 28 characters). Corpora Verspoor [20] and IDP4 [17] were the only provided corpora containing full-text documents.

Corpus NL Non-reccuring NL At least 1 Non-recurring NL
IDP4 24.31 % 10.19 % 31.52 %
Verspoor 73.87 % 4.95 % 40.00 %

Table of manual annotation from A. Bojchevski. Non-recurring NL = natural language mutation mentions, that do not reoccur as standard mention and NL = natural language mutation mention. Non-recurring for at least 1 document = fraction of total docs that have at least 1 non-recurring NL mention. These values were acquired through manually annotating each mutation mention according to the guidelines from the IDP4 corpus [17].

Discussion

As already introduced before, the fraction of NL mentions, varies between different corpora. For instance in corpus from tmVar [15], there are only around 3 % NL mentions, whereas in Verspoor corpus, there are 10 to 30 % NL mentions depending on used NLDefiner. The difference between those 2 corpora is, that tmVar [15] focuses on ST mentions, while Verspoor was instead created to provide core information relevant to genetic variation through their Variome Annotation Schema [28]. SETH [27] on the other hand focuses on ST mentions like tmVar. Therefore less than 5 % are NL mentions in this corpus.

With the IDP4 corpus, we serve a corpus, that does provide robust annotations, defined through annotation guidelines [17], that help finding solid annotations without being dissected into multiple parts. For this reason almost 10 % are recognised as NL mentions. Supporting this hypothesis, in addition we analysed NL mentions, that would not reoccur as ST mention through manual annotation. This task was carried out by A. Bojchevski annotating the IDP4 [17] and Verspoor [20] corpora. Between 5 and 10 % of all NL mentions were recognised as unique without reoccuring ST mention. So it would make sense, to analyse NL mentions, since they are the only way of recognising a protein mutation.

Notable is the high number of Verspoor [28] NL mentions through manual annotation. We concluded, that if extracting every NL mention, then we would have to include dissected mentions, unclear mentions and such. But this is not feasible for machine learn, at least not now. We need solid mutation mentions, that can be related to a gene or protein using e.g. GNormPlus [18].

In conclusion, NL mentions are significantly prominent to not ignore them.

The question of whether using abstracts or preferably full-text documents could be answered, by checking the ratio of NL mentions between abstracts and full-text documents. This measurement is described in measurement values with results presented in the figure Ratio of NL mentions normalised between abstracts and full-text documents. I concluded, that because the ratio is higher than 1, with namely IDP4 [17] having 5 times more NL mentions in the abstracts and Verspoor [20] containing around 1.8 times more NL mentions in the abstracts, to use abstracts.

This would make our workflow much easier too, since we would not have to find ways to download full-text documents. In consequence extending our dataset with new mentions would be easy enough, to animate people to use our tool.


Further Analysis

One of the goals of this project was to determine the significance of natural language mentions, in order to decide if it is worth putting effort into improving prediction performance for such mentions. Other existing methods at the moment such as tmVar, Mutation Finder and SETH, only focus on predicting mutation mentions in standard (or semi-standard) format, which is naturally an easier task. So the motivation behind exploring natural language mention is to provide the community a tool which can handle them nicely should they provide to be significant.

Corpus NL Non-reccuring NL At least 1 Non-recurring NL
IDP4 24.31 % 10.19 % 31.52 %
Verspoor 73.87 % 4.95 % 40.00 %

In order to determine the significance of NL mentions we performed several analysis. First is finding the ratio of standard vs NL mentions in abstracts & full texts. Those statistics for different corpora can be found here.

The second analysis focused on finding the percentage of mutation mentions in natural language that don't appear as standard mention. In other words if some mention appears as both NL and standard mention withing the same text then the significance is significantly lower compared to mentions of mutations that only appear as NL mention and are not translated into a standard mention. In order to do that we manually inspected the mutation mentions of every article for two annotators from the IDP4 corpus and from the Verspoor corpus as well. For each article we recorded:

  • the total number of mentions
  • the number of NL mentions
  • the number of NL mentions that do not exist as a standard mention within the same text

With those 3 numbers we calucated some interesting statistics which can be found in detail here. From the analysis we deemed important the following statistics:

  • Between 32% and 45% (depending on corpus) of documents contain at least one mention that is a NL mention and is not translated into a standard mention within the same text
  • Between 5% and 12% (depending on corpus) of total mentions are NL mentions that are not translated into a standard mention within the same text.

These 2 statistics, along with the first analysis, clearly demonstrate that NL mentions are significant enough.

Furthermore, when considering abstracts vs. full text the diffrence in terms of the first statistic is not that big, however in terms of the second statistic we have:

  • Around 8% for full texts
  • Between 14% and 19% for abstracts

Filtering of PubMED Papers Through DocSelector

Using High Recall Regex On Raw Text

based on Version Commit: carstenuhlig/thesis-alex-carsten/developer/9e5e326c460ae7e24649214471cefa9df9a213ac

Results are saved in: Results in shuffled file

  • Verification
    • Manual
      • by Carsten = ~70-80 % NL mentions of found Annotations
      • by Aleksandar = TODO
    • Automatic Evaluation by using IDP4 corpus ~ 20 %
    • Note: evualation for tmVar and idp4 went for overlapping annotation. could be also in same sentence but not seen.

Bibliography

[1] J. M. Izarzugaza, M Krallinger, and A Valencia, “Interpretation of the con- sequences of mutations in protein kinases: Combined use of bioinformat- ics and text mining,” Front Physiol, vol. 3, p. 323, 2012, ISSN : 1664-042X. DOI : 10.3389/fphys.2012.00323.

[2] R. Winnenburg, C. Plake, and M. Schroeder, “Improved mutation tagging with gene identifiers applied to membrane protein stability prediction.,” BMC bioinformatics, vol. 10 Suppl 8, S3, 2009, ISSN : 1471-2105. DOI : 10. 1186/1471-2105-10-S8-S3.

[3] G Gyimesi, D Borsodi, H Saranko, H Tordai, B Sarkadi, and T Hegedus, “Abcmdb: A database for the comparative analysis of protein mutations in abc transporters, and a potential framework for a general application,” Hum Mutat, vol. 33, no. 11, pp. 1547–1556, 2012, ISSN : 1098-1004. DOI : 10. 1002/humu.22138.

[4] R. Kuipers, T. van den Bergh, H.-J. Joosten, R. H. Lekanne dit Deprez, M. M. Mannens, and P. J. Schaap, “Novel tools for extraction and validation of disease-related mutations applied to fabry disease.,” Human mutation, vol. 31, no. 9, pp. 1026–32, 2010, ISSN : 1098-1004. DOI : 10.1002/humu. 21317.

[5] E. Capriotti, N. L. Nehrt, M. G. Kann, and Y. Bromberg, “Bioinformatics for personal genome interpretation.,” Briefings in bioinformatics, vol. 13, no. 4, pp. 495–512, 2012, ISSN : 1477-4054. DOI : 10.1093/bib/bbr070.

[6] W. Yu, R. R. Ned, A. Wulf, T. Liu, M. J. Khoury, and M. Gwinn, “The need for genetic variant naming standards in published abstracts of human ge- netic association studies.,” BMC research notes, vol. 2, p. 56, 2009, ISSN : 1756-0500. DOI : 10.1186/1756-0500-2-56.

[7] P. E. M. Taschner and J. T. den Dunnen, “Describing structural changes by extending hgvs sequence variation nomenclature.,” Human mutation, vol. 32, no. 5, pp. 507–11, May 2011, ISSN : 1098-1004. DOI : 10.1002/humu. 21427.

[8] J. G. Caporaso, W. A. Baumgartner, D. A. Randolph, K. B. Cohen, and L. Hunter, “Mutationfinder: A high-performance system for extracting point mutation mentions from text.,” Bioinformatics (Oxford, England), vol. 23, no. 14, pp. 1862–5, 2007, ISSN : 1367-4811. DOI : 10 . 1093 / bioinformatics/btm235.

[9] R. T. McDonald, R. S. Winters, M. Mandel, Y. Jin, P. S. White, and F. Pereira, “An entity tagger for recognizing acquired genomic variations in cancer literature,” Bioinformatics, vol. 20, no. 17, pp. 3249–3251, 2004, ISSN : 1367-4803. DOI : 10.1093/bioinformatics/bth350.

[10] F. Horn, A. Lau, and F. Cohen, “Automated extraction of mutation data from the literature: Application of mutext to g protein-coupled receptors and nuclear hormone receptors,” Bioinformatics, vol. 20, no. 4, pp. 557–568, 2004, ISSN : 1367-4803. DOI : 10.1093/bioinformatics/btg449.

[11] D. Rebholz-Schuhmann, S. Marcel, S. Albert, R. Tolle, G. Casari, and H. Kirsch, “Automatic extraction of mutations from medline and cross- validation with omim.,” Nucleic acids research, vol. 32, no. 1, pp. 135–142, 2004, ISSN : 1362-4962. DOI : 10.1093/nar/gkh162.

[12] E.Doughty,A.Kertesz-Farkas,O.Bodenreider,G.Thompson,A.Adadey, T. Peterson, and M. G. Kann, “Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical lit- erature.,” Bioinformatics (Oxford, England), vol. 27, no. 3, pp. 408–15, 2011, ISSN : 1367-4811. DOI : 10.1093/bioinformatics/btq667.

[13] L. I. Furlong, H. Dach, M. Hofmann-Apitius, and F. Sanz, “Osirisv1.2: A named entity recognition system for sequence variants of genes in biomedical literature.,” BMC bioinformatics, vol. 9, p. 84, 2008, ISSN : 1471- 2105. DOI : 10.1186/1471-2105-9-84.

[14] S. Yeniterzi and U. Sezerman, “Enzyminer: Automatic identification of protein level mutations and their impact on target enzymes from pubmed abstracts.,” BMC Bioinformatics, vol. 10 Suppl 8, S2, 2009, ISSN : 1471-2105. DOI : 10.1186/1471-2105-10-S8-S2.

[15] C.-H. Wei, B. R. Harris, H.-Y. Kao, and Z. Lu, “Tmvar: A text mining ap- proach for extracting sequence variants in biomedical literature,” Bioin- formatics, vol. 29, no. 11, pp. 1433–1439, 2013.

[16] T Kudo, Crf++: Yet another crf toolkit, 2005.

[17] J.-M. Cejuela, A. Bojchevski, R. Bekmukhametov, S. Karn, and S. Mah- muti, Idp4 corpus, 2015.

[18] C.-H. Wei, H.-Y. Kao, and Z. Lu, “Gnormplus: An integrative approach for tagging genes, gene families, and protein domains,” BioMed Research International, vol. 2015, 2015.

[19] J. M. Cejuela, P. Mcquilton, L. Ponting, S. J. Marygold, K. Matthews, M. Werner-washburne, R. Cripps, K. Broll, G. Santos, D. Emmert, L. S. Gra- mates, K. Falls, B Beverley, S. Russo, A. Schroeder, S. E. S. Pierre, P. Zhou, M. Zytkovicz, B. Adryan, H. Attrill, M. Costa, S. Marygold, G. Millburn, R. Stefancsik, S. Tweedie, J. Goodman, G. Grumbling, J. Thurmond, and H. Platero, “Tagtog : Interactive human and machine annotation of gene mentions in plos full-text articles,” Proceedings of the Fourth BioCreative Challenge Evaluation Workshop vol. 1, pp. 260–269, 2013.

[20] K. Verspoor, K. B. Cohen, A. Lanfranchi, C. Warner, H. L. Johnson, C. Roeder, J. D. Choi, C. Funk, Y. Malenkiy, M. Eckert, N. Xue, W. a. Baum- gartner, M. Bada, M. Palmer, and L. E. Hunter, “A corpus of full-text jour- nal articles is a robust evaluation tool for revealing differences in perfor- mance of biomedical natural language processing tools,” BMC Bioinfor- matics, vol. 13, no. 1, p. 207, Jan. 2012, ISSN : 1471-2105. DOI : 10.1186/ 1471-2105-13-207.

[21] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. Mc- Closky, “The stanford corenlp natural language processing toolkit,” Pro- ceedings of 52nd Annual Meeting of the ACL: System Demonstrations, pp. 55– 60, 2014.

[22] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova, “Entrez gene: Gene- centered information at ncbi.,” Nucleic acids research, vol. 33, no. Database issue,pp.D54–8,Jan.2005, ISSN :1362-4962. DOI :10.1093/nar/gki031.

[23] Uniprot, “Uniprot: A hub for protein information,” Nucleic Acids Research, vol. 43, no. Database issue, pp. D204–12, Oct. 2014, ISSN : 0305-1048. DOI : 10.1093/nar/gku989.

[24] S.Bird,“Nltk,”inProceedingsoftheCOLING/ACLonInteractivepresentation sessions -, Morristown, NJ, USA: Association for Computational Linguis- tics, Jul. 2006, pp. 69–72. DOI : 10.3115/1225403.1225421.

[25] J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in Proc. of COLING-92, 1992, pp. 1106–1110. DOI : 10.3115/992424.992434.

[26] C.-H. Wei, H.-Y. Kao, and Z. Lu, “Pubtator: A web-based text mining tool for assisting biocuration,” Nucleic Acids Research, vol. 41, 2013. DOI : 10. 1093/nar/gkt441.

[27] P. Thomas, T. Rocktäschel, Y. Mayer, and U. Leser, Seth: Snp extraction tool for human variations, 2014.

[28] K. Verspoor, A. Jimeno Yepes, L. Cavedon, T. McIntosh, A. Herten-Crabb, Z. Thomas, and J.-P. Plazzer, “Annotating the biomedical literature for the human variome,” Database, vol. 2013, bat019–bat019, Apr. 2013, ISSN : 1758-0463. DOI : 10.1093/database/bat019.

Clone this wiki locally