CheNER: a tool for the identification of chemical entities and their classes in biomedical literature
© Usié et al.; licensee Springer. 2015
Published: 19 January 2015
Small chemical molecules regulate biological processes at the molecular level. Those molecules are often involved in causing or treating pathological states. Automatically identifying such molecules in biomedical text is difficult due to both, the diverse morphology of chemical names and the alternative types of nomenclature that are simultaneously used to describe them. To address these issues, the last BioCreAtIvE challenge proposed a CHEMDNER task, which is a Named Entity Recognition (NER) challenge that aims at labelling different types of chemical names in biomedical text.
To address this challenge we tested various approaches to recognizing chemical entities in biomedical documents. These approaches range from linear Conditional Random Fields (CRFs) to a combination of CRFs with regular expression and dictionary matching, followed by a post-processing step to tag those chemical names in a corpus of Medline abstracts. We named our best performing systems CheNER.
We evaluate the performance of the various approaches using the F-score statistics. Higher F-scores indicate better performance. The highest F-score we obtain in identifying unique chemical entities is 72.88%. The highest F-score we obtain in identifying all chemical entities is 73.07%. We also evaluate the F-Score of combining our system with ChemSpot, and find an increase from 72.88% to 73.83%.
CheNER presents a valid alternative for automated annotation of chemical entities in biomedical documents. In addition, CheNER may be used to derive new features to train newer methods for tagging chemical entities. CheNER can be downloaded from http://metres.udl.cat and included in text annotation pipelines.
Scientific literature accumulates at a rate that makes it impossible for any biologist to extract all the relevant information from the multitude of available sources. For this reason, there is a keen interest in the development of systems that can automatically mine information from the text and provide that information to researchers.
Mining biologically important information from text is a two-step process, requiring that one identifies the relevant entities in the documents and, subsequently, the relationships between those entities. Methods that fully automate both steps of the process in a combined way with highly accurate results have yet to be developed. So far the focus has been mostly on creating and testing methods that perform one of the steps of the text-mining process (see for example [1–8]). This focus has been further promoted by initiatives such as the BioCreAtIvE challenge (BioCreAtIvE Workshops I, II, II.5, III, and IV held in 2004, 2007, 2009, 2010, and 2013 respectively) [1–5].
The BioCreAtIvE challenge provides participating research teams with annotated literature corpora that enable a controlled comparison of the performance between the various competing methods for automated recognition of specific types of entities in biomedical documents. There are various BioCreAtIvE challenge tracks that focus on identifying various types of biologically relevant entities, such as genes and their functions, diseases, phenotypes, or chemical compounds. The importance of these chemical compounds arises from their involvement in regulating biological activity of proteins and genes, and from their potential use to treat pathological states.
Examples of chemical entity recognition applications.
Some of the most accurate approaches for the automated identification of chemical entities use Conditional Random Fields (CRFs) [15, 16, 21, 22], Maximum Entropy Markov Models (MEMM) [13, 14], or Support Vector Machines (SVM) . These approaches employ statistical methods to identify chemical entities. Often, the performance of statistical methods can be improved by combining them with linguistic analysis techniques [24–27]. A detailed review about this subject can be found in .
The statistical methods used to identify chemical entities must be trained through the use of appropriate and encompassing gold standard collections of documents (corpora), containing precisely annotated chemical entities . Although quite useful, existing corpora [15, 16, 28, 29] that can be used for training those methods are often limiting in developing automatic annotation systems, because they are small in size and have incomplete annotation. The DDI corpora contain a larger number of documents (766) and chemical entities (13029). However, it is only adequate to train methods that perform NER of pharmacological substances. Because of this only the SCAI corpora could be considered as a general gold standard that covered a large class of chemical entities, containing a total number of ~1550 abstracts with ~6600 entities annotated. However, the Medline corpus within the SCAI corpora only contains 100 Medline abstract with 151 annotated IUPAC (International Union of Pure and Applied Chemistry) chemical names.
The latest round of the BioCreAtIvE challenge emphasized how important automated annotation of chemical entities in biomedical documents is by setting up a track (CHEMDNER) to potentiate the development of more accurate methods to perform that annotation. In order to lift one the main limitations in developing annotation methods, two new biological literature corpora with annotated chemical entities were provided for the community to use in training their methods. Each corpus contains 3500 documents, with approximately 29500 annotated chemical entities, divided into several classes: SYSTEMATIC, TRIVIAL, FAMILY, FORMULA, ABBREVIATIONS, IDENTIFIERS, MULTIPLE, and NO CLASS. The corpora developed by BioCreAtIvE IV are significantly larger than the SCAI corpora [15, 16] and the DDI corpora [28, 29] that were freely available for the training and testing of applications that perform chemical NER. Our team had previously developed CheNER, a tool that automatically and specifically tags IUPAC chemical names in documents . CheNER uses CRFs based on Mallet  to identify the IUPAC names and achieves F-score performances higher than 70% in the SCAI corpora [15, 16]. Given that the IUPAC nomenclature is only one of the many that are used, we took the opportunity provided by BioCreAtIvE IV organizers to further develop CheNER in order for it to specifically identify and tag the different classes of chemical names.
In this paper we report the development of this improved version of CheNER and analyse its performance. We implemented and tested a set of approaches that combine dictionary matching, linear CRFs and regular expressions in different ways to tag chemical entities according to their nomenclature classes in the biomedical literature. We find that the approach with the highest performance implements a CRF that is trained to simultaneously identify the individual classes of chemical entities. Our system is freely available at http://metres.udl.cat and can be easily integrated in pipelines to annotate large bodies of literature. To our knowledge, CheNER is unique with respect to other chemical entity annotation programs that were presented during the challenge because CheNER groups the chemical terms it annotates into the various classes of chemical names.
Materials & methods
Sets of approaches combining CRFs, dictionary matching, and regular expression matching in five different ways.
Combines a CRF to identify SYSTEMATIC entities with dictionary matching to identify TRIVIAL, FAMILY, and ABBREVIATION entities, and regular expression matching to identify FORMULA and IDENTIFIER entities.
Combines individual CRFs to identify SYSTEMATIC and TRIVIAL entities with dictionary matching to identify FAMILY and ABBREVIATION entities, and regular expression matching to identify FORMULA and IDENTIFIER entities.
Uses a single CRF to identify SYSTEMATIC, TRIVIAL, FAMILY, ABBREVIATION, FORMULA and IDENTIFIER entities.
Combines individual CRFs to identify SYSTEMATIC, TRIVIAL, FAMILY, ABBREVIATION, and FORMULA entities with an individual regular expression matching to identify IDENTIFIER entities.
Uses a single CRF to identify SYSTEMATIC, TRIVIAL, FAMILY, ABBREVIATION, FORMULA and IDENTIFIER entities and specifically labels each class of entity.
In the original development of CheNER we systematically tested how order, offset conjunction, and tokenization affected the performance of the CRF . Based on those tests we decided to use linear chain, 2nd order CRFs, with an offset conjunction value of 1 and tokenization by spaces in the development of the current CheNER version. We note that the punctuation marks at the end of the tokens are not taken it into account to extract their features. All CRFs for the current work were implemented using Mallet , and trained using the training corpus provided by the BioCreAtIvE organizers, containing 3500 abstracts, with ~29500 annotated entities.
Word features, regular expressions, and dictionaries
Examples of features and regular expressions used during the training of the chemical entities identification systems.
Name of feature
Classifies tokens by length. If the length is less than 5, the token is Short. If length is between 5 and 15, the token is Medium, otherwise, the token is Large.
Automatic generation of features in terms of frequency of upper and lower case characters, digits and other types of characters.
Automatic generation of suffix and prefix (length 2, 3 and 4)
Automatic generation for every token that match an element within the list. We used lists of basic name segments (~3300), and stop words (~550).
A dictionary matching for trivial, family and abbreviations names classes (~6400, ~1300 and ~1400 elements, repectively).
Regular expressions that identify specific features, such as "contains dashes?", "is all cap?", or "contains numbers?".
Regular expressions that identify specific types of characters that are more common in chemical entities than in other words, such as greek letters, roman numbers, etc.
Regular expressions that match with specific morphological chemical formulas features, identifiers, and systematic features in chemical names.
Regular expressions used in the pos-processing step that filter out common names that are incorrectly tagged by the systems in a systematic way.
Given that several classes of chemical names present either a very regular structure or a finite set of names, we wanted to see if using regular expressions and/or dictionaries to identify the entities for those classes would perform as well as using CRFs. The classes for which we wanted to test this were TRIVIAL, FAMILY, ABBREVIATION, FORMULA, and IDENTIFIER chemical names. The regular expressions that were defined to train our system in the runs that combine CRFs and Regular Expression taggers are also summarized in Table 3. FORMULA chemical were identified in these runs by using regular expressions describing patterns containing atomic elements, SMILES, etc. The dictionaries used to identify TRIVIAL, FAMILY, and ABBREVIATIONS in the relevant runs were built from a non-redundant list of the entities from each class annotated in the corpora provided by the BioCreAtIvE organizers, the SCAI corpora, and also by extracting the names of chemical entities from http://www.drugs.com/. In total, these dictionaries have ~9100 terms, with ~6400 for the TRIVIAL dictionary, ~1300 for the ABBREVIATION dictionary and ~1400 for the FAMILY dictionary. To identify SYSTEMATIC names using a CRF, we used regular expressions to define patterns that identify morphological structures such as isomers (ex: 3,5,4'-trihydroxy-trans-stilbene), as well as the expressions used in . We note that regular expressions or dictionary words used to identify any type of chemical entity by the Regular Expression tagger were also used as a feature to identify the same type of entities by the CRFs tagger in the relevant runs.
It is likely that overall performance of our system would improve by including additional dictionaries such as ChEBI [31, 32], Jochem  and PubChem . However, the deadlines of the BioCreAtIvE challenge made it impossible to develop a reasonable way to correctly attribute class type to each entity in these dictionaries, and class attribution was a differential feature that we wanted CheNER to have.
We tested five different approaches (Runs) to Chemical NER, in order to see which approach works better in the global identification of the chemical names. Each of these Runs is described in Table 2.
The output of the CRFs, dictionary, and Regular Expression taggers in each run is marked according to the IOB (In-Out-Beginning) labelling scheme . This output is reformatted to the required specifications of the CDI (Chemical Document Indexing) and/or CEM (Chemical Entity Mention) output format.
Evaluation of the results
The F-score is a standard way to evaluate performance of NER methods . It is given by the harmonic mean between precision and recall. We calculate the micro-averaged F-score of the individual Runs over the development and test corpora, which is the evaluation measure used by the BioCreAtIvE IV organizers. The micro-averaged performance is calculated by weighing equally every annotated entity in the corpus. To get the macro-averaged scores, each document should be evaluated, and then the resulting evaluation should be averaged on the whole corpus. The calculations of precision, recall, and F-score are done using the evaluation library provided by the BioCreAtIvE IV organizers, downloaded from http://www.biocreative.org/resources/biocreative-ii5/evaluation-library/.
Results & discussion
The evaluation of the systems presented to the IV BioCreAtIvE workshop was done by the organizers using a subset of 3000 abstracts within a test data set composed of 20000 abstracts, and calculating micro-averaged precision, recall, and balanced F-score. The performance of the systems was calculated with the BioCreAtIvE evaluation library.
Performance of the five runs
Micro-average CDI subtask results.
Micro-average CEM subtask results.
What causes the differences in performance between the various approaches we use to identify chemical entities? For example, the approach in Run 3 has the lowest F-score in both subtask, CDI and CEM. This run implements an individual CRF for each entity class. The CRF that identifies FORMULA chemical names tags a large number of false positives, leading to a very low recall. This is seen by comparing the results from Run 3 and Run 4. These two runs differ only in how the system identifies the FORMULA chemical names. We see that the identification of FORMULA chemical names using a single CRF decreases the recall by ~15% when compared to FORMULA identification using regular expressions. This suggests that the context where FORMULA names are often found in the text is not sufficiently informative to allow the CRF to appropriately rule out many false positives.
We see a similar effect in Run 2. This Run has an F-score closer to Run 3 in the CDI subtask, while its F-score in the CEM task is closer to that of the best system. This difference is due to the fact that the system missed more unique entities than systems using CRFs to identify FAMILY, ABBREVIATION, FORMULA and IDENTIFIER chemical names. However, the entities of these types identified by Run 2 are the most frequently repeated in the texts that are analyzed, which raises the F-score of this Run in the CEM task.
To summarize, the usage of a single CRF for each entity class leads to many false positives for each class, due to the similitude between the entity types. Replacing some CRFs with the direct use of Regular Expression taggers leads to a smaller number of entities being identified but improves the identification of the class for those entities, decreasing false positives. When a single CRF is used to tag all classes of entities (Run 5), this CRF can create a more accurate model for each class, thus improving the ability of the method to clearly identify the difference between the entity classes.
In the evaluation done for the BioCreAtIvE Challenge, the best system presented by CheNER achieves an F-score of 67.78 % in the CDI task and an F-score of 63.74% in the CEM task. These scores are higher in the development corpus (72.08% F-score in the CDI task and 72.61% F-score in the CEM task). The version of CheNER we present in this work improves the original F-scores from the BioCreAtIvE workshop to 72.68% in the CDI task and 73.07% in the CEM task. This increase in F-Score indicates that the new version of CheNER has an improved performance. Nevertheless, it would be important to calculate the performances for both tasks once the annotatedtest corpus becomes available to make sure that performance has also improved in that corpus.
Merging the tagging results from different chemical NER tools
The systems with the highest F-score performance in the BioCreAtIvE challenge were trained by combining features that are derived from a human analysis of patterns in chemical names to features that are derived from the automated tagging of chemical entities by entities such as OSCAR or ChemSpot [35–44]. All these systems have F-scores that are 10%-15% higher than those of CheNER, which uses only human-derived features.
We wanted to see whether adding features derived from the automated tagging by CheNER to those combined systems could improve their performance. These features would, for example, be the annotated chemical names themselves. To test this directly we would have to include the output of CheNER ourselves into the tools described in [35–44] and measure the resulting F-Score. However, the relevant tools were not publicly available and this conclusive experiment could not be performed.
As an alternative test to see whether adding features derived from the automated tagging by CheNER to those combined systems might improve their performance, we merged the individual results of CheNER , OSCAR [13, 14], and ChemSpot  in tagging the CHEMDNER development corpus. This allowed us to investigate whether the three programs identified largely overlapping sets of entities or not. We did this for the CDI subtask.
Comparative micro-average performance evaluation of "out of the box" versions of ChemSpot and OSCAR.
NO processing of results
Processing of results
Comparative F-Score performance combining "out of the box" versions of ChemSpot, OSCAR, and CheNER.
Comparative analysis of true and false positive tagging between the best run of CheNER and ChemSpot.
Unique True Positives
Unique False Positives
Notes on the IV BioCreAtIvE Challenge
Here we presented CheNER, the latest version of our system for chemical entity tagging in biological literature. While the original version of CheNER only tagged IUPAC names, the current version tags and identifies various classes of chemical entities (see Figure 1 for an example), with a performance that is better than that of other comparable tools that can be downloaded from the internet and used "out of the box" (see Tables 4, 6, and 7 and references  and ). This version is a development over the one we presented at the IV BioCreAtIvE Challenge workshop, where we only presented early results from Runs 1, 2, 4 in the CDI subtask and Run 1 in the CEM subtask . In addition to testing additional systems, we further refined the post-processing of the results, significantly improving our F-Score.
CheNER presents a valid alternative for automated annotation of chemical entities in biomedical documents that can be downloaded from http://metres.udl.cat and easily integrated in annotation workflows. Examples on how to perform this integration are provided in the website. The individual performance of CheNER could be further improved by expanding the dictionaries of chemical entities used in its training. In addition, CheNER may provide a valuable resource to automatically derive new features that could be used for training and improving the performance of newer methods for tagging chemical entities.
We thank the anonymous reviewers for their valuable suggestions, which significantly improved the clarity of this paper. FS, RA, and AU were partially supported by grants BFU2010-17704 and TIN2011-28689-C02-02 from the Spanish Ministry of Economy and Competitiveness. The authors are members of the research groups 2009SGR809 and 2009SGR145, funded by the "Generalitat de Catalunya". AU was funded by a Generalitat de Catalunya (AGAUR) PhD fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Funding for publication of this article comes from grants BFU2010-17704 and TIN2011-28689-C02-02 from the Spanish Ministry of Economy and Competitiveness.
This article has been published as part of Journal of Cheminformatics Volume 7 Supplement 1, 2015: Text mining for chemistry and the CHEMDNER track. The full contents of the supplement are available online at http://www.jcheminf.com/supplements/7/S1.
- Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005, 6: S1-View ArticleGoogle Scholar
- Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A: Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008, 9: S1-View ArticleGoogle Scholar
- Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A: An Overview of BioCreative II.5. IEEEACM Trans Comput Biol Bioinforma IEEE ACM. 2010, 7: 385-399.View ArticleGoogle Scholar
- Arighi C, Lu Z, Krallinger M, Cohen K, Wilbur W, Valencia A, Hirschman L, Wu C: Overview of the BioCreative III Workshop. BMC Bioinformatics. 2011, 12: S1-View ArticleGoogle Scholar
- Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A: CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform. 2015, 7 (Suppl 1): S1-View ArticleGoogle Scholar
- Kim J-D, Ohta T, Pyysalo S, Kano Y, Tsujii J: Overview of BioNLP'09 shared task on event extraction. Proc Work Curr Trends Biomed Nat Lang Process Shar Task. 1-9.Google Scholar
- Kim J-D, Pyysalo S, Ohta T, Bossy R, Nguyen N, Tsujii J: Overview of BioNLP Shared Task 2011. Proc BioNLP Shar Task 2011 Work. 2011, Portland, Oregon, USA: Association for Computational Linguistics, 1-6.Google Scholar
- Nédellec C, Bossy R, Kim J-D, Kim J, Ohta T, Pyysalo S, Zweigenbaum P: Overview of BioNLP Shared Task 2013. Proc BioNLP Shar Task 2013 Work. 2013, Sofia, Bugaria: Association for Computational Linguistics, 1-7.Google Scholar
- Vazquez M, Krallinger M, Leitner F, Valencia A: Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Mol Informatics. 2011, 30: 506-519. 10.1002/minf.201100005.View ArticleGoogle Scholar
- Hanisch D, Fundel K, Mevissen H-T, Zimmer R, Fluck J: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics. 2005, 6: S14-View ArticleGoogle Scholar
- Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: calling Whatizit. Bioinformatics. 2008, 24: 296-298. 10.1093/bioinformatics/btm557.View ArticleGoogle Scholar
- Cooke-Fox DI, Kirby GH, Lord MR, Rayner JD: Computer translation of IUPAC systematic organic chemical nomenclature. 4. Concise connection tables to structure diagrams. J Chem Inf Comput Sci. 1990, 30: 122-127. 10.1021/ci00066a004.View ArticleGoogle Scholar
- Corbett P, Murray-Rust P: High-Throughput Identification of Chemistry in Life Science Texts. Comput Life Sci II. Edited by: R Berthold M, Glen RC, Fischer I. 2006, Berlin, Heidelberg: Springer Berlin Heidelberg, 4216: 107-118.Google Scholar
- Jessop D, Adams S, Willighagen E, Hawizy L, Murray-Rust P: OSCAR4: a flexible architecture for chemical text-mining. J Cheminformatics. 2011, 3: 41-10.1186/1758-2946-3-41.View ArticleGoogle Scholar
- Klinger R, Kolářik C, Fluck J, Hofmann-Apitius M, Friedrich CM: Detection of IUPAC and IUPAC-like chemical names. Bioinformatics. 2008, 24: i268-i276. 10.1093/bioinformatics/btn181.View ArticleGoogle Scholar
- Kolářik C, Klinger R, Friedrich CM, Hofmann-apitius M, Fluck J: Chemical Names: Terminological Resources and Corpora Annotation. 2008Google Scholar
- Hawizy L, Jessop D, Adams N, Murray-Rust P: ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminformatics. 2011, 3: 17-10.1186/1758-2946-3-17.View ArticleGoogle Scholar
- SureChem - Chemical Patent Search. [http://surechem.com/]
- Cooke-Fox DI, Kirby GH, Rayner JD: Computer translation of IUPAC systematic organic chemical nomenclature. 1. Introduction and background to a grammar-based approach. J Chem Inf Comput Sci. 1989, 29: 101-105. 10.1021/ci00062a009.View ArticleGoogle Scholar
- Cooke-Fox DI, Kirby GH, Rayner JD: Computer translation of IUPAC systematic organic chemical nomenclature. 2. Development of a formal grammar. J Chem Inf Comput Sci. 1989, 29: 106-112. 10.1021/ci00062a010.View ArticleGoogle Scholar
- Rocktäschel T, Weidlich M, Leser U: ChemSpot: A Hybrid System for Chemical Named Entity Recognition. Bioinformatics. 2012Google Scholar
- Usie A, Alves R, Solsona F, Vazquez M, Valencia A: CheNER: chemical named entity recognizer. Bioinformatics. 2013Google Scholar
- Tang B, Feng Y, Wang X, Wu Y, Zhang Y, Jiang M, Wang J, Xu H: A comparison of conditional random fields and structured support vectormachines for chemical entity recognition in biomedical literature. J Cheminform. 2015, 7 (Suppl 1): S8-View ArticleGoogle Scholar
- Blaschke C, Valencia A: The frame-based module of the SUISEKI information extraction system. IEEE Intell Syst. 2002, 17: 14-20.Google Scholar
- Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Annu Symp AMIA Symp. 2001, 17-21.Google Scholar
- Segura-Bedmar I, Martínez P, Segura-Bedmar M: Drug name recognition and classification in biomedical texts. Drug Discov Today. 2008, 13: 816-823. 10.1016/j.drudis.2008.06.001.View ArticleGoogle Scholar
- Segura-Bedmar I, Crespo M, de Pablo-Sánchez C, Martínez P: Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents. BMC Bioinformatics. 2010, 11: S1-View ArticleGoogle Scholar
- Segura-Bedmar I, Martínez P, de Pablo-Sánchez C: Extracting drug-drug interactions from biomedical text. BMC Bioinformatics. 2010, 11: S5-Google Scholar
- Heerero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T: The DDI corpus: an annotated corpus with pharmacological substance and drug-drug interactions. Journal of Biomedical Informatics. 2013, 46 (I5): 914-920.View ArticleGoogle Scholar
- Mallet: A machine learning for language toolkit. [http://mallet.cs.umass.edu/about.php]
- Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2007, 36: D344-D350. 10.1093/nar/gkm791.View ArticleGoogle Scholar
- Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C: The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 2013, 41: D456-D463. 10.1093/nar/gks1146.View ArticleGoogle Scholar
- Hettne KM, Stierum RH, Schuemie MJ, Hendriksen PJM, Schijvenaars BJA, Mulligen EM, Kleinjans J, Kors JA: A dictionary to identify small molecules and drugs in free text. Bioinformatics. 2009, 25: 2983-2991. 10.1093/bioinformatics/btp535.View ArticleGoogle Scholar
- Li Q, Cheng T, Wang Y, Bryant SH: PubChem as a public resource for drug discovery. Drug Discov Today. 2010, 15: 1052-1057. 10.1016/j.drudis.2010.10.003.View ArticleGoogle Scholar
- Choi M, Yepes AJ, Zobel J, Verspoor K: NEROC: Named Entity Recognizer of Chemicals. Proc Fourth BioCreative Chall Eval Work. Bethesda, Maryland. 2013, 2: 97-104.Google Scholar
- Leaman R, Wei C-H, Lu Z: tmChem: a high performance approach for chemical named entity recognitionand normalization. J Cheminform. 2015, 7 (Suppl 1): S3-View ArticleGoogle Scholar
- Lowe DM, Sayle RA: LeadMine: A grammar and dictionary driven approach to chemical entity recognition. J Cheminform. 2015, 7 (Suppl 1): S5-View ArticleGoogle Scholar
- Batista-Navarro RT, Rak R, Ananiadou S: Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser. Proc Fourth BioCreative Chall Eval Work. 2013, Bethesda, Maryland: Association for Computational Linguistics, 2: 55-59.Google Scholar
- Huber T, Rocktäschel T, Weidlich M, Thomas P, Leser U: Extended Feature Set for Chemical Named Entity Recognition and Indexing. Proc Fourth BioCreative Chall Eval Work. 2013, Bethesda, Maryland: Association for Computational Linguistics, 2: 88-91.Google Scholar
- Khabsa M, Giles CL: An Ensemble Information Extraction Approach to the BioCreative CHEMDNER Task. Proc Fourth BioCreative Chall Eval Work. 2013, Bethesda, Maryland: Association for Computational Linguistics, 2: 105-112.Google Scholar
- Akhondi SA, Hettne M, van der Host E, van Mulligen E, Kors JA: Recognition of chemical entities: combining dictionary-based andgrammar-based approaches. J Cheminform. 2015, 7 (Suppl 1): S10-View ArticleGoogle Scholar
- Lana-Serrano S, Sanchez-Cisneros D, Campillos L, Segura-Bedmar I: Recognizing Chemical Compounds and Drugs: a Rule-Based Approach Using Semantic Information. Proc Fourth BioCreative Chall Eval Work. 2013, Bethesda, Maryland: Association for Computational Linguistics, 2: 121-128.Google Scholar
- Yoshioka M, Dieb TM: Ensemble Approach to Extract Chemical Named Entity by Using Results of Multiple CNER Systems with Different Characteristic. Proc Fourth BioCreative Chall Eval Work. 2013, Bethesda, Maryland: Association for Computational Linguistics, 2: 162-167.Google Scholar
- Li L, Guo R, Liu S, Zhang P, Zheng T, Huang D, Zhou H: Combining Machine Learning with Dictionary Lookup for Chemical Compound and Drug Name Recognition Task. Proc Fourth BioCreative Chall Eval Work. 2013, Bethesda, Maryland: Association for Computational Linguistics, 2: 171-177.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.