Volume 7 Supplement 1
Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization
© Dai et al.; licensee Springer. 2015
Published: 19 January 2015
The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound and Drug Named Entity Recognition (CHEMDNER) task to establish a standard dataset for evaluating state-of-the-art chemical entity recognition methods.
This study introduces the approach of our CHEMDNER system. Instead of emphasizing the development of novel feature sets for machine learning, this study investigates the effect of various tag schemes on the recognition of the names of chemicals and drugs by using conditional random fields. Experiments were conducted using combinations of different tokenization strategies and tag schemes to investigate the effects of tag set selection and tokenization method on the CHEMDNER task.
This study presents the performance of CHEMDNER of three more representative tag schemes-IOBE, IOBES, and IOB12E-when applied to a widely utilized IOB tag set and combined with the coarse-/fine-grained tokenization methods. The experimental results thus reveal that the fine-grained tokenization strategy performance best in terms of precision, recall and F-scores when the IOBES tag set was utilized. The IOBES model with fine-grained tokenization yielded the best-F-scores in the six chemical entity categories other than the "Multiple" entity category. Nonetheless, no significant improvement was observed when a more representative tag schemes was used with the coarse or fine-grained tokenization rules. The best F-scores that were achieved using the developed system on the test dataset of the CHEMDNER task were 0.833 and 0.815 for the chemical documents indexing and the chemical entity mention recognition tasks, respectively.
The results herein highlight the importance of tag set selection and the use of different tokenization strategies. Fine-grained tokenization combined with the tag set IOBES most effectively recognizes chemical and drug names. To the best of the authors' knowledge, this investigation is the first comprehensive investigation use of various tag set schemes combined with different tokenization strategies for the recognition of chemical entities.
Studies on the effects of chemical and drug on organismal growth and development under various conditions are very valuable. As a result, both the academia and industry are interesting in finding new ways to retrieve and access chemical compound and drug-related information from narrative texts in a manner that minimizes the required effort. RI Dogan, GC Murray, A Névéol and Z Lu  established that apart from bibliographic queries (such as author name and article title), chemical entities are some of the terms frequently used to browse and search the PubMed database. As research within the biomedical field has evolved, advancements of experimental techniques, the accumulation of experiences and the ease of access to publications around the world have all contributed to the acceleration of biomedical studies, generating enormous repositories of scientific journals and papers. Hence, traditional manual methods of identifying chemical entities in articles and associating them to databases are no longer suffice to meet the needs of researchers, motivating the development of several chemical entity recognition approaches that are based on natural language processing approaches [2, 3]. In contrast to previously proposed gene mention recognition and normalization task [4, 5], the recognition of chemical entities has yet to been much improved using limited standard corpus and evaluation tools. For example, P Corbett and A Copestake  evaluated OSCAR3 using a corpus consisting of 500 PubMed abstracts. Unfortunately, that corpus remains unavailable to the public. To accelerate the research into CHEMical Compound and Drug Name Entity Recognition (CHEMDNER), a CHEMDNER task was set by BioCreative IV  to improve the efficiency and accuracy of chemical and drug recognition, to the benefit of both academia and industry.
Identifying chemical entities in text is hindered by the existence of highly varied ways of naming them. Such names include trivial or brand names (such as Tylenol), systematic International Union of Pure and Applied Chemistry (IUPAC) names such as 6-keto prostaglandin F(1α), generic or family names (such as alcohols), company codes (such as ICI204636), molecular formulas (such as H2SO4) and identifiers associated with various databases (such as CHEBI:28262). Additionally, many of these names are used abbreviated (such as to DMS for dimethyl sulfate). Although nomenclature organizations such as IUPAC have been striving for systematic naming in the biochemical field, most of their rules are treated only as suggestions rather than regulations, leaving ample room for variation in their use.
As indicated in the overview paper of the BioCreative CHEMDNER task , the majority of the approaches that were used by participating teams to detect chemical entities were the machine learning method based on conditional random fields (CRFs), used with a variety of feature sets, along with chemistry-related lexical resources and several pre-/post-processing rules. Despite the promising results of the BioCreative CHEMDNER task, most effort has been applied to the development of various feature sets. The tag set has received much less attention. Accordingly, this study focuses on various tag sets and their effect on the performance of chemical entity recognition with CRFs. Experiments were performed using combinations of different tokenization strategies and tag schemes to elucidate the effects of tag set selection and tokenization strategy on the identification of chemical and drug entities. The results thus demonstrate that tag set selection is as important as feature selection.
Chemical entities can be classified into various categories . For instance, based on the annotation guideline for the CHEMDNER task, the sentence,
"Different samples will be collected and analyzed for five PCAHs including pyrene, benzo(a)anthracene, benzo(e)pyrene, benzoflouroanthene, and benzo(a)pyrene."
includes two types of chemical entity, and should be annotated as follows.
"Different samples will be collected and analyzed for five [PCAHs ABBREVIATION] including [pyreneSYSTEMATIC], [benzo(a)anthraceneSYSTEMATIC], [benzo(e)pyreneSYSTEMATIC], [benzoflouroantheneSYSTEMATIC], and [benzo(a)pyreneSYSTEMATIC]."
ABBREVIATION indicates that "PCAHs" is an acronym for a chemical compound. SYSTEMATIC indicates that "pyrene", "benzo(a)anthracene", "benzo(e)pyrene", "benzoflouroanthene", and "benzo(a)pyrene" are IUPAC names. The recognition of chemicals under different categories can facilitate the following chemical entity normalization system to link the mentions to their corresponding database records. For example, the abbreviated name "PCAHs" is linked to "polycyclic aromatic hydrocarbons", and the systematic name "pyrene" is linked to the ChEBI ID: 39106. Therefore, this study not only presents the unified results concerning the combinations of various tag schemes and tokenization strategies obtained using the official CHEMDNER evaluation script, but also present results for each of the seven categories of chemical names that were defined in the CHEMDNER task. The influence of the proposed tag set on the recognition performance of each individual category is also examined.
In this study, the CHEMDNER task is formulated as a sequence labelling problem. The same feature sets that are utilized for machine learning are developed with various tag sets and their effect on the recognition of chemicals is studied. The subsequent sections firstly detail the proposed tag sets and the employed machine learning model. Then, the workflow of the proposed system and the feature sets that are used in it are elucidated.
Conditional random fields
Tag set selection
The IOB scheme is the most tag scheme used for establishing the tag set in the biomedical named entity recognition task. For example, the state-of-the-art system for recognizing mentions of genes  adopts the IOB tag set in its bi-directional parsing algorithm. Even in the CHEMDNER task, the top-ranked systems [10, 11] used the IOB scheme. Figure 1 presents the graphical representation in CRF of "polycyclic aromatic hydrocarbons (PAHs)" tagged using the IOB tag set (B-FAMILY, I-FAMILY, I-FAMILY, O, B-ABBREVIATION, O). The scheme suggests a model to learn and identify the Beginning, the Inside and the Outside of a particular category of chemical entities.
Various tag schemes for the task of Chinese word segmentation have been proposed and showed a promising improvement. For instance, N Xue  proposed the use of a new tag to represent a Chinese "word" if it forms only a word by itself. H Zhao, C-N Huang, M Li and B-L Lu  concentrated on the subdivision of the beginning of Chinese words into tags like B1 and B2 to better capture longer words. Unlike in the Chinese word segmentation task, in the CHEMDNER task, category information of chemicals is associated with the tag set, leading to a high computational cost and training time. Accordingly, the IOB scheme is delicately extended into four different schemes, whose relative performances when applied to the CHEMDNER task were compared. In particular, motivated by the works on Chinese word segmentation, the tags E and S, which stand for "End of the entity" and "Single-word entity", are added to the IOB tag set to form a four-(IOBE) and five-(IOBES) tag sets. Accordingly, the labelling sequence in Figure 1 becomes B-FAMILY, I-FAMILY, E-FAMILY, O, S-ABBREVIATION, and O when the five tag set is used. In the experiments, the B tag is also split into B1 and B2 tags to form another five-tag scheme, IOB12E. These extended schemes provide more precise machine learning material and establish more intelligent models.
In this study, two tokenization strategies are employed to generate different tokenization results. The performances of the generated CRF models based on these two strategies are then compared.
In this method, the standard Penn Treebank tokenization rules  are utilized to tokenize the given document. The rules are summarized as follows.
Most punctuation marks, including comma, period, and quotation markers, are separated from adjoining words.
Contractions of verbs and Saxon genitives of nouns are split into their component morphemes. For example, "won't" becomes "wo n't".
In the fine-grained tokenization method, the coarse-grained tokenization rules are applied first. The generated tokens are then rigorously tokenized again through the following two steps:
Add separations before and after symbols, such as hyphens and dashes.
Separation at the locations between letters and digits, as well as at sites where a lower-case letter is followed by an uppercase letter.
The feature sets that are examined in this study are based on orthographic, morphological and shallow syntactical features, which were selected mainly from our previous work proposed for the biomedical NER task  with a particular modification for the CHEMDNER task. These feature sets were selected because they do not rely on any specific resources and they allow researchers to re-produce and generate the comparable results. Furthermore, some of them have become the standard NER feature sets and implemented in some open-source NER systems, such as BANNER .
Words that precede or follow the target word may be useful in its categorization. Consider, for example, the sentence, "Mercury induces the expression of cyclooxsygenase-2 and inducible nitric oxide synthase". If the target word is "oxide", then the following word "synthase" will help the CRF model distinguish the oxide synthase from oxide layer, enabling it correctly to classify it as a systematic-type chemical entity. In the developed model, the number of preceding/following words is set to two, and bigram and trigram words features are used as parts of the conjunction features. All of the above features were normalized to maximize the performance and to reduce the use of memory resources, as described in the authors' earlier work . For example, the term "cyclooxygenase-2" was normalized to "cyclooxygenase-1" in our training set.
The affix of a word is a morpheme that is attached to a base morpheme to form that word. Prefixes (that precede another morpheme) and suffixes (that follow another morpheme) are two types of affix. Some prefixes and suffixes provide useful clues for classifying named entities. For example, most words that start with "hydro" are usually chemical entities with related component information. The prefixes and suffixes are defined to have between three and five characters, and they were also normalized before they were encoded into features.
Regular expression pattern
Word shape features
At times, chemical entities within the same category exhibits similar patterns, such as As(V) and DMA(V), and the word shape feature is developed accordingly. The following process is used to generate the shape of a given word: 1) all capitalized characters are replaced by "A"; 2) all non-capitalized characters are replaced by "a"; 3) all digits are replaced by "0", and 4) all non-English characters are replaced by "_". To form the second word shape feature, consecutive strings of identical characters are reduced to a single character. For instance, the term "Aaaaa_A" is contracted to "Aa_A". Consider the two chemical entities "Na(2)CO(3)" and "As(2)O(3)". The generated word shape features are "Aa_1_AA_1_" and "Aa_1_A_1_". The second word shape feature captures both chemical entities.
Named entities are usually found in noun phrases, and the left or right boundaries of most chemical entities are aligned with the edge of noun phrases. For instance, in the noun phrase, "the polyhedral oligomeric silsesquioxane", the chemical name "polyhedral oligomeric silsesquioxane" aligns with the right boundary of the noun phrase. Consequently, the chunk information is encoded as a feature in our model. Moreover, POSs such as verbs and prepositions normally indicate an entity's boundary. A context window length of five is set for POS features herein.
Dataset and evaluation metrics
The CHEMDNER text corpus  was utilized to examine the performance of various tag schemes. The dataset consists of 10,000 abstracts and a total of 84,355 mentions of chemical compounds and drugs that had been manually labelled by domain experts. During the BioCreative IV evaluation period, the dataset was further divided into three subsets, which were the training set (3,500 abstracts), the development set (3,500 abstracts) and the test set (3,000 abstracts). Seven categories of chemical entities adapted from the work of R Klinger, C Kolarik, J Fluck, M Hofmann-Apitius and CM Friedrich  were annotated in the corpus: (1) SYSTEMATIC: the systematic names, such as IUPAC; (2) IDENTIFIERS: database IDs, including CAS numbers, PubChem IDs, company registry numbers, ChEBI and CHEMBL IDs; (3) FORMULA: molecular formula, SMILES, InChI, or InChIKey; (4) TRIVAL: trivial, brand, common or generic names of compounds; (5) FAMILY: chemical families that can be associated to chemical structures; (6) MULTIPLE: mentions that correspond to chemicals that are not described by a continuous string of characters; (7) ABBREVIATION: abbreviations and acronyms.
True positive (TP) refers to the number of correctly recognized chemical mentions. False negative (FN) is the number of human-annotated chemical mentions that were omitted by the presented system. False positive (FP) is the number of recognized chemical mentions that were not annotated by human annotators. The result shows the overall system performance independent of the categories of the chemical entities.
The four tag schemes that were described in the Methods section determine the four configurations in our experiments-IOB, IOBE, IOBES and IOB12E-all of which use the feature sets that were described in the Methods section. Two tokenization strategies were adopted to investigate the impact of the tokenization method on each tag scheme. The first directly exploits the tokenization results that are generated by following the coarse-grained tokenization rules, and the second uses the rules that were described in the Fine-grained tokenization sub-section to produce tokens with a finer granularity. The subscripts "f" in the following notations is used to distinguish the configurations with fine-grained tokenization from the first configurations, IOBf, IOBEf, IOBESf and IOB12Ef. For the CEM task, a fixed confidence of 0.5 was empirically set for all configurations in selecting the recognized chemical entities. The CDI results were converted from the CEM results by removing all duplicate entities and sorting them in order of descending confidence.
Results for CDI and CEM
The CDI and CEM results on the CHEMDNER development set.
The CDI and CEM results on the CHEMDNER development set with the finer tokenization.
The category results of CEM
The CEM category results on the CHEMDNER test set.
In summary, the widely adopted IOB tag scheme provides generally a good formulation of the word sequences used in chemical mentions, because it can achieve comparable RPF-scores to more representative tag schemes with less usage in time and memory during the training phase. For instance, using the IOB tag in comparison to IOBES, both the memory demand and training time are around three time less. When coarse-grained tokenization rules are used, more representative tag scheme exhibited no significant advantage over the IOB tag scheme. Nonetheless, as pointed out by M Krallinger, F Leitner, O Rabal, M Vazquez, J Oyarzabal and A Valencia , the development of a set of specialized tokenization rules for the recognition of chemical terms is required. The experimental results herein support this assertion, and revealed a significant improvement of approximately 6.23% in the F-score between IOB and IOBf with an increased memory usage (6.5%) and training time (23.3%). Our discovery indicates that under the finer tokenization strategy, the more representative IOBES tag set should be preferred over others for performing the CHEMDNER task. Given the higher training cost in the three time of time and memory usages, we can anticipate a further boost of the RPF scores and a stable improvement in the recognition of all chemical entity categories in both the CDI and CEM tasks.
The experiments herein were carried out to investigate the influence of tokenization strategy and effective tag scheme on the CHEMDNER task. Comparing the predicted boundaries of the chemical mentions obtained using the best configuration, IOBESf, in the experiments with those obtained using its counterpart, IOBES, and the most adopted tag set, IOBf, revealed that the use of fine-grained tokenization precisely identifies any modification of arbitrary symbols on chemicals and clearly defined their boundaries. For instance, with fine-grained tokenization, the system can retrieve "octadecanol" rather than "octadecanol-covered" from the sentence that includes "...with an octadecanol-covered Au(111) surface investigated...". Likewise, from the sentence that includes "Among the artemisinin-based combination therapy (ACT) regimens...", "artemisinin" rather than "artemisinin-based" is recognized. The tokenization strategy also enables the CRF model to recognize entities mentioned that are mentioned next to forward slashes, such as those in "alcohols/esters", "Plu/PAA/Epi" and "His/Tyr", which are often used in descriptions of a group of chemicals with similar attributes.
Furthermore, the generation of fine-grained tokens along with the more representative tag set allows the CRF model to capture better mentions of longer entities that were generated by the fine-grained tokenization method. For example, in coarse-grained tokenization, two and four tokens were generated for "N-cinnamoylated chloroquine" and "10, 12-pentacosadiynoic acid", respectively. Either were overlooked in the coarse-grained tokenization method, or have incomplete boundaries in tag schemes such as IOB. The IOBES tag scheme with fine-grained tokenization can successfully recognize entities that comprise many tokens. For instance, the IOBESf model can correctly recognize the boundary of the chemical entity "α-phenyl-N-tert-butyl nitrone" using ten tokens after fine-grained tokenization.
The distribution of chemical entities with different lengths in the CHEMDNER corpus.
According to our hypothesis, the use of tag schemes, which can capture words that comprise a chemical entity using different tags that identify their relative positions within the name of the entity, should enhance the preciseness of entity recognition. As a result, the utilization of a more explicit tag set, such as IOBE, increased the accuracy of identification of chemical entities with longer names. In the sentence "Novel N-indolylmethyl substituted spiroindoline-3,2'-quinazolines were designed as potential inhibitors of SIRT1", the tag set IOBE can retrieve the correct chemical name in bold, whereas IOB recognized "N-indolylmethyl" and "spiroindoline-3,2'-quinazolines", respectively. Similarly, IOBE recognized the name "N-cinnamoylated chloroquine" in the expression "N-cinnamoylated chloroquine analogues as dual-stage antimalarial leads", whereas IOB determined two names "N-cinnamoylated" and "chloroquine". Specification of the end of a chemical name within the IOBE set rather than simply regarding it as a part of the name seems to improve the recognition of chemical entities with long names.
Since words that consist of four or fewer tokens have constitute around 90.7% of the CHEMDNER corpus, the five-tag scheme, IOB12E, should outperform the four tag scheme IOBE. However, increasing the rigidity of the tag set does not provide any improvement, as revealed by the fact that the IOB12E performed most poorly. Close scrutiny reveals that the use of the IOB12E tag set makes difficult the recognition of chemical names that consist of two words, such as "ammonium sulphate", "allyl alcohol" and "acetate esters", which are 9.2% of the names in the CHEMDNER corpus. Since IOB12E captures not only the first word (B1), but also the word that follows it (B2), it may have trouble with the recognition of two-token entities, in which the second word serves both as B2 and the end, or is an independent entity itself. For names with a single word, which occupies 70.9% of the entities in the CHEMDNER corpus, the addition of the S tag to form the IOBES tag scheme provides an improvement over IOBE and IOB. Therefore, we believe that the IOBES tag scheme with fine-grained tokenization is the best alternative for capturing sufficient discriminative information for the CHEMDNER task.
Comparison with the other CHEMDNER systems in BioCreative IV
Comparison the results with the top 3 CEM systems in the CHEMDNER test set.
System1: Leaman et al.
System2: Lu et al.
System3: Batista-Navarro et al.
Whereas comparing the aforementioned advanced features and pre-/post-processing are beyond the scope of this work, the observations herein may be useful for further improving the aforementioned systems. As demonstrated experimentally, the five-tag scheme IOBES outperformed all others. Therefore, we believe that the performance of Systems 1 and 3 can be further improved since both adopted the simplest tag scheme, IOB. System 2 is the only system that utilized the IOBES tag scheme in the CHEMDNER task. However, System 2 did not pay attention to the effect of the tokenization in the combination with tag scheme. As established by P Corbett, C Batchelor and S Teufel  and S Eltyeb and N Salim , tokenization is an important issue in CHEMDNER systems, and a customized tokenizer can provide clear advantages in the handling of multi-token chemical entities. Therefore, a good CHEMDNER system must have a specialized tokenizer or be effective in handling multi-token names. This study demonstrated that properly choosing the representative tag scheme to be used with the fine-grained tokenization strategy, can better capture multi-token words in a chemical name. We therefore believe that the aforementioned systems can be improved by adopting to them with the proposed fine-grained tokenization strategy.
This study describes a system that is developed for performing the CHEMDNER task, and it specifically examined the effect of tokenization and different representative tag sets on chemical and drug name recognition. The use of finer tokenization was generally associated with better performance of all tag sets. Moreover, of all the tag sets used, delicate tag schema such as the five-tag scheme IOBES provided better performance than the others. However, the complexity of the tag set is not entirely correlated with the proficiency of CHEMDNER, as the results herein revealed that the IOB12E tag set performed the worst overall. In summary, finer tokenization combined with the elaborate tag set IOBES achieved the best performance in recognizing chemical and drug names.
The authors would like to appreciate Johnny Chi-Yang Wu and Ted Knoy for their editorial assistances.
The research was financially supported by the Ministry of Science and Technology Republic of China, Taiwan, under Contract No. MOST-103-2221-E-038-019.
This article has been published as part of Journal of Cheminformatics Volume 7 Supplement 1, 2015: Text mining for chemistry and the CHEMDNER track. The full contents of the supplement are available online at http://www.jcheminf.com/supplements/7/S1.
- Dogan RI, Murray GC, Névéol A, Lu Z: Understanding PubMed user search behavior through log analysis. Database: the journal of biological databases and curation. 2009, 2009:Google Scholar
- Corbett P, Murray-Rust P: High-Throughput Identification of Chemistry in Life Science Texts. Computational Life Sciences II. Edited by: R Berthold M, Glen R, Fischer I. 2006, Springer Berlin Heidelberg, 4216: 107-118. 10.1007/11875741_11.View ArticleGoogle Scholar
- Klinger R, Kolarik C, Fluck J, Hofmann-Apitius M, Friedrich CM: Detection of IUPAC and IUPAC-like chemical names. Bioinformatics. 2008, 24 (13): i268-276. 10.1093/bioinformatics/btn181.View ArticleGoogle Scholar
- Smith L, Tanabe LK, Ando RJn, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, et al: Overview of BioCreative II gene mention recognition. Genome Biology. 2008, 9 (Suppl 2): S2-10.1186/gb-2008-9-s2-s2.View ArticleGoogle Scholar
- Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, et al: Overview of BioCreative II gene normalization. Genome Biology. 2008, 9 (Suppl 2): S3-10.1186/gb-2008-9-s2-s3.View ArticleGoogle Scholar
- Corbett P, Copestake A: Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics. 2008, 9 (Suppl 11): S4-10.1186/1471-2105-9-S11-S4.View ArticleGoogle Scholar
- Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A: CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform. 2015, 7 (Suppl 1): S1-View ArticleGoogle Scholar
- Lafferty J, McCallum A, Pereira F: Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML' 01. 2001Google Scholar
- Hsu C-N, Chang Y-M, Kuo C-J, Lin Y-S, Huang H-S, Chung I-F: Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics. 2008, 24 (13): i286-294. 10.1093/bioinformatics/btn183.View ArticleGoogle Scholar
- Batista-Navarro RT, Rak R, Ananiadou S: Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop 2013; Bethesda, MD USA. 2013, 55-59.Google Scholar
- Leaman R, Wei C-H, Lu Z: NCBI at the BioCreative IV CHEMDNER Task: Recognizing chemical names in PubMed articles with tmChem. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop; Bethesda, MD USA. 2013, 34-41.Google Scholar
- Xue N: Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese Language Processing. 2003, 8 (1): 29-48.Google Scholar
- Zhao H, Huang C-N, Li M, Lu B-L: A Unified Character-Based Tagging Framework for Chinese Word Segmentation. 2010, 9 (2): 1-32.Google Scholar
- LingPipe 4.1.0. (accessed October 1, 2008), [http://alias-i.com/lingpipe]
- Tsai RT-H, Sung C-L, Dai H-J, Hung H-C, Sung T-Y, Hsu W-L: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics. 2006, 7 (Suppl 5): S11-10.1186/1471-2105-7-S5-S11.View ArticleGoogle Scholar
- Leaman R, Gozalez G: BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput. 2008, 652-663.Google Scholar
- Rocktaschel T, Weidlich M, Leser U: ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012, 28 (12): 1633-1640. 10.1093/bioinformatics/bts183.View ArticleGoogle Scholar
- Lu Y, Yao X, Wei X, Ji D: WHU-BioNLP CHEMDNER System with Mixed Conditional Random Fields and Word Clustering. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop. 2013, 2: 129-134.Google Scholar
- Eltyeb S, Salim N: Chemical named entities recognition: a review on approaches and applications. Journal of Cheminformatics. 2014, 6 (1): 17-10.1186/1758-2946-6-17.View ArticleGoogle Scholar
- Corbett P, Batchelor C, Teufel S: Annotation of chemical named entities. Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing; Prague, Czech Republic. 1572403: Association for Computational Linguistics. 2007, 57-64.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.