The CHEMDNER corpus of chemicals and drugs and its annotation principles
- Martin Krallinger1Email author,
- Obdulia Rabal2,
- Florian Leitner3,
- Miguel Vazquez1,
- David Salgado4,
- Zhiyong Lu5,
- Robert Leaman5,
- Yanan Lu6,
- Donghong Ji6,
- Daniel M Lowe7,
- Roger A Sayle7,
- Riza Theresa Batista-Navarro8,
- Rafal Rak8,
- Torsten Huber9,
- Tim Rocktäschel10,
- Sérgio Matos11,
- David Campos11,
- Buzhou Tang12,
- Hua Xu13,
- Tsendsuren Munkhdalai14,
- Keun Ho Ryu14,
- SV Ramanan15,
- Senthil Nathan15,
- Slavko Žitnik16,
- Marko Bajec16,
- Lutz Weber17,
- Matthias Irmer17,
- Saber A Akhondi18,
- Jan A Kors18,
- Shuo Xu19,
- Xin An20,
- Utpal Kumar Sikdar21,
- Asif Ekbal21,
- Masaharu Yoshioka22,
- Thaer M Dieb22,
- Miji Choi23,
- Karin Verspoor23, 24,
- Madian Khabsa25,
- C Lee Giles25, 26,
- Hongfang Liu27,
- Komandur Elayavilli Ravikumar27,
- Andre Lamurias28,
- Francisco M Couto28,
- Hong-Jie Dai29,
- Richard Tzong-Han Tsai30,
- Caglar Ata31,
- Tolga Can31,
- Anabel Usié32, 33,
- Rui Alves32,
- Isabel Segura-Bedmar34,
- Paloma Martínez34,
- Julen Oyarzabal2 and
- Alfonso Valencia1
© Krallinger et al.; licensee Springer. 2015
Published: 19 January 2015
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
There is a pressing need to extract information of chemical compounds and drugs from the rapidly growing scientific literature . Text mining and information extraction techniques are showing promising results in the biomedical domain: A range of applications have been implemented  to detect bio-entities [3, 4] and their relations (e.g. protein-protein interactions , gene-disease relations , and protein-mutation associations ), or to select relevant documents for a particular topic . One of the first steps required for more complex relation extraction tasks is to find mentions of the entities of interest. In the life sciences domain the entities that have attracted most attention are genes and proteins , while in case of more generic texts and newswire, efforts have been made to detect information units including names of persons, organizations or locations .
Automated techniques with the aim of detecting (tagging) mentions of named entities in text are commonly called named entity recognition (NER) systems. Although early NER taggers typically relied on hand-crafted rules, the current trend increasingly points towards the use of supervised machine learning techniques for entity recognition . Such systems learn a statistical model to identify entity mentions by inferring which characteristics (features) distinguish them from the surrounding text. Exploited features can be the presence of certain combinations of orthographic features, like consecutive characters or words (n-grams), their letter case, or the presence of digits, special characters (e.g. hyphens, brackets, primes, etc.), and symbols (Greek letters, @, $, etc.). Also the ending or beginning of words (affixes) and the presence of particular terms found in a list (gazetteer) of precompiled names are often exploited by NER systems [10, 11] and can help identify a word's morphology (inflections, gerund, pronouns, etc.). For instance, when looking at the chemical literature, it becomes clear that in case of systematic chemical names they do look quite different from common English words, mainly due to the nomenclature rules that define chemical naming standards.
Supervised methods classify word (token) sequences by assigning them to one of a set of predefined entity classes. For this task, they require labeled example data that commonly is split in two collections. The first collection is called the training set, from which the model infers its parameters. The trained model is then used to detect entity mentions in the second collection, the test set ; This set is used to evaluate the quality of the learned model. If satisfactory, the parameterized model can then be applied to detect entities in new, unlabeled text. Therefore, labeled text is important not only to build machine learning-based entity taggers: It also can be used to evaluate the performance of any kind of NER system, regardless the underlying method used. Producing labeled data for this purpose therefore refers to the construction of properly annotated text, a so-called corpus. This process requires adding metadata (the annotations) to the original text according to specific annotation guidelines.
Over 36 corpora have been generated in the biomedical field  already. When the corpus contains documents with manually marked up annotations done by domain experts, they are known as Gold Standard Corpora (GSC). Because the manual annotation process is very laborious, lower quality corpora can be constructed by using automated techniques. A few such Silver Standard Corpora (SSC) have been published, too, such as the CALBC corpus . Chemical (named) entities are important for chemistry, but also for other research areas such as life sciences, pharmacology, medicine, material sciences or physics. Yet, despite their wide-spread use, only few corpora with manually labeled chemical entities exist to date.
Biology corpora with chemical entities
There are several corpora developed in the life sciences domain that include text annotations of chemical substances. A widely used and valuable resource for biomedical language processing is the GENIA corpus . It contains a collection of PubMed abstracts annotated semantically with a variety of different entity types defined in the GENIA Chemicals ontology. Most of the underlying concept classes were derived from categories found in Medical Subject Headings (MeSH), a hierarchical terminological resource used to index PubMed abstracts . The GENIA chemical concepts do correspond to a rather broad interpretation of chemicals, many of which cannot be linked to any concrete chemical entity with an associated structure. In this corpus, qualifier terms and chemical role/application terms are also annotated as chemical entities. There are no exhaustive annotation guidelines for chemical compounds underlying the GENIA corpus annotation, being essentially tailored towards biologically relevant annotations. Moreover, in GENIA, chemical entity annotations were not prepared by a chemist and chemical annotations relied mainly on human interpretation of the text and background knowledge. The CRAFT corpus  is a corpus of 97 full text biomedical articles that contains several different concept annotation types including a type consisting of chemical concepts from the ChEBI ontology . This type includes chemicals, chemical groups, atoms, subatomic particles, biochemical roles and applications . Annotations of the CRAFT corpus were done by biologists based on annotation guidelines that also included a set of linguistic aspects for text span markup. Chemical annotations in the CRAFT corpus were not exhaustive, being restricted mainly to the concepts covered by the ChEBI ontology. The coverage of this ontology for the chemical space published in the literature is unclear. Another hand-annotated life sciences corpus that contains chemistry-related annotations is the PennBioIE CYP 1.0. This corpus of 1,100 abstracts requires payment of a license fee and is focused on a rather narrow scope, the inhibition of cytochrome P450 enzymes. It includes chemicals under a semantic class called substance. This substance class is rather vaguely defined and includes proteins and other substances as well as role and functional terms. There are a few corpora that are primarily concerned with the annotation of relationships that involve chemicals, and more particularly drugs. The EU-ADR corpus has 300 abstracts including drug-target and drug-disease relations ; it was pre-annotated automatically and missed or incorrect annotations were manually corrected. With a similar scope, the ADE corpus contains annotations of drug-related adverse effects, covering chemicals/drugs in a therapeutic context for 3,000 abstracts. In case of the DDI corpus, 700 documents (both PubMed abstracts and DrugBank records ) were annotated for drugs and relations between them , while the EDGAR corpus (103 PubMed abstracts about cancer) also contains annotations of drugs in addition to genes and cells . The Metabolites and Enzymes corpus  has annotations of metabolites, carried out on 296 abstracts on yeast metabolism. The annotation in this corpus was restricted only to those names that appeared in the context of metabolic pathways. There was also one chemistry-disease relation corpus generated from 21 US patents that contained claimed structure-activity-relationships. These patents were automatically tagged with chemistry and disease terms. The annotations process was restricted to the manual classification of the relation type existing between co-occurring terms .
Chemical text corpora
As opposed to the previously introduced corpora, a number of corpora have also been described that are more focused on chemistry and chemical entities rather than on biological aspects of chemical substances. They provided important lessons for the construction of the CHEMDNER corpus. Nevertheless they also showed crucial differences in scope, used document collections, availability (both of annotation guidelines together with the resulting corpus), format and size. Early attempts to build a chemical NER systems, due to the lack of a chemical entity text corpus, explored the use of lexical resources related to chemistry derived from the UMLS Metathesaurus, which was used for training and testing various methods . Wren published a machine learning method trained on the chemical ChemID database and used it to find chemical entity mentions in PubMed abstracts. Due to the lack of an evaluation text corpus he could only assess the precision on a small sample of putative chemical names extracted automatically . Another publication by Zhang described the use of chemical annotations done by the indexers of the National Library of Medicine (NLM)  as a proxy for evaluating a chemical entity recognition system. These annotations are only done at the document level without specifying the exact entity mention offsets within the abstract. The NLM indexers annotate topic-related chemical concepts and therefore the indexing is not exhaustive. This type of annotation only reflects the understanding of the topic by the individual indexer. The document indexing was based on terms of the MeSH tree associated with chemicals (Chemicals and Drugs branch and supplementary concept records called MeSH substances). Narayanaswamy and colleagues described a small corpus of 55 abstracts selected by a keyword search (using as query acetylates, acetylated and acetylation) that contained also a small number of chemical names . The text corpus introduced in the article describing the ChemicalTagger system consisted in 50 paragraphs from the experimental sections of full text articles selected using a keyword search related to polymer synthesis. It is concerned with the annotation of chemical phrases rather than on chemical entity mentions and the associated link to the annotation guidelines was not functional anymore (broken link) . The ChEBI Patent Gold Standard corpus was created as a joint effort between curators of the ChEBI database and the European Patent Office . It involved the annotation of chemical entities in 40 patent documents (18,061 chemical entities, 47% of them were initially linked to ChEBI records). This corpus is publicly available but more details on the annotation criteria and process were not released together with the corpus. This corpus was generated manually without using any software to create pre-annotations. An updated version of this corpus was also published to increase the initial mapping of mentions by using an updated version of the ChEBI database (53.7% of ChEBI mapped chemical entities) . A recent effort carried out by both academia and commercial teams resulted in a larger corpus of 200 patents annotated with chemical information . These patents were automatically pre-annotated with chemical names and human curators revised and corrected mis-identified pre-annotations and added missing chemical mentions manually. The annotation guidelines used for constructing this corpus were partially based on the annotation guidelines that we have released for the CHEMDNER corpus, as detailed later in this manuscript. A relevant contribution to the development of chemical corpora was provided by the authors of the Sciborg corpus [32, 33] and the Chemistry PubMed corpus by Corbett et al. [33, 34] Unfortunately neither of these two corpora are publicly available, but the underlying annotation criteria shared by both datasets had a deep impact on the annotation guidelines prepared for the CHEMDNER corpus. The Sciborg corpus consisted of 42 full text chemistry research papers annotated manually with chemical compounds while the chemistry PubMed corpus by Corbett et al. consisted in an hand-annotated corpus of 500 PubMed abstracts selected using the query 'metabolism[Mesh] AND drug AND hasabstract'. Both corpora consisted in exhaustively annotated chemical texts done by chemists according to very detailed annotation rules (31 pages long guideline containing 93 rules, together with example cases ). Different annotation classes were defined to deal not only with chemical compounds but also with chemical reactions, chemical adjectives, enzymes and chemical prefixes.
A more granular annotation specifically of the chemical compound mentions was proposed for the construction of the open access Chem EVAL corpus (a.k.a. SCAI corpus), a small corpus of 100 abstracts (with 1206 chemical mentions) annotated with chemical entities . Details on the actual definition and selection of chemical compound mentions were not provided together with this corpus, and the original authors stated that additional evaluation and refinement of the corpus and its guidelines is work in progress. Nevertheless this corpus proposes several types of chemical mention classes of practical relevance, which were modified and adapted for the annotation of chemical mention classes of the CHEMDNER corpus. The chemical classes proposed by them included IUPAC (systematic and semi-systematic chemical names), PART (partial IUPAC names), TRIVIAL (trivial names), ABB (abbreviations and acronyms), SUM (sum formula, atoms, molecules, SMILES and InChI) and FAMILY (chemical family names). The distinction between TRIVIAL and IUPAC was an arbitrary decision according to the name length: names with one word were considered as TRIVIAL, while multi-word systematic and semi-systematic names were labeled as IUPAC.
Chemical names and challenges for NER
To be able to implement and compare the performance of chemical NER systems the availability of large enough manually tagged text corpora is a key requisite. It is thus not surprising that a comparative evaluation effort for this topic had not been carried out prior to the release of the CHEMDNER corpus. The intrinsic difficulty in defining annotation guidelines of what actually constitutes a chemical compound that can be linked to structural information was the main difficulty in constructing the CHEMDNER corpus. Although the International Union of Pure and Applied Chemistry (IUPAC) has defined a set of rules for the chemical nomenclature, those naming standards are not sufficiently followed in practice when examining the scientific literature . Chemistry is a research discipline with a considerable degree of specialization that can explain the encountered variability of language use between its sub-disciplines. Moreover chemical entities are also studied in publications from other disciplines such as medicine, biology and pharmacology. Thus a virtually arbitrary number of language expressions may be found in the literature to refer to chemical compounds. This variability can be explained by the use of aliases, e.g. different synonyms used for the same entity. For instance the antidiabetic and anti-inflammatory drug 'troglitazone' also has the brand name 'Rezulin', while its systematic (IUPAC) name would be '(RS)−5−(4−[(6 −hydroxy−2,5,7,8−tetramethylchroman−2−yl)methoxy]benzyl)thiazolidine−2,4−dione'. Variability can also be simply due to alternative typographical expressions referring to the same chemical. The problem of variability has a negative impact on i) the resulting recall of NER systems (fraction of the total entities mentioned in text that are recognized by a system) and ii) the feasibility to map all the various alternative compound mentions to its corresponding unique canonical chemical structure.
Ambiguity, the fact that a given word can correspond to a chemical entity or to some other concept depending on the context of the mention, also poses difficulties for labeling text with chemical entities. A source of ambiguity for chemical entities is the heavy use of acronyms, abbreviations, short chemical formula and certain trivial names used in the literature. Additionally, a few common English words such as gold, lead and iron are also a source of ambiguity for NER systems. The following list summarizes some of the challenges related to chemical entity mention annotation and automatic recognition.
Difficulties in defining what a chemical entity is.
The official IUPAC nomenclature guidelines are only partially followed in practice in the literature.
Chemical compounds/drugs often have many synonyms or aliases (e.g. systematic names, trivial names and abbreviations referring to the same entity).
Existence of hybrid chemical mentions (e.g. mentions that are partially systematic and trivial).
Chemical compounds are ambiguous with respect to other entities or terms (in particular abbreviations and short formula).
Existence of naming variation: typographical variants (alternating uses of hyphens, brackets, spacing, etc.) and alternative word order.
New chemical compound are discovered and described in papers every day (novel chemical names).
Definition of both chemical entity mention boundaries and word tokenization is complicated.
For the successful detection of chemical entity mentions, tools need to be able to cope as much as possible with these difficulties.
BioCreative task on chemical entity recognition
Chemical entities of practical importance are those that can be ultimately linked to chemical structure information, rather than general vague chemical concepts. Being able to associate a given chemical compound name to a chemical structure was the central annotation criteria followed for the construction of the CHEMDNER corpus. The details on the construction of the CHEMDNER corpus will be provided in the following sections. To demonstrate its utility, the CHEMDNER corpus was used as the dataset to train and evaluate chemical NER systems that participated in a task posed at the fourth BioCreative community challenge . The BioCreative challenges are an ongoing effort to promote the evaluation and development of text mining and natural language processing software for the life sciences community . Carrying out this task within the organization of BioCreative was especially useful due to the previous experiences of this community with related bio-medical NER tasks (the Gene Mention recognition tasks of BioCreative I and II [38, 39], as well as the Gene Normalization tasks ).
The construction of the CHEMDNER corpus started with the definition of the overall annotation goal together with an exhaustive revision of previous work done on annotation of chemical entities as well as named entities in the biomedical and other domains. The aim while defining the chemical entities annotated for the CHEMDNER corpus was to capture only those types of mentions that are practically relevant. The common characteristic among all the chemical mention types used for the CHEMDNER corpus was that they could be associated to chemical structure information with at least a certain degree of reliability. We consider this aspect of crucial practical relevance. The annotation carried out for the CHEMDNER corpus was only exhaustive for this particular type of chemical mention, which we named Structure Associated Chemical Entity Mentions (SACEMs). For example 'nitric oxide', 'resveratrol' or 'malondialdehyde' would constitute example cases of SACEMs, while general chemical concepts like 'inactivator' or 'pigment', biological roles like 'hormone', 'antibiotic' or 'metabolite' and reactivity roles like 'nucleophile' or 'chelator' do not qualify as SACEMs. This implies that other types of mentions of chemicals and substances were not annotated. In order to construct the CHEMDNER corpus we examined several critical aspects that we thought influence the corpus quality.
Corpus selection and sampling.
Annotation guidelines and their corpus-driven refinements.
Entity annotation granularity.
Human annotator expertise and training.
Annotation tools and interface.
Annotation consistency and definition of upper and lower performance boundaries to be expected by automated systems.
Corpus format and availability.
Document selection and sampling
An often-underestimated aspect when constructing text corpora is the initial selection of the documents that should be annotated. Using a keyword based article selection has the risk of generating a rather narrow or biased dataset, especially when the aim is named entity recognition. In order to make sure that the NER tools developed on the CHEMDNER corpus will generalize well on any chemistry-related document we used a careful selection strategy. The used CHEMDNER document set had to be representative and balanced in order to reflect the kind of documents that might mention the entity of interest. In case of chemical entities it is essential to cover articles that show sufficient diversity of the kind of mentions expected to emerge across various chemical disciplines. The articles should have enough cases of systematic names, common or generic names of compounds and drugs, trade names, identifiers, acronyms, reference numbers of compounds and even formulas. In case of the CHEMDNER corpus the document selection criteria took into account primarily the scientific discipline of the journals and publication dates. The following steps were used to select abstracts for the CHEMDNER corpus.
Step 1: Selection based on subject categories from the ISI Web of Knowledge relevant to various chemistry-related disciplines: BIOCHEMISTRY & MOLECULAR BIOLOGY; APPLIED CHEMISTRY; MEDICINAL CHEMISTRY; MULTIDISCIPLINARY CHEMISTRY; ORGANIC CHEMISTRY; PHYSICAL CHEMISTRY; ENDOCRINOLOGY & METABOLISM; CHEMICAL ENGINEERING; POLYMER SCIENCE; PHARMACOLOGY & PHARMACY and TOXICOLOGY.
Step 2: Selection of the top 100 journals for each category based on the journal impact factor.
Step 3: Selection of journals that had at least 100 articles.
Step 4: Selection of articles that were published in 2013 in English, with abstracts and links to full text articles in the PubMed database.
Step 5: Selection of articles that belonged to the various subject categories.
Step 6: Randomization of the abstracts and selection of 10,000 records
Step 7: Splitting into three datasets: 3500 (training set), 3500 (development set) and 3000 (test set) abstracts.
The CHEMDNER corpus therefore contains representative articles for a range of chemistry-related fields. It is sufficiently large to cover the most relevant mention types and naming variability that are encountered in the scientific literature, allowing both to generate a predictive model and train an NER recognizer on a subset of abstracts as well as evaluate the performance on a distinct test collection. We selected recent publications to make sure that the corpus would be useful for the detection of chemical entities in new abstracts as soon as they get published. It also covers journals with an high impact in the field based on its impact factor and the number of published articles by that journal.
Surprisingly there are many manually annotated text corpora that are not distributed together with detailed guidelines describing how the annotations were generated. Such black box corpora have the disadvantage that they cannot be extended, it is impossible to compare them in a meaningful way to other corpora and it is unclear how to deal with potential causes of inconsistencies and annotation errors. Annotation guidelines should specify the necessary instructions to identify the text elements that should be tagged (and those that shouldn't be tagged) and how to assign them to its corresponding entity class. At a general level they do represent the instructions on how the annotation schema should be applied to the actual text data that will be labeled.
Three important things had to be addressed in the annotation guidelines: (a) what to label, (b) the mention boundaries of those labels, and (c) how to classify those mentions into chemical mention categories.
To create high quality guidelines that fit the annotation task required a multi-step iterative process: starting from an initial guideline draft until clear and refined guidelines were obtained. In case of the CHEMDNER corpus, to define the text-bound annotations of chemical mentions was not trivial. It required a deep knowledge of chemistry, supported with consultation of external knowledge sources in case of doubt. The guidelines were prepared by chemists with feedback of trained literature curators also with a Ph.D. in chemistry. In order to label SACEMs mentions, a set of annotation rules were defined. These rules were initially adapted by reviewing the annotation guidelines for chemicals from the manual prepared by Corbett et al.  (version 6.0, 2007). The CHEMDNER annotation rules had several important modifications: (1) only chemical nouns (and specific adjectives, treated as nouns) were considered (not reactions, prefixes or enzymes); (2) the number of original rules was reduced; (3) rules were grouped as positive, negative, orthography and multi-word rules. In case of the multi-word rules some simplifications were done, making less error-prone to human interpretation.
Very general chemical concepts (non-structural or non-specific chemical nouns), adjectives, verbs and other terms (reactions, enzymes) that cannot be associated directly to a chemical structure were excluded from the annotation process. SACEMs for this task had to refer to names of specific chemicals, specific classes of chemicals or fragments of specific chemicals. General chemical concepts, proteins, lipids and macromolecular biochemicals were excluded from the annotation. Therefore genes, proteins and protein-like molecules (above 15 amino acids in length) were not annotated. Chemical concepts were labeled solely if they provided concrete structural information. Relevant and intuitive examples cases (rule instantiation examples) were provided in the guidelines when necessary to represent a specific annotation rule, to make it easier to understand and apply them. Although chemical intuition of the annotators was important for defining the annotation guidelines we did not require any specific linguistic background knowledge.
Stage 1 -- Pre-annotation guideline discussion round
At the very beginning, before a sample set was annotated, the annotators revised the guidelines and posed questions to improve the guidelines in a first refinement round. At this stage, the annotation specifications were reformulated if ambiguities or inconsistencies were detected.
Stage 2 -- pilot annotation guideline testing and refinement
Then, the initial set of rules was then tested in practice by using them to annotate a small sample of abstracts (the seed corpus). The seed corpus was annotated by curators to examine the suitability of the stage 1 guidelines. During this pilot annotation experiment: we estimated the required annotation time effort; refined iteratively the guidelines (to make them more precise and easier to follow, resolving cases of under-specification); learned how to use the annotation interface and how it fitted the needs required for annotating the mentions according to the guidelines.
Stage 3 -- corpus annotation
The last step consisted in the annotation of the training, development and test set. During the corpus annotation stage, the guidelines were refined when novel, previously unspecified ambiguities were encountered. These ambiguities were resolved through direct feedback with the experts that constructed the guidelines. Moreover new example cases were added to the guidelines.
The CHEMDNER annotation guidelines are publicly available together with the corpus at . In an attempt to facilitate its reading, the guidelines are structured according to six different types of rules, while trying to keep them as comprehensive as possible:
General rules: rules that clarify the use of external knowledge sources and how to deal with unclear mentions.
Positive rules: rules that specify which chemical entity mentions should be labeled.
Negative rules: rules that specify which kind of mentions should not be tagged.
Class rules: specifications for the manual assignment to the corresponding CEM classes, including hybrid names.
Orthography and grammar rules: rules for defining consistently the entity mention boundaries, dealing for instance with whitespaces, mis-spellings, flanking characters, commas, brackets, etc.
Multi-word entity rules: rules defining labeling criteria for multi-word chemical entities.
The CHEMDNER annotation guidelines, including the example cases are 21 pages in total. One of the most important and difficult issues when defining the guidelines was to establish what constitutes a chemical mention and what does not. A single, particular chemical compound assignable to a chemical structure can be easily recognized by a chemist. The problem arises for general terms comprising several structurally diverse chemical compounds and for which the mention intrinsically provides a general notion of structural class. For example, the term 'Alkaloid' refers to a group of naturally occurring chemical compounds that mostly contain basic nitrogen atoms. From a practical viewpoint, it would be worthy to tag this SACEM as a FAMILY because an end-user could be interested in recognizing this family of compounds in a given biomedical context. However, strictly talking, a single simple Markush formula can not be assigned to this class. In an attempt to homogenize the criteria, an exemplary list (probably expandable in future releases) was provided in the guidelines. As the number of potential mentions of this kind is not really high compared to the rest of mentions, this should not strongly affect the final conclusions of the task. An additional problem with these mentions is that most of them are natural products commonly found in living organisms, so the frontier between chemistry and biology is not easily traceable. As mentioned, a limit on the size of the peptides, sacharids, nucleotides and lipids was imposed as a solution for these small biochemicals. A second problematic issue was establishing how to deal with the adjectives. Adjectives preceding valid SACEMs that add more precise information on the chemical entity were annotated. Finally, the exact assignation of the mentions to the FAMILY class was controversial in some cases and exemplary cases were added during the iterative refinement. For example, synthetic polymers consisting of an undefined number of monomers were detected during the annotation and incorporated to this class.
Annotation process and interface
In principle, the information represented in annotated textual data can be represented in various alternative formats reflecting how the annotations look like. For choosing the annotation format of the CHEMDNER corpus, several criteria were important. First of all, the format should be easy to use for building NER systems, thus it should be simple and easy to modify. There was a clear separation of the entity annotation format and the exchange (dump) format of the released CHEMDNER corpus. This means that we kept the annotations separate from the actual text (the information on the location of the entity mentions is stored in a different file from the actual raw text). We used a standoff annotation format by specifying in a separate file the character location. Using character offsets instead of token location was particularly important for the CHEMDNER corpus because it makes it easier for the corpus consumers to use their own text tokenization strategy. We avoided using a complicated XML schema for the initial baseline release. We examined some basic recommendations provided by the Linguistic Annotation Framework (LAF) for data distribution . All records used for the CHEMDNER corpus were distributed as plain text, UTF8-encoded PubMed abstracts in a tab-separated format with the following three columns: article identifier (PMID, PubMed identifier), title of the article, and abstract of the article. The baseline entity annotation file had a tab-separated format with columns corresponding to the article identifier, the part of the document processed (T: title, A: abstract), the start and end characters offsets of the chemical, the text string of the chemical entity mention and the corresponding chemical entity mention class. Example cases of the entity annotation file can be seen on Figure 2. The task annotation files were derived from the entity annotation file, one for the CEM task and one for the CDI task. In addition to this simple annotation format we have recently generated a version of the CHEMDNER corpus using an alternative format, the widely used BioC format . The BioCXML version of the CHEMDNER corpus  was checked to make sure that the used XML was valid, both with respect to XML itself and the BioC DTD. The Python script to convert the flat-files of the CHEMDNER tab-separated format into the BioC format was released together with the corpus.
CHEMDNER corpus overview
CHEMDNER corpus overview.
Abstracts with SACEM
Corpus inter-annotator agreement and harmonization
The comparison of independent manual labels constructed for the same documents by different individuals can provide important insights on the quality of the corpus and guidelines, it is an essential element of the construction of Gold Standard corpora. It helps to assess how well the annotation task was defined; it shows how curators compare to each other and determines if the interpretation of the instructions were followed consistently. This means that the inter-annotator agreement (IAA) score allows assessing how accurate the annotations can be done by several annotators and scoring the task reproducibility. Future extensions of a corpus using the same guidelines should result in comparable inter-annotator agreement results. If the score is high, the task is well defined and the annotations are consistent. The simplest IAA score is the percentage agreement between experts. The IAA analysis of the CHEMDNER corpus was conducted using a random sample of 100 abstracts chosen from the entire dataset, asking the curators to annotate the data set independently. The result of the IAA study constitutes a sort of upper boundary for the expected automated prediction performance. An inter-annotator agreement of 91% was obtained when exact matching of the chemical mentions was used without considering the label of the SACEM classes. When the SACEM class annotation of the mentions was also considered, the IAA was of 85.26%. Manual inspection of the conflicting annotations showed that the main source of discrepancies were missed annotations by either one or the other annotator and not true annotation errors or differences in the mention boundary definition. This is in line with previously published studies, describing as one common source of disagreement between manual entity annotations that some mentions were missed by the curators while scanning over the document .
To make sure that during the annotation process the amount of missed chemical mentions was marginal, in addition to the main annotation team that prepared the CHEMDNER corpus, a second group of additional curators annotated the test set abstracts. These abstracts were used to score the automated mention predictions during the CHEMDNER task, and it was therefore particularly important that these annotations were complete and correct. We collected all the conflicting annotations between the two curator teams, consisting in those mentions that were only annotated by a single team. To harmonize those conflicting annotations, they were presented to the main curation group for a second round of manual revision. The entire abstract of those conflicting cases was revised to resolve the annotation discrepancies within their context. The curators provided written decisions of inclusion, exclusion or changes related to the conflicting chemical mentions together with comments explaining their decision for more complicated cases. The annotation guideline developers inspected the list of entity revisions for final approval. Written discussions were done on unclear cases that required further refinements (or additional example cases) to be included in the annotation guidelines. We relied primarily on the annotations of the main annotator team because these curators had a higher degree of experience in this task and they did provide active feedback for the refinement of the annotation guidelines. The results of the corpus harmonization process was that 1,185 annotations were added to the original 24, 671 test set annotations (4.08%) while 505 (2.05%) where removed, obtaining the final harmonized test set of 25,351 annotations. We performed a mention class label revision (SACEM class label harmonization) on the entire CHEMDNER corpus. For potentially inconsistent cases where a given chemical name was annotated in some cases as one SACEM class and in other cases as another SACEM class, the chemical entities and their SACEM class labels were manually inspected and corrected. Finally, an automatic revision of annotations was done to cross check the mention boundaries, trimming whitespace characters, and ensuring their technical coherence with the annotation rules.
A common mismatch between annotators was related to issues on how to deal with non-essential parts of the chemical name, especially concerning general modifiers (e.g. 'substituted') inside the chemical name. These modifiers should be retained whereas in some wrong cases (e.g. 'Fluorophenyl substituted 3,3'-diindolylmethane') the mention was incorrectly splitted. Closely related to this, many mismatches between annotators were detected due to a heavy trend to over split the chemical mentions into different SACEMs, especially in the case of FORMULA and MULTIPLE classes. The main variability between annotators in the SACEM class assignment was found for hybrid mentions comprising a combination of different sub-parts of the mention (typically systematic nomenclature, formula and abbreviations). A hierarchical assignation scheme was defined in the guidelines, so that the curator should label the mention according to the ranking provided for the SACEM: SYSTEMATIC has preference over the rest of SACEMs, FORMULA over TRIVIAL and so on. Some examples for the different combinations were initially provided in the guidelines and a few more were incorporated during the iterative guidelines refinement process. We think that the hierarchical SACEM class assignment guidelines require further improvements. Dealing with the FAMILY class could also be improved. For example, general FORMULA involving more than a single compound were wrongly assigned to the FORMULA class instead of the FAMILY class.
Chemical disciplines CHEMDNER subsets
CHEMDNER abstracts, split into chemical disciplines (subject categories, first column; MULTIDISCIPL. CHEM.: Multidisciplinary Chemistry).
Chem. subject categories
CHEMDNER corpus test set predictions
Most of the existing biomedical corpora are not distributed together with the results of automated systems predictions trained or tested using these datasets. This makes it impossible to do a more exhaustive and detailed analysis of the differences between various methods at the level of concrete annotations. When a corpus was used to generate multiple predictions, for instance by different teams of a community challenge, it is interesting to check various run combinations or construct an ensemble systems with improved performance over the best single run. The competitive performance of ensemble systems has been demonstrated for instance for the recognition of gene mentions  or the detection of protein interactions , showing in some cases that even low scoring runs can positively contribute to the ensemble system performance. Moreover, we think that the release of corpus predictions is useful to examine more difficult or easier cases and to detect potential annotation errors when examining consensus predictions generated by multiple systems. We have included with the CHEMDNER release the predictions generated by participating systems for the BioCreative CHEMDNER task  with the aim of keeping the research on this topic alive and facilitate the improvement of chemical taggers and the corpus annotations. A general characterization of methods, resources, features and performance of the various systems can be found in the CHEMDNER overview paper published in this same special issue . Extra details on each of the methods can be found for a subset of competitive approaches in the systems description papers of this special issue, the CHEMDNER evaluation workshop proceedings  and in Additional file 1. The best F-score obtained for the chemical mention recognition by a single run was 87.39%. For the 3,000 test set abstracts, 26 teams returned 105 different runs, containing a total of 2,565,430 chemical mention predictions. Additional file 2 shows the clustering of all runs in terms of how similar the predictions between the runs are. The mean number of predictions for the test set was 24,432.67 (standard deviation of 12,429.69), corresponding to an average of 8.14 predicted mentions per abstract. When looking at fraction of abstracts that had manually annotated mentions (82.6%) and the average number of abstracts predicted to have at least a single mention by the systems (83.34%) the resulting numbers are very close. The average number of unique chemical name strings per abstract annotated manually for the test set was slightly higher (2.52) than the number of predicted unique compound names by returned by automated taggers (2.10).
CHEMDNER silver standard corpus
Due to the considerable workload required for the construction of manually annotated corpora, some efforts have been made to construct automatically tagged text collections generated by different systems. Despite obvious limitations when relying on automated tagging, one advantage of this strategy is that they can generate very large datasets. When assuming that the automated tools have an acceptable performance, the combination of multiple systems can generate labels with an acceptable quality.
The BioCreative metaserver constituted a pioneering work in the integration, alignment and visualization of multiple automated predictions, including the annotation of gene/protein mentions and handling their character overlaps . The use of silver standard corpora as training data was explored for the implementation of chunkers of biomedical text  and NER systems . Usually the creation of silver standard corpora required a corpus harmonization in order to merge multiple predictions, in the simplest case by applying a voting scheme  together with various mention boundary reconciliation strategies (e.g. exact, nested, continuous similarity measure for mention alignments ). To help in the exploration of silver standard corpora usage for chemical entity recognition and explore alternative corpus construction strategies we have included the release the CHEMDNER silver standard raw corpus. The distribution of this corpus might allow the study of generalization strategies to a broader abstract collection. This corpus contains automatically generated chemical mention annotations generated by teams that participated in the BioCreative CHEMDNER task for a background collection of 17,000 PubMed abstracts. These abstracts corresponded to a random sample retrieved by a PubMed search carried out the 27th of August 2013 selecting records published during 2013 in English, with abstracts and links to full text papers, without any prior keyword or topic filtering. These articles were published in over 3,000 different journals. Originally this background set was added to the test set abstracts during the prediction phase of the CHEMDNER task to assure that teams did not have enough time to do any manual correction of their submissions, making sure that everything was done automatically. This set was also added to obtain predictions of abstracts that were not specifically pre-selected for chemistry. All automatic annotations distributed in the CHEMDNER silver corpus were in a common format, enabling direct comparison and alignment of predictions. This corpus contains only the crude annotations. By doing this we intend to promote that researchers explore their own cross comparison, mention alignment and consensus annotations strategies. A total of 8,359,524 automatic annotations by 105 runs were generated for these 17,000 abstracts. On average, the number of chemical mentions per abstract was of 4.39, almost half when compared to the chemistry-related test set abstracts. The number of predicted unique compound names per abstracts was 0.83 (compared to the 2.10 of the test set). These numbers partially reflect also the fact that in case of this random background set, on average only 52.80% of the abstracts did contain chemical mentions. When extrapolating these numbers to the entire PubMed database, of over currently 14,8 million records with abstracts, we would obtain over 12 million unique chemical names with more than 65 million mentions. However these numbers have to be taken with care, because the background set corresponded to recent articles, while the PubMed database hosts a considerable number of older publications.
Discussion and conclusions
This work is supported by the Innovative Medicines Initiative Joint Undertaking (IMI-eTOX) and the MICROME grant 222886-2. We would like to thank Peter Corbett, Colin Batchelor and Corinna Kolarik their colleagues for their pioneering work on chemical entity annotation.
We would like to acknowledge funding for this publication by the Innovative Medicines Initiative Joint Undertaking (IMI-eTOX, under Grant Agreement nr. 115002)
This article has been published as part of Journal of Cheminformatics Volume 7 Supplement 1, 2015: Text mining for chemistry and the CHEMDNER track. The full contents of the supplement are available online at http://www.jcheminf.com/supplements/7/S1.
- Vazquez M, Krallinger M, Leitner F, Valencia A: Text mining for drugs and chemical compounds: methods, tools and applications. Molecular Informatics. 2011, 30 (6-7): 506-519. 10.1002/minf.201100005.View ArticleGoogle Scholar
- Krallinger M, Valencia A, Hirschman L: Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 2008, 9 (Suppl 2): 8-10.1186/gb-2008-9-s2-s8.View ArticleGoogle Scholar
- Leaman R, Gonzalez G, et al: Banner: an executable survey of advances in biomedical named entity recognition. Pacific Symposium on Biocomputing. 2008, 13: 652-663.Google Scholar
- Gerner M, Nenadic G, Bergman CM: Linnaeus: a species name identification system for biomedical literature. BMC bioinformatics. 2010, 11 (1): 85-10.1186/1471-2105-11-85.View ArticleGoogle Scholar
- He M, Wang Y, Li W: Ppi finder: a mining tool for human protein-protein interactions. PloS one. 2009, 4 (2): 4554-10.1371/journal.pone.0004554.View ArticleGoogle Scholar
- Krallinger M, Leitner F, Valencia A: Analysis of biological processes and diseases using text mining approaches. Bioinformatics Methods in Clinical Research. 2010, Humana Press, 341-382.View ArticleGoogle Scholar
- Krallinger M, Izarzugaza JM, Rodriguez-Penagos C, Valencia A: Extraction of human kinase mutations from literature, databases and genotyping studies. BMC bioinformatics. 2009, 10 (Suppl 8): 1-View ArticleGoogle Scholar
- Fontaine J-F, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA: Medlineranker: flexible ranking of biomedical literature. Nucleic acids research. 2009, 37 (suppl 2): 141-146.View ArticleGoogle Scholar
- Leser U, Hakenberg J: What makes a gene name? named entity recognition in the biomedical literature. Briefings in Bioinformatics. 2005, 6 (4): 357-369. 10.1093/bib/6.4.357.View ArticleGoogle Scholar
- Nadeau D, Sekine S: A survey of named entity recognition and classification. Lingvisticae Investigationes. 2007, 30 (1): 3-26. 10.1075/li.30.1.03nad.View ArticleGoogle Scholar
- Krallinger M, Leitner F, Rabal O, Vazquez M, Oryazabal J, Valencia A: CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform. 2015, 7 (Suppl 1): S1-View ArticleGoogle Scholar
- Neves M, Leser U: A survey on annotation tools for the biomedical literature. Briefings in bioinformatics. 2012, 084-Google Scholar
- Rebholz-Schuhmann D, Jimeno-Yepes AJ, van Mulligen EM, Kang N, Kors JA, Milward D, Corbett PT, Buyko E, Tomanek K, Beisswanger E, et al: The calbc silver standard corpus for biomedical named entities-a study in harmonizing the contributions from four independent named entity taggers. LREC. 2010Google Scholar
- Kim J-D, Ohta T, Tateisi Y, Tsujii J: Genia corpus-semantically annotated corpus for bio-textmining. Bioinformatics. 2003, 19 (suppl 1): 180-182. 10.1093/bioinformatics/btg1023.View ArticleGoogle Scholar
- Lipscomb CE: Medical subject headings (mesh). Bulletin of the Medical Library Association. 2000, 88 (3): 265-Google Scholar
- Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner WA, Cohen KB, Verspoor K, Blake JA, et al: Concept annotation in the craft corpus. BMC bioinformatics. 2012, 13 (1): 161-10.1186/1471-2105-13-161.View ArticleGoogle Scholar
- Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M: Chebi: a database and ontology for chemical entities of biological interest. Nucleic acids research. 2008, 36 (suppl 1): 344-350.Google Scholar
- Van Mulligen EM, Fourrier-Reglat A, Gurwitz D, Molokhia M, Nieto A, Trifiro G, Kors JA, Furlong LI: The eu-adr corpus: Annotated drugs, diseases, targets, and their relationships. Journal of biomedical informatics. 2012, 45 (5): 879-884. 10.1016/j.jbi.2012.04.004.View ArticleGoogle Scholar
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J: Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nucleic acids research. 2006, 34 (suppl 1): 668-672.View ArticleGoogle Scholar
- Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T: The ddi corpus: An annotated corpus with pharmacological substances and drug-drug interactions. Journal of biomedical informatics. 2013, 46 (5): 914-920. 10.1016/j.jbi.2013.07.011.View ArticleGoogle Scholar
- Rindflesch TC, Tanabe L, Weinstein JN, Hunter L: Edgar: extraction of drugs, genes and relations from the biomedical literature. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access. 2000, 517-Google Scholar
- Nobata C, Dobson PD, Iqbal SA, Mendes P, Tsujii J, Kell DB, Ananiadou S: Mining metabolites: extracting the yeast metabolome from the literature. Metabolomics. 2011, 7 (1): 94-101. 10.1007/s11306-010-0251-6.View ArticleGoogle Scholar
- Schlaf A, Bobach C, Irmer M: Creating a gold standard corpus for the extraction of chemistry-disease relations from patent texts. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Edited by: Chair, NCC, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S. 2014, European Language Resources Association (ELRA), Reykjavik, IcelandGoogle Scholar
- Wilbur WJ, Hazard GF, Divita G, Mork JG, Aronson AR, Browne AC: Analysis of biomedical text for chemical names: a comparison of three methods. Proceedings of the AMIA Symposium, American Medical Informatics Association. 1999, 176-Google Scholar
- Wren JD: A scalable machine-learning approach to recognize chemical names within large text databases. BMC bioinformatics. 2006, 7 (Suppl 2): 3-10.1186/1471-2105-7-S2-S3.View ArticleGoogle Scholar
- Zhang JD, Geer LY, Bolton E, Bryant SH: Automated annotation of chemical names in the literature with tunable accuracy. J Cheminformatics. 2011, 3: 52-10.1186/1758-2946-3-52.View ArticleGoogle Scholar
- Narayanaswamy M, Ravikumar K, Vijay-Shanker K, Ay-shanker KV: A biological named entity recognizer. Pac Symp Biocomput. 2003, 427-Google Scholar
- Hawizy L, Jessop DM, Adams N, Murray-Rust P: Chemicaltagger: A tool for semantic text-mining in chemistry. Journal of cheminformatics. 2011, 3 (1): 17-10.1186/1758-2946-3-17.View ArticleGoogle Scholar
- Standard, C.P.G. [http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/patentsGoldStandard]
- Tiago G, Catia P, Bastos Hugo P, et al: Chemical entity recognition and resolution to chebi. ISRN Bioinformatics. 2012, 2012:Google Scholar
- Akhondi SA, Klenner AG, Tyrchan C, Manchala AK, Boppana K, Lowe D, Zimmermann M, Jagarlapudi SA, Sayle R, Kors JA, et al: Annotated chemical patent corpus: A gold standard for text mining. PloS one. 2014, 9 (9): 107477-10.1371/journal.pone.0107477.View ArticleGoogle Scholar
- Rupp C, Copestake A, Teufel S, Waldron B: Flexible interfaces in the application of language technology to an escience corpus. Proceedings of the UK e-Science Programme All Hands Meeting. Citeseer. 2006Google Scholar
- Corbett P, Batchelor C, Teufel S: Annotation of chemical named entities. Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing. Association for Computational Linguistics. 2007, 57-64.View ArticleGoogle Scholar
- Corbett P, Copestake A: Cascaded classifiers for confidence-based chemical named entity recognition. BMC bioinformatics. 2008, 9 (Suppl 11): 4-10.1186/1471-2105-9-S11-S4.View ArticleGoogle Scholar
- Kolárik C, Klinger R, Friedrich CM, Hofmann-Apitius M, Fluck J: Chemical names: terminological resources and corpora annotation. Workshop on Building and Evaluating Resources for Biomedical Text Mining (6th Edition of the Language Resources and Evaluation Conference). 2008Google Scholar
- Tamames J, Valencia A: The success (or not) of hugo nomenclature. Genome biology. 2006, 7 (5): 402-10.1186/gb-2006-7-5-402.View ArticleGoogle Scholar
- Arighi CN, Wu CH, Cohen KB, Hirschman L, Krallinger M, Valencia A, Lu Z, Wilbur JW, Wiegers TC: Biocreative-iv virtual issue. Database. 2014, 2014: 039-View ArticleGoogle Scholar
- Yeh A, Morgan A, Colosimo M, Hirschman L: Biocreative task 1a: gene mention finding evaluation. BMC bioinformatics. 2005, 6 (Suppl 1): 2-10.1186/1471-2105-6-S1-S2.View ArticleGoogle Scholar
- Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA, Hunter L, Carpenter B, Tsai RT, Dai HJ, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli A, Mana-Lopez M, Mata J, Wilbur WJ: Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9 (Suppl 2): 2-10.1186/gb-2008-9-s2-s2.View ArticleGoogle Scholar
- Hirschman L, Colosimo M, Morgan A, Yeh A: Overview of biocreative task 1b: normalized gene lists. BMC bioinformatics. 2005, 6 (Suppl 1): 11-10.1186/1471-2105-6-S1-S11.View ArticleGoogle Scholar
- corpus, C. [http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus]
- Salgado D, Krallinger M, Depaule M, Drula E, Tendulkar AV, Leitner F, Valencia A, Marcelle C: Myminer: a web application for computer-assisted biocuration and text annotation. Bioinformatics. 2012, 28 (17): 2285-2287. 10.1093/bioinformatics/bts435.View ArticleGoogle Scholar
- tool, A. [http://annotateit.org]
- Ide N, Romary L: Representing linguistic corpora and their annotations. Proceedings of the Fifth Language Resources and Evaluation Conference (LREC), Genoa, Italy. 2006Google Scholar
- Comeau DC, Doğan RI, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, Rinaldi F, Torii M, et al: Bioc: a minimalist approach to interoperability for biomedical text processing. Database. 2013, 2013: 064-View ArticleGoogle Scholar
- Smith L, Tanabe LK, Ando RJ, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, et al: Overview of biocreative ii gene mention recognition. Genome biology. 2008, 9 (Suppl 2): 2-10.1186/gb-2008-9-s2-s2.View ArticleGoogle Scholar
- Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A: An overview of biocreative ii. 5. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 2010, 7 (3): 385-399.View ArticleGoogle Scholar
- 2, B.I.-C.P.V. [http://www.biocreative.org/resources/publications/chemdner-proceed-publications]
- Leitner F, Krallinger M, Rodriguez-Penagos C, Hakenberg J, Plake C, Kuo C-J, Hsu C-N, Tsai R, Hung H-C, Lau WW, et al: Introducing meta-services for biomedical information extraction. Genome Biol. 2008, 9 (Suppl 2): 6-10.1186/gb-2008-9-s2-s6.View ArticleGoogle Scholar
- Kang N, van Mulligen EM, Kors JA: Training text chunkers on a silver standard corpus: can silver replace gold?. BMC bioinformatics. 2012, 13 (1): 17-10.1186/1471-2105-13-17.View ArticleGoogle Scholar
- Rebholz-Schuhmann D, Yepes AJ, Li C, Kafkas S, Lewin I, Kang N, Corbett P, Milward D, Buyko E, Beisswanger E, et al: Assessment of ner solutions against the first and second calbc silver standard corpus. Journal of biomedical semantics. 2011, 2 (5): 1-12.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.