A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
© Papadatos et al.; licensee Chemistry Central Ltd. 2014
Received: 7 April 2014
Accepted: 17 July 2014
Published: 12 August 2014
The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are `ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.
The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches.
Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.
KeywordsMachine learning Triage Curation Document classification
The ChEMBL database stores a large quantity of 2D compound structures, biological targets, bioactivity data and calculated molecular properties of drugs and drug-like molecules; the coverage of ChEMBL is primarily focused on the medicinal chemistry, chemical biology and drug discovery fields. Data in ChEMBL is manually extracted from experimental results reported in the primary scientific literature and then curated and integrated to ensure consistency and improve data quality .
Manual document data entry and curation is expensive and time-consuming ,. Furthermore, it has become increasingly difficult for curators to keep up with the increasing scientific output produced, and this is likely to become more of an issue as pressure to release more data from funded research programs is applied. Therefore, biomedical researchers, text miners and curators are in need of automated expert systems that can help with the initial steps of the curation process. This phase is known as triage, namely the selection of likely relevant scientific articles from large repositories, such as Europe PMC and PubMed ,.
Extracting chemistry-related information from text has been performed in the past, in particular using named entity recognition systems such as Whatizit , OSCAR4  or ChemSpot . These tools can help for instance to identify drugs and molecular structures to be further curated or analysed in combination with other data types . However, the main goal of our project diverges from the goal of the tools mentioned. We aim to meet the following criteria: ranking and prioritising the relevant literature using a fast and high performance algorithm, with a generic methodology applicable to other domains and not necessarily related to chemistry and drug discovery. In this regard, we present a method that builds upon the manually collated and curated ChEMBL document corpus, in order to train a Bag-of-Words (BoW) document classifier. The classifier is based on the titles and abstracts of the corpus. The strategy has already proven to be successful in other fields such as toxicogenomics ,, and thus our main aim here has been extension and validation. We demonstrate the use of the methodology and make it available to the community.
In more detail, we have employed two established classification methods, namely Naïve Bayesian (NB) and Random Forest (RF) approaches -. The resulting classification score, henceforth referred to as `ChEMBL-likeness’, is used to prioritise relevant documents for data extraction and curation during the triage process. The data pre-processing workflows and validated models are freely available online under permissive licenses to the community as a Pipeline Pilot protocol and a KNIME workflow respectively ,. Both the protocol and workflow provide the same functionality and have been validated on the same data set.
BoW and n-grams example for two document titles
Discovery of biaryl anthranilides as full agonists for the high affinity niacin receptor.
Automatic prediction of protein interactions with large scale motion.
Bag of words
Discover, biaryl, anthranilid, full, agonist, high, affin, niacin, receptor
Automat, predict, protein, interact, large, scale, motion
Dicovery_of, full_agonists, high_affinity, niacin_receptor, …
Automatic_prediction, protein_interaction, large_scale, …
Discovery_of_biaryl, high_affinity_niacin, affinity_niacin_receptor, …
Automatic_prediction_of, protein_interaction_with, large_scale_motion, …
A document vector example from the titles of the documents in Table 1
Summary of classification validation statistics across different methods and validation sets
NB n-grams EV
RF CV Out-of-Bag
What is noteworthy from the ROC curves in Figures 2 and 3 is that the classifiers appear to have a very high true positive rate at the start of the curve. To quantify this we ranked the predictions in the external validation by the model value rather than class. In the top 5% only 4 out of 954 are false positives, the remaining 950 are true positives. Likewise in the top 10%, 16 out of 1908 are false positives with 1892 true positives. This could indicate that the classifier is able to accurately rank the documents, i.e. highly ranked documents indicate more desirable papers. Currently we are validating this observation (see section “Filtering allosteric ligand-related publications”).
Using the n-gram based document vector was found to slightly improve performance during the stratified partition validation at the expense of an increase in training time and resource usage (3 minutes to completion for BoW and 9 minutes to completion for n-grams on the same machine with the same data, increase of approximately 300%), while performance only increased by 2.5% on average. Given the minimal increase in predictive performance, it was chosen not to follow this up with the other validation strategies. However, it might be interesting to try this approach on sets where the BoW method performs inadequately as we did observe an improvement. Overall, the positive retrospective and prospective validation statistics indicate that these models are suitable to identify highly relevant articles for subsequent information extraction.
Classification validation parameters
Results and discussion
Four applications and use cases that leverage the classifier functionality are presented below. Two applications rely on the quantification of the ChEMBL-likeness score, one application is focused on a specific disease area, and finally a fourth application aims at identifying papers that are relevant to a less-defined, more complex concept (age-related differential drug response).
Prioritizing new publications
Filtering allosteric ligand-related publications
While a detailed analysis of this set will be reported elsewhere we would like to outline our approach for validating the ability of our models to rank papers. Initially we have looked at several samples from the set and indeed higher scoring documents appear to be more relevant whereas low scoring documents that are still ChEMBL-like contain relatively more false positives. Some examples are PMID:11142631 (highest scoring ChEMBL-like), PMID:9891064 (lowest scoring ChEMBL-like), PMID:17008604 (non-ChEMBL-like). To further validate the ranking ability, we have selected a top 10 (based on classifier score) per journal of documents that are predicted to be ChEMBL-like. After these documents have been curated we will compare the score and relevance for ChEMBL. As we have gathered these documents from diverse journals this can likely tell us more about the models’ ability to rank documents.
Generating antimalarial paper alerts
Identifying age-related differential drug responses
The method can be easily adapted to a more complex task, namely the retrieval and prioritization of articles where age-related differential drug responses are reported. After filtering out articles not containing age- and drug- related words based on a dictionary, the NB classifier was trained and validated on manually checked publication abstracts. The articles selected to train and validate the NB classifier contained at least 5 age- and/or drug-related words (Additional file 4). A “relevant” flag was assigned if the abstract contained pertinent information about drugs with reported age-related differential drug responses. Similarly a “non-relevant” flag was attributed if the information was deemed irrelevant. For a fair representation, an equal number of articles from both sets were used to train (in total 125 articles) and validate the model. In the end, the model scored the likelihood of an article to contain information about drugs that are not as effective or safe in paediatric or geriatric populations when compared with adult populations. This approach identified and prioritized approximately 1,400 articles out of a pool of 19,200. Articles freely available in PubMed central were selected for further evaluation. From the 168 selected, 19 contained relevant information, resulting in the identification of 46 new drugs with reported age-related differential drug responses. Despite its apparently modest performance, the classifier has highlighted articles, which had not been previously identified by conventional literature search methods, hence contributing considerably to expand the current list of drugs with known age-related response differences.
In conclusion, this method provides a fast and robust way to automatically identify and score articles relevant to the medicinal chemistry, chemical biology and drug discovery fields. The versatility of the method is highlighted here with four distinct applications, although there are many more that could be foreseen. Both PP (NB) and KNIME (RF) workflows and models, along with the PubMed identifiers of the documents used in training and test sets respectively, are available on the ChEMBL ftp server. This will ensure the reproducibility and reuse of our methodology and the straightforward dissemination of the models via two popular and user-friendly workflow platforms.
While it could be possible that usage of full text or named entity recognition increases performance over the usage of abstracts and titles alone, there is in reality little room for improvement, as shown in the models trained on n-grams as opposed to BoW data. This equally true for the inclusion of other sources of information like author names or journal name and for the investigation into potential data fusion methods relying on both RF and NB. However, another potential result can be that inclusion of this data actually limits the broad applicability of the classifier. These and other potential improvements are the subject of further on-going studies. We propose that titles and abstracts alone, as opposed to full text or annotated documents, provide sufficient information content for a reliable initial classification on a large scale avoiding unrequired complexity as is required in our use cases.
Notably, the way in which the contents of documents are abstracted here bears similarities to established chemoinformatics techniques. The document vector (presence or absence of words drawn from a dictionary) is obviously analogous to a dictionary-based fingerprint, whereby the dictionary is not predefined but constructed from the underlying data. In the same sense, word tokens are analogous to a compound’s substructural features while the word n-grams are linear combination of features (word tokens), which are in turn similar to the substructural features extracted from path-based fingerprints. As a result, this allows for the introduction of additional approaches from the chemoinformatics domain to text mining, including, but not limited to, document clustering, applicability domain determination for classification models, as well as feature importance determination (although this was touched upon already above). Finally, we aim to expand the scope of this model by applying it to chemical patent document mining in the near future. Here, we could score and prioritise relevant patent documents based on the title and abstract content.
Availability and requirements
Project name: ChEMBL literature classifier - Pipeline Pilot and KNIME workflows
Operating system(s): OS X and Windows
Programming language: Java/Pilot Script
Other requirements: KNIME (version 2.9) or Pipeline Pilot (version 8.5) installed
License: Apache 2 License
Any restrictions to use by non-academics: None
GP, GvW, and JPO conceived the study/method. GvW, GP, ST, and RS participated in its design and application. GP, GvW, SC, RS, ST, and JPO helped drafting the manuscript. All authors read and approved the final manuscript.
Area under the curve
Matthews correlation coefficient
We acknowledge the members of the ChEMBL group for valuable discussions and feedback on this work. JPO and GP thank the Wellcome Trust for funding under a Strategic Award (WT086151/Z/08/Z). RS, ST and SC acknowledge EMBL member states for funding. GvW thanks EMBL (EIPOD) and Marie Curie Actions (COFUND) for funding.
- Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Krüger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP: The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014, 42: D1083-D1090. 10.1093/nar/gkt1031.View ArticleGoogle Scholar
- Rebholz-Schuhmann D, Kirsch H, Couto F: Facts from text–is text mining ready to deliver?. PLoS Biol. 2005, 3: e65-10.1371/journal.pbio.0030065.View ArticleGoogle Scholar
- Burge S, Attwood TK, Bateman A, Berardini TZ, Cherry M, O’Donovan C, Xenarios L, Gaudet P: Biocurators and biocuration: surveying the 21st century challenges. Database (Oxford). 2012, 2012: bar059-Google Scholar
- Europe PubMed Central. , [http://europepmc.org/]
- PubMed/MEDLINE. , [http://www.pubmed.org]
- Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through web services: calling Whatizit. Bioinformatics. 2008, 24: 296-298. 10.1093/bioinformatics/btm557.View ArticleGoogle Scholar
- Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P: OSCAR4: a flexible architecture for chemical text-mining. J Cheminform. 2011, 3: 41-10.1186/1758-2946-3-41.View ArticleGoogle Scholar
- Rocktäschel T, Weidlich M, Leser U: ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012, 28: 1633-1640. 10.1093/bioinformatics/bts183.View ArticleGoogle Scholar
- Arighi CN, Cohen KB, Hirschman L, Lu Z, Tudor CO, Wiegers T, Wilbur WJ, Wu CH: Proceedings of the fourth BioCreative challenge evaluation workshop. 2013, Maryland, USA, BethesdaGoogle Scholar
- Davis AP, Wiegers TC, Johnson RJ, Lay JM, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, Murphy CG, Mattingly CJ: Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One. 2013, 8: e58201-10.1371/journal.pone.0058201.View ArticleGoogle Scholar
- Vishnyakova D, Pasche E, Ruch P: Using binary classification to prioritize and curate articles for the comparative toxicogenomics database. Database (Oxford). 2012, 2012: bas050-10.1093/database/bas050.View ArticleGoogle Scholar
- Mitchell TM: Machine learning. 1997, McGraw-Hill, Inc., New York, NY, USAGoogle Scholar
- Domingos P, Pazzani M: On the optimality of the simple bayesian classifier under zero–one loss. Mach Learn. 1997, 29: 103-130. 10.1023/A:1007413511361.View ArticleGoogle Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Pipeline pilot. 2012
- Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B: KNIME: the konstanz information miner. 2007, Springer, In Stud. Classif. Data Anal. Knowl. OrganGoogle Scholar
- Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK: BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007, 35: D198-D201. 10.1093/nar/gkl999.View ArticleGoogle Scholar
- Van Westen GJP, Gaulton A, Overington JP: Chemical, target, and bioactive properties of allosteric modulation. PLoS Comput Biol. 2014, 10: e1003559-10.1371/journal.pcbi.1003559.View ArticleGoogle Scholar
- Brown HL: Pay-per-view in interlibrary loan: a case study. J Med Libr Assoc. 2012, 100: 98-103. 10.3163/1536-5050.100.2.007.View ArticleGoogle Scholar
- Malaria-data resource. , [https://www.ebi.ac.uk/chembl/malaria/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.