Extracting and connecting chemical structures from text sources using chemicalize.org
© Southan and Stracz; licensee Chemistry Central Ltd. 2013
Received: 14 February 2013
Accepted: 18 April 2013
Published: 23 April 2013
Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.
Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and merged extractions.
This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.
The majority of chemical information and related data generated by biomedical research is specified in text form . A proportion of these primary reports have been captured in public and commercial databases that include a document cross-reference linked to standard chemical representations [2, 3]. Two basic methods are used to populate chemical databases via text. The first is expert manual curation (EMC) typically using a chemical sketcher for input. The second is automated name-to-structure conversion, also termed chemical named entity recognition (CNER). A third option, automated conversion of images to structures, has only just begun to contribute to public database entries via SureChemOpen .
A number of questions arise in regard to the global corpus of bioactive chemistry represented in text. These include (a) the total “out there” (b) the number represented in major public databases and (c) the ratio between source types. The upper limit for (a) could be the 70 million substances collated in the CAS commercial database but there are factors suggesting this exceeds the text-based corpus . At 47 million, PubChem is not only the largest open repository but also provides content counts by submission types that can be used to answer (b) and (c) . Patent-extracted structures have four major sources in PubChem. Three of these use CNER, SureChem (9.3 million) SCRIPDB (4.0 million) and IBM (2.4 million). The fourth, Thomson Pharma, is an EMC source (3.8 million). The union between these is 15 million. The largest journal extraction source is ChEMBL, with 0.8 million structures, and PubMed abstracts have 0.2 million linked structures. The chemistry capture ratio for patents: papers: abstracts is therefore approximately 70:4:1, with the union being 16 million. Even if the 70 million CAS-substances exceeded the text-specified total, the implication is that explicit document links for anywhere between 20 and 40 million unique structures are missing from public databases. Paradoxically, because of access constraints, this shortfall is largest for journal content, since the availability of full-text from the major patent offices is now largely complete .
Researchers exploring bioactive chemistry thus need ways of extracting structures from document “tombs”. In this work we explore the utility of chemicalize.org for this task . Developed by ChemAxon, this web application uses a CNER algorithm and dictionaries to identify chemical structures in text from names and identifiers. The value of this lies in addressing practical questions, the answers to which are important to support decisions in both academic and commercial R&D settings.
The chemicalize.org interface
The application has four principle inputs, text strings, a sketcher interface, URLs and PDF uploads. For all of these, chemical entities in the text will be converted if they are recognized as semantic names, CAS numbers, IUPAC names, SMILES or InChI strings. The conversions automatically generate the quartet of IUPAC, SMILES, InChI string and the InChIKey, together with ancillary identifiers. For a web page or PDF, structures found are displayed as a ribbon of images at the top of the page. These present the conversions in order of first occurrence in the document, accompanied in brackets by the count of occurrences of that structure within the document, and, if a second bracket is included, the number of different names that structure had in the document.
Sources and downloads
Document retrieval is outside the scope of this report but we can indicate compatible sources that were successfully used for the examples. For patents, the two most consistent in terms of URL performance, text quality and structure extraction numbers, were Free Patents Online (FPO) and EPO Espacenet [7, 10]. For journal articles we used the Open Access subset of European PubMed Central and PubMed for abstracts. Performance between operating system and browser alternatives can be configuration-dependent but this evaluation was done on a standard Windows 7 machine using Firefox and Chrome (but note an extension to the Safari browser can chemicalize.org web pages on-the-fly ). Features of the standard Microsoft Office suite that proved useful included (a) Notepad for format-stripping and editing text, (b) the ability to transfer either complete URL content, or specific sections to Word and save a PDF for upload to chemicalize.org, and (c) working across multiple windows (e.g. a converted URL open in one, cross-pasting to the chemicalize.org text conversion box in a second and having the a Google interface in a third).
Conversion success rates in CNER are dependent on text quality. For this reason, results from the direct processing of URLs can often be improved by removing confounding formatting. This can be done by converting text sections into a fresh PDF or selected individual IUPAC names for iterative editing via the front page text box. For example, from a 50,000-word patent URL, just the 5,000-word section that encompasses relevant IUPAC names for data-linked examples can be saved as a PDF that will convert rapidly and cleanly on upload. The download options have different utilities. The SDF file can be used as an archive to generate other formats. Alternatively, SMILES produces the smallest file size for batch uploads to databases and are convenient for merging and intersecting result sets from multiple extractions.
The first questions to answer for extracted structures are their identity or similarity to other sources. For any individual compound the most efficient first-pass is a Google search with the inner skeleton layer of the InChIKey . This will instantly record which major databases include a matching record. For similarity searches, the logical order is chemicalize.org itself, followed by internal, public or commercial databases. For bulk checking, an identity search against PubChem will be exemplified here. To enhance result interpretation a series of MyNCBI custom filters were set up for this work. Two of these were unions for the patent and literature-derived compound records from sources described in the introduction. Two others record the total and unique matches to chemicalize.org (see below). The fourth filter was adapted from the constitutive Rule-of-five parameters astpre-set in PubChem, by the addition of a 250 to 800 Mw window. This provided a useful separation of reagents and intermediates from lead-like compounds exemplified in patents.
One of the reasons for choosing PubChem for triage is that structures from the chemicalize.org result archive have recently been deposited (Source name = chemicalize.org by ChemAxon) . This provides not only the pre-computed relationships of each structure to the neighbor space in PubChem but also the connectivity to all other PubChem sources and “back out” (via chemicalize.org) to user-submitted URLs or documents. As of April 2013 the chemicalize.org source was linked to 297,083 compound identifiers (CIDs). Of these, just over 20% were unique (i.e. the exact structures were not present in other PubChem sources). The fact that 80% of the structures are independently supported by other submissions (according to the CID merging rules) indicates the quality of the chemicalize.org archive. While users need to be aware that the presence in PubChem introduces circularity in terms of structure searches per se, any URLs for the chemicalize.org source link via the substance identifiers (SIDs) may have been updated since the deposition date. Thus, not only can the original links in the chemicalize.org entry (accessed via the SID) be different to those chosen by a new user but clicking on any link automatically re-extracts the structures from that URL.
Batch search in PubChem
For batch searches the PubChem query upload interface is shown below .
Can chemical structures be identified in this document?
How many can be extracted?
Which ones have database entries?
Which database entries have links to this document?
Where in the document are the structures specified?
Can SAR data be linked to structures in the document?
What other documents include this structure?
Which database records for this structure have links to other documents?
What additional connections can be made using similarity searches?
The question “do any of these have links to this document?” was answered for individual records by inspecting sources, for example, CID 57499553 (at the top of Figure 5) has two. The chemicalize.org entry (SID 137228062) links to the most recent URLs extracted by users for US20120040982. The SureChemOpen entry (SID 152667195) links to the same document but with 17 additional members of the patent family. The question as to “where in any document an individual database structure is located?” is easiest to answer in reverse by establishing the PubChem match for a chemicalize.org conversion at a certain position. For a patent there are two alternatives, following the chemicalize.org ribbon display in sequence in the document or searching the structure in SureChemOpen (directly on the SureChem website, or via the PubChem link). In this case, SID 152667195 was located to example 496, but note that structures can be specified multiple times in a patent.
The InChIKey was Google-negative but matched CID 57498937 by a direct search in PubChem. This had been submitted by both SureChem (SID 152666516) and chemicalize.org (SID 137227422). The result thus provided an answer to “what other documents include this structure?” as negative. We then addressed the related question “which (other) database records have links to documents?” However, the difficulty associated with this is the substantial amount of common chemistry typically extracted along with the examples. The solution in this case was to prepare a PDF containing only the 38 IUPACs specifically claimed on page 63 of the original document. This was extracted and the results uploaded to PubChem, thereby excluding reagents and prior-art descriptions. This gave 34 CID matches, 32 of which had both SureChem and chemicalize.org as sources. In addition, eight of these had ChEMBL as a third source, thus providing links to journal articles and assay data. For example, CID 24750280, the eighth structure in the claim list, had a published IC50 displayed in the PubChem record for the inhibition of DPPIV in Caco-2 cells of 88 nM derived from PMID 18052023 .
Extraction, upload and PubChem match statistics for US20120040982
CZ in PC
Main examples (PDF)
In a typical medicinal chemistry patent it would be difficult to determine the potentially extractable total (row 1, column 1) because of extensive redundancy in the form of repeated exemplifications, reagents, intermediates and Markush components. Notwithstanding, across the first row, columns 1 and 2 show 96% of the uploaded SMILES were verified by the PubChem structure checker. From columns 3 and 5, we can establish that 96% of the extracted structures were already in PubChem. The high SID: CID ratio indicates that these included a substantial amount of common chemistry or known drug structures (i.e. each structure having on average 63 submitters) most of which were already represented in the 1252 chemicalise.org entries. Inspection of the 52 structures unique to this source (column 8) shows lead-like structures that were absent from PubChem. Row 2 provides a direct in-verses-out assessment from the IUPAC names (497) and the conversions by PubChem (468) with a 94% yield. In this row the SID: CID ratio drops to 2.1 because most of these lead-like examples were novel structures from (on average) only two sources, SureChemOpen and chemicalize.org. The claimed compounds (row 3) have an 89% conversion. The 30 matching structures in PubChem have more sources because some of them have been extracted from published papers (i.e. the SIDs include five ChEMBL entries).
The exploration of chemistry in a patent document can have different domain-specific utilities but the basic outcomes can be reviewed. Chemicalize.org has converted the majority of example structures that circumscribe the subject matter and enable further analysis. The observation that most of these were already in PubChem is likely to be the default case for patents from the major authorities, unless they have been published since the latest SureChemOpen deposition. The extensive matches to chemicalize.org confirmed, via the links, that a user had already converted this document (i.e. one of the authors C.S.). The partitioning of novel structures from common chemistry was achieved by isolating the example sections. The SAR-to-structure mapping was discerned by two routes. The first was intra-document via the data in the patent (which could easily be extended to complete the whole table). The second was inter-document connections to published results for some of the same compounds in medicinal chemistry journals (via ChEMBL). The option of sectioning the document provides flexibility, for example in being able to separate structures extractable from the introduction, description, synthetic schema or claims.
Extraction from papers
The answer to the question “which extractions have database entries?” was all of them, (i.e. 52 matches in PubChem). The common chemistry was partitioned by making a PDF of just the 12 key IUPAC names. The downloaded structures could then be aligned with the IC50, clearance and P540 metabolism results (Table 1 in PMC3305890) thus answering the question “which data can be connected to which structures in the document?”. The structure search in PubChem included 12 CID matches, all of which had chemicalize.org as a source (here again, this was because the paper had been extracted during this work). Unusually, 11 were unique to chemicalize.org since only one CID 10383508; (4a from Figure 7) had other source links. These thus answered the question “which database records for this structure have links to other documents?”. The SureChemOpen entry, via a same-connectivity link (SID 157613372) established a link to example 61 in a Kenkyusho patent, WO2004067509 . The ChEMBL entry (SID 103476839) was linked to a publication (PMID 16392822) where the IC50 inhibitory activity of this structure against DPPIV was 49 nM . We also used CID 10383508 to answer the question “what additional connections can be made for a structure using similarity searches?” in two ways. The first was to launch “ChemSearch” from the chemicalize.org entry which records 1934 structures out to a Tanimoto similarity score of 0.5. The second was launching the equivalent search within PubChem that recorded 632 matches with a similarity score of 0.85.
Outcomes for this paper extraction included extensive PubChem matches. In addition, all the key compounds were aligned against the SAR table and links to additional relevant documents established. The triaging of similarity results and their connectivity cannot be expanded on here but note that the 2D and 3D neighbor spaces in PubChem can be both explored and graphically clustered for any CID. This extraction also provides a remediation example. The initial conversion pass for the synthetic intermediate in section 4.1.2 from Figure 7 in the paper, failed. Copying this out of the manuscript web page as text string and pasting it into chemicalize.org corrected this to (2S)-1-(2-chloroacetyl) pyrrolidine-2-carbonitrile. This was then InChIKey matched via Google to PubChem CID 11073883 which had 24 sources.
The answer to the question “how many have database matches” is 463 but, because the literature-linked filter only recorded 374, the chemicalize.org results provide 89 new abstract-to-database links and 154 structures that are absent in PubChem. Inspection of the ribbon display shows additional utility where, as expected from the query, DPPIV drugs are frequently mentioned. Not only can these be explicitly counted but they can be “stepped through” to locate each occurrence in a specific abstract (e.g. each of the 42 mentions of sitaglyptin). However it should also be noted that Figure 8 includes a false-positive structure. DPP was recognized as synonym for di-n-pentylphthalate because of the spacing in the abstract text (i.e. DPP IV vs. DPPIV).
This analysis indicates useful complementarity with PubMed where chemicalize.org can recognize (and count) structures in abstract sets that are either missing from PubChem or have no direct PubMed-PubChem links. An example is the MeSH supplementary concept IUPAC from PMID 20128619 in the abstract set . This novel DPPIV inhibitor has now become CID 60206521 via the chemicalize.org-only submission (i.e. added to PubChem as a consequence of user extraction and includes a link to the abstract). The simple triage used above (PubMed > result file > chemicalize.org) can be applied to any slice of the 22 million MEDLINE abstracts.
Details of these intersects and differences need not be expanded here, but the combination of chemicalize.org and Venny allows these to be followed-up. Each subset can be isolated and checked by PubChem searching and/or re-extraction. Not only can any combination of extractions or database lists be used (e.g. SMILES, InChI or even standardized IUPAC names), but Venny can also merge and de-duplicate concatenated sets of chemicalize.org outputs (e.g. in this case the four sets added up to 2,049 unique SMILES).
The ability of chemicalize.org to recognize all structures specified in a section of text, as well as the potential addition of false-positives, is subject to the constitutive limitations of CNER. While some document-specific failure rates are shown in Table 1, it should also be noted that IUPAC names not correctly extracted on a first-pass, can potentially be remediated by simple fixes. One of these, pasting minimally formatted text blocks into “clean” PDFs, has already been mentioned. Another is that common IUPAC name conversion errors (e.g. 1 vs. L, spaces, line breaks, missing brackets or author errors in the primary text) can, in many cases, be iteratively corrected using the front page text input box. False-positives are mainly derived from split IUPAC names, homonym clashes (e.g. some gene symbols being identical to chemical acronyms) unresolvable synonyms (e.g. the same names being used in the literature to refer to different structures) and contextual ambiguity (e.g.” x derivatives” being translated to a structure for “x”). While the chemicalize. org dictionary is subject to regular updates and expansions to reduce these limitations, it should be noted that, in practice, the low level of false-positives are easily spotted and would not usually confound further analysis.
The results above demonstrate the value of chemicalize.org to answer questions related to the chemical content of the key document types for biomedical research. It should also be pointed out that the strong growth in open-access journal content (as PDF and/or URLs) will expand chemicalization.org options. The approaches outlined are not only technically straightforward for those unfamiliar with cheminformatics but they can also be extended to any text source, including internal documents and chemical information on the web. The ability to make reciprocal document-to-document, document-to-database or document-to-web connectivity is of crucial importance. In addition, the indexing of PubChem and ChemSpider by Google has become complementary and transformative because over 50 million aggregated databases entries (including the chemicalize.org archive) can now be checked for an InChIKey match . Similarity searches in databases extend this connectivity even further.
In regard to connectivity, the absolute correctness and completeness extraction of any document extraction per se is less important than the ability to make joins using just a sample of the important extracted structures. For example, there is high value in establishing that patent A, journal article B, PubMed abstract C and database record D, are, from the bioactivity standpoint, probably referring to the same canonical chemical entity. This means that the associated data can be merged between documents, regardless of salt form or isomer differences. In the case of boutique sources that either cross-reference thinly or not at all, chemicalize.org may be the only way to make such joins. There are also synergies with other open tools (in addition to Venny). These include OPSIN for IUPAC names , OSRA for chemical images  and Utopia for bio-entity recognition . The performance and scope of chemicalize.org is being continually developed, including implementation for mobile phones . It is therefore destined to make an increasing impact in medicinal chemistry, chemical biology and other bioactive chemistry domains.
Availability and requirements
The application is publicly available for on-line use at chemicalize.org and can be accessed with any modern web browser. There is also a free Android mobile app. The core underlying functionality is available as the ChemAxon's “Naming” commercial product suite that includes the chemical name conversion and mining engine. Use of these products on chemicalize.org is covered by Creative Commons BY-NC-SA 3.0 license.
We would like to acknowledge Daniel Bonniot de Ruisselet and Ferenc Csizmadia for their contributions to the project and thank additional ChemAxon colleagues for reviewing the draft manuscript. We also appreciated the referees comments that enhanced the final submission.
- Vazquez M, Krallinger M, Leitner F, Valencia A: Text mining for drugs and chemical compounds: methods tools and applications. Mol. Inf. 2011, 30: 506-519. 10.1002/minf.201100005.View ArticleGoogle Scholar
- Southan C, Boppana K, Jagarlapudi SAA, Muresan S: Analysis of in vitro bioactivity data extracted from drug discovery literature and patents: ranking 1654 human protein targets by assayed compounds and molecular scaffolds. Journal of cheminformatics. 2011, 3: 14+-View ArticleGoogle Scholar
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP: ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40: D1100-D1107. 10.1093/nar/gkr777.View ArticleGoogle Scholar
- SureChemOpen. https://open.surechem.com/login,
- CAS registry. http://www.cas.org/content/chemical-substances,
- Li Q, Cheng T, Wang Y, Bryant SH: PubChem as a public resource for drug discovery. Drug Discov Today. 2010, 15: 1052-1057. 10.1016/j.drudis.2010.10.003.View ArticleGoogle Scholar
- FPO IP research & communities. http://www.freepatentsonline.com/,
- Chemicalize.org by ChemAxon. http://www.chemicalize.org/,
- Swain M: Chemicalize.org. J Chem Inf Model. 2012, 52: 613-615. 10.1021/ci300046g.View ArticleGoogle Scholar
- European patent office - espacenet. http://worldwide.espacenet.com/advancedSearch?locale=en_EP,
- Chemicalize extension. http://www.macinchem.org/extensions/chemicalize.php,
- Southan C: InChI in the wild: an assessment of InChIKey searching in google. Journal of cheminformatics. 2013, 5: 10-10.1186/1758-2946-5-10.View ArticleGoogle Scholar
- Chemicalize.org from ChemAxon in PubChem. http://cdsouthan.blogspot.se/2013/01/chemicalizeorg-from-chemaxon-in-pubchem.html,
- PubChem structure search. http://pubchem.ncbi.nlm.nih.gov/search/#,
- Gupta R, Walunj SS, Tokala RK, Parsa KV, Singh SKK, Pal M: Emerging drug candidates of dipeptidyl peptidase IV (DPP IV) inhibitor class for the treatment of type 2 diabetes. Curr Drug Targets. 2009, 10: 71-87. 10.2174/138945009787122860.View ArticleGoogle Scholar
- Himmelsbach F, Mark M, Eckhardt M, Langkopf E, Maier R, Lotz R: Xanthine derivatives, the preparation thereof and their use as pharmaceutical compositions. 2012, US20120040982 (patent)Google Scholar
- Eckhardt M, Langkopf E, Mark M, Tadayyon M, Thomas L, Nar H, Pfrengle W, Guth B, Lotz R, Sieger P, Fuchs H, Himmelsbach F: 8-(3-(R)-aminopiperidin-1-yl)-7-but-2-ynyl-3-methyl-1-(4-methyl-quinazolin-2-ylmethyl)-3,7-dihydropurine-2,6-dione (BI 1356), a highly potent, selective, long-acting, and orally bioavailable DPP-4 inhibitor for the treatment of type 2 diabetes. J Med Chem. 2007, 50: 6450-6453. 10.1021/jm701280z.View ArticleGoogle Scholar
- Kato N, Oka M, Murase T, Yoshida M, Sakairi M, Yakufu M, Yamashita S, Yasuda Y, Yoshikawa A, Hayashi Y, Shirai M, Mizuno Y, Takeuchi M, Makino M, Takeda M, Kakigami T: Synthesis and pharmacological characterization of potent, selective, and orally bioavailable isoindoline class dipeptidyl peptidase IV inhibitors. Organic and medicinal chemistry letters. 2011, 1: 7-10.1186/2191-2858-1-7.View ArticleGoogle Scholar
- Kakigami T, Oka M, Katoh N: Compound inhibiting dipeptidyl peptidase IV. 2000, WO2004067509 (patent)Google Scholar
- Tsu H, Chen X, Chen C-T, Lee S-J, Chang C-N, Kao K-H, Coumar MS, Yeh Y-T, Chien C-H, Wang H-S, Lin K-T, Chang Y-Y, Wu S-H, Chen Y-S, Lu I-L, Wu S-Y, Tsai T-Y, Chen W-C, Hsieh H-P, Chao Y-S, Jiaang W-T: 2-[3-[2-[(2S)-2-Cyano-1-pyrrolidinyl]-2-oxoethylamino]-3-methyl-1-oxobutyl]- 1,2,3,4-tetrahydroisoquinoline: a potent, selective, and orally bioavailable dipeptide-derived inhibitor of dipeptidyl peptidase IV. J Med Chem. 2006, 49: 373-380. 10.1021/jm0507781.View ArticleGoogle Scholar
- Xu F, Corley E, Zacuto M, Conlon DA, Pipik B, Humphrey G, Murry J, Tschaen D: Asymmetric synthesis of a potent, aminopiperidine-fused imidazopyridine dipeptidyl peptidase IV inhibitor. J Org Chem. 2010, 75: 1343-1353. 10.1021/jo902573q.View ArticleGoogle Scholar
- Oliveros JC: VENNY. An interactive tool for comparing lists with venn diagrams. 2007, http://bioinfogp.cnb.csic.es/tools/venny/index.html,Google Scholar
- Lowe DM, Corbett PT, Murray-Rust P, Glen RC: Chemical name to structure: OPSIN, an open source solution. J Chem Inf Model. 2011, 51: 739-753. 10.1021/ci100384d.View ArticleGoogle Scholar
- Filippov IV, Nicklaus MC: Optical structure recognition software to recover chemical information: OSRA, an open source solution. J Chem Inf Model. 2009, 49: 740-743. 10.1021/ci800067r.View ArticleGoogle Scholar
- Pettifer S, Velterop J, Attwood TK, Harland L, Marsh J, Thorne D, Tunbridge A: Reuniting data and narrative in scientific articles. Insights: the UKSG journal. 2012, 25: 288-293. 10.1629/2048-7718.104.22.1688.Google Scholar
- Chemicalize.org android app. https://play.google.com/store/apps/developer?id=ChemAxon,
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.