ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files
© The Author(s) 2016
Received: 2 March 2016
Accepted: 18 October 2016
Published: 29 December 2016
KeywordsChemoinformatics Supplementary data Organic reaction modeling Density functional theory Text mining Data mining
Harvesting chemical data from the web is a challenging task requiring several convoluted steps. When chemical structures are stored in truly computable format with atoms and bond matrices (vector format-Cartesian co-ordinates), they can be processed electronically for computational and informatics purposes. However while transforming/storing the files in PDF (Printable/Portable Document/Data Format) that are usually used for the convenience of printing and reading, the valuable and re-usable molecular data is totally lost and buried in scientific literature as documents and seldom used for further computational studies. In earlier days, the hand-drawn molecules in ORTEP diagram formats were published while discussing the 3D conformation of molecules in the research articles. Generation of 3D structures from these molecular images in raster format was extremely difficult. Recently, some efforts have been made to transform computer generated and hand-drawn chemical images from journal articles and patent documents into truly computable molecules for inventory and database applications. Other similar endeavors include transforming either the textual chemical names (common, systematic, corporate identifiers for example CAS Registry number) or the computer generated names into corresponding molecular structures with moderate success. Although the name to chemical structure conversion programs are now routinely being used for harvesting chemical data from documents yet they have been insufficient in generating the accurate and truly computable and re-usable molecular data. The supporting information related to computational methods based research articles, describing the transition states of organic reactions is now available from journal publishers’ websites containing description of computations performed with tables of results, molecular images in 3D conformations along with 3D molecular co-ordinates in a PDF format. This combined data in a single file complicates the harvesting process and development of pattern recognition techniques for selectively excluding the non-atomic co-ordinate information from the pool of large collection of textual data presented as supporting material. Since there are no defined rules and guidelines for submitting molecular data in a supporting document associated with research publications, the authors are free to choose their favorite methods of representing molecular data such as chemical structures and corresponding atomic co-ordinates in the supplementary data file. This freedom of choosing data formats necessitates the development of several pattern recognition templates in the form of regular expressions to handle diverse formats (co-ordinates separated by space, comma, tab etc.) and maintain the order in which the XYZ co-ordinates and atom information is presented by the authors. This study therefore highlights the need for development of standards required for submitting the supporting materials with molecular data in a consistent, truly computable and re-usable format to journals publishing computational research. A specific set of guidelines defined by the publishers to submit molecular data even in a PDF format, would accelerate the automatic processing and recognition of chemical data for further computational studies related to reaction modeling [1–3], drug-discovery [4–7] and molecular inventory management [8, 9]. Several standard molecular representations in ASCII format which are easily readable by molecular modeling and chemoinformatics software packages are available. Supporting materials are deposited in PDF format for the convenience of storage, easy manageability and electronic dissemination. The commercial software packages applied for computational chemistry applications employ their own legacy file formats for handling molecular data, the technical details of which are not usually published. From the researchers’ point of view, the published data in re-usable formats would save efforts and time to understand the molecular data better and use it for practicing to carry out further advanced studies in different problem solving environments that require 3D conformation of molecules. Exchange of chemical data between multiple softwares without loss of information is a critical requirement in computational chemistry and chemoinformatics applications. Thus there is a need for the development of tools that can bridge the gap in molecular data translation automatically and accurately from PDF format to truly computable, re-usable format without manual intervention.
In this context, it is pertinent to mention the efforts by Rzepa and Peter Murray-Rust for developing tools to parse chemically relevant thesis and other published articles for harvesting analytical data [10, 11]. Special emphasis was laid on the use of Information Technology (IT) techniques for free re-distribution of electronic chemical data, for instance, storing actual supplementary information in structured XML/CML documents for universal applicability and dissemination of the valuable experimental/computed data thus advancing “data led science” as is the case in biology. The blue obelisk informal group initiative , encourages the use of open source data, open standards, shared algorithms and tools for performing chemoinformatics tasks. It has led to the development of valuable tools such as JChemPaint , CDK  and chemical information systems . Similar efforts have been made by the Cambridge Crystallographic Data Center (CCDC) group that provides easily downloadable crystal structures of organic molecules that are pliant with a number of software solutions for drug discovery . In a recent article, the importance of curation of large chemogenomics data set for building better predictive model for life sciences has been emphasized . During the preparation of this manuscript, a timely research article by Rzepa’s group on granularity model for extracting molecular information appeared  that stresses on the need for periodic and automatic curation of data from supplementary information in research articles. The present work is geared towards partial fulfillment of this need for “futuristic research data management”.
Conventionally, chemical names (common, systematic), Chemical Abstract Registry numbers are extracted from the web-pages and transformed into corresponding molecular structures using name-to-structure conversion tools , name to structure relational database look-up methods , large scale key-value pair list , distributed relational database search  etc. We have previously employed distributed systems to harvest chemical data using Google API (ChemXtreme) from the web pages . Transforming the raster images into vector graphics followed by identification of relevant pixel information associated with atoms and bonds of a molecule is a cumbersome job . Tools have also been developed to harvest molecular data from images using web camera, scanned images wherein the raster graphics data was transformed into vector graphics to eventually retrieve the atoms and bonds information for the generation of truly computable and re-usable chemical structures such as ChemRobot , OSRA , ChemReader , CLiDE , but only limited success has been achieved. A foolproof method with complete reproducibility of computable molecules from images is still a distant dream as the existing methodologies and tools do not provide accurate molecule data after processing. Therefore it is essential to develop efficient tools that can extract molecules from rich sources such as supplementary information files deposited at the journal site. Although spectral, molecular and analytical data have been harvested in the past but extracting molecules directly from author supplied atomic coordinates provided in supplementary materials as PDF format is not known. Accordingly, in the present work, we have developed an application, ChemEngine that reads all the files stored in the PDF format to extract molecular coordinates and generate computable molecular structures. To demonstrate the efficiency of the program, supporting material data files of three different molecular representations in terms of delimiters in the co-ordinate data were selected and the data was successfully parsed using ChemEngine to extract molecular data. It is to be noted here that the first two files from ACS publications did not require permission for data harvesting, while in the third case (RSC Advances), an article published under the CC-BY license was selected. It is also observed that the bulk processing of articles or supporting materials from publishers’ site automatically is usually prohibited due to copyright and article access policy.
ChemEngine is updated with default option to accept PDF file containing molecular coordinates. Internally the program recognizes the textual and non-textual data and using a default pattern recognition method to separate the 3D coordinates from the non-molecular text for the identification of atomic co-ordinates and atom information. The pseudo code with generic regular expression for harvesting atomic coordinate data from the input file is shown below.
(Co-ordinate Text).matches (“Regular Expression Pattern with Delimiter Definition”);
For Example: Delimiter: Comma
Once the coordinate file is created, the bond matrix is computed to generate the atomic connectivity information for reconstructing the original molecules reported in the supplementary material of the research article. Important parameters such as bond angles, bond lengths and dihedral angles are verified and checked for consistency in the recreated molecule and then saved in the original file format, for instance gjf . The coordinate data and bond matrix information is used to create molecules in standard interoperability formats such as .sdf or .mol as ready to compute molecules for the convenience of the user. This process avoids unnecessary generation of molecular data and laborious recomputation of already published work. The molecules can be subjected to further simulations such as descriptor calculation, energy profile, docking etc. The java based ChemEngine program is made available freely for non commercial purposes through the sourceforge site for evaluation and testing.
Results and discussion
Case study 1
The supporting material file was related to reaction modeling research paper describing the mechanistic investigation of epoxide formation from sulfur ylides and aldehydes . The work provided guidelines on stereo-selective synthesis of epoxide ring systems. The computational data included optimized geometries, calculated single point energies, rotational profiles and potential energy surface (PES) generation using standard B3LYP based DFT method. The PDF file was processed to directly extract a.txt file from which patterns were discerned to generate the bond matrix data. For a complete list of coordinate data of molecules generated by ChemEngine please refer to Additional file 1. This file can be considered as a standard template for submitting coordinate data of molecules for fast processing of PDF files in future.
An important constraint for generating ready to compute molecules was the non-availability of bond order information in the published coordinates data. Accordingly functionality has been built in ChemEngine for creating a bond matrix i.e. inter-atomic connectivities of a given cluster of atoms, to facilitate its recognition by the program as a molecule. This enables construction of the connection tables of molecules to assist the direct conversion of a PDF file to SDF on the fly. The method accurately retained the original conformations of all the optimized molecules when the extracted atomic coordinates were supplied back to the original program (Additional file 2).
Case study 2
The work pertains to a well cited paper wherein computational studies were performed on a range of alkenes to gain insights into the mechanistic processes involved in the thiol ene reactions  typically classified under click chemistry. In contrast with the previous case study, where the approach was straight forward and an open source pdf reader could be employed to convert pdf to text from the supporting information submitted in a pdf file, in the present case the pdf file was first saved in a plain text format externally and then submitted to ChemEngine for extracting the coordinates. The inadvertent errors in file conversion could be related to compatibility issues associated with various PDF maker programs available on the web.
Case study 3
Details of the three case studies representing the diversity of coordinate molecular data in supplementary material handled by ChemEngine
N = molecules
Regular expression pattern
Format and delimiter
Epoxide formation from sulfur ylides and aldehydes
Thiol ene click chemistry
Design of tetra(arenediyl)bis(allyl) derivatives for cope rearrangement transition states
Case study 4
In order to increase the scope of this work to handle several hundred PDF files to harvest truly computable molecular data, that are buried in PDF files we have implemented a default option in ChemEngine to harvest atomic co-ordinate data mixed with images (spectral data, barcode images, experimental data, molecular description and other computed data) and successfully tested with several PDF files to regenerate molecular files without any errors .
In the present work we process the molecules and transform them into SDF format that is mostly compatible with commercial packages thus saving time and computational effort. The compute once and use many times approach will help the readers to access the original input files even after passage of time. It is pertinent to mention here that the biological sciences and bioinformatics community follow a standard representation of molecular coordinates in the PDB file format which is a database compliant format instead of a PDF format thus securing an easy access and exchange of information. Extracting coordinates of protein molecule from a PDF file, assuming an average protein size of over 2,00,000 atoms would have been indeed a truly challenging task. However with the aid of ChemEngine customized with additional atomic co-ordinate pattern recognition modules, now it is possible to harvest any molecular data from PDF format. With the advent of 3D structure repositories and several free academic sites, data storage is no longer a major issue, the ready to compute molecules can be deposited and maintained to avoid duplication of computational efforts. Till such a global archival norm is achieved, it is suggested that the chemical community should maintain a standard and consistent representation of chemical structure data in the electronic supplementary files in native format or standard data format to facilitate the re-usability among the scientific community.
Supplementary information of primary literature deposited with journals is a rich reservoir of peer reviewed molecular data which will be more valuable if available for further reuse. An application ChemEngine presented here selectively extracts the 3D structure from coordinate information present along with inadvertently introduced noisy data present in PDF files. This approach can obviate to some extent the loss of chemical data while at the same time conserve the memory and storage space required at the journal site. The methodology exemplified here will enable molecule mining in semantic context and ensure maximum reuse of the valuable data by interested readers thereby enhancing the citations of the authors. Further the application can be seamlessly integrated to enable a high throughput molecular computing automated workflow.
printable document format
oak ridge thermal ellipsoid plot
structure data format
quantum mechanically derived force field
MK conceived the idea and developed the software, RV validated the methodology, tested the application and prepared the manuscript. Both authors read and approved the final manuscript.
MK thanks director CSIR NCL for providing infrastructure and support. The financial support received from GENESIS (BSC0121) and INSPIRE (CSC0107) under 12FYP projects is duly acknowledged. RV thanks DST, New Delhi for award of a fellowship. We also thank reviewers for valuable comments to improvise the tool with default options that increased the scope of ChemEngine to handle several hundred PDF files to generate molecular structures with high accuracy.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Karthikeyan M, Vyas R (2015) Role of open source tools and resources in virtual screening for drug discovery. Comb Chem High Throughput Screen 18(6):528–543View ArticleGoogle Scholar
- Blurock E (1995) Reaction: system for modeling chemical reactions. J Chem Inf Model 35(3):607–616View ArticleGoogle Scholar
- Dolata D, Spina D, Stahl M (1996) Conformational searching and modeling of transition states. J Chem Inf Model 36(2):228–230Google Scholar
- Aziz H, Gao J, Maropoulos P, Cheung W (2005) Open standard, open source and peer-to-peer tools and methods for collaborative product development. Comput Ind 56(3):260–271View ArticleGoogle Scholar
- Karthikeyan M, Vyas R (2015) Role of open source tools and resources in virtual screening for drug discovery. Comb Chem High Throughput Screen 18(6):528–543View ArticleGoogle Scholar
- Gilbert I (2013) Drug discovery for neglected diseases: molecular target-based and phenotypic approaches. J Med Chem 56(20):7719–7726View ArticleGoogle Scholar
- Ryall K, Tan A (2015) Systems biology approaches for advancing the discovery of effective drug combinations. J Cheminform 7(1):7View ArticleGoogle Scholar
- Postma G, van Bakel B, Kateman G (1996) Automatic extraction of analytical chemical information. System description, inventory of tasks and problems, and preliminary results. J Chem Inf Model 36(4):770–785Google Scholar
- Karthikeyan M, Bender A (2005) Encoding and decoding graphical chemical structures as two-dimensional (PDF417) barcodes. J Chem Inf Model 45(3):572–580View ArticleGoogle Scholar
- Murray-Rust P, Mitchell J, Rzepa H (2005) Chemistry in bioinformatics. BMC Bioinform 6(1):141–144View ArticleGoogle Scholar
- Murray-Rust P, Mitchell J, Rzepa H (2005) Communication and re-use of chemical information in bioscience. BMC Bioinform 6(1):180–195View ArticleGoogle Scholar
- Guha R, Howard M, Hutchison G, Murray-Rust P, Rzepa H, Steinbeck C et al (2006) The blue obelisk interoperability in chemical informatics. J Chem Inf Model 46(3):991–998View ArticleGoogle Scholar
- https://jchempaint.github.io/. Accessed 27 Sept 2016
- http://sourceforge.net/projects/cdk/. Accessed 27 Sept 2016
- Steinbeck C, Krause S, Kuhn S (2003) NMRShiftDB—constructing a free chemical information system with open-source components. J Chem Inf Model 43(6):1733–1739Google Scholar
- http://www.ccdc.cam.ac.uk/. Accessed 27 Sept 2016
- Fourches D, Muratov E, Tropsha A (2015) Curation of chemogenomics data. Nat Chem Biol 11(8):535View ArticleGoogle Scholar
- Harvey MJ, Mason NJ, McLean A, Murray-Rust P, Rzepa HS, Stewart JJP (2015) Standards-based curation of a decade-old digital repository dataset of molecular information. J Cheminform 7:43View ArticleGoogle Scholar
- http://opsin.ch.cam.ac.uk. Accessed 27 Sept 2016
- O’Donnell T (2009) Design and use of relational databases in chemistry. CRC Press, Boca RatonGoogle Scholar
- Richard A, Williams C (2002) Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat Res 499(1):27–52View ArticleGoogle Scholar
- Karthikeyan M, Krishnan S, Pandey A, Bender A (2006) Harvesting chemical information from the internet using a distributed approach: ChemXtreme. J Chem Inf Model 46(2):452–461View ArticleGoogle Scholar
- http://www.chemaxon.com/. Accessed 27 Sept 2016
- Karthikeyan M (2011) Automatic harvesting of molecular information raster graphics. US Patent Appl 14/241285
- http://cactus.nci.nih.gov/osra/. Accessed 27th Sept 2016
- Gkoutos G, Rzepa H, Clark R, Adjei O, Johal H (2003) Chemical machine vision: automated extraction of chemical metadata from raster images. J Chem Inf Model 43(5):1342–1355Google Scholar
- Ibison P, Jacquot M, Kam F, Neville A, Simpson R, Tonnelier C et al (1993) Chemical literature data extraction: the CLiDE Project. J Chem Inf Model 33(3):338–344View ArticleGoogle Scholar
- Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, CambridgeGoogle Scholar
- Karthikeyan M, Pandit Y, Pandit D, Vyas R (2015) MegaMiner: a tool for lead identification through text mining using chemoinformatics tools and cloud computing environment. Comb Chem High Throughput Screen 18(6):591–603View ArticleGoogle Scholar
- http://www.gaussian.com/. Accessed 27 Sept 2016
- Aggarwal V, Harvey J, Richardson J (2002) Unraveling the mechanism of epoxide formation from sulfur ylides and aldehydes. J Am Chem Soc 124(20):5747–5756View ArticleGoogle Scholar
- Grimme S (2014) A general quantum mechanically derived force field (QMDFF) for molecules and condensed phase simulations. J Chem Theory Comput 10(10):4497–4514View ArticleGoogle Scholar
- Northrop B, Coffey R (2012) Thiol ene click chemistry: computational and kinetic analysis of the influence of alkene functionality. J Am Chem Soc 134(33):13804–13817View ArticleGoogle Scholar
- Salvatella L (2015) Theoretical design of tetra(arenediyl)bis(allyl) derivatives as model compounds for Cope rearrangement transition states. RSC Adv 5(15):11494–11497View ArticleGoogle Scholar
- https://sourceforge.net/projects/chemengine/files/?source=navbar. Accessed 27 Sept 2016