- Research article
- Open Access
PubChem atom environments
© Hähnke et al. 2015
Received: 29 January 2015
Accepted: 20 May 2015
Published: 19 August 2015
Atom environments and fragments find wide-spread use in chemical information and cheminformatics. They are the basis of prediction models, an integral part in similarity searching, and employed in structure search techniques. Most of these methods were developed and evaluated on the relatively small sets of chemical structures available at the time. An analysis of fragment distributions representative of most known chemical structures was published in the 1970s using the Chemical Abstracts Service data system. More recently, advances in automated synthesis of chemicals allow millions of chemicals to be synthesized by a single organization. In addition, open chemical databases are readily available containing tens of millions of chemical structures from a multitude of data sources, including chemical vendors, patents, and the scientific literature, making it possible for scientists to readily access most known chemical structures. With this availability of information, one can now address interesting questions, such as: what chemical fragments are known today? How do these fragments compare to earlier studies? How unique are chemical fragments found in chemical structures?
For our analysis, after hydrogen suppression, atoms were characterized by atomic number, formal charge, implicit hydrogen count, explicit degree (number of neighbors), valence (bond order sum), and aromaticity. Bonds were differentiated as single, double, triple or aromatic bonds. Atom environments were created in a circular manner focused on a central atom with radii from 0 (atom types) up to 3 (representative of ECFP_6 fragments). In total, combining atom types and atom environments that include up to three spheres of nearest neighbors, our investigation identified 28,462,319 unique fragments in the 46 million structures found in the PubChem Compound database as of January 2013. We could identify several factors inflating the number of environments involving transition metals, with many seemingly due to erroneous interpretation of structures from patent data. Compared to fragmentation statistics published 40 years ago, the exponential growth in chemistry is mirrored in a nearly eightfold increase in the number of unique chemical fragments; however, this result is clearly an upper bound estimate as earlier studies employed structure sampling approaches and this study shows that a relatively high rate of atom fragments are found in only a single chemical structure (singletons). In addition, the percentage of singletons grows as the size of the chemical fragment is increased.
The observed growth of the numbers of unique fragments over time suggests that many chemically possible connections of atom types to larger fragments have yet to be explored by chemists. A dramatic drop in the relative rate of increase of atom environments from smaller to larger fragments shows that larger fragments mainly consist of diverse combinations of a limited subset of smaller fragments. This is further supported by the observed concomitant increase of singleton atom environments. Combined, these findings suggest that there is considerable opportunity for chemists to combine known fragments to novel chemical compounds. The comparison of PubChem to an older study of known chemical structures shows noticeable differences. The changes suggest advances in synthetic capabilities of chemists to combine atoms in new patterns. Log–log plots of fragment incidence show small numbers of fragments are found in many structures and that large numbers of fragments are found in very few structures, with nearly half being novel using the methods in this work. The relative decrease in the count of new fragments as a function of size further suggests considerable opportunity for more novel chemicals exists. Lastly, the differences in atom environment diversity between PubChem Substance and Compound showcase the effect of PubChem standardization protocols, but also indicate that a normalization procedure for atom types, functional groups, and tautomeric/resonance forms based on atom environments is possible. The complete sets of atom types and atom environments are supplied as supporting information.
The de facto standard for the representation of small molecules in chemical information and cheminformatics is the molecular graph, a mathematical construct providing the topological description of a chemical structure as a set of vertices (corresponding to atoms), and edges between those vertices (corresponding to bonds between atoms) [1, 2]. The molecular graph is deeply rooted in valence bond theory, where the structure diagram is (essentially) equivalent to the Lewis structure of a molecule [3, 4]. It helps provide the basis for several related chemical descriptions: systematic names [5–7], line notations [8–22], and connection table-based file formats [23–26]. The valence bond model description of a chemical structure has proven to be incredibly useful to chemists, even though it is simplistic compared to a full quantum mechanical description. Subgraphs, referred to as substructures or molecular fragments, are the key concept in a variety of standard methods for the assessment of chemical similarity [27–34], clustering [35–39], and structure searching [40–42]. For example, fragment-based approaches of atom-centered or variable topological characteristics are used to accelerate chemical structure searches in databases [43–45].
In the world of chemical information much has changed since the 1970s. For example, aided by computers and the internet, chemical information data exchange has become increasingly open. There are now chemical data repositories providing access to large quantities of aggregated chemical information without barriers or paywalls. An example of one of these repositories is PubChem.
PubChem is an open archive for chemical substances and their biological activities [51–54]. It consists of three distinct primary databases: Substance, Compound and BioAssay. Substance contains descriptions of chemical substances as provided by hundreds of contributors. BioAssay contains bioactivity information about chemical substances. Compound is derived from entries in Substance via automated protocols that generate a preferred chemical representation and identify equivalent chemicals between PubChem contributors. At the time of initially writing this manuscript (January 2013), PubChem contained more than 100 million substance and 46 million compound records. Given the size and breadth of contributing organizations (including many substance suppliers, patent databases, natural product collections, literature databases, etc.), PubChem might be considered to represent a rather large fraction of all known (small molecule, organic) chemistry.
Using PubChem chemical contents and the earlier analysis method used with CASRS, this study assesses the overall progress by chemists to access novel chemistry over the past 40 years as a function of new, unique chemical fragments. In addition, we present here detailed statistics about atom environments of different sizes in the PubChem Substance (non-standardized structures as provided by contributors) and PubChem Compound (standardized unique structures) databases.
Results and discussion
Terminology and approach
Unless stated otherwise, the following approach and definitions were used for the purpose of this study. Incidence refers to the absolute count or percentage of (substance or compound) records that contain a particular fragment. Occurrence refers to the absolute count or percentage of all fragments across all structures (substance or compound) considered. Atom environments are defined as circular atom-centered topological neighborhood fragments of varying ‘radii’ containing all bonds between included atoms and are constructed as detailed in the “Methods”, unless otherwise stated. The ‘radius’ (r) of an atom environment is the maximum allowed topological distance between the center atom and any atom in the original structure that is part of the atom environment, and is measured as the number of bonds along the shortest path . The analysis of PubChem Compound and Substance was performed with respect to atom environments from topological radii zero (i.e., atom type) up to three. As a means of comparison, atom environments of r = 3 are essentially equivalent to those generated by the popular ECFP_6 type extended connectivity fingerprints  or Morgan Fingerprints of r = 3 . Atoms are characterized by atomic number, formal charge, implicit hydrogen atom count, explicit degree (number of explicitly connected atoms), and their valence (the sum of implicit hydrogen count and bond orders of incident bonds). Bonds are distinguished as single, double or triple covalent bonds. Both atoms and bonds were further characterized by their participation in aromatic systems. To ensure that these properties are set consistently, pre-processing is performed that converts all explicit hydrogen atoms to implicit hydrogen atom counts (ignoring isotopes and annotated stereo chemistry) and that perceives aromaticity. Atom environment frequencies are specified by incidence, being the absolute number or relative percentage of the 104,669,789 substance or 46,704,121 compound records considered in this study where a molecular fragment is present.
Known Chemistry (‘then and now’) Comparison
To contrast the current state of known chemistry (‘now’) with that from a little more than 40 years ago (‘then’), we generated atom environments of radius r = 1 for all structures in PubChem Compound. To achieve a direct comparison, we used the same atom and bond types as Adamson et al.  when analyzing the Crowe et al. data . In these earlier fragmentation studies: atoms were distinguished by atomic number; and bonds were classified as single/double/triple and as chain/ring bonds, respectively, and by an aromatic-ring bond type. For better discrimination between this then to now comparison from other results of our study, this particular size of atom environment generated with these particular atom types and bond types will be referred to as ‘augmented atoms’, as in the 1971 study. Based on this classification scheme, the 1971 study found a total of 2,331 unique augmented atoms in a collection of 28,799 molecules randomly sampled from the CASRS. To ensure comparability, we applied the same pre-filtering steps to structures as performed in the original study: entries with more than 100 atoms were omitted, and only structures containing atoms between 1 and 4 explicit connections were allowed, yielding 46,605,207 allowed and 98,914 rejected compound records from the 46,704,121 compound records in PubChem. In addition, terminal atoms were allowed as center atoms of atom environments. Aromaticity was perceived using the OEChem C++ toolkit  aromaticity model OEAroModelMDL, which allows only six-membered rings of carbon and nitrogen to be aromatic, provided they satisfy the ‘Hückel 4n + 2′ rule [58, 59] (i.e., atoms are sp2-hybridized). Even though no aromaticity definition was supplied in the 1971 study, due to its simplicity, it is our opinion that this model might be closest to the perception of aromaticity at that time .
Elemental analysis and comparison
PubChem compound (2013)
Comparing then (1970 Crowe et al. study ) and now (2013 PubChem Compound), carbon is unchanged. It is the most abundant element found in chemical structures, accounting for 74% of all atoms and found in more than 99% of all structures. The story is different for oxygen. It accounts for a decreased percentage of all atoms (13.5% then and 11.3% now) but, interestingly, the fraction of structures containing oxygen have increased (82.6% then and 91.5% now). The change is even more dramatic for nitrogen. It accounts for a substantially larger percentage of all atoms (7.3% then and 10.2% now) and is present in substantially more chemical structures (64.2% then and 91.5% now). Combined, these three elements (C, N, O) account for nearly all atoms both then (94.8%) and now (95.7%). Other noteworthy changes include the increased presence in structures the elements sulfur (19.9% then and 34.2% now) and fluorine (10.0% then and 18.1% now). The reported incidence of other elements was limited in the 1971 study, preventing a more complete analysis.
Beyond the top-10, this comparison shows that since 1970 the number of unique augmented atoms generated has increased by a factor of 7.9. (As stated later, this factor of 7.9 should be considered an upper bound estimation.) This increase in fragment diversity shows that, in the last 40 years, chemists have increased their ability considerably to generate unique combinations of elements and their binding patterns. In addition, and as reflected in the top-10, chemists have become much more adept at working with organic chemicals containing oxygen and nitrogen bonds to the point that many chemicals now contain these. (See also part of the “Atom environment rate of growth” below for discussion on over and under estimation of atom environments in this analysis.) For completeness, the full data of the elemental analysis of Compound is provided as supporting information in Additional file 1: Figure S1 and Table S1. The full list of augmented atoms used in this then and now comparison with their respective incidence is provided as supporting information in Additional file 2.
Analysis of PubChem Substance and Compound
Analysis of PubChem chemical substance descriptions (PubChem Substance: 104,669,789 records) and the unique set of chemical structures after PubChem normalization processing (PubChem Compound: 46,704,121 records) was performed. These two collections were examined according to the data preprocessing and analysis approach as described in the “Methods”. Atoms were characterized by atomic number, formal charge, implicit hydrogen count, explicit degree (number of neighbors), valence (bond order sum including implicit hydrogen atom counts), and aromaticity. Bonds were differentiated as single, double, triple, or aromatic bonds. Atom environments were generated for radii r = 0 (atom types), r = 1, r = 2, and r = 3, where the topological radius (r) is the maximum allowed topological distance between the center atom and any atom in the original structure that is part of the atom environment, as measured by the number of bonds along the shortest path .
Atom type (r = 0 atom environment) statistics
In the case of halogen atom types found in Substance, there are 51 for fluorine [16 (31.4%) singletons], 74 for chlorine [16 (21.6%) singletons], 123 for bromine [56 (45.5%) singletons] and 57 for iodine [17 (29.8%) singletons]. Exemplary atom types unique to Substance are shown in Fig. 8b. In Compound, the number of different halogen atom types is substantially lower, with 4 fluorine atom types [0 singletons], 32 chlorine atom types [7 (21.9%) singletons], 49 bromine atom types [12 (24.5%) singletons] and 21 iodine atom types [5 (23.8%) singletons].
Even though noble gases initially were thought to be chemically inert, today, a few noble gas compounds are known [61–67]. Nonetheless, the 206 noble gas atom types identified in Substance seems irrationally high. Exemplars of atom types per noble gas unique to Substance are shown in Fig. 8c.
Properties of organic atom type singletons
In total, 2,181 different “organic-only” atom types encountered in substances deposited by SCRIPDB do not pass the PubChem valence list. SCRIPDB provides chemical structures found in the complex work units (e.g., Figures) of USPTO (United States Patent Office) patent documents. The most frequent one, found in 21.7% of the 179,784 substances with “invalid organic-only” atom types (referred to henceforth as “invalid substances”), is uncharged, tetra-coordinated, tetra-valent nitrogen that gets an added implicit hydrogen atom during pre-processing. This configuration is not allowed in the PubChem valence list, but is salvageable by standardization protocols with a simple fix: the implicit hydrogen is removed during structure standardization, and the nitrogen atom gets assigned a positive charge. The next most frequent atom types are tri-coordinated and tri-valent oxygen (16.9% of invalid substances), tri-coordinated and penta-valent carbon (8.8% of invalid substances), and tetra-coordinated and penta-valent carbon (8.1% of invalid substances). These atom types lead to the rejection of a substance during standardization. The case of tetra-coordinated, tetra-valent and uncharged nitrogen (6.8% of invalid substances) is analogous to the most frequent atom type. Di-coordinated and di-valent hydrogen cases (4.7% of invalid substances) are not salvageable by structure standardization. The same goes for di-coordinated and tri-valent oxygen (3.9% of invalid substances) and di-coordinated and di-valent fluorine (3.7% of invalid substances). Penta-coordinated and penta-valent carbon (3.5% of invalid substances) causes the respective substances also to be rejected. Tri-coordinated, tetra-valent carbon with a positive charge (3.4% of invalid substances) does not exist in the PubChem valence list and is not salvaged by standardization. A total of 1,138 of the 2,181 (52.2%) rejected atom type cases identified in substances provided by SCRIPDB occur in only a single substance record.
There are 116 “invalid organic-only” atom types found in 131,014 substances provided by IBM. The chemical structures from IBM are pulled from patent and biomedical literature documents. Contrary to SCRIPDB, the number of affected substances is dominated by a single atom type: uncharged, tri-coordinated and penta-valent nitrogen, which is present in 79.1% of the affected substances. Inspection of the substances provided to PubChem revealed that this is due to the configuration of nitro groups as “*N(=O)(=O)”, a common representation approach but deemed invalid by PubChem which favors the “*[N+](=O)–[O–]” representation. This is remedied by PubChem standardization, where the configuration of the nitro group with penta-valent nitrogen is modified to be the PubChem-preferred representation. Five of the top-10 ranked “invalid organic-only” atom types are identical with those identified for SCRIPDB in Fig. 10. Substances containing tetra-coordinated and tetra-valent oxygen (0.5% of 131,014 substances) get rejected by standardization. Tetra-coordinated and penta-valent nitrogen is analogous to the most frequent atom type for IBM that is rejected by the PubChem valence list. During structure standardization, this can be resolved if the respective atom is double-bonded to an oxygen atom by modifying this atom type to a charge-separated representation of the double bond to form a tetra-coordinated and tetra-valent nitrogen that carries a positive charge. Tri-coordinated and tetra-valent oxygen (0.3% of 131,014 substances) is rejected by the PubChem valence list. Di-coordinated and tetra-valent nitrogen gets assigned an implicit hydrogen count of +1 during pre-processing (0.2% of 131,014 substances) that later is replaced by a positive charge during standardization. In total, 40 of the 116 (34.5%) invalid atom types identified in substances deposited by IBM occur in only a single substance.
The 85,894 substances with “invalid organic-only” atom types provided by ChemSpider contain at least one of 282 offending atom types. The most frequent one, tri-coordinated tetra-valent nitrogen that is uncharged and has no implicit hydrogen atoms (50.1% of 85 894 substances), is an annotated nitrogen radical. It occurs 8,425 times in the context of nitro groups being represented as “*[N](=O)(–[O−])”, and 100 times in other contexts. This representation was previously not handled by the PubChem structure standardization protocols, and consequently, affected substances were rejected. The top-10 ranked invalid atom types identified in substances deposited by ChemSpider share three types with the top-10 ranked atom types from SCRIPDB and IBM, and three other atom types with the top-10 ranked atom types just from IBM. As a complementary invalid atom type, di-coordinated and di-valent oxygen carrying a negative charge occurs in 0.97% of 85,894 substances. These respective cases cannot be salvaged, as it is not clear whether connectivity information or formal charge should have precedence such that this atom type can be modified in order to pass the PubChem valence list. The hydrogen atom in tri-coordinated and tri-valent negatively charged sulfur is actually added during pre-processing, it is deposited as di-coordinated, di-valent and carrying a negative charge. Both configurations are neither in the PubChem valence list nor treated by the standardization protocols and are rejected during standardization. A consequence of the selection criteria for investigated atom types is that all bonds in tetra-coordinated hexa-valent carbon (0.81% of 85,894 substances) are considered to be covalent and this atom type does not pass the valence check during structure standardization. Of the invalid atom types encountered in substances deposited by ChemSpider, 81 of 282 (28.7%) invalid atom types occur in only one substance.
These top-10 cases of “invalid organic-only” atom types help to highlight several things. Firstly, Fig. 10 suggests that simple examination of atom types in a given molecule collection can be helpful to identify molecules that may be considered invalid as depicted. Secondly, as indicated by highlighted overlap in Fig. 10, it demonstrates that different organizations can share or differ in preference for particular “invalid” molecule representations, with each organization potentially providing previous unimagined atom configurations that may or may not be salvageable. Thirdly, some of these atom environments that are technically “invalid” from the PubChem perspective can be readily fixed/normalized to the PubChem preferred atom environment. For example, if one removes an implicit hydrogen atom and adds a positive charge, it would make the nitrogen atom in the Fig. 10a (i, v), b (ii, iii, x), and c (iv, v) cases “valid”. Indeed, PubChem standardization protocols can “correct” the nitrogen atom type as highlighted in these cases, however, it is worth mentioning that such fixes consider the larger atom environment (r > 0) and often modify several atom types and respective bonding patterns between them, as opposed to a systemic fix considering only a single atom type (r = 0). (Fixes at the r = 0 atom environment level should be considered to be ill advised as they may complicate or prevent correction of a larger functional group representation.) Lastly, as suggested in Fig. 11, differences in opinion about preferred representation may affect a large number of structures, but the majority of those differ in only a small number of atom types (in this case, ten or less). The examination of configuration histograms per element between repositories can help to identify such cases and also suggest ways to improve consistency within a given chemical collection.
There are only 31 atom types in Compound that are not found in Substance. The top-10 most frequent examples are shown in Fig. 12b. The first case is a result of the phosphorous atom being part of a non-covalent interaction (complex bond), where the charge is set during structure standardization. Analogously, the configurations found in the examples ranked second, third, fourth, sixth, seventh, eighth and ninth occur when trying to find adequate representations for complex bonds the respective atoms are incident to. The full list of atom types with their respective incidence is provided for Substance and Compound as supporting information in Additional file 3.
Atom environment (r = 1) statistics
There are a number of reasons why, for a particular atom environment, the number of structures it is incident to varies between Substance and Compound. If Substance contains a number of duplicates of a structure, each duplicate contributes to the atom environment incidence; however, this redundancy is not present in Compound, as it contains only unique structures in terms of valence bond structure representations as normalized by PubChem. PubChem standardization can modify an atom environment, decreasing the number of times the pre-standardization environment is present in Compound, or even eliminating it in favor of an alternative, but equivalent, representation. If a Substance chemical structure contains invalid or erroneous atom environment configurations that cannot be salvaged by standardization, the corresponding substance structure is rejected and not present in Compound. Due to these effects, in general, the number of encountered atom environment variations and respective frequencies of occurrence will be reduced in Compound.
There are 12,929 atom environments with radius r = 1 in Compound that do not occur in Substance, of which 6,134 (46.6%) are singletons. Consequently, all of them are a result of the PubChem structure standardization protocols. The top-10 most frequent examples are presented in Fig. 17b. In most cases, they are due to simple atom type modifications. In the most frequent and fourth most frequent case, tetra-valent phosphorous that was deposited uncharged gets assigned a positive charge. The charges on zirconium in the second and third most frequent cases are the result of attempts to find an adequate representation of complex bonds the respective atoms are involved. In a similar fashion, this is how the negative charges on carbon atoms in the fifth and seventh example occur. A positive charge is placed on tetra-valent nitrogen that was deposited uncharged in the cases ranked sixth, seventh and tenth. The di-valent chlorine atom showcased in the eighth most frequent environment exclusively found in Compound was deposited without charge and is modified to erroneously standardize to this atom type. In the substances that correspond to the ninth most frequent example, the tungsten atom is part of a complex with covalent single bonds that were removed during standardization by converting them to a PubChem complex bond, yielding a new atom type and consequently a new atom environment.
The full list of atom environments with radius r = 1 with their respective incidence is provided for Substance and Compound as supporting information in Additional file 4.
Atom environments radius r = 2
The full list of atom environments with radius r = 2 with their respective incidence is provided for Substance and Compound as supporting information in Additional file 4.
Atom environments radius r = 3
The top-10 most frequent environments (r = 3) in Compound are almost identical to those in Substance. The three top-ranked fragments are the same. Between Substance and Compound, the fourth and sixth ranked environment switched ranks. Instead of a para substituted benzene ring with a terminal chlorine atom as one of the substituents, the fluorine case is more frequent in Compound and consequently ranked lower with 3.4% incidence (fluorine, ranked seventh) compared to 3.1% incidence (chlorine, ranked ninth). The eighth ranked atom environment (r = 3) in Compound is not among the top-10 in Substance, with a tri-valent, tri-coordinated and uncharged nitrogen atom with one implicit hydrogen atom in ortho position to a second substituent in the benzene ring (3.1% incidence). Lastly, the fragment with a methyl group in para position to a second unspecified substituent (3.1% incidence) is ranked tenth in Compound compared to eighth in Substance.
The full list of atom environments with radius r = 3 with their respective incidence is provided for Substance and Compound as supporting information in Additional file 4.
Atom environment set overlap
Atom environment rate of growth
This atom environment survey of PubChem content is revealing for multiple reasons. One potential surprise is the high rate of singletons in Substance and Compound as a function of atom environment radius. The percentage of atom environment singletons for Substance is nearly constant, being 32.5, 33.5, 30, and 29.2% for radius r = 0, 1, 2, and 3, respectively. For Compound, however, the percentage of atom environment singletons is steadily increasing, being 10.6, 28.5, 39.1, and 44.1% for radius r = 0, 1, 2, and 3, respectively. Furthermore, the rate of growth of new atom environments in both Substance and Compound slows dramatically as a function of increasing radius, with an increase of 69, 42, and 6 times for Compound and 38, 18, and 5 times for Substance when considering the ratio of atom environment radius 0–>1, 1–>2, and 2–>3, respectively. The increasing quantity of singletons and the decreasing rate of growth of atom environments as a function of atom environment radius implies that there is still an enormous chemical space to explore even at the chemical fragment level, with nearly half of all r = 3 molecular fragments in Compound appearing in only a single structure. It may also be surprising or revealing (to some) that the count of unique atom environments in Compound is so relatively few, being 1,583 (r = 0), 109,306 (r = 1), 4,559,587 (r = 2), and 25,115,177 (r = 3). On the other hand, the drop in increase of number of fragments from r = 1 to r = 2 compared to r = 2 to r = 3 could also indicate that there are constraints limiting a full combinatorial exploration of the space defined by smaller fragments (e.g., steric effects). Furthermore, an increasing number of functional groups results in more reactive structures that would be increasingly harder to synthesize.
The high rate of singletons suggests that sampling a database for atom environments, such as in the earlier 1971 study, will miss the vast majority of atom environments present; however, sampling should be sufficient to locate common fragments. Therefore, the rate of growth of ‘augmented atoms’ determined in this study of 7.9 from 1971 CASRS to present day PubChem should be considered an upper bound due to the use of structure sampling in the 1971 CASRS study. Without the full CASRS database for comparison, a more accurate determination of rate of growth cannot be determined.
The concept of circular atom environments is the basis for so-called radial/circular [32, 33], Morgan , or extended connectivity fingerprints (ECFP) . As feature ensemble fingerprints, they are not based on a predefined dictionary of structural features. However, we identified in total 28,460,736 unique atom environments of radii r = 1, r = 2 and r = 3 (of which 12,470,387 are singletons, 43.8%) in addition to 1,583 atom types (r = 0, 167 singletons, 10.6%) in PubChem Compound, in total close to 225 fragments. For the particular example of ECFP, where a hash function is used to generate 32-bit integers as identifiers for circular fragments , more than 99% of the available 232 (4,294,967,296) bits in a fingerprint may remain unused. In Compound, the number of observed singletons increases with increasing radius, from 11% for atom types (r = 0), 29 % at r = 1, 39 % at r = 2, to 44 % at r = 3. It is unclear how well these singular differences between structures are captured in fixed size structural keys (such as those used by PubChem ). Conversely, as dynamically generated fingerprints like ECFP quantify structural differences including those previously unknown, this might be a possible explanation for their general advantage over other fingerprint approaches. Further analysis of molecular fragments, such as those provided in this study, may prove useful to design a better a set of discriminating (i.e., not co-linear) structural patterns useful for cheminformatic purposes such as structure searching (e.g., identity, substructure, similarity, etc.), virtual screening, clustering, clique analysis, and other conceivable applications.
Atom environment errors
Statistics for elements potentially originating from misperceived abbreviations in PubChem Substance and Compound
r = 0
r = 1
r = 2
r = 3
These examples help to showcase that not all atom types and atom environments generated by this study resemble fragments of valid and real chemical structures. The belief is that these sorts of issues are rare and that they apply only to a relatively small number of atom types and atom environments. It also helps to demonstrate that simple atom environments (e.g., r = 0, r = 1, and beyond) are more tractable for manual curation than all of PubChem. Conceivably, these environments could be used as a ‘sanity check’ of real or plausible chemistry, and therefore are worthy of further investigation for chemical structure normalization and quality assurance (QA) purposes. The importance of this cannot be understated, as every new contribution to PubChem might exhibit structural elements previously unknown to the standardization protocols. A comparison of atom types and atom environments between Compound, Substance, and structures being contributed could automatically identify new representations and suggest structure examples for curation that can then be used for further refinement of standardization methods.
The chemical structure contents of the PubChem Compound and Substance databases was examined as a function of atom types and atom environments. The relative novelty of chemical structure fragments found in PubChem is considerable. The percentage of atom environments located in only a single PubChem Compound record is 10.6, 28.5, 39.1, and 44.1% for atom environment r = 0 (atom type), 1, 2, and 3 (ECFP_6 like), respectively. Considering many chemical structures are synthesized for novelty purposes, this may not be completely surprising. Interestingly, the relative rate of increase of new atom environments, while still substantial, slows dramatically when examined as a function of increasing atom environment radius in PubChem Compound, with a 69, 42, and 6-fold increase for 0–>1, 1–>2, and 2–>3, respectively. This suggests that there is still considerable room for chemists to pursue novel chemical structures using only new combinations of smaller (e.g., r = 2) atom environment fragments.
Further emphasizing this point, plots of the incidence of atom environment fragments at various sizes show a log/log behavior. In some ways, this may suggest that chemists lack imagination in that the majority of chemical structures contain one or more of the same basic molecular fragments. One could also easily argue the opposite point, in that chemists are constantly pushing into new and unexplored areas of chemistry and are rarely using the same atom fragments twice. In the end, it seems very clear that chemists have plenty of room to explore new and sparsely explored chemistry space and, therefore, make many new discoveries for some time to come.
The analysis of PubChem Compound was compared to similar studies performed over 40 years ago by Crowe et al. and Adamson et al. of CASRS chemical structures. A near eightfold (8×) increase in r = 1 atom environments (‘augmented atoms’, atom and its nearest neighbors) was found. While this result can only be considered an upper bound, due to the use of structure sampling by the earlier studies and the relatively high rate of singletons found in the PubChem analysis, it does imply a substantial increase in the capability of chemists to synthesize and isolate novel chemistry as a function of time, with a noted increase in the prevalence and popularity of nitrogen and oxygen containing atom environments now as opposed to then. The supporting information provided in this study should allow for future comparisons on the progress and trends of chemists.
The differences between the PubChem Substance and Compound databases were examined, in part, by using examples of atom environments of increasing size unique to each repository. This study noted the count of unique atom environments in Substance is greater than it is in Compound. This is due to the fact that structures in Substance undergo structure standardization and have to pass validity filters before becoming part of Compound. This ‘sanity’ step dramatically reduces the count of atom environments by removing implausible chemistry (e.g., five bonds to carbon) and by normalizing varying functional group representations. These differences also help to emphasize the effect of PubChem standardization protocols for preferred atom types and particular tautomeric/resonance forms such that they could be used as the basis for a fragment-based structure normalization procedure.
The analysis of the Compound database is particular helpful to understand and characterize the diversity of molecular fragments found in known chemicals. Given the limited number of atom environments up to r = 3 (ECFP_6 like), it may be possible to do a more thorough examination of observed fragments to improve the efficiency of chemical information algorithms, such as those for chemical structure searching or virtual screening. Furthermore, the results of this study highlight that further refinement of standardization procedures in PubChem will be beneficial.
In this analysis, the OpenEye Scientific, Inc. OEChem C++ toolkit was used for the representation of atoms, bonds, and molecules .
Most standard formats for structure representation in chemical information, such as SMILES [11, 12] and connection table file formats [23–25] do not require the specification of explicit hydrogen atoms in a chemical structure or implicit hydrogen atom counts. Instead, a standard valence model is employed, where implicit hydrogen atom counts are determined from (among other things) the atomic number, explicit atom valence and formal charge. Standard valence models can vary between file formats and software implementations. In PubChem Substance, the presence of explicit hydrogen atoms are nearly always limited to chemical structures with a hydrogen atom involved in the configuration of a stereocenter or to specify a particular isotope form. Consequently, most non-hydrogen atoms in Substance have non-saturated valences, and the chemical structures do not represent valid chemistry without additional processing to assign implicit hydrogen counts. In order to account for these effects, Substance records were subjected to a standard valence model prior to atom environment analysis by invoking the OEChem C++ toolkit  function OEAssignMDLHydrogens. PubChem Compound is derived from Substance through automated structure standardization protocols, including the adjustment of implicit hydrogen atom counts and subsequent assignment of explicit hydrogen atoms. For the purpose of this analysis, all explicit hydrogen atoms of substances and compounds were converted to implicit hydrogen atom counts using the OEChem C++ toolkit  function OESuppressHydrogens with all Boolean parameters set to ‘false’. Please note that this explicit-to-implicit hydrogen atom conversion removes all explicit hydrogen atoms, including those with specific hydrogen isotopes, affecting 98,342 deuterium and 21,039 tritium containing substances, as well as 56,725 deuterium and 8,909 tritium containing compounds, respectively.
In this study, we employed two atom typing schemes. For an adequate comparison of fragments in PubChem Compound to the results of an ‘augmented atom’ study of the CASRS published by Adamson et al. , atoms are characterized by their atomic number as sole feature. For a more detailed analysis of circular atom environments in PubChem Substance and Compound, atoms are characterized by six properties: (1) atomic number; (2) formal charge; (3) implicit hydrogen count; (4) explicit degree; (5) valence; and (6) participation in a conjugated (aromatic) system. The atom “explicit degree” is the number of explicitly connected atoms. The atom “valence” equals the sum of all incident sigma and pi bonds. The number of “incident sigma bonds” is described by the sum of “implicit hydrogen count” and “explicit degree”. The number of “incident pi bonds” is the sum of bond orders of explicitly connected atoms minus the “explicit degree”. This atom characterization approach allows a description of the molecular context of an atom (environment) without having to include the next layer of atoms as pseudo atoms as in other approaches . Atom aromaticity was perceived using the OEChem C++ toolkit function OEAssignAromaticFlags in combination with the aromaticity model OEAroModelOpenEye. In the specific case of the comparison with CASRS, the OEAroModelMDL was used, as it allows for a more “apples to apples” comparison to the older study by allowing only six-membered rings of carbon and nitrogen to be aromatic, provided they satisfy the ‘Hückel 4n + 2′ rule [58, 59] (i.e., atoms are sp2-hybridized).
In this study, we employed two bond typing schemes. For an adequate comparison of fragments in PubChem Compound to the results of an ‘augmented atom’ study of the CASRS published by Adamson et al. , bonds are characterized by their covalent bond order (single, double, triple), and presence in ring or chain, plus an additional ‘aromatic’ ring bond type. Bond aromaticity was perceived using the OEChem C++ toolkit  function OEAssignAromaticFlags in combination with the aromaticity model OEAroModelMDL. For a more detailed analysis of circular atom environments in PubChem Substance and Compound, four different bond types are distinguished: single, double, triple, and aromatic. Bond aromaticity was perceived using the OEChem C++ toolkit  function OEAssignAromaticFlags in combination with the aromaticity model OEAroModelOpenEye. In addition to covalent bonds, PubChem defines and actively uses three non-standard bond types: ionic, complex and dative bonds. In this analysis, these non-standard bond types were completely ignored.
Atom environments with r > 0 were not generated with terminal atoms as center atoms, referring only to atoms that are adjacent to one other atom. These terminal atoms are included in the environment originating from the adjacent—non-terminal—partner. However, this exclusion of terminal atoms means that mono- and di-atomic structures are excluded from any atom environment analysis when r > 0, as they consist exclusively of terminal atoms. In Substance, this leads to 1,797 mono-atomic and 3,795 di-atomic structures being excluded from the atom environment r > 0 analyses. In Compound, this leads to 448 mono-atomic and 1,306 di-atomic structures being excluded from the atom environment r > 0 analyses. Statistics for these excluded structures are provided in the supporting information. (See Additional file 1: Figures S4, S5) Terminal atoms are included in the atom environment r = 0 (i.e., atom type) analysis.
Atom and bond primitives for encoding of ‘augmented atoms’ in SMARTS
# <atomic number>
@ (!@) for ‘in ring’ (‘not in ring’)
@ (!@) for ‘in ring’ (‘not in ring’)
@ (!@) for ‘in ring’ (‘not in ring’)
Atom and bond primitives for encoding of atom types and atom environments in SMARTS
Lower case indicating aromaticity
Uncharged represented as +0
Implicit hydrogen count
Incidence and occurrence
In this study, atom environment frequency is expressed in terms of incidence and occurrence. Incidence refers to the absolute count or percentage of (substance or compound) records that contain a particular fragment. Occurrence refers to the absolute count or percentage of all fragments across all structures. Therefore, per chemical structure record, occurrence considers all fragments, while incidence considers only the unique fragments.
This study uses PubChem as it existed on January 14, 2013 with maximum SID 160,655,685 and maximum CID 70,680,246. For both data sets, only PubChem records searchable (‘live’) at that point in time were processed. PubChem Substance records with ‘auto-generated’ structures were excluded. In ‘auto-generated’ cases, no actual structure is deposited, but a reference to a PubChem Compound record is derived using chemical names and may include chemical name conversion using various approaches, including the OpenEye Scientific Inc. Lexichem C++ toolkit . Lastly, the chemical structure for a given substance had to be fully specified. Therefore, substances containing arbitrarily defined atoms (pseudo-atoms) were excluded from this analysis. By these criteria, atom environments (r = 0, 1, 2, 3) were determined for 104,669,789 Substance records. All 46,704,121 ‘live’ records in Compound were also processed.
All atom environments (r = 0, 1, 2, 3) found are provided as supporting information in Additional files 3 (r = 0; atom types) and 4 (r = 1, 2, 3; atom environments) as SMARTS patterns. Usage of atom and bond primitives for encoding of augmented atoms and PubChem atom environments are detailed in Tables 4 and 5, respectively. Provided in this format, fragments can be visualized using appropriate techniques [79, 80], or readily imported into various toolkits. All SMARTS patterns supplied as supporting information have been tested for their validity by successfully parsing them through the OEChem C++ toolkit  function OEParseSmarts.
Records may be referred to as SID (substance identifier) for PubChem Substance records and CID (compound identifier) for PubChem Compound records. Atom environments that occur in only a single PubChem record are referred to as singletons.
VDH devised the environment encoding, carried out the computations and analyzed the results, and drafted the manuscript. EEB facilitated the computations and edited the manuscript. SHB reviewed the final manuscript. All authors read and approved the final manuscript.
We thank the anonymous reviewers for their careful reading of our manuscript and their constructive comments, which helped us to improve the manuscript. This research was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, US Department of Health and Human Services.
Compliance with ethical guidelines
Competing interests The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Minkin VI (1951) Glossary of terms used in theoretical organic chemistry (IUPAC recommendations 1999). Pure Appl Chem 1999:71Google Scholar
- Trudeau RJ (1993) Graphs. In Introduction to Graph Theory. Dover Publications, Inc., New York, p 19Google Scholar
- Lewis GN (1916) The atom and the molecule. J Am Chem Soc 38:762–785View ArticleGoogle Scholar
- Cayley A (1874) On the mathematical theory of isomers. Philos Mag 47:444–447Google Scholar
- Panico R, Powell WH, Richter JC (1993) A Guide to IUPAC Nomenclature of Organic Compounds Recommendations 1993. Blackwell Science, OxfordGoogle Scholar
- Favre HA, Hellwich KH, Moss GP, Powell WH, Traynham JG (1999) Corrections to a guide to IUPAC nomenclature of organic compounds (IUPAC recommendations 1993). Pure Appl Chem 71:1327–1330View ArticleGoogle Scholar
- Leigh GJ, Favre HA, Metanomski WV (1998) Principles of organic nomenclature. Blackwell Science, OxfordGoogle Scholar
- Skolnik H, Clow A (1964) A notation system for indexing pesticides. J Chem Doc 4:221–227View ArticleGoogle Scholar
- Dyson GM, Lynch MF, Morgan HL (1968) A modified IUPAC-Dyson notation system for chemical structures. Inform Storage Retr 4:27–83View ArticleGoogle Scholar
- Wiswesser WJ (1982) How the WLN began in 1949 and how it might be in 1999. J Chem Inf Comput Sci 22:88–93View ArticleGoogle Scholar
- Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36View ArticleGoogle Scholar
- Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 29:97–101View ArticleGoogle Scholar
- Barnard JM, Jochum CJ, Welford SM (1989) A universal structure/substructure representation for PC-host communication. In: Warr WA (ed) Chemical Structure Information Systems, ACS Symposium Series, vol 400. American Chemical Society, Washington DC, pp 76–81View ArticleGoogle Scholar
- Rohbeck HG (1991) Representation of structure description arranged linearly. In: Gmehlin J (ed) Software Development in Chemistry 5. Springer, Heidelberg, pp 49–58View ArticleGoogle Scholar
- Ash S, Cline MA, Homer RW, Hurst T, Smith GB (1997) SYBYL line notation (SLN): a versatile language for chemical structure representation. J Chem Inf Comput Sci 37:71–79View ArticleGoogle Scholar
- Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD (2008) SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries. J Chem Inf Model 48:2294–2307View ArticleGoogle Scholar
- Gakh AA, Burnett MN (2001) Modular Chemical Descriptor Language (MCDL): composition, Connectivity, and Supplementary Modules. J Chem Inf Comput Sci 41:1494–1499View ArticleGoogle Scholar
- Gakh AA, Burnett MN, Trepalin SV, Yarkov AV (2011) Modular Chemical Descriptor Language (MCDL): stereochemical modules. J Cheminform 3:5View ArticleGoogle Scholar
- McNaught A (2006) The IUPAC international chemical identifier: inChI—a new standard for molecular informatics. Chem Int 28:12–14Google Scholar
- Heller SR, McNaught AD (2009) The IUPAC international chemical identifier. Chem Int 31:7–9Google Scholar
- Proschak E, Wegner JK, Schüller A, Schneider G, Fechner U (2007) Molecular query language (MQL)—a context-free grammar for substructure matching. J Chem Inf Model 47:295–301View ArticleGoogle Scholar
- Reisen FH, Schneider G, Proschak E (2009) Reaction-MQL: line notation for functional transformation. J Chem Inf Model 49:6–12View ArticleGoogle Scholar
- Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, Laufer J (1992) Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J Chem Inf Comput Sci 32:244–255View ArticleGoogle Scholar
- (2011) Accelrys CTFile Formats. http://accelrys.com/products/informatics/cheminformatics/ctfile-formats/no-fee.php. Accessed 30 July 2015
- (2005) TRIPOS Mol2 File Format. http://tripos.com/data/support/mol2.pdf. Accessed 30 July 2015
- Warr WA (2011) Representation of chemical structures. Wiley Interdiscip Rev Comput Mol Sci 1:557–579View ArticleGoogle Scholar
- Carhart RE, Smith DH, Venkataraghavan R (1985) Atom pairs as molecular features in structure-activity studies: definition and applications. J Chem Inf Comput Sci 25:64–73View ArticleGoogle Scholar
- Sheridan RP, Miller MD, Underwood DJ, Kearsley SK (1996) Chemical similarity using geometric atom pair descriptors. J Chem Inf Comput Sci 36:128–136View ArticleGoogle Scholar
- Barnard JM, Downs GM (1997) Chemical fragment generation and clustering software. J Chem Inf Comput Sci 37:141–142View ArticleGoogle Scholar
- Filimonov D, Poroikov V, Borodina Y, Gloriozova T (1999) Chemical similarity assessment through multilevel neighborhoods of atoms: definition and comparison with the other descriptors. J Chem Inf Comput Sci 39:666–670View ArticleGoogle Scholar
- Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42:1273–1280View ArticleGoogle Scholar
- Bender A, Mussa HY, Glen RC, Reiling S (2004) Molecular similarity searching using atom environments, information-based feature selection, and a naïve bayesian classifier. J Chem Inf Comput Sci 44:170–178View ArticleGoogle Scholar
- Bender A, Mussa HY, Glen RC, Reiling S (2004) Similarity searching of chemical databases using atom environment descriptors: evaluation of performance. J Chem Inf Comput Sci 44:1708–1718View ArticleGoogle Scholar
- Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754View ArticleGoogle Scholar
- Barnard JM, Downs GM (1992) Clustering of chemical structures on the basis of two-dimensional similarity measures. J Chem Inf Comput Sci 32:644–649View ArticleGoogle Scholar
- Willett P (2000) Chemoinformatics—similarity and diversity in chemical libraries. Curr Opin Biotechnol 11:85–88View ArticleGoogle Scholar
- Brown RD, Martin YC (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comput Sci 36:572–584View ArticleGoogle Scholar
- McGregor MJ, Pallai PV (1997) Clustering of large databases of compounds using MDL “keys” as structural descriptors. J Chem Inf Comput Sci 37:443–448View ArticleGoogle Scholar
- MacCuish JD, MacCuish NE (2013) Chemoinformatics applications of cluster analysis. Wiley Interdiscip Rev Comput Mol Sci 4:34–48View ArticleGoogle Scholar
- Willett P (1998) Chemical similarity searching. J Chem Inf Comput Sci 38:983–996View ArticleGoogle Scholar
- Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053View ArticleGoogle Scholar
- Willett P (2011) Similarity searching using 2D structural fingerprints. Methods Mol Biol 672:133–158View ArticleGoogle Scholar
- Feldman A, Hodes L (1975) An efficient design for chemcial structure searching. I. The screens. J Chem Inf Comput Sci 15:147–152View ArticleGoogle Scholar
- Xiao Y, Qiao Y, Zhang J, Lin S, Zhang W (1997) A method for substructure search by atom-centered multilayer code. J Chem Inf Comput Sci 37:701–704View ArticleGoogle Scholar
- Liu P, Agrafiotis DK, Rassokhin DN (2001) Power Keys: a novel class of topological descriptors based on exhaustive subgraph enumeration and their application in substructure searching. J Chem Inf Model 51:2843–2851View ArticleGoogle Scholar
- Crowe JE, Lynch MF, Town WG (1970) Analysis of structural characteristics of chemical compounds in a large computer-based file. Part I. Non-cyclic fragments. J Chem Soc C 990–996. doi:10.1039/J39700000990
- Adamson GW, Lynch MF, Town WG (1971) Analysis of structural characteristics of chemical compounds in a large computer-based File. Part II. Atom-centred fragments. J Chem Soc C 3702–3706. doi:10.1039/J39710003702
- Larsen PO, von Ins M (2010) The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics 84:575–603View ArticleGoogle Scholar
- Binetti R, Costamagna FM, Marcello I (2008) Exponential growth of new chemicals and evolution of information relevant to risk control. Ann Ist Super Sanita 44:13–15Google Scholar
- Chemical Abstracts Service (2008) CAS Statistical Summary 1907–2007. Chemical Abstracts Service, Columbus (OH)Google Scholar
- Bolton EE, Wang Y, Thiessen PA, Bryant SH (2008) Chapter 12 PubChem: integrated platform of small molecules and biological activities. In: Wheeler RA, Spellmeyer DC (eds) Annual reports in computational chemistry, vol 4. Elsevier, Oxford, pp 217–241Google Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Zhou Z et al (2012) PubChem’s BioAssay database. Nucleid Acids Res 40:D400–D412View ArticleGoogle Scholar
- Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO et al (2010) An overview of the PubChem BioAssay resource. Nucleic Acids Res 38:D255–D266View ArticleGoogle Scholar
- (2004) The PubChem Project. http://pubchem.ncbi.nlm.nih.gov/. Accessed 30 July 2015
- Petitjean M (1992) Applications of the radius-diameter diagram to the classification of topological and geometrical shapes of chemical compounds. J Chem Inf Comput Sci 32:331–337View ArticleGoogle Scholar
- (2015) RDKit: Open-Source Cheminformatics Software. http://www.rdkit.org. Accessed 30 July 2015
- (2014) OpenEye OEChem C++ Toolkit, version 2.0.3.b.1. OpenEye Scientific Software, Inc., Santa Fe (NM). http://www.eyesopen.com/oechem-tk. Accessed 30 July 2015
- Hückel E (1931) Quantentheoretische Beiträge zum Benzolproblem I. Die Elektronenkonfiguration des Benzols und verwandter Verbindungen. Z Phys 70:204–286Google Scholar
- Hückel E (1932) Quantentheoretische Beiträge zum Benzolproblem II. Quantentheorie der induzierten Polaritäten. Z Phys 72:310–337View ArticleGoogle Scholar
- OpenEye Scientific Software, Inc. (2012) OEChem C++ Toolkit v1.9.2 Manual. OpenEye Scientific Software, Inc., Santa Fe, p 50Google Scholar
- Claassen HH, Selig H, Malm JG (1962) Xenon Tetrafluoride. J Am Chem Soc 84:3593View ArticleGoogle Scholar
- MacKenzie DR (1963) Krypton Difluoride: preparation and handling. Science 141:1171View ArticleGoogle Scholar
- Templeton DH, Zalkin A, Forrester JD, Williamson SM (1963) Crystal and molecular structure of xenon trioxide. J Am Chem Soc 85:817View ArticleGoogle Scholar
- Selig H, Malm JG, Claassen HH, Chernick CL, Huston JL (1964) Xenon tetroxide—preparation and some properties. Science 143:1322–1323View ArticleGoogle Scholar
- Graham L, Graudejus O, Jha NK, Bartlett N (2000) Concerning the nature of XePtF6. Coord Chem Rev 197:321–334View ArticleGoogle Scholar
- Khriachtchev L, Pettersson M, Runeberg N, Lundell J, Räsänen M (2000) A stable argon compound. Nature 406:874–876View ArticleGoogle Scholar
- Tramšek M, Žemva B (2006) Synthesis, properties and chemistry of xenon(II) fluoride. Acta Chim Slov 53:105–116Google Scholar
- Heifets A, Jurisica I (2012) SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents. Nucl Acids Res 40:D428–D433View ArticleGoogle Scholar
- (2011) SCRIPDB. University of Toronto. http://dcv.uhnres.utoronto.ca/SCRIPDB/. Accessed 30 July 2015
- IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120 (USA)Google Scholar
- (2007) ChemSpider. http://www.chemspider.com/. Accessed 30 July 2015
- (2009) PubChem Substructure Fingerprint V1.3, National Center for Biotechnology Information, Bethesda. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Accessed 30 July 2015
- de Silva KM, Goodman JM (2005) What is the smallest saturated acyclic alkane that cannot be made? J Chem Inf Model 45:81–87View ArticleGoogle Scholar
- Paton RS, Goodman JM (2007) Exploration of the accessible chemical space of acyclic alkanes. J Chem Inf Model 47:2124–2132View ArticleGoogle Scholar
- Kolodzik A, Urbaczek S, Rarey M (2012) Unique ring families: a chemically meaningful description of molecular ring topologies. J Chem Inf Model 52:2013–2021View ArticleGoogle Scholar
- Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1:8View ArticleGoogle Scholar
- Daylight Theory Manual, Chapter 4: SMARTS—A Language for Describing Molecular Patterns. Daylight Chemical Information Systems, Inc., Laguna Niguel http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html. Accessed Sep 2013
- OpenEye Lexichem C++ Toolkit. OpenEye Scientific Software, Inc., Santa Fe. http://www.eyesopen.com/lexichem-tk. Accessed 30 July 2015
- Schomburg K, Ehrlich HC, Stierand K, Rarey M (2010) From structure diagrams to visual chemical patterns. J Chem Inf Model 50:1529–1535View ArticleGoogle Scholar
- (2010) SMARTSviewer. Center for Bioinformatics, Universität Hamburg. http://smartsview.zbh.uni-hamburg.de/. Accessed 30 July 2015