Similar compounds versus similar conformers: complementarity between PubChem 2-D and 3-D neighboring sets
© U.S. Government 2016
Received: 26 May 2016
Accepted: 5 September 2016
Published: 4 November 2016
PubChem is a public repository for biological activities of small molecules. For the efficient use of its vast amount of chemical information, PubChem performs 2-dimensional (2-D) and 3-dimensional (3-D) neighborings, which precompute “neighbor” relationships between molecules in the PubChem Compound database, using the PubChem subgraph fingerprints-based 2-D similarity and the Gaussian-shape overlay-based 3-D similarity, respectively. These neighborings allow PubChem to provide the user with immediate access to the list of 2-D and 3-D neighbors (also called “Similar Compounds” and “Similar Conformers”, respectively) for each compound in PubChem. However, because 3-D neighboring is much more time-consuming than 2-D neighboring, how different the results of the two neighboring schemes are is an important question, considering limited computational resources.
The present study analyzed the complementarity between the PubChem 2-D and 3-D neighbors. When all compounds in PubChem were considered, the overlap between 2-D and 3-D neighbors was only 2% of the total neighbors. For the data sets containing compounds with annotated information, the overlap increased as the data sets became smaller. However, it did not exceed 31% and substantial fractions of neighbors were still recognized by either PubChem 2-D or 3-D similarity, but not by both. The Neighbor Preference Index (NPI) of a molecule for a given data set was introduced, which quantified whether a molecule had more 2-D or 3-D neighbors in the data set. The NPI histogram for all PubChem compounds had a bimodal shape with two maxima at NPI = ±1 and a minimum at NPI = 0. However, the NPI histograms for the subsets containing compounds with annotated information had a greater fraction of compounds with a strong preference for one neighboring method to the other (at NPI = ±1) as well as compounds with a neutral preference (at NPI = 0).
The rapid growth of information on chemicals and their biological activities has created a demand in the scientific community for public repositories that can increase the utility of these data, by collecting, integrating, and disseminating it to the community free of charge. An example of such repositories is PubChem [1–5], developed and maintained by the National Center for Biotechnology Information (NCBI), a part of the U.S. National Institutes of Health’s (NIH) National Library of Medicine (NLM). PubChem consists of three primary databases: Substance, Compound, and BioAssay. As of May 2016, 416 PubChem data contributors provide a wide range of chemical substance descriptions to the PubChem Substance database (accession: SID). The set of unique chemical structures present in the Substance database make up the PubChem Compound database (accession: CID). Results from biological experiments performed on the substance samples are stored in the PubChem BioAssay database (accession: AID). Altogether, PubChem is a sizeable data system of more than 219 million substance descriptions, 89 million unique chemical structures, one million biological assays, and 230 million biological assay outcomes (where an outcome is the set of results from a substance being tested in an assay).
There is a large variation in the amount of information available for each molecule contained in PubChem. For example, whereas some molecules have enormous quantities of information on the biological activity and literature associated, other molecules have very little information other than the chemical structure. When there is no desired information available for a particular molecule, one may infer it from its structural analogues that have relevant information. Even when a molecule has desired information, comparison with other available information on the molecule and its structural analogues may provide additional important insight.
To help find and analyze related information using chemical structures, PubChem provides services that exploit chemical structure similarity between molecules. These include Structure Search, Structure Clustering, and Structure–Activity Analysis [1, 6]. In addition, the PubChem Compound database provides two precomputed chemical structure similarity searches of molecules, dubbed “neighbor” relationships (where only results above a given threshold are retained from the similarity search). These give users immediate access to a set of structurally similar molecules. One of these neighboring relationships, known as “Similar Compounds”, uses the notion of 2-D similarity (which considers the atoms in a molecule and how they connect to each other) and is adept at finding close structural analogues of a structure such as those with the same scaffold. Another neighboring relationship in PubChem, known as “Similar Conformers”, uses the notion of 3-D similarity (which considers the overall shape and macromolecule-binding features of the molecule) [6, 7] and is adept at finding related structures with different scaffolds. As described in more detail in the “Methods” section, the 2-D neighboring uses the PubChem substructure fingerprint  and Tanimoto equation [9–11] to evaluate structural similarity between two molecules, resulting in a list of 2-D “Similar Compound” neighbors for each compound record. The Gaussian-shape overlay method by Grant and Pickup [12–15], which is implemented in the Rapid Overlay of Chemical Structures (ROCS) [16, 17], is used to generate a list of 3-D “Similar Conformer” neighbors for each molecule covered by the PubChem3D project [6, 7, 18–23]. This project generates 3-D conformer models for about 90% of the chemicals in the PubChem Compound database, being only those structures with a single component (i.e., no mixtures or salts), comprised of organic elements, not too flexible (≤15 rotatable bonds), and not too large (≤50 non-hydrogen atoms) [6, 18, 23]. These computationally derived conformer models are used in various PubChem tools and services that exploit 3-D similarity, including the 3-D conformer search, 3-D neighboring, 3-D clustering, 3-D structure–activity relationship analysis, and so on.
In general, binary fingerprint-based 2-D similarity methods can compare on the order of one million compound pairs per second per CPU core, but many 3-D similarity methods (such as ROCS [16, 17], used by PubChem) can only compare on the order of 100 ~ 1000 conformer pairs per second per CPU core. The CPU-based PubChem 3-D neighboring approach, as described in a recent study , uses various filtering schemes to preclude conformer pairs that cannot be neighbors of each other from the most time-consuming, shape superposition optimization step. As a result, the throughput of PubChem 3-D neighboring is enhanced beyond 100,000 conformer pairs per second per CPU core. It is important to note that GPU-based approaches show great promise to reduce the cost of 3-D similarity computation . GPU implementations (such as FastROCS) provide on the order of 1,000,000 conformer pairs per second per GPU. The CPU-based filtering approach that accelerates PubChem 3-D neighboring can complement a GPU approach, where the CPU handles the filtering steps and the GPU perform the superposition optimization. However, if the 3-D neighboring considers ten diverse conformers per compound, a throughput of 100,000 conformer pairs per second per CPU core corresponds to a throughput on the order of magnitude of 1000 compound pairs per second per CPU core because there can be 100 conformer pairs (10 × 10 = 100) for each compound pair. Therefore, even though vastly accelerated, PubChem 3-D neighboring is still significantly slower than PubChem 2-D neighboring by three orders of magnitude.
One may legitimately ask the question, if 3-D neighboring is so computationally demanding, is there sufficient benefit to justify the additional computational effort over use of 2-D similarity? For example, how different are the results from 2-D and 3-D neighboring approaches? Do 2-D and 3-D similarity methods for a given chemical structure give unique chemical lists or do the two approaches largely approximate each other? What does 3-D similarity yield that 2-D similarity does not and vice versa? Are key molecules missed by one approach yet found by the other? The present study explores these questions by analyzing the overlap of 2-D and 3-D neighbors that are precomputed and stored in (and readily downloadable from) PubChem.
Results and discussion
Two series of data sets
Number of compounds (CIDs) in the data sets employed in the present study
Ratio (B/A) (%)
To remove the 2-D bias, the five sets in Series B [designated as PubChem-(B), MeSH-(B), Protein3D-(B), PharmAct-(B), and Drug-(B)] were generated by using the unique set of parent compound representations in the respective sets in Series A and then excluding any compound without a computationally generated PubChem3D conformer description. The Series B data sets ensure all considered chemicals have both 2-D and 3-D descriptions and also removes any redundancy due to salt/mixture form variation of the same parent chemical structure. Therefore, the Series B data sets would address any potential (2-D) bias, allowing the two neighboring approaches to be compared in an even way. The Series A data sets are also retained and analyzed for comparison purposes. Table 1 summarizes the number of compounds in each compound set. It is noteworthy that the Series B data sets consistently have fewer CIDs than their Series A counterparts, ranging from 11.8% fewer for the PubChem-(B) set to 46.4% fewer for the Drug-(B) set, further emphasizing the importance of bias removal for comparison purposes.
Overlap between 2-D and 3-D neighbors in series A
Overlap between 2-D and 3-D neighbors in Series B
The data sets in Series B remove salts and mixture forms that cause an inherent bias in favor of 2-D neighboring. In addition, only parent compounds that have computationally generated 3-D structures were considered. As a result, all five data sets in Series B were found to have considerably fewer 2-D neighbor pairs than in the Series A data sets (see Fig. 2). Similarly to Series A, the fraction of common neighbor pairs in the Series B data sets increased as the size of the data set decreased. However, for the smallest set, Drug-(B), the overlap between the 2-D and 3-D neighbor pairs was still only 31% and structural similarity of the remaining 69% was recognized only by one of the two neighboring approaches. The proportions of the 2-D-only and 3-D-only neighbor pairs are very close to each other (35 and 34%, respectively) for the Drug-(B) set, suggesting that the two approaches used by PubChem are complementary.
Distribution of neighbor preference indices (NPIs)
To determine the extent to which chemicals in PubChem can be interrelated by one similarity method but not the other, the Neighbor Preference Index (NPI) of a compound was introduced. It was designed to measure the extent of overlap between PubChem 2-D and 3-D neighboring approaches. If 2-D neighboring results substantially overlap with 3-D neighboring results, there would be little point to compute 3-D similarity, which is computationally expensive. As defined in the “Methods” section, this NPI quantity may have any value ranging from −1 (for compounds with 3-D neighbors only) to +1 (for compounds with 2-D neighbors only). A compound that has an equal number of 2-D and 3-D neighbors has an NPI value of zero, indicating that it has no preference for any of the two neighboring methods. The NPI value of a compound is dependent on the nature of the given chemical set, because a compound can have different sets of neighbors for different data sets.
Interestingly, the shape of the NPI histograms for the other data sets in Series B [i.e., Mesh-(B), Protein3D-(B), PharmAct-(B), and Drug-(B)] are different from that of the PubChem-(B) set. The outermost columns at each end of the histograms become more prominent, indicating that the fraction of compounds with extreme NPIs (i.e., those with NPI values of −0.95 to −1.00 and those with +0.95 to +1.00) increased in these subsets. In addition, the fraction of compounds with an NPI between −0.05 and 0.05 increases as the size of the data set decreases. This indicates that these subsets of the PubChem-(B) set do not well represent the chemical space covered by the PubChem-(B) set. Note that these four subsets were generated by checking whether compounds have a particular type of annotation (for example, whether a compound has been prominently mentioned in a biomedical journal article, whether it has been co-crystalized with a protein target, whether it has a known pharmacological action, or whether it is a known active drug ingredient). In other words, the four subsets correspond to four narrowly focused subspaces of the PubChem-(B) data set and, unlike the overall chemical set, may be dominated by closely related analogues and structurally similar scaffolds in an attempt to identify similar bioactivity.
Data set dependency of neighbor preference indices
Aspirin has 1488 2-D-only neighbors and 4532 3-D-only neighbors in the PubChem-(B) set, resulting in an NPI value of −0.49. However, it has an NPI value of 1.00 in the Drug-(B) set, with one 2-D-only neighbor (CID 5161; salicylsalicylic acid) and no 3-D-only neighbors. On the other hand, indomethacin has an NPI value of +0.58 for the PubChem-(B) set, with 899 2-D-only neighbors and 210 3-D-only neighbors, but it has only one 3-D neighbor in the Drug-(B) set, resulting in an NPI value of −1.00. Note that the signs of the NPI values of the two compounds for the PubChem-(B) set are opposite to those for the Drug-(B) set. This indicates that the nature of the chemical spaces spanned by the two sets is very different in terms of which neighboring scheme is better in recognizing structural similarity to aspirin and indomethacin.
Effects of stereochemistry upon 2-D and 3-D neighborings
The PubChem fingerprints used for 2-D similarity evaluation in PubChem do not take into account stereochemistry of molecules (such as cis–trans isomerism and chirality). Therefore, different stereo isomers that have the same molecular formula and atom connectivity are represented with the same fingerprint, regardless of whether the configuration stereo centers are explicitly defined or not. For example, both (E) and (Z) forms of 1,2-dichloroethene (CID 638186 and CID 643833) have the same fingerprint as 1,2-dichloroethene (CID 10900). As a results, 2-D similarity evaluation between these stereo isomers always yields a Tanimoto score of “1.0”, classifying them as neighbors of each other. In addition, 2-D neighboring of these stereo isomers results in the same set of 2-D neighbors.
On the contrary to the 2-D neighboring, PubChem 3-D neighboring is not blind to stereochemistry because it uses 3-D conformer models that take stereochemistry into account. PubChem generates a different 3-D conformer model for a given stereo isomer. The conformer model for a compound with unspecified stereo centers is constructed by generating conformers for each stereo isomer arising from enumeration of the undefined stereo centers, and then combining them together . As a result, the use of different conformers for 3-D neighboring of stereo isomers may yield different sets of 3-D neighbors, as discussed in our previous paper .
The overlap between PubChem 2-D and 3-D similarity neighboring approaches were analyzed as a function of annotation type, using ten data sets: five data sets in Series A [i.e., PubChem-(A), MeSH-(A), Protein3D-(A), PharmAct-(A), and Drug-(A)] and five data sets in Series B [i.e., PubChem-(B), MeSH-(B), Protein3D-(B), PharmAct-(B), and Drug-(B)]. The five data sets in Series A considered all compounds [PubChem-(A)], those prominently mentioned in a biomedical journal article [MeSH-(A)], those found in a protein–ligand complex crystal structure [Protein3D-(A)], those with a known pharmacological action [PharmAct-(A)], and those which are approved drugs [Drug-(A)]. A direct comparison between PubChem 2-D and 3-D neighbors using the Series A data sets revealed a bias towards 2-D neighbors as 3-D neighboring does not consider salts and mixtures. To remove this bias, the Series B data sets were generated by considering only parent compounds (in effect discarding salts and mixtures) with a computed 3-D description in PubChem. For both PubChem-(A) and PubChem-(B) sets, the overlap between 2-D and 3-D neighbors were only about 2% of the total neighbors. In other words, the PubChem 2-D and 3-D similarity approaches are nearly orthogonal. Considering the debate over 2-D and 3-D similarity methods [29–34], this is a surprising finding. For the subsets containing compounds with specific types of annotation, the overlap increased substantially as the data sets became smaller. However, it did not exceed 31% [for the Drug-(B) set] and substantial fractions of neighbors were still either 2-D-only or 3-D-only.
To further investigate complementarity between 2-D and 3-D neighborings, the NPI of a molecule for a given data set was introduced that quantifies whether a molecule has more 2-D or 3-D neighbors. The NPI histograms for the PubChem-(B) set shows a bimodal shape with two maxima at NPI = ±1 and a minimum at NPI = 0. It indicates that, for the majority of the compounds in PubChem, their structural similarity to other compounds can be recognized only by either of the 2-D or 3-D neighborings, but not by both. Therefore, considering both 2-D and 3-D approaches in PubChem appears to be beneficial.
Interestingly, the shape of the NPI value histogram for the PubChem-(B) set is not similar to those for its four subsets [i.e., MeSH-(B), Protein3D-(B), PharmAct-(B), and Drug-(B)]. The NPI value histograms show a more polarized trimodal profile with a greater fraction of compounds with a strong preference for one neighboring method over the other (at NPI = ±1) as well as compounds with a neutral preference (at NPI = 0) but less so in between the extremes. As such, one would be well advised to use both 2-D and 3-D similarity when searching for chemicals that are well studied.
The results of our study show the complementarity between the 2-D and 3-D neighbors in PubChem. Each neighboring approach can identify structural similarity that the other neighboring approach cannot detect. Put in other words, they appear to have equal value to interrelate chemical structures with similar counts of neighbor pairs by each. Depending on use case (such as looking for analogues of chemicals in a series or interrelating chemical series), scientists may prefer to use one approach over the other or both to retrieve information on chemicals similar to a compound of interest.
Series A data sets
The present study employed ten different data sets (five data sets each for two series: A and B). The number of compounds per data set is listed in Table 1. The PubChem-(A) set represents the entire chemical space spanned by compounds stored in PubChem  whose CID is less than or equal to 60,182,254, reflecting those CIDs with both 2-D and 3-D neighboring data available at the time of analysis. Note that, while 2-D neighbors are updated on a daily basis, 3-D neighbors are updated less frequently because 3-D neighboring is more CPU-intensive. As a result, newly added compounds in PubChem may be considered in 2-D neighboring sooner than 3-D neighboring. Therefore, the use of a CID cut-off allows for a more direct comparison of 2-D and 3-D neighbor counts for the purpose of this study.
pccompound_mesh for MeSH-(A): this filter includes compounds that have a link to the MeSH database . MeSH (Medical Subject Headings) is the NLM controlled vocabulary thesaurus, and is used to index PubMed citations. The chemicals with a MeSH link have been mentioned in the biomedical literature on several occasions and are deemed (by human curators) to be of sufficient importance to be added to MeSH. The MeSH links are generated using PubChem chemical name matching approaches.
pccompound_structure for Protein3D-(A): this filter includes compounds found in the Molecular Modeling Database (MMDB) . The MMDB contains experimentally resolved structures of proteins, RNA and DNA, derived from the Protein Data Bank (PDB) , including information about small molecule ligands bound to macromolecule structures. Therefore, the Protein3D-(A) set contains the compounds whose macromolecule-bound 3-D experimental structure is available.
pccompound_mesh_pharm for PharmAct-(A): this filter includes compounds that have a pharmacological action link in the MeSH database, indicating that the biological role of the chemical is known. Note that PharmAct-(A) is a subset of MeSH-(A).
pccompound_drugs for Drug-(A): this filter limits compounds to those that are known drugs as defined by the PubChem integration of the NLM DailyMed  resource. The information content of DailyMed is provided by the U.S. Food and Drug Administration (FDA) and includes structured product labelling (SPL) drug information submitted by drug companies who manufacture and sell them.
These filters allow users to obtain CIDs that have particular annotation types. For example, CIDs with the “drug” annotation can be retrieved via the URL: https://www.ncbi.nlm.nih.gov/pccompound/?term=pccompound_drugs[filter].
Series B data sets
The present study attempts to compare the ability of PubChem 2-D and 3-D neighborings to interrelate chemicals that have a particular annotation type in common. Two primary issues are apparent in the analysis of the Series A data sets. First, not all compounds in the Series A data sets have the necessary computed 3-D conformer models required for 3-D neighboring, as the PubChem3D project covers only about 90% of the compound records in PubChem [6, 18, 23], excluding by design multi-component structures like salts. Second, it is not uncommon for annotation to be attributed to a salt form as opposed to the primary active component (see Fig. 1 for an example).
To address these two issues, Series B data sets were generated by including the “parents” of all the compounds in the Series A data sets and then by selecting only those resulting structures with an available computed 3-D conformer description. To achieve this, NCBI’s FLink  was used with the PubChem Compound Entrez filters “pccompound_pccompound_parent” and “has_3d_conformer” to retrieve the parent compounds and the structures with a computed 3-D description, respectively. PubChem defines the “parent” of a mixture as the carbon-containing component whose heavy atom count is ≥70% of the sum of the heavy atom counts of all unique covalent units . The parent compound is neutralized through modification of its protonation state during the PubChem standardization process . Because PubChem3D does not compute conformer models for compound records with multiple covalent units [6, 18, 23], all of the resulting compounds in the Series B data sets are single-component compounds with computed 3-D conformer descriptions. The sizes of the data sets in both Series A and B are compared in Table 1.
PubChem 2-D and 3-D neighboring relationships
PubChem “Similar Compounds” 2-D neighboring
PubChem 2-D neighboring quantifies molecular similarity using the PubChem substructure fingerprints and Tanimoto coefficient as described above. If two chemical structures in PubChem have a Tanimoto score of 0.9 or greater, they are considered as “Similar Compound” 2-D neighbors of each other.
PubChem “Similar Conformers” 3-D neighboring
PubChem 3-D neighboring is described in detail elsewhere [6, 7]. It quantifies molecular similarity using the Gaussian-shape overlay method by Grant and Pickup [12–15], implemented in ROCS [16, 17]. In this approach, molecular shape is described with an atom-centered Gaussian function, which allows for a rapid shape superposition, compared to hard sphere volume approaches. Recent studies [36–38] show that this method can be comparable with, and often better than, structure-based approaches in virtual screening, both in terms of overall performance and consistency.
Because both ST and CT scores range from 0 (for no similarity) to 1 (for identical molecules), by definition, the ComboT score can have a value from 0 to 2 (without normalization).
These three metrics can be computed at two different conformer superpositions: (1) the shape-optimized (or ST-optimized) superposition, where the shape overlap between the two conformers is maximized, and (2) the feature-optimized (or CT-optimized) superposition, where both the shape and feature are considered simultaneously to find the best superposition between the conformers. As a result, PubChem3D quantifies 3-D molecular similarity using six different scores: ST, CT, and ComboT scores for each of the superposition methods. However, because PubChem 3-D neighboring uses the ST-optimized scores only, all the ST, CT, and ComboT scores mentioned in this paper refer to the ST-optimized scores, unless otherwise indicated.
If any of the conformer pairs arising from a pair of two compounds has a ST score of ≥0.8 and a CT score of ≥0.5, those compounds are considered to be neighbors of each other. This guarantees that there is at least one conformer pair with a ComboT score ≥1.3 for each pair of compounds that are neighbors.
Conformer models for 3-D neighboring
PubChem 3-D neighboring requires a computed 3-D conformer model for each compound considered. These conformer models were generated using the OMEGA software from OpenEye Scientific Software, Inc., as described in more detail elsewhere [6, 18, 23]. While these conformer models contain up to 500 sampled conformers for each compound, many of the PubChem3D services support only up to ten conformers per compound. To ensure that the conformers employed represent the overall diversity of shape and feature of a given molecule, PubChem3D computes a diverse conformer ordering. This conformer ordering provides guidance on what conformers to choose when only a subset of the conformers available in a conformer model are used for 3-D similarity comparison.
Despite the use of various filtering schemes to improve its speed [7, 20], 3-D neighboring is not fast enough to consider all possible conformers for each compound. The initial PubChem3D neighboring started a few years ago using a single conformer per compound and it has been gradually extended to more diverse conformers per compound. Up to ten conformers per compound will be considered in the future. The neighboring results used in the present study were from five diverse conformers per compound, as available in PubChem as of January 2013.
Definition of Neighbor Preference Index
Note that, because the choice of neighboring thresholds affects the neighbor counts of a given compound and hence its NPI value, the use of NPI values for comparing the two neighboring methods requires that the neighboring thresholds employed be comparable. Therefore, given that the two PubChem neighboring approaches are established with thresholds that are unlikely to change, it is worthwhile to consider the statistical basis of the two PubChem similarity methods.
A recent study  shows that the average and standard deviation of the 2-D and 3-D similarity scores are 0.42 ± 0.13 and 0.77 ± 0.13, respectively, for randomly selected biologically tested compounds in PubChem. The 3-D similarity statistics are for the ST-optimized ComboT score [Eq. (4)] of a compound–compound pair, which is the highest ComboT score among those of all conformer pairs arising from the compound pair (computed using ten diverse conformers per compound). While not exactly the same as PubChem 3-D neighboring, these 3-D statistics should be considered as a lower-bound, with PubChem 3-D neighboring further restricted (and being more exclusive) by statistically more-significant thresholds of ST ≥0.8 and CT ≥0.5 (and a minimum ComboT ≥1.3). These statistics suggest that the 2-D and 3-D neighboring thresholds are 3.7 and 4.1 standard deviations away from the random average values, respectively [i.e., 3.7 = (0.90 − 0.42)/0.13 for 2-D similarity and 4.1 = (1.3 − 0.77)/0.13 for 3-D similarity]. This translates into a probability of two random structures in PubChem being 2-D neighbors and 3-D neighbors as 0.0111% (1 in 9000) and 0.00228% (1 in 43,825), respectively. In addition, it suggests that the two neighboring thresholds are comparable (within a factor of five of each other, i.e., 4.86 = 0.0111%/0.00228%), with a small bias towards 2-D neighbors. Lastly, it also suggests that the thresholds are suitably high to limit chance correlations of neighbors.
SK generated and analyzed the data and wrote the first manuscript. EEB and SHB reviewed the manuscript. All authors read and approved the final manuscript.
We are grateful to the NCBI Systems staff, especially Ron Patterson, Charlie Cook, and Don Preuss, whose efforts helped make the PubChem3D project possible. We also thank Cindy Clark, NIH Library Editing Service, for reviewing the manuscript. This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services.
The authors declare that they have no competing interests.
COPYRIGHT NOTICE. The article is a work of the United States Government; Title 17 U.S.C 105 provides that copyright protection is not available for an work of the United States government in the United States. Additionally, this is an open access article distributed under the terms of the Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0), which permits worldwide unrestricted use, distribution, and reproduction in any medium for any lawful purpose.
- Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH (2016) PubChem substance and compound databases. Nucleic Acids Res 44:D1202–D1213View ArticleGoogle Scholar
- Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37:W623–W633View ArticleGoogle Scholar
- Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH (2010) An overview of the PubChem BioAssay resource. Nucleic Acids Res 38:D255–D266View ArticleGoogle Scholar
- Wang YL, Suzek T, Zhang J, Wang JY, He SQ, Cheng TJ, Shoemaker BA, Gindulyte A, Bryant SH (2014) PubChem BioAssay: 2014 update. Nucleic Acids Res 42:D1075–D1082View ArticleGoogle Scholar
- Acland A, Agarwala R, Barrett T, Beck J, Benson DA, Bollin C, Bolton E, Bryant SH, Canese K, Church DM, Clark K, DiCuccio M, Dondoshansky I, Federhen S, Feolo M, Geer LY, Gorelenkov V, Hoeppner M, Johnson M, Kelly C, Khotomlianski V, Kimchi A, Kimelman M, Kitts P, Krasnov S, Kuznetsov A, Landsman D, Lipman DJ, Lu ZY, Madden TL et al (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 41:D8–D20View ArticleGoogle Scholar
- Bolton EE, Chen J, Kim S, Han L, He S, Shi W, Simonyan V, Sun Y, Thiessen PA, Wang J, Yu B, Zhang J, Bryant SH (2011) PubChem3D: a new resource for scientists. J Cheminform 3:32View ArticleGoogle Scholar
- Bolton EE, Kim S, Bryant SH (2011) PubChem3D: similar conformers. J Cheminform 3:13View ArticleGoogle Scholar
- PubChem substructure fingerprint description. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdfGoogle Scholar
- Holliday JD, Hu CY, Willett P (2002) Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen 5:155–166View ArticleGoogle Scholar
- Chen X, Reynolds CH (2002) Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J Chem Inf Comput Sci 42:1407–1414View ArticleGoogle Scholar
- Holliday JD, Salim N, Whittle M, Willett P (2003) Analysis and display of the size dependence of chemical similarity coefficients. J Chem Inf Comput Sci 43:819–828View ArticleGoogle Scholar
- Grant JA, Pickup BT (1995) A Gaussian description of molecular shape. J Phys Chem 99:3503–3510View ArticleGoogle Scholar
- Grant JA, Pickup BT (1996) A Gaussian description of molecular shape (vol 99, pg 3505, 1995). J Phys Chem 100:2456–2456View ArticleGoogle Scholar
- Grant JA, Gallardo MA, Pickup BT (1996) A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem 17:1653–1666View ArticleGoogle Scholar
- Grant JA, Pickup BT (1997) Gaussian shape methods. In: van Gunsteren WF, Weiner PK, Wilkinson AJ (eds) Computer simulation of biomolecular systems. Kluwer Academic Publishers, Dordrecht, pp 150–176View ArticleGoogle Scholar
- Rush TS, Grant JA, Mosyak L, Nicholls A (2005) A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem 48:1489–1495View ArticleGoogle Scholar
- ROCS—Rapid Overlay of Chemical Structures, Version 3.1.0, OpenEye Scientific Software, Inc.: Santa Fe, NM, 2010Google Scholar
- Bolton EE, Kim S, Bryant SH (2011) PubChem3D: conformer generation. J Cheminform 3:4View ArticleGoogle Scholar
- Bolton EE, Kim S, Bryant SH (2011) PubChem3D: diversity of shape. J Cheminform 3:9View ArticleGoogle Scholar
- Kim S, Bolton EE, Bryant SH (2011) PubChem3D: shape compatibility filtering using molecular shape quadrupoles. J Cheminform 3:25View ArticleGoogle Scholar
- Kim S, Bolton EE, Bryant SH (2011) PubChem3D: biologically relevant 3-D similarity. J Cheminform 3:26View ArticleGoogle Scholar
- Kim S, Bolton E, Bryant S (2012) Effects of multiple conformers per compound upon 3-D similarity search and bioassay data analysis. J Cheminform 4:28View ArticleGoogle Scholar
- Kim S, Bolton EE, Bryant SH (2013) PubChem3D: conformer ensemble accuracy. J Cheminform 5:1View ArticleGoogle Scholar
- Haque IS, Pande VS (2010) PAPER-Accelerating Parallel Evaluations of ROCS. J Comput Chem 31:117–132View ArticleGoogle Scholar
- Medical Subject Headings. https://www.ncbi.nlm.nih.gov/mesh. Accessed 14 Sept 2016
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242View ArticleGoogle Scholar
- Madej T, Addess KJ, Fong JH, Geer LY, Geer RC, Lanczycki CJ, Liu CL, Lu SN, Marchler-Bauer A, Panchenko AR, Chen J, Thiessen PA, Wang YL, Zhang DC, Bryant SH (2012) MMDB: 3D structures and macromolecular interactions. Nucleic Acids Res 40:D461–D464View ArticleGoogle Scholar
- DailyMed. https://dailymed.nlm.nih.gov. Accessed 14 Sept 2016
- Gohlke BO, Overkamp T, Richter A, Richter A, Daniel PT, Gillissen B, Preissner R (2015) 2D and 3D similarity landscape analysis identifies PARP as a novel off-target for the drug Vatalanib. BMC Bioinform 16:9View ArticleGoogle Scholar
- Hu GP, Kuang GL, Xiao W, Li WH, Liu GX, Tang Y (2012) Performance evaluation of 2D fingerprint and 3D Shape similarity methods in virtual screening. J Chem Inf Model 52:1103–1113View ArticleGoogle Scholar
- Thimm M, Goede A, Hougardy S, Preissner R (2004) Comparison of 2D similarity and 3D superposition: application to searching a conformational drug database. J Chem Inf Comput Sci 44:1816–1822View ArticleGoogle Scholar
- Medina-Franco JL, Martinez-Mayorga K, Bender A, Marin RM, Giulianotti MA, Pinilla C, Houghten RA (2009) Characterization of activity landscapes using 2D and 3D similarity methods: consensus activity cliffs. J Chem Inf Model 49:477–491View ArticleGoogle Scholar
- Koutsoukas A, Paricharak S, Galloway W, Spring DR, Ijzerman AP, Glen RC, Marcus D, Bender A (2014) How diverse are diversity assessment methods? A comparative analysis and benchmarking of molecular descriptor space. J Chem Inf Model 54:230–242View ArticleGoogle Scholar
- Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49:108–119View ArticleGoogle Scholar
- NCBI FLink. https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi. Accessed 14 Sept 2016
- Hawkins PCD, Skillman AG, Nicholls A (2007) Comparison of shape-matching and docking as virtual screening tools. J Med Chem 50:74–82View ArticleGoogle Scholar
- Venhorst J, Nunez S, Terpstra JW, Kruse CG (2008) Assessment of scaffold hopping efficiency by use of molecular interaction fingerprints. J Med Chem 51:3222–3229View ArticleGoogle Scholar
- Sheridan RP, McGaughey GB, Cornell WD (2008) Multiple protein structures and multiple ligands: effects on the apparent goodness of virtual screening results. J Comput Aided Mol Des 22:257–265View ArticleGoogle Scholar
- ShapeTK-C++, Version 1.8.0, OpenEye Scientific Software, Inc.: Santa Fe, NM, 2010Google Scholar