- Research article
- Open Access
PubChem3D: Diversity of shape
© Bolton et al; licensee Chemistry Central Ltd. 2011
Received: 10 January 2011
Accepted: 21 March 2011
Published: 21 March 2011
The shape diversity of 16.4 million biologically relevant molecules from the PubChem Compound database and their 1.46 billion diverse conformers was explored as a function of molecular volume.
The diversity of shape space was investigated by determining the shape similarity threshold to achieve a maximum on the count of reference shapes per unit of conformer volume. The rate of growth in shape space, as represented by a decreasing shape similarity threshold, was found to be remarkably smooth as a function of volume. There was no apparent correlation between the count of conformers per unit volume and their diversity, meaning that a single reference shape can describe the shape space of many chemical structures. The ability of a volume to describe the shape space of lesser volumes was also examined. It was shown that a given volume was able to describe 40-70% of the shape diversity of lesser volumes, for the majority of the volume range considered in this study.
The relative growth of shape diversity as a function of volume and shape similarity is surprisingly uniform. Given the distribution of chemicals in PubChem versus what is theoretically synthetically possible, the results from this analysis should be considered a conservative estimate to the true diversity of shape space.
Virtual screening of large chemical databases is now a routine practice in modern drug discovery [1–8]. One successful virtual screening approach is to compare the 3-D shape similarity of chemical structures using atom-centered Gaussian functions [9–11], e.g., as implemented in ROCS . While this Gaussian-based approach to shape can perform hundreds or even thousands of chemical structure 3-D shape superposition computations per second per Central Processing Unit (CPU) core, even faster approaches with similar efficacy would be welcomed when searching a database of millions of chemical structures and (potentially) billions of conformers.
where V AA and V BB are the self-overlap volume of molecules A and B and V AB is the overlap volume between them and the ST score ranges from 0 (for no shape similarity) to 1 (for identical shapes).
A second 3-D similarity approach using reference shapes  attempted to improve upon the first method by giving both a shape superposition and some assurance that the shape similarity ST is similar to that provided by ROCS. This was achieved by recognizing that two chemical structure conformers with similar 3-D shape align to a common reference shape in a similar fashion. By utilizing the 3 × 3 rotational matrix and XYZ translational vector that align a 3-D chemical structure conformer to a common reference shape (retained after shape fingerprint generation), one could generate a superposition between conformers for each common reference shape. Given that two similar conformers may have multiple common reference shapes, one may "replay" all the alignments to common reference shapes and pick one that yields the best shape superposition. This approach achieved a 100× fold performance improvement by avoiding any shape similarity computation when shapes were too dissimilar (i.e., there were no common reference shapes) and by avoiding any volume overlap maximization optimization computations. However, this methodology has its downsides. It only considered relatively small (<28 non-hydrogen atoms) and inflexible (<6 rotatable bonds) chemical structures and would not compute any shape similarity value when there was no common reference shape. Yet, in both studies [13, 14], it was shown that the use of reference shapes may provide promise to dramatically improve the throughput of shape-based alignment methodologies.
The first work described above  considered data sets of "drug-like" molecules with 12-32 non-hydrogen atoms and conformer counts between 50,000 and 500,000 to examine the growth of shape space as a function of ST value. This growth was linear when considering the logarithm of the count of reference shape and chemical structures, whether using a single conformer or multiple conformers per structure. The second work  also considered reference shapes of "drug-like" molecules of similar size, using a much larger dataset of one million chemical structures and fifteen million conformers, but only at a single ST value, as opposed to a range of ST values. Still, both studies gave valuable insight into how shape space grows with "drug-like" molecules.
In this work, we seek to expand upon these earlier two efforts by exploring in more depth the rate of growth of shape space as a function of reference shape count, conformer volume, and ST value with a much larger data set of 16.4 million biologically relevant small molecules and their 1.46 billion diverse conformers. By improving upon the understanding of the relative growth of shape space of biologically relevant molecules, new or improved "shape fingerprint"-based methodologies might be developed.
Results and Discussion
1. Conformer generation
2. Generation of reference shapes per volume
The shape diversity of a particular conformer volume may be ascertained by clustering conformers of that volume with a certain shape diversity threshold (ST thresh ), which controls the "minimum" distance between any two clusters, and then by counting the number of reference shapes, each of which represents a cluster centroid and all conformers within ST thresh to the reference shape. [Note that the ST thresh is the "maximum" ST value between clusters since the ST score is a similarity measure, not a dissimilarity measure.] If the clustering is performed using the same ST thresh value for a volume range, the shape diversity as a function of each molecular volume size may be evaluated by the growth of the number of reference shapes. However, when a constant ST thresh value is used across a range of volumes, each increase in the molecular volume may result in a very rapid growth of the shape space, and hence, the number of reference shapes per volume. This is not completely desirable as the computational cost of clustering effectively increases as the square (or worse) of the total count of reference shapes (especially when this count is large), when considering N reference shapes must be compared against K conformers and N <K, compelling one to keep the count of reference shapes to a manageable size for tractability purposes.
where V is the conformer volume and ST thresh is the shape Tanimoto for the given volume to achieve 200 or fewer reference conformers. The slope of the ST thresh curve shows that the increase in the cluster distances becomes slower as the conformer volume increases; however, this reduction may be an artifact of the input. The reason for this is relatively simple. This study only considered chemical structures found in PubChem and was restricted to 50 or less non-hydrogen atoms. Furthermore, the distribution of this non-hydrogen atom count had a maximum of 26. Conceivably, ST thresh may decrease at a more rapid rate if the count of chemical structures in PubChem continued to increase as a function of non-hydrogen atom count across the entire range of non-hydrogen atom count, rather than hitting a maximum of 26. The net effect of this input artifact is that the ST thresh curve in Figure 3 may be more linear than actually shown. We expect the entire curve as shown may shift and appear more linear as more theoretically possible and diverse chemical structures are considered; however, we believe the trends detailed in this work should still hold true, unless noted otherwise. Irrespective of the explanation provided, one should consider the curve shown in Figure 3 a conservative estimate of the absolute growth of shape space.
The reference shape count per volume was found to range from 83 (for V = 92 Å3) to the maximum allowed of 200 (for V = 380 Å3), and its average was 147.9. Interestingly, the ST thresh curve does not reflect the maximum found in Figure 1 for conformer volume. In fact, the decrease in ST thresh as a function of volume is very smooth, suggesting that the actual conformer count per volume, as shown in Figure 1, has little bearing on shape diversity, as shown in Figure 3. Or, put another way, the shape space of known chemicals is not near as diverse as chemical space, with a relatively small amount of reference shapes able to represent a large number of chemical structure conformers.
Another interesting observation is that a small change in ST thresh has a large effect on reference count, as reflected in the somewhat periodic growth in shape references until the maximum value of 200 reference shapes is reached, cutting the reference shape count nearly in half. This can be roughly seen in the volume range 75-210 Å3and then again between 275-375 Å3. This reflects the use of 0.01 decrements in ST thresh but also reflects anecdotal evidence seen when exploring the reference shapes, where each change in ST thresh by 0.01 appeared to change the reference count by about a factor of two, much as observed by Haigh, et al.  This is only roughly seen in the reference shape counts as two things are changing, the volume and the ST thresh value, and volume change involves a potentially variable change in shape space.
3. Generation of unique shapes for each volume
There are a number of interesting observations one can make from these graphs. In Figure 7 and Figure 8 there is a banded behavior, indicated previously in Figure 3, which looks like a series of lines spaced further apart as the volume increases. This is due to the steady growth in shape space as volume increases and the use of 0.01 decrements of ST thresh . Whenever the ST thresh decreases by 0.01, a corresponding significant decrease in counts occurs. When the ST thresh value changes less, or does not change at all, the lines appear to be wider apart, reflecting just the growth in shape space due to volume.
Another interesting observation in Figure 8(a), one can see that the absolute count of large unique shapes stays relatively constant in the volume range, with an average count and standard deviation of 22.2 +/- 7.8 and a mode of 24. There is a shallow maximum at volume 145 Å3 followed by a relatively slow overall decline over the rest of the volume range. This decline appears most evident when the volume is beyond volume 305 Å3, perhaps due to the truncation of shape space considered as represented by the rapid reduction in conformer count at larger volumes and the fact that a maximum of non-hydrogen atom count occurs at 26.
Similar to the large unique shapes in Figure 8(a), the large and shared unique shapes in Figure 8(b) show a similar banded behaviour across most of the volume range, with a reference count mean and standard deviation of 144.4 +/- 23.7 and a mode of 140. There is a barely evident maximum volume at volume 228 Å3 and a slightly noticeable dip at volume 261 Å3, prior to resuming the similar narrow band of large and shared unique shapes. This may suggest that the growth of large and shared shape space is relatively constant as a function of PubChem contents.
The small and shared unique shapes completely dominate in Figure 8(a), being nearly the same as the total count of unique shapes across the entire volume; however, the small unique shapes in Figure 8(b) show a very shallow minimum at about volume 200 Å3 prior to significantly increasing as a function of volume. This may suggest that the overall size of PubChem shape space slows (as a function of the rate of changing ST) after a point, with large unique shapes contributing less and less to the overall shape diversity across the full volume range as the total shape space that can be represented by larger shapes diminishes. One can see this to some extent in Figure 9, where the percentage of shared shape space is "Λ"-shaped, reaching a maximum of 73% at volume 217 Å3 and then steadily diminishes as a function of volume as the percentage of shape space of smaller shapes dominates. Again, it is reasonable to suggest that this observation is an artifact of the PubChem contents and not representative of what one might find if significantly more larger chemical structures were considered in the range of 30-50 non-hydrogen atoms. (i.e., if the non-hydrogen atom count maximum was not at 26, but continued to grow until the maximum considered of 50.)
The shape diversity of the biologically relevant conformer space of molecules and conformers was investigated using 16.4 million molecules in the PubChem Compound database (as of January 2008), covering non-hydrogen atom counts up to 50 and effective rotors up to 15, as represented by 1.46 billion diverse conformers. After binning the conformers according to their volume, cluster analysis was performed to get a maximum count of non-redundant reference shapes, representing the shape space spanned by the conformers for a particular unit volume. The ST thresh value, which defines the maximum shape similarity between any two reference shapes for that volume, gradually decreased as the conformer volume increased. There was no apparent correlation between the count of conformers clustered and the shape diversity found. Furthermore, an analysis was performed to examine the rate of increase of new reference shapes as a function of volume and the percentage of shape space unique to a particular volume. Generally speaking, the rate of addition of new reference shapes as a function of increasing volume was relatively constant across the range of volumes considered; however, the ability of a particular volume to explain the shape diversity spanned by lesser volumes increased up to a point and then decreased, ranging between 40-70% of all unique shapes for most of the considered volume range (Figure 9).
Some of the results from this analysis should be considered an artifact of the contents of PubChem in that the population as a function of molecular size peaks at 26 non-hydrogen atoms and then rapidly declines. An exhaustive analysis of all "reasonable" theoretically possible molecules resulting from larger molecules may provide a different trend. As such, the results of this analysis should be considered a conservative estimate.
While it is unfortunate that the PubChem shape space is truncated based on what is possible (due to the diminishing count of chemical structures with non-hydrogen atom counts greater than 26), one does see substantial evidence that shape space grows uniformly with a smoothly decreasing ST thresh and increasing molecular volume. One also sees that keeping the count of reference shapes at a maximum for a given volume as an approach and analysis can allow one to achieve an understanding as to how diverse shape space is as a function of shape similarity. The apparent lack of dependence of the reference shape count with respect to the count of conformers represented by a given volume demonstrates how redundant shape space is across the volume range; however, we believe that the ST thresh curve in Figure 3 may actually be linear or approach it, if provided an exhaustive set of theoretically possible but reasonable chemical structures, as the chemical structure shape possibilities surely are more diverse than the limited population of chemical structures available in PubChem for non-hydrogen atom counts greater than thirty.
Materials and methods
1. Biologically relevant molecules in the PubChem Compound database
(1) Molecules only with a single covalent component were considered, since each component of mixtures and complexes has a unique Compound ID.
(2) Salts were not included because their parent molecules are also in the PubChem database.
(3) Molecules with a non-organic element were not included because they are not compliant with the 94 s variant of the Merck molecular force-field (MMFF94s), which was used for conformer generation (without coulomb interaction terms). For the same reason, molecules with an MMFF94 s unparameterized element type (e.g., hyper-valent species) were removed.
where nr effective is the number of effective rotors, nr is the number of rotatable bonds, and nnara is the number of "non-aromatic" sp 3 -hybridized ring atoms.
(5) Molecules with more than 6 undefined stereocenters were also removed because they need substantial computational resources to consider.
2. Conformer generation
where nr effective is the number of effective rotors and n nha is the number of non-hydrogen atoms in a molecule. The maximum number of conformers in a conformer model for each molecule was limited to 500. If clustering resulted in more than 500 conformers, the clustering RMSD was incremented by 0.2 and the conformers re-clustered, repeating until 500 or fewer conformers were achieved. Post processing of the conformer models was performed. This included full energy minimization of all hydrogen atom locations (all non-hydrogen atoms were kept frozen). Subsequent analysis removed any conformers with atom-atom "bumps", being cases where the steric van der Waals interaction energy was greater than 25 kcal/mol.
3. General descriptions of the partition-clustering algorithm
Due to the rather large number of conformers involved, a "divide and conquer" approach with a multistage partition-based clustering algorithm (as shown in Figure 2) was employed. In the first phase of the partition-clustering algorithm, conformers were split into manageable sets (or partitions), each containing a certain number of conformers (N setsize = 50,000). Conformers in each set were randomly sampled such that no two selected conformers had a ST distance closer than the shape diversity threshold (ST thresh ). The selected conformers were retained for future analysis, as cluster representatives, while the others were considered redundant and discarded. If the count of selected conformers in a given partition was greater than , the partition-clustering procedure with these conformers was repeated at a decreased ST thresh value. After all conformer sets were sampled using ST thresh , all the conformers from each set were then combined and re-sampled as described above (e.g., divided up into partitions and sampled). When the total number of clusters became smaller than , a "non-partition" clustering was performed to eliminate the redundancy among cluster representatives from the different partitions. If the number of clusters from the non-partition clustering procedure was greater than , the clustering was repeated at a decreased ST thresh value. In the end, the ST scores between any two conformers in the final reference set cannot be closer than the ST thresh value. A final step involved comparing the reference set with all conformers represented by the reference set.
This procedure achieves several things. Firstly, it breaks up many millions of conformers into manageable sets. Secondly, it allows the shape diversity threshold to be dynamically decreased for individual conformer sets. Thirdly, it reduces a very large number of conformers to a manageable set of conformers that represent all possible shapes present.
3.1. Partition-clustering of conformers of a given volume
To study the shape diversity for a given volume, the conformers of the same volumes were partition-clustered, based on the procedures outlined in the previous section.
(1) The 1.46 billion conformers were grouped according to their volumes rounded to the nearest integers.
(2) The conformers for a given volume were partition-clustered until the total number of clusters became less than = 6,000. The set of the seed conformers representing these clusters were considered the "basis shapes" for that volume.
3.2. Generation of unique shapes
To investigate the shape space redundancy between different volumes, the unique shapes (Figures 4 and 5) for each volume were generated using two different clustering schemes: (1) the "small-then-large" method and (2) the "large-then-small" method (Figure 6). In the small-then-large method, the unique shapes for V = V1 were generated from clustering of the reference and basis shapes for V<V1, and re-clustering with the reference and basis shapes for V = V1, to locate those shapes unique only to the current volume. On the contrary, in the large-then-small method, the unique shapes were generated by pooling the reference shapes for V = V1, and re-clustering with the reference and basis shapes for V<V1, to locate only those shapes that are unique to lesser volumes.
1. The "small-then-large" approach
Fill cluster holes in the clusters from step (4) with the reference shapes for V = V1.
Fill cluster holes in the clusters from step (5) with the basis shapes for V = V1.
2. The large-then-small approach
Pool all reference shapes of V = V1.
Fill cluster holes in the clustered reference shapes for V = V1 [from step (1)] with the clusters from step (5).
We are grateful to the NCBI Systems staff, especially Ron Patterson, Charlie Cook, and Don Preuss, whose efforts helped make the PubChem3D project possible. This research was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD. http://biowulf.nih.gov.
- Yuriev E, Agostino M, Ramsland PA: Challenges and advances in computational docking: 2009 in review. J Mol Recognit. 2011, 24: 149-164. 10.1002/jmr.1077.View ArticleGoogle Scholar
- Kirchmair J, Distinto S, Liedl KR, Markt P, Rollinger JM, Schuster D, Spitzer GM, Wolber G: Development of anti-viral agents using molecular modeling and virtual screening techniques. Infect Disord Drug Targets. 2011, 11: 64-93.View ArticleGoogle Scholar
- Venkatraman V, Perez-Nueno VI, Mavridis L, Ritchie DW: Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J Chem Inf Model. 2010, 50: 2079-2093. 10.1021/ci100263p.View ArticleGoogle Scholar
- Sheridan RP, McGaughey GB, Cornell WD: Multiple protein structures and multiple ligands: effects on the apparent goodness of virtual screening results. J Comput Aided Mol Des. 2008, 22: 257-265. 10.1007/s10822-008-9168-9.View ArticleGoogle Scholar
- Schneider G: Virtual screening: an endless staircase?. Nat Rev Drug Discovery. 2010, 9: 273-276. 10.1038/nrd3139.View ArticleGoogle Scholar
- Nicholls A, McGaughey GB, Sheridan RP, Good AC, Warren G, Mathieu M, Muchmore SW, Brown SP, Grant JA, Haigh JA, et al: Molecular shape and medicinal chemistry: a perspective. J Med Chem. 2010, 53: 3862-3886. 10.1021/jm900818s.View ArticleGoogle Scholar
- McGaughey GB, Sheridan RP, Bayly CI, Culberson JC, Kreatsoulas C, Lindsley S, Maiorov V, Truchon JF, Cornell WD: Comparison of topological, shape, and docking methods in virtual screening. J Chem Inf Model. 2007, 47: 1504-1519. 10.1021/ci700052x.View ArticleGoogle Scholar
- Hawkins PCD, Skillman AG, Nicholls A: Comparison of shape-matching and docking as virtual screening tools. J Med Chem. 2007, 50: 74-82. 10.1021/jm0603365.View ArticleGoogle Scholar
- Grant JA, Pickup BT: A gaussian description of molecular shape. J Phys Chem. 1995, 99: 3503-3510. 10.1021/j100011a016.View ArticleGoogle Scholar
- Grant JA, Gallardo MA, Pickup BT: A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem. 1996, 17: 1653-1666. 10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K.View ArticleGoogle Scholar
- Grant JA, Pickup BT: Gaussian shape methods. Computer Simulation of Biomolecular Systems. Edited by: van Gunsteren WF, Weiner PK, Wilkinson AJ. 1997, Dordrecht: Kluwer Academic Publishers, 150-176.View ArticleGoogle Scholar
- ShapeTK-C++, Version 1.8.0, OpenEye Scientific Software, Inc.: Santa Fe, NM.Google Scholar
- Haigh JA, Pickup BT, Grant JA, Nicholls A: Small molecule shape-fingerprints. J Chem Inf Model. 2005, 45: 673-684. 10.1021/ci049651v.View ArticleGoogle Scholar
- Fontaine F, Bolton E, Borodina Y, Bryant SH: Fast 3D shape screening of large chemical databases through alignment-recycling. Chem Cent J. 2007, 1: 12-10.1186/1752-153X-1-12.View ArticleGoogle Scholar
- Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: integrated platform of small molecules and biological activities. Annual Reports in Computational Chemistry. Edited by: Ralph AW. 2008, David CS: Elsevier, 4: 217-241. 10.1016/S1574-1400(08)00012-1.Google Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010, 38: D5-D16. 10.1093/nar/gkp967.View ArticleGoogle Scholar
- Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.View ArticleGoogle Scholar
- Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH: An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010, 38: D255-D266. 10.1093/nar/gkp965.View ArticleGoogle Scholar
- OMEGA, Version 2.1, OpenEye Scientific Software, Inc.: Santa Fe, NM.Google Scholar
- Bolton EE, Kim S, Bryant SH: PubChem3D: conformer generation. J Cheminformatics. 2011, 3: 4-10.1186/1758-2946-3-4.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.