PubChem3D: Shape compatibility filtering using molecular shape quadrupoles
© Kim et al; licensee Chemistry Central Ltd. 2011
Received: 12 May 2011
Accepted: 20 July 2011
Published: 20 July 2011
Skip to main content
© Kim et al; licensee Chemistry Central Ltd. 2011
Received: 12 May 2011
Accepted: 20 July 2011
Published: 20 July 2011
PubChem provides a 3-D neighboring relationship, which involves finding the maximal shape overlap between two static compound 3-D conformations, a computationally intensive step. It is highly desirable to avoid this overlap computation, especially if it can be determined with certainty that a conformer pair cannot meet the criteria to be a 3-D neighbor. As such, PubChem employs a series of pre-filters, based on the concept of volume, to remove approximately 65% of all conformer neighbor pairs prior to shape overlap optimization. Given that molecular volume, a somewhat vague concept, is rather effective, it leads one to wonder: can the existing PubChem 3-D neighboring relationship, which consists of billions of shape similar conformer pairs from tens of millions of unique small molecules, be used to identify additional shape descriptor relationships? Or, put more specifically, can one place an upper bound on shape similarity using other "fuzzy" shape-like concepts like length, width, and height?
Using a basis set of 4.18 billion 3-D neighbor pairs identified from single conformer per compound neighboring of 17.1 million molecules, shape descriptors were computed for all conformers. These steric shape descriptors included several forms of molecular volume and shape quadrupoles, which essentially embody the length, width, and height of a conformer. For a given 3-D neighbor conformer pair, the volume and each quadrupole component (Qx, Qy, and Qz) were binned and their frequency of occurrence was examined. Per molecular volume type, this effectively produced three different maps, one per quadrupole component (Qx, Qy, and Qz), of allowed values for the similarity metric, shape Tanimoto (ST) ≥ 0.8.
The efficiency of these relationships (in terms of true positive, true negative, false positive and false negative) as a function of ST threshold was determined in a test run of 13.2 billion conformer pairs not previously considered by the 3-D neighbor set. At an ST ≥ 0.8, a filtering efficiency of 40.4% of true negatives was achieved with only 32 false negatives out of 24 million true positives, when applying the separate Qx, Qy, and Qz maps in a series (Qxyz). This efficiency increased linearly as a function of ST threshold in the range 0.8-0.99. The Qx filter was consistently the most efficient followed by Qy and then by Qz. Use of a monopole volume showed the best overall performance, followed by the self-overlap volume and then by the analytic volume.
Application of the monopole-based Qxyz filter in a "real world" test of 3-D neighboring of 4,218 chemicals of biomedical interest against 26.1 million molecules in PubChem reduced the total CPU cost of neighboring by between 24-38% and, if used as the initial filter, removed from consideration 48.3% of all conformer pairs at almost negligible computational overhead.
Basic shape descriptors, such as those embodied by size, length, width, and height, can be highly effective in identifying shape incompatible compound conformer pairs. When performing a 3-D search using a shape similarity cut-off, computation can be avoided by identifying conformer pairs that cannot meet the result criteria. Applying this methodology as a filter for PubChem 3-D neighboring computation, an improvement of 31% was realized, increasing the average conformer pair throughput from 154,000 to 202,000 per second per CPU core.
PubChem is an open and free resource of the biological activities of small molecules [1–4]. PubChem has an integrated theoretical 3-D layer, PubChem3D [5–7], which provides a precomputed 3-D neighboring relationship called "Similar Conformers"  to help users locate and relate data in the archive. "Similar Conformers" identifies chemicals with similar 3-D shape and similar 3-D orientation of functional groups typically used to define pharmacophores (defined here simply as "features"), complementing a PubChem 2-D neighboring relationship called "Similar Compounds", which identifies closely related chemical analogs using the PubChem 2-D subgraph fingerprint . Effectively, for each PubChem chemical structure, this 3-D neighboring relationship provides (at the time of writing) the results of a 3-D similarity search against 28.9 million compound records using three diverse conformers per molecule.
where V AA and V BB are the self-overlap volumes of conformers A and B, respectively, and V AB is the common overlap volume between A and B. The 3-D neighboring requires finding the maximum shape similarity between static compound 3-D conformations, as dictated by V AB in Equation 1, to calculate ST, a computationally intensive step. It is highly desirable to avoid this overlap computation, especially if it can be determined with certainty that a conformer pair cannot meet the criteria to be a 3-D neighbor. As such, PubChem employs a series of filters, based on the concept of volume, to effectively ignore approximately 65% of all conformer neighbor pairs during 3-D neighboring, thus dramatically accelerating processing .
Volume, although a rather fuzzy concept, is rather effective as a filter between conformers dissimilar in shape and features . Conceivably there are other aspects of molecular shape beyond volume to "recognize" when two shapes are (dis)similar. A characteristic one can readily imagine are descriptors associated with aspects of length, width, and height. Steric shape quadrupoles embody such a concept and attempts have been made to use their differences as a shape similarity metric [11, 12]. This leads to the question: can additional simple shape descriptor relationships be identified that improve upon the volume-based filtering efficacy? Or, put another way, can one place an upper bound on shape similarity by identification of some (additional) crude shape compatibility between conformers?
In this paper, we examine the use of shape descriptors as a means to rapidly identify "dissimilar" molecule shapes. As a part of this, we attempt to answer the critical questions: are vague shape descriptors representing the concepts of length, width, and height good discriminators of molecular shape? Can 3-D similarity searching speed be further accelerated using shape descriptors more sophisticated than volume? Is it possible to create a "shape compatibility" mapping indexed to shape similarity?
As a general premise, if two molecules with the same volume also have identical values for the quadrupole components, they are likely to be shape similar to each other. In addition, as the quadrupole moment difference deviates from zero, the maximum shape similarity is expected to decrease (see Figure 1). When the quadrupole (and volume) difference becomes greater than some value or threshold, the shape dissimilarity is such that the molecule conformer pair cannot possibly meet the criteria to be a PubChem 3-D neighbor (ST ≥ 0.8). Therefore, if we know these quadrupole difference thresholds for a given volume pair, one may be able to preclude conformer pairs that are not sufficiently shape similar, using only knowledge of the volume and quadrupole moments.
where superscript "bin" is used to distinguish these integers from the original, non-binned values. The denominator Binsize was 5.0 Å3 for all the three volumes, and 2.5 Å5, 0.5 Å5, and 0.1 Å5, for Q x , Q y , and Q z , respectively. After all 4.18 billion 3-D neighbors were binned according to their V bin and Q bin values, the 3-D neighbor distribution for a given ( , ) pair was analyzed as a function of ΔQ bin .
These modified ΔQ bin threshold maps are designated as quadrupole filters. For simplicity, we name these filters with a capital letter "F" followed by a subscript, which represents one of the quadrupole components, and a superscript, which represents the type of volume involved. For example, filter " " indicates that the Q x filter generated with the analytic volume, V an .
Accuracy of filtering as a function of volume type and quadrupole component at ST ≥ 0.80 threshold
Of the three volume types utilized, the monopole-based quadrupole filters, F mp , is arguably the best. Filter removed 4.78 billion pairs (36.3%), while incurring a loss of only 30 out of 24 million "potential" neighbors. [Note that the definition of a PubChem 3-D neighbor involves feature similarity as well as shape similarity, while the quadrupole filters deal only with shape similarity. As such, the 30 pairs filtered out had a ST score sufficient to be a 3-D neighbor, making it a "potential" 3-D neighbor.] The false negative count of 30 removed by is negligible, but does show that use of such a filter will result in precluding some potential 3-D neighbors in its use, in this case at a rate of 1 in 800,000.
Filters and are not as efficient as , but could still filter out 3.92 billion pairs (29.8%), and 3.59 billion pairs (27.3%), respectively, when considered individually. If the three F mp filters are used in a series (denoted as , and applied one after the other), 5.33 billion pairs (40.4%) could be removed with a loss of only 32 potential neighbors. Filter F so showed similar performance to F mp , but it filtered out more potential neighbors (288 for versus 32 for ) and removed slightly fewer non-neighbors (39.1% for versus 40.4% for ). The F an filters showed the least loss of potential neighbors (4 for versus 32 for ), but also removed the least non-neighbors (29.0% for versus 40.4% for ).
Acceleration of PubChem 3-D neighboring using the quadrupole filter
Diverse Conformer Count
Average Query Conformers per compound
Average Search Conformers per compound
Total Compound Pairs (billions)
Total Conformer Pairs (billions)
Total Search Time before (days)
Total Search Time after (days)
Conformer pair throughput (thousands/sec)
% Speed up
Efficiency of the quadrupole filter
Diverse Conformer Count
Total Conformer Pairs (billions)
CT Feature Filter
CT Volume Filter
ST Volume Filter
Alignment Recycling Fingerprint
Alignment Recycling Overlap
Alignment Recycling Time
Superposition Optimization Time
Other Overhead Time
Considering PubChem 3-D neighboring is a precomputed similarity search, one can see that the neighboring throughput improvements using are substantial, with an average improvement of 31% across the range of conformer counts per compound. Perhaps surprising is that the filtering removes only 7% of the conformer pairs, yet achieves a 31% neighboring throughput improvement. This emphasizes the dramatic cost/benefit difference between the computation necessary to achieve the 7% reduction versus what is expended in its absence.
It is important to note that is not the first filter applied in Table 5, meaning that there are three other filters utilized before . The filter ordering is such so as to maximize the cost/benefit of each filter. To examine what happens if is used as the first filter, neighboring is repeated for the case of one diverse conformer per compound. When used first, removes 48.3% of all conformer pairs (44.8%, 2.1%, and 1.4% conformer pairs for , , and , applied in that order, respectively) versus the 7.4% as shown in Table 5. The CT Feature, CT Volume, and ST Volume filters, applied in that order, remove 27.9%, 0.1%, and 0.002% conformer pairs, respectively, when is applied first.
Simple molecular shape descriptors, volume and steric quadrupole moments (embodying the length, width, and height of a shape), of 4.18 billion 3-D neighbor pairs resulting from PubChem 3-D neighboring of 17.1 million single conformer molecules were analyzed. The maximum quadrupole differences between neighbor conformers were determined. This examination demonstrated a distinct dependency of shape similarity upon quadrupole variation. With some slight modification of fringe regions, the results of this analysis were turned into computationally inexpensive, yet highly effective set of filters capable of removing 3-D conformer pairs that cannot meet a required shape similarity, using only knowledge of the volume and steric quadrupole moments of the conformer pair. When applied in the context of shape similarity searching, these filters can significantly improve throughput performance by avoiding expensive superposition optimization computation of conformer pairs that cannot possibly meet a pre-defined shape similarity search threshold.
The filters devised were tested using a dataset of 13.2 billion compound pairs. The quadrupole filters based on a monopole volume showed the best efficacy, while the filters using an analytic volume had the lowest efficacy. For all the three volume types, the Q x filters eliminated a larger portion of the compound pairs than the Q y and Q z filters. When the filters were used in a series simultaneously, they could eliminate 30~40% of non-neighbor pairs, with the removal of a negligible amount of potential neighbors. For example, the Q xyz filter based on the monopole volumes ( ) could eliminate 40.4% of the 13.2 billion compound pairs with a loss of 32 potential neighbors out of 24 million at a shape Tanimoto (ST) threshold of 0.80. It was also demonstrated that this filtering efficiency improves linearly as a function of shape similarity threshold approaching 100% efficiency at an ST threshold of 0.99. Further testing of the filters in the context of PubChem 3-D neighboring processing resulted in conformer pair throughput improvements of 31% on average.
In summary, the quadrupole filters developed in this study can speed up the PubChem 3-D neighbor processing with a negligible loss of the 3-D neighbors. However, its applicability is not just limited to PubChem 3-D neighboring. The results of the present study also suggest that the shape multipole moments can be applied generally to enhance the speed of 3-D similarity search methods by the rapid preclusion of dissimilar molecules that cannot be a result. This approach may be able to significantly speed up 3-D similarity search, especially if the 3-D shape superposition optimization is a bottleneck of the similarity search.
At the time of project initiation, PubChem 3-D neighboring of 17,143,181 unique molecules (ranging from CID 1 to CID 25,000,000) had been completed using a single conformer per compound, yielding 4,182,412,802 3-D neighbors. Using the Shape Toolkit from the OpenEye Software , the analytic volume (V an ), monopole volume (V mp ), self-overlap volume (V so ), and steric shape quadrupole moments (Q x , Q y , and Q z ) were computed for the theoretical conformer of all 17.1 million molecules. See Figures 2 and 4 for the distributions of the computed values.
The quadrupole filters developed for pre-screening conformer-pairs based on quadrupole differences as a function of shape similarity ST threshold were generated using the following steps:
1) The 4.18 billion 3-D neighbor pairs and their associated data were obtained from PubChem.
2) The volumes (V mp , V so , and V an ) and quadrupole components (Q x , Q y , and Q z ) of the compound pair for each 3-D neighbor were converted into integers using Equations 4 and 5 to yield , , , , , and , respectively. The denominator BinSize was 5.0 Å3 for all three volume types and 2.5 Å5, 0.5 Å5, and 0.1 Å5, for Q x , Q y , and Q z , respectively.
Of the two conformers in a 3-D neighbor, the one with the smaller value was designated as molecule 1 and the other as molecule 2. When the value was the same for both, the one with the smaller value was designated as molecule 1. If both the and values were the same for both, the one with the smaller was designated as molecule 1. If was also the same for both molecules, the one with the smaller was designated as molecule 2. If all four descriptors are the same for both molecules, the one that appears first for the pair was designated as molecule 1.
3-D neighbors were binned according to three indices, , , and , where subscripts 1 and 2 indicate molecules 1 and 2, respectively, determined in step 3a, and is the Q x difference between the two molecules.
The neighbor count for all ( , , ) bins was analyzed to find the maximum possible absolute value of for a given ( , ) pair. It results in the difference maps as a function of binned volume pairs [see panels (a) and (b) in Figures 7, 8 and 9].
4) To obtain filters effective at an ST threshold other than ≥ 0.80, first restrict the original 4.18 billion 3-D neighbor pairs to those at or above the desired ST threshold and repeat step 3.
To test the efficiency of the quadrupole filters devised, two sets of molecules were chosen. One set contains molecules in the PubChem CID range of 1 ~ 25,000,000, and the other contains those in the CID range of 25,000,001~25,001,000. Because a theoretical conformer was not generated for all CIDs or because compound records were not "live", the two datasets had 17,488,897 and 753 molecules, respectively. All-by-all comparison between the two sets gives 13,169,139,441 CID pairs. Using the first diverse conformer for each compound, the ST values for these 13.2 billion pairs were computed using ROCS  from OpenEye software, Inc., consuming ~419 CPU days in total, and stored. These ST scores were used to estimate how many CID pairs would be filtered out when applying the quadrupole filters as a function of volume type and as a function of ST threshold, for example, as demonstrated in Table 3 and Figure 10.
One aspect of this effort is to examine the change in real-world efficiency of PubChem3D neighboring processing when using quadrupole filters while computing the 3-D "Similar Conformers" relationship. To achieve this, the set of 4,218 biologically relevant chemical structures with known pharmacological actions from our earlier efforts  was used. These small molecules with known biological action (Query set) were neighbored against 26,157,365 compound records (Search set), representing the entire "live" PubChem3D contents as of Oct. 2010, using up to 1, 3, 5, 7, and 10 diverse conformers per compound for both compound sets. Timing and efficiency differences with our earlier work are given in Tables 4 and 5.
We are grateful to the NCBI Systems staff, especially Ron Patterson, Charlie Cook, and Don Preuss, whose efforts helped make the PubChem3D project possible. This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.