PubChem3D: Similar conformers
© Bolton et al; licensee Chemistry Central Ltd. 2011
Received: 24 February 2011
Accepted: 9 May 2011
Published: 9 May 2011
PubChem is a free and open public resource for the biological activities of small molecules. With many tens of millions of both chemical structures and biological test results, PubChem is a sizeable system with an uneven degree of available information. Some chemical structures in PubChem include a great deal of biological annotation, while others have little to none. To help users, PubChem pre-computes "neighboring" relationships to relate similar chemical structures, which may have similar biological function. In this work, we introduce a "Similar Conformers" neighboring relationship to identify compounds with similar 3-D shape and similar 3-D orientation of functional groups typically used to define pharmacophore features.
The first two diverse 3-D conformers of 26.1 million PubChem Compound records were compared to each other, using a shape Tanimoto (ST) of 0.8 or greater and a color Tanimoto (CT) of 0.5 or greater, yielding 8.16 billion conformer neighbor pairs and 6.62 billion compound neighbor pairs, with an average of 253 "Similar Conformers" compound neighbors per compound. Comparing the 3-D neighboring relationship to the corresponding 2-D neighboring relationship ("Similar Compounds") for molecules such as caffeine, aspirin, and morphine, one finds unique sets of related chemical structures, providing additional significant biological annotation. The PubChem 3-D neighboring relationship is also shown to be able to group a set of non-steroidal anti-inflammatory drugs (NSAIDs), despite limited PubChem 2-D similarity.
In a study of 4,218 chemical structures of biomedical interest, consisting of many known drugs, using more diverse conformers per compound results in more 3-D compound neighbors per compound; however, the overlap of the compound neighbor lists per conformer also increasingly resemble each other, being 38% identical at three conformers and 68% at ten conformers. Perhaps surprising is that the average count of conformer neighbors per conformer increases rather slowly as a function of diverse conformers considered, with only a 70% increase for a ten times growth in conformers per compound (a 68-fold increase in the conformer pairs considered).
Neighboring 3-D conformers on the scale performed, if implemented naively, is an intractable problem using a modest sized compute cluster. Methodology developed in this work relies on a series of filters to prevent performing 3-D superposition optimization, when it can be determined that two conformers cannot possibly be a neighbor. Most filters are based on Tanimoto equation volume constraints, avoiding incompatible conformers; however, others consider preliminary superposition between conformers using reference shapes.
The "Similar Conformers" 3-D neighboring relationship locates similar small molecules of biological interest that may go unnoticed when using traditional 2-D chemical structure graph-based methods, making it complementary to such methodologies. The computational cost of 3-D similarity methodology on a wide scale, such as PubChem contents, is a considerable issue to overcome. Using a series of efficient filters, an effective throughput rate of more than 150,000 conformers per second per processor core was achieved, more than two orders of magnitude faster than without filtering.
where A and B are the respective counts of fingerprint set bits in the compound pair and AB is the count of bits in common.
The "Similar Compounds" relationship is useful to relate analogues that may have similar biological activity or function and additional biological annotation; however, "Similar Compounds" is not particularly good at finding chemical structures that can adopt similar 3-D shape and similar 3-D orientation of functional groups typically used to define pharmacophore features (henceforth, these pharmacophore feature functional groups will be referred to as "pharmacophore features" or simply as "features"), which could indicate, for example, that the molecules bind to a protein in a similar fashion. It may be useful, therefore, to provide a "Similar Conformers" relationship in PubChem to help relate relevant conformers of chemical structures.
Wanting to compute a 3-D neighboring relationship with modest computational capacity on a very large scale and actually being able to do it are two very different things. For 30 million compounds, a neighboring relationship requires a minimum of 10^14 pair-wise similarity computations. The 2-D similarity of chemical structures with binary fingerprints is relatively fast, with rates of 10^6 compound pair similarities per second per processor core achievable. Computing the analogous 3-D pair-wise similarity of conformers is much slower, with rates of 10^2 to 10^3 per second per processor core (depending on the degree of accuracy versus performance tradeoffs one is willing to make), when using atom-centered Gaussians [9–12] for the shape description. This difference in 2-D versus 3-D pair-wise similarity overlap computation rate is made yet worse by another factor of 10^1 (or more), when considering that 3-D methods actually need to consider multiple diverse conformers per chemical structure, since a small molecule can typically adopt multiple distinct shapes or orientations of pharmacophore features at room temperature. This puts the comparable rate of computation of 3-D chemical structure pair-wise similarity overlap at least 10^4 to 10^5 slower than that for 2-D. This performance gap has led some to search for alternative approaches for determining 3-D similarity between small molecules.
In one such approach , 3-D similarity is recast to use a binary fingerprint to achieve a conformer pair-wise similarity overlap computation speed similar to that of 2-D similarity computation. This scheme determines a set of representative 3-D reference shapes, each corresponding to a binary bit in a fingerprint. When generating the fingerprint for a 3-D chemical structure conformer, a traditional 3-D shape superposition to all reference shapes is performed. If there is sufficient similarity to a reference shape, the corresponding binary bit is set. Besides the pre-computation expense to determine the reference shapes to use and to generate the 3-D fingerprint for all conformers to be searched, this method has an important drawback. Unlike 2-D binary fingerprint methods, when two 3-D chemical structure conformers are deemed to be similar by this approach, it might not be immediately obvious as to why. The reason is simple. The common binary bit values simply identify that the two conformers share a region of shape-space, without the additional requirement that they actually share a sufficient degree of shape similarity.
An attempt  was made to improve upon the 3-D binary fingerprint approach. In effect, the method was very similar; with a predetermination of reference shapes followed by 3-D shape fingerprint computation. Yet, there were a couple of important distinctions that, in essence, allowed the method to yield conformer superposition results much like "traditional" 3-D similarity superposition methods [10–12]. First, the alignment of the conformer to the reference shape was retained during shape fingerprint generation. Second, when a fingerprint "bit" was in common between two conformers, the retained alignments to the reference shape were used to yield an (approximate) alignment between the conformers. Dubbed "alignment recycling", this approach recognized that conformers with similar shape align to a reference shape in a similar way. By "replaying" the alignment to common reference shapes, the best superposition between the conformer pair is the result of the similarity computation. This approach, while not as fast as the method that used only a binary fingerprint, was 10^2 times faster than "traditional" 3-D similarity superposition methodology. A major downside to "alignment recycling" was that it was only parameterized for relatively small and inflexible chemical structures. It means that additional work is necessary to extend this approach to larger and more flexible structures. In all, the above two 3-D fingerprint approaches showed great promise to dramatically improve the throughput of 3-D similarity computation.
To harness a 3-D fingerprint to speed 3-D similarity throughput, one must first determine the reference shapes to use. Recent efforts  to describe the shape space of biologically relevant small molecules showed exponential behavior in reference shape count resulting from changes to the minimum shape Tanimoto (ST) distance between reference shapes. However, when examining the growth of shape space per unit volume for a maximum count of reference shapes , shape space was shown to grow gradually and smoothly as a function of ST. In addition, and generally speaking, it was shown that the shape space of a given unit volume describes 40-70% of the shape space of all chemical structures with a lesser volume. This would suggest that one could group together regions of shape space and describe it with a relatively small number of reference shapes, while avoiding the problem of having too many reference shapes. Reformulating the fingerprint definition with multiple tiers of fingerprints with different minimum ST distances between reference shapes may allow "alignment recycling" to be effective for larger and more flexible chemical structures, thus, providing a means to speed computation of a 3-D neighboring relationship on a very large scale.
In this work, we describe the multi-conformer PubChem "Similar Conformers" 3-D neighboring relationship and explain various strategies and approaches that made it a tractable problem, including extending the "alignment recycling" methodology to cover the full range of chemical structures considered in the PubChem3D project.
Results and discussion
1. Description of "Similar Conformers" neighboring relationship
where, for each of the six independent fictitious feature atom types, V AA and V BB are the respective self-overlap volumes and V AB is the overlap volume of conformers A and B.
Pair-wise shape and feature comparison of conformers takes two basic steps: (1) optimization of the shape superposition between two 3-D chemical structures, to find their maximum shape overlap in terms of ST, and (2) a single-point CT computation at that maximum shape overlay. PubChem 3-D "Similar Conformers" neighbors are identified as any pair-wise conformer superposition with ST and CT values of ≥0.8 and ≥0.5 (actually ≥0.795 and ≥0.495, after floating point number rounding is considered), respectively.
An important issue with 3-D neighboring is the number of conformers considered. Although PubChem generates a conformer ensemble for each molecule, consisting of up to 500 sampled conformations, it is not practical to consider all of these for 3-D neighboring. Therefore, a selection of diverse conformers for each compound is considered for the purposes of 3-D neighboring. A detailed description of how the diverse conformer set is derived can be found in the Materials and Methods section (See "Diverse conformer concept").
It is important to note that 3-D neighboring using a single conformer per compound has a one-to-one correspondence between compound pairs and conformer pairs. When using multiple conformers per compound, it is possible that only a subset of possible conformer pairs per compound pair may satisfy the 3-D neighboring criteria. For clarification, a 3-D conformer neighbor pair is defined as any conformer pair with ST ≥ 0.8 and CT ≥ 0.5. If there is at least one conformer neighbor pair among all possible conformer pairs from a given compound pair, a compound neighbor pair results. In this work, a 3-D neighbor implies a 3-D compound neighbor. If further clarification is necessary, the terms 3-D compound neighbors and 3-D conformer neighbors are used.
2. The distribution of 3-D neighbors
3. Comparison of 2-D and 3-D similarity neighbors
While not all eight selected NSAID drug molecules are 3-D neighbors of each other, examining the 3-D neighbors of the 3-D neighbors shows that each of the eight drug molecules is related to one or more of the eight drug molecules, effectively forming a cluster of related drugs that are highly similar in terms of shape and pharmacophore features but rather dissimilar in terms of 2-D graph similarity. Actually, this "cluster" of NSAID drugs presented in Figure 6 is part of a larger 3-D cluster, with only eight of thirteen members being selected for clarity and demonstrative purposes. In addition, this is only one of several NSAID drug "clusters" that one can find using 3-D similarity. For the purposes of brevity and focus, only the drug class NSAIDs is explored, but suffice it to say that there are other examples one can find with other drug target classes that are similarly demonstrative.
If a molecule has known bioactivity, there is a reasonable expectation [26, 27] that its similarity neighbors may also be similarly bioactive. As demonstrated in Figure 6 and 7, the 3-D "Similar Conformers" relationship can be useful to identify structurally similar molecules that may be completely missed when only the 2-D "Similar Compounds" relationship is exploited. Therefore, one might consider to use PubChem's precomputed 2-D and 3-D neighboring relationships as complementary virtual screening tools or to help understand how chemical structures relate to each other relative to their biological efficacy.
4. Effect of using multiple conformers
Taking into account all conformers of each CID for 3-D neighboring using the current methodology is simply not practical. The PubChem "Similar Conformers" neighboring relationship described here considers (at the time of writing) only two diverse conformers per compound (with a third conformer per compound soon to be released). One may wonder, as more conformers are considered, does one locate more chemical structures and, if so, to what extent? Is there a point of "diminishing returns", where a plateau forms in the curve of unique neighbor count as a function of diverse conformer count? Indirect evidence addressing aspects of these questions can be found in the 3-D neighboring data PubChem provides.
Sensitivity of conformer choice in 3-D neighboring.
To help address this question more directly, 4,218 compounds were 3-D neighbored against all of PubChem3D. This set of 4,218 compounds were selected using a query of the PubChem Compound database ("has pharm"[Filter] AND "has 3d conformer"[Filter] AND 0[AtomChiralUndefCount] AND 0[BondChiralUndefCount]). This query means that the queried chemical structures have known pharmacological action as annotated by MeSH , have a conformer model in PubChem3D, and have zero undefined SP2/SP3 stereo centers. (The last criterion is utilized solely to limit the count of chemical structures considered and should have no bearing on the results of this test.) The PubChem CIDs for the selected chemical structures are available in Additional file 1.
Effect of using multiple conformers per compound on 3-D neighboring.
Diverse Conformer Count
Average Query Conformers per compound
Average Search Conformers per compound
Total Compound Pairs (billions)
Total Conformer Pairs (billions)
Conformer Pairs per Compound Pair
Total 3-D Compound Neighbors (millions)
Total 3-D Conformer Neighbors (millions)
Ratio of Conformer/Compound Neighbors
Average Compound 3-D Neighbors per Compound
Average Conformer 3-D Neighbors per Conformer
Total Search Time (days)
It is not completely clear why this should be so, but one consideration comes to mind. It may be an artefact of the nature of the diverse conformer relationship, whereby a default conformer is chosen as the first, the most diverse conformer to the default conformer is the selected as the second, and each subsequent diverse conformer must be furthest away from the previously selected diverse conformers. This means that the most diverse conformers for a chemical structure are always considered first. Subsequently, each additional diverse conformer will increasingly resemble the previous diverse conformers, potentially yielding compound neighbors found previously by the other conformers for the same chemical structure. This is reflected by the ratio of conformer and compound 3-D neighbors. At three, five, and seven diverse conformers, 38%, 53%, and 61%, respectively, of the conformer neighbors point to the same compound neighbors. By ten diverse conformers, 68% of the conformer neighbors point to the same compound neighbors. With this said, one thing is clear. Neighboring more diverse conformers per compound will result in more compound neighbors per compound; however, the computation effort expended to do this grows exponentially as an increasing ratio of conformer neighbors show you more ways two compounds are interrelated.
One interesting aspect of Table 2 and Figure 10 is that the average conformer neighbor count per conformer grows very slowly. A ten times growth in conformers, corresponding to a 68 times increase in conformer pairs considered, results in only a 70% increase in the average conformer neighbor count. This is somewhat surprising given the argument above. It appears to suggest that each added diverse conformer of a chemical structure is also adding a significant portion of unique shape/feature space. This is seen in Table 1, whereby the conformer neighbors of each of the first three diverse conformers of aspirin (CID 2244 or CID 450661) mostly had very little overlap, typically less than 20%, of similar conformer neighbors with other diverse conformers of the same chemical structure. While the degree of unique shape/feature space being added may diminish as more diverse conformers are added, it would still appear to be rather substantial even at ten diverse conformers per compound. Eventually, one may expect, as even more diverse conformers are considered, that the average count of conformer neighbors per conformer may grow substantially, as conformers increasingly yield similar neighbor lists, but clearly this point is not yet reached at ten conformers per compound, as reflected by the continued growth in average count of compound neighbors per compound. Perhaps, for most chemical structures, this point may be reached by twenty diverse conformers. Using the computers and algorithms of today, and as reflected in the total search time in Table 2, twenty diverse conformers per compound is still a mountain too high to climb for a collection of the size of PubChem.
5. Efficiency of 3-D neighboring scheme
Although the overall speed of 3-D neighboring depends on various factors, such as atom count, use of a precomputed shape grid approach, etc., a modern computer processor core can process on the order of 10^2 to 10^3 3-D conformer pair superpositions per second, when using a Gaussian-based shape definition. In theory, 26.1 million compounds with two diverse conformers per compound would require more than a quadrillion (10^15) pair-wise conformer superposition determinations, corresponding to +40,000 years of processor core computation; however, PubChem 3-D neighboring processing was completed in about two months using ~2,500 computer processing cores (which represents more the throughput achieved in terms of actual time on a somewhat chaotic and somewhat unstable shared compute cluster rather than actual CPU time), meaning it took ~400 years of compute server time. How was this achieved?
Performance of 3-D neighboring.
Diverse Conformer Count
Total Conformer Pairs (billions)
CT Feature Filter
CT Volume Filter
ST Volume Filter
Alignment Recycling Fingerprint
Alignment Recycling Overlap
Insufficient ST (billions)
Insufficient CT (billions)
Neighbor Count (millions)
Compound Pairs per second
Conformer Pairs per second
Total Search Time (days)
Alignment Recycling Time
Superposition Optimization Time
Other Overhead Time
Input data size (GB)
Alignment recycling is the next stage after filtering. This methodology consists of: comparing a shape fingerprint; locating common reference shapes; and then reuse of the alignment to the common reference, where the shape overlap and the feature overlap are computed at that recycled alignment to the reference shape. This is repeated for each common reference shape and only the best superposition is kept.
Alignment recycling provides two opportunities to further remove conformer pairs from consideration. If a reference shape cannot be found in common, the conformers are considered to be too different to be a neighbor. This alignment recycling fingerprint filter removes an additional 4% of all conformer pairs (14% of all conformer pairs not already filtered). If the pre-optimized best overlap from alignment recycling is not sufficiently large (yielding an ST of at least 0.735), the conformer pair is considered to be incapable of being a neighbor. This alignment recycling overlap filter removes 27% of all conformer pairs (96% of all conformer pairs not already filtered) but consumes 86% of CPU time. Together, all filtering steps remove 99.8% of conformer pairs prior to optimization of the conformer superposition at the recycled alignment. The final shape optimization step consumes 10% of the CPU time, retaining less than 0.6% of optimized conformer pairs as neighbor pairs. About 66% of conformer pairs shape-optimized are rejected due to an insufficient ST value (<0.795) to become a neighbor and the remainder rejected due to insufficient CT value (<0.495) at the shape-optimized superposition.
The overall throughput of the 3-D neighboring methodology is consistent across the range of diverse conformers considered, at a rate of ~150,000 conformers per second. The other overhead reported in Table 3 involves mostly the billions and trillions of timing measurements but also involves some memory allocation aspects. In reality, with timing statistics turned off, there is very little other overhead to the method. While the total size of the input binary data files grows as a function diverse conformer count, ranging from 19 GB to 159 GB, the computational density is more than sufficient to avoid making input of these search files a bottleneck, provided at least four conformers are being queried simultaneously. If fewer than four conformers are queried at a time, and the input binary files are not memory resident, input can be a bottleneck.
6. Alignment recycling
Shape fingerprint design.
Volume range (Å3)
Fingerprint ST ( )
Unique shape counts
Conformer count (millions)
7. Superposition storage
Superposition of two conformers requires modification of the coordinates of one conformer relative to the other. Retention of the rotational matrix and translation vector is a practical approach to retain a superposition between conformers to avoid having to re-compute a superposition or store modified coordinates of a conformer.
Storage of superposition results in PubChem3D involves identification of: the two conformers involved, often with one of the two conformers implicitly identified (e.g., by storing the superposition as a subordinate property of a conformer); the 3 × 3 rotation matrix; and the 3 × 1 translation vector. The PubChem3D conformer ID is often represented as either a 64-bit unsigned integer (sometimes stored in 16-character hex form), with the 32-high bits representing the PubChem Compound identifier (CID) and the 16-low bits representing the local conformer ID (LID), or two numbers "." separated (e.g., CID.LID). Storage of the rotation and translation parts represents more of a challenge, given there are twelve floating point numbers to convey. To provide for a more compact superposition representation, the ability to pack/unpack the rotation and translation into a 64-bit unsigned integer was developed. While described in more detail in the Materials and Methods section below, this involves transforming the rotation matrix into a quaternion and packing each of the four (Qw, Qx, Qy, Qz) components into 32-bits, 8 bits each. The remaining 32-bits are used to encode the translation vector.
Effect of superposition packing on ST/CT.
Matrix Encoding Error Count
Matrix Encoding Enhancement Count
1 in 1,096
1 in 1,024
1 in 38
1 in 30
Perhaps most remarkable, the superposition encode/decode procedure is just as likely to enhance the ST and CT values as detract from them. Also interesting is that the CT error curves are much broader, reflecting, in part, the much greater positional sensitivity of the CT measure. Small deviations in rotation have an increasing effect the further an atom is from the molecule center. Fictitious feature atoms are relatively sparse, have small atomic radii, and are often close to the periphery of the chemical structure. Shape similarity, on the other hand, is not as sensitive, as real atoms are relatively dense and most atoms in the molecule are typically near the steric center, thus, fewer atoms are affected from rotation encoding effects. As a whole, the use of a 64-bit integer to store a conformer pair superposition results in relatively few cases where the Tanimoto difference (after-before) is less than 0.025, with the chances for this to occur for ST and CT being 1 in 14.6 million and 1 in 955, respectively. If the error from being off a small fraction of a degree from the original superposition is too much, one could simply re-optimize the conformer superposition provided by PubChem, as the benefits in terms of the ease of storage are considerable.
In the present paper, the PubChem 3-D "Similar Conformers" neighboring relationship and the methodology used in its computation are described. PubChem 3-D neighbors are defined as any two conformers with a shape-optimized superposition yielding a similar 3-D conformer shape (ST value of ≥ 0.8) and similar 3-D orientation of functional groups typically used to define pharmacophore features (CT value of ≥ 0.5). In the cases of chemical structures without features, a similar 3-D conformer shape with ST value of ≥ 0.93 is used.
To make the calculation of this 3-D neighboring relationship tractable, a series of filters were designed to avoid the time-consuming shape-superposition computation between conformer pairs that could not possibly be 3-D neighbors. This resulted in an average throughput of 150,000 conformer pairs per second per processor core, a speed sufficient to consider multiple diverse conformers per compound in the 3-D neighboring relationship.
Neighboring the first two diverse conformers of 26.1 million PubChem Compound records yielded 8.16 billion 3-D conformer neighbor pairs and 6.62 billion 3-D compound neighbor pairs, with an average of 253 "Similar Conformers" per PubChem Compound. Comparison of the PubChem 3-D "Similar Conformers" neighboring relationship with the PubChem 2-D "Similar Compounds" neighboring relationship using three well-known bioactive molecules (aspirin, caffeine, and morphine) showed a considerable degree of uniqueness between the two neighboring relationships and providing a number of related structures with significant biological annotation. This was also illustrated by the ability of the 3-D neighboring relationship to associate eight selected non-steroidal anti-inflammatory drugs (NSAIDs) to each other, despite little 2-D pair-wise similarity between most of the compound pairs. Additional study of 4,218 small molecules of biomedical interest across a range of diverse conformers shows that neighboring more conformers per compound will result in being able to associate more chemical structures to each other; however, an exponential increase in the count of conformer pairs considered results in only a linear increase in additional compound 3-D neighbor pairs.
Materials and methods
1. Chemical structure 3-D representation
Theoretical 3-D descriptions of the 26,157,365 chemical structures covered in this work and found in the PubChem Compound database [1, 2] are generated as described in our previous studies [15, 28]. It is important to note that these conformers are not energy minima on a potential energy surface, but an ensemble of energetically-accessible (at room temperature), biologically-relevant (able to reproduce most known bioactive conformations), sampled (with a minimum atom pair-wise RMSD separation) conformations that the molecule may cover. In theory, these ensembles describe all relevant molecular shapes (including all important energy minima) within the resolution of the clustering RMSD for the conformer ensemble.
2. Molecular shape and features
An atom-centered Gaussian description [9–12] using Bondi radii  is utilized to compute 3-D similarity. Fictitious "feature" atoms (also known as "color" atoms) are defined to represent the pharmacophore feature functional group types present in a chemical structure. The Mills/Deans implicit force-field , as implemented in the OEShape C++ Toolkit, is employed to identify these features. The six feature types recognized are: anion, cation, hydrogen-bond donor, hydrogen-bond acceptor, hydrophobe, and ring. Feature atom 3-D coordinates are computed relative to the steric center of real "parent" atoms that comprise each feature. Post processing of feature atom assignment identifies any features of the same type within 1.0 Å of each other and merges the unique parent atoms that comprise the two features. This post processing step is performed iteratively, until no additional features are merged. The radius used for all feature atoms is 1.08265 Å.
Shape similarity computation utilizes the shape Tanimoto (ST) via Eq. (2) and only considers the non-hydrogen atoms in the molecule. Feature similarity, unlike shape similarity, involves summing the individual overlaps of the six component feature atom types when computing the A, B, and AB found in Eq. (2); thus, yielding Eq. (3) for the feature similarity measure, color Tanimoto (CT). Otherwise, the feature similarity computation method is identical to the shape similarity computation method.
When optimizing the shape superposition between a conformer pair, the OpenEye OEShape C++ toolkit is used via the OEBestOverlay object, with the parameters OEOverlapRadii::All and OEOverlapMethod::Analytic. Any other shape or feature computation utilizes in-house implementations using the Grant and Pickup  Gaussian-based shape methodology. For all in-house shape-based approaches, an exponent lookup table of size 6,001 is used in lieu of exponent calculation for the range of (-12.0 to 0.0) in 0.002 increments. Exponent values outside of this range are zero. All other terms in the Grant and Pickup shape-based methodology are computed exactly.
A grid-based approach is used by parts of the 3-D neighboring methodology to estimate the shape overlap with O(N) computational complexity. In these cases, a 3-D lattice of points separated by 0.25 Å give the shape overlap of a carbon probe-atom at the grid point to the query conformer. A cut-off distance of 4.5 Å is used for each query conformer atom, where no additional contribution to shape overlap is considered.
3. Diverse conformer concept
Although the theoretical conformer ensemble for each molecule may have up to 500 conformers (averaging ~110), it is not practical to consider all conformers for PubChem 3-D neighboring. Therefore, a diverse conformer concept is introduced that orders the conformers in the ensemble for a chemical structure by their combined shape and feature dissimilarity, with the most dissimilar conformers first. The lowest-energy conformer in the conformer ensemble is selected as the default, first diverse conformer to seed the process. The conformer with the least combo Tanimoto (being the sum of the ST and CT similarity values for the ST-optimized superposition) to the first conformer is selected as the second most diverse conformer. The conformer with the least sum of combo Tanimoto to the first two conformers is selected as the third, and so on until all conformers are assigned a diverse conformer ordering. In the case of a tie, the conformer with the largest sum of combo Tanimoto to all unassigned conformers is selected. If a tie persists, the conformer with the least LID (local conformer identifier) is selected.
4. Shape fingerprint definition
Haigh et al.  applied a clustering technique to select a diverse set of reference shapes that cover the overall shape space of possible 3-D shapes, and generated 3-D molecular shape fingerprints using these reference shapes. Comparison of molecular shape fingerprints was shown to be orders of magnitude faster than shape-overlap-based approaches such as ROCS , illustrating its potential in screening a large 3-D chemical database. Therefore, we applied the 3-D shape fingerprint technique, in conjunction with "alignment recycling" , for use in computing a "Similar Conformer" relationship.
In our recent study , a dynamic shape similarity threshold (ST thresh ) was employed in clustering conformers of a particular volume such that the resulting reference shape count became less than or equal to a certain number (200). In this manner, the number of reference shapes per volume can be kept relatively constant while the growth of the shape space as a function of volume is manifest by a decrease in (the Unique-Shape Tanimoto in Figure 11). The plot of the Unique-Shape Tanimoto versus the conformer volume was used to choose appropriate values for the 3-D shape fingerprint generation (the Fingerprint Tanimoto in Figure 11). The value was chosen to gradually decrease from 0.75 to 0.45 (with a decrement of 0.05) as the conformer volume increases, resulting in seven different regions of conformer volume according to their values. The reference shapes of each region obtained from the previous study  were then pooled and clustered at . Following this step, all conformers in PubChem3D within the given volume range region were examined to locate any additional reference shapes. Also, as new chemical structures are added to PubChem, they are examined to identify new candidate reference shapes.
The resulting "unique shape" count is listed in Table 4, with the conformer count that belongs to the shape space represented by the corresponding unique shapes. It indicates that the shape space spanned by 5.24 billion conformers (the entire contents of the PubChem3D archive, live and non-live, from more than 45.9 million unique chemical structures) can be represented in such a manner so as to only require 3,311 unique reference shapes (a number which may grow as a function of time). Figure 14 shows the frequency of use of the various 3-D fingerprint reference shapes, with some being heavily utilized while others are rarely used. The volume range 433-999 is the largest in both volume spanned and count of reference shapes. We anticipate that this volume range may need to be split into separate regions in the future.
5. PubChem 3-D neighbor processing
PubChem Compound (CID) records are partitioned into two sets, "live" and "non-live". A "live" CID is one that has at least one current version PubChem Substance record pointing to it. The "non-live" partition contains all CIDs not considered to be "live". For each record that is contained in the PubChem3D system and considered to be "live", PubChem computes a "Similar Conformer" relationship that considers both shape similarity (ST ≥ 0.8) and feature similarity (CT ≥ 0.5). [Chemical structures without features, while rare, can have a similar conformer relationship with other featureless chemical structures provided the ST ≥ 0.93.] Essentially, this amounts to a 3-D similarity search of a given conformer across the first N-diverse conformers of "live" CIDs, where at the time of writing "N" is two, with a third in the process of being added. This processing we call "neighboring".
This PubChem3D similarity search is a multistep process designed: to filter out conformer pairs that cannot possibly be neighbors, to generate a near-optimal superposition between conformer pairs, and to perform a final optimization of the superposition to maximize the shape volume overlap between conformer pairs. These distinct stages are described below.
Stage 1: Filtering
An enormous advantage to this filter is that it operates on compound pairs. This means it applies to all conformers being neighbored for the respective compound pair, unlike the previous CT filter which is per conformer pair. So, as the count of conformers per compound is increased, the utility of this filter is magnified.
Stage 2: Shape fingerprint comparison and alignment recycling
where AB shape is the common shape volume between the conformer pair (using a precomputed shape-grid at 0.25 Å resolution) and AB feature is the sum of six component feature overlaps (using a feature atom radius of 1.25 Å, to account for proximate color atoms).
This approach is repeated for each common reference shape. The reference-shape-derived conformer alignment yielding the largest AB combo value is used in a final AB shape overlap computation, this time not using a grid-based approach. The final AB shape overlap is used along with the pre-computed self shape overlap values per conformer to compute the ST at this overlap geometry. If the computed ST is not greater than 0.735, the conformer pair is considered to not possibly be a neighbor. [The "grid" AB shape can be sufficiently different than the "exact" AB shape value, resulting in the loss of many neighbor relationships.]
Stage 3: Shape overlay optimization and ST/CT score computation
Using the "alignment recycling" conformer pair superposition from the previous stage, a final superposition optimization to maximize the shape volume overlap between the conformer pair is performed using the OEShape C++ toolkit . If the final conformer superposition yields an ST ≥ 0.8 (actually 0.795 after rounding to the nearest 0.01 is considered), the CT is also computed at the same conformer alignment. If CT ≥ 0.5 (actually 0.495 after rounding to the nearest 0.01 is considered), the conformer pair is considered to be a neighbor. As mentioned previously, if both conformers are devoid of features, alternatively, an ST ≥ 0.93 (actually 0.925 after rounding to the nearest 0.01 is considered) is sufficient to be considered a neighbor. The 3 × 3 rotation matrix and 3 × 1 translation vector and the ST/CT similarity values are retained for the conformer pair.
6. PubChem 3-D neighbor processing addendum
There are additional aspects to neighbor processing that are germane to their accuracy and use. To minimize input overhead and memory utilization, all input data is encoded in a highly compact 64-bit aligned binary format (gzip or bzip2 compression reduces file size by only 4%) that contains all information necessary to perform the neighboring computation. One side effect of this encoding is that conformer coordinates are transformed into an integer value with a resolution of +/- 0.0015 Å and restricts coordinates to the range (-50 Å, +50 Å), which is more than sufficient for all conformers in the PubChem3D data system. Another side effect of this encoding scheme is that, to obtain the superposition alignment between neighbored conformers, one must first transform the conformers to their steric center (i.e., subtract the coordinate average per axis) prior to applying the stored 3 × 3 rotation matrix and then the 3 × 1 translation vector (in that order). For conformers stored in the PubChem3D system, conformers are already at the steric center (and rotated into the non-mass weighed inertial frame of reference).
The result of using a log value provides for a translation encoding that gives the best precision at 0 Å (+/- 0.0028 Å) and the worst at 100 Å (+/- 0.29 Å). Considering the requirement that a conformer pair must be at the steric center prior to encoding, the encoding error due to translation is minimal.
Validation of the encoding/decoding accuracy as a function of ST and CT values across +1.85 billion unique neighbor pairs show (Table 5) that there is nearly an equal possibility to enhance the ST or CT value as it is to detract from it. However, the CT is much more sensitive to encoding/decoding than ST, with a 1 in 38 chance of yielding a value less by 0.01 than that reported. This is to be expected, as the radius used for feature atoms in CT computations is nearly 60% smaller than for carbon atoms, meaning small changes in the alignment can have a big effect on similarity, especially for features that are furthest from the steric center (a "torque arm" effect). Additionally, the shape overlap optimization does not consider feature atoms, meaning a small change in rotation or translation may either increase or decrease the CT value, considering no attempt was made to optimize feature alignment.
Many thanks to Roger Sayle for providing key insights on ways to improve the rotational/translational matrix pack/unpack scheme. We are grateful to the NCBI Systems staff, especially Ron Patterson, Charlie Cook, and Don Preuss, whose efforts helped make the PubChem3D project possible. This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Md. (http://biowulf.nih.gov).
- Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: integrated platform of small molecules and biological activities. Annual Reports in Computational Chemistry. Edited by: Ralph AW, David CS. 2008, Elsevier, 4: 217-241.Google Scholar
- Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.View ArticleGoogle Scholar
- Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH: An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010, 38: D255-D266. 10.1093/nar/gkp965.View ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010, 38: D5-D16. 10.1093/nar/gkp967.View ArticleGoogle Scholar
- Holliday JD, Hu CY, Willett P: Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen. 2002, 5: 155-166.View ArticleGoogle Scholar
- Chen X, Reynolds CH: Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J Chem Inf Comput Sci. 2002, 42: 1407-1414.View ArticleGoogle Scholar
- Holliday JD, Salim N, Whittle M, Willett P: Analysis and display of the size dependence of chemical similarity coefficients. J Chem Inf Comput Sci. 2003, 43: 819-828.View ArticleGoogle Scholar
- PubChem substructure fingerprint description. [ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf]
- Grant JA, Pickup BT: A gaussian description of molecular shape. J Phys Chem. 1995, 99: 3503-3510. 10.1021/j100011a016.View ArticleGoogle Scholar
- ROCS - Rapid Overlay of Chemical Structures. Version 2.2, OpenEye Scientific Software, Inc.: Santa Fe, NM. 2006
- Haque IS, Pande VS: PAPER—accelerating parallel evaluations of ROCS. J Comput Chem. 2010, 31: 117-132. 10.1002/jcc.21307.View ArticleGoogle Scholar
- ShapeTK-C++. Version 1.8.0, OpenEye Scientific Software, Inc.: Santa Fe, NM. 2010
- Haigh JA, Pickup BT, Grant JA, Nicholls A: Small molecule shape-fingerprints. J Chem Inf Model. 2005, 45: 673-684. 10.1021/ci049651v.View ArticleGoogle Scholar
- Fontaine F, Bolton E, Borodina Y, Bryant SH: Fast 3D shape screening of large chemical databases through alignment-recycling. Chem Cent J. 2007, 1: 12-10.1186/1752-153X-1-12.View ArticleGoogle Scholar
- Bolton EE, Kim S, Bryant SH: PubChem3D: diversity of shape. J Cheminformatics. 2011, 3: 9-10.1186/1758-2946-3-9.View ArticleGoogle Scholar
- Rush TS, Grant JA, Mosyak L, Nicholls A: A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem. 2005, 48: 1489-1495. 10.1021/jm040163o.View ArticleGoogle Scholar
- Grant JA, Gallardo MA, Pickup BT: A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem. 1996, 17: 1653-1666. 10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K.View ArticleGoogle Scholar
- Medical Subject Headings. [http://www.ncbi.nlm.nih.gov/mesh]
- PubMed. [http://www.pubmed.gov]
- Wang Y, Addess KJ, Chen J, Geer LY, He J, He S, Lu S, Madej T, Marchler-Bauer A, Thiessen PA, et al: MMDB: annotating protein sequences with Entrez's 3D-structure database. Nucleic Acids Res. 2007, 35: D298-D300. 10.1093/nar/gkl952.View ArticleGoogle Scholar
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et al: DrugBank 3.0: a comprehensive resource for 'Omics' research on drugs. Nucleic Acids Res. 2011, 39: D1035-D1041. 10.1093/nar/gkq1126.View ArticleGoogle Scholar
- McLeod DC: Zomepirac. Drug Intelligence & Clinical Pharmacy. 1981, 15: 522-530.Google Scholar
- Kerwar SS: Pharmacologic properties of fenbufen. Am J Med. 1983, 75: 62-69.View ArticleGoogle Scholar
- Tolman EL, Birnbaum JE, Chiccarelli FS, Panagides J, Sloboda AE: Inhibition of prostaglandin activity and synthesis by fenbufen (a new nonsteroidal antiinflammatory agent) and one of its metabolites. Advances in Prostaglandin and Thromboxane Research. 1976, 1: 133-138.Google Scholar
- Cashin CH, Dawson W, Kitchen EA: Pharmacology of benoxaprofen (2-[4-chlorophenyl]-α-methyl-5-benzoxazole acetic acid), LRCL 3794, a new compound with anti-inflammatory activity apparently unrelated to inhibition of prostaglandin synthesis. J Pharm Pharmacol. 1977, 29: 330-336.View ArticleGoogle Scholar
- Nicholls A, McGaughey GB, Sheridan RP, Good AC, Warren G, Mathieu M, Muchmore SW, Brown SP, Grant JA, Haigh JA, et al: Molecular shape and medicinal chemistry: a perspective. J Med Chem. 2010, 53: 3862-3886. 10.1021/jm900818s.View ArticleGoogle Scholar
- Muchmore SW, Debe DA, Metz JT, Brown SP, Martin YC, Hajduk PJ: Application of belief theory to similarity data fusion for use in analog searching and lead hopping. J Chem Inf Model. 2008, 48: 941-948. 10.1021/ci7004498.View ArticleGoogle Scholar
- Bolton EE, Kim S, Bryant SH: PubChem3D: conformer generation. J Cheminformatics. 2011, 3: 4-10.1186/1758-2946-3-4.View ArticleGoogle Scholar
- Bondi A: van der Waals volumes and radii. J Phys Chem. 1964, 68: 441-451. 10.1021/j100785a001.View ArticleGoogle Scholar
- Mills JEJ, Dean PM: Three-dimensional hydrogen-bond geometry and probability information from a crystal survey. J Comput Aid Mol Des. 1996, 10: 607-622. 10.1007/BF00134183.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.