PubChem3D: conformer ensemble accuracy
© Kim et al.; licensee Chemistry Central Ltd. 2013
Received: 31 August 2012
Accepted: 18 December 2012
Published: 7 January 2013
PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.
The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (ST ST-opt ) and combo-Tanimoto (ComboT ST-opt ), and color-optimized color-Tanimoto (CT CT-opt ) and combo-Tanimoto (ComboT CT-opt ). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 ± 0.12 Å), and decreasing the ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt scores (by 0.04 ± 0.03, 0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively).
This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt , respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.
The advent of combinatorial chemistry and high-throughput screening technology has made it possible to perform a rapid test of biological activity on a vast number of small molecules, generating a massive amount of biological activity data. While this explosion of information presents scientists with great opportunities to facilitate the identification of potential drug candidates and chemical probes, its benefit is enhanced when this data is combined with that of the others and made available to all. Dissemination of such information requires a public repository that collects and stores the heterogeneous data from various contributors. An example of such a repository is PubChem [1–4] (http://pubchem.ncbi.nlm.nih.gov), launched in 2004 as a component of the Molecular Libraries Roadmap Initiatives of the U.S. National Institutes of Health. PubChem archives biological activity screening data and other information from diverse data sources and offers its contents free of charge to the biomedical research community, facilitating research that benefits human health.
where A and B are the respective counts of the set binary fingerprint bits for the two molecules and AB is the count of set bits in common to both molecules. Because the 2-D molecular similarity computation is very fast (typically at a rate of one million pair-wise comparisons per second per CPU core), it is appropriate for searching a large database like PubChem. However, there are many diverse chemical structures with similar biological efficacies against targets available in PubChem that can be difficult to interrelate using traditional 2-D similarity methods [8–11]. To assist in biological activity analysis of these molecules, a new layer called PubChem3D [8–15] was added to PubChem.
PubChem3D generates a 3-D conformer model description for each record in the PubChem Compound database, when it satisfies the following conditions : (1) not too large (with 50 or fewer non-hydrogen atoms); (2) not too flexible (with no more than 15 rotatable bonds); (3) has only a single covalent unit (i.e., not a salt or mixture); (4) consists of only supported elements (H, C, N, O, F, Si, P, S, Cl, Br, and I); (5) contains only atom-types recognized by the Merck Molecular Force Field (MMFF94s) [16, 17]; and (6) five or fewer undefined atom (R,S) and bond (E,Z) stereo centers. This 3-D description can be employed to enhance existing PubChem search and analysis methodologies by means of 3-D similarity , helping the user identify useful structure-activity relationships that might go unrecognized by the PubChem 2-D similarity method. A diverse conformer ordering  gives a maximal description of the conformational space of a molecule when only a subset of available conformers is used. A pre-computed search per compound record gives immediate access to a set of 3-D similar compounds (called “Similar Conformers” ) in PubChem and their respective superpositions, augmenting the complementary “Similar Compounds” relationship, computed using the PubChem 2-D similarity method. Systematic augmentation of PubChem resources to include a 3-D layer provides users with new capabilities to search, subset, visualize, analyze, and download data .
where RMSD pred is the predicted upper limit of the RMSD accuracy to ensure at least 90% of conformer models generated by OMEGA using the selected PubChem parameter set for the 25,972 PDB ligands had at least one “bioactive” conformer whose RMSD distance from the experimentally determined conformation was closer than the value predicted using Equation (3).
where “int( )” gives the whole number, irrespective of any remaining fraction and where this RMSD threshold is referred to as RMSD cluster to emphasize its usage for clustering purposes (rather than accuracy prediction). Each sampled conformer represents a cluster containing all conformers within the designated RMSD threshold, thus reducing the count of conformers per conformer model. If the conformer model after cluster sampling has more than a maximum of 500 conformers, it is re-clustered using an RMSD cluster value incremented by a further 0.2. This process is repeated as many times as necessary to reduce the overall conformer count to 500 or less. Although this clustering process makes the conformer models more manageable in size and better suited for a large database such as PubChem, it may be accompanied with an undesirable loss of overall accuracy of the conformer model. Therefore, in the present study, we investigated the effect of the conformer model clustering upon the accuracy of the conformer models as a follow-up to our previous study  in order to address key questions as to the performance of PubChem 3D sampled conformers to reproduce “bioactive” ligand geometries: as a function of molecular size and flexibility, with respect to the established PubChem3D similarity measures, and with an eye towards their expected performance relative to biological activity data analysis.
Results and discussion
Molecular size and flexibility of the MMDB ligands
This study considers 47,123 small molecules with experimental 3-D coordinates available from the Molecular Modeling Database (MMDB)  deposition in PubChem (Additional file 1). The molecular connectivity of these MMDB ligands is derived from the 3-D coordinates of the protein-bound small molecules taken from PDB  records. Note that the “experimental” structures of small molecules in PDB are known to, at times, have non-trivial issues or uncertainty concerning their precise chemical identity, protein binding geometry, or crystal structure location [13, 32–36]. The present study largely ignores such potential issues and considers all the 3-D ligand structures as experimental facts.
RMSD clustering threshold
Summary statistics of overall conformer model accuracy
As expected, the PubChem3D conformer sampling procedure results in a loss of the conformer ensemble RMSD accuracy relative to experiment. On average, this overall loss is 0.18 Å (from 0.39 Å to 0.57 Å). The standard deviation of this average also increases by 0.12 Å (from 0.24 Å to 0.36 Å) and may reflect the rounding of RMSD cluster to the nearest 0.2 increment, potentially suggesting the ± 0.1 nature of such a change. In the aggregate, 90% of all conformer models in this study reflect RMSD accuracies better than 1.1 Å after clustering.
In the study by Hawkins et al., an RMSD value of 1.25 Å or less was employed as the definition of a “close” reproduction of the experimental conformation. They also pointed out that an RMSD of 2.0 Å could have been used as a cut-off because it is a common upper bound for successful reproduction of an experimental structure in molecular docking. With these criteria in mind, the after-clustering conformer models in the present study may be considered to be of high quality, although the choice of the cut-off for “close” reproduction of the experimental structure is still arbitrary and subjective.
As shown in Figure 5, after clustering, the fraction of the conformer models with accuracy better than RMSD cluster decreased by no more than 12% in general, except for RMSD cluster = 1.6 Å (34%). When the conformer model accuracy was predicted in a more conservative way using the limit RMSD cluster + 0.1 Å (rather than RMSD cluster ), the difference between the conformer models with accuracy better than this limit before and after clustering was no more than 6%, except for RMSD cluster = 1.6 Å (23%), sho-wing that the realized sampling effects are local in nature. It appears that most of the structures with decreased conformer model accuracy at RMSD cluster = 1.6 Å are simply due to an unfortunate culmination of pronounced partition-based clustering edge-effects for a set of flexible di- and tri-phosphate containing structures. In other words, for these particular computationally-derived conformer models, a conformation most similar to the experimental structures happened to be near the boundaries of the clusters generated with the RMSD cluster = 1.6 Å, and therefore, they were not included in the conformer models after the clustering procedure. As a result, clustering a given conformer model using a different conformer ordering with the same clustering procedure could have yielded results closer to the pre-clustering result.
One readily notices for each RMSD cluster value in Figure 6 that conformer model clustering shifts the cumulative % distribution curves toward the right-hand side, indicating a decrease in the conformer model accuracy as a result of the PubChem sampling procedure. Looking at the 90% level of conformer models before and after clustering, there are some variances in the change of the conformer model accuracy, depending on the RMSD cluster value. For example, the difference between the RMSD accuracies at the 90% level before and after clustering with RMSD cluster of 0.4 Å and 0.6 Å is 0.1 Å (0.25 Å vs. 0.35 Å for RMSD cluster = 0.4 Å, and 0.5 Å vs. 0.6 Å for RMSD cluster = 0.6 Å). However, for the RMSD cluster values between 0.6 Å and 1.6 Å, the corresponding differences range between 0.2 Å and 0.3 Å. In general, it is very reassuring to see that most of these conformer models at the 90% level are within the expected range for most RMSD cluster values. Although sampling by its very nature will increase the distance between conformers, this increase does not appear to severely impact the accuracy of the conformer models in PubChem.
Comparison of ensemble accuracy measures
where the index “f” is one of the six functional-group types, V AA f and V BB f are the self-overlap volume for the functional group type “f” of the two molecules, respectively, and V AB f is the overlap volume for the functional group type “f” between the molecules. The ComboT [37, 38] similarity measure, which is defined as the arithmetic sum of the ST and CT scores, allows one to consider the two different similarities simultaneously. Because both the ST and CT scores range from 0 (for no similarity) to 1 (for identical molecules), the ComboT score ranges from 0 to 2 (without normalization, due to pre-existing convention).
The present study used two different approaches to compute these three 3-D similarity scores: the shape-optimized (or ST-optimized) approach and feature-optimized (or CT-optimized) approach. In the shape-optimized approach, the superposition of two molecules is optimized to have a maximum ST score and then the CT score is computed in that shape-optimized alignment. In the feature-optimized approach, the color and shape of the two conformers will be considered simultaneously to find the best superposition between them, as in the current version of ROCS . In the present paper, the shape-optimized and feature-optimized methods are denoted using the superscripts “ST-opt” and “CT-opt”, respectively. As a result, there are six different 3-D similarity scores (i.e., ST ST-opt , CT ST-opt , ComboT ST-opt , ST CT-opt , CT CT-opt , and ComboT CT-opt ). Along with RMSD, four of these six scores are used to analyze the accuracy of the clustered conformer models relative to the experimentally determined 3-D geometries: ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt .
Linear behavior of average conformer model accuracy as a function of RMSD cluster value
Taking this all into account, in general, conformer model clustering increases the conformer RMSD value and decreases the four 3-D Tanimoto values, indicating the reduced accuracy of the conformer ensemble due to conformer clustering. The difference between the ensemble accuracies before and after clustering increases with the values of N NHA , N R , and N ER , implying that the effects of conformer clustering become more noticeable in bigger and more flexible molecules, which is expected considering that the RMSD cluster value gets larger [Equation (3)]. As compared in Figures 9 and 11, the average conformer CT CT-opt accuracy values show a larger decrease upon clustering than the average conformer ST ST-opt values, meaning that the conformer CT CT-opt values are more sensitive to clustering than the conformer ST ST-opt values. However, conformer clustering decreases the average ComboT ST-opt and average ComboT CT-opt values (Figures 10 and 12) in a similar amount, again showing the insensitiveness of the ComboT value to the optimization type. A similar insensitiveness of the ComboT value to the optimization type was also observed in our previous studies [9, 11], in which the distribution of the ComboT ST-opt scores between randomly selected conformers were found very similar to that of the ComboT CT-opt .
What does this all mean? The average loss of accuracy of PubChem3D conformer ensembles behaves in a predictable fashion, even after sampling, as a function of molecular size and flexibility across PubChem3D similarity measures. There is a linear degradation of accuracy to reproduce bioactive conformers both before and after sampling procedures. In general, there is a modest amount of degradation of accuracy to reproduce bioactivity as a part of this sampling procedure. Generally speaking, one expects the worst-case minimum accuracy of 90% of the PubChem3D ensembles to be (as stated previously from Figure 7) 0.75, 1.09, 0.43, and 1.13, in terms of ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt , respectively. This expected minimum accuracy improves linearly as the molecule becomes smaller or less flexible.
One may ask “how good or how bad are these worst-case minimum accuracies?” To answer this question, it is necessary to determine an appropriate cut-off value for a “close” reproduction of the experimental structure, and our recent study , which studied the statistical significance of the ROCS-based similarity scores, provides some clues on an appropriate choice of the cut-off values. In this study , the ROCS-based 3-D similarity scores between randomly-selected biologically-tested compounds were computed, and from the distribution of these scores, conversion tables were generated which convert a ROCS-based similarity score to the p-value of getting that particular score by randomly selecting two biologically-tested conformers. According to these conversion tables, the p-value of getting a similarity score equal to the worse-case minimum accuracy by selecting two random conformers is 0.019, 0.002, 0.003, and 0.002 for ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt , respectively. If the significance level (α) of 0.05 is employed, these p-values are small enough to reject the null hypothesis of getting a particular 3-D similarity score by chance. Although this interpretation also depends on the significance level one may choose, it is still true that these worst-case minimum accuracies of the conformer models (0.75, 1.09, 0.43, and 1.13, for ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt , respectively) are much greater than one may expect from randomly selected conformer pairs (0.54 ± 0.10, 0.62 ± 0.13, 0.18 ± 0.06, and 0.59 ± 0.14, for ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt , respectively) [9, 11], implying structural similarity between the conformer model and the experimental structure. Also note that this interpretation is consistent with the fact that the 90% of the conformer models considered in this study have RMSD accuracies better than 1.1 Å, which is much tighter than the common upper bound (RMSD 2.0 Å) for successful reproduction of an experimental conformation in molecular docking, as mentioned above.
When it comes to biological activity data analysis, the present study shows that there will be a definitive upper limit to the PubChem3D conformer ensemble accuracy based on the molecular size and flexibility. While the results of the present study consider all sampled conformers, PubChem3D search and analysis tools use a diverse subset of sampled conformers, where the diverse subset selection criterion is the ComboT ST-opt dissimilarity. [The reason for using the ComboT dissimilarity is that it considers both the ST and CT dissimilarity simultaneously. While the choice of the optimization type is somewhat arbitrary, our previous studies [9, 11] have shown that the ComboT score is not very sensitive to the optimization type in the aggregate.] The effects of using a diverse set of sampled conformers will likely further decrease performance beyond that reported in this study. In addition, one can expect that, as the desired 3-D Tanimoto threshold increases in a given biological activity analysis, the ability to interrelate larger and more flexible molecules will decrease, not because they necessarily lack common biologically accessible conformer space, but because of the inherent similarity distance between the stored sampled conformers. This analysis also suggests that the use of a single “one-size-fits-all” similarity Tanimoto threshold for PubChem3D molecules may not be an ideal choice for conformer models sampled at different RMSD values. The results from this study suggest that conformer sampling may exacerbate the molecular size/flexibility dependency already present in conformer generation software . Smaller and less flexible molecules in PubChem3D will have a tighter conformational sampling (with a smaller spacing between conformers) than larger and more flexible molecules, and therefore, can interrelate more molecules at a given Tanimoto value. In addition, smaller and less flexible molecules will have fewer sampled conformers in their respective conformer ensemble and will likely have less of a reduction in accuracy due to the use of a diverse subset. As a result, a search using a smaller and less flexible molecule as a query is likely to return more 3-D similar molecules than a search using a larger and more flexible query molecule. Furthermore, even if a large or flexible molecule is used as a 3-D similarity query, an increasing proportion of returned results are likely to be smaller or less flexible as the Tanimoto value is increased. This potential bias towards conformer models with smaller sampling distances may be worth further consideration and study to develop a more reliable 3-D similarity-based biological activity analysis method.
Effects of experimental uncertainties upon conformer model accuracies
Like any experimentally-derived measurements, the crystal structures stored in PDB have uncertainties in their atomic coordinates, and the interpretation of the accuracy of a computationally-derived conformer model should take into account the positional uncertainty of the corresponding experimental ligand structure. For example, if the positional uncertainty in the experimental structure is greater than the RMSD value of the conformer model, comparison between the experimental and theoretical ligand structures are not particularly meaningful. The average positional errors in atoms in a crystal structure can be evaluated with the diffraction-component precision index (DPI) [41, 42], which can be approximated as proposed by Blow , using information commonly contained in the header of a PDB file. In the study by Hawkins et al., the crystal structures with the DPI of < 0.42 Å were considered to be precise enough for the use as a standard dataset for validation of conformer generators, and in this way, the conformer models with the RMSD value of > 0.6 Å (= √2 × DPI ) were guaranteed to be meaningful predictions.
Comparison of the average and median RMSD and ComboT CT-opt values between different PDB ligand sets
197 Ligands (Ref.)
0.39 (±0.24) / 0.30
0.57 (±0.36) / 0.50
0.40 (±0.26) / 0.33
0.65 (±0.31) / 0.61
0.67 / 0.51
1.77 (±0.20) / 1.85
1.61 (±0.27) / 1.70
1.75 (±0.22) / 1.83
1.55 (±0.25) / 1.58
1.56 / 1.64
Note that the average RMSD value of the 157-ligand set differed only by 0.02 from that of the 197-ligand set from the study of Hawkins et al. (0.65 Å vs. 0.67 Å) . The difference in the ComboT CT-opt accuracy between the two sets were 0.01 (1.55 and 1.56 for the 157- and 197-ligand sets, respectively). Considering that our study used OMEGA parameters different from those used in their study, the conformer model accuracies from the two studies do not seem very different.
In the present study, conformer ensembles for 47,123 PDB ligand molecules from MMDB were computationally generated using the PubChem3D approach. The accuracy of reproduction of the conformer models was investigated in comparison to the experimentally-derived structures as a function of the RMSD and the PubChem3D similarity scores (i.e., ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt ). The PubChem3D conformer sampling procedure increased the RMSD value of the conformer ensemble by 0.18 ± 0.12 Å, and decreased the accuracy of the ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt accuracies by 0.04 ± 0.03, 0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively (see Table 1), indicating a decrease in the conformer ensemble accuracy in general. For all five accuracy measures (RMSD, ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt ), the conformer model accuracies before and after clustering linearly decreased with the increase in the RMSD cluster value (as well as N NHA , N R and N ER ), with R2 values to fit these curves greater than 0.91 (see Figures 8, 9, 10, 11 and 12 and Table 2).
Whereas the change in the CT CT-opt accuracy (0.09 ± 0.05) upon clustering was much greater than the ST ST-opt average difference (0.04 ± 0.03), the ComboT ST-opt and ComboT CT-opt changes had similar average and standard deviations (0.16 ± 0.09 vs. 0.15 ± 0.09). This implies that, in general, while the CT CT-opt accuracy is more sensitive to the clustering than the ST ST-opt accuracy, the effect of the clustering upon the ComboT accuracy is not sensitive to the optimization type. Similarly, while the rate of the decrease of the ST ST-opt accuracy with the increase in molecular size and flexibility was much slower than that of the CT CT-opt accuracy (Figure 9vs. Figure 11), the ComboT ST-opt and ComboT CT-opt accuracies decreased at a similar rate (Figure 10 vs. Figure 12).
This study shows that there is a definitive limit in the ability of the PubChem3D sampled conformer models to reproduce the bioactive conformations found in PDB ligands. This study also suggests that larger and more flexible molecules may be less able to interrelate with other larger and more flexible molecules at a given Tanimoto value than smaller and less flexible molecules do. [This is also supported by our recent study  on the PubChem 3-D neighbors. The PubChem 3-D neighbors (also known as “similar conformers”) are defined as any two compounds that are structurally similar (with ST ST-opt ≥ 0.8 and CT ST-opt ≥ 0.5), and it was found that compounds without 3-D neighbors occur more frequently among larger compounds than among smaller compounds. In addition, smaller molecules tend to have more 3-D neighbors than larger molecules]. As a result, one may want to consider such effects when performing a 3-D similarity search or 3-D biological activity data analysis. Specifically in the case of 90% of the PubChem3D conformer models, in general, one can expect the worst-case minimum accuracy to be 0.75, 1.09, 0.43, and 1.13, in terms of ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt , respectively (see Figure 7). These values are expected to linearly improve as the molecules considered become smaller and less flexible. In addition, these values may become worse if a diverse subset of sampled conformers is used.
The experimental 3-D structures of small molecules were downloaded from the Molecular Modeling Database (MMDB) ligand dataset [45, 46] as available from the PubChem Substance database at the National Center for Biotechnology Information (NCBI) (as of July 1, 2010). Ligands too small or too big were discarded by limiting the non-hydrogen atom count to 6 – 50. Ligands too flexible (with an effective rotor count greater than 15) were also eliminated. This filtering stage resulted in a set of 47,123 3-D non-unique, organic (i.e., carbon containing) 3-D experimental reference structures, where a 3-D conformer model could be generated. The distributions of molecular size and flexibility of the dataset are depicted in Figure 1.
Select the PubChem Substance records associated with the MMDB records for the 197 PDB structures determined in the study by Hawkins et al. . These PDB structures were determined by considering the local quality of fit of the ligand to its density, as well as global level metrics of the protein structure. Because some of the 197 PDB structures had multiple ligands, there were 265 PubChem Substance records associated with these protein structures. Note that, because their study provided a list of the PDB identifiers (without a unique ligand identifier), it was difficult to determine what ligands were actually included in the 197-ligand set. Therefore, next filtering steps similar to those used in their study were taken subsequently.
Select the PubChem Substance records that are neither too rigid nor too flexible (3 ≤ N R ≤ 16), and that are neither too small nor too large (8≤ N NHA ≤ 50). This filtering stage resulted in 200 PubChem Substance records.
Select the PubChem Substance records with good “local” quality of fit to the density. Hawkins et al.  used three metrics for this purpose: the real-space correlation coefficient (RSCC) , the real-space R-value (RSR) , and the occupancy-weighted B-factor (OWAB). In the present study, the same criteria as used in their study (RSCC > 0.9, RSR < 0.2, and 1 < OWAB < 50) were applied, after downloading these data from the electron density server (EDS) [49, 50]. After this step, 176 structures were remained.
Some of the remaining 176 PubChem Substance Records were associated with identical PubChem Compound Records, or had the same three-letter PDB ligand codes, implying that they were the same ligand molecule. In these cases, the one with the largest RSCC value was retained, and the others were removed. After removing the redundancy, there were 164 structures remained.
When any pair of the remaining 164 structures had the PubChem 2-D similarity score of > 0.9 (computed using the PubChem 2-D subgraph fingerprints  and the Tanimoto equation [6, 7]), the one with the largest RSCC value was retained and the other was removed. [In the study of Hawkins et al. , the LINGOS method  was used instead of the PubChem fingerprint to remove too similar molecules.] There were 157 ligands remained after this final filtering stage.
Conformer generation using OMEGA
The conformer ensemble for each molecule in the dataset was generated using the OMEGA software  from the OpenEye Scientific Software, Inc. The OMEGA application performs conformer generation in two primary stages: fragment generation and torsion driving. The fragment generation stage splits the input structure into smaller pieces that are energy minimized and conformationally sampled to get diverse 3-D representations for each molecule fragment. The torsion driving stage reassembles and iterates over the fragments from the first stage using particular rule-based torsion angles that depend on the molecular environment between connecting fragments. More detailed description of the OMEGA software is given elsewhere [18, 52].
OMEGA has many adjustable parameters to generate conformations with particular attributes, and the optimal set of parameter values used for the present study was based on our recent study . The Merck Molecular Force Field (MMFF94s) without coulombic terms (MMFF94s_NoEstat) was used with the "startfact" value of 20. The energy window value of 25 kcal/mol was employed for both model building and torsion driving stages. The values used for other parameters were identical to those used in the previous study .
As pointed out in a recent review by Scior et al., because adequate conformational space coverage is an important requirement for reliable 3-D similarity computations, it would be desirable to consider as many conformations per molecule as possible. However, because it would require tremendous computational resources, it is inevitable to find a compromise between computational cost and conformational coverage. The PubChem3D conformer generation procedure generates a maximum of 100,000 conformers for each chemical structure. As demonstrated in our previous study , this limit may not be enough for very flexible and large molecules, resulting in truncation of conformational search. However, in the same study , it was shown that the 100-K limit does not cause a noticeable decrease in the “average” conformer model accuracy for smaller and less flexible molecules (i.e., with N NHA ≤35 and N ER ≤15). Therefore, this 100-K limit seems adequate for these molecules in general.
Clustering of conformer ensembles
After conformer models were produced, a data reduction was performed whereby conformers were sampled to identify a random set of conformers that have a minimum RMSD distance to each other. This minimum RMSD distance was determined by rounding the RMSD pred value [in Equation (3)] to the nearest 0.2 increment i.e., Equation (4)]. The conformers in each conformer ensemble were down-sampled using a partition-based clustering scheme, as described in our previous study , with the RMSD as a distance threshold between conformers (that is, RMSD cluster ) and the lowest-energy conformer in each partition as an initial “seed” structure for clustering of that partition. The centroid of each cluster was selected as the representative conformer of that cluster to construct a smaller conformer model with 500 conformers or less. If the conformer model had more than 500 conformers after sampling, it was re-clustered with the RMSD cluster value incremented by a further 0.2. This re-clustering process was repeated as many times as necessary to reduce the overall conformer count to be 500 or less. Note that, because the lowest-energy conformer in each partition was used as an initial seed, low-energy conformers are more likely to be included than high-energy conformers when all partitions are combined together for next round of clustering. As a result, the final conformer model sampled though clustering is more likely to include low-energy conformers than high-energy conformers. All RMSD values computed in this study used the OEChem OERMSD function with: “overlay” turned on to allow rotation/translation to yield the lowest possible RMSD value; and “automorph” detection turned on to allow proper treatment of symmetrically equivalent atoms, except when its use resulted in excessive run-time [an extremely rare event (at a rate of about 1 in 10,000) generally caused by large, nearly symmetric molecules].
Evaluation of ensemble accuracies
The accuracy of the clustered ensembles was estimated using five different accuracy measures: RMSD, ST ST-opt , ComboT ST-opt , CT CT-opt , and ComboT CT-opt . The latter four accuracy measures were computed using ROCS [37, 38, 52]. Note that the generated conformer model had up to 500 conformers, and the accuracy of the conformer model was evaluated by selecting the best conformer that was closest to the experimental structure (that is, the one with the smallest RMSD value or the largest ROCS-based similarity values).
This research was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services.
- Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: integrated platform of small molecules and biological activities. Annual Reports in Computational Chemistry. Volume 4. Edited by: Ralph AW, David CS. 2008, Elsevier, Amsterdam, the Netherlands, 217-241.View ArticleGoogle Scholar
- Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.View ArticleGoogle Scholar
- Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH: An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010, 38: D255-D266. 10.1093/nar/gkp965.View ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011, 39: D38-D51. 10.1093/nar/gkq1172.View ArticleGoogle Scholar
- PubChem substructure fingerprint description. [ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf]
- Chen X, Reynolds CH: Performance of similarity measures in 2D fragment-based similarity searching: comparison of structural descriptors and similarity coefficients. J Chem Inf Comput Sci. 2002, 42: 1407-1414. 10.1021/ci025531g.View ArticleGoogle Scholar
- Holliday JD, Salim N, Whittle M, Willett P: Analysis and display of the size dependence of chemical similarity coefficients. J Chem Inf Comput Sci. 2003, 43: 819-828. 10.1021/ci034001x.View ArticleGoogle Scholar
- Bolton EE, Kim S, Bryant SH: PubChem3D: similar conformers. J Cheminform. 2011, 3: 13-10.1186/1758-2946-3-13.View ArticleGoogle Scholar
- Kim S, Bolton EE, Bryant SH: PubChem3D: biologically relevant 3-D similarity. J Cheminform. 2011, 3: 26-10.1186/1758-2946-3-26.View ArticleGoogle Scholar
- Bolton EE, Chen J, Kim S, Han L, He S, Shi W, Simonyan V, Sun Y, Thiessen PA, Wang J, et al: PubChem3D: a new resource for scientists. J Cheminform. 2011, 3: 32-10.1186/1758-2946-3-32.View ArticleGoogle Scholar
- Kim S, Bolton E, Bryant S: Effects of multiple conformers per compound upon 3-D similarity search and bioassay data analysis. J Cheminform. 2012, 4: 28-10.1186/1758-2946-4-28.View ArticleGoogle Scholar
- PubChem3D Thematic Series. [http://www.jcheminf.com/series/pubchem3d]
- Bolton EE, Kim S, Bryant SH: PubChem3D: conformer generation. J Cheminform. 2011, 3: 4-10.1186/1758-2946-3-4.View ArticleGoogle Scholar
- Kim S, Bolton EE, Bryant SH: PubChem3D: shape compatibility filtering using molecular shape quadrupoles. J Cheminform. 2011, 3: 25-10.1186/1758-2946-3-25.View ArticleGoogle Scholar
- Bolton EE, Kim S, Bryant SH: PubChem3D: diversity of shape. J Cheminform. 2011, 3: 9-10.1186/1758-2946-3-9.View ArticleGoogle Scholar
- Halgren TA: Merck molecular force field. 1. Basis, form, scope, parameterization, and performance of MMFF94. J Comput Chem. 1996, 17: 490-519. 10.1002/(SICI)1096-987X(199604)17:5/6<490::AID-JCC1>3.0.CO;2-P.View ArticleGoogle Scholar
- Halgren TA: MMFF VI. MMFF94s option for energy minimization studies. J Comput Chem. 1999, 20: 720-729. 10.1002/(SICI)1096-987X(199905)20:7<720::AID-JCC7>3.0.CO;2-X.View ArticleGoogle Scholar
- Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT: Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model. 2010, 50: 572-584. 10.1021/ci100031x.View ArticleGoogle Scholar
- Hawkins PCD, Nicholls A: Conformer Generation with OMEGA: learning from the data set and the analysis of failures. J Chem Inf Model. 2012, 52: 2919-2936. 10.1021/ci300314k.View ArticleGoogle Scholar
- Murrall NW, Davies EK: Conformational freedom in 3-D databases. 1. Techniques. J Chem Inf Comput Sci. 1990, 30: 312-316. 10.1021/ci00067a016.View ArticleGoogle Scholar
- Hurst T: Flexible 3D searching: the directed tweak technique. J Chem Inf Comput Sci. 1994, 34: 190-196. 10.1021/ci00017a025.View ArticleGoogle Scholar
- Klebe G, Mietzner T: A fast and efficient method to generate biologically relevant conformations. J Comput Aided Mol Des. 1994, 8: 583-606. 10.1007/BF00123667.View ArticleGoogle Scholar
- Renner S, Schwab CH, Gasteiger J, Schneider G: Impact of conformational flexibility on three-dimensional similarity searching using correlation vectors. J Chem Inf Model. 2006, 46: 2324-2332. 10.1021/ci050075s.View ArticleGoogle Scholar
- Greene J, Kahn S, Savoj H, Sprague P, Teig S: Chemical function queries for 3D database search. J Chem Inf Comput Sci. 1994, 34: 1297-1308. 10.1021/ci00022a012.View ArticleGoogle Scholar
- OMEGA, Version 2.1. 2006, OpenEye Scientific Software, Inc, Santa Fe, NMGoogle Scholar
- OMEGA, Version 2.2. 2007, OpenEye Scientific Software, Inc, Santa Fe, NMGoogle Scholar
- OMEGA, Version 2.3. 2008, OpenEye Scientific Software, Inc, Santa Fe, NMGoogle Scholar
- OMEGA, Version 2.4. 2009, OpenEye Scientific Software, Inc, Santa Fe, NMGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.View ArticleGoogle Scholar
- Borodina YV, Bolton E, Fontaine F, Bryant SH: Assessment of conformational ensemble sizes necessary for specific resolutions of coverage of conformational space. J Chem Inf Model. 2007, 47: 1428-1437. 10.1021/ci7000956.View ArticleGoogle Scholar
- Madej T, Addess KJ, Fong JH, Geer LY, Geer RC, Lanczycki CJ, Liu CL, Lu SN, Marchler-Bauer A, Panchenko AR, et al: MMDB: 3D structures and macromolecular interactions. Nucleic Acids Res. 2012, 40: D461-D464. 10.1093/nar/gkr1162.View ArticleGoogle Scholar
- Bostrom J: Reproducing the conformations of protein-bound ligands: a critical evaluation of several popular conformational searching tools. J Comput Aided Mol Des. 2001, 15: 1137-1152. 10.1023/A:1015930826903.View ArticleGoogle Scholar
- Nicklaus MC, Wang SM, Driscoll JS, Milne GWA: Conformational changes of small molecules binding to proteins. Bioorg Med Chem. 1995, 3: 411-428. 10.1016/0968-0896(95)00031-B.View ArticleGoogle Scholar
- Nissink JWM, Murray C, Hartshorn M, Verdonk ML, Cole JC, Taylor R: A new test set for validating predictions of protein-ligand interaction. Proteins-Structure Function and Genetics. 2002, 49: 457-471. 10.1002/prot.10232.View ArticleGoogle Scholar
- Westbrook J, Feng ZK, Burkhardt K, Berman HM: Validation of protein structures for Protein Data Bank. Macromolecular Crystallography, Pt D. 2003, Academic Press Inc, San Diego, 370-385.View ArticleGoogle Scholar
- Acharya KR, Lloyd MD: The advantages and limitations of protein crystal structures. Trends Pharmacol Sci. 2005, 26: 10-14. 10.1016/j.tips.2004.10.011.View ArticleGoogle Scholar
- ROCS - Rapid Overlay of Chemical Structures, Version 3.1.0. 2010, OpenEye Scientific Software, Inc, Santa Fe, NMGoogle Scholar
- ShapeTK - C++, Version 1.8.0. 2010, OpenEye Scientific Software, Inc, Santa Fe, NMGoogle Scholar
- Grant JA, Gallardo MA, Pickup BT: A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem. 1996, 17: 1653-1666. 10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K.View ArticleGoogle Scholar
- Rush TS, Grant JA, Mosyak L, Nicholls A: A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem. 2005, 48: 1489-1495. 10.1021/jm040163o.View ArticleGoogle Scholar
- Cruickshank DWJ: Remarks about protein structure precision. Acta Crystallogr D. 1999, 55: 583-601. 10.1107/S0907444998012645.View ArticleGoogle Scholar
- Cruickshank DWJ: Remarks about protein structure precision (vol 55, pg 583, 1999). Acta Crystallogr D. 1999, 55: 1108-1108.View ArticleGoogle Scholar
- Blow DM: Rearrangement of Cruickshank’s formulae for the diffraction-component precision index. Acta Crystallogr D. 2002, 58: 792-797. 10.1107/S0907444902003931.View ArticleGoogle Scholar
- Goto J, Kataoka R, Hirayama N: Ph4Dock: pharmacophore-based protein-ligand docking. J Med Chem. 2004, 47: 6804-6811. 10.1021/jm0493818.View ArticleGoogle Scholar
- Chen J, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, et al: MMDB: Entrez’s 3D-structure database. Nucleic Acids Res. 2003, 31: 474-477. 10.1093/nar/gkg086.View ArticleGoogle Scholar
- Wang Y, Addess KJ, Chen J, Geer LY, He J, He S, Lu S, Madej T, Marchler-Bauer A, Thiessen PA, et al: MMDB: annotating protein sequences with Entrez’s 3D-structure database. Nucleic Acids Res. 2007, 35: D298-D300. 10.1093/nar/gkl952.View ArticleGoogle Scholar
- Murshudov GN, Vagin AA, Dodson EJ: Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr D. 1997, 53: 240-255. 10.1107/S0907444996012255.View ArticleGoogle Scholar
- Jones TA, Zou JY, Cowan SW, Kjeldgaard M: Improved methods for building protein models in electron-density maps and the location of errors in these models. Acta Crystallogr A. 1991, 47: 110-119. 10.1107/S0108767390010224.View ArticleGoogle Scholar
- Kleywegt GJ, Harris MR, Zou JY, Taylor TC, Wahlby A, Jones TA: The Uppsala electron-density server. Acta Crystallogr D. 2004, 60: 2240-2249. 10.1107/S0907444904013253.View ArticleGoogle Scholar
- Electron Density Server (EDS). [http://www.jcheminf.com/series/pubchem3d]
- Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA: Lingos, finite state machines, and fast similarity searching. J Chem Inf Model. 2006, 46: 1912-1918. 10.1021/ci6002152.View ArticleGoogle Scholar
- Bostrom J, Greenwood JR, Gottfries J: Assessing the performance of OMEGA with respect to retrieving bioactive conformations. J Mol Graph Model. 2003, 21: 449-462. 10.1016/S1093-3263(02)00204-8.View ArticleGoogle Scholar
- Scior T, Bender A, Tresadern G, Medina-Franco JL, Martinez-Mayorga K, Langer T, Cuanalo-Contreras K, Agrafiotis DK: Recognizing pitfalls in virtual screening: a critical review. J Chem Inf Model. 2012, 52: 867-881. 10.1021/ci200528d.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.