PubChem3D: Conformer generation
- Evan E Bolton1Email author,
- Sunghwan Kim1 and
- Stephen H Bryant1
https://doi.org/10.1186/1758-2946-3-4
© NIH copyright; licensee Chemistry Central Ltd 2011
Received: 12 November 2010
Accepted: 27 January 2011
Published: 27 January 2011
Abstract
Background
PubChem, an open archive for the biological activities of small molecules, provides search and analysis tools to assist users in locating desired information. Many of these tools focus on the notion of chemical structure similarity at some level. PubChem3D enables similarity of chemical structure 3-D conformers to augment the existing similarity of 2-D chemical structure graphs. It is also desirable to relate theoretical 3-D descriptions of chemical structures to experimental biological activity. As such, it is important to be assured that the theoretical conformer models can reproduce experimentally determined bioactive conformations. In the present study, we investigate the effects of three primary conformer generation parameters (the fragment sampling rate, the energy window size, and force field variant) upon the accuracy of theoretical conformer models, and determined optimal settings for PubChem3D conformer model generation and conformer sampling.
Results
Using the software package OMEGA from OpenEye Scientific Software, Inc., theoretical 3-D conformer models were generated for 25,972 small-molecule ligands, whose 3-D structures were experimentally determined. Different values for primary conformer generation parameters were systematically tested to find optimal settings. Employing a greater fragment sampling rate than the default did not improve the accuracy of the theoretical conformer model ensembles. An ever increasing energy window did increase the overall average accuracy, with rapid convergence observed at 10 kcal/mol and 15 kcal/mol for model building and torsion search, respectively; however, subsequent study showed that an energy threshold of 25 kcal/mol for torsion search resulted in slightly improved results for larger and more flexible structures. Exclusion of coulomb terms from the 94s variant of the Merck molecular force field (MMFF94s) in the torsion search stage gave more accurate conformer models at lower energy windows. Overall average accuracy of reproduction of bioactive conformations was remarkably linear with respect to both non-hydrogen atom count ("size") and effective rotor count ("flexibility"). Using these as independent variables, a regression equation was developed to predict the RMSD accuracy of a theoretical ensemble to reproduce bioactive conformations. The equation was modified to give a minimum RMSD conformer sampling value to help ensure that 90% of the sampled theoretical models should contain at least one conformer within the RMSD sampling value to a "bioactive" conformation.
Conclusion
Optimal parameters for conformer generation using OMEGA were explored and determined. An equation was developed that provides an RMSD sampling value to use that is based on the relative accuracy to reproduce bioactive conformations. The optimal conformer generation parameters and RMSD sampling values determined are used by the PubChem3D project to generate theoretical conformer models.
Keywords
Background
PubChem [1–4] is an open archive for the biological activities of small molecules. It consists of three primary databases: Substance, Compound, and BioAssay. The PubChem Compound database contains the unique chemical structure content found in the PubChem Substance database. When possible, a theoretical 3-D conformer model description is generated for each and every record in the PubChem Compound database. This 3-D layer is the basis of the PubChem3D project.
PubChem provides search and analysis tools to assist users in locating desired information in the archive. The importance of this cannot be understated with more than 70 million substance descriptions, 28 million unique small molecules, 480,000 biological assays, and 110 million biological assay outcomes (results from a substance tested in an assay is considered an outcome). Nearly all of these tools focus on the notion of chemical structure similarity at some level. PubChem3D enables similarity of chemical structure 3-D conformers to augment the existing similarity of 2-D chemical structure graphs.
With the goal in mind to use theoretical 3-D descriptions of chemical structures to relate experimental biological activity, there must be an appropriate determination whether these constructs have any relevance to reality. Presumably, if the 3-D conformer model can readily reproduce a reputed "experimental bioactive ligand" conformation with sufficient regularity, one tends to feel (more) confident that the theoretical methodology may be producing biologically meaningful results. There is currently no way to prove with any absolute degree of certainty that all theoretical conformers produced will be biologically relevant; however, one can check if all known experimental "bioactive" conformers of a chemical structure can be found in its theoretical model.
The largest publicly available source of "experimental" 3-D coordinates of chemical structures is the Protein Data Bank (PDB) [5]. This data is not without its considerable issues [6–10]. Most "experimental" 3-D coordinates for small molecules provided by the PDB are, in essence, theoretical models derived from fitting electron density produced by X-ray diffraction experiments to the presumed location of atoms that are part of a protein, a ligand (typically bound to the protein), and other moieties (ions, water molecules, etc.). At times, electron density is lacking or there is some degree of uncertainty as to the precise location of the small molecule atoms. In this context, the ligand location or protein binding geometry cannot be considered to be well understood, with many possible conformations of the same ligand plausible [6, 11]. These concerns will be largely ignored here and all the PDB ligands will be treated as experimental fact for the purposes of this study.
There are a number of established theoretical conformer generator packages available [11–16]. Many of these perform reasonably well [6, 17, 18], being both fast and accurate. The specific requirements of the PubChem3D project (such as a multiplatform programmatic interface) made the choice of one of these (OMEGA [19]) very easy. Considering the size of PubChem, the need to relate similar conformers, and the desire to allow users to analyze biological activity patterns in real-time, it requires all 3-D information to be pre-computed and stored. This requirement is a primary limiting factor.
Conformer generator packages are very capable to produce many conformers per chemical structure. This is important as a small molecule of reasonable flexibility at room temperature can access many potential conformational shapes; however, it is impractical to store all produced theoretical conformations per chemical structure, especially when you need to consider the storage requirements for millions of compounds. A common practice is to limit the conformer count using some mix of energy-based filtering, minimum conformer root-mean-squared distance (RMSD) of pair-wise atoms, and random sampling. This leaves one to determine how best to minimize the count of conformations stored while not sacrificing coverage or resolution of biologically meaningful conformer space.
In this work, one of a series covering the PubChem3D project, we attempt to answer questions regarding conformer model construction relative to the ability to reproduce PDB ligand 3-D coordinates. For example, what is the base-line conformer generation software accuracy as a function of molecular size and flexibility? Given that conformer models are produced in vacuum, is it beneficial to remove bias towards conformers with intra-molecular interaction to improve accuracy? Is energy filtering useful? What are some practical limitations when generating conformers of flexible molecules? Can one predict average theoretical conformer model accuracy? How do you minimize the count of conformers without significantly impacting accuracy? Using PDB ligand 3-D coordinates, key parameters of conformer model creation are explored to answer these questions. In the process of doing so, a useful relationship is developed relating the size and flexibility of a molecule to the accuracy of reproduction. Further examination is given to accuracy as it relates to limiting the total count of conformers considered in such a model.
Results and Discussion
1. Molecular size and flexibility of the PDB ligands
Distribution of (a) the non-hydrogen atom count and (b) the effective rotor count, binned by whole numbers, for the 25,972 experimentally determined structures in the MMDB ligand data set.
2. Parameter validation for conformer generation
OMEGA [19] was used in the present study. It is known to be among the fastest and most accurate conformer generation programs [17] available. In addition to high quality, it was the only commercially available program that had a non-windows-only C++ application programming interface (API) at the time of project initiation, a critical consideration given the computing environments at the National Center for Biotechnology Information (NCBI). A brief overview of the conformer generation algorithm of OMEGA is given in the Materials and Methods section. A more detailed explanation can be found elsewhere [11, 23].
OMEGA parameters modified or non-default during conformer generation
Flag | Options | Description |
---|---|---|
-startfact | 20, 50 | Determines the fragment sampling rate for determination of fragment conformation [default = 20]. |
-buildff a | MMFF94s MMFF94s_NoEstat MMFF94s_Trunc | Specifies the type of a molecular force field used in the model building step [default = MMFF94s_NoEstat] |
-searchff a | MMFF94s MMFF94s_NoEstat MMFF94s_Trunc | Specifies the type of a molecular force field used in the torsion search step [default = MMFF94s_NoEstat]. |
-ewindow | integers between 1 and 30 | Determines the size of the energy window [default = 10.0]. (1) In the model building step: determines the maximum energy range of a ring system. (2) In the torsion search step: determines the maximum energy range of a conformer relative to that of a global minimum conformer. |
-MaxConfsGen | 100,000 | Controls the maximum number of fully constructed conformers that OMEGA will attempt to build [default = 30,000]. |
-MaxConfs | 100,000 | Controls the maximum number of conformations to be retained [default = 20,000]. |
-MaxRotors | 15 | Controls the maximum number of rotatable bonds in a molecule. If a molecule has more rotatable bonds than this cutoff, OMEGA will not search for conformers [default = 12]. |
-MaxSearchTime | 180 | Limits the maximum amount of time (in seconds) spent generating conformers for each molecule [default = 30]. |
-DefaultRMSD | 0.0 | Controls the threshold of the RMS distance to determine duplicate conformations. The value of 0.0 indicates that the duplicate detection is skipped [default = 0.8]. |
where V AA and V BB are the self-overlap volume of A and B, respectively, and V AB is the overlap volume between A and B [24–26]. Note that the ST score is a molecular similarity measure ranging from 0 (for no similarity) to 1 (for identical molecules), whereas the RMSD value is a molecular dissimilarity measure ranging from 0 (for identical molecules) to infinity. Among all theoretical 3-D conformers generated for a given molecule, the one with the least RMSD and the one with the greatest ST to the experimental 3-D coordinates were considered to be the most similar to the "bioactive" conformation. Therefore, the accuracy of a conformer model generated by OMEGA was evaluated using the minimum RMSD and the maximum ST values between the experimental conformation and a single theoretical 3-D conformation in the conformer model. In further discussion, the RMSD and ST values for a conformer model indicates the RMSD and ST values between the corresponding experimental structure and the most similar theoretical conformer in the conformer model, respectively.
2.1. Effects of fragment sampling rate
The average root-mean-square distance (RMSD), the average Shape-Tanimoto (ST), the average number of conformers per chemical structure, and the count of the 100-k limit cases for conformer models for all 25,972 3-D reference structures
Energy Window a | 5/5 | 5/15 | 10/10 | 10/15 | 15/15 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Conformer FF b | Full | NoEstat | Trunc | Trunc | NoEstat | NoEstat | NoEstat | |||||||
Fragment FF b | Full | NoEstat | Trunc | Full | NoEstat | Trunc | Full | NoEstat | Trunc | Full | Trunc | Trunc | Trunc | Trunc |
Sampling rate = 20 | ||||||||||||||
RMSD | 0.69 | 0.69 | 0.71 | 0.48 | 0.48 | 0.48 | 0.467 | 0.467 | 0.462 | 0.41 | 0.41 | 0.42 | 0.40 | 0.42 |
ST | 0.89 | 0.89 | 0.89 | 0.93 | 0.93 | 0.93 | 0.93 | 0.93 | 0.93 | 0.95 | 0.95 | 0.94 | 0.943 | 0.94 |
# Conformers | 698 | 698 | 645 | 2827 | 2,814 | 2,870 | 5,369 | 5,352 | 5,852 | 12,957 | 13,012 | 10,794 | 13,118 | 10,794 |
# 100-k Limit cases | 23 | 18 | 36 | 94 | 93 | 118 | 216 | 211 | 239 | 2,203 | 2,209 | 1,842 | 2,198 | 1,842 |
Sampling rate = 50 | ||||||||||||||
RMSD | 0.69 | 0.69 | 0.71 | 0.48 | 0.48 | 0.48 | 0.466 | 0.467 | 0.462 | 0.41 | 0.41 | 0.42 | 0.40 | 0.42 |
ST | 0.89 | 0.89 | 0.89 | 0.93 | 0.93 | 0.93 | 0.93 | 0.93 | 0.93 | 0.95 | 0.95 | 0.94 | 0.946 | 0.94 |
# Conformers | 696 | 697 | 646 | 2,834 | 2,818 | 2,875 | 5,377 | 5,358 | 5,857 | 12,981 | 13,025 | 10,796 | 13,124 | 10,796 |
# 100-k Limit cases | 21 | 18 | 36 | 96 | 96 | 119 | 217 | 211 | 239 | 2,203 | 2,210 | 1,841 | 2,196 | 1,841 |
Effects of the fragment sampling rate upon the average RMSD as a function of (a) non-hydrogen atom count and (b) effective rotor count. The MMFF94s_NoEstat and MMFF94s_Trunc force fields were used for the model building and torsion search stages, respectively, and the energy window of 15 kcal/mol was used for both stages.
2.2. Effects of force field choice
OMEGA has several pre-defined molecular force fields and it is possible to choose different force fields for model building and torsion search, using the "-buildff" and "-searchff" options, respectively. Three force fields were tested: (1) the 94s variant of the Merck molecular force field (MMFF94s_Full) [27–33], (2) MMFF94s without coulombic interaction terms (MMFF94s_NoEstat), and (3) MMFF94s without coulombic interaction terms and without the attractive part of the van der Waals interaction terms (MMFF94s_Trunc). The latter two force-field variants attempt to remove a perceived bias towards conformers that lower their energy by intra-molecular interactions, which are assumed to not be significant when making inter-molecular interactions. This consideration is critical as conformer generation is performed in vacuum, which is a very different environment than a protein binding pocket. Varying the force-field terms allows this hypothesis to be tested as improved accuracy should be found if intra-molecular interactions are removed.
Overall effects of the choice of the force field and energy window upon the average RMSD and ST values between the computationally generated conformations and the experimental conformations for the 25,972 3-D reference structures
Torsion Search Force Field a | |||||||
---|---|---|---|---|---|---|---|
Energy window (kcal/mol) | Fragment Force Field a | Full | NoEstat | Trunc | |||
RMSD | ST | RMSD | ST | RMSD | ST | ||
5 | Full | 0.69 | 0.89 | 0.48 | 0.93 | 0.47 | 0.93 |
NoEstat | 0.69 | 0.89 | 0.48 | 0.93 | 0.47 | 0.93 | |
Trunc | 0.71 | 0.89 | 0.48 | 0.93 | 0.46 | 0.93 | |
10 | Full | 0.52 | 0.92 | 0.42 | 0.94 | 0.42 | 0.94 |
NoEstat | 0.52 | 0.92 | 0.42 | 0.94 | 0.42 | 0.94 | |
Trunc | 0.52 | 0.92 | 0.41 | 0.94 | 0.42 | 0.94 | |
15 | Full | 0.46 | 0.93 | 0.40 | 0.94 | 0.40 | 0.95 |
NoEstat | 0.46 | 0.93 | 0.40 | 0.94 | 0.40 | 0.95 | |
Trunc | 0.46 | 0.93 | 0.40 | 0.94 | 0.40 | 0.95 |
The overall average conformer count per molecule and 100-k limit count for different force field types and energy window sizes for the 25,972 3-D reference structures
Energy Window (kcal/mol) | Force Field a | ||
---|---|---|---|
Full | NoEstat | Trunc | |
Conformer count | |||
5 | 698 | 2,818 | 5,857 |
10 | 2,817 | 10,852 | 10,889 |
15 | 5,599 | 13,219 | 13,178 |
100-k limit count | |||
5 | 21 | 96 | 239 |
10 | 250 | 1,788 | 1,844 |
15 | 491 | 2,189 | 2,220 |
Another potential explanation for the superiority of MMFF94s_NoEstat and MMFF94_Trunc over the MMFF94s_Full force field is that the lack of intra-molecular interaction is an important characteristic of a biologically relevant conformation of a small-molecule ligand found in its complex with the protein. When a small-molecule ligand binds to its target protein, the intra-molecular interaction in the ligand molecule that exists in its unbound state is likely to disappear because of a conformational change which enhances inter-molecular interaction between the molecule and the target protein. Regardless of the exact reason why, employing the MMFF94s_NoEstat for conformer model generation appears to be a more sensible choice than the MMFF94s_Full due to the accuracy improvement.
Overall, it appears that removal of electrostatic terms has a rather favorable effect on improving overall accuracy of reproduction of experimental bound ligand geometries. Removal of attractive van der Waals terms does not appear to have any significant effect. As such, the remainder of this study will only consider the MMFF94s_NoEstat force field variant.
2.3. Effects of energy window
The overall average RMSD and Shape-Tanimoto (ST) values of the conformer models for all 25,972 structures as a function of the model building energy window and the torsion search energy window (in kcal/mol).
3. Accuracy of conformer models and 100-k limit cases
Average RMSD values for all 25,972 3D reference structures as a function of non-hydrogen atom count [(a) and (b)] and effective rotor count [(c) and (d)]. In Panels (b) and (d), the 100-k limit cases, in which the number of conformers reached the maximum number allowed, were removed. The numbers in the legend boxes indicates the energy-window values used in the model building (first) and torsion search (second) stages.
Average Shape-Tanimoto (ST) values for all 25,972 3D reference structures as a function of non-hydrogen atom count [(a) and (b)] and effective rotor count [(c) and (d)]. In Panels (b) and (d), the 100-k limit cases, in which the number of conformers reached the maximum number allowed, were removed. The numbers in the legend boxes indicates energy-window values used in the model building (first) and torsion search (second) stages.
The effect of the 100-k limit cases upon the distributions of (1) the non-hydrogen atom count and (2) the effective rotor counts. The MMFF94s_NoEstat force field and the energy window of 5 kcal/mol were used for both the model building and torsion search stages.
The average number of conformers for a molecule as a function of the effective rotor count. The first two numbers in the legend box correspond to the energy-window sizes used in the model building (first) and torsion search (second). While the solid data symbols represent all 25,972 molecules, the open data symbols denote cases in which the 100-k limit cases were removed.
4. Prediction of RMSD accuracy of the conformer ensemble
So far, we have found ways to get the best average RMSD and ST accuracy using OMEGA as the conformer generator. Now we are faced with the problem of data reduction, as it is not practical to store or use all conformers produced when considering millions of chemical structures. Naturally, one would like to maximally reduce the conformer count without significantly sacrificing accuracy, e.g., by a minimum RMSD separation between conformers. The lower the separation RMSD used, the greater the count of conformers that must be kept. Conversely, too large of an RMSD separation may reduce the ensemble accuracy. In an ideal world, one would know a priori precisely which conformers are needed and discard the rest. In the real world, this is not possible to know; however, if one can reliably predict the RMSD accuracy of a conformer model, then one can devise an RMSD separation value that could be used as a sampling threshold with some statistical assurances that the sampled conformer model should have at least one conformer within the sampling RMSD some significantly large percentage of the time.
Comparison of the predicted and actual RMSD of the theoretical conformer models for the 25,972 experimentally determined structures. While the predicted RMSDs in panel (a) were computed using a single equation [Eq. (4)] for all 25,972 structures, those in panel (b) were computed using two different equations: Eq. (6) for the structures with NER≤10 and NNHA≤35 (in blue), and Eq. (7) for otherwise (in red).
The R2 values of the regression formula Eqs. (6) and (7) were 0.52 and 0.33, respectively, and the standard deviation of RMSD pred was 0.17 and 0.29 for Eqs. (6) and (7), respectively. (Attempts to partition N ER and N NHA values using different partitioning schemes had similar R2 results.) In the same manner as used to derive Eq. (5), one can add these first standard deviations to Eqs. (6) and (7) to get RMSD sampling formulas Eqs. (8) and (9), respectively, for the two individual groups. As shown in Figure 8, the RMSD pred values from Eqs. (6) and (7) are comparable with those from Eq. (4), despite the poor R2 values for Eqs. (6) and (7). As shown in Figure 9, where the RMSD sampling values from Eq. (5) are compared with those from Eqs. (8) and (9), the frequency distributions of the difference between RMSD pred and RMSD actual are similar to each other. The RMSD pred value is greater than or equal to RMSD actual for ~91.0% of the time when Eq. (5) is used, and for ~91.6% of the time when Eqs. (8) and (9) are used.
In the recent study of Hawkins et al. [11], the quality of theoretical conformations from OMEGA was evaluated by comparing a set of 197 high-quality PDB ligand structures with corresponding OMEGA-generated theoretical conformers. They used the MMFF94_Trunc force field to generate conformers and then reduced their count to a maximum of 200 by sampling. The mean RMSD between the 197 ligand set and their corresponding theoretical conformers was 0.67 Å. According to their bootstrapping tests, the OMEGA-generated conformers are expected 90% of the time to have an RMSD value between 0.647-0.688 Å for chemical structures with similar properties to those tested. The average rotatable bond count and non-hydrogen atom count of this 197 ligand set were 6.3 and 24.4, respectively, indicating that they are, on average, slightly bigger and flexible than the 25,972 set used in our study (4.9 rotatable bonds and 17.1 non-hydrogen atoms on average). With these average counts as the N NHA and N ER values in Eq. (5), RMSDpred is predicted to be 0.71 Å, which is comparable to 0.67 Å, the mean RMSD between the 197 ligand set and the corresponding theoretical conformers. Considering that the OMEGA parameters used in both studies were not exactly identical and that we used their reported rotatable bond count rather than effective rotor count, this may suggest that Eq (5) may have applicability beyond the parameter sets used in this study.
Conclusion
In the present study, theoretical conformer ensembles for 25,972 experimental MMDB ligand molecules were generated using the OMEGA software and the accuracy of the conformer models were analyzed in terms of the non-hydrogen atom pair-wise RMSD and shape similarity ST values between the theoretical conformer models and the corresponding experimental structures.
Effects of different settings for three important parameters (the fragment sampling rate, the type of the molecular force field, and the size of the energy window) that OMEGA uses for conformer generation were investigated to find their optimal settings. The use of a fragment sampling rate greater than the default (= 20.0) did not make a statistically significant change, indicating the default fragment sampling rate is already sufficient. Variation in the fragment force field type was found to provide little benefit. However, the accuracy of the theoretical conformer models was sensitive to the force field type used in the torsion search stage. When used as a torsion search force field, the MMFF94s_NoEstat and MMFF94s_Trunc force fields were found to generate more conformers for a given molecule and also improve the overall accuracy of the theoretical conformer models, compared to the MMFF94s force field but less so as the energy window was increased. In general, using a greater energy window in the model building and torsion search stages resulted in more accurate conformer models. However, a rapid convergence in the overall average RMSD and ST of the conformer models was observed when the energy windows were bigger than 10 kcal/mol for model building and 15 kcal/mol for torsion search. However, for larger and more flexible structures, an energy window of 25 kcal/mol for torsion search gave some noticeable improvement to the overall accuracy for larger and more flexible structures and may be better threshold for general purpose use. The average accuracy as a function of non-hydrogen atom count (size) and effective rotor count (flexibility) was very linear in the ranges considered. A regression equation was developed using these two variables to predict the accuracy of a theoretical ensemble to reproduce the experimental geometry. This equation was subsequently used to provide a RMSD sampling rate to filter conformer models such that 90% of conformer ensembles should have the same or lesser RMSD accuracy value, thus, allowing one to maximize the accuracy of a conformer model while minimizing the count of retained conformers.
Materials and Methods
1. Datasets
The "experimental" 3-D coordinate data set of small molecules used in the present study was downloaded from the Molecular Modeling DataBase (MMDB) [21, 22] ligand dataset as available from the PubChem Substance database at NCBI on October 20, 2006. The data set was used to calibrate the parameters used when operating the software generation package OMEGA [19]. Ligands that were too small or too big were discarded by limiting the non-hydrogen atom count to 2 - 50. Ligands too flexible (with a rotatable bond count greater than 15) were also eliminated. This filtering stage resulted in an initial dataset that contained 25,972 non-unique organic 3-D experimental reference structures where a 3-D conformer model could be generated.
2. Conformer generation using OMEGA
Essentially, the OMEGA application performs conformer generation in two primary stages: model building and torsion search. In the model building stage, initial molecular structures are constructed by assembling fragment templates, which are generated from fragmentation of the input molecular graph along sigma bonds. In the torsion search stage, OMEGA generates a conformer ensemble using particular rule-based torsion angles that depend on the molecular environment between connecting fragments.
There are a number of adjustable parameters available when performing conformer generation. The effects of three primary parameters upon the accuracy of conformer models generated were evaluated:
(1) -startfact (20 or 50): the argument of this option determines the fragment sampling rate for determination of fragment conformations. Effects of this parameter were studied by trying two values, 20 (default) and 50.
(2) -buildff and -searchff (MMFF94s, MMFF94s_NoEstat, or MMFF94s_Trunc): these specify the type of molecular force field used in the model building and torsion search stages, respectively. The MMFF94s argument tells OMEGA to use the 94s variant of Merck molecular force field. The MMFF94s_NoStat excludes the coulombic terms from the MMFF94s force field and the MMFF94s_Trunc means the MMFF94s force field was truncated by removal of both the coulombic and van der Waals attractive terms.
(3) -ewindow (1-10, 15, 20, 25, or 30): the size of energy window in the model building and torsion search stages in units of kcal/mol. Different values may be specified for the two stages. Various combinations of integer values ranging from one to thirty (default is 10) were used. The energy window indicates the maximum allowed energetic separation from the lowest conformer energy. Higher energy conformers are discarded.
See Table 1 for all non-default parameters used. It is important to note that a maximum of 100,000 (100-k) conformers were generated per chemical structure. This limit is more than adequate for small or inflexible structures but for flexible compounds this limitation is of concern. For example, imagine a chemical structure with nine rotatable bonds where one systematically samples each rotatable bond four times; one would generate 49 (= 262,144) conformations. Therefore, the effects of these 100-k limit cases upon the overall accuracy of the conformer models is analyzed as a function of the non-hydrogen atom counts and the effective rotor counts. While increasing this 100-k threshold to a larger value was not possible with earlier versions of OMEGA, one can see that increasing the total count of conformers considered by five or ten times would still not be sufficient for many flexible molecules.
To assess the accuracy of reproduction of experimental coordinates as a function of conformer generation parameters modification, two metrics were used: the RMSD of non-hydrogen atoms using the OEChem OERMSD function (with "automorph" detection turned on to allow proper treatment of symmetrically equivalent atoms and "overlay" turned on to allow rotation/translation to yield the lowest possible RMSD value) and the shape-optimized ST using the value reported by ROCS. For each conformer produced for a structure, an RMSD and ST determination were made. The lowest RMSD and greatest ST values per conformer model were used to assess "accuracy" of reproduction for the parameters used.
Declarations
Acknowledgements
We are grateful to the NCBI Systems staff, especially Ron Patterson, Charlie Cook, and Don Preuss, whose efforts helped make the PubChem3D project possible. This research was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD. (http://biowulf.nih.gov).
Authors’ Affiliations
References
- Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.View ArticleGoogle Scholar
- Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH: An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010, 38: D255-D266. 10.1093/nar/gkp965.View ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010, 38: D5-D16. 10.1093/nar/gkp967.View ArticleGoogle Scholar
- Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: integrated platform of small molecules and biological activities. Annual Reports in Computational Chemistry. Edited by: Ralph AW. 2008, David CS: Elsevier, 4: 217-241. 10.1016/S1574-1400(08)00012-1.Google Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.View ArticleGoogle Scholar
- Bostrom J: Reproducing the conformations of protein-bound ligands: a critical evaluation of several popular conformational searching tools. Journal of Computer-Aided Molecular Design. 2001, 15: 1137-1152. 10.1023/A:1015930826903.View ArticleGoogle Scholar
- Nicklaus MC, Wang SM, Driscoll JS, Milne GWA: Conformational changes of small molecules binding to proteins. Bioorganic & Medicinal Chemistry. 1995, 3: 411-428. 10.1016/0968-0896(95)00031-B.View ArticleGoogle Scholar
- Nissink JWM, Murray C, Hartshorn M, Verdonk ML, Cole JC, Taylor R: A new test set for validating predictions of protein-ligand interaction. Proteins-Structure Function and Genetics. 2002, 49: 457-471. 10.1002/prot.10232.View ArticleGoogle Scholar
- Westbrook J, Feng ZK, Burkhardt K, Berman HM: Validation of protein structures for Protein Data Bank. Macromolecular Crystallography, Pt D. 2003, San Diego: Academic Press Inc, 374: 370-385. full_text. Methods in Enzymology,View ArticleGoogle Scholar
- Acharya KR, Lloyd MD: The advantages and limitations of protein crystal structures. Trends Pharmacol Sci. 2005, 26: 10-14. 10.1016/j.tips.2004.10.011.View ArticleGoogle Scholar
- Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT: Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. Journal of Chemical Information and Modeling. 2010, 50: 572-584. 10.1021/ci100031x.View ArticleGoogle Scholar
- Murrall NW, Davies EK: Conformational freedom in 3-D databases. 1. Techniques. J Chem Inf Comput Sci. 1990, 30: 312-316.View ArticleGoogle Scholar
- Hurst T: Flexible 3D searching: the directed tweak technique. J Chem Inf Comput Sci. 1994, 34: 190-196.View ArticleGoogle Scholar
- Klebe G, Mietzner T: A fast and efficient method to generate biologically relevant conformations. Journal of Computer-Aided Molecular Design. 1994, 8: 583-606. 10.1007/BF00123667.View ArticleGoogle Scholar
- Renner S, Schwab CH, Gasteiger J, Schneider G: Impact of conformational flexibility on three-dimensional similarity searching using correlation vectors. J Chem Inf Model. 2006, 46: 2324-2332. 10.1021/ci050075s.View ArticleGoogle Scholar
- Greene J, Kahn S, Savoj H, Sprague P, Teig S: Chemical function queries for 3D database search. J Chem Inf Comput Sci. 1994, 34: 1297-1308.View ArticleGoogle Scholar
- Kirchmair J, Wolber G, Laggner C, Langer T: Comparative performance assessment of the conformational model generators omega and catalyst: a large-scale survey on the retrieval of protein-bound ligand conformations. J Chem Inf Model. 2006, 46: 1848-1861. 10.1021/ci060084g.View ArticleGoogle Scholar
- Bostrom J, Greenwood JR, Gottfries J: Assessing the performance of OMEGA with respect to retrieving bioactive conformations. Journal of Molecular Graphics & Modelling. 2003, 21: 449-462. 10.1016/S1093-3263(02)00204-8.View ArticleGoogle Scholar
- OMEGA. 2006, Version 2.1, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- Borodina YV, Bolton E, Fontaine F, Bryant SH: Assessment of conformational ensemble sizes necessary for specific resolutions of coverage of conformational space. J Chem Inf Model. 2007, 47: 1428-1437. 10.1021/ci7000956.View ArticleGoogle Scholar
- Chen J, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, et al: MMDB: Entrez's 3D-structure database. Nucleic Acids Res. 2003, 31: 474-477. 10.1093/nar/gkg086.View ArticleGoogle Scholar
- Wang Y, Addess KJ, Chen J, Geer LY, He J, He S, Lu S, Madej T, Marchler-Bauer A, Thiessen PA, et al: MMDB: annotating protein sequences with Entrez's 3D-structure database. Nucleic Acids Res. 2007, 35: D298-D300. 10.1093/nar/gkl952.View ArticleGoogle Scholar
- OpenEye Omega Toolkit - C++. [http://www.eyesopen.com]
- ROCS - Rapid Overlay of Chemical Structures. 2006, Version 2.2, OpenEye Scientific Software, Inc.: Santa Fe, NMGoogle Scholar
- Grant JA, Gallardo MA, Pickup BT: A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem. 1996, 17: 1653-1666. 10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K.View ArticleGoogle Scholar
- Rush TS, Grant JA, Mosyak L, Nicholls A: A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem. 2005, 48: 1489-1495. 10.1021/jm040163o.View ArticleGoogle Scholar
- Halgren TA: Merck molecular force field. 1. Basis, form, scope, parameterization, and performance of MMFF94. J Comput Chem. 1996, 17: 490-519. 10.1002/(SICI)1096-987X(199604)17:5/6<490::AID-JCC1>3.0.CO;2-P.View ArticleGoogle Scholar
- Halgren TA: Merck molecular force field. 2. MMFF94 van der Waals and electrostatic parameters for intermolecular interactions. J Comput Chem. 1996, 17: 520-552. 10.1002/(SICI)1096-987X(199604)17:5/6<520::AID-JCC2>3.0.CO;2-W.View ArticleGoogle Scholar
- Halgren TA: Merck molecular force field. 3. Molecular geometries and vibrational frequencies for MMFF94. J Comput Chem. 1996, 17: 553-586. 10.1002/(SICI)1096-987X(199604)17:5/6<553::AID-JCC3>3.0.CO;2-T.View ArticleGoogle Scholar
- Halgren TA, Nachbar RB: Merck molecular force field. 4. Conformational energies and geometries for MMFF94. J Comput Chem. 1996, 17: 587-615.Google Scholar
- Halgren TA: Merck molecular force field. 5. Extension of MMFF94 using experimental data, additional computational data, and empirical rules. J Comput Chem. 1996, 17: 616-641. 10.1002/(SICI)1096-987X(199604)17:5/6<616::AID-JCC5>3.0.CO;2-X.View ArticleGoogle Scholar
- Halgren TA: MMFF VI. MMFF94s option for energy minimization studies. J Comput Chem. 1999, 20: 720-729. 10.1002/(SICI)1096-987X(199905)20:7<720::AID-JCC7>3.0.CO;2-X.View ArticleGoogle Scholar
- Halgren TA: MMFF VII. Characterization of MMFF94, MMFF94s, and other widely available force fields for conformational energies and for intermolecular-interaction energies and geometries. J Comput Chem. 1999, 20: 730-748. 10.1002/(SICI)1096-987X(199905)20:7<730::AID-JCC8>3.0.CO;2-T.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.