DPRESS: Localizing estimates of predictive uncertainty
© Clark; licensee BioMed Central Ltd. 2009
Received: 04 March 2009
Accepted: 14 July 2009
Published: 14 July 2009
The need to have a quantitative estimate of the uncertainty of prediction for QSAR models is steadily increasing, in part because such predictions are being widely distributed as tabulated values disconnected from the models used to generate them. Classical statistical theory assumes that the error in the population being modeled is independent and identically distributed (IID), but this is often not actually the case. Such inhomogeneous error (heteroskedasticity) can be addressed by providing an individualized estimate of predictive uncertainty for each particular new object u: the standard error of prediction su can be estimated as the non-cross-validated error st* for the closest object t* in the training set adjusted for its separation d from u in the descriptor space relative to the size of the training set.
The predictive uncertainty factor γt* is obtained by distributing the internal predictive error sum of squares across objects in the training set based on the distances between them, hence the acronym: Distributed PRedictive Error Sum of Squares (DPRESS). Note that st* and γt*are characteristic of each training set compound contributing to the model of interest.
The method was applied to partial least-squares models built using 2D (molecular hologram) or 3D (molecular field) descriptors applied to mid-sized training sets (N = 75) drawn from a large (N = 304), well-characterized pool of cyclooxygenase inhibitors. The observed variation in predictive error for the external 229 compound test sets was compared with the uncertainty estimates from DPRESS. Good qualitative and quantitative agreement was seen between the distributions of predictive error observed and those predicted using DPRESS. Inclusion of the distance-dependent term was essential to getting good agreement between the estimated uncertainties and the observed distributions of predictive error. The uncertainty estimates derived by DPRESS were conservative even when the training set was biased, but not excessively so.
DPRESS is a straightforward and powerful way to reliably estimate individual predictive uncertainties for compounds outside the training set based on their distance to the training set and the internal predictive uncertainty associated with its nearest neighbor in that set. It represents a sample-based, a posteriori approach to defining applicability domains in terms of localized uncertainty.
Early work on quantitative structure-activity relationships (QSAR) was primarily concerned with relating select physical properties to in vivo biological activity [1, 2]. Ordinary least squares regression (multiple linear regression) was the analytical tool of choice, and the statistical questions addressed focused on whether a particular descriptor was significant or not. QSAR methods soon evolved, however, into being ways of identifying optimal physical properties rather than simply trends, a shift accomplished by fitting to quadratic and bilinear equations. This development was spurred in no small part by the desire to identify optimal octanol/water partition coefficients (logP), generally in pursuit of optimal in vivo activity.
The focus for pharmaceutical drug discovery subsequently shifted from in vivo testing to in vitro evaluation of interactions between candidate ligands and isolated enzymes or receptors. This change brought with it a shift of descriptors from measurable properties of compounds to computationally estimated properties of molecules, with the calculations in question often being based on (sub)structural descriptors. The next step was to take descriptors into account that were based on molecular structure but were not themselves measurable physical properties. Often these were more or less local in nature, and the purposes of doing the analysis shifted from identifying significant underlying relationships to the descriptors to identifying optimal substituents or substitution patterns. Interest in artificial neural networks (ANNs)  and partial least squares with projection onto latent structures (PLS)  as analytical tools increased at the same time. Questions related to validity of the model as a whole took center stage as the number of descriptors available proliferated [5, 6], followed closely by a strong interest in predictivity and how best to establish applicability domains [7–15].
Today, however, the overall statistical properties of a particular QSAR are less relevant to medicinal chemists or environmental regulatory agencies. Recent pressure to simultaneously reduce clinical failures, ensure the safety of bulk chemicals [16–18] and reduce testing on animals have led to an increasing reliance on models for predicting off-target biological effects and toxicity. This use of QSAR models entails applications to more structurally diverse compounds, but it also changes the relative importance of different kinds of mistakes. If a structure is predicted to have a much higher affinity for the target than it actually does, the cost to a lead optimization program is limited to the synthetic resources wasted on that particular structure. Even that cost is mitigated if something useful was learned about the underlying structure-activity relationship (SAR) in the process. Such a false positive error in predictive toxicology, however, may mean that a life-saving (and profitable) drug never gets commercialized. Compounds mistakenly predicted to be inactive – false negatives – represent a missed opportunity in the context of lead optimization, but they have the potential to be downright catastrophic (and ruinous) in the context of predictive toxicology.
Such considerations put a premium on being able to make a quantitative estimate of how reliable an individual prediction obtained from a given model is. What is more, answers to the question, "How reliable are the predictions about this particular molecule that I am considering for synthesis, clinical evaluation or registration?" are often most relevant for extrapolations to structures near the "outside" edges of the descriptor space defined by the training set. Hence, to be of practical use, constraints on applicability domains need to be "soft" – i.e., increase with distance from the descriptor space covered by the training set – but "hard" enough to indicate just how far outside the training set one can safely expect to go. They also need to provide a robust quantitative estimate of predictive reliability that is sensitive to local variations in the descriptor space. This paper presents a novel methodology for doing exactly that based on how close a new compound is to those in the training set and the distribution of internal predictive error across compounds in that set.
Classical statistical theory
As indicated in Eq. 1, the error ε is assumed to be normally distributed with mean 0 and a standard deviation σX. Best linear unbiased estimators (BLUEs) for the various parameters in Eq. 1 can be calculated from a sample T0 of n observations (in QSAR, compounds) drawn from the full population, provided several preconditions are met :
1. the strict linear dependence of Y on X set out in Eq. 1 applies across the population;
2. the sample is random and unbiased;
3. the descriptors contributing to X are mutually independent in a statistical sense; and
4. the error distribution ε is homoskedastic and independent of X and Y – i.e., its standard deviation is the same everywhere in the descriptor space, so σX = σ for all X.
where is the mean value of Y for the sample; xi = Xi - X0, with X0 being the sample centroid for X; and is the predicted value of Y at Xi . Note that sFIT is greater than the root mean square error (RMSE); this is because the means and X0 and the calculated coefficient vector b are themselves estimates that are subject to sampling error, with 1 and p degrees of freedom, respectively.
where su is the expected standard error of prediction (uncertainty) for the new observation u and n is the number of training set observations t used to build the model. The Mahalanobis distances d0, u and d0, t are measured in the model space defined by b, i.e., they are weighted Euclidean distances between the centroid X0 of the descriptor matrix for the training set and the vectors Xu and Xt, respectively.
The rationale behind the "extra" terms in Eq. 4 is straightforward. For any random sample, the error involved in using as an estimate of μY is inversely proportional to n – hence the 1/n term in Eq. 4. In addition, the accuracy with which β is estimated by b is inversely proportional to how thoroughly X is sampled by the training set, but how much difference that makes to the error is directly proportional to the distance d0, u between Xu and X0 in the model space. Together these countervailing effects of variation in X account for the second term within the outer brackets.
Dealing with violated assumptions
The value of su produced by Eq. 4 is a best linear unbiased estimator of σu – provided the assumptions underlying its derivation hold. Unfortunately, one or more of those assumptions are violated in most QSAR applications. In particular:
1. the dependence of Y on X rarely fits the prescribed function perfectly, linear or otherwise;
2. the training set used is usually a non-random sample, its selection biased by matters of historical accident and convenience that reflect the historical trajectory of the synthesis program that motivated the analysis;
3. the descriptors contributing to X are often correlated to a greater or lesser degree and hence are not independent variables in the statistical sense (correlation implies lack of independence, but the inverse is not true: lack of correlation does not imply statistical independence); and
4. ε is usually heteroskedastic – its standard deviation σX is often different in different regions of the descriptor space.
Most or all of the assumptions are, in fact, explicitly violated when ANNs, PLS, variable selection, quadratic regression, or bilinear regression techniques are applied, with the result that sFIT and the estimator given by Eq. 4 underestimate the actual uncertainty of prediction, often drastically.
Several groups have derived theoretical variations of Eq. 4 for use with PLS and principal component analysis (PCA) that seek to address departures from ideality [20–22]. Unfortunately, subsequent work has demonstrated that these methods are often not robust when applied in realistic situations .
where is the value of Y predicted by applying the reduced model built from the n - 1 compounds in training subset Tu to Xu and is the corresponding predictive error. The summation is indexed across u to emphasize that prediction is external to the training subset used in each case. Here p represents the number of PLS components included in the model rather than the number of descriptors.
Cross-validation statistics were originally employed in PLS solely as a way to determine an optimal model complexity, a role for which the classical goodness-of-fit measure r2 used in ordinary least squares is unsuited . It has since come to widely used to assess predictivity, however. This use is unfortunate, in that a poorly predictive model will have a high sCV and a low q2, but the converse may or may not be true: good cross-validation statistics may be due to redundancies in the training set rather than truly robust predictive performance [25–28]. Some workers prefer to use "leave-some-out" cross-validation – in which several compounds are held back together – to address this problem. Nonetheless, the LOO standard error is the best estimate of the full model's predictivity for each individual compound in the training set , which makes it is a reasonable starting point for estimating a model's predictive reliability for structures occupying nearby points in the descriptor space.
Violation 4 – that error is not identically and independently distributed across compounds – is especially problematic for QSAR analyses. In one recently described case in point, the variation in predictive error was clearly correlated with one of the two descriptors being used . If that is true when many descriptors are involved (as is the case for PLS), the overall variability in predictive error should be similar across the full range of Y. Such a distribution of error is, in fact, often seen in place of the quadratically increasing spread implied by Eq. 4 . This makes it all too easy to make the unjustified leap to the unjustified conclusion that the aggregate predictive uncertainty – typically sCV or the root mean square error of prediction for an external test set (RMSEP or sPRED) – is a reliable indicator of the level of uncertainty associated with individual predictions: independence from Y does not imply independence from X.
Partitioning the PRESS
where ct, i = [xt1xi1 xt2xi2 ... xtpxip] is the covariance between xt and xi and vt is a weight vector that is specific to compound t. Basically, Eq. 6 says that activity can be expressed as a linear function of the similarities of each compound to each of the other compounds in the training set. This suggests that the observed predictive error eu can be cast as a sum of contributions from each compound in the training set that increases with similarity to those compounds, which is consistent with the observation that predictive error tends to increase with distance from – i.e., tends to decrease with increasing similarity to – compounds in the training set [11, 12, 15]. If t* is the closest (i.e., the most similar) such compound, its standard error (st*) is a reasonable first approximation to the predictive error su for a new compound. In most QSAR applications, a single response value Yt is assigned to each compound in the training set, so the best estimate of st is simply |et|, where et is the deviation seen for t in the full, non-cross-validated model, i.e., the residual error of fitting.
where d00 is the length of the vector x00 defined by the standard deviations of the descriptors; d00 = 1 when descriptors have been centered and autoscaled, as was the case here.
A small constant (1/n) needs to be included to prevent the reciprocal from "exploding" at small distances. Basically, it dictates the distance at which error is expected to distribute evenly. The choice of this particular value is somewhat arbitrary, but 1/n works well and nicely accommodates the tendency of data points to get closer together as the training set gets larger. Taken together, Eqs. 7–9 define the Distributed PRedictive Error Sum of Squares (DPRESS) approach to estimating predictive uncertainty.
The suitability of DPRESS or any other quantitative model of predictive uncertainty is best evaluated by applying it to experimental QSAR data sets. Here, DPRESS is tested against PLS models obtained using a 3D descriptor (comparative molecular field analysis, or CoMFA [32–34]) and a 2D descriptor (hologram QSAR, or HQSAR [35–37]). A large data set (N = 304) was used to insure that the number of compounds held back to evaluate external predictivity was much greater than the numbers needed to train a reasonably robust model.
The data set
The main goal of the work reported here was to see how well local estimates of predictive error obtained by DPRESS reflect the actual distribution of predictive error across the descriptor space. Simple random sampling produces a biased training set because, as in most such data sets, the major structural classes are not evenly represented (Fig. 2). Therefore diverse but representative ("boosted" ) training sets were generated by independently drawing five training (sub)sets of 75 compounds from the full set using optimizable k-dissimilarity (OptiSim) selection [39, 42, 43]. Models based on those training sets were then used to predict the activities of the 229 inhibitors not used to construct them. Three additional training sets were drawn at random, only one of which gave acceptable internal cross-validation statistics. Representation in the full data set is biased, so such simple random subsets are biased as well. The results obtained using that training set (set R) are included here to illustrate the effect of sampling bias due to structural redundancy [39, 44, 45].
The optimal number of components p* for the CoMFA models obtained for the boosted training sets ranged from three to seven. It is not appropriate to compare models that differ in complexity directly, however, so a consensus complexity of p = 6 was used in all cases. The corresponding leave-one-out (LOO) cross-validated standard errors (sCV) ranged from 0.681 to 0.762, corresponding to internal predictivities (q2) of 0.537 to 0.337. The non-cross-validated models exhibited standard errors of regression (sFIT) ranging from 0.279 to 0.398, corresponding to r2 values between 0.901 and 0.827. Calculating the root mean square error for external predictions yielded sPRED = 0.633 to 0.655 – i.e., the internal cross-validated error underestimated the overall accuracy of external prediction somewhat.
In contrast, the biased training set R yielded a cross-validated standard error (sCV) of 0.489, corresponding to a q2 of 0.696. The overall goodness-of-fit statistics for the non-cross-validated model were sFIT = 0.279 and r2 = 0.901. As expected, however, the predictive performance on those compounds not in R was substantially worse than that of the boosted training sets, with sPRED = 0.744.
Several conclusions can be drawn by comparing the distribution of errors to each other and to the distribution of activities. Firstly, though the distributions of observed predictive errors for the three models differ from one another (Fig. 4A vs 4B vs 4C), they resemble each other more than they resemble the distribution of activity itself (Fig. 3A). Secondly, the larger observed errors are not particularly concentrated among the singletons or at the edges of the descriptor space, as would be expected for the ordinary least squares distribution expected based on Eq. 6 and in most published approaches to establishing applicability domains. Thirdly, the distributions of predictive uncertainty seen for the boosted training sets are in good overall agreement with the observed errors with respect to the regions of descriptor space where the observed error is relatively high or low (Fig. 4D vs 4A and 4E vs 4B). Though somewhat less evident, the same is true for the model constructed using the biased training set R (Fig. 4F vs 4C). Finally, the smaller errors predicted by the boosted training with the better internal predictivity (Fig. 4E vs 4D) do seem to be realized in the localized errors actually observed (Fig. 4B vs 4A), even though this was not obvious in the aggregate statistics (sPRED = 0.637 and sPRED = 0.633, respectively).
Eq. 7 implies that = |et| for each member t in the training set. The corresponding points are represented by filled stars in each panel in Fig. 5, which therefore define the lines = |eu|. Unbiased and normally distributed error should only fall outside these lines about 32% of the time and should fall outside the dotted lines corresponding to = 2|eu| less than 5% of the time. This is clearly not the case when the cross-validated error for the most similar compound t* in the training set is taken as a direct estimate of , i.e., when γt is set equal to 0 for all t (Fig. 5A). There are fewer unduly low predicted uncertainties for the biased training set R, but still more than would be expected by chance (Fig. 5B). Note that the bias evident in the model constructed from R comes mostly in the form of negative residuals, i.e., predicted activities that are larger than the observed activities. Such false positives account for most of the "extra" out-of-bounds errors seen in Fig. 5B. The distributions of errors for the boosted training sets are much better behaved; indeed, the predicted uncertainties are slightly more conservative than necessary for large errors in prediction (Fig. 5C and 5D).
HQSAR analyses were carried out as a complement to the results obtained in the CoMFA studies described above. The 2D molecular holograms used were built up from the number of each kind of substructure comprised of between four and seven heavy atoms, the counts being mapped down into count vectors of various lengths by hashing . HQSAR models were then constructed by applying PLS analysis to holograms of length 97, 151, 199, 257, 307 and 353. The optimal complexity for the full model (N = 304) was six components for all hash lengths. The sCV values obtained ranged from 0.609 to 0.640; the median and average were both 0.622. The value of q2 ranged from 0.547 to 0.582, with a median of 0.564 and an average of 0.563. Based on these results, a hash length of 353 (sCV = 0.609 and q2 = 0.582) was chosen for evaluating the behavior of the various training sets. The corresponding non-cross-validated analysis gave sFIT = 0.527 and r2 = 0.687.
The consensus optimal complexity across the boosted training subsets was five components, in keeping with the full data set's having nearly four times as many compounds and, therefore, containing substantially more information. The sCV values obtained ranged from 0.691 to 0.776 versus a value of 0.540 for the biased training set R; the respective q2 values were 0.386 to 0.514 and 0.623. The sPRED for the boosted subsets ranged from 0.619 to 0.669 and the corresponding value for the biased subset was 0.735. Hence HQSAR performance followed the trend seen for CoMFA: cross-validation under-estimated the predictive error substantially for the biased subset (i.e., was overly optimistic about the extensibility of the model) and over-estimated the predictive error slightly for the boosted training sets. It differed in that it was the boosted training set B which gave the better external predictive performance.
The degree to which any QSAR can be extended to compounds outside of the training set used to construct it is necessarily limited to some degree by the structural diversity of that training set. Some extensibility is necessary, however, if the QSAR is to be of use for something beyond mere rationalization of known activities. When only a few descriptors are being considered, it may be possible to restrict the applicability domain to "internal" regions in the descriptor space, but as the number of descriptors increases distinguishing compounds that lie "outside" the space defined by the training set from those that are "inside" becomes progressively less meaningful. Regardless of the complexity of the system, it is clear that one will often need to extend the applicability domain beyond the training set somehow. It is equally clear that this must be done cautiously, however, and that it would be desirable for the degree of caution to reflect the idiosyncrasies of the QSAR being examined. It would be particularly desirable to take local variations in the uncertainty of predictions into account, rather than trying to find a single acceptable distance to the model that is applicable across the entire descriptor space [12, 14, 15].
DPRESS was formulated to address these needs. It is based on two simple assumptions: that the uncertainty in prediction for new objects (e.g., molecular structures) is likely to be dominated by the error in prediction for objects near them in the descriptor space; and that this influence is, to a first approximation, inversely related to the distance between them. The "true" dependence may well be more complex in some cases, but the size of the training set needed to characterize that dependence will almost always be impractically large. In any event, such dependence is likely to reduce to a linear relationship over the relatively short ranges of QSAR extrapolations that have any chance of being relevant.
The fact that the predictive uncertainties derived from DPRESS analysis are sometimes more conservative than necessary for large errors is of some concern, though that is certainly preferable to the alternative of their being overly optimistic; further work in this area is a matter on ongoing investigation. Nonetheless, the method is intrinsically less constraining than the classical quadratic relationship based on distance from the mean (Eq. 4). Given how much predictive error varies across the model space (e.g., Fig. 4), any approach based on the overall sPRED seems bound to be overly optimistic regarding the reliability of predicted potencies for some compounds.
The underlying QSARs examined here – CoMFA and HQSAR – both rely on (nominally ) linear PLS, but there is no intrinsic reason that the method cannot be more broadly applied. The key point is that the error being distributed must be predictive – i.e., it needs to reflect predictions made for objects not included in the training set. LOO cross-validation yields the most information for any given dataset, but a leave-some-out approach should be a viable alternative. The predictive errors obtained from the validation sets often used in ANN analysis could be used as well, since there is no intrinsic reason that a linear model for local error distribution should be incompatible with a QSAR that is non-linear on a global scale.
The usual reasons for preferring LSO over LOO cross-validation are unlikely to be relevant to DPRESS calculations, however. LOO can indeed be distorted when the training set is biased due to redundancy, but DPRESS based on LOO turns out to be conservative in such a situation (see above). The reduction in sCV that occurs when the sampling density in one particular area of descriptor space is high is reflected in a reduction in the error that each individual prediction contributes to the PRESS. But is not a root mean square, so the effect on its value is offset by the fact that biased sampling necessarily: increases the total number of errors; decreases their spread (d00); increases the distance between the training set and most new observations; or effects some combination thereof.
Diverse training sets representative of the full structural space produce more reliable local uncertainty estimates than do biased training sets, indicating that taking care to avoid undue sampling bias (redundancy) in the training set is worth the effort. Even the biased training set R, however, did better than setting the uncertainty of prediction for a new object equal to the observed error for the closest object in the training set (Fig. 5 and 7). Moreover, the errors falling outside the range expected based for the calculated for R were false positives, the least serious type of error to make when trying to predict toxicity.
There are two fundamental differences between the estimate of predictive uncertainty derived from classical theory (Eq. 4) and the DPRESS model represented by Eqs. 7–9. The first difference is that Eq. 4 is a sum of squares, whereas Eq. 7 is a sum of linear terms. Using a sum of squares formulation was considered for DPRESS, but was found to consistently overestimate the uncertainty of prediction (details not shown). The second difference is that Mahalanobis distances d measured in the model space are used in the classical model, whereas Euclidean distances measured in the descriptor space are used in DPRESS. The less parametric approach is followed for DPRESS because the variation in one or more variables in a particular training set may not be large enough to reveal the influence that variable might exert if examined across a greater range. The small coefficient assigned to such a variable in that event means that substantial deviations in its value will have a negligible effect on distances in the model space. Sticking with distances in the "raw" descriptor space rather than using the descriptor weights from b to calculate a Mahalanobis distance is more conservative – it assumes that variation in things that have yet to be explored are likely to make predictions less reliable.
Examination of the distribution of predictive errors across the descriptor space makes it clear that errors are consistently larger in some regions than in others – i.e., the predictive error is heteroskedastic (Fig. 4). Given that a major use of QSAR predictions is in chemoinformatic tabulations used by medicinal chemists and other third parties, it would be good practice to routinely attach some estimate of uncertainty to each prediction. Doing so based on some analytical estimator would be preferable, but is impractical in most real-world situations because it requires detailed a priori knowledge of the global dependence of error on the descriptors. In the absence of such knowledge, a locally linear estimator of predictive reliability that is embedded in the sample space represents a reasonable alternative. Partition of predictive error sum of squares (DPRESS) provides just such an estimator in a form – that of a standard error – that is widely understood by those likely to use it. The calculations involved are straightforward and the estimator produced is a qualitatively (Fig. 4 and 6) and quantitatively (Fig. 5 and 7) reliable estimate of how much confidence one should place in the associated prediction. Moreover, though the particular applications studied here involved PLS models built using 2D and 3D descriptors, the technique is likely applicable to any regression method that can be reformulated in kernel-based terms [12, 47].
It is also important when constructing the model in the first place to examine the distribution of predictive error in the descriptor space. If uncertainty is homoskedastic, a classical or uniform distribution model may provide a somewhat more precise estimate of predictive uncertainty. Should (e)NLM or principal components analysis (PCA) indicate heteroskedacity, however, a DPRESS calculation should be carried out before applying the model – e.g., for prioritizeing compounds for synthesis, acquisition or detailed testing. DPRESS may also serve to highlight regions of structural space from which more data needs to be gathered.
Ordinary multiple linear regression is not suitable when the number of descriptors in a data set exceeds the number of observations. PLS  was used instead, with the appropriate number of latent variables (components) to include (i.e., the model complexity) being the number corresponding to the first minimum in the "leave-one-out" cross-validated standard error (sCV). This measure of internal consistency is obtained by setting aside each of the n compounds in the training subset in turn and trying to predict its activity using the other n - 1 compounds in the training set. The external error of prediction (sPRED) was calculated as the root mean square error for the N - n compounds left out of the model calculation altogether.
Training set selection
Boosted training sets were obtained by applying OptiSim selection to the full data set. OptiSim selection entails drawing a series of random subsamples of size k from the data set of interest. For each subsample in the series, the individual that is most different from those selected from previous subsamples is extracted and added to the selection set S. This procedure results in a representative but diverse selection set that samples the full data set space both efficiently and effectively . Here the structural space was defined in terms of the Tanimoto similarity T(a, b) between the corresponding UNITY substructural fingerprints . The individual a in the ith subsample for which max(T(a, b): b ∈ S) is smallest was added to S. Candidates with a Tanimoto similarity greater than 0.8 to any compound already in the selection set were deemed redundant and were excluded from subsamples.
The selection process was repeated five times with n = 75 and k = 4, using a different random number seed each time. Five inhibitors appeared in every boosted training set, including the thiophene, cyclopentadiene and isoxazole analogs that fall outside the five major classes. A total of 113 inhibitors were not selected for any of the boosted training sets, whereas 191 were selected for at least one of them.
CoMFA involves using PLS to identify correlations of biological activity with variations in steric and electrostatic molecular fields, which requires that the molecules under consideration be put into similar conformations and into a common frame of reference as a key part of the process. Here, conformations were set and molecular structures aligned based on the homologous atoms in their central and peripheral rings, as has been described in detail elsewhere . Charges were calculated using the method of Marsili and Gasteiger , as extended in SYBYL  to take the distribution of π electrons into account ("Gasteiger-Hückel charges"). Coulombic and Lennard-Jones interaction energies were calculated on a 2 Å rectilinear grid extending 4 Å beyond the edge of any molecule in the full data set. The probe atom used to calculate the fields was an sp3-hybridized carbon monocation. Interaction energies were truncated at nominal values above 30 kcal/mol, and electrostatics were ignored within the steric envelope of each inhibitor.
The first step in constructing a molecular hologram is to identify all substructures in a molecule that fall within a specified size range – here, all fragments made up of four to seven atoms, with hydrogens ignored and bond types taken into account. Each fragment is then mapped into a compressed count vector of specified length using a hashing function, so that the elements of that count vector can be used as descriptors in subsequent PLS analyses . The hashing means that different fragments may map to the same position in the final count vector. The fragments overlap, however, so each substructure contributes to many fragment counts. The result is that the noise introduced by "collisions" for a few subfragments constitutes a relatively minor perturbation that is, on average, self-limiting. Overfit PLS models are characteristically unstable to such perturbations, however, so surveying a range of hash lengths and picking one with good but representative statistical properties is a good way to avoid picking a length whose cross-validation statistics are overly optimistic. This is a non-parametric perturbation analysis analogous to looking at the effect of small perturbations in response to assess model stability .
2D depictions of the relationship between different compounds were obtained using the embedded non-linear mapping (eNLM) facility  in Benchware DataMiner . "Ordinary" NLM can be thought of as placing springs between all pairs of points in the original descriptor space, then compressing the ensemble into two dimensions in such a way that the residual tension in those springs is minimized. Embedded NLM differs in that parts of springs longer than some specified threshold length (horizon) are treated as elastic to extension, i.e., they do not contribute to the overall stress in the system. Here, spring "tensions" were based on the block-wise autoscaled Euclidean distances ("CoMFA Standard scaling" ) between the molecular fields or between the molecular holograms of different compounds.
CoMFA and HQSAR analyses were carried out in SYBYL. The distances dt, i used to partition the PRESS were taken from the SAMPLS.dist file generated by the SYBYL interface as input to the SAMPLS program  and represent inter-observation distances in the descriptor space after autoscaling has been applied. The descriptors used here are already either fully commensurate (HQSAR) or are piecewise commensurate (within steric and electrostatic fields but not between them, for CoMFA), so "CoMFA standard" (block) autoscaling was used . Observed and predicted responses were taken from the SAMPLS.out file generated by the SAMPLS program.
Localized predictivity estimates were calculated by combining scripts written in SYBYL programming language (SPL) with spreadsheet manipulations carried out in Excel. For each compound t in the training set, the scaling factor γt was calculated based on the observed predictive variance (squared cross-validation error of prediction, δi 2) for every other compound in the training set (i ≠ t) weighted inversely by the square of the Euclidean distance between the two (dt, i) in the descriptor space (Eq. 8). A normalization factor αi for each compound i was calculated as the sum of squared distances to all other compounds in the training set (Eq. 9). A limiting proximity term of 1/n was included to ensure reasonable behavior for closely-spaced compounds where dt, i approaches 0; this works well when the descriptors have been autoscaled in some way before use.
The individual scaling factors γt obtained from the n LOO cross-validation errors for the training set were used to calculate an estimate for the predictive uncertainty associated with each new structure u based on the observed cross-validation error of the training set compound (t*) lying closest to it in the descriptor space, its distance from t*, and the scaling factor γt* derived from the model cross-validation analyses (Eqs. 7–9).
artificial neural network
best linear unbiased estimator
comparative molecular field analysis
- d :
- δ :
predictive error for a compound outside the training set
- e :
residual error for a compound in the training set
embedded non-linear mapping
principal components analysis
distributed predictive error sum of squares
partial least squares with projection to latent structures
predictive error sum of squares
quantitative structure/activity relationship
- s :
standard error for a sample
SYBYL programming language.
Chris Williams of the Chemical Computing Group provide helpful input to the organization of the manuscript, as did the three anonymous reviewers recruited by the Journal. The paper is substantially clearer as a result of their input, and the author appreciates it.
- Hansch C, Maloney PP, Fujita T, Muir RM: Correlation of biological activity of phenoxyacetic acids with Hammett substituent constants and partition coefficients. Nature. 1962, 194: 178-180.View ArticleGoogle Scholar
- Hansch C, Fujita T: p-σ-π Analysis. A method for the correlation of biological activity and chemical structure. J Am Chem Soc. 1964, 86: 1616-1626.View ArticleGoogle Scholar
- Jurs PC, Chou JT, Yuan M: Computer-assisted structure-activity studies of chemical carcinogens. A heterogeneous data set. J Med Chem. 1979, 22: 476-483.View ArticleGoogle Scholar
- Wold S, Ruhe A, Wold H, Dunn WJ: The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J Sci Stat Comput. 1984, 5: 735-743.View ArticleGoogle Scholar
- Baumann K, Albert H, von Korff M: A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations. J Chemometrics. 2002, 16: 339-350.View ArticleGoogle Scholar
- Baumann K, von Korff M, Albert H: A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part II. Practical applications. J Chemometrics. 2002, 16: 351-360.View ArticleGoogle Scholar
- Schultz T, Hewitt M, Netzeva T, Cronin M: Assessing applicability domains of toxicological QSARs: definition, confidence in predicted values, and the role of mechanisms of action. QSAR Comb Sci. 2007, 26: 238-254.View ArticleGoogle Scholar
- Giuliani A, Benigni R: Modeling without boundary conditions: an issue in QSAR validation. Computer-Assisted Lead Finding and Optimization. Edited by: van de Waterbeemd H, Testa B, Folkers G. 1997, Weinheim: Wiley-VCH, 51-63.Google Scholar
- Sheridan RP, Feuston BP, Maiorov VN, Kearsley SK: Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J Chem Inf Comput Sci. 2004, 44: 1912-1928.View ArticleGoogle Scholar
- Guha R, Jurs PC: Determining the validity of a QSAR model – a classification approach. J Chem Inf Model. 2005, 45: 65-73.View ArticleGoogle Scholar
- He L, Jurs PC: Assessing the reliability of a QSAR model's predictions. J Mol Graph Model. 2005, 23: 503-523.View ArticleGoogle Scholar
- Schroeter TS, Schwaighofer A, Mika S, Ter Lakk A, Suelzle D, Ganzer U, Heinrich N, Müller K-R: Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules. J Comput-Aided Mol Des. 2007, 21: 651-664.View ArticleGoogle Scholar
- Benigni R, Bossa C: Predictivity of QSAR. J Chem Inf Model. 2008, 48: 971-980.View ArticleGoogle Scholar
- Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Öberg T, Todeschini R, Fourches D, Varnek A: Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J Chem Inf Model. 2008, 48: 1733-1746.View ArticleGoogle Scholar
- Weaver S, Gleeson MP: The importance of the domain of applicability in QSAR modeling. J Mol Graph Model. 2008, 26: 1315-1326.View ArticleGoogle Scholar
- Johnson DE, Wolfgang GI: Predicting human safety: screening and computational approaches. Drug Discov Today. 2000, 5: 445-454.View ArticleGoogle Scholar
- Bassan A, Worth AP: The integrated use of models for the properties and effects of chemicals by means of a structured workflow. QSAR Comb Sci. 2008, 27: 6-20.View ArticleGoogle Scholar
- Walker JD, Carlsen L, Jaworska J: Improving opportunities for regulatory acceptance of QSARs: the importance of model domain, uncertainty, validity and predictability. Quant Struct-Act Rel. 2003, 22: 6-20.Google Scholar
- Snedecor GW, Cochran WG: Statistical Methods. 1989, Iowa State Press, Ames, IA, 8Google Scholar
- Kleinknecht RE: Error estimation in PLS latent variable structure. J Chemometrics. 1996, 10: 687-695.View ArticleGoogle Scholar
- Denham MC: Prediction intervals in partial least squares. J Chemometrics. 1997, 11: 39-52.View ArticleGoogle Scholar
- Faber K, Kowalski BR: Propagation of measurement errors for the validation of predictions obtained by principal component regression and partial least squares. J Chemometrics. 1997, 11: 181-238.View ArticleGoogle Scholar
- Morsing T, Ekman C: Comments on construction of confidence intervals in connection with partial least squares. J Chemometrics. 1998, 12: 295-299.View ArticleGoogle Scholar
- Wold S: Validation of QSARs. Quant Struct-Act Rel. 1991, 10: 191-193.View ArticleGoogle Scholar
- Clark RD, Sprous DG, Leonard JM: Validating models based on large data sets. Rational Approaches to Drug Design. Edited by: Höltje H-D, Sippl W. 2001, Barcelona: Prous Science, 475-485.Google Scholar
- Golbraikh A, Tropsha A: Beware of q2!. J Mol Graph Model. 2002, 20: 269-276.View ArticleGoogle Scholar
- Golbraikh A, Tropsha A: Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. J Comput-Aided Mol Des. 2002, 16: 357-269.View ArticleGoogle Scholar
- Clark RD, Fox PC: Statistical variation in progressive scrambling. J Comput Aided Mol Des. 2004, 18 (7-9): 563-576.View ArticleGoogle Scholar
- Hawkins DM, Basak SC, Mills D: Assessing model fit by cross-validation. J Chem Inf Comput Sci. 2003, 43: 579-586.View ArticleGoogle Scholar
- Sutherland JJ, O'Brien LA, Weaver DF: A Comparison of methods for modeling quantitative structure-activity relationships. J Med Chem. 2004, 47: 3777-3787.View ArticleGoogle Scholar
- Bush BL, Nachbar RB: Sample-distance partial least squares: PLS optimized for many variables, with application to CoMFA. J Comput-Aided Mol Des. 1993, 7: 587-619.View ArticleGoogle Scholar
- Cramer RD, Patterson DE, Bunce JD: Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc. 1988, 110: 5959-5967.View ArticleGoogle Scholar
- Cramer RD, DePriest SA, Patterson DE, Hecht P: The developing practice of comparative molecular field analysis. 3D QSAR in Drug Design: Theory, Methods and Applications. Edited by: Kubinyi H. 1993, Leiden: ESCOM, 443-485.Google Scholar
- Kroemer RT, Hecht P, Guessregen S, Liedl KR: Improving the predictive quality of models. 3D QSAR in Drug Design. Edited by: Kubinyi H, Folkers G, Martin YC. 1998, Dordrecht: Kluewer/ESCOM, 3: 41-56.View ArticleGoogle Scholar
- Heritage TW, Lowis DR: Molecular Hologram QSAR. Rational Drug Design: Novel Methodology and Practical Applications, ACS Symposium Series 719. Edited by: Parrill AL, Reddy MR. 1999, Washington DC: American Chemical Society, 212-225.View ArticleGoogle Scholar
- Tong W, Lowis DR, Perkins R, Chen Y, Welsh WJ, Goddette DW, Heritage TW, Sheehan DM: Evaluation of quantitative structure-activity relationship methods for large-scale prediction of chemicals binding to the estrogen receptor. J Chem Inf Comput Sci. 1998, 38: 669-677.View ArticleGoogle Scholar
- Seel M, Turner DB, Willett P: Effect of parameter variations on the effectiveness of HQSAR analyses. Quant Struc-Act Rel. 1999, 18: 245-252.View ArticleGoogle Scholar
- Chavatte P, Yous S, Marot C, Baurin N, Lesieur D: Three-dimensional quantitative structure-activity relationships of cyclooxygenase-2 (COX-2) inhibitors: a comparative molecular field analysis. J Med Chem. 2001, 44: 3223-3230.View ArticleGoogle Scholar
- Clark RD: Boosted leave-many-out cross-validation: the effect of training set and test set diversity on PLS statistics. J Comput-Aided Mol Des. 2003, 17: 265-275.View ArticleGoogle Scholar
- Clark RD, Patterson DE, Soltanshahi F, Blake JF, Matthew JB: Visualizing substructural fingerprints. J Mol Graph Model. 2000, 18: 404-411.View ArticleGoogle Scholar
- Agrafiotis DK: Stochastic Proximity Embedding. J Comput Chem. 2003, 24: 1215-1221.View ArticleGoogle Scholar
- Clark RD: OptiSim: an extended dissimilarity selection method for finding diverse representative subsets. J Chem Inf Comput Sci. 1997, 37: 1181-1188.View ArticleGoogle Scholar
- Clark RD, Langton WJ: Balancing representativeness against diversity using optimizable K-dissimilarity and hierarchical clustering. J Chem Inf Comput Sci. 1998, 38: 1079-1086.View ArticleGoogle Scholar
- Clark RD: Getting past diversity in assessing virtual library designs. J Brazil Chem Soc. 2002, 13: 788-794.View ArticleGoogle Scholar
- Clark RD, Shepphird JK, Holliday J: The effect of structural redundancy in validation sets on virtual screening performance. J Chemometrics.
- Kim KK: Nonlinear dependence in comparative molecular field analysis. 3D QSAR in Drug Design Theory, Methods and Applications. Edited by: Kubinyi H. 1993, Leiden: ESCOM, 71-82.Google Scholar
- Embrechts MJ, Szymanski B, Sternickel K: Introduction to scientific data mining: Direct kernel methods and applications. Computationally Intelligent Hybrid Systems: The Fusion of Soft Computing and Hard Computing. Edited by: Ovaska SJ. 2005, New York: Wiley, 317-362.Google Scholar
- Haranczyk M, Holliday J: Comparison of similarity coefficients for clustering and compound selection. J Chem Inf Model. 2008, 48: 498-509.View ArticleGoogle Scholar
- Gasteiger J, Marsili M: Iterative partial equalization of orbital electronegativity – a rapid access to atomic charges. Tetrahedron. 1980, 36: 3219-3228.View ArticleGoogle Scholar
- SYBYL, v 8.0. 2008, Tripos International: St. Louis, MO
- Benchware DataMiner, v. 1.6. 2007, Tripos International: St. Louis, MO
- Bush B: SAMPLS: SAMple-driven Partial Least Squares. 1993, Merck & Co., Inc.: Rahway, NJGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.