Complementary PLS and KNN algorithms for improved 3DQSDAR consensus modeling of AhR binding
 Svetoslav H Slavov^{1},
 Bruce A Pearce^{1},
 Dan A Buzatu^{1},
 Jon G Wilkes^{1} and
 Richard D Beger^{1}Email author
DOI: 10.1186/17582946547
© Slavov et al.; licensee Chemistry Central Ltd. 2013
Received: 21 August 2013
Accepted: 15 November 2013
Published: 21 November 2013
Abstract
Multiple validation techniques (Yscrambling, complete training/test set randomization, determination of the dependence of R^{2}_{test} on the number of randomization cycles, etc.) aimed to improve the reliability of the modeling process were utilized and their effect on the statistical parameters of the models was evaluated. A consensus partial least squares (PLS)similarity based knearest neighbors (KNN) model utilizing 3DSDAR (three dimensional spectral dataactivity relationship) fingerprint descriptors for prediction of the log(1/EC_{50}) values of a dataset of 94 aryl hydrocarbon receptor binders was developed. This consensus model was constructed from a PLS model utilizing 10 ppm x 10 ppm x 0.5 Å bins and 7 latent variables (R^{2}_{test} of 0.617), and a KNN model using 2 ppm x 2 ppm x 0.5 Å bins and 6 neighbors (R^{2}_{test} of 0.622). Compared to individual models, improvement in predictive performance of approximately 10.5% (R^{2}_{test} of 0.685) was observed. Further experiments indicated that this improvement is likely an outcome of the complementarity of the information contained in 3DSDAR matrices of different granularity. For similarly sized data sets of Aryl hydrocarbon (AhR) binders the consensus KNN and PLS models compare favorably to earlier reports. The ability of 3DQSDAR (three dimensional quantitative spectral dataactivity relationship) to provide structural interpretation was illustrated by a projection of the most frequently occurring bins on the standard coordinate space, thus allowing identification of structural features related to toxicity.
Keywords
QSAR Molecular descriptors Quantitative spectral dataactivity relationship (3DQSDAR) Estrogen receptor binding Molecular modelingBackground
During the past decade, the application of consensus modeling to various QSAR related problems has been explored [1–3]. Early QSARs often relied on single models, which under certain circumstances were prone to arbitrary overestimation of the contribution of given structural features at the expense of others that were suppressed or ignored. To mitigate such risks consensus models based on a multitude of individual models can be advantageously used. Reports of improved performance of consensus models [4–6] or its lack thereof [7] have been published.
Recently, our group introduced the concept of a robust 3DQSDAR approach [8]. 3DQSDAR utilizes unique fingerprints constructed from pairs of ^{13}C chemical shifts augmented with their corresponding interatomic distances. The proposed 3DQSDAR methodology was designed in accordance with the Organization for Economic Cooperation and Development (OECD) principles [9]: it provided several levels of validation, thus assuring models would be both reliable and interpretable. In our earlier work [8] an automated partial least squares (PLS) algorithm was used to process data from regularly tessellated 3DSDAR fingerprints and to derive averaged (composite model) predictions from 100 randomized training/holdout test set pairs. A technique [10] based on the standard deviation of the experimental data was employed to determine a “realistic” upper bound for coefficient of determination. A Yscrambling procedure [11, 12] assessed the probability of generating seemingly “good” models by chance.
However, the above described modeling procedure employed a single data processing algorithm, namely PLS. As a step forward, experiments designed to explore the likelihood of building improved consensus models combining predictions generated by conceptually unrelated algorithms operating on 3DSDAR matrices of different granularity were conceived. A KNN algorithm intended to supplement PLS by capturing complementary aspects of the structureactivity relationship was devised. It was hypothesized that the improvement of performance in consensus modeling should depend on the degree of orthogonality of the predictions produced by the individual models. Beyond the accuracy of biological data the inherent information content of a given descriptor pool was thought of as a factor limiting the improvement of R^{2}_{test} in consensus modeling. In other words, regardless of data processing algorithm the maximum achievable R^{2} for a holdout test set would be limited by the descriptors’ ability to depict specific aspects of the molecular structure directly related to the observed effect.
Summary of QSARs published since year 2000
Chemical class  Endpoint  Dataset size  Data processing algorithm*  Descriptor type  Statistical parameters  Reference 

PCBs, PCDDs and PCDFs  logEC_{50}  52  MLR  ^{13}CNMR  R^{2} = 0.85; q^{2} = 0.71  [13] 
PCBs, PCDDs and PCDFs  logEC_{50}  52  MLR  ^{13}CNMR, atomtoatom distances  R^{2} = 0.85; q^{2} = 0.52  [14] 
PCDFs  log(1/EC_{50})  33  MLR  Quantum mechanical; logP  R^{2} = 0.720; s = 0.723  [15] 
PCDFs  log(1/EC_{50})  34  MLR  Quantum mechanical  R^{2} = 0.747; R^{2}_{adj} = 0.669; q^{2} = 0.572  [16] 
PCDDs and PCDFs  log(1/EC_{50})  90  PLS  CoMFA  10 latent variables  R^{2} = 0.838; q^{2} = 0.624; SEP = 0.903  [17] 
PCDFs  log(1/EC_{50})  34  MLR  Quantum mechanical  R^{2} = 0.863; R^{2}_{adj} = 0.839; q^{2} = 0.807; SE = 0.558 ; F = 35.389  [18] 
PCDDs  log(1/EC_{50})  47  MLR  Quantum mechanical  R^{2} = 0.729; R^{2}_{adj} = 0.703; SE = 0.797; F = 28.269  [19] 
PHDDs  log(1/EC_{50})  25  MLR  Quantum mechanical  R^{2} = 0.768; R^{2}_{adj} = 0.721; q^{2} = 0.635; S.E. = 0.762; F = 16.529  [20] 
PHDDs  log(1/EC_{50})  25  MLR  WHIM  R^{2} = 0.915; R^{2}_{adj} = 0.902; q^{2} = 0.880; S.E. = 0.451; F = 75.032  [20] 
PCDDs and PCDFs  log(1/EC_{50})  60  MLR  Quantum mechanical  R^{2} = 0.687; R^{2}_{adj} = 0.686; q^{2} = 0.603; S.E. = 0.870  [21] 
Dataset
AhR binders and their experimental and predicted log(1/EC _{ 50 } )
Chemical name  Experimental log(1/EC_{50})  Predicted log(1/EC_{50})  

2 ppm × 2 ppm × 0.5Å  10 ppm × 10 ppm × 0.5Å  PLSKNNconsensus from II and III  
IPLS  IIKNN  IIIPLS  IVKNN  
3,3',4,4'Tetrachlorobiphenyl  6.15  6.02  5.50  6.49  5.75  6.00 
2,3,4,4'Tetrachlorobiphenyl  4.55  5.27  5.35  5.15  5.28  5.25 
3,3',4,4',5Pentachlorobiphenyl  6.89  5.06  5.11  5.96  5.63  5.54 
2',3,4,4',5Pentachlorobiphenyl  4.85  4.77  5.26  4.23  5.11  4.75 
2,3,3',4,4'Pentachlorobiphenyl  5.37  5.59  5.64  5.07  5.38  5.36 
2,3',4,4',5Pentachlorobiphenyl  5.04  5.47  5.47  4.74  5.29  5.11 
2,3,4,4',5Pentachlorobiphenyl  5.39  4.81  4.78  5.53  5.14  5.16 
2,3,3',4,4',5Hexachlorobiphenyl  5.15  5.33  5.22  5.61  5.19  5.42 
2,3',4,4',5,5'Hexachlorobiphenyl  4.80  5.16  5.41  4.80  5.36  5.11 
2,3,3'4,4',5'Hexachlorobiphenyl  5.33  5.23  5.12  5.07  5.37  5.10 
2,2',4,4'Tetrachlorobiphenyl  3.89  5.22  4.83  4.49  4.94  4.66 
2,2',4,4'5,5'Hexachlorobiphenyl  4.10  4.41  5.05  3.50  4.85  4.28 
2,3,4,5Tetrachlorobiphenyl  3.85  5.55  5.20  5.35  5.29  5.28 
2,3',4,4',5',6Hexachlorobiphenyl  4.00  5.44  5.23  4.37  4.90  4.80 
4'Hydroxy2,3,4,5tetrachlorobiphenyl  4.05  5.88  5.05  5.07  4.86  5.07 
4'Methyl2,3,4,5tetrachlorobiphenyl  4.51  5.21  5.27  5.13  4.86  5.20 
4'Fluoro2,3,4,5tetrachlorobiphenyl  4.60  5.13  4.92  4.37  4.67  4.65 
4'Methoxy2,3,4,5tetrachlorobiphenyl  4.80  5.35  5.15  4.32  4.74  4.74 
4'Acetyl2,3,4,5tetrachlorobiphenyl  5.17  5.00  4.87  4.14  4.98  4.51 
4'Cyano2,3,4,5tetrachlorobiphenyl  5.27  5.48  5.05  4.29  4.78  4.67 
4'Ethyl2,3,4,5tetrachlorobiphenyl  5.46  5.13  5.06  4.50  4.82  4.78 
4'Bromo2,3,4,5tetrachlorobiphenyl  5.60  5.42  5.34  5.27  5.51  5.31 
4'Iodo2,3,4,5tetrachlorobiphenyl  5.82  5.53  5.16  5.88  5.84  5.52 
4'isopropyl2,3,4,5tetrachlorobiphenyl  5.89  5.77  5.45  5.07  4.75  5.26 
4'Trifluromethyl2,3,4,5tetrachlorobiphenyl  6.43  5.42  5.25  4.46  4.80  4.86 
3'Nitro2,3,4,5tetrachlorobiphenyl  4.85  5.51  5.27  5.07  4.75  5.17 
4'NAcetylamino2,3,4,5tetrachlorobiphenyl  5.09  5.26  4.87  5.09  4.96  4.98 
4'Phenyl2,3,4,5tetrachlorobiphenyl  5.18  4.74  5.03  4.69  5.01  4.86 
4'tButyl2,3,4,5tetrachlorobiphenyl  5.17  5.12  5.34  4.71  4.89  5.03 
4'nButyl2,3,4,5tetrachlorobiphenyl  5.13  5.12  5.13  5.44  4.93  5.29 
2,3,7,8Tetrachlorodibenzopdioxin  8.00  8.27  7.66  7.10  7.28  7.38 
1,2,3,7,8Pentachlorodibenzopdioxin  7.10  6.10  6.73  6.43  5.99  6.58 
2,3,6,7Tetrachlorodibenzopdioxin  6.80  6.56  6.76  5.92  5.96  6.34 
2,3,6Trichlorodibenzopdioxin  6.66  6.31  6.67  5.85  5.90  6.26 
1,2,3,4,7,8Hexachlorodibenzopdioxin  6.55  5.83  6.10  5.84  5.69  5.97 
1,3,7,8Tetrachlorodibenzopdioxin  6.10  6.22  6.68  6.03  6.12  6.36 
1,2,4,7,8Pentachlorodibenzopdioxin  5.96  5.99  6.46  5.41  5.86  5.94 
1,2,3,4Tetrachlorodibenzopdioxin  5.89  4.39  5.44  5.96  5.87  5.70 
2,3,7Trichlorodibenzopdioxin  7.15  6.72  6.84  6.69  7.37  6.77 
2,8Dichlorodibenzopdioxin  5.50  5.73  6.04  7.83  7.94  6.94 
1,2,3,4,7Pentachlorodibenzopdioxin  5.19  5.68  6.02  5.69  5.95  5.86 
1,2,4Trichlorodibenzopdioxin  4.89  5.46  5.90  6.12  5.99  6.01 
1,2,3,4,6,7,8,9octachlorodibenzopdioxin  5.00  6.78  7.76  4.77  5.74  6.27 
1Chlorodibenzopdioxin  4.00  5.97  6.09  6.44  6.54  6.28 
2,3,7,8Tetra bromodibenzopdioxin  8.82  9.29  8.61  9.86  8.43  9.24 
2,3Dibromo7,8dichlorodibenzopdioxin  8.83  8.56  8.43  8.55  8.15  8.49 
2,8Dibromo3,7dichlorodibenzopdioxin  9.35  7.54  7.86  6.87  7.06  7.37 
2Bromo3,7,8trichlorodibenzopdioxin  7.94  8.31  8.05  7.26  7.40  7.66 
1,3,7,8,9Pentabromodibenzopdioxin  7.03  7.25  7.99  7.53  8.29  7.76 
1,3,7,8,Tetrabromodibenzopdioxin  8.70  7.38  8.51  8.22  8.48  8.37 
1,2,4,7,8Pentabromodibenzopdioxin  7.77  7.31  8.06  9.20  8.24  8.63 
1,2,3,7,8Pentabromodibenzopdioxin  8.18  8.31  8.65  8.40  8.57  8.53 
2,3,7Tribromodibenzopdioxin  8.93  8.10  8.40  8.23  8.42  8.32 
2,7Dibromodibenzopdioxin  7.81  7.48  7.36  7.07  8.06  7.22 
2Bromodibenzopdioxin  6.53  6.67  7.03  8.22  7.73  7.63 
2Chlorodibenzofuran  3.55  3.94  4.48  3.76  3.78  4.12 
3Chlorodibenzofuran  4.38  5.13  5.01  5.75  5.89  5.38 
4Chlorodibenzofuran  3.00  5.20  4.54  4.80  5.37  4.67 
2,3Dichlorodibenzofuran  5.33  5.29  4.77  5.68  5.71  5.23 
2,6Dichlorodibenzofuran  3.61  5.03  4.85  3.50  4.14  4.18 
2,8Dichlorodibenzofuran  3.59  4.21  4.77  3.76  3.88  4.27 
1,3,6Trichlorodibenzofuran  5.36  6.28  6.21  5.70  5.57  5.96 
1,3,8Trichlorodibenzofuran  4.07  5.80  5.82  5.28  5.40  5.55 
2,3,4Trichlorodibenzofuran  4.72  6.78  5.80  5.73  5.83  5.77 
2,3,8Trichlorodibenzofuran  6.00  5.58  5.07  5.63  5.59  5.35 
2,6,7 Trichlorodibenzofuran  6.35  5.64  5.29  5.38  4.98  5.34 
2,3,4,6Tetrachlorodibenzofuran  6.46  5.95  5.86  6.68  5.56  6.27 
2,3,4,8Tetrachlorodibenzofuran  6.70  6.19  5.84  5.55  5.38  5.70 
1,3,6,8Tetrachlorodibenzofuran  6.66  5.63  5.52  6.36  5.92  5.94 
2,3,7,8Tetrachlorodibenzofuran  7.39  6.96  6.54  7.18  6.84  6.86 
1,2,4,8Tetrachlorodibenzofuran  5.00  5.16  5.32  4.19  4.90  4.76 
1,2,4,6,7Pentachlorodibenzofuran  7.17  5.65  5.50  5.82  5.54  5.66 
1,2,4,7,9Pentachlorodibenzofuran  4.70  6.82  6.34  5.22  5.40  5.78 
1,2,3,4,8Pentachlorodibenzofuran  6.92  6.42  5.74  5.49  5.21  5.62 
1,2,3,7,8Pentachlorodibenzofuran  7.13  7.03  6.56  6.96  7.19  6.76 
1,2,4,7,8Pentachlorodibenzofuran  5.89  5.94  5.57  6.32  5.94  5.95 
2,3,4,7,8Pentachlorodibenzofuran  7.82  6.42  6.42  7.08  6.80  6.75 
1,2,3,4,7,8Hexachlorodibenzofuran  6.64  6.61  6.06  7.22  6.95  6.64 
1,2,3,6,7,8Hexachlorodibenzofuran  6.57  7.22  6.78  6.67  6.47  6.73 
1,2,4,6,7,8Hexachlorodibenzofuran  5.08  6.58  5.83  6.53  5.70  6.18 
2,3,4,6,7,8Hexachlorodibenzofuran  7.33  7.93  6.85  7.73  6.60  7.29 
2,3,6,8Tetrachlorodibenzofuran  6.66  5.39  5.23  5.58  5.42  5.41 
1,2,3,6Tetrachlorodibenzofuran  6.46  4.93  5.36  6.17  5.85  5.77 
1,2,3,7Tetrachlorodibenzofuran  6.96  6.93  6.57  7.00  7.22  6.79 
1,3,4,7,8Pentachlorodibenzofuran  6.70  6.82  6.59  6.60  6.53  6.60 
2,3,4,7,9Pentachlorodibenzofuran  6.70  6.54  6.34  7.29  6.99  6.82 
1,2,3,7,9Pentachlorodibenzofuran  6.40  6.32  6.40  6.69  6.94  6.55 
H  3.00  3.53  4.46  3.98  3.95  4.22 
2,3,4,7Tetrachlorodibenzofuran  7.60  6.08  6.44  6.37  6.29  6.41 
1,2,3,7Tetrachlorodibenzofuran  6.96  6.97  6.59  7.00  7.17  6.80 
1,3,4,7,8Pentachlorodibenzofuran  6.70  6.84  6.58  6.62  6.52  6.60 
2,3,4,7,9Pentachlorodibenzofuran  6.70  6.52  6.36  7.23  6.96  6.80 
1,2,3,7,9Pentachlorodibenzofuran  6.40  6.38  6.41  6.68  6.94  6.55 
1,2,4,6,8Pentachlorodibenzofuran  5.51  5.81  5.61  3.30  4.80  4.46 
Methods
Conventions
Several layers of complexity related to the utilized modeling procedures were introduced in this manuscript and these require clarification. To avoid ambiguity, models utilizing the same algorithm (either PLS or KNN) operating on an individual 3DSDAR data matrix by generating multiple randomized training/test subset pairs later combined to form a single model will be referred to as “composite models”. Models averaging the predictions from two (or eventually more) composite models will be referred to as “consensus models”. The term “individual models” is used interchangeably to denote either the individual PLS or KNN models forming the “consensus model” or the individual randomized training/test subset models resulting in a “composite model”. However, its specific meaning would be determined through its contextual use. The term “matching training/test subset pairs” indicates complementary training and test subset pairs processed by different algorithms, but composed of the same subsets of compounds.
Molecular conformation
In its current implementation, 3DQSDAR does not employ docking or alignment algorithms, nor does it use Xray structures to achieve more consistent geometries of the molecules constituting the dataset. This choice widens its applicability to datasets of compounds with unknown, multiple, or no specific targets and in the absence of knowledge about the binding site and its conformational requirements. For the purpose of reproducibility, however, the conformation at the global minimum of the potential energy surface was used. It has to be acknowledged that, while this conformation is the most energetically stable, it may not be the one assumed during solvent interaction or upon binding with a macromolecule [24].
To find the lowest energy conformers of all PCBs (the PHDDs and PCDFs have no rotatable bonds) a conformational search analysis was performed in HyperChem 8.0 [25]. An AMBER force field [26] and a random walks search method with an acceptance energy criterion of 6 kcal/mol were used. At the final stage of optimization all PCBs, PHDDs and PCDFs were optimized by employing a semiempirical Austin Model 1 (AM1) Hamiltonian with a rootmeansquare gradient of 0.01 kcal/Å × mol.
3DQSDAR descriptor calculations
Because the units of length on each axis are not identical, X, Y and Z do not form a Cartesian coordinate system. Since the number of carbon atoms in a molecule (N_{ C }) determines uniquely the number of elements in a fingerprint (N_{ C }(N_{ C }  1)/2), each of the 94 AhR binders will be represented by at least 66 such fingerprint elements in the 3DSDAR space. This 3DSDAR space was further tessellated using regular grids to form bins ranging in size from 2 ppm x 2 ppm x 0.5 Å to 20 ppm x 20 ppm x 2.5 Å (i.e. incremental steps of 0.5 Å on the Zaxis and 2 ppm on the chemical shifts plane XY were used). As a result, 50 regular grids of different granularity were generated. A procedure performed separately on each of the 50 grids counted the number of fingerprint elements of a molecule belonging to a given bin (i.e., bin occupancy) and stored these values as row vectors in m x n matrices. Here m represents the number of compounds in the dataset, whereas n represents the number of occupied bins.
Determination of the optimal number of randomization cycles
Model building
 i)
A SIMPLS based [28] PLS algorithm written in Matlab [29] was employed to process each of the 50 3DQSDAR data matrices. All descriptors were standardized using the “zscore” Matlab function. As described above, 100 random training/test set pairs were generated and composite (ensemble) PLS models for the training sets, including somewhere between 1 and 10 LVs, were built. These models were then used to predict the log(1/EC_{50}) values for the complementary 20% “holdout” test subsets. At the end, each of the individual 100 R^{2} _{training}, R^{2} _{test} and R^{2} _{scrambling} values were recorded and their averages for the composite models were reported. For each of the 50 average models utilizing grids of different granularity the random number generator was initialized in order to recreate the same training/holdout test sequence (Additional file 1). Due to the specifics of the chosen modelbuilding procedure, the reader should bear in mind that these average reported parameters include contributions from “good” as well as “bad” models (see the results and discussion section).
 ii)
Alternatively, a KNN algorithm written in Matlab and based on Tanimoto similarity [30] in its generalized vector form, $T\left(A,B\right)=\frac{A.B}{{\Vert A\Vert}^{2}+{\Vert B\Vert}^{2}A.B}$ was employed. In this equation, A and B are data objects represented by vectors (originally bit vectors). Thus, the Tanimoto similarity is a dot product of two vectors A and B (bin occupancy row vectors for a pair of compounds) divided by the squared magnitudes of A and B minus their dot product. In other words, for compounds sharing common structural features T will be closer to 1, otherwise T will be closer to 0.
Because T is not invariant to standardization, the desire for preservation of its universal nature required use of the original, nontransformed 3DSDAR descriptor pool. At a constant granularity of the grid this specific choice allowed bijection of T  there is one and only one T for a given pair of compounds. For a standardized descriptor pool, T loses its universal nature by being dependant on the mean and the standard deviation of the descriptors within the training set, and multiple Ts between a pair of compounds would exist (i.e., T would become a local characteristic of similarity).
These invariant T values (calculated for all pairs of compounds) were later used to predict the holdout test set activities by ranking the compounds from the training set in a descending order of their similarity to each compound from the holdout test and using T of the first Kneighbors (1 ≤ K ≤ 10) to weight their contributions to activity. Under these experimental settings, both odd and even numbers of neighbors can be used. As with PLS, the KNN validation procedure involved 100 randomized training/holdout test set pairs recreated by the use of the same random seed.
Fit and prediction
The majority of QSARs are built for prediction. Hence, parameters such as the coefficient of determination for the training set (R^{2}_{training}) that measure the fitting ability of a model play only a minor role, typically unrelated to predictive power. Since we are more interested in the behavior of models intended for prediction, our attention was primarily focused on R^{2}_{test} and R^{2}_{scrambling}. More specifically, the behavior of R^{2}_{test} was closely followed, whereas R^{2}_{scrambling} was monitored only as an indicator of potential chance correlations.
Results and discussion
Similarity as a discrimination function
Optimal bin size
Average statistical parameters of the best PLS and KNN models at a given number of LVs and neighbors as a function of the granularity of the 3DSDAR space
Bin size  Optimal number of LVs  Avg. R^{2}test (PLS)  Std. R^{2}test (PLS)  Avg. R^{2}scr (PLS)  Std. R^{2}scr (PLS)  Optimal number of neighbors  Avg. R^{2}test (KNN)  Std. R^{2}test (KNN) 

2 ppm x 2 ppm x 0.5 Å  3  0.591  0.143  0.085  0.103  6  0.618*  0.170 
4 ppm x 4 ppm x 0.5 Å  3  0.604  0.142  0.088  0.109  5  0.606  0.146 
6 ppm x 6 ppm x 0.5 Å  5  0.532  0.167  0.074  0.097  7  0.453  0.178 
8 ppm x 8 ppm x 0.5 Å  5  0.593  0.142  0.097  0.113  6  0.520  0.162 
10 ppm x 10 ppm x 0.5 Å  7  0.633*  0.147  0.085  0.113  4  0.612  0.162 
12 ppm x 12 ppm x 0.5 Å  3  0.474  0.178  0.105  0.115  9  0.432  0.181 
14 ppm x 14 ppm x 0.5 Å  2  0.321  0.193  0.096  0.121  10  0.312  0.179 
16 ppm x 16 ppm x 0.5 Å  3  0.383  0.154  0.073  0.090  10  0.353  0.166 
18 ppm x 18 ppm x 0.5 Å  2  0.307  0.189  0.077  0.100  10  0.307  0.186 
20 ppm x 20 ppm x 0.5 Å  2  0.410  0.178  0.122  0.137  9  0.356  0.180 
2 ppm x 2 ppm x 1.0 Å  3  0.567  0.149  0.082  0.095  6  0.599  0.181 
4 ppm x 4 ppm x 1.0 Å  3  0.562  0.149  0.081  0.099  3  0.558  0.179 
6 ppm x 6 ppm x 1.0 Å  5  0.526  0.164  0.076  0.099  7  0.466  0.178 
8 ppm x 8 ppm x 1.0 Å  4  0.542  0.161  0.095  0.116  6  0.504  0.164 
10 ppm x 10 ppm x 1.0 Å  6  0.597  0.153  0.086  0.100  4  0.593  0.162 
12 ppm x 12 ppm x 1.0 Å  2  0.440  0.176  0.101  0.128  10  0.429  0.182 
14 ppm x 14 ppm x 1.0 Å  2  0.315  0.195  0.100  0.125  10  0.327  0.179 
16 ppm x 16 ppm x 1.0 Å  5  0.251  0.147  0.069  0.090  10  0.357  0.168 
18 ppm x 18 ppm x 1.0 Å  2  0.296  0.189  0.077  0.106  10  0.292  0.185 
20 ppm x 20 ppm x 1.0 Å  2  0.405  0.176  0.128  0.137  10  0.358  0.180 
2 ppm x 2 ppm x 1.5 Å  3  0.537  0.163  0.074  0.087  5  0.603  0.178 
4 ppm x 4 ppm x 1.5 Å  3  0.542  0.151  0.077  0.101  6  0.574  0.160 
6 ppm x 6 ppm x 1.5 Å  5  0.536  0.164  0.073  0.112  5  0.481  0.169 
8 ppm x 8 ppm x 1.5 Å  8  0.500  0.196  0.090  0.106  9  0.498  0.164 
10 ppm x 10 ppm x 1.5 Å  8  0.531  0.180  0.092  0.106  5  0.585  0.166 
12 ppm x 12 ppm x 1.5 Å  2  0.440  0.174  0.104  0.132  10  0.421  0.180 
14 ppm x 14 ppm x 1.5 Å  8  0.267  0.155  0.073  0.082  10  0.316  0.181 
16 ppm x 16 ppm x 1.5 Å  6  0.286  0.147  0.063  0.081  10  0.359  0.169 
18 ppm x 18 ppm x 1.5 Å  2  0.302  0.188  0.079  0.111  7  0.291  0.180 
20 ppm x 20 ppm x 1.5 Å  2  0.406  0.176  0.121  0.138  10  0.365  0.182 
2 ppm x 2 ppm x 2.0 Å  2  0.495  0.177  0.071  0.086  6  0.576  0.180 
4 ppm x 4 ppm x 2.0 Å  3  0.504  0.158  0.080  0.102  7  0.535  0.172 
6 ppm x 6 ppm x 2.0 Å  5  0.500  0.170  0.071  0.095  6  0.467  0.173 
8 ppm x 8 ppm x 2.0 Å  4  0.508  0.159  0.095  0.121  10  0.481  0.169 
10 ppm x 10 ppm x 2.0 Å  4  0.498  0.174  0.088  0.105  10  0.557  0.174 
12 ppm x 12 ppm x 2.0 Å  3  0.450  0.171  0.102  0.116  10  0.430  0.181 
14 ppm x 14 ppm x 2.0 Å  9  0.297  0.156  0.078  0.093  10  0.329  0.186 
16 ppm x 16 ppm x 2.0 Å  7  0.207  0.142  0.057  0.075  10  0.359  0.166 
18 ppm x 18 ppm x 2.0 Å  2  0.273  0.179  0.070  0.112  10  0.308  0.188 
20 ppm x 20 ppm x 2.0 Å  2  0.410  0.174  0.131  0.137  10  0.383  0.179 
2 ppm x 2 ppm x 2.5 Å  2  0.481  0.18  0.076  0.087  8  0.555  0.185 
4 ppm x 4 ppm x 2.5 Å  3  0.485  0.163  0.079  0.101  7  0.522  0.182 
6 ppm x 6 ppm x 2.5 Å  5  0.492  0.165  0.071  0.101  7  0.465  0.175 
8 ppm x 8 ppm x 2.5 Å  3  0.422  0.173  0.097  0.122  6  0.485  0.175 
10 ppm x 10 ppm x 2.5 Å  10  0.471  0.222  0.072  0.082  3  0.568  0.172 
12 ppm x 12 ppm x 2.5 Å  2  0.404  0.174  0.097  0.135  10  0.429  0.180 
14 ppm x 14 ppm x 2.5 Å  8  0.286  0.158  0.073  0.094  10  0.315  0.186 
16 ppm x 16 ppm x 2.5 Å  7  0.244  0.133  0.057  0.076  10  0.339  0.167 
18 ppm x 18 ppm x 2.5 Å  3  0.282  0.173  0.081  0.092  10  0.293  0.184 
20 ppm x 20 ppm x 2.5 Å  1  0.397  0.176  0.137  0.152  10  0.358  0.176 
 i)
As described in the methodology section, the PLS algorithm utilizes data standardization, which adjusts for the size disparity of variables. Unlike PLS, KNN uses the original bin occupancies. Thus, in the case of PLS, the optimal bin size would mainly reflect the inherent estimation error in the chemical shifts of carbon atoms and their associated interatomic distances. As demonstrated in [8], bins with a high resolution on the Z axis (C_{i}C_{j} distance) and a granularity of at least twice the estimation error of ^{13}C chemical shifts in the XY plane would result in PLS models of optimal performance. For the current dataset, the ^{13}C chemical shifts estimation error was 3.98 ppm, which would require bins with granularity of at least 8 ppm in the XY plane. Hence, it is not surprising that the best performing PLS model utilizes 10 ppm x 10 ppm bins in the XY plane and a 0.5 Å on the Zaxis. Besides the ^{13}C chemical shifts estimation error, the optimal grid granularity also depends on the bin occupancy. Bins that are too narrow will result in a large, but sparsely populated 3DSDAR matrix and PLS models unable to generalize (poor predictive performance). On the other hand, models using bins that are too wide (e.g., > 14 ppm in the chemical shifts plane XY) may assign fingerprint elements encoding divergent structural features to the same bin, thus producing models lacking in their ability to decode the underlying relationship between structure and activity.
 ii)
The use of T as a factor for activity determination in KNN results in smaller optimal bin sizes in part due to the cancelation of the error in the chemical shifts plane (XY) for similar compounds. Note that the highest contribution to the determination of activity in KNN comes only from the first Knearest neighbors, which by definition are most similar to the compound the activity of which is being predicted. Because for similar structures the error of prediction propagates in parallel, it is not surprising that similarity based KNN algorithms will achieve maximum performance at smaller bin sizes.
 iii)
Unlike PLS, which assigns a different contribution of each bin to the final model, KNN treats all bins as independent coordinates of a vector compared against other such vectors (i.e., assigns equal contribution). Thus, depending on the model building technique being employed, grids of different granularity may be identified as performing better.
Composite and consensus models
Improvement of R ^{ 2 } _{ test } of consensus models over the average R ^{ 2 } _{ test } of the individual models (in %)
ID  Model 1  Model 2  R^{2}_{test}for the consensus model  Average R^{2}_{test}of the individual models  % improvement 

1  PLS 10 ppm x 10 ppm x 0.5 Å  KNN 2 ppm x 2 ppm x 0.5 Å  0.685  0.620  10.5 
2  PLS 10 ppm x 10 ppm x 0.5 Å  PLS 2 ppm x 2 ppm x 0.5 Å  0.673  0.609  10.5 
3  PLS 2 ppm x 2 ppm x 0.5 Å  KNN 10 ppm x 10 ppm x 0.5 Å  0.658  0.603  9.1 
4  KNN 2 ppm x 2 ppm x 0.5 Å  KNN 10 ppm x 10 ppm x 0.5 Å  0.654  0.614  6.5 
5  PLS 2 ppm x 2 ppm x 0.5 Å  KNN 2 ppm x 2 ppm x 0.5 Å  0.640  0.612  4.6 
6  PLS 10 ppm x 10 ppm x 0.5 Å  KNN 10 ppm x 10 ppm x 0.5 Å  0.633  0.611  3.6 
As can be seen from Table 4, both differences in the data processing algorithms and the granularity of the 3DSDAR space contribute to the improvement in consensus modeling. A comparison of the performance improvement of consensus models indicates that generally the models of type iii perform best. A possible explanation for this observation is that these models benefit from: i) the complementary information extracted from 3DSDAR matrices of different granularity and ii) the utilization of different data processing algorithms. Among these 6 consensus models, the one averaging the predictions from the best performing PLS (10 ppm x 10 ppm x 0.5 Å bins, 7 LVs) and KNN (2 ppm x 2 ppm x 0.5 Å bins, 6 neighbors) individual models was characterized by the highest coefficient of determination (shown in Figure 5c and the last column of Table 2).
To further understand the factors playing a role in consensus modeling and to explain the observed improvement over the composite PLS and KNN models, analysis based on training/test set pairs of individual models was carried out.
According to our initial hypothesis an improvement in consensus modeling would be observed only if the individual composite models account for complementary information (i.e., explain complementary portions of the variance in the biological data). For this purpose, the behavior of the individual 100 submodels resulting in the best composite PLS and KNN models was investigated. If for each of the 100 training/test set pairs both algorithms capture almost identical structural information encoded in the 3DSDAR descriptor pool, the corresponding R^{2}_{test} values generated on each cycle should be highly correlated and therefore no improvement in consensus modeling would be observed. In other words, the two algorithms would be somewhat redundant and the consensus R^{2}_{test} would be an average of R^{2}_{test} for the 100 individual submodels. It has to be emphasized that such an experiment would be valid only in a case of matching training/test subset pairs. This condition is satisfied by the use of the same random seed for both PLS and KNN and a random number generator which was initialized after 100 runs.
Figure 6b shows a plot of R^{2}_{test} of matching training/test subset pairs processed alternatively by PLS or KNN. Although, there were PLS and KNN submodels performing equally well (forming a cluster in the upper right corner or the plot), a significant portion of submodels predicted well by PLS were combined with inferior KNN models and viceversa. This observation and the relatively low R^{2} of 0.367 suggest that the two individual models reflect different structural patterns in the data and are partially “orthogonal”. The distribution of ΔR^{2}_{test} PLSKNN shown on Figure 6c indicated that a total of 28 models deviate by at least 1σ from the mean. PLS outperformed KNN for 13 models while KNN performed better for the remaining 15 models. These 28 models, for which one of the algorithms succeeded in establishing a structureactivity relationship undetected by the other, were identified as a major contributing factor affecting the performance of consensus models. Thus, a consensus PLSKNN model would benefit from the partial orthogonality of the PLS and KNN approaches on different sized bins and would outperform the individual composite models.
Interpretation
A detailed examination of the 3DSDAR maps shown in Figure 7 reveals that none of the bins with positive weights overlaps with any of the bins with negative weights: i.e., the structural features affecting binding (increasing or decreasing log(1/EC_{50})) are well separated. Therefore, compounds with 3DSDAR fingerprints predominantly occupying bins with positive PLS weights will be stronger binders (highly toxic). Conversely, chemicals with fingerprint elements falling into regions of the 3DSDAR space occupied by bins with negative weights will be weaker binders (less toxic). This hypothesis was tested using an in house program projecting some of the most frequently occurring positively and negatively weighted bins on the standard coordinate space. This projection allowed identification of subsets of structures in which these bins can often be found together.
In contrast, most of the negatively weighted bins were found to be present in the structures of PCBs. As can be seen from Figure 9, positions 2 and 2′ and (due to symmetry) positions 6 and 6′ are particularly affected and chlorine substitution at these positions will lower the toxicity of PCBs, compared to that of other chlorine substituted homologues.
As an intermediate chemical class with an average activity higher than that of PCBs and lower than that of PHDDs, the activity of dibenzofurans is affected by the presence/absence of structural patterns similar to those observed in the structures of both PCBs and PHDDs. For example, the presence of an oxygen atom resulting in a chemical shift range of the neighboring carbon atoms between 150 and 160 ppm will lower the EC_{50} of PCDFs (higher toxicity). Analogously to the 2 and 2′ positions in biphenyls, chlorine substitution at positions 1 and 9 will result in PCDFs with toxicity lower than that of PCDF homologues substituted elsewhere.
Comparison to earlier models
Due to variability in the datasets and the multitude of available data processing algorithms and validation techniques, a direct quantitative comparison with the QSARs summarized in Table 1 is impossible. However, if one takes into account the much stringent validation criteria imposed in our work (vs the crossvalidation procedures employed in [13–21]) it is clear that the 3DQSDAR methodology performs at least on par with these earlier models. Similarly to CoMFA [17] on a qualitative level the 3DQSDAR was able to recognize correctly the positions that affect the strength of binding to AhR. Since our work is based on a dataset originally compiled by Mekenyan et al. a more direct comparison with the QSARs reported in [22] was possible. Multiple separate QSARs for the three classes of PCBs, PHDDs and PCDFs with R^{2} ranging from 0.640 (n = 30) to 0.899 (n = 14) were derived. The statistical parameters of a model combining the most planar PCBs, PHDDs and PCDFs (n = 80) were as follows: R^{2} = 0.73; s^{2} = 0.59; R^{2}_{cv} = 0.73 and F = 69.2. In comparison, for the complete set of 94 compounds our best consensus model produced an R^{2}_{test} of 0.685 and a q^{2}_{LOO} of 0.79 which are both close to the R^{2}_{cv} of 0.73 reported by Mekenyan et al.
Conclusions
We have introduced several validation techniques intended to improve the quality and reliability of individual and consensus QSAR models. Their use was illustrated on a dataset of 94 AhR binders modeled by 3DQSDAR. The functional dependence between R^{2}_{test} and the number of training/test subset randomization cycles was used to determine the minimum number of cycles necessary to achieve convergence of R^{2}_{test} to its asymptotic “true” value. In this specific case, which uses 20% of the compounds as a holdout test set, 100 randomization cycles proved sufficient for achieving convergence for both PLS and KNN models. The use of a distance measure (Tanimoto similarity) as a discriminant function in KNN was shown to produce models with performance similar to that of PLS when applied to the same dataset. A plot of R^{2}_{test} for matching test set pairs was used to demonstrate the partial orthogonality of PLS and similarity based KNN approaches on different bin granularity. However, further investigations may shed additional light on the character of the multiple factors playing role in the improvements observed in consensus modeling.
In the last stage of the modeling process the most frequently occurring positively and negatively weighted bins were projected back to the standard coordinate space to identify structural features related to toxicity. It was found that most of the highly ranked bins with positive PLS weights were specific to a class of polybrominated dioxins. The oxygen atoms of PHDDs and PCDFs participating in formation of donoracceptor bonds with the receptor were associated with the high toxic effect of these two chemical classes. In the absence of other substituents, PCBs with chlorine atoms at positions 2 and 2′ (and due to symmetry positions 6 and 6′) were accurately predicted to be relatively weaker binders (less toxic).
Abbreviations
 PLS:

Partial Least Squares
 KNN:

k Nearest Neighbors
 3DSDAR:

ThreeDimensional Spectral Data  Activity Relationship
 AhR:

Aryl Hydrocarbon Receptor
 PCBs:

Polychlorinated Biphenyls
 PHDDs:

Polyhalogenated DibenzopDioxins
 PCDFs:

Polychlorinated Dibenzofurans
 QSAR:

Quantitative Structure  Activity relationship
 LOO:

LeaveOneOut
 CV:

CrossValidation
 MLR:

Multiple Linear Regression.
Declarations
Acknowledgements
The authors thank F.D.A. for the financial support.
Authors’ Affiliations
References
 Ganguly M, Brown N, Schuffenhauer A, Ertl P, Gillet VJ, Greenidge PA: Introducing the Consensus Modeling Concept in Genetic Algorithms: Application to Interpretable Discriminant Analysis. J Chem Inf Model. 2006, 46: 21102124. 10.1021/ci050529l.View ArticleGoogle Scholar
 Gramatica P, Giani E, Papa E: Statistical External Validation and Consensus Modeling: A QSPR Case Study for Koc Prediction. J Mol Graphics Modell. 2007, 25: 755766. 10.1016/j.jmgm.2006.06.005.View ArticleGoogle Scholar
 Kuzmin VE, Muratov EN, Artemenko AG, Varlamova E, Gorb L, Wang J, Leszczynski J: Consensus QSAR Modeling of PhosphorContaining Chiral AChE Inhibitors. QSAR Comb Sci. 2009, 28: 664677. 10.1002/qsar.200860117.View ArticleGoogle Scholar
 Gramatica P, Pilutti P, Papa E: Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into TrainingTest Sets and Consensus Modelling. J Chem Inf Comput Sci. 2004, 44: 17941802. 10.1021/ci049923u.View ArticleGoogle Scholar
 Mario L, Vinothini S: In Silico Prediction of Aqueous Solubility, Human Plasma Protein Binding and Volume of Distribution of Compounds from Calculated pKa and AlogP98 Values. Mol Divers. 2003, 7: 6987.View ArticleGoogle Scholar
 Sussman NB, Arena VC, Yu S, Mazumdar S, Thampatty BP: Decision Tree SAR Models for Developmental Toxicity Based on an FDA/TERIS Database. SAR QSAR Environ Res. 2003, 14: 8396. 10.1080/1062936031000073126.View ArticleGoogle Scholar
 Hewitt M, Cronin MT, Madden JC, Rowe PH, Johnson C, Obi A, Enoch SJ: Consensus QSAR models: do the benefits outweigh the complexity?. J Chem Inf Model. 2007, 47: 14601468. 10.1021/ci700016d.View ArticleGoogle Scholar
 Slavov S, Geesaman E, Pearce B, Schnackenberg L, Buzatu D, Wilkes J, Beger R: ^{13}C NMRDistance Matrix Descriptors: Optimal Abstract 3D Space Granularity for Predicting Estrogen Binding. J Chem Inf Model. 2012, 52: 18541864. 10.1021/ci3001698.View ArticleGoogle Scholar
 Report from the Expert Group on (Quantitative) StructureActivity Relationship ([Q]SARs) on the Principles for the Validation of (Q)SARs. 2004, Paris, France: Organisation for Economic Cooperation and Development
 Doweyko AM, Bell AR, Minatelli JA, Relyea DI: Quantitative StructureActivity Relationships for 2[(Phenylmethyl)Sulfonyl]Pyridine 1Oxide Herbicides. J Med Chem. 1983, 26: 475478. 10.1021/jm00358a004.View ArticleGoogle Scholar
 Klopman G, Kalos AN: Causality in StructureActivity Studies. J Comput Chem. 1985, 6: 492506. 10.1002/jcc.540060520.View ArticleGoogle Scholar
 Wold S, Eriksson L: Statistical Validation of QSAR Results. Chemometric Methods in Molecular Design. Edited by: van de Waterbeemd H. 1995, Weinheim, Germany: WileyVCH Verlag GmbH, 309318.View ArticleGoogle Scholar
 Beger RD, Wilkes JG: Models of Polychlorinated Dibenzodioxins, Dibenzofurans, and Biphenyls Binding Affinity to the Aryl Hydrocarbon Receptor Developed Using ^{13}C NMR Data. J Chem Inf Comput Sci. 2001, 41: 13221329. 10.1021/ci000312l.View ArticleGoogle Scholar
 Beger RD, Buzatu DA, Wilkes JG: Combining NMR spectral and structural data to form models of polychlorinated dibenzodioxins, dibenzofurans, and biphenyls binding to the AhR. J Comput Aided Mol Des. 2002, 16: 727740. 10.1023/A:1022479510524.View ArticleGoogle Scholar
 Arulmozhiraja S, Morita M: Structureactivity relationships for the toxicity of polychlorinated dibenzofurans: approach through density functional theorybased descriptors. Chem Res Toxicol. 2004, 17: 348356. 10.1021/tx0300380.View ArticleGoogle Scholar
 Hirokawa S, Imasaka T, Imasaka T: Chlorine substitution pattern, molecular electronic properties, and the nature of the ligandreceptor interaction: quantitative propertyactivity relationships of polychlorinated dibenzofurans. Chem Res Toxicol. 2005, 18: 232238. 10.1021/tx049874f.View ArticleGoogle Scholar
 Ashek A, Lee C, Park H, Cho SJ: 3D QSAR studies of dioxins and dioxinlike compounds using CoMFA and CoMSIA. Chemosphere. 2006, 65: 521529. 10.1016/j.chemosphere.2006.01.010.View ArticleGoogle Scholar
 Gu C, Jiang X, Ju X, Yu G, Bian Y: QSARs for the toxicity of polychlorinated dibenzofurans through DFTcalculated descriptors of polarizabilities, hyperpolarizabilities and hyperorder electric moments. Chemosphere. 2007, 67: 13251334. 10.1016/j.chemosphere.2006.10.057.View ArticleGoogle Scholar
 Zhao YY, Tao FM, Zeng EY: Theoretical study of the quantitative structureactivity relationships for the toxicity of dibenzopdioxins. Chemosphere. 2008, 73: 8691. 10.1016/j.chemosphere.2008.05.018.View ArticleGoogle Scholar
 Gu C, Jiang X, Ju X, Gong X, Wang F, Bian Y, Sun C: QSARs for congenerspecific toxicity of polyhalogenated dibenzopdioxins with DFT and WHIM theory. Ecotoxicol Environ Saf. 2009, 72: 6070. 10.1016/j.ecoenv.2008.04.003.View ArticleGoogle Scholar
 Diao J, Li Y, Shi S, Sun Y, Sun Y: QSAR Models for Predicting Toxicity of Polychlorinated Dibenzopdioxins and Dibenzofurans Using Quantum Chemical Descriptors. Bull Environ Contam Toxicol. 2010, 85: 109115. 10.1007/s0012801000652.View ArticleGoogle Scholar
 Mekenyan OG, Veith GD, Call DJ, Ankley GTA: QSAR evaluation of Ah receptor binding of halogenated aromatic xenobiotics. Environ Health Perspect. 1996, 104: 13021310.View ArticleGoogle Scholar
 Long G, McKinney J, Pedersen L: Polychlorinated dibenzofuran (PCDF) binding to the Ah receptor(s) and associated enzyme induction. Theoretical model based on molecular parameters. Quant StructAct Relat. 1987, 6: 17. 10.1002/qsar.19870060102.View ArticleGoogle Scholar
 Eliel EL: Chemistry in Three Dimensions. Chemical Structures. Edited by: Warr WA. 1993, Berlin, Germany: Springer, 1View ArticleGoogle Scholar
 HyperChem 8 Professional, version 8.0. 2007, Gainesville, FL: HyperCube Inc
 Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA: A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules. J Am Chem Soc. 1995, 117: 51795197. 10.1021/ja00124a002.View ArticleGoogle Scholar
 ACD/NMR Predictor Release 12.00, version 12.5; Advanced Chemistry Development, Inc. 2011, Toronto, ON, Canada, http://www.acdlabs.com,
 De Jong S: SIMPLS: an alternative approach to partial least squares regression. Chemom Intell Lab Systems. 1993, 18: 251263. 10.1016/01697439(93)85002X.View ArticleGoogle Scholar
 MATLAB, version 8.0 (R2012b), The MathWorks Inc. 2012, Cambridge, MA, USA, http://www.mathworks.com,
 Tanimoto TT: IBM Internal Report: 17th Nov. Technical report. 1957, Armonk, NY, USA: IBMGoogle Scholar
 Kobayashi S, Saito A, Ishii Y, Tanaka A, Tobinaga S: Relationship between the biological potency of polychlorinated dibenzopdioxins and their electronic states. Chem Pharm Bull. 1991, 39: 21002105. 10.1248/cpb.39.2100.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.