Examining the predictive accuracy of the novel 3D Nlinear algebraic molecular codifications on benchmark datasets
 César R. GarcíaJacas^{1, 2, 3, 4}Email author,
 Ernesto ContrerasTorres^{8},
 Yovani MarreroPonce^{3, 4, 5},
 Mario PupoMeriño^{2},
 Stephen J. Barigye^{3, 7} and
 Lisset CabreraLeyva^{1, 6}
https://doi.org/10.1186/s133210160122x
© GarcíaJacas et al. 2016
Received: 14 October 2015
Accepted: 9 February 2016
Published: 25 February 2016
Abstract
Background
Recently, novel 3D alignmentfree molecular descriptors (also known as QuBiLSMIDAS) based on twolinear, threelinear and fourlinear algebraic forms have been introduced. These descriptors codify chemical information for relations between two, three and four atoms by using several (dis)similarity metrics and multimetrics. Several studies aimed at assessing the quality of these novel descriptors have been performed. However, a deeper analysis of their performance is necessary. Therefore, in the present manuscript an assessment and statistical validation of the performance of these novel descriptors in QSAR studies is performed.
Results
To this end, eight molecular datasets (angiotensin converting enzyme, acetylcholinesterase inhibitors, benzodiazepine receptor, cyclooxygenase2 inhibitors, dihydrofolate reductase inhibitors, glycogen phosphorylase b, thermolysin inhibitors, thrombin inhibitors) widely used as benchmarks in the evaluation of several procedures are utilized. Three to nine variable QSAR models based on Multiple Linear Regression are built for each chemical dataset according to the original division into training/test sets. Comparisons with respect to leaveoneout crossvalidation correlation coefficients \(\left( {Q_{loo}^{2} } \right)\) reveal that the models based on QuBiLSMIDAS indices possess superior predictive ability in 7 of the 8 datasets analyzed, outperforming methodologies based on similar or more complex techniques such as: Partial Least Square, Neural Networks, Support Vector Machine and others. On the other hand, superior external correlation coefficients \(\left( {Q_{ext}^{2} } \right)\) are attained in 6 of the 8 test sets considered, confirming the good predictive power of the obtained models. For the \(Q_{ext}^{2}\) values nonparametric statistic tests were performed, which demonstrated that the models based on QuBiLSMIDAS indices have the best global performance and yield significantly better predictions in 11 of the 12 QSAR procedures used in the comparison. Lastly, a study concerning to the performance of the indices according to several conformer generation methods was performed. This demonstrated that the quality of predictions of the QSAR models based on QuBiLSMIDAS indices depend on 3D structure generation method considered, although in this preliminary study the results achieved do not present significant statistical differences among them.
Conclusions
Keywords
Background
Computational methods that employ statistical and/or artificial intelligence procedures are widely used in the drug discovery process, where the Quantitative Structure–Activity Relationship (QSAR) studies have an important role [1–4]. These studies are based on the principle that the biological activity (or property) of compounds depends on their structural and physicochemical features and thus, are primarily aimed at finding good correlations among molecular features and specific biological activities [5]. In this way, models with high external predictive ability in novel compounds could be built.
Right from the works developed by Hansch and Fujita in 1960s [6, 7], considered as the origins of the modern QSAR studies [8], several approaches have been reported in the literature with most of these being 2DQSAR methods, that is, they only consider the topological structural features of molecules often using matrix representations such as the connectivity and distance matrices [8]. However, with the introduction of the CoMFA [9] methodology in 1988, the 3DQSAR approaches become popular. These take into account the geometric (3D) features of molecules, which can be computed either from the information represented in a grid through an alignment process with respect to a reference compound or a pharmacophore [2, 10, 11], or using procedures based on Cartesian coordinates [8, 12, 13], molecular spectra [14, 15] and molecular transforms [16], or by the adaptation of 2D methods to take into account threedimensional (3D) aspects [17–21].
However, despite the number and variety of procedures defined up to date, there exists continued interest in creating or extending the current approaches to more generalized forms in order to codify more relevant chemical information with the aim of yielding QSAR models with better predictive ability. This assertion is in accordance with the Non Free Lunch Theorem [22], which could be interpreted as no single QSAR procedure yields superior predictions than all the others when its performance is averaged over all possible compound datasets. This can be confirmed in a report performed by Sutherland et al. [23], where it is observed how wellestablished procedures, assessed in eight diverse chemical datasets, present moderate predictions and without significant differences among them (see Additional file 1: Table S1 for a statistical analysis). The justification for this observation is that one family of molecular descriptors (MDs) may not suffice to codify all chemical information and/or molecular properties for different chemical datasets. In other words, the relevance of MDs depends on the nature of the compounds under study. It is therefore necessary to search for alternative methods/approaches to codify novel and orthogonal chemical information.
Inspired by the previous idea, recently the 3D Nlinear algebraic molecular descriptors have been introduced as a novel mathematical procedure for computing the structural features of chemical compounds [24–26]. These MDs employ the bilinear, quadratic and linear algebraic maps [27] to codify information between atompairs by using several (dis)similarity metrics [25]. Also, the Nlinear algebraic forms [28] were used as generalized expressions of the bilinear, quadratic and linear algebraic maps, when relations among three and four atoms are studied [26]. In this way, the geometric matrix [8] was extended to consider for the first time relations for more than two atoms.
Several studies aimed at assessing the quality of this novel descriptor family, also called QuBiLSMIDAS [acronym of Quadratic, Bilinear and NLinear Maps based on Ntuple Spatial Metric [(Dis)Similarity] Matrices and Atomic Weightings], were performed and these included an evaluation of the information content (variability) and linear independence using Shannon’s entropy based variability analysis [29] (using IMMAN software [30]) and the principal component analysis (PCA) technique [31], respectively. Also, comparisons with other MDs reported in the literature were performed [25, 26]. In general sense, the results demonstrated that the novel MDs have superior variability than 3D DRAGON indices and another approaches implemented in several software [32–35]. Furthermore, the results revealed that the novel 3D Nlinear indices not only do they codify all information contained in the 3D DRAGON MDs, but capture information orthogonal to the latter. Lastly, the QuBiLSMIDAS MDs were used for modeling the binding affinity to the corticosteroidbinding globulin (CBG), achieving superior results with respect to other QSAR methodologies (see Tables 8–9 in Ref. [25] and Tables 9–10 in Ref. [26]).
However, although the initial results with QuBiLSMIDAS MDs are promissory, it cannot be stated that these are most suitable for building QSAR models for all chemical datasets. It is thus necessary to evaluate the performance of the 3D Nlinear algebraic MDs in QSAR modeling with different molecular sets. Therefore, this paper is dedicated to the assessment of the utility of the QuBiLSMIDAS approach in the prediction of the biological activity in several compound datasets and the comparison of the obtained results with those of other QSAR procedures reported in the literature.
Mathematical overview of the 3D Nlinear algebraic molecular descriptors
The molecular vectors (or property vectors) \(\bar{x}\), \(\bar{y}\), \(\bar{z}\) and \(\bar{w}\) are calculated by using the Chemistry Development Kit (CDK) library [36] considering the following fragment and atombased properties: atomic mass (m), the van der Waals volume (v), the atomic polarizability (p), atomic electronegativity in Pauling scale (e), atomic GhoseCrippen LogP (a), GasteigerMarsili atomic charge (c), atomic polar surface area (psa), atomic refractivity (r), atomic hardness (h) and atomic softness (s).
Metrics used to compute the “distance” between two atoms of a molecule
Metrics  Formula^{a}  Range^{b}  Average  Range 

Minkowski (M1–M7) p = 0.25, 0.5, 1, 1.5, 2, 2.5, 3, and ∞ [where, when p = 1 it is the Manhattan, cityblock or taxi distance (also known as Hamming distance between binary vectors) and p = 2 is Euclidean distance)  \(d_{XY} = \left( {\mathop \sum \limits_{j = 1}^{h} \left {x_{j}  y_{j} } \right^{p} } \right)^{{\frac{1}{p}}}\)  [0, ∞)  \(\bar{d} = \frac{{d_{XY} }}{{n^{1/p}}}\)  [0, ∞) 
Chebyshev/Lagrange (M8) (Minkowski formula when p = ∞)  \(d_{XY} = max\left\{ {\left {x_{j}  y_{j} } \right} \right\}\)  
Canberra (M10)  \(d_{XY} = \mathop \sum \limits_{j = 1}^{h} \frac{{\left {x_{j}  y_{j} } \right }}{{\left {x_{j} } \right + \left {y_{j} } \right}}\)  [0, n]  \(\bar{d} = \frac{{d_{XY} }}{n}\)  [0, 1] 
Lance–Williams/Bray–Curtis (M11)  \(d_{XY} = \frac{{\mathop \sum \nolimits_{j = 1}^{h} \left {x_{j}  y_{j} } \right }}{{\mathop \sum \nolimits_{j = 1}^{h} \left( {\left {x_{j} } \right + \left {y_{j} } \right} \right) }}\)  [0, 1]  \(\bar{d} = \frac{{d_{XY} }}{n}\)  \(\left[ {0,\frac{1}{n}} \right]\) 
Clark/coefficient of divergence (M12)  \(d_{XY} = \sqrt {\mathop \sum \limits_{j = 1}^{h} \left( {\frac{{x_{j}  y_{j} }}{{\left {x_{j} } \right + \left {y_{j} } \right}}} \right)^{2} }\)  [0, n]  \(\bar{d} = \frac{{d_{XY} }}{\sqrt n }\)  \(\left[ {0,\sqrt n } \right]\) 
Soergel (M13)  \(d_{XY} = \frac{1}{n}\mathop \sum \limits_{j = 1}^{h} \frac{{\left {x_{j}  y_{j} } \right }}{{max\left\{ {x_{j} ,y_{j} } \right\}}}\)  [0, 1]  \(\bar{d} = \frac{{d_{XY} }}{n}\)  \(\left[ {0,\frac{1}{n}} \right]\) 
Bhattacharyya (M14)  \(d_{XY} = \sqrt {\mathop \sum \limits_{j = 1}^{h} \left( {\sqrt {x_{j} }  \sqrt {y_{j} } } \right)^{2} }\)  [0, ∞)  \(\bar{d} = \frac{{d_{XY} }}{\sqrt n }\)  [0, ∞) 
Wave–Edges (M15)  \(d_{XY} = \mathop \sum \limits_{j = 1}^{h} \left( {1  \frac{{min\left\{ {x_{j} ,y_{j} } \right\} }}{{max\left\{ {x_{j} ,y_{j} } \right\}}}} \right)\)  [0, n]  \(\bar{d} = \frac{{d_{XY} }}{n}\)  [0, 1] 
Angular separation/[1 − Cosine (Ochiai)] (M16)  d _{ XY } = 1−Cos _{ XY } where, \(Cos_{XY} = \frac{{\varvec{XY}}}{{\varvec{XY}}} = \frac{{\mathop \sum \nolimits_{j = 1}^{h} x_{j} y_{j} }}{{\sqrt {\mathop \sum \nolimits_{j = 1}^{h} x_{j}^{2} \mathop \sum \nolimits_{j = 1}^{h} y_{j}^{2} } }}\)  [0, 2] 
Measures used to compute the ternary (A) and quaternary (B) relations (multimetrics) among atoms of a molecule
Measure  Formula 

(A) Ternary measures (T _{ XYZ })  
Perimeter (M19–M20)  T _{ XTZ } = d _{ xy } + d _{ yz } + d _{ zx } 
Triangle area (M21–M22)  \(\begin{aligned} T_{XYZ} & = \sqrt {s\left( {s  d_{XY} } \right)\left( {s  d_{YZ} } \right)\left( {s  d_{ZX} } \right)} \\ s & = \frac{{d_{XY} + d_{YZ} + d_{ZX} }}{2} \\ \end{aligned}\) 
Sides summation (M25–M26)  T _{ XTZ } = d _{ xy } + d _{ yz } 
Bond angle (angle between sides) (m27–m28)  \(\begin{aligned} & A_{X} ,A_{Y} ,A_{Z} \;coordinates\;of\;three\;atoms\;of\;a\;molecule \\ & U = A_{X}  A_{Y} ,\;\;V = A_{Z}  A_{Y} \\ & T_{XYZ} = \alpha = \arccos \left( {\frac{U*V}{\left U \right*\left V \right}} \right) \\ \end{aligned}\) 
(B) Quaternary measures (T _{ XYZ })  
Perimeter (M19–M20)  Q _{ XTZW } = d _{ XY } + d _{ YZ } + d _{ ZW } + d _{ WX } 
Volume (M23–M24)  \(\begin{aligned} A_{X} ,A_{Y} ,A_{Z} ,A_{W} \;coordinates\;of\;four\;atoms\;of\;a\;molecule \hfill \\ Q_{XYZW} = \frac{1}{6}\left( {\begin{array}{*{20}c} {A_{Y1}  A_{X1} } & {A_{Z1}  A_{X1} } & {A_{W1}  A_{X1} } \\ {A_{Y2}  A_{X2} } & {A_{Z2}  A_{X2} } & {A_{W2}  A_{X2} } \\ {A_{Y3}  A_{X3} } & {A_{Z3}  A_{X3} } & {A_{W3}  A_{X3} } \\ \end{array} } \right) \hfill \\ \end{aligned}\) 
Sides summation (M25–M26)  Q _{ XTZW } = d _{ XY } + d _{ YZ } + d _{ ZW } 
Dihedral angle (M29–M30)  \(\begin{aligned} & A_{X} ,A_{Y} ,A_{Z} \;coordinates\;of\;three\;atoms\;of\;a\;molecule\;in\;the\;plane\;A \\ & B_{W} ,B_{Y} ,B_{Z} \;coordinates\;of\;three\;atoms\;of\;a\;molecule\;in\;the\;plane\;B \\ & U_{A} = \left( {A_{X}  A_{Y} } \right) \times \left( {A_{Z}  A_{y} } \right) \\ & U_{B} = \left( {B_{W}  A_{Y} } \right) \times \left( {B_{Z}  A_{y} } \right) \\ & Q_{XYZW} = \alpha = \arccos \left( {\frac{{U_{A} *U_{B} }}{{\left {U_{A} } \right*\left {U_{B} } \right}}} \right) \\ \end{aligned}\) 
(A) Chemical structure of Chloro(methoxy)methane and its labeled molecular scaffold, (B) examples of twotuple total spatial(dis)similarity matrices for k = 1 (order) calculated from different (dis)similarity metrics, (C) example of threetuple total spatial(dis)similarity matrix for k = 1 (order) calculated from bond angle ternary measure
(A) 3D molecular structure  
(B) Twotuple total spatial(dis)similarity matrices, \( {\mathbb{G}}^{1} \)  
\( {\mathbb{G}}^{1} \) based on Euclidean metric  \( {\mathbb{G}}^{1} \) based on LanceWilliams metric  
C1  C2  O3  Cl4  C1  C2  O3  Cl4  
C1  0.000  2.408  1.439  3.939  0.000  1.000  0.973  1.000 
C2  2.408  0.000  1.438  1.757  1.000  0.000  0.954  0.293 
O3  1.439  1.438  0.000  2.598  0.973  0.954  0.000  0.973 
Cl4  3.939  1.757  2.598  0.000  1.000  0.293  0.973  0.000 
\( {\mathbb{G}}^{1} \) based on Soergel metric  \( {\mathbb{G}}^{1} \) based on Angular Separation metric  
C1  C2  O3  Cl4  C1  C2  O3  O3  
C1  0.000  1.158  1.003  1.709  0.000  1.354  0.558  1.875 
C2  1.158  0.000  1.234  1.359  1.354  0.000  0.318  0.237 
O3  1.003  1.234  0.000  2.235  0.558  0.318  0.000  0.952 
Cl4  1.709  1.359  2.235  0.000  1.875  0.237  0.952  0.000 
(C) Threetuple total spatial(dis)similarity matrix, \( {{\mathbb{G}}{\mathbb{T}}}^{1} \)  
\( {{\mathbb{G}}{\mathbb{T}}}^{1} \) slide 1ij  \( {{\mathbb{G}}{\mathbb{T}}}^{1} \) slide 2ij  
C1  C2  O3  Cl4  C1  C2  O3  O3  
C1  0.000  0.000  0.000  0.000  0.000  0.000  0.578  0.281 
C2  0.000  0.000  0.578  2.470  0.000  0.000  0.000  0.000 
O3  0.000  1.985  0.000  2.682  1.985  0.000  0.000  0.697 
Cl4  0.000  0.390  0.163  0.000  0.390  0.000  0.553  0.000 
\( {{\mathbb{G}}{\mathbb{T}}}^{1} \) slide 3ij  \( {{\mathbb{G}}{\mathbb{T}}}^{1} \) slide 4ij  
C1  0.000  0.578  0.000  0.297  0.000  0.281  0.297  0.000 
C2  0.578  0.000  0.000  1.892  2.470  0.000  1.892  0.000 
O3  0.000  0.000  0.000  0.000  2.682  0.697  0.000  0.000 
Cl4  0.163  0.553  0.000  0.000  0.000  0.000  0.000  0.000 
(A) Twotuple total spatial(dis)similarity matrix for k = 1, \({\mathbb{G}}^{1}\), computed from 3D coordinates of the molecule Chloro(methoxy)methane (see Table 1A), (B) examples of twotuple localfragment spatial(dis)similarity matrices, \({\mathbb{G}}_{\varvec{F}}^{1}\), obtained with different chemical fragments
C1  C2  O3  Cl4  

(A) Twotuple total spatial(dis)similarity matrices, \({\mathbb{G}}^{1}\)  
C1  0.000  2.408  1.439  3.939 
C2  2.408  0.000  1.438  1.757 
O3  1.439  1.438  0.000  2.598 
Cl4  3.939  1.757  2.598  0.000 
(B) twotuple localfragment spatial(dis)similarity matrices, \({\mathbb{G}}_{F}^{1}\)  
\({\mathbb{G}}_{F}^{1}\) based on halogens fragment  
C1  0.000  0.000  0.000  1.969 
C2  0.000  0.000  0.000  0.878 
O3  0.000  0.000  0.000  1.299 
Cl4  1.969  0.878  1.299  0.000 
\({\mathbb{G}}_{F}^{1}\) based on methyl groups fragment  
C1  0.000  1.204  0.719  1.969 
C2  1.204  0.000  0.000  0.000 
O3  0.719  0.000  0.000  0.000 
Cl4  1.969  0.000  0.000  0.000 
\({\mathbb{G}}_{F}^{1}\) based on heteroatoms fragment  
C1  0.000  0.000  0.719  1.969 
C2  0.000  0.000  0.719  0.878 
O3  0.719  0.719  0.000  2.598 
Cl4  1.969  0.878  2.598  0.000 
Example of probabilistic transformations on the nonstochastic twotuple total spatial(dis)similarity matrix for k = 1, \(_{{\varvec{ns}}} {\mathbb{G}}^{1}\), computed from 3D coordinates of the Chloro(methoxy)methane compound (see Table 1A) by using the Euclidean metric
C1  C2  O3  Cl4  C1  C2  O3  Cl4  

Nonstochastic matrix, \(_{ns} {\mathbb{G}}^{1}\)  Simplestochastic matrix, \(_{ss} {\mathbb{G}}^{1}\)  
C1  0.000  2.408  1.439  3.939  0.000  0.309  0.185  0.506 
C2  2.408  0.000  1.438  1.757  0.430  0.000  0.257  0.314 
O3  1.439  1.438  0.000  2.598  0.263  0.263  0.000  0.475 
Cl4  3.939  1.757  2.598  0.000  0.475  0.212  0.313  0.000 
Doublestochastic matrix, \(_{ds} {\mathbb{G}}^{1}\)  Mutual probability matrix, \(_{mp} {\mathbb{G}}^{1}\)  
C1  0.000  0.387  0.246  0.368  0.000  0.089  0.053  0.145 
C2  0.387  0.000  0.368  0.246  0.089  0.000  0.053  0.065 
O3  0.246  0.368  0.000  0.387  0.053  0.053  0.000  0.096 
Cl4  0.368  0.246  0.387  0.000  0.145  0.065  0.096  0.000 
In order to automatize the calculation of the 3D Nlinear algebraic indices used in the present manuscript the QuBiLSMIDAS software has been developed [41]. This software has as one of its main features the multicore processing of the MDs, as well as the option to carry out the distributed calculation of the indices by using the MultiServer Distributed Computing Platform known as Tarenal [42]. The latter is particularly useful for highthroughput calculation tasks. Both software are freely available via internet at: http://tomocomd.com/.
Methods
In order to assess the correlation ability of the QuBiLSMIDAS MDs for different biological activities eight wellknown chemical datasets were used. These were previously employed by Sutherland et al. in a comparative study of QSAR methods commonly used in chemoinformatics analysis [23] and since then, these have been utilized as “benchmarks” for comparing results obtained in other approaches [43–47]. These datasets are comprised by angiotensin converting enzyme (ACE) inhibitors, acetylcholinesterase (AchE) inhibitors, ligands for the benzodiazepine receptor (BZR), cyclooxygenase2 (COX2) inhibitors, dihydrofolate reductase inhibitors (DHFR), inhibitors of glycogen phosphorylase b (GPB), thermolysin inhibitors (THER) and thrombin inhibitors (THR). In this study the 3D coordinates were generated using CORINA software, and the same partitioning into training and test sets used in the initial study was considered in order to guarantee comparability of results.

The 1000 MDs with best variability behavior according to their Shannon’s Entropy values [29] were retained by using the IMMAN software [30].

The MDs with values represented as power of 10 (scientific notation) and whose exponents are greater or lesser than ±5 were removed.

Filters for removing the MDs with correlation equal or greater than 0.95 and standardized entropy lesser than 0.3 were applied.

The statistical method Multiple Linear Regression (MLR) implemented in the STATISTICA software was employed in order to select the MDs included in the model by using Forward Stepwise and Backward Stepwise selection procedures.

The MDs retained after applying the previous steps and computed for the same compounds were merged into a single dataset.
With the reduced data matrices for each chemical datasets, QSAR models were built with the MLR technique to determine the relationship between the response (activity) and predictor variables (MDs). The MLR technique is coupled with the Genetic Algorithm (GA) metaheuristic as the variable selection method [48]. This strategy (MLR + GA) is implemented in the MobyDigs software (version 1.0) which was utilized to carry out this study [49]. In this sense, to perform the search process several populations with 100 3D Nlinear MDs each were created, while the following configurations were used for the GA procedure: Number of iterations equal to 500,000, Population size equal to 100, Reproduction/mutation tradeoff equal to 0.5, Selection bias was initially set to 0 (indicative of random selection) until achieving the 80 % of the maximum number of iterations and was later set to 1 (indicates tournament selection) in order to increase the selection pressure. The values of the previous parameters were selected according to the study performed by Todeschini et al. in Ref. [49].

The “best” 50 models according to the \(Q_{loo}^{2}\) parameter were retained.

To each model retained the validation methods “bootstrapping” [50] \(\left( {Q_{boot}^{2} } \right)\) and “Yscrambling” [51] (a(Q ^{2})) were applied in order to assess the predictive power and the possible chance correlation with respect to the modeled biological activity, respectively. The former randomly creates training sets (with repeated objects) of the same size as the original and the objects left out constitute the test set, while the latter randomly changes the true response variables to determine the quality of the model. Both procedures were repeated 5000 and 300 times, respectively. These methods were applied due to the fact that \(Q_{loo}^{2}\) procedure does not suffice to validate the stability of a predictive model [52].

For each model the function \(f(x) = \left( {1  Q_{boot}^{2} } \right) + \left {a\left( {Q^{2} } \right)} \right\) was computed, which takes into account the results obtained with the two validation procedures employed and the model with the smallest f(x) value constitutes the “best” regression model.

The “best” regression model was assessed by using “external validation” \(\left( {Q_{ext}^{2} } \right)\) procedure in the corresponding test set in order to measure its generalization ability.
Results and discussion
Assessment of the QuBiLSMIDAS models versus other approaches
Statistical parameters and equations of the best models developed for each chemical dataset analyzed
Size  R ^{2}  \(\left( {Q_{\text{loo}}^{2} } \right)\)  \(\left( {Q_{\text{boot}}^{2} } \right)\)  a(Q ^{2})  \(\left( {Q_{\text{ext}}^{2} } \right)\)  SDEP_{ext}  Models^{a} 

ACE dataset  
6  0.814  0.7756  0.765  −0.169  0.7422  1.078  Act = 1.576 (±1.283) + 0.132 (±0.018) \({}_{{\varvec{NS}2}}^{{\varvec{SD}}} \varvec{TrC}_{\varvec{e}}^{{\varvec{M}20\left( {\varvec{M}4} \right)}}\) − 17.977 (±3.649) \({}_{{\varvec{SS}2}}^{{\varvec{RA}}} \varvec{B}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}1}}\) + 2.135 (±0.398) \({}_{{\varvec{SS}0}}^{{\varvec{RA}}} \varvec{B}_{{\varvec{a}  \varvec{e}}}^{{}}\) − 3.900 (±0.772) \({}_{{\varvec{SS}1}}^{{\varvec{RA}}} \varvec{F}_{\varvec{a}}^{{\varvec{M}1}}\) + 0.034 (±0.013) \(\left[ {{}_{{\varvec{NS}3}}^{{\varvec{AC}\left[ 3 \right]\_\varvec{K}}} \varvec{TrC}_{\varvec{c}}^{{\varvec{M}20\left( {\varvec{M}16} \right)}} } \right]^{D}\) − 0.114 (±0.071) \(\left[ {{}_{{\varvec{MP}1}}^{{\varvec{RA}}} \varvec{QuQd}_{\varvec{e}}^{{\varvec{M}29}} } \right]^{\varvec{X}}\) 
ACHE dataset  
8  0.738  0.6574  0.626  −0.213  0.6309  0.784  Act = 7.622 (±0.564) − 0.010 (±0.004) \({}_{{\varvec{SS}4}}^{{\varvec{i}50}} \varvec{TrQB}_{{\varvec{e}  \varvec{v}}}^{{\varvec{M}21\left( {\varvec{M}3} \right)}}\) − 0.204 (±0.046) \({}_{{\varvec{NS}4}}^{\varvec{K}} \varvec{Tr}_{{\varvec{a}  \varvec{e}  \varvec{h}}}^{{\varvec{M}21\left( {\varvec{M}1} \right)}}\) + 3.311 (±0.673) \({}_{{\varvec{SS}1}}^{{\varvec{i}50}} \varvec{B}_{{\varvec{a}  \varvec{h}}}^{{\varvec{M}1}}\) − 111.324 (±30.793) \({}_{{\varvec{MP}2}}^{{\varvec{i}50}} \varvec{F}_{\varvec{a}}^{{\varvec{M}1}}\) − 0.413 (±0.156) \({}_{{\varvec{SS}7}}^{{\varvec{ES}\_\varvec{SD}}} \varvec{TrB}_{{\varvec{a}  \varvec{e}}}^{{\varvec{M}21\left( {\varvec{M}13} \right)}}\) − 0.647 (±0.201) \({}_{{\varvec{NS}4}}^{{\varvec{TS}\left[ 2 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a}  \varvec{v}}}^{{\varvec{M}4}}\) + 0.022 (±0.011) \(\left[ {{}_{{\varvec{NS}4}}^{\varvec{K}} \varvec{Tr}_{{\varvec{a}  \varvec{e}  \varvec{h}}}^{{\varvec{M}21\left( {\varvec{M}1} \right)}} } \right]^{\varvec{A}}\) − 1.747 (±0.699) \(\left[ {{}_{{\varvec{SS}1}}^{{\varvec{i}50}} \varvec{B}_{{\varvec{a}  \varvec{h}}}^{{\varvec{M}1}} } \right]^{\varvec{P}}\) 
BZR dataset  
9  0.754  0.6931  0.669  −0.170  0.5692  0.631  Act = 8.589 (±0.592) + 0.160 (±0.024) \({}_{{\varvec{SS}7}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{K}}} \varvec{Tr}_{{\varvec{a}  \varvec{e}  \varvec{h}}}^{{\varvec{M}19\left( {\varvec{M}11} \right)}}\) + 0.416 (±0.076) \({}_{{\varvec{SS}1}}^{{\varvec{RA}}} \varvec{B}_{{\varvec{c}  \varvec{v}}}^{{\varvec{M}2}}\) + 0.018 (±0.006) \({}_{{\varvec{SS}2}}^{{\varvec{i}50}} \varvec{TrB}_{{\varvec{e}  \varvec{v}}}^{{\varvec{M}19\left( {\varvec{M}16} \right)}}\) + 0.092 (±0.034) \({}_{{\varvec{NS}2}}^{{\varvec{TS}\left[ 7 \right]\_\varvec{K}}} \varvec{Tr}_{{\varvec{a}  \varvec{h}  \varvec{c}}}^{{\varvec{M}27}}\) + 0.030 (±0.010) \({}_{{\varvec{NS}2}}^{{\varvec{AC}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{c}  \varvec{e}}}^{{\varvec{M}2}}\) − 7.940 (±2.981) \({}_{{\varvec{SS}0}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{i}50}} \varvec{B}_{{\varvec{a}  \varvec{c}}}^{{}}\) − 0.009 (±0.005) \(\left[ {{}_{{\varvec{SS}4}}^{{\varvec{AC}\left[ 4 \right]\_\varvec{K}}} \varvec{TrB}_{{\varvec{e}  \varvec{v}}}^{{\varvec{M}20\left( {\varvec{M}13} \right)}} } \right]^{D}\) + 0. (±0.) \(\left[ {{}_{{\varvec{NS}4}}^{{\varvec{AM}}} \varvec{QuQd}_{\varvec{v}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{C}\) + 0. (±0.) \(\left[ {{}_{{\varvec{NS}4}}^{{\varvec{AM}}} \varvec{QuQd}_{\varvec{v}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{P}\) 
COX2 dataset  
9  0.670  0.6313  0.615  −0.091  0.4932  1.038  Act = –94.390 (±8.607) + 1.759 (±0.150) \({}_{{\varvec{MP}3}}^{{\varvec{ES}\_\varvec{N}1}} \varvec{B}_{{\varvec{v}  \varvec{e}}}^{{\varvec{M}3}}\) − 0.032 (±0.007) \({}_{{\varvec{NS}4}}^{{\varvec{AC}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a}  \varvec{e}}}^{{\varvec{M}13}}\) + 0.317 (±0.070) \({}_{{\varvec{SS}0}}^{{\varvec{ES}\_\varvec{i}50}} \varvec{B}_{{\varvec{h}  \varvec{e}}}\) + 0.005 (±0.002) \({}_{{\varvec{SS}2}}^{{\varvec{SD}}} \varvec{TrQB}_{{\varvec{v}  \varvec{h}}}^{{\varvec{M}20\left( {\varvec{M}16} \right)}}\) + 0.021 (±0.005) \({}_{{\varvec{NS}4}}^{{\varvec{TS}\left[ 5 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}11}}\) + 0.081 (±0.017) \({}_{{\varvec{NS}2}}^{{\varvec{AC}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{c}  \varvec{e}}}^{{\varvec{M}8}}\) − 17.442 (±3.695) \(\left[ {{}_{{\varvec{SS}4}}^{{\varvec{SD}}} \varvec{QuCB}_{{\varvec{h}  \varvec{c}}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{\varvec{D}}\) − 14.761 (±2.510) \(\left[ {{}_{{\varvec{SS}4}}^{{\varvec{SD}}} \varvec{QuCB}_{{\varvec{h}  \varvec{c}}}^{{\varvec{M}26\left( {\varvec{M}8} \right)}} } \right]^{\varvec{M}}\) + 122.311 (±50.893) \(\left[ {{}_{{\varvec{MP}1}}^{{\varvec{SD}}} \varvec{Tr}_{{\varvec{a}  \varvec{h}  \varvec{c}}}^{{\varvec{M}20\left( {\varvec{M}16} \right)}} } \right]^{X}\) 
DHFR dataset  
9  0.732  0.7055  0.697  −0.077  0.6405  0.826  Act = 3.127 (±0.519) + 0.019 (±0.005) \({}_{{\varvec{SS}1}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e}  \varvec{v}}}^{{\varvec{M}21\left( {\varvec{M}2} \right)}}\) + 0.050 (±0.007) \({}_{{\varvec{NS}6}}^{{\varvec{GV}\left[ 4 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{c}  \varvec{e}}}^{{\varvec{M}4}}\) − 15.592 (±3.530) \({}_{{\varvec{MP}4}}^{{\varvec{TS}\left[ 2 \right]\_\varvec{i}50}} \varvec{QuQd}_{\varvec{m}}^{{\varvec{M}25\left( {\varvec{M}3} \right)}}\) − 0.067 (±0.007) \({}_{{\varvec{NS}2}}^{{\varvec{GV}\left[ 3 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}1}}\) + 0.471 (±0.034) \({}_{{\varvec{NS}3}}^{{\varvec{GV}\left[ 1 \right]\_\varvec{K}}} \varvec{B}_{{\varvec{h}  \varvec{c}}}^{{\varvec{M}3}}\) − 0.325 (±0.037) \({}_{{\varvec{NS}1}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{c}  \varvec{e}}}^{{\varvec{M}1}}\) + 55.107 (±10.603) \({}_{{\varvec{NS}1}}^{{\varvec{GV}\left[ 5 \right]\_\varvec{SD}}} \varvec{B}_{{\varvec{c}  \varvec{e}}}^{{\varvec{M}3}}\) + 0.044 (±0.008) \({}_{{\varvec{NS}2}}^{{\varvec{TS}\left[ 3 \right]\_\varvec{SD}}} \varvec{B}_{{\varvec{v}  \varvec{e}}}^{{\varvec{M}4}}\) − 0.933 (±0.331) \({}_{{\varvec{MP}4}}^{{\varvec{N}1}} \varvec{Qu}_{{\varvec{e}  \varvec{v}  \varvec{h}  \varvec{c}}}^{{\varvec{M}26\left( {\varvec{M}3} \right)}}\) 
GPB dataset  
8  0.893  0.8124  0.774  −0.394  0.8283  0.499  Act = 2.073 (±0.351) + 0.334 (±0.078) \({}_{{\varvec{SS}3}}^{{\varvec{TS}\left[ 4 \right]\_\varvec{K}}} \varvec{TrB}_{{\varvec{e}  \varvec{h}}}^{{\varvec{M}20\left( {\varvec{M}8} \right)}}\) + 0.147 (±0.051) \({}_{{\varvec{NS}2}}^{{\varvec{AC}\left[ 3 \right]\_\varvec{K}}} \varvec{F}_{\varvec{e}}^{{\varvec{M}8}}\) + 0.046 (±0.009) \({}_{{\varvec{SS}3}}^{{\varvec{AC}\left[ 4 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{c}  \varvec{v}}}^{{\varvec{M}12}}\) + 55.958 (±10.078) \({}_{{\varvec{SS}2}}^{{\varvec{AC}\left[ 2 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}8}}\) + 0.050 (±0.039) \({}_{{\varvec{SS}4}}^{{\varvec{N}1}} \varvec{Tr}_{{\varvec{e}  \varvec{v}  \varvec{c}}}^{{\varvec{M}19\left( {\varvec{M}12} \right)}}\) + 0.078 (±0.055) \({}_{{\varvec{NS}3}}^{{\varvec{GV}\left[ 2 \right]\_\varvec{K}}} \varvec{F}_{\varvec{a}}^{{\varvec{M}11}}\) + 1.322 (±0.427) \({}_{{\varvec{MP}0}}^{{\varvec{SD}}} \varvec{QuQTr}_{{\varvec{e}  \varvec{v}  \varvec{h}}}^{{}}\) − 0.309 (±0.108) \({}_{{\varvec{MP}4}}^{{\varvec{SD}}} \varvec{QuQTr}_{{\varvec{e}  \varvec{v}  \varvec{h}}}^{{\varvec{M}26\left( {\varvec{M}3} \right)}}\) 
THER dataset  
7  0.815  0.7530  0.723  −0.260  0.7248  1.197  Act = –11.296 (±3.486) + 126.508 (±41.628) \({}_{{\varvec{NS}1}}^{{\varvec{GV}\left[ 5 \right]\_\varvec{N}1}} \varvec{B}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}8}}\) + 0.016 (±0.003) \({}_{{\varvec{NS}1}}^{{\varvec{GV}\left[ 7 \right]\_\varvec{i}50}} \varvec{Q}_{\varvec{e}}^{{\varvec{M}8}}\) − 4.265 (±0.851) \({}_{{\varvec{SS}1}}^{{\varvec{N}1}} \varvec{Tr}_{{\varvec{v}  \varvec{h}  \varvec{c}}}^{{\varvec{M}20\left( {\varvec{M}3} \right)}}\) + 0.718 (±0.171) \({}_{{\varvec{SS}3}}^{{\varvec{RA}}} \varvec{TrC}_{\varvec{e}}^{{\varvec{M}20\left( {\varvec{M}3} \right)}}\) + 0.016 (±0.009) \({}_{{\varvec{SS}4}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e}  \varvec{v}}}^{{\varvec{M}27}}\) − 0.027 (±0.029) \(\left[ {{}_{{\varvec{SS}4}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e}  \varvec{v}}}^{{\varvec{M}27}} } \right]^{A}\) + 0.042 (±0.027) \(\left[ {{}_{{\varvec{SS}4}}^{{\varvec{RA}}} \varvec{TrB}_{{\varvec{e}  \varvec{v}}}^{{\varvec{M}27}} } \right]^{X}\) 
THR dataset  
9  0.866  0.8149  0.789  −0.286  0.7674  0.540  Act = 5.251 (±0.605) − 2120.900 (±253.086) \({}_{{\varvec{MP}2}}^{{\varvec{TS}\left[ 1 \right]\_\varvec{i}50}} \varvec{Tr}_{{\varvec{a}  \varvec{h}  \varvec{c}}}^{{\varvec{M}19\left( {\varvec{M}2} \right)}}\) − 0.0001 (±0.) \({}_{{\varvec{NS}0}}^{{\varvec{TS}\left[ 5 \right]\_\varvec{i}50}} \varvec{Tr}_{{\varvec{e}  \varvec{v}  \varvec{h}}}^{{}}\) + 0.060 (±0.013) \({}_{{\varvec{SS}1}}^{{\varvec{AC}\left[ 2 \right]\_\varvec{K}}} \varvec{TrQB}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}27}}\) + 0.022 (±0.004) \({}_{{\varvec{NS}3}}^{{\varvec{RA}}} \varvec{Tr}_{{\varvec{e}  \varvec{v}  \varvec{h}}}^{{\varvec{M}20\left( {\varvec{M}2} \right)}}\) + 1.415 (±0.222) \({}_{{\varvec{NS}2}}^{{\varvec{RA}}} \varvec{TrQB}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}20\left( {\varvec{M}8} \right)}}\) + 0.958 (±0.293) \({}_{{\varvec{NS}2}}^{{\varvec{GV}\left[ 4 \right]\_\varvec{PN}}} \varvec{B}_{{\varvec{c}  \varvec{v}}}^{{\varvec{M}8}}\) + 0.107 (±0.041) \({}_{{\varvec{SS}4}}^{\varvec{K}} \varvec{Tr}_{{\varvec{e}  \varvec{v}  \varvec{h}}}^{{\varvec{M}21\left( {\varvec{M}8} \right)}}\) + 0.029 (±0.012) \({}_{{\varvec{MP}4}}^{{\varvec{AC}\left[ 7 \right]\_\varvec{K}}} \varvec{Tr}_{{\varvec{a}  \varvec{e}  \varvec{c}}}^{{\varvec{M}19\left( {\varvec{M}13} \right)}}\) − 0.058 (±0.022) \(\left[ {{}_{{\varvec{SS}1}}^{{\varvec{AC}\left[ 2 \right]\_\varvec{K}}} \varvec{TrQB}_{{\varvec{a}  \varvec{c}}}^{{\varvec{M}27}} } \right]^{\varvec{C}}\) 
Comparison of the crossvalidation statistic parameter \(\left( {Q_{loo}^{2} } \right)\) obtained from the QuBiLSMIDAS models with respect to the performance achieved by 15 QSAR procedures
ACE  ACHE  BZR  COX2  DHFR  GPB  THER  THR  

QuBiLSMIDAS^{a}  0.7756  0.6574  0.6931  0.6313  0.7055  0.8124  0.7530  0.8149 
QuBiLSMIDAS^{b}  0.7713  0.6521  0.6886  0.6064  0.7055  0.8124  0.7495  0.8047 
CoMFA [23]  0.68  0.52  0.32  0.49  0.65  0.42  0.52  0.59 
COMSIA basic [23]  0.65  0.48  0.41  0.43  0.63  0.43  0.54  0.62 
COMSIA extra [23]  0.66  0.49  0.45  0.57  0.65  0.61  0.51  0.72 
EVA [23]  0.70  0.42  0.40  0.45  0.64  0.58  0.48  0.47 
HQSAR [23]  0.72  0.34  0.42  0.50  0.69  0.66  0.49  0.50 
2D [23]  0.68  0.32  0.36  0.49  0.51  0.31  0.62  0.62 
2.5D [23]  0.72  0.31  0.35  0.55  0.53  0.46  0.66  0.52 
SAMFARF [43]  0.69  0.58  0.43  0.38  0.70  0.66  0.52  0.53 
SAMFASVM [43]  0.52  0.29  0.38  0.39  0.57  0.53  0.18  0.39 
SAMFAPLS [43]  0.65  0.54  0.49  0.40  0.68  0.61  0.60  0.56 
Fingerprints Library [44]  0.69  0.57  0.56  0.55  0.76  0.53  0.53  0.58 
O3Q [45]  0.69  0.52  0.42  0.48  0.70  0.55  0.48  0.59 
O3QMFA [46]  0.65  0.41  0.41  0.43  0.69  0.30  0.47  0.65 
O3A/O3Q [45]  0.71  0.55  0.46  0.46  0.66  0.50  0.67  0.68 
COSMOsar3D [46]  0.71  0.53  0.45  0.54  0.69  0.61  0.58  0.74 
Comparison of the external predictive accuracy \(\left( {Q_{ext}^{2} } \right)\) attained by the QuBiLSMIDAS models with respect to the generalization ability achieved with 12 QSAR procedures
ACE  ACHE  BZR  COX2  DHFR  GPB  THER  THR  

QuBiLSMIDAS^{a}  0.7422  0.6309  0.5692  0.4932  0.6405  0.8283  0.7248  0.7674 
QuBiLSMIDAS^{b}  0.7255  0.5989  0.5459  0.4660  0.6405  0.8283  0.7061  0.7498 
CoMFA [23]  0.49  0.47  0.00  0.29  0.59  0.42  0.54  0.63 
COMSIA basic [23]  0.52  0.44  0.08  0.03  0.52  0.46  0.36  0.55 
COMSIA extra [23]  0.49  0.44  0.12  0.37  0.53  0.59  0.53  0.63 
EVA [23]  0.36  0.28  0.16  0.17  0.57  0.49  0.36  0.11 
HQSAR [23]  0.30  0.37  0.17  0.27  0.63  0.58  0.53  −0.25 
2D [23]  0.47  0.16  0.14  0.25  0.47  −0.06  0.14  0.04 
2.5D [23]  0.51  0.16  0.20  0.27  0.49  0.04  0.07  0.28 
O3Q [45]  0.69  0.67  0.17  0.32  0.60  0.50  0.51  0.67 
O3QMFA [46]  0.45  0.61  0.13  0.37  0.59  0.29  0.49  0.60 
O3A/O3Q [45]  0.54  0.65  0.24  0.28  0.53  0.41  −0.18  0.30 
COSMOsar3D [46]  0.62  0.61  0.13  0.43  0.58  0.63  0.59  0.66 
2DFPT [47]  0.713 ^{L}  0.714 ^{N}  0.378 ^{L}  0.329^{N}  0.683 ^{N}  0.667 ^{L}  0.649 ^{L}  0.737 ^{N} 
Also, it can be observed from Table 7 that the crossvalidation performances achieved by the QuBiLSMIDAS models have comparabletosuperior behavior with respect to the approaches reported in the literature. Until now, the best \(Q_{loo}^{2}\) value for the datasets ACE, ACHE, BZR, COX2, GPB, THER and THR had been attained by the procedures HQSAR (and 2.5D) [\({\text{Q}}_{\text{loo}}^{2}\) = 0.72], SAMFARF (\({\text{Q}}_{\text{loo}}^{2}\) = 0.58), AllShortest Path [ASP] Fingerprint (\({\text{Q}}_{\text{loo}}^{2}\) = 0.56), COMSIA extra (\({\text{Q}}_{\text{loo}}^{2}\) = 0.57), HQSAR (and SAMFARF) [\({\text{Q}}_{\text{loo}}^{2}\) = 0.66], O3A/O3Q (\({\text{Q}}_{\text{loo}}^{2}\) = 0.67) and COMSIA extra (\({\text{Q}}_{\text{loo}}^{2}\) = 0.72), respectively, by using PLS, Random Forest (RF) or Support Vector Machine (SVM) techniques. However, all these previous results are clearly outperformed by the QuBiLSMIDAS models [(ACE, \({\text{Q}}_{\text{loo}}^{2}\) = 0.7756), (ACHE, \({\text{Q}}_{\text{loo}}^{2}\) = 0.6574), (BZR, \({\text{Q}}_{\text{loo}}^{2}\) = 0.6931), (COX2, \({\text{Q}}_{\text{loo}}^{2}\) = 0.6313), (GPB, \({\text{Q}}_{\text{loo}}^{2}\) = 0.8124), (THER, \({\text{Q}}_{\text{loo}}^{2}\) = 0.7530) and (THR, \({\text{Q}}_{\text{loo}}^{2}\) = 0.8149)], which were built with MLR that is a simpler method than those employed in the reported results. In the specific case of the DHFR dataset, although the attained value (\({\text{Q}}_{\text{loo}}^{2}\) = 0.7055) with the QuBiLSMIDAS approach is not better than the current best result (ASP fingerprint, \({\text{Q}}_{\text{loo}}^{2}\) = 0.76), the former is superior to the remaining QSAR procedures. However, it is important to remark that the best model (ASP fingerprint + SVM) for the DHFR dataset does not have the external prediction value (\(Q_{ext}^{2}\)) reported and thus the corresponding \(Q_{loo}^{2}\) could be overoptimistic.
According to the external predictions, it can be observed in the Table 8 that the models based on QuBiLSMIDAS indices yield comparabletosuperior performances with respect to the results reported in the literature. Specifically, the models for ACE (\({\text{Q}}_{\text{ext}}^{2}\) = 0.7422), BZR (\({\text{Q}}_{\text{ext}}^{2}\) = 0.5692), COX2 (\({\text{Q}}_{\text{ext}}^{2}\) = 0.4932), GPB (\({\text{Q}}_{\text{ext}}^{2}\) = 0.8283), THER (\({\text{Q}}_{\text{ext}}^{2}\) = 0.7248) and THR (\({\text{Q}}_{\text{ext}}^{2}\) = 0.7674) test sets outperform the best results reported up to date for each dataset previously mentioned, which correspond to COSMOsar3D (\({\text{Q}}_{\text{ext}}^{2}\) = 0.43) in COX2 and to the 2DFPT methodology in the other datasets [(ACE, \({\text{Q}}_{\text{ext}}^{2}\) = 0.713), (BZR, \({\text{Q}}_{\text{ext}}^{2}\) = 0.378), (GPB, \({\text{Q}}_{\text{ext}}^{2}\) = 0.667), (THER, \({\text{Q}}_{\text{ext}}^{2}\) = 0.649) and (THR, \({\text{Q}}_{\text{ext}}^{2}\) = 0.737)]. The 2DFPT models were developed by using SQS framework that determines linear and nonlinear models (see Table 8), while the model corresponding to COSMOsar3D is based on the PLS technique. Even so, the obtained MLR models have better predictive accuracy, even when these are compared with respect to more complex or similar procedures.
As for the ACHE and DHFR datasets, the predictive power obtained for models built with the QuBiLSMIDAS approach is inferior to the best results reported so far in the literature. In the former dataset, the methods 2DFPT (\({\text{Q}}_{\text{ext}}^{2}\) = 0.714), O3Q (\({\text{Q}}_{\text{ext}}^{2}\) = 0.67) and O3A/O3Q (\({\text{Q}}_{\text{ext}}^{2}\) = 0.65) offer better predictions than the proposed model (\({\text{Q}}_{\text{ext}}^{2}\) = 0.6309), albeit this can be considered as suitable (explains 63 % of total variance). Additionally, when the DHFR test set is taken into account the 2DFPT approach (\({\text{Q}}_{\text{ext}}^{2}\) = 0.683) has more predictive ability than the corresponding QuBiLSMIDAS model (\({\text{Q}}_{\text{ext}}^{2}\) = 0.6405), but the latter is superior to the remaining methodologies. Nonetheless, it is important to highlight that the procedures O3Q and O3A/O3Q are alignment dependent and thus their use is generally restricted to congeneric datasets [45]. In the specific case of the 2DFPT methodology for ACHE and DHFR datasets, the achieved results are based on nonlinear models while the proposed outcomes are determined with linear models.
The obtained results evidence that the QuBiLSMIDAS MDs properly codify structural information of the molecules considering interactions among N (N = 2, 3, 4) atoms and thus are suitable for developing QSAR models that contribute to the prediction of biological activity in novel structures. However, notwithstanding the comparabletosuperior predictions achieved by the proposed models, it is important to statistically validate these results.
Statistical analysis of the external predictive accuracy
Therefore, an exploratory study was performed to analyze the normality of the data by using Kolmogorov–Smirnov (K–S) test corrected by Lilliefors [53] and the Shapiro–Wilk test [54]. This was done in order to guarantee that the variable \({\text{Q}}_{\text{ext}}^{2}\) is not normally distributed, at least for one model, and so to ensure that the nonparametric tests are the proper choice. As can be observed in Additional file 1: Table S5, the null hypotheses of normality can only be rejected with a high certainty for \({\text{Q}}_{\text{ext}}^{2}\) values in the 2DFTP and COSMOsar3D models, although with Shapiro–Wilk test the rejection of the null hypothesis is achieved for COMSIA basic as well. Therefore the nonparametric tests may be considered as suitable for this statistical analysis.
Wilcoxon signedrank test for pairwise multiple hypothesis tests by using BH as adjustment method for controlling FDR. It shows the onetailed pvalues for the greater alternative
2D  2.5D  EVA  COMSIA basic  HQSAR  O3QMFA  CoMFA  O3A/O3Q  COMSIA extra  COSMO sar3D  O3Q  2DFPT  

2.5D  0.115  –  –  –  –  –  –  –  –  –  –  – 
EVA  0.138  0.402  –  –  –  –  –  –  –  –  –  – 
COMSIA basic  0.137  0.115  0.323  –  –  –  –  –  –  –  –  – 
HQSAR  0.203  0.380  0.197  0.402  –  –  –  –  –  –  –  – 
O3QMFA  0.046  0.046  0.138  0.241  0.312  –  –  –  –  –  –  – 
CoMFA  0.051  0.089  0.115  0.241  0.367  0.703  –  –  –  –  –  – 
O3A/O3Q  0.089  0.089  0.277  0.556  0.402  0.654  0.727  –  –  –  –  – 
COMSIA extra  0.031  0.051  0.045  0.051  0.164  0.427  0.249  0.272  –  –  –  – 
COSMOsar3D  0.027  0.022  0.036  0.022  0.051  0.054  0.027  0.068  0.015  –  –  – 
O3Q  0.015  0.022  0.022  0.015  0.186  0.051  0.042  0.051  0.203  0.698  –  – 
2DFPT  0.015  0.015  0.015  0.015  0.015  0.022  0.015  0.015  0.022  0.068  0.015  – 
QuBiLS MIDAS  0.015  0.015  0.015  0.015  0.015  0.015  0.015  0.022  0.015  0.015  0.022  0.138 
Analysis of the predictive ability according to conformer generation methods
The conformer generation constitutes an important step when chemoinformatics tasks are performed, particularly in the computeraided drug design, where the outcomes of a virtual screening process may depend on 3D structures employed to build the procedure to be used, e.g. a QSAR model [59]. Therefore, in this section an evaluation of the sensibility of the QuBiLSMIDAS MDs to the different conformer generation methods is performed in order to comprehend how these could affect in the performance of the indices. To this end, the software FROG2 [60], RDKit [61], BALLOON [62], OpenBabel [63] and Standardizer ChemAxon [64] were employed to generate the 3D structures, taking as starting point the SMILES representations corresponding to the eight compound datasets considered in this report.

8640 twolinear algebraic indices (Additional file 1: Table S9) were computed.

CfsSubsetEval feature selection procedure, implemented in WEKA software, was applied in order to retain those MDs with high correlation according to dependentvariable and with low intercorrelation among them.

The MLRGA procedure implemented in MobyDigs software was employed to build 9variable models performing 100,000 iterations and considering the tabu list options of removing MDs with correlation equal or greater than 0.95, fourth order moment greater than 8 and standardized entropy lesser than 0.3. The fitness function used was the statistical parameter \(Q_{loo}^{2}\).

The model with the highest \(Q_{loo}^{2}\) value was selected as the best model, to which the external predictive ability was determined.
External predictive accuracy achieved by QSAR models developed from 3D molecular structures generated with six different programs
ACE  ACHE  BZR  COX2  DHFR  GPB  THER  THR  Rank average  

BALLOON  0.3296  0.1943  0.3949  0.2451  0.3758  0.0000  0.0000  0.0000  4.5 
CHEMAXON  0.5504  0.1343  0.4163  0.3361  0.2978  0.1687  0.0000  0.1386  3.375 
CORINA  0.4133  0.0556  0.3628  0.2865  0.4288  0.2767  0.1915  0.2334  3.25 
FROG2  0.4832  0.3535  0.3635  0.3393  0.3786  0.2712  0.3264  0.1457  2.125 
OPENBABEL  0.3993  0.1306  0.1715  0.2775  0.3460  0.4742  0.2806  0.0803  4 
RDKIT  0.4181  0.1770  0.3024  0.2189  0.5008  0.4511  0.0000  0.0710  3.75 
Note that for the forthcoming version of QuBiLSMIDAS software, RDKIT program will be incorporated in the QuBiLSMIDAS software as a builtin option for conformer generation. This is due to the fact that FROG2 procedure can only be accessed using a web browser, while CORINA and CHEMAXON software are not freely available for use. In addition, according to a study performed in Ref. [65] in order to assess the quality of the conformations generated by several free methods, RDKIT tends to generate the most similar conformations to the experimental structures, in addition to being the second fastest among all toolkits analyzed.
Conclusions
In this report the predictive accuracy of the novel alignmentfree geometric molecular descriptors based on Nlinear algebraic maps (so called QuBiLSMIDAS) has been examined. To this end, QSAR models for predicting the biological activity in eight molecular datasets were developed by using MLR as statistical technique. The results obtained with the QuBiLSMIDAS models were compared with respect to several QSAR procedures reported in the literature according to the correlation coefficients achieved with the leaveoneout crossvalidation \(\left( {Q_{loo}^{2} } \right)\) and external prediction \(\left( {Q_{ext}^{2} } \right)\) methods, and generally superior performances were observed with this QuBiLSMIDAS framework.
A few exceptions were observed: for the \(Q_{loo}^{2}\) parameter, the QuBiLSMIDAS approach is exclusively outperformed by the ASPbased (fingerprint) method in the DHFR dataset, while for the \(Q_{ext}^{2}\) parameter, the QuBiLSMIDAS method yields inferior results with respect to the 2DFPT methodology in the DHFR and ACHE test set, respectively. Also, inferior \(Q_{ext}^{2}\) values are yielded by the QuBiLSMIDAS approach with respect to the O3Q and O3A/O3Q procedures in the ACHE test set. However, these previous methodologies are based on techniques more complex than MLR and/or cannot be used in noncongeneric datasets because are alignmentdepend. Thus, considering the maximum parsimony principle (Ockham’s razor), the QuBiLSMIDAS approach seems to be more suitable than the other QSAR methods.
Additionally, several steps for statistically validating the obtained results are detailed. In this sense, the external predictive ability of the developed models was compared with respect to other methodologies by means of the multiple comparison tests. It was demonstrated that the QuBiLSMIDAS models yield the best predictions, and that these are significantly superior in 11 of the 12 methodologies compared. Therefore, it can be suggested that the 3D Algebraic Nlinear molecular descriptors (also known as QuBiLSMIDAS) are suitable for extracting structural information of the molecules and thus, constitute a promissory alternative to build models that contribute to the prediction of pharmacokinetic, pharmacodynamics and toxicological properties of novel compounds.
Declarations
Authors’ contributions
CRGJ proposed the theory of the QuBiLSMIDAS indices, supervised the QSAR modeling on the eight chemical datasets, performed the study about the performance of the indices according to several structure generation methods and prepared the manuscript. ECT worked in the QSAR modeling on the eight chemical datasets. YMP leaded the research related with this manuscript. MPM performed the statistical analysis. SJB worked in the definition of the QuBiLSMIDAS indices and prepared the manuscript. LCL worked in the QSAR modeling on the eight datasets. All authors read and approved the final manuscript.
Acknowledgements
GarcíaJacas, CR. thanks the program “International Professor” for a fellowship to work at “Pontificia Universidad Católica del Ecuador Sede Esmeraldas (PUCESE)” in 2015–2016. Barigye, S.J acknowledges support from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). MP. Y acknowledge also the partial financial support from Colegio de Medicina, USFQ. Finally, but not least, this work was supported in part by ISCUSFQ.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Norinder U (1996) Single and domain mode variable selection in 3D QSAR applications. J Chemom 10(2):95–105View ArticleGoogle Scholar
 SungSau S, Karplus M (1997) Threedimensional quantitative structure–activity relationships from molecular similarity matrices and genetic neural networks. 2. Applications. J Med Chem 40(26):4360–4371View ArticleGoogle Scholar
 AiresdeSousa J, Gasteiger J (2002) Prediction of enantiomeric selectivity in chromatography: application of conformationdependent and conformationindependent descriptors of molecular chirality. J Mol Graph Model 20(5):373–388View ArticleGoogle Scholar
 Chen H, Zhou J, Xie G (1998) PARM: a genetic evolved algorithm to predict bioactivity. J Chem Inf Comput Sci 38(2):243–250View ArticleGoogle Scholar
 Kubinyi H (1997) QSAR and 3D QSAR in drug design: 1. Methodology. Drug Discov Today 2(11):457–467View ArticleGoogle Scholar
 Fujita T, Iwasa J, Hansch C (1964) A new substituent constant, π, derived from partition coefficients. J Am Chem Soc 86(23):5175–5180View ArticleGoogle Scholar
 Hansch C et al (1962) Correlation of biological activity of phenoxyacetic acids with Hammett substituent constants and partition coefficients. Nature 194(4824):178–180View ArticleGoogle Scholar
 Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics. In: Mannhold R, Kubinyi H, Folkers G (eds) Methods and principles in medicinal chemistry, 2nd edn. WileyVCH, WeinheimGoogle Scholar
 Cramer RD, Patterson DE, Bunce JD (1988) Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc 110(18):5959–5967View ArticleGoogle Scholar
 Parretti MF et al (1997) Alignment of molecules by the Monte Carlo optimization of molecular similarity indices. J Comput Chem 18(11):1344–1353View ArticleGoogle Scholar
 Tominaga Y, Fujiwara I (1997) Novel 3D descriptors using excluded volume: application to 3D quantitative structure–activity relationships. J Chem Inf Comput Sci 37(6):1158–1161View ArticleGoogle Scholar
 Todeschini R, Lasagni M, Marengo E (1994) New molecular descriptors for 2D and 3D structures. Theory. J Chemom 8(4):263–272View ArticleGoogle Scholar
 Consonni V, Todeschini R, Pavan M (2002) Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. Part 1. Theory of the novel 3D molecular descriptors. J Chem Inf Comput Sci 42(3):682–692View ArticleGoogle Scholar
 Bursi R et al (1999) Comparative spectra analysis (CoSA): spectra as threedimensional molecular descriptors for the prediction of biological activities. J Chem Inf Comput Sci 39(5):861–867View ArticleGoogle Scholar
 Turner DB et al (1999) Evaluation of a novel molecular vibrationbased descriptor (EVA) for QSAR studies: 2. Model validation using a benchmark steroid dataset. J Comput Aided Mol Des 13(3):271–296View ArticleGoogle Scholar
 Gasteiger G et al (1996) Chemical information in 3D space. J Chem Inf Comput Sci 36(5):1030–1037View ArticleGoogle Scholar
 Balaban AT (1997) From chemical topology to threedimensional geometry. Springer, New YorkGoogle Scholar
 Bogdanov B, Nikolic S, Trinajstic N (1990) On the threedimensional Wiener number: a comment. J Math Chem 5(3):305–306View ArticleGoogle Scholar
 Mekenyan O et al (1986) Modelling the interaction of small organic molecules with biomacromolecules. I. Interaction of substituted pyridines with anti3azopyridine antibody. Arzneim Forsch 36(2):176–183Google Scholar
 Randić M (1995) Molecular profiles novel geometrydependent molecular descriptors. New J Chem 19:781–791Google Scholar
 Pearlman RS, Smith KM (1998) Novel software tools for chemical diversity. In: Kubinyi H, Folkers G, Martin YC (eds) 3D QSAR in drug design. Kluwer/ESCOM, Dordrecht, pp 339–353Google Scholar
 Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evolut Comput 1(1):67–82View ArticleGoogle Scholar
 Sutherland JJ, O’Brien LA, Weaver DF (2004) A comparison of methods for modeling quantitative structure–activity relationships. J Med Chem 47(22):5541–5554View ArticleGoogle Scholar
 Cubillán N et al (2015) Novel global and local 3D atombased linear descriptors of the Minkowski distance matrix: theory, diversity–variability analysis and QSPR applications. J Math Chem 53(9):2028–2064View ArticleGoogle Scholar
 MarreroPonce Y et al (2015) Optimum search strategies or novel 3D molecular descriptors: is there a stalemate? Curr Bioinf 10(5):533–564View ArticleGoogle Scholar
 GarcíaJacas CR et al (2014) Nlinear algebraic maps to codify chemical structures: is a suitable generalization to the atompairs approaches? Curr Drug Metab 15(4):441–469View ArticleGoogle Scholar
 Edwards CH, Penney DE (1988) Elementary linear algebra. Prentice Hall, Englewoods CliffsGoogle Scholar
 Johnson RW, Huang CH, Johnson JR (1991) Multilinear algebra and parallel programming. J Supercomput 5(2–3):189–217View ArticleGoogle Scholar
 Godden JW, Stahura FL, Bajorath J (2000) Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci 40(3):796–800View ArticleGoogle Scholar
 Urias RWP et al (2015) IMMAN: free software for information theorybased chemometric analysis. Mol Divers 19(2):305–319View ArticleGoogle Scholar
 Somorjai RL (2010) Multivariate statistical methods. In: John L (ed) Encyclopedia of spectroscopy and spectrometry. Academic Press, Oxford, pp 1704–1709View ArticleGoogle Scholar
 Yap CW (2011) PaDELdescriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474View ArticleGoogle Scholar
 Georg H (2008) BlueDescmolecular descriptor calculator. University of Tübingen, TübingenGoogle Scholar
 Hong H et al (2008) Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Comput Sci 48(7):1337–1344View ArticleGoogle Scholar
 Mauri A et al (2006) DRAGON software: an easy approach to molecular descriptor calculations. Match 56(2):237–248Google Scholar
 Steinbeck C et al (2003) The Chemistry Development Kit (CDK): an opensource Java library for chemo and bioinformatics. J Chem Inf Comput Sci 43(2):493–500View ArticleGoogle Scholar
 Sinkhorn R, Knopp P (1967) Concerning nonnegative matrices and doubly stochastic matrices. Pacific J Math 21(2):343–348View ArticleGoogle Scholar
 Barigye SJ et al (2013) Shannon’s, mutual, conditional and joint entropybased information indices. Generalization of global indices defined from local vertex invariants. Curr Comput Aided Drug Des 9(2):164–183View ArticleGoogle Scholar
 Barigye SJ et al (2013) Relations frequency hypermatrices in mutual, conditional and joint entropybased information indices. J Comput Chem 34(4):259–274View ArticleGoogle Scholar
 MarreroPonce Y et al (2012) Derivatives in discrete mathematics: a novel graphtheoretical invariant for generating new 2/3D molecular descriptors. I. Theory and QSPR application. J Comput Aided Mol Des 26(11):1229–1246View ArticleGoogle Scholar
 GarcíaJacas CR et al (2014) QuBiLSMIDAS: a parallel freesoftware for molecular descriptors computation based on multilinear algebraic maps. J Comput Chem 35(18):1395–1409View ArticleGoogle Scholar
 GarcíaJacas CR et al (2015) Multiserver approach for highthroughput molecular descriptors calculation based on multilinear algebraic maps. Mol Inform 34(1):60–69View ArticleGoogle Scholar
 Manchester J, Czerminski R (2008) SAMFA: simplifying molecular description for 3DQSAR. J Chem Inf Model 48(6):1167–1173View ArticleGoogle Scholar
 Hinselmann G et al (2011) jCompoundMapper: An open source Java library and commandline tool for chemical fingerprints. J Cheminform 3(1):3View ArticleGoogle Scholar
 Tosco P, Balle T (2011) A 3DQSARdriven approach to binding mode and affinity prediction. J Chem Inf Model 52(2):302–307View ArticleGoogle Scholar
 Klamt A et al (2012) COSMOsar3D: molecular field analysis based on local COSMO σprofiles. J Chem Inf Model 52(8):2157–2164View ArticleGoogle Scholar
 Bonachéra F, Horvath D (2008) Fuzzy tricentric pharmacophore fingerprints. 2. Application of topological fuzzy pharmacophore triplets in quantitative structure–activity relationships. J Chem Inf Model 48(2):409–425View ArticleGoogle Scholar
 Leardi R, Boggia R, Terrile M (1992) Genetic algorithms as a strategy for feature selection. J Chemom 6(5):267–281View ArticleGoogle Scholar
 Todeschini R et al (2003) MobyDigs: software for regression and classification models by genetic algorithms. In: Leardi R (ed) Natureinspired methods in chemometrics: genetic algorithms and artificial neural networks. Elsevier, Amsterdam, pp 141–167View ArticleGoogle Scholar
 Wu CFJ (1986) Jackknife, bootstrap and other resampling methods in regression analysis. Ann Stat 14(4):1261–1295View ArticleGoogle Scholar
 Lindgren F et al (1996) Model validation by permutation tests: applications to variable selection. J Chemom 10(5–6):521–532View ArticleGoogle Scholar
 Elisseeff A, Pontil M (2003) Leaveoneout error and stability of learning algorithms with applications. NATO science series sub series III computer and systems sciences, vol 190, pp 111–130Google Scholar
 Lilliefors HW (1967) On the Kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 62(318):399–402View ArticleGoogle Scholar
 Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3/4):591–611View ArticleGoogle Scholar
 Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92View ArticleGoogle Scholar
 Siegel S (1957) Nonparametric statistics. Am Stat 11(3):13–19Google Scholar
 Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1(6):80–83View ArticleGoogle Scholar
 Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc A 57(1):289–300Google Scholar
 Hechinger M, Leonhard K, Marquardt W (2012) What is wrong with quantitative structureproperty relations models based on threedimensional descriptors? J Chem Inf Model 52(8):1984–1993View ArticleGoogle Scholar
 Miteva MA, Guyon F, Tufféry P (2010) Frog2: efficient 3D conformation ensemble generator for small compounds. Nucleic Acids Res 38(suppl 2):W622–W627View ArticleGoogle Scholar
 RDKit: cheminformatics and machine learning software. February 2, 2016; http://www.rdkit.org/
 Vainio MJ, Johnson MS (2007) Generating conformer ensembles using a multiobjective genetic algorithm. J Chem Inf Model 47(6):2462–2474View ArticleGoogle Scholar
 O’Boyle N et al (2011) Open Babel: an open chemical toolbox. J Cheminform 3(1):33View ArticleGoogle Scholar
 Standardizer ChemAxon 5.9.0. February 2, 2016. https://www.chemaxon.com/products/standardizer/
 Ebejer JP, Morris GM, Deane CM (2012) Freely available conformer generation methods: how good are they? J Chem Inf Model 52(5):1146–1158View ArticleGoogle Scholar