QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations
- José R. Valdés-Martiní^{1},
- Yovani Marrero-Ponce^{2, 3, 4, 5, 6}Email authorView ORCID ID profile,
- César R. García-Jacas^{7, 8, 9},
- Karina Martinez-Mayorga^{7},
- Stephen J. Barigye^{10},
- Yasser Silveira Vaz d‘Almeida^{11},
- Hai Pham-The^{12},
- Facundo Pérez-Giménez^{6} and
- Carlos A. Morell^{13}
Received: 8 November 2016
Accepted: 7 April 2017
Published: 7 June 2017
Abstract
Background
In previous reports, Marrero-Ponce et al. proposed algebraic formalisms for characterizing topological (2D) and chiral (2.5D) molecular features through atom- and bond-based ToMoCoMD-CARDD (acronym for Topological Molecular Computational Design-Computer Aided Rational Drug Design) molecular descriptors. These MDs codify molecular information based on the bilinear, quadratic and linear algebraic forms and the graph-theoretical electronic-density and edge-adjacency matrices in order to consider atom- and bond-based relations, respectively. These MDs have been successfully applied in the screening of chemical compounds of different therapeutic applications ranging from antimalarials, antibacterials, tyrosinase inhibitors and so on. To compute these MDs, a computational program with the same name was initially developed. However, this in house software barely offered the functionalities required in contemporary molecular modeling tasks, in addition to the inherent limitations that made its usability impractical. Therefore, the present manuscript introduces the QuBiLS-MAS (acronym for Quadratic, Bilinear and N-Linear mapS based on graph-theoretic electronic-density Matrices and Atomic weightingS) software designed to compute topological (0–2.5D) molecular descriptors based on bilinear, quadratic and linear algebraic forms for atom- and bond-based relations.
Results
The QuBiLS-MAS module was designed as standalone software, in which extensions and generalizations of the former ToMoCoMD-CARDD 2D-algebraic indices are implemented, considering the following aspects: (a) two new matrix normalization approaches based on double-stochastic and mutual probability formalisms; (b) topological constraints (cut-offs) to take into account particular inter-atomic relations; (c) six additional atomic properties to be used as weighting schemes in the calculation of the molecular vectors; (d) four new local-fragments to consider molecular regions of interest; (e) number of lone-pair electrons in chemical structure defined by diagonal coefficients in matrix representations; and (f) several aggregation operators (invariants) applied over atom/bond-level descriptors in order to compute global indices. This software permits the parallel computation of the indices, contains a batch processing module and data curation functionalities. This program was developed in Java v1.7 using the Chemistry Development Kit library (version 1.4.19). The QuBiLS-MAS software consists of two components: a desktop interface (GUI) and an API library allowing for the easy integration of the latter in chemoinformatics applications. The relevance of the novel extensions and generalizations implemented in this software is demonstrated through three studies. Firstly, a comparative Shannon’s entropy based variability study for the proposed QuBiLS-MAS and the DRAGON indices demonstrates superior performance for the former. A principal component analysis reveals that the QuBiLS-MAS approach captures chemical information orthogonal to that codified by the DRAGON descriptors. Lastly, a QSAR study for the binding affinity to the corticosteroid-binding globulin using Cramer’s steroid dataset is carried out.
Conclusions
From these analyses, it is revealed that the QuBiLS-MAS approach for atom-pair relations yields similar-to-superior performance with regard to other QSAR methodologies reported in the literature. Therefore, the QuBiLS-MAS approach constitutes a useful tool for the diversity analysis of chemical compound datasets and high-throughput screening of structure–activity data.
Keywords
Background
The codification of chemical information using mathematical–computational methods to accelerate small-molecule drug discovery constitutes one of the fundamental tasks of mathematical chemistry [1, 2]. In recent years, the number and diversity of molecular features, also known as molecular descriptors (MDs), has significantly increased and corresponding educational and commercial computational implementations developed [3–9]. The absence of an ultimate universal chemical descriptor emphasizes the need of defining alternative methods to codify relevant and orthogonal chemical information.
In previous reports, Marrero-Ponce et al. proposed algebraic formalisms for characterizing topological (2D) and chiral (2.5D) molecular features through atom- and bond-based ToMoCoMD-CARDD (acronym for Topological Molecular Computational Design-Computer Aided Rational Drug Design) molecular descriptors [10–13]. These MDs codify molecular information based on the bilinear, quadratic and linear algebraic forms and the graph-theoretical electronic-density and edge-adjacency matrices in order to consider atom- and bond-based relations, respectively. The ToMoCOMD-CARDD MDs have been successfully applied in the screening of chemical compounds of different therapeutic applications ranging from antimalarials [14], trichomonacidals [15, 16], antitrypanosomals [17], paramphistomicides [18], antibacterials [19], tyrosinase inhibitors [20, 21] and others [22, 23]. To compute these descriptors, a computational program with the same name was developed. However, this software barely offered the functionalities required in contemporary molecular modeling tasks, in addition to the inherent limitations that made its usability impractical, for instance: (a) it did not support standard input formats (i.e. MDL MOL/SDF files) and the only input method for the chemical structures entailed the sketching of molecular pseudographs using a built-in manual drawing mode; (b) parameter configurations could not be exported or saved for posterior experiments; (c) no option for batch processing of descriptors was offered; (d) lacked the distributed computing functionality which permits the correct utilization of current multi-core architectures; (e) could not be used as a standalone library thus preventing the its integration in other applications; and (f) presented ambiguities in the labeling of the descriptors’ names in the output file.
In addition, in several mathematical procedures employed to compute MDs (e.g. GT-STAF [24, 25], DIVATI [26] and QuBiLS-MIDAS [27–30]), the molecules are not analyzed as a whole, that is, these are partitioned in order to univocally characterize each atom independently. In this way, several mathematical operators (also known as aggregation operators) may be applied over the atom-level indices to compute different global/local MDs. The use of several aggregation operators is based on the idea that the most suitable global definition of a system may not necessarily be additive. In fact, it is reported in the literature that operators other than the sum could yield better correlations with determined chemical properties [24–28]. In this sense, in the present report strategies are defined to generalize the procedure of obtaining global or local QuBiLS-MAS (acronym for Quadratic, Bilinear and N-Linear mapS based on graph-theoretic electronic-density Matrices and Atomic weightingS) indices using the so-called aggregation operators. Moreover, several new atom-based properties, chemical local-fragments (e.g. terminal methyl groups), distance-based cut-offs (for the analysis of the most important non-covalent or covalent interactions) and probabilistic transformations of the matrix representations are introduced. Furthermore, initiatives to deal with the computational and practical limitations inherent to the original ToMoCoMD-CARDD program were carried out, with the ultimate goal of improving its applicability in present-day cheminformatics tasks.
Theoretical scaffold: past and present
Brief history of algebraic maps-based indices
In addition, local-fragment (group or atom-type) quadratic, bilinear and linear atom/bond-based indices can be defined to characterize a predetermined molecular fragment (F) instead of the whole molecule (total indices). These are computed using the kth local-fragment matrix \({}_{F}{\boldsymbol{\mathcal{M}}}^{k}\) \(({}_{F}{\boldsymbol{\mathcal{E}}}^{k} )\), which is computed from the corresponding kth total matrix \({\boldsymbol{\mathcal{M}}}^{k}\) (\({\boldsymbol{\mathcal{E}}}^{\varvec{k}}\)) considering only those vertices (or edges) belonging to the selected molecular fragment. These fragments F may be heteroatoms (X), halogens (G) and H-bond donors (N or O atoms sharing a bond with an H-atom, labeled as D) [10, 34, 36]. Thus, NS and SS local-fragment atom/bond-based bilinear, quadratic and linear indices can be computed using the \({}_{F}{\boldsymbol{\mathcal{M}}}^{k}\) and \(_{F} {\boldsymbol{\mathcal{E}}}^{k}\) local-fragment matrices instead of the corresponding total matrices in the Eqs. 3 and 4.
It is important to remark that for each partitioning of a molecule into Z molecular exclusive fragments, there will be Z local-fragment matrices. In this case, if a molecule is partitioned into Z molecular fragments, then the original kth power of matrix \({\boldsymbol{\mathcal{M}}}_{{\varvec{ns},\varvec{ss}}}^{k}\) (or \({\boldsymbol{\mathcal{E}}}_{{\varvec{ns},\varvec{ss}}}^{k}\)) is exactly the sum of the kth power of the local-fragment matrices. Consequently, the total algebraic form-based indices are the sum of the exclusive contributions of the respective local-fragment algebraic form-based indices, as long as there is not overlap among the fragments. Therefore, taking into consideration the previous elements, the next sections address in detail the improvements related with the mathematical definition corresponding to the 2D algebraic indices introduced by Marrero-Ponce et al. [10, 31, 32, 43, 44].
The QuBiLS-MAS MDs: new definitions, generalization and extension of algebraic indices
The coefficients \({\fancyscript{m}}\) _{ ij } ^{ a,k } (see Eq. 5) are the elements corresponding to the kth NS (or SS) total atom-level pseudograph-theoretic electronic-density matrix [NS(SS)-GEDM] \({\boldsymbol{\mathcal{M}}}^{a,k}\) for atom “a”, while the entries \(e_{ij}^{e,k}\) (see Eq. 6) belonging to kth NS (or SS) total bond-level edge-adjacency matrix [NS(SS)-EAM] \({\boldsymbol{\mathcal{E}}}^{e,k}\) for bond “e”. These atom/bond-level coefficients are obtained from the entries \({\fancyscript{m}}\) _{ ij } ^{ k } of the \({\boldsymbol{\mathcal{M}}}^{k}\) total matrix and \(e_{ij}^{k}\) of the \({\boldsymbol{\mathcal{E}}}^{k}\) total matrix, respectively, using the described procedure to compute local-fragment matrices but considering the fragment F as an atom “a” or bond “e” of the molecule. Moreover, the diagonal coefficients \({\fancyscript{m}}\) _{ ii } ^{1} could have two distinct values in order to achieve greater discrimination of molecular structures: (1) aromatic ring sensibility for setting up aromatic atoms hooked on full aromatic rings instead of mapping individual atom loops as shown in the molecular pseudograph of the Table 1, and/or (2) the number of lone-pairs for each atom. The \(e_{ii}^{1}\) entries are always zero.
It is important to highlight that as an extension of the former ToMoCoMD 2D-MDs several local-fragments have been aggregated: H-bond acceptors (A), carbon atoms in aliphatic chains (C), H-bond donors (D), halogens (G), terminal methyl groups (M), carbon atoms in an aromatic portion (P) and heteroatoms (X). Thus, from these local-fragments the kth NS (or SS) local-fragment atom-level pseudograph-theoretic electronic-density matrices \(_{F} {\boldsymbol{\mathcal{M}}}^{a,k}\) for atom “a” and the kth NS (or SS) local-fragment bond-level edge-adjacency matrices \(_{F} {\boldsymbol{\mathcal{E}}}^{e,k}\) for bond “e”, may be computed. Consequently, local-fragment atom- and bond-level bilinear, quadratic and linear indices are determined from the Eqs. 5 and 6 using \(_{F} {\boldsymbol{\mathcal{M}}}^{a,k}\) and \(_{F} {\boldsymbol{\mathcal{E}}}^{a,k}\) as matrix forms, respectively. Note that the coefficients \(_{F} \fancyscript{m}_{ij}^{a,k} \in\,{_{F} {\boldsymbol{\mathcal{M}}}^{a,k}}\) and \(_{F} e_{ij}^{e,k} \in\,{_{F} {\boldsymbol{\mathcal{E}}}^{e,k}}\) are calculated from the elements \(_{F} \fancyscript{m}_{ij}^{k} \in\,{_{F} {\boldsymbol{\mathcal{M}}}^{k}}\) and \(_{F} e_{ij}^{k} \in\,{_{F} {\boldsymbol{\mathcal{E}}}^{k}} ,\) respectively.
Lastly, in order to obtain the global kth total (or local-fragment) bilinear, quadratic and linear indices from the corresponding atom-level (\({\boldsymbol{\mathcal{L}}}_{a}\)) or bond-level (\({\boldsymbol{\mathcal{L}}}_{e}\)) definitions, the summation operator is used. The global indices obtained using this operator over components of vector \({\boldsymbol{\mathcal{L}}}\) coincide with those indices calculated through the original procedure vector–matrix–vector detailed in Eqs. 3 and 4. Note that the summation operator is equivalent to the Manhattan norm applied to elements of the vector \({\boldsymbol{\mathcal{L}}}\) relative to the origin, which is in turn a specific case of Minkowski norm when p = 1. Motivated by this understanding, a generalization in which different p values are used, i.e. p = 2 and 3, where the former (p = 2) is the Euclidean norm (see Additional file 1: Figure SI1 for geometrical interpretation) was introduced. Additionally, other operators (see Additional file 1: Table SI2) applicable to the vector of LOVEIs were applied with the aim of generalizing the use of the linear combination to obtain global indices. It has been demonstrated in several reports [24–28] that better correlations for bioactivities may be attained when operators other than the sum are employed.
Neighborhood topological constraints in the graph-theoretical electronic-density and edge-adjacency matrix
The \(_{(F)} {\boldsymbol{\mathcal{M}}}^{k}\) and \(_{(F)} {\boldsymbol{\mathcal{E}}}^{k}\) matrices contain information on the connectivity for all atoms and bonds that constitute a molecule, respectively. However, some biological properties do not depend on the chemical structure as a whole but rather on interactions at particular topological distances, for example, short-, middle- and large-range contacts. Thus, with the aim of considering interactions that satisfy specific topological criteria, three graph-theoretical constraints (cut-offs) are introduced: (1) keeping only the diagonal elements of the matrix, denoted as “Self-Returning Walks” (SRW), (2) keeping only the off-diagonal elements of the matrix, denoted as “Non-Self-Returning Walks” (NSRW), and (3) keeping only the elements within a given interval, based on the topological distance for a path cut-off, denoted as Lag p.
The QuBiLS-MAS module
The QuBiLS-MAS module was designed as standalone software, with the extensions and generalizations discussed in “The QuBiLS-MAS MDs: new definitions, generalization and extension of algebraic indices” section. This software was developed in Java v1.8 and the Chemistry Development Kit (CDK) library (version 1.4.19) [9] was used in the manipulation of the chemical structures, as well as in determining the atom- and fragment-based chemical properties involved in the calculation process. The QuBiLS-MAS software is comprised of a front-end and back-end. The front-end is composed of a desktop and command-line user interface, while the back-end is developed as an Abstract Programming Interface (API) to enable its use as an independent Java library in the development of other cheminformatics applications or in the implementation of other user-friendly interfaces either graphical or command-line based. With these two components, independence between the software presentation layer and the processing logic implemented in the back-end is achieved and thus, any modification in the latter does not provoke changes in the front-end (GUI), and vice versa.
Back-end: the QuBiLS-MAS molecular descriptors library-computational complexity of algorithms
All the requests performed by the users through the GUI are processed by the QuBiLS-MAS library. This component is structured in packages according to the goals of the functionalities (see Additional file 1: Figure SI3 for UML diagram). The main package is tomocomd.cardd.qubils which contains the packages descriptors, matrices, metrics and workers that encapsulate the main concepts utilized in the definition of the QuBiLS-MAS MDs. The descriptors package includes the classes related to the calculation of the total and local-fragment bilinear, quadratic and linear algebraic maps. The matrices package contains the objects responsible for building the pseudograph-theoretic electronic-density matrix and the edge-adjacency matrix corresponding to atom- and bond-based representations, respectively. Additionally, the simple-stochastic, double-stochastic and mutual probability normalization strategies, as well as the topological constraints (cut-offs) are defined in this package. The tools package includes classes for the identification of the local-fragments, as well as the considered aggregation operators. Lastly, the workers package comprises the classes for the configuration and control of the algebraic MDs calculation process.
The algorithms responsible for performing the multiplication based on bilinear, quadratic and linear algebraic forms constitute the principal procedures to compute the QuBiLS-MAS indices. This procedure consists of a loop that iterates for each atom of the molecule to determine the corresponding atom- or bond-level matrix. Next the atom/bond-level matrices are multiplied by the corresponding property vectors in order to obtain the atom/bond-level indices. The corresponding sequential implementations have a computational complexity of \(O(n^{3} ).\) Nonetheless, when the atom/bond-level matrices are computed according to the mentioned procedure, it is noted that the only entries with values different from zero correspond to the atom with respect to which the atom/bond-level matrix is built. Therefore, instead of iterating for each atom in order to build the atom/bond-level matrix used posteriorly to determine the corresponding index, it is more suitable to compute the atom/bond-level indices at the same time as the original matrix is analyzed. Taking this into account, the algorithms have been optimized to an inferior polynomial order, achieving a complexity of \(O(n^{2} )\) in the computation of the atom/bond-based contributions for the QuBiLS-MAS indices.
Graphic user interface of the QuBiLS-MAS software
In the “Algebraic Form” panel, the specific algebraic maps to be used in the computation of the MDs are chosen according to the selected option in the “Constraints” panel, which could be atom-based or bond-based. Also, chirality detection may be configured in the “Constraints” panel. The matrix normalization formalisms (MP, NS, SS, and DS) used in the algebraic forms are configured in the “Matrix Form” panel, as well as the maximum order (k value) to which the coefficients of the matrices are raised. In the “Cut-Off” panel the option to “keep all” (KA) atomic interactions is selected by default, but other options [i.e. “Self-Returning Walks” (SRW), “Non-Self-Returning Walks” (NSRW) and/or the value-rank(s) of threshold p] may be considered to take into account only the non-covalent interactions according to the established criterion. The “Local-Fragments” panel contains the options to configure the seven chemical groups (or atom-types) that may be employed to compute either the total or local-fragment indices. Likewise, in the “Properties” panel the atomic properties used to setup different weighting schemes are chosen. Finally, the mathematical operators used to compute the global total or local indices from the atomic contributions are selected in the “Invariants” panel.
It is important to highlight that the selected options to compute the descriptors can be exported into an XML configuration file, called the project file, which can be used to calculate the same QuBiLS-MAS indices for other datasets when the software is run again. Another important feature is that the software can be executed on computer clusters using a command-line interface, which uses the project files to obtain the configuration of the indices to be computed. Also, the QuBiLS-MAS software has incorporated the “On/Off H-Atoms” option to consider (or not) the H-atoms during the calculation, the “On/Off Lone-Pair Electron” option to consider (or not) the number of lone-pairs for heteroatoms and the “Show Debug Report” option to track the algebraic processes that take place during the calculation (see Additional file 1: SI4).
The supported input file format for the chemical structures to be analyzed is the MDL MOL/SDF format and these are sequentially read in order to employ suitable memory allocation according to the size of the molecule. Moreover, the path of the output file may be specified where the values of the computed MDs are saved. To this end, the QuBiLS-MAS software supports the following output file formats: CSV, ARFF, and TXT (space- and tab-separated ASCII format) which are easily interpretable in popular statistical and/or machine learning software.
Comparison between the old software (TOMOCOMD) and the new one proposed in this report (QuBiLS-MAS)
Features | Computer program | |
---|---|---|
TOMOCOMD | QuBiLS-MAS | |
Description level | ||
Theoretical | ||
Algebraic form maps | 3 (quadratic, bilinear and linear) | |
Atom and Bond level | Yes | Yes |
Matrices | 2 (NS, SS) | 4 (NS, SS, DS, MP) |
Atom Weightings | 4 (M, V, P, E) | 10 (M, V, P, E, A, C, PSA, R, H, S) |
Local-fragments | 3 (D, G, X) | 7 (A, C, D, G, M, P, X) |
Chirality | YES, \({\mathfrak{c}}\) = ±1 | YES, extended to \({\mathfrak{c}}\) = ±0.25 to ±3 with a 0.25 step |
Lone-pair electrons | – | Yes |
Topological constraints | – | Yes, three cut-off types (SRW, NSRW, Lag P) |
H-atoms consideration | – | Yes, permits inclusion or removal |
Invariants or aggregation operators | – | Yes, 21 aggregation operators classified in four major groups |
Computational | ||
Open source | – | Yes, under LGPL |
Availability | Shareware | Freeware |
Programming language | Borland Delphi | Java |
Clear Object-oriented source code design | – | Yes |
Canonical namespace packages structure | – | Yes, under com.tomocomd.qublis. |
Target operating system(OS) | Microsoft Windows | Platform-independent |
Graphical user interface | Yes | Yes |
Command line | – | Yes |
Portable MDs library | – | Yes, as pre-compiled Java JAR file |
Supported input format | In-house file format | mol/sdf MDL |
Output format | Text File (TSV) | Text File (TSV, SSV, CSV), Weka (ARFF) |
Structure curation and cleaning | – | Yes, available under Structure menu item (with 10 check/cleaning tasks, H-atoms handling, and function for chemical formats conversion) |
Built-in example data | – | Yes, six chemical datasets |
Unique MD header | – | Yes, identifying the codification scheme |
Batch Processing mode | – | Yes |
Parallelized computing | – | Yes, using the Fork/Join framework |
Configurable projects | – | Yes |
Import/export configuration | – | Yes, using a XML file format |
Calculation progress | – | Yes, for descriptors and molecules |
Real-time memory monitor | – | Yes, with garbage collection option when desired |
Events logging | – | Yes, accessible through the History Tab |
Calculation report | – | Yes |
Runtime help accessibility | – | Yes |
User owner’s manual | – | Yes |
Online webpage | – |
Main features of commonly used tools for molecular descriptors (MDs) calculations
Software | Number of types of MDs | Configuration of MDs parameters | Advantages | Disadvantages | Additional remarks and online reference |
---|---|---|---|---|---|
QuBiLS-MAS v1.0 | 2080 (linear, quadratic and bilinear) | 1. Atom- or Bond-Based | 1. Computes MDs based on algebraic maps | 1. Only accepts MDL files (MOL or SDF) as input formats | 1. Uses CDK to read molecular files and calculate atomic properties |
2. Atomic properties | 2. 10 atom weighting schemes | 2. Requires Java JRE 1.7 or above http://www.tomocomd.com | |||
3. Local-fragments | 3. Graphic user-friendly interface and command-line interface | ||||
4. Matrix approaches | 4. Platform-independency | ||||
5. Aggregation operators | 5. Supports any organic molecules | ||||
6. Add (or remove) hydrogen atoms | 6. Free download and support | ||||
7. Consider lone-pair electrons | 7. Batch mode processing | ||||
8. Data cleaning module | |||||
9. Parallel processing | |||||
PaDEL-Descriptor v2.0 | 43 | None | 1. Graphic user interface | 1. One functionality for data cleaning tasks (remove salts) | 1. Uses CDK to read molecular files and calculate most of the descriptors and fingerprints |
2. Fully cross-platform | 2. No MDs batch processing | 2. Employs Java Web Start technology | |||
3. Command line interface | |||||
4. Free and Open Source | |||||
5. Accepts multiple file formats (>90 formats) | |||||
6. Parallel processing | |||||
DRAGON v6.0 | 29 | 1. Predefined atom weighting schemes | 1. Graphic user-friendly interface | 1. Only Windows and Linux platforms | Academic permanent license: 900 euros (to be installed on 3 PCs) |
2. Selection of single molecular descriptors included in the different blocks | 2. Command line interface | 2. No parallel processing | |||
3. Batch mode processing | 3. No data cleaning functionalities | ||||
4. Supports any organic molecules | 4. Does not allow selection of local-fragments | ||||
5. Accepts the formats: MDL, Sybyl, HyperChem, Macromodel, Smiles, CML and HyperChem | 5. Commercial cost | ||||
CDK Descriptor Calculator v1.3.9 | 48 | 1. Add (or remove) hydrogen atom | 1. Graphic user interface | 1. Only accepts MDL files (MOL or SDF) as input formats | Use CDK library and requires JRE 1.6 |
2. Command line execution | 2. No data cleaning functionalities | ||||
3. Fully cross-platform | 3. Does not allow selection of local-fragments | ||||
4. Free software | 4. Does not allow selection of atom weighting schemes | ||||
5. Batch mode processing | |||||
BlueDesc | 36 | None | 1. Free and Open Source | 1. No graphic user interface | Use CDK and JOELib2 library and requires Java JRE 1.6 |
2. Fully cross-platform | 2. Only accepts MDL files (MOL or SDF) as input formats | http://www.ra.cs.uni-tuebingen.de/software/bluedesc/welcome_e.html | |||
3. No parallel processing | |||||
4. No data cleaning functionalities | |||||
5. Does not allow selection of local-fragments | |||||
6. Does not allow selection of atom weighting schemes | |||||
Model | 98 | None | 1. Web-based graphic user interface | 1. No parallel processing | Use of MODEL for commercial purposes is not allowed |
2. Accepts the formats: PDB, MDL, MOL2,COR | 2. No data cleaning tasks | ||||
3. Does not allow selection of local-fragments | |||||
4. Does not allow selection of atom weighting schemes | |||||
5. For academic purposes only | |||||
Mol2 | 20 | None | 1. Command line interface | 1. No graphic user interface | http://www.fda.gov/ScienceResearch/BioinformaticsTools/Mold2/ucm144528.htm |
2. Free of charge download request | 2. Only Windows platform | ||||
3. Only accepts SDfile format | |||||
4. No parallel processing | |||||
5. No data cleaning functionalities | |||||
6. Does not allow selection of local-fragments | |||||
7. Does not allow selection of atom weighting schemes | |||||
MOE | – | None | 1. Graphic user interface | 1. Only accepts SDfile format | |
2. Command line interface | 2. No parallel processing | ||||
3. Data cleaning tasks | 3. Does not allow selection of local-fragment | ||||
4. Fully cross-platform | 4. Does not allow selection of atom weighting schemes | ||||
VolSurf | 22 | None | 1. Graphic user interface | 1. Commercial | |
2. Command line interface | 2. Only Linux platform | ||||
3. Accepts several formats: MDL SDF, Sybyl, Mol2, Multi Mol2, GRID kout | 3. Only compute 2D MDs | ||||
4. No parallel processing | |||||
5. Does not allow selection of local-fragment | |||||
6. Does not allow selection of atom weighting schemes | |||||
Adriana. Code | 5 | None | 1. Graphic user interface | 1. Commercial | A demo version is available on request free of charge |
2. Command line interface | 2. Only Windows and Linux platforms | ||||
3. Batch mode processing | 3. No parallel processing | ||||
4. Accepts any organic molecule | 4. No data cleaning functionalities | ||||
5. Several input and output formats | 5. Does not allow selection of local-fragment | ||||
6. Does not allow selection of atom weighting schemes | |||||
CODESSA PRO | 8 | None | 1. Graphic user interface | 1. Commercial | |
2. Only for Windows platform | |||||
3. No parallel processing | |||||
4. No batch mode processing | |||||
5. Does not allow selection of local-fragment | |||||
6. Does not allow selection of atom weighting schemes | |||||
PowerMV | – | None | 1. Graphic user interface | 1. Only for Windows platform | Requires Microsoft.Net 1.1 or above |
2. No parallel processing | |||||
3. No batch mode processing | |||||
4. Does not allow selection of local-fragment | |||||
5. Does not allow selection of atom weighting schemes | |||||
Molconn-Z v4.10 | 79 | Multi-platform SGI Irix, Linux, Solaris, Mac OS-X and Windows. 12 months free Support | No GUI, Commercial | Minimum price US$750 for a Single Educational Node/User license | |
Pre ADMET Descriptor | 34 | GUI, Free web-based Limited application and Commercial PC version. Maintenance and Upgrade free of charge | Commercial. Runs on Windows. Only accepts MDL files (MOL or SDF) as input formats | Requires Microsoft.NET Framework 2.0 and minimum price is US$1 000 for 1 year Academic license | |
Toxicity Estimation Software Tool (T.E.S.T.) v4.1 | 13 (628) | GUI, Open source and multi-platform | Platform specific distributions. Only accepts MOL or SMILES as input formats | Based on CDK library. Requires Java JRE 1.6 | |
ADAPT | 27 | Non-Commercial | Runs on Unix. Heavy-atom limitations up to 255 atoms. Only accepts MOL as input formats | Written in Fortran and is installed on a DEC alpha workstation | |
ChemAxon Calculator Plugins v5.11 | 12 | 27 | Free for non-commercial, freely accessible web pages | s | http://www.chemaxon.com/marvin/help/calculations/calculator-plugins.html |
GUI, Batch execution from command line | |||||
Multi-platform Windows, HP, MacOS X, Solaris and Linux | |||||
JOELib2 | 40 | Free, Open Source, Redistributable. Multi-platform | http://www.ra.cs.uni-tuebingen.de/software/joelib/introduction.html | ||
TOPS-MODE & MODes Lab | Several (mainly edge-based) topological indices | GUI | Runs on Windows | ||
Non-Commercial | No Batch execution |
Assessment of the performance of the QuBiLS-MAS descriptors
Information content analysis based on Shannon’s entropy
Shannon’s entropy (SE) quantifies the information content codified by molecular indices, according to the principle that variables that effectively discriminate all molecules in a dataset possess high entropy values, while redundant variables have low entropy values. To perform this study, the Spectrum dataset (http://www.msdiscovery.com/spectrum.html) comprised by 1963 structures was used. The highest SE for this dataset is equal to 10.93 bits (log_{2}N, where N is the number of compounds). In the following subsections the novel QuBiLS-MAS 2D-MDs are analyzed taking into account the proposed internal generalizations, as well as with respect to well-known MDs computed by other software. For this study, the IMMAN software was used [49].
Comparative variability analysis according to the matrix formalisms
Analysis of variability according to the aggregation operators
Variability analysis of QuBiLS-MAS 2D-indices versus DRAGON descriptor families
The purpose of this analysis is to compare the entropy of the QuBiLS-MAS 2D-MDs with the DRAGON descriptor families. To perform this study some DRAGON descriptor-blocks were clustered into bigger families: (1) 0D_others for molecular properties, constitutional and charge descriptors, (2) 1D-fragment for functional group counts and atom-centered fragments, (3) 2D-conn_autocorr_inf for 2D autocorrelations, connectivity and information indices, (4) 2D-edge_walk for edge adjacency indices, walk and path counts, (5) 2D-eigenvalues for Burden eigenvalues, topological charge and eigenvalues-based indices, and (6) 3D-Randic_geometrical for Randic molecular profiles and geometrical descriptors. The remaining DRAGON families were kept with the same denominations. The maximum number of descriptors considered for each family is 91, which corresponds to the 0D_others family that has the least number of MDs.
Variability comparison for QuBiLS-MAS 2D-indices with respect to other descriptor computing software
Linear independence of the QuBiLS-MAS algebraic descriptors
In this section, the possible orthogonality of the QuBiLS-MAS 2D-Indices with respect to the DRAGON 0D-2D MDs is examined, using the Principal Component Analysis (PCA) [53, 54]. The PCA is a mathematical technique that converts several correlated variables into a reduced number of non-correlated variables, called principal components. The extracted components have the following features: (1) the first component will explain the highest possible variance of all determined components, (2) the successive components will explain the variance that the previous components did not explain, and (3) variables loaded in each component are linearly independent to the ones loaded in the remaining components. For all the studies performed in this section, the curated Spectrum Collection dataset (1963 molecules) was employed.
To perform this analysis, two sets of descriptors were calculated using QuBiLS-MAS MDs and the DRAGON (824 MDs) software, respectively, with the latter comprising of the following families: 0D-others (B01 Constitutional, B19 Charge and B20 Molecular Properties) with 91 indices, 1D-fragment (B17 Functional Groups Counts and B18 Atom-centered Fragments) with 274 indices, 2D-conn_autocorr_inf (B04 Connectivity, B05 Information and B06 2D-AutoCorrelations) with 176 indices, 2D-edge_walk (B03 Walk-Path Counts and B07 Edge Adjacency) with 154 indices, 2D-eigenvalues (B08 Burden, B10 Eigenvalue-based and B09 Topological Charge) with 129 indices, and finally the B02 2D Topological with 119 indices.
In this analysis, 12 principal components were selected, which explain approximately 74.60% of the cumulative variance (see Additional file 1: SI6 and Additional file 1: SI7). As it can be observed, Factors 1 (27.83%), 2 (13.06%), 8 (2.47%) and 9 (1.99%) exhibit strong loadings for some QuBiLS-MAS indices and some 0D–2D descriptors of the DRAGON software. On the other hand, exclusive loadings are obtained for the QuBiLS-MAS descriptors in the Factors 3 (8.6%), 4 (6.26%), 5 (3.86%), 6 (3.51%), 7 (2.71%), 11 (1.42%) and 12 (1.20%), explaining 27% of the total variance. Factor 10 (1.62%) is important for some 0–2D DRAGON MDs as these are exclusively loaded in this factor, and these indices include: TI2 (B02 2D Topological), PW2 (B02 2D Topological), RBF (0D–others) and EEig01r (2D-edge_walk) [for details on these descriptors, see Additional file 1: SI8]. On the whole, much of the information codified by the 0D-2D DRAGON MDs is equally captured by the QuBiLS-MAS indices, considering that negligible variance (1.62%) is explained by the factor exclusive for the former (F10). Moreover, the numerous factors (i.e. F3, F4, F5, F6, F7, F11 and F12) exclusive for the QuBiLS-MAS MDs suggest that orthogonal information is codified and thus demonstrating the theoretical contribution of the generalization schemes adopted in this framework.
QSAR modeling of the binding affinity to corticosteroid binding globulin (CBG) of Cramer’s steroid dataset
In what follows, the predictive ability of the QuBiLS-MAS approach is assessed. To accomplish this objective, QSAR models for predicting the “binding affinity to the corticosteroid-binding globulin (CBG) of the popular Cramer’s steroid database” (see Additional file 1: SI9 for names and CGB values of compounds) were built. This dataset has been used as a “benchmark” to evaluate the quality of novel procedures. A total of 1455 variables were computed for each algebraic form (quadratic, bilinear and linear maps). The prediction models were built using Multiple Linear Regression (MLR) as the fitting method, coupled with the Genetic Algorithm (GA) as variable subset selection strategy and the statistical parameter Q _{loo} ^{2} (“leave-one-out” cross validation) as the fitness function. Throughout the study, regression models of 2–6 variables were developed and the best model in each case retained for posterior validation. The GA was setup with the following configurations: population size—100, crossover/mutation rate—0.7, selection operator was fixed at 60 and the number of iterations—500,000. In addition, the tabu list option was configured to remove those MDs with correlation equal or greater than 0.95. The MLR-GA based model building was performed using the MobyDigs [55] computer program. The best models built were also assessed with the bootstrapping [56] \((Q_{boot}^{2} )\) and Y-scrambling [57] \((a (Q^{2} ))\) validation methods in order to assess the predictive power and the possible chance correlation with respect to the activity modeled.
Examination of matrix formalisms
Analysis of the aggregation operators
The following study evaluates the predictive power of the aggregation operators proposed as a generalization scheme for the linear combination of LOVEIs as method for obtaining global (or local) indices. As it can be observed in Fig. 6b, all Q _{loo} ^{2} values are superior to 50%, with the best performances corresponding to the statistical operators, followed by the mean operators and lastly by the norms. Regarding the evaluation of the operators classified as “classical algorithms” (Fig. 6c) it is observed that Kier–Hall (KH), Total Sum (TS), Gravitational (GV) and Autocorrelation (AC) algorithms yield comparable to superior performance with respect to the remaining operators. It may therefore be concluded that the incorporation of the aforementioned generalization scheme improves the performance of the QuBiLS-MAS indices in modeling tasks and thus demonstrating its practical contribution.
The QuBiLS-MAS MDs versus literature reports
Statistical parameters for the best models for 2–6 variables for the physicochemical property log K, considering the 31 structures as the training set
Size | R ^{2} | Q _{loo} ^{2} | Q _{boot} ^{2} | a (Q ^{2}) | F | Models | Equations |
---|---|---|---|---|---|---|---|
2 | 0.778 | 0.734 | 0.738 | −0.208 | 49.16 | log K = 1.596 (±0.885) + 3.809 (±0.582) | (19) |
TS[1]_MX_B_AB_nCi_2_SS12_T_KA_a-h − 0.118 (±0.011) | |||||||
KH[1]_MX_F_AB_nCi_2_MP2_T_KA_h | |||||||
3 | 0.863 | 0.826 | 0.820 | −0.259 | 57.14 | log K = −32.132 (±3.841) − 75.624 (±9.789) | (20) |
TS[1]_RA_F_AB_nCi_2_MP2_T_KA_h + 135.484 (±13.179 | |||||||
TS[4]_PN_Q_AB_nCi_2_MP0_T_KA_h + 1782.101 (±257.835) | |||||||
KH[2]_PN_B_AB_nCi_2_SS8_T_KA_v-h | |||||||
4 | 0.915 | 0.887 | 0.879 | −0.324 | 70.59 | log K = −66.472 (±6.939) − 0.223 ± 0.021) | (21) |
AC[2]_MX_B_AB_nCi_2_SS7_T_KA_r-h + 0.407 (±0.089) | |||||||
TS[5]_HM_B_AB_nCi_2_SS8_T_KA_v-h + 131.848 (±10.928) | |||||||
TS[4]_PN_Q_AB_nCi_2_MP0_T_KA_h + 3323.451 (±355.509) | |||||||
KH[2]_PN_B_AB_nCi_2_SS8_T_KA_v-h | |||||||
5 | 0.932 | 0.902 | 0.890 | −0.376 | 68.53 | log K = −70.522 (±6.342) − 0.246 (±0.020) | (22) |
AC[2]_MX_B_AB_nCi_2_SS7_T_KA_r-h + 0.422 (±0.081) | |||||||
TS[5]_HM_B_AB_nCi_2_SS8_T_KA_v-h + 144.507 (±9.991) | |||||||
TS[4]_PN_Q_AB_nCi_2_MP0_T_KA_h + 4616.536 (±15.439) | |||||||
GV[2]_MX_Q_AB_nCi_2_MP3_X_KA_h + 3536.215 (±324.863) | |||||||
KH[2]_PN_B_AB_nCi_2_SS8_T_KA_v-h | |||||||
6 | 0.942 (0.960)^{a} | 0.914 (0.937)^{a} | 0.898 (0.925)^{a} | −0.414 (−0.465)^{a} | 65.26 (91.74)^{a} | log K = −81.005 (±6.216) − 0.233 (±0.020) | (23) |
AC[2]_MX_B_AB_nCi_2_SS7_T_KA_r-h − 39,144.250 (±4.757) | |||||||
AC[2]_MN_B_AB_nCi_2_MP2_A_KA_c-h + 0.572 (±17.485) | |||||||
TS[5]_HM_B_AB_nCi_2_SS8_T_KA_v-h + 120.683 (±1.681) | |||||||
TS[4]_PN_Q_AB_nCi_2_MP0_T_KA_h + 0.804 (±0.354) | |||||||
TS[6]_HM_Q_AB_nCi_2_SS0_A_KA_h + 3979.089 (±310.376) | |||||||
KH[2]_PN_B_AB_nCi_2_SS8_T_KA_v-h |
Comparison of Q _{loo} ^{2} statistics of nD-QSAR methods for the property log K (CGB)^{†} for 31 (or 30)
nD-QSAR method | PCs/var. | Statistical method | \({\text{Q}}^{2}\) _{loo} | Equations/references |
---|---|---|---|---|
31/30 Steroids (all dataset) | ||||
Combined electrostatic and shape similarity matrix | 6 | Genetic NN | 0.941 | [59] |
QuBiLS-MAS^{c} | 6 | MLR and GA | 0.937 | Equation 23 |
QuBiLS-MAS | 6 | MLR and GA | 0.914 | Equation 23 |
Hodking SM | 6 | Genetic NN | 0.903 | [59] |
QuBiLS-MAS | 5 | MLR and GA | 0.902 | Equation 22 |
QuBiLS-MAS | 4 | MLR and GA | 0.887 | Equation 21 |
Fragment QS-SM | 4 | PLS | 0.886 | [60] |
MEDV-13 | 5 | MLR and GA | 0.882 | [61] |
MiDSASA—“template” | 2 “compounds” | – | 0.88 | [62] |
SOM^{a} | 3 | – | R^{2} 0.85 | [63] |
Tuned-QSAR | 6 | MLR and PCA | 0.842 | [64] |
Autocorrelation vector 30 | – | – | 0.84 | [65] |
CoMMA | 3 | PLS | 0.828 | [66] |
QuBiLS-MAS | 3 | MLR and GA | 0.826 | Equation 20 |
Similarity Indices (ESP MC matrix 30) | 1 | PLS | 0.820 | [65] |
SOMFA/esp + ALPHA | – | SOR | 0.82 | [67] |
Combined electrostatic and shape similarity matrix | 6 | MLR and GA | 0.819 | [59] |
EEVA | 4 | PLS | 0.81 | [68] |
SOM-4D-QSAR | 4 | SOM neural network | 0.80 | [69] |
Charges and Properties from MEPS-AM1 | 5 | MLR | 0.80 | [70] |
HE State/E-State^{a,b} | 3 | – | 0.80 | [71] |
E-State^{a,b} | 3 | – | 0.79 | [71] |
CoSA | 3 “Bins” | PLS | 0.78 | [72] |
QSAR/E-State | 3 “atoms” | – | 0.78 | [73] |
TQSI | 4 | MLR | 0.775 | [64] |
EVA | 5 | PLS | 0.77 | [74] |
CoMSA | 1 | PLS | 0.76 | [75] |
MQSM | 5 | MLR and PCA | 0.759 | [64] |
EVA + ALPHA | – | SOR | 0.75 | [67] |
GRIND | – | PLS | 0.75 | [76] |
SEAL | 3 | PLS | 0.748 | [77] |
SOMFA/esp | 6 | PLS | 0.74 | [67] |
CoSCoSA^{a} | 3 | – | 0.74 | [78] |
CoSASA | 3 “atoms” | PLS | 0.73 | [72] |
E-State and kappa shape index | 4 | MLR | 0.72 | [79] |
TARIS | 2 | – | 0.71 | [80] |
MQSM | 3 | MLR | 0.705 | [64] |
Combined electrostatic and shape similarity matrix | 5 | PLS | 0.70 | [59] |
SAMFA-RF | – | RF | 0.69 | [81] |
SAMFA-PLS | 4–5 | PLS | 0.69 | [81] |
4D-QSAR | 2 | PLS | 0.69 | [69] |
CoMMA (ab initio) | 6 | PLS | 0.689 | [82] |
QSAR^{a} | 3 | – | 0.68 | [83] |
SOM-4D-QSAR | 4 | SOM Neural Network | 0.68 | [69] |
Wagener’s (AMSP Method) | – | k-NN and FNN | 0.630 | [84] |
SAMFA-SVM | – | SVM | 0.60 | [81] |
ALPHA | 2 | PLS | 0.57 | [67] |
In general, when the 31 steroids are taken into account as training set, the models based on QuBiLS-MAS indices yield comparable-to-superior performance relative to other methods reported in the literature according to the Q _{loo} ^{2} statistic. Up to now, the best model reported has been the one based on the “Combined Electrostatic and Shape Similarity Matrix” (Q _{loo} ^{2} = 0.941, var = 6), which is an alignment- and grid-based method known to be computationally expensive. Additionally, this model employs the Genetic Neural Network (GNN) as the fitting method, which generally yields more robust and better optimized models compared to other linear methods. Even then, comparable performance is obtained with QuBiLs-MAS models [(Q _{loo} ^{2} = 0.937 (compound 31 excluded), var = 6), (Q _{loo} ^{2} = 0.914 (compound 31 included), var = 6)] based on the MLR-GA, which is a much simpler technique than GNN. Therefore, based on the results obtained in this study, it can be claimed that the QuBiLs-MAS MDs proposed offer a considerable advantage over well-known traditional methodologies.
Conclusions
The QuBiLs-MAS approach for atom-pair relations, in its diverse generalizations and extensions, seems to renew the prospect of achieving 2D-QSAR models with good predictive power. Inspired by the “No Free Lunch” theorem [58], which postulates that there is no unique best alternative for tackling optimization problems, the different extensions constitute an innovative undertaking to suitably characterize the different phenomena that affect the molecular configuration and intermolecular interactions, and thus affecting their biological activity. Variability and Principal Component analyses of the QuBiLs-MAS indices demonstrated that the proposed generalizations yield indices with superior variability compared to other indices defined in the literature and capture chemical information not codified by the DRAGON MD families. Also, it was demonstrated that suitable gains are obtained in the predictive ability of the QSAR models with the QuBiLs-MAS approach. Therefore, the QuBiLs-MAS 2D-indices constitute a relevant tool for the diversity analysis of compound datasets and high-throughput screening of structure–activity data.
Futures outlooks
Future tasks include the development of a version of the QuBiLs-MAS module to compute molecular indices on a distributed computing system for high-throughput calculation, as well as, a version to use the Graphical Processing Units (GPU) present in several personal computers nowadays. Moreover, various (dis-)similarity multi-metrics to consider relations for more than two atoms (multi-linear forms) are to be introduced, in addition to a new set of multi-metrics based cut-offs.
Notes
Declarations
Authors’ contributions
YMP proposed the theory of the QuBiLS-MAS indices, supervised the chemical applications, the design of the GUI and prepared the manuscript. JRVM worked in the definition of the QuBiLS-MAS indices, in the computational implementation of API and GUI interfaces, performed the QSAR and other statistical analysis and prepared the manuscript. YSVA worked in the computational implementation of QuBiLs-MAS software. KMM, SJB, HLT and FPG worked in the QSAR modeling and performed the statistical analysis. CRGJ and CAM lead the informatics (program design) research related with this manuscript. All authors read and approved the final manuscript.
Acknowledgements
YM-P give thanks to support from USFQ with partial finance of Project ID5400 “Chancellor Grant 2016”. CRGJ acknowledges the support from “Dirección General de Asuntos del Personal Académico” (DGAPA) for the postdoctoral fellowship at “Instituto de Química, Universidad Nacional Autónoma de México (UNAM)” in 2016–2017. Work supported by “Programa de Apoyo a la Investigación y el Posgrado (PAIP) 5000-9163” and “Instituto de Química, UNAM” (KMM).
Authors’ information
Professor Yovani Marrero-Ponce received the BS degree in Pharmaceutical Sciences (summa cum laude) from the Central University of Las Villas (UCLV), Santa Clara, Cuba, in 2001, the M.S. degree in Biochemistry from Medical University “Dr. Serafin Ruiz-de Zarate Ruiz”, Santa Clara, Cuba, in 2004, and the Ph.D. degree in Chemistry from Havana University, Havana City, Cuba, in 2005. After post-doctoral fellowships at the University of Valencia, Spain, he founded the Unit of Computer-Aided Molecular “Biosilico” Discovery and Bioinformatic Research (CAMD-BIR Unit, today is known as CAMD-BIR International Network) as a spin-off of the Department of Pharmacy at UCLV. At present, he is an Full Professor/Research of Molecular Pharmacology and Pharmacotherapy at the Universidad San Francisco de Quito (USFQ), and Head of “Grupo de Medicina Molecular y Traslacional (MeM&T)”, Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas 170157, Pichincha, Ecuador. His research interests include molecular modelling and drug discovery, chem-bio-med-informatics, chemometrics, molecular descriptor, chemogenomics, and mathematical, theoretical and computational chemistry. Scopus Author ID: 55665599200. ResearcherID: H-5724-2011. ResearchGate: http://www.researchgate.net/profile/Yovani_Marrero-Ponce/, Google scholar: http://scholar.google.com/citations?user=rsbUYyEAAAAJ&hl=en, Facebook: http://www.facebook.com/ymarreroponce.
Availability of data and materials
The QuBiLS-MAS software and the respective user manual are freely available online at www.tomocomd.com.
Availability and requirements
Project name: QuBiLs Suite project. Project home page: www.tomocomd.com. Operating system(s): Platform independent. Programming language: Java. Other requirements: Java 1.8. License: Open source.
Competing interests
The authors declare that they have no competing interests.
Funding
This work was partially supported from the USFQ (Project ID5400 “Chancellor Grant 2016”). Dr. CRGJ was further supported by a specific DGAPA’s postdoctoral fellowship to work at “Instituto de Química”, UNAM.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics. In: Mannhold R, Kubinyi H, Folkers G (2009) Methods and principles in medicinal chemistry, Second, Revised and Enlarged ed. vol 1. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, p 2125Google Scholar
- Brown FK (1998) Chapter 35. Chemoinformatics: what is it and how does it impact drug discovery. In: James AB (ed) Annual reports in medicinal chemistry. Academic Press, New York, pp 375–384Google Scholar
- Todeschini R et al (2006) DRAGON software: an easy approach to molecular descriptor calculations. MATCH Commun Math Comput Chem 56(2):237–248Google Scholar
- Hong H et al (2008) Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Comput Sci 48(7):1337–1344View ArticleGoogle Scholar
- García-Jacas CR et al (2014) QuBiLS-MIDAS: a parallel free-software for molecular descriptors computation based on multilinear algebraic maps. J Comput Chem 35(18):1395–1409View ArticleGoogle Scholar
- García-Jacas CR et al (2015) Multi-server approach for high-throughput molecular descriptors calculation based on multi-linear algebraic maps. Mol Inform 34(1):60–69View ArticleGoogle Scholar
- Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474View ArticleGoogle Scholar
- Cao D-S et al (2013) ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29(8):1092–1094View ArticleGoogle Scholar
- Steinbeck C et al (2006) Recent developments of the chemistry development kit (CDK)—an open-source java library for chemo- and bioinformatics. Curr Pharm Des 12(17):2111–2120View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2006) Bond-Based global and local (bond and bond-type) quadratic indices and their applications to computer-aided molecular design. 1. QSPR studies of octane isomers. J Comput Aided Mol Des 20(10–11):685–701View ArticleGoogle Scholar
- Castillo-Garit JA, Marrero-Ponce Y, Torrens F (2006) Atom-based 3D-chiral quadratic indices. Part 2: prediction of the corticosteroid-binding globulinbinding affinity of the 31 benchmark steroids data set. Bioorg Med Chem 14(7):2398–2408View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2008) Novel 2D TOMOCOMD-CARDD molecular descriptors: atom-based stochastic and non-stochastic bilinear indices and their QSPR applications. J Math Chem 44(3):650–673View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2010) Bond-based linear indices of the non-stochastic and stochastic edge-adjacency matrix. 1. Theory and modeling of ChemPhysical properties of organic molecules. Mol Divers 14(4):731–753View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2005) Ligand-based virtual screening and in silico design of new antimalarial compounds using nonstochastic and stochastic total and atom-type quadratic maps. J Chem Inf Model 45(4):1082–1100View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2006) Predicting antitrichomonal activity: a computational screening using atom-based bilinear indices and experimental proofs. Bioorg Med Chem 14(19):6502–6524View ArticleGoogle Scholar
- Meneses-Marcel A et al (2005) A linear discrimination analysis based virtual screening of trichomonacidal lead-like compounds: outcomes of in silico studies supported by experimental results. Bioorg Med Chem Lett 15(17):3838–3843View ArticleGoogle Scholar
- Montero-Torres A et al (2005) A novel non-stochastic quadratic fingerprints-based approach for the ‘in silico’ discovery of new antitrypanosomal compounds. Bioorg Med Chem 13(22):6264–6275View ArticleGoogle Scholar
- Marrero-Ponce Y, Huesca-Guillén A, Ibarra-Velarde F (2005) Quadratic indices of the molecular pseudograph’s atom adjacency matrix and their stochastic forms: a novel approach for virtual screening and in silico discovery of new lead paramphistomicide drugs-like compounds. J Mol Struct 717(1–3):67–79View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2005) Atom, atom-type, and total nonstochastic and stochastic quadratic fingerprints: a promising approach for modeling of antibacterial activity. Bioorg Med Chem 13(8):2881–2899View ArticleGoogle Scholar
- Casanola-Martın GM et al (2007) TOMOCOMD-CARDD descriptors-based virtual screening of tyrosinase inhibitors: evaluation of different classification model combinations using bond-based linear indices. Bioorg Med Chem 15(3):1483–1503View ArticleGoogle Scholar
- Casañola-Martín GM et al (2006) New tyrosinase inhibitors selected by atomic linear indices-based classification models. Bioorg Med Chem 16(2):324–330View ArticleGoogle Scholar
- Castillo-Garit JA et al (2008) Estimation of ADME properties in drug discovery: predicting Caco-2 cell permeability using atom-based stochastic and non-stochastic linear indices. J Pharm Sci 97(5):1946–1976View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2003) Total and local quadratic indices of the “molecular pseudograph’s atom adjacency matrix”. Application to prediction of Caco-2 permeability of drugs. Int J Mol Sci 4(8):512–536View ArticleGoogle Scholar
- Barigye SJ et al (2013) Shannon’s mutual, conditional and joint entropy information indices: generalization of global indices defined from local vertex invariants. Curr Comput Aided Drug Des 9(2):164–183View ArticleGoogle Scholar
- Barigye SJ et al (2013) Relations frequency hypermatrices in mutual, conditional and joint entropy-based information indices. J Comput Chem 34:259–274View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2012) Derivatives in discrete mathematics: a novel graph-theoretical invariant for generating new 2/3D molecular descriptors. I. Theory and QSPR application. J Comput Aided Mol Des 26(11):1229–1246View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2015) Optimum search strategies or novel 3D molecular descriptors: is there a stalemate? Curr Bioinform 10(5):533–564View ArticleGoogle Scholar
- Garcia-Jacas CR et al (2014) N-linear algebraic maps for chemical structure codification: a suitable generalization for atom-pair approaches? Curr Drug Metab 15(4):441–469View ArticleGoogle Scholar
- García-Jacas CR et al (2016) Examining the predictive accuracy of the novel 3D N-linear algebraic molecular codifications on benchmark datasets. J Cheminform 8(10):1–16Google Scholar
- García-Jacas CR et al (2016) N-tuple topological/geometric cutoffs for 3D N-linear algebraic molecular codifications: variability, linear independence and QSAR analysis. SAR QSAR Environ Res 27(12):949–975View ArticleGoogle Scholar
- Marrero-Ponce Y (2003) Total and local quadratic indices of the molecular pseudograph’s atom adjacency matrix: applications to the prediction of physical properties of organic compounds. Molecules 8(9):687–726View ArticleGoogle Scholar
- Marrero-Ponce Y (2004) Linear Indices of the “molecular pseudograph’s atom adjacency matrix”: definition, significance-interpretation, and application to QSAR analysis of flavone derivatives as HIV-1 integrase inhibitors. J Chem Inf Comput Sci 44(6):2010–2026View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2004) Atom, atom-type, and total linear indices of the “molecular pseudograph’s atom adjacency matrix”: application to QSPR/QSAR studies of organic compounds. Molecules 9(12):1100–1123View ArticleGoogle Scholar
- Marrero Ponce Y (2004) Total and local (atom and atom type) molecular quadratic indices: significance interpretation, comparison to other molecular descriptors, and QSPR/QSAR applications. Bioorg Med Chem 12(24):6351–6369View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2004) Tomocomd-Cardd, a novel approach for computer-aided ‘rational’ drug design: I. Theoretical and experimental assessment of a promising method for computational screening and in silico design of new anthelmintic compounds. J Comput Aided Mol Des 18(10):615–634View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2005) Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic. Bioorg Med Chem 13(4):1005–1020View ArticleGoogle Scholar
- Todeschini R, Consonni V (2000) Handbook of molecular descriptors. In: Mannhold R, Kubinyi H, Timmerman H (eds) Methods and principles in medicinal chemistry, vol 11, 1st edn. WILEY-VCH Verlag GmbH, Weinheim, p 667Google Scholar
- Estrada E, Molina E (2001) Novel local (fragment-based) topological molecular descriptors for QSPR/QSAR and molecular design. J Mol Graph Model 20(1):54–64View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2005) Non-stochastic and stochastic linear indices of the molecular pseudographs atom adjacency matrix: application to in silico studies for the rational discovery of new antimalarial compounds. Bioorg Med Chem 13(4):1293–1304View ArticleGoogle Scholar
- Castillo-Garit JA et al (2008) Bond-based 3D-chiral linear indices: theory and QSAR applications to central chirality codification. J Comput Chem 29(15):2500–2512View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2008) 3D-chiral (2.5) atom-based TOMOCOMD-CARDD descriptors: theory and QSAR applications to central chirality codification. J Math Chem 44(3):755–786View ArticleGoogle Scholar
- Marrero-Ponce Y et al (2006) Non-stochastic and stochastic linear indices of the molecular pseudograph’s atom-adjacency matrix: a novel approach for computational in silico screening and “rational” selection of new lead antibacterial agents. J Mol Model 12(3):255–271View ArticleGoogle Scholar
- Castillo-Garit JA et al (2007) Atom-based stochastic and non-stochastic 3D-chiral bilinear indices and their applications to central chirality codification. J Mol Graph Model 26(1):32–47View ArticleGoogle Scholar
- Castillo-Garit JA et al (2008) Atom-based non-stochastic and stochastic bilinear indices: application to QSPR/QSAR studies of organic compounds. Chem Phys Lett 464(1–3):107–112View ArticleGoogle Scholar
- Axler SJ (2015) Linear algebra done right. In: Axler S, Ribet K (eds) Undergraduate texts in mathematics, vol 2, 3rd edn. Springer, New YorkGoogle Scholar
- Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann Math Stat 35(2):876–879View ArticleGoogle Scholar
- Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Comput Sci 50(7):1189–1204View ArticleGoogle Scholar
- Marrero-Ponce Y, Romero V (2002) TOMO-COMD (TOpological MOlecular COMputer Design) for Windows version 1.0. In: Preliminary version, may be obtained by email request to Marrero-Ponce (ymarrero77@yahoo.es). Central University of Las Villas, Santa ClaraGoogle Scholar
- Urias RP et al (2015) IMMAN: free software for information theory-based chemometric analysis. Mol Divers 19(2):305–319View ArticleGoogle Scholar
- Gutiérrez Y, Estrada E (2002–2004) MODESLAB, v1.5 (MOlecular DEScriptors LABoratory) for windows. Universidad de Santiago de Compostela, EspañaGoogle Scholar
- Georg H (2008) BlueDesc-molecular descriptor calculator. University of Tübingen, TübingenGoogle Scholar
- Liu J et al (2005) PowerMV: a software environment for molecular viewing, descriptor generation, data analysis and hit evaluation. J Chem Inf Model 45:515–522View ArticleGoogle Scholar
- Massey WF (1965) Principal components regression in exploratory statistical research. J Am Stat Assoc 60(309):234–256View ArticleGoogle Scholar
- Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, LondonGoogle Scholar
- Todeschini R et al (2003) MobyDigs: software for regression and classification models by genetic algorithms. In: Leardi R (ed) Data handling in science and technology. Elsevier, Amsterdam, pp 141–167Google Scholar
- Wu CFJ (1986) Jackknife, bootstrap and other resampling methods in regression analysis. Ann Stat 14(4):1261–1295View ArticleGoogle Scholar
- Lindgren F et al (1996) Model validation by permutation tests: applications to variable selection. J Chemom 10(5–6):521–532View ArticleGoogle Scholar
- Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82View ArticleGoogle Scholar
- So SS, Karplus M (1997) Three-dimensional quantitative structure-activity relationships from molecular similarity matrices and genetic neural networks. 1. Method and validations. J Med Chem 40(26):4347–4359View ArticleGoogle Scholar
- Amat L, Besalu E, Carbo-Dorca R (2001) Identification of active molecular sites using quantum-self-similarity measures. J Chem Inf Comput Sci 41(4):978–991View ArticleGoogle Scholar
- Shu-Shen L, Chun-Sheng L, Lian-Sheng W (2002) Combined MEDV-GA-MLR method for QSAR of three panels of steroids, dipeptides, and COX-2 inhibitors. J Chem Inf Comput Sci 42(3):749–756View ArticleGoogle Scholar
- Beger RD, Harris SH, Xie Q (2004) Models of steroid binding based on the minimum deviation of structurally assigned 13C NMR spectra analysis (MiDSASA). J Chem Inf Comput Sci 44(4):1489–1496View ArticleGoogle Scholar
- Polanski J (1997) The receptor-like neural network for modeling corticosteroid and testosterone binding globulins. J Chem Inf Comput Sci 37(3):553–561View ArticleGoogle Scholar
- Robert D, Amat L, Carbo-Dorca R (1999) Three-dimensional quantitative–activity relationships from tuned molecular quantum similarity measures: prediction of the corticosteroid-binding globulin binding affinity for a steroid family. J Chem Inf Comput Sci 39(2):333–344View ArticleGoogle Scholar
- Parretti MF et al (1997) Alignment of molecules by the Monte Carlo optimization of molecular similarity indices. J Comput Chem 18(11):1344–1353View ArticleGoogle Scholar
- Silverman BD, Platt DE (1996) Comparative molecular moment analysis (CoMMA): 3D-QSAR without molecular superposition. J Med Chem 39(11):2129–2140View ArticleGoogle Scholar
- Tuppurainen K et al (2004) Ligand intramolecular motions in ligand-protein interaction: ALPHA, a novel dynamic descriptor and a QSAR study with extended steroid benchmark dataset. J. Comput Aided Mol Des 18(3):175–187View ArticleGoogle Scholar
- Tuppurainen K et al (2002) Evaluation of a novel electronic eigenvalue (EEVA) molecular descriptor for QSAR/QSPR studies: validation using a benchmark steroid data set. J Chem Inf Comput Sci 42(3):607–613View ArticleGoogle Scholar
- Polanski J, Bak A (2003) Modeling steric and electronic effects in 3D- and 4D-QSAR schemes: predicting benzoic pKa values and steroid CBG binding affinities. J Chem Inf Comput Sci 43(6):2081–2092View ArticleGoogle Scholar
- De K, Sengupta C, Roy K (2004) QSAR modeling of globulin binding affinity of corticosteroids using AM1 calculations. Bioorg Med Chem 12(12):3323–3332View ArticleGoogle Scholar
- Kellogg GE et al (1996) E-state fields: applications to 3D QSAR. J. Comput Aided Mol Des 10(6):513–520View ArticleGoogle Scholar
- Beger RD, Wilkes JE (2001) Developing 13C NMR quantitative spectrometric data-activity relationship (QSDAR) models of steroid binding to the corticosteroid binding globulin. J Comput Aided Mol Des 15(7):659–669View ArticleGoogle Scholar
- Gregorio CD, Kier LB, Hall LH (1998) QSAR modeling with electrotopological state indices: corticosteroids. J Comput Aided Mol Des 12(6):557–561View ArticleGoogle Scholar
- Turner DB et al (1999) Evaluation of a novel molecular vibration-based descriptor (EVA) for QSAR studies: 2. Model validation using a benchmark steroid dataset. J Comput Aided Mol Des 13(3):271–296View ArticleGoogle Scholar
- Polanski J, Walczak B (2000) The comparative molecular surface analysis (COMSA): a novel tool for molecular design. J Comput Chem 24(5):615–625View ArticleGoogle Scholar
- Pastor M et al (2000) GRid-INdependent descriptors (GRIND): a novel class of alignment-independent three-dimensional molecular descriptors. J Med Chem 43(17):3233–3243View ArticleGoogle Scholar
- Kubinyi H, Hamprecht FA, Mietzner T (1998) Three-dimensional quantitative similarity–activity relationships (3D QSiAR) from SEAL similarity matrices. J Med Chem 41(14):2553–2564View ArticleGoogle Scholar
- Beger RD et al (2002) Comparative structural connectivity spectra analysis (CoSCoSA) models of steroid binding to the corticosteroid binding globulin. J Chem Inf Comput Sci 42(5):1123–1131View ArticleGoogle Scholar
- Maw HH, Hall LH (2001) E-state modeling of corticosteroids binding affinity validation of model for small data set. J Chem Inf Comput Sci 41(5):1248–1254View ArticleGoogle Scholar
- Marín RM, Aguirre NF, Daza EE (2008) Graph theoretical similarity approach to compare molecular electrostatic potentials. J Chem Inf Model 48(1):109–118View ArticleGoogle Scholar
- Manchester J, Czerminski R (2008) SAMFA: simplifying molecular description for 3D-QSAR. J Chem Inf Model 48(6):1167–1173View ArticleGoogle Scholar
- Silverman BD et al (eds) (1998) Comparative molecular moment analysis (COMMA). In: Kubinyi H, Folkers G, Martin YC (eds) 3D QSAR in drug design, vol 3. Kluwer, Dordrecht, pp 183–196Google Scholar
- Good AC, So SS, Richards WG (1993) Structure-activity relationships from molecular similarity matrices. J Med Chem 36(4):433–438View ArticleGoogle Scholar
- Wagener M, Sadowski J, Gasteiger J (1995) Autocorrelation of molecular surface properties for modeling corticosteroid binding globulin and cytosolic Ah receptor. J Am Chem Soc 117(29):7769–7775View ArticleGoogle Scholar