Supervised extensions of chemography approaches: case studies of chemical liabilities assessment
© Ovchinnikova et al.; licensee Chemistry Central Ltd. 2014
Received: 25 November 2013
Accepted: 28 April 2014
Published: 7 May 2014
Chemical liabilities, such as adverse effects and toxicity, play a significant role in modern drug discovery process. In silico assessment of chemical liabilities is an important step aimed to reduce costs and animal testing by complementing or replacing in vitro and in vivo experiments. Herein, we propose an approach combining several classification and chemography methods to be able to predict chemical liabilities and to interpret obtained results in the context of impact of structural changes of compounds on their pharmacological profile. To our knowledge for the first time, the supervised extension of Generative Topographic Mapping is proposed as an effective new chemography method. New approach for mapping new data using supervised Isomap without re-building models from the scratch has been proposed. Two approaches for estimation of model’s applicability domain are used in our study to our knowledge for the first time in chemoinformatics. The structural alerts responsible for the negative characteristics of pharmacological profile of chemical compounds has been found as a result of model interpretation.
During the past decade, computational technologies and predictive tools have been deeply integrated in the modern drug discovery process and changed this process extracting the useful knowledge embedded in the complex arrays of chemical and biological information to select the most promising compounds as early as possible and to reveal chemical liabilities in order to reduce the risk of late stage attrition [1, 2]. Chemical liabilities, such as adverse effects and toxicity, play a significant role in modern drug discovery process. Methods to avoid or reduce chemical liabilities are an important target for drug discovery and development. Herein, we propose an approach combining several classification and chemography  methods to assess chemical liabilities in silico and to interpret obtained results in the context of impact of structural changes of compounds on their pharmacological profile. Model development has been performed in six different descriptor spaces for mutagenicity, carcinogenicity, acute toxicity and phospholipidosis data sets. A set of machine learning methods has been involved in model development encompassing well-known approaches with new ones. The combination of classification and data visualization is a key point for mechanistic model interpretation which allows one to understand which changes of the existing structures are required to improve target properties, to generate new hypothesis and, finally, to optimize the chemical structures. Over the years, a number of dimensionality reduction approaches [4–11] have been proposed and used in cheminformatics. The most known and widely used among these methods are Principal Component Analysis , Multidimensional Scaling (MDS) [13, 14], Self-Organizing Maps (SOM) , Stochastic Proximity Embedding [16–18], Stochastic Neighbor Embedding [19, 20], Sammon Mapping  and Generative Topographic Mapping (GTM) [22–24]. In this study, Generative Topographic Mapping and Isomap as well as their supervised extensions have been involved. Recently, the unsupervised implementations of these approaches have been used in a number of studies in chemoinformatics [25–32]. These two representatives of nonlinear dimensionality reduction methods are related to two different families: distance-based approaches and topology based approaches. Isomap reduces the dimensionality of data by using distance preservation as the criterion, that is intuitively understandable and easy to compute. GTM is related to the topology based techniques. This group of methods tries to preserve topology principle that is concerned to relative proximities: compounds which are close in the data space remain close in the data visualization model. Topology preservation usually is considered as more powerful and in the same time more complex comparing with distance preservation . The comparison of used techniques on the considered data is performed in this study. Support vector machines (SVM) , GTM and probabilistic neural networks (PNN)  have been used for the development of classification models. Two applicability domain of models’ approaches (AD) are involved in our study in order to assesses the model’s limitation in prediction of new data in order to reliably predict those data that are structurally similar to the training set compounds used for model development. Recently, several different AD approaches have been proposed [35–49]. Here, we use the representatives of two families of AD methods: distance-based (Ball)  and probability-based (Local Outlier Factor LOF) .
Here, to our knowledge for the first time, we propose supervised extension of Generative Topographic Maps  that can be used as a universal tool to visualize the chemical space and to develop classification models. New approach for projecting new data using supervised Isomap  without re-building models from the scratch has been developed. The evaluation of the performance of the dimensionality reduction techniques and introduced descriptor spaces to separate different activity classes has been monitored by three parameters, two of them have been used in cheminformatics for the first time.
Materials and methods
Data preparation has been carried out using recommendations published in . Chemaxon Standardizer  and Instant JChem  software have been used for the data preparation. Using Standardizer, the explicit hydrogen atoms have been removed, the structures have been aromatized and neutralized. Four data sets have been used in our study.
Ames mutagenicity data from a study by Kazius et al. . The data set contained 2367 active and 1888 inactive compounds. External test set consists of 1164 active and 2167 inactive compounds.
Data was collected from the distributed ISSCAN Database (part of structure-searchable toxicity DSSTox public database network ). The database has been specifically designed as an expert decision support tool and includes the carcinogenicity classification “calls” to guide the application of SAR approaches. Collected data set encompass 1088 chemical structures containing 648 compounds annotated as actives and 440 as inactive compounds. External test set  contains 359 actives and 141 inactives.
A set of 100 phospholipidosis-inducing compounds and 82 negative drug-like compounds were taken from , where the active compounds have been observed to act on a range of species (humans, rats, mice, dogs, rabbits, hamsters and monkeys) and on a variety of tissue types (lungs, kidney and liver). External test set from  contains 141 active and 359 inactive compounds.
Data from EPA Fathead Minnow Acute Toxicity Database  after data preparation stage containing 612 compounds (578 actives and 34 inactives). This database was generated by the U.S. EPA Mid-Continental Ecology Division (MED) for the purpose of developing an expert system to predict acute toxicity from chemical structures based on mode of action considerations.
In this study, six descriptor types have been involved in model development. ISIDA package  has been represented by two different descriptor types: (i) ISIDA Property-Labeled Fragment Descriptors (IPLF)  (atom-centered fragments (augmented atoms) of radius 1 to 3 colored by pH-dependent pharmacophores and (ii) subclass of ISIDA Substructural Molecular Fragments (SMF)  consisting of the shortest topological paths with explicit representation of only terminal atoms and bonds, where the values of minimal n min and maximal n max number of atoms varied from 2 to 15. 2D descriptors of Molecular Operating Environment (MOE 2D)  containing different physical properties, subdivided surface areas, atom and bond counts, Kier&Hall connectivity and Kappa shape indices, adjacency and distance matrix descriptors, pharmacophore feature descriptors and partial charge descriptors were involved in model development. The CDK (Chemistry Development Kit) MACCS keys and extended fingerprints (EF) were computed using the RCDK package  of the R software . Finally, Dragon software  has been used for molecular descriptors calculations. Constant and nearly constant descriptors were removed. Detailed table with the final number of descriptors for each data set and descriptor type is represented in supporting information.
Support Vector Machines (SVM)
SVM [68, 69] is a supervised learning method commonly used for classification and regression and based on statistical learning theory of Vapnik–Chervonenkis [70, 71]. Projecting the original data described by means of descriptor vectors to a higher dimensional feature space SVM achieves distinct separation between considered classes of compounds finding the optimal position of the separating hyperplane between the instances from the classes.
Generative Topographic Mapping (GTM)
where W- the output weights of RBF.
It relates the real data in the chemical space with manifold points. Thus, any point of the latent space ℝ L has its own projection in a data space ℝ D obtained by non-linear parameterized mapping y(x, W).
The mapping function y(x, W) is continuous, which leads to the topographic ordering of the projected points, i.e. two points that are close in the latent space are also close in the data space. Defining a probability distribution over the latent space induces the corresponding distribution over the manifold in the data space and, thus, imposes the probabilistic relationships between two spaces.
where ℒ - log likelihood function, β - inverse of variance, W - the output weights of RBF, K - number of the nodes, N - number of compounds, p(t n |x i , W, β) – prior probability generated in a point t n in the data space by the Gaussian with a center in y(x i , W).
The latter ensures that the estimated posterior probabilities are normalized. By applying Function 3 to each class C k one can assess the posterior probability of class membership for each compound. According to statistical decision theory , the optimal class assignment is determined by the maximal value of posterior class probabilities P(C k |t).
Probabilistic Neural Networks (PNN)
PNN  belongs to a group of feed-forward neural network algorithms. It was derived from Bayesian Networks  and Kernel Discriminant Analysis . PNN consists of four layers: input layer, pattern layer, summation layer and output layer.
An input layer represents the input vector, e.g. a compound from a test set. Each compound is attributed to a single neuron of pattern layer, for which its descriptors represent a weight vector. Therefore all pattern neurons can be marked with the class labels of corresponding compounds. Input layer interconnected with a pattern layer, thus each pattern unit forms a dot product Z of an input vector and its weight vector. Z is propagated to the network activation function and the result is outputted to the summation layer. Each neuron in the summation layer is connected to pattern units of the corresponding class. This layer performs simple summation of the inputs from the pattern layer. The output layer is a two-input layer, which produces a binary output. It takes into account the contribution for each class of inputs. The output is a 1 (positive identification) for that class and a 0 (negative identification) for non-targeted classes. In fact, there’s no training required since the compounds of the training set are considered as the weights to the hidden layer of the network. As no training required, classifying an input vector is fast, depending on the number of classes and compounds in use. PNNs have some advantages comparing with multilayer perceptron networks: they are faster, relatively insensitive to outliers and generate probability scores.
Dimensionality reduction methods
Supervised Generative Topographic Mapping (s-GTM)
GTM performs visualization by inversing mapping from the data space to the latent space (unbending this flexible sheet into the rectangular 2D map). For this Bayes theorem is used. Thus, for each molecule GTM calculates its probability to be located in the given point of this map represented by the latent space and visualizes this molecule according this probability.
In order to make manifolds location in the data space dependent on distribution not only of the whole data set, but also of each class, a new supervised training procedure was performed. Each iteration consists of two major steps.
where index j refers to one of the classes, N j – is a number of compounds in this class.
The latent point is associated with the class with the largest sum of responsibilities, only in a case when the difference between the sums is greater than the threshold value thr, which is an external parameter of the method. If not, the latent point remains unlabeled. To assure the formation of clusters of similarly labeled latent points, the influence of neighbor latent points is taken into account by decreasing the threshold value if the latent point on previous iteration had neighbors associated with the class which responsibility sum is larger on the current iteration and increasing it if the neighbors are from the opposite class.
Then RBF network is trained using the coordinates of x k in the latent space as input and as a target.
Supervised GTM has a number of external parameters that have a great influence on the model development. Main parameter for latent points’ colorization is the threshold value. It should be low enough, in order to allow a considerable amount of latent points to get labeled. The maximum value can be found from analyzing the responsibility matrix and strongly depends on the number of latent points: the larger is their number, the lower should be the threshold value.
It is obvious, that parameter ρ is required to bring both terms of Formula 9 to similar scale. Surprisingly, in quite a wide range it has small impact on the model, but can be very useful for imbalanced data to prevent all the latent point to be marked by the same class label. It should be altered for fine optimization or in case if no similarly labeled clusters of latent points are formed during the training process.
Isomap  is a low-dimensional embedding method. It implies that data are disposed along a manifold with a dimensionality d less than dimensionality d o of the original data space. Our aim is to “unroll” the manifold into a d-dimensional space, so that data points, which are close to each other on the manifold remain close, and remote points – stay remote. To this end, we replace Euclidian distance with geodesic one – the length of the shortest curve between two points that lies on the manifold.
Isomap algorithm consists of three steps. On the first step we define k nearest neighbors of each compound and assume that Euclidian distances between them are small and, thus, are nearly equal to corresponding geodesic distances. This assumption allows us to create a weighted graph where only the vertices that are nearest neighbors are connected and the length of each edge equals the corresponding distance. This graph is not always connected and in this case the largest connected part is taken for the next step. After the graph has been constructed we compute shortest distances between its vertices. Then obtained distance matrix is used for multidimensional scaling (MDS) [13, 14] from original to d-dimensional space. To minimize the cost function in MDS coordinates of compounds in the new space should be set to the top d eigenvectors of the matrix , where is a matrix of pairwise distances between training points and τ is an operator, that converts distances to inner products. For visualization purpose we set d = 2.
Here y i denotes class label of compound x i , and β is a parameter that prevents D(x i , x j ) from increasing too fast. β should depend on data density and average Euclidian distance between all pairs of data points is usually used. The parameter α gives some chance to the points from different classes to be more close to each other.
After new distances have been calculated k-nearest neighbors are defined and weighted graph is constructed in the way it is done in non-supervised algorithm.
where λ k is eigenvalues and υ ki - coordinates of the corresponding eigenvectors of the matrix , operator denotes average over the data set. To make this work for S-Isomap we take into consideration Eq. 11, while computing – geodesic distance from and external point x to the training point x i . We assume that the distance from x to its k nearest neighbors of x is small enough to make not much difference between two parts of Eq. 11, and so we can use their average as a geodesic distance from x to its k nearest neighbors. Other geodesic distances are found from matrix by computing the shortest paths as it has been done while training the model. If value is too large (which happens when average distances between compounds in the original data space are much exceed one), additional coefficient β1 can be used for both training the model and extending it to the new points. In this case the parameter β in Eq. 11 is replaced with β1β.
Applicability domain approaches
where α is a centroid of the data points and x ij denotes the coordinate j of the compound x i .
Local Outlier Factor (LOF)
LOF is a probability based method for outlier detection in a multidimensional dataset . It operates with local densities of objects in the dataset by using the definition of local reachability density and calculates value of “local outlier factor” that indicates the degree of object’s dissimilarity to other compounds in the data set.
To define the local reachability density we should first introduce some other concepts. We call k-distance of the object p (dist k (p)) the smallest value for which there are at least k objects besides p with a distance from p smaller or equal to dist k (p). K-distance neighborhood of an object p (N k (p)) is a set of objects, not including p, whose distance from p does not exceed dist k (p). Let us specify that the cardinality of N k (p), which we also denote as |N k (p)|, can be greater than k in case, when in N k (p) exist two or more objects whose distances from p are equal to dist k (p). Reachability distance of object p with respect to object o (rdist k (p,o)) is the maximum value between k-distance of o and distance from o to p. The idea of reachability distance is illustrated in Figure 1b.
In  is shown, that LOF of objects that lie ‘deep’ inside a cluster approximately equals to 1. It is also shown that in majority of cases k can be chosen so that for all objects that belong to some cluster of objects LOF approximately equals to one, and for any other object it significantly differs from one. This fact allows us to detect compounds that do not belong to any cluster and so can be called outliers.
where Sens is Sensitivity, Spec is Specificity, tp stands for true positive rate (e.g. the number of correctly predicted active compounds), tn – for true negative (correctly predicted inactive compounds), fp – for false positive (inactive compounds that’ve been predicted to be active), fn – for false negative (active compounds that’ve been identified as inactive ones).
LibSVM  was used for developing SVM models, two its external parameters v and γ were varied from 0.01 to 0.91 and from 2-11 to 23 respectively.
GTM models were built with the help of the Netlab  package. This implementation can’t work with large number of descriptors, so the Principal Component Analysis was introduced beforehand. Here, the following external parameters were gone over. Number of first principal components, that were retained, was varied from 20 to 60, number of latent points – from 52 to 502, number of radial basis network centers – from 22 to 72.
PNN was implemented in Classification Toolbox for use with MATLAB . Its only external parameter Gaussian width was chosen from the range of [0; 1].
Among all the developed models for each combination of dataset, descriptor type and applied method one with the highest Balanced Accuracy was selected for further analysis.
For s-GTM the value of threshold was tried from 0 to 0.2, in most cases we used ρ = 40 or ρ = 30. The only external parameter in the movement step (responsibility radius, rr) has a great influence on the model. Too small values leads to small changes in the model compared to unsupervised GTM, too big – to mapping all compounds into a single point. This parameter was sorted out in a large range.
The performance of data visualization has been monitored with three quantitative measures. Each of them is normalized to vary from 0 to 1 and can be computed for a data set where the information about the classes is available.
where N is a number of classes.
Distance Consistency (DSC)
Distribution Consistency (DC)
p R is the whole number of molecules in the region R and a coefficient Z = n log2N is used to range DC from 0 to 1. In this work to obtain the required regions we divided the visualization map into 15 × 15 equal sized rectangles.
Results and discussion
Classification models performance
PNN may be considered a compromise between the lack of method’s internal information of SVM and the decrease of accuracy of GTM. It is not such a universal tool as GTM but slightly outperforms it (up to 6% for mutagenicity). At the same time, PNN makes less accurate predictions then SVM, but allows one to look through the background of each decision by analyzing pattern and decision layers. There is a similarity in behavior of SVM and PNN.
Dependence of Balanced Accuracy from datasets and descriptor types obtained by PNN is turned out to be similar to that of SVM, but not of GTM, though both PNN and GTM are neural networks.
The considered data sets were previously studied by other teams. Thus, classification of the acute toxicity data set has been performed in . The compounds have been divided into classes differently than in our study and in the original database. A set of different machine learning approaches including several types of neural networks as well as SVM, Decision Trees and Gene Expression Programming have been applied for classification purposes. Corresponding Balanced Accuracy values of the developed models varied in the range from 0.85 to 0.93. A number of studies [85, 86] with the regression analysis have been published including the original publication of this data set . The carcinogenicity data involved in this study has been used in QSAR studies mostly as a source of further data retrieval. It has been used, for example, as a part of considered data in . A thorough analysis of the mutagenicity data set including the applicability domain estimation has been performed in . The direct comparison of the obtained results performance is straitened because of the difference in the statistical parameters used. Comparable results (obtained by combination of ECFP descriptors with Random Forest and Nearest Neighbor classifiers) have been recently reported in . In  SVM and Random Forest were applied for phospholipidosis prediction. There Matthews Correlation Coefficient was used to assess the results performance, and its values varied up to 0.72 that outperforms the maximum value of this parameter in our study.
MACCS descriptors were not effective in detecting structural alerts for all data sets, but mutagenicity, where eight descriptors detected mostly nitro groups. There are limited number of descriptors, which all three methods considered to be structural alerts. PNN tends to attribute descriptors to structural alerts that may be one of the reasons of its inferior efficiency compared to SVM. The described approach didn’t allow detecting structural alerts for phospholipidosis. Though more than 30 descriptors were unanimously marked by the methods, all these descriptors refer to several groups of active compounds with similar structure (an example is demonstrated in Figure 3).
Performance of data visualization models
In this study, supervised extensions of Isomap and GTM were used for data visualization.
One can see that while s-Isomap performed almost perfect separation of the training set (none of the applied assessment parameters decreased below 0.91), the quality of mapping an external set for these models is highly dependent on the dataset in consideration. An external set of mutagenicity was mapped quite accurate (Г-score = 0.80, DSC = 0.86, DC = 1.00), while the mapping of external set for carcinogenicity is moderate: the corresponding parameters varied in the range 0.54-0.62. One of the main factors that determine the quality of mapping is the distance from each point of the external set to the nearest neighbors in the training set. The closer they are, the better are the results. In case, if the distances are much greater than one, but are of the same scale, the additional parameter β1 can be used to put them to the desirable range (look  for the specific values). In the case of carcinogenicity, in particular, the distances from the points of the external set differ for several degrees.
The supervised extension of GTM is proposed in this paper for the first time. It demonstrates a significant improvement in visualization performance. An example for acute toxicity dataset and MOE descriptors is given in Figure 5. Besides a noticeable increase in all three used visualization quality measures (Г-score raised from 0.62 for unsupervised model to 0.77 for the supervised one, DSC – from 0.57 to 0.87 and DC – from 0.85 to 0.95, respectively), one can see how structurally similar compounds related to different classes and close to each other on the map obtained by unsupervised GTM are separated using supervised extension of GTM. Here, two groups were selected, each of them contained structurally similar active and inactive compounds. The first one contains toxic 1-Decanol and non-toxic 1-Tridecanol that differ from each other only by the length of the carbon chain (Tanimoto Similarity Coefficient (TSC) is equal to 1.00). The second group consists of toxic 2-Undecanone and 2-Dodecanone and similar to them (TSC = 0.82) non-toxic 3-Tetradecanal. All these compounds were mapped into a small area by unsupervised GTM while well distinguished applying its supervised extension.
Mapping of external test set for s-GTM is performed using the same procedure as for GTM, and the corresponding results are demonstrated in Figure 6. One can see that presented visualization maps are inferior to those of s-Isomap. At the same time s-GTM performs more accurate mapping of the external test set than s-Isomap, since after the model has been trained, the training set is mapped using the same algorithm as is used for the mapping of an external test set. In s-GTM, if one includes a compound from the training set in the test set, it will projected exactly to the same point of the map. This is not so for s-Isomap. Without label information each mapping will be an approximation and can be performed in different ways. The one we’ve proposed is based on the assumption that label information does not have much influence on the relative location of the points that are close to each other. During the training process s-Isomap changes distances between compounds in different manners regarding if the compounds belong to the same class or not but proportionally their relative position. Thus, new distances for compounds from different classes do not change significantly if they are close to each other. And if the compound from the test set has close neighbors in the training set, they will mapped close even if they belong to different classes. In Figure 6, as well as in Figure 4, acute toxicity maps are not presented since we had no corresponding external set at our disposal. Nevertheless, s-GTM demonstrated reasonably high results visualizing this data set. Considered quantitative measures for the best maps varied in the following ranges as a function of the descriptors type: Г-Score – 0.76-0.77; DSC – 0.72-0.87; DC – 0.93-0.96.
The given examples allows one to assume that s-GTM tends to form clusters of identically labeled projections that is reflected by the increase of the DSC value as compared with the results of original GTM. For instance, for presented in Figure 6 examples the improvement in DSC is 0.12 for carcinogenicity, 0.18 for mutagenicity and 0.21 for phospholipidosis. At the same time, while generally s-GTM provides at least slight increase in all the considered parameters for visualization quality assessment, it doesn’t separates areas of overlapping as successfully as s-Isomap does. The reason for this is that s-GTM works with the given relative location of compounds in the data space, while s-Isomap changes the distance between the compounds according to the label information (and thus performs some sort of metric learning ). E.g. if the choice of descriptors leads to overlapping differently labeled compounds in the original data space, s-GTM may not be able to separate them completely, but will project an external set following the pattern of the training set, while s-Isomap can achieve almost perfect separation for the most difficult visualization tasks, but then one may face some problems with the mapping of the external set.
For each presented map (Figures 4, 5 and 6) the values of three quantitative measures of visualization performances are given. None of the parameters is perfect and can be individually applied for identification of adequate data visualization models and comparison of different maps. Г-score, for example, is high for the maps with randomly mixed compounds that are still grouped in small clusters. Distance consistency can be low for well separated classes that form non-convex figures. Distribution Consistency is usually high for imbalanced dataset visualization and strongly depends on its external parameter. The effectiveness of each parameter is defined by the nature of obtained map. For example, the maps may have similar DC value, but differ in DCS, which can be interpreted that considered maps have similar class overlapping and different level of clusterization. In this study, the combination of DC and DSC parameters demonstrates its performance. Another advantage of DC and DSC is its less time- and memory-consuming compared to Г-score.
Applicability domain of models
Two methods of applicability domain estimation were applied in this study, their performance was compared. One of them is a distance-based Ball, the other – a distribution-based LOF. The Principal Component Analysis was used as a pre-processing step. Each method was used to generate a sorted list of compounds according to their “outlierness” (the value of LOF function for LOF and distance to the centroid for ball). The impact of outliers’ exclusion on the Balanced Accuracy of the models was analyzed.
For acute toxicity LOF proved to be more efficient than Ball. This can be explained by the presence of several clusters with high density of compounds in the dataset containing compounds of different classes. The compounds in these clusters may have been correctly classified, while a number of false predictions were made for the compounds lying in the areas of classes overlapping in the midst of the clusters. In this case LOF was able to detect these mispredicted compounds as outliers and Ball just excluded the most distant from the centroid compounds in spite of the density distribution.
For phospholipidosis Ball and LOF demonstrated similar performance, though LOF is a bit more efficient. It may indicate that the data are slightly clusterized with an area of clusters’ overlapping and most incorrectly predicted compounds are located far from the main aggregation of the chemical structures.
For carcinogenicity both applied methods demonstrated only a small increase of the Balanced Accuracy, with a better performance of Ball (in Figure 7 blue lines lie above corresponding black lines). This could happen if the projection of the dataset into the data space was a one cluster with irregular density distribution and large area of classes overlapping.
Similar pattern can be found for mutagenicity. Here, the maximum increase in BA is only about 2% and Ball only slightly outperforms LOF for IPLF descriptors. In respect with reasonable performances of both visualization and classification methods for mutagenicity dataset, one may assume that this dataset doesn’t contain many outliers and applying applicability domain analysis does not affect the predictive performance of models.
The SMF descriptors we used represent only terminal groups (See the section devoted to the descriptor types). The presented compounds were considered as outliers not because of the presence of some unique fragment, but because of unique or rare combination of atoms and bonds and their relative location. For example, Ceftazidime is the only compound in the dataset that contains sulfur with aromatic bond together with distanced heteroatoms (from 9 to 15 atoms in a fragment). And only in Rifampin there are carbon atoms with double bonds having from 4 to 10 atoms between them. Not all the given compounds are characterized by a number of unique descriptors, but all of them contain plenty of rare ones, as, for example, Colchicine.
This work concerns an approach that combines several classification and chemography methods for in silico assessment of chemical liabilities and for the interpretation of obtained results in the context of impact of structural changes of compounds on their pharmacological profile. Support Vector Machines, Generative Topographic Mapping and Probabilistic Neural Network were used for classification. The classification performances were improved by combination with two applicability domain assessment approaches (Ball and Local Outlier Factor), and their contribution was analyzed. Here, the supervised extension of Generative Topographic Mapping was proposed as new efficient chemography method. New approach for mapping new data using supervised Isomap without re-building models from the scratch has been proposed. The evaluation of the performance of the dimensionality reduction techniques and introduced descriptor spaces to separate different activity classes has been monitored by three parameters (Г-score, Distance Consistency and Distribution Consistency) and their efficiency was compared. The obtained results, which are comparable with or exceed those, published by other teams for the given biological activities, allow one to use proposed approach as an efficient filter for exclusion of compounds with undesirable activities on early stages of drug design process.
Authors thank Russian Foundation for Basic Research (projects no. 11-03-00161 and 12-03-33086). Authors gratefully acknowledge Prof. Alexandre Varnek: the work on modification of Generative Topographic Mapping code was made in the Laboratory of Chemoinformatics in the University of Strasbourg under his supervision.
- Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL: How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov. 2010, 9: 203-214.Google Scholar
- van de Waterbeemd H, Gifford E: ADMET in silico modelling: towards prediction paradise?. Nat Rev Drug Discov. 2003, 2: 192-204. 10.1038/nrd1032.View ArticleGoogle Scholar
- Oprea TI, Gottfries J: Chemography: the art of navigating in chemical space. J Comb Chem. 2001, 3: 157-166. 10.1021/cc0000388.View ArticleGoogle Scholar
- Lee JA, Verleysen M: Nonlinear Dimensionality Reduction. 2007, New York: SpringerView ArticleGoogle Scholar
- Gorban AN, Kegl B, Wunsch DC, Zinovyev A: Principal Manifolds for Data Visualisation and Dimension Reduction. 2007, Berlin – Heidelberg – New York: SpringerGoogle Scholar
- Ivanenkov YA, Bovina EV, Balakin KV: Nonlinear mapping techniques for prediction of pharmacological properties of chemical compounds. Russ Chem Rev. 2009, 78: 465-483. 10.1070/RC2009v078n05ABEH004030.View ArticleGoogle Scholar
- Ivanenkov YA, Savchuk NP, Ekins S, Balakin KV: Computational mapping tools for drug discovery. Drug Discov Today. 2009, 14: 767-775. 10.1016/j.drudis.2009.05.016.View ArticleGoogle Scholar
- Balakin KV: Pharmaceutical Data Mining. 2010, Wiley, New Jersey: Approaches and Applications for Drug DiscoveryGoogle Scholar
- Reutlinger M, Schneider G: Nonlinear dimensionality reduction and mapping of compound libraries for drug discovery. J Mol Graph Model. 2012, 34: 108-117.View ArticleGoogle Scholar
- Ertl P, Rohde B: The Molecule Cloud-compact visualization of large collections of molecules. J Cheminform. 2012, 4: 1-8. 10.1186/1758-2946-4-1.View ArticleGoogle Scholar
- Ritchie TJ, Ertl P, Lewis R: The graphical representation of ADME-related molecule properties for medicinal chemists. Drug Discov Today. 2011, 16: 65-72. 10.1016/j.drudis.2010.11.002.View ArticleGoogle Scholar
- Jolliffe IT: Principal Component Analysis. 2002, New York: SpringerGoogle Scholar
- Kruskal JB: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964, 29: 1-27. 10.1007/BF02289565.View ArticleGoogle Scholar
- Kruskal JB: Nonmetric multidimensional scaling: a numerical method. Psychometrika. 1964, 29: 115-129. 10.1007/BF02289694.View ArticleGoogle Scholar
- Kohonen T: Self-Organizing Maps. 2001, Berlin: Springer-VerlagView ArticleGoogle Scholar
- Agrafiotis DK, Xu H: A self-organizing principle for learning nonlinear manifolds. Proc Natl Acad Sci U S A. 2002, 99: 15869-15872. 10.1073/pnas.242424399.View ArticleGoogle Scholar
- Agrafiotis DK: Stochastic proximity embedding. J Comb Chem. 2003, 24: 1215-1221. 10.1002/jcc.10234.View ArticleGoogle Scholar
- Rassokhin DN, Agrafiotis DK: A modified update rule for stochastic proximity embedding. J Mol Graph Model. 2003, 22: 133-140. 10.1016/S1093-3263(03)00155-4.View ArticleGoogle Scholar
- Hinton GE, Roweis ST: Stochastic Neighbor Embedding. Advances in Neural Information Processing Systems. Edited by: Becker S, Thrun S, Obermayer K. 2002, Cambridge: The MIT Press, 833-840.Google Scholar
- Reutlinger M, Guba W, Martin RE, Alanine AI, Hoffmann T, Klenner A, Hiss JA, Schneider P, Schneider G: Neighborhood-preserving visualization of adaptive structure–activity landscapes: application to drug discovery. Angew Chem Int Ed. 2011, 50: 11633-11636. 10.1002/anie.201105156.View ArticleGoogle Scholar
- Sammon JW: A nonlinear mapping for data structure analysis. IEEE T Comput. 1969, 18: 401-409.View ArticleGoogle Scholar
- Bishop CM, Svensen M: GTM: the generative topographic mapping. Neural Comput. 1998, 10: 215-234. 10.1162/089976698300017953.View ArticleGoogle Scholar
- Bishop CM, Svensén M, Williams CK: GTM: A principled alternative to the self-organizing map. Artificial Neural Networks — ICANN 96. Edited by: vor der Malsburg C, von Seelen W, Vorbrüggen JC, Sendhoff B. 1996, Berlin: Springer-Verlag, 165-170.View ArticleGoogle Scholar
- Bishop CM, Svensén M, Williams CKI: Developments of the generative topographic mapping. Neurocomputing. 1998, 21: 203-224. 10.1016/S0925-2312(98)00043-5.View ArticleGoogle Scholar
- Maniyar DM, Nabney IT, Williams BS, Sewing A: Data visualization during the early stages of drug discovery. J Chem Inf Model. 2006, 46: 1806-1818. 10.1021/ci050471a.View ArticleGoogle Scholar
- Owen JR, Nabney I, Medina-Franco JL, Lopez-Vallejo F: Visualization of molecular FIngerprints. J Chem Inf Model. 2011, 51: 1552-1563. 10.1021/ci1004042.View ArticleGoogle Scholar
- Kireeva N, Baskin II, Gaspar HA, Horvath D, Marcou G, Varnek A: Generative Topographic Maps (GTM): universal tool for data visualization, structure-activity modeling and database comparison. Mol Inf. 2012, 31: 301-312. 10.1002/minf.201100163.View ArticleGoogle Scholar
- Kireeva N, Kuznetsov SL, Bykov AA, Tsivadze AY: Towards in silico identification of the human ether-a-go-go-related gene channel blockers: discriminative vs. generative classification models. SAR QSAR Environ Res. 2013, 24: 103-117. 10.1080/1062936X.2012.742135.View ArticleGoogle Scholar
- Kireeva N, Kuznetsov SL, Tsivadze AY: Toward navigating chemical space of ionic liquids: prediction of melting points using generative topographic maps. Ind Eng Chem Res. 2012, 51: 14337-14343. 10.1021/ie3021895.View ArticleGoogle Scholar
- Hasegawa K, Funatsu K: Prediction of protein-protein interaction pocket using L-Shaped PLS approach and its visualizations by generative topographic mapping. Mol Inf. 2014, 33: 65-72. 10.1002/minf.201300137.View ArticleGoogle Scholar
- Hähnke V, Rupp M, Krier M, Rippmann F, Schneider G: Pharmacophore alignment search tool: influence of canonical atom labeling on similarity searching. J Comb Chem. 2010, 31: 2810-2826. 10.1002/jcc.21574.View ArticleGoogle Scholar
- Das P, Moll M, Stamati H, Kavraki LE, Clementi C: Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction. Proc Natl Acad Sci. 2006, 103: 9885-9890. 10.1073/pnas.0603553103.View ArticleGoogle Scholar
- Chen N, Lu W, Yang J, Li G: Support vector machine in chemistry. 2004, Singapore: World ScientificView ArticleGoogle Scholar
- Specht DF: Probabilistic neural networks. Neural Netw. 1990, 3: 109-118. 10.1016/0893-6080(90)90049-Q.View ArticleGoogle Scholar
- Jaworska J, Nikolova-Jeliazkova N, Aldenberg T: QSAR applicability domain estimation by projection of the training set descriptor space: a review. ALTA Altern Lab Anim. 2005, 33: 445-459.Google Scholar
- Tetko IV, Bruneau P, Mewes H-W, Rohrer DC, Poda GI: Can we estimate the accuracy of ADMET predictions?. Drug Discov Today. 2006, 11: 700-707. 10.1016/j.drudis.2006.06.013.View ArticleGoogle Scholar
- Weaver S, Gleeson MP: The importance of the domain of applicability in QSAR modeling. J Mol Graph Model. 2008, 26: 1315-1326. 10.1016/j.jmgm.2008.01.002.View ArticleGoogle Scholar
- Todeschini R, Consonni V, Pavan M: A distance measure between models: a tool for similarity/diversity analysis of model populations. Chemometr Intell Lab. 2004, 70: 55-61. 10.1016/j.chemolab.2003.10.003.View ArticleGoogle Scholar
- Schultz TW, Hewitt M, Netzeva TI, Cronin MT: Assessing applicability domains of toxicological QSARs: definition, confidence in predicted values, and the role of mechanisms of action. QSAR Comb Sci. 2007, 26: 238-254. 10.1002/qsar.200630020.View ArticleGoogle Scholar
- Sushko I, Novotarskyi S, Körner R, Pandey AK, Cherkasov A, Li J, Gramatica P, Hansen K, Schroeter T, Müller K-R: Applicability domains for classification problems: benchmarking of distance to models for AMES mutagenicity set. J Chem Inf Model. 2010, 50: 2094-2111. 10.1021/ci100253r.View ArticleGoogle Scholar
- Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Oberg T, Todeschini R, Fourches D, Varnek A: Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J Chem Inf Model. 2008, 48: 1733-1746. 10.1021/ci800151m.View ArticleGoogle Scholar
- Soto AJ, Vazquez GE, Strickert M, Ponzoni I: Target-driven subspace mapping methods and their applicability domain estimation. Mol Inf. 2011, 30: 779-789. 10.1002/minf.201100053.View ArticleGoogle Scholar
- Sahigara F, Mansouri K, Ballabio D, Mauri A, Consonni V, Todeschini R: Comparison of different approaches to define the applicability domain of QSAR models. Molecules. 2012, 17: 4791-4810. 10.3390/molecules17054791.View ArticleGoogle Scholar
- Rodgers A, Zhu H, Fourches D, Rusyn I, Tropsha A: Modeling liver-related adverse effects of drugs using k nearest neighbor quantitative structure-activity relationship method. Chem Res Toxicol. 2010, 23: 724-732. 10.1021/tx900451r.View ArticleGoogle Scholar
- Sheridan RP: Three useful dimensions for domain applicability in QSAR models using random forest. J Chem Inf Model. 2012, 52: 814-823. 10.1021/ci300004n.View ArticleGoogle Scholar
- Sahigara F, Ballabio D, Todeschini R, Consonni V: Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions. J Cheminform. 2013, 5: 27-10.1186/1758-2946-5-27.View ArticleGoogle Scholar
- Todeschini R, Ballabio D, Consonni V, Sahigara F, Filzmoser P: Locally centred Mahalanobis distance: a new distance measure with salient features towards outlier detection. Anal Chim Acta. 2013, 787: 1-9.View ArticleGoogle Scholar
- Tetko IV, Novotarskyi S, Sushko I, Ivanov V, Petrenko AE, Dieden R, Lebon F, Mathieu B: Development of dimethyl sulfoxide solubility models using 163 000 molecules: using a domain applicability metric to select more reliable predictions. J Chem Inf Model. 2013, 53: 1990-2000. 10.1021/ci400213d.View ArticleGoogle Scholar
- Brandmaier S, Novotarskyi S, Sushko I, Tetko IV: From descriptors to predicted properties: experimental design by using applicability domain estimation. ATLA Altern Lab Anim. 2013, 41: 33-47.Google Scholar
- Tax D: Data description toolbox dd tools 1.7. 5. 2010, Delft: Delft University of TechnologyGoogle Scholar
- Breunig MM, Kriegel H-P, Ng RT, Sander J: LOF: identifying density-based local outliers. ACM Sigmod Record. 2000, 29: 93-104. 10.1145/335191.335388.View ArticleGoogle Scholar
- Kireeva N, Ovchinnikova S, Tsivadze A: Supervised Generative Topographic Mapping for In Silico Assessment of Chemical Liabilities. Proceedings of ACS National Meeting “Chemistry in Motion” Indianapolis. 2013Google Scholar
- Geng X, Zhan D-C, Zhou Z-H: Supervised nonlinear dimensionality reduction for visualization and classification. IEEE T Syst Man Cy B. 2005, 35: 1098-1107. 10.1109/TSMCB.2005.850151.View ArticleGoogle Scholar
- Tropsha A: Best practices for QSAR model development, validation, and exploitation. Mol Inf. 2010, 29: 476-488. 10.1002/minf.201000061.View ArticleGoogle Scholar
- Chemaxon Standardizer. http://www.chemaxon.com/products/standardizer/,
- Instant JChem. http://www.chemaxon.com/products/instant-jchem/,
- Kazius J, McGuire R, Bursi R: Derivation and validation of toxicophores for mutagenicity prediction. J Med Chem. 2005, 48: 312-320. 10.1021/jm040835a.View ArticleGoogle Scholar
- DSSTox database. http://www.epa.gov/ncct/dsstox/,
- Lowe R, Mussa HY, Nigsch F, Glen RC, Mitchell JB: Predicting the mechanism of phospholipidosis. J Cheminform. 2012, 4: 2-10.1186/1758-2946-4-2.View ArticleGoogle Scholar
- Goracci L, Ceccarelli M, Bonelli D, Cruciani G: Modeling phospholipidosis induction: reliability and warnings. J Chem Inf Model. 2013, 53: 1436-1446. 10.1021/ci400113t.View ArticleGoogle Scholar
- Russom CL, Bradbury SP, Broderius SJ, Hammermeister DE, Drummond RA: Predicting modes of toxic action from chemical structure: acute toxicity in the fathead minnow (Pimephales promelas). Environ Toxicol Chem. 1997, 16: 948-967. 10.1897/1551-5028(1997)016<0948:PMOTAF>2.3.CO;2.View ArticleGoogle Scholar
- Varnek A, Fourches D, Horvath D, Klimchuk O, Gaudin C, Vayer P, Solov’ev V, Hoonakker F, Tetko IV, Marcou G: ISIDA - platform for virtual screening based on fragment and pharmacophoric descriptors. Curr Comput Aided Drug Des. 2008, 4: 191-198. 10.2174/157340908785747465.View ArticleGoogle Scholar
- Ruggiu F, Marcou G, Varnek A, Horvath D: ISIDA property-labelled fragment descriptors. Mol Inf. 2010, 29: 855-868. 10.1002/minf.201000099.View ArticleGoogle Scholar
- Molecular Operating Environment. http://www.chemcomp.com,
- Guha R: Chemical informatics functionality in R. J Stat Softw. 2007, 18: 1-16. 10.1360/jos180001.View ArticleGoogle Scholar
- R project. http://www.r-project.org/foundation/,
- Dragon 6. http://www.talete.mi.it/products/dragon_molecular_descriptors.htm,
- Cristianini N, Shawe-Taylor J: An Introduction To Support Vector Machines (and Other Kernel-Based Learning Methods). 2000, Cambridge: Cambridge University PressView ArticleGoogle Scholar
- Ivanciuc O: Applications of Support Vector Machines in Chemistry. 2007, Weinheim: Wiley-VCHView ArticleGoogle Scholar
- Vapnik VN: Statistical Learning Theory. 1998, New York: Wiley-InterscienceGoogle Scholar
- Vapnik VN: The Nature of Statistical Learning Theory. 1995, New York: Springer-VerlagView ArticleGoogle Scholar
- Bishop CM: Pattern Recognition and Machine Learning. 2006, New York: SpringerGoogle Scholar
- Pearl J: Bayesian networks: a model of self-activated memory for evidential reasoning. Proceedings of The 7th conference of the Cognitive Science Society. 1985, University of California, Irvine, 329-334.Google Scholar
- Hand D: Research studies press Chichester. Kernel discriminant analysis. 1982Google Scholar
- Tenenbaum JB, De Silva V, Langford JC: A global geometric framework for nonlinear dimensionality reduction. Science. 2000, 290: 2319-2323. 10.1126/science.290.5500.2319.View ArticleGoogle Scholar
- Mardia KV, Kent JT, Bibby JM: Multivariate Analysis. 1979, London: Academic PressGoogle Scholar
- Bengio Y, Paiement J-F, Vincent P, Delalleau O, Le Roux N, Ouimet M: Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. Advances in Neural Information Processing Systems. Edited by: Thrun S, Saul LK, Scholkopf B. 2004, Cambridge: MIT Press, 177-184.Google Scholar
- Silva VD, Tenenbaum JB: Global versus local methods in nonlinear dimensionality reduction. Advances in Neural Information Processing Systems. Edited by: Becker S, Thrun S, Obermayer K. 2002, Cambridge: MIT Press, 705-712.Google Scholar
- Sokolova M, Japkowicz N, Szpakowicz S: Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. Advances in Artificial Intelligence. Edited by: Sattar A, Kang BH. 2006, New York: Springer, 1015-1021.Google Scholar
- Chang CC, Lin CJ: LIBSVM: a Library for Support Vector Machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm,
- Nabney I, Bishop C: Netlab neural network software. http://ntlb.sourceforge.net/,
- Stork DG, Yom-Tov E: Computer Manual in MATLAB to Accompany Pattern Classification. 2004, New York: John Wiley & SonsGoogle Scholar
- Sips M, Neubert B, Lewis JP, Hanrahan P: Selecting good views of high-dimensional data using class consistency. Comput Graph Forum. 2009, 28: 831-838. 10.1111/j.1467-8659.2009.01467.x.View ArticleGoogle Scholar
- Singh KP, Gupta S, Rai P: Predicting acute aquatic toxicity of structurally diverse chemicals in fish using artificial intelligence approaches. Ecotox Environ Safety. 2013, 95: 221-233.View ArticleGoogle Scholar
- Öberg T: A QSAR for baseline toxicity: validation, domain of application, and prediction. Chem Res Toxicol. 2004, 17: 1630-1637. 10.1021/tx0498253.View ArticleGoogle Scholar
- Cassani S, Kovarich S, Papa E, Roy PP, van der Wal L, Gramatica P: Daphnia and fish toxicity of (benzo) triazoles: validated QSAR models, and interspecies quantitative activity–activity modelling. J Hazard Mater. 2013, 258: 50-60.View ArticleGoogle Scholar
- Devillers J, Mombelli E, Samsera R: Structural alerts for estimating the carcinogenicity of pesticides and biocides. SAR QSAR Environ Res. 2011, 22: 89-106. 10.1080/1062936X.2010.548349.View ArticleGoogle Scholar
- Liu R, Wallqvist A: Merging applicability domains for in silico assessment of chemical mutagenicity. J Chem Inf Model. 2014, 54: 793-800. 10.1021/ci500016v.View ArticleGoogle Scholar
- Kireeva NV, Ovchinnikova SI, Kuznetsov SL, Kazennov AM, Tsivadze AY: Impact of distance-based metric learning on classification and visualization model performance and structure–activity landscapes. J Comput Aid Mol Des. 2014, 28: 61-73. 10.1007/s10822-014-9719-1.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.