In-silico predictive mutagenicity model generation using supervised learning approaches
- Abhik Seal†1Email author,
- Anurag Passi†2,
- UC Abdul Jaleel3,
- Open Source Drug Discovery Consortium2 and
- David J Wild1
© Seal et al.; licensee Chemistry Central Ltd. 2012
Received: 28 December 2011
Accepted: 3 April 2012
Published: 15 May 2012
Experimental screening of chemical compounds for biological activity is a time consuming and expensive practice. In silico predictive models permit inexpensive, rapid “virtual screening” to prioritize selection of compounds for experimental testing. Both experimental and in silico screening can be used to test compounds for desirable or undesirable properties. Prior work on prediction of mutagenicity has primarily involved identification of toxicophores rather than whole-molecule predictive models. In this work, we examined a range of in silico predictive classification models for prediction of mutagenic properties of compounds, including methods such as J48 and SMO which have not previously been widely applied in cheminformatics.
The Bursi mutagenicity data set containing 4337 compounds (Set 1) and a Benchmark data set of 6512 compounds (Set 2) were taken as input data set in this work. A third data set (Set 3) was prepared by joining up the previous two sets. Classification algorithms including Naïve Bayes, Random Forest, J48 and SMO with 10 fold cross-validation and default parameters were used for model generation on these data sets. Models built using the combined performed better than those developed from the Benchmark data set. Significantly, Random Forest outperformed other classifiers for all the data sets, especially for Set 3 with 89.27% accuracy, 89% precision and ROC of 95.3%. To validate the developed models two external data sets, AID1189 and AID1194, with mutagenicity data were tested showing 62% accuracy with 67% precision and 65% ROC area and 91% accuracy, 91% precision with 96.3% ROC area respectively. A Random Forest model was used on approved drugs from DrugBank and metabolites from the Zinc Database with True Positives rate almost 85% showing the robustness of the model.
We have created a new mutagenicity benchmark data set with around 8,000 compounds. Our work shows that highly accurate predictive mutagenicity models can be built using machine learning methods based on chemical descriptors and trained using this set, and these models provide a complement to toxicophores based methods. Further, our work supports other recent literature in showing that Random Forest models generally outperform other comparable machine learning methods for this kind of application.
KeywordsMolecular descriptors Machine learning Mutagenicity Random forest Screening Toxicophores
In the past two decades high throughput screening (HTS) has provided a large amount of experimental data on compound biological activities. Data mining and machine learning methods provide an in silico counterpart building predictive models based on chemical structure features and other properties, and training sets of known bioactivities. Despite these capabilities quantitative methods do not tend to model the biochemical and physiological process well. Recent developments in machine learning have focused on the exploration of large data sets with non–congeneric molecules. The applicability of Quantitative Structure Activity Relationship (QSAR) studies to predict toxicity is very limited. The rationale behind the use of machine learning is to discover patterns and signatures in data sets from high throughput in-vitro assays. Nonetheless, the development of in-silico models as alternative approaches to mutagenicity assessment of chemicals without animal testing is constantly increasing and has attracted researchers in the field of Quantitative Biological Activity Relationship (QBAR)  and even toxicology.
Mutagenicity is the ability of a substance to cause genotoxicity. Experimentally, mutagenicity is assessed by Ames test performed on Salmonella typhimurium bacterial strains where each bacterial strain is sensitive to specific chemical mutagen . It has been found that the predictive power of positive Ames test for rodent carcinogenicity is high, ranging from 77% to 90% . Kazius et al.  assembled a data set of 4337 compounds and derived 29 toxicophores with an error rate of 18% in training set and 15% in a validation test set. Helma et al.  reported MOLFEA algorithm for generation of descriptors based on molecular fragments for non-congeneric compounds and compared various machine learning algorithms with its data set of 684 compounds derived from Carcinogenic Potency Database (CPDB: http://potency.berkeley.edu/). The data set gave an accuracy of 78% with 10 folds of cross validation. Hansen et al.  reported a unique new public Ames Mutagenicity data set with 6500 compounds and compared results with commercial and non-commercial tools. Zhang and Sousa  also reported the use of MOLMAP descriptors for bond properties which were used for training of Random Forest classifier. Error percentages, as low as 15% - 16% were achieved with an external validation set of 472 compounds against a training set of 4083 structures. Up to 91% sensitivity and 93% specificity were obtained from the test sets. Feng et al.  used four data sets NCI, Mut, Yeast and Tox and generated four different types of descriptors. Using statistical methods, models were built to link chemical descriptors to the biological activity. King et al.  reported different methods for establishing structure activity relationships (SARs). They represented chemical structures by atoms and bond connectivities in combination with inductive logic programming algorithm Progol. They tested 230 compounds which were divided in two sets of 188 compounds and 42 compounds. For 42 compounds Progol formed a SAR better than linear regression and back propagation. Judson et al.  used different classifiers to predict the accuracy of the model of complex chemical toxicology data sets. Neural networks and Support Vector Machines (SVM) were at the top of the list of classifiers, predicting with 96% and 99% specificity, respectively. They also mentioned that irrelevant features decreased the accuracy rate, with linear discriminant analysis suffering the maximum degradation. Ferarri and Gini  proposed the idea of a trained QSAR classifier supervised by a SAR layer that incorporates coded human knowledge. The model is implemented in the CAESAR project (http://www.caesar-project.eu)  where initially a classifier is trained on more than four thousand molecules based on Bursi data set by using molecular descriptors, then in the next step the relative knowledge to complement its practice is extracted from a collection of well-known structural alerts. Votano et al.  reported the application of three QSAR methods using artificial neural networks, k-nearest neighbors, and decision forest, to a data set of 3363 diverse compounds. They used molecular connectivity indices, electrotopological state indices, and binary indicators to obtain an accuracy of 82%.
Unlike many bioactivities, mutagenicity can be linked to very specific chemical structure fragments and functional groups, usually referred to as toxicophores, which interfere with DNA [14–16]. These include aromatic amines, hydroxyl amines, nitroso compounds, epoxides, thiols, nitrogen mustards, aziridines, aromatic azo’s, propiolactones, aliphatic halides, thiophenes, heteroatom derivatives, polycylic planar compounds, hydrazine, hydrazide and hydroxylamine. It has also been found that detoxifying structures such as the CF3, SO2NH, SO2OH and aryl sulphonyl derivates render mutagenic compounds non-mutagenic .
In this paper, firstly, we have applied four classification algorithms - Naïve Bayes, J48, Random Forest and Sequential Minimal Optimizer (SMO) - to model the mutagenicity data of compounds. In particular, we were interested in discovering whether such “whole molecule” algorithms are appropriate for mutagenicity prediction, or whether this is better done using simple alerts based on toxicophores. We were also interested in whether we would replicate previous work indicating that Random Forest is a better classifier than other Base and Ensemble classifiers . We tested the model with validation sets (PubChem data sets AID1189 and AID1194, DrugBank  approved, and withdrawn drugs and Zinc metabolites data (zinc.docking.org/browse/subsets/special.php)  all of which indicate that the Random Forest model performs well.
Distribution of different data sets and it compounds (mutagens and non-mutagens) in test and train sets
For validation of the generated model, external test sets were used. External data sets, AID1189 and AID1194, were taken from EPA DSSTOX data set in the CPDB . AID1189 contained 1477 compounds with 788 mutagens and 689 non-mutagens and AID1194 contained 832 compounds with 396 mutagens and 436 non-mutagens. The toxicity models were tested against the 1410 approved drugs and 66 withdrawn drugs from the DrugBank database and as well as with the 22080 metabolite data which were taken from the recently published ZINC Data sets. The metabolites may be toxic or non-toxic the idea here is to check whether the compounds formed after metabolism has some mutagenicity or not using our predictive models.
For each data set, descriptors were calculated by PowerMV . PowerMV calculates a total of 6122 descriptors classified as 546 atom pair descriptors, 4662 Carhart descriptors, 735 fragment pair descriptors, 147 pharmacophore fingerprints, 24 Weighted Burden Number descriptor and 8 properties descriptors. Among those we used:
Property descriptors including XlogP (a measure of the propensity of a molecule to partition into water or oil), polar surface area (PSA), number of rotatable bonds, H-bond donors, H-bond acceptors, molecular weight, blood–brain indicator (0 indicating a compound does not pass the BBB, and 1 indicating that a compound passes the BBB) and bad group indicator (the molecule contains a chemically reactive or toxic group).
Pharmacophore Fingerprint descriptors based on bioisosteric principles. They are divided in to six classes totaling to 147 descriptors.
Weighted Burden number descriptors, a set of continuous descriptors and are also a variation of the Burden number . One of the three properties, namely, electronegativity, Gasteiger partial charge or atomic lipophilicity and XLogP is placed on the diagonal of the Burden connectivity matrix. The off-diagonal elements are weighted by one of the following values: 2.5, 5.0, 7.5 or 10.0. Then the largest and the smallest eigenvalues are used as descriptors.
Machine learning classifiers
Machine learning has been widely used in classifying molecules as active or inactive, mutagen or non-mutagen against a protein target . In this work we used Weka  open source software which is a collection of different classifiers for data mining and machine learning. It is licensed under GNU GPL. It includes tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Of the many data mining approaches that have been explored, four have evolved to largely dominate other classification methods at present. These are a) Bayesian methods  b) Support Vector Machines  c) Decision trees  and d) Random Forest [30, 31].
The results are discussed for each of the data sets for which the models were developed using the four classifiers. The Random Forest was parameterized with 100 trees because we did not find much difference in the out of bag error rates for 500 trees (which was around less than 0.5%).
Result table for Set 1 with four classifier algorithms Naïve Bayes, Random Forest, J48 and SMO
Result table for AID1189 taken as test set for the models prepared by different sets i.e. Set 1, Set 2 and Set 3
Result table for AID1194 taken as validation set for the models generated on different sets i.e. Set 1, Set 2 and Set 3
Result table for Set 2 with four classifier algorithms Naïve Bayes, Random Forest, J48, and SMO
Result table for Set 3 with four classifier algorithms Naïve Bayes, Random Forest, J48, and SMO
The drug and the metabolites data tested with Set 1, Set 2, Set 3 with random forest
Set1 (Drug Data)
Set1(Cost sensitive classification of Drug data)
Set2 (Drug Data)
Set2 (Cost sensitive classification of Drug data)
Set3 (Cost sensitive classification of Drug data)
Each model was tested with the drug data and the metabolites data. It was found that every model predicted the drug data with almost the same specificity i.e. the true negatives which were labeled as non- mutagen. Every model predicted with almost more than 84% specificity. To improve the model of prediction of true negatives we also implemented the classification with cost matrix in Weka and tested our data sets. We set the cost of false positive to 2.5 for misclassifying every non-mutagenic compound. Every data set was classified with more than 90% as true negative. The models predicted the withdrawn drugs data with low sensitivity and it predicted most of the compounds as false positives (non-mutagen). The compounds from Zinc metabolites database show very low mutagenic effects to the living systems and after testing with each model it was observed that Set 3 gave the best classification of the compounds. From 9523 mutagen compounds labeled arbitrarily, it predicted 8037 compounds as false negatives (mutagens compounds labeled predicted as non- mutagens) and 10774 compounds as True negatives (non-mutagens compounds labeled predicted as non-mutagens) from 12557 compounds. This indicates that 85% of the compounds in the zinc metabolite dataset are non-mutagenic.
Analysis of false positives and false negatives results
Erroneous compounds i.e. the false positives, false negatives were observed for the test set of Set 3, drug data sets, and metabolites. Each data set is described below.
Metabolites data set: This data set contained 22080 compounds and around 3269 compounds were predicted as mutagens. The Additional file 6 contains the ZINC ids and smiles along with predictions of the Random Forest Set 3 classifier.
Comparison of the random forest with CAESAR
Comparison of Caesar with Random Forest (rf) with the validation sets depicting True Positives (TP), False Negatives (FN), True Negatives (TN), False Positives (FP) and Accuracy
Previously the Benchmark data set was the largest mutagenicity data set containing more than 6000 molecules classified as mutagens and non-mutagens. In this work we were able to create a new mutagenicity data set (Set 3) containing more than 8000 compounds.
The models generated using Random Forest classifier was observed to have a high performance rate. This was proved by a higher sensitivity and specificity results for the validation sets AID1189, AID 1194. Descriptor optimization is important criteria for model generation, the use of Gini importance could play an important role in descriptor space optimization. Other than that the comparative results of descriptor based Random Forest with CAESAR (which is based on the structural alerts) clearly shows that Random Forest has the better predictive ability to classify mutagenic from non-mutagenic. Classification of the Drug data and the metabolite datasets gave us a clear view the impact of predictive models in drug design and discovery. The mutagenic predictive models could make a great impact in classifying compounds in large repositories such as PubChem and ZINC which could help to accelerate the pipeline of drug discovery.
We would like to acknowledge the work of Ms. Geetha Sugumaran, Project Fellow, OSDD, CSIR for helping us in formatting and proof reading the paper. Her key inputs aided in representing the data in a comprehensive manner. We also like to thank Accelerys for providing the student edition of pipeline pilot. We would also like to acknowledge Dr. Rajarshi Guha for discussion of the paper. We thank the reviewers for their time and valuable suggestions on the paper.
- van Ravenzwaay B, Herold M, Kamp H, Kapp MD, Fabian E, Looser R, Krennrich G, Mellert W, Prokoudine A, Strauss V, Walk T, Wiemer J: Metabolomics: A tool for early detection of toxicological effects and an opportunity for biology based grouping of chemicals-From QSAR to QBAR. Mutat Res. 2012, [In Press]
- Ames B: The detection of environmental mutagens and potential. Cancer. 1984, 53: 2030-2040.View Article
- Mortelmans K, Zeiger E: The ames salmonella/microsome mutagenicity assay. Mutat Res. 2000, 455 (1–2): 29-60.View Article
- Kazius J, McGuire J, Bursi R: Derivation and validation of toxicophores for mutagenicity prediction. J Med Chem. 2005, 48 (1): 312-320. 10.1021/jm040835a.View Article
- Helma C, Cramer T, Kramer S, Raedt L: Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J Chem Inf Comput Sci. 2004, 44: 1402-1411. 10.1021/ci034254q.View Article
- Hansen K, Mika S, Schroeter T, Sutter A, Laak A, Hartmann ST, Heinrich N, MullerK P: Benchmark data set for in-silico prediction of ames mutagenicity. J Chem Inf Model. 2009, 49: 2077-2081. 10.1021/ci900161g.View Article
- Zhang QZ, Aires-de-Sousa J: Random forest prediction of mutagenicity from empirical physicochemical descriptors. J Chem Inf Model. 2007, 47: 1-8. 10.1021/ci050520j.View Article
- Feng J, Lurati L, Ouyang H, Robinson T, Wang Y, Yuan S, Young SS: Predictive toxicology: benchmarking molecular descriptors and statistical methods. J Chem Inf Comput Sci. 2003, 43: 1463-1470. 10.1021/ci034032s.View Article
- King RD, Muggletont SH, Srinivasani A, Sternberg MJE: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc Natl Acad Sci. 1996, 93: 438-442. 10.1073/pnas.93.1.438.View Article
- Judson R, Elloumi F, Setzer RW, Li Z, Shah I: A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model. BMC Bioinf. 2008, 9-241.
- Ferrari T, Gini G: An open source multistep model to predict mutagenicity from statistical analysis and relevant structural alerts. Chem Cent J. 2010, 4 (Suppl 1): S2-10.1186/1752-153X-4-S1-S2.View Article
- Benfenati E: The CAESAR project for in silico models for the REACH legislation. Chem Central J. 2010, 4 (Suppl 1): I1-10.1186/1752-153X-4-S1-I1.View Article
- Votano JR, Parham M, Hall LH, Kier LB, Oloff S, Tropsha A, Xie QA, Tong W: Three new consensus QSAR models for the prediction of ames genotoxicity. Mutagenesis. 2004, 19: 365-377. 10.1093/mutage/geh043.View Article
- Ashby J, Tennant RW: Chemical structure, salmonella mutagenicity and extent of carcinogenicity as indicators of genotoxic carcinogenesis among 222 chemicals tested in rodents by the U.S. NCI/NTP. Mutat Res. 1988, 204 (1): 17-115. 10.1016/0165-1218(88)90114-0.View Article
- Hakimelahi GH, Khodarahmi GA: The Identification of Toxicophores for the Prediction of Mutagenicity Hepatotoxicity and Cardiotoxicity. J Iran Chem Soc. 2005, 2: 244-267. 10.1007/BF03245929.View Article
- Blagg J: Structure activity relationships for in vitro and in vivo toxicity. Annu R Med Chem. 2006, 41: 353-358.View Article
- Bongsup PC, Beland FA, Marques FM: NMR structural studies of a 15-mer DNA sequence from a rasprotooncogene modified at the first base of codon 61 with the carcinogen 4 -aminobiphenyl. Biochemistry. 1992, 31 (40): 9587-9602. 10.1021/bi00155a011.View Article
- Li J, Dierkes P, Gutsell S, Stott I: Assessing different classifiers for in-silico prediction of ames test mutagenicity. In a poster in the 4 th Joint Sheffield Conference on Chemoinformatics: 2007.
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Guo AC, Wishart DS: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Res. 2011, 39: D1035-D1041. 10.1093/nar/gkq1126.View Article
- Irwin J, Shoichet B: Zinc – a free database of commercially available compounds for virtual screening. J Chem Inf Model. 2005, 45 (1): 177-182. 10.1021/ci049714+.View Article
- Accelrys, Inc., 10188 Telesis Court, Suite 100, San Diego, CA. URL: [http://accelrys.com/products/pipeline-pilot/]
- Gold LS, Slone TH, Ames BN, Manley NB, Garfinkel GB, Rohrbach L: Carcinogenic Potency Database. In Handbook of Carcinogenic Potency and Genotoxicity Databases. 1997, Boca Raton: CRC Press, 1-605.
- Liu K, Feng J, Young SS, Power MV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation. J Chem Inf Model. 2005, 45 (2): 515-522. 10.1021/ci049847v.View Article
- Burden FR: Molecular identification number for substructure searches. J Chem Inf Comput Sci. 1989, 29: 225-227. 10.1021/ci00063a011.View Article
- Schierz AC: Virtual screening of bioassay data. J Cheminformatics. 2009, 1: 21-10.1186/1758-2946-1-21.View Article
- Friedman N, Geiger D, Goldszmidt M: Bayesian network classifiers. Mach Learn. 1997, 29: 131-163. 10.1023/A:1007465528199.View Article
- Keerthi S, Gilbert E: Convergence of a generalized SMO algorithm for SVM classifier design. Mach Learn. 2002, 46: 351-360. 10.1023/A:1012431217818.View Article
- Murthy A: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min Knowledge Discovery. 1998, 2: 345-389. 10.1023/A:1009744630224.View Article
- Dietterich TG: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000, 40: 139-157. 10.1023/A:1007607513941.View Article
- Ehrman TM, Barlow DJ, Hylands J, 2: Virtual Screening of Chinese Herbs with Random Forest. J ChemInf Model. 2007, 47: 264-278. 10.1021/ci600289v.View Article
- Singla D, Anurag M, Dash D, Raghava G: A web server for predicting inhibitors against bacterial target GlmU protein. BMCPharmacol. 2011, 11: 5-
- Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W: A comparison of Random Forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinforma. 2009, 10: 213-10.1186/1471-2105-10-213.View Article
- Lagadic D, Rissel M, Le Bot MA: Guillouzo toxic effects of tacrine on primary hepatocytes and liver epithelial cells in culture. Cell Biol Toxicol. 1998, 14: 5361-5373.
- Fuchs S, Simon Z, Brezis M: Fatal hepatic failure associated with ciprofloxacin. Lancet. 1994, 242: 738-739.View Article
- Ashby J: Fundamental structural alerts to potential carcinogenicity or non carcinogenicity. Environ Mutagen. 1985, 7: 919-921. 10.1002/em.2860070613.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.