The influence of negative training set size on machine learning-based virtual screening
© Kurczab et al.; licensee Chemistry Central Ltd. 2014
Received: 12 October 2013
Accepted: 2 June 2014
Published: 11 June 2014
The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods.
The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set.
In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
Machine learning (ML) methods are widely used in the process of drug discovery to classify molecules as potentially active or inactive against a particular protein target. The vast majority of those methods require the preparation of a training set of compounds (supervised learning) that are used to develop a decision rule that can be then applied to sort a dataset of new molecules (the test set) among particular activity classes . A number of studies have aimed to determine optimal learning parameters and examine their impact on classification effectiveness [2–5]. Interestingly, no extensive research that considers the influence of the ratio of active to inactive training examples on the efficiency of new active compounds recognition has been performed. The possible impact of negative training examples on the performance of ML models has only recently become the subject of research in the field of cheminformatics. Although, it should be emphasized that the construction of the training set might be the issue of the well-known problem of learning from imbalanced data as well. However, due to a large number of existing approaches in this field, their relevant examination was beyond the scope of this paper. Recently, we showed that the way of inactive set design significantly influences classification effectiveness, with the best results obtained for training sets with inactives randomly selected from the ZINC database . Tests were conducted with six of the most frequently used approaches for selecting assumed inactive compounds: random and diverse selection from the ZINC database , the MDDR database  and libraries generated according to the DUD methodology . All experiments were performed using 5 different protein targets, 3 different fingerprints for molecular representation and seven ML algorithms. Concurrently, Heikamp et al.  analyzed the effects of alternative sets of negative training data and different background databases on support vector machine (SVM) modeling and virtual screening (VS). The results showed a clear influence of negative training examples on SVM search efficiency, with the best performance achieved when SVM models were trained and screened on a dataset randomly chosen from ZINC (experimentally confirmed active and inactive compounds were selected from PubChem Confirmatory Bioassays ). The authors also touched the problem of positive to negative training examples ratio and noted that increased numbers of reference compounds generally can lead to improvement in the prediction abilities of SVM. The models were derived on the basis of differently composed training sets containing confirmed inactive molecules or compounds randomly selected from the ZINC database as negatives.
In this study, we delve into the influence of increasing the number of negative instances used for training (with a fixed set of actives) on ML methods performance. At first, ligands (from the MDDR database) of 3 proteins were studied in details, and next the analysis was extended by 12 other targets (active compounds fetched from ChEMBL ) to confirm the applicability of found conclusions. This is an extension of our previous study that focused on determining the optimal formula for providing the maximum possible yield from machine learning-based virtual screening, taking into account another very important aspect of designing VS experiments.
Results and discussion
The question raised in this study was how the performance of machine learning-based virtual screening depends on the size of the dataset of negative training examples. To address this issue, we have first performed initial calculations for 3 different protein targets (5-HT1A, HIV-1 protease and matrix metalloproteinase) with actives fetched from the MDDR database. Next, to broaden the scope of the study and to verify the obtained results, the set of targets was extended by 12 proteins belonging to different classes (enzymes, membrane proteins, transcription factors, transporter) and compounds stored in the freely available ChEMBL database – confirmatory tests. Five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP) were applied to datasets of a fixed number of positive training instances and varied the number of negative examples to obtain different active to inactive ratios (from 0.5 to 20 with a step size of 0.5). In order to show the relative diversity of actives towards inactives randomly selected from the ZINC database, the matrices of Tanimoto coefficients were calculated (see Additional file 1: Figure S1). It revealed that for the great majority of active ligands, there were found inactive compounds from ZINC that were characterized by high similarity, therefore the classification task (discrimination between actives and inactives) was not trivial.
The selection of particular classification algorithms was dictated by their popularity in virtual screening experiments. The machine learning methods performance was assessed with the use of the following evaluating metrics – recall, precision, Matthews Correlation Coefficient (MCC), ROC curves and AUC.
The recall, precision and MCC are usually used to provide comprehensive assessments of imbalanced learning problems. On the other hand, the ROC curves, are very useful for providing a visual representation of the relative trade-offs between the true positives and false positives of classification in regard to data distributions. Albeit, in the case of highly skewed data sets, the ROC curves may provide an overly optimistic view of an algorithm’s performance. In such situations the PR curves can provide a more informative representation of performance assessment . As recall, precision and MCC results give slightly different information and should be differently interpreted, they were described independently from ROC and AUC. Recall, precision and MCC values are generated only on the basis of the confusion matrix, whereas ROC and AUC take into account the value of predictive function – therefore they provide information about the expected proportion of positives ranked before a uniformly drawn random negative.
General search performance
Influence of negative training set size on the performance of ML methods
For almost all ML algorithms studied, increasing the number of negative training instances was found to decrease recall and increase both precision and MCC values. However, completely different tendencies were observed for the Naïve Bayes algorithm, which in all cases showed only a slight sensitivity to the negative training set size enlargement (Figure 2). According to its methodological assumptions it labels instances from the test set according to the class distribution from the training data. Therefore, one would expect that increasing the number of inactive compounds in the training set should lead to improvement of Naïve Bayes performance in virtual screening-like experiment. However, the attempts to reproduce the class distribution from the training set led to errors in class assignments for sets with higher number of inactives which in turn resulted in lower values of evaluating parameters instead of the expected uplift.Considering the dynamics of changes in ML performance with a growing number of negative training examples, SMO and Random Forest algorithms quickly led to models with very good classification effectiveness (Figure 2). In comparison, the improvement of Ibk and J48 methods was less significant; their corresponding curves on precision-recall plots responded very slowly to increases in the number of negative instances.
In general, the preferable ratio of active to inactive compounds in the training sets was found to be approximately 1:9–1:10 – further increase in the size of the negative training set led to only slight improvements in global ML methods performance (MCC) that were not profitable due to increases in computational expenses (see Additional file 2: Figure S2, Additional file 3: Table S1).
The improvement of evaluating parameters calculated for targets from the initial tests
From fingerprint-based point of view, the overall tendency of searching did not change considerably when MACCS or CDK FP fingerprints were used. In almost all cases, the total improvement of predictive models was much lower for MACCS fingerprints (Table 1). This is also shown through the precision-recall plots (Figure 2), where in almost all cases studied, the performance of a particular ML algorithm changed more dynamically when molecules were encoded by CDK FPs than MACCS fingerprints. Moreover, only in one case MACCS (in combination with SMO for metalloproteinase, Figure 2) produced effective classification models (quarter I). Interestingly, CDK FP required approximately 8-fold more time on training prediction models and database screening than MACCS. This is likely caused by the difference in the length of the bit string used to represent the molecule (166 and 1024 bit positions used in MACCS and CDK FP, respectively).
The improvement of MCC values calculated for targets from the confirmatory tests
Compound data sets
The MDDR database was used as a source of the structures of active compounds for 3 different protein targets: 5-HT1AR agonists, HIV-1 protease inhibitors and metalloproteinase inhibitors used in the initial study. Targets, belonging to diverse families of proteins and possessing large numbers of ligands, were carefully chosen after a survey of the literature concerning different aspects of ML methods tests [14–16].
Composition of training and test sets used in the initial study
MDDR activity index
Number of actives/inactives
The changes in recall, precision and MCC values between particular iterations were statistically insignificant, and therefore repeating the study with another randomly selected ZINC sets leads to very similar results and the dependencies connected with the number of inactives in the training set are preserved.
The ChEMBL Target Classification Hierarchy directed the selection of 12 targets used in the confirmatory tests, which ensured the diversity of both proteins and structures of active compounds. As ChEMBL (unlike MDDR) contains numerical values of particular parameter determining compounds activity, active compounds were selected manually by setting an appropriate threshold – only compounds which annotated activity satisfied one of the conditions: Ki < 100 nM or IC50 < 200 nM, were included into active class. Because different number of active ligands were obtained, the chosen number of inactives was rescaled to ensure the same active to inactive ratios, as in initial study. Details concerning number of actives in both training and test sets are included in Table 2.
Machine learning algorithms
Five of the most commonly used in cheminformatics ML algorithms were selected: Sequential Minimal Optimization (SMO) , Naïve Bayes classifier (NB) , Instance-Based Learning (Ibk) [19, 20], J48  and Random Forest (RF) [22, 23]. All machine learning calculations were carried out using the WEKA package (version 3.6).
Calculations and performance measures
Recall measures the number of correctly identified positive instances, precision describes the correctness of positive predictions and the MCC is a balanced measure of binary classification effectiveness, ranging from −1 to 1, with 1 referring to perfect prediction. The Receiver Operator Characteristic curves (ROC) present how the number of correctly classified positive examples varies with the number of incorrectly predicted negative examples. Unfortunately, it was not possible to obtain ROC curve and AUC for SMO algorithm, because used WEKA implementation enables only on the binary classification.
These parameters were selected to enable the assessment of classification effectiveness from various perspectives. All calculations were performed on an Intel Core i7 CPU 3.00 GHz computer system with 24 GB RAM running a 64-bit Linux operating system.
We have investigated a seldom-explored question in machine learning-based virtual screening methodology: how the performance of machine learning depends on the size of the set of negative training examples. We compared a variety of combinations of machine learning algorithms in classification experiments using compounds represented by 2 types of molecular fingerprints, for sets generated on the basis of confirmed active (from 2 activity databases) and varied numbers of assumed inactive compounds randomly selected from ZINC.
We found that the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. In general, increasing the size of the negative training set (with a constant quantity of positives), led to decrease in recall and improvement in precision, whereas no significant effect on AUC values was observed. However, the precision changes were much higher than the changes in recall, and hence the global classification effectiveness expressed by MCC values was enhanced by the addition of more inactives to the training data. These findings are generally in line with the results obtained by Heikamp et al. , who found better performance for an increased number of negative training examples for SVM models. An exception was the Naïve Bayes algorithm, for which no significant changes in models’ performance were observed. This provides some evidence of its independence of training set perturbation and variation. Overall, the best models (in terms of MCC values) were obtained for combination of SMO with CDK FP. The results were consistent for all protein targets and fingerprints, however, fingerprint with shorter bit string (MACCS) demonstrated less ability to improve ML models. For all the analyzed scenarios (target/ML method/fingerprint) and sizes of used test set, the preferable ratio of positive to negative training instances was found to be approximately 1:9 to 1:10. These findings revealed that the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening. However, it should be noted that the preferable active:inactive ratio indicated in the study might change under different experimental conditions (the dimension of the screening database, different number of actives, relative diversity of actives towards inactives, used to train the ML algorithms, application of methods for imbalanced learning, optimization of ML methods parameters, etc.), but this require further research which goes far beyond the scope of the article.
Machine learning algorithms used and a short description of their training parameters
Sequential Minimal Optimization (SMO)
The complexity parameter was set at 1, the epsilon for a round-off error was 1.0 E-12, and the option of normalizing training data was chosen. The normalized polynomial kernel was used.
Naïve Bayes (NB)
Instance-Based Learning (Ibk)
The nearest neighbor search algorithm using the Euclidean distance function and 1 neighbor.
Random Forest (RF)
Trees with unlimited depth, seed number: 1. Number of generated trees: 10.
The study was supported by the grant PRELUDIUM 2011/03/N/NZ2/02478 financed by the National Science Centre. SS would also like to acknowledge the Marian Smoluchowski Kraków Research Consortium: “Matter-Energy-Future” for the scholarship provided within the KNOW funding.
- Melville JL, Burke EK, Hirst JD: Machine learning in virtual screening. Comb Chem High Throughput Screen. 2009, 12: 332-343. 10.2174/138620709788167980.View ArticleGoogle Scholar
- Ma XH, Wang R, Yang SY, Li ZR, Xue Y, Wei YC, Low BC, Chen YZ: Evaluation of virtual screening performance of support vector machines trained by sparsely distributed active compounds. J Chem Inf Model. 2008, 48: 1227-1237. 10.1021/ci800022e.View ArticleGoogle Scholar
- Plewczynski D, Spieser SH, Koch U: Assessing different classification methods for virtual screening. J Chem Inf Model. 2006, 46: 1098-1106. 10.1021/ci050519k.View ArticleGoogle Scholar
- Bruce CL, Melville JL, Pickett SD, Hirst JD: Contemporary QSAR classifiers compared. J Chem Inf Model. 2007, 47: 219-227. 10.1021/ci600332j.View ArticleGoogle Scholar
- Smusz S, Kurczab R, Bojarski AJ: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds. Chemom Intell Lab Syst. 2013, 128: 89-100.View ArticleGoogle Scholar
- Smusz S, Kurczab R, Bojarski AJ: The influence of the inactives subset generation on the performance of machine learning methods. J Cheminf. 2013, 5: 17-25. 10.1186/1758-2946-5-17.View ArticleGoogle Scholar
- Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG: ZINC: A free tool to discover chemistry for biology. J Chem Inf Model. 2012, 52: 1757-1768. 10.1021/ci3001277.View ArticleGoogle Scholar
- USA: MDDR licensed by Accelrys, Inc, [http://www.accelrys.com]
- Huang N, Shoichet BK, Irwin JJ: Benchmarking sets for molecular docking. J Med Chem. 2006, 49: 6789-6801. 10.1021/jm0608356.View ArticleGoogle Scholar
- Heikamp K, Bajorath J: Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening. J Chem Inf Model. 2013, 53: 1595-1601. 10.1021/ci4002712.View ArticleGoogle Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Zhou Z, Han L, Karapetyan K, Dracheva S, Shoemaker BA, Bolton E, Gindulyte A, Bryant SH: PubChem’s BioAssay Database. Nucleic Acids Res. 2012, 40: D400-D412. 10.1093/nar/gkr1132.View ArticleGoogle Scholar
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, Mcglinchey S, Michalovich D, Al-Lazikani B, Overington JP: ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2011, 40: D1100-D1107.View ArticleGoogle Scholar
- Davis J, Goadrich M: The relationship between precision-recall and ROC curves. Proceedings of the 23rd international conference onMachine Learning . 2006, 233-240.Google Scholar
- Chen B, Harrison RF, Papadatos G, Willett P, Wood DJ, Lewell XQ, Greenidge P, Stiefl N: Evaluation of machine-learning methods for ligand-based virtual screening. J Comput Aid Mol Des. 2007, 21: 53-62. 10.1007/s10822-006-9096-5.View ArticleGoogle Scholar
- Ma XH, Jia J, Zhu F, Xue Y, Li ZR, Chen YZ: Comparative analysis of machine learning methods in ligand-based virtual screening of large compound libraries. Comb Chem High Throughput Screen. 2009, 12: 344-357. 10.2174/138620709788167944.View ArticleGoogle Scholar
- Cannon EO, Amini A, Bender A, Sternberg MJE, Muggleton SH, Glen RC, Mitchell JBO: Support vector inductive logic programming outperforms the naive Bayes classifier and inductive logic programming for the classification of bioactive chemical compounds. J Comput Aid Mol Des. 2007, 21: 269-280. 10.1007/s10822-007-9113-3.View ArticleGoogle Scholar
- Platt JC Sequential Minimal Optimization: A fast algorithm for training Support Vector Machines. Technical Report MSR-TR-98-14. 1998, 1-21.Google Scholar
- Mitchell TM: Machine Learning. 1997, New York: McGraw-HillGoogle Scholar
- Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Mach Learn. 1991, 6: 37-66.Google Scholar
- Brighton H, Mellish C: Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc. 2002, 6: 153-172. 10.1023/A:1014043630878.View ArticleGoogle Scholar
- Quinlan JR: Induction of decision trees. Mach Learn. 1986, 1: 81-106.Google Scholar
- Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP: Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003, 43: 1947-1958. 10.1021/ci034160g.View ArticleGoogle Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- San Diego, CA, USA: MACCS Structural keys, Accelrys, [http://www.accelrys.com]
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J Chem Inf Comput Sci. 2003, 43: 493-500. 10.1021/ci025584y.View ArticleGoogle Scholar
- Yap CW: PaDEL-Descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011, 32: 1466-1474. 10.1002/jcc.21707.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.