Fast rule-based bioactivity prediction using associative classification mining
© Yu and Wild; licensee Chemistry Central Ltd. 2012
Received: 15 August 2012
Accepted: 23 October 2012
Published: 23 November 2012
Skip to main content
© Yu and Wild; licensee Chemistry Central Ltd. 2012
Received: 15 August 2012
Accepted: 23 October 2012
Published: 23 November 2012
Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) methods, and produce highly interpretable models.
Classification is an essential part of data mining, and it involves predicting a categorical (discrete, unordered) label upon a set of attributes/variables. In cheminformatics, attributes usually are molecular descriptors such as structural fingerprints or physiochemical properties while the label represents bioactivity (for example, active/inactive class). Classification methods such as Decision forest , Bayesian classification [2–5], artificial neural networks(ANN), support vector machines (SVM) [6–8], k-nearest neighbor approach  and random forest inter alia have been comprehensively used in cheminformatics, especially in drug discovery, to predict the activity of a compound based on its structural features. Several studies in the data mining community have shown that classification which is based on associations rule mining or so called associative classification mining (ACM) is able to build accurate classifiers [11–13] and is comparable to traditional methods such as decision trees, rule induction and probabilistic approaches. ACM is a data mining framework that employs association rule mining (ARM) methods to build classification systems, also known as associative classifiers. Recently, there have been some applications of ARM or ACM in the biological domain that are focused on genotype-phenotype mapping , gene expression data mining [15–17], protein-protein interaction (PPI)  or protein-DNA binding . Genes found to be associated with each other by ARM or ACM can be helpful in building gene networks. Furthermore, the effect of cellular environment, drugs or other physiological conditions on gene expression can be uncovered by ACM as well . In the cheminformatics field, there have been a few methods and typical applications using frequent itemset mining [20–23]. These methods enumerate fragments or the sub-graphs of the structure by applying sub-graph discovering algorithms on the topological structure of a molecule. Some  used an existing algorithm—frequent sub-graphs (FSG), while others [21, 24] developed their own methods. Besides being used directly in associative classification, the mined frequent sub-graphs can be used as features for other methods such as SVM classifier . However, to our best knowledge, compared with other fields, ACM has not been well explored.
A sample dataset with fingerprint as features
Prior to rule generation, all frequent ruleitems are discovered. A ruleitem is strong if and only if it satisfies a minimum support θ (named minsup) threshold and minimum confidence δ (named minconf) threshold. The support of a ruleitem is the percentage of transactions in T that containX ∪ C (i.e., the union of sets X and C, or say both X and C); the confidence of a ruleitem is the percentage of transactions in T having X that also contain C. Their probability definitions are support (X → C) = P (X ∪ C) and confidence (X → C) = P (C|X) respectively. For the above example, the support = 3/5 = 60% and confidence = 60%/60% = 100%. If δ = 10% and θ = 75%, then the example ruleitem is frequent and strong. Each ruleitem passing the minconf threshold is identified and a corresponding rule is generated. The derived rule from the example “if a compound’s fingerprint has Bit1 set and Bit7 not set then it tends to be active” provides intuitive interpretation of a relationship between the biological activity and chemical features.
Apriori , frequent pattern growth (FP-growth)  and Eclat  are the three most widely used basic algorithms of frequent itemset mining, which have been used for the first and major time consuming step. For example, CBA employs a traditional breadth-first method—Apriori , and CMAR utilizes the FP-growth approach . Other algorithms are also applied, as an illustration, the modified first order inductive learner (FOIL) is adopted by CPAR . Once all frequent rule items are discovered, they can be used for classifier generation and prediction. The size of the rule set is reduced in a process of pruning and evaluating with removing redundant and non-predictive rules to improve the efficiency and accuracy. The popular pruning techniques include chi-square testing, database coverage, rule redundancy, conflicting rules and pessimistic error estimation etc . Pruning can be applied when extracting frequent ruleitems, generating rules (chi-square testing), or building classifiers (database coverage). Some pruning techniques such as database coverage and rule redundancy tend to produce small rule sets while others incline to generate relatively bigger classifiers. In practical usage, there is a trade-off between the size of classifiers and accuracies. After a classifier is built, it can be applied for next two steps: rule ranking and prediction.
Firstly, rules in the classifier are ranked by support, confidence and cardinality. In the event of a tie, most methods assign orders randomly, but Thabtah et al. argued that the class distribution frequency of the rule should be considered under this situation . The prediction is based on either a single rule which matches the new data and has the highest precedence, or multiple rules that are all applicable to the new data. Different prediction methods are categorized as: maximum likelihood-based [11, 27], score-based  and Laplace-based .
For some cases, the resulted classifier is more appealing than a “black box” such as ANN, SVM or Bayesian. Although most ACM algorithms have been tested against some standard data sets from UCI data collection; however, the application of these methods and interpretation of the generated ACM classifiers in terms of chemical features and bioactivity are not available.
In this paper we present data supporting the viewpoint that ACM can be used for modeling chemical datasets while preserving some appealing features from other methods.
The hERG dataset is downloaded from pharmacoKinetics Knowledge Base (PKKB) . The dataset contains 806 molecules with hERG activities. 495 compounds are from Li’s dataset  and 66 from WOMBAT-PK  database; the other 245 compounds are collected by PKKB from publications. Compounds are classified into blockers (IC50 less than or equal to 40μm) and non-blockers (IC50 greater than 40μm).
The antituberculosis (antiTB) dataset is obtained from Prathipati’s paper . According to this paper, the dataset contains a large number of curated and diverse chemical compounds which are appropriate for modeling. In this study, all 3,779 compounds are used. The compounds are classified into active and inactive groups using the same criterion as used in the paper— minimum inhibitory concentration (MIC) less than 5μM.
The mutangenicity dataset contains 4,337 compounds with Ames test data and 2-D structures. The dataset is constructed from the available Ames test data by using the following criteria: a) standard Ames test data of Salmonella Typhimurium strains required for regulatory evaluation of drug approval; b) Ames test performed with standard plate method or preincubation method, either with or without a metabolic activation mixture. Compounds which contain at least one positive Ames test result are classified as mutagen, otherwise as non-mutagen .
Property descriptors used in the modeling
Both Naïve Bayesian (Bayesian) and ACM prefer categorical attributes since the conditional probability for Bayesian can be described using a smaller table and the number of itemsets for ACM can be significantly reduced. Meanwhile, converting continuous attributes into categorical attributes also helps treat all the attributes and the class identically. The quantitative/numeric attributes such as AlogP, molecular weight, number of H-acceptor, H-donor and rotation bonds are discretized into levels and the levels are mapped into categorical values. To demonstrate, for AlogP, we set 1 for 0≤AlogP≤3.5, 2 for 3.5<AlogP≤7 and 3 for 7<AlogP. For every data set, the entropy based methods are utilized for discretizing all the attributes, which has been done by Rapid miner 5.1 . The process is performed by using the “Discretize by Entropy” operator in RapidMiner with default settings. Previous studies have shown that the performance of Bayesian algorithm can be significantly improved if entropy-based discretization is adopted [42, 43]. As a result, all the continuous attributes are converted into categorical attributes for mining.
Both MDL public keys and PubChem fingerprints are bit strings of fixed length with size of 166 and 881 respectively. There is a one-to-one mapping between the bits and molecule features ideal for our mining and model interpretation. Each bit can be set to 1 or 0 representing the existence or nonexistence of a predefined chemical feature. The bit string can be mined directly by the software package used in this research.
All the computations are carried out on a PC Q6600 2.4GHz with 6G memory running on the 64-bit Windows 7 operating system. Results of the Bayesian and SVM are used as references. The computation and modeling of Bayesian and SVM are performed by using RapidMiner with default settings. As to speed, Bayesian is the fastest one; ACM is faster than SVM in most cases. For example, the computation time for mutagenicity dataset are 5 seconds, 20.5 minutes, 1.5 minutes, 5.7 minutes and 6.3 minutes for Bayesian, SVM, CPAR (there is another implementation which only takes 12 seconds), CMAR and CBA respectively.
Summary of used ACM methods
Classification based on predictive association rules  uses a greedy approach—a weighted version of FOIL-gain to identify features and discover rules. A PNArray data structure is utilized to reduce storage space and computation time .
Classification based on multiple association rules  employs FP-growth method to discover rules. FP-growth builds a FP-tree based on the dataset using less storage space and improves the efficiency of retrieving rules.
To further study the robustness of the generated models, Y-randomization is applied to the antiTB dataset as an alternative validation method. Paola  recommended that Y-randomization and CV should be carried out in parallel to test the significance of the derived models. In this method, the bioactivity vector is randomly shuffled and a new model is generated based on the original feature matrix. The process is repeated five times and the resultant models are compared with the original one.
F-score of all the data sets using different descriptors or fingerprints
Five models are built for each combination of dataset and feature type (e.g. for antiTB dataset when using MDL, antiTB_MDL will be used to represent one combination). In total, there are nine combinations of datasets and feature types which generate forty-five models. Table 5 shows that the overall F-score of ACM is comparable to or better than that of Bayesian and SVM. The highest F-score in each combination is shown in bold. Among the total nine combinations, only two are achieved by SVM and Bayesian which is 70.48±1.80 for the antiTB_MDL combination (Bayesian) and 80.80±7.22 for the hERG_properties (SVM). A simple ranking method can be used to compare CPAR, CBA and CMAR without considering the complexity of the classifier. For any scenario, the three approaches are assigned 1, 2 and 3 according to the accuracy with 1 for the most accurate. For example, for antiTB_MDL, CPAR is 2, CBA is 1 and CMAR is 3. The final rank is the average of all cases. The result is 2.11, 1.78 and 2.11 for CPAR, CBA and CMAR respectively, which shows the order of the accuracy is CBA > CPAR = CMAR in this study.
Accuracy of Y-randomization on antiTB_ MDL
27 ACM models are built in total in our study. For the classifier and significance analysis, CBA models for the antiTB dataset are chosen to demonstrate the analyzing strategies and their chemical significances. The same strategies can be applied to any other models and similar results can be obtained.
Some classifiers have around twenty rules and others may have several hundreds. The number of generated rules varies depending on several factors: the size of the dataset, features, algorithms etc. The results show that CMAR produces the biggest classifiers in most cases. Parameters can be tuned to reduce the size of the classifier but the accuracy may be lowered correspondingly. Another important character of the classifiers is the length of the rules, namely, the size of the ruleitem. In our study, the item size of CPAR ranges from one to seven. Although to reduce the total number of itemsets, the maximum length of CBA and CMAR is set to four, the length of the generated rules is mostly two or three. Longer rules sometimes can provide us more information about the compounds since they contain more structural fragments.
If a feature is existing, then sign = 1; otherwise sign = −1. The rank of this feature is the sum of R. With antiTB dataset as an example, Additional file 1: Table S1 shows the rank of each feature for active and inactive compounds respectively. Of particular interest are the features (yellow features in Additional file 1: Table S1) that exist only in active compounds and those only found in inactive compounds (red). For green features, their contributions to the bioactivity depend on other features that are in association with them. The MDL feature space is reduced from 166 to 101 for the antiTB dataset. The same analysis can be carried out for the PubChem fingerprint. To be noticed, the feature space for PubChem is remarkably reduced from 881 to 146. Although MDL and Pubchem use substantially different encoding schemes, the mined features are related, such as MDL 110 with PubChem 366, 117, 123 and 95, MDL 75 with PubChem 392 and MDL 22 with PubChem 116 (Additional file 1: Table S2 and S3). Among the top ten features, multiple features (Additional file 1: Table S4) are linked to each other.
Selected association rules for the antiTB dataset
[#7]~[#6]~[#8] AND *!@[#8]!@* → class = active
Not [#7]~*~*~[#8] AND not [#7]!:*:* → class = inactive
21.38% 75. 50%
[#7]~[#6]~[#8] AND *~*(~*)(~*)~* → class = active
[#7]~*~[CH2]~* AND [#8]~[#6]~[#8] → class = active
ALogP[0.985 - 4.446] AND Num_RingBonds[>19] AND ADMET_CYP2D6[=0 ] → class = active
Num_Hydrogens[18–50] AND Molecular_Solubility[−12.036 - -7.198] AND Molecular_SASA[690.864 - 1058.920] → class = inactive
Molecular_FractionalPolarSASA[0.140 - 0.312] AND Molecular_Solubility[−12.036 - -7.198] AND ChemAxon_HBD[>3] → class = inactive
Num_Bonds[<30] AND ChemAxon_TPSA[<46.170] AND Molecular_FractionalPolarSASA[<0.140] → class = active
The property rules utilize a set of property levels to achieve relatively higher classification accuracy. Rule 5 employs ALogP with Num_RingBonds and non CYP2D6 inhibitor together to identify active compounds. Our previous single feature analysis discovered that an optimum ALogP was important for activity. The specific mechanism behinds the association of CYP2D6 level and antiTB activity is not clear. Several popular antiTB drugs such as isoniazid and rifampicin, induce certain CYP activity. A possible explanation of non CYP2D6 inhibitor related to active antiTB activity might be that some drugs are administrated as prodrug. Their active ingredients are metabolites depending on the CYP activity such as the undergoing drug SQ109 . Finally, the level number of ring bonds can help researchers limit the number and size of the rings at the same time.
ACM is a powerful tool for modeling as it not only offers comparable accuracy but also interpretability. In particular, the measures of descriptor importance can provide guidance for molecule design. It does not need prior feature selection or parameter tuning but preserves the most appealing feature of Bayesian and Decision Tress—the ability to handle a large number of descriptors simultaneously. Compared with some tree-based methods, models generated by ACM are relatively stable and their accuracies are higher. Therefore, the interpretability of the model is more reliable—an obvious advantage in contrast with “black-box” methods. The mined association rules represent the possible relationships between the structure and bioactivity. More functional rules can be found by using different features or criteria. Among the three methods studied, CBA has relative higher accuracy than CPAR and CMAR, and CMAR generates the biggest classifiers. Additionally, the classifier of CPAR has the longest rules.
Single feature analysis provides a fast way to access the “good” or “bad” features for antiTB compounds. The list of fingerprint bits preferentially presented in active or inactive compounds can be used as a guide for screening and optimizing. Depending on the attributes and the methods of discretization, both general and specific interpretations can be made from the ACM classifiers by combining chemical or biological knowledge. In each case the generated model indicates that a very strong relationship between the structural features and bioactivities exists in the studied datasets.
All ACM methods used here are called traditional ACM methods because they do not distinguish the difference of significance of features. For some cases, features are not equally important. For example, in our study, even though we know AlogP, ADMET_BBB_Level or Molecular_SASA are more important than others, traditional ACM is not able to incorporate this information during mining. Our next step will incorporate weight information of the features into ACM—weighted ACM, which can generate more correlated and important patterns [52–54]. Recently, knowledge from semantic ontologies is used to understand or interpret the meaning of the patterns produced by ACM . Additionally, it is integrated into an existing rule reduction process to build concise, high quality and easily interpretable rule set . At present, most of the ontology-driven mining in the biomedical domain uses the UMLS  or GO  ontology, but now several chemical information ontologies such as ChEBI  and CHEMINF  are available too. Our future work will try to improve current models by incorporating those ontologies constraints during the rule generation process. We envision that there will be more applications of ACM in the chemical domain.
Thanks to Professor Bin Liu from the University of Illinois at Chicago and Professor Frans Coenen from the University of Liverpool for providing the software package and helping with the usage.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.