Computational analysis and predictive modeling of small molecule modulators of microRNA
© Jamal et al.; licensee Chemistry Central Ltd. 2012
Received: 24 March 2012
Accepted: 30 July 2012
Published: 13 August 2012
MicroRNAs (miRNA) are small endogenously transcribed regulatory RNA which modulates gene expression at a post transcriptional level. These small RNAs have now been shown to be critical regulators in a number of biological processes in the cell including pathophysiology of diseases like cancers. The increasingly evident roles of microRNA in disease processes have also motivated attempts to target them therapeutically. Recently there has been immense interest in understanding small molecule mediated regulation of RNA, including microRNA.
We have used publicly available datasets of high throughput screens on small molecules with potential to inhibit microRNA. We employed computational methods based on chemical descriptors and machine learning to create predictive computational models for biological activity of small molecules. We further used a substructure based approach to understand common substructures potentially contributing to the activity.
We generated computational models based on Naïve Bayes and Random Forest towards mining small RNA binding molecules from large molecular datasets. We complement this with substructure based approach to identify and understand potentially enriched substructures in the active dataset. We use this approach to identify miRNA binding potential of a set of approved drugs, suggesting a probable novel mechanism of off-target activity of these drugs. To the best of our knowledge, this is the first and most comprehensive computational analysis towards understanding RNA binding activities of small molecules and predictive modeling of these activities.
KeywordsmicroRNA Machine learning Maximum common substructure (MCS)
MicroRNAs are a well characterized class of small non-coding RNAs now known to be encoded in the genomes of a wide variety of eukaryotes spanning the plant and animal kingdoms of life [1, 2]. Recent advancements in the availability of computational and experimental tools have triggered increasing levels of interest to predict and experimentally validate microRNAs and their biological targets and understand their regulatory roles in a wide variety of organisms [3–5]. MicroRNAs typically mediate post-transcriptional regulation of protein-coding genes by binding to the 3’ un-translated regions of the transcripts [6, 7]. A number of microRNAs are known to modulate regulation of crucial oncogenes and function both by promoting as well as suppressing oncogenesis and form a distinct class popularly termed as ‘oncomiRs’ . Due to their ubiquitous role in pathological processes, it has been suggested that microRNAs could act as potential drug targets [9–12].
RNA-binding molecules offer an attractive strategy for modulating microRNAs function. The current literature points to a large number of classes of small molecules, including many therapeutically active classes of molecules which have RNA-binding potential [13, 14]. In addition a large number of studies have shown potential small-molecules which can bind and modulate non-coding RNA functions [15, 16]. Some of the reported molecules like aurintricarboxylic acid, suramin and oxidopamine modulate microRNA processing by inhibiting microRNA loading on the RNA Induced Silencing complex , while molecules like enoxacin, a fluoroquinolone antibacterial agent could potentially modulate microRNA biogenesis in a cancer-specific manner .
Techniques and assays for screening of small molecules with potential to modulate microRNA function and or action  apart from phenotypic or specific expression based screens have been increasingly being adapted for high-throughput screening strategies. The recent advancements in synthesis of compounds and large numbers of new compound libraries currently available for biological screening, poses a high demand for predictive computational methods that can prioritize molecules for biological screening. Previous studies [17, 18] have shown the application of Machine Learning in predictive modeling of molecules from high-throughput datasets available in public domain. We have previously used similar strategies using 2D descriptors and activities reported from high-throughput screen data available in public databases like PubChem for prioritization of small molecules with anti-tubercular action based on modeling activities based on concepts of machine learning [19, 20]. Apart from Machine learning chemical similarity searching by means of common substructures has been widely used for predicting potential biological activities of compounds and identifying frequently occurring molecular scaffolds in large molecular libraries [21, 22].
Here in this manuscript, we describe a computational strategy for predictive modeling of small molecules with potential to inhibit specific microRNAs, based on machine learning from high-throughput screen dataset for modulators of microRNA mir-21 , a well studied oncomiR. We show that the methodology is highly accurate with low false positivity. This methodology could be potentially used for computational prioritization of small molecules before performing high-throughput biological assay. We extend our study to analyze common chemical substructures shared between biologically active molecules using a Maximum Common Substructure (MCS) approach. To the best of our knowledge this is the first comprehensive analysis of predictive modeling of small-molecule modulators of microRNA.
Results and discussion
Model construction using machine learning algorithms
CSC Naïve Bayes
CSC Random Forest
Evaluation of models
Evaluation of enriched substructures
Significantly enriched scaffolds in the active dataset
DrugBank and Protein Data Bank (PDB) database screening
We used the predictive models to screen approved drugs from DrugBank database . Out of the 1410 approved drugs NB model predicted 205 drugs and RF model predicted 74 drugs to be active against miR-21 (Additional file 2). A consensus from both the models resulted in 43 drugs. A clustering analysis of the 43 drugs (Additional file 3) revealed the presence of mostly heterocyclic compounds comprising benzenes, quinolines, furans, pyridines and their derivatives. The 14 significantly enriched scaffolds were searched in the Protein Data Bank  to identify any similarity with known RNA binding ligands. One positive hit was obtained (Additional file 4) for Scaffold 3 which matched with the ligand ‘triazole-acridine’ (PDB-id: R14) which is known to bind to telomeric RNA-quadruplex (PDB-id: 3MIJ) .
Virtual screening of experimentally identified novel miRNA inhibitors
We have also used the predicted models to screen a set of novel molecules identified as miRNA inhibitors derived from different literature sources [14–16, 26, 27]. Out of the 37 molecules reported as actives in these literatures, NB predicted 12 molecules as actives and RF predicted 11 molecules as actives (Additional file 5). Consensus predictions made by both the models suggested 11 molecules to have probable activity against miR-21.
Understanding small molecules that bind to RNA could have implications both in modulating RNA levels for research as well as therapeutic applications. In this study, we have been successful in creating predictive computational models for small molecules with potential to bind and inhibit microRNA action using machine learning algorithms and chemical descriptors. We show the methodology is highly accurate with low false positivity. This methodology could be potentially used for computational screen of datasets before performing high-throughput screen as well as picking potential hits from large chemical structure datasets. In addition we have evaluated the maximally enriched substructures in the active dataset of small molecules with activity against mir-21. Apart from being involved in the pathogenesis of neoplasia, mir-21 is also known to be involved in the pathogenesis of Mycobacterium leprae  and is suggested to be involved in the modulation of immune responses in intracellular pathogens including Mycobacterium tuberculosis . Recent evidence has also suggested that microRNA apart from others to be differentially expressed in individuals with latent tuberculosis .This would also serve as the starting point to understand and design molecule libraries both virtual as well as experimental for specific activities for both research and therapeutic applications. To the best of our knowledge this is the first comprehensive analysis of predictive modeling of small-molecule modulators of microRNA.
The dataset [AID: 2289] consisting of modulators of human microRNA, miR-21 was downloaded from PubChem . The high-throughput screen consisted of a total of 3,33,521 tested compounds. Compounds were characterized based on a compound ranking system called ‘PubChem Activity Score’. Compounds having an activity score between 40 and 100 were considered as active (3282), all compounds with a score of 0 were inactives (3,01,747) and the ones having a score between 1 and 39 were labeled as inconclusive (28,713). The active and inactive sets were downloaded in Structure Data Format (SDF).
The bioactivity of compounds in the high throughput screen of PubChem AID2289 has been measured in a cell-based Firefly Luciferase (FLuc) reporter gene assay. However, it has earlier been reported [32, 33] that compounds that resemble substrates of FLuc can potentially function as competitive inhibitors of the enzyme thereby resulting in counterintuitive phenomenon of signal activation. The apparent increase in luminescence could thus be mistakenly interpreted as an activity. Therefore, we also used the counter-screen of mir-21 project (AID: 588342) that uses a ~350 k library of MLSMR compounds to filter out true positives from potentially false positives. The overlapping revealed that 2399 compounds in the active set of AID2289 are inhibitors of FLuc rather than our target miR-21. All overlaps were filtered out and only 883 true positives were considered as actives for modeling experiments (Additional file 6).
The chemical structures downloaded from PubChem were imported and 2D descriptors were generated using PowerMV . The large dataset was split into smaller files using SplitSDFiles from Mayachem tools . A total of 179 descriptors were calculated which includes 147 pharmacophore fingerprints, 24 weighted burden number and 8 property descriptors (Additional file 1). For the bit string descriptors, each bit was set to ‘1’ when a certain feature was presented and ‘0’ when it was not. The attributes having bit string descriptor values of only one value throughout the dataset (all 0’s or all 1’s) were filtered. The dataset was split into 20% test set and the 80% training-cum-validation set to build the model.
Cost sensitive classification
One of the caveats with the virtual screening of bioassay data is the imbalance between active and inactive compounds . A dataset is considered imbalanced when one class is represented by large number of entities as compared to other. To overcome this problem cost-sensitive classification has been used previously . In cost sensitive learning, misclassification of the marginal class is assigned a high cost which the algorithm then attempts to lessen. We used Weka (Waikato Environment for Knowledge Analysis), a popular suite of machine learning software, to perform modeling tasks . In Weka, cost sensitivity is introduced by means of a confusion matrix. In the present binary classification scheme a 2x2 matrix was deployed to predict the class with the minimum expected misclassification cost setting. A 2x2 confusion matrix consists of four sections: True positives (TP) for active compounds correctly classified as active, false positives (FP) for inactive compounds incorrectly classified as active, true negatives (TN) for inactive compounds correctly classified as inactive and false negatives (FN) for active compounds incorrectly classified as inactive. As false negatives are deemed to be more important in any experiment, misclassification cost was set for false negatives and was incremented serially so as to optimize the predictions. The maximum false positive rate is constrained to approximately 20%. The optimal misclassification cost setting for each classifier in the Weka cost matrix depends on the base classifier used. The model was first build with training dataset and 5-fold cross validation was used during training of data. Cross validation is a technique in which data is partitioned into subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). The base classifiers used were Naive Bayes and Random forest. For both Naive Bayes and Random forest, cost sensitivity was employed.
Machine learning is a field of artificial intelligence and is based on prediction of a set of outcomes, based on known properties learned from a dataset of known outcomes, otherwise termed as the training data. In our experiment the following algorithms were used which can be formulated in terms of machine learning methods.
Naïve Bayes is one of the simplest probabilistic classifier. The technique is based on Bayes theorem in statistics. A Bayesian classifier considers each structural feature or descriptor independent of the other descriptors, and the probability of activity is considered to be proportional to the ratio of actives to inactives that share the descriptor value. The final probability that a compound is active is a product of all descriptor based probabilities .
Random Forest was first described by [Leo Breiman 40]. It is an ensemble classifier methodology based on decision trees. The algorithm tries to find as good a distinction as possible between active compounds and others, on the basis of a set of molecular descriptors. It identifies features shared by different subsets of active compounds and accordingly filters out compounds within the target data set in which these combinations are lacking. It is the most accurate classifiers available.
We used various statistical measures such as Accuracy, Sensitivity, Specificity, Balanced Classification Rate (BCR) and Receiver Operating Characteristic (ROC) to evaluate the models. Sensitivity, Specificity and Accuracy are expressed in terms of true positive (TP), false negative (FN), true negative (TN), false positive (FP) rates. A True Positive Rate (TPR) is the proportion of actual positives which are correctly predicted as actives (TP/TP + FN). False Positive Rate (FPR) is ratio of predicted false actives to actual number of inactives (FP/FP + TN). Accuracy indicates overall effectiveness of the classifier. It can be calculated as (TP + TN/TP + TN + FP + FN). Sensitivity refers to proportion of actual positives which are predicted positives (TP/TP + FN). Specificity refers to proportion of actual negatives which are predicted negatives (TN/TN + FP). Balanced Classification Rate (BCR) is the average of sensitivity and specificity which may be defined as a measure to test classifiers ability to avoid false classification.
Maximum common substructure search
A maximum common substructure (MCS) based approach was used to identify potentially enriched bioactive molecules. We used the hierarchical clustering algorithm ‘LibMCS’, available from [ChemAxon 41] to recognize the substructure common to a pair of molecules. This MCS based classification of molecules creates disjoint subsets, where one molecule belongs to one cluster only. The size of the MCS is determined as a function of the numbers of the constituent atoms which was empirically set to a threshold of ”10 atoms” in this study owing to the complexity of the structures involved and computation required to generate the clusters.
The molecular scaffolds generated as a result of clustering were thus used as SMILES query to search for substructures in both active and inactive target datasets. This was accomplished using the ‘jcsearch’ algorithm available from [ChemAxon 42]. The substructures were later evaluated for enrichment using chi-square test. The p-values were used to evaluate the significance of enrichment. We used substructures which have at least > 1% matches among the active dataset entries. We also calculated enrichment factor and used an empirical threshold of 2 to prioritize molecules for further analysis. A molecular alignment of the selected scaffolds with molecules of active dataset was performed using the vROCS (release 3.1.2)  and visualized in VIDA (4.1.1)  available from OpenEye Scientific Software, Inc. .
The authors thank Dr Chetana Sachidanandan and Dr Souvik Maiti for reviewing the manuscript and for scientific suggestions. The authors also thank the Open Source Drug Discovery (OSDD) community for support and discussions. The computation was supported by CDAC India through the Garuda grid, and authors acknowledge help and support from the CDAC Garuda grid team members. This work was funded by the Council of Scientific and Industrial Research (CSIR), India for funding through the Open Source Drug Discovery Project (HCP001).
- Ambros V: microRNAs: tiny regulators with great potential. Cell. 2001, 107: 823-826. 10.1016/S0092-8674(01)00616-X.View ArticleGoogle Scholar
- Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004, 116: 281-297. 10.1016/S0092-8674(04)00045-5.View ArticleGoogle Scholar
- Yoon S, De MG: Computational identification of microRNAs and their targets. Birth Defects Res C Embryo Today. 2006, 78: 118-128. 10.1002/bdrc.20067.View ArticleGoogle Scholar
- Chaudhuri K, Chatterjee R: MicroRNA detection and target prediction: integration of computational and experimental approaches. DNA Cell Biol. 2007, 26: 321-337. 10.1089/dna.2006.0549.View ArticleGoogle Scholar
- Mendes ND, Freitas AT, Sagot MF: Current tools for the identification of miRNA genes and their targets. Nucleic Acids Res. 2009, 37: 2419-2433. 10.1093/nar/gkp145.View ArticleGoogle Scholar
- Filipowicz W, Bhattacharyya SN, Sonenberg N: Mechanisms of post-transcriptional regulation by microRNAs: are the answers in sight?. Nat Rev Genet. 2008, 9: 102-114.View ArticleGoogle Scholar
- Chekulaeva M, Filipowicz W: Mechanisms of miRNA-mediated post-transcriptional regulation in animal cells. Curr Opin Cell Biol. 2009, 21: 452-460. 10.1016/j.ceb.2009.04.009.View ArticleGoogle Scholar
- Cho WC: OncomiRs: the discovery and progress of microRNAs in cancers. Mol Cancer. 2007, 6: 60-10.1186/1476-4598-6-60.View ArticleGoogle Scholar
- Scaria V, Hariharan M, Brahmachari SK, Maiti S, Pillai B: microRNA: an emerging therapeutic. ChemMedChem. 2007, 2: 789-792. 10.1002/cmdc.200600278.View ArticleGoogle Scholar
- Liu Z, Sall A, Yang D: MicroRNA: An emerging therapeutic target and intervention tool. Int J Mol Sci. 2008, 9: 978-999. 10.3390/ijms9060978.View ArticleGoogle Scholar
- Roshan R, Ghosh T, Scaria V, Pillai B: MicroRNAs: novel therapeutic targets in neurodegenerative diseases. Drug Discov Today. 2009, 14: 1123-1129. 10.1016/j.drudis.2009.09.009.View ArticleGoogle Scholar
- Mishra PK, Tyagi N, Kumar M, Tyagi SC: MicroRNAs as a therapeutic target for cardiovascular diseases. J Cell Mol Med. 2009, 13: 778-789. 10.1111/j.1582-4934.2009.00744.x.View ArticleGoogle Scholar
- Gumireddy K, Young DD, Xiong X, Hogenesch JB, Huang Q, Deiters A: Small-molecule inhibitors of microrna miR-21 function. Angew Chem Int Ed Engl. 2008, 47: 7482-7484. 10.1002/anie.200801555.View ArticleGoogle Scholar
- Melo S, Villanueva A, Moutinho C, Davalos V, Spizzo R, Ivan C, et al: Small molecule enoxacin is a cancer-specific growth inhibitor that acts by enhancing TAR RNA-binding protein 2-mediated microRNA processing. Proc Natl Acad Sci U S A. 2011, 108: 4394-4399. 10.1073/pnas.1014720108.View ArticleGoogle Scholar
- Shan G, Li Y, Zhang J, Li W, Szulwach KE, Duan R, et al: A small molecule enhances RNA interference and promotes microRNA processing. Nat Biotechnol. 2008, 26: 933-940. 10.1038/nbt.1481.View ArticleGoogle Scholar
- Tan GS, Chiu CH, Garchow BG, Metzler D, Diamond SL, Kiriakidou M: Small molecule inhibition of RISC loading. ACS Chem Biol. 2012, 7: 403-410. 10.1021/cb200253h.View ArticleGoogle Scholar
- Schierz AC: Virtual screening of bioassay data. J Cheminform. 2009, 1: 21-10.1186/1758-2946-1-21.View ArticleGoogle Scholar
- Melville JL, Burke EK, Hirst JD: Machine Learning in Virtual Screening. Comb Chem High Throughput Screen. 2009, 12: 332-343. 10.2174/138620709788167980.View ArticleGoogle Scholar
- Periwal V, Rajappan JK, Jaleel AU, Scaria V: Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets. BMC Res Notes. 2011, 4: 504-10.1186/1756-0500-4-504.View ArticleGoogle Scholar
- Periwal V, Kishtapuram S, Scaria V: Computational models for in-vitro anti-tubercular activity of molecules based on high-throughput chemical biology screening datasets. BMC Pharmacol. 2012, 12: 1-View ArticleGoogle Scholar
- Cao Y, Jiang T, Girke T: A maximum common substructure-based algorithm for searching and predicting drug-like compounds. Bioinformatics. 2008, 24: i366-i374. 10.1093/bioinformatics/btn186.View ArticleGoogle Scholar
- Stahl M, Mauser H: Database clustering with a combination of fingerprint and maximum common substructure methods. J Chem Inf Model. 2005, 45: 542-548. 10.1021/ci050011h.View ArticleGoogle Scholar
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, et al: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Res. 2011, 39: D1035-D1041. 10.1093/nar/gkq1126.View ArticleGoogle Scholar
- Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, et al: The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977, 112: 535-542. 10.1016/S0022-2836(77)80200-3.View ArticleGoogle Scholar
- Collie GW, Sparapani S, Parkinson GN, Neidle S: Structural basis of telomeric RNA quadruplex–acridine ligand recognition. J Am Chem Soc. 2011, 133: 2721-2728. 10.1021/ja109767y.View ArticleGoogle Scholar
- Young DD, Connelly CM, Grohmann C, Deiters A: Small molecule modifiers of microRNA miR-122 function for the treatment of hepatitis C virus infection and hepatocellular carcinoma. J Am Chem Soc. 2010, 132: 7976-7981. 10.1021/ja910275u.View ArticleGoogle Scholar
- Watashi K, Yeung ML, Starost MF, Hosmane RS, Jeang KT: Identification of small molecules that suppress microRNA function and reverse tumorigenesis. J Biol Chem. 2010, 285: 24707-24716. 10.1074/jbc.M109.062976.View ArticleGoogle Scholar
- Liu PT, Wheelwright M, Teles R, Komisopoulou E, Edfeldt K, Ferguson B, et al: MicroRNA-21 targets the vitamin D-dependent antimicrobial pathway in leprosy. Nat Med. 2012, 18: 267-273. 10.1038/nm.2584.View ArticleGoogle Scholar
- Xu G, Zhang Y, Jia H, Li J, Liu X, Engelhardt JF, et al: Cloning and identification of microRNAs in bovine alveolar macrophages. Mol Cell Biochem. 2009, 332: 9-16. 10.1007/s11010-009-0168-4.View ArticleGoogle Scholar
- Wang C, Yang S, Sun G, Tang X, Lu S, Neyrolles O, et al: Comparative miRNA expression profiles in individuals with latent and active tuberculosis. PLoS One. 2011, 6: e25832-10.1371/journal.pone.0025832.View ArticleGoogle Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.View ArticleGoogle Scholar
- Thompson JF, Hayes LS, Lloyd DB: Modulation of firefly luciferase stability and impact on studies of gene regulation. Gene. 1991, 103: 171-177. 10.1016/0378-1119(91)90270-L.View ArticleGoogle Scholar
- Auld DS, Thorne N, Nguyen DT, Inglese J: A specific mechanism for nonspecific activation in reporter-gene assays. ACS Chem Biol. 2008, 3: 463-470. 10.1021/cb8000793.View ArticleGoogle Scholar
- Liu K, Feng J, Young SS: PowerMV: a software environment for molecular viewing, descriptor generation, data analysis and hit evaluation. J Chem Inf Model. 2005, 45: 515-522. 10.1021/ci049847v.View ArticleGoogle Scholar
- Sud M: MayaChemTools. 2010, http://www.mayachemtools.org/,Google Scholar
- Blagus R, Lusa L: Class prediction for high-dimensional class-imbalanced data. BMC Bioinforma. 2010, 11: 523-10.1186/1471-2105-11-523.View ArticleGoogle Scholar
- Elkan C: The Foundations of Cost-Sensitive Learning. 973-978.Google Scholar
- Bouckaert RR, Frank E, Hall MA, Holmes G, Pfahringer B, Reutemann P, et al: Weka -Experiences with a Java Open-Source Project. J Mach Learn Res. 2010, 2533-2541.Google Scholar
- Friedman N, Geiger D, GoldSzmidt M: Bayesian Network Classifiers. Mach Learn. 1997, 29: 131-163. 10.1023/A:1007465528199.View ArticleGoogle Scholar
- Breiman L: Random Forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Chemaxon: Budapest H. Library MCS, version 0.7. 2008Google Scholar
- Chemaxon: Budapest H. Jcsearch version 5.8.2.Google Scholar
- vROCS: release 3.1.2. 2010, OpenEye Scientific Software, Inc, Santa Fe, NM, USA, www.eyesopen.com,Google Scholar
- VIDA: version 4.1.1. 2010, OpenEye Scientific Software, Inc, Santa Fe, NM, USA, www.eyesopen.com,Google Scholar
- OpenEye Scientific Software, Inc: Santa Fe, NM, USA. 2010, www.eyesopen.com,Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.