Enhanced ranking of PknB Inhibitors using data fusion methods
© Seal et al; licensee Chemistry Central Ltd. 2013
Received: 27 August 2012
Accepted: 11 December 2012
Published: 14 January 2013
Skip to main content
© Seal et al; licensee Chemistry Central Ltd. 2013
Received: 27 August 2012
Accepted: 11 December 2012
Published: 14 January 2013
Mycobacterium tuberculosis encodes 11 putative serine-threonine proteins Kinases (STPK) which regulates transcription, cell development and interaction with the host cells. From the 11 STPKs three kinases namely PknA, PknB and PknG have been related to the mycobacterial growth. From previous studies it has been observed that PknB is essential for mycobacterial growth and expressed during log phase of the growth and phosphorylates substrates involved in peptidoglycan biosynthesis. In recent years many high affinity inhibitors are reported for PknB. Previously implementation of data fusion has shown effective enrichment of active compounds in both structure and ligand based approaches .In this study we have used three types of data fusion ranking algorithms on the PknB dataset namely, sum rank, sum score and reciprocal rank. We have identified reciprocal rank algorithm is capable enough to select compounds earlier in a virtual screening process. We have also screened the Asinex database with reciprocal rank algorithm to identify possible inhibitors for PknB.
In our work we have used both structure-based and ligand-based approaches for virtual screening, and have combined their results using a variety of data fusion methods. We found that data fusion increases the chance of actives being ranked highly. Specifically, we found that the ranking of Pharmacophore search, ROCS and Glide XP fused with a reciprocal ranking algorithm not only outperforms structure and ligand based approaches but also capable of ranking actives better than the other two data fusion methods using the BEDROC, robust initial enhancement (RIE) and AUC metrics. These fused results were used to identify 45 candidate compounds for further experimental validation.
We show that very different structure and ligand based methods for predicting drug-target interactions can be combined effectively using data fusion, outperforming any single method in ranking of actives. Such fused results show promise for a coherent selection of candidates for biological screening.
Most people affected by tuberculosis contain both active infection and latent infection. The mechanism of Mycobacterium tuberculosis shift between the latent and active state is not clearly understood, but one of the essential components of these kind of systems is the regulation of cell wall synthesis and cell division in response to stimuli from the host via signal transduction. One of the chief mechanisms by which extracellular signals are translated to intracellular responses is via protein phosphorylation. In bacteria Protein phosphorylation is carried out by specific protein kinases in a two component system. In eukaryotes protein phosphatases and protein kinases plays a key role behind of host and pathogen signal transduction pathways. Mycobacterium tuberculosis having 11 serine-threonine proteins Kinases (STPKs) and it has been found that Ser/Thr kinases are an attractive target for drug discovery [1, 2]. The 3D structures of PknB, PknE and PknG are required for mycobacterial growth  and are being deposited in PDB (http://htt://www.rcsb.org) which resemble the human kinases with conserved motif and a striking similarity of ATP bound kinase domain with the activated eukaryotic Ser/thr Kinases . PknB is a receptor like protein transmembrane protein with an extracellular signal sensor domain (PASTA) and an intracellular kinase domain and it shares high sequence similarity with eukaryotic STPKs [4, 5]. PknB is a very important functional protein kinase which can phosphorylate by itself. It has been shown that the expression of PknB is constitutive and is present under both in vitro and in vivo conditions. Previously MtrA was response regulator for Mycobacterium tuberculosis and found to be essential for growth .Further it was also observed Fernandez  that knock-outs and overexpression of PknB affects the cell morphology which supports involvement of with cell division and shape which shows that PknB is the right molecular target for designing inhibitors. There is a need for the design of new inhibitors because current drugs, whilst effective in vitro have not shown good in vivo activity. Our hypothesis is that this is because the compounds are not targeting PknB in cells. Though the sequence identity is less than 27% and the PknB structure shows a very low RMSD of 1.36 Å and 1.72 Å with eukaryotic kinases [7–9], the overall catalytic domain is similar to the eukaryotic protein kinase consisting the N terminal subdomain including a β-sheet and a long α-helix and the C terminal lobe consists of α-helices . In this work we have used ligand and structure based approaches to screen large set of inhibitors. Previously many high affinity inhibitors have been reported for PknB [11–15]: we used 62 inhibitors listed in Additional file 1 for our work.
Virtual screening (VS) using structure and ligand based approaches is widely used in drug discovery . Structure-based screening involves using information about a protein target, usually through molecular docking. It requires a protein structure to be known, but known active ligands are not required. Ligand-based screening only uses information from active ligands, but does not require a protein target structure. Both structure and ligand based approaches can be applied parallel to VS, but often these approaches are applied in a stepwise filtering approach . The most commonly applied VS methods are molecular docking, pharmacophore identification and ligand similarity (including shape based), along with a variety of machine learning methods that “learn” to differentiate actives from inactives based on known data [18, 19]. Simple similarity searching with known ligands can also be effective [20, 21].
The most important challenge in VS is to create accurate scoring function that can distinguish between novel bioactive and inactive molecule. In case of docking the three classes of scoring is highlighted forcefield based scoring, empirical scoring and knowledge based scoring . The three classes include various types of scoring algorithms are used for molecular docking, historically, scoring does not correlate well with binding activity, although Consensus scoring, which takes a weighted average of several methods, can result in improvements [23, 24]. However, these consensus scores are only concerned with variations of a structure-based approach and their limitations have been documented . Data Fusion has been shown to be effective in integrating data from different sources [26–28] for example Willet etal used 2D similarity searching using different similarity measures using SUM function, although there are few results reported using structure and ligand based approaches along with data fusion [29, 30].
In this study we have applied multiple ligand and structure based methods to the PknB problem and then combined these results using data fusion. Performance was evaluated with a widely used benchmark dataset from Schrodinger (http://www.schrodinger.com/glidedecoyset), which has been used in other VS [31, 32]. This set is a set of decoys that have similar properties to the active compounds but are topologically dissimilar. We also evaluated ranking of actives using a variety of well-established methods including Enrichment factor (EF) , RIE  and BEDROC [34, 35]. We used the EF,RIE and BEDROC in the each of the VS protocols and in the data fusion algorithms such as sum score, sum rank and reciprocal rank and found that reciprocal rank outperformed all of the VS protocols such as pharmacophore search, shape screening and docking and as well as related to other algorithms. The next best algorithm which performed well in data fusion was sum score rank which outperformed the structure and ligand based approaches.
The aim of the study was to utilize several VS protocols and then fuse the results for evaluation the results and identify best fusion algorithm among the 3D structure based, ligand based methods and fusion algorithms, and to show how these results could be used to select compounds for follow-up testing.
Conformational sampling was performed on all database molecules using the ConfGen search algorithm . Confgen with OPLS 2005 forcefield was applied for generation of conformers with duplicate poses eliminate if the RMSD is less than 1.0 Å. A distance dependent dielectric constant of 4 and maximum relative energy difference of 10 kcal/mol is applied as suggested by salam etal . For validation of the structure and ligand based approaches we used a randomly selected set of 35 active compounds from the dataset of 62 active compounds and 1000 decoy compounds making a total of 1035 compounds as the database given in Additional file 2.
For protein preparation PknB inhibitor Mitoxantrone (Mtz) bound crystal structure PDBID: 2FUM was prepared using the protein preparation wizard. Bond orders and formal charges were added for hetero groups, and hydrogens were added to all atoms in the system. Water molecules were removed. A brief relaxation was performed using an all-atom constrained minimization carried out with the Impact Refinement module (Impref) (Impact v5.0, Schrodinger, LLC, New York, NY) using the OPLS-2005 force field to alleviate steric clashes that may exist in the original PDB structures. The minimization was terminated when the energy converged or the rmsd reached a maximum cutoff of 0.30 Å.
Three methods of VS were applied in the current study: Docking using Glide [32, 37], e-pharmacophore search using phase  and 3D shape similarity search using vROCS . Our aim was to investigate whether data fusion methods can search and rank actives from a database.
For docking, the extra precision (XP) mode was used both for the actives and decoy sets and all settings were left as default except for adding the Epik states penalties to the docking score. Glide energy grid was generated using 2FUM structure. It was found that the glide grid for the ligand was smaller using the default settings so it was extended to 12Å to cover all the ligands and the active site grid covered whole of the active site.
Using a python script we extracted the remaining hits and ranked according to their descending fitness values. For vROCS screening compounds containing it relevant conformers were generated using OMEGA  with a maximum of 1000 conformers for each molecule and using the parameters mentioned in Bostrom et al. . The query for the ROCS was the Glide XP docked structure poses of the active compound VIII. We selected this compound because we used this compound’s e-pharmacophore in findmatches. The explanation can be found in the Results and Discussion section.
where C i is the rank of the compound i and j is the system index or the VS protocol used.
TP is the number of true positives returned after screening the database, TN, is the number of true negatives, FP, is the number of false positives ,FN, is the number of false negatives and A , is the total number of actives in the database.
Where D is the total number of database compounds.
Where is the relative rank of the ith active compound and α is the tunning parameter. Changing the parameter α, one can control the early ranking of hits. BEDROC values ranges in between [0,1] and can be defined as the probability that an active is ranked before a randomly selected compound was exponentially distributed with parameter α. BEDROC and RIE have a linear relationship .
We calculated the BEDROC value for three VS methods at α=20.At α=20 implies that 80% of the the final BEDROC score is based on the first 8% of the ranked data set.
E-pharamcophore, vROCS, and glide SP docking were performed on Asinex datasets in a step by step process.To select the best scoring molecules from the database of Asinex we first screened the database using the e-pharmacophore matching at least 4 features out of five present in the Pharmacophore model. A distance matching with tolerance of 2 Å was given during pharmacophore mapping of the database along with a minimum of 4 sites to match the database entries. Also Excluded volumes were included in search generated from e-Pharmacophore. Phase find matches retrieved 5000 molecules from the database screen. For vROCS screening we used the 5000 molecules as the primary dataset and screened the dataset using the Glide XP docked pose of compound VIII. We also carried out Glide SP docking with the 5000 hits retrieved from phase find matches search. Docking was carried using the prepared protein 2FUM; the ligands were docked into the ATP binding site of the protein.
All the 5000 hits were ranked accordingly. For e-pharmacophore shortlisted hits were ranked in descending order, i.e. the highest fitness compound was given a best rank. ROCS automatically ranked compounds, based on the Tanimoto combo score. Flexible Glide SP docking was carried out and the ranking was done based on the docking energy. Then based on the scores and rankings sum score method was applied for ranking the hits. The sum score was selected since it produced the best results than the other fusion methods. After ranking and screening, the top 500 compounds (~10%) of the dataset was used for further evaluation. 500 compounds were further docked using the Flexible Glide XP docking. The binding poses were visually inspected for 500 molecules. The poses were compared with the pharmacophore alignment and the molecules which align in the same pattern were considered. The molecules with a maximum of at least 2 H-bonds with the hinge region at the ATP binding site were considered along with any one of the rings.
The table shows the enrichment factors, BEDROC value and RIE of the different methods applied in virtual screening
E-pharmacophore I(5 sites)
E-pharmacophore II(7 sites)
E-pharmacophore III(5 sites)
Shows the % yield of actives (Ya), %actives, sensitivity, specificity and GH score of pharmacophores
%Yield Of Actives
Goodness of Hit(GH SCORE)
E- pharmacophore I
E- pharmacophore II
E- pharmacophore III
Shows donor D8 and Ring R13 along with any other pharmcophoric point from 5 sites
%Yield Of Actives
Goodness of Hit (GH Score)
For shape based screening we used the same 1035 dataset used for pharmacophore validation i.e. 1000 decoys and 35 active molecules. For query the docked pose of compound VIII was taken. After screening the maximum value of the Tanimoto Combo score attained was 1.418. The roc area obtained by vROCS program was 0.89 which was higher than the pharmacophore search and docking.
It shows the ROC AUC’s of VS methods and data fusion
With e-pharmacophore III giving the best possible BEDROC and RIE score among the other pharmacophore models, we selected this pharmacophore model for our prepared virtual database of Asinex. We selected top 5000 hits for our work in which only 222 hits scored above 2.0 other than that it was very interesting among all the 222 hits, there were only 14 compounds which met all the necessary five sites of the pharmacophore. Most of the compounds in top 222 lacked the D6 donor site of pharmacophore III. But in the active set we found that there were some compounds which lacked this site showed a good docking score with one acceptor and one donor. For shape based screening of 5000 compounds a maximum of 1000 conformers per molecule were generated using the parameters set by Bostrom et al. .Then we ran vROCS using the Glide XP docked query for the generated dataset. For vROCS query the maximum Tanimoto combo score attained after screening was 1.19. Ranking of the vROCS results was done by the program itself. Due to time and power limitations we used Glide SP for docking purpose. All the 5000 molecules were ranked in ascending order of the docking score. For data fusion we used reciprocal rank algorithm to rank the compounds as because it performed the best among all the other fusion algorithms. After ranking was done top 10% (500 compounds) of the database hits ranked by reciprocal rank were docked in the 2FUM ATP binding site with Glide XP docking algorithm. Then each of the 500 poses were visually inspected and mapped to the pharmacophoric sites. We selected a list of 45 inhibitors which matched the above limitations and also resembling the pose of compound VIII. Additional file 3 contains the structures of 45 compounds.
In this work we have created the Mycobacterium tuberculosis PknB pharmacophore model which is can be used for further development high affinity compounds. We have developed three pharmacophore models using e-pharmacophore and came to find out that most of the active compounds in the dataset of 62 compounds resemble the kinase type I pharmacophore  which is represented by e-pharmacophore III. Data fusion methods previously being implemented in 2D screening protocols  and also now being widely accepted in 3D screening methods. In our work we have used sum score, sum rank and reciprocal rank algorithms. Reciprocal rank algorithm is used in the information retrieval systems in meta search engines . It has been found that previously no one has implemented the reciprocal rank algorithm for data fusion using 3D methods. We have found that reciprocal rank algorithm performed better than sum score and sum rank fusion methods results which indicates that it can rank molecules better in a VS run. After running a virtual screening run we have found identified compounds based on reciprocal rank algorithm and further docking by glide XP and pharmacophore mapping. We did found around 45 compounds which were having one acceptor and one donor and one ring. We also mapped the compounds to the physicochemical space of the PknB inhibitors and found that many of the compounds fall in the same physicochemical region as of PknB inhibitors. The set of 45 compounds in the Additional file 3 could be further processed for experimental validation against PknB.
The following are the datasets used for these experiments.
Additional file 1: The PknB dataset of 62 inhibitors which contains Pubmed ID and IC50 values of inhibitors.
Additional file 2: The validation dataset of 1035 compounds in which 35 are active compounds and 1000 are decoys.
Additional file 3: It contains 45 compounds which are visually mapped with pharmacophore and Glide XP docking. It also contains the reciprocal rank scores of the compounds along with the Glide XP docking scores, Tanimoto Combo Score and pharmacophore fitness values.
We would like to acknowledge the work Indo-US Science Technology forum for providing the fellowship of research. We would like to thank Open eye for providing academic license of vROCS to Indiana University. The authors also thank the Open Source Drug Discovery (OSDD) community for support and discussions. We would also like to thank Jae Hong shin Phd student at Indiana University for discussions of Kinase pharmacophores and Anurag Passi(project assistant CSIR OSDD), Pushpdeep Mishra(Assistant Professor Bioinformatics Patkar College Mumbai India) ,Rajdeep Poddar (Applications Specialist Bio Analytical Technologies India Pvt Ltd.) for collecting reference articles and drawing structures .
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.