When drug discovery meets web search: Learning to Rank for ligand-based virtual screening
© Zhang et al.; licensee Springer. 2015
Received: 24 September 2014
Accepted: 7 January 2015
Published: 13 February 2015
The rapid increase in the emergence of novel chemical substances presents a substantial demands for more sophisticated computational methodologies for drug discovery. In this study, the idea of Learning to Rank in web search was presented in drug virtual screening, which has the following unique capabilities of 1). Applicable of identifying compounds on novel targets when there is not enough training data available for these targets, and 2). Integration of heterogeneous data when compound affinities are measured in different platforms.
A standard pipeline was designed to carry out Learning to Rank in virtual screening. Six Learning to Rank algorithms were investigated based on two public datasets collected from Binding Database and the newly-published Community Structure-Activity Resource benchmark dataset. The results have demonstrated that Learning to rank is an efficient computational strategy for drug virtual screening, particularly due to its novel use in cross-target virtual screening and heterogeneous data integration.
To the best of our knowledge, we have introduced here the first application of Learning to Rank in virtual screening. The experiment workflow and algorithm assessment designed in this study will provide a standard protocol for other similar studies. All the datasets as well as the implementations of Learning to Rank algorithms are available at http://www.tongji.edu.cn/~qiliu/lor_vs.html.
KeywordsLearning to Rank Virtual screening Drug discovery Data integration
Generally, the task of ligand-based VS is to output a ranking list of a set of molecules in terms of their binding affinities for a given drug target, so that the top-k molecules can be further examined through in-vivo or in-vitro test. The most basic technique utilized in VS is similarity search, which can be performed by firstly setting the target compound and then calculate the similarity between each compound and the target one. For this step, many different strategies of similarity measurements have been developed, including Cosine Coefficient, Euclidean Distance, Soergel Distance, Dice coefficient and Tanimoto coefficient . Based on the similarity scores, the candidate compounds will be ranked and the top-k compounds can be selected for further investigation. Specially, VS can also be formulated as to learn a function f : Structure → Activity (R d → R) based on a set of training compounds with known affinities for the target. The learned function can be used to predict the label (compound affinity) for any given molecules according to their structural features. Traditionally, this function can be learned as a regression or classification form, similar to the procedure of Quantitatively Structure Activity Relationship (QSAR) study .
Recently, a new emerging computational strategy called Learning to Rank (LOR) [6,7] that was firstly utilized in information retrieval field especially for the web search, has gained much attention. Web search and VS can be treated as a similar problem, seeking an analogous result where higher candidates (webs or compounds) should have higher relevance to the underlying target (query or protein). Taking this fundamental similarities into consideration, LOR should be a promising technique for solving VS problem; however, very few studies were performed in this area.
Compared with traditional statistical learning based VS methods, Learning to Rank has the following two unique capabilities of (1). Applicable of extension to screen compounds on novel targets when there is not enough data available for these targets, and (2). Integration of heterogeneous data when compound affinities are measured in different platforms. Here, we have developed an integrated framework, which includes (1) a standard pipeline for LOR analysis in virtual drug screening, (2) comprehensive performance assessment for different LOR algorithms, and (3) public available testing benchmark data. In particular, the experimental workflow and algorithm assessments designed in this study will provide a standard protocol for other similar studies in drug discovery.
Results and discussion
Results of different testing strategies
Curated bingding database dataset
Curated CSAR dataset
It should be noted that in the following testing strategies Normalized Discounted Cumulative Gain (NDCG) was applied for the quantitatively comparison of different VS methods. NDCG was originally presented in information retrieval community to quantitatively measure the ranking results of instances based on its position in the ranking list. Basically in the ranking performance evaluation, we keep a grand-truth ranking list which is the molecule ranking for a given target based on their known efficacy. Then for different VS methods we obtain different predicted ranking lists based on different prediction models. These predicted ranking lists can be compared to the ground-truth ranking list to evaluate the VS performances respectively, as measured by the value of NDCG. Detailed information to calculate NDCG can be seen in Methods.
NDCG@10 of strategy I
As a summary, SVMRank was the most efficient one among others. The superiority of SVMRank probably due to that such a ranking method inherits the maximum-margin characteristics of SVM. It transfers the ranking problem into a partial order pair classification problem, and utilizes the maximum margin optimization in SVM to derive the optimal ranking order. Therefore SVMRank obtains a robust and satisfied performance in LOR [6,7]. This result indicates that given proper optimization, the pair-wise based LOR model may serve as a suitable option for VS. Compared to traditional SVR-based VS, LOR could be served as an alternative option and achieves the acceptable performance in VS.
Taking accuracy and efficiency into consideration, SVMRank was selected for comparison in the following testing. It should be noted that in the following strategies, traditional SVR based method does not make sense, since there are either no training data existed for the specific target or the training data are combined from different measurements.
This strategy was designed to investigate the performance of LOR to screen compounds on novel targets when there is no or few ligand affinity data available for these targets. In this case, traditional learning based VS techniques are not suitable here, since there are no or few available training datasets for the specific target. Specially, for the 24 protein curated from BDB, every 23 protein targets and their associated ligands data were combined together to act as the training dataset, and then tested on the left one target among the 24 ones. The testing procedure was also performed for 5 times on the 5 random divided parts of the compounds associated with the left target, respectively. Based on this strategy, the testing datasets in the strategy I and II were made to be identical for equally comparison purpose. The 5 times averaged NDCG value for each target among the 24 ones were calculated for quantitatively performance evaluation.
As a summary, SVMRank can be served as an efficient method for cross-target VS, and the performance can be improved when much more biological and pharmaceutical information are taking into considerations, as shown in the following.
As a summary, the results in this strategy supported that, at least in our dataset, the selection of phylogenetically related targets and their associated compound affinity data in the training process may benefit the cross-target prediction to a certain extent. Serving as an efficient cross-target VS method, LOR still has the potential to improve its performance when extended useful information are considered.
NDCG@10 in strategy IV
NDCG@10 of normal feature mapping
NDCG@10 of cross-term feature mapping
As a summary, the test results indicate that LOR may serve as a good choice for integration of various heterogeneous compound affinity data in VS, and the design of proper feature mapping in LOR will also influence the final ranking result. While the design of the efficient feature mapping method remains an open question in this field.
Discussion on various VS methods based on multiple target information
Basically all the traditional regression or classification based models require that the training and testing data are i.i.d, and they cannot handle cross-target or cross-platform data integration. Although these methods can be directly performed, the results are not comparable since these methods are theoretically not suitable for cross-target or cross-platform scenario in VS. While for LOR, it is theoretically applicable for cross-target screening for the following reasons (1). In LOR model, it treated the target-compound pair as a whole instance. It does not require the distribution of the training compound data and testing compound data to be identical, thus it is inherently suitable for cross-target situations, and (2). It only considers the ranking orders of the instances for a specific target rather than their exact affinity values. In LOR for a specific target, especial in the use of the pair-wise LOR, it transfers the compound affinity data to the pair-wise partially order pairs and treats these new order pairs as the instances. Therefore although the compound affinities associated with the target may be measured in different platforms, it will have no influence on their transferred order pairs. While for traditional regression or classification based model it commonly treats all the compound data associated with different targets as a mixture dataset, thus their cross-platform effect should be taken into considerations.
Another important issue for LOR is the proper design of feature function ∅( ) (See Methods). In current study we just combine the two feature vector for protein and compound in two sides directly to form the new feature vector or use the cross-term feature mapping. Compared to the directly feature combination from two sides, the cross-term feature mapping is more efficient. Although these two representations have their advantages of simplicity while their biological meanings are waiting to be elucidated. Another possible way to generate the feature is to define the target-compound interaction fingerprint as applied in our previous work . Such kind of fingerprint is biologically much more meaningful while they are often not applicable for large-scale data since the generation of the fingerprint is time-consuming. We hoped that in the coming future more efficient and meaningful feature functions can be investigated.
Benchmark datasets generation
The testing datasets were collected from two public data sources, the Binding Database and the 2012 benchmark dataset published by CSAR. To make a relatively objective and balanced dataset, for the BDB, protein targets and their associated compound affinities data were selected based on the following criteria: (1). Only human protein targets are considered; (2) The redundancy of protein targets are eliminated; (3) The protein targets are selected to cover as many protein families as possible, and the proteins from the same family are avoid to be selected again as much as possible once other members in this family were selected; (4) To keep the data balanced, only targets with non-redundant ligands record number between 500 and 1,500 are considered; and (5) The affinity distribution of the compounds associated with a given target should be even. Taking pIC50 value as the affinity measurement, normally a compound is considered to be active if its pIC50 value is higher than 6 (pIC50 ≥ 6) , and inactive vise verse. The affinity was roughly graded into 5 categories as 0 (pIC50 < 6), 1 (6 ≤ pIC50 < 7), 2 (7 ≤ pIC50 < 8), 3 (8 ≤ pIC50 < 9), 4 (9 ≤ pIC50) according to reported literatures and we required that the associated compound affinity value should cover these 5 grades evenly. Those targets with associated compound affinities only have 0-grade and 1-grade, or the percentage of their highest grade data is fewer than 5% were also deleted. Based on these criteria, finally 24 proteins associated with 9,330 compounds were curated (Table 1). These data will be used in the former three testing strategies in the pipeline.
The second dataset is curated from the published 2012 CSAR benchmark dataset, which includes six protein targets and several of them have associated compound affinity information, while measured in different standards, including pIC50 and pKi value. In this dataset, only target Chk1, Erk2 and Urokinase with associated compound affinity data were tested in the fourth strategy in the pipeline (Table 2).
In this work, a comprehensive investigation on LOR was performed on benchmark datasets and the experiment workflow and algorithm assessment was presented. The results indicate that LOR, especially the pair-wise methods like SVMRank, can be served as an alternative option for VS compared with traditional methods. Furthermore, LOR has its inherent advantages to be extended for screening molecules of novel target as well as its utility in data integration. For a certain novel protein target, no matter whether its associated known ligand affinity information existed or not, LOR can return a satisfied ranking result. It is also theoretically suitable to rank the compounds based on the training data measured in different platforms. In addition, several future work directions on LOR would be: (1) The integration of multiple feature representations of the target as well as the compound using other descriptors or profiles. The high-dimensional pharmoco-genomics information from CMAP [31,32]_ENREF_30 and PubChem BioAssay data [33,34] can be extensively investigated. The multi-view learning  based methodology can be investigated to integrate different representations to present the comprehensive target and compound description and similarity calculation; (2) The transfer learning  based methodology is needed in VS for the study of “cross-target knowledge transfer” to leverage the information of large-scale of target and compound data.
LOR model in VS
LOR in VS aims to create a ranking function which could return the input compounds with a relevance descending affinity order for the target. Traditionally, the similarity based ranking model in VS is constructed by purely similarity-based or regression/classification-based model. In LOR framework, we often learn a ranking function f(T, C), which is trained by minimize a ranking loss function on a set of compound C ij (i = 1,2, …, m) for a given set of targets (T 1, T 2, … T m ) . Different from the traditional machine learning model for single target, the learned function has the generalized ability for novel data prediction. This means that for a novel target T m + 1 that is not seen in the previous training dataset, as long as it can be explicitly represented in the correspondence feature space, the system can also rank the compounds on this target.
Compared to traditional QSAR modeling, LOR is different in that it focus on multiple targets rather than single target. LOR uses a bunch of targets with their associated compounds to train a generalized prediction model and makes prediction on the other targets (Figure 8). Therefore LOR is suitable for the cross-target screening. Such an extended ranking ability for the new target cannot be achieved with the traditional classification or regression model in VS .
Feature representations of targets and proteins
As aforementioned, in LOR framework, for a given target-compound pair (T i , C ij ) a feature vector C ij = ∅(T i , C ij ) should be defined, where ∅( ) denotes the feature function. In this study, for ligands, the widely used General Descriptor (GD, 32 bit) is employed to represent the ligand in a 32-dimensional feature vector. GD measures a compound through four aspects, van der Waals surface area, log P (octanol/water), molar refractivity and partial charge . For protein targets, they were depicted through CTD (Composition, Transition, Distribution) feature, which represents the amino acid distribution patterns of a specific structural or physicochemical property along a protein or peptide sequence. The protein target is represented in 147-dimension vector by the CTD feature. In this study, GD was calculated through the software Molecular Operating Environment (MOE, C.C.G., Inc. Molecular Operation Environment, 2008.10; Montreal, Quebec, Canada, 2008) and protein CTD feature was calculated by PROFEAT .
After representing target and compound respectively, the chosen of ∅( ) is important for the performance of LOR. In strategy I, II and III, the protein feature and compound feature were combined in two sides directly to form the new feature vector (totally 179-dimension). In strategy IV, the cross-term feature mapping function was also used to generate the new feature vector for target-compound pair representation. While the possibility of defining other forms of ∅( ) was discussed in Results and Discussion.
Where y (r) is the rank label of the compound at r-th position in the ranking list.
Noted that if the predicted ranking is exactly the same as the ground truth, the NDCG value will be 1.0. This measurement can be used for the evaluation of LOR results compared to traditional regression or classification based performance measurements such as RMSE and accuracy etc. Also we noticed that there are some other ranking performance evaluations like ERR , MAP  etc., while they are not intuitionistic as NDCG does.
It also be noted that in this study, only the top-10 ranking results were evaluated with NDCG value, denoted as NDCG@10. This is a very strict evaluation criteria since the ideal ranking list can only be achieved when the top-10 known candidates were successfully predicted.
All the datasets as well as the LOR algorithm packages are available at http://www.tongji.edu.cn/~qiliu/lor_vs.html. This work was supported by the Young Teachers for the Doctoral Program of Ministry of Education, China (Grant No. 20110072120048), Innovation Program of Shanghai Municipal Education Commission (Grant No. 20002360059), the Fundamental Research Funds for the Central Universities (Grant No. 2000219084), National Natural Science Foundation of China (Grant No.31100956 and Grant No. 61173117), National 863 Funding (Grant No. 2012AA020405) and Zhejiang Open Foundation of the Most Important Subjects.
- Agarwal S, Dugar D, Sengupta S. Ranking Chemical Structures for Drug Discovery: A New Machine Learning Approach. J Chem Inf Model. 2010;50(5):716–31.View ArticleGoogle Scholar
- Shoichet BK. Virtual screening of chemical libraries. Nature. 2004;432(7019):862–5.View ArticleGoogle Scholar
- Walters WP, Stahl MT, Murcko MA. Virtual screening–an overview. Drug Discov Today. 1998;3(4):160–78.View ArticleGoogle Scholar
- Fechner U, Schneider G. Evaluation of Distance Metrics for Ligand‐Based Similarity Searching. Chem BioChem. 2004;5(4):538–40.Google Scholar
- Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T, Prachayasittikul V. A practical overview of quantitative structure-activity relationship. EXCLI J. 2009;8:74–88.Google Scholar
- Trotman A. Learning to rank. Inf Retr. 2005;8(3):359–81.View ArticleGoogle Scholar
- Liu T-Y. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval. 2009;3(3):225–331.View ArticleGoogle Scholar
- Wassermann AM, Geppert H, Bajorath JR. Searching for target-selective compounds using different combinations of multiclass support vector machine ranking methods, kernel functions, and fingerprint descriptors. J Chem Inf Model. 2009;49(3):582–92.View ArticleGoogle Scholar
- Rathke F, Hansen K, Brefeld U, Muller KR. StructRank: A New Approach for Ligand-Based Virtual Screening. J Chem Inf Model. 2011;51(1):83–92.View ArticleGoogle Scholar
- Wale N, Karypis G. Target Fishing for Chemical Compounds Using Target-Ligand Activity Data and Ranking Based Methods. J Chem Inf Model. 2009;49(10):2190–201.View ArticleGoogle Scholar
- Li S, Leihong W, Xiaohui F, Yiyu C. Consensus Ranking Approach to Understanding the Underlying Mechanism With QSAR. J Chem Inf Model. 2010;50(11):1941–8.View ArticleGoogle Scholar
- Al-Sharrah G. Ranking Using the Copeland Score: A Comparison with the Hasse Diagram. J Chem Inf Model. 2010;50(5):785–91.View ArticleGoogle Scholar
- Lerche D, Sørensen PB, Brüggemann R. Improved Estimation of the Ranking Probabilities in Partial Orders Using Random Linear Extensions by Approximation of the Mutual Ranking Probability. J Chem Inf Model. 2003;43(5):1471–80.View ArticleGoogle Scholar
- Crammer K, Singer Y. Pranking with ranking. Adv Neur In. 2002;14:641–7.Google Scholar
- Van Dang: RankLib [http://people.cs.umass.edu/~vdang/ranklib.html]
- Burges CJ. From ranknet to lambdarank to lambdamart: An overview. Learning. 2010;11:23–581.Google Scholar
- Freund Y, Iyer R, Schapire RE, Singer Y. An efficient boosting algorithm for combining preferences. J Mach Learn Res. 2004;4(6):933–69.Google Scholar
- Joachims T. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM; 2002: 133–142.Google Scholar
- Joachims T. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM; 2006: 217–226.Google Scholar
- Xu J, Li H. Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM; 2007: 391–398.Google Scholar
- Cao Z, Qin T, Liu T-Y, Tsai M-F, Li H. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, ACM; 2007: 129–136.Google Scholar
- Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011;2(3):27.Google Scholar
- Jacob L, Vert J-P. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics. 2008;24(19):2149–56.View ArticleGoogle Scholar
- Liu Q, Che D, Huang Q, Cao Z, Zhu R. Multi‐target QSAR Study in the Analysis and Design of HIV‐1 Inhibitors. Chin J Chem. 2010;28(9):1587–92.View ArticleGoogle Scholar
- Liu Q, Zhou H, Liu L, Chen X, Zhu R, Cao Z. Multi-target QSAR modelling in the analysis and design of HIV-HCV co-inhibitors: an in-silico study. BMC Bioinformatics. 2011;12(1):294.View ArticleGoogle Scholar
- Liu Q, Xu Q, Zheng VW, Xue H, Cao Z, Yang Q. Multi-task learning for cross-platform siRNA efficacy prediction: an in-silico study. BMC Bioinformatics. 2010;11(1):181.View ArticleGoogle Scholar
- Gao J, Che D, Zheng VW, Zhu R, Liu Q. Integrated QSAR study for inhibitors of hedgehog signal pathway against multiple cell lines: a collaborative filtering method. BMC Bioinformatics. 2012;13(1):186.View ArticleGoogle Scholar
- Gao J, Huang Q, Wu D, Zhang Q, Zhang Y, Chen T, et al. Study on human GPCR–inhibitor interactions by proteochemometric modeling. Gene. 2013;518(1):124–31.View ArticleGoogle Scholar
- Wu D, Huang Q, Zhang Y, Zhang Q, Liu Q, Gao J, et al. Screening of selective histone deacetylase inhibitors by proteochemometric modeling. BMC Bioinformatics. 2012;13(1):212.View ArticleGoogle Scholar
- Shen Z, Huang Q, Kang H, Liu Q, Cao Z, Zhu R. A new fingerprint of chemical compounds and its application for virtual drug screens. ACTA CHIMICA SINICA. 2011;69(1):1845–50.Google Scholar
- Huang S. Genomics, complexity and drug discovery: insights from Boolean network models of cellular regulation. Pharmacogenomics. 2001;2(3):203–22.View ArticleGoogle Scholar
- Adkins DE, Åberg K, McClay JL, Bukszár J, Zhao Z, Jia P, et al. Genomewide pharmacogenomic study of metabolic side effects to antipsychotic drugs. Mol Psychiatry. 2011;16(3):321–32.View ArticleGoogle Scholar
- Wang Y, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, et al. An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010;38 suppl 1:255–66.View ArticleGoogle Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Zhou Z, et al. PubChem's BioAssay database. Nucleic Acids Res. 2012;40(D1):D400–12.View ArticleGoogle Scholar
- Muslea I, Minton S, Knoblock CA. Active + semi-supervised learning = robust multi-view learning. ICML. 2002;2:435–42.Google Scholar
- Pan SJ, Yang Q. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on. 2010;22(10):1345–59.View ArticleGoogle Scholar
- Li H. Learning to rank for information retrieval and natural language processing. Synthesis Lectures Human Language Technol. 2011;4(1):1–113.View ArticleGoogle Scholar
- Chang K.-Y. A Survey on Learning to Rank. 2010Google Scholar
- Labute P. A widely applicable set of descriptors. J Mol Graph Model. 2000;18(4):464–77.View ArticleGoogle Scholar
- Li Z-R, Lin HH, Han L, Jiang L, Chen X, Chen YZ. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006;34 suppl 2:32–7.View ArticleGoogle Scholar
- Chapelle O, Metlzer D, Zhang Y, Grinspan P. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, ACM; 2009: 621–630.Google Scholar
- Yue Y, Finley T, Radlinski F, Joachims T. A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM; 2007: 271–278.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.