Target enhanced 2D similarity search by using explicit biological activity annotations and profiles
© Yu et al. 2015
Received: 16 July 2015
Accepted: 3 November 2015
Published: 17 November 2015
The enriched biological activity information of compounds in large and freely-accessible chemical databases like the PubChem Bioassay Database has become a powerful research resource for the scientific research community. Currently, 2D fingerprint based conventional similarity search (CSS) is the most common widely used approach for database screening, but it does not typically incorporate the relative importance of fingerprint bits to biological activity.
In this study, a large-scale similarity search investigation has been carried out on 208 well-defined compound activity classes extracted from PubChem Bioassay Database. An analysis was performed to compare the search performance of three types of 2D similarity search approaches: 2D fingerprint based conventional similarity search approach (CSS), iterative similarity search approach with multiple active compounds as references (ISS), and fingerprint based iterative similarity search with classification (ISC), which can be regarded as the combination of iterative similarity search with active references and a reversed iterative similarity search with inactive references. Compared to the search results returned by CSS, ISS improves recall but not precision. Although ISC causes the false rejection of active hits, it improves the precision with statistical significance, and outperforms both ISS and CSS. In a second part of this study, we introduce the profile concept into the three types of searches. We find that the profile based non-iterative search can significantly improve the search performance by increasing the recall rate. We also find that profile based ISS (PBISS) and profile based ISC (PBISC) significantly decreases ISS search time without sacrificing search performance.
On the basis of our large-scale investigation directed against a wide spectrum of pharmaceutical targets, we conclude that ISC and ISS searches perform better than 2D fingerprint similarity searching and that profile based versions of these algorithms do nearly as well in less time. We also suggest that the profile version of the iterative similarity searches are both better performing and potentially quicker than the standard algorithm.
Keywords2D similarity search Iterative similarity search Nearest neighbor Iterative similarity search with classification Profile
Large scale virtual screening methods have been an attractive approach for prescreening millions of compounds in commercial or public chemical databases to find compounds specifically active against a specific target, especially in early stages of modern drug development pipelines. Among the search methods available, 2D fingerprint based conventional similarity search (CSS) is a well-established virtual screening tool [1, 2], in which the similarities between database compounds and the query compound are measured and ranked, and hits are selected from the top of the ranked list. The central principle underlying virtual screening methods is the molecular similarity principle, which states that structurally similar small molecules tend to express similar biological activities [2–4]. A molecular 2D fingerprint is usually defined as a fixed-length bit string where each bit represents a specific molecular substructure feature or structure property. As a ligand based virtual screening method, the generation of molecular 2D fingerprint only requires the molecular graph as input. The similarity between the input and compound being searched is usually measured by the Tanimoto coefficient , one of the most common approaches for database searching due to its simplicity [6–8], fast speed, easy implementation and results in drug discovery [8–10].
Despite the development of more sophisticated 3D similarity approaches [11, 12] and machine learning methods such as random forests, naïve Bayesian classifiers, and support vector machines, 2D similarity search continues to be the focus of virtual screening research to better retrieve compounds of desired bioactivities or physical properties [13–17]. In part, this is due to the relative computational efficiency, which is important for large online chemical databases such as PubChem to answer user queries in a reasonable amount of time. These advanced 2D similarity search strategies generally can be summarized into three categories. The first category is data fusion of similarity coefficients, in which several types of similarity coefficients take into account different characteristics of compounds that are combined together to optimize the measure of compound similarity [16, 17]. The second category of search strategies is non-iterative single reference searches that are often that based on one-against-one similarity measures, i.e., bit-weighting [18, 19] and bit-truncation  approaches. The third category is the iterative similarity search with multiple references, which is also known as nearest neighbor (NN) search or turbo search [10, 14, 21–24]. ISS is an iterative similarity search approach in which the similarity of a database compound is determined by comparing the query compound to multiple references with the same biological activity. The basic theory behind ISS is that the neighbor list of references map out a hypervolume in the multidimensional sampling space for the bioactivity of interest, and consequently the top-ranked structures in the search result are more likely to be compounds with similar biological activity. Peter Willett et al. compared ISS with CSS and bit-weighting approaches, and they found an overwhelming advantage of ISS in retrieving active hits . Furthermore, accumulative simulations have also demonstrated that ISS with the MAX fusing rule (maximum of all of similarity pairs) usually gets better search results than ISS with the SUM fusing rule [10, 22, 25]. Overall, by using multiple compounds as “baits” to fish out more active compounds against a given target from a database of decoys, this simple but efficient approach for target enhanced similarity search is promising for chemical database screening.
One of the objectives in 2D similarity searches is to improve the recall performance. This is based on a general assumption that if more active hits are included in the hit list, then the there is a higher probability that the remaining hits in the hit list may share the same biological activity. Nevertheless, constrained by the quality of the data , the number and nature of compounds in the data set , and more importantly the underlying limitation of molecular representations [27, 28], it is unavoidable to include inactive compounds in database screening based solely on the chemical similarity principle. Mounting evidence suggests that the previous assumption does not always work especially if “activity cliffs” widely exist in a given chemical space [29, 30]. Currently many chemical databases like PubChem Bioassay and ChEMBL preserve both active and inactive target-ligand information in each deposited assay . Enriched active and inactive end-points enable us to not only re-evaluate the search performance of the ISS and the CSS by counting the numbers of annotated active and inactive hits in the hit lists, but also to utilize the structure information of these inactive compounds to reshape the chemical sampling space of the similarity search. If ISS has high specificity in retrieving active compounds, the reverse version of ISS by replacing active references in the neighbor list with inactive references should also retain the ability to identify inactive compounds. Ideally, the combination of ISS and the reversed ISS, which we call it as iterative search with classification or ISC in this study, may help to both retrieve active compounds and to purify the results from database screening.
The purpose of this study is to develop and compare target enhanced similarity search approaches. ChEMBL bioassay data  and PubChem confirmatory bioassay data  with explicit EC50, IC50 or Ki value were retrieved from PubChem Database, and the data was combined into 208 activity classes for our test. Each activity class corresponded to a protein target. In an effort to expand the sampling space and alleviate the computational burden of iterative searches, we also introduced the profile concept into target enhanced similarity search. In this case, the binary 2D fingerprints in the CSS, ISS and ISC were replaced by representative average profiles (AVEs). In total, 6 search approaches including 2 non-iterative approaches (2D fingerprint base d conventional similarity search or CSS, and average profile search or PBSS), 2 iterative ISS approaches with multiple active references (fingerprint based ISS, and average profile based ISS or PBISS search), and finally 2 iterative searches with classification (fingerprint based ISC, and average profile based ISC or PBISC) were systematically tested on 208 activity classes. The arithmetic mean of recall rates tested on the selected activity class (ARR), the arithmetic precision rate (APR), and area under the ROC curve (AUC) of each of 208 activity classes were compared to comprehensively evaluate the search performance of all 6 search approaches. The detailed data set preparation, description of search approaches and results of the search simulations are reported herein.
Results and discussion
Our study attempts to address three questions: Can chemical similarity searches be improved by (1) using iterative searches, (2) classifying search results by using bioactivity data, and (3) by using fingerprint profiles? Furthermore, what is a reasonable metric for determining the answer to these questions—should we only measure recall, as has been typically done in other studies, or measure both recall and precision at the same time?
For these purposes, the recall, precision and comprehensive search performance (AUC) determined by calculating ARRs and APRs on 208 activity classes using 6 search approaches are compared and described below. The specific AUC, ARR and APR values of each activity class returned by six search approach can be found in three heatmaps in Additional file 1: Figure S4. It should be noted that since explicitly annotated inactives were added in each activity class, the precision rate calculation of each similarity search follows a new definition described in the method part below.
Profiling of conventional similarity search on 208 activity classes
2D Fingerprint based similarity search has been very popular in various applications and it is often used as a standard search algorithm for benchmarking new algorithms. Therefore, we first characterized the search performance of the CSS search on 208 well-curated activity classes.
Summary of average enrichments (AEFs), ARRs, APRs and AUCs of 208 activity classes
Compare iterative similarity search and iterative similarity search with classification to conventional similarity search
Because there is no obvious relationship between recall rate and precision rate observed in our analysis and a high portion of annotated inactive hits in the hit list are not our expected result, we regard recall and precision of equal importance in evaluating similarity search performance.
Benefit of profiling in 2D similarity searches
By screening the compound structures in the bioassays, we observed that many active compounds in the same bioassay have the same scaffold. Using intermediate queries with high self-identity is one bottleneck in improving the search efficiency of iterative ISS or ISC searches. Inspired by the idea of profile searches found in sequence searches, the introduction of profiling into compound 2D similarity comparison may benefit chemical similarity searching. We chose the simple average profile (AVE) to replace the fingerprints in CSS, ISS and ISC search approaches.
Finally, it is worth mentioning that there is a presupposition of this study is that each query compound has at least one known binding target. However, in real world, this presupposition may be not necessary. In another word, even if the specific bioactivity of the query compound has not been confirmed, we still can use PBSS, ISS, ISC, PBISS, and PBISC search approaches to retrieve compound hits of a desired bioactivity, since the role of query compound can be regarded as the bait to fish the real compounds of desired bioactivity to form neighbor lists for further database screening. Furthermore, according to the curves of averaging 208 APRs at varied similarity cutoffs shown Additional file 1: Figure S3, PBSS can return better precision rates at high similarity cutoffs (i.e., similarity ≥0.9). This means under the extreme situation that we don’t have any knowledge of the bioactivity of the query compound, instead of using CSS to simply retrieve compounds simply based on molecular structure similarity, we can use PBSS to create the biological target profile of the query compound with high confidence, and then perform our iterative methods or use biological profile based methods like HTS-FP similarity search , bioturbo similarity search , or connectivity map  for more thorough virtual screening.
In this paper, we introduce profiles and neighbor classification into target enhanced 2D molecular similarity searching. We have symmetrically compared the recall, precision and general search performance of two non-iterative search approaches—fingerprint based conventional similarity search (CSS) and average profile based similarity search (PBSS), two iterative search approaches with multiple active references—fingerprint based iterative search (ISS) and average profile based nearest neighbor search approaches (PBISS), two iterative search approaches with classification—fingerprint based iterative search with classification (ISC) and average profile based iterative search with classification (PBISC), a total of 6 search approaches applied to 208 activity classes.
Although the recall performance of 2D similarity search has been typically used to measure the search performance, our study suggests both recall and precision should be measured in order to evaluate search performance comprehensively. Both ISS and ISC significantly improve the recall performance but only the ISC search approach improves the precision. In addition, the introduction of profiles into 2D similarity search has two benefits. Comparing to CSS, average profiles enhance search performance. Profiles also simplify the iterative ISS and ISC search approaches without losing search performance. In balancing the recall and precision, ISC and similarly profile based ISC search approaches are promising and efficient target enhanced similarity search approaches that can be implemented in chemical databases containing bioactivity information.
Preparation of data sets
Summary of the sizes of data sets of 208 activity classes, including known actives and inactives
In order to compare the search performance of our 6 search approaches, the data set of each activity class was split into three subsets: a query set composed by annotated actives for intriguing the query procedure, a reference set for providing both active and inactive references, and a test set for evaluating the search ability of the algorithm. To ensure the structure representation of active compounds in the query set, we directly extracted the center compounds of Taylor-Butina clustering results to form the query set of every activity classes. Then we randomly assigned the remaining active compounds into the reference set and the test set. Similarly, we separated those inactive compounds in the same activity class randomly into two groups, and added them into the reference set and test set of that activity class. For the original activity classes with the number of inactive compounds exceeding 20,000, the number of inactives in the reference set was limited to one-fourth of total inactive compounds (Additional file 1: Table S2). The average sizes of query set, reference set and test set of 208 activity classes are summarized in Table 2. For each query from a selected activity class, all compounds in the query set and the reference set of the selected activity class were excluded from the database, and similarities measured between the query and all remaining compounds in the database to create the hit list for the query. All six algorithms in this study were tested with this set to ensure the validity of the comparison.
By selecting well characterized bioassay results, a large number of activity classes and compounds, ensuring structural diversity, balancing the relative weight of activity classes, and using a single test set, we attempt to ensure that our test results and conclusions are less likely affected by the varied composition of the data sets.
Formation of average profile
Non-iterative similarity searches
Iterative similarity search
Except for CSS and PBSS, fingerprint based nearest neighbor search (ISS), fingerprint based neighbor classification (ISC) and the corresponding profile versions (PBISS and PBISC) are named as iterative search approaches because at least two fingerprints/profiles participate in the similarity calculation. A brief description of the four iterative search approaches is shown in Fig. 9. Before the iterative search, all iterative search approaches first search the reference set and create the same neighbor list as the one used in the PBSS search. In iterative searches, the MAX fusion rule (max of [Tc1, Tc2, Tc3 …… Tcn_ref]) was applied in our study to assign the similarity score of database compounds. The same as in the analysis of non-iterative search results, the top 4941 hits of each query were collected for further analysis.
ISS and ISC search approaches
PBISS and PBISC search approaches
In our preliminary study of ISS and ISC search approaches, we observed that many reference hits to a query are of high self-similarity. Including a large amount of similar references in structure decreases the search efficiency in iterative database screening. It is for this reason we introduce the use of profiles into ISS and ISC search approaches. For the PBISS search approach, we first applied the Taylor-Butina algorithm with a similarity cutoff of 0.4 to cluster all of the active references in the neighbor list and then created one average profile for each of the clusters. For the PBISC search approach, we clustered all of references in the neighbor list of a query. If the cluster was composed of all active references or all inactive references, we created a single profile to represent the structure feature of that set of compounds. Otherwise we separated active references from inactive references and created two profiles. By using this clustering and profiling strategy, the compression ratio from fingerprints to profile is 6.58 on average from 33,199 queries.
Fingerprint and similarity measurement
Evaluation of similarity search performance
2D fingerprint based conventional similarity search
2D fingerprint based iterative similarity search
2D fingerprint based iterative similarity search with classification
average profile based similarity search
average profile based iterative similarity search
average profile based iterative similarity search with classification
area under ROC curve of the selected activity class
the arithmetic mean of recall rates of multiple queries on the selected activity class
the arithmetic mean of precision rates of multiple queries on the selected activity class
The research was conceived by XY and LG. The computational work was performed by XY. LG supervised the project. All other authors participated in project discussion. All authors read and approved the final manuscript.
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. We also thank Dr. Yanli Wang for her guide to correctly select high-quality bioassay data from PubChem Bioassay Database.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Edgar SJ, Holliday JD, Willett P (2000) Effectiveness of retrieval in similarity searches of chemical databases: a review of performance measures. J Mol Graph Model 18(4–5):343–357View ArticleGoogle Scholar
- Bender A, Glen RC (2004) Molecular similarity: a key technique in molecular informatics. Org Biomol Chem 2(22):3204–3218View ArticleGoogle Scholar
- Nikolova N, Jaworska J (2004) Approaches to measure chemical similarity—a review. QSAR Comb Sci 22(9–10):1006–1026Google Scholar
- Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comp Sci 38(6):983–996View ArticleGoogle Scholar
- Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434):1115–1118View ArticleGoogle Scholar
- Willett P (2011) Similarity searching using 2D structural fingerprints. Methods Mol Biol 672:133–158View ArticleGoogle Scholar
- Xu J, Hagler A (2002) Chemoinformatics and drug discovery. Molecules 7(8):566–600View ArticleGoogle Scholar
- Geppert H, Bajorath J (2010) Advances in 2D fingerprint similarity searching. Expert Opin Drug Dis 5(6):529–542View ArticleGoogle Scholar
- Bajorath F (2002) Integration of virtual and high-throughput screening. Nat Rev Drug Discov 1(11):882–894View ArticleGoogle Scholar
- Hert J, Willett P, Wilton DJ (2004) Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comp Sci 44(3):1177–1185View ArticleGoogle Scholar
- Kim S, Bolton EE, Bryant SH (2012) Effects of multiple conformers per compound upon 3-D similarity search and bioassay data analysis. J Cheminform 4:28View ArticleGoogle Scholar
- Fontaine F, Bolton E, Borodina Y, Bryant SH (2007) Fast 3D shape screening of large chemical databases through alignment-recycling. Chem Cent J 1:12View ArticleGoogle Scholar
- Schuffenhauer A, Floersheim P, Acklin P, Jacoby E (2003) Similarity metrics for ligands reflecting the similarity of the target proteins. J Chem Inf Comput Sci 43(2):391–405View ArticleGoogle Scholar
- Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2006) New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J Chem Inf Model 46(2):462–470View ArticleGoogle Scholar
- Tovar A, Eckert H, Bajorath J (2007) Comparison of 2D fingerprint methods for multiple-template similarity searching on compound activity classes of increasing structural diversity. ChemMedChem 2(2):208–217View ArticleGoogle Scholar
- Salim N, Holliday J, Willett P (2003) Combination of fingerprint-based similarity coefficients using data fusion. J Chem Inf Comp Sci 43(2):435–442View ArticleGoogle Scholar
- Chen J, Holliday J, Bradshaw J (2009) A machine learning approach to weighting schemes in the data fusion of similarity coefficients. J Chem Inf Model 49(2):185–194View ArticleGoogle Scholar
- Wang Y, Bajorath J (2009) Development of a compound class-directed similarity coefficient that accounts for molecular complexity effects in fingerprint searching. J Chem Inf Model 49(6):1369–1376View ArticleGoogle Scholar
- Wang Y, Bajorath J (2008) Bit silencing in fingerprints enables the derivation of compound class-directed similarity metrics. J Chem Inf Model 48(9):1754–1759View ArticleGoogle Scholar
- Nisius B, Bajorath J (2010) Reduction and recombination of fingerprints of different design increase compound recall and the structural diversity of hits. Chem Biol Drug Des 75(2):152–160View ArticleGoogle Scholar
- Whittle M, Gillet VJ, Willett P, Alex A, Loesel J (2004) Enhancing the effectiveness of virtual screening by fusing nearest neighbor lists: a comparison of similarity coefficients. J Chem Inf Comp Sci 44(5):1840–1848View ArticleGoogle Scholar
- Heikamp K, Bajorath J (2011) Large-scale similarity search profiling of ChEMBL compound data sets. J Chem Inf Model 51(8):1831–1839View ArticleGoogle Scholar
- Whittle M, Gillet VJ, Willett P, Alex A, Loesel J (2004) Enhancing the effectiveness of virtual screening by fusing nearest neighbor lists: a comparison of similarity coefficients. J Chem Inf Comput Sci 44(5):1840–1848View ArticleGoogle Scholar
- Williams C (2006) Reverse fingerprinting, similarity searching by group fusion and fingerprint bit importance. Mol Diversity 10(3):311–332View ArticleGoogle Scholar
- Gardiner EJ, Gillet VJ, Haranczyk M, Hert J, Holliday JD, Malim N, Patel Y, Willett P (2009) Turbo similarity searching: Effect of fingerprint and dataset on virtual-screening performance. Stat Anal Data Mining 2(2):103–114View ArticleGoogle Scholar
- Xie XQS (2010) Exploiting PubChem for virtual screening. Expert Opin Drug Dis 5(12):1205–1220View ArticleGoogle Scholar
- Bender A, Jenkins JL, Scheiber J, Sukuru SC, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49(1):108–119View ArticleGoogle Scholar
- Heikamp K, Bajorath J (2011) How do 2D fingerprints detect structurally diverse active compounds? Revealing compound subset-specific fingerprint features through systematic selection. J Chem Inf Model 51(9):2254–2265View ArticleGoogle Scholar
- Hu Y, Maggiora GM, Bajorath J (2013) Activity cliffs in PubChem confirmatory bioassays taking inactive compounds into account. J Comput Aided Mol Des 27(2):115–124View ArticleGoogle Scholar
- Cruz-Monteagudo M, Medina-Franco JL, Perez-Castillo Y, Nicolotti O, Cordeiro MN, Borges F (2014) Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discovery Today 19(8):1069–1080View ArticleGoogle Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Zhou Z, Han L, Karapetyan K, Dracheva S, Shoemaker BA et al (2012) PubChem’s BioAssay Database. Nucleic Acids Res 40(Database issue):D400–D412View ArticleGoogle Scholar
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(Database issue):D1100–D1107View ArticleGoogle Scholar
- Petrone PM, Simms B, Nigsch F, Lounkine E, Kutchukian P, Cornett A, Deng Z, Davies JW, Jenkins JL, Glick M (2012) Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS Chem Biol 7(8):1399–1409View ArticleGoogle Scholar
- Wassermann AM, Lounkine E, Glick M (2013) Bioturbo similarity searching: combining chemical and biological similarity to discover structurally diverse bioactive molecules. J Chem Inf Model 53(3):692–703View ArticleGoogle Scholar
- Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN et al (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313(5795):1929–1935View ArticleGoogle Scholar
- Taylor R (1995) Simulation analysis of experimental-design strategies for screening random compounds as potential new drugs and agrochemicals. J Chem Inf Comp Sci 35(1):59–67View ArticleGoogle Scholar
- Butina D (1999) Unsupervised data base clustering based on Daylight’s fingerprint and Tanimoto similarity: a fast and automated way to cluster small and large data sets. J Chem Inf Comp Sci 39(4):747–750View ArticleGoogle Scholar
- Shannon CE (1948) A mathematical theory of communication. At&T Tech J 27(3):379–423Google Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402View ArticleGoogle Scholar
- Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29(14):2994–3005View ArticleGoogle Scholar
- Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159View ArticleGoogle Scholar