Mapping the 3D structures of small molecule binding sites
© The Author(s) 2016
Received: 15 August 2016
Accepted: 16 November 2016
Published: 6 December 2016
Analysis of the 3D structures of protein–ligand binding sites can provide valuable insights for drug discovery. Binding site comparison (BSC) studies can be employed to elucidate the function of orphan proteins or to predict the potential for polypharmacology. Many previous binding site analyses only consider binding sites surrounding an experimentally observed bound ligand.
To encompass potential protein–ligand binding sites that do not have ligands known to bind, we have incorporated fpocket cavity detection software and assessed the impact of this inclusion on BSC performance. Using fpocket, we generated a database of ligand-independent potential binding sites and applied the BSC tool, SiteHopper, to analyze similarity relationships between protein binding sites. We developed a method for clustering potential binding sites using a curated dataset of structures for six therapeutically relevant proteins from diverse protein classes in the protein data bank. Two clustering methods were explored; hierarchical clustering and a density-based method adept at excluding noise and outliers from a dataset. We introduce circular plots to visualize binding site structure space. From the datasets analyzed in this study, we highlight a structural relationship between binding sites of cationic trypsin and prothrombin, protein targets known to bind structurally similar small molecules, exemplifying the potential utility of objectively and holistically mapping binding site space from the structural proteome.
KeywordsBinding site prediction Binding site comparison Mapping binding site space Protein structure
Analysis of the three-dimensional structures of proteins is integral to our understanding of the molecular machinery involved in their biological function and is increasingly enabled by the wealth of structural data available in the Protein Data Bank (PDB) . In particular, the examination of functional binding sites is of importance in biological chemistry and drug discovery by rational design . Here, we present a method for generating a structural map of potential small molecule binding sites derived from the currently available structural proteome [3, 4].
Evidence for the existence and location of a binding site can be built through experimental observation of protein–ligand binding events—often facilitated by protein X-ray crystallography and/or Nuclear Magnetic Resonance (NMR) spectroscopy. However, prospective computational analysis to discover novel potential binding sites requires an objective and systematic cavity detection method, for which many tools exist [5–7]. For example, fpocket is a widely used and freely available software that employs geometric alpha shape theory to detect cavities in protein structure coordinates .
BSC tools commonly define binding sites as the protein environment surrounding an experimentally observed bound ligand. Importantly, this definition excludes potential binding sites that have not been demonstrated to bind ligands (so called unliganded binding sites), thereby creating a bias towards currently exemplified protein–ligand complexes. To address this, tools such as CavBase , RAPMAD , IsoMIF  and TrixP , have integrated binding site detection algorithms with BSC. However, to the best of our knowledge, there has been no systematic analysis of the implications for BSC performance with unliganded cavities in the dataset. To mitigate this concern, we applied a modular approach and independently validated both the cavity detection and BSC components when applied to datasets comprising both liganded and unliganded protein binding sites.
A structural mapping of protein binding sites can provide a useful tool for probing the three-dimensional structural relationships between biological macromolecules [3, 10]. Tools that aim to provide an assessment of similarity between protein–ligand binding sites include Relibase, a database of known protein–ligand binding sites ; the sc-PDB, an annotated database of druggable binding sites from the PDB ; and the Pocketome, an encyclopedia of conformational ensembles of druggable binding sites . While each of these tools provides an assessment of similarity between binding sites, the potential for identifying novel three-dimensional relationships involving currently unliganded binding sites is limited without the incorporation of objective methods for cavity detection.
The workflow presented here enables a structural mapping of potentially ligandable binding sites of the currently available structural proteome. We apply fpocket to objectively detect protein cavities and SiteHopper BSC to systematically generate pairwise structural similarities between all detected cavities. We also assess the performance of BSC incorporating all fpocket-detected cavities versus datasets only containing cavities surrounding an experimentally observed ligand. We then describe a number of clustering methods and visualization techniques for the mapping of potential binding site space. Altogether we present a validated workflow and describe challenges associated with the methodologies employed therein. In this work, we have adopted the following definitions throughout: a cavity is a surface depression identified by fpocket in static protein structure data. A potential protein–ligand binding site is one predicted by fpocket to bind small molecules, whereas a known protein–ligand binding site is one that has been experimentally shown to bind small molecules.
Protein structure datasets
Four datasets of protein structures were studied.
An ensemble of five cAMP-dependent protein kinase structures, all bound to the endogenous co-factor adenosine 5′-triphosphate (ATP) (Additional file), was retrieved from the PDB . Crystallographic structure data was selected to satisfy ligand-centric quality criteria: resolution ≥2.7 ångströms (Å) , Real-Space Correlation Coefficient (RSCC) of ligand instance ≥0.9, Real-Space R-factor (RSR) of ligand instance ≤0.15 and Occupancy-Weighted Average B-factor (OWAB) of ligand instance between 5 and 50 Å2 . All structures were aligned using the PDB entry 1ATP as the reference coordinates and Schrödinger’s Protein Preparation Wizard  was applied to ensure consistent protonation, removal of waters and assignment of tautomers. All structures exhibit very similar conformations of the protein with a mean all-atom Root-Mean-Square Deviation (RMSD) for pairwise alignment of 1.0 Å, calculated in PyMOL using the align command with the cycles flag set to zero .
The PDBBind-refined set (2014)  is a curated set of 3446 high-quality, binary protein-small molecule complexes associated with measured binding affinity (K i or K d ). This dataset was used to evaluate models for ranking detected cavities and to determine a threshold above which detected cavities constitute potential binding sites.
A further dataset was manually curated from the PDB to guide the generation of a map of the structures of diverse and therapeutically relevant potential small molecule binding sites. This dataset contains 1085 crystallographically determined protein structures of the following protein targets: bromodomain-containing protein 4 (BRD4) (93), cyclin-dependent kinase 2 (CDK2) (148), estrogen receptor (52), human immunodeficiency virus-1 (HIV-1) protease (335), prothrombin (142) and cationic trypsin (315). Structures were retrieved by their respective UniProt  identifiers, except HIV-1 Protease for which structures were retrieved with 90% sequence identity (Protein BLAST , E = 10−20) to a reference sequence . Retrieved crystal structures were selected to satisfy protein-centric crystallographic quality criteria: resolution ≥2.5 Å, Free R-factor (Rfree) ≤0.3 and Diffraction Precision Index (DPI)  ≤0.5 Å —calculated using DPI calculator . This dataset is referred to as the Pilot dataset (Additional file).
All crystallographic quality descriptors were retrieved from either the PDB or Electron Density Server (EDS)  unless otherwise stated.
Binding site detection
fpocket (version 2.0)  was implemented for ligand-independent cavity detection using default settings with two parameter alterations; the −r flag was set to 3.0 (default 4.5) and the −n flag was set to 3 (default 2). fpocket ranks cavities according to a Partial Least Squares (PLS) model Score trained on five descriptors relating to hydrophobicity, polarity and the size of a detected binding site . Cavities were detected for protein structures in the PDBBind-refined set (2014)  and an fpocket Score ≥16.8 was determined, above which cavities were considered as potential ligand-binding sites. This threshold corresponds to the Score above which 95% of known ligand binding sites from the PDBBind-refined set were identified.
Binding site comparison (BSC)
A dataset of ligand-dependent binding sites was generated using the SiteHopper create tool  where default parameters create a binding site patch within 4 Å of a specified bound ligand. This approach was followed to generate 9275 binding site patches for the sc-PDB (2013) database; this is referred to as the ligCav binding site dataset. Eight protein structures failed to yield binding site patches.
To generate ligand-independent binding site patches, surface protein atoms associated with fpocket cavities were utilized as a pseudo-ligand for input to the SiteHopper create tool. Binding site patches were defined as surface protein atoms lying within 0.3 Å of the fpocket surface atoms. This site size value was determined empirically through a number of retrieval experiments with a range of site size values (0.1–0.6 Å, increments of 0.1 Å). The ability of SiteHopper to identify similarity between a query estrogen receptor binding site patch and other members of the estrogen receptor in the sc-PDB (2013) was assessed using binding site patches created with varying site sizes (Additional file 1: Figure S1). Larger binding site patches incur a penalty in calculation time during BSC, and therefore the chosen site size represents a balance between computational expense and retrieval success.
The SiteHopper tool was utilized to generate binding site patches and to assess pairwise structure similarity between reference and query patches. The default SiteHopper PatchScore represents a summation of Tanimoto similarity coefficients  weighted 3:1 in favor of color similarity over shape similarity, yielding a continuous value between zero and four, conveying complete dissimilarity and perfect similarity respectively . Utilization of the symmetric Tanimoto similarity coefficient causes an inherent size matching to exist between pairs of binding site patches that show high levels of structural similarity.
The sc-PDB (2013) database  was utilized to assess BSC performance through a series of retrieval experiments evaluating the ability of SiteHopper to identify similar binding sites belonging to the same protein target. True positives were defined as binding site patches with the same UniProt identifier as the query patch, except for those belonging to HIV-1 protease, which were defined by sequence searching as previously described (“Methods” section). Due to the presence of multiple binding sites per protein structure, only the binding site with the highest SiteHopper PatchScore derived from a matching protein structure was considered a true positive. Reference binding site patches used as queries for retrieval experiments are shown in Additional file 1: Table S1.
Mapping binding sites
To guide mapping of the potentially ligandable binding sites of the structural proteome, an exhaustive all-against-all BSC was performed on the Pilot dataset containing 2708 binding sites generated by fpocket. A breakdown of the Pilot dataset, including the number of binding sites detected for each protein target, is shown in Additional file 1: Table S2. The resulting matrix of (2708 × 2708) SiteHopper PatchScores was exploited to produce a clustered heat map of potential binding site space. To remove non-conserved binding sites from the dataset, patches with fewer than five pairwise SiteHopper PatchScores ≥2.0 were filtered out. Binding sites were first clustered within the protein targets from which they were derived using average-linkage agglomerative hierarchical clustering and the Euclidean distance measure. Subsequent clustering was performed in the same way across the global Pilot dataset. Plots were generated using matplotlib  and clustering was implemented in the Python programming language using the SciPy package .
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)  was applied to cluster binding sites for each protein using n × n matrices of SiteHopper PatchScores. DBSCAN was implemented in the Python programming language using the scikit-learn machine learning toolkit  with range ε = 7 and a minimum number of points per core cluster being ten. Circular plots were generated as an alternative visual tool for mapping potential binding site space using the Circos software package .
Receiver Operating Characteristic (ROC) curves are a widely used tool employed to quantify the ability of a method to identify instances with similar characteristics to a reference (true positives). The Area Under a Receiver Operating Characteristic (AUROC) curve provides a measure of how well a method distinguishes between true positives and false positives in a dataset . A perfect separation of all true positives from the data would result in an AUROC of 1, whilst a random classifier would be expected to distribute true positives throughout the whole dataset resulting in an AUROC of 0.5.
Often, it is the early recognition of true positives that is important , especially in cases where n is large and AUROC results are indistinguishable between methods. To this end, Enrichment Factors (EF) at 5% and the Boltzmann-Enhanced Discrimination Receiver Operating Characteristic (BEDROC)  were also calculated. An EF is the ratio of the percentage of true positives in an initial portion of a dataset, to the overall percentage of true positives in the entire dataset. Thus an EF = 1 implies no enrichment in the initial portion of the data (no early enrichment); EF < 1 implies the classifier performs worse than random at identifying true positives, and EF > 1 implies there is some quantifiable enrichment of true positives among the highest ranked data points . BEDROC applies Boltzmann weighting to the AUROC calculation thereby emphasizing the initial portion of the ROC curve—calculated using the CROC package  at α = 20 [46, 47].
Results and discussion
Ligand-independent binding site detection
Typically, BSC studies make use of known binding sites characterized by surface protein atoms surrounding an experimentally observed bound ligand. To objectively consider currently unliganded binding sites, ligand-independent binding site detection tools were evaluated. Incorporation of binding site detection tools removes bias associated with utilizing currently known liganded binding sites; however, it may also introduce noise to the data through inclusion of cavities that are incapable of ligand binding. Therefore, an assessment of the noise introduced to the data by binding site detection was conducted and also of the subsequent implications for BSC performance.
fpocket is a well-established and freely available binding site detection tool capable of operating in high-throughput and therefore applicable to large datasets of protein structure data [e.g. the sc-PDB (2013) contains 9283 structures]. fpocket was evaluated according to three criteria: Its ability to (1) detect cavities corresponding to functionally relevant binding sites starting from a global search of a protein structure; (2) detect similar cavities from an ensemble of structurally similar experimental structures of the same protein bound to the same ligand; and (3) rank and prioritize detected cavities according to their likelihood of binding small molecule ligands. Two datasets were utilized to assess these criteria: an ensemble of five ATP-bound cAMP-dependent protein kinases, and the PDBBind-refined set (2014) .
fpocket binding site detection often identifies multiple cavities per protein structure. To reduce the complexity associated with carrying forward multiple cavities per protein for BSC, the detected binding sites were ranked and subsequently filtered according to their potential to bind a small molecule ligand. Many other binding site detection tools rank cavities according to binding site volume as often the largest cavity corresponds to the observed ligand binding site . The fpocket Score model aims to predict whether a cavity may contain a bound small molecule ligand and is distinct from a drug ability model since ligands are not necessarily drug-like .
A comparison of binding site detection tools has previously found that 95% of ligand-bound sites are identifiable using geometric algorithms . Accordingly, a sensitivity threshold was determined above which 95% of observed ligand binding sites from the PDBBind-refined set were identified. This sensitivity threshold precedes a sharp increase in false positive rate and therefore excluding sites below the sensitivity threshold ensures that the number of cavities without ligand-binding potential introduced to the dataset is limited. Thus, cavities were taken forward to BSC if the fpocket Score is ≥16.8, corresponding to the Score above which 95% of the ligand-bound cavities from the PDBBind-refined set are identified.
The average number of cavities identified per protein structure before and after applying the 95% recall filter is shown for both default and modified fpocket parameters (Fig. 4c). Although fpocket with modified parameters (yielding smaller, more concise cavities) performs slightly worse than default according to ROC analysis, the number of cavities detected per protein is comparable and therefore both parameter sets introduce similar levels of noise to the dataset. However, smaller and more consistent cavities are beneficial for BSC in terms of studying binding site similarity. Therefore, we elected to study BSC using cavities detected by fpocket with modified parameters; it can be assumed that further mentions of fpocket refer to this non-standard modified model.
Binding site comparison (BSC)
Evaluation of SiteHopper retrieval of binding site patches from the sc-PDB (2013) belonging to the same protein as a query patch
BRD4 (n = 2)
1.00 ± 0.00
20.03 ± 0.00
1.00 ± 0.00
Carbonic anhydrase 2 (n = 3)
1.00 ± 0.00
19.77 ± 0.00
0.99 ± 0.00
CDK2 (n = 3)
1.00 ± 0.00
19.66 ± 0.07
0.97 ± 0.01
Estrogen receptor (n = 5)
1.00 ± 0.00
20.03 ± 0.00
0.96 ± 0.01
HIV-1 protease (n = 3)
0.99 ± 0.00
19.94 ± 0.00
0.99 ± 0.00
Prothrombin (n = 3)
1.00 ± 0.00
20.03 ± 0.00
1.00 ± 0.00
BRD4 (n = 2)
0.97 ± 0.03
18.67 ± 1.33
0.94 ± 0.06
Carbonic anhydrase 2 (n = 3)
0.99 ± 0.00
18.69 ± 0.00
0.93 ± 0.00
CDK2 (n = 3)
0.76 ± 0.03
8.83 ± 1.74
0.43 ± 0.10
Estrogen receptor (n = 5)
0.94 ± 0.01
16.07 ± 0.17
0.80 ± 0.01
HIV-1 protease (n = 3)
0.98 ± 0.00
19.52 ± 0.00
0.97 ± 0.00
Prothrombin (n = 3)
0.94 ± 0.01
17.76 ± 0.32
0.88 ± 0.01
As described above, the incorporation of fpocket cavity detection into BSC introduces the potential for noise in the binding site dataset compared to only defining binding sites surrounding observed bound ligands, and this may result in poorer retrieval performance metrics. However, in our retrieval analysis, we only observed a slight impact on BSC performance using early enrichment metrics; the AUROC enrichment remains high when compared to retrieval analyses performed using the ligCav dataset. Thus, the incorporation of fpocket objective cavity detection into BSC workflows is not associated with an unreasonable decrease in retrieval capability. In summary, we show that SiteHopper is able to identify structural similarity between potential binding sites that have been detected objectively from protein structure coordinates.
Interestingly, we observed a variation in retrieval rates across protein targets. Retrieval scores for the acetyl-lysine binding site of BRD4 are high, likely due to the rigidity of the protein structure surrounding this site. On the contrary, EF at 5% and BEDROC for the protein kinase CDK2 are relatively poor, likely due to the flexibility and range of protein conformations exemplified by crystal structures of this protein. Upon inspection of instances where structural similarity was expected, but not assigned a high SiteHopper score, we found that, in many cases, analogous fpocket-detected potential binding sites showed structural variability. This observation highlights the importance of consistency in the binding site detection tool; for example, upon inspection of cavities detected for prothrombin, we found that overlapping but dissimilar fpocket-detected sites were extracted from very similar protein conformations. This exemplifies how the objective implementation of fpocket binding site detection can still introduce noise into the BSC workflow despite the modifications we describe.
Clustering and mapping of potential binding sites
We applied the Pilot dataset to develop and validate a method for clustering and mapping objectively detected potentially ligandable binding sites. An exhaustive all-against-all SiteHopper BSC was performed to generate a 2708 × 2708 matrix of SiteHopper PatchScores. Possible PatchScores range from zero to four, where zero indicates total pairwise dissimilarity and four indicates perfect similarity. To remove non-conserved, information-poor potential binding sites, those with fewer than five PatchScores ≥2.0 were removed. Starting from an initial pool of 2708 potential binding sites, this criterion reduced the data to a 1706 × 1706 matrix of SiteHopper PatchScores.
The number of dominant clusters in the Pilot dataset represents the homogeneity of potential binding sites across the six proteins (Fig. 6). The scarcity of clusters of conserved binding sites among CDK2 structures is consistent with the flexibility of this kinase observed in protein crystal structures—notably the presence of diverse active and inactive protein conformations. Notably, SiteHopper identifies dominant clusters of substantially conserved binding sites for each of the five other protein targets. The most highly conserved binding site is that of the HIV-1 protease, likely due to the large volume and enclosed shape of the catalytic binding site that enables consistent identification by fpocket and robust detection of similarity by SiteHopper, respectively. Other factors that will likely affect the presence or absence of conserved binding sites within available structures of a particular protein include the presence of apo and holo bound structures, particularly for proteins containing multiple domains .
The global cluster analysis highlights a region of overlap between two clusters of binding sites belonging to prothrombin and cationic trypsin (Fig. 6). SiteHopper identifies structural similarity between the catalytic protease binding sites of these proteins (highlighted in red, Fig. 6). These two proteins are known to bind similar compounds and are annotated with a selectivity group of 254 compounds in the ChEMBL database (version 21) [53, 54] of bioactive molecules. Compounds in the selectivity group represent literature examples where the ratio of binding (selectivity coefficient) between prothrombin and trypsin has been measured; 99 of the 254 selectivity group examples exhibit a selectivity ratio of less than ten indicating that ligands commonly bind to both protein targets. The identification of similarity between these binding sites exemplifies the potential of BSC tools to rationalize and predict polypharmacology independent of ligand data.
Despite efforts to minimize the noise introduced by cavity detection, non-conserved potential binding sites inevitably affect the interpretability of clustered heat maps because much of the heat map conveys regions of structural dissimilarity—which is less informative than similarity. Furthermore, non-conserved binding site patches that do not show SiteHopper similarity to other patches are grouped together by clustering methods, generating a group of information-poor binding sites. One method to reduce the presence of these information-poor binding sites is to apply stricter binding site conservation criteria. However, these would penalize potentially interesting novel proteins for which there are fewer instances exemplified in the PDB versus more extensively studied proteins.
In constructing a workflow to map the binding sites of the currently characterized structural proteome, we adopted a modular approach that comprises objective binding site detection, binding site comparison (BSC), mapping of detected binding sites using unsupervised learning methods, and visualization of binding site maps. Although we outline a workflow for mapping potential small molecule binding sites in proteins, each of the components can be altered according to the tools available and specific hypothesis under test.
We applied fpocket as a geometric cavity detection tool to identify potentially novel unliganded binding sites, and modified fpocket parameters to yield concise cavities that are better suited to subsequent BSC studies. To filter out fpocket cavities that are unlikely to be ligandable, we determined a threshold fpocket Score by analyzing retrieval rates from the PDBbind-refined set; cavities were taken forward to BSC if the fpocket Score is ≥16.8, corresponding to the Score above which 95% of the ligand-bound cavities from the PDBBind-refined set are identified.
Applying fpocket cavity detection to the sc-PDB dataset (2013) to assess the impact of incorporating objective and unbiased cavity detection to BSC compared with only defining binding sites that surround exemplified bound ligands. Using SiteHopper for BSC, we show that the penalty associated with replacing ligand-dependent binding sites with objectively detected cavities is minimal and importantly also allows consideration of currently unliganded sites in BSC studies.
The workflow we describe applies the fpocket geometric detection algorithm to detect cavities in a protein structure. A limitation is that local chemical interaction hotspots and flat binding sites that are particularly relevant for the study of Protein–Protein Interactions (PPIs), will not be identified. To map such binding sites, it may be possible to introduce an interaction hotspot prediction tool such as FTMap , GRID  or SuperStar  into the modular workflow; this will be the subject of future studies.
The Pilot dataset was processed by fpocket and an all-against-all SiteHopper BSC was performed to create a matrix of binding site similarities. Hierarchical clustering within protein structures derived from the same protein target reveals a large proportion of cavities that are not conserved across multiple structures of the same protein; we therefore introduced a conservation filter (removal of cavities with fewer than five PatchScores ≥2.0) to minimize the number of information-poor cavities in the dataset. A combination of clustering both locally within protein targets and globally across the entire dataset, generates a map of potential binding site space. Furthermore, we show that density-based clustering by DBSCAN is an appropriate method for generating clusters of binding sites and mitigating the noise introduced to the dataset by objective fpocket cavity detection.
Although a powerful visualization, heat maps can be challenging to interpret, and therefore we introduce circular plots as an intuitive tool for visualizing and mapping structural binding site space. We show that such plots can highlight the similarity between binding sites derived from different proteins. Here, we exemplify an objectively identified similarity between binding sites of the serine proteases prothrombin and cationic trypsin that is consistent with literature reports that their catalytic sites bind similar ligands. We suggest that such protein binding site maps will be useful for building further understanding of the relationship between small molecules and complex biological systems; this approach is potentially applicable to the discovery of hit matter for novel biological targets, for predicting and rationalizing ligand polypharmacology and for predicting protein function [3, 4]. In addition, we suggest that such an objective binding site map, which encompasses unliganded cavities, will also be useful for optimizing compound screening collections towards a more complete chemical coverage of binding site space. We will present examples of such applications in due course.
area under receiver operating characteristic
binding site comparison
Boltzmann-enhanced discrimination receiver operating characteristic
density-based spatial clustering of applications with noise
protein data bank
bromodomain-containing protein 4
cyclin-dependent kinase 2
human immunodeficiency virus-1
JB, NB and JM conceived and designed the project. JM performed the experiments; JM analyzed the data with input from NB and JB. JM drafted the manuscript; JB and NB revised the manuscript. All authors read and approved the final manuscript.
We thank Dr. Yi Mok for his helpful guidance and comments on the manuscript.
The authors declare that they have no competing interests.
JM is supported by Wellcome Trust Grant 102361/Z/13/Z. NB and JB are supported by Cancer Research UK Grant C309/A11566.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank. Nucleic Acids Res 28:235–242View ArticleGoogle Scholar
- Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 7:1884–1897. doi:10.1002/pro.5560070905 View ArticleGoogle Scholar
- Holm L, Sander C (1996) Mapping the protein universe. Science 273:595–602View ArticleGoogle Scholar
- Ehrt C, Brinkjost T, Koch O (2016) Impact of binding site comparisons on medicinal chemistry and rational molecular design. J Med Chem 59:4121–4151. doi:10.1021/acs.jmedchem.6b00078 View ArticleGoogle Scholar
- Henrich S, Salo-Ahen OMH, Huang B et al (2010) Computational approaches to identifying and characterizing protein binding sites for ligand design. J Mol Recognit 23:209–219. doi:10.1002/jmr.984 Google Scholar
- Dukka BK (2013) Structure-based methods for computational protein functional site prediction. Comput Struct Biotechnol J 8:e201308005–e201308008. doi:10.5936/csbj.201308005 View ArticleGoogle Scholar
- Schmidtke P, Souaille C, Estienne F et al (2010) Large-scale comparison of four binding site detection algorithms. J Chem Inf Model 50:2191–2200. doi:10.1021/ci1000289 View ArticleGoogle Scholar
- Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinform 10:168–179. doi:10.1186/1471-2105-10-168 View ArticleGoogle Scholar
- Kellenberger E, Schalon C, Rognan D (2008) How to measure the similarity between protein ligand-binding sites? Curr Comput Aided Drug Des 4:209–220. doi:10.2174/157340908785747401 View ArticleGoogle Scholar
- Pérot S, Sperandio O, Miteva MA et al (2010) Druggable pockets and binding site centric chemical space: a paradigm shift in drug discovery. Drug Discov Today 15:656–667. doi:10.1016/j.drudis.2010.05.015 View ArticleGoogle Scholar
- Nisius B, Sha F, Gohlke H (2012) Structure-based computational analysis of protein binding sites for function and druggability prediction. J Biotechnol 159:123–134. doi:10.1016/j.jbiotec.2011.12.005 View ArticleGoogle Scholar
- Jalencas X, Mestres J (2013) Identification of similar binding sites to detect distant polypharmacology. Mol Inf 32:976–990. doi:10.1002/minf.201300082 View ArticleGoogle Scholar
- Batista J, Hawkins PC, Tolbert R, Geballe MT (2014) SiteHopper: a unique tool for binding site comparison. J Cheminform 6:P57. doi:10.1186/1758-2946-6-S1-P57 View ArticleGoogle Scholar
- Kryštof V, McNae IW, Walkinshaw MD et al (2005) Antiproliferative activity of olomoucine II, a novel 2,6,9-trisubstituted purine cyclin-dependent kinase inhibitor. Cell Mol Life Sci 62:1763–1771. doi:10.1007/s00018-005-5185-1 View ArticleGoogle Scholar
- Grant JA, Pickup BT (1995) A gaussian description of molecular shape. J Phys Chem 99:3503–3510. doi:10.1021/j100011a016 View ArticleGoogle Scholar
- Chen Y-C, Tolbert R, Aronov AM et al (2016) Prediction of protein pairs sharing common active ligands using protein sequence, structure, and ligand similarity. J Chem Inf Model 56:1734–1745. doi:10.1021/acs.jcim.6b00118 View ArticleGoogle Scholar
- VIDA 18.104.22.168 (2015) OpenEye Scientific Software, Santa Fe, NM, USA
- Kuhn D, Weskamp N, Schmitt S et al (2006) From the similarity analysis of protein cavities to the functional classification of protein families using cavbase. J Mol Biol 359:1023–1044. doi:10.1016/j.jmb.2006.04.024 View ArticleGoogle Scholar
- Krotzky T, Grunwald C, Egerland U, Klebe G (2015) Large-scale mining for similar protein binding pockets: with RAPMAD retrieval on the fly becomes real. J Chem Inf Model 55:165–179. doi:10.1021/ci5005898 View ArticleGoogle Scholar
- Chartier M, Najmanovich R (2015) Detection of binding site molecular interaction field similarities. J Chem Inf Model 55:1600–1615. doi:10.1021/acs.jcim.5b00333 View ArticleGoogle Scholar
- von Behren MM, Volkamer A, Henzler AM et al (2013) Fast protein binding site comparison via an index-based screening technology. J Chem Inf Model 53:411–422. doi:10.1021/ci300469h View ArticleGoogle Scholar
- Hendlich M, Bergner A, Gunther J, Klebe G (2003) Relibase: design and development of a database for comprehensive analysis of protein-ligand interactions. J Mol Biol 326:607–620. doi:10.1016/S0022-2836(02)01408-0 View ArticleGoogle Scholar
- Kellenberger E, Muller P, Schalon C et al (2006) sc-PDB: an annotated database of druggable binding sites from the protein data bank. J Chem Inf Model 46:717–727. doi:10.1021/ci050372x View ArticleGoogle Scholar
- Kufareva I, Ilatovskiy AV, Abagyan R (2012) Pocketome: an encyclopedia of small-molecule binding sites in 4D. Nucl Acids Res 40:D535–D540. doi:10.1093/nar/gkr825 View ArticleGoogle Scholar
- Hawkins PCD, Skillman AG, Warren GL et al (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the protein databank and Cambridge structural database. J Chem Inf Model 50:572–584. doi:10.1021/ci100031x View ArticleGoogle Scholar
- Sitzmann M, Weidlich IE, Filippov IV et al (2012) PDB ligand conformational energies calculated quantum-mechanically. J Chem Inf Model 52:739–756. doi:10.1021/ci200595n View ArticleGoogle Scholar
- Sastry GM, Adzhigirey M, Day T et al (2013) Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. J Comput Aided Mol Des 27:221–234. doi:10.1007/s10822-013-9644-8 View ArticleGoogle Scholar
- The PyMOL molecular graphics system (2016) Version 22.214.171.124. Schrödinger, LLC. New York
- Liu Z, Li Y, Han L et al (2015) PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31:405–412. doi:10.1093/bioinformatics/btu626 View ArticleGoogle Scholar
- The UniProt Consortium (2014) UniProt: a hub for protein information. Nucl Acids Res 43:D204–D212. doi:10.1093/nar/gku989 View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 25:3389–3402. doi:10.1093/nar/25.17.3389 View ArticleGoogle Scholar
- HIV-1 Protease Consensus B Reference Sequence, HIV drug resistance database. http://hivdb.stanford.edu/pages/3DStructures/pr.fasta. Accessed 24 May 2016
- Blow D (2002) Outline of crystallography for biologists. Oxford University Press, OxfordGoogle Scholar
- Wirth M, Volkamer A, Zoete V et al (2013) Protein pocket and ligand shape comparison and its application in virtual screening. J Comput Aided Mol Des 27:511–524. doi:10.1007/s10822-013-9659-1 View ArticleGoogle Scholar
- Vainio MJ, Puranen JS, Johnson MS (2009) ShaEP: molecular overlay based on shape and electrostatic potential. J Chem Inf Model 49:492–502. doi:10.1021/ci800315d View ArticleGoogle Scholar
- Kleywegt GJ, Harris MR, Zou JY et al (2004) The uppsala electron-density server. Acta Crystallogr D 60:2240–2249. doi:10.1107/S0907444904013253 View ArticleGoogle Scholar
- Tanimoto TT (1957) IBM internal report 17th (November 1957)
- Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95. doi:10.1109/MCSE.2007.55 View ArticleGoogle Scholar
- Jones E, Oliphant T, Peterson P (2014) SciPy: open source scientific tools for Python 2001–2012. Mach Learn Optim Big Data 9432:145–157. doi:10.1007/978-3-319-27926-8_13 Google Scholar
- Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34):226–231Google Scholar
- Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830Google Scholar
- Krzywinski M, Schein J, Birol I et al (2009) Circos: an information aesthetic for comparative genomics. Genome Res 19:1639–1645. doi:10.1101/gr.092759.109 View ArticleGoogle Scholar
- Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36View ArticleGoogle Scholar
- Ballester PJ, Finn PW, Richards WG (2009) Ultrafast shape recognition: evaluating a new ligand-based virtual screening technology. J Mol Graph Model 27:836–845. doi:10.1016/j.jmgm.2009.01.001 View ArticleGoogle Scholar
- Swamidass SJ, Azencott C-A, Daily K, Baldi P (2010) A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval. Bioinformatics 26:1348–1356. doi:10.1093/bioinformatics/btq140 View ArticleGoogle Scholar
- Truchon J-F, Bayly CI (2007) Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J Chem Inf Model 47:488–508. doi:10.1021/ci600426e View ArticleGoogle Scholar
- Yabuuchi H, Niijima S, Takematsu H et al (2011) Analysis of multiple compound-protein interactions reveals novel bioactive molecules. Mol Syst Biol 7:472. doi:10.1038/msb.2011.5 View ArticleGoogle Scholar
- OEChem Toolkit (2015) version 2015.02.09; OpenEye Scientific Software: Santa Fe, NM, USA
- McKinney W (2010) Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference, pp 51–56
- Fauman EB, Rai BK, Huang ES (2011) Structure-based druggability assessment—identifying suitable targets for small molecule therapeutics. Curr Opin Chem Biol 15:463–468. doi:10.1016/j.cbpa.2011.05.020 View ArticleGoogle Scholar
- Schmidtke P, Le Guilloux V, Maupetit J, Tuffery P (2010) fpocket: online tools for protein ensemble pocket detection and tracking. Nucl Acids Res 38:W582–W589. doi:10.1093/nar/gkq383 View ArticleGoogle Scholar
- Brylinski M, Skolnick J (2007) What is the relationship between the global structures of apo and holo proteins? Proteins 70:363–377. doi:10.1002/prot.21510 View ArticleGoogle Scholar
- ChEMBL v21. https://www.ebi.ac.uk/chembl/downloads. Accessed 8 March 2016
- Gaulton A, Bellis LJ, Bento AP et al (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucl Acids Res 40:D1100–D1107. doi:10.1093/nar/gkr777 View ArticleGoogle Scholar
- Ivan G, Szabadka Z, Grolmusz V (2010) A hybrid clustering of protein binding sites. FEBS J 277:1494–1502. doi:10.1111/j.1742-4658.2010.07578.x View ArticleGoogle Scholar
- Kozakov D, Grove LE, Hall DR et al (2015) The FTMap family of web servers for determining and characterizing ligand-binding hot spots of proteins. Nat Protoc 10:733–755. doi:10.1038/nprot.2015.043 View ArticleGoogle Scholar
- Goodford PJ (1985) A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J Med Chem 28:849–857View ArticleGoogle Scholar
- Verdonk ML, Cole JC, Watson P et al (2001) Superstar: improved knowledge-based interaction fields for protein binding sites. J Mol Biol 307:841–859. doi:10.1006/jmbi.2001.4452 View ArticleGoogle Scholar