bSiteFinder, an improved protein-binding sites prediction server based on structural alignment: more accurate and less time-consuming
- Jun Gao†1, 2,
- Qingchen Zhang†1,
- Min Liu2,
- Lixin Zhu3, 4, 5,
- Dingfeng Wu1,
- Zhiwei Cao1 and
- Ruixin Zhu1Email author
© The Author(s) 2016
Received: 3 February 2016
Accepted: 30 June 2016
Published: 11 July 2016
Protein-binding sites prediction lays a foundation for functional annotation of protein and structure-based drug design. As the number of available protein structures increases, structural alignment based algorithm becomes the dominant approach for protein-binding sites prediction. However, the present algorithms underutilize the ever increasing numbers of three-dimensional protein–ligand complex structures (bound protein), and it could be improved on the process of alignment, selection of templates and clustering of template. Herein, we built so far the largest database of bound templates with stringent quality control. And on this basis, bSiteFinder as a protein-binding sites prediction server was developed.
By introducing Homology Indexing, Chain Length Indexing, Stability of Complex and Optimized Multiple-Templates Clustering into our algorithm, the efficiency of our server has been significantly improved. Further, the accuracy was approximately 2–10 % higher than that of other algorithms for the test with either bound dataset or unbound dataset. For 210 bound dataset, bSiteFinder achieved high accuracies up to 94.8 % (MCC 0.95). For another 48 bound/unbound dataset, bSiteFinder achieved high accuracies up to 93.8 % for bound proteins (MCC 0.95) and 85.4 % for unbound proteins (MCC 0.72). Our bSiteFinder server is freely available at http://binfo.shmtu.edu.cn/bsitefinder/, and the source code is provided at the methods page.
Most biological processes involve the interaction of ligands with proteins. Functional characterization of ligand-binding sites of proteins is a key issue in understanding those biological processes [1–4]. In addition, identifying the location of protein-binding sites is a vital first step in structure-based drug design [5–8]. However, functional characterization of proteins through experimental method is a labor intensive and time-consuming process. A computational tool to predict the functional binding sites in a protein is therefore of practical importance.
To date, a variety of computational methods have been developed for protein-binding sites prediction, which can be divided into four categories: geometry based methods [9–14], energy based methods [15, 16], alignment based methods [17–20] and other miscellaneous methods [21–23]. Alignment based methods can be further divided into sequence alignment based and structural alignment based methods. Recently, increasing structural genomics projects have led to the exponential growth of the number of available protein structures. As a consequence, structural alignment based methods exceeded other methods due to its more efficient and more accurate performance.
In 1996, Lichtarge et al.  developed the first structural alignment based algorithm for protein-binding sites prediction, entitled evolutionary trace method (ET method). It is based on the extraction of functionally important residues from sequence conservation patterns in homologous proteins, and on their mapping onto the protein surface to generate clusters identifying functional interfaces. In 2007, Brylinski and Skolnick developed a popular structural alignment method called FINDSITE . For a given target sequence, FINDSITE identifies ligand-bound template structures from a set of distantly homologous proteins recognized by the PROSPECTOR_3 threading approach and superposes them onto the target’s structure using the TM-align structural alignment algorithm. Binding pockets are identified by the spatial clustering of the center of mass of template-bound ligands that are subsequently ranked by the number of binding ligands. In 2009, Oh et al.  developed LEE, a two-stage template-based ligand binding site prediction method, where templates are used first for protein 3D modeling and then for binding site prediction by structural clustering of ligand-containing templates to the predicted 3D model. Later in 2010, Wass et al.  described a new method called 3DligandSite. Structures similar to the query are identified by using MAMMOTH  against a library of protein structures with bound ligands. The structural based alignment of the similar structures and the query superposes ligands onto the query structures. After filtering, the top 25 ligands are retained for analysis and further clustering. In 2012, another comparative approach called COFACTOR was proposed by Zhang group . COFACTOR recognizes functional sites of protein–ligand interactions using low-resolution protein structural models, based on a global-to-local sequence and structural comparison algorithm. The major advantage of COFACTOR over the existing methods is the optimal combination of global and local structural comparisons for identifying protein-binding sites. But, the global comparison can be distracted by structural variations in the regions far away from the binding pockets; meanwhile the local comparison has a high false positive rate since the number of residues involved is too small. Later in 2013, Zhang group published another structural alignment based algorithm, TM-SITE . Different from COFACTOR, TM-SITE compares the structures of a subsequence from the first binding residue to the last binding residue (called SSFL) on the query and template proteins, which solve the problems of global-to-local structural comparison algorithm. These methods provide us valuable choices to predict the binding sites. However, their performance needs to be improved for lack of accuracy or time-efficiency or both since the structural information of protein–ligand complexes (bound protein) are underutilized.
Herein, we built so far the largest database of bound templates with stringent quality control. And on this basis, Stability of Complex as a new criterion and Optimized Multiple-Templates Clustering algorithm are introduced to improve the accuracy. Meanwhile, Homology Indexing and Chain Length Indexing are used to accelerate the efficiency of the structural alignment. Finally, we presented a user friendly protein-binding sites prediction web server (bSiteFinder), at http://binfo.shmtu.edu.cn/bsitefinder/.
Definitions of operations
Rules of five
The macromolecule type is protein, no DNA and RNA.
Experiment method is set to X-ray.
X-ray resolution is between 0 and 3.0.
Has free ligands = yes.
Sequence length is over 20.
Number of ligand atoms
In the process of building databases, which database a protein finally falls into depends on whether it contains ligands and whether these ligands have enough atoms. For this reason, ligands identification, which is judged by the rules mentioned below, plays a key role. Every HETATM residue is recognized through HET records from the header of PDB files. Notably, some of the residues are modified on normal chains, which are not counted as true ligands because of their present in the MODRES records. Hence, the selected ligands only come from HET records excluding MODRES ones. Water molecule is included in HETATM but not regarded as a ligand. Analyzing the data, we define that a ligand should possess 6 or more atoms as a basic rule to identify a ligand.
Stability of Complex
The binding site check criterion is using as the standard of judging the bound structure’s stability. Only if any one of atoms of the ligand has a distance within 4 Å from the geometry center of the calculated binding site, the structure of complex is considered to be stable.
Homology Indexing is implemented by using SCOPe, version 2.03 . First, a four-digit classification number is searched based on PDB ID and CHAIN ID of the query chain. After that, all the protein chains with the same classification number are obtained and used to constitute the template database for subsequent structural alignment.
Chain Length Indexing
Only the chains, which have length difference with query chain less than 30 %, are used as candidates for subsequent structural alignment.
The structural alignment between query and templates in bSiteFinder is implemented by using Combinatorial Extension (CE) algorithm, which is provided by Biojava . Different from traditional dynamic programming algorithm and Monte Carlo algorithm, CE algorithm defines continuous residues in the sequence as aligned fragment pairs (AFPs), which is used in local alignment between query and template. Finally, the optimized alignment results are obtained by expanding or abandoning the local AFPs.
Optimized Multiple-Templates Clustering
Detection of binding sites
On the condition that protein chains have ligands, we define all residues within the distance of 8 Å from ligands as the components of the binding site. On the condition that binding site is detected by doing structural alignment with templates, all residues within the distance of 10 Å from mapped ligands are defined as the components of the binding site. It should be noted that if the bound proteins’ stabilities did not pass the evaluation of Stability of Complex, the bound proteins would be treated as unbound proteins with original ligands removed.
Test and evaluation methods
For comparing with other binding site prediction algorithms, two widespread adopted datasets from LIGSITEcsc  were used for testing our algorithm with the same criteria of evaluating the accuracy of binding site prediction. The first test set contained 210 proteins with ligands (bound dataset). At the suggestion of RCSB, protein 1B6N was replaced by 1Z1H. The second test set contained 48 proteins with/without ligands (bound/unbound dataset).
Here, the accuracy and Matthews Correlation Coefficient (MCC)  were both used to evaluate our algorithm.
A widely accepted verification method  was used. For bound protein, if the protein–ligand’s stability has passed the evaluation of Stability of Complex, the accuracy is 100 %. If the protein–ligand’s stability did not pass the evaluation of Stability of Complex, the original ligands of bound protein will be removed and in this situation, the bound protein will be regarded as unbound protein and may have a lower accuracy.
For unbound proteins, if the geometric center of a binding site has a distance within 4 Å from any one of the atoms of the predicted ligands, this binding site is regarded as a correctly predicted binding site. Otherwise, this binding site is regarded as an incorrectly predicted binding site.
For unbound proteins, the structural alignment between query and template is implemented to map the ligands in bound proteins to the unbound proteins. Then, the mapped pseudo ligands were used to detect the binding site as describe in “Detection of Binding Sites”. To evaluate our methods, we divided the residues of query chains into residues of predicted binding site (Res-BS-Pre) and residues of predicted non-binding site (Res-NBS-Pre). At the same time, we also define residues of experimental binding site as Res-BS-Exp and residues of experimental non-binding site as Res-NBS-Exp according to the original ligands of query chains. Therefore, in formula (1), TP is the intersection of Res-BS-Pre and Res-BS-Exp, and TN is the intersection of Res-NBS-Pre and Res-NBS-Exp, and FP is the intersection of Res-BS-Pre and Res-NBS-Exp, and FN is the intersection of Res-NBS-Pre and Res-BS-Exp.
Create template database
Workflow of binding sites detection
Binding sites prediction of high quality bound protein (Part 1)
Binding sites prediction of unbound protein with bound templates of same Homology Indexing (Part 2)
Binding sites prediction of unbound protein with bound templates of Chain Length Indexing (Part 3)
If the query chain has no satisfactory homologous bound template, the binding site of this query chain will be detected as the following procedure. Chain Length Indexing will be employed to search the bound templates, which have difference with query chain less than 30 % in length, in template database. Then enter the process as the description above (Part 2 of “Workflow of binding sites detection”) with top 20 most similar bound templates. Any protein chains submitted into our system could receive the results of binding sites via efficient computation.
Results and discussion
Performance of our algorithm and its comparison with others
Comparison of the top1 and top3 success rates for various methods using 210 bound structures
Comparison of the top1 and top3 success rates for various methods using 48 bound/unbound structures
Since there are still lots of protein chains have no satisfactory bound structures, bound templates is borrowed for detecting the binding sites in this situation. Our templates database contains 101,315 bound templates. It would consume a large amount of computation for predicting the binding site if structural alignments go through all the chains in the database. Thus, to improve the efficiency of our algorithm, Homology Indexing is introduced and then the time-consuming structural alignment will be limited only among homologous proteins. After building Homology Indexing for all 101,315 chains in template database by using SCOPe , 4254 protein classes are obtained. It means that only about 24 (101,315/4254) bound templates are needed to do the time-consuming structural alignment with the query per prediction. This would significantly reduce the computation time.
Frequency of structural alignment with 48 unbound chains using Homology Indexing
Frequency of structural alignment in 20 no homologous template chains with Chain Length Indexing involved
Percentage of sequences passed Chain Length Indexing (%)
Percentage of sequences passed Chain Length Indexing (%)
Top1 template for 20 no homologous template chains and their length obtained without Chain Length Indexing
Template chain (length constrained)
Stability of Complex
Similarly, Stability of Complex is introduced to build a template database (see details in Fig. 2), which reduced the number of bound structures from 117,823 to 101,315 with 14 % structures removed. Not only improved the quality of template database, this operation also reduced the number of time-consuming structural alignments.
An Optimized Multiple-Templates Clustering method
Similar to FINDSITE , 3DLigandSite  and COFACTER , the prediction accuracy of our algorithm is improved by Optimized Multiple-Templates Clustering. However, in other works, the cluster number is required in previous algorithms, which actually could not be obtained before computing. In addition, the distances between ligands in each cluster have no reasonable physical meaning. In our algorithm, this deficiency is overcome by defining a new constraint, which restrict that the distances between geometric centers of all the ligands (for one binding site) in the same cluster should be less than a certain threshold (cluster radius). Ligands in multiple templates could be clustered automatically following the constraint with reasonable physical meaning, and there has no need to estimate cluster number before clustering.
Comparison of prediction accuracies using Optimized Multiple-Templates Clustering with different cluster radius with 48 unbound dataset
Result in Table 6 indicates that the Top1 and Top3 have highest prediction accuracies with 48 unbound dataset, when cluster radius is set to 3.0 Å. Thus, 3.0 Å is set as the default parameter by bSiteFinder in Optimized Multiple-Templates Clustering.
bSiteFinder as a protein-binding sites prediction server was developed based on the largest database of bound templates so far with stringent quality control. Each protein chain submitted would be processed by following steps: (1) Binding sites prediction of high quality bound protein; (2) Binding sites prediction of unbound protein with bound templates of same Homology Indexing; (3) Binding sites prediction of unbound protein with bound templates of Chain Length Indexing. Any protein chain submitted could receive the results of binding sites via efficient computation. By introducing Homology Indexing, Chain Length Indexing, Stability of Complex and Optimized Multiple-Templates Clustering into our algorithm, the efficiency of our server have been significantly improved. What’s more, the accuracy was approximately 2–10 % higher than that of other algorithms for the test with either bound dataset or unbound dataset. For 210 bound dataset, bSiteFinder achieved high accuracies up to 94.8 % (MCC 0.95). For another 48 bound/unbound dataset, bSiteFinder achieved high accuracies up to 93.8 % for bound proteins (MCC 0.95) and 85.4 % for unbound proteins (MCC 0.72). An online bSiteFinder server is freely available at http://binfo.shmtu.edu.cn/bsitefinder/, and the source code is provided at the methods page. Our work lays a foundation for functional annotation of protein and structure-based drug design. With ever increasing numbers of three-dimensional protein–ligand complex structures, our server should be more accurate and less time-consuming.
Each author has contributed significantly to the submitted work. RZ conceived and designed the project. JG, QZ, ML, LZ, DW and ZC performed the experiments. JG, QZ, ML, LZ, DW and ZC analyzed the data. JG and QZ drafted the manuscript. LZ and RZ revised the manuscript. All authors read and approved the final manuscript.
This work was supported by National Natural Science Foundation of China 61303099(to JG), 31200986 (to RZ), 41530105 (to RZ), and The Fundamental Research Funds for the Central Universities 10247201546 (to RZ) and 2000219083 (to RZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Greer J, Erickson JW, Baldwin JJ, Varney MD (1994) Application of the three-dimensional structures of protein target molecules in structure-based drug design. J Med Chem 37(8):1035–1054View ArticleGoogle Scholar
- Fuller JC, Burgoyne NJ, Jackson RM (2009) Predicting druggable binding sites at the protein-protein interface. Drug Discov Today 14(3–4):155–161View ArticleGoogle Scholar
- Mandal S, Moudgil M, Mandal SK (2009) Rational drug design. Eur J Pharmacol 625(1–3):90–100View ArticleGoogle Scholar
- Rausell A, Juan D, Pazos F, Valencia A (2010) Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc Natl Acad Sci USA 107(5):1995–2000View ArticleGoogle Scholar
- Laurie ATR, Jackson RM (2006) Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Curr Protein Pept Sci 7(5):395–406View ArticleGoogle Scholar
- Honma T (2003) Recent advances in De novo design strategy for practical lead identification. Med Res Rev 23(5):606–632View ArticleGoogle Scholar
- Pradeep H, Rajanikant GK (2014) Computational prediction of a putative binding site on Drp 1: implications for antiparkinsonian therapy. J Chem Inf Model 54(7):2042–2050View ArticleGoogle Scholar
- Xiao X, Min JL, Lin WZ, Liu Z, Cheng X, Chou KC (2015) iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach. J Biomol Struct Dyn 33(10):2221–2233View ArticleGoogle Scholar
- Levitt DG, Banaszak LJ (1992) POCKET: a computer graphies method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234View ArticleGoogle Scholar
- Hendlich M, Rippmann F, Barnickel G (1997) LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 15(6):359View ArticleGoogle Scholar
- Brady GP, Stouten PFW (2000) Fast prediction and visualization of protein binding pockets with PASS. J Comput Aid Mol Des 14(4):383–401View ArticleGoogle Scholar
- Laskowski RA (1995) Surfnet—a program for visualizing molecular-surfaces, cavities, and intermolecular interactions. J Mol Graph 13(5):323View ArticleGoogle Scholar
- Weisel M, Proschak E, Schneider G (2007) PocketPicker: analysis of ligand binding-sites with shape descriptors. Chem Cent J 1:7View ArticleGoogle Scholar
- Dai TL, Liu Q, Gao J, Cao ZW, Zhu RX (2011) A new protein-ligand binding sites prediction method based on the integration of protein sequence conservation information. BMC Bioinform 12(Suppl 14):S9View ArticleGoogle Scholar
- Laurie ATR, Jackson RM (2005) Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 21(9):1908–1916View ArticleGoogle Scholar
- Ngan CH, Hall DR, Zerbe B, Grove LE, Kozakov D, Vajda S (2012) FTSite: high accuracy detection of ligand binding sites on unbound protein structures. Bioinformatics 28(2):286–287View ArticleGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE (1996) An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257(2):342–358View ArticleGoogle Scholar
- Brylinski M, Skolnick J (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA 105(1):129–134View ArticleGoogle Scholar
- Roy A, Yang JY, Zhang Y (2012) COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res 40(W1):W471–W477View ArticleGoogle Scholar
- Yang JY, Roy A, Zhang Y (2013) Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29(20):2588–2595View ArticleGoogle Scholar
- Liang SD, Zhang C, Liu S, Zhou YQ (2006) Protein binding site prediction using an empirical scoring function. Nucleic Acids Res 34(13):3698–3707View ArticleGoogle Scholar
- Sonavane S, Chakrabarti P (2010) Prediction of active site cleft using support vector machines. J Chem Inf Model 50(12):2266–2273View ArticleGoogle Scholar
- Xie ZR, Liu CK, Hsiao FC, Yao A, Hwang MJ (2013) LISE: a server using ligand-interacting and site-enriched protein triangles for prediction of ligand-binding sites. Nucleic Acids Res 41(W1):W292–W296View ArticleGoogle Scholar
- Oh M, Joo K, Lee J (2009) Protein-binding site prediction based on three-dimensional protein modeling. Proteins 77:152–156View ArticleGoogle Scholar
- Wass MN, Kelley LA, Sternberg MJE (2010) 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 38:W469–W473View ArticleGoogle Scholar
- Ortiz AR, Strauss CEM, Olmea O (2002) MAMMOTH (Matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 11(11):2606–2621View ArticleGoogle Scholar
- Fox NK, Brenner SE, Chandonia JM (2014) SCOPe: structural classification of proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42(D1):D304–D309View ArticleGoogle Scholar
- Prlić A, Yates A, Bliven SE, Rose PW, Jacobsen J, Troshin PV, Chapman M, Gao JJ, Koh CH, Foisy S et al (2012) BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28(20):2693–2695View ArticleGoogle Scholar
- Huang BD, Schroeder M (2006) LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation. BMC Struct Biol 6:19View ArticleGoogle Scholar
- Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein. Structure 405(2):442–451Google Scholar
- Skolnick J, Brylinski M (2009) FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinform 10(4):378–391View ArticleGoogle Scholar
- Xie ZR, Hwang MJ (2012) Ligand-binding site prediction using ligand-interacting and binding site-enriched protein triangles. Bioinformatics 28(12):1579–1585View ArticleGoogle Scholar