The good, the bad and the dubious: VHELIBS, a validation helper for ligands and binding sites
© Cereto-Massagué et al.; licensee Chemistry Central Ltd. 2013
Received: 15 May 2013
Accepted: 18 July 2013
Published: 29 July 2013
Many Protein Data Bank (PDB) users assume that the deposited structural models are of high quality but forget that these models are derived from the interpretation of experimental data. The accuracy of atom coordinates is not homogeneous between models or throughout the same model. To avoid basing a research project on a flawed model, we present a tool for assessing the quality of ligands and binding sites in crystallographic models from the PDB.
The Validation HElper for LIgands and Binding Sites (VHELIBS) is software that aims to ease the validation of binding site and ligand coordinates for non-crystallographers (i.e., users with little or no crystallography knowledge). Using a convenient graphical user interface, it allows one to check how ligand and binding site coordinates fit to the electron density map. VHELIBS can use models from either the PDB or the PDB_REDO databank of re-refined and re-built crystallographic models. The user can specify threshold values for a series of properties related to the fit of coordinates to electron density (Real Space R, Real Space Correlation Coefficient and average occupancy are used by default). VHELIBS will automatically classify residues and ligands as Good, Dubious or Bad based on the specified limits. The user is also able to visually check the quality of the fit of residues and ligands to the electron density map and reclassify them if needed.
VHELIBS allows inexperienced users to examine the binding site and the ligand coordinates in relation to the experimental data. This is an important step to evaluate models for their fitness for drug discovery purposes such as structure-based pharmacophore development and protein-ligand docking experiments.
KeywordsElectron density map Binding site structure validation Ligand structure validation Protein structure validation PDB PDB_REDO
The 3D structure of proteins depends on their amino acid sequence  but cannot be predicted based solely on that sequence, except for relatively small proteins . As the structure of a molecule cannot be observed directly, a model of the structure must be constructed using experimental data. These data can be obtained through different methods, such as X-ray crystallography, NMR spectroscopy or electron microscopy. However, none of these methods allows for the direct calculation of the structure from the data. In X-ray crystallography, the most widely applied method, the crystallographic diffraction data are used to construct a three-dimensional grid that represents the probability for electrons to be present in specific positions in space, the so-called electron density (ED) map. The ED shows the average over many (typically between 1013 and 1015) molecules arranged in a periodic fashion in crystals and is the average over the time of the X-ray experiment . This ED is then interpreted to construct an atomic model of the structure. The model is just a representation of the crystallographic data and other known information about the structure, such as the sequence, bond lengths and angles. Different models, such as the thousands of models in the Protein Data Bank (PDB) , represent the experimental data with varying degrees of reliability, and the quality of experimental data (for example, the resolution limit of the diffracted X-rays) varies significantly.
Due to the interpretation step during modeling, which is inevitably subjective [5, 6], it is very important to see if a model fits reasonably to the ED that was used to construct it, to ensure its reliability. For drug discovery and design purposes, the model quality of the protein binding sites and of the ligands bound to them are of particular interest, while the overall model quality or the quality of the model outside the binding site are not directly relevant.
A good way to assess how well a subset of atomic coordinates fits the experimental electron density is the Real Space R-value (RSR) , which has been recommended by the X-ray Validation Task Force of the Worldwide PDB [8, 9]. The RSR measures a similarity score between the 2mFo-DFc and the DFc maps. The real-space correlation coefficient (RSCC)  is another well-established measure of model fit to the experimental data. The use of the ED to validate the model will not catch all possible problems in the model , but it can show whether the model fits the data from which it was created.
VHELIBS aims to enable non-crystallographers and users with little or no crystallographic knowledge to easily validate protein structures before using them in drug discovery and development. To that end, VHELIBS features a Graphical User Interface (GUI) with carefully chosen default values that are valid for most situations but allows parameters to be easily tuned for more advanced users. A tool named Twilight [11, 12] has recently been published to evaluate ligand density. However, while VHELIBS focuses on assessing both the ligands and binding sites to aid model evaluation for drug discovery purposes, Twilight is ligand-centric and focuses on highlighting poorly modeled ligands. VHELIBS also enables the user to choose between the models from either the PDB [4, 13] or the PDB_REDO  databanks. Using PDB_REDO as the data source can have substantial benefits over using the PDB. PDB_REDO changes models both by re-refinement, incorporating advances in crystallographic methods since the original structure model (the PDB entry) was constructed, and by limited rebuilding, mainly of residue side chains , improving the fit of models to the ED .
VHELIBS validates the binding site and ligand against the ED in a semi-automatic way, classifying them based on a score of Good, Bad or Dubious. This score is calculated by taking several parameters into account (RSR, RSCC, and average occupancy by default, but more can be used). After performing the automatic analysis and classification of a target’s binding site and ligand, it then enables the user to graphically review and compare them with their ED in order to make it easier to properly classify any structure labeled ‘dubious’ or to re-classify any other structure based on actual visual inspection and comparison of the ED with the model.
VHELIBS is mainly implemented using Python under Jython , with some critical parts implemented in Java. It uses Jmol  for the 3D visualization of models and EDs. Electron density maps are retrieved from the EDS [19, 20] or from the PDB_REDO databank, which are updated weekly with new data from the PDB. Models are downloaded from either the PDB or PDB_REDO according to the user settings.
Description of the algorithm
VHELIBS takes as input a user-provided list of either PDB  or UniProtKB  codes (which are mapped to their corresponding PDB codes). The codes in these lists can be entered directly from the GUI or provided in a text file.
For each of these PDB codes, statistical data are retrieved from the EDS or from the PDB_REDO, depending on the source of the models being analyzed (i.e., EDS data for models downloaded from the PDB and PDB_REDO data for models downloaded from the PDB_REDO). Ligands bound with residues or molecules included in the ‘blacklist’ exclusion list (see below) with a bond length < 2.1 Å are rejected. Those ligands bound to molecules in the ‘non-propagating’ exclusion list (which can be modified by the user and by default contains mainly metal ions) are not rejected. The exclusion lists are composed of the most common solvent molecules and other non-ligand hetero compounds often found in PDB files, as well as some less common solvents and molecules that were found to have very simple binding sites (e.g., a binding site consisting of just 1–2 residues). We also incorporated the buffer molecules from Twilight’s list [11, 12]. The exclusion list from BioLip  was also considered, but deemed too restrictive.
For each residue and component of each ligand and each binding site, the initial score is defined to be 0.
For each unmet user-specified condition, the score is increased by 1. The user specified conditions are the value thresholds for several different properties of the model and the data (i.e., RSR, RSCC, occupancy-weighted B factor, R-free, resolution and residue average occupancy; the user may also use a subset of these properties).
If the score remains 0, the ligand/residue is labeled as Good.
If the score is greater than the user-defined tolerance value, the ligand/residue is labeled as Bad.
If the score is between 0 and the user-defined tolerance value, the ligand/residue is labeled as Dubious.
At the end of all evaluations, the binding site and the ligand (for ligands with more than 1 ‘residue’, i.e., those composed of more than one hetero compound in the PDB file) are labeled according to the worst score of their components (i.e., a binding site with a Bad residue will be labeled as Bad regardless of how the rest of the residues are labeled, and a binding site can only be labeled as Good when all its residues are Good).
The results from this classification are saved to a CSV file (the results file), which can be opened by any major spreadsheet software and can then be filtered as desired (for Good ligands, for Good binding sites or for both). A file with a list of all the rejected PDB structures and ligands and the reason for the rejection is also generated with the results file.
binding site residues are shown by default in white and with a wireframe style in order to show the context where the possible reclassification is evaluated.
coordinates to examine for veracity are shown in ball and stick style and colored according to their B-factor.
ligand coordinates are shown in ball and stick style and colored in magenta (but can be colored according to their B-factor if they need to be examined).
the ED for coordinates to examine is shown in yellow.
the ED for the complete binding site can be added to the visualization (in cyan) if necessary.
the ED for the ligand can be shown separately (in red).
Hence, with this visualization frame, the user has all the information he/she needs in order to decide, for instance, whether (a) dubious binding site coordinates could be relevant for protein-ligand docking results (if the dubious coordinates face away from the ligand, it is reasonable to think that their accuracy does not affect protein-ligand docking results); and (b) ligand coordinates that were classified as Bad or Dubious by the automatic analysis can be changed to Good if the experimental pose is the only possibility for its corresponding ED (this can occur with non-flexible rings that have only partial ED for their atoms). In the online documentation (https://github.com/URVnutrigenomica-CTNS/VHELIBS/wiki) , there is more information on this and some practical rules for guiding such an evaluation. Of course, the visualization of the binding site, the ligand and coordinates to examine (dubious or bad residues and ligands) and their respective EDs can be customized in several ways through the GUI, e.g., by changing atom colors and styles or the contour level and radius of the EDs.
VHELIBS can be used with different running conditions (i.e., with different profiles). The values of the default profiles [i.e., Default (PDB) and Default (PDB_REDO)] were chosen after careful visualization and comparison of models with their EDs, giving a default minimum RSCC of 0.9, a minimum average occupancy of 1.0, a maximum RSR of 0.4 and a maximum good RSR of 0.24 for PDB and 0.165 for PDB_REDO. The different RSR cut-offs for the PDB and PDB_REDO are the result of RSR being calculated using different software in the EDS (which uses MAPMAN ) and in PDB_REDO (which uses EDSTATS ). The third provided profile, Iridium, is based on the values used in the construction of the Iridium set . This profile is only provided as an example of how easy it is to adapt VHELIBS to use other values found in the literature. Note however that VHELIBS will yield slightly different results from those in the Iridium set, because VHELIBS uses the EDs and statistical data from EDS or PDB_REDO, while the authors of the Iridium set calculate all the data using different software and different EDs.
Key features of VHELIBS
Many different parameters can be used to filter good models, and their threshold values can be adjusted by the user. Contextual help informs the user about the meaning of the different parameters.
VHELIBS comes with three profiles, and the user can create custom profiles and export them for further use or sharing.
VHELIBS has the ability to work with an unlimited number of PDB or UniProtKB  codes (all the PDB codes in each UniProtKB entry are analyzed).
VHELIBS has the ability to choose between models from PDB_REDO or from the PDB.
VHELIBS runs in the Java Virtual Machine, which makes it operating-system independent.
VHELIBS consists of a single jar file, needing no installation. There are no dependencies other than Java.
The user can load a results file from a previous analysis; one can let a huge analysis run during lunch or overnight and then review the results at any later time.
A user does not need to be familiar with any other software (although familiarity with Jmol  will help the user to make sophisticated custom views).
PDB_REDO changes to support VHELIBS
The PDB_REDO databank was upgraded to have per-residue RSR and RSCC values and downloadable EDs in the CCP4  format for each entry. These ready-made maps make electron density visualization possible not only in VHELIBS but also in PyMOL  (for which a plugin is available via the PDB_REDO website).
To assess how much of the previously observed model improvement in PDB_REDO  is applicable to ligands and their binding pocket, we implemented two new ligand validation routines in the PDB_REDO pipeline: (1) EDSTATS  calculates the fit of the ligand with the ED; and (2) YASARA  calculates the heat of formation of the ligand (which is used as a measure of geometric quality) and the interactions of the ligand with its binding pocket. The interactions measured in YASARA include the number of atomic clashes (bumps), the number and total energy of hydrogen bonds, and the number and strength of hydrophobic contacts, π-π interactions, and cation-π interactions. The strengths of hydrophobic contacts, π-π interactions, and cation-π interactions are based on knowledge-based potentials  in which each individual interaction has a score between 0 and 1.
Results and discussion
Average validation scores for ligands in PDB and PDB_REDO
Validation score a
PDB average b
PDB_REDO average b
Heat of formation (kJ/mol) d
Hydrogen bonding energy (kJ/mol) d
Hydrophobic contact strength d,e
π-π interaction strength d,e
Cation-π interaction strength d,e
Number of atomic clashes d
Analysis of all binding sites present in both PDB and PDB_REDO
Analysis of all ligands present in both PDB and PDB_REDO
Number of complexes classified as Good , Bad or Dubious after applying VHELIBS to 75 ligand/DPP-IV binding site complexes using the Default (PDB_REDO) profile
VHELIBS can be used to choose structures to use for a protein-ligand docking: with VHELIBS, the user can choose the structures with the best-modeled binding sites.
VHELIBS can be used to choose structures where both the binding site and the ligand are well modeled, in order to validate the performance of different protein-ligand docking programs. This could make it possible to obtain a new gold standard for protein/ligand complexes that could be used for the validation of docking software and that could be significantly larger and more diverse than those currently being used (i.e., the Astex Diverse Set  and the Iridium set ).
VHELIBS can be used to choose structures where both the binding site and the ligand are well modeled to obtain reliable structure-based pharmacophores that select the relevant target bioactivity-modulating intermolecular interactions. This is important in drug-discovery workflows for finding new molecules with similar activity to the co-crystallized ligand.
VHELIBS can be used to obtain well-modeled ligand coordinates in order to evaluate the performance of 3D conformation-generator software that claims to be able to generate bioactive conformations.
VHELIBS allows the user to easily check the fit of models to the ED for binding sites and ligands without additional scripting or console commands for each structure. Moreover, our study allows us to conclude that in general, binding site and ligand coordinates derived from PDB_REDO structures are more reliable than those obtained directly from the PDB and therefore highlights the contribution of the PDB_REDO database to the drug-discovery and development community.
Availability and requirements
Project name: VHELIBS (Validations Helper for Ligands and Binding Sites).
Project home page: http://urvnutrigenomica-ctns.github.com/VHELIBS/
Operating System(s): Platform independent.
Programming language: Python, Java.
Other requirements: Java 6.0 or newer, internet connection.
License: GNU AGPL v3.
Any restrictions to use by non-academics: None other than those specified by the license (same as for academics).
Protein data bank
Graphical user interface
Real space residual
Real space correlation coefficient
Dipeptidyl peptidase 4.
This manuscript has been edited by American Journal Experts.
We acknowledge support from the Generalitat de Catalunya through grant XRQTC.
We also acknowledge Professor Robert Hanson from the St. Olaf College for his support with questions regarding Jmol, Ed Pozharski for writing the initial PyMOL plugin and Anastassis Perrakis for helpful discussion during the manuscript preparation.
This work was supported by the Ministerio de Educación y Ciencia of the Spanish Government [AGL2008-00387 and AGL2011-25831], the ACC1Ó program from the Generalitat de Catalunya [TECRD12-1-0005], and Veni grant 722.011.011 from the Netherlands Organization for Scientific Research (NWO).
- Anfinsen CB: Principles that govern the folding of protein chains. Science (New York, NY). 1973, 181: 223-230. 10.1126/science.181.4096.223.View ArticleGoogle Scholar
- Bradley P, Misura KMS, Baker D: Toward high-resolution de novo structure prediction for small proteins. Science (New York, NY). 2005, 309: 1868-1871. 10.1126/science.1113801.View ArticleGoogle Scholar
- Rhodes G, Cooper J: Model and molecule. Crystallography Made Crystal Clear: A Guide for Users of Macromolecular Models. 2006, Academic, 1-5.View ArticleGoogle Scholar
- Berman H, Henrick K, Nakamura H: Announcing the worldwide protein data bank. Nat Struct Biol. 2003, 10: 980-10.1038/nsb1203-980.View ArticleGoogle Scholar
- Dauter Z, Weiss MS, Einspahr H, Baker EN: Expectation bias and information content. Acta Crystallogr Sect D Struct Biol Cryst. 2013, 69: 141-141.View ArticleGoogle Scholar
- Bränd’en C-I, Alwyn Jones T: Between objectivity and subjectivity. Nature. 1990, 343: 687-689. 10.1038/343687a0.View ArticleGoogle Scholar
- Jones TA, Zou JY, Cowan SW, Kjeldgaard M: Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallogr Sect A Found Cryst. 1991, 47: 110-119. 10.1107/S0108767390010224.View ArticleGoogle Scholar
- Read RJ, Adams PD, Arendall WB, Brunger AT, Emsley P, Joosten RP, Kleywegt GJ, Krissinel EB, Lütteke T, Otwinowski Z, Perrakis A, Richardson JS, Sheffler WH, Smith JL, Tickle IJ, Vriend G, Zwart PH: A new generation of crystallographic validation tools for the protein data bank. Structure (London, England: 1993). 2011, 19: 1395-1412. 10.1016/j.str.2011.08.006.View ArticleGoogle Scholar
- Gore S, Velankar S, Kleywegt GJ: Implementing an x-ray validation pipeline for the protein data bank. Acta Crystallogr Sect D Biol Cryst. 2012, 68: 478-483. 10.1107/S0907444911050359.View ArticleGoogle Scholar
- Richardson JS, Richardson DC: Studying and polishing the PDB’s macromolecules. Biopolymers. 2013, 99: 170-182. 10.1002/bip.22108.View ArticleGoogle Scholar
- Pozharski E, Weichenberger CX, Rupp B: Techniques, tools and best practices for ligand electron-density analysis and results from their application to deposited crystal structures. Acta Crystallogr Sect D Biol Cryst. 2013, 69: 150-67.View ArticleGoogle Scholar
- Weichenberger CX, Pozharski E, Rupp B: Visualizing ligand molecules in twilight electron density. Acta Crystallogr Sect F Struct Biol Cryst Commun. 2013, 69: 195-200.View ArticleGoogle Scholar
- Berman HM: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.View ArticleGoogle Scholar
- Joosten RP, Vriend G: PDB improvement starts with data deposition. Science (New York, NY). 2007, 317: 195-196. 10.1126/science.317.5835.195.View ArticleGoogle Scholar
- Joosten RP, Joosten K, Cohen SX, Vriend G, Perrakis A: Automatic rebuilding and optimization of crystallographic structures in the protein data bank. Bioinformatics (Oxford, England). 2011, 27: 3392-3398. 10.1093/bioinformatics/btr590.View ArticleGoogle Scholar
- Joosten RP, Joosten K, Murshudov GN, Perrakis A: PDB_REDO: constructive validation, more than just looking for errors. Acta Crystallographica Section D. 2012, 68: 484-496. 10.1107/S0108767312019034.View ArticleGoogle Scholar
- The Jython Project. http://www.jython.org/,
- Hanson RM: Jmol – a paradigm shift in crystallographic visualization. J Appl Cryst. 2010, 43: 1250-1260. 10.1107/S0021889810030256.View ArticleGoogle Scholar
- Kleywegt GJ, Harris MR, Zou JY, Taylor TC, Wählby A, Jones TA: The Uppsala electron-density server. Acta Crystallogr Sect D Biol Cryst. 2004, 60: 2240-2249. 10.1107/S0907444904013253.View ArticleGoogle Scholar
- EDS - Uppsala Electron Density Server. http://eds.bmc.uu.se/eds/,
- Magrane M: UniProt Knowledgebase: a hub of integrated protein data. Database J Biol Databases Curat. 2011, 2011: bar009-Google Scholar
- Yang J, Roy A, Zhang Y: BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013, 41: D1096-103. 10.1093/nar/gks966.View ArticleGoogle Scholar
- Edmondson SD, Mastracchio A, Mathvink RJ, He J, Harper B, Park Y-J, Beconi M, Di Salvo J, Eiermann GJ, He H, Leiting B, Leone JF, Levorse DA, Lyons K, Patel RA, Patel SB, Petrov A, Scapin G, Shang J, Roy RS, Smith A, Wu JK, Xu S, Zhu B, Thornberry NA, Weber AE: (2S,3S)-3-Amino-4-(3,3-difluoropyrrolidin-1-yl)-N, N-dimethyl-4-oxo-2-(4-[1,2,4]triazolo[1,5-a]-pyridin-6-ylphenyl)butanamide: a selective alpha-amino amide dipeptidyl peptidase IV inhibitor for the treatment of type 2 diabetes. J Med Chem. 2006, 49: 3614-27. 10.1021/jm060015t.View ArticleGoogle Scholar
- RCSB Protein Data Bank - RCSB PDB - 3Q8W Structure Summary. http://www.rcsb.org/pdb/explore/explore.do?structureId=3Q8W,
- VHELIBS Online Documentation. https://github.com/URVnutrigenomica-CTNS/VHELIBS/wiki,
- Kleywegt GJ, Jones TA: xdlMAPMAN and xdlDATAMAN - programs for reformatting, analysis and manipulation of biomacromolecular electron-density maps and reflection data sets. Acta Crystallogr Sect D Biol Cryst. 1996, 52: 826-828. 10.1107/S0907444995014983.View ArticleGoogle Scholar
- Tickle IJ: Statistical quality indicators for electron-density maps. Acta Crystallogr Sect D Biol Cryst. 2012, 68: 454-467. 10.1107/S0907444911035918.View ArticleGoogle Scholar
- Warren GL, Do TD, Kelley BP, Nicholls A, Warren SD: Essential considerations for using protein-ligand structures in drug discovery. Drug Discov Today. 2012, 17: 1270-1281. 10.1016/j.drudis.2012.06.011.View ArticleGoogle Scholar
- UniProtKB. http://www.uniprot.org/help/uniprotkb,
- Winn MD, Ballard CC, Cowtan KD, Dodson EJ, Emsley P, Evans PR, Keegan RM, Krissinel EB, Leslie AGW, McCoy A, McNicholas SJ, Murshudov GN, Pannu NS, Potterton EA, Powell HR, Read RJ, Vagin A, Wilson KS: Overview of the CCP4 suite and current developments. Acta Crystallogr Sect D Biol Cryst. 2011, 67: 235-42. 10.1107/S0907444910045749.View ArticleGoogle Scholar
- Schrödinger L: The PyMOL Molecular Graphics System. 2010Google Scholar
- Krieger E, Koraimann G, Vriend G: Increasing the precision of comparative models with YASARA NOVA–a self-parameterizing force field. Proteins. 2002, 47: 393-402. 10.1002/prot.10104.View ArticleGoogle Scholar
- Krieger E, Joo K, Lee J, Lee J, Raman S, Thompson J, Tyka M, Baker D, Karplus K: Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins. 2009, 77 (Suppl 9): 114-22.View ArticleGoogle Scholar
- Vos S, Parry RJ, Burns MR, De Jersey J, Martin JL: Structures of free and complexed forms of Escherichia coli xanthine-guanine phosphoribosyltransferase. J Mole Biol. 1998, 282: 875-89. 10.1006/jmbi.1998.2051.View ArticleGoogle Scholar
- Hartshorn MJ, Verdonk ML, Chessari G, Brewerton SC, Mooij WTM, Mortenson PN, Murray CW: Diverse, high-quality test set for the validation of protein-ligand docking performance. J Med Chem. 2007, 50: 726-41. 10.1021/jm061277y.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.