Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods
© Riniker and Landrum; licensee Chemistry Central Ltd. 2013
Received: 23 May 2013
Accepted: 23 July 2013
Published: 24 September 2013
Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and naïve Bayes. An open-source implementation of the method is provided.
KeywordsVisualization Machine-learning Similarity Fingerprints
Chemical structures are often represented by molecular fingerprints where structural features are converted to either bits in a bit vector or counts in a count vector. This abstract representation allows the computationally efficient handling and comparison of chemical structures. Using such fingerprints, the similarity between two molecules can be calculated in a straightforward manner with simple similarity metrics such as Tanimoto , Dice , and so on. However, depending on the descriptors used to generate the fingerprints, the interpretation of the resulting similarity may not be trivial. This problem worsens when machine-learning (ML) models are trained to predict the activity (or other properties) of new compounds: ML models often appear as complete "black boxes" that just output numeric predictions to their users. Though these predictions can be quite accurate, it has been shown that supplementing numeric predictions with additional information from the model can improve the ability of both expert and non-expert users to work with the results . This provides substantial motivation for the development of strategies to visualize the parts of a molecule contributing to a similarity value or model prediction.
Few visualization approaches for such models are described in the literature. An early example is the visualization of a modal fingerprint [4, 5], which contains all bits which are present in 50 - 100% of the molecules of a training set. The atoms are colored based on the similarity to the modal fingerprint, i.e. how many of the bits set by the atom are present in the modal fingerprint. Franke et al.  visualized the importance of three-point pharmacophores (3PP) obtained from a trained support vector machine (SVM) model by placing differently sized spheres at the centre of the substructure leading to a 3PP. The importance of each 3PP was calculated based on the difference of SVM prediction for a molecule when this 3PP is removed. The interpretation of linear SVM models was also the goal of the heat map coloring scheme developed by Rosenbaum et al. . The SVM model was trained using ECFP fingerprints and the authors focussed solely on the coloring of bonds. The coloring was based on the weights obtained from the SVM model, where the final weight of a bond is the normalized sum of the weights of the fingerprints features containing this bond. The color scheme was chosen such that red corresponds to the negative class and green to the positive class with orange as zero. Another approach is the Glowing Molecule visualization which has been used to show the regions of a molecule which may have the most influence on ADME and physicochemical properties [8, 9]. A red glow indicates that this region has a positive influence on the property (i.e. the property value increases) while a blue glow indicates a negative influence with green representing no significant overall effect. Unfortunately, a detailed description of the algorithm used for the Glowing Molecule method were not provided and, since it is implemented as part of a commercial product, the method is not generally available.
Here, we present similarity maps, a general approach for the visualization of both fingerprint similarities between two molecules and machine-learning (ML) model predictions. In our scheme, the "weight" of an atom is the similarity or predicted-probability difference obtained when the bits in the fingerprint corresponding to the atom are removed, similar to the approach of Franke et al. . The normalized weights are then used to color the atoms in a topography-like map with green indicating a positive difference (i.e. the similarity or probability decreases when the bits are removed) and pink indicating a negative difference, gray represents no change. The visualization is demonstrated for atom pairs and several types of circular fingerprints and subsequently used to explain the factors leading to the predicted probability of a random forest and a naïve Bayes model. All source code and data required to reproduce the examples is provided in the Additional file 1.
A "weight" is determined for each atom of the test molecule by removing the bits which are set by the atom in the fingerprint of the test molecule, recalculating the similarity between the modified fingerprint and the fingerprint of the reference compound s mod , and calculating the difference to the original similarity Δs = s orig - s mod . The fingerprints are calculated using the open-source cheminformatics toolkit RDKit . Dice  similarity is used in the current implementation but any other similarity metric could be employed. For AP (a count vector), the bits of an atom i are straightforward to determine, the count for each pair involving atom i is decreased by one. In circular fingerprints, on the other hand, bits are set for different atomic environments, starting at radius 0 up to the maximum radius. In RDKit, the environment (i.e. centre atom and radius) associated with each bit in a fingerprint can be obtained when generating the fingerprint. This information is used to determine all the bits where the atom is part of the environment.
In the case of NB, the difference between the logarithmic probabilities is used. The ML methods were calculated using the open-source toolkit scikit-learn .
To construct a similarity map, the atom weights are normalized by dividing by the maximum absolute weight value and then used to calculate bivariate Gaussian distributions centered at the corresponding atom positions. The atom weights influence only the peak and not the variance of the Gaussian distribution. The RDKit function for this makes use of the Python library matplotlib . The similarity map is then generated by superimposing the atom coordinates with the Gaussian distributions and the contours using a matplotlib figure.
Results and discussion
Dice similarities and maximum weights
The features in the reference compound are aromatic rings, two acceptors and two basic acceptors. These features are marked green in the right panels in Figure 3 for both molecules. Removing the aromatic acceptors or the donor in the molecules, on the other hand, increased the similarity to the reference compound. Interestingly, one carbon of the piperazine moiety in molecule 3 is highlighted pink using CountMorgan2 (and to a lesser extent using Morgan2) whereas it is green using FeatMorgan2. For (Count)Morgan2, the atom type of this carbon is different than the atom types of the other carbons as the number of heavy-atom neighbours and the number of hydrogens is different. Using features (donor, acceptor, aromatic, basic, acidic, no-feature), however, the number of neighbours and hydrogens are not considered, thus the feature type (i.e. no-feature) is the same for all carbons in the piperazine.
Two kinds of machine-learning (ML) methods, random forest (RF) and naïve Bayes (NB), were trained and used to predict the probability to be active of new molecules. The reference compound and the other active molecules (activity smaller than 10 μM) from Ref.  (Figure S1 in Additional file 2) were used together with randomly selected 10% of the 10000 ChEMBL decoys used in a recent benchmarking study  to train the ML models. Morgan2 was used as the standard fingerprint. The following optimal parameters of random forests have been determined through a grid search: number of trees (N T ) = 100, maximum depth = 2, minimum samples to split = 2 and minimum samples per leaf = 1. To avoid the problems caused by imbalance in the training set (i.e. many more inactives than actives) for RFs, the balanced random forest algorithm  was applied: for each decision tree the majority class is down-sampled to yield an equal number of instances as the minority class. The naïve Bayes classifier was trained using an additive Laplace smoothing parameter of 1.0 and learned class prior probabilities.
Similar findings were obtained for the NB model (right panels in Figure 5). Again, the piperazine moiety was found to be most important.
Similarity maps are an easy and general strategy for the visualization of the atomic origins of fingerprint similarity between molecules. The "atomic weights" are generated by removing the bits belonging to the corresponding atom and comparing the resulting similarity with the similarity of the unmodified fingerprint. Similarity maps can be generated for every fingerprint that allows a backtracking of the bits to a corresponding atom or substructure. The methodology can be extended to machine-learning (ML) models to visualize the atomic contributions to the predicted probability of the ML model. This is especially useful as ML models often appear as black boxes. In future work, we will investigate the application of the visualization strategy to descriptor-based models for physicochemical-property prediction.
Availability and requirements
S. R. thanks the Novartis Institutes for BioMedical Research education office for a Presidential Postdoctoral Fellowship. The authors thank Nikolas Fechner for the helpful discussions.
- Rogers D, Tanimoto TT: A computer program for classifying plants. Science. 1960, 132: 1115-1118. 10.1126/science.132.3434.1115.View ArticleGoogle Scholar
- Dice LR: Measures of the amount of ecological association between species. Ecology. 1945, 26: 297-302. 10.2307/1932409.View ArticleGoogle Scholar
- Hansen K, Baehrens D, Schroeter T, Rupp M, Müller KR: Visual interpretation of kernel-based prediction models. Mol Inf. 2011, 30: 817-826. 10.1002/minf.201100059.View ArticleGoogle Scholar
- Shemetulskis NE, Weiniger D, Blankey CJ, Yang JJ, Humblet C: Stigmata: an algorithm to determine structural commonalities in diverse datasets. J Chem Inf Comput Sci. 1996, 36: 862-871. 10.1021/ci950169+.View ArticleGoogle Scholar
- Wild DJ, Blankley CJ: VisualiSAR: a web-based application for clustering, structure browsing, and structure-activity relationship study. J Mol Graph Model. 1999, 17: 85-89. 10.1016/S1093-3263(99)00026-1.View ArticleGoogle Scholar
- Franke L, Byvatov E, Werz O, Steinhilber D, Schneider P, Schneider G: Extraction and visualization of potential pharmacophore points using support vector machines: application to ligand-based virtual screening for COX-2 inhibitors. J Med Chem. 2005, 48: 6997-7004. 10.1021/jm050619h.View ArticleGoogle Scholar
- Rosenbaum L, Hinselmann G, Jahn A, Zell A: Interpreting linear support vector machine models with heat map molecule coloring. J Cheminf. 2011, 3: 11-22. 10.1186/1758-2946-3-11.View ArticleGoogle Scholar
- Segall M, Champness E, Obrezanova O, Leeding C: Beyond profiling: using ADMET models to guide decisions. Chem Biodivers. 2009, 6: 2144-2151. 10.1002/cbdv.200900148.View ArticleGoogle Scholar
- Glowing Molecule visualization tool by Optibrium. [http://www.optibrium.com/community/faq/glowing-molecule],
- RDKit: Cheminformatics and Machine Learning Software 2013. [http://www.rdkit.org],
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E: Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011, 12: 2825-2830.Google Scholar
- Hunter JD: Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007, 9: 90-95.View ArticleGoogle Scholar
- Shi L, Javitch JA: The binding site of aminergic G protein-coupled receptors: the transmembrane segments and second extracellular loop. Annu Rev Pharmacol Toxicol. 2002, 42: 437-467. 10.1146/annurev.pharmtox.42.091101.144224.View ArticleGoogle Scholar
- Chien EY, Liu W, Zhao Q, Katritch V, Han GW, Hanson MA, Shi L, Newman AH, Javitch JA, Cherezov V, Stevens RC: Structure of the human dopamine D3 receptor in complex with a D2/D3 selective antagonist. Science. 2010, 330: 1091-1095. 10.1126/science.1197410.View ArticleGoogle Scholar
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP: ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40: D1100—D1107-View ArticleGoogle Scholar
- ChEMBL: European Bioinformatics Institute (EBI), version 14. Cambridge, UK. 2012, [http://www.ebi.ac.uk/chembl/],
- Banala AK, Levy BA, Khatri SS, Furman CA, Roof RA, Mishra Y, Griffin SA, Sibley DR, Luedtke RR, Newman AH: N-(3-Fluoro-4-(4-(2-methoxy or 2,3-dichlorophenyl)piperazine-1-yl)butyl)arylcarboxamides as selective dopamine D3 receptor ligands: critical role of the carboxamide linker for D3 receptor selectivity. J Med Chem. 2011, 54: 3581-3594. 10.1021/jm200288r.View ArticleGoogle Scholar
- Leopoldo M, Lacivita E, Giorgio PD, Colabufo NA, Niso M, Berardi F, Perrone R: Design, synthesis, and binding affinities of potential positron emission tomography (PET) ligands for visualization of brain dopamine D3 receptors. J Med Chem. 2006, 49: 358-365. 10.1021/jm050734s.View ArticleGoogle Scholar
- Sasse BC, Mach UR, Leppaenen J, Calmels T, Stark H: Hybrid approach for the design of highly affine and selective dopamine D3 receptor ligands using privileged scaffolds of biogenic amine GPCR ligands. Bioorg Med Chem. 2007, 15: 7258-7273. 10.1016/j.bmc.2007.08.034.View ArticleGoogle Scholar
- Carhart RE, Smith DH, Venkataraghavan R: Atom pairs as molecular features in structure-activity studies: definition and applications. J Chem Inf Comput Sci. 1985, 25: 64-73. 10.1021/ci00046a002.View ArticleGoogle Scholar
- Rogers D, Hahn M: Extended-connectivity fingerprints. J Chem Inf Model. 2010, 50: 742-754. 10.1021/ci100050t.View ArticleGoogle Scholar
- Riniker S, Landrum G: Open source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminf. 2013, 5: 26-10.1186/1758-2946-5-26.View ArticleGoogle Scholar
- Landrum G, Lewis R, Palmer A, Stiefl N, Vulpetti A: Making sure there’s a "give" associated with the "take": producing and using open-source software in big pharma. J Cheminf. 2011, 3 (Suppl 1): O3-10.1186/1758-2946-3-S1-O3.View ArticleGoogle Scholar
- Gobbi A, Poppinger D: Genetic optimization of combinatorial libraries. Biotech Bioeng. 1998, 61: 47-54. 10.1002/(SICI)1097-0290(199824)61:1<47::AID-BIT9>3.0.CO;2-Z.View ArticleGoogle Scholar
- Sastry M, Lowrie JF, Dixon SL, Sherman W: Large-scale systematic analysis of 2D fingerprint methods and parameters to improve virtual screening enrichments. J Chem Inf Model. 2010, 50: 771-784. 10.1021/ci100062n.View ArticleGoogle Scholar
- Chen C, Liaw A, Breiman L: Using Random Forest to Learn Imbalanced Data. 2004, Berkeley: University of CaliforniaGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.