The Molecule Cloud - compact visualization of large collections of molecules
© Ertl and Rohde; licensee Chemistry Central Ltd. 2012
Received: 14 May 2012
Accepted: 6 July 2012
Published: 6 July 2012
Analysis and visualization of large collections of molecules is one of the most frequent challenges cheminformatics experts in pharmaceutical industry are facing. Various sophisticated methods are available to perform this task, including clustering, dimensionality reduction or scaffold frequency analysis. In any case, however, viewing and analyzing large tables with molecular structures is necessary. We present a new visualization technique, providing basic information about the composition of molecular data sets at a single glance.
A method is presented here allowing visual representation of the most common structural features of chemical databases in a form of a cloud diagram. The frequency of molecules containing particular substructure is indicated by the size of respective structural image. The method is useful to quickly perceive the most prominent structural features present in the data set. This approach was inspired by popular word cloud diagrams that are used to visualize textual information in a compact form. Therefore we call this approach “Molecule Cloud”. The method also supports visualization of additional information, for example biological activity of molecules containing this scaffold or the protein target class typical for particular scaffolds, by color coding. Detailed description of the algorithm is provided, allowing easy implementation of the method by any cheminformatics toolkit. The layout algorithm is available as open source Java code.
Visualization of large molecular data sets using the Molecule Cloud approach allows scientists to get information about the composition of molecular databases and their most frequent structural features easily. The method may be used in the areas where analysis of large molecular collections is needed, for example processing of high throughput screening results, virtual screening or compound purchasing. Several example visualizations of large data sets, including PubChem, ChEMBL and ZINC databases using the Molecule Cloud diagrams are provided.
One of the most typical tasks that a cheminformatics expert in pharmaceutical or agrochemical industry performs practically daily is analyzing and visualizing large collections of molecules. Typical areas, where this is needed are the analysis of the company compound archive and its enhancement by purchasing additional molecules from commercial compound providers, analysis of high-throughput screening results, design of combinatorial libraries, chemogenomics analysis of bioactivity data and many others . But also researchers in academia are facing similar challenges when they need to process and visualize large molecular databases that have become freely available in the last few years  or were even generated in silico . A number of methods to analyze and visualize large collections of molecules exist . Among the most commonly used ones are various clustering techniques, dimensionality reduction approaches  or methods based on substructure analysis, particularly the analysis of molecule scaffolds . But even if such sophisticated methods are applied, at the end it is necessary to visually inspect large tables of molecular structure diagrams.
The word cloud diagrams inspired us to develop a method for compact visualization of large molecular data sets. The method is based on the same principle, but instead of words molecular structures are displayed in the graph. The size of molecules corresponds to their frequency in the particular data sets. The largest structures catch the eyes of the observer first and therefore a single look at the diagram provides basic, condensed information about the content of the parent data set. In analogy with the “word cloud” we call this method ”Molecule Cloud”.
Results and discussion
Molecule Cloud methodology
The principle of the Molecule Cloud method is very simple. The most common substructure features present in the analyzed data set are identified and then displayed in such a way that their size corresponds to their frequency. The most common substructures are rendered the largest and therefore immediately catch the eye of an observer. Molecules are represented by their scaffolds, i.e. cores remaining when all non-ring substituents are removed. The concept of scaffold as the central part of a molecule is one of the basic concepts of medicinal chemistry and scaffolds play an important role in several drug discovery techniques like combinatorial chemistry and scaffold hopping . Molecules without any rings are represented by their major chain, i.e. the longest chain, containing the largest number of heteroatoms. Reduction of molecules to scaffolds simplifies the analysis considerably. For example 35 million molecules from the PubChem database are represented by about 3.9 million scaffolds (50% of which are singletons, present only once in the database) and about 150 thousand chains.
Once the most common scaffolds and chains are identified, they need to be visualized with their size scaled according to the frequency in the parent database. It is well known that the frequency of various substructure features like scaffolds, substituents or linkers in molecular databases follows the power law (so called “long tail” distribution) . This means that only few scaffolds in a database are very common, while there are many rare scaffolds including large number of singletons. Before using scaffold frequencies as a scaling factor, they therefore need to be transformed into the logarithmic scale. Benzene is a special case, practically in all large data sets the benzene is clearly the most frequent scaffold. In many cases it is therefore advisable not to display it. Even after logarithmic transformation of frequencies benzene would be disproportionally large, and it would not contribute any useful information. Removal of benzene is similar to the removal of the common stop words in classical text clouds.
According to our experience the optimal number of substructures to be displayed in the Molecule Cloud is between 100 and 250. This number usually contains 30 - 50 large structures, easily recognizable, the rest are smaller structures that optically fill the image. Of course, when displaying the graph in a larger area (for example as a poster) the number of structures that can be displayed is proportionally larger.
In most cases already this initial layout provides quite good placement of molecules. To further improve it, a second layout step is performed, namely an iterative optimization loop. In this step molecules, one by one are slightly moved in the direction that improves the total “overlap score”. The convergence is fast and after few seconds the final layout is achieved. During the optimization slight repulsive forces are also placed in the corners of the drawing area to provide aesthetically more pleasing “oval” display instead of completely filling the available image rectangle.
The Molecule Cloud layout algorithm has been implemented in Java. The layout itself does not require any “chemical intelligence”, it operates simply on the rectangles representing molecules. The required molecular processing capabilities, particularly parsing of SMILES and molecule depiction are defined by a Java interface class and may be implemented by using any cheminformatics toolkit. The program requires as input only a list of SMILES codes of structures to display with their frequencies and desired size of the final image as input. We tested the algorithm using two cheminformatics engines, the depiction engine from Molinspiration  and the recently released Novartis open source Avalon Cheminformatics Toolkit . To interested parties the Java source code of the Molecule Cloud layout algorithm is available from the corresponding author under the terms of the BSD license. The distribution provides also instructions how to interface the program with the Avalon Cheminformatics Toolkit.
Molecule Cloud - a method for compact visualization of the typical substructures present in large collections of molecules was described here. The Molecule Cloud graphs allow to recognize scaffolds and other substructure features that are typical for particular data set by a single look. Additional information about molecules may be included by using colors. Further enhancement of generated images, for example by adding textual information, or possibility to click on particular scaffold to see the molecules containing it, should be with help of HTML5 technology easy.
- Martin E, Ertl P, Hunt P, Duca J, Lewis R: Gazing into the crystal ball; the future of computer-aided drug design. J Comp-Aided Mol Des. 2011, 26: 77-79.View ArticleGoogle Scholar
- Langdon SR, Brown N, Blagg J: Scaffold diversity of exemplified medicinal chemistry space. J Chem Inf Model. 2011, 26: 2174-2185.View ArticleGoogle Scholar
- Blum LC, Reymond J-C: 970 Million druglike small molecules for virtual screening in the chemical universe database GDB-13. J Am Chem Soc. 2009, 131: 8732-8733. 10.1021/ja902302h.View ArticleGoogle Scholar
- Dubois J, Bourg S, Vrain C, Morin-Allory L: Collections of compounds - how to deal with them?. Cur Comp-Aided Drug Des. 2008, 4: 156-168. 10.2174/157340908785747410.View ArticleGoogle Scholar
- Medina-Franco JL, Martinez-Mayorga K, Giulianotti MA, Houghten RA, Pinilla C: Visualization of the chemical space in drug discovery. Cur Comp-Aided Drug Des. 2008, 4: 322-333. 10.2174/157340908786786010.View ArticleGoogle Scholar
- Schuffenhauer A, Ertl P, Roggo S, Wetzel S, Koch MA, Waldmann H: The Scaffold Tree - visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model. 2007, 47: 47-58. 10.1021/ci600338x.View ArticleGoogle Scholar
- Langdon S, Ertl P, Brown N: Bioisosteric replacement and scaffold hopping in lead generation and optimization. Mol Inf. 2010, 29: 366-385. 10.1002/minf.201000019.View ArticleGoogle Scholar
- Lipkus AH, Yuan Q, Lucas KA, Funk SA, Bartelt WF, Schenck RJ, Trippe AJ: Structural diversity of organic chemistry. A scaffold analysis of the CAS Registry. J Org Chem. 2008, 73: 4443-4451. 10.1021/jo8001276.View ArticleGoogle Scholar
- mib 2010.10, Molinspiration Cheminformatics: http://www.molinspiration.com,
- Bernhard R: Avalon Cheminformatics Toolkit. http://sourceforge.net/projects/avalontoolkit/,
- Wang Y, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang J, Xiao J, Zhang J, Bryant SH: An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2009, 38: D255-D266.View ArticleGoogle Scholar
- Irwin JJ, Shoichet BK: ZINC − a free database of commercially available compounds for virtual screening. J Chem Inf Model. 2004, 45: 177-182.View ArticleGoogle Scholar
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP: ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40: D1100-D1107. 10.1093/nar/gkr777.View ArticleGoogle Scholar
- Welsch ME, Snyder SA, Stockwell BR: Privileged scaffolds for library design and drug discovery. Curr Opin Chem Biol. 2010, 14: 347-361. 10.1016/j.cbpa.2010.02.018.View ArticleGoogle Scholar
- Ertl P: Cheminformatics analysis of organic substituents: Identification of the most common substituents, calculation of substituent properties, and automatic identification of drug-like bioisosteric groups. J Chem Inf Comp Sci. 2003, 43: 374-380. 10.1021/ci0255782.View ArticleGoogle Scholar
- TagCrowd: http://tagcrowd.com
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.