The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration
- Antonino Fiannaca†1Email author,
- Massimo La Rosa†2,
- Giuseppe Di Fatta†3,
- Salvatore Gaglio†1,
- Riccardo Rizzo†1 and
- Alfonso Urso†1
© Fiannaca et al.; licensee Chemistry Central Ltd. 2014
Received: 30 January 2014
Accepted: 7 May 2014
Published: 13 May 2014
In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications.
This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds.
The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
KeywordsMolecular compounds Self organizing map Clustering Visualization Taverna
The increasingly large amount of data related to DNA, proteins, molecular compounds, gene expressions and other biological sciences, raises the need for advanced analytical tools to support a data-driven scientific discovery. Classification and clustering of high-dimensional data, for example, are very popular techniques for the analysis of large multidimensional biological datasets . Classification methods are based on a supervised learning approach, where the patterns of the training set belong to pre-defined classes. Dissimilarly, clustering algorithms use an unsupervised learning approach to find groups of similar data objects with no pre-defined classes. Clustering is typically used as an explorative tool. However, the interpretation of the clustering outcomes is often difficult and not intuitive, especially for large datasets with complex topological structures. For this reason, a suitable visualization method is a desirable tool to complement clustering analysis. Several methods (e.g., [2, 3]) have been proposed to generate a visualisation of the outcomes of classification and clustering algorithms. The Self-Organizing Map (SOM)  is one of the best known unsupervised methods for data visualization and clustering [5, 6]. SOM is an artificial neural network that generates a lattice of neurons, typically organized in a 2D grid, where high-dimensional data objects are projected to a lower dimensionality space while preserving their topological relations. SOMs are often used to generate clusters of similar data objects, but they can also be used to create 2D maps to facilitate the visual inspection of the relations induced by the adopted similarity function. Moreover, new data objects can be projected on a previously trained map in order to unveil similarities with other input data objects and performing a classification over the learned clusters.
In many experimental pipelines and workflows related to genomics, proteomics and other "omics" disciplines, the clustering and visualization of large multidimensional datasets is often used to identify and investigate similarities among data elements before further tests and analysis are carried out. The Taverna workbench  is one of the most popular tools to manage scientific workflows, especially in the bioinformatics domain. Taverna provides a broad range of components, which can be executed by local processors or Web services. It includes components to integrate remote resources, online databases and external analysis tools into user-defined workflows and can be extended with additional components by means of a service plug-in architecture.
This work presents a novel Taverna plug-in, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering and provides visualization of biological datasets. The core algorithm in BioDICE is Fast Learning SOM (FLSOM) , an improved version of the classic SOM algorithm. FLSOM belongs to the category of the so-called Emergent Self Organizing Maps (ESOM) . ESOMs have a number of neurons much larger than the number of input patterns to facilitate the discovery of emergent structures in the data. These structures are used to visualize multidimensional data objects and to identify clusters of similar objects by means of the U-Matrix visualization technique .
While there are a few SOM implementations available in Taverna, BioDICE fills a gap, as it is the first Taverna component performing SOM clustering with U-Matrix visualization.
RapidMiner is a data mining workflow execution engine and the RapidMiner plug-in  integrates RapidMiner operators into the Taverna environment. Although one of these operators provides an implementation of the classic SOM algorithm for dimensionality reduction, it does not generate a clustering model. There is another SOM implementation in RapidMiner that generates projections with U-Matrix and other visualization techniques. However, these visualization techniques generate static maps, do not allow an interactive exploration and do not detect clusters in the map automatically. Moreover, the RapidMiner plug-in for Taverna was based on RapidAnalytics, a server-based data mining workflow execution engine, which is no longer supported.
Another Taverna plugin that provides some machine learning services is the Chemistry Development Kit (CDK) plugin . It integrates five clustering algorithms from the Weka machine learning library , which do not include SOM clustering.
This work introduces the BioDICE plug-in for Taverna: BioDICE is a pipeline of algorithms that performs a fast SOM clustering and generates a visualization of high-dimensional datasets. BioDICE supports an interactive generation of data partitions that represent emerging clusters.
In the following sections, the initialization and learning phases of FLSOM are discussed in more detail.
The output generated by a SOM is influenced by the initialization of the neuron weights. In BioDICE, a linear initialization technique  is adopted to improve the clustering results and to reduce the execution time. In general, linear initialization procedures are based on the analogy between SOMs and principal curve analysis algorithm, which is a non-linear generalization of the Principal Component Analysis (PCA). In BioDICE, the singular vectors obtained by SVD are used in place of the principal components to imprint the initial SOM lattice with fingerprints of the input objects. This initialization technique facilitates the learning process to converge towards a better clustering and faster than a random initialization.
FLSOM provides an advanced SOM learning phase. A simulated annealing heuristic is combined with a standard batch SOM learning algorithm to obtain an adaptive learning rate . This optimization technique improves both the quality and the convergence rate of the learning process. In FLSOM, the simulated annealing "temperature" is the Quantization Error (QE), which is defined as the average Euclidean distance between data vectors and their best matching units at the end of each learning epoch. The variation of QE (Δ QE ) between two consecutive epochs is used to adapt the learning factor and, consequently, the convergence rate of the algorithm. The learning process stops when the value of Δ QE is less than a user-defined threshold value. FLSOM was compared with other SOM-based algorithms, using both artificial and real biological datasets [8, 14]. FLSOM provided a good convergence time and, most importantly, better results with respect to local distortion, topology preservation and clustering quality.
The BioDICE user interface is composed by two panels: the configuration panel and the interactive map.
Results and discussion
In this Section, an application of the BioDICE plugin for the analysis of molecular compounds is presented. The analysis of similarities among chemical structures is an important challenge in cheminformatics [15, 16]: the main goal is to gain understanding about the relationship of the compounds with respect to their chemical or functional activity. In previous works [6, 8], the FLSOM algorithm was shown to be very effective in the cluster analysis of molecular compounds.
The BioDICE Taverna plugin requires an input file containing a features ×patterns data table, that is a matrix with the feature identifiers as rows and the chemical compound identifiers as columns. The plugin also accepts two optional inputs, which are used for enriching the graphical representation of the chemical compounds: an ordered list of input compounds and a list of their corresponding representations in SMILES notation.
The workflow of Figure 4 contains two nested Taverna workflows, which have been made publicly available and can be retrieved from the repository myExperiment at http://www.myexperiment.org/workflows/1412.html and http://www.myexperiment.org/workflows/1427.html. The complete workflow of Figure 4 can be retrieved at http://www.myexperiment.org/workflows/3611.html.
BioDICE is a new plugin for the Taverna workbench, that can be adopted to perform fast clustering of multidimensional biological datasets and to generate their interactive visualization. BioDICE is based on the FLSOM algorithm, an improved version of SOM learning algorithm. An application scenario in cheminformatics has been discussed to demonstrate the use of the plugin. A dataset of molecular compounds in SMILES format has been first processed with a frequent subgraph mining algorithm (feature generation). BioDICE has been applied to provide a cluster analysis of the compounds with respect to the extracted features. BioDICE has generated an interactive map of the input compounds and a list of the compounds in each detected cluster.
The BioDICE plugin, the documentation, a tutorial (covering installation, configuration and use), the workflow and the dataset used in this work are available at http://biolab.pa.icar.cnr.it/biodice.html.
Availability and requirements
Project name: BioDICE
Project home page:http://biolab.pa.icar.cnr.it/biodice.html
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.6 or higher, Taverna 2.3.0. For compatibility issues between Java runtime version and MoSS tool please refer to our project home page.
License: GNU GPL v3
Any restrictions use by non-academics: Only those imposed already by the license
The publication costs for this article were funded by the CNR Interomics Flagship Project "- Development of an integrated platform for the application of "omic" sciences to biomarker definition and theranostic, predictive and diagnostic profiles".
- Belacel N, Wang C, Cupelovic-Culf M: Clustering: unsupervised learning in large biological data. Statistical Bioinformatics: A Guide for Life and Biomedical Science Researchers. Edited by: Lee JK. 2010, Hoboken: Wiley, 89-127. Chap. 5. doi:10.1002/9780470567647View ArticleGoogle Scholar
- Ultsch A: Self-organizing neural networks for visualisation and classification. Information and Classification. Studies in Classification, Data Analysis and Knowledge Organization. Edited by: Opitz O, Lausen B, Klar R. 1993, Berlin, Heidelberg: Springer, 307-313. doi:10.1007/978-3-642-50974-2_31Google Scholar
- Ertl P, Rohde B: The molecule cloud - compact visualization of large collections of molecules. J Cheminformatics. 2012, 4 (1): 12-10.1186/1758-2946-4-12. doi:10.1186/1758-2946-4-12View ArticleGoogle Scholar
- Kohonen T: Self Organizing Maps. 1995:521, Berlin: SpringerView ArticleGoogle Scholar
- Digles D, Ecker GF: Self-organizing maps for in silico screening and data visualization. Mol Inform. 2011, 30 (10): 838-846. 10.1002/minf.201100082. doi:10.1002/minf.201100082View ArticleGoogle Scholar
- Di Fatta G, Fiannaca A, Rizzo R, Urso A, Berthold M, Gaglio S: Context-aware visual exploration of molecular databases. Sixth IEEE International Conference on Data Mining - Workshops (ICDMW’06). 2006, 136-141. doi:10.1109/ICDMW.2006.51View ArticleGoogle Scholar
- Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P, Bhagat J, Belhajjame K, Bacall F, Hardisty A, Nieva de la Hidalga A, Balcazar Vargas MP, Sufi S, Goble C: The Taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res. 2013, doi:10.1093/nar/gkt328Google Scholar
- Fiannaca A, Di Fatta G, Rizzo R, Urso A, Gaglio S: Simulated annealing technique for fast learning of SOM networks. Neural Comput Appl. 2011, 22 (5): 889-899. doi:10.1007/s00521-011-0780-6View ArticleGoogle Scholar
- Ultsch A: Emergence in self organizing feature maps. The 6th International Workshop on Self-Organizing Maps (WSOM). 2007, doi:10.2390/biecoll-wsom2007-114Google Scholar
- Jupp S, Eales J, Fischer S, Land S, Ramgolam R, Williams A, Stevens R: Combining RapidMiner operators with bioinformatics services. A powerful combination. RapidMiner Community Meeting and Conference. 2011, Aachen: ShakerGoogle Scholar
- Truszkowski A, Jayaseelan KV, Neumann S, Willighagen EL, Zielesny A, Steinbeck C: New developments on the cheminformatics open workflow environment CDK-Taverna. J Cheminformatics. 2011, 3: 54-10.1186/1758-2946-3-54. doi:10.1186/1758-2946-3-54View ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software. ACM SIGKDD Explorations Newsl. 2009, 11 (1): 10-18. 10.1145/1656274.1656278. doi:10.1145/1656274.1656278View ArticleGoogle Scholar
- Fiannaca A, Di Fatta G, Rizzo R, Urso A, Gaglio S: A new linear initialization in SOM for biomolecular data. Computational Intelligence Methods for Bioinformatics and Biostatistics. Lecture Notes in Computer Science, vol. LNCS 5488. Edited by: Masulli F, Tagliaferri R, Verkhivker GM. 2009, Berlin, Heidelberg: Springer, 177-187. doi:10.1007/978-3-642-02504-4_16Google Scholar
- Fiannaca A, Di Fatta G, Rizzo R, Urso A, Gaglio S: Clustering quality and topology preservation in fast learning SOMs. Neural Netw World. 2009, 19 (5): 625-639.Google Scholar
- Riniker S, Landrum GA: Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods. J Cheminformatics. 2013, 5 (1): 43-10.1186/1758-2946-5-43. doi:10.1186/1758-2946-5-43View ArticleGoogle Scholar
- Hastings J, Magka D, Batchelor C, Duan L, Stevens R, Ennis M, Steinbeck C: Structure-based classification and ontology in chemistry. J Cheminformatics. 2012, 4: 8-10.1186/1758-2946-4-8. doi:10.1186/1758-2946-4-8View ArticleGoogle Scholar
- Pence HE, Williams A: ChemSpider: an online chemical information resource. J Chem Educ. 2010, 87 (11): 1123-1124. 10.1021/ed100697w. doi:10.1021/ed100697wView ArticleGoogle Scholar
- Borgelt C, Meinl T, Berthold M: MoSS: a program for molecular substructure mining. Proceedings of the 1st International Workshop on Open Source Data Mining Frequent Pattern Mining Implementations - OSDM ‘05. 2005, New York: ACM Press, 6-15. doi:10.1145/1133905.1133908View ArticleGoogle Scholar
- Goble CA, Bhagat J, Aleksejevs S, Cruickshank D, Michaelides D, Newman D, Borkum M, Bechhofer S, Roos M, Li P, De Roure D: myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res. 2010, 38 (Web Server issue): 677-682. doi:10.1093/nar/gkq429View ArticleGoogle Scholar
- NCI/DTP: A set of FDA-approved anticancer drugs to enable cancer research. [http://dtp.nci.nih.gov/branches/dscb/oncology_drugset_explanation.html],
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.