New developments on the cheminformatics open workflow environment CDK-Taverna
© Truszkowski et al; licensee Chemistry Central Ltd. 2011
Received: 17 August 2011
Accepted: 13 December 2011
Published: 13 December 2011
The computational processing and analysis of small molecules is at heart of cheminformatics and structural bioinformatics and their application in e.g. metabolomics or drug discovery. Pipelining or workflow tools allow for the Lego™-like, graphical assembly of I/O modules and algorithms into a complex workflow which can be easily deployed, modified and tested without the hassle of implementing it into a monolithic application. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna, the Chemistry Development Kit (CDK) or the Waikato Environment for Knowledge Analysis (WEKA). A first integrated version 1.0 of CDK-Taverna was recently released to the public.
The CDK-Taverna project was migrated to the most up-to-date versions of its foundational software libraries with a complete re-engineering of its worker's architecture (version 2.0). 64-bit computing and multi-core usage by paralleled threads are now supported to allow for fast in-memory processing and analysis of large sets of molecules. Earlier deficiencies like workarounds for iterative data reading are removed. The combinatorial chemistry related reaction enumeration features are considerably enhanced. Additional functionality for calculating a natural product likeness score for small molecules is implemented to identify possible drug candidates. Finally the data analysis capabilities are extended with new workers that provide access to the open-source WEKA library for clustering and machine learning as well as training and test set partitioning. The new features are outlined with usage scenarios.
CDK-Taverna 2.0 as an open-source cheminformatics workflow solution matured to become a freely available and increasingly powerful tool for the biosciences. The combination of the new CDK-Taverna worker family with the already available workflows developed by a lively Taverna community and published on myexperiment.org enables molecular scientists to quickly calculate, process and analyse molecular data as typically found in e.g. today's systems biology scenarios.
Current problems in the biosciences typically involve several domains of research. They require a scientist to work with different and diverse sets of data. The reconstruction of a metabolic network from sequencing data, for example, employs many of the data types found along the axis of the central dogma, including reconstruction of genome sequences, gene prediction, determination of encoded protein families, and from there to the substrates of enzymes, which then form the metabolic network. In order to work with such a processing pipeline, a scientist has to copy/paste and often transform the data between several bioinformatics web portals by hand. The manual approach involves repetitive tasks and cannot be considered effective or scalable.
Especially the processing and analysis of small molecules comprises tasks like filtering, transformation, curation or migration of chemical data, information retrieval with substructures, reactions, or pharmacophores as well as the analysis of molecular data with statistics, clustering or machine learning to support chemical diversity requirements or to generate quantitative structure activity/property relationships (QSAR/QSPR models). These processing and analysis procedures itself are of increasing importance for research areas like metabolomics or drug discovery. The power and flexibility of the corresponding computational tools become essential success factors for the whole research process.
The workflow paradigm addresses the above issues with the supply of sets of elementary workers (activities) that can be flexibly assembled in a graphical manner to allow complex procedures to be performed in an effective manner - without the need of specific code development or software programming skills. Scientific workflows allow the combination of a wide spectrum of algorithms and resources in a single workspace [1–3]. Earlier problems with iterations over large data sets  are completely resolved in version 2.0 due to new implementations in Taverna. Taverna 2 allows control structures such as "while" loops or "if-then-else" constructs. Termination criteria for loops may now be evaluated by listening to a state port . In addition the user interface of the Taverna 2 workbench has clearly improved: The design and manipulation of workflows in a graphical workflow editor is now supported. Features like copy/paste and undo/redo simplify workflow creation and maintenance .
The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna , the Chemistry Development Kit (CDK) [8, 9], or the Waikato Environment for Knowledge Analysis (WEKA) . A first integrated version 1.0 of CDK-Taverna was recently released to the public . To extend usability and power of CDK-Taverna for different molecular research purposes the development of version 2.0 was motivated.
The CDK-Taverna 2.0 plug-in makes use of the Taverna plug-in manager for its installation. The manager fetches all necessary information about the plug-in from a XML file which is located at http://www.ts-concepts.de/cdk-taverna2/plugin/. The information provided therein contains the name of the plug-in, its version, the repository location and the required Taverna version. Upon submitting the URL to the plug-in manager it downloads all necessary dependencies automatically from the web. After a subsequent restart the plug-in is enabled and the workers are visible in the services. The plug-in uses Taverna version 2.2.1 , CDK version 1.3.8  and WEKA version 3.6.4 . Like its predecessor it uses the Maven 2 build system  as well as the Taverna workbench for automated dependency management.
CDK-Taverna 2.0 worker implementation
The CDK-Taverna 2.0 plug-in is designed to be easily extendible: The implementation allows to create new workers by simply inheriting from the single abstract class org.openscience.cdk.applications.taverna. AbstractCDKActivity (which is the analogue of the CDKLocalWorker interface of CDK-Taverna version 1.0). The class is located in the cdk-taverna-2-activity module. It provides all necessary data for the underlying worker registration mechanism which frees the software developer from handling these tasks manually. The methods which need to be overwritten in order to implement a worker are:
public void addInputPorts(), public void addOutputPorts(): Specify the ports for passing data between workers.
public String getActivityName(), public String getFolderName(): Return name and folder of a worker.
public void work(): Entry point for the worker's central algorithm that performs its core function.
public String getDescription(): Provides descriptive text that explains a worker's function.
public HashMap <String, Object> getAdditionalProperties(): Specifies additional properties like file extensions, the number of concurrent threads to use, etc.
Finally a new worker has to be registered to be available in the Taverna workbench. For this purpose Taverna offers the class net.sf.taverna.t2.spi. SPIRegistry.SPIRegistry to register Service Provider Interfaces (SPI). It is necessary to add the new worker's full name including its package declaration to the file org.openscience.cdk.applications.taverna.AbstractCDKActivity which contains the names and packages of all available workers. This file is located at cdk-taverna-2-activity-ui/src/main/resources/META-INF/services.
Besides the basic implementation it is possible to define a configuration panel for a worker which allows the specification of parameters. A configuration panel has to inherit from the abstract class org.openscience.cdk.applications.taverna. ActivityConfigurationPanel. The GUI element itself has to be defined in the constructor of the class and may contain any Java Swing element. The following methods are the backbone of a configuration panel:
public boolean checkValues(): Validates all GUI values.
public boolean isConfigurationChanged(): After the validity check this method is used to compare the current worker settings with the GUI settings to detect changes.
public void noteConfiguration(): The properties of the worker are saved in a bean structure. The changes of the configuration bean object are updated by this method.
public void refreshConfiguration(): Updates the GUI values itself.
public CDKActivityConfigurationBean getConfiguration(): Access to the configuration bean.
The configuration panel has to be registered in the CDKConfigurationPanelFactory class of the org.openscience.cdk.applications.taverna.ui.view package. More details on how to write workers and their configuration panels are provided at the project's wiki page http://cdk-taverna-2.ts-concepts.de/wiki/index.php?title=Main_Page.
CDK-Taverna 2.0 supports 64-bit computing by use with a Java 64-bit virtual machine. The CDK-Taverna 2.0 plug-in is written in Java and requires Java 6 or higher. The latest Java version is available at http://www.java.com/de/download/. The CDK-Taverna 2.0 plug-in is developed and tested on Microsoft Windows 7 as well as Linux and Mac OS/X (32 and 64-bit).
Results and Discussion
CDK-Taverna 2.0 workers
Iterative File I/O
Molecular descriptor calculation
AtomCount, LargestChain, WienerIndex
kMeans, Perceptron, SVM
Overview on multi-threading CDK-Taverna 2
Calculation of molecular descriptors
QSAR Descriptor Threaded
Significance of input components evaluation using a genetic algorithm
GA Attribute Selection
Significance of input components evaluation using a 'Leave-One-Out' strategy
Leave-One-Out Attribute Selection
Partitioning datasets into training and test sets
Split Dataset Into Train-/Testset
Construction of clustering models
Construction of regression models
Construction of classification models
CDK-Taverna 1.0 was confined to 32-bit Java virtual machine and thus was restricted to in-memory processing of data volumes of at most 2 gigabyte in practice. Version 2.0 also supports 64-bit computing by use of a 64-bit Java virtual machine so that the processable data volume is only limited by hardware constraints (memory, speed): 64-bit in-memory workflows were successfully performed with data sets of about 1 million small molecules. Since the memory restrictions of version 1.0 were a main reason to use Pgchem::tigress as a molecular database backend  the corresponding version 1.0 workers were not migrated to the current version 2.0 yet.
Advanced reaction enumeration
Evaluation of small molecules for natural product likeness
In recent years, computer assisted drug design studies use natural product (NP) likeness as a criterion to screen compound libraries for potential drug candidates [14, 15]. The reason to estimate NP likeness during candidate screening is to facilitate the selection of those compounds that mimic structural features that are naturally evolved to best interact with biological targets.
Clustering and machine learning applications
Unsupervised clustering tries to partition input data into a number of groups smaller than the number of data whereas supervised machine learning tries to construct model functions that map the input data onto their corresponding output data. If the output codes continuous quantities a regression task is defined. Alternatively the output may code classes so that a classification task is addressed. Molecular data sets for clustering consist of input vectors where each vector represents a molecular entity and consists of a set of molecular descriptors itself. Molecular data sets for machine learning add to each input vector a corresponding output vector with features to be learned - thus they consist of I/O pairs of input and output vectors.
The clustering and machine learning workers of CDK-Taverna 2.0 allow the use of distinct WEKA functionality. As far as clustering is concerned the ART-2a worker of version 1.0 is supplemented with five additional WEKA-based workers which offer
Expectation Maximisation (EM): Expectation maximisation algorithm for iterative maximum likelihood estimation of cluster memberships .
Farthest First: Heuristic 2-approximation algorithm for solving the k-center problem .
Hierarchical Clusterer: Hierarchical clustering methods: The distance function and the linkage type are freely selectable .
Simple KMeans: Simple k-means clustering algorithm .
XMeans: Extended k-means clustering with an efficient estimation of the number of clusters .
Machine learning workers support the significance analysis of single components (i.e. features) of an input vector to obtain smaller inputs with a reduced set of components/features, the partitioning of machine learning data into training and test sets, the construction of input/output mapping model functions and model based predictions as well as result visualization. There is a total of six WEKA-based machine learning methods available: Two workers allow regression as well as classification procedures...
Three-Layer Perceptron-Type Neural Networks: Neural network implementation using the backpropagation algorithm for weight optimisation .
Support Vector Machines: Support Vector Machine implementation using the LibSVM library .
... two workers do only support regression...
Multiple Linear Regression: Multiple linear regression algorithm.
... and two workers are restricted to classification tasks:
Naive Bayes: Bayesian classifier for the estimation of continuous variables .
J46 C4.5 decision tree: Decision tree implementation based on the C4.5 classification algorithm .
For training and test set partitioning the Split Dataset Into Train-/Testset worker is available which offers three strategies :
Random: Data are split randomly into a training and test set of defined sizes.
Cluster Representatives: First the input data of the I/O pairs are clustered with the number of clusters to be equal to the number of training data by application of the Simple KMeans algorithm. Then a single input point of each cluster is chosen randomly as a representative and the corresponding I/O pair is inserted into the training set. The remaining I/O pairs are transferred to the test set.
Single Global Max: Cluster representatives are evaluated in a first step. These representatives are then re ned by an iterative procedure that exchanges data between training and test set that belong to the same cluster. The latter constraint assures that the input data of training and test set have a similar spatial diversity. A single iteration determines the test set I/O pair with the largest deviation between data and model. This I/O pair is then transferred to the training set while the best predicted I/O pair of the same cluster in the training set is transferred to the test set in exchange. Oscillations during the refinement steps may be suppressed by blacklisting exchanged I/O pairs.
CDK-Taverna 2.0 Wiki
Based on the free MediaWiki framework a Wiki was developed for the CDK-Taverna 2.0 project . The web page provides general information about the project, documentation about available workers/workflows and on how to create them as well as about installation procedures. The Wiki can be found at http://cdk-taverna-2.ts-concepts.de/wiki/index.php?title=Main_Page.
CDK-Taverna 2.0 provides an enhanced and matured free open cheminformatics workflow solution for the biosciences. It was successfully applied and tested in academic and industrial environments with data volumes of hundreds of thousands of small molecules. Combined with available workers and workflows from bioinformatics, image analysis or statistics CDK-Taverna supports the construction of complex systems biology oriented workflows for processing diverse sets of biological data.
The authors express their gratitude to the teams and communities of Taverna, CDK and WEKA for creating and developing these open tools.
- Hassan M, Brown R, Varma-O'brien S, Rogers D: Cheminformatics analysis and learning in a data pipelining environment. Molecular diversity. 2006, 10 (3): 283-299. 10.1007/s11030-006-9041-5.View Article
- Shon J, Ohkawa H, Hammer J: Scientific workflows as productivity tools for drug discovery. Current opinion in drug discovery and development. 2008, 11 (3): 381-388.
- Oinn T, Li P, Kell D, Goble C, Goderis A, Greenwood M, Hull D, Stevens R, Turi D, Zhao J: Taverna/my Grid: Aligning a Workflow System with the Life Sciences Community. Workflows for e-Science. 2007, 300-319. [http://www.springerlink.com/index/l9425v576v544vv3.pdf]View Article
- Kuhn T, Willighagen E, Zielesny A, Steinbeck C: CDK-Taverna: an open workflow environment for cheminformatics. BMC Bioinformatics. 2010, 11: 159-10.1186/1471-2105-11-159.View Article
- Missier P, Soiland-Reyes S, Owen S, Tan W, Nenadic A, Dunlop I, Williams A, Oinn T, Goble C: Taverna, Reloaded. Lecture Notes in Computer Science. 2010, 6187: 471-481. 10.1007/978-3-642-13818-8_33.View Article
- Taverna 2. [http://www.taverna.org.uk/]
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock M, Wipat A, Li P: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004, 20 (17): 3045-3054. 10.1093/bioinformatics/bth361.View Article
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics. Journal of Chemical Information and Computer Sciences. 2003, 43 (2): 493-500. 10.1021/ci025584y.
- Steinbeck C, Hoppe C, Kuhn S, Guha R, Willighagen E: Recent Developments of The Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics. Current Pharmaceutical Design. 2006, 12 (17): 2111-2120. 10.2174/138161206777585274.View Article
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I: The WEKA Data Mining Software: An Update. SIGKDD Explorations. 2009, 11 (1): 10-18. 10.1145/1656274.1656278.View Article
- The Chemistry Development Kit(CDK). [http://sourceforge.net/projects/cdk/]
- Waikato Environment for Knowledge Analysis (WEKA). [http://www.cs.waikato.ac.nz/ml/weka/]
- Apache Maven. [http://maven.apache.org/]
- Ertl P, Roggo S, Schu enhauer A: Natural Product-likeness Score and Its Application for Prioritization of Compound Libraries. J Chem Inf Model. 2008, 48: 68-74. 10.1021/ci700286x.View Article
- Dobson PD, Patel Y, Kell DB: Metabolite-likeness as a criterion in the design and selection of pharmaceutical drug libraries. Drug Discovery Today. 2009, 14 (1-2): 31-40. 10.1016/j.drudis.2008.10.011.View Article
- Faulon JL, Collins MJ, Carr RD: The Signature Molecular Descriptor. 4. Canonizing Molecules Using Extended Valence Sequences. J Chem Inf Comput Sci. 2004, 44: 427-436. 10.1021/ci0341823.View Article
- Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological). 1977, 39: 1-38. [http://www.jstor.org/stable/2984875]
- Hochbaum D, Shmoys D: A best possible heuristic for the k-center problem. Mathematics of operations research. 1985, 10: 180-184. 10.1287/moor.10.2.180. [http://www.jstor.org/stable/3689371]View Article
- WEKA API Documentation. [http://weka.sourceforge.net/doc.stable/]
- MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967, Berkeley, CA: University of California Press, 1:
- Pelleg D, Moore A: X-means: Extending K-means with an efficient Estimation of the Number of Clusters. Proceedings of 17th International Conference on Machine Learning. 2000, San Francisco, CA: Morgan Kaufmann, 727-734.
- Mitchell TM: Machine Learning. 1997, New York, NY: McGraw-Hill, internatiol edition
- Chang C, Lin C: LIBSVM: a library for support vector machines. 2001, [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]
- Quinlan J: Learning with continuous classes. 5th Australian joint conference on artificial intelligence. 1992, Singapore: World Scientific, 92: 343-348.
- Wang Y, Witten I: Induction of model trees for predicting continuous classes. Proceedings of the 9th European Conference on Machine Learning. 1997, London: Springer Verlag
- John G, Langley P: Estimating continuous distributions in Bayesian classifiers. Proceedings of the eleventh conference on uncertainty in artificial intelligence. 1995, San Francisco, CA: Morgan Kaufmann, 1: 338-345.
- Quinlan R: C4.5: Programs for Machine Learning. 1993, San Mateo, CA: Morgan Kaufmann Publishers
- Zielesny A: From Curve Fitting to Machine Learning: An illustrative Guide to scientific Data Analysis and Computational Intelligence. 2011, Berlin: Springer: Intelligent Systems Reference Library, 18:
- MediaWiki. [http://www.mediawiki.org/wiki/MediaWiki]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.