LICSS - a chemical spreadsheet in microsoft excel
© Lawson and Lawson; licensee Chemistry Central Ltd. 2012
Received: 14 December 2011
Accepted: 2 February 2012
Published: 2 February 2012
Representations of chemical datasets in spreadsheet format are important for ready data assimilation and manipulation. In addition to the normal spreadsheet facilities, chemical spreadsheets need to have visualisable chemical structures and data searchable by chemical as well as textual queries. Many such chemical spreadsheet tools are available, some operating in the familiar Microsoft Excel environment. However, within this group, the performance of Excel is often compromised, particularly in terms of the number of compounds which can usefully be stored on a sheet.
LICSS is a lightweight chemical spreadsheet within Microsoft Excel for Windows. LICSS stores structures solely as Smiles strings. Chemical operations are carried out by calling Java code modules which use the CDK, JChemPaint and OPSIN libraries to provide cheminformatics functionality. Compounds in sheets or charts may be visualised (individually or en masse), and sheets may be searched by substructure or similarity. All the molecular descriptors available in CDK may be calculated for compounds (in batch or on-the-fly), and various cheminformatic operations such as fingerprint calculation, Sammon mapping, clustering and R group table creation may be carried out.
We detail here the features of LICSS and how they are implemented. We also explain the design criteria, particularly in terms of potential corporate use, which led to this particular implementation.
LICSS is an Excel-based chemical spreadsheet with a difference:
• It can usefully be used on sheets containing hundreds of thousands of compounds; it doesn't compromise the normal performance of Microsoft Excel
• It is designed to be installed and run in environments in which users do not have admin privileges; installation involves merely file copying, and sharing of LICSS sheets invokes automatic installation
• It is free and extensible
LICSS is open source software and we hope sufficient detail is provided here to enable developers to add their own features and share with the community.
The familiar Chemical Spreadsheet paradigm is an extremely useful way of presenting structural information together with calculated or measured structural properties. Indeed, most software which handles or stores chemical data will make available a tabular view implementing at least some of the more common spreadsheet functionality such as sorting by columns. Many excellent chemical spreadsheet tools are commercially available and there are also notable freeware/open source examples . Most such software is self-contained which, of course, gives the developers maximum freedom of implementation. This approach has certain potential disadvantages however, particularly considered in the context of a corporate environment:
An interested user needs to buy/download and install the software. This of course is trivial in the case of a 'home' or independent user but may pose almost insurmountable challenges in a 'locked-down' corporate environment
The user must get to grips with an entirely new piece of software overcoming a potentially steep learning curve
It is extremely difficult to provide spreadsheet features (powerful calculated columns, visualisation, macro language, etc) which begin to rival those of the industry standard, Microsoft Excel - a program already very familiar to target users.
The last point suggests a different approach in which the chemistry engine is build on top of Excel. This tactic appears extremely attractive partly because the potential developer can concentrate on implementing chemical functionality but also because of the ubiquity and power of Excel. Two well-known realisations of this approach are Isis for Excel  and Accord for Excel .
Solutions of this type are typically implemented as Excel AddIns, using Visual Basic for Applications (VBA) to interface with chemistry engines. Structures are usually stored on the spreadsheets as some kind of object (including structure-layout or image data) which may be interpreted by the chemistry engine for visualisation and calculation purposes. To ensure that structure objects display and sort properly, it is usually necessary to intercept several of Excel's fundamental calls (such as the main calculation routine). This necessity, together with the size of the stored objects, can lead to rapid degradation of performance for spreadsheets containing large numbers of structures.
Bearing the foregoing in mind, LICSS was designed to appeal particularly to corporate users of Excel for Windows. Because of one of the authors' experience of corporate locked-down environments and because LICSS was to be a 'hobby' project, initially with just one spare-time developer, some rather specific design criteria were developed:
LICSS should require no installation beyond file copying. Users should be able to share spreadsheets with fully automatic installation (if necessary)
Structures should be stored purely as Smiles strings in cells; structure rendering would be on-the-fly
LICSS spreadsheets would not intercept Excel's calculation calls
An Excel add-in would not be used (they normally need user installation and can require admin privileges). Any necessary VBA would exist on each chemically-enabled spreadsheet.
User Implementation and Features
Timings for common cheminformatics tasks using LICSS.
SSS (Sub Structure Search) with n1cnccc1 (Smarts matching)
SSS with pyrimidine (sketcher)
SSS with n1cnccc1 (Smarts matching/fingerprint pre-search)
SSS with pyrimidine (sketcher/fingerprint pre-search)
RGroupTable generation with Pyrimidine as core (sketcher)
Jarvis Patrick clustering (generating 737 clusters)
Sammon Map coordinate calculation
Descriptor calculation (XLogP)
SSS with Cc1cncnc1 (Smarts matching)
SSS with 5-MePyrimidine (sketcher)
486 (includes cc1cncnc1 as well as Cc1cncnc1)
SSS with Cc1cncnc1 (Smarts matching/fingerprint pre-search)
SSS with 5-MePyrimidine (sketcher/fingerprint pre-search)
RGroupTable generation (on Pyrimidine subset with Pyrimidine as core; sketcher)
Descriptor calculation (XLogP)
The main enabling program is contained in an Excel for Windows workbook (Excel 97-2003 format), EnableChemicalSpreadsheetV2.1.xls. It is written in VBA using the VBA Extensibility library which allows the program to copy code to and create code in the workbook being enabled. Most code is simply copied from EnableChemicalspreadsheetV2.1.xls but some event handling routines are created specifically for the workbook being enabled; this makes possible features such as structure pop-up upon mouse hover over chart data points for example.
The CDK and OPSIN Java libraries are accessed in one of two ways. For batch processes (such as Substructure and Similarity searching) the relevant compounds are first written to file in Smiles (SMI) file format (after an in-sheet fingerprint search if necessary). Then an executable JAR file, CDKSSWin.jar is synchronously executed. This contains a number of routines corresponding to each of the available LICSS programs and taking appropriate input/output file and other control parameters. Each of these routines creates an output file and terminates, whereupon the calling VBA processes the output file appropriately. The synchronous Jar file execution is done without a command line window through Javaw.exe and CDKSSWin.jar starts by creating a pop-up Swing progress window. In this way, the routines appear to run as part of Excel.
CDK classes are widely used within CDKSSWin.jar to provide cheminformatics methods (fingerprint generation, substructure searching etc). Where available, existing open source code was adapted to use the CDK minimising the need to rewrite algorithms (eg for Jarvis Patrick clustering and Sammon projection; see acknowledgments). Algorithms for R-Group table generation, similarity searching and diverse compound picking were written in-house.
Calls to JChemPaint, to display structure editing or structure display windows, are handled quite differently. Originally (version 1 x), the JChemPaint applet was used inside a WebBrowser control within VBA. However, this approach was not suitable for the rapid display of several structures (eg for displaying all worksheet structures). From version 2.0 onwards, a JVM is run within the Excel process space so calls to Java can be made directly, without per action initialisation or context switching overheads. Calls to Java of this type are made possible by creating C++ proxies for each Java method (contained within a single CDecl dll file, CDKInterfaceDll.dll) using JNI via the open-source Jace project technology . The C++ proxy functions may then be declared and called directly from VBA.
In practice, after one-off Java initialisation, this approach enables extremely rapid access to Java routines directly from VBA in Excel. Thus, for example, a user can render a screen's worth of structures from Smiles in < 1 second. The same method has been used for all the new Excel formulas - for example, on a 2.13 MHz laptop with 4 GB of memory running Vista, a formula entry such as: ' = GetCDKDescriptor(C2,"XLogP",1)' will calculate the XLogP descriptor for > 100 compounds per second when copied down for a column of Smiles strings (see also Table 1).
LICSS is an open source chemical spreadsheet implemented in Microsoft Excel for Windows. It uses the CDK, JChemPaint and OPSIN open source libraries to provide cheminformatics functionality. LICSS-enabled worksheets and charts are self-installing, requiring no Add-Ins or anything that requires admin privileges. Enabled sheets contain only Smiles strings (with optional compact fingerprints) to represent chemistry and do not slow down Excel's calculation routines. Structures are visualised by clicking on cells containing Smiles strings or by hovering over enabled chart sheet data points. Structures for all currently visible compounds on a sheet may be simultaneously visualised 'on-the-fly'. These features mean that LICSS is suitable for worksheets containing very large (100s of 1000s) of compounds. In addition to basic visualisation and substructure/similarity searching functionality, routines for some more advanced analysis such as Sammon projection, R-Group table creation and Jarvis Patrick clustering are provided.
Availability and Requirements
Project name: excel-cdk
Project home page: http://code.google.com/p/excel-cdk/
Operating system: Windows (XP, Vista or Windows 7); Microsoft Excel for Windows (97 - 2010)
Programming languages: VBA, Java, C++
Other requirements (if compiling): Jace tools
License: GNU GPL v2
Any restrictions to use by non-academics: none
The authors would like to acknowledge the authors/developers of the excellent Java libraries which LICSS makes extensive use of. In particular, the CDK [4, 5], JChemPaint  and OPSIN . The latest version of LICSS also uses rendering/copying code originally written by Rajarshi Guha. The new rendering engine uses C++ proxying of Java classes and, for this, the JACE engine is used . Sammon mapping code was adapted from Java classes originally written by Jarkko Miettinen (for MZMine) . Clustering code was adapted from Java classes provided by NIH . R-Group Generation code was written together with Tom Sheldon (ex Syngenta). The VBA Code Module 'Chart Series' was written by John Walkenbach .
- Rich Apodaca: Your Favorite Chemical Spreadsheet. [http://depth-first.com/articles/2008/09/12/your-favorite-chemical-spreadsheet/]
- Links to Chemistry Spreadsheets. [http://cds.dl.ac.uk/cds/other_info_and_links/info6.html]
- Desktop Cheminformatics Software. [http://accelrys.com/products/informatics/desktop-software.html]
- The Chemistry Development Kit. [http://sourceforge.net/projects/cdk]
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): an open-source Java library for chemo-and bioinformatics. J Chem Inf Comput Sci. 2003, 43: 493-500. 10.1021/ci025584y.View ArticleGoogle Scholar
- JChemPaint. [http://sourceforge.net/apps/mediawiki/cdk/index.php?title=JChemPaint]
- LICSS system video. [http://www.screencast.com/t/Oy1cwtO0ht]
- excel-cdk. [http://code.google.com/p/excel-cdk/]
- OPSIN: Open Parser for Systematic IUPAC nomenclature. [http://OPSIN.ch.cam.ac.uk/]
- Jace. [http://code.google.com/p/jace/]
- Katajamaa M, Miettinen J, Oresic M: MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics. 2006, 22: 634-636. 10.1093/bioinformatics/btk039.View ArticleGoogle Scholar
- Informatic Tools. [http://nctt.nih.gov/page.cfm?pageID=27543665]
- Walkenbach J: A Class Module to Manipulate a Chart Series. [http://spreadsheetpage.com/index.php/tip/a_class_module_to_manipulate_a_chart_series/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.