The PubChem chemical structure sketcher
© Ihlenfeldt et al; licensee BioMed Central Ltd. 2009
Received: 8 October 2009
Accepted: 17 December 2009
Published: 17 December 2009
Skip to main content
© Ihlenfeldt et al; licensee BioMed Central Ltd. 2009
Received: 8 October 2009
Accepted: 17 December 2009
Published: 17 December 2009
This approach has not been a severe limitation for the operation of the classic collection of databases served by NCBI. However, with the addition of the PubChem suite of databases, an obvious problem arose. Chemical structure databases cannot be readily queried by structure in a reasonable fashion using purely textual input and standard HTML form elements. Users of chemical structure databases demand the capability to search by full-structure or substructure, with a graphical rendition of the query structure as input. Fragment name search, externally generated and pasted SMILES/SMARTSstrings, or upload of query files drawn with stand-alone chemical structure drawing programs are awkward procedures and do not yield a satisfactory user experience. It became very clear that the PubChem structure search system needed a method to allow users to draw their query structure interactively, in an integrated fashion, and without the need of external software.
This is, of course, not a new problem. Most of the structure databases on the Web already have graphical structure input tools. These traditionally come in two styles: Java applets and browser plug-ins. A Java applet requires that the client browser has a functional Java virtual machine installed and correctly configured. While this may be a condition easily satisfied by most browsers, there will always be some fraction where an issue may crop up. For example, if only one in ten thousand of NCBI's two million world-wide users per day encounters and reports such an issue, two hundred user requests will ensue, distracting and diluting support capabilities. So, while there are quite a number of capable Java-based structure editors such as JME, JChemPainthttp://sourceforge.net/projects/jchempaint/, JmolDraw, MCDL, SDA, or Marvin, the use of any of these for PubChem was not possible. Relying on plug-ins such as the one from ChemDraw, is even more problematic because plug-ins are platform-dependent, and there are (to our knowledge) no free chemical structure editor plug-ins which can connect to arbitrary sites.
The bandwidth requirements for this model are not excessive. Essentially, the client needs to capture and send mouse events on an image area (not more than a couple of dozen bytes per second), and to receive a sequence of images. Since typical chemical structure drawings consist mostly of a monochrome background, with sparse monochrome lines and letters, these images were expected to compress very well. They do not exceed a couple of kilobytes for a reasonable-sized drawing area. Even with four or five image updates per second, the required receiver bandwidth would thus not exceed some ten kilobytes per second. This poses certainly no problem for Internet access via broadband and could be, after some throttling of the data stream, acceptable even for dial-up via traditional phone line or ISDN connections. It is comparable to the bandwidth requirements of Internet radio or telephony, and certainly far less than streaming video. A segmentation of the drawing area into panels in order to further reduce the amount of data sent to the client was considered, but not deemed necessary after these initial calculations.
The server responsible for the update of the images must be able to process multiple events per second in order to guarantee a satisfactory user experience. Users expect dynamic feedback as it is provided in standard stand-alone structure editors, such as continuously updated bond lines following the mouse cursor. Because of the rapid sequence of events, the server CGI application has to be of the FastCGI (FCGI) variant, as even the 100 ms start-up overhead of a classical CGI to process only an individual event would be prohibitive.
The general rules for the deployment of Web applications at NCBI also require robustness and redundancy with fail-over support. The sketcher system as it is currently deployed at PubChem uses two independent multi-processor server hosts, and redundant database servers for storing state. To the user, this is invisible - servers are transparently switched depending on the load during a drawing session, and multiple server processes are running in parallel on each host. Nevertheless, these requirements for redundancy and parallelism needed to be taken into account early in the design and coding of the application because they are difficult to retrofit at a later stage.
The PubChem sketcher system is at its core a comparatively simple CACTVShttp://www.xemistry.com cheminformatics toolkit application script. All sketcher functionality on the server is implemented in ~2100 lines of script code. Such efficiency is achieved considering CACTVS provides most features required to code a structure editor. This includes: basic handling of chemical structure objects; creation, deletion and modification of atoms and bonds; a structure layout/cleanup algorithm; I/O modules for the input and output of many structure exchange formats, such as MDL MOL/SDF, ChemDraw CDX/CDXML, ISIS drawing formats, ACD/Labs ChemSketch files, Daylight SMILES/SMARTS, and InChI; basic image output functionality, although some extra functionality was added for this purpose (detailed in the next section); FCGI and CGI data input and decoding functions; and HTTP header output function, in its default configuration. With such capabilities, writing a server for this task is not difficult. Because the CACTVS script commands are high-level, their execution as interpreted statements is not a bottleneck. More than 90% of the execution time spent in response to a client request is spent in C-coded library routines and not the script interpreter.
As part of the general integration of CACTVS for the use within the PubChem project, we added a couple of general-purpose interface functions to the toolkit. These include encoding and decoding of the PubChem ASN.1 binary and ASCII chemical structure representations and functions to access the PubChem structure database and queuing system. These are used in the specific version of the sketcher deployed at NCBI in order to integrate the application into the PubChem environment. However, if no site-specific functions, such as state storage in the queuing system or non-public direct retrieval of PubChem structures via compound (CID) or substance (SID) identifiers, are needed, the PubChem sketcher server process can be run with a standard interpreter from the CACTVS toolkit.
Image sizes for PubChem Compound CID 2244 as a function of image format.
24-bit true color PNG
8-bit colormap GIF, dithered
8-bit colormap GIF, color reduced
8-bit colormap PNG, color reduced
8-bit colormap GIF, original colormap
8-bit colormap PNG, original colormap
By default, we draw both element symbols and lines in anti-aliased fashion to improve visual appearance. Internally, this requires rendering on a 24-bit image object because anti-aliasing requires many (often hundreds) intermediate color shades. However, original true-color 24-bit images are big - too big to be readily sent directly to the client. Therefore, we reduce the color space in a postprocessing step and transmit only 8-bit colormap-indexed images. The GD library natively provides a colorspace reduction function based on dithering. However, this class of algorithm is primarily suited to photographs and other images with large colored regions with comparatively subtle color changes from one pixel to the next. Line drawings deteriorate notably and become fuzzy when processed in this manner. In order to address this problem, we wrote a custom colorspace reduction function which simply determines the most common 24-bit colors, with additional weight for colors far from any color already represented as a colormap entry. All 24-bit colors in the original image are then rounded to the closest colormap entry. This resulting reduced 8-bit colormap image is used for display. Each processed image varies in size but is typically about 4 Kbytes. This size range is unfortunately prone to encounter a size-related IE browser PNG rendering bug http://support.microsoft.com/?kbid=822071. We pad images in invisible comment fields that have no effect on the display, as required to circumvent this problem
With an update rate of four images per second, and the HTTP header and TCP protocol overhead not exceeding a few percent, the result is a bandwidth requirement of roughly 128 Kbit/s. This is about the same as a quality MP3 radio stream. Nevertheless, this may still be too much for dialup connections or users experiencing significant network congestion to the PubChem website. In order to accommodate users lacking sufficient network bandwidth, a network speed control element to the user interface is available. If the low-bandwidth option is chosen, anti-aliasing is disabled (resulting in images with a somewhat less pleasing visual appearance but that are less than half the size of the smoothed version) and thus the required sketcher bandwidth is within the capacity of a single-channel ISDN connection (64 Kbit/s), even without the reduction of the processed event rate (see below) which is simultaneously enforced. These "low-speed" images are directly rendered as 8-bit colormap images. Because the sketcher uses only a few primary drawing colors (black, red, blue, green, orange), there is no danger of colormap overflow when not using anti-aliasing.
Under special circumstances, we send an additional type of image to the client. In order to provide easily recognizable visual feedback for error conditions, parts of the structure (e.g., the offending atom, bond, or fragment) or the whole image area are flashed orange. In order to deliver a clear signal with reliable timing, and in order to avoid the complication of having to send error condition flags back to the client which then would need to request additional images in the highlight sequence, we assemble a non-cycling animated GIF in and transmit it as a single image. If the flash area is small, this animated GIF is not significantly different in size from a standard static image, because the animated GIF specification allows the time-controlled replacement of arbitrary sections of the base image by small bitmaps. The last operation on the flash sequence is to restore the original drawing, so that the image can be retained as drawing area backdrop for normal operation. An animated variant of the PNG format has been defined, but at this time it is not sufficiently supported on Web browsers. The GD library (at the time of writing) does not support the assembly of animated GIF images. We had to add this functionality.
Mouse events are not the only events associated with drawing area coordinates. Some keyboard shortcuts also use coordinates, such as ctrl-v for pasting a structure encoding from the clipboard onto the drawing area. Other keyboard shortcuts simply change the sketcher mode. A description of all keyboard shortcuts is available in the PubChem Sketcher help page http://pubchem.ncbi.nlm.nih.gov/sketch/sketchhelp.html.
The actual structure manipulations in response to events sent by the client are performed exclusively on the server side. In the beginning of a sketcher session, the main page of the sketcher is loaded by the client, where a session ID is transmitted to the server and either an empty structure object is created, a structure object is generated by decoding a linear notation string (e.g., a SMILES, SMARTS, or InChI), or a structure object is pulled directly from a PubChem database subsystem (e.g., by CID). This structure object, and a second backup object for undo/redo operations, remains associated with the session key.
Every server process can perform editing services for multiple parallel editing sessions. On a single-processor single-core server, a couple of dozen parallel sessions do not generate any load, since in most cases the computation time is only a few milliseconds, and, for a comparatively large fraction of the lifetime of a session, nothing really happens on the server other than interpretation of comparatively rare mouse events requiring action.
In case of a single server-side sketcher process, one easy approach to remembering the structure data is to simply keep it in memory. This is a mode well suited for sites without a lot of traffic (and indeed one of the state storage modes supported by the sketcher application). However, in the PubChem environment, with two load balanced multi-core, multi-processor servers each running four sketcher server processes, and where each server-side sketcher process can pick up an operation for any session at any time, a non-local structure state store is required. In that type of configuration, the structure data is retrieved from the store at the beginning of every task, via an unique key bound to the session ID, and, if anything is changed, written back into the same storage slot when the task has been completed. This is not as inefficient as it may sound. The content of the structure store is a simple serialized molecular data object, or more precisely a pair of these objects in order to support undo/redo functionality. Functions to encode and decode these objects efficiently are already part of the Cactvs toolkit. The data size of such an object, which stores basic connectivity plus atom and bond attributes, including query specifications, is usually a couple of kilobytes. The image data associated with a structure is not stored in these blobs since it needs to be regenerated after each structure change. The generic database blob storage mechanism is part of the PubChem queuing system. This system is implemented as monitored database tables on a redundant set of MS SQL Server hosts.
After considering the two sketcher extremes (single sketcher process on single server or many sketcher processes on many servers), there are two additional structure store mechanisms for an intermediate load configuration suitable for single-host multiprocessor systems. The structure data may be saved into shared memory segments or managed by a memory cache daemon (e.g., memcached), where it can be picked up from multiple processes on the same server. The other option is to run the application script in a multi-threaded style, with a pool of threads picking up individual CGI tasks and threads operating concurrently on different processors. However, these two additional configurations are not currently deployed.
The function called in the secondary application window that opened the sketcher can fill the passed data into its application form fields or perform any other operation with it. For example, the sketcher is also used as a tool in a verification Web service for multi-record file uploads to be used in parallel multi-structure queries in PubChem. In this context, individual records of uploaded SD-files can be sent to the sketcher and edited. A thumbnail of that record is updated by the verification service with data received from the sketcher, by passing it to a structure rendering CGI. Since the transfer functions are called whenever the edited structure changes, the dependent form contents or other data representations are continuously updated during the editing process.
The classes of transfer functions, for which an attempt at calling is made, are configurable in the sketcher set-up. The underlying toolkit supports a large number of structure file formats, and any of these exchange formats can be used, as well as custom functions to transfer the molecular formula or various line notations such as SMILES, SMARTS, SLN, InChI, JME strings, PubChem Minimols (a compact file format designed specifically for PubChem structure searching) or CACTVS serialized objects. The transfer format used for PubChem structure queries is an extended, backwards-compatible SMARTS version.
User feedback is generally very positive. In over two years, there is only a single reported instance of a user who could not get the sketcher to work because of severely restricted Internet zone control settings in their Web browser. Other observed failures were due to general server or network problems, not due to any specific unresolved issues with the sketcher implementation methodology. This may be surprising to some considering that approximately 15% of all PubChem Structure Search interactive users launch or otherwise consistently use the PubChem Sketcher on a daily basis.
We have set the various timers and timeout periods which control event processing to yield a maximum refresh rate of three to four images per second in broadband connection mode. The maximum updates rates are reached while performing drag operations with the mouse, where there is continuous feedback by trailing bond lines or the current location of moved or rotated structure parts. With these settings, reports from users in Europe and Japan indicated satisfactory usability even at these non-US locations. In regions with less developed infrastructure, it helps that the sketcher can be operated without resorting to any actions which provoke generation of a stream of feedback images. Almost all functions are accessible by means of isolated mouse clicks that change atoms, sprout bonds at default angles, change bond orders, add template fragments, and so on. If used in this style, the image is only updated when the structure has really changed, and there is no risk of an accumulating backlog of delayed image updates which are prone to confuse a user when they arrive late.
The sketcher has been used in unexpected and creative fashions by an enterprising user community. It has been praised as convenient interactive rendering tool for InChI and SMILES strings. We have also observed numerous attempts to access its rendering functions by batch scripts. Some of these access attempts had to be suppressed because they were not compatible with the NCBI access rules. In the sketcher configuration deployed on the PubChem site, the data transfer functions to fill linked forms are intentionally only usable only by pages served from within the NCBI domain. Linking the sketcher to other Web services with automatic data transfer is not supported.
We have successfully implemented and deployed a new class of Web-based chemical structure editor. It has been in stable production use for more than two years. It is freely accessible at http://pubchem.ncbi.nlm.nih.gov/edit/.
By completely avoiding the need for a Java runtime, the installation of plug-ins and any advanced features in client Web browsers, we were able to provide an extremely portable chemical structure editing system which works on all major Web browsers, even those outdated by nearly a decade, and is entirely operating-system independent. The system has proven to be robust and reliable. The interactive responsiveness of the sketcher, while undoubtedly less than applications based on locally executed code in Plug-ins or Java virtual machines, was found to be sufficient for its purpose and we have received scant complaints. We therefore are convinced that our approach has demonstrated clear benefits over competing approaches for the input of chemical structures, especially in the context of public databases that have a diverse user base with a broad and unpredictable spectrum of browser software, operating systems, and Internet access methods.
Installation instructions for the NCBI version of the sketcher are part of the manual, which can be downloaded as PDF or viewed as HTML pages. A working installation needs to be assembled from the Open Source portable application script and interface files downloadable from NCBI, which can be freely modified and redistributed, plus a proprietary generic closed-source script interpreter suitable for the target server platform, which is for example included in a feature-sufficient version in the standard academic distribution of the Cactvs toolkit and can be obtained from the Xemistry website.
The software is also available in an enhanced commercial version from Xemistry GmbH.
This research was supported [in part] by the Intramural Research Program of the NIH, National Library of Medicine.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.