Open Access

MONA – Interactive manipulation of molecule collections

  • Matthias Hilbig1,
  • Sascha Urbaczek1,
  • Inken Groth2,
  • Stefan Heuser3 and
  • Matthias Rarey1Email author
Journal of Cheminformatics20135:38

DOI: 10.1186/1758-2946-5-38

Received: 12 June 2013

Accepted: 31 July 2013

Published: 28 August 2013


Working with small‐molecule datasets is a routine task forcheminformaticians and chemists. The analysis and comparison of vendorcatalogues and the compilation of promising candidates as starting pointsfor screening campaigns are but a few very common applications. Theworkflows applied for this purpose usually consist of multiple basiccheminformatics tasks such as checking for duplicates or filtering byphysico‐chemical properties. Pipelining tools allow to create andchange such workflows without much effort, but usually do not supportinterventions once the pipeline has been started. In many contexts, however,the best suited workflow is not known in advance, thus making it necessaryto take the results of the previous steps into consideration beforeproceeding.

To support intuition‐driven processing of compound collections, wedeveloped MONA, an interactive tool that has been designed to prepare andvisualize large small‐molecule datasets. Using an SQL database commoncheminformatics tasks such as analysis and filtering can be performedinteractively with various methods for visual support. Great care was takenin creating a simple, intuitive user interface which can be instantly usedwithout any setup steps. MONA combines the interactivity of moleculedatabase systems with the simplicity of pipelining tools, thus enabling thecase‐to‐case application of chemistry expert knowledge. Thecurrent version is available free of charge for academic use and can bedownloaded at


The compilation and preparation of small‐molecule datasets forms the core ofvirtually all cheminformatics applications. The careful selection of relevantcompounds and the thorough processing of the associated data are essential in orderto obtain meaningful results. Although the necessary steps for this process stronglydepend on the respective context, there are nevertheless a number of common andrecurring tasks. These include, among others, the removal of duplicates, filteringby physico‐chemical properties or substructure matching and the visualinspection of the respective compounds.

Workflow or pipelining tools support this recurrence by providing components or nodescorresponding to such common tasks. These nodes can be individually parameterizedand combined in a pipeline, thus enabling the generation of a variety of customizedworkflows. The specification of these workflows is usually facilitated by agraphical interface. The most commonly used programs in the context ofcheminformatics are Pipeline Pilot [1] and the open‐source alternative Knime [2] which have been compared in a recent review [3]. There are numerous further examples of scientific workflow systemsdescribed in the literature [4]. All these programs contain a certain number of predefined components andare extensible by allowing users to program their own modules. In addition to theflexibility concerning the specification of workflows, pipelining tools have theadvantage that the processes are completely automated. This makes workflowprocessing the method of choice when all steps are known in advance and nointervention is necessary. Furthermore, there are usually only short setup timescompared to the laborious installation and initialization of a server‐basedmolecular database system. Molecular databases, on the other hand, make it possibleto compile datasets in a more interactive manner. Data needed for commoncheminformatics tasks can be calculated in advance and stored in the database,resulting in noticeably reduced run times for data access. For most common databasesystems chemical cartridges exist which provide the functionality to import chemicaldata. Molecules are typically written to SQL tables in the form of line notationssuch as (U)SMILES [5] or InChI [6]. These unique topological identifiers are used to ensure the uniquenessof molecules or to rapidly find particular molecules in the database. It is possibleto reduce run times for substructure searches by annotating common substructures inmolecules and for similarity searches by using pre‐calculated fingerprints.Physico‐chemical properties can be stored in databases using indices to boostthe run times of filter operations. Depending on the number and kind ofpre‐calculated molecular descriptors, run times for setting up the databasescan be quite large. Additionally, database systems often need to be installed on therespective operating system.

Here, we present MONA, a software tool aiming at combining the advantages of bothapproaches. In this way, the software enables a more interactive and intuitiveapproach to deal with large compound collections. In different validation procedureswe show the internal consistency of all provided operations. Additionally we providebenchmarks showing that all provided operations are sufficiently fast forinteractive use.


Based upon the NAOMI framework [7], MONA allows to interactively prepare, inspect and convertsmall‐molecule datasets. The most important aspect of MONA is that the primaryobjects handled are molecules, not their occurrences in a particular dataset. Duringthe import procedure, molecules are converted into a unique topological description,duplicates are automatically detected and stored as so‐called instances. Atypical MONA workflow scheme is shown in Figure 1. Toensure high efficiency, MONA employs a relational SQL database for all operations ondatasets. Furthermore, MONA’s architecture allows an efficient handling ofmolecule sets including their instant creation as well as classical set operationslike union, intersection and difference.
Figure 1

Schematic of a typical workflow using MONA.

The following sections describe the concepts behind MONA. This includes molecularrepresentation and management by a relational database, performing operations onmolecule sets, and rapid visualization of large compound collections.

Molecules and instances

In the context of MONA the terms molecule and instance are used to distinguishbetween the actual compound and its occurrence in a dataset (seeFigure 2). There can be multiple instances of thesame molecule originating from different entries of input files. Depending onthe context these instances can be interpreted as either conformations orduplicate entries. In order to reliably assign instances to their correspondingmolecules, a canonical topological description is needed. MONA uses an internalstring representation called MolString which serves two purposes. First andforemost it is used to efficiently rebuild the molecule as this is needed forparticular operations as explained in the following sections. Furthermore, it isused as unique topological descriptor for the assignment of instances tomolecules during registration. Molecules are serialized to and from thedatabase, where each molecule and each instance is identified internally by anunique id called Molecule Key and Instance Keyrespectively.
Figure 2

Handling of instances in MONA. Input structures withdifferent coordinates but identical topology are assigned to the samemolecule.

Instances can be imported from common chemical file formats (SMILES, SDF, MOL2)using the NAOMI framework. The procedures for the consistent handling of theseformats have been described in detail in [7]. If an entry consists of multiple disconnected components, currentlysolely the largest component is kept. Furthermore, it is possible to importsmall molecules from PDB files using the method described in [8]. In this case all components of the entry are imported. Additionaldata from SDF files is stored for each entry and can be recreated during export.Since the identification of molecules is based on a topological description,different tautomeric forms and protonation states are generally handled asseparate entities. The same also applies to molecules with and without explicitspecification of stereo descriptors. In order to customize the way molecules areassigned to instances, MONA offers different rules for the import of molecules.Depending on the context, molecules can be imported in a neutralized form, ascanonized tautomer and without stereochemistry.

Molecule sets

MONA allows to organize compounds in molecule sets. Molecule sets are collectionsof pair‐wise different molecules (not instances) which are used for alloperations in MONA. As has been mentioned above, molecules are considered equalif and only if their canonical MolString representation is identical. We believethat this concept of molecular identity follows the basic understanding ofchemists. Additionally, there are various technical reasons why sets ofmolecules are used rather than sets of instances. All available operations, suchas filtering, manual selection and visualization, are based on moleculartopology, so that there would not be any benefit from using sets of instances.Furthermore, some operations are based on the equality of the sets’elements. Due to the additional data from the input format equality of instancesis ambiguous at best, whereas it is well defined for molecules on thetopological level. In the end, working with molecule sets is more efficient andthe results from set operations can be intuitively understood.

Molecule sets are stored internally as lists of Molecule Keys. MONA isable to handle an arbitrary number by keeping these lists in a relationaldatabase. When exporting molecule sets to chemical file formats, molecules mustbe converted back to instances. As instances for a given molecule may come fromdifferent input files, it is necessary to choose which source should be used foroutput generation. For that purpose, a list of original molecule sources is keptin the database. Data associated with a molecule, such as names and coordinates,are then either taken from the first found instance or from all instances in thechosen data sources and eventually exported to the output file.

Visualization of molecule sets

The analysis of the distribution of different physico‐chemical propertiesis a simple way to get a first impression of a molecule set. For that purposeMONA offers customizable histograms for a number of commonphysico‐chemical properties. It is also possible to include multiple setsin one histogram, which allows to compare their properties at a quickglance.

For further analysis, MONA offers a fast visualization of molecule sets usingtwo‐dimensional structure diagrams. This provides a means to visuallyinspect large molecule collections and manually select molecules for thecreation of smaller sets. MONA does not offer any type ofthree‐dimensional visualization which would only be needed to showdifferences between instances such as conformational variability. The necessarytwo‐dimensional coordinates are generated by a built‐in layoutalgorithm on the fly. In order to browse large molecule sets, the results ofsuch calculations for the molecules must be available instantly. Even with afast layout algorithm the pre‐calculation of coordinates for all moleculesin a set would take a prohibitively long time. Fortunately, coordinates for allmolecules are really never needed. By using a model‐view architecture andlazily calculating coordinates only when they are needed, browsing of moleculesets with hundred thousands of molecules becomes instantaneous. On modernhardware depictions of the few molecules a user can capture simultaneously onthe computer screen appear without much latency. By intelligentmulti‐threading, including the cancellation of coordinate calculations formolecules that are no longer visible, fast scrolling of large sets does not leadto congested threads.

Operations on molecule sets

In general, MONA operates on molecule sets and creates new sets as results (seeFigure 3). All sets can be used in furtheroperations resulting in a high degree of flexibility. The intention of the setconcept is to enable the typical workflow of interactive processing, namely tobrowse, select, and store data iteratively. The common mathematical setoperations (union, intersection and difference) work on multiple input sets andproduce a single set as result. Since these operations are solely based on theevaluation of identities of the contained molecules, they can be realizeddirectly by the database using SQL statements. Because molecule sets areinternally handled as lists of Molecule Keys the respectiveoperations can be carried out efficiently. Mathematical set operations produceresults instantaneously even for large datasets, which makes them suitable forinteractive use. For the same reasons, the splitting of molecule sets by variouscriteria is interactively possible.
Figure 3

Supported operations in MONA.

Filtering and visual selection

Both filtering and visual selection are operations on a single molecule set whichgenerate a subset by excluding particular elements. The criterion for theexclusion is either a combination of molecular properties or manual selection.Filter chains for molecular properties are specified as a logical conjunction ofelementary filters. Four elementary filter types are currently supported: (a)physico‐chemical properties, (b) chemical elements, (c) functional groups,and (d) SMARTS patterns.

The physico‐chemical properties comprise mostly topological descriptorssuch as the number of rings, molecular weight, and the topological surface area.This is extended by properties which can be derived from the chemical structuresuch as LogP [9]. Property filters always include or exclude a range of values themolecules must conform to. In contrast to that, substructure filters only ensurethe presence or absence of a specific substructure in the molecules of the set.Chemical element filters are the most basic type of substructure filters. Theyare typically used to remove large classes of molecules such as halogenatedcompounds. Functional group filters allow the exclusion or inclusion of a set ofcommon functional groups including both aromatic rings and acyclic structures.The number of groups and their types are currently predefined in MONA. If theseshould not be sufficient, SMARTS expressions can be used to handle any type ofchemical patterns. Additionally, MONA allows to upload collections of SMARTSpatterns and use them in a single query. The efficiency of the filteringoperation strongly depends on the selected filter types. Property filters arefast since the values for molecules are pre‐calculated and stored in thedatabase. These filters can therefore be realized by directly using databasefunctionality. The same holds true for element and functional group filters.Both resort to pre‐calculated bitfields saved in the database. These areslower than the property filter as SQL databases do not support bitfieldmatches. SMARTS filters are the computationally most demanding types, since allmolecules have to be rebuild from their MolString and tested against the SMARTSexpression.

Elementary filters can be combined into complex queries which can be applied toany molecule set. In order to make filtering with criteria such as theRule‐of‐Five for orally bioavailable molecules [10] possible, a tolerance can optionally be specified for a filter chain.This means that not all elementary filters need to match but onlym of n filters, wherem ≤ n can be arbitrarily chosen.Using tolerances has an impact on the speed of filtering operations. Ifm < n the filter process becomesslower, since the filter chain needs to be transformed into multiple databasequeries instead of one.

MONA as application

MONA is a cross‐platform application, which can be started without priorinstallation as no setup of an external database system is required. CurrentlySQLite is used as underlying database backend for its simplicity in setup andadministration. SQLite is connected via a regular SQL API such that any otherrelational database system could be used instead.

The user interface consists of three different areas reflecting the functionalitydescribed in the previous sections. Imported molecule files are contained in themolecule sources view, from where molecule sets can be created at any time. Thecurrent molecule sets are shown in the list on the left side. They can be visualizedin the respective views either as histograms or as a sortable table of structurediagrams. Operations for sets as described above are available in the toolbar or viathe context menu. Filter chains can easily be build in the filter view (seeFigure 4) using particular GUI elements for each typeof elementary filter. Physico‐chemical property filters are created with thehelp of a histogram that shows the distribution of the selected property in thecurrently chosen set. Chemical elements in the element filter can be selected in aperiodic table, and functional groups are specified using structure diagrams. SMARTSexpressions are entered in text form, the syntax is checked while typing and wrongexpressions are highlighted.
Figure 4

MONA running on Linux. Molecules are added from files via thefile menu or the Molecule Sources tab shown on the left. 2Dstructure diagrams can be browsed in the Visual Selection tabshown in the middle, and filter chains are created using theFilters tab on the right.

All operations run in separate threads, which is the basis of this responsive userinterface. It maintains its performance even if more demanding tasks are running inthe background. Created molecule sets can be saved persistently in the database andrestored when opening the database again. Molecule sets can eventually be exportedto one of the supported chemical file formats from the context menu.

Results and discussion

The main focus of MONA are interactive scenarios where large molecule files need tobe handled. To illustrate this further, three different workflows are described:

Scenario 1: Preparing a molecule dataset for screening

The compilation of a set of molecules for a virtual or experimental screening isa very common task in cheminformatics. Starting with a large collection ofcompounds the preparation mainly consists of selecting a subset of moleculeswith suitable properties for the target to be addressed (see Figure 5). For this purpose various filters can be iterativelycreated and tested. A few common filters, e.g., the Rule‐of‐Five,are already predefined in MONA and can be used directly. In addition to the useof filters, molecules can also be selected manually using visual selection. Themanual selection can often be facilitated by sorting the molecules according toa specific property. If the results of different filter runs are kept as sets,they can be compared to each other using set operations. Set operations can alsobe used to eliminate particular molecules (rather than substructures) frommolecule sets. One can simply load a file containing unwanted compounds andsubtract them from the current set. All steps can be iteratively applied aftervisual inspection of the remaining and the rejected molecules. For example,bounds related to physico‐chemical properties can be adapted on acase‐to‐case basis depending on the size of the remaining library.After finding the right combination of filters the final candidate set can beexported into an appropriate file format and used by another program. All dataincluding 3D coordinates from instances previously read into the database areretained in this step.
Figure 5

Preparing a molecule dataset for virtual screening. MONAallows to iteratively and interactively apply filtering steps to createsuitable candidate sets.

Scenario 2: Handling catalogs of molecules

The second scenario is taken from the field of compound management. Many vendorsoffer their compound catalogs in the form of chemical data files. These filescan be used to compare the compound portfolio of the different vendors with eachother or with an in‐house library (see Figure 6). This task is usually complicated by the fact that each vendor usesdifferent standards for the representation of the respective compounds. Whenloading vendor catalogs as sets within MONA, different file formats andmolecules across different vendors are automatically unified. Optionally, theuser can decide to unify additional properties like the tautomeric state or theprotonation. The resulting individual sets can be intersected with each otherfor comparison and evaluation. In this way either compounds offered by variousvendors or substances that are uniquely supplied by one vendor can be easilyidentified. Furthermore, the sets can also be intersected with a currentin‐house collection, so that potential additions may be identified. Vendorcatalogs usually contain price information and order numbers for each compound.Exporting all instances for molecule sets preserves this information and allowsto compare prices for all molecules in the exported set.
Figure 6

Handling catalogs of molecules. Set operations can be usedto compare different compound collections by identifying moleculespresent in both.

Scenario 3: Verifying existing molecular databases

Databases like DUD‐E [11, 12] are widely used to test and evaluate the performance of dockingalgorithms. The functionality provided by MONA can be used to simplifyverification tasks that are tedious to do manually. In order to validate the newDUD‐E database, we tried to answer the following three questions (seeFigure 7):
Figure 7

Verifying existing molecular databases. Set operationsbetween different DUD‐E subsets for different targets can be usedto identify potentially problematic molecules.

  • Are any of the actives decoys for other targets?

  • Are any of the decoy molecules ligands found in structurally resolved protein-ligand complexes?

  • Are any of the decoy molecules already known drugs?

In order to investigate the first question, one molecule set with actives and oneset with decoys was created from the respective files for each individualtarget. Then, all active sets where united into one set A and alldecoys where united into one set D. The intersection of both setsdirectly provides the answer to the first question. The resulting set contains123 molecules (provided in Additional file 1).

To answer the second question, the decoy set D has to beintersected with a set containing known ligands from protein‐ligandcomplexes. The necessary data is provided by LigandExpo [13, 14] which offers a SMILES file containing all small molecules fromcrystal structures in the Protein Data Bank (PDB) [15]. The resulting intersection contains 141 decoys which are ligands ofat least one protein in the PDB (provided in Additional file 2).

The third question can be answered in the same way. This time, a substance set ofapproved drugs from Drugbank [16, 17] was used as reference. Drugbank currently lists 1395 moleculesregistered as drugs. The intersection of these molecules withD contains 26 molecules (provided in Additional file 3) each of which is approved as a drug. Mostinterestingly, the resulting set contains the compound cladribine (seeFigure 8), which is known to interact todeoxycytidine kinase and considered as a decoy molecule ofmitogen‐activated protein kinase 1. The compound nandrolone phenpropionateis a known substrate to cytochrome P450 19A1 and considered decoy for cytochromeP450 3A4. Although these two molecules might in fact be inactive against theirdecoy targets, this analysis at least points to critical cases where the decoystatus should be further clarified.
Figure 8

Cladribine and nandrolone phenpropionate are two examples from the 26molecules that are contained in both DUD‐E decoys andDrugbank.

Furthermore, it is possible to quickly exploit the data sources like the PDB forseeking alternative targets for all the actives in the DUD‐E dataset. LetA i  be the set of active compounds for each target i. Theintersections between each A i  and the LigandExpo set results in one set per target containing allcompounds for which complex structures are deposited in the PDB. Exporting thesesets with all instances taken from LigandExpo results in one file for eachtarget containing other proteins in the PDB with the same ligand. As an examplethe active flavopiridol for cdk2 was found which also inhibits glycogenphosphorylase (PDB code 1e1y). Note that searching for flavopiridol in the PDBeasily gives the same result but with MONA, this search process was performedwith all 20289 active molecules of DUD‐E simultaneously without the needfor scripting.

It took seven minutes to import all 1.2 million molecules necessary for thisscenario into the database and one minute to create all sets in the GUI on anIntel Core i7‐2600 CPU with 3.4 GHz and 8 GB of memory. All individual setoperations ran in less than 10 seconds.


All operations provided by MONA depend on the consistent internal representationof molecules and their respective properties. This applies to both the internalchemical model and the operations performed by the underlying database. Theconsistency of the chemical model concerning the handling of different chemicalfile formats has already been validated in [7]. Therefore, the validation of MONA was focused on the correctness ofthe database functionality. This was done by ensuring the following invariants:
  1. 1.

    Molecules stored in the database are restored exactly as before.

  2. 2.

    Molecule sets can be created and combined with set operations.

  3. 3.

    Different types of filters can be correctly applied to molecule sets.


Storage of molecules in the database is tested by comparing a molecule restoredfrom the database with the original molecule. The order of atoms and bonds maychange, but if any valence states or atom coordinates differ the test fails. Allmolecules passing NAOMI initialization from PubChem Substance (100 M molecules) [18, 19] and from emolecules (5 M molecules) [20] can be correctly restored from the database.

Operations on sets of molecules were tested against each other by verifying thatthe general equation in Figure 9 holds. SetsS1,S2 and S3 are created by randomly distributing molecules of a test setto one, two or all three sets. Then the union of S1,S2 and S3 must be the same as the union of the symmetric difference(S1Δ S2Δ S3), the intersection of all three sets and all pair‐wiseintersections of two sets.
Figure 9

Testing set operations against each other. The shownequation was evaluated with three randomly created setsS1,S2 andS3, where Δ is the symmetricdifference of two sets.

Confirming filter operations was done by comparing results returned by thedatabase against the results retrieved by linearly applying each filter againstevery molecule in turn.

Computing time

In order to assess the computing time requirements of MONA, scaling tests forimportant operations on the database were performed. As most of the operationsonly consist of database queries the results are highly dependent upon the useddatabase backend. Here, SQLite was used with a page cache of 1 GB. This valuewas chosen as the best compromise for modern workstations.

All benchmarks were done on a workstation with an Intel Xeon E5630 CPU running at2.53 GHz and 64 GB of available main memory. A subset of molecules from thePubChem Substances database was used as benchmark set. The molecules in this setwere randomly chosen with uniform probability from the whole PubChem Substancedatabase.

Naturally, the size of the database depends linearly on the size of the input. Inour case the size of the database corresponds roughly to the size of acompressed SD file of the same compound set. All in all it takes approximately1000 seconds to read 1 million molecules from SDF (see Figure 10), resulting in a database of size 1 GB, which is muchsmaller than the respective uncompressed MOL2 or SDF files.
Figure 10

Requirements for reading molecules from SDF including insertion andduplicate detection. The red curve shows overall loadingtime for data files of a particular size (approximately one millisecondper molecule is needed) and the green curve shows the time needed tocreate a molecule set of this size once the molecules are stored in thedatabase.

The relative order of run times for different types of filters (seeFigure 11) has been discussed in Section“Filtering and visual selection”. Additionally, all filters and setoperations do not only depend linearly upon the size of the input set but alsoon the size of the resulting set. This can be seen when comparing the pickyproperty filter to the simple property filter from Figure 11 as the picky filter has to write considerable less results into anew subset in the database.
Figure 11

Computing times of filter and set operations. All operationsclearly show a linear dependence on the number of molecules (forfilters, left diagram) or the number of molecules in the resulting set(for set operations, right diagram).

In summary, we conclude that MONA is efficient enough to handle sets with up toone million molecules interactively on a current workstation with at least 2 GBof main memory. Therefore, it can be used as a desktop application for mostcheminformatics tasks.


MONA is an intuitive, interactive tool for processing large small‐moleculedatasets. It offers functionality to perform many common cheminformatics tasks suchas combining datasets, filtering by molecular properties, and visualization using abuilt‐in 2D engine. Since MONA is based on a robust cheminformatics framework,molecules from common file formats (SMILES, SDF, MOL2) can be handled consistently.The low setup time despite the use of a database makes MONA a reasonable compromisebetween pipelining tools and molecule database systems. More importantly, MONAoffers a different way of working with molecule datasets. Compared to pipeliningtools, it supports an interactive and case‐driven process. While chemicaldatabases and pipelining tools are mostly in the hands of cheminformaticians,MONA’s lightweight interface offers chemists an easy way to deal with largecompound collections.

We have provided three prototypical scenarios from different fields of applicationswhich emphasize the great versatility of MONA. Various validation procedures showthat MONA is internally consistent concerning both the representation of moleculesand the database operations. Furthermore, the run times for dataset operations fromthe benchmarks are sufficient for interactive use in most situations with up to onemillion molecules.

Since working with datasets is such a central task in cheminformatics there are a lotof potential additional features which could be included in future versions of MONA.We are confident, that MONA’s functionality will be substantially extendedover the next year. The main focus will be on the introduction of new types ofvisualizations for molecular sets with respect to molecular similarity and molecularscaffolds. The current version can be downloaded at It is available free ofcharge for academic use.



Thanks to Mathias v. Behren, Andreas Heumeier and Thomas Otto for the firstversion of MONA and demonstrating that sets of molecules are a worthwhile idea.We further thank Thomas Lemcke (University of Hamburg) for his pharmaceuticaladvice and Marcus Gastreich and Christian Lemmen (BioSolveIT GmbH) forcritically reviewing the usability of MONA.

Authors’ Affiliations

Center for Bioinformatics (ZBH), University of Hamburg
Beiersdorf AG, Research Active Ingredients
Nuremberg Institute of Technology Georg Simon Ohm


  1. Accelrys Software Inc:: Pipeline Pilot 8.5. 2012Google Scholar
  2. Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B: KNIME: The Konstanz information miner, Studies in Classification, DataAnalysis, and Knowledge Organization. 2008, Berlin Heidelberg: Springer,Google Scholar
  3. Warr W: Scientific workflow systems: Pipeline pilot and KNIME. J Comput‐Aided Mol Des. 2012, 26 (7): 801-804. 10.1007/s10822-012-9577-7.View ArticleGoogle Scholar
  4. Kappler M: Software for rapid prototyping in the pharmaceutical and biotechnologyindustries. Curr Opin Drug Discov Dev. 2008, 11 (3): 389-392.Google Scholar
  5. Weininger D, Weininger A, Weininger J: SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci. 1989, 29 (2): 97-101. 10.1021/ci00062a008.View ArticleGoogle Scholar
  6. Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I: InChI ‐ the worldwide chemical structure identifier standard. J Cheminform. 2013, 5: 7-10.1186/1758-2946-5-7.View ArticleGoogle Scholar
  7. Urbaczek S, Kolodzik A, Fischer R, Lippert T, Heuser S, Groth I, Schulz‐Gasch T, Rarey M: NAOMI ‐ On the almost trivial task of reading molecules from differentfile formats. J Chem Inf Model. 2011, 51 (12): 3199-3207. 10.1021/ci200324e.View ArticleGoogle Scholar
  8. Urbaczek S, Kolodzik A, Groth I, Heuser S, Rarey M: Reading PDB Perception of molecules from 3D atomic coordinates. J Chem Inf Model. 2013, 53 (1): 76-87. 10.1021/ci300358c.View ArticleGoogle Scholar
  9. Wildman SA, Crippen GM: Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci. 1999, 39 (5): 868-873. 10.1021/ci990307l.View ArticleGoogle Scholar
  10. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility andpermeability in drug discovery and development settings. Adv Drug Deliv Rev. 2001, 46 (1–3): 3-26.View ArticleGoogle Scholar
  11. Mysinger M, Carchia M, Irwin J, Shoichet B: Directory of useful decoys, enhanced (DUD‐E): better ligands and decoysfor better Benchmarking. J Med Chem. 2012, 55 (14): 6582-6594. 10.1021/jm300687e.View ArticleGoogle Scholar
  12. DUD‐E. [] [Data set as SDF downloaded on2013‐02‐01],
  13. Feng Z, Chen L, Maddula H, Akcan O, Oughtred R, Berman H, Westbrook J: Ligand Depot: a data warehouse for ligands bound to macromolecules. Bioinformatics. 2004, 20 (13): 2153-2155. 10.1093/bioinformatics/bth214.View ArticleGoogle Scholar
  14. Ligand Expo. [] [Data set as SMILES (CACTVSwith stereo) last accessed on 2013‐02‐01],
  15. Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P: The protein data bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.View ArticleGoogle Scholar
  16. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo A, Wishart D: DrugBank 3.0: a comprehensive resource for ’omics’ research ondrugs. Nucleic Acids Res. 2011, 39 (suppl 1): D1035-D1041.View ArticleGoogle Scholar
  17. DrugBank. [] [Data set with approved drugs as SDFdownloaded on 2013‐02‐01],
  18. Bolton E, Wang Y, Thiessen P, Bryant S, Elsevier: Chapter 12 PubChem: Integrated platform of small molecules and biologicalactivities. Annu Rep Comput Chem. 2008, 4: 217-241.View ArticleGoogle Scholar
  19. PubChem Substance. [] [Data set as SDFdownloaded on 2012‐20‐09],
  20. eMolecules. [] [Data set as SDF downloaded on2012‐20‐09],


© Hilbig et al.; licensee Chemistry Central Ltd. 2013

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (, whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.