MONA – Interactive manipulation of molecule collections
© Hilbig et al.; licensee Chemistry Central Ltd. 2013
Received: 12 June 2013
Accepted: 31 July 2013
Published: 28 August 2013
Working with small‐molecule datasets is a routine task forcheminformaticians and chemists. The analysis and comparison of vendorcatalogues and the compilation of promising candidates as starting pointsfor screening campaigns are but a few very common applications. Theworkflows applied for this purpose usually consist of multiple basiccheminformatics tasks such as checking for duplicates or filtering byphysico‐chemical properties. Pipelining tools allow to create andchange such workflows without much effort, but usually do not supportinterventions once the pipeline has been started. In many contexts, however,the best suited workflow is not known in advance, thus making it necessaryto take the results of the previous steps into consideration beforeproceeding.
To support intuition‐driven processing of compound collections, wedeveloped MONA, an interactive tool that has been designed to prepare andvisualize large small‐molecule datasets. Using an SQL database commoncheminformatics tasks such as analysis and filtering can be performedinteractively with various methods for visual support. Great care was takenin creating a simple, intuitive user interface which can be instantly usedwithout any setup steps. MONA combines the interactivity of moleculedatabase systems with the simplicity of pipelining tools, thus enabling thecase‐to‐case application of chemistry expert knowledge. Thecurrent version is available free of charge for academic use and can bedownloaded at http://www.zbh.uni-hamburg.de/mona.
The compilation and preparation of small‐molecule datasets forms the core ofvirtually all cheminformatics applications. The careful selection of relevantcompounds and the thorough processing of the associated data are essential in orderto obtain meaningful results. Although the necessary steps for this process stronglydepend on the respective context, there are nevertheless a number of common andrecurring tasks. These include, among others, the removal of duplicates, filteringby physico‐chemical properties or substructure matching and the visualinspection of the respective compounds.
Workflow or pipelining tools support this recurrence by providing components or nodescorresponding to such common tasks. These nodes can be individually parameterizedand combined in a pipeline, thus enabling the generation of a variety of customizedworkflows. The specification of these workflows is usually facilitated by agraphical interface. The most commonly used programs in the context ofcheminformatics are Pipeline Pilot  and the open‐source alternative Knime  which have been compared in a recent review . There are numerous further examples of scientific workflow systemsdescribed in the literature . All these programs contain a certain number of predefined components andare extensible by allowing users to program their own modules. In addition to theflexibility concerning the specification of workflows, pipelining tools have theadvantage that the processes are completely automated. This makes workflowprocessing the method of choice when all steps are known in advance and nointervention is necessary. Furthermore, there are usually only short setup timescompared to the laborious installation and initialization of a server‐basedmolecular database system. Molecular databases, on the other hand, make it possibleto compile datasets in a more interactive manner. Data needed for commoncheminformatics tasks can be calculated in advance and stored in the database,resulting in noticeably reduced run times for data access. For most common databasesystems chemical cartridges exist which provide the functionality to import chemicaldata. Molecules are typically written to SQL tables in the form of line notationssuch as (U)SMILES  or InChI . These unique topological identifiers are used to ensure the uniquenessof molecules or to rapidly find particular molecules in the database. It is possibleto reduce run times for substructure searches by annotating common substructures inmolecules and for similarity searches by using pre‐calculated fingerprints.Physico‐chemical properties can be stored in databases using indices to boostthe run times of filter operations. Depending on the number and kind ofpre‐calculated molecular descriptors, run times for setting up the databasescan be quite large. Additionally, database systems often need to be installed on therespective operating system.
Here, we present MONA, a software tool aiming at combining the advantages of bothapproaches. In this way, the software enables a more interactive and intuitiveapproach to deal with large compound collections. In different validation procedureswe show the internal consistency of all provided operations. Additionally we providebenchmarks showing that all provided operations are sufficiently fast forinteractive use.
The following sections describe the concepts behind MONA. This includes molecularrepresentation and management by a relational database, performing operations onmolecule sets, and rapid visualization of large compound collections.
Molecules and instances
Instances can be imported from common chemical file formats (SMILES, SDF, MOL2)using the NAOMI framework. The procedures for the consistent handling of theseformats have been described in detail in . If an entry consists of multiple disconnected components, currentlysolely the largest component is kept. Furthermore, it is possible to importsmall molecules from PDB files using the method described in . In this case all components of the entry are imported. Additionaldata from SDF files is stored for each entry and can be recreated during export.Since the identification of molecules is based on a topological description,different tautomeric forms and protonation states are generally handled asseparate entities. The same also applies to molecules with and without explicitspecification of stereo descriptors. In order to customize the way molecules areassigned to instances, MONA offers different rules for the import of molecules.Depending on the context, molecules can be imported in a neutralized form, ascanonized tautomer and without stereochemistry.
MONA allows to organize compounds in molecule sets. Molecule sets are collectionsof pair‐wise different molecules (not instances) which are used for alloperations in MONA. As has been mentioned above, molecules are considered equalif and only if their canonical MolString representation is identical. We believethat this concept of molecular identity follows the basic understanding ofchemists. Additionally, there are various technical reasons why sets ofmolecules are used rather than sets of instances. All available operations, suchas filtering, manual selection and visualization, are based on moleculartopology, so that there would not be any benefit from using sets of instances.Furthermore, some operations are based on the equality of the sets’elements. Due to the additional data from the input format equality of instancesis ambiguous at best, whereas it is well defined for molecules on thetopological level. In the end, working with molecule sets is more efficient andthe results from set operations can be intuitively understood.
Molecule sets are stored internally as lists of Molecule Keys. MONA isable to handle an arbitrary number by keeping these lists in a relationaldatabase. When exporting molecule sets to chemical file formats, molecules mustbe converted back to instances. As instances for a given molecule may come fromdifferent input files, it is necessary to choose which source should be used foroutput generation. For that purpose, a list of original molecule sources is keptin the database. Data associated with a molecule, such as names and coordinates,are then either taken from the first found instance or from all instances in thechosen data sources and eventually exported to the output file.
Visualization of molecule sets
The analysis of the distribution of different physico‐chemical propertiesis a simple way to get a first impression of a molecule set. For that purposeMONA offers customizable histograms for a number of commonphysico‐chemical properties. It is also possible to include multiple setsin one histogram, which allows to compare their properties at a quickglance.
For further analysis, MONA offers a fast visualization of molecule sets usingtwo‐dimensional structure diagrams. This provides a means to visuallyinspect large molecule collections and manually select molecules for thecreation of smaller sets. MONA does not offer any type ofthree‐dimensional visualization which would only be needed to showdifferences between instances such as conformational variability. The necessarytwo‐dimensional coordinates are generated by a built‐in layoutalgorithm on the fly. In order to browse large molecule sets, the results ofsuch calculations for the molecules must be available instantly. Even with afast layout algorithm the pre‐calculation of coordinates for all moleculesin a set would take a prohibitively long time. Fortunately, coordinates for allmolecules are really never needed. By using a model‐view architecture andlazily calculating coordinates only when they are needed, browsing of moleculesets with hundred thousands of molecules becomes instantaneous. On modernhardware depictions of the few molecules a user can capture simultaneously onthe computer screen appear without much latency. By intelligentmulti‐threading, including the cancellation of coordinate calculations formolecules that are no longer visible, fast scrolling of large sets does not leadto congested threads.
Operations on molecule sets
Filtering and visual selection
Both filtering and visual selection are operations on a single molecule set whichgenerate a subset by excluding particular elements. The criterion for theexclusion is either a combination of molecular properties or manual selection.Filter chains for molecular properties are specified as a logical conjunction ofelementary filters. Four elementary filter types are currently supported: (a)physico‐chemical properties, (b) chemical elements, (c) functional groups,and (d) SMARTS patterns.
The physico‐chemical properties comprise mostly topological descriptorssuch as the number of rings, molecular weight, and the topological surface area.This is extended by properties which can be derived from the chemical structuresuch as LogP . Property filters always include or exclude a range of values themolecules must conform to. In contrast to that, substructure filters only ensurethe presence or absence of a specific substructure in the molecules of the set.Chemical element filters are the most basic type of substructure filters. Theyare typically used to remove large classes of molecules such as halogenatedcompounds. Functional group filters allow the exclusion or inclusion of a set ofcommon functional groups including both aromatic rings and acyclic structures.The number of groups and their types are currently predefined in MONA. If theseshould not be sufficient, SMARTS expressions can be used to handle any type ofchemical patterns. Additionally, MONA allows to upload collections of SMARTSpatterns and use them in a single query. The efficiency of the filteringoperation strongly depends on the selected filter types. Property filters arefast since the values for molecules are pre‐calculated and stored in thedatabase. These filters can therefore be realized by directly using databasefunctionality. The same holds true for element and functional group filters.Both resort to pre‐calculated bitfields saved in the database. These areslower than the property filter as SQL databases do not support bitfieldmatches. SMARTS filters are the computationally most demanding types, since allmolecules have to be rebuild from their MolString and tested against the SMARTSexpression.
Elementary filters can be combined into complex queries which can be applied toany molecule set. In order to make filtering with criteria such as theRule‐of‐Five for orally bioavailable molecules  possible, a tolerance can optionally be specified for a filter chain.This means that not all elementary filters need to match but onlym of n filters, wherem ≤ n can be arbitrarily chosen.Using tolerances has an impact on the speed of filtering operations. Ifm < n the filter process becomesslower, since the filter chain needs to be transformed into multiple databasequeries instead of one.
MONA as application
MONA is a cross‐platform application, which can be started without priorinstallation as no setup of an external database system is required. CurrentlySQLite is used as underlying database backend for its simplicity in setup andadministration. SQLite is connected via a regular SQL API such that any otherrelational database system could be used instead.
All operations run in separate threads, which is the basis of this responsive userinterface. It maintains its performance even if more demanding tasks are running inthe background. Created molecule sets can be saved persistently in the database andrestored when opening the database again. Molecule sets can eventually be exportedto one of the supported chemical file formats from the context menu.
Results and discussion
The main focus of MONA are interactive scenarios where large molecule files need tobe handled. To illustrate this further, three different workflows are described:
Scenario 1: Preparing a molecule dataset for screening
Scenario 2: Handling catalogs of molecules
Scenario 3: Verifying existing molecular databases
Are any of the actives decoys for other targets?
Are any of the decoy molecules ligands found in structurally resolved protein-ligand complexes?
Are any of the decoy molecules already known drugs?
In order to investigate the first question, one molecule set with actives and oneset with decoys was created from the respective files for each individualtarget. Then, all active sets where united into one set A and alldecoys where united into one set D. The intersection of both setsdirectly provides the answer to the first question. The resulting set contains123 molecules (provided in Additional file 1).
To answer the second question, the decoy set D has to beintersected with a set containing known ligands from protein‐ligandcomplexes. The necessary data is provided by LigandExpo [13, 14] which offers a SMILES file containing all small molecules fromcrystal structures in the Protein Data Bank (PDB) . The resulting intersection contains 141 decoys which are ligands ofat least one protein in the PDB (provided in Additional file 2).
Furthermore, it is possible to quickly exploit the data sources like the PDB forseeking alternative targets for all the actives in the DUD‐E dataset. LetA i be the set of active compounds for each target i. Theintersections between each A i and the LigandExpo set results in one set per target containing allcompounds for which complex structures are deposited in the PDB. Exporting thesesets with all instances taken from LigandExpo results in one file for eachtarget containing other proteins in the PDB with the same ligand. As an examplethe active flavopiridol for cdk2 was found which also inhibits glycogenphosphorylase (PDB code 1e1y). Note that searching for flavopiridol in the PDBeasily gives the same result but with MONA, this search process was performedwith all 20289 active molecules of DUD‐E simultaneously without the needfor scripting.
It took seven minutes to import all 1.2 million molecules necessary for thisscenario into the database and one minute to create all sets in the GUI on anIntel Core i7‐2600 CPU with 3.4 GHz and 8 GB of memory. All individual setoperations ran in less than 10 seconds.
Molecules stored in the database are restored exactly as before.
Molecule sets can be created and combined with set operations.
Different types of filters can be correctly applied to molecule sets.
Storage of molecules in the database is tested by comparing a molecule restoredfrom the database with the original molecule. The order of atoms and bonds maychange, but if any valence states or atom coordinates differ the test fails. Allmolecules passing NAOMI initialization from PubChem Substance (100 M molecules) [18, 19] and from emolecules (5 M molecules)  can be correctly restored from the database.
Confirming filter operations was done by comparing results returned by thedatabase against the results retrieved by linearly applying each filter againstevery molecule in turn.
In order to assess the computing time requirements of MONA, scaling tests forimportant operations on the database were performed. As most of the operationsonly consist of database queries the results are highly dependent upon the useddatabase backend. Here, SQLite was used with a page cache of 1 GB. This valuewas chosen as the best compromise for modern workstations.
All benchmarks were done on a workstation with an Intel Xeon E5630 CPU running at2.53 GHz and 64 GB of available main memory. A subset of molecules from thePubChem Substances database was used as benchmark set. The molecules in this setwere randomly chosen with uniform probability from the whole PubChem Substancedatabase.
In summary, we conclude that MONA is efficient enough to handle sets with up toone million molecules interactively on a current workstation with at least 2 GBof main memory. Therefore, it can be used as a desktop application for mostcheminformatics tasks.
MONA is an intuitive, interactive tool for processing large small‐moleculedatasets. It offers functionality to perform many common cheminformatics tasks suchas combining datasets, filtering by molecular properties, and visualization using abuilt‐in 2D engine. Since MONA is based on a robust cheminformatics framework,molecules from common file formats (SMILES, SDF, MOL2) can be handled consistently.The low setup time despite the use of a database makes MONA a reasonable compromisebetween pipelining tools and molecule database systems. More importantly, MONAoffers a different way of working with molecule datasets. Compared to pipeliningtools, it supports an interactive and case‐driven process. While chemicaldatabases and pipelining tools are mostly in the hands of cheminformaticians,MONA’s lightweight interface offers chemists an easy way to deal with largecompound collections.
We have provided three prototypical scenarios from different fields of applicationswhich emphasize the great versatility of MONA. Various validation procedures showthat MONA is internally consistent concerning both the representation of moleculesand the database operations. Furthermore, the run times for dataset operations fromthe benchmarks are sufficient for interactive use in most situations with up to onemillion molecules.
Since working with datasets is such a central task in cheminformatics there are a lotof potential additional features which could be included in future versions of MONA.We are confident, that MONA’s functionality will be substantially extendedover the next year. The main focus will be on the introduction of new types ofvisualizations for molecular sets with respect to molecular similarity and molecularscaffolds. The current version can be downloaded athttp://www.zbh.uni-hamburg.de/mona. It is available free ofcharge for academic use.
Thanks to Mathias v. Behren, Andreas Heumeier and Thomas Otto for the firstversion of MONA and demonstrating that sets of molecules are a worthwhile idea.We further thank Thomas Lemcke (University of Hamburg) for his pharmaceuticaladvice and Marcus Gastreich and Christian Lemmen (BioSolveIT GmbH) forcritically reviewing the usability of MONA.
- Accelrys Software Inc:: Pipeline Pilot 8.5. 2012Google Scholar
- Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B: KNIME: The Konstanz information miner, Studies in Classification, DataAnalysis, and Knowledge Organization. 2008, Berlin Heidelberg: Springer,Google Scholar
- Warr W: Scientific workflow systems: Pipeline pilot and KNIME. J Comput‐Aided Mol Des. 2012, 26 (7): 801-804. 10.1007/s10822-012-9577-7.View ArticleGoogle Scholar
- Kappler M: Software for rapid prototyping in the pharmaceutical and biotechnologyindustries. Curr Opin Drug Discov Dev. 2008, 11 (3): 389-392.Google Scholar
- Weininger D, Weininger A, Weininger J: SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci. 1989, 29 (2): 97-101. 10.1021/ci00062a008.View ArticleGoogle Scholar
- Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I: InChI ‐ the worldwide chemical structure identifier standard. J Cheminform. 2013, 5: 7-10.1186/1758-2946-5-7.View ArticleGoogle Scholar
- Urbaczek S, Kolodzik A, Fischer R, Lippert T, Heuser S, Groth I, Schulz‐Gasch T, Rarey M: NAOMI ‐ On the almost trivial task of reading molecules from differentfile formats. J Chem Inf Model. 2011, 51 (12): 3199-3207. 10.1021/ci200324e.View ArticleGoogle Scholar
- Urbaczek S, Kolodzik A, Groth I, Heuser S, Rarey M: Reading PDB Perception of molecules from 3D atomic coordinates. J Chem Inf Model. 2013, 53 (1): 76-87. 10.1021/ci300358c.View ArticleGoogle Scholar
- Wildman SA, Crippen GM: Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci. 1999, 39 (5): 868-873. 10.1021/ci990307l.View ArticleGoogle Scholar
- Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility andpermeability in drug discovery and development settings. Adv Drug Deliv Rev. 2001, 46 (1–3): 3-26.View ArticleGoogle Scholar
- Mysinger M, Carchia M, Irwin J, Shoichet B: Directory of useful decoys, enhanced (DUD‐E): better ligands and decoysfor better Benchmarking. J Med Chem. 2012, 55 (14): 6582-6594. 10.1021/jm300687e.View ArticleGoogle Scholar
- DUD‐E. [http://dude.docking.org/] [Data set as SDF downloaded on2013‐02‐01],
- Feng Z, Chen L, Maddula H, Akcan O, Oughtred R, Berman H, Westbrook J: Ligand Depot: a data warehouse for ligands bound to macromolecules. Bioinformatics. 2004, 20 (13): 2153-2155. 10.1093/bioinformatics/bth214.View ArticleGoogle Scholar
- Ligand Expo. [http://ligand-expo.rcsb.org/] [Data set as SMILES (CACTVSwith stereo) last accessed on 2013‐02‐01],
- Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P: The protein data bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.View ArticleGoogle Scholar
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo A, Wishart D: DrugBank 3.0: a comprehensive resource for ’omics’ research ondrugs. Nucleic Acids Res. 2011, 39 (suppl 1): D1035-D1041.View ArticleGoogle Scholar
- DrugBank. [http://www.drugbank.ca/] [Data set with approved drugs as SDFdownloaded on 2013‐02‐01],
- Bolton E, Wang Y, Thiessen P, Bryant S, Elsevier: Chapter 12 PubChem: Integrated platform of small molecules and biologicalactivities. Annu Rep Comput Chem. 2008, 4: 217-241.View ArticleGoogle Scholar
- PubChem Substance. [http://www.ncbi.nlm.nih.gov/pcsubstance] [Data set as SDFdownloaded on 2012‐20‐09],
- eMolecules. [http://www.emolecules.com/] [Data set as SDF downloaded on2012‐20‐09],
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.