MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics
© Jeffryes et al. 2015
Received: 30 March 2015
Accepted: 6 July 2015
Published: 28 August 2015
In spite of its great promise, metabolomics has proven difficult to execute in an untargeted and generalizable manner. Liquid chromatography–mass spectrometry (LC–MS) has made it possible to gather data on thousands of cellular metabolites. However, matching metabolites to their spectral features continues to be a bottleneck, meaning that much of the collected information remains uninterpreted and that new metabolites are seldom discovered in untargeted studies. These challenges require new approaches that consider compounds beyond those available in curated biochemistry databases.
Here we present Metabolic In silico Network Expansions (MINEs), an extension of known metabolite databases to include molecules that have not been observed, but are likely to occur based on known metabolites and common biochemical reactions. We utilize an algorithm called the Biochemical Network Integrated Computational Explorer (BNICE) and expert-curated reaction rules based on the Enzyme Commission classification system to propose the novel chemical structures and reactions that comprise MINE databases. Starting from the Kyoto Encyclopedia of Genes and Genomes (KEGG) COMPOUND database, the MINE contains over 571,000 compounds, of which 93% are not present in the PubChem database. However, these MINE compounds have on average higher structural similarity to natural products than compounds from KEGG or PubChem. MINE databases were able to propose annotations for 98.6% of a set of 667 MassBank spectra, 14% more than KEGG alone and equivalent to PubChem while returning far fewer candidates per spectra than PubChem (46 vs. 1715 median candidates). Application of MINEs to LC–MS accurate mass data enabled the identity of an unknown peak to be confidently predicted.
MINE databases are freely accessible for non-commercial use via user-friendly web-tools at http://minedatabase.mcs.anl.gov and developer-friendly APIs. MINEs improve metabolomics peak identification as compared to general chemical databases whose results include irrelevant synthetic compounds. Furthermore, MINEs complement and expand on previous in silico generated compound databases that focus on human metabolism. We are actively developing the database; future versions of this resource will incorporate transformation rules for spontaneous chemical reactions and more advanced filtering and prioritization of candidate structures.
KeywordsEnzyme promiscuity Untargeted metabolomics Liquid chromatography–mass spectrometry Metabolite identification
Metabolomics, the study of the population of small molecules in a cell, has drawn intense interest in fields from medicine to synthetic biology because it can provide a fine-grain representation of cellular state and activity [1–4]. Of particular interest is untargeted metabolomics, which seeks to measure as much of the metabolome as possible by limiting methodological detection bias. The dominant analysis technique for untargeted metabolomics is chromatography coupled with mass spectrometry (MS) but this method is hindered by a large number of unknown peaks  and the limited number of reference spectra available to identify the peaks . A number of tools have been developed to propose structural matches for unannotated peaks [7–11] but in practice these tools either return too many candidates when drawing from large chemical databases such as PubChem  or miss compounds not yet present in curated biochemical database [13, 14].This has the effect of locking untargeted metabolomics in a unfortunate paradox: compounds that are not present in biochemical databases are not identified and in the absence of experimental identification, new compounds cannot be added to databases .
There is a growing consensus that many enzymes mediate undocumented side-reactions (known as promiscuous activities) as a result of exposure to diverse cellular metabolites [16, 17]. These activities may explain unannotated peaks in metabolomics datasets [18, 19] but are difficult to detect as they may be overshadowed by a known function  or be dependent on intracellular conditions . Predicting novel chemical reactions based on broad enzyme specificity has been utilized by a number of tools for the prediction of new biochemical pathways [22–24]. Recently, this technique has also been used to expand structure databases for metabolomics by the MyCompoundID tool  the In Vivo/In Silico Metabolites Database (IIMDB) , LipidHome  and others [27, 28].
Here we present Metabolic In silico Network Expansions (MINEs) that utilize the Biochemical Network Integrated Network Explorer (BNICE) [29, 30] to expand on general biochemical databases as well as organism-specific databases for Escherichia coli and yeast. The focus on endogenously present and organism-specific metabolites has been cited as critical to improving the confidence of compound matches  and thus we complement existing resources that focus on human metabolism. In principle, these predictions could also be made using Reaction Difference Matching (RDM) , machine learning methods [31, 32], or other rule-based methods such as ChemAxon’s Metabolizer. Each of these approaches has their benefits; the output really depends on the quality and coverage of the reaction rules used in the analysis. We selected BNICE because we have a set of BNICE reaction rules that have been demonstrated to reproduce a large fraction of known biochemical reactions , as well as to predict enzyme reactions that were subsequently verified experimentally . Importantly, we also have the right to re-distribute BNICE output. No license is required for academic users to access the website or APIs and all BNICE predicted compounds are available for download in SDF format from the website.
Construction and content
Compound information was obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Release 68.0) , the Yeast Metabolome Database (YMDB) (Version 1.0)  and EcoCyc (Version 17.0) . Generalized (containing R groups), inorganic compounds, and disconnected fragments were removed using the Pybel toolkit . Generalized structures are of very limited utility, as they cannot be assigned an accurate mass or represented in a canonical form. Where possible, we encourage developers to avoid ambiguity by enumerating all possible structures in their databases. Additionally, biochemical databases often contain numerous duplicate compounds  and these were identified by Standard InChIKey  comparison and removed for computational efficiency.
BNICE products may take a variety of tautomeric forms depending on the source structure and the nature of the operator applied. Therefore, products were processed with ChemAxon’s Standardizer & Structure Checker (JChem 6.0.4, 2013) to ensure canonical valences and placement of charge. Natural Product Likeness scores  and estimated logP values were calculated with a standalone Java ARchive (JAR) package and ChemAxon’s Calculator Plugins (JChem 6.0.4, 2013) respectively. Estimated Kováts Retention Indices were calculated using the NIST RI algorithm .
Compounds were matched to PubChem  and KEGG COMPOUND databases with the connectivity block of InChIKeys for annotation. Generated compounds are assigned identifiers based on hash of the canonical SMILES  for internal use and a numeric MINE ID for human readability. Finally, the exact mass and chemical fingerprints of structures were calculated with Pybel.
Compound and reaction data is stored as collections in a Mongo Database (v2.6.2). A compound entry contains the chemical formula, exact mass, InChIKey canonical SMILES , FP2 and FP4 fingerprints and lists of reactions in which the compound is predicted to participate as a reactant or product. A compound may also be annotated with additional information such as common names or database links if it matches a KEGG or PubChem entry. Reactions are uniquely identified by an ‘R’ followed by the SHA1 hash of the sorted chemical reaction. Reactions entries contain arrays of reactants and products as tuples of the stoichiometric coefficient and the compound ID as well as a list of the operators that predicted the reaction.
Utility and discussion
Comparison of MINEs generated from various source databases and other databases containing computationally predicted metabolites
Original database compounds
Final database compounds
Compounds found in PubChem (%)
Green Tea metabolites 
Web interface description
Use case: annotation of accurate mass datasets
Annotation of MassBank data
Correct annotation present
Median # of candidates
Finally, to demonstrate the practical utility of MINE databases, we utilized the EcoCyc MINE to annotate untargeted metabolomics data from an E. coli knockout study analyzed by LC–MS. The protocols for sample extraction, data acquisition and post processing are available in the supplementary information. 493 distinct exact MS features were extracted, 30 of which were identified following a traditional annotation workflow using NIST MSPepsearch (see Additional file 2); in contrast, the EcoCyc MINE database proposed candidates for 132 of the accurate masses when searching with 5 mDa precision and with [M+]+, [M+H]+, [M+Na]+ adducts. The resulting MINE candidates were consistent with 93% of the NIST MSPepsearch results.
In addition to expanding the scope for the metabolome, the MINE framework also offers a pipeline for illuminating the synthesis and degradation of poorly annotated secondary metabolites. While applied very broadly to nearly all of metabolism in this study, BNICE expansions may be focused on a region of interest in the metabolic network by adjusting the starting compounds and permissible transformations in a manner similar to that recently demonstrated by Ridder et al. . These targeted MINEs will integrate the generation of plausible pathways by BNICE with the tools to detect the presence of predicted pathway intermediates with accurate mass spectrometry thereby accelerating the process of proposing and evaluating hypothetical enzymatic synthesis routes for a number of compounds of interest.
Here we have presented Metabolic In silico Network Expansions (MINEs) that utilizes generalized biochemical transformations to propose structures for use in untargeted metabolomics. The resulting compounds are rarely found in PubChem but are structurally similar to natural products. We have demonstrated the utility of these databases for proposing correct metabolite structures that stymied a standard annotation workflow. MINE data are accessible without licensing restrictions for non-commercial users through a user-friendly web interface and API for developers in several common scripting languages.
Availability and requirements
MINE databases are freely accessible at: http://minedatabase.mcs.anl.gov and API clients are available at https://github.com/JamesJeffryes/MINE-API. There are no restrictions for Academic Use. Commercial users must obtain a license from Pathway Solutions Inc. (www.pathway.jp) and explicit permission from the authors.
Metabolic In Silico Network Expansions
liquid chromatography–mass spectrometry
Biochemical Network Integrated Computational Explorer
Kyoto Encyclopedia of Genes and Genomes
International Union of Pure and Applied Chemistry
IUPAC International Chemical Identifier
Application Programing Interface
JGJ and CSH conceived the databases. JGJ constructed the databases & implemented the API. JGJ and RLC built the web application. ME collected and analyzed the LC–MS/MS data. ME and TK validated the database. TDN and ADH provided the biological samples tested. ADH, OF, TK, LJB, KEJT and CSH advised on database construction. JGJ, ME, KEJT, and CSH wrote the paper. All authors read and approved the final manuscript.
Additional data files
List of MassBank compounds used for validation.
List of experimental compounds used for validation.
The authors would like to thank Dr. John Meissen helpful input on mass spectrometry The authors also thank Dante Pertusi, Trang Vu, and Jennifer Greene for helpful discussions and Matthew Moura for the use of his operator creation figure.
This work was funded by the US National Science Foundation [MCB-1153357 (to C. H.), MCB-1153413 (to A. H.), and MCB-1153491 (to O. F.)], the US Department of Energy as part of the DOE Systems Biology Knowledgebase (P/ANL2013-194 to C. H.) and the National Institutes of Health (U24 DK097154 to O.F.).
Compliance with ethical guidelines
Competing interests The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Patti GJ, Yanes O, Siuzdak G (2012) Innovation: metabolomics: the apogee of the omics trilogy. Nat Rev Mol Cell Biol 13:263–269View ArticleGoogle Scholar
- Dromms R, Styczynski M (2012) Systematic applications of metabolomics in metabolic engineering. Metabolites 2:1090–1122View ArticleGoogle Scholar
- Roux A, Lison D, Junot C, Heilier J-F (2011) Applications of liquid chromatography coupled to mass spectrometry-based metabolomics in clinical chemistry and toxicology: a review. Clin Biochem 44:119–135View ArticleGoogle Scholar
- Guertin KA, Moore SC, Sampson JN, Huang W-Y, Xiao Q, Stolzenberg-Solomon RZ (2014) Metabolomics in nutritional epidemiology: identifying metabolites associated with diet and quantifying their potential to uncover diet-disease relations in populations. Am J Clin Nutr ajcn.113.078758
- Scalbert A, Brennan L, Fiehn O, Hankemeier T, Kristal BS, van Ommen B et al (2009) Mass-spectrometry-based metabolomics: limitations and recommendations for future progress with particular focus on nutrition research. Metabolomics 5:435–458View ArticleGoogle Scholar
- Stein S (2012) Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal Chem 84:7274–7282View ArticleGoogle Scholar
- Heinonen M, Shen H, Zamboni N, Rousu J (2012) Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics 28:2333–2341View ArticleGoogle Scholar
- Menikarachchi LC, Cawley S, Hill DW, Hall LM, Hall L, Lai S et al (2012) MolFind: a software package enabling HPLC/MS-based identification of unknown chemical structures. Anal Chem 84:9388–9394
- Wang Y, Kora G, Bowen B, Pan C (2014) MIDAS: a database-searching algorithm for metabolite identification in metabolomics. Anal Chem 86:9496–9503
- Wolf S, Schmidt S, Müller-Hannemann M, Neumann S (2010) In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinform 11:148View ArticleGoogle Scholar
- Kind T, Liu K-H, Lee DY, DeFelice B, Meissen JK, Fiehn O (2013) LipidBlast in silico tandem mass spectrometry database for lipid identification. Nat Methods 10:755–758View ArticleGoogle Scholar
- Schymanski E, Neumann S (2013) CASMI: and the winner is… Metabolites 3:412–439View ArticleGoogle Scholar
- Shen H, Zamboni N, Heinonen M, Rousu J (2013) Metabolite identification through machine learning—tackling CASMI challenge using FingerID. Metabolites 3:484–505View ArticleGoogle Scholar
- Matsuda F (2014) Rethinking mass spectrometry-based small molecule identification strategies in metabolomics. Mass Spectrom 3:S0038View ArticleGoogle Scholar
- Menikarachchi LC, Hill DW, Hamdalla MA, Mandoiu II, Grant DF (2013) In silico enzymatic synthesis of a 400,000 compound biochemical database for nontargeted metabolomics. J Chem Inf Model 53:2483–2492View ArticleGoogle Scholar
- Nam H, Lewis NE, Lerman JA, Lee D-H, Chang RL, Kim D et al (2012) Network context and selection in the evolution to enzyme specificity. Science 337:1101–1104View ArticleGoogle Scholar
- Bar-Even A, Noor E, Savir Y, Liebermeister W, Davidi D, Tawfik DS et al (2011) The moderately efficient enzyme: evolutionary and physicochemical trends shaping enzyme parameters. Biochemistry 50:4402–4410
- Weng J-K, Philippe RN, Noel JP (2012) The rise of chemodiversity in plants. Science 336:1667–1670View ArticleGoogle Scholar
- Fiehn O, Barupal DK, Kind T (2011) Extending biochemical databases by metabolomic surveys. J Biol Chem 286:23637–23643View ArticleGoogle Scholar
- O’Brien P, Herschlag D (1999) Catalytic promiscuity and the evolution of new enzymatic activities. Chem Biol 6:R91–R105
- Sánchez-Moreno I, Iturrate L, Martín-Hoyos R, Jimeno ML, Mena M, Bastida A et al (2009) From kinase to cyclase: an unusual example of catalytic promiscuity modulated by metal switching. Chem Biochem 10:225–229Google Scholar
- Gao J, Ellis LBM, Wackett LP (2011) The University of Minnesota Pathway Prediction System: multi-level prediction and visualization. Nucleic Acids Res 39(Web Server issue):W406–W411View ArticleGoogle Scholar
- Moriya Y, Shigemizu D, Hattori M, Tokimatsu T, Kotera M, Goto S (2010) PathPred: an enzyme-catalyzed metabolic pathway prediction server. Nucleic Acids Res 38(Web Server issue):W138–W143View ArticleGoogle Scholar
- Henry CS, Broadbelt LJ, Hatzimanikatis V (2010) Discovery and analysis of novel metabolic pathways for the biosynthesis of industrial chemicals: 3-hydroxypropanoate. Biotechnol Bioeng 106:462–473Google Scholar
- Li L, Li R, Zhou J, Zuniga A, Stanislaus AE, Wu Y et al (2013) MyCompoundID: using an evidence-based metabolome library for metabolite identification. Anal Chem 85:3401–3408View ArticleGoogle Scholar
- Foster JM, Moreno P, Fabregat A, Hermjakob H, Steinbeck C, Apweiler R et al (2013) LipidHome: a database of theoretical lipids optimized for high throughput mass spectrometry lipidomics. PLoS One 8:1–8Google Scholar
- Ridder L, van der Hooft JJJ, Verhoeven S, De Vos RCH, Vervoort J, Bino RJ (2014) In silico prediction and automatic LC–MS n annotation of green tea metabolites in urine. Anal Chem 140411210700006
- Morreel K, Saeys Y, Dima O, Lu F, Van de Peer Y, Vanholme R et al (2014) Systematic structural characterization of metabolites in arabidopsis via candidate substrate-product pair networks. Plant Cell 26:tpc.113.122242
- González-Lergier J, Broadbelt LJ, Hatzimanikatis V (2005) Theoretical considerations and computational analysis of the complexity in polyketide synthesis pathways. J Am Chem Soc 127:9930–9938View ArticleGoogle Scholar
- Henry CS, Jankowski MD, Broadbelt LJ, Hatzimanikatis V (2006) Genome-scale thermodynamic analysis of Escherichia coli metabolism. Biophys J 90:1453–1461View ArticleGoogle Scholar
- Mu F, Unkefer CJ, Unkefer PJ, Hlavacek WS (2011) Prediction of metabolic reactions based on atomic and molecular properties of small-molecule compounds. Bioinformatics 27:1537–1545View ArticleGoogle Scholar
- De Groot MJL, Van Berlo RJP, Van Winden WA, Verheijen PJT, Reinders MJT, De Ridder D (2009) Metabolite and reaction inference based on enzyme specificities. Bioinformatics 25:2975–2982View ArticleGoogle Scholar
- Frelin O, Huang L, Hasnain G, Jeffryes JG, Ziemak MJ, Rocca JR et al (2015) A directed-overflow and damage-control N-glycosidase in riboflavin biosynthesis. Biochem J 466:137–145View ArticleGoogle Scholar
- Kumar A, Suthers PF, Maranas CD (2012) MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases. BMC Bioinform 13:6View ArticleGoogle Scholar
- Lang M, Stelzer M, Schomburg D (2011) BKM-react, an integrated biochemical reaction database. BMC Biochem 12:42View ArticleGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M (2014) Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42:D199–D205View ArticleGoogle Scholar
- Jewison T, Knox C, Neveu V, Djoumbou Y, Guo AC, Lee J et al (2012) YMDB: the yeast metabolome database. Nucleic Acids Res 40(Database issue):D815–D820View ArticleGoogle Scholar
- Keseler IM, Mackie A, Peralta-Gil M, Santos-Zavaleta A, Gama-Castro S, Bonavides-Martínez C et al (2013) EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res 41(Database issue):D605–D612View ArticleGoogle Scholar
- O’Boyle NM, Morley C, Hutchison GR (2008) Pybel: a python wrapper for the OpenBabel cheminformatics toolkit. Chem Cent J 2:5View ArticleGoogle Scholar
- Altman T, Travers M, Kothari A, Caspi R, Karp PD (2013) A systematic comparison of the MetaCyc and KEGG pathway databases. BMC Bioinform 14:112
- Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I (2013) InChI: the worldwide chemical structure identifier standard. J Cheminform 5:7View ArticleGoogle Scholar
- Jayaseelan KV, Moreno P, Truszkowski A, Ertl P, Steinbeck C (2012) Natural product-likeness score revisited: an open-source, open-data implementation. BMC Bioinform 13:106View ArticleGoogle Scholar
- Stein SE, Babushok VI, Brown RL, Linstrom PJ (2007) Estimation of kovats retention indices using group contributions. J Chem Inf Model 47:975–980
- Bolton E, Wang Y, Thiessen P, Bryant S (2008) PubChem: integrated platform of small molecules and biological activities. Annu Rep 4:217–241Google Scholar
- Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Model 29:97–101View ArticleGoogle Scholar
- Fenner K, Gao J, Kramer S, Ellis L, Wackett L (2008) Data-driven extraction of relative reasoning rules to limit combinatorial explosion in biodegradation pathway prediction. Bioinformatics 24:2079–2085View ArticleGoogle Scholar
- Horai H, Arita M, Kanaya S, Nihei Y, Ikeda T, Suwa K et al (2010) MassBank: a public repository for sharing mass spectral data for life sciences. J Mass Spectrom 45:703–714View ArticleGoogle Scholar