An efficient computer-aided structural elucidation strategy for mixtures using an iterative dynamic programming algorithm
- Bo-Han Su^{1},
- Meng-Yu Shen^{1},
- Yeu-Chern Harn^{2},
- San-Yuan Wang^{1},
- Alioune Schurz^{3},
- Chieh Lin^{1},
- Olivia A. Lin^{3} and
- Yufeng J. Tseng^{1, 3}Email author
Received: 21 August 2017
Accepted: 1 November 2017
Published: 15 November 2017
Abstract
The identification of chemical structures in natural product mixtures is an important task in drug discovery but is still a challenging problem, as structural elucidation is a time-consuming process and is limited by the available mass spectra of known natural products. Computer-aided structure elucidation (CASE) strategies seek to automatically propose a list of possible chemical structures in mixtures by utilizing chromatographic and spectroscopic methods. However, current CASE tools still cannot automatically solve structures for experienced natural product chemists. Here, we formulated the structural elucidation of natural products in a mixture as a computational problem by extending a list of scaffolds using a weighted side chain list after analyzing a collection of 243,130 natural products and designed an efficient algorithm to precisely identify the chemical structures. The complexity of such a problem is NP-complete. A dynamic programming (DP) algorithm can solve this NP-complete problem in pseudo-polynomial time after converting floating point molecular weights into integers. However, the running time of the DP algorithm degrades exponentially as the precision of the mass spectrometry experiment grows. To ideally solve in polynomial time, we proposed a novel iterative DP algorithm that can quickly recognize the chemical structures of natural products. By utilizing this algorithm to elucidate the structures of four natural products that were experimentally and structurally determined, the algorithm can search the exact solutions, and the time performance was shown to be in polynomial time for average cases. The proposed method improved the speed of the structural elucidation of natural products and helped broaden the spectrum of available compounds that could be applied as new drug candidates. A web service built for structural elucidation studies is freely accessible via the following link (http://csccp.cmdm.tw/).
Keywords
Background
Examining natural and therapeutic products is crucial for drug development because many chemically synthesized compounds have potentially serious toxicity and adverse effects, while less toxic compounds extracted from natural products could possibly be developed into new drug candidates [1]. In addition, natural products often open new chemical spaces not explored by synthetic compounds produced by combinatorial chemistry and can further expand the diversity and novelty of molecules by extracting different natural sources, such as the deep and cold seas [2, 3]. A review by Newman and Cragg [2] indicated that 47% of new anti-cancer drugs from 1950 to 2006 were originally from or derived from natural products. Recently, Butler et al. [3] reviewed 100 natural products and natural products-derived compounds that were either evaluated in clinical trials or in registration at the end of 2013. They concluded that 50% of the compounds were natural products or semi-synthetic natural products, while the remaining compounds were classified as natural products-derived compounds. The exploration of new lead compounds from natural products and their successful development into clinical trials will continue to be a significant trend in drug discovery over the next few years.
However, natural products-based drug discovery faces many challenges [4], and the exploration of natural products for new drug development was actually disfavored by the pharmaceutical industry in the 2000s [5]. One of the major hurdles is the extremely time-consuming processes involved in the isolation and structural elucidation of bioactive compounds from natural products composed of complicated mixtures. Because the magnitude of the natural products database is limited, high-throughput screening methods cannot be used to effectively identify potential natural products drugs. Many advances in mass spectrometry (MS) and nuclear magnetic resonance (NMR) automation techniques over the last two decades have accelerated structural elucidation processes for complex natural products mixtures. MS is a common tool used to identify elemental constituents of a molecule. MS data can provide the molecular weight, fragmentation pattern, and molecular formula, which can then be matched to structures. Current advances in MS instruments can provide high-resolution molecular weight measurements [6] and reduce the total number of overlapping m/z values. However, MS data itself is insufficient to determine the structure of a partially or completely unknown compound [7–9]. On the other hand, NMR methods can give a spectroscopic overview of compounds. Although high-resolution and high-dimensional NMR methods have undergone continual advancement [10–12], NMR still cannot independently elucidate novel chemical structures unless co-eluting compounds can be completely separated [8]. Even though LC–NMR–MS [13] and LC–UV–solid-phase extraction–NMR–MS [9] have proven to be effective methods to elucidate compound structures in natural products extracts, the successful structural elucidation of unknown compounds still greatly depends on the development of computational systems to help evaluate the mass spectral data [14].
Computer-aided structure elucidation (CASE), developed 40 years ago [15–18], is a well-known computational approach that can accelerate the process of identifying possible chemical constituents based on expert systems. To fully automate structural elucidation via MS and NMR techniques, different advanced algorithms have been applied in CASE [19–23]. However, the many limitations [24] of CASE expert systems still hinder the creation of fully automatic processes for structural elucidation [25]. One of the major restrictions is the requirement of 2D NMR data as input in these CASE systems [26]. 2D NMR experiments are intrinsically insensitive [27] and extremely time-consuming, especially for the extraction of ^{13}C nucleus peak shifts [25]. Moreover, inaccurate structural elucidation may result from the co-elution of compounds that cannot be completely separated. Therefore, current traditional CASE tools based on 2D NMR spectra still cannot meet the structure-solving needs of experienced natural product chemists [4]. Using MS to develop CASE systems can provide more significant benefits than NMR-based CASE systems, as MS is more sensitive, and the amount of unknown structures in the mixtures needed for analysis is smaller. Furthermore, MS is not usually affected by impurities in the input mixture [28].
Harn et al. [29] proposed a novel CASE algorithm, known as NP-StructurePredictor, to predict individual components in an natural product mixture strictly using information obtained from LC–MS experiments. The purpose of NP-StructurePredictor is to generate possible chemical structures to aid in the identification or structural elucidation of unknown natural products. This can be achieved by matching a list of known scaffolds with a list of weighted side chains to produce a list of possible compounds under specified structural constraints. In this study, we convert this structural elucidation problem to a formal mathematic formula and refer to the problem as the Chemical Substituents-Core Combinatorial Problem (CSCCP). Since the computational complexity of the CSCCP is NP-complete, the search for optimal solutions (valid structures) in the CSCCP must be executed in exponential time complexity for any deterministic algorithms without a loss of generality. Thus, using brute force (BF) algorithms, which search all possible answers, to solve the CSCCP cannot be finished in a reasonable timeframe. In NP-StructurePredictor, a branch and bound (BB) strategy was applied to search for and generate a set of optimal solutions. Nevertheless, the BB strategy has its limitations as well: [1] although the execution time is shorter, in many cases, the algorithm still cannot be finished in a reasonable timeframe, [2] it is difficult to analyze the stability and accuracy of the BB method, and [3] the experimental execution time in real cases is unstable and poor for complex mixtures and sometimes even as slow as the BF algorithm. The BB algorithm used in NP-StructurePredictor still limits the number of combinatorial candidates of the side chains for each scaffold so that the program can be finished in a reasonable execution time. In such cases, Structure Hunter cannot find the optimal solutions.
In this study, to further promote the performance of searching for all optimal solutions of the CSCCP problem, we first present a pseudo-polynomial time algorithm based on classical dynamic programming (DP) strategies that can effectively and accurately search for all correct structures in a natural product mixture. The DP strategy is based on the method used by Ibarra and Kim [30]. However, because this is a pseudo-polynomial time algorithm, the time performance is limited by the required precision of the molecular weight, which is between four and six decimal places. We then propose another iterative DP algorithm that can be executed in polynomial time for the average case. Four complex herbs with verified structures were applied in the study of NP-StructurePredictor, and all were successfully predicted by this method. We compared the time performance of our algorithm in NP-StructurePredictor. For all cases, our iterative DP algorithm outperformed the BF algorithm, classical DP algorithm, and the BB program in NP-StructurePredictor and also could run to completion in a reasonable time and provide optimal solutions. We developed a new, efficient CASE strategy that can accurately predict the possible structures of compounds in mixtures based only on information obtained from LC–MS experiments.
Results and discussion
The identification or prediction of the main chemical components present in an natural product mixture with traditional chromatographic methods is time-consuming. The limited compound references also make identification or prediction more difficult for each constituent in a mixture. Hence, an efficient algorithm to solve the CSCCP is needed.
This section reports the simulation results of the DP algorithms developed in this paper. Two other traditional methods were implemented and compared with the DP algorithms in terms of quality and time performance. All four of the algorithms were implemented in Java (JDK version 7) and tested on a Linux PC with an Intel Xeon(R) CPU 2.40 GHz with 32 GB of memory. Four types of natural products were used in our simulation: Cuscuta chinensis (C. chinensis), Ophiopogon japonicus (O. japonicus), Polygonum multiflorum (P. multiflorum), and Angelica sp. According to the experimental identification procedures from the Natural Product Laboratory of the Taiwan Medical and Pharmaceutical Industry Technology and Development Center as well as investigations from earlier publications, the main structures of each herb have been established and were used as a validation set for the evaluation of our simulation results. The time complexity analysis of our new CASE algorithms, quality of the search results, and running time performance compared to those of the traditional methods for the four cases were also discussed in the last sections.
Targeted MWs and seed scaffolds are necessary input information in our algorithm. To obtain possible targeted MWs, LC–MS experiments for the tested natural products were performed beforehand. Targeted MWs can then be retrieved from peak tables generated in LC–MS experiments. In our cases, we used MAVEN [31] or XCMS [32] to extract peak tables and then retrieved all possible molecular weights as input targeted MWs. To identify the main structures of each tested natural products, the obtained data including parent and daughter ions pattern were compared with the compounds spectra of similar medicinal herbs in earlier publications or databases. This step resulted in the preliminary identification of top five high intensity peaks in our cases. These peaks can be validated with known standard compounds analyzed under the same LC conditions, by comparing and matching the retention times and MS/MS spectra. These identified main structures were used as our validation sets. The remaining core structures can be used directly as the seed scaffolds in our tested cases after all terminal side chains of the identified structures were eliminated. The input targeted MWs and seed scaffolds of the four natural products were listed in the following sections.
The choice of the seed scaffolds is likely to influence the outcome of the structural elucidation. In real cases, users can roughly identify the similar structures of natural products, in which compound spectra are similar to all high intensity peaks measured in the preliminary LC–MS experiments. Because those identified structures can be regarded as potential candidates of individual component in test materials, the core structures of these candidates can be directly used as seed scaffolds in our algorithm. Furthermore, when users cannot provide the seed scaffolds of the test material, our algorithm is capable of performing a full search on all our collected 83,242 possible scaffolds. The full searching strategy can automatically generate suitable candidates and run to completion in a reasonable timeframe.
Structural elucidation of the four natural products using the iterative DP algorithm
List of the target molecular weights for the four tested natural products
Case name | Target weight lists |
---|---|
C. chinensis | 286.24, 302.24, 354.31, 478.41 |
O. japonicus | 328.32, 342.35, 356.33, 370.36 |
P. multiflorum | 270.24, 284.27, 290.27, 406.39, 432.38, 578.53 |
Angelica sp. | 162.03, 186.03, 192.04, 202.03, 216.04, 230.09, 244.11, 246.05, 246.09, 246.09, 270.09, 286.08, 288.10, 300.10, 316.09, 328.13, 334.11, 354.15, 360.08, 360.16, 366.22, 374.14, 376.15, 378.17, 386.14, 388.15, 402.13, 414.17, 414.20, 426.17, 426.17, 428.18, 546.26, 574.29 |
Number of different types of possible sets of substituted positions (\( {\mathbf{N}}_{{\mathbf{k}}} \)) for each tested natural product
Case name | (Scaffold number, \( {\text{N}}_{\text{k}} \)) |
---|---|
C. chinensis | (1, 13), (2, 231) |
O. japonicus | (1, 4), (2, 6), (3, 31) |
P. multiflorum | (1, 68), (2, 28), (3, 68), (4, 81) |
Angelica sp. | (1, 58), (5, 28), (3, 8), (4, 15), (5, 11) |
Analysis of time complexity in the DP and iterative DP algorithms
In real cases, R is a constant. If \( K^{{\prime }} = max_{i = 1..n} K_{i} \), the total time complexity can be converted to \( O\left( {nW_{0}^{{\prime }} K^{{\prime }} {\text{Rlog}}K^{{\prime }} R} \right) \) by setting \( K_{i} \) equal to \( K^{{\prime }} \). However, DPforCSCCP is not a real polynomial time algorithm. Because each \( W_{0} \) is converted into an integer by multiplying by 10^{6} in the DP algorithm, the complexity becomes \( O\left( {10^{6} \times nW_{0} K^{{\prime }} R\log K^{{\prime }} R} \right) \). Although \( W_{0}^{{\prime }} \) is a single integer number, the actual input \( 10^{6} W_{0} \) may be exponentially times greater than \( nK^{{\prime }} \). Therefore, the main time complexity of DPforCSCCP is the cost on the input \( 10^{6} W_{0} \), and DPforCSCCP is a pseudo-polynomial time algorithm.
We then analyzed the time complexity of the IDPforCSCCP algorithm. As shown in Fig. 5, the while loop in the program is the same as that in the DPforCSCCP algorithm, and it executes completely in \( O\left( {nW^{{\prime \prime }} R\bar{K}\log R\bar{K}} \right) \) time, where \( \bar{K} \) is the average of \( K_{i} \). Herein, we estimate the average value of the target weight of the natural products. In our collected natural products database, a total of 82,847 natural products contain ring structures, and only these compounds were considered in this study. The average total molecular weight of all possible maximal substituents in each compound is \( 89.42 \), where for a given scaffold, the maximal substituent is the side chain with the maximal molecular weight. On average, we can assume that the variable \( W^{{\prime \prime }} \) is a constant. Therefore, in an average case, the time complexity in the while loop of IDPforCSCCP reduces to \( O(89nR\bar{K}\log R\bar{K}) \). The rest of the while loop in IDPforCSCCP requires only \( O(R) \) to check the size and the weight, and thus, we can ignore this step. Finally, we considered the number of executed while loops. For each while loop, R is multiplied by 10. In the first while loop, the program executes in \( O(89n10R^{{\prime }} \bar{K}\log 10R^{{\prime }} \bar{K}) \); in the second while loop, the program executes in \( O(89n10^{2} R^{{\prime }} \bar{K}\log 10^{2} R^{{\prime }} \bar{K}) \); and so on. Assuming that the while loop is executed L times, the total time complexity of IDPforCSCCP then becomes \( \mathop \sum \nolimits_{i = 1}^{L} \left( {89n10^{i} R\bar{K}\log 10^{i} R\bar{K}} \right) \) for the average case. When L is sufficiently small, the CSCCP can be ideally solved in polynomial time, on average, using our iterative DP algorithm, IDPforCSCCP. We will discuss the value of L in the subsequent section. We noted that in the previous DPforCSCCP algorithm, even if \( W^{{\prime }} \) is set to \( 89 \times 10^{D} \), the time complexity is still near exponential. Because \( W^{{\prime }} \) has to be converted into an integer number and the precision number D is typically set to at least 6, the time complexity of the DPforCSCCP algorithm can only be reduced to \( O\left( {10^{6} \times 89nR\bar{K}\log R\bar{K}} \right) \) for the average case.
Running iterations of the while loop executed using the IDPforCSCCP algorithm for our four tested herbs
Tested herbs | # of running iterations executed within one loop | # of running iterations executed in exactly 2 loops | # of running iterations executed in exactly 3 loops |
---|---|---|---|
C. chinensis | 975 (99.9%) | 1 (0.1%) | 0 (0%) |
O. japonicus | 164 (100%) | 0 (0%) | 0 (0%) |
P. multiflorum | 1338 (98.5%) | 16 (1.2%) | 4 (0.3%) |
Angelica sp. | 4084 (100%) | 0 (0%) | 0 (0%) |
Assuming that the number of while loop iterations is only 1, the time complexity of the entire IDPforCSCCP algorithm only requires \( O\left( {89n\bar{K}R^{{\prime }} \log \bar{K}R^{{\prime }} } \right) \) time for the average case, where \( \bar{K} = (\mathop \sum \nolimits_{i = 1}^{n} K_{i} )/n \) and \( R^{{\prime }} \) is 100. Thus, the time complexity of IDPforCSCCP can be further reduced to \( O\left( {8900n\bar{K}\log 100\bar{K}} \right) \) for the average case. Therefore, our novel CASE procedure developed using the iterative DP algorithm can automatically elucidate the unknown structures in complex mixtures within reasonable polynomial time for the average case.
Time performance comparison with traditional algorithms
Time performance comparison between the BF, BB, DPforCSCCP and IDPforCSCCP algorithms
Cases | Execution time seconds (min) | |||
---|---|---|---|---|
BF | BB | DPfor CSCCP | IDPfor CSCCP | |
C. chinensis | > 3 years | 914 (15.2) | 3757 (62.6) | 17 (0.3) |
O. japonicus | 177 (3.0) | 141 (2.4) | 12 (0.2) | 0.75 (0.01) |
P. multiflorum | > 6 years | > 1 days | 1079 (18.0) | 67 (1.1) |
Angelica sp. | > 80 days | ~ 1 days | 3919 (65.3) | 188 (3.1) |
In our DPforCSCCP algorithm, the program finished within 1.5 h for all cases. In the case of P. multiflorum, which is the most complex example, our DP algorithm reduced the execution from 6 years by BF to 1079 s (18 min). The execution time of the DP algorithm is also much faster than the BB algorithm for all tested cases, except C. chinensis. The time performance of the BB algorithm for C. chinensis is 4 times faster than that of the DPforCSCCP algorithm, whereas the time performance of DPforCSCCP algorithm for O. japonicus, P. multiflorum, and Angelica sp. is 12–80 times faster than that of the BB algorithm. For the extreme case of P. multiflorum, the execution time of the BB algorithm requires over 1 day, whereas for the DPforCSCCP algorithm, only 1079 s (18 min) are needed. This result confirms that our developed DPforCSCCP algorithm can be executed in polynomial time for an average case and is faster than the BB algorithm. However, as \( D \) increases, the execution time for DPforCSCCP may become lower than that for the BB algorithm or even the BF algorithm (Lemma 5). The iterative DP algorithm, IDPforCSCCP, can perform the structural elucidation without setting the parameter D. To confirm that the IDPforCSCCP algorithm outperforms the DPforCSCCP algorithm in the four tested cases, we present the execution time of the IDPforCSCCP algorithm for the four natural products in Table 6. The execution times range from 12.18 to 3919.70 s (0.2–65.3 min) for DPforCSCCP and from 0. 75 to 188.44 s (0.01–3.1 min) for IDPforCSCCP. IDPforCSCCP finished within 3 min in all cases. On average, the time performance of IDPforCSCCP is 67 times better than that of DPforCSCCP. For the extreme case of P. multiflorum, the IDPforCSCCP algorithm reduced the execution time from 6 years by BF to close to 1 min. Our iterative DP algorithm is much more efficient than the BF, BB, and our original DP algorithms.
Several approaches including the BF, BB, DPforCSCCP and IDPforCSCCP algorithms were applied to solve the CASE problem in this study. The BF and BB algorithms blindly check all combinatorial candidates for the substituted positions in the scaffold, while the two DP algorithms, DPforCSCCP and IDPforCSCCP, formulate the CSCCP in terms of a cost equation and save each solution of each sub-problem for the effective generation of optimal solutions. The IDPforCSCCP algorithm can reduce a large number of combinations of side chains to identify the main structures in natural products without the limitation of the number of decimal places of the mass, thus further accelerating the identification procedures. IDPforCSCCP would realize the spectroscopist’s dream of fully automated structural elucidation.
Improvement in the structural elucidation results using the iterative DP algorithm
Number of structures identified by the branch and bound and IDPforCSCCP algorithm for the four different targeted MWs of C. chinensis with the given seed scaffolds
Targeted MWs | Number of identified structures (Brach and bound (N _{ k } = 5), IDPforCSCCP) | |
---|---|---|
Scaffold 1 | Scaffold 2 | |
286.24 | (3, 4) | (149, 1936) |
302.24 | (12, 24) | (196, 1802) |
354.31 | (3, 29) | (300, 2006) |
478.41 | (18, 116) | (300, 1995) |
To demonstrate the actual improvement in the structural elucidation results using the iterative DP algorithm, we compared the identification results of the structural elucidation of Angelica sp. Harn’s CASE tool cannot correctly predict twelve structures out of a total of forty-five main compounds in this case. In fact, the overall prediction rate for our IDPforCSCCP algorithm increased to 82%. Four additional structures, including Imperatorin, Isoimperatorin, Umbelliprenin, and Ostruthol, were further correctly identified in our study (Additional file 1: Table S3). IDPforCSCCP indeed improved both the time performance and prediction accuracy for the structural elucidation of natural products. In fact, IDPforCSCCP still failed to predict some main structures since our system can only utilize the collected side chains to construct possible structures on the scaffolds. If additional common or structurally related side chains were manually input to our database, IDPforCSCCP would be able to correctly predict these failed structures.
Conclusions
For the structural elucidation of complex natural products, we defined a new CSCCP problem based only on the information obtained from MS spectra. Theoretically, to solve the CSCCP, exponential time should be required in the worst-case scenario. We designed a novel CASE algorithm by applying a classical DP algorithm to search for the optimal solutions, and the time complexity was in pseudo-polynomial time. However, the higher precision of the molecular weight required, the higher the time complexity of our DP algorithm, even reaching exponential time. We thus developed an iterative DP algorithm that can be executed in polynomial time for the average case. Four real natural products, C. chinensis, O. japonicus, P. multiflorum, and Angelica sp., were applied in our study to verify the results and time performance of our algorithm. The execution time was compared with that of the BF and BB algorithms. Our iterative DP algorithm outperformed the BF and BB programs. In addition, we could really elucidate the correct structures in herbs derived from a previous study. Because the time performance of our algorithm is more efficient than those of the other algorithms, we could search for the real optimal solutions in an acceptable time in the four tested cases without the limitation described in previous studies of the number of combinatorial candidates of the substituted positions in each scaffold. The proposed efficient algorithm provides a new tool for spectroscopists to aid in the structural elucidation of unknown complex mixtures when only MS spectral information is known. A web service built for structural elucidation is freely accessible (http://csccp.cmdm.tw/).
Methods
Data Set
Four types of herbs were used to validate the accuracy and time efficiency of our prediction method: Cuscuta chinensis (C. chinensis), Ophiopogon japonicus (O. japonicus), Polygonum multiflorum (P. multiflorum), and Angelica sp. A list of possible scaffolds (seed scaffolds) for the tested herbs obtained from the preliminary LC–MS identification procedures from the Natural Product Laboratory of the Taiwan Medical and Pharmaceutical Industry Technology and Development Center (PITDC) is shown in Fig. 1. A preliminary identification of the high-intensity mass peaks was also performed, and the possible molecular weights (targeted MWs) of the tested herbs derived from the peak tables are listed in Table 1. The possible seed scaffolds and the list of targeted MWs were used as input in our new CASE program. The procedure is used to elucidate the possible main chemical structures in herbs containing the list of possible seed scaffolds to match the peaks corresponding to the given targeted MWs. To combine different side chains on the seed scaffolds for identification of most of the possible main structures in the herbs, a database containing a list of possible side chains that can attach on the seed scaffolds from natural products was generated. The collected natural product database included the Dictionary of Natural Products (DNP) [33], “ZINC natural products” subset of ZINC [34], and the Traditional Chinese Medicine Database (TCMD, updated on 2010-07-14) [35]. The number of natural products that contain ring structures is 82,847, and only these compounds were considered. Furthermore, the Natural Product Laboratory of PITDC has identified a list of main structures of each herb with their LC–MS/MS procedures. The verified main structures of each herb were provided in the results and discussion section and were used as a baseline for the evaluation of our prediction results.
Definitions of the structural elucidation problem
Our new CASE system identified main structures that matched the targeted MWs of herbs by combining the known seed scaffolds with a list of possible side chains. The procedure was defined as a Chemical Substituents-Core Combinatorial Problem (CSCCP) in the studies. The scaffold (core) of the compounds represented a common substructure of molecules that may have similar biological activities, and a side chain (substituent) denoted a chemical group that is attached to the scaffold. The position of an atom between the scaffold and a side chain is called a substituted position. A compound may have many different substituted positions, and each substituted position on a scaffold can also be linked by many different attached side chains. For each seed scaffold, according to the analysis of the relationship between the scaffolds and side chains from our collected natural product database, we can compute the attaching probabilities of the side chains that can link to each substituted positions of that scaffold. The main procedure of our new CASE method utilized this information to elucidate chemical structures, and the formal definition of the CSCCP (Definition 1) was as follows.
Definition 1
DP algorithm
In this study, we first proposed a DP algorithm as our new CASE strategy that can be executed in pseudo-polynomial time complexity to find the optimal solutions in the CSCCP. The optimization problem of CSCCP has been defined in Definition 1 by Eqs. 1 1 and 2. The CSCCP can also be represented by a cost function defined as follows:
Definition 2
(CSCCP cost function, C) In the CSCCP, \( n \) denotes the number of substituted positions on a given seed scaffold, and \( W_{0} \) is the targeted MW, as defined in Definition 1. A CSCCP cost function \( C:\left\{ {s|1 \le s \le n,s \in N} \right\} \times \left\{ {w|1 \le w \le W_{0} ,w \in N} \right\} \times \{ r|1 \le r \le R,r \in N\} \to \left[ {0, 1} \right] \), is defined such that \( C(s,w,r) \) represents the highest \( r \)th value of \( \prod\nolimits_{i = 1}^{s} {p_{{i,x_{i} }} } \) when only s out of n positions are substituted by side chains and the total weight of the selected side chains is equal to w \( (\sum\nolimits_{i = 1}^{s} {m_{{i,x_{i} }} } = w) \), where w is an integer number. If the original molecular weights \( W_{0} \) and \( m_{{i,x_{i} }} \) are floating point numbers, they are transformed to integers for the following analysis. Therefore, \( C(s,w,r) \) corresponds to a sub-problem of the CSCCP since only s substituted positions are considered. The r highest values of \( \prod\nolimits_{i = 1}^{s} {p_{{i,x_{i} }} } ,x_{i} \in \left\{ {1,2, \ldots ,K_{i} } \right\},\forall i \in \left\{ {1,2, \ldots ,s} \right\} \) are denoted as \( C(s,w,1:r) \), where “1:r” denotes “from 1 to r”. The goal of the CSCCP is to find the R highest values \( C(n,W_{0} ,1:R) \) of \( \prod\nolimits_{i = 1}^{n} {p_{{i,x_{i} }} } \) satisfying \( \sum\nolimits_{i = 1}^{n} {m_{{i,x_{i} }} } = W_{0} , \) where \( \hbox{``}1:R\hbox{''} \) denotes “from 1 to R”.
Therefore, the problem of finding the optimal solutions in the CSCCP can be regarded as solving a mathematical procedure of the cost function \( C\left( {n,W_{0} ,1:R} \right) \). To compute the cost function, we first need to set the initial configurations of the cost function. The initial condition of \( C(s,w,r) \) obeys the following Lemma 1.
Lemma 1
In \( C(0,w,r) \) , for any values of w and r, \( C(0,w,r) \) is 0, except for the case of \( C\left( {0,0,1} \right), \) in which \( C(0,0,1) \) is equal to 1.
Next, we utilized the dynamic programming strategy to iteratively compute the cost function based on the initial condition defined in Lemma 1. The main concept of the DP method is based on the principle of optimality: for any initial conditions and decisions, the decisions selected over the remaining period must be optimal for the remaining problem, with the states resulting from the previous decisions acting as the initial condition [30]. Therefore, to solve the entire CSCCP problem, \( C(n,W_{0} ,1:R) \), we must compute the sequence of the sub-problems, \( C(s,w,1:R) \) for \( s = 1, 2, \ldots , n \), and \( w = 1,2, \ldots ,W_{0} \). We used Lemma 2 to represent the relationship between the sub-problems.
Lemma 2
According to the principle of DP, the optimal solution of \( C(s, w, 1) \) can be decided by the optimal solutions in the previous step, \( C\left( {s - 1, 1:w, 1} \right). \) In other words, the highest values of \( \prod\nolimits_{i = 1}^{s - 1} {p_{{i,x_{i} }} } \) matching the molecular weights from 1 to \( w \) were obtained when the positions from 1 to \( s - 1 \) on the scaffold were linked by the appropriate side chains derived from the computation of the DP algorithm. Thus, we can directly evaluate the optimal solutions of \( C\left( {s, w, 1} \right) \) based on the known \( C\left( {s - 1, 1:w, 1} \right) \) in Lemma 2. Next, we extend Lemma 2 into Lemma 3 to calculate the highest R optimal solutions.
Lemma 3
If \( C(s,w,1:R \)) are the highest R optimal solutions, with \( s \in \{ 1,2, \ldots ,n\} \) and \( w \in \{ 0,1, \ldots ,W_{0} \} \) , then: \( C\left( {s, w, 1:{\text{R}}} \right) = \mathop {\hbox{max} }\limits_{{top R, x_{s} \in \left\{ {1,2 \ldots ,K_{s} } \right\},r \in \{ 1,2 \ldots ,R\} }} \{ p_{{s,x_{s} }} \times C\left( {s - 1, w - m_{{s,x_{s} }} ,{\text{r}}} \right)\} \), where K _{ s } is the number of possible side chains at the sth substituted position on the given seed scaffold and \( m_{{s,x_{s} }} \) and \( p_{{s,x_{s} }} \) are the molecular weight and probability of the \( x_{s}^{th} \) side chain that can be linked on the sth substituted position.
Note that the order of the positions used to iteratively calculate the sub-problem, \( C\left( {s, w, 1:{\text{R}}} \right) \), will not affect the results of the optimal solutions according to our proven Lemma 4.
Lemma 4
If we select any arbitrary order of s substituted positions to calculate \( C(s,w,1:R) \) , the solutions of \( C(s,w,1:R) \) are unchanged.
Considering Lemmas 1–4, we can reasonably conclude Theorem 1.
Theorem 1
The highest R optimal solutions in the CSCCP, \( C(n,W_{0} ,1:R) \) , can be solved by iteratively finding the optimal solutions of \( C(s,w,1:R) \) for the position s from 1 to n and the molecular weight w from 0 to \( W_{0} \) based on the initial condition of Lemma 1.
In the DPforCSCCP algorithm, when D is too large, the time complexity must be in exponential time. The following lemma shows the lower bound of D in this case.
Lemma 5
When the number of mass decimal places D is greater than \( \log_{10} \left( {\prod\nolimits_{i = 1}^{n} {K_{i} } } \right)/W_{0} \), the time complexity of DPforCSCCP is greater than that of the BF algorithm for the CSCCP.
Iterative DP algorithm
According to Lemma 5, a larger D results in a worse time performance for DPforCSCCP that could be even slower than the brute force algorithm. To improve the DPforCSCCP algorithm, we introduce in this section a modified algorithm without the limitation of D. First, we derived Theorem 2:
Theorem 2
Let us assume that all molecular weights \( m_{{i,x_{i} }} \in {\mathbb{R}} \) are converted into integers \( m_{{i,x_{i} }}^{{\prime \prime }} = m_{{i,x_{i} }} + 0.5 \in {\mathbb{N}} \), \( W_{0} \in {\mathbb{R}} \) is converted to \( W_{0}^{{\prime \prime }} = W_{0} + 0.5 \in {\mathbb{N}} \) , and \( R \) is changed to \( R^{{\prime }} > R \) . Let \( C \) be the look-up table used by DPforCSCCP when the targeted MW is \( W_{0}^{{\prime }} = 10^{D} *W_{0} \) and \( C^{{\prime }} \) be the look-up table used by IDPforCSCCP when the targeted MW is \( W_{0}^{{\prime \prime }} \) . Then, if \( R^{{\prime }} \) is sufficiently large, the set \( C^{{\prime }} \left[ {n, W_{0}^{{\prime \prime }} - 0.5n : W_{0}^{{\prime \prime }} + 0.5(n + 1),R^{{\prime }} } \right] \) contains all of the values in \( C[n,W_{0}^{{\prime }} ,R] \) calculated by DPforCSCCP.
Lemma 6
In the IDPforCSCCP algorithm, if the number of optimal solutions searched in the first iteration is less than R, the set of the searched optimal solutions will not be updated in subsequent iterations.
According to Lemma 6, we designed the boundary condition in lines 22–23. If the size of the searched optimal solutions is less than R, the while loop is also broken. The proofs for Lemmas 1–6 and Theorems 1–2 are all given in the Additional file 1. The source code of the IDPforCSCCP algorithm programming in Java was provided in the Additional file 2.
Declarations
Authors’ contributions
BHS conceived and designed the algorithms, built the models, analyzed the results, and drafted the manuscript. YCH gathered the dataset and prepared the scaffold and MS spectral information. MYS, SYW, AS, and CL provided scientific comments. MYS, YCH, SYW, AS, CL, and OAL revised the manuscript. YJT conceived and supervised the study, assisted in experiment analysis and manuscript writing. All authors read and approved the final manuscript.
Acknowledgements
Resources of the Laboratory of Computational Molecular Design and Metabolomics and the Department of Computer Science and Information Engineering of National Taiwan University were used in performing these studies.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The four testing datasets including the scaffold structures and targeted molecular weights are available for download at http://csccp.cmdm.tw/testingMaterial.rar.
Ethics approval and consent to participate
Not applicable.
Funding
This work was funded by the Ministry of Science and Technology, Taiwan, Grant Numbers 105-3011-F-002-010-, 105-2812-8-002-001-MY2, and 106-2622-B-002-008 -, and National Taiwan University, Grant Number NTU-ERP-106R880803.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Cragg GM, Newman DJ (2013) Natural products: a continuing source of novel drug leads. Biochim Biophys Acta 1830(6):3670–3695View ArticleGoogle Scholar
- Newman DJ, Cragg GM (2007) Natural products as sources of new drugs over the last 25 years. J Nat Prod 70(3):461–477View ArticleGoogle Scholar
- Butler MS, Robertson AA, Cooper MA (2014) Natural product and natural product derived drugs in clinical trials. Nat Prod Rep 31(11):1612–1661View ArticleGoogle Scholar
- Butler MS (2004) The role of natural product chemistry in drug discovery. J Nat Prod 67(12):2141–2153View ArticleGoogle Scholar
- Kingston DGI (2011) Modern natural products drug discovery and its relevance to biodiversity conservation. J Nat Prod 74(3):496–511View ArticleGoogle Scholar
- Moco S, Bino RJ, De Vos RCH, Vervoort J (2007) Metabolomics technologies and metabolite identification. Trends Analyt Chem 26(9):855–866View ArticleGoogle Scholar
- Corcoran O, Mortensen RW, Hansen SH, Troke J, Nicholson JK (2001) HPLC/1H NMR spectroscopic studies of the reactive alpha-1-O-acyl isomer formed during acyl migration of S-naproxen beta-1-O-acyl glucuronide. Chem Res Toxicol 14(10):1363–1370View ArticleGoogle Scholar
- Corcoran O, Spraul M (2003) LC–NMR–MS in drug discovery. Drug Discov Today 8(14):624–631View ArticleGoogle Scholar
- van der Hooft JJJ, Mihaleva V, de Vos RCH, Bino RJ, Vervoort J (2011) A strategy for fast structural elucidation of metabolites in small volume plant extracts using automated MS-guided LC–MS–SPE–NMR. Magn Reson Chem 49:S55–S60View ArticleGoogle Scholar
- Freeman R, Morris GA (1979) Two-dimensional Fourier transformation in NMR. Bull Magn Res 1:1–26Google Scholar
- Bax A, Aszalos A, Dinya Z, Sudo K (1986) Structure elucidation of the antibiotic desertomycin through the use of new two-dimensional NMR techniques. J Am Chem Soc 108(25):8056–8063View ArticleGoogle Scholar
- Schwalbe H, Kessler H (2003) The 900 MHZ NMR spectrometer in Munich and Frankfurt. Nachr Chem 51(4):412–417View ArticleGoogle Scholar
- Koehn FE, Carter GT (2005) The evolving role of natural products in drug discovery. Nat Rev Drug Discov 4(3):206–220View ArticleGoogle Scholar
- Kind T, Fiehn O (2010) Advances in structure elucidation of small molecules using mass spectrometry. Bioanal Rev 2(1–4):23–60View ArticleGoogle Scholar
- Nelson DB, Munk ME, Gash KB, Herald DL (1969) Alanylactinobicyclone. Application of computer techniques to structure elucidation. J Org Chem 34(12):3800–3805View ArticleGoogle Scholar
- Lederberg J, Sutherland GL, Buchanan BG, Feigenbaum EA, Robertson AV, Duffield AM et al (1969) Applications of artificial intelligence for chemical inference. I. The number of possible organic compounds. Acyclic structures containing C, H, O, and N. J Am Chem Soc 91:2973–2976View ArticleGoogle Scholar
- Sasaki S, Abe H, Ouki T, Sakamoto M, Ochiai S (1968) Automated structure elucidation of several kinds of aliphatic and alicyclic compounds. Anal Chem 40(14):2220–2223View ArticleGoogle Scholar
- Elyashberg ME, Gribov LA (1968) Formal–logical method for interpreting infrared spectra from characteristic frequencies. Appl Spectrosc 8(2):189–191View ArticleGoogle Scholar
- Christie BD, Munk ME (1991) The role of 2-dimensional nuclear-magnetic-resonance spectroscopy in computer-enhanced structure elucidation. J Am Chem Soc 113(10):3750–3757View ArticleGoogle Scholar
- Peng C, Yuan SG, Zheng CZ, Hui YZ (1994) Efficient application of 2d NMR correlation information in computer-assisted structure elucidation of complex natural-products. J Chem Inf Comput Sci 34(4):805–813View ArticleGoogle Scholar
- Lindel T, Junker J, Kock M (1999) 2D-NMR-guided constitutional analysis of organic compounds employing the computer program COCON. Eur J Org Chem 3:573–577View ArticleGoogle Scholar
- Blinov KA, Carlson D, Elyashberg ME, Martin GE, Martirosian ER, Molodtsov S et al (2003) Computer-assisted structure elucidation of natural products with limited 2D NMR data: application of the StrucEluc system. Magn Reson Chem 41(5):359–372View ArticleGoogle Scholar
- Elyashberg ME, Blinov KA, Williams AJ, Molodtsov SG, Martin GE, Martirosian ER (2004) Structure elucidator: a versatile expert system for molecular structure elucidation from 1D and 2D NMR data and molecular fragments. J Chem Inf Comput Sci 44(3):771–792View ArticleGoogle Scholar
- Elyashberg ME, Williams A, Martin GE (2008) Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation. Prog Nucl Magn Reson Spectrosc 53(1–2):1–104View ArticleGoogle Scholar
- Elyashberg M, Blinov K, Molodtsov S, Williams A (2012) Elucidating ‘undecipherable’ chemical structures using computer-assisted structure elucidation approaches. Magn Reson Chem 50(1):22–27View ArticleGoogle Scholar
- Elyashberg M, Blinov K, Molodtsov S, Smurnyy Y, Williams A, Churanova T (2009) Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist’s dream. J Cheminform 1(1):3View ArticleGoogle Scholar
- Codina A, Ryan RW, Joyce R, Richards DS (2010) Identification of multiple impurities in a pharmaceutical matrix using preparative gas chromatography and computer-assisted structure elucidation. Anal Chem 82(21):9127–9133View ArticleGoogle Scholar
- von Bargen C, Hubner F, Cramer B, Rzeppa S, Humpf HU (2012) Systematic approach for structure elucidation of polyphenolic compounds using a bottom-up approach combining ion trap experiments and accurate mass measurements. J Agric Food Chem 60(45):11274–11282View ArticleGoogle Scholar
- Harn Y-C. Structure hunter: prediction of novel chemical structures in a mixture [Master dissertation]. Taipei, Taiwan: National Taiwan University; 2011Google Scholar
- Ibarra OH, Kim CE (1975) Fast approximation algorithms for the knapsack and sum of subset problems. J ACM 22:463–468View ArticleGoogle Scholar
- Clasquin MF, Melamud E, Rabinowitz JD. LC-MS data processing with MAVEN: a metabolomic analysis and visualization engine. In: Baxevanis AD et al, editors. Current protocols in bioinformatics/editoral board, Chapter 14. 2012; Unit14 1Google Scholar
- Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G (2006) XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 78(3):779–787View ArticleGoogle Scholar
- The Dictionary of Natural Products database is available from Chapman & Hall/CRC. http://dnp.chemnetbase.com. 2010 cited 2010-07-10
- Irwin JJ, Shoichet BK (2004) ZINC—a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182View ArticleGoogle Scholar
- Chen CYC (2011) TCM Database@Taiwan: the world’s largest traditional chinese medicine database for drug screening in silico. PLoS ONE 6(1):15939View ArticleGoogle Scholar