HSQC spectral based similarity matching of compounds using nearest neighbours and a fast discrete genetic algorithm
 Gregory K Pierens^{1}Email author,
 Steven Brossi^{1},
 Zhengyi Yang^{1},
 David C Reutens^{1} and
 Viktor Vegh^{1}
DOI: 10.1186/17582946425
© Pierens et al.; licensee Chemistry Central Ltd. 2012
Received: 9 July 2012
Accepted: 24 September 2012
Published: 3 October 2012
Abstract
HSQC spectra are routinely acquired for chemical structure analysis based on hydrogen and carbon chemical environments. Two fast HSQC peak matching algorithms have been developed; a nearest neighbour approach and a probabilistic method based on an existing discrete genetic algorithm. Both of these techniques are intended to find HSQC spectra matches that supplement information generated by established molecular fingerprint methods. Our results are compared to those calculated using a specific implementation of molecular fingerprints. The nearest neighbour and genetic algorithmbased methods ranked highly particular structures missed by molecular fingerprints. Our analysis shows that by complementing molecular fingerprint matches with our findings, a comprehensive list of matches can be identified. The refined list of compounds could be used to improve the quality of compounds used in screening libraries in the pharmaceutical industry.
Keywords
HSQC Spectral matching Similarity Nearest neighbours Discrete genetic algorithmBackground
Database driven chemical structure identification is common practice in drug discovery. Classification of similar compounds is based on the premise that physicochemical properties are comparable [1–3]. The mapping of specific compound properties to “fingerprints” has provided a robust method of searching large databases. Currently, database searching efficiency is constrained by the size of the database, the method used to determine similarity and the function defining match quality. To improve database searching, we have concentrated on the latter two constraints and propose a new approach of identifying similar compounds using heteronuclear single quantum coherence (HSQC) spectra.
Carbon HSQC spectra are collected routinely to confirm or elucidate molecular structure in synthetic and natural product chemistry. Experimental results are presented as 2D plots with axes defined by proton (^{1} H) and carbon (^{13}C) chemical shifts. The highintensity plot features, referred to as “peaks”, delineate directly bonded hydrogen and carbon atoms of a compound. Generally, the 2D Cartesian coordinates of the peaks are reported without any reference to intensity or peak size. The intensity of the peaks could also be included in the analysis. However, care must be taken to ensure that all data was acquired using the same acquisition parameters. Since we validate our findings using published data, in this work, peak intensities are not included as part of the spectra matching. The location of peaks provides valuable information about the chemical environment of hydrogen and carbon atoms allowing molecular structure to be inferred from the number and location of peaks which have specific distributions for each compound.
A number of metrics have been used to quantify the similarity between a compound of interest and a database of compounds allowing the best database results to be selected as possible replacements for the candidate structure. For example, compound fragments and related properties have been mapped to molecular fingerprints defined using bit strings [4]. The fingerprints capture specific information about molecular structure and specific properties of a molecule [4, 5]. In bit stringbased fingerprinting, the Tanimoto (T_{c}) [6] and Tversky [7] coefficients have been used widely to quantify the level of similarity. Above an acceptable threshold, compounds are deemed similar and therefore have similar chemical or biological properties.
We previously outlined a method of matching HSQC spectra of small compounds motivated by evolutionary optimization [8, 9]. The use of selfadaptive differential evolution allowed matching of a candidate compound HSQC peaks to individual entries of a database. However, as the number of peaks increased (i.e. larger than 20), the search space became very large, to the extent that the quality of match was not computable in a reasonable amount of time. Our new approach is aimed at increasing computational efficiency by considering three factors limiting the rate of convergence of any algorithm, the choice of the metric and method to obtain an optimal solution and the size of the search space. The outcome is a robust algorithm capable of matching spectra containing a large number of peaks rapidly on a standard desktop computer.
We improved the efficiency of our previously reported HSQC spectra matching algorithm by using a discrete genetic algorithm (DGA) implementation instead of differential evolution. We tested our new method on a compound database containing 51 HSQC spectra. The results were compared to bit string based molecular fingerprints incorporating a suitable threshold for the Tanimoto coefficient [10] and to nearest neighbour search, also known as proximity search or closest point search which is the simplest implementation of all peak matching methods.
Results and discussion
The database of 51 HSQC spectra from our previous work [8] was used to test the efficacy of our newly developed algorithm. The actual structures of the 51 compounds are listed in Additional file 1.
Treatment of outliers in DGA
To remedy a single long distance peaktopeak match, we identify outliers and exclude their matches in the metric used to classify match quality. We thus modified the mean distance per peak, excluding outliers by statistically examining each set of matched peak distances and applying a rejection criterion. The mean and standard deviation (σ) was calculated for all pairs of matched peaks in the HSQC spectrum to spectrum comparison. If an individual distance was greater than S times σ from the mean, it was considered an outlier. We then rematched any outliers to their nearest neighbour in the other spectrum. We examined several values of S (1.75 – 3.0) and settled on 2.5σ as the threshold value for outlier rejection. We arrived at this result by qualitatively evaluating characteristics of many spectral matches. The value of S is a user defined variable and can changed if unsuitable for the HSQC matching under consideration.
Effect of population size and number of iterations in the DGA
The number of function evaluations (NFE) and corresponding error rate for the DGA and SADE methods
NFE  DGA (%)  SADE (%) 

1.60E + 09  2.5  5.0 
7.98E + 09  0.5  2.7 
1.60E + 11  0.1  2.1 
In the integer optimization problem mutations and crossovers were chosen to improve performance with respect to our application, and hence, we were able to set G_{ max } relatively small. The calculation of the 2601 HSQC spectral matches using the medium settings (K = 5 and G_{ max } = 10 N) took approximately 85 minutes, which was an average of ~ 2 seconds per match which includes overheads from the GUI and reading and writing data files. The largest peak matching was between compounds 17 (17 peaks) and 18 (22 peaks), taking approximately 4 seconds. If 20,000 HSQC spectral matches were required on similar sized spectra using the medium settings, then it would take ~ 11 hours.
Ranking of matches against molecular fingerprint (MFP)
The results of NN and DGA approaches of matching HSQC spectra were compared to those obtained using the MFP method within Open Babel: The Open Source Chemistry Toolbox [11, 12]. The FP2 pathbased fingerprint, which indexes small molecule fragments, was used to generate the similarity results. All compound similarities were calculated using the Tanimoto coefficient (T_{c}) which ranges from 0, no similarity, to 1, maximum similarity. The selection criterion for T_{c} of equal to or greater than a value of 0.7 was used and resulted in 44 compound matches. Of these, 38 were between compounds 1 13. This was not unexpected, as these compounds are part of a combinatorial library based around 2(3chloro4hydroxyphenyl) acetamide.
The categories and cutoffs for MFP, NN and DGA
Category  MFP  NN  DGA 

1  [0.9, 1.0]  [0, 0.0076)  [0, 0.0098) 
2  [0.8, 0.9)  [0.0076, 0.0152)  [0.0098, 0.0196) 
3  [0.7, 0.8)  [0.0152, 0.0228)  [0.0196, 0.0294) 
4  [0.6, 0.7)  [0.0228, 0.0304)  [0.0294, 0.0392) 
5  [0.5, 0.6)  [0.0304, 0.0380)  [0.0392, 0.0490) 
6  < 0.5  ≥ 0.0380  ≥ 0.0490 
NN versus DGA based HSQC spectra matching
Among the top 44 similarity matches, NN and DGA HSQC similarity methods identified only seven different matches. The matches unique to NN were all within compounds 1–13, while for DGA, only one was from this group of compounds. All NN matches were all in category 4 for DGA, just outside the threshold to be classed as similar. Five out of the seven DGA matches were in Category 4 and two were in Category 6 from NN. The two HSQC matches for compounds 7 and 11 are provided in Figure 2. In this case the spectra classified in Category 3 for NN and Category 6 for DGA.
Comparison of MFP, NN and DGA results
The NN and DGA histograms are similar with the highest frequency of scores appearing in the “most similar” region. The primary difference between MFP and the other two matching methods is that in MFP, a feature is either present or not within a fingerprint, whereas a distance between matched peaks is computed in both NN and DGA. This means that a feature is always included in NN and DGA, irrespective of whether a peak match is identified as an outlier in the latter approach. The histogram distribution is narrower for NN than for DGA. Thus is likely to be due to DGA identifying a unique peaktopeak match, which results in an overemphasis of the peak distances. On the other hand, NN matches peaks nonuniquely, essentially providing information about the peaks’ neighbourhoods with respect to the other HSQC spectrum. NN and DGA can both suffer from false positives.
In view of our findings, we recommend the following protocol for matching of HSQC spectra. First, calculate MFP, NN and DGA based similarities. Determine the MFP cutoff to be used; this is usually set to 0.7. Calculate the number of structures identified by the MFP method and set a suitable threshold to obtain the same number of structures using NN and DGA in accordance with their ranking. The highly significant compound structures would be matches identified by at least two of the methods. In our case, this would be 43 [18(common to all), 3(MFP and NN only), 3(MFP and DGA only) and 19(NN and DGA)]. The compounds that were identified only by one method should be reviewed on a casebycase basis.
Conclusions
The research aimed to investigate whether new approaches can improve a molecular fingerprintbased method of identifying structurally similar compounds from databases of HSQC spectra. Two fast peaktopeak spectral matching methods were developed, the nearest neighbour and discrete genetic algorithm methods. We found that complementary information from both methods improved the classification of compound structures. We compared our new approaches to a method based on molecular fingerprints, and investigated differences between matches. We conclude that our approaches are not a replacement for existing established methods; instead they should be used to refine the assessment of similarity. The use of our algorithms can help counter missed similarity matches arising when molecular fingerprint is used solely for matching of HSQC spectra.
Methods
where p_{ n } is the coordinate of the n^{th} peak and N is the number of peaks. Given another spectrum $\mathbf{q}=\left\{{q}_{m}:m=1,2,\dots ,M\right\}\text{,}$ the goal is then to match peaks by minimizing a metric quantifying the quality of match between p and q.
where j is a vector of N elements and ${j}_{n}\in \left[1,M\right]$ is a perturbation on m given n, such that $\epsilon $ is minimized when j is the optimal indexing of q. The term ${\epsilon}_{S}$ measures the quality of match when all peaks are matched. In the case when one spectrum contains more or less peaks than the other, all peaks from the smaller spectrum are matched, leaving some peaks in the larger spectrum unmatched. We will use the “matched” and “unmatched” terminology throughout this paper. If N < M, j contains N unique integers in [1,M], and hence, the unmatched peaks of q do not appear in j. If N > M, then j contains N unique integers from [1,N]. As such, the entries where j_{n} > M are left unmatched. The modified metric, $\widehat{d}$, accounts for this case.
Nearest Neighbour (NN) matching
The NN approach does not take into account different numbers of peaks in different areas of the spectrum. This can result errors in ascribing compound similarity on the basis of HSQC spectra. For this reason, we also propose to uniquely match spectra peaks, enabling improved differentiation of compound structures through the introduction of long distance peak matching in the metric. This type of matching implemented in our previous work using differential evolution had the drawback that establishing matches to database entries with more than 20 HSQC spectra peaks was time consuming. Our improved method based on a discrete genetic algorithm is still probabilistic and obtains good approximations for large numbers of peaks in a practical amount of time.
Discrete genetic algorithm (DGA) matching
We use a discrete genetic algorithm to optimize the optimal indexing in (4). Our implementation was inspired by the algorithm applied to solve traveling salesman problems. In this work we closely followed the implementation outlined by Schneider [13]. We defined K to be the population size (i.e. the number of solutions) and G_{max} as the maximum number of generations.
Our DGA implementation did not involve forcing of match directions. That is, given a spectrum p to be matched to q, we did not require the denotation of spectrum to be such that q always had a larger number of peaks than p. Furthermore, we used injection of sort solutions through progressive iterations of the algorithm, and when the number of peaks in p and q were differed, we left N – M peaks unmatched. The following mutations were used in DGA:

EXC(L2O): Select two peak matches and exchange them;

L3O: Select three peak matches and shuffle them such that none of them are the same as the starting point;

L4O: Same as L3O but using four peak matches;

EXON: Used only when N < M. Exchanges a peak match of set s with a number from [1,M] that is not an element of s, hence named EXchange with Outside Node.
We updated the population using five mutation sweeps using: RX, BURTRAND and SINGLEBURST crossovers [13]:

RX (random): r is a string of independent random bits of length N, with equal probabilities for zero and one.

BURSTRAND: Same as above but with dependence between the bits such that P(r(i + 1) ~ = r(i)) = 2/N, where P denotes probability. This way of generating “perturbation” or noise is often used for simulating bursty channels (also known as Gilbert–Elliott channel).

SINGLEBURST: r is a continuous block of ones. The length is chosen randomly in [3, N] and the start position i is chosen randomly in [1, N]. The block rolls over when i + l > N, such that r(1 to (l + iN)) = 1.
We provide the functions of DGA, description of terms and detailed explanation of our specific metric implementation can be found in Additional file 1.
Declarations
Acknowledgements
The research was supported with assistance from the State Government of Queensland through the Queensland NMR Network and the National Health and Medical Research Council of Australia.
Authors’ Affiliations
References
 Johnson MA, Maggiora GM: Concepts and applications of molecular similarity. 1990, John Wiley & Sons, New YorkGoogle Scholar
 Martin YC, Kofron JL, Traphagen LM: Do structurally similar molecules have similar biological activity. J Med Chem. 2002, 45: 43504358. 10.1021/jm020155c.View ArticleGoogle Scholar
 Patterson DE, Cramer RD, Ferguson AM, Clark RD, Weinberger LE: Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. J Med Chem. 1996, 39: 30493059. 10.1021/jm960290n.View ArticleGoogle Scholar
 Eckert H, Bajorath J: Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov Today. 2007, 12: 225233. 10.1016/j.drudis.2007.01.011.View ArticleGoogle Scholar
 Willett P: Similaritybased virtual screening using 2D fingerprints. Drug Discov Today. 2006, 11: 10461053. 10.1016/j.drudis.2006.10.005.View ArticleGoogle Scholar
 Willett P, Barnard JM, Downs GM: Chemical similarity searching. J Chem Inf Comput Sci. 1998, 38: 983996. 10.1021/ci9800211.View ArticleGoogle Scholar
 Tversky A: Features of similarity. Psychol Rev. 1977, 84: 327352.View ArticleGoogle Scholar
 Pierens G, Mobli M, Vegh V: Effective protocol for database similarity searching of heteronuclear single quantum coherence spectra. Anal Chem. 2009, 81: 93299335. 10.1021/ac901616t.View ArticleGoogle Scholar
 Vegh V, Pierens GK, Tieng QM: A variant of differential evolution for discrete optimization problems requiring mutually distinct parameters. International Journal of Innovative Computing, Information and Control. 2011, 7: 897914.Google Scholar
 Bajorath J: Integration of virtual and highthroughput screening. Nat Rev Drug Discovery. 2002, 1: 882894. 10.1038/nrd941.View ArticleGoogle Scholar
 Guha R, Howard MT, Hutchison GR, MurrayRust P, Rzepa H, Steinbeck C, Wegner J, Willighagen EL: The blue obelisk interoperability in chemical informatics. J Chem Inf Model. 2006, 46: 991998. 10.1021/ci050400b.View ArticleGoogle Scholar
 The open babel package, version 2.3.0. http://openbabel.sourceforge.net/,
 Schneider JJ, Kirkpatrick S: Stochastic optimization. Application of genetic algorithms to TSP. 2006, Springer, Berlin Heidelberg, 415422. Scientific ComputationGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.