Comparative analyses of structural features and scaffold diversity for purchasable compound libraries
© The Author(s) 2017
Received: 29 November 2016
Accepted: 9 April 2017
Published: 21 April 2017
Virtual screening (VS) based on a variety of ligand-based or structure-based drug design approaches, such as property-based or drug-likeness rules, quantitative structure–activity relationship (QSAR) models, pharmacophore hypotheses, molecular docking, has become a powerful way to find hits in drug discovery. Certainly, screening libraries of small molecules with 2-D or 3-D structures are indispensable sources for VS campaigns. For example, the number of purchasable molecules collected in the ZINC database increases from ~0.73 million in 2005 to over 100 million in 2015 . For the 176 vendors deposited in ZINC15, 37 offer more than 100,000 compounds and 9 offer more than 1 million compounds. This highlights the progress in synthesis of organic chemistry and tremendous demand of this market. In most VS applications, it is more practical and time effective to screen a compound library provided by a specific vendor rather than screen all compound libraries collected by ZINC. Certainly, the distributions of physiochemical properties, structural features and scaffold diversity of purchasable compound libraries afforded by different vendors should be different . Therefore, an important question may be raised: which library should be used for VS? In order to answer this question, we need to have a deep understanding of the intrinsic features of each purchasable compound library and the difference among them.
Based on the Murcko framework, Schuffenhauer et al.  proposed a more complicated and systematical methodology, called Scaffold Tree (ST), to describe the ring systems arranged in a hierarchical tree, which iteratively prunes rings one by one based on a set of prioritization rules until only one ring remains. The structural hierarchies of each molecule in a Scaffold Tree are numbered numerically from Level 0 (the single remaining ring usually) to Level n (the original molecule) (Fig. 1i), and Level n − 1 is the Murcko framework. Owing to the systematic partition of molecular structures, the Scaffold Tree methodology has been employed in many scaffold diversity studies of compound libraries [10–12].
A number of studies have been reported to analyze and compare the chemical space and diversity of commercially available compound libraries in the last decade . Krier et al.  evaluated the scaffold diversity of 17 commercially available screening collections with 2.4 million compounds by analyzing the maximum common substructures (MCS), and they grouped the commercial collections into different categories with low, medium and high scaffold diversity. However, the definition of MCS is arbitrary and data set dependent, and MCS may be not a general way to represent a large number of scaffolds. Langdon et al.  analyzed the structural diversity of 7 commercial compound libraries by using the Murcko frameworks and Scaffold Trees, and then visualized the scaffold space by using the Tree Maps software . They found that there were some emblematical scaffolds in each library. Nevertheless, the libraries analyzed by Langdon et al. are rarely used in practical VS and the numbers of molecules in three libraries are even <10,000, and therefore the results may not be informative for drug design/discovery. With the rapid increase of the number of commercially available small molecules, analysis of the structural features and scaffold diversity for representative screening libraries is quite demanding.
In this study, the structural features and scaffold diversity of eleven commercially available screening libraries and Traditional Chinese Medicine compound database (TCMCD) were explored by analyzing seven fragment representations. All the selected commercial libraries have more than 50,000 compounds and have been widely used in VS. We aimed to find the difference of the structural features and scaffold diversity among these libraries. Tree Maps and SAR Maps  were used to visualize the distribution of the scaffolds based on the similarity of molecular fingerprints. Moreover, the underlying pharmacological characteristics, that is the potential targets of the molecules with the representative scaffolds, were also examined. We believe that our study will help the decision making process when selecting commercially available compound libraries for VS.
Preparation and standardization of libraries
Basic information of the 12 studied libraries
Large, individual service
Original and unique
Generation of fragment presentations
A total of 7 fragment representations were used to characterize the structural features and scaffolds of molecules, and they are ring assemblies, bridge assemblies, rings, chain assemblies, Murcko frameworks , RECAP fragments , and Scaffold Tree .
The first five types of fragment representations were generated by using the Generate Fragments component in Pipeline Pilot 8.5 (PP 8.5) . The RECAP fragments and Scaffold Tree for each molecule were generated by using the sdfrag command in MOE . Owing to the lack of the original molecules in the Scaffold Tree provided by the sdfrag command, the missing original molecules were added to the SDF files of the Scaffold Tree using PP 8.5 (Additional file 1: File S1). The generation of the Scaffold Tree (from Level 1 to Level n) was accomplished in PP 8.5 by defining the fragments at different levels for each molecule. Eventually, the SDF files of these fragment representations were obtained (Additional file 1: File S1).
Analyses of scaffold diversity
The scaffold diversity of each standardized dataset was characterized by the fragment counts and the cumulative scaffold frequency plots (CSFPs) or so called cyclic system retrieval (CSR) curves [23, 24]. The duplicated fragments were removed first, and the numbers of unique fragments for each dataset were counted for ring assemblies, bridge assemblies, rings, chain assemblies, Murcko frameworks, RECAP fragments and Levels 0–11 of Scaffold Tree, along with the numbers of molecules they represent (referred to as the scaffold frequency).
Then, the scaffolds were sorted by their scaffold frequency from the most to the least, and the cumulative percentage of scaffolds was computed as the cumulative scaffold frequency divided by the total number of molecules . Similarly, percentages of unique fragments can also be calculated. Then, CSFPs with the number or the percentage of Murcko frameworks and Level 1 scaffolds, which may better represent the whole molecules than the other types of fragments, were generated. In each CSFP, PC50C was determined for each scaffold representation to quantify the distribution of molecules over scaffolds. PC50C was defined as the percentage of scaffolds that represent 50% of molecules in a library .
Generation of Tree Maps
The Tree Maps methodology was employed to analyze the structural similarity of the Level 1 scaffolds by using the TreeMap software, which can highlight both the structural diversity of scaffolds and the distribution of compounds over scaffolds. Tree Maps has been used as a powerful tool to depict structure–activity relationships (SARs) and analyze scaffold diversity . Different from traditional tree structure represented by a graph with the root node and children nodes from the top to the bottom, Tree Maps proposed by Shneiderman uses circles or rectangles in a 2D space-filling way to delegate a kind of property for a clustered dataset with clearly intuitive visualization . Thus, one can visualize a hierarchical clustering map by organizing those clustered properties along with other features for a dataset, such as MW.
First, the unique Level 1 scaffolds were clustered by using the cluster molecules component in PP 8.5 based on the ECFP_4 (extensive-connectivity fingerprint 4) fingerprints [26–28]. According to Tian’s study  and our testing, although the clustering method is order dependent, the order dependency of the cluster molecules component did not have obvious effect on the clustering results. So, recentering the cluster center twice in a clustering protocol is enough. Then, the SDF file of the clustered scaffolds for each standardized dataset was converted into a text formatted file, which was used as the input of the TreeMap software  (Additional file 1: File S1). In each Tree Maps, scaffolds are represented by circles with gray perimeters. The area of each circle is proportional to the scaffold frequency, and the color of each small circle is related to the DTC (DistanceToClosest, i.e., the distance between the fragment and the cluster center) of fragments in each cluster. The lowest value of DTC for the Level 1 scaffolds of ChemBridge (DTC = 0) was colored in red, the highest value (DTC = 0.778) in deep green and the middle value in white. The highest values of DTC for the other databases were also around 0.8. The yellow labels in each Tree Maps were the order numbers of clusters.
Generation of SAR Maps
SAR Maps generated by the DataMiner 1.6 software is usually used to organize high throughput screening (HTS) data into clusters of chemically similar molecules, which provides a good way for interactive analysis. This structural clustering allows identification of possible false negatives and false positives in the data when the colors in the map represent experimental activity values. The map can not only display the results effectively, but also provide a convenient way to access the chemical series presented by the maximum common structure (MCS) scaffolds. Along with SAR (structure–activity relationship) rules, and substructure- and property-based tools provided in DataMiner, the SAR Map is a powerful method assisting to make the best possible decision on which molecules should be studied further.
First, the cluster centers of the top 10 most frequently occurring clusters of the Level 1 Scaffolds observed in the Tree Maps for each standardized subset were defined as the queries to search the dataset by using the Substructure Filter from File component in PP 8.5. The 4816 identified records (i.e., original molecules) were saved into a SDF file (Additional file 1: File S1).
Then, the Generate SAR Map function in DataMiner 1.6 was used to generate the structure similarity maps, i.e. SAR Maps . The K-dissimilarity Selection or OptiSim method [31–33] was used to select a diverse and representative samples from the original dataset based on the Tanimoto similarity distances calculated from the 2D UNITY structural fingerprints . Because the SAR Map is not a simple plot of two variables, it does not have axes. For N compounds, the SAR Map is an optimal projection of the N-squared similarities within the points onto a two dimensional plot using the nonlinear mapping (NLM) projection method . Singleton Radius and SAR Map Horizon are two critical parameters to control the map. The Singleton Radius represents a dissimilarity radius, which was set to 0.3. A singleton is a compound that does not have any nearest neighbor within a predefined radius, and it is regarded as a point in the hedge of the map. The SAR Map Horizon was also set to 0.3, which means that two points will be placed far apart if the dissimilarity between them is higher than the parameter value, but their distance is not in scale relative to the others’ on the map. Accordingly, molecules gathered on the map definitely characterizing much more similar compounds are more meaningful than those separated ones. Therefore, 40 denser areas or so called representative molecules were selected and shown with black dotted circles on the SAR Map. The similarity between molecules in each area and its central molecules were higher than 0.8 (including 0.8), and these representative molecules in an area were saved as a SDF file (Additional file 1: File S1). Then selected molecules from each circle were used as the queries to identify the similar molecules in the BindingDB database . In similarity search, the structural similarity threshold for each query was adjusted to make sure that at least one similar compound could be found for each query, and the least similarity threshold was set to 0.6. Finally, the potential targets of 39 queries were assigned to those of the similar molecules found in BindingDB.
Results and discussion
Counts of fragments
Numbers of the duplicated and non-duplicated ring assemblies (ra), bridge assemblies (b), rings (r), chain (c), Murcko framwork (m) and RECAP fragment (RECAP) for the 12 standardized datasets
Numbers of the duplicated and non-duplicated scaffolds at different levels of Scaffold Tree for the 12 standardized datasets
Obviously, two kinds of fragments contain side chains, including chain assemblies (chains) and RECAP fragments. The percentages of molecules that do not have any ring in the standardized subsets were also calculated, and they are 0.12, 0.34, 0.51, 0.58, 0.24, 0.56, 0.48, 0.08, 4.71, 0.96, 0.49 and 0.36% for ChemBridge, ChemDiv, ChemicalBlock, Enamine, LifeChemicals, Maybridge, Mcule, Specs, TCMCD, UORSY, VitasM and ZelinskyInstitute, respectively. Among the studied libraries, TCMCD has the highest percentage of acyclic molecules (close to 2000), which is consistent with the results reported by Tian et al. . However, the total number of chains in TCMCD is the least but one (466,842). More interestingly, TCMCD has 5962 unique chains, which are almost twice to those in ChemBridge (3450). Considering that the standardized subset of TCMCD has more acylic compounds, less chains while more unique chains, it appears that the chains in TCMCD are bigger or more complicated and diverse. Despite Maybridge has the fewest number of chains (461,415), which is similar to TCMCD, its number of unique chains (3543) is at the average level, which is still higher than those of ChemBridge (3450) and ChemDiv (3493). However, Chembridge and ChemDiv bear the top two numbers of chains (>510,000). Thus, the structures in Maybridge may be more diverse, which needs to be explored by other types of fragment representations. Among the studied libraries, UORSY and Enamine have more non-duplicated chain assemblies (6120 and 6002) than the others, suggesting that they have more diverse chains, which are two times higher than that of LifeChemicals (2603). Moreover, Mcule owns relatively high number of unique chains (5368).
Another fragment representation containing side chains is RECAP fragments, which are the building blocks for synthesizing molecules. As shown in Table 2, TCMCD has extremely high number of RECAP fragments (702,520), indicating that, on the average, synthesizing a compound in TCMCD needs more RECAP fragments than synthesizing a molecule in any other standardized subset. That is to say, synthesizing these compounds in TCMCD may be quite difficult. ChemBridge, Enamine and UORSY have relatively high numbers of RECAP fragments (~500,000), which are almost twice comparing with those of ChemicalBlock (250,765) and Maybridge (264,327). Therefore, it may be easier to synthesize the molecules in ChemicalBlock and Maybridge.
In the other five types of fragment presentations, three of them belong to ring systems, including rings, ring assemblies and bridge assemblies. The total numbers of rings for all libraries are quite close, and the biggest difference is found between Maybridge (110,054) and ChemDiv (129,997). Similarly, the total numbers of ring assemblies of these libraries are not quite different, but TCMCD is the only exception. The number of all ring assemblies in TCMCD (58,111) is significantly fewer than those in the other libraries, but quite interestingly, the number of the unique ring assemblies in TCMCD (1351) is quite higher than those in the other libraries. Different from rings and ring assemblies, bridge assemblies characterize contiguous ring systems sharing two or more bonds, and therefore they are also ring assemblies but more complicated. As shown in Table 2, the total number of the bridge assemblies in TCMCD (5793) is significantly higher than those in the other libraries. Although the total number of the simple ring systems in TCMCD is not quite high, its unique numbers of the rings and ring assemblies are much higher than those of the other libraries. In a word, TCMCD has more complicated and diverse ring systems. However, commercial libraries generally contain more simple rings instead of multiple ring systems, such as bridge assemblies. Herein, as a whole, ChemicalBlock and Specs have more unique ring systems, as shown in Table 2. Mcule and VitasM have relatively diverse ring systems, and Mcule also has relatively diverse chains. Enamine and UORSY have relatively high numbers of unique chains, but the numbers of their distinctive ring systems are so low. For LifeChemicals, both of the numbers of the unique chains and ring systems are quite low, suggesting that it has relatively low structural diversity.
The other two types of fragment presentations, Murcko frameworks and Scaffold Tree, characterize molecular scaffolds, and they can represent the whole structural features for compounds in a library. Murcko frameworks, which are the union of ring systems (Fig. 1a) and linkers (Fig. 1b), are usually used as the structural signatures of molecules. As shown in Table 2, the total numbers of the Murcko frameworks for all the standardized subsets except TCMCD do not have large difference, which may result from much more acylic molecules found in TCMCD, but those of the unique ones are quite different. The number of the unique Murcko frameworks for Mcule (27,247) is the highest, while that for TCMCD (12,941) is the lowest, highlighting the fact that the structures of natural compounds may be more conservative than those of the synthesized molecules in commercially available libraries. Other databases, such as ChemBridge and Enamine, also possess relatively high numbers of Murcko frameworks (25,788 and 26,870, respectively). However, as mentioned above, the diversity of the ring systems for Enamine is pretty low.
In summary, TCMCD contains much more complicated structures and its whole molecular scaffolds are more conservative than the commercial libraries. Generally speaking, at Levels 2 and 3, ChemBridge and Mcule show high structural diversity. At Level 5 or higher, ChemicalBlock, Specs and VitasM possess relatively high structural diversity, suggesting that these libraries contain more complicated structures. LifeChemicals has relatively high diversity for the Scaffolds at Levels 3 and 4, but has relatively low diversity for rings, ring assemblies and bridge assemblies (Table 2). Certainly, in order to characterize the structural diversity of the 12 studied libraries more clearly, further quantitative analyses are necessary.
Cumulative scaffold frequency plots (CSFPs)
PC50C values of the Murcko frameworks (Murcko) and Level 1 scaffolds for the 12 standardized datasets
The scaffold diversity evaluated based on the Level 1 scaffolds and Murcko frameworks deliver similar overall trends. Three libraries, including ChemDiv, Mcule and LifeChemicals, are more structurally diverse for whether the Level 1 scaffolds or Murcko frameworks, and two libraries, including TCMCD and Specs, are less structurally diverse. But the quantity statistics cannot reveal similarities among these scaffolds, and the scaffolds of TCMCD may present more diverse in similarity. Besides, the exact trends of CSFPs for the Murcko frameworks and Level 1 scaffolds are also different. The CSFPs for the Murcko frameworks are more discriminatory. It is possible that more granular Murcko frameworks enhance the apparent scaffold diversity. Moreover, PC50C is also just a simple index at a certain point in CSFPs. Therefore, a more comprehensive comparison within the distributions of the Level 1 scaffolds is necessary to evaluate the structural features of these libraries.
In the previous section, we analyzed the scaffold diversity of the 12 libraries using the distributions of molecules over scaffolds. Our analyses show that the studied libraries are not evenly distributed over scaffolds, but we know little about the structural similarity and distribution of representative scaffolds. Thus, Tree Maps was used to visualize the structural similarity and distribution of the Level 1 scaffolds.
In Additional file 2: Fig. S3, these scaffolds acting as the cluster centers in Tree Maps are obviously more dissimilar between each other. As shown in Fig. 7, there are only 2 scaffolds (26 and 27) with frequencies ≥2, which can be found in ChemBridge and LifeChemicals, and ChemDive and Maybridge, respectively. It seems that the scaffolds of these cluster centers serving as the representatives for clusters are more unique than the most frequent ones.
In the previous two sections, the structural features, distributions and scaffold diversity of 12 libraries have been analyzed, but the relationships among the scaffolds present in clusters for different libraries have not been explored. Then, the chemical space of the molecules identified by the substructure search of the representative scaffolds, which are the cluster centers from the Tree Maps for the 12 subsets, was characterized by the SAR Maps methodology. Besides, high interests in diverse scaffolds that preferentially interact with important target families are also taken into consideration . The underlying pharmacological characteristics of some representative scaffolds which are important components of drug candidates against different drug targets are also predicted.
Therefore, to focus on the gathered molecules, the original SAR Map is magnified and shown in Fig. 8b. Compounds in the same library are represented by the points with the same color, size and shape. As shown in Fig. 8b, most of the biggest blue circles in TCMCD lie on the left of the map, and vast of the pink circles of ChemicalBlock on the upper right. Similarly, most light blue circles of Maybridge are at the bottom. As for the other libraries, such as Mcule represented by the smallest blue circles, it distributes more sparsely with few dense parts. But Mcule has 518 representative molecules, roughly equal to that of Maybridge (513) on the map. More dispersive distribution of Mcule suggests that Mcule also owns a large number of diverse molecules. The gray ones of LifeChemicals also spread in a wide range, but some accumulate in certain separated areas. Thus, there must be some distinct molecules in each library as shown by the denser areas on the map. Then, 40 selected areas of representative molecules highlighted by the black dotted circles on the SAR Map were identified.
To grasp the potential functions and structural properties of these selected representative molecules, similarity searching and the MCS searching were carried out. By searching BindingDB based on similarity, similar inhibitors of the representative molecules and the corresponding targets were obtained. Similar molecules in BindingDB could be found for 39 out of the 40 representative molecules, and the 39 corresponding MCSs are shown in Additional file 2: Fig. S4 and the potential targets are listed in Additional file 2: Table S1. We found that many identified potential targets were kinases and GPCRs with high similarity thresholds, such as Pyruvate kinase for ChemDiv, streptokinase A precursor for ChemicalBlock, Cyclin-Dependent kinase for LifeChemicals, Serine/threonine-protein kinase for Maybridge, hexokinase and Serine-protein kinase for TCMCD and Glycogen synthase kinase for LifeChemicals, Maybridge, Mcule, TCMCD, VitasM and ZelinskyInstitute. Moreover, GPCRs were also identified as the potential targets for the representative molecules found in ChemBridge, ChemicalBlock, Maybridge, TCMCD and VitasM. In particular, three groups of molecules in TCMCD have high similarity (up to 1) to the inhibitors of GPCRs but MCSs of the representative structures from these groups are not that similar. Besides, some ion channels, transporters, etc. can also be found as the potential targets. Our results suggest that these typical structures found by the SAR Maps can reveal some important structural and potential functional features for each dataset. Specifically, TCMCD, ChemicalBlock and Maybridge occupying unique area in chemical space, are of great potential to find drug candidates of those vital druggable targets, such as kinases and GPCRs.
In this study, based on seven different fragment representations, the structural features, scaffold diversity and chemical distributions of 12 libraries, including 11 commercially available compound libraries and TCMCD, were explored and compared. The analyses indicate that although Chembridge, ChemicalBlock, Mcule, TCMCD and VitasM are more structurally diverse than the other databases. TCMCD is actually not quite structurally diverse for simple molecules, but the most occurring Level 1 scaffolds of it has tremendous difference to those of the other libraries. Despite Chembridge, Mcule and VitasM are rich in different kinds of fragments, their representative molecules largely overlap with those of the other databases, suggesting that the unique compounds in these libraries may be not so high in fact. Structures in ChemicalBlock are really diverse and complicated enough for VS. As for LifeChemicals, it does not have a variety of fragments but has much dissimilar molecular structures. Some libraries such as Enamine and UORSY are not good choice for actual VS considering the structural complexity and diversity of the molecules. Besides, 40 groups of representative scaffolds were identified in these 12 databases through Tree Maps and SAR Maps, and some molecules with these representative scaffolds found in certain libraries may be potential inhibitors of kinases and GPCRs. We believe that our study may provide valuable information to select proper commercial libraries in practical VS.
JS, DK and TH conceived and designed the experiments. JS, HS and HL performed the simulations. JS, HS, HL, FC, ST, PP and DL analyzed the data. JS, DK and TH wrote the manuscript. All authors read and approved the final manuscript.
We would like to thank the following: Zhengkun Kuang and Wenlei Peng from College of Informatics of Huazhong Agricultural University for providing guidance on employing Pipline Pilot and programming in shell.
The authors declare that they have no competing interests.
This study was funded by the National Natural Science Foundation of China (21275061; 81302679; 21575128) and the National Major Basic Research Program of China (2016YFA0501701; 2016YFB0201700).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Sterling T, Irwin JJ (2015) ZINC 15-ligand discovery for everyone. J Chem Inf Model 55:2324–2337View ArticleGoogle Scholar
- Tiikkainen P, Franke L (2012) Analysis of commercial and public bioactivity databases. J Chem Inf Model 52:319–326View ArticleGoogle Scholar
- Johnson M, Maggiora G (1990) Concepts and applications of molecular similarity. Wiley, New YorkGoogle Scholar
- Garcia-Castro M, Zimmermann S, Sankar MG, Kumar K (2016) Scaffold diversity synthesis and its application in probe and drug discovery. Angew Chem Int Edit 55:7586–7605View ArticleGoogle Scholar
- Markush EA (1924) Pyrazolone dye and process of making the same. US Patent 1506316Google Scholar
- Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, DordrechtView ArticleGoogle Scholar
- Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39:2887–2893View ArticleGoogle Scholar
- Lewell XQ, Judd DB, Watson SP, Hann MM (1998) RECAP—retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. J Chem Inf Comput Sci 38:511–522View ArticleGoogle Scholar
- Schuffenhauer A, Ertl P, Roggo S, Wetzel S, Koch MA et al (2007) The scaffold tree-visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model 47:47–58View ArticleGoogle Scholar
- Agrafiotis DK, Wiener JJM (2010) Scaffold Explorer: an interactive tool for organizing and mining structure-activity data spanning multiple chemotypes. J Med Chem 53:5002–5011View ArticleGoogle Scholar
- Wetzel S, Klein K, Renner S, Rauh D, Oprea TI et al (2009) Interactive exploration of chemical space with Scaffold Hunter. Nat Chem Biol 5:581–583View ArticleGoogle Scholar
- Langdon SR, Brown N, Blagg J (2011) Scaffold diversity of exemplified medicinal chemistry space. J Chem Inf Model 51:2174–2185View ArticleGoogle Scholar
- Tian S, Wang J, Li Y, Li D, Xu L et al (2015) The application of in silico drug-likeness predictions in pharmaceutical research. Adv Drug Deliv Rev 86:2–10View ArticleGoogle Scholar
- Krier M, Bret G, Rognan D (2006) Assessing the scaffold diversity of screening libraries. J Chem Inf Model 46:512–524View ArticleGoogle Scholar
- Shneiderman B (1992) Tree visualization with tree-maps—2-D space-filling approach. ACM Trans Graph 11:92–99View ArticleGoogle Scholar
- DataMiner 1.6. http://www.tripos.com/. Accessed April 2016
- Hou TJ, Qiao XB, Xu XJ (2001) Research and development of 3D molecular structure database of traditional Chinese drugs. Acta Chim Sin 59:1788–1792Google Scholar
- Qiao XB, Hou TJ, Zhang W, Guo SL, Xu SJ (2002) A 3D structure database of components from Chinese traditional medicinal herbs. J Chem Inf Comput Sci 42:481–489View ArticleGoogle Scholar
- Shen M, Tian S, Li Y, Li Q, Xu X et al (2012) Drug-likeness analysis of traditional Chinese medicines: 1. Property distributions of drug-like compounds, non-drug-like compounds and natural compounds from traditional Chinese medicines. J Cheminform 4:31View ArticleGoogle Scholar
- Pipeline Pilot 8.5. http://accelrys.com/. Accessed April 2016
- Muresan S, Sadowski J (2005) “In-House likeness”: comparison of large compound collections using artificial neural networks. J Chem Inf Model 45:888–893View ArticleGoogle Scholar
- MOE version 2014. http://www.chemcomp.com/. Accessed April 2016
- Lipkus AH, Yuan Q, Lucas KA, Funk SA, Bartelt WF III et al (2008) Structural diversity of organic chemistry. A scaffold analysis of the CAS Registry. J Org Chem 73:4443–4451View ArticleGoogle Scholar
- Gonzalez-Medina M, Prieto-Martinez FD, Owen JR, Medina-Franco JL (2016) Consensus Diversity Plots: a global diversity analysis of chemical libraries. J Cheminform 8:63View ArticleGoogle Scholar
- Clark AM (2010) 2D depiction of fragment hierarchies. J Chem Inf Model 50:37–46View ArticleGoogle Scholar
- Gorse D, Lahana R (2000) Functional diversity of compound libraries. Curr Opin Chem Biol 4:287–294View ArticleGoogle Scholar
- Khanna V, Ranganathan S (2011) Structural diversity of biologically interesting datasets: a scaffold analysis approach. J Cheminform 3:30View ArticleGoogle Scholar
- Le GT, Abbenante G, Becker B, Grathwohl M, Halliday J et al (2003) Molecular diversity through sugar scaffolds. Drug Discov Today 8:701–709View ArticleGoogle Scholar
- Tian S, Li Y, Wang J, Xu X, Xu L et al (2013) Drug-likeness analysis of traditional Chinese medicines: 2. Characterization of scaffold architectures for drug-like compounds, non-drug-like compounds, and natural compounds from traditional Chinese medicines. J Cheminform 5:5View ArticleGoogle Scholar
- TreeMap v. 3.8.3. http://www.treemap.com/. Accessed April 2016
- Brown RD, Martin YC (1996) Use of structure Activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comput Sci 36:572–584View ArticleGoogle Scholar
- Clark RD (1997) OptiSim: an extended dissimilarity selection method for finding diverse representative subsets. J Chem Inf Comput Sci 37:1181–1188View ArticleGoogle Scholar
- Clark RD, Patterson DE, Soltanshahi F, Blake JF, Matthew JB (2000) Visualizing substructural fingerprints. J Mol Graph Model 18:404–411View ArticleGoogle Scholar
- SYBYL X1.0. https://www.certara.com. Assessed 1 Jan 2016
- Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C 18:401–409View ArticleGoogle Scholar
- Gilson MK, Liu TQ, Baitaluk M, Nicola G, Hwang L et al (2016) BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res 44:D1045–D1053View ArticleGoogle Scholar
- Yongye AB, Waddell J, Medina-Franco JL (2012) Molecular scaffold analysis of natural products databases in the public domain. Chem Biol Drug Des 80:717–724View ArticleGoogle Scholar
- Hu Y, Stumpfe D, Bajorath J (2016) Computational exploration of molecular scaffolds in medicinal chemistry. J Med Chem 59:4062–4076View ArticleGoogle Scholar