Cheminformatics analysis of the AR agonist and antagonist datasets in PubChem
© The Author(s) 2016
Received: 18 March 2016
Accepted: 1 July 2016
Published: 8 July 2016
As one of the largest publicly accessible databases for hosting chemical structures and biological activities, PubChem has been processing bioassay submissions from the community since 2004. With the increase in volume for the deposited data in PubChem, the diversity and wealth of information content also grows. Recently, the Tox21 program, has deposited a series of pairwise data in PubChem regarding to different mechanism of actions (MOA), such as androgen receptor (AR) agonist and antagonist datasets, to study cell toxicity. To the best of our knowledge, little work has been reported from cheminformatics study for these especially pairwise datasets, which may provide insight into the mechanism of actions of the compounds and relationship between chemical structures and functions, as well as guidance for lead compound selection and optimization. Thus, to fill the gap, we performed a comprehensive cheminformatics analysis, including scaffold analysis, matched molecular pair (MMP) analysis as well as activity cliff analysis to investigate the structural characteristics and discontinued structure–activity relationship of the individual dataset (i.e., AR agonist dataset or AR antagonist dataset) and the combined dataset (i.e., the common compounds between the AR agonist and antagonist datasets).
Scaffolds associated only with potential agonists or antagonists were identified. MMP-based activity cliffs, as well as a small group of compounds with dual MOA reported were recognized and analyzed. Moreover, MOA-cliff, a novel concept, was proposed to indicate one pair of structurally similar molecules which exhibit opposite MOA.
Cheminformatics methods were successfully applied to the pairwise AR datasets and the identified molecular scaffold characteristics, MMPs as well as activity cliffs might provide useful information when designing new lead compounds for the androgen receptor.
As one of the largest publicly accessible databases for chemical structures and their bioactivities, PubChem , hosted by the National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), has become an increasingly important platform to the scientific community for data sharing. With three interconnected databases: PubChem Substance (identifier SID), PubChem BioAssay (identifier AID) and PubChem Compound (identifier CID), PubChem offers open access to over 50,000 users daily via the NCBI Entrez system, as well as web-based and programmatic tools. In addition, PubChem is closely integrated with literature and other biomedical databases such as PubMed, Protein, Gene, Structure, Biosystems and Taxonomy . According to the recent review , PubChem has been successfully applied to various fields, such as developing secondary resources and tools, studying compound-target network and drug polypharmacology, generating and validating machine learning models, and identifying lead compounds etc.
Despite of a number of previous data mining efforts [3–7], the demand only becomes higher for researchers to collectively analyze bioactivity data to solve or provide insights into scientific questions, especially in the medicinal chemistry filed, where one of the main tasks is to identify and optimize lead compounds towards desired biological activities. Thus, many researchers have attempted different computational approaches to accomplish such tasks including virtual screening based on PubChem bioactivity data  using the maximum unbiased validation datasets, predicting adverse drug reactions using PubChem bioassay data  and many others [10–13]. However, most of the studies mainly focused on the datasets with the single endpoints. With the increase in volume for the deposited data in PubChem, the diversity and wealth of information content also grows. PubChem contains hundreds of large scale high-throughput screening (HTS) projects, which often tested a common compound library providing great opportunities for bioactivity profiling research. Recently, the Tox21 program compiled a library of 10,000 compounds, and systematically carried out HTS projects against a group of targets and pathways, such as androgen receptor (AR), estrogen receptor (ER), retinoic acid receptor (RAR) and other receptors, searching simultaneously for agonists and antagonists in a pairwise manner. Data generated by these projects were deposited in PubChem. Analysis of such pairwise bioactivity data regarding to different mechanism of actions (MOA) for the same target may result in interesting discoveries, in particularly when to combine with prior data in PubChem. However, to the best of our knowledge, little work has been reported from cheminformatics study for these datasets. Thus, to fill the gap, we performed a comprehensive study focusing on this data collection using several cheminformatics methods, including scaffold analysis, matched molecular pair (MMP) analysis and activity cliff analysis.
In fact, previous studies have successfully applied such cheminformatics methods to the analysis of bioactivity data in public databases. For example, Hu and Bajorath  performed scaffold analysis for the DrugBank database  and the ChEMBL database . They concluded that many drugs contain unique scaffolds with varying structural relationships to scaffolds of currently available bioactive compounds. The same authors also explored the scaffold universe of kinase inhibitors with respect to different activities . Kramer et al.  performed matched molecular pair analysis by comparing the ChEMBL data and Novartis data suggesting that MMP analysis is a very robust tool for lead optimization and will have growing importance in daily medicinal chemistry practice. Using the ChEMBL database, Dimova et al.  presented a systematic evaluation of activity cliff progression in evolving compound datasets. They found that activity cliffs currently are not a major focal point of practical medicinal chemistry efforts and anticipated that chemically unexplored activity cliffs should provide significant opportunities for further study in medicinal chemistry. All these findings indicate that cheminformatics studies are playing important roles in medicinal chemistry. However, it can be noted that most of such studies are mainly focusing on the ChEMBL database.
In this work, we performed a comprehensive cheminformatics study for the Tox21 assay data deposited in the PubChem database to investigate the molecular scaffold characteristics, matched molecular pairs as well as activity cliff in the individual target-based dataset (i.e., either AR agonist dataset or antagonist dataset). Moreover, we also performed a computational analysis for the combined dataset (i.e., commonly tested compounds) between the AR agonist and antagonist datasets in Tox21. Several interesting observations are reported and discussed.
Material and experimental methods
Bioactivity data for the agonist and antagonist screens for the androgen receptor (AR, GenBank: AAI32976.1) were retrieved from the PubChem BioAssay database. For the agonist screen (AID 743053), there were 372 substances reported as active outcomes and 9070 substances as inactive outcomes from a total of 10,486 substances, while for the antagonist screen (AID 743063), 670 substances were reported as active and 7770 substances as inactive from the same compound library. These original compounds were subject to further filtering as described below.
Preprocessing of the original data
To obtain the final dataset for analysis, the following steps were applied: (1) compounds with missing readouts were removed (original 8, 111 unique CIDs were reduced to 8110 for both the AR agonist and antagonist datasets); (2) redundant compounds (same CIDs and same readouts but different SIDs) were removed (CIDs remained the same for both the AR agonist and antagonist datasets); (3) compounds with discrepant bioactivity, meaning the same chemical structure (CID) with contradictory bioactivity report (same CIDs but different readouts and different SIDs), were removed (CIDs were reduced to 7866 for the AR agonist dataset, and 7678 for the AR antagonist dataset, respectively); (4) compounds without outcome annotations of “Active” and “Inactive” were removed (CIDs were reduced to 7174 for the AR agonist dataset, and 6321 for the AR agonist dataset, respectively); (5) compounds of mixtures were removed (CIDs were reduced to 5649 for the AR agonist dataset, and 4956 for the AR antagonist dataset, respectively); and (6) compounds containing no ring-like structures were removed (CIDs were reduced to 4162 for the AR agonist dataset, and 3563 for the AR antagonist dataset, respectively). Finally, the PubChem CID (representing unique chemical) rather than SID (representing a sample) was used as the compound identifier for keeping data consistency. The final AR agonist dataset consisted of 172 “Active” molecules and 3990 “Inactive” ones, and the AR antagonist dataset consisted of 322 and 3241 of “Active” and “Inactive” compounds, respectively. The R software  was used to perform the analysis.
A molecular scaffold, according to the definition introduced by Bemis and Murcko , is often called BM scaffold, which is extracted from the molecule by removing all substituents while retaining aliphatic linkers between ring systems. In this work, the scaffolds of the AR agonist and antagonist datasets were constructed by using the method proposed by Matlock et al. . Specifically, the scaffold network generator (sng) tool , taking the input of SDF format of molecules, was used to generate the molecular scaffolds. In addition, each scaffold was also reduced to an even more brief molecular framework (also called cyclic skeleton (CSK) ) by converting all heteroatoms to carbon and turning all bonding orders (double bonds or triple bonds) to one. Therefore, each CSK represents a series of topologically equivalent scaffolds. The RDKit software  was used to obtain the CSKs from the corresponding scaffolds.
Matched molecular pair
As described by Hussain and Rea , an MMP is a pair of molecules that only differ by a structural change at a single site, which has become a major tool for analyzing large chemistry dataset for promising chemical transformations . In this work, size-restricted MMPs were constructed to limit structural differences between compounds to small replacements as reported previously , which was done in the following procedures: (1) the invariant core fragment was required to have at least twice as the size of each exchanged fragment; (2) the maximal size of an exchanged fragment was limited to 13 non-hydrogen atoms and (3) the size difference between two exchanged fragments was set to eight atoms as the maximum. Thus, the generated MMPs provided a conservative measure of structural similarity . All MMP calculations were calculated using the algorithm proposed by Hussain and Rea . Specifically, the mmpa module implemented in RDKit software  was used to generate the MMPs. The module was ran with the default settings except the maximal size change in heavy atoms allowed in MMPs identified (13 in this work). The other steps were performed using the R software , which took the SMILES format of molecules as input.
A common definition for activity cliff is that a pair of structurally similar molecules exhibit a large difference in bioactivity potency . For the similarity measures between molecules, different methods have been successfully applied, whereas Tanimoto similarity based on various fingerprint descriptors (e.g., PubChem fingerprints, MACCS fingerprints, ECFP4 fingerprints and many others ) and MMP-based similarity are among the most popular ones . In this work, the latter was adopted. In addition, the PubChem bioactivity outcome annotations (i.e., active or inactive) provided by depositors were directly used to obtain the bioactivity potency differences. Thus, the generated activity cliffs herein were MMP-based cliffs.
Results and discussion
As one of the nuclear hormone receptors, AR (GenBank: AAI32976.1) plays a critical role in AR-dependent prostate cancer and other androgen related diseases. Several endocrine disrupting chemicals and their interactions with AR may cause disruption of normal endocrine function as well as interfere with metabolic homeostasis, reproduction, developmental and behavioral functions. Thus, in order to identify the agonists and antagonists of AR signaling, GeneBLAzer AR-UAS-bla-GripTite cell line containing a beta-lactamase reporter gene under control of an upstream activator sequence stably integrated into HEK293 cells was used to screen the Tox21 10K compound library. In this work, we have investigated the screened compounds by applying several cheminformatics methods in order to mine useful information for the design of lead compounds.
Scaffolds and CSKs of the AR agonist and antagonist datasets
Summary of the studied AR agonist and antagonist datasets
Number of unique compounds
Number of unique scaffolds
Number of unique CSKs
It is well known that datasets from HTS have the imbalanced nature, which means that the majority of screened compounds exhibit inactive outcomes, while just a minority part of them show active outcomes. In our study, the inactive compounds of the AR agonist dataset are more than 23 folds larger than the active ones. By comparing the scaffolds of them, the former are more than 21 folds of the latter (Table 1). However, one can notice that the imbalanced ratio between the inactive and active CID counts, and that between the scaffold counts for the compounds of the AR antagonist dataset are relatively low compared to those of the agonist dataset, which are about 10 and 6 for the compounds and scaffolds, respectively, which indicates that the identified agonists are more structurally specific while the antagonists are rather structurally diverse in this studied datasets. By calculating the diversity index (DI)  of active and inactive molecules, using the PubChem fingerprints for the AR agonist dataset, it can be noticed that the DI of active compounds is 0.50, which is relatively less than the inactive DI of 0.66 though the number of former dataset is largely less than the latter. For the AR antagonist dataset, the DIs are 0.61 and 0.67 for the active and inactive compounds, respectively. The almost equal DIs indicate that the investigated datasets are diverse.
We further decomposed the scaffolds to CSKs which are used to elucidate more general skeletons of the scaffolds. According to the previously mentioned criteria, a total of 1571 scaffolds are reduced to 895 CSKs for the AR agonist dataset, where the active 72 scaffolds consist of 53 CSKs and the inactive 1521 ones consist of 865 CSKs (Table 1). Likely, the AR antagonist dataset consists of 814 unique CSKs, in which the active and inactive ones consist of 160 and 717 CSKs, respectively (Table 1). Figure 1e, f show the distribution of scaffolds among CSKs for the AR agonist and antagonist datasets, respectively. There are about 77 % of the whole CSKs in the AR agonist dataset exhibiting a one CSK to one scaffold relationship, while this ratio is 78 % for the AR antagonist dataset, again indicating the screened compound library is structurally diverse enough. The whole list can be found in the Additional file 1: Table S1.
MMPs and activity cliffs of the AR agonist and antagonist datasets
MMPs for the AR agonist and antagonist datasets
Number of MMPs
Activity cliff MMPs
Mechanism of actions analysis
In addition to the activity cliff analysis within the respective AR agonist dataset and antagonist dataset, we also carried out MMP-based analysis by combing the agonist and antagonist datasets taking the advantage that both screens tested the same compound library. We compiled a total of 3293 such common compounds for both datasets. We first removed those compounds (3008) with inactive outcome in both of the AR agonist and antagonist datasets as we attempted to focus on the compounds with potential agonist and antagonist function as identified in the two screens. As a result, the remaining 285 compounds with pairwise mechanism of actions (i.e. agonist vs. antagonist) were applied to further study with two questions in mind: (1) to check structure-based bioactivity overlap; and (2) to explore MMP-based MOA cliffs.
Summary of MMPs and cliffs for the combined AR dataset
Number of MMPs
In this work, we analyzed the pairwise agonist and antagonist AR data including scaffold analysis, matched molecular pair and activity cliff. Scaffolds with distinct agonist or antagonist bioactivity as well as those showing activity cliffs were identified. In addition to the activity cliffs regarding to a single MOA, we also carried out activity cliff analysis by combing the AR agonist and antagonist datasets. We proposed a novel MOA-based cliff concept to indicate a pair of structurally similar molecules which exhibit the opposite MOA. In a summary, by a thorough investigation of the Tox21 AR datasets, a series of scaffolds, MMPs, activity cliffs as well as MOA-cliffs have been identified or proposed. We hope this analysis might be helpful for optimizing or designing novel AR agonists and antagonists, and to find key structure elements for determining mechanism of actions for small molecule compounds.
MH and YW conceptualized the project. MH was responsible for the solution development. YW supervised the project. All authors participated in the project discussion. All authors read and approved the final manuscript.
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wang Y, Suzek T, Zhang J, Wang J, He S, Cheng T, Shoemaker BA, Gindulyte A, Bryant SH (2014) PubChem BioAssay: 2014 update. Nucleic Acids Res 42:D1075–D1082View ArticleGoogle Scholar
- Cheng T, Pan Y, Hao M, Wang Y, Bryant SH (2014) PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today 19:1751–1756View ArticleGoogle Scholar
- Rupp M, Schroeter T, Steri R, Proschak E, Hansen K, Zettl H, Rau O, Schubert-Zsilavecz M, Müller K-R, Schneider G (2010) Kernel learning for ligand-based virtual screening: discovery of a new PPARγ agonist. J Cheminform 2:P27View ArticleGoogle Scholar
- Reynolds CR, Sternberg MJ (2012) Integrating logic-based machine learning and virtual screening to discover new drugs. J Cheminform 4:O10View ArticleGoogle Scholar
- Kurczab R, Smusz S, Bojarski AJ (2014) The influence of negative training set size on machine learning-based virtual screening. J Cheminform 6:32View ArticleGoogle Scholar
- Ahmed A, Saeed F, Salim N, Abdo A (2014) Condorcet and borda count fusion method for ligand-based virtual screening. J Cheminform 6:19View ArticleGoogle Scholar
- Xie XQ, Chen JZ (2008) Data mining a small molecule drug screening representative subset from NIH PubChem. J Chem Inf Model 48:465–475View ArticleGoogle Scholar
- Rohrer SG, Baumann K (2009) Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model 49:169–184View ArticleGoogle Scholar
- Pouliot Y, Chiang AP, Butte AJ (2011) Predicting adverse drug reactions using publicly available PubChem bioassay data. Clin Pharmacol Ther 90:90–99View ArticleGoogle Scholar
- Chen B, Wild D, Guha R (2009) PubChem as a source of polypharmacology. J Chem Inf Model 49:2044–2055View ArticleGoogle Scholar
- van Deursen R, Blum LC, Reymond JL (2010) A searchable map of PubChem. J Chem Inf Model 50:1924–1934View ArticleGoogle Scholar
- Wendt B, Mulbaier M, Wawro S, Schultes C, Alonso J, Janssen B, Lewis J (2011) Toluidinesulfonamide hypoxia-induced factor 1 inhibitors: alleviating drug-drug interactions through use of PubChem data and comparative molecular field analysis guided synthesis. J Med Chem 54:3982–3986View ArticleGoogle Scholar
- Hao M, Wang Y, Bryant SH (2014) An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal Chim Acta 806:117–127View ArticleGoogle Scholar
- Hu Y, Bajorath J (2014) Many drugs contain unique scaffolds with varying structural relationships to scaffolds of currently available bioactive compounds. Eur J Med Chem 76:427–434View ArticleGoogle Scholar
- Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36:D901–D906View ArticleGoogle Scholar
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107View ArticleGoogle Scholar
- Hu Y, Bajorath J (2015) Exploring the scaffold universe of kinase inhibitors. J Med Chem 58:315–332View ArticleGoogle Scholar
- Kramer C, Fuchs JE, Whitebread S, Gedeck P, Liedl KR (2014) Matched molecular pair analysis: significance and the impact of experimental uncertainty. J Med Chem 57:3786–3802View ArticleGoogle Scholar
- Dimova D, Heikamp K, Stumpfe D, Bajorath J (2013) Do medicinal chemists learn from activity cliffs? A systematic evaluation of cliff progression in evolving compound data sets. J Med Chem 56:3339–3345View ArticleGoogle Scholar
- R Core Team (2015) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria. http://www.R-project.org/
- Bemis GW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med Chem 39:2887–2893View ArticleGoogle Scholar
- Matlock M, Zaretzki J, Swamidass SJ (2013) Scaffold network generator: a tool for mining molecular structures. Bioinformatics 29:2655–2656View ArticleGoogle Scholar
- Hu Y, Bajorath J (2015) Structural and activity profile relationships between drug scaffolds. AAPS J 17:609–619View ArticleGoogle Scholar
- RDKit: open-source cheminformatics software, version 2015.03. http://www.rdkit.org/
- Hussain J, Rea C (2010) Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J Chem Inf Model 50:339–348View ArticleGoogle Scholar
- Dimova D, Hu Y, Bajorath J (2012) Matched molecular pair analysis of small molecule microarray data identifies promiscuity cliffs and reveals molecular origins of extreme compound promiscuity. J Med Chem 55:10220–10228View ArticleGoogle Scholar
- Perez-Villanueva J, Mendez-Lucio O, Soria-Arteche O, Medina-Franco JL (2015) Activity cliffs and activity cliff generators based on chemotype-related activity landscapes. Mol Divers 19:1021–1035View ArticleGoogle Scholar
- Hu Y, Maggiora G, Bajorath J (2013) Activity cliffs in PubChem confirmatory bioassays taking inactive compounds into account. J Comput Aided Mol Des 27:115–124View ArticleGoogle Scholar
- Perez JJ (2005) Managing molecular diversity. Chem Soc Rev 34:143–152View ArticleGoogle Scholar
- Birch AM, Kenny PW, Simpson I, Whittamore PR (2009) Matched molecular pair analysis of activity and properties of glycogen phosphorylase inhibitors. Bioorg Med Chem Lett 19:850–853View ArticleGoogle Scholar
- Stumpfe D, Hu Y, Dimova D, Bajorath J (2014) Recent progress in understanding activity cliffs and their utility in medicinal chemistry. J Med Chem 57:18–28View ArticleGoogle Scholar
- Hu Y, Furtmann N, Gutschow M, Bajorath J (2012) Systematic identification and classification of three-dimensional activity cliffs. J Chem Inf Model 52:1490–1498View ArticleGoogle Scholar
- Dimova D, Stumpfe D, Hu Y, Bajorath J (2015) Activity cliff clusters as a source of structure-activity relationship information. Expert Opin Drug Discov 10:441–447View ArticleGoogle Scholar
- Hu Y, Bajorath J (2012) Extending the activity cliff concept: structural categorization of activity cliffs and systematic identification of different types of cliffs in the ChEMBL database. J Chem Inf Model 52:1806–1811View ArticleGoogle Scholar