Mining basic active structures from a large-scale database
© Takada et al.; licensee Chemistry Central Ltd. 2013
Received: 14 December 2012
Accepted: 8 March 2013
Published: 16 March 2013
The Pubchem Database is a large-scale resource for chemical information, containing millions of chemical compound activities derived by high-throughput screening (HTS). The ability to extract characteristic substructures from such enormous amounts of data is steadily growing in importance. Compounds with shared basic active structures (BASs) exhibiting G-protein coupled receptor (GPCR) activity and repeated dose toxicity have been mined from small datasets. However, the mining process employed was not applicable to large datasets owing to a large imbalance between the numbers of active and inactive compounds. In most datasets, one active compound will appear for every 1000 inactive compounds. Most mining techniques work well only when these numbers are similar.
This difficulty was overcome by sampling an equal number of active and inactive compounds. The sampling process was repeated to maintain the structural diversity of the inactive compounds. An interactive KNIME workflow that enabled effective sampling and data cleaning processes was created. The application of the cascade model and subsequent structural refinement yielded the BAS candidates. Repeated sampling increased the ratio of active compounds containing these substructures. Three samplings were deemed adequate to identify all of the meaningful BASs. BASs expressing similar structures were grouped to give the final set of BASs. This method was applied to HIV integrase and protease inhibitor activities in the MDL Drug Data Report (MDDR) database and to procaspase-3 activators in the PubChem BioAssay database, yielding 14, 12, and 18 BASs, respectively.
The proposed mining scheme successfully extracted meaningful substructures from large datasets of chemical structures. The resulting BASs were deemed reasonable by an experienced medicinal chemist. The mining itself requires about 3 days to extract BASs with a given physiological activity. Thus, the method described herein is an effective way to analyze large HTS databases.
The extraction of compounds with characteristic substructures and a certain physiological activity from large chemical databases is an important step in determining structure-activity relationships. The concept of basic active structures (BASs) has been discussed previously . A BAS is a substructure that is generally indicative of a certain biological activity. A set of BASs is expected to cover most of the active compounds in a given assay dataset. BASs have already been extracted for G-protein coupled receptor (GPCR)-related activity and repeated dose toxicity, and the results have been disclosed on the BASiC website .
Pharmaceutical companies produce in-house datasets via high-throughput screening (HTS), and these datasets can contain hundreds of thousands of compounds. The PubChem BioAssay Project releases large-scale screening databases for public use . While some research has focused on predicting biological activity based on these data, the results have not provided insight on characteristic structures [4, 5]. Rough set and activity landscape methods have provided useful suggestions as to the active substructure, but the number of molecules in the datasets was limited [6, 7].
The extraction of BASs from these datasets provides a means of recognizing a pharmacophore with a target activity. However, the previous mining technique employed by the authors, which was based on a cascade model, was not applicable to large HTS datasets. The number of inactive compounds in such databases is usually 1000 times that of active compounds. The magnitude of this imbalance prohibits the extraction of characteristic substructures of active compounds. This difficulty is not limited to the cascade model but is also commonly encountered in most data-mining methods. The current report introduces a sampling technique that can be used to overcome the problems associated with unbalanced data. The technique uses all of the active compounds and an equal number of randomly sampled inactive compounds. Repeating the sampling process yields several sets of similar BASs while avoiding sampling biases.
The overall mining process was demonstrated by extracting BASs exhibiting HIV integrase inhibitor activity from the MDL Drug Data Report (MDDR) database. All compounds without a reference to this activity were assumed to be inactive. The tedious task of data preprocessing was reduced by the development of a KNIME workflow. The strategy was also applied to extract compounds with HIV protease activity from the MDDR database and compounds showing procaspase-3 activator activity from the PubChem BioAssay database. All of the developed software environments have been disclosed free of charge on the Internet.
Workflow for pre-processing
Simple handling processes are necessary to eliminate or minimize the most tedious tasks involved in repeated sampling, data cleaning, and mining. The following section describes a KNIME (version 2.4.0) workflow that was developed to pre-process compound data . The MDDR database (version 2003.1) was used as the data source targeting HIV integrase inhibitors . The MDDR database contains more than 130,000 records, of which only 153 compounds show the desired activity. All other compounds were assumed to be inactive.
An input molecule that is converted to a molecule object of RDKit in node II-A-1 is transformed to canonical SMILES in node II-A-2. A Java program was developed to recognize and remove salts to be used in node II-A-3, and the results were converted to SMILES.
The desalt program failed when the salt was larger than the drug itself. In these cases, the meta node II-E in Figure 3 checks the results manually.
Normalizing charge and tautomers
These processes use the same workflow as that shown at the top of Figure 3. Normalizing charge and tautomers use the new meta nodes III-A and IV-A depicted at the bottom of Figure 3, respectively. A Python script was developed to normalize charges in node III-A-1 of Figure 3. The following steps change the molecular format from normalized SMILES to RDKit Canon Smiles.
OpenEye’s QuacPac  was used to normalize tautomers in the External Tool (Labs) Node IV-A-1 in Figure 3. The following nodes serve to change the molecular format and allow interactive validation of the transformed structures. The R node is the same as that used in the desalt step.
Removing duplicate structures
The mining process was demonstrated using HIV integrase inhibitor as an example BAS activity. Three sample datasets were selected using the pre-processing technique described in the previous section. The details of the method can be found in Reference .
Fragments, rules, and structural refinement
This rule denotes that there are 135 compounds that meet the condition N3H=y (the existence of an amino group with at least one hydrogen atom) with an inactive/active compound ratio of 0.615/0.385. Further applying the main condition C4H-c3:::c3H:c3H=y (the existence of an aromatic ring as shown at the bottom right in Figure 6 – this includes parts of a ring or a set of fused rings) yields 32 compounds, and the ratio changes to 0.156/0.844. The last BSS denotes the strength of the rule, and its definition is given below.
n A : #Covered Compounds after the application of the main condition
n B : #Covered Compounds before the application of the main condition
Subscripts P and N denote positive and negative classes.
Step 1: Calculate the priority of all rules.
Step 2: Select the highest score rule.
Step 3: Repeat step 1 and step 2 among the unselected rules until priority is 0.
Here, the priority is defined as the product of Novelty and BSS. Novelty values are higher when the covered compounds of a rule are not covered by any previously selected rules. BSS (Between groups sum of squares) is the strength of a rule, and its value is higher when the number of covered compounds selected by the main condition is large and when the changes in the positive/negative ratio are large. Repetition of this rule selection scheme resulted in 55 rules being refined in the next step.
Unification of BAS candidates with similar structures
Validation of the sampling scheme
The number of samplings is critical to the success of the proposed method. Too few samplings may affect the diversity of inactive compounds, resulting in the wrong BAS structure. Too many samplings are time consuming. The cover ratio of active compounds is used to estimate the necessary number of samplings, since identified BASs should cover the majority of active compounds.
The light and dark blue bars in Figure 10 show the numbers of BASs and their cover ratios obtained from each sample dataset, respectively. Between 4 and 7 BASs were obtained, covering 65% to 88% of the active compounds in the case of HIV integrase. Thus, as expected, the BASs mined from a single sampling often fail to cover most of the active compounds. Conversely, the cumulative cover ratio, shown by the red line, reached its maximum values after three samplings in almost all cases. The saturation of the cumulative cover ratios after three samplings does not change, even after changing the sampling order of the three activities. Therefore, the estimated required number of samplings was three.
The number of final BASs obtained from the first three samplings yielded 9 BASs in the case of HIV integrase, as will be discussed in the next section. Inspection of BAS structures obtained from the 4th and the 5th samplings has shown that all of these BASs are contained in the final BAS sets. This was also observed with the protease and procaspase-3 activities. Therefore, three samplings were deemed sufficient to yield a stable set of BASs covering most of the active compounds.
Results and discussion
HIV integrase inhibitor mining results
The red dashed lines in Figure 13 indicate BASs selected without a priori chemical knowledge. The blue dotted lines indicate BASs mined by an experienced chemist. This result illustrates the importance of BASs in real pharmaceuticals.
Applications to other biological activities
HIV protease inhibitor
Procaspase-3 activators in the PubChem BioAssay database
BASs of three pharmacological activities were successfully mined from large-scale databases, including real HTS data in the PubChem BioAssay database. Most BASs could be mined without a priori chemical knowledge, and the time required to perform the mining for one activity was about 3 days. All of the software developed is open to the public and is available at the BASiC website. The obtained BASs were deemed meaningful substructures from the viewpoint of a medicinal chemist. Thus, BAS mining was shown to be a useful method for understanding the characteristics of active structures. Currently, the bottleneck of this mining process is the unification of similar BASs, which requires a manual comparison of about 100 candidate BAS structures. Future developments to the current process will therefore include an automatic method to identify similar candidate BASs and to construct new, refined BASs. Software development to mine BASs with fewer covered compounds is also in progress.
Basic active structure
Between-groups sum of squares
We thank Dr. Hiroshi Horikawa for helpful conversations and contributions to analyses as an experienced chemist. We also thank OpenEye for permission to use their Python library.
- Okada T: The development of a knowledge base for basic active structures: an example case of dopamine agonists. Chemistry Central Journal. 2010, 4: 1-10.1186/1752-153X-4-1.View Article
- BASiC: [http://BASiC.dm-lab.info/]
- PubChem: [http://pubchem.ncbi.nlm.nih.gov/]
- Schierz AC: Virtual screening of bioassay data. Journal of Cheminformatics. 2009, 1: 21-10.1186/1758-2946-1-21.View Article
- Qingliang L: A novel method for mining highly imbalanced high-throughput screening data in PubChem. Bioinformatics. 2009, 25 (24): 3310-3316. 10.1093/bioinformatics/btp589.View Article
- Koyama M, Hasegawa K, Arakawa M, Funatsu K: Application of rough set theory to high throughput screening data for rational selection of lead compounds. Chem-Bio Informatics Journal. 2008, 8 (3): 85-95. 10.1273/cbij.8.85.View Article
- Hasegawa K, Migita K, Funatsu K: Visualization of molecular selectivity and structure generation for selective dopamine inhibitors. Molecular Informatics. 2010, 29 (11): 793-800. 10.1002/minf.201000096.View Article
- Berthold MR: KNIME: the Konstanz information miner. Data analysis, machine learning and applications. Edited by: Preisach C, Burkhardt H. 2008, Berlin, Heidelberg: Springer-Verlag, 319-326.View Article
- Accelrys: MDDR: [http://accelrys.com/products/databases/bioactivity/mddr.html]
- OpenEye: [http://www.eyesopen.com/quacpac]
- Pommier Y: Integrase inhibitors to treat HIV/AIDS. Nat Rev Drug Discov. 2005, 4: 236-248. 10.1038/nrd1660.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.