Bioalerts: a python library for the derivation of structural alerts from bioactivity and toxicity data sets
 Isidro CortesCiriano^{1}Email author
Received: 13 October 2015
Accepted: 22 February 2016
Published: 4 March 2016
Abstract
Background
Assessing compound toxicity at early stages of the drug discovery process is a crucial task to dismiss drug candidates likely to fail in clinical trials. Screening drug candidates against structural alerts, i.e. chemical fragments associated to a toxicological response prior or after being metabolized (bioactivation), has proved a valuable approach for this task. During the last decades, diverse algorithms have been proposed for the automatic derivation of structural alerts from categorical toxicity data sets.
Results and conclusions
Here, the python library bioalerts is presented, which comprises functionalities for the automatic derivation of structural alerts from categorical (dichotomous), e.g. toxic/nontoxic, and continuous bioactivity data sets, e.g. \(K_{i}\) or \(\hbox {pIC}_{50}\) values. The library bioalerts relies on the RDKit implementation of the circular Morgan fingerprint algorithm to compute chemical substructures, which are derived by considering radial atom neighbourhoods of increasing bond radius. In addition to the derivation of structural alerts, bioalerts provides functionalities for the calculation of unhashed (keyed) Morgan fingerprints, which can be used in predictive bioactivity modelling with the advantage of allowing for a chemically meaningful deconvolution of the chemical space. Finally, bioalerts provides functionalities for the easy visualization of the derived structural alerts.
Keywords
Background
The early assessment of compound toxicity is crucial in drug discovery to dismiss drug candidates displaying undesirable safety profiles. This reduces the risk of investing time and resources on compounds likely to fail at clinical trials. Computational toxicology aims at predicting in silico the health and environmental risks associated to compound exposure or intake. The main benefits of these approaches are the potential reduction of the usage of animal models and their low cost in comparison to in vitro and in vivo methods.
The derivation of structural alerts, i.e. chemical fragments associated to a toxicological response prior or after being metabolized (bioactivation), is an area of intense development in toxicology. A plethora of studies in the last decades have proved the suitability of screening drug candidates against structural alerts known to be implicated in compound toxicity [1–4]. In silico approaches to catalogue, rationalize and identify structural alerts are generally divided into two groups. The first category comprises knowledgebased expert systems [5–8]. These methods gather and annotate structural alerts derived through human experience or from the scientific literature. These data can be used to define sets of rules for the evaluation of compound toxicity. Nonetheless, further expanding knowledgebased expert systems, i.e. adding more structural alerts by toxicology experts or from curation of scientific literature, is an arduous and costinefficient task. In addition, the need for human subjective intervention might lead to divergent perceptions about the toxicity of particular structural alerts.
The second category is composed of datadriven systems [9]. In this case, machine learning and data mining techniques are applied to large toxicity data sets to identify structural alerts in an automatic and unbiased fashion. This second class of methods differ with respect to knowledgebased expert systems in that (1) they are fast, exhaustive, deterministic, and automatic, and (2) there is no need for human intervention for the derivation of the strucutural alerts, although it might be required for their interpretation. Diverse notions underlie the datadriven algorithms reported in the literature, such as maximum common substructure (MCS) calculation [10], emerging pattern mining [9, 11], or cliquebased techniques [12]. The functionalities of these methods have been exposed and compared on categorical toxicity data sets, such as mutagenicity or carcinogenicity. However, the notion of identifying chemical fragments that enrich for a biological endpoint, be it toxicity or in vitro enzyme inhibition, can be extended to data sets reporting continuous compound activity values.
Here, the python library bioalerts is presented, which comprises functionalities for the automatic derivation of structural alerts from categorical (dichotomous), e.g. toxic/nontoxic, and continuous bioactivity data sets, e.g. \(K_{i}\) or \(\hbox {pIC}_{50}\) values. The workflow proposed by Ahlberg et al. [13] is followed for the derivation of structural alerts from categorical data—with the exception that to generate compound substructures circular Morgan fingerprints are used instead of a signature descriptor [14, 15]—whereas the pipeline published by CortesCiriano et al. [16] is followed when using continuous bioactivity data. Ahlberg and coworkers showed that their pipeline leads to comparable results to both manual derivation of structural alerts and a cliquebased method, namely PAFI [17]. However, the computational efficiency of their method was proved to be significantly higher; in fact, training and prediction times were reduced by up to three orders of magnitude in comparison to PAFI [13]. The library bioalerts relies on the RDKit implementation of the circular Morgan fingerprint algorithm to compute chemical substructures, which are derived by considering radial atom neighbourhoods of increasing bond radius. Morgan fingerprints were chosen given the high retrieval rates obtained with them in comparative virtual screening studies [18, 19]. Thus, they appear to efficiently account for chemical aspects related to bioactivity. In addition to the derivation of structural alerts, bioalerts provides functionalities for the calculation of unhashed (keyed) Morgan fingerprints, which can be used in predictive bioactivity modelling with the advantage of allowing for a chemically meaningful deconvolution of the chemical space [16]. Finally, bioalerts provides functionalities for the easy visualization of the derived structural alerts.
Implementation
The library bioalerts is implemented in python. It is mainly based on the RDKit implementation of circular Morgan fingerprints, licensed under a 3clause BSD license [20], and depends on the following python modules: numpy, operator, os, pandas, scipy, and sys [21–23]. The library is divided in three modules, namely: (1) LoadMolecules, (2) Alerts, and (3) FPCalculator. These modules allow to read molecule files, compute hashed and unhashed Morgan fingerprints in binary and count format, and to derive structural alerts from continuous and categorical bioactivity data sets. The implementation of the three modules is serial. Their functionalities are presented in the following subsections.
Reading molecule files
Molecule files can be read with the method ReadMolecules() from the class bioalerts.LoadMolecules.LoadMolecules(). The available input file formats are: (1) SMILES, (2) SDF, and (3) mol2. Once the molecules are loaded into a python list, the method extract_substructure_information() from the class bioalerts.LoadMolecules.GetDataSetInfo() permits the generation of a dictionary of substructures, whose keys correspond to substructure unambiguous integer identifiers, and the values to the molecule indices within the molecule list. Only the substructures with a bond radius allowed by the user through the argument radii are considered to build the substructure dictionary. This dictionary, which defines in terms of substructure composition the input set of molecules (e.g. training set), serves to compute unhashed Morgan fingerprints and to derive structural alerts.
Computation of Morgan fingerprints
Morgan fingerprints encode chemical structures by considering atom neighbourhoods [24]. For their computation, each substructure in a molecule, with a maximal userdefined bond radius, is assigned an unambiguous integer identifier. These identifiers can be mapped into an unhashed or hashed array. For the hashed array, the position in the array where the substructures are mapped is given by the modulo of the division of the substructure identifier by the fingerprint size. In the case of unhashed (keyed) fingerprints, each bit in the fingerprint is associated to only one substructure. Thus, the length of the unhashed fingerprints is equal to the number of distinct substructures present in the molecule set. Morgan fingerprints can be generated as binary, recording the presence or absence of each substructure, or count format, recording the number of occurrences of each substructure in a given compound.
Derivation of structural alerts from categorial bioactivity data sets
The derivation of structural alerts from dichotomous bioactivity data follows the work by Ahlberg et al. [13]. The probability for a substructure to be a structural alert is derived from the probability density function of the binomial distribution (Eq. 1). The question to be answered here is (Fig. 1a): which is the probability of having observed by chance \(m_{S\ act}\) active compounds (or more) with a given substructure (\(S_1\)), given that the substructure is present in \(n_s\) compounds in a training set of n compounds where m are active (Fig. 1a)?
A high probability (i.e. P value) indicates that it is likely to obtain by chance \(m_{S\ act}\) active compounds with substructure \(S_1\) from the binomial distribution defined by the training data, and thus, substructure \(S_1\) is not likely to be correlated to compound activity.
By contrast, a low P value indicates that it is not likely to have observed by chance \(m_S\ act\) active compounds with substructure \(S_1\) from the underlying binomial distribution defined by the training data. Thus, the presence of substructure \(S_1\) is significantly associated to compound activity.
A training set is used to derive a substructure dictionary. This dictionary serves to calculate the P value for the substructures present in the molecules from a test or external set with the method calculate_p_values() from the class bioalerts.Alerts(). This method calculates the substructures for the external molecules and their associated P value. The method calculate_p_values() requires three parameters to be defined, namely: (1) threshold_nb_substructures: the threshold for the number of substructures (default value equal to 5) [13], (2) threshold_pvalue: P value threshold (default value equal to 0.05), and (3) threshold_frequency: substructure frequency threshold. The threshold for the number of substructures, threshold_nb_substructures, indicates the minimum number of compounds in the training set with a given substructure (\(n_s\)) required to proceed to the calculation of the P value for that substructure. For instance, if the thresehold is set to 5 [13], and only 4 compounds from the training set present a given substructure \(S_1\), the algorithm will not calculate the P value for \(S_1\). On the other hand, the P value threshold, threshold_pvalue, indicates the level of significance (\(\alpha\)) to consider a given substructure as structural alert. Finally, the substructure frequency threshold, threshold_frequency, corresponds to the ratio \(m_{S\ act}\)/\(n_s\). Therefore, if there are too few active molecules with a given substructure \(S_1\) with respect to the total number of molecules with that substructure, the P value for \(S_1\) will not be computed.

Take as input a series of molecules (i.e. training set) and derive a dictionary of substructures (reference set). Only the substructures with a bond radius within the set of bond radii specified by the user will be considered.

For each molecule from an external (or test) set, do:

For each set of substructures rooted at a given heavy atom, Set\(_{\mathrm{substr.}}\), do:
 1.
Among the substructures in this set that have not been processed yet, select the substructure S of smallest radius
 2.
If substructure S is in the reference substructure set derived from the training set, continue
 3.
If the number of compounds with substructure S in the training set is higher than the value of the argument threshold_nb_substructures, continue
 4.
If the ratio \(m_{S\ act}\)/\(n_s\) is higher than the value of the argument threshold_frequency, continue
 5.
Compute P value for substructure S
 6.
If the P value is below the P value threshold, threshold_pvalue, label substructure S as significant. In this case, substructure S will not be considered even if it appears in other molecules from the external molecule set yet not processed. In addition, substructures comprising substructure S (i.e. substructures of higher bond radius rooted at the same heavy atom present in Set\(_{\mathrm{substr.}}\) but not yet processed) will not be considered in future iterations.
 1.

To control the familywise error rate the Bonferroni correction can be applied to the computed P values by setting the argument Bonferroni of the method calculate_p_values() to True (default). The Bonferroni correction consists of multiplying the P values by the total number of substructures for which P values were computed. Alternatively, the significance level can be divided by the number of computed P values. In that case, the significance level would be compared to the computed P values without transforming them.
Derivation of structural alerts from continuous bioactivity data sets
The module calculate_p_values() from the class bioalerts.Alerts.CalculatePvaluesContinuous() permits the derivation of structural alerts from data sets reporting continuous compound activity values, e.g. \(\hbox {pIC}_{50}\). To identify which substructures from a test molecule significantly contribute to bioactivity on the biomolecular system under study, two bioactivity distributions are defined (Fig. 1b), namely: (1) distribution A, comprising the bioactivity values for those compounds from the training set presenting a given substructure, and (2) distribution B, comprising the bioactivity values for those compounds not presenting that substructure in the same training set. The normality of these distributions is assessed with the Shapiro–Wilk test (\(\alpha = 0.05\)). If both A and B are normally distributed, a twotailed t test for independent samples is applied to statistically evaluate the difference between these two distributions. By contrast, if A and B do not follow a normal distribution, the Kolmogorov Smirnov test is used instead. If the difference between A and B is significant, the considered substructure is assumed to have an influence on bioactivity. The sign of the difference between the mean value of A and B indicates whether the presence of that substructure increases or decreases compound activity on the studied biological system.
The method calculate_p_values() requires two parameters. The first one, threshold_nb_substructures, indicates the minimum number of compounds in the training set presenting a given substructure that is required to compute the significance (P value) for that substructure. The second one, threshold_ratio, refers to the ratio between the number of compounds from the training set with a substructure and the size of the training set. If this ratio is below the value set for the parameter threshold_ratio the algorithm will not further consider that substructure.

Take as input a series of molecules (i.e. training set) and derive a dictionary of substructures (reference set). Only the substructures with a bond radius within the set of bond radii specified by the user will be considered.

For each substructure, S, in the molecules from an external (or test) set, do:

If substructure S is in the reference substructure set derived from the training set and has not been previously processed, continue

If the number of compounds with substructure S in the training set is higher than the value of the argument threshold_nb_substructures, continue

If the ratio between the number of compounds with substructure S from the training set and the total number of distinct compounds from the training set is higher than the value set for the argument threshold_ratio, continue

Compute P value for substructure S using the Student’s t test or the Kolmogorov Smirnov test.

The substructures identified with the algorithm along with additional information, e.g. the associated P values and the bioactivity difference between distributions A and B (effect size), are saved to a pandas data frame [22], which can be further saved to a .xlsx file for further processing and visualization (Additional files 1 and 2). As in the previous case, the Bonferroni correction can be applied to the computed P values if the argument Bonferroni of the method calculate_p_values() is set to True (default).
Discussion and conclusion
An open source implementation of two methodologies for the automatic derivation of structural alerts from bioactivity data sets is presented. Additionally, the library bioalerts permits the computation of unhashed (keyed) Morgan fingerprints. The performance of unhashed and hashed Morgan fingerprints has been shown to be comparable on continuous bioactivity data sets [16] (Additional file 2). Nonetheless, building predictive models with unhashed fingeprints enables the deconvolution of the models in a chemically meaningful way [16, 25, 26], thus increasing model interpretability. The functionalities of bioalerts are illustrated in a tutorial (Additional file 2) using three diverse bioactivity data sets, namely: (1) Ames mutagenicity data for 1752 compounds [27], (2) \(\hbox {pIC}_{50}\) values for 2311 compounds on human cyclooxygenase (COX) 2 [16], and (3) blood–brain barrier (BBB) data for 157 organic compounds [28].
Although the method for the derivation of structural alerts from continuous bioactivity data has been recently validated on a human cyclooxygenases data set [16], it is of paramount importance to bear in mind the following limitations. This method treats the effect of indivudal substructures on compound activity as independent events. However, it was shown by Klekota et al. [29] that the contribution to biological activity of some substructures depends on the presence of others. Similarly, it is important to note a second scenario where none of the methods would be suited. We can envision a data set where all active compounds present a given substructure \(S_{1}\) not implicated in the studied biological response, and substructure \(S_{2}\) enriching for activity, wheras all inactive compounds in that data set do not present neither \(S_{1}\) nor \(S_{2}\). The presented approaches would identify both substructures \(S_{1}\) and \(S_{2}\) as implicated in bioactivity, whereas only \(S_{2}\) actually is. Given that in practice the last situation might not be common when modelling large data sets, it is advisable to use large and chemically diverse training sets.
Finally, another important consideration when deriving structural alerts from continuous bioactivity data is the effect size required to privilege or dismiss compounds presenting a given substructure. The effect size corresponds to the difference between the two distributions from which the statistical test, e.g. t test, is calculated. For instance, (1) distribution A comprising the bioactivities for all compounds in a data set harboring a given substructure, and (2) distribution B comprising the bioactivities for the remaining compounds. In some cases, depending on the sample sizes, a highly significant P value might be obtained for a small size effect, e.g. a tenth of a \(\hbox {pIC}_{50}\) unit. From a medicinal chemistry standpoint that difference might be negligible [30]. Therefore, it is paramount to pay especial attention to the effect size in addition to the P values.
Overall, bioalerts constitutes an open source python library for the derivation of structural alerts from categorical and continuous data sets using two methodologies that have been previously validated. In addition, bioalerts functionalities include the computation of unhashed Morgan fingerprints, which can be further used in e.g. predictive bioactivity modelling, clustering or similarity searching.
Availability and requirements

Project name: bioalerts

Project home page: Source code is available at http://github.com/isidroc/bioalerts. Users of bioalerts are encouraged to visit this site for future versions and improvements.

Operating system(s): Platform independent

Programming language: Python

License: GNU GPL version 3

Any restrictions to use by nonacademics: none.
Declarations
Acknowledgements
ICC thanks the ParisPasteur International PhD Programme and Institut Pasteur for funding.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Enoch SJ, Ellison CM, Schultz TW, Cronin MTD (2011) A review of the electrophilic reaction chemistry involved in covalent protein binding relevant to toxicity. Crit Rev Toxicol 41(9):783–802View ArticleGoogle Scholar
 Enoch S, Madden J, Cronin M (2008) Identification of mechanisms of toxic action for skin sensitisation using a smarts pattern based approach. SAR QSAR Environ Res 19(5–6):555–578View ArticleGoogle Scholar
 Ashby J, Tennant RW (1988) Chemical structure, salmonella mutagenicity and extent of carcinogenicity as indicators of genotoxic carcinogenesis among 222 chemicals tested in rodents by the u.s. nci/ntp. Mutat Res Genet Toxicol 204(1):17–115View ArticleGoogle Scholar
 Bailey AB, Chanderbhan R, CollazoBraier N, Cheeseman M, Twaroski ML (2005) The use of structureactivity relationship analysis in the food contact notification program. Regul. Toxicol Pharmacol 42(2):225–235View ArticleGoogle Scholar
 Ridings J, Barratt M, Cary R, Earnshaw C, Eggington C, Ellis M, Judson P, Langowski J, Marchant C, Payne M, Watson W, Yih T (1996) Computer prediction of possible toxic action from chemical structure: an update on the derek system. Toxicology 106(1–3):267–279View ArticleGoogle Scholar
 Benigni R, Bossa C (2008) Structure alerts for carcinogenicity, and the salmonella assay system: a novel insight through the chemical relational databases technology. Mutat Res Rev Mutat 659(3):248–261View ArticleGoogle Scholar
 Benigni R, Bossa C (2011) Mechanisms of chemical carcinogenicity and mutagenicity: a review with implications for predictive toxicology. Chem Rev 111(4):2507–2536View ArticleGoogle Scholar
 Klopman G, Frierson MR, Rosenkranz HS (1990) The structural basis of the mutagenicity of chemicals in salmonella typhimurium: the genetox data base. Mutat Res Fund Mol Mech Mutagen 228(1):1–50View ArticleGoogle Scholar
 Metivier JP, Lepailleur A, Buzmakov A, Poezevara G, Cremilleux B, Kuznetsov SO, Goff JL, Napoli A, Bureau R, Cuissart B (2015) Discovering structural alerts for mutagenicity using stable emerging molecular patterns. J Chem Inf Model 55(5):925–940View ArticleGoogle Scholar
 Nicolaou CA, Tamura SY, Kelley BP, Bassett SI, Nutt RF (2002) Analysis of large screening data sets via adaptively grown phylogeneticlike trees. J Chem Inf Comput Sci 42(5):1069–1079View ArticleGoogle Scholar
 Auer J, Bajorath J (2006) Emerging chemical patterns: a new methodology for molecular classification and compound selection. J Chem Inf Model 46(6):2502–2514View ArticleGoogle Scholar
 Deshpande M, Kuramochi M, Wale N, Karypis G (2005) Frequent substructurebased approaches for classifying chemical compounds. IEEE Trans Knowl Data Eng 17(8):1036–1050View ArticleGoogle Scholar
 Ahlberg E, Carlsson L, Boyer S (2014) Computational derivation of structural alerts from large toxicology data sets. J Chem Inf Model 54(10):2945–2952View ArticleGoogle Scholar
 Faulon JL, Visco DP, Pophale RS (2003) The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. J Chem Inf Comput Sci 43(3):707–720View ArticleGoogle Scholar
 Faulon JL, Churchwell CJ, Visco DP (2003) The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J Chem Inf Comput Sci 43(3):721–734View ArticleGoogle Scholar
 CortesCiriano I, Murrell DS, van Westen G, Bender A, Malliavin T (2014) Ensemble modeling of cyclooxygenase inhibitors. J Cheminf 7:1View ArticleGoogle Scholar
 Kuramochi M, Karypis G (2004) An efficient algorithm for discovering frequent subgraphs. IEEE Trans Knowl Data Eng 16(9):1038–1051View ArticleGoogle Scholar
 Bender A, Jenkins JL, Scheiber J, Sukuru SCK, Glick M, Davies JW (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J Chem Inf Model 49(1):108–119View ArticleGoogle Scholar
 Koutsoukas A, Paricharak S, Galloway WRJD, Spring DR, IJzerman AP, Glen RC, Marcus D, Bender A (2013) How diverse are diversity assessment methods? A comparative analysis and benchmarking of molecular descriptor space. J Chem Inf Model 54(1):230–242View ArticleGoogle Scholar
 Tosco P, Stiefl N, Landrum G (2014) Bringing the MMFF force field to the RDKit: implementation and validation. J Cheminf 6(1):37View ArticleGoogle Scholar
 Walt Svd, Colbert SC, Varoquaux G (2011) The NumPy array: A structure for efficient numerical computation. Comput Sci Eng 13(2):22–30View ArticleGoogle Scholar
 McKinney W (2010) Data structures for statistical computing in python. In: van der Walt S, Millman J (eds) Proceedings of the 9th Python in science conference, pp 51–56Google Scholar
 Jones E, Oliphant T, Peterson P (2001) SciPy: open source scientific tools for Python (2001). http://www.scipy.org/
 Rogers D, Hahn M (2010) Extendedconnectivity fingerprints. J Chem Inf Model 50(5):742–754View ArticleGoogle Scholar
 Ain QU, Mendez Lucio O, CortesCiriano I, van Westen G, Malliavin T, Bender A (2014) Bioactivity modelling of inhibitors for serine proteases using proteochemometric approaches. Integr Biol 6:1023–1033View ArticleGoogle Scholar
 Cortes Ciriano I, Ain QU, Subramanian V, Lenselink EB, Mendez Lucio O, IJzerman AP, Wohlfahrt G, Prusis P, Malliavin T, van Westen G, Bender A (2015) Polypharmacology modelling using proteochemometrics: recent developments and future prospects. Med Chem Comm 6:24–50View ArticleGoogle Scholar
 Young S, Gombar VK, Emptage MR, Cariello NF, Lambert C (2002) Mixture deconvolution and analysis of Ames mutagenicity data. Chemometr Intell Lab 60(1–2):5–11 (Fourth International Conference on Environ metrics and Chemometrics held in Las Vegas, NV, USA, 1820 September 2000)View ArticleGoogle Scholar
 Zhang L, Zhu H, Oprea T, Golbraikh A, Tropsha A (2008) Qsar modeling of the blood–brain barrier permeability for diverse organic compounds. Pharm Res 25(8):1902–1914View ArticleGoogle Scholar
 Klekota J, Roth FP (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24(21):2518–2525View ArticleGoogle Scholar
 Kramer C, Fuchs JE, Whitebread S, Gedeck P, Liedl KR (2014) Matched molecular pair analysis: significance and the impact of experimental uncertainty. J Med Chem 57(9):3786–3802View ArticleGoogle Scholar