Predicting cytotoxicity from heterogeneous data sources with Bayesian learning
© Langdon et al; licensee BioMed Central Ltd. 2010
Received: 9 September 2010
Accepted: 9 December 2010
Published: 9 December 2010
We collected data from over 80 different cytotoxicity assays from Pfizer in-house work as well as from public sources and investigated the feasibility of using these datasets, which come from a variety of assay formats (having for instance different measured endpoints, incubation times and cell types) to derive a general cytotoxicity model. Our main aim was to derive a computational model based on this data that can highlight potentially cytotoxic series early in the drug discovery process.
We developed Bayesian models for each assay using Scitegic FCFP_6 fingerprints together with the default physical property descriptors. Pairs of assays that are mutually predictive were identified by calculating the ROC score of the model derived from one predicting the experimental outcome of the other, and vice versa. The prediction pairs were visualised in a network where nodes are assays and edges are drawn for ROC scores >0.60 in both directions. We observed that, if assay pairs (A, B) and (B, C) were mutually predictive, this was often not the case for the pair (A, C). The results from 48 assays connected to each other were merged in one training set of 145590 compounds and a general cytotoxicity model was derived. The model has been cross-validated as well as being validated with a set of 89 FDA approved drug compounds.
We have generated a predictive model for general cytotoxicity which could speed up the drug discovery process in multiple ways. Firstly, this analysis has shown that the outcomes of different assay formats can be mutually predictive, thus removing the need to submit a potentially toxic compound to multiple assays. Furthermore, this analysis enables selection of (a) the easiest-to-run assay as corporate standard, or (b) the most descriptive panel of assays by including assays whose outcomes are not mutually predictive. The model is no replacement for a cytotoxicity assay but opens the opportunity to be more selective about which compounds are to be submitted to it. On a more mundane level, having data from more than 80 assays in one dataset answers, for the first time, the question - "what are the known cytotoxic compounds from the Pfizer compound collection?" Finally, having a predictive cytotoxicity model will assist the design of new compounds with a desired cytotoxicity profile, since comparison of the model output with data from an in vitro safety/toxicology assay suggests one is predictive of the other.
A 2003 study estimated the cost of the research and development of a drug up to the pre-approval point to be over 800 million US dollars . Toxicity is the reason behind the withdrawal of over 90% of drugs from the market and the failure of a third of drugs in phase I-III clinical trials . Because of the huge cost in researching and developing a new drug, pharmaceutical companies want to minimise the number of failures in clinical trials and the number of withdrawals from the market. One way to minimise the number of failures is to ensure drugs are not toxic before they reach clinical trials. This is done by screening compounds for toxicity in the early stages of drug discovery and understanding the mechanisms of toxicity to avoid designing toxic drugs in the first place.
The general toxicity testing pipeline in the pharmaceutical industry begins with in vitro toxicology screening followed by in vivo studies . The majority of mandatory non-clinical toxicity investigations are in vivo . Preclinical in vivo studies are used to determine potential adverse effects of drugs, estimate safety margins , understand mechanisms of toxicity and decide if compounds should be eliminated from the development process . At the moment no in vitro test for acute oral toxicity has been approved by regulatory agencies to be sufficient evidence to allow commencement of clinical trials . However, there are two mandatory in vitro studies, genotoxicity and hERG assays, that must be carried out before clinical trials can commence.
In order to use in vivo and in vitro methods, compounds must have already been synthesised and available in sufficient quantities. Moreover, the experimental methods are time consuming and costly. For the time being it is a requirement that in vitro and in vivo toxicity studies are carried out on all drug candidates before they reach clinical trials. Development of a predictive model allows in-silico screening of compounds in virtual libraries, i.e. before any compounds are actually made.
In vitro cytotoxicity assays are often run in parallel to primary cell-based activity screens in order to identify hits that only appear to be active because of their cytotoxic effects [7, 8]. These cytotoxicity assays are usually run to triage compounds which appear active in a cell-based primary assay against a target of interest. The choice of cytotoxicity assay is not restrictive, with some scientists choosing to re-use an assay from a previous project, while others opt for the newest cytotoxicity assay kits on the market. Cytotoxicity assays may be run against cell lines from different species (e.g. human, mouse, rat) and/or different cell types (e.g. skin, neuronal, liver). The choice of cell line and/or species may be aligned to those used in the primary target assay or be more comparable to the in vitro toxicology assay which it precedes. Assay methodologies vary widely (e.g. measurements of mitochondrial activity, ATP concentrations, and membrane integrity) but the basic principle is to assess cell viability and/or proliferation. Endpoint detection methods are similarly diverse, e.g. luminescence, absorbance or fluorescence. Finally, the period of cell incubation with compound varies from 2 hours in acute studies to several days in some long term antiviral assays. Again the length of incubation time may be selected simply to parallel that of the primary assay.
The aim of this project was to develop a computational model which could be used to generate a general "cytotoxicity score". This could then be used as a service to alert when a new synthesis is similar to a known cytotoxic compound, and/or as a tool to give an indication of compound cytotoxicity. To make this model as generally applicable as possible we tried to maximise the coverage of chemical space in the training set by merging data from multiple assays. We see a general cytotoxicity model as crucial in early stages of drug discovery when typically chemical series are pursued for which little cytotoxicity data is available and therefore no opportunity exists to build a more accurate series-specific model. Users could then access more information to include cell line, species, compound dose and incubation time details - and use this to triage their data further. Finally, we plan to collaborate with safety colleagues to be able to identify the cytotoxicity assays which are the best predictors of in vitro and clinical toxicity. This would provide the potential to reduce compound attrition since series with cytotoxic characteristics which track with known toxicology profiles would not be pursued.
Predicting toxicity is a challenging task because of the complex biological mechanisms behind it. The results of in vivo studies can be used to validate in vitro studies . As long as the in vitro methods used to generate the data are successful at predicting in vivo outcomes, then the in silico models built with that data should be able to closely mimic the results of in vivo studies . In this project, data from in vitro experiments will be used alongside Bayesian learning to predict the cytotoxicity of compounds.
There are several examples of predicting cytotoxicity from in vitro data in the literature, including the use of neural networks , random forests , decision trees and linear least squares . The last example successfully predicts general cytotoxicity using in vitro results from 59 different cell lines. In this work we will attempt to predict general cytotoxicity using in vitro data gathered using many different assay formats, we will also compare our work with Guha and Schürer's random forests, as we can reproduce their models using our own methods and the same publicly available datasets.
Bayesian learning is a popular and mature machine learning method that can be used to classify molecules in two sets e.g. active/inactive or toxic/non-toxic. It has many applications in the pharmaceutical industry including modelling biological activity [13–15], such as kinase inhibitors  and hERG blockers [17, 18], enriching high throughput screening (HTS) data [19, 20] & docking results , predicting combinatorial library protocols  and describing compound similarity . Bayesian learning is used in this paper because of its speed, safety with respect to over-fitting and its ability for handling noisy data. The speed of Bayesian learning scales linearly with the number of compounds, making it a fast and efficient technique. No pre-selection of descriptors is required prior to learning as only those descriptors that correlate with activity will have a great effect on learning and unimportant descriptors will not lead to over-fitting. This also means that Bayesian learning performs well with noisy data, as is the case in this study which has a large amount of primary assay data and an expected high number of false positives and negatives.
Another advantage of Bayesian learning is that it does not require the active/inactive ratio in the training set to be balanced; instead, the assumption is that the ratio present in the training set is representative of the ratio in the set where predictions are to be made. Therefore pre-processing to derive a training set with balanced active/inactive data is not required.
We have used Bayesian learning with publicly available and in-house cytotoxicity assay data to predict the cytotoxicity of compounds.
We start by discussing the use of Bayesian learning to model cytotoxicity using publicly available data and the validation of these methods. Next we describe the application of these methods to a much larger Pfizer in-house data set collected from multiple different assays. Prediction networks, based on the ability of assay data to predict the results of other assays are generated and then used to select assay data suitable as a training set for a general cytotoxicity model.
Results and Discussion
Modelling Public Data
Two publicly available cytotoxicity datasets were downloaded from PubChem : "Scripps" which contained a mixture of single point (percent inhibition) primary data and IC50 confirmation data and "NCGC" which contained only IC50 data . These datasets have previously been used by Guha and Schürer to derive Random Forest models . For each data set, two versions of Bayesian models have been built using different descriptors. The FCFP_6 models used FCFP_6 fingerprints, AlogP, number of hydrogen bond donors, number of hydrogen bond acceptors, number of rotational bonds and molecular fractional polar surface area as descriptors. The BCI models used BCI-1052 structural keys as descriptors, as used in the published Random Forest models . We were not able to calculate the BCI fingerprints for all compounds therefore some compounds were left out (11 from the NCGC data, 33 from the Scripps IC50 data and 3800 from the Scripps percent inhibition data). For each model, the data set was split into 5 equal-sized random sets. The models were built on 4 of these sets (80% of the data) and tested with the remaining 20%. This process was repeated so that 5 models were built, each tested on the set that was left out of the training data. This is a technique known as 5-fold cross-validation. For each validation a receiver operating characteristic plot (ROC plot) and truth table were generated. The models' performance can be assessed from the average ROC plot and truth table for the 5 models.
Scripps IC50 Data
Truth table for 5-fold cross-validation of the Scripps IC50 FCFP_6 and BCI models
Scripps IC50 FCFP_6 Model
Scripps IC50 BCI Model
21 ± 1.5%
16 ± 1.6%
20 ± 3.7%
17 ± 3.7%
18 ± 0.9%
45 ± 0.9%
20 ± 2.1%
43 ± 2.1%
Specificity and sensitivity of Scripps IC50 FCFP_6 and BCI models
Scripps IC50 FCFP_6 Model
Scripps IC50 BCI Model
Fraction correctly classified
0.57 ± 0.04
0.71 ± 0.01
0.53 ± 0.1
0.68 ± 0.03
Scripps Percent Inhibition Data
Truth table for Scripps percent inhibition FCFP_6 and BCI models.
Scripps percent inhibition FCFP_6 model
Scripps percent inhibition BCI model
30 ± 3.4%
7 ± 1.7%
27 ± 2.6%
11 ± 0.7%
41 ± 2.0%
22 ± 3.3%
40 ± 3.2%
23 ± 3.1%
Specificity and sensitivity of Scripps percent inhibition FCFP_6 and BCI models
Scripps percent inhibition FCFP_6 model
Scripps percent inhibition BCI model
Fraction correctly classified
0.82 ± 0.05
0.35 ± 0.04
0.72 ± 0.03
0.37 ± 0.05
Example series of compounds which are all predicted to be toxic (Scripps percent inhibition FCFP_6 model score >0).
5.24 (moderate toxic)
Truth table for the 5-fold cross-validation of the NCGC IC50 FCFP_6 and BCI models
NCGC IC50 FCFP_6 model
NCGC IC50 BCI model
1.5 ± 0.3
3.3 ± 1.2
1.7 ± 0.6
2.9 ± 1.4
8.7 ± 2.0
86.6 ± 2.8
13.5 ± 1.8
81.9 ± 2.3
Specificity and sensitivity of the NCGC IC50 FCFP_6 and BCI models.
NCGC IC50 FCFP_6 Model
NCGC IC50 BCI Model
Fraction correctly classified
0.32 ± 0.06
0.91 ± 0.02
0.41 ± 0.20
0.86 ± 0.02
Cross Predictions Between Scripps And NCGC
Firstly, we tested the NCGC Jurkat IC50 models (FCFP_6, toxicity cut-off pIC50 > 4.64) against the Scripps IC50 dataset (FCFP_6, toxicity cut-off pIC50 > 5.5). The NCGC models don't distinguish toxic from non-toxic, as indicated by the quasi-random ROC scores at 0.52 (the BCI model was no better at 0.50). In ref , Guha and Schürer considered their NCGC model predictive, but only after altering the fingerprint descriptors (CATS2D) used to train the model, and applying a different toxicity cut-off to the Scripps set, resulting in 640 out of 775 compounds being toxic (about 83%). They did not report a ROC score, but a percentage of correctly classified compounds (68%). This is shown in Figure 5 together with the value of 61% we obtained against the Scripps set with the original cut-off of 5.5. The model in ref  had a high sensitivity (0.76) and a low specificity (0.26); in effect the model was successful by predicting most compounds to be toxic - possibly as a consequence of forcing down the cut-off. The model "all compounds are toxic" would have correctly classified 83% of the compounds. Our FCFP_6 model can be considered the reverse. With the original cut-off for toxicity (pIC50 > 5.5) the sensitivity is low (0.08) and the specificity is high (0.92); this model yielded a 61% correct classification by predicting the majority of compounds to be non-toxic. The simplistic "all compounds are non-toxic" model would have correctly classified 63% of the compounds. As illustrates, the two trivial models would perform better than the models reported by Guha and Schürer and ourselves, indicating that our models failed at predicting each other. We also tried to predict the NCGC outcomes by models from the Scripps dataset. Again, the models derived from the Scripps IC50 could not correctly classify the NCGC set, as shown by the ROC scores of 0.51 (FCFP_6) and 0.40 (BCI). The ROC scores improved to 0.60 (FCFP_6) and 0.51 (BCI) when the Scripps percent inhibition models were used, but not enough to indicate good predictive power. Figure 5 shows the percentage of correct prediction (65%) of the FCFP_6 model.
Modelling Pfizer Data
The results obtained modelling the Scripps and NCGC sets using naïve Bayesian were comparable to the result obtained by Guha and Schürer using Random Forest models. Since Bayesian models do not need rebalancing of training sets with toxic/non-toxic ratios far from 1/1 we decided to use Bayesian models to analyse Pfizer data. We consistently obtained better results using FCFP_6 fingerprints than with BCI fingerprints and therefore decided to subsequently only use FCFP_6 fingerprints. We concluded from modelling the Scripps data that Bayesian models can improve if all percent inhibition data are used to augment the data set and that a much lower cut-off can be used than is typically applied by the experimenter. The Pfizer data set contains results from 33 assays with percent inhibition data and 52 assays with IC50 data. These data have been obtained by Pfizer and its multiple legacy companies and not surprisingly a variety of assay formats have been applied. We developed assay meta data collection tools for the biological assays to focus on the factors most likely to influence cytotoxicity (e.g. cell-line, incubation time, dose, endpoint detection method). Extensive data profiling was applied to generate a well characterised data set (Pfizer dataset collection and profiling - Methods).
Many of the Pfizer assays were selectivity assays, aimed at removing "actives" from the primary assay where the activity was in fact due to cytotoxicity or another non-specific event. Since the compounds submitted to these assays had already shown activity in a cell-based assay, they are not true random subsets of the Pfizer file and the expected toxic hit rate is closer to the Scripps IC50 set (37%) than to the Scripps percent inhibition set (1.4%). The cytotoxicity assay collection also covered different % inhibition and IC50 dose ranges. A particular cut off may give 20% actives in one assay, but 100% actives in another. Therefore to enable cross-assay comparison, the top 20% of compounds (by activity or pIC50) were considered active so that every assay would have the same hit rate. For an assay with a normal distribution this would equal mean plus (just under) one standard deviation. Modelling the Scripps percent inhibition data has shown that including this many actives in the training set can still yield a predictive model. An important feature of Bayesian learning is that it is not sensitive to the ratio of actives in the dataset; the ROC scores in Figure 2 illustrate this point: essentially the same model is obtained from the Scripps percent inhibition data, whether the cut-off for activity is set to 10% or to 80% or to any value in between. This advantage of Bayesian learning means we can pragmatically define the top 20% of compounds as toxic without decreasing the quality of the model.
Our aim was to derive one generally applicable cytotoxicity model and it was therefore tempting to integrate all data into one training set, hoping for a synergy in predictive power similar to that observed when the NCGC and Scripps IC50 sets were combined. We decided to take a more systematic approach and to only include data sets leading to models that are predictive for at least one other data set.
For each assay with at least 10 toxic molecules, a Bayesian model was derived and the ROC scores were calculated predicting the outcome of each of the other assays. To visualise connections between data sets prediction networks were created. (see Prediction Networks - Methods)
In a prediction network the nodes represent data from different assays, and the size of the node is proportional to the number of molecules in the corresponding data set. Nodes are considered predictive if the model yields a ROC score greater than or equal to 0.60. Two nodes are connected if the data at one node can be used to build a predictive model for the cytotoxicity of the molecules at the other node. The nodes are only connected if predictions are bi-directional.
One Predictive Cytotoxicity Model
ROC scores from the 5-fold cross-validation of the models derived from the predictive assays
Percent inhibition cytotoxicity model
0.846 ± 0.003
IC50 cytotoxicity model
0.836 ± 0.002
Merged Percent inhibition/IC50 model
0.842 ± 0.002
Cytotoxicity can also be related to the descriptors used to derive the model. After the FCFP_6 fingerprints, the descriptor which has the largest impact on the Bayesian score is AlogP. Compounds having AlogP between 3.7 and 34 are given a high probability of being toxic by the model, the probability of being toxic increases at the higher end of the range. Compounds with AlogP below 3.3 are given a low probability of being toxic, generally the lower the AlogP the lower the probability of being toxic. There is one exception where an AlogP in the range of 34 to 63 gives a non toxic compound, but there is only one example of such a compound occurring, therefore this is an anomalous result. These are also unusually high values for logP; therefore AlogP is an unreliable estimate of logP for these compounds. Compounds with a high logP are lipophilic and can therefore easily cross cell membranes, their tendency to preferentially bind with proteins rather than remain in a polar solvent making them more likely to have non-specific intracellular effects. Molecular weight is the next most important descriptor. The model gives a low probability of a molecule being toxic if its molecular weight is below 370. Higher molecular weights give a high probability of toxicity. A molecule's lipophilicity will increase as its mass increases; therefore it is not surprising that heavier compounds have a higher probability of being cytotoxic. The polar surface area and the number of hydrogen bond donors and acceptors also show how cytotoxicity is dependent on the lipophilicity of the compounds.
The number of rotatable bonds also has a positive correlation with cytotoxicity score. This is to be expected, since a flexible molecule can adopt a greater number of conformations, allowing it to bind to many different sites, possibly leading to unwanted effects. Typically molecules with a large number of rotatable bonds also have a higher molecular weight - which is again correlated with logP.
The observed correlation of lipophilicity and related properties with cytotoxicity is not surprising as this has also been observed in studied linking in vivo toxicity and bioavailability to physiochemical properties.
There is a wealth of data from cytotoxicity assays available both publicly and within pharmaceutical companies that can be used to derive predictive models. Here, a predictive Bayesian model has been derived from public and in-house Pfizer data.
During the development of this model the need for multiple-fold cross-validations has been reinforced, as this gives the most accurate validation results. A method for cut-off optimisation has also been shown to provide an appropriate definition of cytotoxicity to build a successfully predictive model. Prediction networks have been used to make informed decisions on which data sets should be included in the training set and have identified the need for more detailed examinations of what makes two data sets predictive of each other. The prediction networks identified assay data that could be used to derive predictive models. These assays were combined into one training set that produced a successful predictive cytotoxicity model with a ROC score of 0.842 ± 0.002.
The data indicate that some assays are highly predictive of each other. We speculated that this may because they shared common assay conditions (cell line, species, incubation time, detection method etc.). To investigate this further, more networks were created in Cytoscape to incorporate the assay conditions available. However no clear relationship between these factors and cytotoxicity could be demonstrated. This does not necessarily rule out a relationship as there was little overlap in assay conditions between data sets, and only a few compounds have been tested in more than one assay. To study this hypothesis further, the prediction network method should be repeated with a dense matrix of assays spanning diverse experimental conditions and compounds tested against all assays. This information can be represented in the network and any assay relationships between predictive data sets will become apparent.
Although there are gaps in the understanding of why the combination of assay data used to derive the predictive cytotoxicity model works, the model is still an extremely useful tool and also supports previous evidence in the literature that toxicity is related to lipophilicity. This model could be used to triage hits from primary cell-based screens for cytotoxicity, rather than running parallel cytotoxicity assays. The model predictions track well with the in vitro safety/toxicology assay we examined, but the applicability of the model as a tool to help identify toxic molecules early on in the drug discovery pipeline would be increased if its output could be compared with more in vitro assays of this type. Once more is understood on what makes a data set predictive, this knowledge can be utilised to derive a more accurate predictive model. Modelling methods described in this paper are not limited to cytotoxicity; they can also be used when predicting other molecular properties, or compound activities.
Summary of the 4 data sets used to build Bayesian models to predict cytotoxicity
No. of assays
No. of compounds
PubChem, AID 364, 463, 464
T-Cell (Jurkat) proliferation data containing a mixture of percent inhibition and IC50 measurements
PubChem, AID 426
T-Cell (Jurkat) proliferation data containing IC50 measurements
Pfizer percent inhibition
Percent inhibition data from a variety of different cytotoxicity assays
IC50 data from a variety of different cytotoxicity assays
Pfizer dataset collection and profiling
We conducted a gap analysis on the original dataset to identify those protocols where a substantial proportion of the assay experimental conditions was missing or inconsistent, which was the case for some legacy protocols. Examination of the full assay documents and direct contact with the biologists involved allowed us to generate a list of 82 assays with comprehensive coverage of the assay experimental parameters.
Assay endpoint detection methods were classified as Fluorescence emission, Luminescence, RNA quantification and Absorbance. The assay technologies included dye binding, flow cytometry, formazan dye formation, luciferase, PCR, and Resorufin dye formation. Data was used from a variety of species - Human, Hamster, Mouse, Pig, Rat, and Monkey - and a total of 34 different cell lines across all of the assays. To standardize the data and improve confidence in the model, the cell lines were re-classified according to their tissue origin (blood, skin, colon, cervix, ovary, lung, kidney, breast, foreskin, liver, aorta, brain, connective tissue, muscle, and nerve). Incubation times were standardised to a base unit of hours - our observations indicate that a wide range of incubation periods are used in cytotoxicity screens (2 hours to 145 hours) and they can vary within the same tissue type, or assay technology.
In addition to assay profiling and classification we analysed the percent inhibition and IC50 results for each assay to determine whether these results could be included in our models. Wherever we could not identify the convention used to distinguish cytotoxic compounds we decided to remove this data from further analysis. The assays where we could not reliably differentiate between true actives, artefacts and different naming conventions were likewise excluded.
To allow the model to make appropriate comparisons, data from the remaining HTS assays was examined to ensure there was the expected normal distribution around zero % inhibition. Assays were excluded from further analysis where this was not the case. Assay results where the endpoint value violated the standard business rules (e.g. zero or null) were also excluded. Scitegic Pipeline Pilot was used to develop an automated data cleaning tools to perform the tasks described in this section. In addition, using curve fit descriptors and quality parameters, we generated Spotfire plots and screen data confidence scores  which enable interactive exploration and assessment of the data quality. These tools were used to refine the IC50 data set to a list of 52 assays where the data, curve fits and endpoints were reliable and well understood.
Bayesian Learning and Bayesian score
Pipeline Pilot[31, 32] was used to perform all calculations. During the period of research versions 6.5, 7.0 and 7.5 were used, but there are no differences in the components used in these versions. Bayesian learning is based on Bayes' rule for conditional probability which gives the probability of an event A occurring given that event B has already occurred. In a cytotoxicity context, this is the probability of a compound being toxic, given that it contains a particular descriptor. For each descriptor, D, the probability of a molecule being toxic given it contains descriptor D is calculated as P(Active|;D) = AD/(AD+ID), where AD is the number of active compounds containing descriptor D and ID is the number of inactive compounds containing descriptor D. These probabilities become unreliable as the number of molecules containing descriptor D becomes small. Therefore a Laplacian modified model is derived which takes into account the different sampling frequencies of different features by adding samples with the same hit rate as observed in the training set.
Laplacian modified model
If we assume most features have no relationship to activity then we would expect P(active|;D) to be equal to the overall activity rate, P(active) = A/(A+I). If we sample a feature K additional times, where K = 1/P(active), we would expect P(active)K of these samples to be active. Therefore the Laplace corrected probability of a compound being active given a certain descriptor D, P(Active|;D), is equal to (AD+P(active)*K)/((AD+ID)+K). As (AD+ID) approaches 0 the feature probability converges towards P(active) which is expected if it is assumed the feature has no relationship to activity. The Bayesian score calculated for a compound of unknown class is calculated by multiplying the probabilities for each descriptor contained in the compound; this score represents the likeliness of the compound being active.
Percent inhibition cut-off optimisation
The following method was used to find the best percent inhibition value to use as the definition for cytotoxicity for the molecules in the Scripps data set. The best cut-off is the value that gives the highest ROC score when used to build a model. The ROC score is the area under the curve of the ROC plot for the model. This method was originally suggested by David Rogers . A set of 121 models was built, each with a different percent inhibition cut-off as the definition for toxicity. The cut-offs ranged from -20% to 100% in 1% increments. The ROC score was calculated for each of these models and was plotted against the corresponding cut-off. The optimum cut-off is defined as the cut-off that yields the highest ROC score. As 5-fold cross-validation is used to test the models, the same method is also used in the cut-off optimisation. The set of 121 models is trained on 80% of the data and the ROC scores are calculated by testing on the remaining 20% of the data. This is repeated 5 times using a different 20% to test the model each time. When the ROC score is calculated the cytotoxic compounds are defined as those that were labelled as active in the original data extracted from PubChem. These labels were assigned based on the percent inhibition or IC50 values if available for the molecules. This procedure was repeated twice. Once for the FCFP_6 fingerprints and once for the BCI fingerprints.
A major challenge for machine learning methods is to understand the applicability domain of models. For example a model trained on a particular data set may perform well when cross-validated, but fail at classifying compounds from a different data set. This research aims to determine which assay data can be used to predict the outcome of other assays and to understand any relationship between such data sets. To do this we have created prediction networks.
The available Pfizer data were split into two categories: IC50 and percent inhibition data. This is because IC50 data are often obtained as confirmations of previous data and are therefore enriched in hit rate but with lower chemical diversity of compounds (as was the case with the Scripps data). The hit rate for the Pfizer assays was artificially set to 20%, but the chemical diversity has probably been artificially lowered by routinely removing compounds with undesirable chemical functional groups and/or physical properties. The NCGC and Scripps data were included as well as separate screens. There are no distinct percent inhibition measurements available for NCGC, therefore we took the percent inhibition at 9.2 μM from the full curve data as a surrogate.
The Pfizer percent inhibition data set contains data from 33 assays, A Bayesian model was derived for each assay, giving a total of 28 models (5 of the assays contained only 1 molecule so a model could not be trained). Each of these models was then tested in turn with data from the remaining assays not used to train the model. Each of the models was also tested on the Scripps percent inhibition and NCGC percent inhibition data sets, and the Scripps percent inhibition and NCGC models were be tested with each of the Pfizer percent inhibition models. A text delimited file was created containing a column for training set, a column for test set and a column for the ROC score when a model trained with the training set, is tested with the test set. This file was imported into Cytoscape v.2.6.1 where the prediction networks were created.
The same method was applied to the 52 assays in the Pfizer IC50 data set. A total of 45 models were produced as 7 of the assays only contained 1 molecule. The Scripps IC50 and NCGC data sets were also included. For all models built, FCFP_6 fingerprints, AlogP, number of hydrogen bond donors, number of hydrogen bond acceptors, number of rotational bonds and molecular fractional polar surface area were used as descriptors. Since for most of the assays it had not been recorded what constitutes as a cytotoxic outcome the top 20% compounds (top percent inhibition or top pIC50) of each assay were classed as toxic. For the Scripps and NCGC data sets the definitions for toxicity described above were used.
Two prediction networks were built, one for the Pfizer percent inhibition data set and one for the Pfizer IC50 data set. Assays are represented in the network as nodes, and the nodes are connected with an edge if a model trained with the screen at the source node is successful in predicting the cytotoxicity of the screen at the target node as defined by a ROC score greater than 0.60. The networks are arranged using a spring-embedded layout. A spring-embedded layout positions nodes to give an aesthetically appealing layout. This is done by replacing the nodes with rings and each edge with a spring. The nodes are placed in an initial layout then are let go so the springs force the nodes to move to a minimal energy layout.
The authors wish to acknowledge the following Pfizer colleagues for their contributions to technical discussions during the design, development and validation of this predictive tool:
Paul Driscoll, James Dykens, Ian Johns, Philip Laflin, Jens Loesel, Richard Lyons, Charles Mowbray, Russell Naven, David Pryde, Rachel Russell.
We wish to thank David Millan (Pfizer) for making available the set of 87 FDA-approved drugs.
- DiMasi Joseph A, Hansen Ronald W, Grabowski Henry G: The price of innovation: new estimates of drug development costs. J Health Econ. 2003, 22: 151-185. 10.1016/S0167-6296(02)00126-1.View ArticleGoogle Scholar
- Schuster D, Laggner C, Langer T: Why drugs fail - a study on side effects in new chemical entities. Curr Pharm Des. 2005, 11: 3545-3559. 10.2174/138161205774414510.View ArticleGoogle Scholar
- Gross CJ, Kramer JA: The role of investigative molecular toxicology in early stage drug development. Expert Opin Drug Saf. 2003, 2: 147-159. 10.1517/147403188.8.131.52.View ArticleGoogle Scholar
- Ukelis U, Kramer PJ, Olejniczak K, Mueller SO: Replacement of in vivo acute oral toxicity studies by in vitro cytotoxicity methods: Opportunities, limits and regulatory status. Regul Toxicol Pharmacol. 2008, 51: 108-118. 10.1016/j.yrtph.2008.02.002.View ArticleGoogle Scholar
- Greaves P, Williams A, Eve M: First dose of potential new medicines to humans: how animals help. Nat Rev Drug Discovery. 2004, 3: 226-236. 10.1038/nrd1329.View ArticleGoogle Scholar
- Fielden MR, Kolaja KL: The role of early in vivo toxicity testing in drug discovery toxicology. Expert Opin Drug Saf. 2008, 7: 107-110. 10.1517/147403184.108.40.206.View ArticleGoogle Scholar
- Chen T, Knapp AC, Wu Y, Huang J, Lynch JS, Dickson JK, Lawrence RM, Feyen JHM, Agler ML: High Throughput Screening Identified a Substituted Imidazole as a Novel RANK Pathway-Selective Osteoclastogenesis Inhibitor. Assay Drug Dev Technol. 2006, 4: 387-396. 10.1089/adt.2006.4.387.View ArticleGoogle Scholar
- Hallis TM, Kopp AL, Gibson J, Lebakken CS, Hancock M, Van Den Heuvel-Kramer K, Turek-Etienne T: An improved b-lactamase reporter assay: multiplexing with a cytotoxicity readout for enhanced accuracy of hit identification. J Biomol Screening. 2007, 12: 635-644. 10.1177/1087057107301499.View ArticleGoogle Scholar
- Xu JJ: In Vitro Toxicology: Bringing The In Silico and In Vivo World Closer. Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals. Edited by: Ekins S. 2008, New Jersey: John Wiley and Sons Inc, 22-23. 1Google Scholar
- Molnar L, Keseru GM, Papp A, Lorincz Z, Ambrus G, Darvas F: A neural network based classification scheme for cytotoxicity predictions:Validation on 30,000 compounds. Bioorg Med Chem Lett. 2006, 16: 1037-1039. 10.1016/j.bmcl.2005.10.079.View ArticleGoogle Scholar
- Guha R, Schurer SC: Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays. J Comput-Aided Mol Des. 2008, 22: 367-384. 10.1007/s10822-008-9192-9.View ArticleGoogle Scholar
- Lee AC, Shedden K, Rosania GR, Crippen GM: Data mining the NCI60 to predict generalized cytotoxicity. J Chem Inf Model. 2008, 48: 1379-1388. 10.1021/ci800097k.View ArticleGoogle Scholar
- Chen B, Harrison RF, Papadatos G, Willett P, Wood DJ, Lewell XQ, Greenidge P, Stiefl N: Evaluation of machine-learning methods for ligand-based virtual screening. J Comput-Aided Mol Des. 2007, 21: 53-62. 10.1007/s10822-006-9096-5.View ArticleGoogle Scholar
- Vogt M, Bajorath J: Bayesian screening for active compounds in high-dimensional chemical spaces combining property descriptors and molecular fingerprints. Chem Biol Drug Des. 2008, 71: 8-14.View ArticleGoogle Scholar
- Paolini GV, Shapland RHB, van Hoorn WP, Mason JS, Hopkins AL: Global mapping of pharmacological space. Nat Biotechnol. 2006, 24: 805-815. 10.1038/nbt1228.View ArticleGoogle Scholar
- Xia X, Maliski EG, Gallant P, Rogers D: Classification of Kinase Inhibitors Using a Bayesian Model. J Med Chem. 2004, 47: 4463-4470. 10.1021/jm0303195.View ArticleGoogle Scholar
- O'Brien SE, De Groot MJ: Greater Than the Sum of Its Parts: Combining Models for Useful ADMET Prediction. J Med Chem. 2005, 48: 1287-1291.View ArticleGoogle Scholar
- Sun H: An accurate and interpretable Bayesian classification model for prediction of hERG liability. ChemMedChem. 2006, 1: 315-322. 10.1002/cmdc.200500047.View ArticleGoogle Scholar
- Glick M, Jenkins JL, Nettles JH, Hitchings H, Davies JW: Enrichment of High-Throughput Screening Data with Increasing Levels of Noise Using Support Vector Machines, Recursive Partitioning, and Laplacian-Modified Naive Bayesian Classifiers. J Chem Inf Model. 2006, 46: 193-200. 10.1021/ci050374h.View ArticleGoogle Scholar
- Glick M, Klon AE, Acklin P, Davies JW: Enrichment of extremely noisy high-throughput screening data using a naive Bayes classifier. J Biomol Screening. 2004, 9: 32-36. 10.1177/1087057103260590.View ArticleGoogle Scholar
- Klon AE, Glick M, Thoma M, Acklin P, Davies JW: Finding More Needles in the Haystack: A Simple and Efficient Method for Improving High-Throughput Docking Results. J Med Chem. 2004, 47: 2743-2749. 10.1021/jm030363k.View ArticleGoogle Scholar
- van Hoorn WP, Bell AS: Searching Chemical Space with the Bayesian Idea Generator. J Chem Inf Model. 2009, 49: 2211-2220. 10.1021/ci900072g.View ArticleGoogle Scholar
- Bender A, Mussa HY, Glen RC, Reiling S: Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naive Bayesian Classifier. J Chem Inf Comput Sci. 2004, 44: 170-178.View ArticleGoogle Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37: W623-W633. 10.1093/nar/gkp456.View ArticleGoogle Scholar
- Xia M, Huang R, Witt KL, Southall N, Fostel J, Cho MH, Jadhav A, Smith CS, Inglese J, Portier CJ, et al: Compound Cytotoxicity Profiling Using Quantitative High-Throughput Screening. Environ Health Perspect. 2007, 116: 284-291. 10.1289/ehp.10727.View ArticleGoogle Scholar
- Fawcett T: An introduction to ROC analysis. Pattern Recogn Lett. 2006, 27: 861-874. 10.1016/j.patrec.2005.10.010.View ArticleGoogle Scholar
- Rogers D: Does this stuff really work?. 2007 Pipeline Pilot European User Group Meeting. 2007, Pistoia, ItalyGoogle Scholar
- Hughes JD, Blagg J, Price DA, Bailey S, DeCrescenzo GA, Devraj RV, Ellsworth E, Fobian YM, Gibbs ME, Gilles RW, et al: Physiochemical drug properties associated with in vivo toxicological outcomes. Bioorg Med Chem Lett. 2008, 18: 4872-4875. 10.1016/j.bmcl.2008.07.071.View ArticleGoogle Scholar
- Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Delivery Rev. 1997, 23: 3-25. 10.1016/S0169-409X(96)00423-1.View ArticleGoogle Scholar
- Paolini GV, Lyons RA, Laflin P: How Desirable Are Your IC50s? A Way to Enhance Screening-Based Decision Making. J Biomol Screen. 2010, 15: 1183-93. 10.1177/1087057110384402.View ArticleGoogle Scholar
- Pipeline Pilot version 7.5.2. 2008, Accelrys, Inc.: San Diego, CAGoogle Scholar
- Hassan M, Brown RD, Varma-O'Brien S, Rogers D: Cheminformatics analysis and learning in a data pipelining environment. Mol Diversity. 2006, 10: 283-299. 10.1007/s11030-006-9041-5.View ArticleGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13: 2498-2504. 10.1101/gr.1239303.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.