# Cytochrome P450 site of metabolism prediction from 2D topological fingerprints using GPU accelerated probabilistic classifiers

- Jonathan D Tyzack
^{1}, - Hamse Y Mussa
^{1}, - Mark J Williamson
^{1}, - Johannes Kirchmair
^{2}and - Robert C Glen
^{1}Email author

**6**:29

https://doi.org/10.1186/1758-2946-6-29

© Tyzack et al.; licensee Chemistry Central Ltd. 2014

**Received: **19 February 2014

**Accepted: **12 May 2014

**Published: **27 May 2014

## Abstract

### Background

The prediction of sites and products of metabolism in xenobiotic compounds is key to the development of new chemical entities, where screening potential metabolites for toxicity or unwanted side-effects is of crucial importance. In this work 2D topological fingerprints are used to encode atomic sites and three probabilistic machine learning methods are applied: Parzen-Rosenblatt Window (PRW), Naive Bayesian (NB) and a novel approach called RASCAL (Random Attribute Subsampling Classification ALgorithm). These are implemented by randomly subsampling descriptor space to alleviate the problem often suffered by data mining methods of having to exactly match fingerprints, and in the case of PRW by measuring a distance between feature vectors rather than exact matching. The classifiers have been implemented in CUDA/C++ to exploit the parallel architecture of graphical processing units (GPUs) and is freely available in a public repository.

### Results

It is shown that for PRW a SoM (Site of Metabolism) is identified in the top two predictions for 85%, 91% and 88% of the CYP 3A4, 2D6 and 2C9 data sets respectively, with RASCAL giving similar performance of 83%, 91% and 88%, respectively. These results put PRW and RASCAL performance ahead of NB which gave a much lower classification performance of 51%, 73% and 74%, respectively.

### Conclusions

2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets. Thus, the machine learning methods outlined in this paper are conceptually simpler and more efficient than other methods tested and the use of simple topological descriptors derived from 2D structure give results competitive with other approaches using more expensive quantum chemical descriptors. The descriptor space subsampling approach and ensemble methodology allow the methods to be applied to molecules more distant from the training data where data mining would be more likely to fail due to the lack of common fingerprints. The RASCAL algorithm is shown to give equivalent classification performance to PRW but at lower computational expense allowing it to be applied more efficiently in the ensemble scheme.

## Keywords

## Background

The prediction of sites and products of metabolism for xenobiotic and endogenous compounds is an important avenue of research, playing an influential role in the development and use of pharmaceuticals, cosmetics, nutritional supplements and agrochemicals. Toxicity of metabolites can play a major role in the withdrawal of new drugs or black-box warnings, contributing to the high attrition rates in the development of new chemical entities.

The cytochrome P450s (CYPs) are a family of heme-containing enzymes involved in the phase-I metabolism of over 90% of drugs on the market [1, 2]. The CYP family of enzymes consists of 57 isoforms with the majority of biotransformations in mammals facilitated by the CYP 3A4 isoform, followed by 2D6 and 2C9.

The most common reactions catalysed by CYPs involve the insertion of a single oxygen into an organic molecule giving rise to C=C epoxidation, aromatic C oxidation, S oxidation and C-H hydroxylation reactions, the last example often leading to N-dealkylation or O-dealkylation if oxidation occurs on a suitable leaving group in an amine or ether moiety.

A host of computational approaches to predict SoMs have been developed as an alternative or aid to the resource and time consuming nature of experimental investigation. These approaches can be either ligand-based, where the structures and properties of known substrates or non-substrates are modelled to develop structure-activity relationships, or structure-based, where the structure of the metabolising CYP enzyme and its interactions with ligands are modelled. The reader is referred to the many comprehensive review papers [3–7] for an overview of the current computational tools to predict SoMs.

Many methods consider reactivity and accessibility factors since a SoM must be sufficiently reactive and also able to come into close proximity to the reactive heme centre. One such example is SMARTCyp [8], a Java-based SoM predictor that uses a database of activation energies for various pre-defined ligand fragments to assign reactivity estimates to matching moieties in a query ligand, with an accessibility descriptor used to tune the ranking.

Other methods take the accessibility consideration further and employ docking techniques to refine the predictions from reactivity approaches. Examples of these methods include IMPACTS [9], which combines docking with a fragment based reactivity approach, and a recently published approach [10] that makes use of a tethered docking methodology using GOLD [11] combined with a reactivity approach based on hydrogen bond order descriptors and a novel implementation of the average local ionisation energy.

In contrast to these approaches the methods described in this work do not require the explicit modelling of ligand binding or reactivity, but make use of machine learning techniques applied to an appropriate, representative data set. Various machine learning methods have been applied to the problem of SoM metabolism with some of the major contributions summarised below.

An example of a data-mining approach is MetaPrint2D [12], an online metabolism prediction tool trained on the Accelrys Metabolite Database [13] that makes SoM predictions based on occurrence counts of atomic fingerprints within the database where they appear as SoM versus non-SoM. If matching sites are not found in the database, Metaprint2D informs the user and makes no predictions rather than extrapolating beyond its domain of applicability.

Many methods employing machine learning techniques generate a wide-range of descriptors for each atomic site in the data set often including quantum chemical and electronic descriptors. An example is RegioSelectivity (RS)-predictor [14, 15] which uses a Support Vector Machine (SVM) to predict SoM using 148 topological and 392 quantum chemical atomic descriptors where some of these descriptors are modified to include contributions from neighbouring atoms. A neural network approach called Xenosite [16] has also been applied to this descriptor set but combined with other molecular descriptors and fingerprint descriptors based on the Daylight [17] definition. A probability score that each atomic site is a SoM is obtained allowing the different sites in the molecule to be ranked with improved predictive performance over (RS)-predictor reported.

Another study [18] combines descriptors based on the electronic structure of the molecule with explicitly calculated activation energies [19, 20], and also incorporates Solvent Accessible Surface Area (SASA) descriptors calculated using MOE [21]. Classification of the atomic sites in the data set into SoMs and non-SoMs was performed using random forest/ensemble decision trees, with the activation energy shown to be the most important in determining SoM.

A further method [22] with relevance to small endogenous molecules makes use of the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database [23] where 4843 reactions were classified into 80 classes. SMARTS patterns were used to define chemical substructures (reaction centres and surrounding regions) with various descriptors used to encode these sites including electronic, energetic, topological, distance and steric. For each reaction centre a SVM binary classification model was trained and the score obtained for each potential SoM in a query molecule was used to rank the candidate reaction centres.

The computational expense of using quantum chemical descriptors is addressed by FAME [24], a metabolism prediction tool that applies random forest models to the Metabolite [13] database. It calculates atomic descriptors using the CDK [25] relating to charge and molecular topology and generates SoM predictions in a few seconds per molecule. FAME has other benefits since it is not just a predictor of CYP metabolism but reflects the broader enzyme reactions documented in the Metabolite database and can be filtered for Phase-I and Phase-II metabolism in human, rat and dog.

A recent publication [26] describes an approach to SoM prediction that applies the PASS algorithm to atom environment fingerprints encoded with 2D descriptors. This allows the method to be faster than those methods that must first generate 3D structures to calculate quantum chemical descriptors and gives results that are competitive with RS-Predictor and SMARTCyp.

This paper takes a similar approach using the 2D topological circular fingerprint [27, 28] of atomic sites within a molecule as the sole descriptor making the method computationally less expensive than those that use 3D descriptors. Three probabilistic classifiers have been applied to the problem of xenobiotic SoM prediction: the Naive Bayesian (NB), the Parzen-Rosenblatt Window (PRW) and a novel method called RASCAL (Random Attribute Subsampling Classification Algorithm) that will be presented in the Methods section. In all classifiers the machine learning technique was applied on an ensemble basis by randomly subsampling descriptor space and treating each sub-classifier as one vote in an overall classification.

The kernelised nature of the PRW and RASCAL algorithms allied with the ensemble approach lends itself to a parallel implementation exploiting the massively parallel processing capability of Graphical Processing Units (GPUs). The computing industry is moving towards a parallel model as the limitations and capabilities of modern semiconductor manufacturing mean that ever increasing performance from a single processor is no longer possible [29]. In recent years, GPUs have become increasingly competitive with regard to programmability, speed and price, with the release of CUDA (Compute Unified Device Architecture) providing a standard C-like interface allowing scientists to exploit the parallel power of the GPU. The CUDA model operates by launching blocks of threads that are executed on stream multiprocessors (SMs) concurrently. Threads and blocks can be referred to by identification numbers allowing each to operate on a different portion of the data, where threads in a common block can communicate via a localised shared memory.

Recent developments in GPU accelerated classification tools include implementations for support vector machines [30–32], neural networks [33], k-nearest neighbours [34] and a parallel tool for the classification of remotely sensed imagery [35]. The techniques described in these studies could lead to similar efficiency gains when applied to cheminformatics classification problems and in the remainder of this paper GPU accelerated probabilistic classifiers are applied to the problem of SoM identification. The CUDA implementation released as part of this work could be applied to other binary or multi-class data sets, where in the case of RASCAL and NB the feature vectors would need to consist of integer or binary values. It is hoped that this implementation would be of interest to members of the cheminformatics community applying classification approaches to large data sets where performance is hindered by high computational demands.

In the Methods section the data sets and descriptors are presented, along with the three probabilistic classifiers (PRW, RASCAL and NB) and a discussion of the CUDA implementation. In the Results and discussion section the classification performance of the different methods is presented in terms of the Matthews Correlation Coefficient (MCC), area under the ROC curve and the percentage of the data sets where a SoM is identified in the top k positions. The effect of varying the size of the circular fingerprints used to describe atomic sites is investigated and a benchmarking analysis comparing the speed performance of the CUDA implementation on a Tesla C2075 GPU and a GeForce GT640 GPU to reference is presented. The important inferences from this work are presented in the Conclusions section.

To emphasize the benefits and novel aspects of this work it is important to point out that the SoM prediction models are built from 2D topological circular fingerprints without the requirement for complex quantum chemical and 3D descriptors. RASCAL combines the classification performance of the PRW with greater computational speed and hence could be applied to other much larger data sets. The data sets employed here are relatively small and show that SoMs can be identified within data sets of the order of 100’s of molecules being of potential interest to pharmaceutical companies with limited data on a specific series of molecules.

## Experimental

### Data sets

The machine learning approaches used in this study have been applied to the publicly available CYP 3A4, 2D6 and 2C9 data sets that originate from those initially released in the supporting information of the RS-Predictor paper [15] but further curated [9] with reference to the primary literature to identify and eliminate conflicting information. Open Babel v2.3.1 [36] was used to generate the protonation state of each ligand at pH 7.4 and the mol2 files generated are made available in the Additional files. The data sets have been sampled on a leave-one-out-cross-validation basis to generate results that are comparable with other reported methods [16].

### SoM prediction

**Methods**section was assessed for each isoform data set using feature vectors calculated at each of the bond depths ranging from 0-8. These methods have been implemented on an ensemble subsampling methodology, hence as described previously the length of the subsample,

*q*, needs to be optimised. Therefore, for each classification model

*q*was varied from an initial value of 5 to the full number of features

*L*in increments of 5, with the Matthews Correlation Coefficient (MCC) [38] and the area under ROC curves used to compare classification performance. Selection of the

*j*parameter is a balance between sufficiently sampling feature space versus computational load, with a value of 201 found to be suitable in this case. The graphs shown in Figure 2 show classification performance against

*j*for models with

*q*=40 and bond depth ranging from 4-6 and show that by

*j*=201 the classification performance has largely reached a plateau.

In this way the probability that an atomic site is a SoM can be calculated allowing all sites in a molecule to be ranked. The percentage of the data set where a SoM is identified in the top two (top-2%) and top three (top-3%) predictions was calculated to allow comparisons to other SoM identification methods. When calculating the top-2% and top-3% values steps were taken to identify atoms in equivalent sites to ensure that such sites are only included once in predictions. This was achieved by generating full circular fingerprints for each atomic site in the same way as described previously but this time spanning the entire ligand, with equivalent atomic sites defined as those with identical circular fingerprints.

### CUDA implementation

The three probabilistic classifiers have been implemented using the Nvidia Nsight [39] integrated development environment for CUDA/C++ released as part of CUDA Tools 5.5. A benchmarking analysis has been carried out comparing the performance speed of an Nvidia C2075 Tesla GPU and an Nvidia GT640 GeForce GPU against a reference CPU implementation running on a single Intel Xeon E5506 at 2.13GHz.

The source code is available in a public repository (https://bitbucket.org/jdt42/probclassifier_cuda) and could be applied to other classification problems where the input files are formatted in the appropriate manner for this tool. The code exploits the parallel nature of GPU’s by aligning the dimensions of the classification problem, such as number of feature vectors *N* and number of subclassifiers *j*, with the parallel architecture of CUDA where computational work is kernelised and spread over blocks of threads. The code makes use of atomic operations and so requires a CUDA enabled GPU with a compute rating of ≥ 1.1.

## Methods

### Probabilistic classifiers

Cheminformatics classification approaches that compare structural similarity to known molecules tend to be more successful if they allow for some degree of uncertainty (fuzziness) [40]. This means one treats the attributes representing the data items under study, and the class they are associated with as stochastic variables. In this scenario the classification problem can be viewed as a hypothesis testing task; this requires estimates of the probability densities of the attributes for each class. Estimates of these probability densities, coupled with an appropriate decision rule, constitute what is commonly referred to as statistical (or probabilistic) classifiers [40, 41].

In the probabilistic pattern recognition framework, it is assumed that there is an unknown probability distribution *p* that underlies any relationship in the available data, *D*, where the data points belonging to *D* are drawn from a product space *X*×*Y* where *X* represents input patterns/objects and *Y* represents the class space. In this setting, the purpose of a learning algorithm is to discover the probability distribution *p* that captures the “functional” relationships that may exist between *X* and *Y*. In a typical pattern classification scenario the available data set *D* is finite and is defined as $D={\left\{\right({\mathbf{\text{x}}}_{i},{y}_{i}\left)\right\}}_{i=1}^{N}$, where **x**_{
i
}∈*X*; *y*_{
i
}∈*Y* and *N* refers to the size of the given sample data.

The process of finding an appropriate probabilistic classification model *p*(**x**_{
i
},*y*_{
i
}) involves relating **x**_{
i
} probabilistically to its associated *y*_{
i
}, where *p*(**x**_{
i
},*y*_{
i
}) refers to the probability of (**x**_{
i
}, *y*_{
i
}) occurring.

In practice, we are often interested in the posterior probability that a given pattern **x**_{
i
} is associated with *y*_{
i
}. The essence of our task is to generate an algorithm that is capable of inferring probabilistic classification rules from the given training set with the ability to generalise to new patterns. The input pattern **x**_{
i
} is an *L*–dimensional mathematical vector that inhabits an abstract *L*–dimensional space. The pattern vector **x**_{
i
} consists of *L* elements **x**_{
i
} = (*x*_{i1},*x*_{i2},…,*x*_{
i
L
}) that represent the information we have about *L* different but relevant properties of the *i*^{
t
h
} pattern. The class label *y*_{
i
} often denotes a set of predefined classes {*ω*_{0},*ω*_{1},…,*ω*_{M-1}}.

In this work we are concerned with classification problems where the elements *x*_{
i
l
} can assume integer values being the counts of SYBYL atom types at specific bond depths from the atom described by **x**_{
i
}. For the sake of clarity the index *i* in both **x**_{
i
} and *y*_{
i
} will be dropped in the rest of the paper; unless otherwise stated.

*p*(

**x**,

*ω*

_{ α }) (

*i.e.*,

*p*(

**x**,

*y*)) from a priori and conditional probabilities. This means that the class posterior probability

*p*(

*ω*

_{ α }|

**x**) of a given pattern

**x**being associated with

*ω*

_{ α }[41] can expressed in the form [41–44]:

*p*(

*ω*

_{ α }) is the a priori probability of class

*ω*

_{ α }and can be estimated from the class proportions in the training data set;

*p*(

**x**|

*ω*

_{ α }) is the class conditional probability of

**x**belonging to class

*ω*

_{ α }; and

*p*(

**x**) is given by:

**x**, the probability of

**x**belonging to each class

*ω*

_{ α }needs to be computed, with

**x**assigned to the class with the highest probability value:

The prediction efficiency of the generated classifier can be validated by comparing classification predictions made by the model with actual known classes. Once the model is deemed reliable, it can then be used to make predictions about the behaviour of real world patterns coming from the same domain, *X*, as the given training set.

Clearly the estimation of the probability function *p*(**x**|*ω*_{
α
}) is the most important step in the estimation of *p*(*ω*_{
α
}|**x**), *vide supra*. Thus, over the years a number of approaches to obtain estimates of *p*(**x**|*ω*_{
α
}) have been developed [44–46].

When the data are sparse in the descriptor/input space, it can become difficult to generate a good estimate of *p*(**x**|*ω*_{
α
}). This may result in a probabilistic classifier (a single classifier) with poor generalisation performance. This problem is also encountered when one tries to construct non–probabilistic classifiers on a data set sparse in the descriptor space.

There are different approaches that can be used to improve the generalisation performance of single classifiers. Feature selection schemes [41]; regularisation methods [42, 46]; and the so–called ensemble learning technique (which is currently popular [42, 47, 48]) are good examples of these approaches. In the ensemble technique, the scheme employed in this work, the classification of a new pattern **x** is typically made through majority voting of the classification predictions of *j*(>1) single (or base) classifiers. Empirical and theoretical results indicate that an ensemble of base classifiers gives improvements in generalisation over the individual classifiers [49], providing the base classifiers are not correlated with one another. It was suggested that an effective method of achieving uncorrelated individual classifiers in an ensemble is by training the single classifiers using randomly selected *q* distinct attributes/features of the available *L* features where *q*≤*L*[49] and the *q*–dimensional descriptor vector is denoted by **x**^{
q
}.

It is important to note that, in this ensemble approach, the number of training data points remains the same, *i.e.*, *N*. Thus, the relative (with respect to *q*, the descriptor subspace dimension) training sample size increases [50], which in turn can improve the approximation of the class–conditioned probability from the training set.

Three probabilistic classifiers based on the ideas briefly described in the preceding paragraphs have been used to create SoM prediction tools, namely the Naive Bayesian (NB) [41, 42], Parzen–Rosenblatt Window (PRW) [51] and an internally developed methodology called RASCAL. The latter two approaches are kernel based, *vide infra*.

In the following sections we describe how *p*(*ω*_{
α
}|**x**^{
q
}) is estimated in each of the classification methods. The atomic site and its class (SoM or nonSoM) are represented by **x**^{
q
}, and *ω*_{
α
}, respectively. Note that in this work *M*=2, *i.e.* we are only dealing with 2 classes *ω*_{0} and *ω*_{1}, where *ω*_{0} refers to SoM and *ω*_{1} denotes nonSoM. Finally it is assumed that all a priori class distributions are equal, i.e., *p*(*ω*_{0})=*p*(*ω*_{1}).

### Naive Bayesian (NB)

*p*(

**x**

^{ q }|

*ω*

_{ α }) can be estimated as

with *C*_{
l
} being the number of times descriptor ${x}_{l}^{q}$ assumes the same value in class *ω*_{
α
} and *N*_{
α
} is the number of training items belonging to class *ω*_{
α
}.

*p*(

*ω*

_{0})=

*p*(

*ω*

_{1}), the posterior probability prediction for membership of a particular class is computed as

The data item **x**^{
q
} is predicted to be in the class with the highest posterior probability *p*(*ω*_{
α
}|**x**^{
q
}), and in the case where equal posterior probabilities are calculated for two classes they are ranked arbitrarily. The final class membership probabilities are calculated as the ratio of the number of votes for that class divided by the total number of subclassifiers *j*.

### Kernel based probabilistic classifiers

*ω*

_{ α }is calculated by comparing ${\mathbf{\text{x}}}_{i}^{q}$ with all examples in the training set for

*ω*

_{ α }as

*N*

_{ α }represents the number of training data items belonging to class

*ω*

_{ α }and the kernel function $K\left({\mathbf{\text{x}}}_{i}^{q},{\mathbf{\text{x}}}_{k}^{q}\right)$ measures the similarity between ${\mathbf{\text{x}}}_{i}^{q}$ and ${\mathbf{\text{x}}}_{k}^{q}$. In this case the class conditional probability $p\left({\mathbf{\text{x}}}_{i}^{q}|{\omega}_{\alpha}\right)$ is equal to ${S}_{\alpha}^{i}$,

*i.e.*

Two different kernel functions have been implemented: the PRW and a Dirac kernel [52] used in the implementation of RASCAL, both described in the next subsections.

### Randomised attribute subsampling classification algorithm (RASCAL)

In the case where the descriptors are defined over the binary domain it has been shown that this kernel is equivalent to a full expansion in Radmacher Walsh polynomials [53]. However, in this application the circular fingerprints used to describe the atomic sites are integer valued although the closed form of the kernel shown above can be applied to all discrete valued feature vectors.

### Parzen-Rosenblatt window (PRW)

where ${({\mathbf{\text{x}}}_{i}^{q}-{\mathbf{\text{x}}}_{k}^{q})}^{T}({\mathbf{\text{x}}}_{i}^{q}-{\mathbf{\text{x}}}_{k}^{q})$ is a measure of the distance between ${\mathbf{\text{x}}}_{i}^{q}$ and ${\mathbf{\text{x}}}_{k}^{q}$, *q* is the number of features and *h* is a smoothing parameter, where a value of *h* = 0.1 was found to give good performance.

## Results and discussion

*q*=45,

*j*=201 and the RASCAL classifier. SoMs have been labelled with a green circle and the scores for the top three predictions are labelled against the atom: green text is used where a SoM is identified, red text is used with a corresponding red ring where a non-SoM is identified. A SoM is identified in the top 3 predictions for all molecules, and the top 2 predictions for all except Lovastatin, reflecting the strong classification performance observed across the entire data set.

### Circular fingerprint depth analysis

**Site of metabolism prediction results, MCC and AUC**

Data | Bond | MCC | AUC | ||||||
---|---|---|---|---|---|---|---|---|---|

set | depth | PRW | RASCAL | NB | PRW | RASCAL | NB | ||

4 | 0.69 | 0.64 | 0.35 | 0.94 | 0.95 | 0.85 | |||

2C9 | 5 | 0.73 | 0.68 | 0.36 | 0.96 | 0.95 | 0.87 | ||

6 | 0.73 | 0.66 | 0.34 | 0.97 | 0.96 | 0.87 | |||

4 | 0.66 | 0.66 | 0.36 | 0.95 | 0.96 | 0.85 | |||

2D6 | 5 | 0.72 | 0.69 | 0.36 | 0.97 | 0.96 | 0.84 | ||

6 | 0.73 | 0.72 | 0.38 | 0.97 | 0.97 | 0.84 | |||

4 | 0.61 | 0.60 | 0.31 | 0.94 | 0.94 | 0.80 | |||

3A4 | 5 | 0.64 | 0.62 | 0.32 | 0.95 | 0.94 | 0.80 | ||

6 | 0.68 | 0.65 | 0.32 | 0.96 | 0.94 | 0.81 |

**Site of metabolism prediction results, top-k%**

Data | Bond | Top-3% | Top-2% | Top-1% | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

set | depth | PRW | RASCAL | NB | PRW | RASCAL | NB | PRW | RASCAL | NB | |||

4 | 92 | 88 | 81 | 83 | 83 | 71 | 71 | 71 | 39 | ||||

2C9 | 5 | 87 | 88 | 81 | 86 | 85 | 70 | 76 | 75 | 43 | |||

6 | 92 | 90 | 85 | 88 | 88 | 74 | 77 | 76 | 50 | ||||

4 | 92 | 94 | 79 | 89 | 88 | 66 | 66 | 77 | 53 | ||||

2D6 | 5 | 93 | 92 | 83 | 90 | 91 | 73 | 73 | 78 | 45 | |||

6 | 95 | 93 | 78 | 91 | 90 | 69 | 77 | 78 | 46 | ||||

4 | 89 | 89 | 64 | 83 | 82 | 51 | 67 | 69 | 26 | ||||

3A4 | 5 | 89 | 85 | 61 | 84 | 82 | 47 | 69 | 69 | 28 | |||

6 | 89 | 87 | 64 | 85 | 83 | 50 | 72 | 70 | 27 |

### Analysis of subsample length on classification for bond depths of 4-6

*q*and the number of subclassifiers

*j*. The graphs in Figure 5 show the classification performance in terms of MCC against

*q*for each of the models based on circular fingerprint bond depths ranging from 4-6. It can be seen that PRW and NB are less sensitive to

*q*than RASCAL, which is to be expected since RASCAL is in effect a subsample matching algorithm and if the subsample becomes too long then no matching instances will be found in the training set for either class. Therefore when classifying a data set using RASCAL it is necessary to parametrise carefully to find a suitable value for

*q*. For PRW and NB classification performance tends to plateau at high

*q*when applied to these data sets and using a PRW smoothing parameter of

*h*=0.1. For these approaches the extra computational cost of an ensemble approach is not justified in terms of improved performance and a standard implementation running over all

*L*descriptors would be more suitable. It can be seen that PRW and RASCAL give similar classification performance once the parameter

*q*has been optimised, with both systematically outperforming the NB. This is reinforced by the ROC curves shown in Figure 6 where PRW and RASCAL performance tracks above that for NB.

### Significance of molecular similarity

**Impact of molecular similarity on prediction performance**

Data | Classifier | Bond | Top-3% | Top-2% | ||||
---|---|---|---|---|---|---|---|---|

set | depth | TS1 | TS2 | TS1 | TS2 | |||

2C9 | PRW | 5 | 85 | 85 | 82 | 85 | ||

6 | 83 | 85 | 81 | 85 | ||||

RASCAL | 5 | 85 | 85 | 78 | 85 | |||

6 | 81 | 85 | 77 | 85 | ||||

2D6 | PRW | 5 | 90 | 92 | 85 | 75 | ||

6 | 91 | 94 | 86 | 79 | ||||

RASCAL | 5 | 90 | 90 | 84 | 75 | |||

6 | 87 | 88 | 84 | 83 | ||||

3A4 | PRW | 5 | 83 | 82 | 80 | 78 | ||

6 | 84 | 81 | 80 | 79 | ||||

RASCAL | 5 | 80 | 76 | 76 | 67 | |||

6 | 79 | 75 | 72 | 63 |

### Benchmarking analysis

**Benchmarking analysis**

Data | Method | q | Runtime seconds | x-fold to Ref | |||||
---|---|---|---|---|---|---|---|---|---|

set | Ref | GT640 | C2075 | GT640 | C2075 | ||||

RASCAL | 40 | 42 | 11 | 2.2 | 4 | 19 | |||

NB | 40 | 8.8 | 1.4 | 0.3 | 6 | 29 | |||

2C9 | PRW | 40 | 436 | 114 | 32 | 4 | 14 | ||

NB | 175 | 8.2 | 1.3 | 0.3 | 6 | 27 | |||

PRW | 175 | 7.0 | 31 | 3.6 | 0.2 | 2 | |||

RASCAL | 40 | 57 | 15 | 3.0 | 4 | 19 | |||

NB | 40 | 12 | 1.9 | 0.4 | 6 | 30 | |||

2D6 | PRW | 40 | 727 | 157 | 44 | 5 | 17 | ||

NB | 175 | 11 | 1.8 | 0.4 | 6 | 28 | |||

PRW | 175 | 9.7 | 43 | 5.0 | 0.2 | 2 | |||

RASCAL | 40 | 298 | 66 | 13 | 5 | 23 | |||

NB | 40 | 55 | 8.3 | 1.8 | 7 | 31 | |||

3A4 | PRW | 40 | 3,654 | 714 | 201 | 5 | 18 | ||

NB | 175 | 53 | 8.1 | 1.8 | 7 | 29 | |||

PRW | 175 | 45 | 197 | 23 | 0.2 | 2 |

*N*refers to the size of the training set for a particular class;

*j*refers to the number of subsamples;

*q*refers to the subsample size; and

*L*refers to the full length of the feature vector. Using these definitions the PRW algorithm computational expense when subsampling descriptor space, PRW

_{ sub }, can be estimated as

It should be noted that the RASCAL estimate is an upper bound. Only in the case where the training and test subsamples are identical are all *q* features compared otherwise the feature by feature comparison is halted as soon a difference is noted. Therefore the 15-fold faster running time of RASCAL over PRW can be justified by the lower computational expense, the early termination of the RASCAL kernel where relevant and the simpler nature of the RASCAL kernel compared to the PRW kernel.

*L*features without subsampling, PRW

_{ all }, where the computational expense can be estimated as

Therefore RASCAL is more likely to run faster than PRW when *q*≪*L*. In the data set with bond depth of 6, where *L*=175, applying RASCAL with *q*=40 and *j*=201 gives a speed increase of 1.5-2.0 compared to PRW running over all features. However, the case for using RASCAL is likely to become much more compelling when using data sets consisting of longer feature vectors where it is more likely that RASCAL models can be used where *q*≪*L*.

NB runs faster than RASCAL since it is possible to precompute the counts of common features between test data items and each training class so that the training data items only need to be parsed once, albeit at the price of poorer classification performance.

## Conclusions

The probabilistic classifiers outlined in this paper based on 2D topological fingerprints give SoM predictive performance that is competitive with other machine learning methods that employ complex 3D descriptors, enabling conceptually simple and efficient classification models to be built. Data mining approaches often suffer when fingerprints from a test molecule are not contained in the training data, but descriptor space subsampling and the use of classifiers that measure a distance between vectors instead of exact matching help to alleviate this problem. This enables the creation of classification models that can be based on relatively small data sets but still applicable to molecules more distant from the training data where data mining would be more likely to fail due to the lack of common fingerprints. Hence the methods could be of interest to pharmaceutical companies studying a series of molecules with only of the order of 100’s of data points available.

RASCAL and PRW were found to give similar predictive performance with PRW identifying a SoM in the top 2 predictions for 85%, 91% and 88% for the CYP 3A4, 2D6 and 2C9 data sets respectively, whereas for RASCAL the figures are 83%, 91% and 88%, respectively. This performance is competitive with other published machine learning methods [14]**,**[15] but is achieved using 2D descriptors. This suggests that there are common patterns in the local structure of SoMs in the data sets that are captured by the atomic circular fingerprints and can be identified by machine learning methods.

RASCAL gives similar classification performance to PRW but at lower computational expense making it suitable for use on large data sets and particularly for the ensemble schemes outlined in this paper. The speed boost from using RASCAL is likely to become more apparent for classification problems where *q*≪*L* (where *q* is the subsample length and *L* is the full length of the feature vector) and in these situations the benefits from using RASCAL are likely to become more compelling.

The suitability of using CUDA/C++ to exploit the parallel capabilities of GPU hardware for classification problems has been demonstrated and it is hoped that the source code will be of use to other researchers in the field. Further enhancements to the implementation could include maximising the use of faster memory (shared, texture and constant), implementing task parallelism using streams and implementing code to run on multiple GPUs, all of which should further improve the speed-up in performance compared to the reference.

In summary it has been shown that probabilistic classifiers implemented using randomly selected subclassifiers on an ensemble basis using 2D topological circular fingerprints as descriptors can give strong SoM predictive performance. However, as with all machine learning methods, the models are likely to be most relevant within their domain of applicability and are likely to perform less well against novel molecules that are very different from those in the training set.

## Declarations

### Acknowledgements

The authors would like to thank Unilever for funding. We thank Dr. Guus Duchateau, Leo van Buren and Prof. Werner Klaffke for useful discussions in the development of this work.

## Authors’ Affiliations

## References

- Guengerich FP: Cytochrome P450s and other enzymes in drug metabolism and toxicity. AAPS J. 2006, 8: E101-11. 10.1208/aapsj080112. [http://link.springer.com/article/10.1208/aapsj080112]View ArticleGoogle Scholar
- Lewis DFV: 57 varieties: the human cytochromes P450. Pharmacogenomics. 2004, 5 (3): 305-18. 10.1517/phgs.5.3.305.29827. [http://www.futuremedicine.com/doi/abs/10.1517/phgs.5.3.305.29827]View ArticleGoogle Scholar
- Kirchmair J, Williamson MJ, Tyzack JD, Tan L, Bond PJ, Bender A, Glen RC: Computational prediction of metabolism: sites, products, SAR, P450 enzyme dynamics, and mechanisms. J Chem Inf Model. 2012, 52 (3): 617-48. 10.1021/ci200542m. [http://pubs.acs.org/doi/abs/10.1021/ci200542m]View ArticleGoogle Scholar
- Kulkarni SA, Zhu J, Blechinger S: In silico techniques for the study and prediction of xenobiotic metabolism: a review. Xenobiotica; Fate Foreign Compounds Biol Syst. 2005, 35 (10-11): 955-73. 10.1080/00498250500354402. [http://www.ncbi.nlm.nih.gov/pubmed/16393855]View ArticleGoogle Scholar
- Tarcsay A, Keseru GM: In silico site of metabolism prediction of cytochrome P450-mediated biotransformations. Expert Opin Drug Metab Toxicol. 2011, 7 (3): 299-312. 10.1517/17425255.2011.553599. [http://www.ncbi.nlm.nih.gov/pubmed/21291341]View ArticleGoogle Scholar
- Ekins S, Andreyev S, Ryabov A, Kirillov E, Bugrim A, Nikolskaya T, Rakhmatulin Ea: Computational prediction of human drug metabolism. Expert Opin Drug Metab Toxicol. 2005, 1 (2): 303-24. 10.1517/17425255.1.2.303. [http://www.ncbi.nlm.nih.gov/pubmed/16922645]View ArticleGoogle Scholar
- Vaz RJ, Zamora I, Li Y, Reiling S, Shen J, Cruciani G: The challenges of in silico contributions to drug metabolism in lead optimization. Expert Opin Drug Metab Toxicol. 2010, 6 (7): 851-61. 10.1517/17425255.2010.499123. [http://www.ncbi.nlm.nih.gov/pubmed/20565339]View ArticleGoogle Scholar
- Rydberg P, Gloriam DE, Zaretzki J, Breneman C, Olsen L: SMARTCyp: A 2D method for prediction of cytochrome P450-Mediated drug metabolism. ACS Med Chem Lett. 2010, 1 (3): 96-100. 10.1021/ml100016x. [http://pubs.acs.org/doi/abs/10.1021/ml100016x]View ArticleGoogle Scholar
- Campagna-Slater V, Pottel J, Therrien E, Cantin LD, Moitessier N: Development of a computational tool to rival experts in the prediction of sites of metabolism of xenobiotics by p450s. J Chem Inf Model. 2012, 52 (9): 2471-83. 10.1021/ci3003073. [http://www.ncbi.nlm.nih.gov/pubmed/22916680]View ArticleGoogle Scholar
- Tyzack JD, Williamson MJ, Torella R, Glen RC: Prediction of cytochrome P450 xenobiotic metabolism: tethered docking and reactivity derived from ligand molecular orbital analysis. J Chem Inf Model. 2013, 53 (6): 1294-305. 10.1021/ci400058s. [http://www.ncbi.nlm.nih.gov/pubmed/23701380]View ArticleGoogle Scholar
- Jones G, Willett P, Glen RC: Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation. J Mol Biol. 1995, 245: 43-53. 10.1016/S0022-2836(95)80037-9. [http://www.sciencedirect.com/science/article/pii/S0022283695800379]View ArticleGoogle Scholar
- MetaPrint2D, (accessed 03-06-2013). [http://www-metaprint2d.ch.cam.ac.uk/]
- Accelrys Metabolite Database. Accelrys Inc., 10188 Telesis Court, Suite 100, San Diego, CA, 92121, USA. [http://accelrys.com/products/databases/bioactivity/metabolite.html]
- Zaretzki J, Bergeron C, Rydberg P, Huang TW, Bennett KP, Breneman CM: RS-Predictor: a new tool for predicting sites of cytochrome P450-Mediated metabolism applied to CYP 3A4. J Chem Inf Model. 2011, 51 (7): 1667-89. 10.1021/ci2000488. [http://pubs.acs.org/doi/abs/10.1021/ci2000488]View ArticleGoogle Scholar
- Zaretzki J, Rydberg P, Bergeron C, Bennett KP, Olsen L, Breneman CM: RS-Predictor models augmented with SMARTCyp reactivities: robust metabolic regioselectivity predictions for nine CYP isozymes. J Chem Inf Model. 2012, 52 (6): 1637-59. 10.1021/ci300009z. [http://www.ncbi.nlm.nih.gov/pubmed/22524152]View ArticleGoogle Scholar
- Zaretzki J, Matlock M, Swamidass SJ: XenoSite: accurately predicting CYP-mediated sites of metabolism with neural networks. J Chem Inf Model. 2013, 53 (12): 3373-83. 10.1021/ci400518g. [http://www.ncbi.nlm.nih.gov/pubmed/24224933]View ArticleGoogle Scholar
- Daylight Chemical Information Systems, Inc. Aliso Viejo, CA. [http://www.daylight.com/dayhtml/doc/theory/theory.finger.html]
- Hasegawa K, Koyama M, Funatsu K: Quantitative prediction of regioselectivity toward cytochrome P450/3A4 using machine learning approaches. Mol Inform. 2010, 29 (3): 243-249. 10.1002/minf.200900086. [http://doi.wiley.com/10.1002/minf.200900086]View ArticleGoogle Scholar
- Olsen L, Rydberg P, Rod TH, Ryde U: Prediction of activation energies for hydrogen abstraction by cytochrome P450. J Med Chem. 2006, 49 (22): 6489-6499. 10.1021/jm060551l. [http://pubs.acs.org/doi/abs/10.1021/jm060551l]View ArticleGoogle Scholar
- Rydberg P, Ryde U, Olsen L: Prediction of activation energies for aromatic oxidation by cytochrome P450. J Phys Chem A. 2008, 112 (50): 13058-65. 10.1021/jp803854v. [http://www.ncbi.nlm.nih.gov/pubmed/18986131]View ArticleGoogle Scholar
- Molecular Operating Environment (MOE), 2012.10. Chemical Computing Group Inc., 1010 Sherbooke St. West, Suite 910, Montreal, QC, Canada, H3A 2R7, 2012. [https://www.chemcomp.com/MOE-Molecular_Operating_Environment.htm]
- Mu F, Unkefer CJ, Unkefer PJ, Hlavacek WS: Prediction of metabolic reactions based on atomic and molecular properties of small-molecule compounds. Bioinformatics (Oxford, England). 2011, 27 (11): 1537-45. 10.1093/bioinformatics/btr177. [http://bioinformatics.oxfordjournals.org/content/27/11/1537.short]View ArticleGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40 (Database issue): D109-14. [http://nar.oxfordjournals.org/content/40/D1/D109.short]View ArticleGoogle Scholar
- Kirchmair J, Williamson MJ, Afzal AM, Tyzack JD, Choy APK, Howlett A, Rydberg P, Glen RC: FAst MEtabolizer (FAME): A rapid and accurate predictor of sites of metabolism in multiple species by endogenous enzymes. J Chem Inf Model. 2013, 53 (11): 2896-907. 10.1021/ci400503s. [http://www.ncbi.nlm.nih.gov/pubmed/24219364]View ArticleGoogle Scholar
- Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen E: Recent developments of the chemistry development kit (CDK) - an open-source Java library for Chemo- and Bioinformatics. Curr Pharm Des. 2006, 12 (17): 2111-2120. 10.2174/138161206777585274. [http://www.ingentaconnect.com/content/ben/cpd/2006/00000012/00000017/art00005]View ArticleGoogle Scholar
- Rudik AV, Dmitriev A, Lagunin AA, Filimonov D, Poroikov VV: Metabolism site prediction based on xenobiotic structural formulae and PASS prediction algorithm. J Chem Inf Model. 2014, 140113114718001-[http://pubs.acs.org/doi/abs/10.1021/ci400472j]Google Scholar
- Xing L, Glen R: Novel Methods for the Prediction of logP, pKa, and logD. J Chem Inf Model. 2002, 42 (4): 796-805. 10.1021/ci010315d. [http://pubs.acs.org/cgi-bin/doilookup/?10.1021/ci010315d]Google Scholar
- Xing L, Glen RC, Clark RD: Predicting pK(a) by molecular tree structured fingerprints and PLS. J Chem Inf Comput Sci. 2003, 43 (3): 870-879. 10.1021/ci020386s. [http://www.ncbi.nlm.nih.gov/pubmed/12767145]View ArticleGoogle Scholar
- Asanovic K, Wawrzynek J, Wessel D, Yelick K, Bodik R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J, Morgan N, Patterson D, Sen K: A view of the parallel computing landscape. Commun ACM. 2009, 52 (10): 56-10.1145/1562764.1562783. [http://dl.acm.org/citation.cfm?id=1562783]View ArticleGoogle Scholar
- Catanzaro B, Sundaram N, Keutzer K: Fast support vector machine training and classification on graphics processors. Proceedings of the 25th international conference on Machine learning - ICML ’08. 2008, New York, USA: ACM Press, 104-111. [http://dl.acm.org/citation.cfm?id=1390170]View ArticleGoogle Scholar
- Li Q, Salman R, Test E, Strack R, Kecman V: GPUSVM: a comprehensive CUDA based support vector machine package. Cent Eur J Comput Sci. 2011, 1 (4): 387-405. 10.2478/s13537-011-0028-7. [http://www.springerlink.com/index/10.2478/s13537-011-0028-7]Google Scholar
- Herrero-Lopez S, Williams JR, Sanchez A: Parallel multiclass classification using SVMs on GPUs. Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU ’10. 2010, New York, USA: ACM Press, 2-2.View ArticleGoogle Scholar
- Oh KS, Jung K: GPU implementation of neural networks. Pattern Recognit. 2004, 37 (6): 1311-1314. 10.1016/j.patcog.2004.01.013. [http://www.sciencedirect.com/science/article/pii/S0031320304000524]View ArticleGoogle Scholar
- Jian SLCWYLL: CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU. 2009 IEEE Youth Conference on Information, Computing and Telecommunication. 2009, New York: IEEE, 415-418.Google Scholar
- Bernabé S, Plaza A, Reddy Marpu P, Atli Benediktsson J: A new parallel tool for classification of remotely sensed imagery. Comput Geosci. 2012, 46: 208-218. [http://www.sciencedirect.com/science/article/pii/S009830041100433X]View ArticleGoogle Scholar
- O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR: Open Babel: an open chemical toolbox. J Cheminformatics. 2011, 3: 33-10.1186/1758-2946-3-33. [http://www.jcheminf.com/content/3/1/33]View ArticleGoogle Scholar
- SYBYL Molecular Modeling Software:. Tripos Associates Inc., St Louis, MO, USA. [http://www.certara.com]
- Matthews B: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Struct. 1975, 405 (2): 442-451. 10.1016/0005-2795(75)90109-9. [http://www.sciencedirect.com/science/article/pii/0005279575901099]View ArticleGoogle Scholar
- NVIDIA Nsight. NVIDIA, Santa Clara, CA, USA. [https://developer.nvidia.com/cuda-toolkit]
- Mussa HY, Mitchell JBO, Glen RC: Full “Laplacianised” posterior naive Bayesian algorithm. J Cheminformatics. 2013, 5 (1): 37-43. 10.1186/1758-2946-5-37.View ArticleGoogle Scholar
- Duda RO, Hart PE: Pattern Classification and Scene Analysis. 1973, New York, NY: John Wiley and Sons LtdGoogle Scholar
- Webb AR: Statistical Pattern Recognition. 2002, New York: Wiley–BlackwellView ArticleGoogle Scholar
- Young T, Calvert TW: Classification, Estimation and Pattern Recognition. 1974, New York: ElsevierGoogle Scholar
- Ripley BD: Pattern Recognition and Neural Networks. 1996, Cambridge, UK: Cambridge University PressView ArticleGoogle Scholar
- Hand DJ: Discrimination and classification. 1981, New York: WileyGoogle Scholar
- Bishop CM: Neural Networks for Pattern Recognition. 1996, New York: Oxford University PressGoogle Scholar
- Ho TK: The random subspace method for constructing decision forests. IEEE Tran Pat Anal Mach Intel. 1998, 20 (5): 832-844.Google Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Oza KNC: Tumer: Input decimation ensembles: decorrelation through dimensionality reduction. Proc Intl Workshop Multiple Classifier Syst. 2096, 2001: 238-247.Google Scholar
- Skurichina, RPW M: Duin: Bagging, boosting and the random subspace method for linear classifiers. Pattern Anal Appl. 2002, 5 (2): 121-135. 10.1007/s100440200011.View ArticleGoogle Scholar
- Parzen E: On estimation of a probability density function and mode. Annal Math Stat. 1962, 33 (3): 1065-1076. 10.1214/aoms/1177704472.View ArticleGoogle Scholar
- Jacob L, Vert JP: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics (Oxford, England). 2008, 24 (19): 2149-56. 10.1093/bioinformatics/btn409. [http://bioinformatics.oxfordjournals.org/content/24/19/2149.short]View ArticleGoogle Scholar
- Mussa HY, Tyzack JD, Glen RC: Note on the Rademacher-Walsh polynomial basis functions. J Math Res. 2013, 5: 114-121. [http://www.ccsenet.org/journal/index.php/jmr/article/view/24995]Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwisestated.