# Chemically Aware Model Builder (camb): an R package for property and bioactivity modelling of small molecules

- Daniel S Murrell†
^{1}, - Isidro Cortes-Ciriano†
^{2}, - Gerard J P van Westen
^{3}, - Ian P Stott
^{4}, - Andreas Bender
^{1}, - Thérèse E Malliavin
^{2}Email author and - Robert C Glen
^{1}Email author

**7**:45

**DOI: **10.1186/s13321-015-0086-2

© Murrell et al. 2015

**Received: **1 April 2015

**Accepted: **3 July 2015

**Published: **28 August 2015

## Abstract

### Background

In silico predictive models have proved to be valuable for the optimisation of compound potency, selectivity and safety profiles in the drug discovery process.

### Results

*camb* is an R package that provides an environment for the rapid generation of quantitative Structure-Property and Structure-Activity models for small molecules (including QSAR, QSPR, QSAM, PCM) and is aimed at both advanced and beginner R users. *camb's* capabilities include the standardisation of chemical structure representation, computation of 905 one-dimensional and 14 fingerprint type descriptors for small molecules, 8 types of amino acid descriptors, 13 whole protein sequence descriptors, filtering methods for feature selection, generation of predictive models (using an interface to the R package *caret*), as well as techniques to create model ensembles using techniques from the R package *caretEnsemble*). Results can be visualised through high-quality, customisable plots (R package *ggplot2*).

### Conclusions

*camb*constitutes an open-source framework to perform the following steps: (1) compound standardisation, (2) molecular and protein descriptor calculation, (3) descriptor pre-processing and model training, visualisation and validation, and (4) bioactivity/property prediction for new molecules.

*camb*aims to speed model generation, in order to provide reproducibility and tests of robustness. QSPR and proteochemometric case studies are included which demonstrate

*camb's*application.

### Keywords

R Package Ensemble Learning Workflow QSPR QSAR PCM## Background

The advent of high-throughput technologies over the last two decades has led to a vast increase in the number of compound and bioactivity databases [1–3]. This increase in the amount of chemical and biological information has been exploited by developing fields in drug discovery such as quantitative structure activity relationships (QSAR), quantitative structure property relationships (QSPR), quantitative sequence-activity modelling (QSAM), or proteochemometric modelling (PCM) [4, 5].

Currently available R packages provide the capability for only subsets of the above mentioned steps. For instance, the R packages *chemmineR* [9] and *rcdk* [10] enable the manipulation of SDF and SMILES files, the calculation of physicochemical descriptors, the clustering of molecules, and the retrieval of compounds from PubChem [3]. On the machine learning side, the *caret* package provides a unified platform for the training of machine learning models [11].

While it is possible to use a combination of these packages to set up a desired workflow, going from start to finish requires a reasonable understanding of model building in *caret*.

Here, we present the R package *camb*: *C*hemically *A*ware *M*odel *B*uilder, which aims to address the current lack of an R framework comprising the four steps mentioned above. Specifically, the *camb* package makes it extremely easy to enter new molecules (that have no previous standardisation) through a single function, to acquire new predictions once model building has been done. The package has been conceived such that users with minimal programming skills can generate competitive predictive models and high-quality plots showing the performance of the models under default operation. It must be noted that *camb* does limit practitioners to a limited but easily used workflow to begin with. Experienced users, or those that intend to practice machine learning in R extensively are encouraged to neglect this basic wrapper completely on their second training attempt and learn how to use the *caret* package from the *caret* related vignettes directly.

Overall, *camb* enables the generation of predictive models, such as Quantitative Structure–Activity Relationships (QSAR), Quantitative Structure–Property Relationships (QSPR), Quantitative Sequence–Activity Modelling (QSAM), or Proteochemometric Modelling (PCM), starting with: chemical structure files, protein sequences (if required), and the associated properties or bioactivities. Moreover, *camb* is the first R package that enables the manipulation of chemical structures utilising Indigo’s C API [12], and the calculation of: (1) molecular fingerprints and 1-D [13] topological descriptors calculated using the PaDEL-Descriptor Java library [14], (2) hashed and unhashed Morgan fingerprints [15], and (3) eight types of amino acid descriptors. Two case studies illustrating the application of *camb* for QSPR modelling (solubility prediction) and PCM are available in the Additional files 1, 2.

## Design and implementation

This section describes the tools provided by *camb* for (1) compound standardisation, (2) descriptor calculation, (3) pre-processing and feature selection, model training, visualisation and validation, and (4) bioactivity/property prediction for new molecules.

### Compound standardization

Chemical structure representations are highly ambiguous if SMILES are used for representation—for example, when one considers aromaticity of ring systems, protonation states, and tautomers present in a particular environment. Hence, standardisation is a step of crucial importance when either storing structures or before descriptor calculation. Many molecular properties are dependent on a consistent assignment of the above criteria in the first place. If one examines large chemical databases one can see how important this step is—a rather good explanation for standardisation is found in PubChem, one of the largest public databases, can be found on the PubChem Blog [16]. Hence, we are of the opinion that standardising chemical structures is crucial in order to provide consistent data for later modelling steps, in line with perceptions by others (such as the PubChem curators). For standardisation, *camb* provides the function *StandardiseMolecules* which utilises Indigo’s C API [12]. SDF and SMILES formats are provided as molecule input options. Any molecules that Indigo fails to parse are removed during the standardisation step. As a filter, the user can stipulate the maximum number of each halogen atom that a compound can possess in order to pass the standardisation process. This allows datasets with a bias towards many molecules that contain one type of halogen to be easily normalised before training. Additional arguments of this function include the removal of inorganic molecules or those compounds with a molecular mass above or below a defined threshold. Most importantly, *camb* makes use of Indigo’s InChI [17] plugin to represent all tautomers by the same canonical SMILES by converting molecules to InChI, discarding tautomeric information, and converting back to SMILES.

### Descriptor calculation

Currently, *camb* supports the calculation of compound descriptors and fingerprints via PaDEL-Descriptor [14], and Morgan circular fingerprints [15] as implemented in RDkit [18]. The function *GeneratePadelDescriptors* permits the calculation of 905 1- and 2-D descriptors and 10 PaDEL-Descriptor fingerprints, namely: CDK fingerprints [19], CDK extended fingerprints [19], Kier-Hall E-state fragments [20], CDK graph only fingerprints [19], MACCS fingerprints [21], Pubchem fingerprints [3], Substructure fingerprints [22], and Klekota–Roth fingerprints [23].

In addition to the PaDEL-Descriptor fingerprints, Morgan fingerprints can be computed with the function *MorganFPs* through the python library RDkit [18]. Hashed fingerprints can be generated as *binary*, recording the presence or absence of each substructure, or *count based*, recording the number of occurrences of each substructure. Additionally, the *MorganFPs* function also computes unhashed (keyed) fingerprints, where each substructure in the dataset is assigned a unique position in a binary fingerprint of length equal to the number of substructures existing in the dataset. Since the positions of substructures in the unhashed fingerprint depend on the dataset, the function *MorganFPs* allows calculation of unhashed fingerprints for new compounds using a basis defined by the substructures present in the training dataset. This ensures that substructures in new compounds map to the same locations on the fingerprint and allows enhanced model interpretation by noting which exact substructures are deemed important by the learning algorithm.

The function *SeqDescs* enables the calculation of 13 types of whole protein sequence descriptors from UniProt identifiers or from amino acid sequences [24], namely: amino acid composition (AAC), dipeptide composition (DC), tripeptide composition (TC), normalized Moreau–Broto autocorrelation (MoreauBroto), Moran autocorrelation (Moran), Geary autocorrelation (Geary), CTD (composition/transition/distribution) (CTD), Conjoint Traid (CTriad), sequence order coupling number (SOCN), quasi-sequence order descriptors (QSO), pseudo amino acid composition (PACC), amphiphilic pseudo amino acid composition (APAAC) [25, 26].

In addition, *camb* permits the calculation of 8 types of amino acid descriptors, namely: 3 and 5 Z-scales (Z3 and Z5), T-Scales (TScales), ST-Scales (STScales), Principal Components Score Vectors of Hydrophobic, Steric, and Electronic properties (VHSE), BLOSUM62 Substitution Matrix (BLOSUM), FASGAI (FASGAI), MSWHIM (MSWHIM), and ProtFP PCA8 (ProtFP8). Amino acid descriptors can be used for modelling of the activity of small peptides or for the description of protein binding sites [5, 25, 27, 28]. Multiple sequence alignment gaps are supported by this *camb* functionality. Descriptor values for these gaps are encoded with zeros. Further details about these descriptors and their predictive signal for bioactivity modelling can be found in two recent publications [25, 26].

### Model training and validation

Prior to model training, descriptors often need to be pre-processed [29] so that they are equally weighted as inputs into the learning algorithms and to remove any that contain little relevant information content. To this end, several functions (see package documentation and tutorials) are provided. These functions include the removal of non-informative descriptors (function *RemoveNearZeroVarianceFeatures*) or highly correlated descriptors (function *RemoveHighlyCorrelatedFeatures*), the imputation of missing descriptor values (function *ImputeFeatures*), and descriptor centering and scaling to unit variance (function *PreProcess*) among others [30].

The R package *caret* provides a common interface to the most popular machine learning packages that exist in R, and, as such, *camb* invokes *caret* to set up cross-validation frameworks and train machine learning models. These include learning methods in Bagging, Bayesian Methods, Boosting, Boosted Trees, Elastic Net, MARS, Gaussian Processes, K Nearest Neighbour, Principal Component Regression, Radial Basis Function Networks, Random Forests, Relevance Vector Machines, and Support Vector Machines among others. Additionally, two ensemble modelling approaches, namely greedy and stacking optimisation, have been integrated from the R package *caretEnsemble* [31], which allows the combination of models to form ensemble models, which have proven to be less error prone [28].

In greedy optimization [32], the cross-validated RMSE is optimized using a linear combination of input model predictions. The input models are all trained using an identical fold composition. Each model is assigned a weight in the following manner. Initially, all models have their weight set to zero. The weight for a given model is repeatedly incremented by 1 if the subsequent normalized weight vector results in a closer match between the weighted combination of cross-validated predictions and the observed values (i.e. lower RMSE of the linear combination). This repetition is carried out *n* times, by default *n* = 1,000. The resulting weight vector is then normalized to obtain a final weight vector.

In the case of model stacking [28], the predictions of the input models serve as training data points for a meta-model. This meta-model can have linear, e.g. Partial Least Squares [33], or non-linear, *e.g.* Random Forest [34] characteristics. If the selected algorithm allows the importance of its inputs to be determined, each input corresponds to a single model, then the relative contributions of each model to the prediction can be ascertained. These model ensembles can be applied to a test set (which was not used when building the ensembles), and the error metric (e.g. RMSE) compared to that of the single models on the test set.

In the general case, prior to model training, the dataset is divided into a training set, comprising e.g. 70% of the data, and a test set, which comprises the remaining data. The test set is used to assess the predictive power of the models on new data points not considered in the training phase. In the training phase, the values of the model parameters (hyper-parameters) are optimized by grid search and *k*-fold cross-validation (CV) [35]. A grid of plausible hyper-parameter values covering an exponential range is defined (function *expGrid*). Next, the training set is split into *k* folds by, e.g. stratified or random sampling of the bioactivity/property values. For each combination of hyper-parameters, a model is trained on \(k-1\) folds, and the values for the remaining fold are then predicted. This procedure is repeated *k* times, each time holding out a different fold. The values of the hyper-parameters exhibiting the lowest average RMSE (or another metric such as e.g. *R*
^{2}) value across the *k* folds are considered optimal. A model is then trained on the whole training set using the optimal hyper-parameter values, and the predictive power of this model is assessed on the test set. The final model, trained on the whole dataset after having optimized the hyper-parameter values by CV, can be used to make predictions on an external chemical library.

Statistical metrics for model validation have also been included:

*During cross-validation*

*i*, prediction

*i*, and the average value of observations in the training set, respectively.

*During testing*

*j*, prediction

*j*, and the average value of observations in the test set, respectively. \(\bar{y_{tr}}\) represents the average value of observations in the training set.

\(R_{0\ test}^2\) is the square of the coefficient of determination through the origin, being \(\widetilde{y}_{j}^{ r0} = k \widetilde{y}_j\) the regression through the origin (observed versus predicted) and *k* its slope. The reader is referred to Ref. [36] for a detailed discussion of both the evaluation of model predictive ability through the test set and about the three different formulations for \(Q^{2}_{test}\), namely \(Q_{1\ { test}}^{2}\), \(Q_{2\ { test}}^{2}\), and \(Q_{3\ { test}}^{2}\). The value of these metrics permits the assessment of model performance according to the criteria proposed by Tropsha and Golbraikh [37, 38], namely: \(q_{{ CV}}^{2} > 0.5\), \(R_{test}^2 > 0.6\), \(\frac{(R_{test}^2 - R_{0\ test}^2)}{R_{test}^2} < 0.1\), and \(0.85 \le k \le 1.15\).

These values might change depending on the dataset modelled, as well as on the application context, e.g. higher errors might be tolerated in hit identification in comparison to lead optimization, nevertheless, these criteria can serve as general guidelines to assess model predictive ability. The function *Validation* permits the calculation of all these metrics.

In cases where information about the experimental error of the data is available, the values for the statistical metrics on the test set can be compared to the theoretical maximum and minimum achievable performance given (1) the uncertainty of the experimental measurements, (2) the size of the training and test sets, and (3) the distribution of the dependent variable [39]. The distribution of maximum and minimum \(R_{0\ test}^2, R_{test}, Q^{2}_{test}\), and RMSE_{test} values can be computed with the functions *MaxPerf* and *MinPerf*. The distributions of maximum model performance are calculated in the following way. A sample, *S*, of size equal to the test set is randomly drawn from the dependent variable, e.g. IC_{50} values. Next, the experimental uncertainty is added to *S*, which defines the sample \(S_{noise}\). The \(R_{0\ test}^2, R_{test}, Q^{2}_{test}\), and RMSE_{test} values for *S* against \(S_{noise}\) are then calculated. These steps are repeated *n* times, by default 1,000, to calculate the distributions of \(R_{0\ test}^2, R_{test}, Q^{2}_{test}\), and RMSE_{test} values. To calculate the distributions of minimum model performance, the same steps are followed, with the exception that *S* is randomly permuted before calculating the values for the statistical metrics.

### Visualization

*ggplot2*[40]. Default options of the plotting functions were chosen to allow the generation of high-quality plots, and in addition, the layer-based structure of ggplot objects allows for further optimisation by the addition of customisation layers. The visualization tools include correlation plots (

*CorrelationPlot*), bar plots with error bars (

*ErrorBarplot*), and Principal Component Analysis (PCA) (

*PCA*and

*PCAPlot*), histograms (

*DensityResponse*), and pairwise distance distribution plots (

*PairwiseDistPlot*). For instance, the

*camb*function

*PCA*performs a Principal Component Analysis (PCA) on compound and/or protein descriptors. The output can be directly sent to the function

*PCAPlot*, which will depict the two fist principal components, with the shape and color of a user-defined class

*e.g.*compound class or protein isoform (Fig. 2).

Visual depiction of compounds is also possible with the function *PlotMolecules*, utilising Indigo’s C API. Visualization functions are exemplified in the tutorials provided in the Additional file 2 and with the package documentation (folder *camb/doc* of the package).

### Predictions for new molecules

One of the major benefits of having all tools available in one framework is that it is straightforward to perform exactly the same processing on new molecules as the ones used on the training set, e.g. standardisation of molecules and centering and scaling of descriptors. The *camb* function *PredictExternal* allows the user to read an external set of molecules together with a trained model, and outputs predictions on this external set. This *camb* functionality ensures that the same standardization options and descriptor types are used when a model is applied to make predictions for new molecules. An example of this is shown in the QSPR tutorial.

## Results and discussion

Two tutorials demonstrating property and bioactivity modelling are available as Additional files 1 and 2, and also within the package documentation. We encourage *camb* users to visit the package repository (https://github.com/cambDI/camb) for future updated versions of the tutorials. In the following subsections, we show the results obtained for the two case studies presented in the tutorials, namely: (1) QSPR: prediction of compound aqueous solubility (logS), and (2) PCM: modelling of the inhibition of 11 mammalian cyclooxygenases (COX) by small molecules. The datasets are available in the *examples/PCM* directory of the package. Further details about the PCM dataset can be found in Ref. [28].

### Case study 1: QSPR

To illustrate the functionalities of *camb* for compound property modelling, the aqueous solubility values for 1,708 small molecules were downloaded [41]. Aqueous solubility values were expressed as logS, where S corresponds to the solubility at a temperature of 20–25\(^{\circ }\)C in mol/L. A common representation for the compound structures was found using the function *StandardiseMolecules* with default parameters, meaning that all molecules were kept irrespective of their molecular mass or the number of halogens present within their structure. Molecules were represented with implicit hydrogens, dearomatized, and passed through the InChI format to ensure that tautomers were represented by the same SMILES. 905 one and two-dimensional topological and physicochemical descriptors were then calculated using the function *GeneratePadelDescriptors* provided by the PaDEL-Descriptor [14] Java library built into the *camb* package. Missing descriptor values were imputed with the function *ImputeFeatures*. Two filtering steps were then performed: (1) highly-correlated descriptors with redundant predictive signal were removed using the function *RemoveHighlyCorrelatedFeatures* with a cut-off value of 0.95, and (2) descriptors with near zero variance and hence limited predictive signal, were removed using the function *RemoveNearZeroVarianceFeatures* with a cut-off value of 30/1. Prior to model training, all descriptors were centered to have zero mean and scaled to have unit variance using the function *PreProcess*. After applying these steps the dataset consisted of 1,606 molecules encoded with 211 descriptors.

*camb*were explored, namely: greedy optimization and model stacking. First, greedy ensemble was trained using the function

*caretEnsemble*with 1,000 iterations. The greedy ensemble picked a linear combination of model outputs that was a local minimum in the RMSE landscape. Secondly, linear and non-linear stacking ensembles were created. In model stacking, the cross-validated predictions of a library of models are used as descriptors, on which a meta-model (ensemble model) is trained. This meta model can be a linear model, e.g. SVM with a linear kernel, or non linear, such as Random Forest. The application of ensemble modelling led to a decrease by 10–15% of RMSE

_{test}values (Table 1). The highest predictive power was obtained with the greedy and the linear stacking ensembles, with \(R^{2}_{0\ test}\)/RMSE

_{test}of 0.93/0.51 and 0.93/0.51, respectively. Taken together, these results indicate that higher predictive power can be obtained when modelling this dataset by combining different single QSPR models with either greedy optimisation or model stacking. From this case study it can be seen that by utilizing the

*camb*package, a model training task which might involve porting datasets between multiple different external tools can be simplified to a few lines of code in a reproducible fashion within the R language alone. Additionally, predictions can easily be made on new molecules using a single function call passing in a new structures file.

Cross-validation and testing metrics for the single and ensemble QSPR models trained on the compound solubility dataset

Algorithm | \(R^{2}_{CV}\) | RMSE | \(R^{2}_{0\ test}\) | RMSE | |
---|---|---|---|---|---|

A | |||||

GBM | 0.90 | 0.59 | 0.93 | 0.52 | |

RF | 0.89 | 0.62 | 0.91 | 0.59 | |

SVM radial | 0.88 | 0.63 | 0.91 | 0.60 | |

B | |||||

Greedy | – | 0.57 | 0.93 | 0.51 | |

Linear stacking | 0.90 | 0.57 | 0.93 | 0.51 | |

RF stacking | 0.89 | 0.62 | 0.92 | 0.55 |

### Case study 2: proteochemometrics

*camb*are illustrated for proteochemoemtric modelling. The tutorial “PCM with

*camb*” (Additional file 2) reports the complete modelling pipeline for this dataset [28]. Bioactivity data for 11 mammalian COX (COX-1 and COX-2 inhibitors) was extracted from ChEMBL 16 [2, 28] (Table 2). Only the data satisfying the following criteria was kept: (1) assay score confidence higher than 8, (2) activity relationship equal to ‘=’, (3) activity type equal to “IC50”, and (4) activity unit equal to ‘nM’. The mean IC

_{50}value was taken for duplicated compound-COX combinations. The final dataset comprised 3,228 distinct compounds and 11 mammalian COX proteins, with a total number of 4,937 datapoints (13.9% matrix completeness) [28].

Cyclooxygenase inhibition dataset ("Results and discussion" section, case study 2)

UniProt ID | Isoenzyme | Organism | Number of datapoints |
---|---|---|---|

P23219 | 1 |
| 1,346 |

O62664 | 1 |
| 48 |

P22437 | 1 |
| 50 |

O97554 | 1 |
| 11 |

P05979 | 1 |
| 442 |

Q63921 | 1 |
| 23 |

P35354 | 2 |
| 2,311 |

O62698 | 2 |
| 21 |

Q05769 | 2 |
| 305 |

P79208 | 2 |
| 341 |

P35355 | 2 |
| 39 |

A common representation for the compound structures was found using the function *StandardiseMolecules* with default parameters. Then, two main descriptor types were calculated: (1) PaDEL descriptors [14] with the function *GeneratePadelDescriptors*, (2) and Morgan fingerprints with the function *MorganFPs*. Substructures with a maximal diameter of 4 bonds were considered. The length of the fingerprints was set to 512. To describe the target space, the binding site amino acid descriptors were derived from the crystallographic structure of ovine COX-1 complexed with celecoxib (PDB ID: 3KK6 [42]) by selecting those residues within a sphere of radius equal to 10 Å centered in the ligand. Subsequently, we performed multiple sequence alignment to determine the corresponding residues for the other 10 COX, and calculated 5 *Z*-scales for these residues with the function *AADescs*.

*ImputeFeatures*). Two filtering steps were then performed: (1) highly-correlated descriptors with redundant predictive signal were removed using the function

*RemoveHighlyCorrelatedFeatures*with a cut-off value of 0.95, and (2) descriptors with near zero variance and hence limited predictive signal, were removed using the function

*RemoveNearZeroVarianceFeatures*with a cut-off value of 30/1. Prior to model training, all descriptors were centered to have zero mean and scaled to have unit variance using the function

*PreProcess*. These steps led to a final selection of 356 descriptors: 242 Morgan fingerprint binary descriptors, 99 physicochemical descriptors, and 15

*Z*-scales. The dataset was split into a training set, which was comprised of 80% of the data, and a test set (20%) with the function

*SplitSet*. Three single PCM models were trained using fivefold cross-validation, namely: GBM, RF, and SVM with a radial kernel (Table 3).

Cross-validation and testing metrics for the single and ensemble PCM models trained on the COX dataset

Algorithm | \(R^{2}_{CV}\) | RMSE | \(R^{2}_{0\ test}\) | RMSE | |
---|---|---|---|---|---|

A | |||||

GBM | 0.59 | 0.77 | 0.60 | 0.76 | |

RF | 0.60 | 0.78 | 0.61 | 0.79 | |

SVM | 0.61 | 0.75 | 0.60 | 0.76 | |

B | |||||

Greedy ensemble | – | 0.73 | 0.63 | 0.73 | |

Linear stacking | 0.63 | 0.73 | 0.63 | 0.73 | |

EN stacking | 0.63 | 0.72 | 0.62 | 0.72 | |

SVM linear stacking | 0.63 | 0.73 | 0.62 | 0.73 | |

SVM radial stacking | 0.63 | 0.73 | 0.63 | 0.73 | |

RF stacking | 0.61 | 0.76 | 0.58 | 0.77 |

*Validation*served to calculate the values for the statistical metrics on the test set. The observed against the predicted values on the test set were reported with the function

*CorrelationPlot*(Fig. 3b).

All model ensembles displayed higher predictive power on the test set than single PCM models, except for RF Stacking (Table 3). The lowest RMSE value on the test set, namely 0.72 was obtained with the Elastic Network (EN) Stacking model (Table 3), whereas the highest \(R^{2}_{0}\) value, namely 0.63, was obtained with the greedy, the Linear Stacking and the SVM Radial Stacking ensembles. As in the previous case study, these data indicate that higher predictive power can be obtained by combining single PCM models in more predictive model ensembles, although this improvement might be sometimes marginal. This case study illustrates the versatility of *camb* to train and validate PCM models from amino acid sequences and compound structures in an integrated and seamless modelling pipeline.

## Availability and future directions

*camb* is coded in R, C++, Python and Java and is available open source at https://github.com/cambDI/camb. To install *camb* from R type: library(devtools); install_github(“cambDI/camb/camb”). We plan to include further functionality based on the C++ Indigo API, and to implement new error estimation methods for regression and classification models. Additionally, we plan to further integrate the python library RDkit with *camb*. The package is fully documented and includes the usage examples and details of the R functions implemented in *camb*.

## Conclusions

In silico predictive models have proved valuable for the optimisation of compound potency, selectivity and safety profiles. In this context, *camb* provides an open framework to (1) compound standardisation, (2) molecular and protein descriptor calculation, (3) pre-processing and feature selection, model training, visualisation and validation, and (4) bioactivity/property prediction for new molecules. All the above functionalities will speed up model generation, provide reproducibility and tests of robustness. *camb* functions have been designed to meet the needs of both expert and amateur users. Therefore, *camb* can serve as an education platform for undergraduate, graduate, and post-doctoral students, while providing versatile functionalities for predictive bioactivity/property modelling in more advanced settings.

## Notes

## Declarations

### Authors’ contributions

DM and ICC conceived and coded the package. DM and ICC wrote the tutorials. GvW provided analytical tools for amino acid descriptor calculation. DM, ICC, GvW, IS, AB, TM and RG wrote the paper. All authors read and approved the final manuscript.

### Acknowledgements

ICC thanks the Paris-Pasteur International PhD Programme and Institut Pasteur for funding. TM thanks CNRS and Institut Pasteur for funding. DSM and RCG thanks Unilever for funding. GvW thanks EMBL (EIPOD) and Marie Curie (COFUND) for funding. AB thanks Unilever and the European Research Commission (Starting Grant ERC-2013-StG 336159 MIXTURE) for funding.

### Compliance with ethical guidelines

**Competing interests** The authors declare that they have no competing interests.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Bender A (2010) Databases: compound bioactivities go public. Nat Chem Biol 6(5):309View ArticleGoogle Scholar
- Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A et al (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40(D1):1100–1107View ArticleGoogle Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Zhou Z et al (2012) PubChem’s BioAssay Database. Nucleic Acids Res 40(Database issue):400–412Google Scholar
- van Westen GJP, Wegner JK, IJzerman AP, van Vlijmen HWT, Bender A (2011) Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. Med Chem Comm 2:16–30View ArticleGoogle Scholar
- Cortes Ciriano I, Ain QU, Subramanian V, Lenselink EB, Mendez Lucio O, IJzerman AP et al (2015) Polypharmacology modelling using proteochemometrics: recent developments and future prospects. Med Chem Comm 6:24–50View ArticleGoogle Scholar
- R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5:80View ArticleGoogle Scholar
- Mente S, Kuhn M (2012) The use of the R language for medicinal chemistry applications. Curr Top Med Chem 12(18):1957–1964View ArticleGoogle Scholar
- Cao Y, Charisi A, Cheng LC, Jiang T, Girke T (2008) ChemmineR: a compound mining framework for R. Bioinformatics 24(15):1733–1734View ArticleGoogle Scholar
- Guha R (2007) Chemical informatics functionality in R. J Stat Softw 18(5):1–16Google Scholar
- Kuhn M (2008) Building predictive models in r using the caret package. J Stat Softw 28(5):1–26Google Scholar
- Indigo (2013) Indigo Cheminformatics Library. GGA Software Services, CambridgeGoogle Scholar
- Rognan D (2007) Chemogenomic approaches to rational drug design. Br J Pharmacol 152(1):38–52View ArticleGoogle Scholar
- Yap CW (2011) PaDEL-Descriptor: an open source software to calculate molecular descriptors and fingerprints (v2.16). J Comput Chem 32(7):1466–1474View ArticleGoogle Scholar
- Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754View ArticleGoogle Scholar
- PubChem (2014) PubChem Blog: what is the difference between a substance and a compound in PubChem?. http://pubchemblog.ncbi.nlm.nih.gov/2014/06/19/what-is-the-difference-between-a-substance-and-a-compound-in-pubchem/
- InChI (2013) IUPAC—International Union of Pure and Applied Chemistry: The IUPAC International Chemical Identifier (InChI). http://www.iupac.org/home/publications/e-resources/inchi.html
- Landrum G (2006) RDKit: open-source cheminformatics. http://www.rdkit.org
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The chemistry development kit (cdk): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43(2):493–500View ArticleGoogle Scholar
- Hall LH, Kier LB (1995) Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Comput Sci 35(6):1039–1045View ArticleGoogle Scholar
- Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of mdl keys for use in drug discovery. J Chem Inf Comput Sci 42(6):1273–1280View ArticleGoogle Scholar
- O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) Open Babel: an open chemical toolbox. J Cheminf 3(1):33View ArticleGoogle Scholar
- Klekota J, Roth FP (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24(21):2518–2525View ArticleGoogle Scholar
- Xiao N, Xu Q (2014) Protr: Protein sequence descriptor calculation and similarity computation with R. R package version 0.2-1Google Scholar
- van Westen GJ, Swier RF, Cortes-Ciriano I, Wegner JK, Overington JP, Ijzerman AP et al (2013) Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): Modeling performance of 13 amino acid descriptor sets. J Cheminf 5(1):42View ArticleGoogle Scholar
- van Westen G, Swier R, Wegner JK, IJzerman AP, van Vlijmen HW, Bender A (2013) Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets. J Cheminf 5(1):41Google Scholar
- van Westen GJP, van den Hoven OO, van der Pijl R, Mulder-Krieger T, de Vries H, Wegner JK et al (2012) Identifying novel adenosine receptor ligands by simultaneous proteochemometric modeling of rat and human bioactivity data. J Med Chem 55(16):7010–7020View ArticleGoogle Scholar
- Cortes-Ciriano I, Murrell DS, van Westen GJP, Bender A, Malliavin T (2014) Prediction of the potency of mammalian cyclooxygenase inhibitors with ensemble proteochemometric modeling. J Cheminf 7:1View ArticleGoogle Scholar
- Andersson CR, Gustafsson MG, Strömbergsson H (2011) Quantitative chemogenomics: machine-learning models of protein-ligand interaction. Curr Top Med Chem 11(15):1978–1993View ArticleGoogle Scholar
- Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New YorkView ArticleGoogle Scholar
- Mayer Z (2013) CaretEnsemble: framework for combining caret models into ensembles. [R Package Version 1.0]Google Scholar
- Caruana R, Niculescu-Mizil A, Crew G, Ksikes A (2004) Ensemble selection from libraries of models. In: Proceedings of the 21st international conference on machine learning. ICML‘04, ACM, New York, p 18Google Scholar
- Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemometr Intell Lab 58(2):109–130View ArticleGoogle Scholar
- Breiman L (2001) Random forests. Mach Learn 45(1):5–32View ArticleGoogle Scholar
- Hawkins DM, Basak SC, Mills D (2003) Assessing model fit by cross-validation. J Chem Inf Comput Sci 43(2):579–586View ArticleGoogle Scholar
- Consonni V, Ballabio D, Todeschini R (2010) Evaluation of model predictive ability by external validation techniques. J. Chemom 24(3–4):194–201View ArticleGoogle Scholar
- Golbraikh A, Tropsha A (2002) Beware of q2!. J Mol Graphics Modell 20(4):269–276View ArticleGoogle Scholar
- Tropsha A, Gramatica P, Gombar VK (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22(1):69–77View ArticleGoogle Scholar
- Cortes Ciriano I, van Westen GJ, Lenselink EB, Murrell DS, Bender A, Malliavin T (2014) Proteochemometric modeling in a Bayesian framework. J Cheminf 6(1):35Google Scholar
- Wickham H (2009) Ggplot2: elegant graphics for data analysis. http://had.co.nz/ggplot2/book
- Wang J, Krudy G, Hou T, Zhang W, Holland G, Xu X (2007) Development of reliable aqueous solubility models and their application in druglike analysis. J Chem Inf Model 47(4):1395–1404View ArticleGoogle Scholar
- Rimon G, Sidhu RS, Lauver DA, Lee JY, Sharma NP, Yuan C, Frieler RA, Trievel RC, Lucchesi BR, Smith WL (2010) Coxibs interfere with the action of aspirin by binding tightly to one monomer of cyclooxygenase-1. Proc Natl Acad Sci USA 107(1):28–33View ArticleGoogle Scholar
- Kruger FA, Overington JP (2012) Global analysis of small molecule binding to related protein targets. PLoS Comput Biol 8(1):1002333View ArticleGoogle Scholar