Using Pareto points for model identification in predictive toxicology
 Anna Palczewska^{1}Email author,
 Daniel Neagu^{1} and
 Mick Ridley^{1}
https://doi.org/10.1186/17582946516
© Palczewska et al.; licensee Chemistry Central Ltd. 2013
Received: 13 December 2012
Accepted: 27 February 2013
Published: 22 March 2013
Abstract
Predictive toxicology is concerned with the development of models that are able to predict the toxicity of chemicals. A reliable prediction of toxic effects of chemicals in living systems is highly desirable in cosmetics, drug design or food protection to speed up the process of chemical compound discovery while reducing the need for lab tests. There is an extensive literature associated with the best practice of model generation and data integration but management and automated identification of relevant models from available collections of models is still an open problem. Currently, the decision on which model should be used for a new chemical compound is left to users. This paper intends to initiate the discussion on automated model identification. We present an algorithm, based on Pareto optimality, which mines model collections and identifies a model that offers a reliable prediction for a new chemical compound. The performance of this new approach is verified for two endpoints: IGC50 and LogP. The results show a great potential for automated model identification methods in predictive toxicology.
Keywords
Predictive toxicology Model identification Pareto optimality Model combinationBackground
Predictive toxicology is concerned with the development of models that are able to predict the toxicity of chemicals [1]. These models are continuously built and validated on large collections of toxicological experimental studies to discover new biologically active compounds that are more effective, selective, less toxic, or satisfy various toxicological criteria [2, 3]. A reliable prediction of toxic effects of chemicals in living systems is highly desirable in domains such as: cosmetics, drug design or food safety. This knowledge allows an earlier rejection of those chemicals that may fail the testing phase and reduces the cost of manufacturing chemical compounds in the development stage. Additionally, the European Commission’s Legislation of Registration, Evaluation and Authorization of Chemicals (REACH) [4] allows the registration of chemicals that were developed using in silico modelling, which facilitates a reduction in the number of animal tests. These two factors have contributed to increased interests from research and business communities in development of toxicological modelling systems that are focused on data integration, model development and predictions (e.g OpenTox [5], InkSpot [6] or OCHEM [7]).
Quantitative StructureActivity Relationship (QSAR) or StructureActivity Relationship (SAR) models (both regression and classification) are the most common and widely used methods to relate chemical structure/properties with their biological, chemical or environmental activities [8]. According to the Organisation for Economic Cooperation and Development (OECD) Principles for QSAR Model Validation [9], a model should be statistically significant and robust, have its application boundaries defined and be validated by an external dataset [10, 11]. A model applicability domain [12, 13] determines the boundary of the chemical subspace where the model makes reliable prediction for a given activity. Applying models for chemicals from outside of their applicability domains increases the likelihood of inaccurate prediction.
There is an extensive literature associated with the best practice of model generation and data integration [14–19] but management and identification of relevant models from available collections of models is still an open problem. In recent years a large number of highly predictive models, having various applicability domains, has become publicly available. Some of them, tested on a wide chemical space, have become officially approved tools, e.g. KOWWIN (estimates the log octanolwater partition coefficient) or BCFBAF (estimates fish bioconcentration factor) built into Estimation Program Interface (EPI) Suite [20]. There is also a large number of quality models that are applicable only for a narrow chemical space. Some of them are annotated according to the OECD principles and publicly available in databases like JRC QSAR Models Database [21]. This database includes reports of model generation, validation and prediction according to the OECD standards. QSAR Model Reporting Format (QRMF) and QSAR Prediction Reporting Format (QPRF) have been developed at the Computational Toxicology and Modelling lab of the JRC’s Institute to standardise annotation of model metainformation. Currently, there is a lot of effort to build the ontologies for QSAR experiments and to provide an interoperable and reproducible framework for QSAR analyses [22].
Models that are stored in model databases can be reused to predict toxicity of new chemical compounds. Unfortunately, this involves a manual process of model identification. A potential user is required to make a comparison of model applicability domains and their predictivity for a given activity in order to decide if the model can make reliable predictions for a given chemical compound. Model comparison is a difficult task since models are generated using various subsets or various chemical compound descriptors. Consequently, models can be trained and validated on different datasets. For regression models, the model performance can be described by the predictive squared correlation coefficient q^{2}. Since the sizes and contents of modelling and validation datasets may differ for various models, the value of q^{2} is not sufficient for model comparison [10]. Several model performance matrices were analysed in the context of model validation and model selection [14]. They are applied in automated model development where models are validated by the same dataset. In the case where two models come from different sources, model comparison becomes challenging. This requires predictive models to be validated across the entire chemical space, which is very difficult as the list of available chemicals and assays may be limited.
Clearly, there is a need for automated techniques for mining model repositories. This includes methods for model quality control, data and model integration, model comparison and model identification. Our research aims to address this gap. In this paper, we draw attention to the importance of existing models’ usage in predictive toxicology. We also introduce methods for effective model identification for a new unseen chemical compound. The term “model identification“ covers the whole range of problems related to model selection from a collection of models (for a given endpoint) developed on various datasets. In the extreme case, datasets (and specified applicability domains) for two models can be disjoint. Model identification is a much harder problem than the well known model selection problem [23], i.e choosing a model from a set of candidate models with the same applicability domains. Therefore, various methods applied in traditional model selection [24–27] cannot be directly applied to model identification. In contrast to model selection, model identification cannot take into account model variables or parameters since some model variables cannot be easily accessed for new chemical compounds.
The interesting questions here are whether efficient model identification is possible based on molecular structures and models performances, and how good the identified model can be for a new chemical compound. In [28], authors defined the framework for automated model selection and described a simple algorithm for model selection. The method selects the most predictive model from the collection of models for a nearest neighbour to the query chemical compound. Often, the nearest neighbourhood can contain more than one element and model performances can differ slightly. In this case, it is difficult to say which model would be the most reliable for a given chemical compound.
To answer the above question, in this paper we present a new method for model identification for regression models. This method uses Pareto points [29] to define the nearest Pareto neighbourhood according to two criteria: structural similarity of chemicals and models performances. In the next section a framework for model identification, Pareto points and their properties are introduced. Having the Pareto nearest neighbourhood defined, we present two methods for model identification. The first method averages model performances for all Pareto neighbours and identifies the one with the smallest error. The second method identifies a model for which the Pareto point is the closest (based on Euclidean distance) to a centroid of all points in the Pareto neighbourhood. We also demonstrate that model identification improves the quality of the test set, or unseen chemical compound prediction. Experimental work using IGC50 for Tetrahymena pyriformis and internal Syngenta LogP datasets show that our approach provides good results and it is worth being considered for further research.
Methods
Framework for model identification in predictive toxicology
There are several chemical compound representations and thousands of available chemical descriptors [8] used for predictive model development. In this paper, a chemical space X is a set of chemicals represented by pairs x = (x^{ d },x^{ f }), where ${x}^{d}\in {\mathbb{R}}^{{K}_{1}}$ represents a vector of descriptor values, ${x}^{f}\in {\{0,1\}}^{{K}_{2}}$ is a fingerprint, and K_{1} + K_{2} is the dimension of the chemical space. Descriptors represent various topological, geometrical, physical and chemical properties of a chemical compound. A fingerprint is a binary vector whose coordinates define the presence or absence of predefined structural fragments within a molecule [30]. A fingerprint is also a one dimensional representation of a chemical compound and it is widely used for chemical similarity search in large databases [31]. It is also worth noting that a fingerprint is not a unique chemical compound representation because it encodes only a fragment of a molecule. There can be two different molecules having the same fingerprint representation.
A predictive model M is a mapping X → Y, where $Y\subset \mathbb{R}$ is the output space. The output space Y might, for example, represent a particular biological, physical or chemical activity of a chemical compound.
The input data is represented by the pairs: (x_{ i },y_{ i }) ∈ X × Y for i = 1,…,n, where x_{ i } is an element of the chemical space and y_{ i } is the measured activity of that element. There is also a set of m predictive models $\mathcal{\mathcal{M}}=\{{M}_{1},\dots ,{M}_{m}\}$ associated with the activity Y. These models were generated using various statistical or data mining techniques and they have different applicability domains and performances. To identify the most predictive model from the collection of models $\mathcal{\mathcal{M}}$ for a new chemical compound x, we define a partitioning model that splits the chemical space into disjoint groups and allows an unambiguous model identification.
where

D_{1},…,D_{ m } ⊆ X are disjoint,

$\bigcup _{i=1}^{m}{D}_{i}=X$.
The main hypothesis in predictive modeling is that similar chemical compounds have similar properties [32]. Following this hypothesis we build the partitioning model that it splits the chemical space in groups in order to maximize the similarity of their chemical compounds and to minimize the error of a model associated with this group. It is easy to notice that this is a bicriteria problem and the solutions have to represent a tradeoff between optimality of these criteria (the socalled Pareto points). Pareto optimality is a multicriteria optimisation technique widely applied in decision making problems [29]. In QSAR modelling multiobjective (criteria) was used for feature selection [33] in order to maximize predictive capacity and to reduce the number of selected descriptors. In this paper we present how Pareto optimality can be applied in QSAR model identification. In the following sections we recall the basic definition of the Pareto set and we propose an algorithm that finds Pareto points in 2D vector space.
Pareto points and their properties
Let consider a vector $v=\phantom{\rule{2.77626pt}{0ex}}[{f}_{1},{f}_{2},\dots ,{f}_{K}]$ in the K– dimensional space. Let π_{ j }(v) = f_{ j } denote a jth coordinate of vector v and V be a finite set of vectors in ${\mathbb{R}}^{K}$.
Definition 1 (Domination)
Definition 2 (Comparison)
Vectors $v,w\in {\mathbb{R}}^{K}$ are incomparable, which we denote by v∼w, if neither v ≼ w nor w ≼ v.
Definition 3 (Pareto set)
A set Γ ⊂ V of minimal vectors with respect to ≼ is called a Pareto set for V.
The set V_{ j } consists of all vectors in V with minimal value on the jth coordinate.
Lemma 1
Let Γ_{ j } be the set of all minimal vectors in V_{ j }. Then Γ_{ j } ⊂ Γ, where Γ is the Pareto set for V.
In particular, IΓ is a subset of Γ and it is called an initial Pareto set. Now we establish the dependence of the conditions for incomparability with vectors in this initial Pareto set.
Lemma 2
The proof of this Lemma 1 and Lemma 2 as well as all other results in the paper are provided in Appendix 1.
Pareto order in two dimensions
This subsection is devoted to the study of the twodimensional case, i.e. K = 2. We shall use the notation introduced above.
Lemma 3
 1.
If IΓ = 1, then IΓ is the Pareto set for V.
 2.If IΓ = 2, then a vector v ∈ V is incomparable with vectors in IΓ if and only if${\forall}_{j=1,2}\phantom{\rule{1em}{0ex}}{\pi}_{j}\left(v\right)\in ({f}_{j}^{\mathit{\text{min}}},{f}_{j}^{\mathit{\text{max}}}).$(10)
As shown in Figure 1 and Figure 2, when IΓ consists of two elements w_{1} and w_{2}, a set of vectors incomparable with IΓ is given by the rectangle $\mathcal{V}$. Let Γ be a vector incomparable with IΓ, i.e. $\gamma \in \mathcal{V}$. The introduction of v_{0} divides the rectangle $\mathcal{V}$into three areas:

A^{′} and A^{′′} is a set of vectors incomparable with IΓ ∪ {γ},

B is a set of vectors smaller then Γ,

C is a set of vectors bigger then Γ.
The above properties of IΓ and vectors incomparable with IΓ allow us to limit the search space$\mathcal{V}$to find Pareto solutions.
Finding a Pareto set in 2D vector space
In this section, we present an algorithm for finding a Pareto set in twodimensional space (see Algorithm 1). FINDPARETOSET(V) is a recursive algorithm that finds all Pareto points in the rectangle$\mathcal{V}$ defined by two points in the initial Pareto set IΓ (see Lemma 1); this rectangle contains all points from V. The algorithm starts from finding a point Γ that does not dominate any other points in V (line 4). This point splits the area$\mathcal{V}$ into four rectangles (see Figure 2). According to Lemma 2 and Lemma 3, B ∩ V = ∅, C does not contain Pareto points, whereas points in rectangles A^{′} and A^{′′} are incomparable with Γ. The above procedure is recursively repeated for V ∩ A ^{′} and V ∩ A^{′′}.
Algorithm 1 FINDPARETOSET(V)
The algorithm sketched above calls FINDPARETOPOINT($\stackrel{\u0304}{V}$) (see Algorithm 2) to find a Pareto point in the set$\stackrel{\u0304}{V}$. This procedure works in the pessimistic time O(n^{2}), where n is a number of elements in$\stackrel{\u0304}{V}$ (when all solutions are comparable, i.e., to form a chain it may take niterations to find a Pareto point). However, the expected running time is much shorter thanks to the random selection of points.
Algorithm 2 FINDPARETOPOINT($\stackrel{\u0304}{V}$)
Model identification in predictive toxicology
Following the similarity hypothesis researchers build models for groups of chemicals that have a common molecular fragment or common properties. These models are more reliable and give better predictions for chemicals that lie in the model applicability domains. Further, high quality models developed for a small subset of chemical space can be combined in a global model that covers larger chemical space using various ensemble techniques. In this section we present how to identify a reliable model from a collection of already existing models for new before unseen chemicals.
The model identification procedure (see Algorithm 3) can be described as follows: for a query chemical compound q and a given chemical space – 1) create the set V of pairs (d_{ i },e_{ i m }), 2) find the Pareto set for V, 3) select the most suitable model for q. To create a set V we start from the array T (see Figure 3) that contains a structural representation of the chemical compound, its measured activity (for a given endpoint) and predictive performance of each model from $\mathcal{\mathcal{M}}$.
Algorithm 3 MODELIDENTIFY(T q)
After executing MODELIDENTIFY(T,q), in line 1, the array T is converted into a list of vectors V using procedure INIT(T,q) (see Algorithm 4). Every vector v_{ i } ∈ V is defined as a pair of the distance between q and the ith chemical compound from T, and the error of the jth model from$\mathcal{\mathcal{M}}$for the compound i. The distance d_{ q i } = 1  S T_{ q i } is calculated using Tanimoto coefficient ST, which is the most frequently used similarity measure in chemoinformatics [35]. This coefficient works with fingerprints (binary representation of molecules) and is defined as a ratio between the number of bits set on the same position in two fingerprints and the sum of bits set on different positions. The model error e_{ i j } is defined as a distance between the true activity for compound i and the value computed by model j. We treat V as a set of all possible solutions for model identification for a given query molecule q and known chemical subspace.
Algorithm 4 INIT(T,q)
In line 2 of MODELIDENTIFY(T,q), we call FINDPARETOSET(V) to find the set of all Pareto points Γ in V. Then, we analyse points in Γ in order to choose the most predictive model for q. In the case when Γ = 1, there is only one candidate, so the choice is trivial. This case is comparable to the algorithm proposed in [28] which selects the most predictive model for the most similar chemical compound of q. In the case when Γ consists of many Pareto points, the model identification becomes a difficult task: the Tanimoto similarity coefficient (as well as other fingerprint similarity measures) between chemical compounds may not be correlated enough with their activity partially contradicting the similarity hypothesis [32] (see the end of this section for a detailed example). To identify a model using Pareto points, first we define nPareto Neighbourhood as follows:
Definition 4
nPareto Neighbourhood is a set with at most n  Pareto points from Γ which are at distance less than τ from the element q where τ > 0 and n > 0.
The threshold τ is selected by experiment and depends on the chemical similarity within a given chemical space. Having defined the Pareto neighbourhood for a given chemical compound q, we provide two methods for model identification. The first one is called nAverage Pareto (see Algorithm 5). The threshold τ provides means for removing those chemical compounds which are dissimilar to the query compound q but their activity is very well predicted by some model. Next, the model average model errors for the chemicals represented by Pareto points and then the model with the smallest average error is selected. We call this method nAverage Pareto Model Identification (nAPMI). The usage of Pareto neighbourhood in comparison with the standard nearest neighbourhood is that this method is more sensitive on model performances and allows for the rejections of the similar chemical compounds on which models perform badly.
Algorithm 5 Average Pareto
where d_{ c } is the average of distances and e_{ c } is the average of model errors for all Pareto points from the neighbourhood (n  P N). In the next step the Euclidean distance between Pareto points and the centroid is computed. The model that is associated with the Pareto point for which the Euclidean distance to the centroid is minimal, is selected. We call this method nCentroid Pareto Model Identification (nCPMI). According to the definition, both nAPMI and nCPMI are partitioning models that splits chemical space into disjoin groups and allow unambiguous model identification.
Algorithm 6 Centroid Pareto
Analysis of chemical compound similarities in order to highlight the difference of the chemical activity for the TETRATOX dataset
f_{ sim }/diff_{ activ }  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7 

0  1  2  2  2  2  2  2  2 
0.1  3  13  27  44  51  62  70  79 
0.2  6  112  220  335  431  512  585  655 
0.3  16  318  617  933  1213  1474  1719  1928 
0.4  32  720  1402  2081  2701  3297  3840  4328 
0.5  66  1380  2726  4042  5227  6437  7536  8547 
f _{ sim } / diff _{ activ }  0.8  0.9  1  1.1  1.2  1.3  1.4  1.5 
0  2  2  2  2  2  2  2  2 
0.1  84  90  93  96  99  103  104  104 
0.2  700  753  782  801  827  842  849  858 
0.3  2106  2278  2412  2507  2621  2715  2784  2821 
0.4  4763  5160  5526  5837  6119  6360  6575  6724 
0.5  9481  10362  11167  11840  12488  13082  13589  14004 
The TETRATOX dataset contains over one thousand chemical compounds and the biggest difference between measured values of IGC50 is equal to 5.3. Notice that the number of pairs of chemicals that are similar, based on both the fingerprint similarity and the activity, is very small. There is only one pair of chemical compounds that have the same activity and maximal similarity (1row, 1 column). On the other hand, there are many chemicals which are similar fingerprintwise but have different activities. This makes the model identification challenging.
In the next section we present results of the experiments that were carried out in order to demonstrate how model identification works.
Experimental results
Two experiments were proposed in order to demonstrate the advantages of model identification for predictive toxicology. Each experiment has two phases. In the first phase we treated model identification as a classification problem to study the performances of proposed methods in comparison with the other classification algorithms. We defined an “oracle model” that associates each chemical compound from a given chemical space with the most predictive model from the collection of existing models and we used this model to validate our methods. In the second phase, for each chemical compound we applied an identified model to predict the growth inhibition concentration (IGC50) in the first experiment and Partition coefficient (LogP) in the second. Finally, we compared these results with the original model performances applied to the whole chemical space.
IGC50 Prediction for Tetrahymena Pyriformis
Training datasets for both models were obtained from JRC QSAR Models Database. These datasets were compared with the Tetrahymena pyriformis dataset and 204 (136 from the PN model and 68 from the NPN models) training chemicals were present in the TPT dataset. We did not perform any data curation for this dataset. The above described models were implemented for the logP value calculated using the cdk library [42] and used to predict toxicity for the TPT datasets.
First, we considered the model identification problem as a classification problem to predict which model will be the most reliable for a given chemical compound. Having a dataset of the predicted IGC50 for both models and the measured value, we used a priori information (“oracle model“) about the best selected model for each chemical compound and we applied various classification methods. To simulate the model identification for before unseen chemical compounds the leaveoneout (LOO) method was used. This methods takes out one chemical compound from the dataset and uses others chemicals to predict which model would be the most reliable for it. This procedure were repeated for all chemicals in the dataset.
Comparison of classification algorithms according to a number of correctly classified elements, false positive, false negative and the classifiers accuracies
Method  Correct class  Falsepositive  False negative  Accuracy 

SMO  899  122 (10.8%)  106 (9.4%)  0.80 
Part  904  123 (10.9%)  101 (8.9%)  0.80 
NaiveBayes  845  191 (19%)  90 (7.9%)  0.75 
J48  905  123 (10.9%)  100 (8.9%)  0.80 
IBK(1)  905  121 (10.7%)  102 (9%)  0.80 
IBK(3)  901  133 (11.7%)  94 (8.3%)  0.79 
IBK(5)  889  149 (13.2%)  93(8.2%)  0.78 
BayesNet  756  264 (23%)  108 (9.5%)  0.67 
DMS  901  115 (10.1%)  112 (9.9%)  0.79 
3CPMI  902  136 (12%)  90 (7.9%)  0.79 
5CPMI  897  137 (12%)  94 (8.3%)  0.79 
10CPMI  863  168 (14.8%)  97 (8.5%)  0.76 
3APMI  918  99 (8.7%)  111(9.8%)  0.81 
5APMI  891  115 (10%)  122 (10.8%)  0.78 
Model performances and distance comparison of the 3Pareto neighbourhood of the 3Phenyl1propanol
Name  Distance  PN  NPN 

Methylbenzene  0.33  0.37  0.28 
4Dimethylbenzene  0.36  0.54  0.08 
4Chloro3methylphenol  0.30  0.61  1.14 
Model performances and distance comparison of the 3Pareto neighbourhood of the Benzylamine
Name  Distance  PN  NPN 

2Chloroaniline  0.08  0.30  0.38 
(+/)1,2Diphenyl2propanol  0.11  0.041  0.59 
Analysis of model prediction accuracies for IGC50 for Tetrahymena pyriformis
Method Name  R2  RSE  Q2  MAE  RMSE 

NPN  0.58  0.66  0.15  0.69  0.94 
PN  0.58  0.66  0.58  0.50  0.66 
DMS  0.68  0.56  0.62  0.43  0.62 
3CPMI  0.67  0.58  0.60  0.43  0.63 
5CPMI  0.66  0.59  0.59  0.44  0.65 
10CPMI  0.65  0.60  0.57  0.44  0.66 
3APMI  0.69  0.56  0.65  0.41  0.60 
5APMI  0.68  0.57  0.62  0.42  0.62 
Oracle  0.75  0.50  0.71  0.35  0.54 
The 3APMI method provides the best prediction among “nonoracle models”. The first two rows present prediction statistics for PN and NPN models. They are lower than for all other models. Notice, however, that their R2 and RSE statistics are identical. This is due to the fact that both models are affine functions of one and the same explanatory variable. An affine function can, therefore, transform one model into another. This is what happens when regression is applied to compute R2 and RSE. Notice that other two measures of Q2 and predictive errors are different for these models.
Comparison of classification algorithms according to a number of correctly classified elements, false positive, false negative and the classifiers accuracies
Method  Correct class  Falsepositive  Falsenegative  Accuracy 

SMO  296  47(12%)  33(8.7%)  0.787 
Part  303  34(9%)  39(10.3%)  0.805 
NaiveBayes  281  67(17%)  28(7.4%)  0.747 
J48  296  44(11.7%)  36(9.5%)  0.787 
IBK(1)  307  42(11.1%)  27(7.1%)  0.816 
IBK(3)  300  42(11.1%)  34(9%)  0.797 
IBK(5)  299  46(12.2%)  31(8.2%)  0.795 
BayesNet  273  76(20.1%)  27(7.1%)  0.726 
DMS  297  48(12.7%)  31(8.2%)  0.719 
3CPMI  316  29 (7.7%)  31(8.2%)  0.844 
5CPMI  305  33(8.7%)  38(10.1%)  0.811 
10CPMI  288  41(10.9%)  47(12.5 %)  0.766 
3APMI  306  33(8.7%)  37(9.8%)  0.813 
5APMI  300  41(10.9%)  35(9.3%)  0.797 
Chemical structures wrongly associated with the PN model by 3CPMI
CAS  Smiles 

4097498  CC(C)(C)C1=CC(=C(C(=C1)[N+](=O)[O])O)[N+](=O)[O] 
6920225  CCCCC(O)CO 
928972  CCC=CCCO 
10031875  CCC(CC)COC(=O)C 
112141  C(C)(=O)OCCCCCCCC 
105668  C(CCC)(=O)OCCC 
624544  O(C(CC)=O)CCCCC 
123660  C(CCCCC)(=O)OCC 
123159  CCCC(C=O)C 
2987168  CC(C)(CC=O)C 
96480  O=C1CCCO1 
19686738  CC(CBr)O 
4620706  C(NCCO)(C)(C)C 
111864  CCCCCCCCN 
597977  C(N=C=S)(C)(C)CC 
17112822  c1c2c(CN=C=S)cccc2ccc1 
1138529  CC(C)(C)C1=CC(=CC(=C1)O)C(C)(C)C 
142303  C(#CC(C)(C)O)C(C)(C)O 
31333138  CCCCCC#CCCO 
107879  CC(CCC)=O 
2067336  OC(CCCCBr)=O 
91156  N#Cc1c(C#N)cccc1 
2065238  c1(ccccc1)OCC(OC)=O 
613978  N(CC)(C)c1ccccc1 
586787  [N+](c1ccc(cc1)Br)(=O)[O] 
91667  c1(N(CC)CC)ccccc1 
38713563  O(CCCCCCCCC)C(=O)c1ccc(O)cc1 
622468  C(Oc1ccccc1)(=O)N 
93914  C(CC(=O)C)(=O)c1ccccc1 
2216946  C(#Cc1ccccc1)C(=O)OCC 
Chemical structures wrongly associated with the NPN model by 3CPMI
CAS  Smiles 

29338496  CC(C(C1=CC=CC=C1)C2=CC=CC=C2)O 
100447  C1=CC=C(C=C1)CCl 
1823912  CC(C#N)C1=CC=CC=C1 
103695  CCNC1=CC=CC=C1 
112538  C(CCCCCCCCCCC)O 
1119864  C(CCCCCC)CC(CO)O 
628637  C(C)(=O)OCCCCC 
108225  O(C(=C)C)C(=O)C 
94042  C(C(OC=C)=O)(CCCC)CC 
1932929  C(CC)(=O)OCC#C 
1732098  O(C(CCCCCCC(OC)=O)=O)C 
110623  C(CCCC)=O 
36536466  O=C1CC(C)O1 
6261229  CCC#CCO 
4753597  O(CCCCBr)C(C)=O 
20965279  N#CCCCCCCBr 
1577180  OC(=O)CC=CCC 
111160  C(CCCCCC(=O)O)(=O)O 
535137  C(C(C)Cl)(=O)OCC 
600000  CCOC(=O)C(C)(C)Br 
23165448  c1ccc(CCCC)cc1N=C=S 
1565759  CCC(C)(C1=CC=CC=C1)O 
529191  CC1=CC=CC=C1C#N 
141286  C(CCCCC(OCC)=O)(OCC)=O 
106796  C(CCCCCCCCC(OC)=O)(OC)=O 
123728  C(CCC)=O 
22819916  N#CCCCCCCCl 
109524  C(CCCC)(=O)O 
2627272  c1ccccc1CCCN=C=S 
609938  c1(c(c([N+](=O)[O])cc(c1)C)O)[N+](=O)[O] 
3012371  C(#N)SCc1ccccc1 
Analysis of model prediction accuracies for IGC50 for the reduced TPT dataset
Method name  R2  RSE  Q2  MAE  RMSE 

NPN  0.84  0.37  0.60  0.44  0.57 
PN  0.84  0.37  0.75  0.33  0.46 
DMS  0.89  0.30  0.88  0.20  0.32 
3CPMI  0.92  0.25  0.91  0.16  0.26 
5CPMI  0.90  0.28  0.89  0.18  0.29 
10CPMI  0.88  0.32  0.86  0.21  0.33 
3APMI  0.91  0.27  0.90  0.18  0.29 
5APMI  0.90  0.28  0.89  0.19  0.30 
Oracle  0.98  0.10  0.98  0.09  0.11 
The above examples show the great potential of the model identification methods. We demonstrated that the method based on predefined rules (such as maximal similarity for chemicals and minimal error for a model assigned with them) can be compared with the standard machine learning algorithms for the classification problem. Model identification can be considered as an ensemble technique to build high predictive consensus models in predictive toxicology.
LogP prediction for inhouse Syngenta dataset
For the second experiment we considered the estimation of the LogP for an internal Syngenta dataset. The octanol/water Partition coefficient (LogP) is a measure of the lipophilicity of chemical compounds and is an important descriptive parameter in biostudies [8]. Currently, there are various methods for estimating this coefficient: fragmental methods (CLOGP, KOWWIN), atom contribution methods (TSAR, XLOGP), topological indices (MLOGP), molecular properties (BLOGP).
Analysis of model prediction accuracies for a LogP estimation
nr chemicals  Mod.Name  Q2  MAE  RMSE 

CLOGP  0.83  0.38  0.74  
1000  MLOGP  0.57  0.84  1.19 
KOWWIN  0.79  0.47  0.83  
3APMI  0.84  0.38  0.74  
CLOGP  0.76  0.41  0.78  
2000  MLOGP  0.44  0.85  1.2 
KOWWIN  0.69  0.50  0.88  
3APMI  0.78  0.39  0.72  
CLOGP  0.37  1.21  1.54  
2333  MLOGP  0.39  1.13  1.52 
KOWWIN  0.41  1.01  1.49  
3APMI  0.64  0.80  1.16 
Table 10 displays the accuracy of model predictions. The 3APMI is generally at least as good as the best model (CLOGP). In the case of randomly selected chemicals CLOGP was hard to beat, although for 2000 randomly selected chemicals one can clearly see the benefit of using 3APMI (higher Q2 and lower MAE). The biggest gain is, however, observed for those chemicals whose activity is difficult to predict (the last experiment). This shows that partitioning model (3APMI) can be a powerful knowledge extraction tool.
All methods proposed in the paper were implemented in R [44]. The logP value, fingerprints and Tanimoto similarity were calculated using the RCDK [45] library. A number of tests were run to define the threshold τ. It is important to notice that the nPareto neighbourhood defines the set of at most nPareto points. Therefore, for the 3Pareto neighbourhood we found chemicals that have 1, 2, or 3 Pareto neighbours for τ = 0.4 for the entire TPT dataset. For the 5Pareto neighbourhood τ = 0.7 and for the 10Pareto neighbourhood we considered all Pareto neighbours. This shows that a size of the Pareto neighbourhood depends on a size of the available chemical space and may vary for different endpoints. Also, looking at the results for APMI and CPMI one can notice that it is not worth considering all Pareto points, and that the size of the Pareto neighbourhood depends on chemical compound similarities.
Conclusion
In this paper, we draw attention to advantages of model reusage in predictive toxicology. Since the amount of experimental data and the number of predictive models are growing every day, it is crucial to develop automated methods for mining models in repositories. The most demanding task is to find a model for a new chemical compound from a collection of models for a given endpoint.
In this paper, we proposed two methods (APMI and CPMI) that identify the suitable model for a query chemical compound based on the model performances in its Pareto neighbourhood. These algorithms are based on our simple yet effective method for finding the Pareto set in 2D space. The experimental results demonstrate the advantage of our approach and indicate that automated model identification is a promising research direction with many practical applications. Our approach is mainly focused on regression models and in the future we plan to extend it to classification models, including the analysis of model variables in chemical space partitioning. An additional interesting direction could address the estimation of identified model reliability for a new chemical compound.
Appendix 1 Proofs
Proof
(Lemma 1). We prove this lemma by contradiction. Let’s$j\in \{1,\dots ,K\}$ and choose v ∈ Γ_{ j }. Assume that v ∉ Γ, which is equivalent to saying that there exists w ∈ V that is strictly dominated by v, i.e. w ≺ v. This means that π_{ j }(w) = π_{ j }(v) and w ∈ V_{ j }. By the definition of Γ_{ j } we know that v is a minimal vector in V_{ j }, so v ≼ w, which contradicts w ≺ v.
Proof
(Lemma 2). Let v ∈ V. First notice that${\pi}_{j}\left(v\right)\ge {f}_{j}^{\mathit{\text{min}}}$,$j=1,\dots ,K$. If${\pi}_{j}\left(v\right)\notin ({f}_{j}^{\mathit{\text{min}}},{f}_{j}^{\mathit{\text{max}}})$ for all j then${\pi}_{j}\left(v\right)\ge {f}_{j}^{\mathit{\text{max}}}$ for all j and w ≼ v for w ∈ IΓ. If there exists exactly one$j\in \{1,\dots ,K\}$ such that${\pi}_{j}\left(v\right)\in ({f}_{j}^{\mathit{\text{min}}},{f}_{j}^{\mathit{\text{max}}})$, then for each index l ≠ j we have${\pi}_{l}\left(v\right)\ge {f}_{l}^{\mathit{\text{max}}}$ and there exists a vector w ∈ Γ_{ j } such that w ≼ v. Therefore, if v is incomparable with vectors in IΓ, none of the above cases can take place, and the proof is completed.
Proof
Consequently, w is dominated by every vector of V, so it is the only minimal vector in V.
Assume now that IΓ consists of two vectors: w_{1} and w_{2}.
Due to (3) the set of vectors v ∈ V incomparable with IΓ satisfies (9).
According to the Definition 2 and formula (4) we obtain v ∼ w_{1} and v ∼ w_{2}. Since IΓ = {w_{1},w_{2}}, then v is incomparable with the vectors w_{1} and w_{2}.
Declarations
Acknowledgements
The authors would like to thank BBSRC and Syngenta Ltd for funding the Industrial CASE Studentship Grant (No. BB/H530854/1) for AP. The authors are also grateful to Kim Travis and Richard MarcheseRobinson from Syngenta Ltd for their useful comments, and to John Delaney and Nathan Kidley from Syngenta Ltd for the access to the LogP dataset. The authors are also grateful to the referees for their invaluable and insightful comments that have helped to improve the presentation of this work.
Authors’ Affiliations
References
 Helma C (Ed): Predictive Toxicology. 2005, Boca Raton: Taylor & Francis GroupView ArticleGoogle Scholar
 Kavlock R: A framework for computational toxicology research in ORD. [http://nepis.epa.gov/Exe/ZyPURL.cgi?Dockey=100046MA.txt]
 Judson R: Public databases supporting computational toxicology. J Toxicol Environ Health, Part B. 2010, 13 (2): 218231.View ArticleGoogle Scholar
 REACH. [http://ec.europa.eu/environment/chemicals/reach/reach_intro.htm]
 OpenTox. [http://www.opentox.org]
 Inkspot Cloud platform for portable, scalable and secure cloud computing. [http://www.inkspot.co]
 OCHEM. [http://ochem.eu]
 Gasteiger J (Ed): Handbook of Chemoinformatics: From Data to Knowledge. 2003, Weinheim: WileyVCH Verlag GmbHView ArticleGoogle Scholar
 OECD principles for the validation, for regulatory purposes, of QSAR models. [http://www.oecd.org/dataoecd/33/37/37849783.pdf]
 Golbraikh A, Tropsha A: Beware of q2. J Mol Graph Model. 2002, 20: 269276. 10.1016/S10933263(01)001231.View ArticleGoogle Scholar
 Gramatica P: Principles of QSAR models validation: internal and external. QSAR Comb Sci. 2007, 26: 6947012. 10.1002/qsar.200610151.View ArticleGoogle Scholar
 Jaworska J, Comber M, Auer C, Leeuwen C: Summary of a workshop on regulatory acceptance of (Q)SARs for human health and environmental endpoints. Environ Health Perspect. 2003, 111: 13581360. 10.1289/ehp.5757.View ArticleGoogle Scholar
 Jaworska J, NikolovaJelizkova N, Aldenberg T: QSAR applicability domain estimation by projection of the training set in descriptor space: a review. ATLA Alternat Lab Anim. 2005, 33: 445459.Google Scholar
 Tropsha A, Gramatica P, Gombar V: The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci. 2003, 22: 6977. 10.1002/qsar.200390007.View ArticleGoogle Scholar
 Cartmell J, Enoch S, Krstajic D, Leahy D: Automated QSPR through competitive workflow. J ComputAided Mol Design. 2005, 19: 821833. 10.1007/s1082200590298.View ArticleGoogle Scholar
 Zhu H, Tropsha A, Fourches D, Varnek A, Papa E, Gramatical P, Öberg T, Dao P, Cherkasov A, Tetko I: Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis. J Chem Inf Model. 2008, 48 (4): 766784. 10.1021/ci700443v.View ArticleGoogle Scholar
 Patlewicz G, Jeliazkova N, Gallegos Saliner A, Worth A: Toxmatcha new software tool to aid in the development and evaluation of chemically similar groups. SAR QSAR Environ Res. 2008, 19 (3): 397412.View ArticleGoogle Scholar
 Tropsha A: Best practices for QSAR model development, validation, and exploitation. Mol Inf. 2010, 29 (67): 476488. 10.1002/minf.201000061.View ArticleGoogle Scholar
 Hardy B, Douglas N, Helma C, Rautenberg M, Jeliazkova N, Jeliazkov V, Nikolova I, Benigni R, Tcheremenskaia O, Kramer S, Girschick T, Buchwald F, Wicker J, Karwath A, Gutlein M, Maunz A, Sarimveis H, Melagraki G, Afantitis A, Sopasakis P, Gallagher D, Poroikov V, Filimonov D, Zakharov A, Lagunin A, Gloriozova T, Novikov S, Skvortsova N, Druzhilovsky D, Chawla S, Ghosh I, Ray S, Patel H, Escher S: Collaborative development of predictive toxicology applications. J Cheminformatics. 2010, 2: 710.1186/1758294627.View ArticleGoogle Scholar
 EPI Suite. [http://www.epa.gov/oppt/exposure/pubs/episuite.htm]
 JRC QSAR Model Reporting Format (QMRF). [http://qsardb.jrc.ec.europa.eu/qmrf/]
 Spjuth O, Willighagen E, Guha R, Eklund M, Wikberg J: Towards interoperable and reproducible QSAR analyses: Exchange of datasets. J Cheminformatics. 2010, 2: 510.1186/1758294625.View ArticleGoogle Scholar
 Kohavi R: A study of crossvalidation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th international joint conference on Artificial intelligence  Volume 2, IJCAI’95. 1995, San Francisco: Morgan Kaufmann Publishers Inc., 11371143.Google Scholar
 Izrailev S, Agrafiotis DK: A method for quantifying and visualizing the diversity of QSAR models. J Mol Graph Model. 2004, 22 (4): 275284. 10.1016/j.jmgm.2003.10.001.View ArticleGoogle Scholar
 Kuncheva L: Combining Pattern Classifiers: Methods and Algorithms. Wiley. 2004, : WileyView ArticleGoogle Scholar
 Todeschini R, Consonni V, Pavan M: A distance measure between models: a tool for similarity/diversity analysis of model populations. Chemometrics Intell Lab Syst. 2004, 70: 5561. 10.1016/j.chemolab.2003.10.003.View ArticleGoogle Scholar
 Makhtar M, Neagu D, Ridley M: Predictive model representation and comparison: Towards data and predictive models governance. Comput Intell (UKCI), 2010 UK Workshop on. 2010, 16.View ArticleGoogle Scholar
 Wojak A, Neagu D, Ridley M: Double MinScore (DMS) Algorithm for automated model selection in predictive toxicology. United Kingdom Workshop in Computational Intelligence (UKCI 2011). 2011, 150156.Google Scholar
 Ehrgott M: Multicriteria Optimization. 2005, New York, Inc.: SpringerVerlagGoogle Scholar
 Todeschini R, Consonni V: Handbook of Molecular Descriptors. 2000, Weinheim: WileyVCH Verlag GmbHView ArticleGoogle Scholar
 Flower DR: On the properties of bit stringbased measures of chemical similarity. J Chem Inf Comput Sci. 1998, 38 (3): 379386. 10.1021/ci970437z.View ArticleGoogle Scholar
 Jahnson M, Maggiora G: Concept of Application of Molecular Similarity. 1990, New York: John Wiley & SonsGoogle Scholar
 Soto A, Cecchini R, Vazquez G, Ponzoni I: Multiobjective feature selection in QSAR using a machine learning approach. QSAR & Comb Sci. 2009, 28 (1112): 15091523.View ArticleGoogle Scholar
 Tappeta RV, Renaud JE: Interactive multiobjective optimization procedure. AIAA J. 1999, 37: 881889. 10.2514/2.7537.View ArticleGoogle Scholar
 Willet P, Berdnard J, Downs G: Chemical Similarity Searching. J Chem Inf Comput Sci. 1998, 38: 983996. 10.1021/ci9800211.View ArticleGoogle Scholar
 Tetratox. [http://www.vet.utk.edu/TETRATOX]
 Inchemicotox. [http://www.inchemicotox.org/results/]
 Schultz TW: TETRATOX: Tetrahymena Pyriformis population growth impairment endpointa surrogate for fish lethality. Toxicol Methods. 1997, 7 (4): 289309. 10.1080/105172397243079.View ArticleGoogle Scholar
 Xue Y, Li H, Ung CY, Yap CW, Chen YZ: Classification of a diverse set of Tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. Chem Res Toxicol. 2006, 19 (8): 10301039. 10.1021/tx0600550.View ArticleGoogle Scholar
 Ellison C, Cronin M, Madden J, Schultz T: Definition of the structural domain of the baseline nonpolar narcosis model for Tetrahymena pyriformis. SAR QSAR Environ Res. 2008, 19 (78): 751783. 10.1080/10629360802550366.View ArticleGoogle Scholar
 Enoch S, Cronin M, Schultz T, Madden J: An evaluation of global QSAR models for the prediction of the toxicity of phenols to Tetrahymena pyriformis. Chemosphere. 2008, 71 (7): 12251232. 10.1016/j.chemosphere.2007.12.011.View ArticleGoogle Scholar
 Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): An opensource java library for Chemo and Bioinformatics. J Chem Inf Comput Sci. 2003, 43: 493500. 10.1021/ci025584y.View ArticleGoogle Scholar
 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explor Newsl. 2009, 11: 1018. 10.1145/1656274.1656278.View ArticleGoogle Scholar
 The R Project for Statistical Computing. [http://www.rproject.org/]
 RCDK. [http://cran.rproject.org/web/packages/rcdk/index.html]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.