Machine learning is applied within a vast range of disciplines such as economical forecasting, robotics, image analysis and risk assessment. Scientists using machine learning are not in general machine learning experts themselves and the algorithmic understanding for the various methods could be rather limited. Additionally, within many of these disciplines, low level programming knowledge is not abundant and scientist are often restricted to predefined machine learning protocols, wrapped in some graphical environment.
The new European chemical legislation, REACH, requires the chemical industry to provide information on, for example, ecotoxicity and human safety for chemicals used on the European market. However, the legislation does not support an increased usage of laboratory animals, but rather advocates the sharing of data and the development of alternative in vitro and in silico methods. Machine learning is routinely used in the prediction of chemical properties based on molecular structure information, so called quantitative structure activity relationships (QSARs). Hence, the ratification of the REACH legislation emphasizes the need to develop new methods for building and validation of QSAR models. The increased importance of QSAR modeling is manifested by the establishment of the OECD (Q)SAR project in 2004. The project aims to promote regulatory acceptance of QSAR approaches and it is in the process of establishing the "OECD Principals for the Validation for Regulatory Purpose of QSAR Models" .
Economical necessities and the concern for laboratory animals have driven the pharmaceutical industry in the same direction, replacing in vivo studies with in vitro experiments and in silico methods. Hence, QSAR modeling is becoming increasingly important also within drug discovery . Information related to Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) is relevant to all pharmaceutical projects and the aim is often to build models intended to be applicable within vast ranges of chemical space. In developing such global models, as much of chemical diversity as possible is included in the training sets, often encompassing tens of thousands of compounds obtained from proprietary internal databases .
Several studies have shown that it is not possible to identify a single machine learning algorithm which will be the most accurate for all data sets, even restricted to QSAR applications . Hence, a data set specific choice of modeling algorithm and perhaps also the usage of combined model predictions, has the potential of increasing the model applicability beyond what is achievable with a single algorithm . Multivariate linear modeling algorithms, such as Partial Least Squares (PLS), are well established within the QSAR community. However, the often non-linear relationship between descriptors and biological responses is recognized and the application of non-linear machine learning algorithms for QSAR modeling is increasing [5, 6]. In general, non-linear machine learning algorithms represent a more complex optimization problem than linear methods and therefor require more training examples. Hence, the non-linear machine learning methods are of particular interest for global QSAR modeling, for which they need to be used in conjunction with thorough statistical validation and assessment of the applicability domain. In addition, non-linear methods are considered more difficult to use because of the tweaking of model hyper-parameters, such as the number of hidden neurons in an artificial neural network, often required to build accurate models.
Commercial mathematical packages, for example MATLAB, interfaces several machine learning algorithms, while Simca and TreeNet are commercial packages, well established within the QSAR community, thought developed around a single modeling algorithm. The Open Source statistical package R  has several third party modules with state-of-the-art machine learning methods. Orange  and Weka  are developed to be machine learning platforms providing multiple algorithms together with preprocessing and validation methods, while Knime  is a general pipe-lining system with several machine learning plug-ins.
Model hyper-parameter selection aims to find the parameters with the greatest generalization accuracy for a given data set by comparing the accuracy for different combinations of hyper-parameters. A few machine learning packages implement semi-automated model hyper-parameter selection. The Random Forest (RF) module of R monotonously increases the number of active variables until the out-of-bag (OOB) error no longer decreases. The libSVM  authors recommend a grid search to find the C and γ parameters with the best cross validation (CV) accuracy, while the grid search in the OpenCV  SVM implementation can optimize the C, γ, p, nu, coef f and degree parameters. Weka offers the possibility to calculate the CV accuracy varying a single parameter within a user defined interval. In addition, Weka implements a grid search restricted to two model parameters.
Despite the diversity of available machine learning packages, there is no package fulfilling all of the requirements on an Open Source state-of-the-art QSAR modeling platform. Such a system needs to include all tools necessary within a work flow encompassing database communication, data preprocessing, descriptor calculation and selection, model building and validation. It should be possible to build flexible machine learning applications in a graphical programming environment, as well as in a scripting mode. Because data sets often contain tens of thousands of compounds and the size of available data sets is expected to grow rapidly, the machine learning algorithms need to be highly numerically efficient. To exhaust the statistical aspects of model development, multiple and complementary machine learning algorithms should be made available. Complex modeling algorithms need to be customized for non-expert users and model hyper-parameters selected in an automated work flow to increase accuracy and efficiency in the model development process. Finally, the system should make it easy to develop models compliant with the OECD principals for validation of QSAR models.
AZOrange is a general Open Source platform for machine learning, however developed to meet the increasing demand for ADMET models in drug discovery in particular. AZOrange customizes several high performance state-of-the-art machine learning algorithms. The automated and generalized model hyper-parameter selection is a unique feature of AZOrange. The customization and the automated model hyper-parameter selection provide the tools necessary for automated model development, batch generation of models and the assessment of multiple model hypothesis. In addition, a graphical programming environment makes development of flexible high performance machine learning applications possible without scripting requirements.