A big data approach to the ultra-fast prediction of DFT-calculated bond energies
© Qu et al.; licensee Chemistry Central Ltd. 2013
Received: 18 April 2013
Accepted: 8 July 2013
Published: 12 July 2013
The rapid access to intrinsic physicochemical properties of molecules is highly desired for large scale chemical data mining explorations such as mass spectrum prediction in metabolomics, toxicity risk assessment and drug discovery. Large volumes of data are being produced by quantum chemistry calculations, which provide increasing accurate estimations of several properties, e.g. by Density Functional Theory (DFT), but are still too computationally expensive for those large scale uses. This work explores the possibility of using large amounts of data generated by DFT methods for thousands of molecular structures, extracting relevant molecular properties and applying machine learning (ML) algorithms to learn from the data. Once trained, these ML models can be applied to new structures to produce ultra-fast predictions. An approach is presented for homolytic bond dissociation energy (BDE).
Machine learning models were trained with a data set of >12,000 BDEs calculated by B3LYP/6-311++G(d,p)//DFTB. Descriptors were designed to encode atom types and connectivity in the 2D topological environment of the bonds. The best model, an Associative Neural Network (ASNN) based on 85 bond descriptors, was able to predict the BDE of 887 bonds in an independent test set (covering a range of 17.67–202.30 kcal/mol) with RMSD of 5.29 kcal/mol, mean absolute deviation of 3.35 kcal/mol, and R2 = 0.953. The predictions were compared with semi-empirical PM6 calculations, and were found to be superior for all types of bonds in the data set, except for O-H, N-H, and N-N bonds. The B3LYP/6-311++G(d,p)//DFTB calculations can approach the higher-level calculations B3LYP/6-311++G(3df,2p)//B3LYP/6-31G(d,p) with an RMSD of 3.04 kcal/mol, which is less than the RMSD of ASNN (against both DFT methods). An experimental web service for on-line prediction of BDEs is available at http://joao.airesdesousa.com/bde.
Knowledge could be automatically extracted by machine learning techniques from a data set of calculated BDEs, providing ultra-fast access to accurate estimations of DFT-calculated BDEs. This demonstrates how to extract value from large volumes of data currently being produced by quantum chemistry calculations at an increasing speed mostly without human intervention. In this way, high-level theoretical quantum calculations can be used in large-scale applications that otherwise would not afford the intrinsic computational cost.
KeywordsBDE Bond dissociation energy Neural network Random forest Machine learning Chemoinformatics DFT DFTB Big data
The rapid calculation of physicochemical properties of atoms, bonds and molecules is required to process thousands or millions of structures in data mining explorations, or to establish data-driven sound relationships between structure and observable properties. Quantum chemistry calculations based on ab initio and density functional theory (DFT) provide estimations of several properties with increasing accuracy, but with a high computational cost. Many different approximate methods have been developed, such as semi-empirical molecular orbital methods and the self-consistent-charge density-functional tight-binding (SCC-DFTB) method that are 2–3 orders of magnitude faster than DFT and Hartree-Fock using medium-sized basis sets, although with a lower accuracy . Of course, the possibility of applying a specific quantum chemistry method depends on the sizes and numbers of involved molecules, computational resources, as well as time/accuracy requirements.
Density functional theory (DFT) of electronic structure, using approximate exchange-correlation functionals, has enabled the successful application of quantum mechanics to a wide range of problems in chemistry at a fraction of the computational requirements of the traditional Hartree-Fock theory methods . The B3LYP hybrid functional, particularly, is arguably the most popular DFT functional, and has been widely recognized as a cost-effective method. However, DFT calculations are still too computationally expensive (and difficult to automate) for increasingly common data-mining tasks involving thousands/millions of structures, which shall be performed on single workstations or small clusters in time scales of up to a few hours. In fact, even semi-empirical methods are currently hardly affordable in such situations.
At the same time, DFT and ab initio calculations are continuously being carried out by many science professionals all over the world producing huge amounts of data. A Big Data scenario can be envisaged in which computational analytic techniques can extract innovative knowledge from the large volumes of data produced by these quantum calculations so that they can be predicted in new situations 5–6 orders of magnitude faster.
Here we present an implementation of this concept in our lab, for the estimation of bond dissociation energies (BDEs). A collection with thousands of molecular structures was retrieved from a public database, and energies were calculated for molecules and fragments by DFT, generating 93.2 GB of data. BDEs were calculated from these data for all non-ring bonds of the compounds, and incorporated into a data set of bonds represented by topological bond descriptors. The data set was used for training machine learning methods, such as Random Forests  and Associative Neural Networks,  with the goal of establishing models to predict the BDE from the bond descriptors.
BDE is a fundamental thermodynamic property which measures the strength of a chemical bond. It is one of the factors playing a decisive role in the assessment of chemical reactivity, with an impact in different fields, e.g., in determining anti-oxidant activity  or identifying the major possible metabolic sites of xenobiotics . The BDE has been used in computational mass spectrometry for the challenging task of identifying unknown compounds in the interpretation of metabolomics data . In order to identify a metabolite, its experimental mass spectrum can be compared to the spectra simulated for the structures of metabolites in large databases, to find the best match. A strategy to identify unknown compounds has consisted in the evaluation of a candidate by generating all its possible topological fragments in order to match the fragment mass with the measured peaks. Candidates are then scored from the number of fragments that explain peaks in the measured spectrum, and the likelihood of the fragmentations has been approached by crude estimations of bond dissociation energies .
Some methods have been proposed to predict the BDE, mainly restricted to small data sets of specific types of bonds. Cherkasov et al.  developed an additive empirical relationship that was able to predict the homolytic C-H bond dissociation energy within 3.75 kcal/mol for a data set of 79 molecules. The coefficient of determination R2 was 0.94. However, this relationship is only valid for molecules where resonance contributions and captodative stabilization are insignificant, and the energy range is from 76.4 kcal/mol to 106.7 kcal/mol. Xue et al.  developed a Quantitative Structure–property Relationship model of the O-H bond dissociation energy using 78 substituted phenols and Support Vector Machines. A RMSD of 0.79 kcal/mol was achieved for the test set with the dissociation energies ranging from 76.8 kcal/mol to 95.0 kcal/mol. Stanger et al.  developed a prediction scheme based on low-level quantum chemistry computation of the hybridization which yielded a correlation coefficient of 0.951 for 35 C-H BDEs. Przybylak et al.  developed a relation between the C-H bond dissociation energy of 43 ethers and spin density, which also requires quantum chemistry calculations. Feng et al.  constructed a homolytic C-H and N-H bond dissociation energy prediction model for strained hydrocarbons and amines, which is based on quantum chemical descriptors. Two separate models were constructed for hydrocarbon C-H bonds, and amine N-H bonds. Predictions for 89 C-H BDEs were achieved with a correlation coefficient of 0.927 and standard deviation of 2.7 kcal/mol for C-H bonds. The two regression statistical properties are 0.878 and 2.7 kcal/mol for N-H bonds, respectively. Santos et al.  studied the homolytic dissociation energies of O-H and S-H bonds in a set of di-substituted phenols and thiophenols by density functional theory at the B3LYP/6-311++G** level. A good agreement between B3LYP/6-311++G** energy and experimental values was observed. Correlations between the bond dissociation energy and Hammet parameters were established in the same work. The correlation coefficient was 0.842 and 0.949 for phenols and thiophenols, respectively.
High-level quantum mechanics theory can be used to directly calculate BDEs. However, as mentioned above, the computational cost is still too high to be used, e.g., in mass spectrum prediction or drug design, which require the ability of on-the-fly evaluation. High-level quantum mechanics calculations require hours to determine the dissociation energy of an individual bond in a small molecule containing 20 or 30 atoms. Unfortunately, the number of atoms of typical drug molecules is far more than 30. A large diverse data set of small molecules and a suitable quantum method are essential for our chemoinformatics data-driven approach, to generate a good quality training set with a large number of bond dissociation energies. The “Fragment-Like” subset of the ZINC database is a practical choice since it includes convenient low molecular weight molecules encompassing a diversity of structural patterns compatible with relevant applications [14, 15].
Accurate prediction of bond dissociation energies demands high-level post Hartree-Fork (HF) methods because the electron correlation effects play an important role in the bond breaking process. Coupled Cluster with Single and Double and perturbative Triple excitations (CCSD(T)) and Quadratic Configuration Interaction with Single and Double and perturbative Triple excitations (QCISD(T)) are good examples of such methods. They are able to generate very accurate energies with a sufficient large basis set. However, the choice of a computational method is a balance between accuracy and computational cost. Such methods scale as N7, where N measures the system size, limiting their application to fairly small species for routine use. Density Function Theory (DFT) offers another method to take account of electron correlation effects. DFT methods typically scale as N3 – N4 and are able to yield significantly more accurate results than Hartree-Fork (HF) theory. B3LYP is the most popular density functional in chemistry, which has been widely recognized as a cost-effective method and has been successfully applied to a lot of bond dissociation energy research work [2, 11–13]. The work here presented employs this method to calculate a database of bond dissociation energies.
Our database involves thousands of species (molecules and fragments). It is possible to calculate the energy for these species using B3LYP. Nevertheless, B3LYP is still too computationally expensive for the geometry optimization and frequency analysis of so many species. The density functional tight-binding (DFTB) is a semi-empirical approximate quantum chemical method derived from DFT by neglect, approximation and parametrization of interaction integrals . The self-consistent-charge density functional tight-binding (SCC-DFTB), which can be derived by a second order expansion of the DFT total energy, extends the DFTB methods to charge self-consistency. The third generation of DFTB methods (DFTB3) was established in 2011 by third order expansion in combination with other improvements, such as the description of coulomb interaction. DFTB3 improves transferability and overall accuracy for several properties. In particular, geometries are usually reproduced excellently . However, because DFTB3 empirical parameter is only available for elements C, H, O, N, S and P, and performance for phosphorous-containing molecules is often unsatisfactory, the current study was restricted to molecules only containing atoms of C, H, O, N, or S.
All the chemical structures in the current study were retrieved from the Fragment-Like database of the ZINC database [14, 15]. A subset with the Tanimoto cutoff level of 80% was used, and 1,000 molecules were randomly selected. This data set was randomly partitioned into a training set with 900 molecules (8,336 unique bonds) and a test set with 100 molecules (887 unique bonds). The data sets of molecules were used to build data sets of covalent bonds. In this work, bonds are the objects to be processed by machine learning methods in order to get predictions for their dissociation energies. Additional molecules were retrieved from the Fragment-Like subset to enrich the training set in underrepresented types of bonds (based on atom types of the bond, atom types of neighbors and bond orders). Globally charged molecules and molecules containing elements other than C, H, O, N, or S were discarded. After the enrichment, the training set was screened against itself for identical bonds, which were identified and included only once, finally yielding a training set with 12,834 unique bonds from molecules different from the test set. The 3D structures and BDEs used for training the models are available in the MDL SDFile format in Additional file 1.
The ChemAxon Calculator Plugins were used to generate 3D structures from SMILES strings, and to propose low energy conformers for structure property prediction and calculation, Marvin 5.8.2, 2012, ChemAxon (http://www.chemaxon.com)  was employed. For all the molecules in the current study, the lowest energy conformer was used as a starting point for geometry optimization. The geometries of all the species were optimized with the DFTB+ program  in combination with the Atomic Simulation Environment (ASE) package . The DFTB+ program was used to calculate the DFTB3 energy and gradient, while the Broyden-Flectcher-Goldfarb-Shanno (BFGS)  and the Fast Inertial Relaxation Engine (FIRE)  optimizers were used to update the coordinates. The optimized structures were confirmed to be real minima by vibrational frequency analysis (no imaginary frequency). For the radical species with unpaired electrons, spin polarization was enabled in the DFTB calculation.
Calculation of bond dissociation energies (BDEs) by quantum methods
To assess the reliability of the B3LYP/6-311++G**//DFTB3 energy, bond dissociation energies were also calculated at a higher theoretical level for the whole test set. In those experiments, the geometry optimization was performed at the B3LYP/6-31G** level, and the single point energy was evaluated at the B3LYP/6-311++G(3df, 2p) level.
For a comparison with faster quantum methods, bond dissociation energies were also calculated for the test set with the PM6 semi-empirical method . For this purpose, the MOPAC2009 program was used,  the geometry was fully relaxed and confirmed to be real minima by frequency analysis.
Both in the experiments at the B3LYP/6-31G** level and with the PM6 method, some molecules failed to converge in geometry optimization, or had an imaginary frequency, and were not used in the comparisons – 787 bonds were used.
Bond descriptors are required for the processing of bonds by machine learning techniques. Ultra-fast estimation of bond energies cannot rely on computationally expensive descriptors. Therefore, bond descriptors were designed from the outset that: 1) use a simple algorithm; 2) not rely on quantum chemistry calculations; 3) not rely on optimized 3D geometry so that they can be directly calculated from the connection table; 4) not use bond orders explicitly, thus avoiding different mesomers to end up with different descriptors. As the homolytic bond dissociation energy is independent of the bond orientation (A-B or B-A), an additional requirement was that bond descriptors be independent of the bond orientation. With all requirements considered, a set of descriptors was developed to encode a bond (the target bond) based on counts of atom types and pairs of atom types existing in the bond or in its neighborhood.
The software for the calculation of bond descriptors was written in the Java programming language, and relies on the Chemistry Development Kit (CDK) libraries . Therefore, in this implementation, definitions of aromaticity and π-systems are provided by the CDK CDKHueckelAromaticityDetector algorithm. The source code to calculate bond descriptors is available at http://sourceforge.net/projects/unlpredict.
Two kinds of atom types are used in this work. The first simply consists of classifying an atom according to its element (C, H, O, N, S). The second definition of atom type combines the element and the number of connected atoms. For example, the carbon atom in methane is C4, the carbon atom in ethylene is C3 and the carbon atom in acetylene is C2. There are 14 types defined in the current study: C2, C3, C4, H1, N1, N2, N3, N4, O1, O2, S1, S2, S3, S4. Hereafter, we will use the term “connection number atom type” to refer to this latter atom type system.
These are counts of specific atom types in a specific sphere. If the number of atom types is N, the point descriptors for a single sphere is an array of size N. This descriptor can be calculated for the different systems of atom types. Element point descriptors correspond to point descriptors based on element atom types; connection number point descriptors correspond to point descriptors based on connection number atom types. The descriptors can be restricted to subsets of the molecule. For example, restriction to aromatic atoms will result in aromatic point descriptors. Figure 1 shows several examples of bond descriptors.
These are counts of atom pairs of specific atom types in specific spheres at specific distances between them. Pair descriptors for a sphere are defined so that one atom of the pair belongs to the sphere and the other atom is in the same or in a lower sphere. Pair descriptors are specified for a distance, which is the number of bonds between the two atoms of the pair on the shortest possible path. Figure 1 illustrates element pair descriptors. For example, for bond 8–9, the pair descriptor for sphere 2, distance 2, atom types “C-C” is 4, which results from counting atom pairs 4–6, 4–8, 6–8 and 11–9. In this paper, we used pair descriptors with different combinations of spheres, distances and atom types.
Pair descriptors can also be calculated without atom types, i.e., counting all the pairs of specific spheres and distances regardless of the atom types – no-type pair descriptors. The main advantage of no-type pair descriptors is low dimensionality; when compared with a large atom type system, this reduction is particularly relevant.
Bond-breaking difference pair descriptors
To describe the bond breaking explicitly, and at the same time obtain descriptors independent of the bond orientation, descriptors were also defined as the difference between pair descriptors before and after breaking the target bond – bond-breaking difference pair descriptors. Breaking the bond generates two fragments, and pair descriptors after the cleavage derive from atom pairs residing on the same fragment.
The number of atoms or pairs can also be counted in the whole molecule without restriction to spheres of a target bond. These are molecular descriptors, which are the same for all the bonds in the same molecule. Examples of molecular point descriptors and molecular pair descriptors are shown in Figure 1.
Fragment point descriptors
These are point descriptors calculated for individual fragments after cleavage of the target bond. For each specific point descriptor (defined for a sphere and atom type) there are two values, one for each fragment. In order to be independent from the bond orientation, the two values are sorted. The main purpose of this descriptor was to provide information on the distribution of special functional structures, such as aromatic systems, or conjugated π systems, by restricting the atoms involved in the calculation of the descriptors.
Machine learning methods
The relationship between bond descriptors and DFT-calculated bond dissociation energies was explored with two different machine learning algorithms, Random Forests (RF)  and Associative Neural Networks (ASNN) .
Random Forests (RF) were employed as ensembles of unpruned regression trees created by using bootstrap samples of the training data. The best split at each node is defined among a randomly selected subset of descriptors. Prediction is made by an average of the individual regression trees in the forest. This is a high-dimensional nonparametric method that works well on large numbers of variables. The performance is internally assessed with the prediction error for the objects left out in the bootstrap procedure (out-of-bag estimation, OOB). Here, RFs were grown with the R program version 2.14.1,  using the randomForest library . The number of trees in the forest was set to 1,000, and the number of variables tested for each split was set to default (square root of the number of variables).
Associative Neural Networks (ASNNs) integrate an ensemble of Feed-Forward Neural Networks (FFNNs) with a memory of data. The ensemble consists of independently trained FFNNs, which contribute to a single prediction. The ASNN scheme is employed for composing a prediction of the bond dissociation energy from a) the outputs produced by the ensemble of NNs and b) the most similar cases in the memory (here, the training set). The ASNN program was kindly provided by Dr. Igor Tetko. It was used with the Levenberg-Marquardt algorithm to train fully connected FFNNs with an input layer (including a bias equal to 1), one hidden layer (also including a bias equal to 1), and one output neuron. The presence of a bias enables to shift the activation function upwards or downwards by an adjustable value. The number of hidden neurons was optimized based on the training set, and in the final experiments was set to 6. The logistic activation function was used and each input and output variable was linearly normalized between 0.1 and 0.9 on the basis of the training set. Prior to the training of each network, the program randomly divided the training set into a validation set and a reduced training set with approximately the same size. Full cross-validation of the training set was performed using the leave-one-out method (LOO). The training was stopped when there was no further improvement in the root mean square deviation (RMSD) for the validation set.
Results and discussion
Random forest prediction of bond dissociation energies
In a first experiment, random forests were trained with an extensive pool of 3,675 bond descriptors, to predict the bond dissociation energies calculated with B3LYP/6-311++G(d,p)//DFTB. All the descriptors were based on the connection number atom type system, and consisted in 1) point descriptors (including molecular point descriptors), 2) point descriptors of atoms in ring systems, 3) point descriptors of atoms in π systems, 4) point descriptors of atoms in aromatic systems, 5) pair descriptors, 6) bond-breaking difference pair descriptors. All the descriptors were calculated for 7 spheres (the maximum sphere radius is 6 bonds). In pair descriptors, interatomic distances were considered from 1 to 4 bonds. Bonds with identical descriptors were detected and included only once in the training set.
A random forest with 1,000 trees was able to predict the B3LYP bond dissociation energy with Mean Absolute Deviation (MAD) of 4.24 kcal/mol, and Root Mean Square Deviation (RMSD) of 6.87 kcal/mol. These results were obtained with the out-of-bag (OOB) validation on the training set. The bond dissociation energies in the training set cover a range between −9.19 kcal/mol and 226.88 kcal/mol. The minimum positive bond dissociation energy is 0.45 kcal/mol. Only 14 bonds obtained negative values, which are probably the result of numerical errors of the B3LYP energy calculation, and in this experiment they were included in the training set. All the 14 bonds are weak, and ten of them are N-N2 bonds of azides. A non-dynamical electronic correlation effect is likely to be important – but DFT is a single configuration quantum chemical method that only takes dynamical electronic correlation into account. Another possible reason is the insufficient description of the non-covalent interactions of B3LYP, such as hydrogen bonds, van der Waals interactions, or charge-transfer interactions. Four of the 14 structures contain simultaneously atoms with formal positive charge and atoms with formal negative charges, in which non-covalent interactions are thus expected to occur. On the other hand, it has been noted that B3LYP systematically underestimates BDEs  – the negative BDEs can be viewed as a particular case of underestimation.
Random Forest prediction of bond dissociation energies with different selections of descriptors obtained in the out-of-bag (OOB) validation procedure over the training set a
No. of descriptors
Connection Number (only point descriptors)
Simple Element (Selection 1)
Selection of connection number, element, and no-type (Selection 2)
Impact of individual groups of descriptors on random forest prediction of bond dissociation energies a
No. of descriptors
Selection 2 b
Selection 2 - CN point
Selection 2 - Element pair
Selection 2 - Fragment point
Selection 2 - Aromatic fragment point
Selection 2 - In-ring fragment point
Selection 2 - No-type pair
Selection 2 - No-type bond-breaking difference pair
Selection 2 - π fragment point
Selection 2 - Molecular element pair
Selection 2 - Molecular CN fragment point
Selection 3 c
Random forest prediction of BDEs with descriptors encoding different spheres around the target bond
Nr of spheres
No. of descriptors
Descriptors selected by random forest to predict bond dissociation energies
Group of descriptors
CN point desc.
Element pair desc.
Aromatic fragment point desc.
No-type pair desc.
π Total number desc.
Out-of-bag validation yielded RMSD of 7.09 kcal/mol, MAD of 4.46 kcal/mol, R2 = 0.952 and the maximum error was 56.29 kcal/mol. At the end, the model was tested with the independent test set consisting of bonds from 100 molecules. The RMSD was 6.04 kcal/mol, the MAD was 3.84 kcal/mol, R2 = 0.939 and the maximum error was 37.07 kcal/mol. The better results for the test set, comparing to the training set, may be due to the high diversity of the training set, which was designed to include reasonable numbers of underrepresented, hard-to-predict types of bonds.
Prediction of bond dissociation energies with Associative Neural Networks
Accuracy of dissociation energies predicted by ASNN, RF and calculated by PM6 (against B3LYP-calculated BDEs) for different types of bonds a,b
Comparison between BDEs predicted by ASNN and BDEs predicted as the average of BDEs for the bonds of the same type in the training set a
Correlations between BDEs calculated at different levels of theory and estimations by PM6 and ASNN a
A web service incorporating the new ultra-fast methods has been made available for online estimation of BDEs - http://joao.airesdesousa.com/bde. For operational reasons, it was more convenient to implement the web service based on an ensemble of FFNNs with the Weka package . A bootstrap aggregating ensemble (bagging) was formed with 75 neural networks containing 6 hidden neurons each. It was further improved by additive regression which consists of 5 iterations (bagging ensembles). It was confirmed that the quality of the predictions was essentially the same as with the ASNN.
Limitations of the current model
In order to build a fast data-driven method, the representation of molecular structures must be simplified. The main simplification was the definition of descriptors from the connection table regardless of any 3D structure. Trans and cis isomerism is also not accounted for. Furthermore, during the selection of descriptors, the maximum sphere around the target bond was reduced to 3. This occurred probably because most of the factors determining bond energies (up to a certain level of accuracy) are within 3 bonds of the target bond. However, such a reduction precludes the model from distinguishing bonds whose differences are only outside of the 3rd sphere. The following two examples illustrate some of these limitations.
Finally, the new data-driven models are expected to be limited by the limitations of the DFT method that was used to generated the training set. For example, Feng et al. reported that B3LYP systematically underestimates BDEs  by about 1–4 kcal/mol – the models are therefore assumed to inherit the same systematic deviation.
Knowledge could be automatically extracted from a data set of > 12,000 BDEs calculated with DFT methods by machine learning techniques such as Random Forests and Neural Networks. The models could be applied to an independent test set achieving a root mean square deviation of 5.29 kcal/mol, a mean absolute deviation of 3.35 kcal/mol, and R2 = 0.953. Similar quality of predictions was observed across different types of bonds (RMSD 1.94–7.58 kcal/mol) except for N-H, O-H, and N-N bonds. Predictions were particularly accurate for C-H bonds (RMSD 3.68 kcal/mol, MAD 2.20 kcal/mol). The deviations between the ASNN predictions and the DFT (B3LYP) values are quite close to reported deviations (for simpler molecules) between B3LYP calculations and experimental values (MAD 3–8 kcal/mol) [27, 29]. Even for experimental BDEs, uncertainties of 1–2 kcal/mol are rather common. Differently from quantum methods, the new machine learning models can provide ultra-fast estimations of BDEs. The ASNN predictions also exhibited better accuracy than PM6, except for O-H, N-H, and N-N bonds. A reason for the more problematic prediction of these bonds may reside in their frequent involvement in hydrogen bonding – determined by 3D through-space interactions not currently encoded by our descriptors.
Comparison of B3LYP/6-311++G(d,p)//DFTB calculations with B3LYP/6-311++G(3df,2p)//B3LYP/6-31G(d,p) revealed a smaller deviation than the observed error for the ASNN predictions, which confirms that the chosen method for building the database of theoretical calculations is a reasonable compromise between accuracy and computational cost.
This work combines chemoinformatics and theoretical chemistry methodologies utilizing the currently available computational power. We believe it demonstrates a way to use high-level theoretical quantum calculations in large-scale applications that otherwise would not afford the intrinsic computational cost.
Atomic simulation environment
Associative Neural Network
Becke 3-parameter Lee-Yang-Parr
Bond dissociation energy
Coupled cluster with single and double and perturbative triple excitations
Chemistry development kit
Density functional theory
Density functional based tight-binding
Feed-forward neural network
Fast inertial relaxation engine
Mean absolute deviation
Quadratic configuration interaction with single and double and perturbative triple excitations
Root mean square deviation
Self-consistent-charge density functional tight-binding
Zero point energy
This work was funded by Portuguese national funds via FCT – Fundação para a Ciência e a Tecnologia (Lisboa, Portugal) – under project PEst-C/EQB/LA0006/2011. DARSL acknowledges FCT for the postdoctoral research grant SFRH/BPD/63192/2009. We thank Dr. Igor Tetko (Helmholtz Zentrum München, Neuherberg, Germany) for providing the ASNN program.
- Gaus M, Cui Q, Elstner M: DFTB3: extension of the Self-Consistent-Charge Density-Functional Tight-Binding Method (SCC-DFTB). J Chem Theory Comput. 2011, 7: 931-948. 10.1021/ct100684s.View ArticleGoogle Scholar
- Cohen AJ, Mori-Sánchez P, Yang WT: Challenges for density functional theory. Chem Rev. 2012, 112: 289-320. 10.1021/cr200107z.View ArticleGoogle Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Tetko IV: Neural network studies. 4. Introduction to associative neural networks. J Chem Inf Comput Sci. 2002, 42: 717-728. 10.1021/ci010379o.View ArticleGoogle Scholar
- Wright JS, Johnson ER, DiLabio GA: Predicting the activity of phenolic antioxidants: theoretical method, analysis of substituent effects, and application to major families of antioxidants. J Am Chem Soc. 2001, 123: 1173-1183. 10.1021/ja002455u.View ArticleGoogle Scholar
- Drew KLM, Reynisson J: The impact of carbon–hydrogen bond dissociation energies on the prediction of the cytochrome P450 mediated major metabolic site of drug-like compounds. Eur J Med Chem. 2012, 56: 48-55.View ArticleGoogle Scholar
- Hill AW, Mortishire-Smith RJ: Automated assignment of high-resolution collisionally activated dissociation mass spectra using a systematic bond disconnection approach. Rapid Commun Mass Spectrom. 2005, 19: 3111-3118. 10.1002/rcm.2177.View ArticleGoogle Scholar
- Cherkasov A, Jonsson M: A new method for estimation of homolytic C-H bond dissociation enthalpies. J Chem Inf Comput Sci. 2000, 40: 1222-1226. 10.1021/ci000387p.View ArticleGoogle Scholar
- Xue CX, Zhang RS, Liu HX, Yao XJ, Liu MC, Hu ZC, Fan BT: An accurate QSPR study of O-H bond dissociation energy in substituted phenols based on support vector machines. J Chem Inf Comput Sci. 2004, 44: 669-677. 10.1021/ci034248u.View ArticleGoogle Scholar
- Stanger A: A simple and intuitive description of C–H bond energies. Eur J Org Chem. 2007, 2007: 5717-5725. 10.1002/ejoc.200700419.View ArticleGoogle Scholar
- Przybylak KR, Cronin MTD: Correlation between bond dissociation energies and spin distribution. J Mol Struct. 2010, 955: 165-170. 10.1016/j.theochem.2010.06.012.View ArticleGoogle Scholar
- Feng Y, Liu L, Wang JT, Zhao SW, Guo QX: Homolytic C-H and N-H bond dissociation energies of strained organic compounds. J Org Chem. 2004, 69: 3129-3138. 10.1021/jo035306d.View ArticleGoogle Scholar
- dos Santos JVA, Newton AS, Bernardino R, Guedes RC: Substituent effects on O–H and S–H bond dissociation enthalpies of disubstituted phenols and thiophenols. Int J Quantum Chem. 2008, 108: 754-761.View ArticleGoogle Scholar
- Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG: ZINC: a free tool to discover chemistry for biology. J Chem Inf Model. 2012, 52: 1757-1768. 10.1021/ci3001277.View ArticleGoogle Scholar
- Carr RAE, Congreve M, Murray CW, Rees DC: Fragment-based lead discovery: leads by design. Drug Discov Today. 2005, 10: 987-992. 10.1016/S1359-6446(05)03511-7.View ArticleGoogle Scholar
- ChemAxon. JChem. 5.8.2 [http://www.chemaxon.com] (accessed February 2012)
- Aradi B, Hourahine B, Frauenheim T: DFTB+, a sparse matrix-based implementation of the DFTB method. J Phys Chem A. 2007, 111: 5678-5684. 10.1021/jp070186p.View ArticleGoogle Scholar
- Bahn SR, Jacobsen KW: An object-oriented scripting interface to a legacy electronic structure code. Comput Sci Eng. 2002, 4: 56-66.View ArticleGoogle Scholar
- Nocedal J, Wright SJ: Numerical Optimization. 2006, New York: Springer, 2Google Scholar
- Bitzek E, Koskinen P, Gähler F, Moseler M, Gumbsch P: Structural relaxation made simple. Phys Rev Lett. 2006, 97: 170201-View ArticleGoogle Scholar
- Schmidt MW, Baldridge KK, Boatz JA, Elbert ST, Gordon MS, Jensen JJ, Koseki S, Matsunaga N, Nguyen KA, Su S, Windus TL, Dupuis M, Montgomery JA: General atomic and molecular electronic structure system. J Comput Chem. 1993, 14: 1347-1363. 10.1002/jcc.540141112. GAMESS Version 11 Aug 2011 (R1)View ArticleGoogle Scholar
- Stewart JJP: Optimization of parameters for semiempirical methods V: Modification of NDDO approximations and application to 70 elements. J Mol Model. 2007, 13: 1173-1213. 10.1007/s00894-007-0233-4.View ArticleGoogle Scholar
- Stewart JJP: MOPAC2009 Version 11.366L. [http://openmopac.net]
- Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: Recent developments of the Chemistry Development Kit (CDK) - an open-source java library for chemo- and bioinformatics. Curr Pharm Des. 2006, 12: 2111-2120. 10.2174/138161206777585274.View ArticleGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2011, Vienna, Austria: R Foundation for Statistical Computing, [http://www.R-project.org]Google Scholar
- Liaw A, Wiener A: Classification and regression by RandomForest. R News. 2002, 2: 18-22.Google Scholar
- Feng Y, Liu L, Wang JT, Huang H, Guo QX: Assessment of experimental bond dissociation energies using composite ab initio methods and evaluation of the performances of density functional methods in the calculation of bond dissociation energies. J Chem Inf Comput Sci. 2003, 43: 2005-2013. 10.1021/ci034033k.View ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explorations. 2009, 11: 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Zhang IY, Wu J, Luo Y, Xu X: Accurate bond dissociation enthalpies by using doubly hybrid XYG3 functional. J Comput Chem. 2011, 32: 1824-1838. 10.1002/jcc.21764.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.