- Research article
- Open Access
A machine learning correction for DFT non-covalent interactions based on the S22, S66 and X40 benchmark databases
- Ting Gao^{1},
- Hongzhi Li^{1},
- Wenze Li^{1},
- Lin Li^{1},
- Chao Fang^{1},
- Hui Li^{1},
- LiHong Hu^{1}Email author,
- Yinghua Lu^{1, 2}Email author and
- Zhong-Min Su^{2}
Received: 10 November 2015
Accepted: 11 April 2016
Published: 3 May 2016
Abstract
Background
Non-covalent interactions (NCIs) play critical roles in supramolecular chemistries; however, they are difficult to measure. Currently, reliable computational methods are being pursued to meet this challenge, but the accuracy of calculations based on low levels of theory is not satisfactory and calculations based on high levels of theory are often too costly. Accordingly, to reduce the cost and increase the accuracy of low-level theoretical calculations to describe NCIs, an efficient approach is proposed to correct NCI calculations based on the benchmark databases S22, S66 and X40 (Hobza in Acc Chem Rev 45: 663–672, 2012; Řezáč et al. in J Chem Theory Comput 8:4285, 2012).
Results
A novel type of NCI correction is presented for density functional theory (DFT) methods. In this approach, the general regression neural network machine learning method is used to perform the correction for DFT methods on the basis of DFT calculations. Various DFT methods, including M06-2X, B3LYP, B3LYP-D3, PBE, PBE-D3 and ωB97XD, with two small basis sets (i.e., 6-31G* and 6-31+G*) were investigated. Moreover, the conductor-like polarizable continuum model with two types of solvents (i.e., water and pentylamine, which mimics a protein environment with ε = 4.2) were considered in the DFT calculations. With the correction, the root mean square errors of all DFT calculations were improved by at least 70 %. Relative to CCSD(T)/CBS benchmark values (used as experimental NCI values because of its high accuracy), the mean absolute error of the best result was 0.33 kcal/mol, which is comparable to high-level ab initio methods or DFT methods with fairly large basis sets. Notably, this level of accuracy is achieved within a fraction of the time required by other methods. For all of the correction models based on various DFT approaches, the validation parameters according to OECD principles (i.e., the correlation coefficient R ^{2}, the predictive squared correlation coefficient q ^{2} and \(q_{cv}^{2}\) from cross-validation) were >0.92, which suggests that the correction model has good stability, robustness and predictive power.
Conclusions
Keywords
- Non-covalent interactions
- Density functional theory
- Machine learning correction
- Computational accuracy
- Feature selection
Background
Non-covalent interactions (NCIs) are crucial in bio-molecular structures, supramolecules and various chemical reactions [1–4]. Because of the inherent intricacy of NCIs, their measurement is challenging, especially for complex biological systems. Therefore, computational methods are important tools for exploring NCIs. However, the accurate calculation of NCIs is quite demanding because such rigor requires coupled cluster or MPn levels of theory with large basis sets [e.g., complete basis set limit CBS, aug-cc-pVDZ, 6-311+G(3df, 2p)] [5]. The CCSD(T) method with a complete basis set description (i.e., CCSD(T)/CBS), which involves taking single and double electron excitations iteratively and triple electron excitation perturbatively, can provide a highly accurate description of various types of noncovalent complexes. Although this approach is considered to be the golden standard of computational methodologies, it is impractical for molecules with more than 100 atoms [6, 7]. Therefore, it is challenging to obtain accurate NCIs for medium- or large-sized molecules with reasonable computer resources. Compared with covalent bonds, NCIs are weak, highly susceptible to the environment and diversified. Generally, NCIs are classified into four categories: electrostatic (e.g., hydrogen bonding and ion-pairing), π-effect (e.g., cation–π, π–π stacking), van der Waals forces (e.g., dispersion attractions, dipole–dipole and dipole–induced dipole interactions) and hydrophobic. Among NCIs, the magnitude of hydrogen bonding is larger than that of most other NCIs, and hydrogen bonding combines electrostatic, polarization, exchange-repulsion, charge transfer, and even dispersion. Detailed energy decomposition analyses have shown that every interaction between two molecular systems involves a combination of multiple interactions that makes the interaction strong enough to maintain the stability of the molecular structures [6]. Although the magnitude of each NCI (i.e., several kilocalories) is much smaller than that of covalent bond interactions (i.e., hundreds of kilocalories), a dramatic effect may be observed in ligand binding, transition states, and biological systems [6]. Some special types of dispersion interactions, such as C–H···π, N–H···π, and halogen bonding, usually must be investigated individually [8, 9]. The significance of certain NCIs in biological systems remains largely uninvestigated [3, 8, 10]. These reports indicate that NCIs are intrinsically complicated and difficult to calculate with high accuracy.
Quantum chemical methods have become a routine tool for studying molecular systems. Density functional theory (DFT) methods are the most often used quantum chemical methods because of their low cost and satisfactory performance. However, DFT methods are deficient with respect to the calculation of NCIs. Recently, there has been significant effort to incorporate dispersion interactions in DFT methods, and great progress has been made [11–18]. However, further improvement in accuracy for NCI calculations is desirable. Regarding the forms of the dispersion corrections, in general, there are three types of NCI-corrected DFT methods. The parameterized NCI correction methods are standard hybrid DFT functionals with parameters optimized using training sets of benchmark interaction energies. Methods of this type include M05-2X and M06-2X [14], where the adjustable parameters have been fit to a ‘training set’ of molecules. The accuracy of such parameterized methods usually depends on the benchmark databases; for this reason, the accuracy of these methods may not be reliable for molecules that are not in the benchmark database. Dispersion correction methods, such as the DFT-D series, are flexible because the dispersion term can be added to any DFT method. Thus, the addition of correction terms can improve the calculation of NCIs [17]. However, dispersion interactions comprise only a fraction of the total NCIs. The long-range corrected hybrid density functionals, such as the ωB97 series [15, 16, 18], can be included to improve the performance when calculating NCI systems. However, these methods can only partly solve the accuracy of long-distance interactions. Although the results obtained with these corrected functionals are usually improved for most applications, there is no systematic way of improving them, and high accuracy by low levels of theory or for large molecules (i.e., >100 atoms) is difficult to achieve.
Machine learning methods have been implemented to process large data sets in many fields. In the past decade, machine learning methods have been successfully applied in the field of quantum chemistry to improve the accuracy of quantum chemical calculations for large molecules. In 2003, we applied neural networks to improve the accuracy of DFT calculations for the first time. In that paper, neural networks were used to correct the errors associated with B3LYP/6-311+G(d,p) calculations for the heats of formation \((\Delta H_{f}^{\theta } )\) of 180 organic molecules. The RMSE of the calculated \(\Delta H_{f}^{\theta }\) were dramatically reduced from 21 to ~3 kcal/mol [19]. Thereafter, this strategy has been used to solve different types of accuracy problems for quantum chemical calculations, including absorption energies and Gibbs free energy [20–28]. In practical applications, the incorporation of quantum chemical methods and machine learning methods can be called a ‘GOLDEN’ combination because the advantages of both methods can be fully utilized; for example, machine learning methods can use the essential information captured by quantum chemical methods to reduce calculation errors caused by inherent approximations in the level of theory and limited basis sets. The essential feature of such a combination is to take the calculated properties of interest obtained by quantum chemical methods as the primary descriptor. Because the calculated values include all of the essential information of the property of interest, the systematic and random errors from various aspects of the calculations are easy to reduce. Thus, the accuracy of the quantum chemical calculations can be markedly improved, which enables low-level quantum chemical calculations to be performed with higher accuracy. Moreover, the use of machine learning methods is likely to uncover important factors that may affect the accuracy of the target properties. Therefore, this approach may reveal a new strategy for developing a correction term(s) for quantum chemical methods.
To improve the accuracy of DFT calculations for NCIs and investigate the factors that affect weak interactions, herein we propose a new correction for DFT NCI calculations through a combination of DFT and machine learning methods. In the following, the complex correction model is described according to the steps of model establishment. The model includes DFT calculations for the benchmark databases and the development of a stepwise machine learning correction model: data division, descriptor selection, regression and validation. Detailed discussions of the correction model and concluding remarks are presented following a description of the method.
Methods
DFT calculations
DFT calculations were first performed to obtain quantum molecular descriptors. The benchmark databases of NCIs developed by Hobza et al. offer an excellent opportunity for novel computational techniques to examine NCIs [29–31]. In our calculations, three typical benchmark databases with equilibrium structures have been used (i.e., S22, S66 and X40). The three databases include various NCI complexes with important bonding motifs, H-bonded, dispersion-dominated, mixed, and halogen bonded complexes. The databases also cover a wide range of sizes and interaction strengths of NCI complexes. The initial geometries of the database molecules were taken from published supplementary materials [29–31]. The geometry optimizations and energy calculations were performed at the same level of theory. The downloaded structures in the references were not used here because the correction is meant to make predictions for molecules that are newly discovered or studied.
Regarding reference NCI values, the NCIs obtained by the CCSD(T)/CBS level of theory are taken as the target or reference experimental values of NCIs for building the correction models. The reason is that CCSD(T)/CBS is considered the golden standard of computational methodologies and its associated NCIs are highly accurate. By this means, two obstacles for a machine learning model can be solved: experimental NCIs and expansion of the database. Therefore, the correction model can be further improved by easily adding more molecules in the databases with accurate NCIs determined by the CCSD(T)/CBS level of theory. In addition, because the highly accurate NCIs determined herein by CCSD(T)/CBS are taken as experimental values, they are not considered calculated values under certain computational conditions any longer. Accordingly, the DFT calculations in this study are not confined to calculations in vacuum. We note that adopting gas phase experimental values as the targets for solution phase DFT calculations is not appropriate. Including a solvent model is like introducing a systematic error to the DFT calculations when comparing with gas phase experimental values. Fortunately, such protocols do not affect the performance of the machine learning correction models because the systematic errors can be easily removed, which is also one of the most important advantages in combining machine learning methods with quantum chemical calculations. That is, the calculations expose trends in the properties, which are possibly more important than the accuracy of the descriptors. The advantage allows us to perform either a gas phase or liquid phase descriptor calculation for an experimental target. In our previous works, we obtained the same correction accuracy using different input accuracies [19, 28]. This study also illustrates that input descriptors with different levels of accuracy were corrected to the same level of accuracy. The DFT methods M06-2X, ωB97XD, B3LYP, B3LYP-D3, PBE and PBE-D3 were used to calculate NCIs. For the M06-2X and ωB97XD methods, solvent effects have been considered using the conductor-like polarizable continuum model (C-PCM) with two types of solvents (i.e., water and pentylamine, which, with an epsilon value of 4.2, was chosen to mimic a protein’s environment). For the M06-2X method, the diffuse basis set effect was also investigated by comparing the results using the 6-31G* and 6-31+G* basis sets, which, although relatively small, make the model practical for large complexes. The M06-2X and ωB97XD calculations were performed using the Gaussian 09 program package [32]. However, this program has not implemented the 6-31G* and 6-31+G* basis sets for Bromine (Br) or Iodine (I). Thus, the polarization ECP basis set LANL2DZDP, which can be used for most metallic elements, was used for these atoms. B3LYP, B3LYP-D3, PBE and PBE-D3 calculations with the pure basis set 6-31G* were performed using the ORCA 3.0 quantum chemical program [33].
Machine learning correction
The machine learning correction was constructed using a step-wise procedure: descriptor selections, data division, regression and validation. The detailed descriptions of each step are presented as follows. All values are normalized to [−1, 1] in the machine learning correction steps.
Data division
Partial least square (PLS) descriptor selection
Molecular descriptors represent the essential features of a molecule and can be considered its fingerprint. In a machine learning model, molecular descriptors can be the inputs of regression methods, and a quantitative structure activity/property relationship (QSA/PR) can be established between the inputs and output (targets/endpoints). Therefore, molecular descriptors markedly affect the quality of a regression model [36–38]. Usually, molecular descriptors can be obtained in various ways, including quantum chemical calculations, molecular mechanical calculations, and structure analyses. In our calculations, we sought to take full advantage of quantum chemical calculations while keeping the modeling as simple as possible. For this reason, only descriptors from quantum chemical calculations and constitutional descriptions of molecular structures were used to construct the model.
Screening of the molecular descriptors is an important step that is intended to avoid redundancy and noise of the extracted information. In this correction approach, PLS is used to select the most significant descriptors. PLS is a recently developed generalization of multiple linear regressions (MLR) and is a multivariate statistical data analysis method for modeling multiple variables. In addition to being a feature extraction method, it is also a regression model. This approach has become popular because it is capable of analyzing large amounts of data that are strongly correlated with noisy and large dimensional X-variables. It has also been found to be a very efficient data dimensionality reduction method [39]. Herein, PLS is used to screen the molecular descriptors; that is, the method selects the most significant descriptors from all of the available descriptors according to the PLS fitting coefficients.
GRNN regression modeling
To obtain a reliable and stable model, K-fold cross-validation is employed when training the network. In this approach, there are N samples in the training set, which are evenly divided into K groups. K − 1 groups are chosen as the training samples and the remaining sample is assigned as the validation sample. The network loops K times and the results of each cycle are compared. The best prediction accuracy of the input data sets is then selected to generate the GRNN model. The descriptor selection and regression modeling are fulfilled within the training set.
Model validation
To validate our models, we calculated validation parameters for our correction model according to the principles of the Organization for Economic Cooperation and Development (OECD) [41]. These parameters are the correlation coefficient R ^{2}, predictive squared correlation coefficient q ^{2} and \(q_{cv}^{2}\) obtained from cross-validation, mean absolute error (MAE) and root mean square error (RMSE), which represent the goodness-of-fit, robustness and predictive behavior of the model, respectively [42]. Generally, the fitting power in terms of R ^{2} is larger than the stability power in terms of \(q_{cv}^{2}\) (values of \(q_{cv}^{2}\) and q ^{2} larger than 0.5 are valid). If \(R^{2} - q_{cv}^{2} > 0.3,\) then this may indicate that the established model is over-fit [43]. All of these parameters have been calculated to evaluate the correction models.
Results and discussions
Databases
The mean values (kcal/mol) of CCSD(T)/CBS benchmark interactions and the number of four NCI-dominated molecular complexes
Types | Number | Mean |
---|---|---|
H-bonded complexes | 29 | −10.33 |
Dispersion complexes | 30 | −3.94 |
Mixed complexes | 26 | −3.70 |
Halogen complexes | 36 | −3.43 |
NCI calculations with DFT methods
The validation parameters of DFT and GRNN correction models (RMSE & MAE units: kcal/mol)
RMSE | MAE | DFT | GRNN | ||||||
---|---|---|---|---|---|---|---|---|---|
DFT | GRNN | DFT | GRNN | q^{2} | R^{2} | q^{2} | q _{cv} ^{2} | R^{2} | |
M062X/6-31G*^{(vac)} (DFT1) | 1.66 | 0.50 | 1.34 | 0.35 | 0.87 | 0.98 | 0.98 | 0.95 | 0.99 |
M062X/6-31G*^{a} (DFT2) | 1.79 | 0.52 | 1.13 | 0.37 | 0.85 | 0.93 | 0.98 | 0.92 | 0.99 |
M062X/6-31G*^{b} (DFT3) | 1.43 | 0.46 | 1.00 | 0.34 | 0.90 | 0.96 | 0.98 | 0.93 | 0.99 |
M062X/6-31+G*^{a} (DFT4) | 2.59 | 0.54 | 1.45 | 0.33 | 0.68 | 0.89 | 0.98 | 0.96 | 0.99 |
ωB97XD/6-31G*^{(vac)} (DFT5) | 1.77 | 0.47 | 1.55 | 0.35 | 0.85 | 0.98 | 0.98 | 0.96 | 0.99 |
ωB97XD/6-31G*^{a} (DFT6) | 1.74 | 0.54 | 1.22 | 0.38 | 0.86 | 0.93 | 0.98 | 0.93 | 0.99 |
ωB97XD/6-31G*^{b} (DFT7) | 1.46 | 0.46 | 1.17 | 0.34 | 0.90 | 0.95 | 0.98 | 0.94 | 0.99 |
B3LYP/6-31G*^{a} (DFT8) | 3.98 | 0.62 | 3.01 | 0.46 | 0.25 | 0.80 | 0.97 | 0.92 | 0.98 |
B3LYP-D3/6-31G*^{a} (DFT9) | 1.89 | 0.56 | 1.18 | 0.40 | 0.83 | 0.93 | 0.98 | 0.92 | 0.99 |
PBE/6-31G*^{a} (DFT10) | 3.15 | 0.62 | 2.33 | 0.46 | 0.53 | 0.83 | 0.97 | 0.92 | 0.98 |
PBE-D3/6-31G*^{a} (DFT11) | 1.96 | 0.59 | 1.33 | 0.45 | 0.82 | 0.91 | 0.97 | 0.92 | 0.98 |
The mean values of the NCIs in four types of complexes and the RMSEs of DFT calculations and GRNN corrections relative to the CCSD(T)/CBS benchmark NCIs (Unit: kcal/mol)
Methods | Mean (RMSE/RMSE1^{c}) | |||
---|---|---|---|---|
H-bonded | Dispersion | Mixed | Halogen | |
CCSD(T)/CBS | −10.33 (−/−) | −3.94 (−/−) | −3.70 (−/−) | −3.43 (−/−) |
M062X/6-31G^{*(vac)} (DFT1) | −12.23 (2.12/0.59) | −4.84 (1.11/0.45) | −4.86 (1.31/0.43) | −4.39 (1.85/0.53) |
M062X/6-31G^{*a} (DFT2) | −9.23 (3.13/0.54) | −3.84 (1.01/0.47) | −3.79 (0.65/0.53) | −3.81 (1.31/0.52) |
M062X/6-31G^{*b} (DFT3) | −9.96 (2.37/0.37) | −4.07 (0.78/0.47) | −4.05 (0.68/0.49) | −3.82 (1.22/0.49) |
M062X/6-31+G^{*a} (DFT4) | −7.24 (4.78/0.76) | −3.22 (1.25/0.45) | −2.80 (1.18/0.43) | −2.91 (1.33/0.47) |
ωB97XD/6-31G^{*(vac)} (DFT5) | −12.46 (2.20/0.42) | −5.37 (1.52/0.42) | −5.20 (1.58/0.51) | −3.98 (1.72/0.53) |
ωB97XD/6-31G^{*a} (DFT6) | −9.37 (2.88/0.50) | −4.30 (1.38/0.64) | −4.01 (0.74/0.41) | −3.43 (1.25/0.55) |
ωB97XD/6-31G^{*b} (DFT7) | −10.24 (2.06/0.34) | −4.61 (1.28/0.46) | −4.34 (0.86/0.40) | −3.53 (1.35/0.56) |
B3LYP/6-31G^{*a} (DFT8) | −6.76 (5.10/0.57) | 0.23 (5.06/0.64) | −0.85 (3.09/0.60) | −2.23 (2.04/0.65) |
B3LYP-D3/6-31G^{*a} (DFT9) | −8.94 (3.30/0.58) | −3.88 (1.32/0.42) | −3.74 (0.67/0.66) | −4.16 (1.24/0.57) |
PBE/6-31G^{*a} (DFT10) | −8.27 (3.94/0.59) | −0.68 (4.08/0.55) | −1.72 (2.27/0.79) | −3.22 (1.80/0.56) |
PBE-D3/6-31G^{*a} (DFT11) | −9.75 (2.99/0.59) | −3.54 (1.76/0.57) | −3.74 (0.70/0.58) | −4.45 (1.66/0.60) |
The RMSE (kcal/mol) of benchmark databases by DFT methods with respect to CCSD(T)/CBS benchmark interactions
Methods | RMSE | |||||
---|---|---|---|---|---|---|
S22 (DFT) | S22 (GRNN) | S66 (DFT) | S66 (GRNN) | X40 (DFT) | X40 (GRNN) | |
M062X/6-31G^{*(vac)} (DFT1) | 1.41 | 0.57 | 1.84 | 0.48 | 1.85 | 0.52 |
M062X/6-31G^{*a} (DFT2) | 2.55 | 0.58 | 1.67 | 0.48 | 1.31 | 0.52 |
M062X/6-31G^{*b} (DFT3) | 1.82 | 0.42 | 1.37 | 0.45 | 1.22 | 0.49 |
M062X/6-31+G^{*a} (DFT4) | 3.99 | 0.83 | 2.45 | 0.43 | 1.33 | 0.47 |
ωB97XD/6-31G^{*(vac)} (DFT5) | 1.68 | 0.30 | 1.46 | 0.43 | 1.71 | 0.53 |
ωB97XD/6-31G^{*a} (DFT6) | 2.36 | 0.60 | 1.70 | 0.50 | 1.25 | 0.55 |
ωB97XD/6-31G^{*b} (DFT7) | 1.56 | 0.28 | 1.46 | 0.43 | 1.35 | 0.56 |
B3LYP/6-31G^{*a} (DFT8) | 5.97 | 0.47 | 3.90 | 0.63 | 2.04 | 0.65 |
B3LYP-D3/6-31G^{*a} (DFT9) | 2.78 | 0.41 | 1.79 | 0.59 | 1.24 | 0.57 |
PBE/6-31G^{*a} (DFT10) | 4.66 | 0.57 | 3.07 | 0.67 | 1.80 | 0.56 |
PBE-D3/6-31G^{*a} (DFT11) | 2.68 | 0.45 | 1.79 | 0.61 | 1.66 | 0.60 |
Diffuse basis function
We note that DFT4 performs worse than DFT1–3, even though it uses a slightly larger basis set. In contrast to 6-31G*, the diffuse function present in 6-31+G* is intended to improve the calculation of NCIs. Diffuse functions play an important role in NCI calculations because they account for the distant electronic density of an atom. However, in this calculation, the diffuse function instead shows negative effects without any advantages. Indeed, the RMSE of DFT4 is 2.59 kcal/mol, which is 0.8 kcal/mol larger than that of DFT2. To clarify the role of the basis functions, ωB97XD/6-31+G* was also performed for both the S22 and S66 databases; notably, similar results were obtained, with RMSEs for M06-2X/6-31+G* and ωB97XD/6-31+G* of 2.86 and 2.91 kcal/mol, respectively. These results indicate that the use of one diffuse function does not benefit the calculation of NCIs for this basis set in solvents. Because the means of the correction for these two DFT methods are different, the reason for this basis set effect remains unclear. Further studies on basis set effects are ongoing.
Solvation effects
Because NCIs usually occur in aqueous biological systems, our calculations were mainly performed with a description for a solvent. Two solvent environments were considered: water and protein. The protein environment (with a dielectric constant ε of ~4.0) [44] is mimicked by pentylamine (ε = 4.2). In vacuum, DFT1 and DFT5 provide good results. In solvent, the results show that the NCI values are more accurate in the protein environment than in water. As shown in Table 2, the NCIs calculated by DFT3 and DFT7 possess the smallest RMSEs (<1.5 kcal/mol) among all of the DFT methods examined. This result indicates that solvent effects play a crucial role in NCI systems and the choice of solvent may impact the accuracy of the calculations. Our calculations demonstrate that an environment that accounts for the protein is recommended for NCI calculations in the corresponding level of theory, and the proper choice of a dielectric value can improve the accuracy of the DFT calculations [45].
Dominant interactions in the complexes
The mean values of the NCI references, the DFT calculations and the RMSE and MAE of the NCIs for four types of complexes calculated by various DFT methods are listed in Table 3. A comparison of the mean NCI values of the four types of dominant interaction complexes indicates that the mean NCI values of hydrogen bonding complexes are larger than the other three types of complexes because of its strong hydrogen bonding interaction, which provides greater stabilization to the complexes than the other NCIs. In the gas phase calculations (i.e., DFT1 and DFT5), all of the average interactions are larger than the reference values, which indicate that all of the interactions are overestimated. In the solvent phase, screening effects in water are very strong, and thus the results underestimate the hydrogen bonded and dispersion complexes. However, when the dielectric constant is decreased to 4.2 (i.e., the protein environment), the DFT calculations are clearly improved, with the best results obtained with DFT3 and DFT7. In the solvent phase, the NCIs of the hydrogen-bonded complexes are all underestimated by the DFT methods. The best estimate for the H-bonded complexes is obtained by DFT7 (i.e., ωB97XD/6-31G* with a protein environment). Except for the H-bonded complexes, estimates of the NCIs do not follow a consistent trend. The best results are obtained by DFT3 (i.e., M062X/6-31G* with a protein environment) for dispersion and halogen interactions and DFT2 (i.e., M062X/6-31G* with water) for mixed complexes. Comparing DFT methods with dispersion corrections, most of the RMSEs for the entire dataset fall between 20 and 50 % of the mean NCI values. Without the dispersion correction, DFT8 (i.e., B3LYP/6-31G* with water) and DFT10 (i.e., PBE/6-31G* with water) do not give reasonable results, although the latter method is better than the former for the four types of complexes.
Molecules in the benchmark databases
The RMSEs of the DFT methods and the GRNN corrections for the databases are listed in Table 4. The RMSEs of the best DFT calculations for S22, S66 and X40 are 1.56 kcal/mol by DFT7 and 1.37 and 1.22 kcal/mol by DFT3 (highlighted in italics), respectively. These results indicate the best performance of the DFT methods for the three databases are achieved by DFT3 and DFT7. Comparing these results in terms of solvents, a protein environment is more appropriate for estimating NCIs and yields better results in the corresponding level of theory. Based on the overall performance, we suggest that DFT3 and DFT7 are suitable for economic calculations for medium- and large-size systems.
GRNN correction model
Descriptor analyses
With the SPXY method, the entire database was divided into a training set (consisting of 91 molecules) and a test set (consisting of 30 molecules). Because the trends of various DFT-calculated NCIs are similar in terms of their correlation coefficients, we performed screening for only descriptors obtained by the M06-2X method. In the M06-2X calculations, the effects of both the basis set and solvents are considered. In total, there are 43 descriptors extracted from the quantum chemical calculations and constitutional descriptors, which are listed in Additional file 1: Table S1.
In Fig. 2, the red columns are the selected descriptors. For M06-2X/6-31G* either in water or the protein environment, DFT-calculated NCIs (DFT1-2), the number of valence electrons (N_{ve}), dipole moments (D), and the energy of the second lower unoccupied molecular orbital (E_{LUMO+1}) are selected. For M06-2X/6-31+G* with water, four selected descriptors for regression calculations are shown in Fig. 2c (DFT3, N_{ve}, ESE, Arrangement). However, two selected descriptors [i.e., the arrangement of monomers (Arrangement) and the electronic spatial extent (ESE)] are different from those used for the M06-2X/6-31G* methods. This difference may be ascribed to numerical differences of the descriptors from different basis set calculations. Thus, for M06-2X/6-31G* with different solvents, the selected descriptors are identical. Because the same basis set gives similar accuracy, the same selected descriptors for two basis sets with other DFT methods are adopted for the regression model.
GRNN correction
For GRNN, the network structure and connection weights between neurons are determined if the study samples are assigned. In the GRNN modeling, there is only one parameter that requires optimization: the smoothing factor, σ. The regression value of σ determines the generalizability of the network; when σ approaches zero, the output becomes very similar to the objective value, although the predictive ability might be poor. In practice, network training is the process of optimizing the smoothing parameter. It is necessary to select reasonable smoothing parameters. Smaller smooth parameters (σ) give rise to stronger network approximation processes. Similarly, higher values of the smoothing parameter result in a smoother network approximation process but will increase the validation error. In this study, we used a loop test to determine the smoothing parameter. During training, the range of σ values was set as [0.1, 2] with a step size of 0.1. The optimal GRNN model was constructed using the σ value with the smallest validation errors. Through training, the best smoothing parameters (σ) were obtained for the 6-31G* (0.2) and 6-31+G* (0.1) basis sets. Notably, σ is insensitive to the DFT method, but appears to be influenced by the basis sets that are employed. Although a GRNN model was trained for each DFT method, only two σ values (i.e., 0.2 and 0.1) were obtained for the 6-31G* and 6-31+G* basis sets.
As shown in the GRNN method section, y _{ i } denotes the benchmark NCI of the training set, X is the transposed matrices of input neurons (four selected descriptors) and X _{ i } is the input neuron corresponding to the ith pattern neuron, all of which are known inputs. Thus, in the correction model, there is an empirical parameter σ (0.2 and 0.1 for the 6-31G* and 6-31+G* basis sets, respectively) that is determined when the regression model is constructed. This correction term is easy to implement in quantum chemical programs, where it can be added as a subroutine after the quantum chemical calculations. Thus, using a low-cost DFT calculation to obtain molecular descriptors, NCIs by higher levels of theory can be achieved. Further, this approach could be applicable to a wider range of molecular systems than could be determined directly using higher levels of theory. The proposed correction model was generated in Matlab R2014b [49] with the GRNN code [50]. The Matlab code used for our correction model in Ref. [50] is presented in the Additional file 2.
The correction is obtained with this regression model. The overall corrected results presented in Table 2 show that the correction model improved all of the DFT results, irrespective of whether an inherent NCI correction was already present in the DFT functional. The GRNN correction RMSEs relative to the CCSD(T)/CBS benchmark NCIs were reduced from 1.43–3.98 to 0.46–0.62 kcal/mol, and the MAEs decreased from 1.00–3.01 to 0.33–0.46 kcal/mol. The best correction results were achieved by combinations of GRNN with both DFT3 and DFT7, which are based on the best DFT results. The correction for the methods without NCI corrections was significant; indeed, the RMSEs of DFT8 and DFT10 were reduced to 0.62 from 3.98 to 3.15 kcal/mol, respectively. All of the evaluation parameters of the GRNN model are larger than 0.92 (Table 2), indicating that the GRNN correction model is robust and possesses good predictability in terms of the principles outlined by the OECD. As shown in Table 3, for the different types of NCI-dominated complexes, the GRNN correction remarkably reduces the errors in various NCI systems. The best performance for each type of complex was 0.34 kcal/mol for H-bonded, 0.42 kcal/mol for dispersion, 0.40 kcal/mol for mixed and 0.47 kcal/mol for halogen complexes. Regarding the databases, the GRNN correction also performs well and is very stable (Table 4). The best performance for S22(0.28) and S66(0.43) on the basis of DFT7(ωB97XD/6-31G*) was similar to the results calculated with spin-component scaled MP2 SCS-MI-MP2/cc-PVTZ S22 (0.26 kcal/mol) and S66 (0.38 kcal/mol) reported in ref. 6, whereas the DFT methods used in this study required significantly less computational time than the MP2 methods.
Conclusions
A general NCI correction constructed by GRNN for DFT methods is proposed. NCI calculations have been a serious problem for DFT methods. Recent developments have improved these methods, although NCI calculations with DFT methods with small basis sets continue to possess large deviations. In this present work, a GRNN correction for DFT NCI calculations was applied to various DFT methods with or without inherent NCI corrections in the functional. The results show that with this new correction, all of the RMSEs of the tested DFT methods can be improved to chemical accuracy (i.e., 0.4–0.6 kcal/mol). Because the approach combines the strengths of both DFT and GRNN methods, it simultaneously achieves both efficiency and accuracy at a very low cost. The best accuracy obtained by GRNN based on ωB97XD/6-31G* is comparable with previous SCS-MI-MP2/cc-PVTZ calculations. In summary, the great advantages of this method are the following: (1) the efficiency and accuracy of the method is high, and it can be applied to large molecules with accuracy comparable to higher levels of theory; (2) the correction model does not strictly require accurate descriptors as inputs, provided that they correlate with properties in certain trends; and (3) in the GRNN correction, there is only one parameter that must be fit, and the obtained model is easy to implement in quantum chemical programs. Using the NCIs calculated by the CCSD(T)/CBS method as the target or reference experimental values for the correction model not only avoids the difficulty of finding experimental NCIs but also sets the stage for further improvements to the correction model by adding more molecules to the database using the results from CCSD(T)/CBS calculations. Moreover, the proposed state-of-the-art correction can be an alternative means for extending DFT methods to large systems with comparably high accuracy. Furthermore, we believe that this approach can also improve the accuracy of NCIs for other first-principle methods.
Declarations
Authors’ contributions
TG, HZL, WL and LL performed the calculations. TG, HZL and LHH carried out the data analyses. TG, LL and LHH wrote the paper. CF and HL wrote and helped with the program code. LHH, YL and ZMS conceived of the study, and helped to design computational experiments. All authors read and approved the final manuscript.
Acknowledgements
The authors gratefully acknowledge financial support from NSFC (21473025 and 21131001), the Science and Technology Development Planning of Jilin Province (20130522109JH and 20150204041GX), and the Education Department of Jilin Province (2015556 and 2014B045).
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Johnson ER, Keinan S, Mori-Sanchez P, Contreras-Garcia J, Cohen AJ, Yang W (2010) Revealing noncovalent interactions. J Am Chem Soc 132:6498–6506View ArticleGoogle Scholar
- Rodríguez A, Romero MJ, Fernández A, López-Torres M, Vázquez-García D, Naya L, Vila JM, Fernández JJ (2014) Dinuclear cyclometallated platinum(III) complexes. Relationship between molecular structure and crystal packing. Polyhedron 67:160–170View ArticleGoogle Scholar
- Yang L, Adam C, Nichol GS, Cockroft SL (2013) How much do van der Waals dispersion forces contribute to molecular recognition in solution? Nat Chem 5:1006–1010View ArticleGoogle Scholar
- Patil MP, Sunoj RB (2008) The role of noninnocent solvent molecules in organocatalyzed asymmetric Michael addition reactions. Chem Eur J 14:10472–10485View ArticleGoogle Scholar
- Sedlak R, Riley KE, Řezáč J, Pitoňák M, Hobza P (2013) MP2.5 and MP2.X: approaching CCSD(T) quality description of noncovalent interaction at the cost of a single CCSD iteration. Chem Phys Chem 14:698–707Google Scholar
- Hobza P, Müller-Dethlefs K (2009) Non-covalent interactions: theory and experiment. The Royal Society of Chemistry Press, CambridgeGoogle Scholar
- Hobza P (2012) Calculations on noncovalent interactions and databases of benchmark interaction energies. Acc Chem Rev 45:663–672View ArticleGoogle Scholar
- Adhikary R, Zimmermann J, Liu J, Forrest RP, Janicki TD, Dawson PE, Corcelli SA, Romesberg FE (2014) Evidence of an unusual N–H···N hydrogen bond in proteins. J Am Chem Soc 136:13474–13477View ArticleGoogle Scholar
- Doemer M, Travernelli I, Rothlisberger U (2013) Intricacies of describing weak interactions involving halogen atoms with density functional theory. J Chem Theory Comput 9:955–964View ArticleGoogle Scholar
- Chen W, Enck S, Price JL, Powers DL, Powers ET, Wong CH, Dyson HJ, Kelly JW (2013) Structural and energetic basis of carbohydrate–aromatic packing interactions in proteins. J Am Chem Soc 135:9877–9884View ArticleGoogle Scholar
- Corminboeuf C (2014) Minimizing density functional failures for non-covalent interactions beyond van der Waals complexes. Acc Chem Res 47:3217–3224View ArticleGoogle Scholar
- Riley KE, Pitonak M, Jurecka P, Hobza P (2010) Stabilization and structure calculations for noncovalent interactions in extended molecular systems based on wave function and density functional theories. Chem Rev 110:5023–5063View ArticleGoogle Scholar
- Cohen AJ, Mori-Sánchez P, Yang Q (2012) Challenge for density functional theory. Chem Rev 112:289–320View ArticleGoogle Scholar
- Zhao Y, Truhlar DG (2008) The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: two new functionals and systematic testing of four M06-class functionals and 12 other functionals. Theor Chem Acc 120:215–241View ArticleGoogle Scholar
- Chai JD, Head-Gordon M (2008) Long-range corrected hybrid density functionals with damped atom–atom dispersion corrections. Phys Chem Chem Phys 10:6615–6620View ArticleGoogle Scholar
- Grimme S (2004) Accurate description of van der Waals complexes by density functional theory including empirical corrections. J Comput Chem 25:1463–1473View ArticleGoogle Scholar
- Grimme S, Ehrlich S, Goerigk L (2011) Effect of the damping function in dispersion corrected density functional theory. J Comput Chem 32:1456–1465View ArticleGoogle Scholar
- Chai JD, Head-Gordon M (2008) Systematic optimization of long-range corrected hybrid density functionals. J Chem Phys 128:084106View ArticleGoogle Scholar
- Hu LH, Wang XJ, Wong LH, Chen GH (2003) Combined first-principles calculation and neural-network correction approach for heat of formation. J Chem Phys 119:11501–11507View ArticleGoogle Scholar
- Wang XJ, Wong LH, Hu LH, Chan CY, Su ZM, Chen GH (2004) Improving the accuracy of density-functional theory calculation: the statistical correction approach. J Phys Chem A 108:8514–8525View ArticleGoogle Scholar
- Wang XJ, Hu LH, Wong LH, Chen GH (2004) A combined first-principles calculation and neural networks correction approach for evaluating Gibbs energy of formation. Mol Simul 30:9–15View ArticleGoogle Scholar
- Zheng X, Hu LH, Wang XJ, Chen GH (2004) A generalized exchange-correlation functional: the Neural-Networks approach. Chem Phys Lett 390:186–192View ArticleGoogle Scholar
- Sun J, Wu J, Song T, Hu LH, Shan K, Chen GH (2014) Alternative approach to chemical accuracy: a neural networks-based first-principles method for heat of formation of molecules made of H, C, N, O, F, S, and Cl. J Phys Chem A 118:9120–9131View ArticleGoogle Scholar
- Li HZ, Li L, Zhong ZY, Han Y, Hu LH, Lu YH (2013) An accurate and efficient method to predict Y–NO bond homolysis bond dissociation energies. Math Probl Eng 2013(7):831–842Google Scholar
- Wu JM, Xu X (2007) The X1 method for accurate and efficient prediction of heats of formation. J Chem Phys 127:214105View ArticleGoogle Scholar
- Gao T, Shi LL, Li HB, Zhao SS, Li H, Sun SL, Su ZM, Lu YH (2009) Improving the accuracy of low level quantum chemical calculation for absorption energies: the genetic algorithm and neural network approach. Phys Chem Chem Phys 11:5124–5129View ArticleGoogle Scholar
- Gao T, Sun SL, Shi LL, Li H, Li HZ, Su ZM, Lu YH (2009) An accurate density functional theory calculation for electronic excitation energies: the least-squares support vector machine. J Chem Phys 130:184104View ArticleGoogle Scholar
- Li HZ, Zhong ZY, Li L, Gao R, Cui JX, Gao T, Hu LH, Lu YH, Su ZM, Li H (2015) A cascaded QSAR model for efficient prediction of overall power conversion efficiency of all-organic dye sensitized solar cell. J Comput Chem 36:1036–1046View ArticleGoogle Scholar
- Jurecka P, Sponer J, Cerny J, Hobza P (2006) Benchmark database of accurate (MP2 and CCSD(T)) complete basis set limit) interaction energies of small model complexes, DNA base pairs, and amino acid pairs. Phys Chem Chem Phys 8:1985–1993View ArticleGoogle Scholar
- Řezáč J, Riley KE, Hobza P (2012) Benchmark calculations of noncovalent interactions of halogenated molecules. J Chem Theory Comput 8:4285–4292View ArticleGoogle Scholar
- Řezáč J, Riley KE, Hobza P (2011) S66: a well-balanced database of benchmark interaction energies relevant to biomolecular structures. J Chem Theory Comput 7:2427–2438View ArticleGoogle Scholar
- Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Scalmani G, Barone V, Mennucci B, Petersson GA, Nakatsuji H, Caricato M, Li X, Hratchian HP, Izmaylov AF, Bloino J, Zheng G, Sonnenberg JL, Hada M, Ehara M, Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao O, Nakai H, Vreven T, Montgomery JA et al (2013) GAUSSIAN 09 (Revision D.01). Gaussian, Inc, Wallingford, CTGoogle Scholar
- Neese F (2012) The ORCA program system. WIREs Comput Mol Sci 2:73–78View ArticleGoogle Scholar
- Kennard RW, Stone LA (1969) Computer aided design of experiments. Technometrics 11:137–148View ArticleGoogle Scholar
- Galvao RKH, Araujo MCU, Jose GE, Pontes MJC, Silva EC, Saldanha TCB (2005) A method for calibration and validation subset partitioning. Talanta 67:736–740View ArticleGoogle Scholar
- Kombo DC, Tallapragada K, Jain R, Chewning J, Mazurov AA, Speake JD, Hauser TA, Toler S (2013) 3D Molecular descriptors important for clinical success. J Chem Inf Model 53:327–342View ArticleGoogle Scholar
- Wirshup AM, Contreras-Garcia J, Wipf P, Yang W, Beratan DN (2013) Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. J Am Chem Soc 135:7296–7303View ArticleGoogle Scholar
- Eklund M, Norinder U, Boyer S, Carlsson L (2014) Choosing feature selection and learning algorithms in QSAR. J Chem Inf Model 54:837–843View ArticleGoogle Scholar
- Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58:109–130View ArticleGoogle Scholar
- Specht DF (1991) A general regression neural network. IEEE Trans Neural Netw 2:568–576View ArticleGoogle Scholar
- Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26:694–701View ArticleGoogle Scholar
- Schüürmann G, Ebert RU, Chen JW, Wang B, Kühne R (2008) External validation and prediction employing the predictive squared correlation coefficient—test set activity mean vs training set activity mean. J Chem Inf Model 48:2140–2145View ArticleGoogle Scholar
- Eriksson L, Jaworska J, Worth AP, Cronin MTD, McDowell RM, Gramatica P (2003) Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environ Health Perspect 111:1361–1375View ArticleGoogle Scholar
- Hu LH, Eliasson J, Heimdal J, Ryde U (2009) Do quantum mechanical energies calculated for small models of protein-active sites converge? J Phys Chem A 113:11793–11800View ArticleGoogle Scholar
- Klamt A, Moya C, Palomar J (2015) A comprehensive comparison of the IEFPCM and SS(V)PE continuum solvation methods with the COSMO approach. J Chem Theory Comput 11:4220–4225View ArticleGoogle Scholar
- Hu LH, Zhao Y, Wang F, Chen GH, Ma C, Phillips D (2007) Are adenine strands H-aggregates? J Phys Chem B 111:11812–11816View ArticleGoogle Scholar
- Wheeler SE, Bloom JWG (2014) Toward a more complete understanding of noncovalent interactions involving aromatic rings. J Phys Chem A 118:6133–6147View ArticleGoogle Scholar
- Wu P, Chaudret B, Hu X, Yang W (2013) Noncovalent interaction analysis in fluctuating environments. J Chem Theory Comput 9:2226–2234View ArticleGoogle Scholar
- Math Works (2014) Matlab R2014b Neural Network Toolbox User GuideGoogle Scholar
- Wang XC, Shi F, Yu L, Li Y (2013) Matlab Neural Network 43 case studies. Beihang University Press, Beijing(in Chinese) Google Scholar