2.1Quantum Chemistry
The initial Cartesian coordinates of 225 carbonyl compounds, including aldehydes, ketones and amides were obtained using molecular modelling software Spartan’10 [10]. The molecules used in this study vary in size ranging from 7 to 31 atoms. A complete list of compounds included in the training set can be found in Supplementary Materials (Table 1). Full geometry optimization (of default Spartan-generated conformers) and generation of the wavefunction files were carried out using Gaussian09 [11]. All the molecules were optimized at B3LYP/6-31+G* and wavefunctions were obtained at M05-2X/6-311++G** level of theory [12-16]. The wavefunctions obtained from these calculations were used to calculate the topological properties of the molecules such as electron density, Laplacian of electron density, Hessian eigenvalues (of the corresponding Hessian matrix, r or Ñ2r), bond critical points, and critical points in Ñ2r. The topological properties of electron density at bond critical points were determined using AIMAll [17].Whereas the topological properties at critical points of the Laplacian of the charge density were determined using Denprop [18]. The topological data used for descriptors in neural network training were extracted using a Python script to create the data sets. The types of critical points that are located and used as training data in this research are illustrated in Figures 1 and 2.
2.2 Data Sets and Quantum Atomics
Quantum Atomics refers to topological and/or integrated atomic data obtained via, or within the context of the Quantum Theory of Atoms in Molecules (QTAIM). Numerous computational and experimental groups worldwide have been collecting such data for decades, without referring to it as such. It is perhaps a useful term for distinguishing such data from other computational chemistry and crystallographic parameters that are routinely used, but lack the rigorous foundation of QTAIM. A second over-arching goal of this research is to explore the information content “quality” of various types of Quantum Atomic data for specific machine learning goals. For instance, if the goal is to predict chemical reactivity, should training data include topological properties of the charge density, or the Laplacian of the charge density? Although we have not explored such questions here, future research could address whether more efficient and accurate machine learning can be achieved with specific integrated atomic properties that are defined uniquely within QTAIM, such as atomic energy, magnetic susceptibility, electric polarizability, multipole moments, etc…
In this research, a total of nine separate data sets were used: three types topological data (BCP, LCP, combined) trained with three types of physicochemical data (C-13 NMR, C=O frequencies, nucleophile interaction energies) with each data set containing critical point descriptors for all 225 molecules were used to train artificial neural networks (ANNs) to predict C-13 chemical shifts, C=O stretching frequencies, and interaction energies with a model nucleophile (the fluoride ion). Among the nine data sets were three of each of the following: bond critical point (BCPs), Laplacian critical point (LCPs), and combined data sets (with both BCP and LCP data). Each of the previously mentioned data sets contains the following: a class label from experimental C-13 shifts, C=O stretching frequencies, or theoretical interaction energy values. The experimental C-13 chemical shift in ppm and C=O stretching frequency in cm-1 values were collected from spectral database of organic compounds library (SDBS) [19]. All topological data in the training sets can be found in Supplementary Materials (Table 2). An example of the LCP portion of such a data set for 1-butanal is shown in Table 1.
Table 1. Sample Laplacian critical point (LCP) input data for 2-chloro-4-fluorobenzaldehyde (all units au)
Distance from C nucleus
|
λ1
|
λ2
|
λ3
|
ρ
|
∇2ρ
|
1.0282
|
-1.140
|
-0.614
|
10.800
|
0.141
|
0.0294
|
1.0282
|
-1.140
|
-0.614
|
10.800
|
0.141
|
0.0294
|
0.9701
|
4.300
|
5.060
|
22.000
|
0.292
|
-1.0100
|
0.9790
|
6.480
|
6.900
|
18.800
|
0.302
|
-1.2300
|
0.9811
|
5.230
|
8.960
|
18.000
|
0.431
|
-1.0700
|
The interaction energy (ΔEinteraction) between carbonyl compounds and a fluoride ion (F-1) in the nucleophilic addition reaction was calculated at B3LYP/6-31+G* level with the following approach,
ΔEinteraction = ECC+F - ECC - EF (Eqn. 1)
where E CC+F, ECC and EF are total energies of the carbanion complex, carbonyl compound, and the fluoride ion nucleophile, respectively. Basis set superposition errors (BSSE) were not calculated,as they would have considerably increased computational time, while yielding no significant value. The machine learning here is based on properties of the electron density, which are completely unaffected by BSSE.
2.3 Artificial Neural Network Model
Over the past few decade artificial neural networks (ANNs) have had huge success in machine learning and data mining applications [20]. Recently, Handley and Popelier have pioneered use of QTAIM data (atomic multipole moments) in conjunction with machine learning to model the fluctuating polarizability of water molecules in molecular dynamics simulations [21]. ANNs are powerful tools in advanced computing that analyse information quantitatively by learning from training data. The important properties of ANNs are the learning ability of a network from its environment and improving performance by learning. A learning algorithm is a procedure in which the learning rules are used to adjust the weights. The ANNs consists of input layer, one or more hidden layers and output layer. The input signal propagates layer by layer in the forward direction and these networks are commonly called as multilayer perceptrons (MLP) [22,23]. MLP with back-propagation learning method (BP) is one of the successfully used methods in chemistry and drug design because of its well-defined and explicit set of equations for weight corrections [24]. This is a supervised learning algorithm in which the network is trained with a training data with expected outputs are provided to get the algorithm trained. The learning consists of a forward pass and backward pass. In the forward pass, the input vector is applied to the input layer and these input values are modified by a fixed weight and its effect passes through the network layer by layer. Finally, output produced by the network compared with desired output to calculate the error signal. This error signal is then back propagated in the backward pass to adjust the weights in such a way that the actual output value move closer to the desired output value according to error correction rule [22].
In this study, the machine learning package WEKA 3.6.13 was used for the ANN model development [25]. In our model, we used 1 hidden layer with 9 neurons configuration when it was trained with bond critical point data and 1 hidden layer with 15 neurons configuration when the model was trained with Laplacian critical point data. The predictive ability of the network is determined by validation techniques. We have used the leave-one-out cross-validation technique, in which the whole data set was divided into 225 pieces, 224 pieces used for training and one piece for testing. The mean absolute percent errors were calculated by comparing predicted values with actual values.