Quantitative structure-property relationship study of alcohols water solubility based on a new model combined modified autocorrelation method and electro-topological indices


 In this study, structure water solubility modeling was performed to describe a set of 50 of aliphatic alcohols in a Quantitative Structure-Property Relationship model by developing of two descriptors types based on multifunctional autocorrelation method, which gives a general description of whole molecule; and electro-topological descriptors. The index combines the topological nature with electronic state of the atom. The Modified Autocorrelation Method was used in structure–property relationships to describe the local environment of the hydroxyl group. For the statistical studies, Multiple Linear Regression, Artificial Neural Networks and Principal Components Analysis were used. The approach efficiency approach was evaluated through the predictive ability of models by leave-p-Out cross-validation method. The coefficient of determination and errors of descriptors combination calculated respectively by multiple linear regression and artificial neural networks were r= 0.99, s = 0.18 and r = 0.99, s = 0.32. In order to simplify components computation, the molecules were coded by means of SMILES system and stored as input files. The results showed that aliphatic alcohols solubility is dominated by the shape and molecule branching, also the electro-topological descriptors had a considered model effect.


Introduction
The present study is aimed at estimating water solubility of Aliphatic Alcohols, because it is considered as an important property of organic compounds [1]. Since alcohols are toxic materials, therefore it represents dangerous environmental pollutants; the first step in polluting alcohols action is their solubility in water. Various quantitative structure-property relationship (QSPR) models were proposed for estimating water solubility of Aliphatic Alcohols. In General, QSPR methods are based on 4 consider the molecule as a graph, in which the nodes are atoms (hydrogen atoms are not taken into account) linked by bonds. Properties f of atoms (surface, volume, electronegativity…) are defined on every node of the graph. The Components P, of the autocorrelation vector corresponding to that property are defined by the relation bellow: Where P k is autocorrelation component corresponding to topological distance of k bonds (smallest number of bonds between i and j) to the specific property f i To facilitate the computation for Big Data, a computation program on Matlab which generates the connection matrices, distances matrices and the variables of the MAM was developed in the following way: The first program was used for the calculation of global environment descriptors of the molecule.
A new component Pik was specified to characterize the local environment of the atoms in molecules by the formula bellow: The principle of Modified Autocorrelation Method proposed by Nohair et al [16], comes from setting carbon atom i and P ik the sum of f(i)*f(j) of all chemical bonds existing between all pairs of carbon atoms i(fixed atom) and j separated by a topological distance k, and x=1.
The second program was applied to study the local environment of molecule, while fixing the atom concerned: Fig. 1 shows the graph form of 2,2-Dimethyl-3-pentanol molecule. Several matrices such as connection matrix and distance matrix (Fig. 2) can be determined from this form.

Experimental Data set
A series of 50 Aliphatic Alcohols reported by Amic et al studied [18]. With short and long carbon chain with different category (linear and branched), primary, secondary and tertiary carbon.
The QSPR analysis was performed using the experimental solubility of aliphatic alcohols in water (expressed in lnSol) which depends on two factors: influence of hydrophobic carbon chain and hydrophilic hydroxyl group.

molecular descriptors Determination
In the data processing part, the descriptors were obtained by the MAM method,(table 3) and the electro-topological indices were generated by the MOE (Molecular Operating Environment (table 8) software in order to set up a model that combines these two types of descriptors for the QSPR modeling.

Statistical analysis
To determine the structure-property relationship by the four descriptors (table1) calculated by the  [16]. In our model, we add components that have a relationship with the description of the hydroxyl group Vik (k = 1, 2) and we add descriptors of Kiers, we observe the significant progression and we got for n=50, s=0.18; r 2 =0.99.

Neural network
Artificial neural networks (ANNs) are very common for QSPR-type mathematical models that convert structural features into different properties of chemical compounds. In this study, we demonstrate that the multifunctional autocorrelation and kier methods are effective for modeling the water solubility of aliphatic alcohols. The main benefit of using neural networks in QSPR models is their ability to offer non-linear mapping of descriptors that describes a structure-property relationship. After the choice of the significant descriptors, we used the components (V 1 , V 11 ,) as input data. We changed the number of hidden layers up to the number 7. We examine the change of the RMS error (RMS stands for root mean square, that is the square root of the average residual) for the total set and also for both the training and the test sets. Training was stopped when there was no further improvement in the test set RMS error. We also obtained the correlation coefficient between the observed and predicted data. (s= 0.27; r=0.99).
We made a comparison between the three different training algorithms to find out which one gives the best results with high performance . Table 4 generalizes the results obtained:

Levenberg-Marquardt
This algorithm typically requires more memory but less time. Training automatically stops when generalization stops improving, as indicated by an increase in the mean square error of the validation samples.

Bayesian Regularization
This algorithm typically requires more time, but can result in good generalization for difficult, small or noisy datasets. Training stops according to adaptive weight minimization (regularization).

Scaled Conjugate Gradient
This algorithm requires less memory. Training automatically stops when generalization stops improving, as indicated by an increase in the mean square error of the validation samples.
Based on these results, (table4) it can be concluded that Bayesian Regularization has better capability of a short-term forecast, however in the long run, it loses its accuracy and follows similar performance to that of Levenberg-Marquardt.
There are various techniques of cross-validation, and in our case the leave-20% Out technique has been applied using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set 20% and training set 80%.the results are shown in the S cv = 0.70 by the Artificial Neural network modeling.

Variables Selection
In this study, the approach followed for determining the subset of variable selection problem is the forward stepwise regression [24,25,26], this method is simple to define. We began with no variables in the model; we selected the variable that had the highest R-Squared. We selected, at each step, the variable that increased R-Squared the most, and then we stopped adding variables when none of the remaining variables are significant. We confirmed our choice of relevant variables by AIC: Akaike Information Criterion [27][28][29][30]. The introduction of this criterion is widely used for good modeling selection, which is the most important aspect of statistical inference; also, AIC is the basis of a paradigm for the foundations of statistics.
A formula for least squares regression type analyses: AIC = n log(RSS) + 2K (5) . [13] Where: RSS = Residual Sum of Squares/n, n = sample size, K is the number of model parameters.
The model containing MAM descriptors performed better than previous model, and that good statistical indicators were obtained when the two components V 11 and V I2 were added. Excellent MLR regression: n = 50; s = 0.22; r 2 = 0.98 was obtained with the three components V 1 , V 11 , V 12 . The matrix correlation (Table 2) shows that components V 11 and V 12 are strongly correlated (-0.97) which mean, they give the same information, this was confirmed by the forward stepwise regression (Table   5) by the selection of the most significant term to each step, the operation when added variable stops improving the model. The results of Akaike information criterion Table 6 show that the (The higher the number, the better the fit. This was obtained from statistical output) that the two criterion of the model V 1 , V 2 , V 11 and

Results of Electro-topological indices by Kier and Hall
We have Followed the same approach for Kier and Hall descriptors (table8) and the same techniques used in the first type of descriptors, we got the statistical indices s=0.42; r 2 = 0.96 by the MLR, and s=0.44; r 2 = 0.98 by ANN.

Principal Component Analysis (PCA)
All the 11 descriptors encoding the 50 molecules were submitted to a Principal Component Analysis (PCA) and 11 components were obtained (Fig. 3). The objective consists in transforming the correlated variables into new variables uncorrelated from each other. It reduces the number of variables and makes the information less redundant. This transformation is defined in such a way that the first component has the largest possible variance that is, accounts for as much of the variability in the data as possible.
The first three axes, F1; F2 and F3 contributed respectively 55.74%; 23.87% and 8.42 to the total variability, and the total information 88,04%.   The link of these two types of descriptors table (13)  for the water solubility of alcohols is excellent. The standard deviation, 0.18, is better than those obtained for the other two data sets. It is this statistical indicator that is most important because it relates directly to the interests of the experimental scientist who wishes to do so. It is also the first report for this particular data series that includes different forms of aliphatic alcohols, as described. The values of experimental water solubility of Aliphatic Alcohols and predicted values are illustrated in Fig. 5 for Multiple Linear Regression and in Fig. 6 for Artificial Neural Network, which shows a strong correlation between the predicted and experimental data.

Conclusion
Multiple Linear Regression and Artificial Neural Network were used to construct a quantitative structure proprety relation model of 50 Aliphatic Alcohols for their water solubility; the study of this property is interesting because the toxic action of these compounds belongs primarily on it. the two modeling methods were compared and had significantly better predictive capability with a greater power, it is concluded that the solubility seems to be largely determined by the components (V 1 , V 2, and V 11 ), which represent the size and the branching of the molecule without neglecting the influence of hydroxyl group caused by C-OH, also the addition of kier and hall descriptors (Kier1, Kier2, Kier3) resulted in a good meaningful model.

Availability of data and materials
The database described in this Data note can be freely and openly accessed. Please see Table 3 and Refs. [18] and table 8 for details to the data.

Declaration of competing interests
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.
We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). She is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs.
We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from: chaymae.jermouni@gmail.com                 Supplementary Files This is a list of supplementary files associated with this preprint. Click to download.