Monte Carlo technique to study the adsorption affinity of azo dyes by applying new statistical criteria of the predictive potential

ABSTRACT Azo dyes are broadly used in different industries through their chemical stability and ease of synthesis. However, these dyes are usually identified as critical environmental pollutants. Hence, a mathematical model for the adsorption affinity of azo dyes can be applied for solving tasks of medicine and ecology. Quantitative structure-property relationships for the adsorption affinity of azo dyes to a substrate (DAF, kJ/mol) were established using the Monte Carlo method by generating optimal SMILES-based descriptors. The index of ideality of correlation (IIC) and the correlation intensity index (CII) improved the model’s predictive potential, especially when they were used simultaneously. The statistical quality of the best model on the validation set was characterized by n = 18, r 2 = 0.9468, and RMSE = 1.26 kJ/mol.


Introduction
Dyes are coloured compounds with affinity for a substrate to which they are applied. Dyes interact with substrates through several mechanisms depending on the dye's and substrate's physical and chemical properties. Therefore the affinity and the nature of the interaction between a colourant and substrate define the quality of dyes. Based on their origin, colours can be classified as natural or synthetic. Synthetic and natural colourants can be applied on different substrates, but artificial colourings dominate the market in their availability and ease of application [1,2]. Notable that azo dyes are a large class of synthetic organic dyes that contain nitrogen as the azo group -N=N-as part of their molecular structures; more than half the commercial dyes belong to this class. Azo dyes represent the most significant production volume of dye chemistry today, and their relative importance may even increase in the future. They play a crucial role in the governance of the dye and printing market [2,3].
Azo dyes are broadly used in different industries (in the food, pharmaceutical, cosmetic, textile, and leather industries) through their chemical stability and ease of synthesis. However, these dyes are usually identified as critical environmental pollutants, and much attention was performed to the degradation of azo dyes using biological systems. If they are systemically absorbed, azo dyes can be metabolized via azoreductases of intestinal microflora by liver cells and skin surface bacteria. This metabolism leads to aromatic amines that can be hazardous. Hence, applying azo dyes leads to significant ecological risks [1][2][3][4].
Thus, synthetic colourants' production and application methods pose some environmental challenges such as pollution of water bodies and occupational health issues in humans [5,6]. The affinity of azo dyes to a substrate is a significant technological and ecologic indicator [7].
Quantitative structure-property/activity relationships (QSPRs/QSARs) are a tool for developing models for different endpoints. QSPR can be applied to build models for DAF [8][9][10][11][12][13]. The model can be built up with the Monte Carlo method [14]. A QSPR model obtained from a stochastic process should be considered a random event [15]. However, the reliability of the QSPR is extremely important.
The present study aims to assess the so-called index of ideality of correlation (IIC) [16,17] and the correlation intensity index (CII) [18,19] as tools to determine the reliability of QSPR for DAF of azo dyes.

Data
The experimental dye affinity to substrate data (DAF, kJ/mol) of 72 azo dyes was taken from the literature [8]. We randomly split these compounds into an active training set (≈25%), a passive training set (≈25%), a calibration set (≈25%), and a validation set (≈25%). Each of these sets has its task. The active training set serves to calculate correlation weights which give as large as possible the correlation between experimental and predicted endpoints for the active training set. The passive training set has to inspect whether these data provide a good correlation coefficient for similar compounds in the passive training set. The calibration set is a job to detect overtraining. The task for the validation set is the final estimation of the predictive potential of the model. Here we examine ten random splits.

Optimal SMILES-based descriptor
The optimal SMILES-based descriptor DCW(T,N) is applied for a predictive model of the endpoint in this equation: T is an integer to separate SMILES attributes as rare or non-rare. The non-rare SMILES are employed to build up the model. The rare SMILES are not involved in building the model. N is the number of epochs of the optimization of the correlation weights.
S k is a SMILES atom, i.e. one symbol of the SMILES line (e.g. ' = ', 'O') or a group of symbols that cannot be examined separately (e.g. 'Cu', '%11). SS k is two SMILES atoms. CW(S k ) and CW(SS k ) correlate with these SMILES attributes.

The Monte Carlo optimization
Equation (2) needs the numerical data for the correlation weights, and Monte Carlo optimization is a tool to calculate them. Here four target functions for the Monte Carlo optimization are examined: r AT and r PT are correlation coefficients between observed and predicted endpoints for the active and passive training sets, respectively. IIC C is the index of ideality of correlation [16,17]. IIC C is calculated with data on the calibration set as follows: The ∆ k is the difference between the observed and predicted parameter values for the k-th dye. The correlation intensity index (CII), similarly to the above IIC, was developed to improve the quality of the Monte Carlo optimization used to build up QSPR/QSAR models.
CII is calculated as follows [18,19]: r 2 is the coefficient of determination for a set that contains n substances. r 2 k is the coefficient of determination for n-1 substances of a group after removing the k-th substance. Thus, if (r 2 k -r 2 ) is more than zero, the k-th meaning is an 'oppositionist' for the correlation between experimental and predicted values of the set. A small sum of 'protests' means a more 'intensive' correlation.

Comparison of target functions
A comparison of the determination coefficients on the validation set showed that the best target function is TF 3 , calculated with Equation (6).
A possible reason for the effectiveness of the objective function calculated, taking into account the IIC together with the CII lies in the significant difference in the estimates of the correlation quality by means of these two values ( Figure 2). The CII recognizes correlations even if they are non-linear or fuzzy. In contrast, the IIC reacts very strongly to the different dissymmetric departures of points (i.e. to the difference of values 'experiment -calculation').

The best model
The best model according to the determination coefficient on the validation set is the following:  Table 1 lists the statistical characteristics of the model. Figure 3 contains the graphical representation of the best model. Table 2 lists promoters for an increase or decrease of DAF. The presence of nitrogen atoms increases DAF; the number of rings from 1 to 4 increases DAF, but four rings with nitrogen and five with Sulphur reduce it. However, it is clearer to single out molecular features and compare their total correlation contribution to compare molecular structures. Figure 4 represents a comparison of two molecules that are more than 90% the same (as well as their SMILES). Such comparisons can be helpful in the molecular design of effective dyes.

Domain of applicability
The domain of applicability for the described model calculated with Equation (15) defines the so-called statistical defects of SMILES attributes. These defects can be calculated as: where P(A k ), P'(A k )P"(A k ) are the probability of A k in the active training set, passive training set, and calibration set, respectively; N(A k ), N'(A k ), and N"(A k ) are frequencies of A k in the active training set, passive training set, and calibration set, respectively. The statistical SMILES-defects (D j ) are calculated as: where NA is the number of non-blocked SMILES attributes in the SMILES.
A SMILES falls in the domain of applicability if In the model calculated with Equation (15), six SMILES are outside the domain of applicability. However, these do not fall in the validation set.

Comparison with models suggested in the literature
The average statistical characteristics for the validation set for eight random splits studied in Kumar and Kumar [8] Figure 1).
Supplementary materials set out the technical details related to the best model (split in active training, passive training, calibration, and validation sets, values of DCW (1,15), experimental and calculated values of DAF, and domain of applicability).

Conclusions
The index of ideality of correlation (TF 1 ) and the correlation intensity index (TF 2 ) improve the predictive potential of QSPR for DAF. The simultaneous use of these indices (TF 3 ) is especially effective. The advantage of the TF 3 is demonstrated by the considerable absolute average of the determination coefficient on ten random splits and the small dispersion of the value on the ten random splits.

Data availability statement
The data used in this work and the models developed are freely available in the Supplementary materials section and at: http://www.insilico.eu/coral.