The system of self-consistent QSPR-models for refractive index of polymers

Quantitative structure–property/activity relationships (QSPRs/QSARs) are a component of modern natural science. The system of self-consistent models is a specific approach to build up QSPR/QSAR. A group of models of refractive index for different distributions in training and test sets is compared. This comparison is a basis to formulate the system of self-consistent models. The so-called index of ideality of correlation (IIC) has been used to improve the predictive potential of models of the refractive index of different polymers (n = 255). The predictive potential of the suggested models is high since the average value of the determination coefficient for the validation set is 0.885. In addition, the system of self-consistent models may be applied as a tool to assess the predictive potential of an arbitrary QSPR-approach. The statistical characteristics of the best model are the following: n = 57, R2 = 0.7764, RMSE = 0.039 (active training set) and n = 57, R2 = 0.9028, RMSE = 0.019 (validation set).

The QSPR/QSAR analysis of polymers is an intensive developing field of theoretical chemistry [8]. The search for new drugs also can be carried out with molecular simulation studies of polymers [9]. The design of relevant monomeric units can be a tool to select polymers with desired properties [10]. There are examples of the intersection of polymer science with searching nanomaterials for medicine [11,12]. Photoelectrical and optic-electronic properties of polymers [13] are studied using molecular dynamics [14]. The random forest [15,16] and artificial neural networks [17] also are used for the QSPR/QSAR analysis of the polymer systems.
The simplified molecular input-line entry system (SMILES) is a widely used format to represent the molecular structure [18]. Recently, high-refractive index polymers have captured considerable attention of the scientific community due to various applications, aimed to improve advanced optic-electronic devices [19].
The present study aims to build up and validate the QSPR-model for the refractive index of polymers. The assessment of the predictive potential of these models was carried out via so-called system of self-consistent models of the refractive index of polymers. The index of ideality of correlation (IIC) also can serve as a criterion of the predictive potential. The IIC demonstrates significant ability to improve the predictive potential of the QSPR-model being applied as additional component of the Monte Carlo optimization aimed to model an arbitrary endpoint.
The QSPR analysis of the refractive index of polymers can be based on the traditional physicochemical descriptors together with topological indices [19]; also in this end, the so-called expert-in-the-loop approach [20] based on automatic feature selection (e.g., descriptors of aromaticity) can serve as the basis to develop corresponding models.
The Monte Carlo technique was used to develop models for the refractive index of polymers with the representation of polymer structures by SMILES of monomers [21]. Here similar SMILES-based models are built up by applying the abovementioned IIC. In addition, the new conception of the validation, the so-called system of self-consistent QSPR-models, is used for the QSPR analysis of the refractive index of different polymers. The CORAL software (http:// www. insil ico. eu/ coral) is applied for building up the QSPR-models.

Dataset
The experimental data on the refractive index (RI) of different polymers were taken in the literature [20]. The total set of polymers examined here contains 255 structures of vinyl monomer polymers, (met)-acrylates, other acrylic polymers, vinyl esters, and polydienes, as well as heterochain polymers containing substituents of various nature.
Two duplicates were removed. The remaining set list of polymers (n = 255) has been distributed randomly into four special subsets: the active training set (25%), passive training set (25%), calibration set (25%), and validation set (25%). Table 1 confirms that the five random distributions described above are not identical.
Each of the above subsets has its task. The task for the active training set is to calculate correlation weights, which are given as large as the possible correlation between experimental and predicted endpoints for the active training set. The task for the passive training set is inspection: whether these data give a reasonable correlation coefficient for the similar compounds in the passive training set. The task of the calibration set is to detect overtraining. The task for the validation set is the final estimation of the predictive potential of the model.

Optimal hybrid descriptor
The CORAL software can be tuned to build up models based on solely SMILES. In addition, the software can be turned up to build up models based on optimal descriptors which are calculated by taking into account both molecular features extracted from SMILES together with molecular features represented by the graph invariants. Here, namely, the hybrid optimal descriptors [22] are used to build up models.
The optimal hybrid descriptor DCW(T, N) is applied for a predictive model of RI via the equation: C 0 and C 1 are regression coefficients; the descriptor of the correlation weights (DCW) is calculated as APP k are the atom pair's proportions [23], S k is SMILES attributes, EC1, EC2,and EC3 are Morgan extended connectivity of first, second, and third orders, respectively [21,22], and C5 and C6 are special codes of rings [24]. The T is thresholds, i.e., an integer to separate SMILES attributes into rare and nonrare [22]. The rare SMILES attributes have correlation weights equal to zero; i.e., these are not involved in building up a model. N is the number of epochs of the Monte Carlo optimization. Equation 2 needs the numerical data on the above correlation weights. The Monte Carlo optimization is a tool to calculate those correlation weights. Here two target functions for the Monte Carlo optimization are examined.

The first target function (TF 1 )
The first target function is calculated as the following: R A and R P are correlation coefficients between observed and predicted endpoints for the active training set and passive training set, respectively.

The second target function (TF 2 )
The second target function is calculated as the following: IIC C is the index of ideality of correlation calculated with polymers of the calibration set [24]. IIC is calculated as the following: The observed and calculated are corresponding values of the endpoint.

The system of self-consistent models
Each i-th model has an i-th validation set. As it is demonstrated (Table 1), the validation sets are far from identical. Is it important whether the arbitrary model can be used for an arbitrary validation set? If the answer is yes, these different models should be considered self-consistent ones.
The measure of self-consistency is the average and dispersion of the correlation coefficient on different validation sets. The corresponding computational experiments are represented by the matrix: M i is the i-th model, V j is the list of polymers applied as the validation set in the case of the j-th split, and Rv 2 ij is the correlation coefficient observed for the j-th validation set if the i-th model is applied.
The main quality of an approach is the ability to provide good statistics for the external validation set. Consequently, different approaches should be assessed by the corresponding correlation coefficient for the validation set. In the situation where five models are built up with different splits, the Rv 2 ij estimation could be the clear basis to compare the suitability of different approaches (i.e., optimizations with target functions TF 1 , or TF 2 ). Figure 1 gives histories of (10) the Monte Carlo optimizations with different target functions.
To this end, five random splits were applied to build up models for the RI of different polymers using the abovementioned three target functions. These models are listed below. Table 2 contains the statistical characteristics of models obtained by the Monte Carlo optimization with target functions TF 1 , and TF 2 .   Table 3 Three systems of self-consistent models were obtained by the TF 1 -and TF 2optimizations for splits 1-5. The average determination coefficient for the validation set obtained with TF 1 -optimization is 0.849 ± 0.086; the value in the case of the TF 2 -optimization is 0.885 ± 0.061 V k is the validation set related to the k-th split; the preferable predictive potential of models (obtained using TF 1 or TF 2 ) is indicated by bold

Results and discussion
One can see ( Table 2) that the best predictive potential is observed for the TF 2 -optimization since the correlation coefficients for validation sets, in this case, reach maximums in comparison with TF 1 -and TF 2 -optimizations. It is to be noted that TF 1 -optimization with a large number of epochs gives overtraining (Fig. 1), whereas TF 2 -optimizations give improvement of the statistical quality for the calibration and validation sets but in detriment of the active/passive training sets.
Three different approaches based on different target functions can be compared with characteristics calculated as It is clear that some of the SMILES in the cases of situations M i ∶ V j → Rv 2 ij (i ≠ j) are presented in both training and validation sets. However, the general conditions of building up for the groups of models must be quite different. Table 3 contains data on applying the TF 1 -and TF 2models for "stranger" validation sets (i.e., applying the i-th model to the j-th split, i ≠ j). One can see that there are four better TF 1 -models, whereas the number of the better TF 2 -models is sixteen. Thus, a convenient measure of quality for an arbitrary QSPR-approach is demonstrated. The measure of predictive potential could be expressed by (22) Quality = Rv 2 − ΔRv 2  a classic scheme, i.e., the average determination coefficient for the validation set plus/minus its dispersion. Table 3 demonstrates that the above-average value for the TF 2optimization is larger and the dispersion is smaller. The comparison of the models examined here with RI models described in the literature confirms that the predictive potential of the suggested here models is comparable with that of analogical approaches (Table 4).
Having results of several runs of the Monte Carlo optimization, one can obtain molecular features extracted from SMILES or from the molecular graph, which have solely positive correlation weights (promoters of increase for the refractive index) or solely negative correlation weights (promoters for decrease for the refractive index). Table 5 contains a collection of the above molecular features observed for split1. According to Table 5, promoters of increase are the presence of chlorine atoms (Cl……..), six-member aromatic rings (C6…A…1.. and C6…A…2), whereas branching (C…(……) and the presence of fluorine atoms are promoters of decrease for the refractive index. It is to be noted that the influence of fluorine was mentioned in the work [9].
The Supplementary information section contains the technical details of described computational experiments.

Conclusions
Applying the IIC improves the statistical characteristics of a model for the validation set but to the detriment of the active/passive training sets. The system of the selfconsistent model gives the possibility of assessment of different approaches in an aspect of the predictive potential of corresponding models. Factually, the system of selfconsistent models is a new tool for checking up the predictive potential of QSPR-models. The statistical quality of the Monte Carlo models is comparable with models for the refractive index suggested in the literature ( Table 4). The system of self-consistent models gives the additional measure of the predictive potential for QSPR: the average value of the determination coefficient observed for the abovementioned stranger validation sets.