## Data

The experimental data dye affinity (DAF, kJ mol-1) of 72 azo dyes were taken from the literature [5]. We randomly split these compounds into an active training set (≈ 25%), a passive training set (≈ 25%), a calibration set (≈ 25%), and a validation set (≈ 25%). Each of these sets has its task. The active training set serves to calculate correlation weights which give as large as possible the correlation between experimental and predicted endpoints for the active training set. The passive training set has to inspect whether these data give a good correlation coefficient for similar compounds in the passive training set. The calibration set is job to detect overtraining. The task for the validation set is the final estimation of the predictive potential of the model. Here we examine ten random splits.

## Optimal SMILES-based descriptor

The optimal SMILES-based descriptor *DCW(T,N)* is applied for a predictive model of the endpoint in this equation:

$$DAF= {C}_{0}+{C}_{1}\times DCW(T,N)$$

1

$$DCW\left(T,N\right)=\sum _{ }^{ }CW({S}_{k})+\sum _{ }^{ }CW({SS}_{k})$$

2

*T* is an integer to separate SMILES attributes as rare or non-rare. The non-rare SMILES are employed to build up the model. The rare SMILES are not involved in building the model. *N* is the number of epochs of the optimization of the correlation weights. *S**k* is a SMILES atom, i.e., one symbol of the SMILES line (e.g. ‘=’, ‘O’) or a group of symbols that cannot be examined separately (e.g. ‘Cu’, ‘%11). *SS**k* is two SMILES atoms. CW(*S**k*) and CW(*SS**k*) correlate with these SMILES attributes.

## The Monte Carlo optimization

Eq. 2 needs the numerical data for the correlation weights and the Monte Carlo optimization is a tool to calculate them. Here four target functions for the Monte Carlo optimization are examined:

$${TF}_{0}={r}_{AT}+{r}_{PT}-\left|{r}_{AT}-{r}_{PT}\right|\times 0.1$$

3

$${TF}_{1}={TF}_{0}+{IIC}_{C}\times 0.5$$

4

$${TF}_{2}={TF}_{0}+{CII}_{C}\times 0.3$$

5

$${TF}_{3}={TF}_{0}+{{IIC}_{C}\times 0.3+CII}_{C}\times 0.5$$

6

\({r}_{AT}\) and \({r}_{PT}\) are correlation coefficients between observed and predicted endpoints for the active and passive training sets, respectively.

*IIC* *C* is the index of ideality of correlation [13, 14]. *IIC**C* is calculated with data on the calibration set as follows:

$${IIC}_{C}={r}_{C}\frac{\text{m}\text{i}\text{n}({}_{ }{}^{-}{MAE}_{C},{}_{ }{}^{+}{MAE}_{C}) }{\text{m}\text{a}\text{x}({}_{ }{}^{-}{MAE}_{C},{}_{ }{}^{+}{MAE}_{C}) }$$

7

$$\text{min}\left(x,y\right)=\left\{\begin{array}{c}x, if x<y\\ y,otherwise\end{array}\right.$$

8

$$\text{max}\left(x,y\right)=\left\{\begin{array}{c}x, if x>y\\ y,otherwise\end{array}\right.$$

9

$${}_{ }{}^{-}M{AE}_{C}=\frac{1}{{}_{ }{}^{-}N}\sum \left|{\varDelta }_{k}\right|, {}_{ }{}^{-}N is the EquationNumber of {\varDelta }_{k}<0$$

10

$${}_{ }{}^{+}M{AE}_{C}=\frac{1}{{}_{ }{}^{+}N}\sum \left|{\varDelta }_{k}\right|, {}_{ }{}^{+}N is the EquationNumber of {\varDelta }_{k}\ge 0$$

11

$${\varDelta }_{k}={observed}_{k}-{calculated}_{k}$$

12

The observed and calculated are corresponding values of the endpoint.

The correlation intensity index (*CII*), similarly to the above *IIC*, was developed to improve the quality of the Monte Carlo optimization used to build up QSPR/QSAR models.

*CII* is calculated as follows [15, 16]:

$${CII}_{C}=1-\sum {Protest}_{k}$$

13

$${Protest}_{k}=\left\{\begin{array}{cc}{R}_{k}^{2}-{R}^{2},& if {R}_{k}^{2}-{R}^{2}>0\\ 0,& otherwise\end{array}\right.$$

14

*R* *2* is the correlation coefficient for a set that contains n substances. *R**2**k* is the correlation coefficient for n-1 substances of a set, after removing the k-th substance. Thus, if (*R**2**k* *- R**2*) is more than zero, the k-th meaning is an "oppositionist" for the correlation between experimental and predicted values of the set. A small sum of “protests” means a more “intensive” correlation.