The self-organizing vector of atom-pairs proportions: use to develop models for melting points

Atom-pairs proportions are the transparent quality of a molecule: if a molecule has two atoms of oxygen and three atoms of nitrogen, the atom-pair atom1-atom2 can be expressed as a code “atom1-atom2-n1-n2,” indicating the different atoms and their numbers. These codes for a group of atoms (nitrogen, oxygen, sulfur, phosphorus, fluorine, chlorine, bromine, as well as double and triple covalent bonds) are applied to build up the so-called optimal molecular descriptor calculated with special coefficients named correlation weights of corresponding pairs. The numerical data on the correlation weights are calculated by the Monte Carlo technique using the CORAL software (http://www.insilico.eu/coral). The one-variable model for melting points of 8653 different organic compounds is characterized by the following statistical quality: n=6483, r2=0.6452, RMSE=61.9°C; n=2170, r2=0.7941, RMSE=39.2°C.


Introduction
The knowledge of the physical and chemical properties of a compound is required for understanding and modeling the action of a compound for a number of possible applications. Many physicochemical properties of compounds are strongly interconnected. The relationships between physicochemical properties can be established by the theoretical analysis or found empirically [1][2][3]. Physicochemical properties are also very important in the case of toxicokinetic models, to describe partitioning between the different organs, and processes as skin permeation [4].
The numerical data on melting points has a diversity of applications [5,6]. For example, information on the melting point is applicable to check up the purity of an experimental sample. In addition, the information is applicable to establish the existence of different conformations of the molecular structure for a compound under consideration. Data on melting point may be useful information for both studies of pure substances and studies of mixtures. However, the experimental definition of this endpoint is impossible for all substances applying in science and everyday life [7][8][9].
Similar to most quantitative structure-property/activity relationships (QSPRs/QSARs), the models for melting points are based on a well-known group of mathematical approaches, e.g., linear regression and neural networks [10], random forest [9], comparative molecular field analysis (CoMFA) [11], partial least squares [12], k-nearest neighbor approach [13], and Monte Carlo approach [14]. Thus, modeling of melting points is not a simple task, and previously melting point models for relatively simple and small compound sets have been developed [15].
Nonetheless, relationships between molecular structure and biological activity or molecular structure and physical properties can be investigated for large databases of organic compounds using the newest computer-assisted conceptions aimed to derive quantitative relationships between a property and a structure via modeling [1,2]. Sometimes, a simple Monte Carlo approach provides reliable models for large datasets of complex molecules [16].
The Monte Carlo approach based on the so-called optimal descriptors was studied as a tool to build up models for different endpoints such as bioactivity of anticonvulsant agents [17], the biological activity of a number of drug-like substances [18][19][20][21][22], and biological activity of nanomaterials [23]. However, the approach was not applied to build up models for melting points.
The aim of the present study is the assessment of the approach as a tool to build up models for melting points for a large set of organic compounds that contains more than eight and a half thousand different organic compounds.

Data
The numerical data on the melting point expressed in Celsius have been taken in the literature [24]. The data has been randomly distributed into four sub-sets, an equivalent percentage. The active training set (25%), passive training set (25%), and calibration set (25%) are a special group of training for the model. The external validation set (25%) is used to estimate the predictive potential of the model. The range of melting point for united active training set, passive training set, and calibration set is min = −196°C and max = 492.5°C; the range for validation set is min = −134°C and max = 376°C.

Model
The model for the melting point is the following one-variable generalized formula.
The T 0 m is expressed in Celsius; C 0 , C 1 are regression coefficients; and DCW(T,N) is the so-called optimal descriptor (D) that is calculated with correlation weights (CW).

Descriptor of correlation weights (DCW)
The descriptor is calculated as the following: The AAP j is a vector of atom-pairs proportions. These are special SMILES attributes encoded as the following: (atom1, atom2).n1.n2. The atom1, atom2 can be N, O, S, P, F, Cl, Br, and addition SMILES-atoms: "=" (double covalent bond), and "#" (triple covalent bond). Each compound (SMILES) has a self-organized AAP j vector where it includes atom-pairs present in the corresponding molecule, whereas atom-pairs for atoms absent in the molecule are not appearing.
It is expected that the correlation weights of AAP j can improve the predictive potential of the model for melting points. Table 1 contains an example of the vector of atom-pairs proportions.
The S k is the "SMILES-atom," i.e., one symbol or two symbols (e.g., "C," "N," "O,") which cannot be examined separately (e.g., "Cl," "Si,"); the SS k is a combination of two SMILES-atoms; the SSS k is a combination of three SMILESatoms; the CW(S k ), CW(SS k ), and CW(SSS k ) are the so-called correlation weights of the above-mentioned attributes of SMILES. The numerical data on the CW(S k ), CW(SS k ), and CW(SSS k ) are calculated with the Monte Carlo method, i.e., the optimization procedure which gives a maximal value of a special target function (TF). Table 2 contains an interpretation of SMILES attributes. The correlation weights for the SMILES attributes are calculated by the Monte Carlo technique using the CORAL software (http://www.insilico.eu/coral).

Monte Carlo optimization of the correlation weights
The Monte Carlo method applied here is based on two different target functions (TF 1 and TF 2 ): The r AT and r PT are the correlation coefficient between the observed and predicted values of the endpoint for the active training and passive training sets, respectively.
The index of ideality of correlation (IIC) is special characteristic able to improve the predictive potential of a model [25,26].
The IIC CLB is calculated with data on the calibration set as the following: The observed and calculated are the corresponding values of the endpoint.

Domain of applicability
Domain of applicability of the CORAL model is defined according to the distribution of SMILES attributes in the active training and calibration sets (A k = S k , SS k , SSS k , and APP j ): the defect of SMILES-atom calculated as: where P(A k ) and P'(A k ) are the probability of A k in the training and calibration sets, respectively; N(A k ) and N'(A k ) are frequencies of A k in the training and calibration sets, respectively. The defect of SMILES in whole (SMILES-defect) is calculated as: where NA is the number of non-blocked SMILES attributes in the SMILES. A substance falls in the domain of applicability if where D is an average of the statistical SMILES-defect for the training set. It is to be noted that the sum of SMILES-defect on the active training set is a measure of the statistical quality of the selected split.

Results and discussion
Two new elements of the CORAL model are studied here: (1) self-organized vector of atom-pairs proportions (APP) and (2) index of ideality of correlation (IIC). Table 3 contains the statistical quality of models obtained with applying and without applying these elements. A comparison of these models ( Table 3) has shown that the best model was obtained by applying both APP j and IIC. Applying APP j without IIC gives an improvement of the statistical characteristics for the group of training (active and passive), but it accompanied a decrease of the determination coefficient for the validation set.
The best model for melting point according to the statistical quality for the validation set is the following:  Table 4 contains a comparison of the model calculated with Eq. (12) with models for melting points suggested in the literature. It should be noted that the first pioneer works dedicated to models of melting points were oriented to limited datasets, but models for melting points for large datasets gradually become more popular [1].
The comparison confirms that the CORAL model calculated with APP j and IIC is quite comparable with other models for melting points. However, it is important to notice that the models in the literature have a higher complexity at least regarding the characterization of the chemical information and the associated descriptors. The model we present here only uses very simple chemical information, such as atom type, their number, and the information derived directly from the SMILES string, without further steps for the calculation of molecular descriptors.

Conclusions
The here described approach allows building up the model for melting point comparable with related models described in the literature [24]. The suggested self-organizing vector of atompairs proportions together with the index of ideality of correlation can serve as a tool to improve the predictive potential of QSPR/QSAR models. The CORAL software gives the possibility to define and study different hypotheses on the aspect of improving the predictive potential of these models by the use of the index of ideality of correlation. This criterion improves the statistical quality of a model for the calibration set to the detriment of the training set. Likely, the above index deserves further study both for melting points modeling as well as for modeling of any endpoints.