3.1 Chemical and Physical Feature Descriptor Methods
The produced model performance and accuracy parameters using the descriptors-based method are presented in Figures 3 (A) and (B), where the RF algorithm is used to predict aqueous solubility. We also trained our model using linear regression to compare the accuracy to the RF (Figures 3 (C) and (D)).
Figure 4 illustrates the results of the SHAP analysis for the RF model trained on chemical descriptors and compares the impacts of ten chemical and physical descriptors, based on their average SHAP values, on the aqueous solubility outputs. The octanol/water partition coefficient, or the ratio of the concentration of a solute in a water-saturated octanolic phase to its concentration in an octanol-saturated aqueous phase, is the most important descriptor; FilterItLogS (Figure 4 (A)) [33]. We have depicted the impact of each descriptor on the model's prediction in Figure 4 (B). The feature values in the positive SHAP value range indicate a positive effect on solubility, while feature values in the negative SHAP value range indicate a negative effect. The density of the points represents the feature distribution. For example, FilterltLogS is more densely distributed on the positive side (Figure 4 (B)). Red denotes a higher feature value, and blue denotes lower values.
3.2 Fingerprinting Methods
R2 and RMSE values for the Morgan fingerprints methods are presented in Figure 5. The test and training set performance was satisfactory, and our RF model exhibited reasonable functionality considering the regression analysis datasets.
Figure 6 displays the top twelve important chemical substructures for predicting aqueous solubility where the Morgan fingerprint was applied. Features 807, 222, 650, and 1171 contributed to increased solubility measures with enhanced amounts, and the rest negatively affected aqueous solubility. Solubility is a question of equilibria; therefore, to interpret the results we should frame the important features in terms of the energetics of the states as opposed to the dynamics that would transition between states. The energetics of a compounds in water can be estimated through a statistical thermodynamical-like approach[34][35]. Gibbs energy, enthalpy or entropy as thermodynamic analysis of solubility with the purpose of contributing to the understanding of the possible molecular interactions can be used for interpretation of data[16]. Calculated Gibbs energy, as a thermodynamic paradigm, was adopted in this study to indicate a better mastery of the chemistry involved and improve the clarity of the discussion. Lower Gibbs energy measures indicate greater solubility in water, and a higher positive Gibbs energy specifies lower solubility in water. Table 2 illustrates the Gibbs energies for the top important features calculated by Perlovich’s study [36]. Perlovich et al. developed a correlation equation (Equation 1) describing Gibbs energies using molecular polarizability, α, the sum of all H-bond acceptor factors in a molecule, ƩCa, and the sum of H-bond donor factors, ƩCd, as the variables.
∆G298= (-0.5±1.6) – (1.37±0.06)α + (3.84±0.25)ƩCa – (2.97±0.26)ƩCd (1)
Three described variables for each feature were calculated by descriptors-based method described in section 3.1. Features 807, 222, 650, and 1171, as the features with positive effects, have low Gibbs energies and are thermodynamically favorable; they have lower Gibbs energies compared to Features 1380, 561, 1143, 1750, 114 and 591 with negative effects. The thermodynamic results are intuitive and agree with expectations arising from SHAP’s analysis. The current fingerprint can also be used for the development of Quantitative structure-property relationship (QSPR) models[37]. The agreement between the impactful features and the thermodynamic quantities can separated the fingerprint method from other computational tools to predict the physico-chemical properties[38].
Table 2. Perlovich’s variable and Gibbs energies for the top twelve features.
Feature
|
Molecular Polarizability
|
H-bond Acceptor
|
H-bond Donor
|
∆G298
|
1380
|
12.01
|
0
|
0
|
14.46
|
807
|
1.47
|
0
|
1
|
0.29
|
561
|
10.52
|
0
|
0
|
12.42
|
222
|
4.47
|
1
|
1
|
-2.99
|
650
|
0.80
|
0
|
0
|
-0.90
|
1143
|
15.02
|
0
|
0
|
18.57
|
1750
|
16.69
|
0
|
0
|
20.86
|
1171
|
2.43
|
0
|
0
|
1.33
|
1873
|
9.01
|
0
|
0
|
10.34
|
294
|
17.10
|
4
|
0
|
7.14
|
114
|
16.35
|
0
|
0
|
20.40
|
591
|
12.68
|
0
|
0
|
15.37
|
Blue represents the central atom, yellow depicts the aromatic atoms, and the aliphatic ring atoms are highlighted in dark gray in the substructure drawings illustrated in Figure 6. Light gray also indicates atom/bond structures that influence the atom’s connectivity invariants but are not directly part of the fingerprint. A schematic of extracting features 561 and 807 from their molecular structure is provided in Figure 7 to illustrate the concept of hashing each structure.
3.3 Blind test
We did a blind test on a database that was never used in our model to verify performance and compare the two models. The database consists of 32 low organic molecules with the number of C atoms ranging from 1 to 12, extracted from the Llinàs et al. [19] study. The results in Table 3 indicate an acceptable difference since the experimental solubility value can differ because of varied experimental conditions. The average uncertainty in measured aqueous solubility for organic molecules represented ∼0.6 log units or higher, when the solubility values were gathered from various published sources. [39] [40] [41]
Table 3. Empirical and predicted solubility for selected druglike molecules using different chemical representation methods.
Name
|
logS(mol/L):
Intrinsic Solubility
|
logS(mol/L): Molecular Descriptor Method
|
Molecular Descriptor Method -Absolute Calculation Error
|
logS(mol/L): MF1 Method
|
MF Method-Absolute Calculation Error
|
hexobarbital
|
-2.67
|
-1.81
|
0.86
|
-2.40
|
0.274
|
nalidixic_acid
|
-3.61
|
-1.61
|
2.00
|
-3.43
|
0.183
|
phenanthroline
|
-1.61
|
-1.96
|
0.35
|
-1.80
|
0.191
|
phenobarbital
|
-2.29
|
-1.94
|
0.35
|
-2.33
|
0.041
|
sulfamethazine
|
-2.73
|
-2.02
|
0.71
|
-2.38
|
0.347
|
bromogramine
|
-4.05
|
-1.84
|
2.21
|
-3.92
|
0.127
|
phenazopyridine
|
-4.19
|
-1.81
|
2.38
|
-4.02
|
0.169
|
amantadine
|
-1.85
|
-1.68
|
0.17
|
-2.12
|
0.271
|
benzylimidazole
|
-2.25
|
-1.92
|
0.33
|
-1.51
|
0.745
|
chlorpropamide
|
-3.24
|
-1.91
|
1.33
|
-2.89
|
0.351
|
cimetidine
|
-1.69
|
-0.68
|
1.01
|
-1.49
|
0.196
|
thymol
|
-2.18
|
-2.14
|
0.04
|
-2.26
|
0.083
|
tryptamine
|
-3.29
|
-1.95
|
1.34
|
-2.91
|
0.385
|
azathioprine
|
-3.2
|
-2.20
|
1.00
|
-2.84
|
0.357
|
sulfathiazole
|
-2.68
|
-1.65
|
1.03
|
-2.55
|
0.128
|
acetaminophen
|
-1.06
|
-2.05
|
0.99
|
-1.19
|
0.132
|
diazoxide
|
-3.36
|
-2.07
|
1.29
|
-3.28
|
0.085
|
famotidine
|
-2.64
|
-2.09
|
0.55
|
-2.58
|
0.058
|
hydroflumethiazide
|
-2.96
|
-1.80
|
1.16
|
-2.33
|
0.626
|
nitrofurantoin
|
-3.23
|
-2.18
|
1.05
|
-3.42
|
0.190
|
phthalic_acid_form_I
|
-1.49
|
-1.55
|
0.06
|
-0.93
|
0.562
|
sulfacetamide
|
-1.51
|
-1.75
|
0.24
|
-1.42
|
0.090
|
trichloromethiazide_ form_I
|
-3.18
|
-1.85
|
1.33
|
-2.78
|
0.400
|
2_amino_5_ bromobenzoic_acid
|
-3.07
|
-1.94
|
1.13
|
-2.80
|
0.267
|
5_bromo_2_4_ dihydroxybenzoic_acid
|
-2.62
|
-1.52
|
1.10
|
-2.20
|
0.418
|
chlorzoxazone
|
-2.65
|
-2.07
|
0.58
|
-2.89
|
0.236
|
5_hydroxybenzoic_acid
|
-1.46
|
-1.32
|
0.14
|
-1.69
|
0.227
|
4_iodophenol
|
-1.71
|
-2.10
|
0.39
|
-2.00
|
0.294
|
metronidazole
|
-1.22
|
-2.29
|
1.07
|
-1.35
|
0.131
|
guanine
|
-4.42
|
-2.05
|
2.37
|
-4.08
|
0.339
|
acetazolamide
|
-2.43
|
-2.13
|
0.30
|
-2.34
|
0.085
|
1_naphthol
|
-1.98
|
-1.91
|
0.07
|
-2.27
|
0.290
|
1Morgan Fingerprint
|
|
|
Mean=0.9
|
|
Mean=0.25
|