In order to predict the functions of chemicals in cosmetic products based on their structural and physical-chemical properties, i.e. to generate QSUR models, on the training set HFunc_Str_Pc (numerical data are normalized 0–1 rank transformation), the bagging SVM learners were trained. Model training involves optimizing the C and γ parameters using 5-fold cross-validation so that maximum predictive learner performances are achieved. Using the grid-search technique i.e. the setting appropriate rankings for the parameters, optimal combinations of these parameters were found for the 43 bagging SVM models to predict each of the 43 harmonized functions. Table 1 shows the optimal parameters and predictive performances of the generated models.
Table 1
Predictive performances of the bagging SVM models (5- fold cross-validation, positive class: 1)
Function
|
optimal parameters
|
Accur
|
cl_error
|
precis
|
sensit
|
specif
|
skin conditioner
|
SVM.C = 960.4 SVM.gamma = 0.01
|
95.99%
|
4.01%
|
38.08%
|
29.81%
|
98.25%
|
hair dye
|
SVM.C = 1280.2 SVM.gamma = 0.082
|
98.78%
|
1.22%
|
49.62%
|
51.06%
|
99.37%
|
Buffer
|
SVM.C = 1280.2 SVM.gamma = 0.082
|
99.04%
|
0.96%
|
39.19%
|
47.62%
|
99.42%
|
antimicrobial
|
SVM.C = 1600.0 SVM.gamma = 0.046
|
96.63%
|
3.37%
|
56.23%
|
46.29%
|
98.60%
|
hair conditioner
|
SVM.C = 1600.0 SVM.gamma = 0.064
|
98.48%
|
1.52%
|
54.46%
|
50.67%
|
99.28%
|
catalyst
|
SVM.C = 320.8 SVM.gamma = 0.064
|
98.29%
|
1.71%
|
72.41%
|
66.35%
|
99.23%
|
preservative
|
SVM.C = 320.8 SVM.gamma = 0.082
|
99.40%
|
0.60%
|
50.00%
|
26.00%
|
99.85%
|
skin protectant
|
SVM.C = 640.6 SVM.gamma = 0.01
|
99.36%
|
0.64%
|
15.38%
|
10.00%
|
99.76%
|
flavorant
|
SVM.C = 320.8 SVM.gamma = 0.046
|
87.40%
|
12.60%
|
50.07%
|
44.65%
|
93.57%
|
flame retardant
|
SVM.C = 640.6 SVM.gamma = 0.01
|
99.16%
|
0.84%
|
73.32%
|
71.90%
|
99.59%
|
colorant
|
SVM.C = 320.8 SVM.gamma = 0.064
|
95.33%
|
4.67%
|
71.86%
|
64.16%
|
97.91%
|
masking agent
|
SVM.C = 1600.0 SVM.gamma = 0.1
|
97.02%
|
2.98%
|
29.34%
|
22.19%
|
98.73%
|
antioxidant
|
SVM.C = 960.4 SVM.gamma = 0.082
|
97.68%
|
2.32%
|
37.13%
|
29.61%
|
98.99%
|
fragrance
|
SVM.C = 320.8 SVM.gamma = 0.01
|
81.63%
|
18.37%
|
69.58%
|
75.13%
|
84.66%
|
additive
|
SVM.C = 960.4 SVM.gamma = 0.064
|
99.38%
|
0.62%
|
13.33%
|
15.00%
|
99.68%
|
surfactant
|
SVM.C = 640.6 SVM.gamma = 0.028
|
98.26%
|
1.74%
|
73.49%
|
71.22%
|
99.14%
|
heat stabilizer
|
SVM.C = 1600.0 SVM.gamma = 0.01
|
99.38%
|
0.62%
|
18.75%
|
18.33%
|
99.72%
|
reducer
|
SVM.C = 320.8 SVM.gamma = 0.01
|
99.83%
|
0.17%
|
85.33%
|
68.33%
|
99.94%
|
emulsion stabilizer
|
SVM.C = 1280.2 SVM.gamma = 0.1
|
99.64%
|
0.36%
|
36.67%
|
40.00%
|
99.83%
|
ubiquitous
|
SVM.C = 960.4 SVM.gamma = 0.046
|
96.46%
|
3.54%
|
22.68%
|
18.66%
|
98.40%
|
chelator
|
SVM.C = 320.8 SVM.gamma = 0.064
|
99.55%
|
0.45%
|
47.86%
|
33.33%
|
99.76%
|
plasticizer
|
SVM.C = 1600.0 SVM.gamma = 0.064
|
99.19%
|
0.81%
|
34.67%
|
42.00%
|
99.55%
|
solvent
|
SVM.C = 1280.2 SVM.gamma = 0.046
|
98.52%
|
1.48%
|
27.27%
|
23.11%
|
99.31%
|
Vinyl
|
SVM.C = 960.4 SVM.gamma = 0.046
|
99.98%
|
0.02%
|
96.00%
|
100.00%
|
99.98%
|
monomer
|
SVM.C = 640.6 SVM.gamma = 0.046
|
98.80%
|
1.20%
|
70.24%
|
69.15%
|
99.37%
|
viscosity controlling agent
|
SVM.C = 1600.0 SVM.gamma = 0.046
|
99.27%
|
0.73%
|
20.00%
|
12.00%
|
99.74%
|
Emollient
|
SVM.C = 1280.2 SVM.gamma = 0.01
|
97.73%
|
2.27%
|
59.32%
|
57.14%
|
98.85%
|
Foamer
|
SVM.C = 320.8 SVM.gamma = 0.01
|
99.72%
|
0.28%
|
28.57%
|
20.00%
|
99.89%
|
Crosslinker
|
SVM.C = 1280.2 SVM.gamma = 0.01
|
97.11%
|
2.89%
|
51.34%
|
46.06%
|
98.65%
|
film forming agent
|
SVM.C = 640.6 SVM.gamma = 0.1
|
98.91%
|
1.09%
|
47.59%
|
45.56%
|
99.44%
|
UV absorber
|
SVM.C = 640.6 SVM.gamma = 0.01
|
98.71%
|
1.29%
|
59.42%
|
53.63%
|
99.39%
|
Perfumer
|
SVM.C = 1600.0 SVM.gamma = 0.046
|
96.40%
|
3.60%
|
15.17%
|
12.57%
|
98.29%
|
additive for liquid system
|
SVM.C = 320.8 SVM.gamma = 0.046
|
99.72%
|
0.28%
|
20.00%
|
10.00%
|
99.91%
|
wetting agent
|
SVM.C = 960.4 SVM.gamma = 0.028
|
99.51%
|
0.49%
|
50.00%
|
53.00%
|
99.74%
|
rheology modifier
|
SVM.C = 960.4 SVM.gamma = 0.046
|
99.61%
|
0.39%
|
14.29%
|
10.00%
|
99.87%
|
adhesion promoter
|
SVM.C = 640.6 SVM.gamma = 0.1
|
99.38%
|
0.62%
|
36.67%
|
24.00%
|
99.72%
|
organic pigment
|
SVM.C = 320.8 SVM.gamma = 0.028
|
99.01%
|
0.99%
|
77.79%
|
74.74%
|
99.52%
|
soluble dye
|
SVM.C = 320.8 SVM.gamma = 0.046
|
99.66%
|
0.34%
|
66.67%
|
46.67%
|
99.87%
|
Photoinitiator
|
SVM.C = 1600.0 SVM.gamma = 0.01
|
99.74%
|
0.26%
|
61.54%
|
53.33%
|
99.89%
|
foam boosting agent
|
SVM.C = 640.6 SVM.gamma = 0.01
|
99.87%
|
0.13%
|
83.00%
|
90.00%
|
99.91%
|
Whitener
|
SVM.C = 320.8 SVM.gamma = 0.1
|
99.91%
|
0.09%
|
100.00%
|
60.00%
|
100.00%
|
antistatic agent
|
SVM.C = 960.4 SVM.gamma = 0.082
|
99.72%
|
0.28%
|
67.33%
|
65.00%
|
99.87%
|
Emulsifier
|
SVM.C = 1280.2 SVM.gamma = 0.028
|
99.68%
|
0.32%
|
30.00%
|
23.33%
|
99.85%
|
Average
|
|
98.07%
|
1.93%
|
49.34%
|
43.99%
|
98.94%
|
As can be seen from the Table 1, the accuracy of the model based on 5-fold cross-validation ranges from 81.63% to 99.98%. All models are valid in contrast to the results obtained by Phillips et al., (2017) [6] where 8 of the 46 models had a balanced accuracy of less than 75%. For these very important functions such as masking agent, solvent, viscosity, controlling agent and perfumer, models with accuracy greater than 96.40% were obtained. However, it should be taken into consideration that this is not a balanced accuracy, so much of the accuracy falls on the major (negative) class (true negative rate i.e. sensitivity averages 98.94%, while true positive rate i.e. specificity averages 43.99%). The lower sensitivity of the models indicates that the models are not overfitted, i.e. too dependent on the training set for the positive class, which gives them the potential for better performance on an unknown dataset. A great deal of model specificity is present because the negative class is much larger than the positive class. The small precision of the positive class, which averages 13.73%, indicates that a number of examples of the negative class (chemicals that have no function) are the least distant from the positive class (chemicals most similar in structure and physical-chemical properties to chemicals that have this function) by an SVM qualifier associated with a positive (minor) class.
Applying the resulting bagging SVM models to the training set, a prediction was generated for 43 functions. Each bagging prediction generates multiple SVM models and accordingly each bagging prediction has its own probability. Predictions with high probability mean that the greatest number of SVM models thus generated voted for a chemical to have/does not have a certain function.
The goal now is to increase the precision and sensitivity of the model (precision of a positive class and true positive rate) by taking from a large number of negative class representatives only those chemicals that are farthest from the chemicals belonging to the positive class (those having the structure and properties differing to the greatest extent). Those members of the negative class to be selected were determined by bagging SVM through assigning them the highest probability of belonging to the negative class.
Therefore, the next step is to determine the optimal threshold (minimum likelihood) of Pr, for predictions that will be accepted. For this purpose, the DT method with the gini_index measure for partitioning [28] and 5-fold cross validation were used. Specifically, the optimal threshold for the probability of an SVM prediction was determined for each of the 43 models as follows. A number of chemicals whose major class prediction was less than the Pr value were excluded from the training set and 5-fold DT predictive performances were tested. The parameter Pr is determined to obtain the maximum predictive performance of the DT learner. Table 2 shows the values of this parameter as well as the predictive performances of the QSUR models thus obtained for each of the 43 functions.
Table 2
Predictive performance of DT models on bagging SVM output (5- fold cross-validation, positive class: 1)
Function
|
accur
|
cl_err
|
precis
|
sensit
|
specif
|
Pr
|
skin conditioner
|
92.13%
|
7.87%
|
93.89%
|
92.30%
|
91.84%
|
> 0.98
|
hair dye
|
99.57%
|
0.43%
|
92.18%
|
80.27%
|
99.88%
|
> 0.98
|
Buffer
|
99.25%
|
0.75%
|
90.95%
|
77.62%
|
99.77%
|
> 0.98
|
antimicrobial
|
89.57%
|
10.43%
|
90.75%
|
91.03%
|
87.69%
|
> 0.98
|
hair conditioner
|
91.93%
|
8.07%
|
80.30%
|
73.71%
|
95.99%
|
> 0.94
|
catalyst
|
92.10%
|
7.90%
|
86.37%
|
87.19%
|
94.01%
|
> 0.96
|
preservative
|
92.75%
|
7.25%
|
100.00%
|
78.00%
|
100.00%
|
> 0.95
|
skin protectant
|
90.29%
|
9.71%
|
81.00%
|
71.67%
|
94.48%
|
> 0.99
|
flavorant
|
90.80%
|
9.20%
|
84.67%
|
88.78%
|
91.82%
|
> 0.95
|
flame retardant
|
99.25%
|
0.75%
|
88.66%
|
86.92%
|
99.64%
|
> 0.95
|
colorant
|
92.37%
|
7.63%
|
93.18%
|
92.85%
|
91.77%
|
> 0.95
|
masking agent
|
91.69%
|
8.31%
|
78.00%
|
73.07%
|
95.54%
|
> 0.97
|
antioxidant
|
96.07%
|
3.93%
|
88.79%
|
75.59%
|
98.73%
|
> 0.97
|
fragrance
|
94.06%
|
5.94%
|
91.94%
|
95.41%
|
92.92%
|
> 0.8
|
additive
|
99.72%
|
0.28%
|
100.00%
|
73.33%
|
100.00%
|
> 0.98
|
surfactant
|
99.02%
|
0.98%
|
96.42%
|
92.06%
|
99.67%
|
> 0.98
|
heat stabilizer
|
99.24%
|
0.76%
|
83.33%
|
76.67%
|
99.61%
|
> 0.98
|
reducer
|
99.90%
|
0.10%
|
90.00%
|
76.67%
|
99.97%
|
> 0.95
|
emulsion stabilizer
|
95.46%
|
4.54%
|
88.33%
|
66.67%
|
98.57%
|
> 0.95
|
ubiquitous
|
89.01%
|
10.99%
|
86.50%
|
78.71%
|
93.49%
|
> 0.99
|
chelator
|
98.19%
|
1.81%
|
80.00%
|
70.00%
|
99.25%
|
> 0.97
|
plasticizer
|
97.97%
|
2.03%
|
100.00%
|
71.00%
|
100.00%
|
> 0.98
|
solvent
|
97.04%
|
2.96%
|
80.17%
|
84.56%
|
98.08%
|
> 0.995
|
Vinyl
|
99.98%
|
0.02%
|
96.00%
|
100.00%
|
99.98%
|
> 0.5
|
monomer
|
99.25%
|
0.75%
|
92.08%
|
88.73%
|
99.66%
|
> 0.95
|
viscosity controlling agent
|
91.77%
|
8.23%
|
83.50%
|
72.00%
|
95.49%
|
> 0.99
|
emollient
|
98.29%
|
1.71%
|
91.20%
|
82.92%
|
99.41%
|
> 0.985
|
foamer
|
97.22%
|
2.78%
|
83.33%
|
70.00%
|
99.26%
|
> 0.982
|
crosslinker
|
91.44%
|
8.56%
|
89.31%
|
85.88%
|
94.32%
|
> 0.99
|
film forming agent
|
94.27%
|
5.73%
|
87.14%
|
76.39%
|
97.72%
|
> 0.99
|
UV absorber
|
92.69%
|
7.31%
|
91.67%
|
79.23%
|
97.07%
|
> 0.99
|
perfumer
|
92.14%
|
7.86%
|
81.43%
|
79.16%
|
95.39%
|
> 0.99
|
additive for liquid system
|
95.29%
|
4.71%
|
76.67%
|
90.00%
|
96.25%
|
> 0.98
|
wetting agent
|
99.64%
|
0.36%
|
96.00%
|
69.00%
|
99.95%
|
> 0.98
|
rheology modifier
|
99.15%
|
0.85%
|
85.71%
|
53.33%
|
99.83%
|
> 0.98
|
adhesion promoter
|
97.65%
|
2.35%
|
93.33%
|
81.67%
|
98.97%
|
> 0.93
|
organic pigment
|
99.52%
|
0.48%
|
91.79%
|
89.30%
|
99.78%
|
> 0.9
|
soluble dye
|
96.77%
|
3.23%
|
79.33%
|
70.00%
|
98.15%
|
> 0.95
|
photoinitiator
|
99.42%
|
0.58%
|
78.33%
|
80.00%
|
99.71%
|
> 0.99
|
foam boosting agent
|
99.95%
|
0.05%
|
100.00%
|
86.67%
|
100.00%
|
> 0.85
|
whitener
|
99.18%
|
0.82%
|
93.33%
|
90.00%
|
99.57%
|
> 0.85
|
antistatic agent
|
94.85%
|
5.15%
|
89.47%
|
93.33%
|
95.00%
|
> 0.93
|
emulsifier
|
99.86%
|
0.14%
|
80.00%
|
72.73%
|
99.94%
|
> 0.99
|
Average
|
95.95%
|
4.05%
|
88.49%
|
80.57%
|
97.40%
|
|
It can be seen from the Table 2 that the average accuracy of the final QSUR models is 95.95%, the precision averages 88.49%, while the average sensitivity and specificity are 80.57% and 97.40%, respectively. Thus, after removing the example of a negative class whose probability of belonging to the class is less than the Pr threshold from the training set, precision increased on average by 74.76%, sensitivity on average by 36.58%, while the sensitivity decreased on average by 1.54%.
In the study of Phillips et al., (2017) [6] for 49 harmonized function the 49 balanced random forest models were generated, of which 41 were valid (with balanced accuracy of >75%). For 8 functions (which include some of the most important ones such as perfumer and solvent) no valid models were obtained i.e. random balanced undersampling was not a method effective enough to predict significant differences in the structure and physical-chemical properties of these chemicals compared to the others. Most models have well recognized the chemicals that make up the positive class in the training set (sensitivity models average about 85%), however the average precision (positive classes) is only about 14%, which means that the predictive power of the model is weak. This is due to the large number of false positives (chemicals that do not have a specific function but are misclassified by the model as having them). To identify the chemicals that could be functional substitutes, the generated models were applied to 6,356 chemicals in the Tox21 library for which they are structural and physical-chemical descriptors available. Consistent with the small precision of the positive class of obtained models, about 88% of the predictions were invalid (with a probability of less than 50%).
Comparing the performances of the QSUR models thus obtained with the results from the confusion matrix obtained by Phillips et al., (2017) [6] (Table A3 below) [6], it can be seen that their accuracy averaged 91.81%, sensitivity 84.62%, specificity 91.83%, and precision 13.73%. The precision of our QSUR models is significantly improved over theirs, while the other 3 indicators were similar. The foregoing could lead to a conclusion that the predictive potential of our QSUR models for the positive class (chemicals having some function) is increased.
Bagging SVM prediction did the pre-processing of data for the DT learner and significantly improved its predictive performances even in the case of strong class imbalance. Instead of random undersampling used in the study by Phillips et al., (2017) [6] to ensure class balancing, an undersampling of the major class was made here based on bagging SVM prediction, i.e. only prediction with a higher probability of belonging to the major class was included in the sample. Thus, valid models were obtained for all 43 features with high predictive performances.
Based on the generated DT models, it can be concluded which structural and physical-chemical properties are most responsible for distinguishing the chemicals belonging to the positive class from the others. Thus, chemicals that have these properties can be identified and tested for potential substitutes. For example, the rules for the positive class derived from the DT model for the fragmentation function in Table 3 show the structural and physicochemical properties that should satisfy the potential substitutes for this function in cosmetic products.
Table 3
Positive class rules derived from DT models for the fragrance function
DT rules
|
DT rules
|
Rule 1: 1 {0 = 2, 1 = 14}, Accuracy = 87.50%
|
Rule 6: 1 {0 = 0, 1 = 20} Accuracy = 100%
|
logKoa_unitless > 0.199
|
logKoa_unitless ≤ 0.199
|
logKoa_unitless > 0.229
|
logP_unitless > 0.375
|
bond.C..O.O_carboxylicEster_aromatic = true
|
bond.X.any._halide = false
|
logKoa_unitless ≤ 0.286
|
atom.element_metal_metalloid = false
|
Rule 2: 1 {0 = 0, 1 = 6}, Accuracy = 100%
|
bond.OZ_oxide_peroxy = false
|
logKoa_unitless > 0.199
|
bond.CN_amine_aliphatic_generic = false
|
logKoa_unitless ≤ 0.229
|
bond.C..O.O_carboxylicEster_alkenyl = true
|
chain.aromaticAlkane_Ph.C1_acyclic_generic = false
|
chain.alkeneLinear_diene_1_2.butene = true
|
bond.COH_alcohol_ter.alkyl = false
|
Rule 7: 1 {0 = 2, 1 = 10} Accuracy = 83.33%
|
chain.alkaneCyclic_ethyl_C2_.connect_noZ. = true
|
logKoa_unitless ≤ 0.199
|
ring.hetero_.5._N_pyrrole_generic = false
|
logP_unitless ≤ 0.375
|
persistence_units(hr) ≤ 0.084
|
bond.C..O.O_carboxylicEster_alkyl = false
|
Rule 3: 1 {0 = 0, 1 = 8}, Accuracy = 100%
|
bond.CC..O.C_ketone_generic = false
|
logKoa_unitless > 0.199
|
ring.hetero_.5._O_oxolane = false
|
logKoa_unitless ≤ 0.229
|
logP_unitless > 0.359
|
chain.aromaticAlkane_Ph.C1_acyclic_generic = false
|
persistence_units(hr) ≤ 0.020
|
bond.COH_alcohol_ter.alkyl = true
|
chain.alkeneLinear_mono.ene_ehtylene_terminal = false
|
Rule 4: 1 {0 = 4, 1 = 45}, Accuracy = 91.83%
|
vapor_pressure_units(Pa) > 0.000
|
logKoa_unitless > 0.199
|
Rule 8: 1 {0 = 2, 1 = 25}, Accuracy = 92.59%
|
logKoa_unitless ≤ 0.229
|
logKoa_unitless ≤ 0.199
|
chain.aromaticAlkane_Ph.C1_acyclic_generic = true
|
logP_unitless ≤ 0.375
|
bond.C..O.O_carboxylicAcid_generic = false
|
bond.C..O.O_carboxylicEster_alkyl = false
|
water_solubility_units(mg/L) ≤ 0.129
|
bond.CC..O.C_ketone_generic = true
|
bond.X.any._halide = false
|
molecular_weight ≤ 0.113
|
bond.CN_amine_aliphatic_generic = false
|
bond.C..O.O_carboxylicAcid_alkyl = false
|
Rule 5: 1 {0 = 60, 1 = 1347}, Accuracy = 95.73%
|
chain.alkeneCyclic_ethene_C_.connect_noZ. = false
|
logKoa_unitless ≤ 0.199
|
logP_unitless > 0.335
|
logP_unitless > 0.375
|
Rule 9: 1 {0 = 0, 1 = 24}, Accuracy = 100%
|
bond.X.any._halide = false
|
logKoa_unitless ≤ 0.199
|
atom.element_metal_metalloid = false
|
logP_unitless ≤ 0.375
|
bond.OZ_oxide_peroxy = false
|
bond.C..O.O_carboxylicEster_alkyl = true
|
bond.CN_amine_aliphatic_generic = false
|
air_half_life_units(hr) > 0.000
|
bond.C..O.O_carboxylicEster_alkenyl = false
|
|
bond.CS_sulfide = false
|
|
molecular_weight ≤ 0.229
|
|
Rule 5 is the most important as it covers the largest number (1347) of positive examples with an accuracy of 95.73%. There are a number of studies/reports stating that 95% of the chemicals used in cosmetic products as fragrance components are of synthetic origin, derived from petroleum. The odour components either of synthetic or of natural origin are the other allergens in frequency of causing reactions, accounting for 12.5% of all reactions [30]. The chemicals can cause symptoms such as: respiratory irritation, increased asthma, allergic reactions, mucosal irritations, migraines, headaches, skin problem, cognitive problem, gastrointestinal problem, contact dermatitis, urticaria, photosensitivity. Some of these chemicals are lyral (synthetic lily scent), nitro and polycyclic musks, amyl cinnamal (usually of synthetic origin, though it may be of natural origin, having a floral jasmine-like scent), etc. These compounds are widely used as fragrances in various personal care products such as cosmetics and perfumery. In recent years, a large number of preparations have been marketed, which are labelled as odourless preparations, with the presence of vegetable ingredients or oils. Concealed allergens include rose oil, vanilla and sweet almond oil. Lilal - Butyl-phenyl-methyl-propanal causes contact dermatitis and is often found as a fragrant ingredient in perfumes, shampoos, bath preparations and lotions [31].
It is most commonly obtained synthetically via cross-aldol condensation between para-terc-butylbenzaldehyde and propanal, followed by hydrogenation of the intermediate alkene. It is the clear, viscous liquid with a strong floral scent. In addition to causing contact dermatitis, it is suspected to have an effect on the endocrine system and estrogen activity [31]. Citronellol is a colourless oily liquid with a floral scent on rose which is used in cleansers, hair care products, lipsticks, perfumes. It is a known skin allergen, causes eczema, and often causes complications in people with psoriasis [32]. Nitro- and polycyclic musks are two common and important synthetic musks currently in use [33]. In addition, due to their strong photochemical toxicity, [34] carcinogenicity [35] and neurotoxic properties, as well as endocrine dysfunction, nitro-musks (e.g. musk xylene), their use is being monitored in Japan, in the EU they are also under scrutiny, and further research is being conducted on their potential adverse effects on human health.
The process of high-throughput screening of a set of unknown chemicals using the generated QSUR models would consist of the following steps:
-
Prediction of the chemical function using the bagging SVM model
-
Elimination of chemicals whose non-function is determined with probability less than Pr
-
Application of the DT model to predict function on the purified set of chemicals from step 2
-
For each chemical, each of the 41 QSUR models will generate a result
-
The result that has the highest confidence will determine the function for which that chemical can be substitute
-
If two or more results have the same confidence, then the result obtained by the model with the highest precision of positive class will be the winning one
The next step is to generate a model for the prediction of weight fractions, on the training set Fuse_Str_Pc, in which the nominal variables are transformed into numerical dummy variables, while the numerical are normalized by a 0–1 rank transformation. By training bagging multi-class SVM learner, i.e. by applying grid-search procedure in combination with 5-fold cross-validation, the optimal combination of parameters SVM.C = 750.0 and SVM.gamma = 0.1 was obtained and the following predictive performance was achieved: accuracy- 88.47%, classification error- 11.53%, mean recall − 83.44% and mean precision − 85.82%.
For the purpose of predicting the fractions of chemicals in the product category, Isaacs et al. (2016) [13] generated a classification model, by generalizing fractions into three categories - low, medium, and high fractions. They used structural and physical-chemical properties of chemicals as well as their functional uses as predictors. Using the random forest method, a model with a 5-fold cross-validation error of 16.7% was generated. The potential for misclassification of the obtained model is highest for high fractions (about 22%), while for low and medium fractions the class precision is about 84%. Accordingly, applying the model to an unknown dataset, less than 1% of chemicals are predicted to have high fractions (30% -100% of total weight), while 35% and 65% of chemicals are predicted to have medium and low fractions. Therefore, the predictive performances of the model for the high-fraction chemicals that make up the majority of cosmetic product composition are not satisfactory.
With our model the precision error for the high class is 15%, which is a better result than the one achieved by Isaacs et al. (2016) [13] having amounted to 22%, and the 5-fold classification error of the model is decreased by about 5%. This indicates that the multi-class bagging SVM model has a better predictive potential for a class of chemicals with a high participation in cosmetics than the random forest model used in the study mentioned.
As with the QSUR model, by generating a DT model on bagging SVM output, taking high Pr results (those voted for by the great number of SVM models obtained by bagging) for each class should further improve the prediction performances. However, since most of the results on the training set had a Pr greater than 0.99 (meaning that 99% of the generated SVM models voted for that result), the DT on the refined dataset had almost the same predictive performance as the SVM.
Nevertheless, the DT model generated on SVM output is useful as it provides explicit classification rules for low, medium and high fractions from which it can be concluded what are the chemical properties that cosmetic products contain to the greatest extent. Table 4 shows some of the most significant rules generated by DT model for all three classes (with the highest accuracy and cover).
Table 4
Rules derived from DT models for weight factions
DT rules
|
DT rules
|
Rule 1: Medium {Low = 135, Medium = 4193, High = 60}, Acc = 95.55%
|
Rule 4: High {Low = 0, Medium = 0, High = 114}, Acc = 100%
|
logKoa_unitless > 0.086
|
logKoa_unitless > 0.086
|
bond.S..O.O_sulfonate = false
|
bond.S..O.O_sulfonate = false
|
bond.C.O_carbonyl_generic = false
|
bond.C.O_carbonyl_generic = true
|
chain.aromaticAlkane_Ar.C_meta = false
|
logP_unitless ≤ 0.523
|
logP_unitless > 0.352
|
bond.C..O.O_carboxylicEster_acyclic = false
|
chain.alkaneBranch_t.butyl_C4 = false
|
air_half_life_units(hr) > 0.001
|
henrys_law_constant_units(atm-m3/mol) ≤ 0.000
|
Rule 5: Low {Low = 595, Medium = 167, High = 0}, Acc = 78.08%
|
bond.X.any._halide = false
|
logKoa_unitless > 0.086
|
Rule 2: Low {Low = 174, Medium = 22, High = 1}, Acc = 88.32
|
bond.S..O.O_sulfonate = true
|
logKoa_unitless > 0.086
|
Rule 6: High {Low = 12, Medium = 209, High = 830}, Acc = 78.97%
|
bond.S..O.O_sulfonate = false
|
logKoa_unitless ≤ 0.086
|
bond.C.O_carbonyl_generic = false
|
logP_unitless ≤ 0.466
|
chain.aromaticAlkane_Ar.C_meta = true
|
bond.C..O.O_carboxylicAcid_alkyl = false
|
Rule 3: Medium {Low = 24, Medium = 1301, High = 16}, Acc = 97.01%
|
logKoa_unitless > 0.086
|
bond.S..O.O_sulfonate = false
|
bond.C.O_carbonyl_generic = true
|
logP_unitless > 0.523
|
bond.CC..O.C_ketone_alkene_cyclic_2.en.1.one_generic = false
|
logP_unitless ≤ 0.685
|
bond.C.O_carbonyl_1_2.di = false
|
Thus, for example, it can be concluded from Table 4, based on Rule 3, that cosmetic formulations can contain up to 30% of chemicals whose octanol-water partitioning coefficient is between 0.523 and 0.685. This means that water pollutants can be significant due to poor water solubility. Also, this rule shows that cosmetic products can contain up to 30% of chemicals with a carbonyl group to which they belong and some that are dangerous to human health. A substance that has a carbonyl group in it and is a common ingredient in various cosmetic products including liquid soaps, shampoos and shower creams/lotions is formaldehyde [36, 37]. According to the International Agency for Research on Cancer, formaldehyde belongs to a group of human carcinogens because there is enough evidence that is causes cancer in humans. This fact is based on the fact that formaldehyde can lead to nasopharyngeal cancer in humans after inhalation and to squamous cell carcinoma of the nasal passages in rats [38]. This is why formaldehyde and paraformaldehyde are used as preservatives at concentrations of up to 0.1% in products used in cosmetics for oral hygiene (not to be used in aerosol products) and up to 0.2% in other products [39].
The procedure for predicting the unknown fraction of a chemical in a cosmetic product using the generated multi-class SVM model implies that the model input provides information on the product category, the function the chemical has in the product, the structural descriptors of the chemical and its physicochemical properties. Based on this information, the model will predict whether the chemical in the specified category is represented by a low, medium or high weight fraction.