Improving Structure Based Models for Predicting Chemical Functions and Weight Fractions in Cosmetic Products using Ensemble Support Vector Machine

Through usage of a large number of cosmetic products, consumers are very often exposed to toxic chemicals. This paper is aimed at proposing a model for the prediction of chemical functions and weight fractions in these products based on the structural and physical-chemical properties of the chemicals. Due to the imbalance of classes we used Support Vector Machine (SVM) method, which can complement a smaller class with the examples that are most similar to it and identify the examples that are most different. The generality of the SVM method was additionally enhanced by combining it with ensemble Bootstrap Aggregation (Bagging). The research results show that the proposed bagging SVM method can overcome the disadvantages of previously applied methods. Further, it can help address the lack of information needed to assess exposure to risk from the use of cosmetic products containing toxic chemicals in their composition. The proposed models can be applied to predict whether a certain chemical may be a substitute for a function performed by another possibly toxic chemical in a cosmetic product, as well as well as to determine the quantity proportion of a certain dangerous chemical on the basis of its chemical structure and physical-chemical properties.


Introduction
Instead of a bene cial effect of a cosmetic product, the undesirable and harmful occurrences often appear at the area of application, and sometimes in distant parts of the skin. This may be due to the active cosmetic substances and also to the components of the vehikulum. Irritant-toxic reactions occur as a result of the use of a harmful chemical or several of those and will occur with anyone who uses such a product. An allergic reaction is caused by a sensitivity to one or several substances. Irritant acts as follows: it chemically reacts onto the cells of the epidermis, dissolves or removes some of the essential metabolites, disrupts the osmotic balance of the cells, reduces the membrane potential and thus changes the cell permeability, irreversibly denatures the proteins. Allergic reactions are more common than the irritative ones. Acute and chronic in ammations of the skin are acute allergic dermatitis. Allergic contact skin in ammation is most commonly caused by low molecular weight substances. These are partial antigens that, by binding to the epidermal and dermal proteins, become full antigens and cause an allergic reaction or speci c hypersensitivity. Some low molecular weight substances are converted to primary irritants or allergens only after UV-A and UV-B radiation and short visible spectrum waves (400-600 nm wavelength) [1,2,3,4].
Users of cosmetic products are exposed to the toxic chemicals used in the manufacture of these products on a daily basis. In most common cases, manufacturers provide neither a complete information about their qualitative composition, i.e. the set of chemicals they contain, nor the quantitative composition, i.e. the weight fraction of chemicals. The main goals of green chemistry are to reduce the use of toxic chemicals (cancerous as well as those that affect the reproductive and nerve system) while preserving the functionality of the chemical ingredients and the e ciency of the product. Classi cation of the chemicals by the chemical function they perform may contribute to nding more easily the less toxic substitutes among chemicals that have the same functionality [5,6]. The risk of exposure is highly dependent on the quantity of toxic chemical used, and those information are often unavailable or incomplete due to the lack of adequate regulations and business policies of the manufacturer [7,8,9]. A large number of chemicals appear on the market, while no high-quality in vivo toxicity data exist for most of them, and for many there are none toxicity data [10]. The standard method for toxicity assessment is high-throughput screening (HTS) applied by the United States Environmental Protection Agency (EPA). This method involves assessing the potential harmfulness of a chemical in vitro by quantifying its bioactivity. The EPA evaluated about 8,000 chemicals using this method. However, there are approximately 100,000 chemicals in use in the US market, which are di cult to test individually [6]. EPA is currently working to develop methods for the substitution of chemicals that have been identi ed as hazardous with safe chemicals, aimed at providing formulations that are safe for humans and environment [11,12]. Such methods involve identifying alternatives from available chemical databases.
They take an individual chemical to evaluate and return multiple possible alternatives (low-throughput approach). Such an approach is not e cient enough for many chemicals, therefore, the methods are used to automatically identify groups of chemicals with appropriate function in large databases (highthroughput-HT access) for further more accurate testing. Recently, there have been a number of studies proposing quantitative structure -use relationship (QSUR) methods for classifying and predicting the functionality of chemicals based on their chemical structure and/or physical chemical properties [6,13]. Therefore, the aim of our research is to more e ciently overcome the problem of class imbalance, i.e. to generate the QSUR models with a higher precision of the positive class and thus with a more accurate prediction, i.e. by targeting of chemicals that could be potential functional substitutes for use in cosmetic products.
It has been con rmed in the literature that the Support Vector machine method could successfully solve the problem of class imbalance by eliminating data noise that leads to class overlaps, so this method can be used as a pre-processor that re nes data [14,15,16]. The classi ers applied to SVM output (on the re ned dataset) signi cantly improve their predictive performance [17,18,19]. Farquad & Bose, 2012 tested SVM as a pre-processor for highly unbalanced data (94%: 6%) and showed that SVM can balance data better than standard techniques commonly used such as undersampling (taking a subset of a larger class) or oversampling (supplementing of minor class with new examples). SVM provides more minor class examples by associating the most similar major class examples to the minor class. Also, the authors showed that when SVM re nes data and balances classes, other classi cation methods (such as neural network, logistic regression, and Random Forest), on such a re ned dataset, signi cantly reduce the misclassi cation of minor classes. Recently, the SVM method has been increasingly used in biochemistry and drug design researches to generate QSAR models that predict the activity of molecules based on their structural and other properties [20][21][22][23][24]. The Ensemble Bootstrap Aggregation (Bagging) method enhances learner predictive performances by generating multiple models using a random sampling (with replacement) of the training set and selecting random subsets of predictors when creating models. Prediction is performed by ensemble average of all models, and the number of models that "vote" for a result determines the probability of accuracy or con dence of the result.
Taking into consideration the prior researches, this paper proposes a bagging SVM method as a preprocessor of data to generate QSUR models aiming at enhancing their predictive performances. Also, to solve the problem of chemical high-fraction class misclassi cation highlighted by Isaacs et al., (2016) [13], a multi-class classi cation was proposed using the LibSVM implementation of the SVM method [25,26] which applies the one-against-one approach for k-class learning problems by solving binary SVMs. Hsu & Lin (2002), [27] have shown that the one-against-one approach is better than the oneagainst-all for large datasets and the training time for this method is shorter.
Continuation of work is organized as follows. The second section describes the data and methods used, while the third section presents and discusses the results obtained. Finally, concluding considerations are given.

Materials And Methods
This section aims at de ning the data and their sources, as well as the predictive methods that will be applied.

Data
This paper uses two publicly available datasets that were formed in the research described in the previous section.
Using a publicly available database of the chemical ingredients of cosmetics with the function have in a product (CosIng), Isaacs et al. (2016) [13] formed the Functional Use (FUse) database. They have reduced many functions by harmonizing similar functions into a common one. Each chemical in this base has its own unique CASRN code, while one chemical can have multiple functions (in different products). Data on the weight fractions of chemicals in individual cosmetic products were obtained from the Consumer Product Chemical Pro le Database (CPCPdb) and on-line product information provided by manufacturers (Material Safety Data Sheets-MSDS sheets). The products are assigned to the appropriate category, and each chemical within a product has an appropriate harmonized function taken from the FUse database and merged based on a unique CASRN code of the chemical. Thus, they obtained a data set containing in each row a chemical as an ingredient in a product of a certain category with a unique harmonized function and a weight fraction that it has in that category (the unique identi er is the CASRN for the chemical, the product category and the function of the chemical). This dataset is coupled via CASRN to the corresponding physical and chemical properties of the chemicals and to their use in manufacturing. The combined data set includes 828 chemicals and 17,103 of their functional uses (35 harmonized functions) in 4,115 cosmetic products, i.e. 66 categories of these products. Phillips et al., 2017 [6] expanded the FUse database from previous research with new chemicals and their functions in consumer products, using product composition information from the manufacturers' web pages. The functions are harmonized so that each chemical has a unique function. This set of chemicals is coupled to sets of chemicals that have structural and physical-chemical descriptors via a unique CASRN code. Structural descriptors are publicly available and taken from the EPA's Distributed Structure-Searchable Toxicity -DSSTox database, and physical-chemical descriptors (molecular weight, vapour pressure, water solubility, Henry's Law constant, log of the octanol -air partition coe cient -log(Koa), the log of the octanol -water coe cient -log(Kow), half-life of a chemical in soil, sediment, water, and air, and the persistence of a chemical in the environment) were obtained using the US EPA's Estimation Program Interface (EPI) Suite. Thus, a training set of 4,791 unique chemicals was obtained with 729 structural properties and 11 physical-chemical properties. Retaining only harmonized functions that include at least 10 chemicals, 49 remained in in the training set.
Our rst dataset was taken from the FUse database created in the research of Isaacs et al. (2016) [13] covering functional use and weight fraction of chemicals in cosmetic product categories (17 103 functional uses). The data were merged (via a unique CASRN) with structural descriptors (729 descriptors in total) and physical-chemical properties of chemicals (11 descriptors in total) taken from supplementary materials in a study of Phillips et al. (2017) [6]. Structural descriptors are binary (dummy) variables that have a value of 1 if a chemical has a structural property de ned by that descriptor (for example, atom.element_main_group, atom.element_metal_group_I_II, bond.CC..O.C_ketone_alkane_cyclic_.C4., etc.), while if it does not have such a property, the value of the variables is 0. Physical descriptors are numerical variables such as molecular weight, vapour pressure, water solubility, Henry's Law constant, the log of the octanol -air partition coe cient -log(Koa), the log of the octanol -water coe cient -log(Kow), half-life of a chemical in soil, sediment, water, and air, and the persistence of a chemical in the environment.
After merging and eliminating the functions that have less than 10 uses, there are 11,240 functional uses remaining in this dataset, thus representing the nal training set (Fuse_Str_Pc) ( Table A1 in Appendix and in supplementary materials). According to Isaacs et al., (2016) [13] weight fractions were generalized to three categories -low (0.0001-0.01), medium (0.01-0.3) and high (0.3-1) so at to apply the classi cation predictive method. In this way, the dataset was prepared for prediction of weight fractions of chemicals in cosmetic products based on structural and physical-chemical descriptors, functional uses and categories of cosmetic products, using the multi-class SVM bagging method.
The second dataset was taken from an expanded FUse database created in a study by Phillips et al., (2017) [6], which contains data on chemicals and their harmonized functions in consumer products. As in the previous case, the data were merged (via a unique CASRN) with structural and physical-chemical descriptors taken from the same work and fuctions that included less than 10 chemicals were removed. Thus, a nal training set (HFunc_Str_Pc) was obtained, comprising 4,665 unique chemicals with 729 structural features, 11 physical-chemical properties and 43 harmonized functions (Table A2 in Appendix and in supplementary materials). This set was used to generate the QSUR model (for prediction of functional uses based on structural and physical-chemical descriptors) by the SVM bagging method.
The following sections shall de ne the SVM method used and the predictive procedures.

SVM
To classify linearly non-separable classes, i.e. in case of class overlaps and noise in the data, Vapnik (2010) [28] proposed the SVM method that maps data (it views as n-dimensional vectors) from original space to feature space, where the classes can be separated using hyper-alignment. Finding such a hyperalignment minimizes the distance between its end position (so that the gap between classes, i.e. the margins, is as large as possible) and the closest points (support vectors). Instead of an explicit mapping function to a greater-dimension space, it is used a kernel function that allows the calculation of the scalar product of the vector in the original space (kernel trick). Maximizing the margin in a greater-dimension space is reduced to the quadratic programming optimization problem in the original space, using a kernel function. Different kernel functions can be used, but it is often the most e cient and the most widely used Radial Basis Function (RBF) [29]. Training of SVM classi ers is being realized by choosing the optimal values of the gamma parameter for the RBF kernel, and the parameter C that represents the boundary for the margin i.e. the empty space between the classes. Choosing smaller values for parameter C reduces over-tting and increases the generality of the SVM model i.e. its predictive performance.

Grid-search and k-fold cross-validation
A grid-search technique combined with k-fold cross-validation is used to select the optimal SVM parameters (C and γ). Grid-search sets appropriate rankings and steps for parameters (i.e. de nes grid) and then tests their combinations so that the best predictive learner performances are reached. The k-fold cross-validation process is implemented by dividing the training set into k subsets, of which k-1 is used to train the model and the one remaining to test the predictive performances of the model using unknown examples. The procedure is iteratively repeated so that each of the k sub-sets serves as a test set. The nal predictive performances are the averages of the model performances obtained in these k iterations.

Results And Discussion
Page 7/23 In order to predict the functions of chemicals in cosmetic products based on their structural and physicalchemical properties, i.e. to generate QSUR models, on the training set HFunc_Str_Pc (numerical data are normalized 0-1 rank transformation), the bagging SVM learners were trained. Model training involves optimizing the C and γ parameters using 5-fold cross-validation so that maximum predictive learner performances are achieved. Using the grid-search technique i.e. the setting appropriate rankings for the parameters, optimal combinations of these parameters were found for the 43 bagging SVM models to predict each of the 43 harmonized functions. Table 1 shows the optimal parameters and predictive performances of the generated models. As can be seen from the Table 1, the accuracy of the model based on 5-fold cross-validation ranges from 81.63% to 99.98%. All models are valid in contrast to the results obtained by Phillips et al., (2017) [6] where 8 of the 46 models had a balanced accuracy of less than 75%. For these very important functions such as masking agent, solvent, viscosity, controlling agent and perfumer, models with accuracy greater than 96.40% were obtained. However, it should be taken into consideration that this is not a balanced accuracy, so much of the accuracy falls on the major (negative) class (true negative rate i.e. sensitivity averages 98.94%, while true positive rate i.e. speci city averages 43.99%). The lower sensitivity of the models indicates that the models are not over tted, i.e. too dependent on the training set for the positive class, which gives them the potential for better performance on an unknown dataset. A great deal of model speci city is present because the negative class is much larger than the positive class. The small precision of the positive class, which averages 13.73%, indicates that a number of examples of the negative class (chemicals that have no function) are the least distant from the positive class (chemicals most similar in structure and physical-chemical properties to chemicals that have this function) by an SVM quali er associated with a positive (minor) class.
Applying the resulting bagging SVM models to the training set, a prediction was generated for 43 functions. Each bagging prediction generates multiple SVM models and accordingly each bagging prediction has its own probability. Predictions with high probability mean that the greatest number of SVM models thus generated voted for a chemical to have/does not have a certain function.
The goal now is to increase the precision and sensitivity of the model (precision of a positive class and true positive rate) by taking from a large number of negative class representatives only those chemicals that are farthest from the chemicals belonging to the positive class (those having the structure and properties differing to the greatest extent). Those members of the negative class to be selected were determined by bagging SVM through assigning them the highest probability of belonging to the negative class.
Therefore, the next step is to determine the optimal threshold (minimum likelihood) of Pr, for predictions that will be accepted. For this purpose, the DT method with the gini_index measure for partitioning [28] and 5-fold cross validation were used. Speci cally, the optimal threshold for the probability of an SVM prediction was determined for each of the 43 models as follows. A number of chemicals whose major class prediction was less than the Pr value were excluded from the training set and 5-fold DT predictive performances were tested. The parameter Pr is determined to obtain the maximum predictive performance of the DT learner. Table 2 shows the values of this parameter as well as the predictive performances of the QSUR models thus obtained for each of the 43 functions. It can be seen from the Table 2 that the average accuracy of the nal QSUR models is 95.95%, the precision averages 88.49%, while the average sensitivity and speci city are 80.57% and 97.40%, respectively. Thus, after removing the example of a negative class whose probability of belonging to the class is less than the Pr threshold from the training set, precision increased on average by 74.76%, sensitivity on average by 36.58%, while the sensitivity decreased on average by 1.54%.
In the study of Phillips et al., (2017) [6] for 49 harmonized function the 49 balanced random forest models were generated, of which 41 were valid (with balanced accuracy of >75%). For 8 functions (which include some of the most important ones such as perfumer and solvent) no valid models were obtained i.e. random balanced undersampling was not a method effective enough to predict signi cant differences in the structure and physical-chemical properties of these chemicals compared to the others. Most models have well recognized the chemicals that make up the positive class in the training set (sensitivity models average about 85%), however the average precision (positive classes) is only about 14%, which means that the predictive power of the model is weak. This is due to the large number of false positives (chemicals that do not have a speci c function but are misclassi ed by the model as having them). To identify the chemicals that could be functional substitutes, the generated models were applied to 6,356 chemicals in the Tox21 library for which they are structural and physical-chemical descriptors available.
Consistent with the small precision of the positive class of obtained models, about 88% of the predictions were invalid (with a probability of less than 50%).
Comparing the performances of the QSUR models thus obtained with the results from the confusion matrix obtained by Phillips et al., (2017) [6] (Table A3 below) [6], it can be seen that their accuracy averaged 91.81%, sensitivity 84.62%, speci city 91.83%, and precision 13.73%. The precision of our QSUR models is signi cantly improved over theirs, while the other 3 indicators were similar. The foregoing could lead to a conclusion that the predictive potential of our QSUR models for the positive class (chemicals having some function) is increased.
Bagging SVM prediction did the pre-processing of data for the DT learner and signi cantly improved its predictive performances even in the case of strong class imbalance. Instead of random undersampling used in the study by Phillips et al., (2017) [6] to ensure class balancing, an undersampling of the major class was made here based on bagging SVM prediction, i.e. only prediction with a higher probability of belonging to the major class was included in the sample. Thus, valid models were obtained for all 43 features with high predictive performances.
Based on the generated DT models, it can be concluded which structural and physical-chemical properties are most responsible for distinguishing the chemicals belonging to the positive class from the others. Thus, chemicals that have these properties can be identi ed and tested for potential substitutes.
For example, the rules for the positive class derived from the DT model for the fragmentation function in Table 3 show the structural and physicochemical properties that should satisfy the potential substitutes for this function in cosmetic products.   [30]. The chemicals can cause symptoms such as: respiratory irritation, increased asthma, allergic reactions, mucosal irritations, migraines, headaches, skin problem, cognitive problem, gastrointestinal problem, contact dermatitis, urticaria, photosensitivity. Some of these chemicals are lyral (synthetic lily scent), nitro and polycyclic musks, amyl cinnamal (usually of synthetic origin, though it may be of natural origin, having a oral jasmine-like scent), etc. These compounds are widely used as fragrances in various personal care products such as cosmetics and perfumery. In recent years, a large number of preparations have been marketed, which are labelled as odourless preparations, with the presence of vegetable ingredients or oils. Concealed allergens include rose oil, vanilla and sweet almond oil. Lilal -Butyl-phenyl-methyl-propanal causes contact dermatitis and is often found as a fragrant ingredient in perfumes, shampoos, bath preparations and lotions [31].
It is most commonly obtained synthetically via cross-aldol condensation between para-tercbutylbenzaldehyde and propanal, followed by hydrogenation of the intermediate alkene. It is the clear, viscous liquid with a strong oral scent. In addition to causing contact dermatitis, it is suspected to have an effect on the endocrine system and estrogen activity [31]. Citronellol is a colourless oily liquid with a oral scent on rose which is used in cleansers, hair care products, lipsticks, perfumes. It is a known skin allergen, causes eczema, and often causes complications in people with psoriasis [32]. Nitro-and polycyclic musks are two common and important synthetic musks currently in use [33]. In addition, due to their strong photochemical toxicity, [34] carcinogenicity [35] and neurotoxic properties, as well as endocrine dysfunction, nitro-musks (e.g. musk xylene), their use is being monitored in Japan, in the EU they are also under scrutiny, and further research is being conducted on their potential adverse effects on human health.
The process of high-throughput screening of a set of unknown chemicals using the generated QSUR models would consist of the following steps: 1. Prediction of the chemical function using the bagging SVM model 2. Elimination of chemicals whose non-function is determined with probability less than Pr 3. Application of the DT model to predict function on the puri ed set of chemicals from step 2 4. For each chemical, each of the 41 QSUR models will generate a result 5. The result that has the highest con dence will determine the function for which that chemical can be substitute 6. If two or more results have the same con dence, then the result obtained by the model with the highest precision of positive class will be the winning one The next step is to generate a model for the prediction of weight fractions, on the training set Fuse_Str_Pc, in which the nominal variables are transformed into numerical dummy variables, while the numerical are normalized by a 0-1 rank transformation. By training bagging multi-class SVM learner, i.e. by applying grid-search procedure in combination with 5-fold cross-validation, the optimal combination of parameters SVM.C = 750.0 and SVM.gamma = 0.1 was obtained and the following predictive performance was achieved: accuracy-88.47%, classi cation error-11.53%, mean recall − 83.44% and mean precision − 85.82%.
For the purpose of predicting the fractions of chemicals in the product category, Isaacs et al. (2016) [13] generated a classi cation model, by generalizing fractions into three categories -low, medium, and high fractions. They used structural and physical-chemical properties of chemicals as well as their functional uses as predictors. Using the random forest method, a model with a 5-fold cross-validation error of 16.7% was generated. The potential for misclassi cation of the obtained model is highest for high fractions (about 22%), while for low and medium fractions the class precision is about 84%. Accordingly, applying the model to an unknown dataset, less than 1% of chemicals are predicted to have high fractions (30% -100% of total weight), while 35% and 65% of chemicals are predicted to have medium and low fractions. Therefore, the predictive performances of the model for the high-fraction chemicals that make up the majority of cosmetic product composition are not satisfactory.
With our model the precision error for the high class is 15%, which is a better result than the one achieved by Isaacs et al. (2016) [13] having amounted to 22%, and the 5-fold classi cation error of the model is decreased by about 5%. This indicates that the multi-class bagging SVM model has a better predictive potential for a class of chemicals with a high participation in cosmetics than the random forest model used in the study mentioned.
As with the QSUR model, by generating a DT model on bagging SVM output, taking high Pr results (those voted for by the great number of SVM models obtained by bagging) for each class should further improve the prediction performances. However, since most of the results on the training set had a Pr greater than 0.99 (meaning that 99% of the generated SVM models voted for that result), the DT on the re ned dataset had almost the same predictive performance as the SVM.
Nevertheless, the DT model generated on SVM output is useful as it provides explicit classi cation rules for low, medium and high fractions from which it can be concluded what are the chemical properties that cosmetic products contain to the greatest extent. Table 4 shows some of the most signi cant rules generated by DT model for all three classes (with the highest accuracy and cover).  Table 4, based on Rule 3, that cosmetic formulations can contain up to 30% of chemicals whose octanol-water partitioning coe cient is between 0.523 and 0.685.
This means that water pollutants can be signi cant due to poor water solubility. Also, this rule shows that cosmetic products can contain up to 30% of chemicals with a carbonyl group to which they belong and some that are dangerous to human health. A substance that has a carbonyl group in it and is a common ingredient in various cosmetic products including liquid soaps, shampoos and shower creams/lotions is formaldehyde [36,37]. According to the International Agency for Research on Cancer, formaldehyde belongs to a group of human carcinogens because there is enough evidence that is causes cancer in humans. This fact is based on the fact that formaldehyde can lead to nasopharyngeal cancer in humans after inhalation and to squamous cell carcinoma of the nasal passages in rats [38]. This is why formaldehyde and paraformaldehyde are used as preservatives at concentrations of up to 0.1% in products used in cosmetics for oral hygiene (not to be used in aerosol products) and up to 0.2% in other products [39].
The procedure for predicting the unknown fraction of a chemical in a cosmetic product using the generated multi-class SVM model implies that the model input provides information on the product category, the function the chemical has in the product, the structural descriptors of the chemical and its physicochemical properties. Based on this information, the model will predict whether the chemical in the speci ed category is represented by a low, medium or high weight fraction.

Conclusions
An assessment of the toxicity exposure of chemicals in consumer products involves knowledge of the qualitative and quantitative composition of these products. Namely, on the basis of knowledge of the structural properties and amount of chemicals used in the product, the negative impact of the product on the consumer and the environment can be assessed.
This paper proposes methods that, based on available information on the functional and quantitative use of chemicals in thousands of real consumer products, generate predictive chemical classi cation models based on the function and weight fractions that chemicals have in cosmetic products. With these models, the composition of products with unavailable information can be assumed. These methods clearly de ne the approach by which great libraries of chemicals can be screened to identify potential substitutes for toxic chemicals without impairing the functions that the original chemicals have in the product. Also, a clear procedure is de ned on the manner in which the weight fraction of a chemical in a cosmetic product can be estimated using the generated predictive model. Thus, one can implicitly nd out the chemical composition of cosmetic products that is inaccurate or completely inaccessible for many products.