Prediction of Micronucleus Assay Outcome Using In Vivo Activity Data and Molecular Structure Features

In vivo micronucleus assay is the widely used genotoxic test to determine the extent of chromosomal aberrations caused by the chemicals in human beings, which plays a significant role in the drug discovery paradigm. To reduce the uncertainties of the in vivo experiments and the expenses, we intended to develop novel machine learning-based tools to predict the toxicity of the compounds with high precision. A total of 372 compounds with known toxicity information were retrieved from the PubChem Bioassay database and literature. The fingerprints and descriptors of the compounds were generated using PaDEL and ChemSAR, respectively, for the analysis. The performance of the models was assessed using the three tires of evaluation strategies such as fivefold, tenfold, and validation by external dataset. Further, structural alerts causing genotoxicity of the compounds were identified using SARpy method. Of note, fingerprint-based random forest model built in our analysis is able to demonstrate the highest accuracy of about 0.97 during tenfold cross-validation. In essence, our study highlights that structural alerts such as chlorocyclohexane and trimethylamine are likely to be the leading cause of toxicity in humans. Indeed, we believe that random forest model generated in this study is appropriate for reduction of test animals and should be considered in the future for the good practice of animal welfare.


Introduction
Currently, humans are confronted with various chemicals, including cosmetics, food additives, pesticides, and drugs. These are the significant causes of mutagenicity or irreversible damage to genetic material, causing multiple negative impacts on human health, including cancer [1]. Recently, the regulatory authority implemented appropriate toxicity tests to measure the genotoxicity of compounds causing mutations at both genetic and chromosome levels. Initially, in vitro methods were implemented for examining the

Dataset Construction
The dataset used in the study was obtained from two different streams: (i) A list containing 3074 compounds and its bioassay results were downloaded from the PubChem Bioassay database (https:// www. ncbi. nlm. nih. gov/ guide/ chemi cals-bioas says/) using the query "Micronucleus assay in humans" [12]. The metadata sheet contained PubChem CID number, PubChem SID number, activity outcome, species/cell type, assay type, and results. Using the assay type as a filter, we have retrieved 412 compounds that were tested by micronucleus assay. Notably, these compounds were classified based on the assay results as "Positive," "Negative," and "No conclusion." Around 140 compounds with "No Conclusion" were discarded from the dataset. Notably, the 272 compounds obtained from the PubChem Bioassay database were used for model development. (ii) On the other hand, micronucleus assay results for the total of 100 chemicals tested in the recent decade were retrieved from the published article and utilized as external validation dataset for the generated models (Table S1). The resultant spatial data file of the 372 compounds was downloaded using the PubChem CID number to extract features. Moreover, the compound's molecular weight and Tanimoto coefficient (T C ) were calculated using MACCS fingerprints to evaluate the distribution of compounds [13]. In this study, the Tanimoto coefficient was calculated by averaging the standard features shared between two compounds divided by their union. For instance, the Tanimoto coefficient between two compounds A and B is given by where N a is the number of features in compound A, N b number of features in compound B, and N ab total number of features in both compounds A and B.
In the current investigation, the T C was calculated using DataStructs sub-package of rdkit, and a threshold of 0.25 was set during the analysis to prove the diverse nature of the dataset [14].

Molecular Fingerprint and Descriptor Generation
Molecular fingerprints are the vectors encoding the structural characteristics of the chemical compounds. They are widely used for virtual screening, machine learning, virtual chemical maps, and QSAR modeling [15]. In the current investigation, six different categories of molecular fingerprints were generated using PaDEL descriptors. The fingerprints developed in our analysis are MACCS fingerprint (166 bits), substructure fingerprint (307 bits), PubChem fingerprint (881 bits), Estate fingerprint (79 bits), CDK (1024 bits), and CDK Extended fingerprints (1024 bits). Each bit of a fingerprint represents the chemical property of the compounds.
On the other hand, the ChemSAR tool was implemented in this study to generate 325 features belonging to five different clusters such as Basak descriptors, Constitutional descriptors, Burden Descriptors, MOE descriptors, and topology descriptors. These feature clusters are numerical descriptions depicting the structural and chemical properties of the [16]. Altogether, 3806 features were retrieved for each compound and used for model development.

Dataset Preprocessing and Feature Selection
The biggest issue in machine learning is the curse of dimensionality alongside overfitting. In particular, generalizing correctly becomes exponentially harder if the dimensionality (number of features) grows. Hence, data preprocessing and feature selection strategies are incorporated in the machine learning pipeline to reduce the detrimental effects of overfitting and produce significant predictive models [17]. Initially, preprocessing of data was done by calculating variance and correlation for all the 3806 features. Variance depicts the sensitivity of the model towards the taken data, which reduces error during model generation. Moreover, correlation establishes the relationship between the features and the target variable, which in turn assists us in selecting the essential features. Subsequently, three different criteria reported in the recent literature were implemented to clean the dataset [ The crucial features of the dataset were selected using recursive feature elimination with decision tree estimator kernel in combination with cross-validation strategy (RFECV). This strategy identifies the optimal number of features and ranks them accordingly based on the level of correlation [18]. Finally, the top-ranking fingerprints and descriptors were selected for model development and validation.

Model Development Using a Machine Learning Strategy
Six classifier algorithms, namely Naïve Bayes (NB), logistic regression (LR), decision tree (DT), random forest (RF), support vector machine (SVM), and gradient boosting machines (GBM), were employed for model development with train-test ratio distribution of about 80% and 20%, respectively [19][20][21][22]. These models were chosen due to their simple interpretation strategy and are highly based on data-driven predictions rather than static program instructions [23]. Moreover, models were developed using three different approaches: (1) using fingerprints of compounds, (2) using descriptors of the compounds, and (3) using both fingerprints and descriptors.

a) Naïve Bayes
Naïve Bayes algorithm is a fast and simple classifier that works on a well-known Bayes theorem. The Bayes theorem is given by: where A and B are the events, A|B is the probability of A given by B is true. B|A is the probability of B given by A is true. Using the above rule, the probability of fitting to each subset is learned by eliminating the marginal probabilities determined based on the specific conditions [24]. This classification was carried out by using GaussianNB sub-package imported from the sklearn. Naive bayes package [25]. b) Logistic regression The sigmoid function is the most widely recognized function used in neural networks due to its non-linearity and computational simplicity. Thus, logistic regression (LR) uses the sigmoid function to estimate the probability threshold by mapping the dependent and independent variables. LR works on the given below sigmoid function: P is the probability threshold, and z is a real number ranging between − ∞ and + ∞ [23]. Although it is more appropriate for binary classification, it provides satisfactory results even for categorical and continuous variables. Moreover, the LR model was implemented in our analysis as the target variable (survival rate) used in this study is a binary variable. In this model generation, the patient's survival is considered the dependent variable, and all 36 features were regarded as the independent variables. This algorithm was carried out by using LogisticRegression sub-package imported from the sklearn.linear_model package [25]. c) Decision tree Decision tree (DT) is a predictive model that maps and develops the relationship between nodes and leaf nodes. The object (feature) is represented as a node in the tree, and the leaf node demonstrates the object value (class of the corresponding feature). DT has become advantageous because of its wide usage in data mining, data analysis, and data prediction [26]. Gini is used to measure the distribution of the subsets of the data in the generated tree. The Gini is calculated using the given below formula: The P represents the probability of each class in the respective feature. Note that the DecisionTreeClassifier sub-package of sklearn,tree package was implemented to perform the decision tree analysis. This sub-package processed the data and produced a decision tree at the end and the other assessing values, including the accuracy and precision of the model [25]. d) Random forest Random forest (RF) generates considerable numbers of trees from the point of a random seed, which is split during the model's training. This algorithm generates trees randomly to overcome data overfitting, which is ensemble by the RF classifier resulting in one final model with higher accuracy [27]. It is achieved by identifying and classifying the essential features into binary outcomes using the mean decrease accuracy and Gini score. Here, we employed RandomForestClassifier imported from sklearn.ensemble package for the development of random forest model [25]. e) Support vector machine Support vector machine (SVM) is the most widely used algorithm for solving classification and regression problems. SVM minimizes overfitting by using a structural risk minimization strategy during model generation. This strategy reduces the generalization error instead of the mean square error of a model [27]. Moreover, it classifies the input data using both linear and non-linear hyperplanes. Noises in the model are finally reduced using the slack variables [28]. A linear hyperplane with statistical characteristics is created in this study to distinguish two different cohorts using the svm sub-package from sklearn package [25]. The equation given below describes the linear hyperplane that divides the input space.
where the separation hyperplane is defined by W and the bias is defined by b. f) Gradient boosting machines Gradient boosting machine, an ensemble and non-parametric approach, associates predictions of feeble learners to yield robust outcomes than the single learner. Unlike other parametric machine learning models, GBM implements additive expansion for model developments providing more freedom for the researchers. This algorithm trains the model in a gradual, additive, and sequential manner. Initially, it trains a decision tree using the user-defined dataset and later on which new tree is fit into the modified dataset sequentially. Since GBM works based on the functional-based optimization strategy, the performance of GBM is comparatively better than other bagging-based techniques [29]. GBM is implemented by importing the sub-package GradientBoostingClassifier from the sklearn ensemble of sklearn [25].

Performance Evaluation of the Models
The parameters such as accuracy, precision, recall, F1-score, and average precision-recall score (AP score) were calculated using the given equations to evaluate the predicting ability of the models used.
where TP and TN denote true positive and true negative. Similarly, FP and FN denote false positives and negatives, respectively. Moreover, R n and P n represent the recall and precision of the n th threshold.
Additionally, receiver operating curve (ROC) was generated for all the developed machine learning models to visualize the connection between the sensitivity and specificity of the models. It is plotted between FP rate in the x-axis and TP rate in the y-axis. The ROC AUC score and the metrics for the plot were measured using roc_auc_score and metrics sub-packages of sklearn.metrices package. These measurements were used to plot the ROC curve using a matplot library of python packages [25].

Analysis of Structural Alerts
The relationship of the presence of toxic compounds causing adverse effects on human organs and their substructures is defined by the structural alerts. In general, SA can be identified by using three different strategies, namely, (i) graphical method, (ii) fragmentbased, and (iii) SMILES-based tools. Among them, the SMILES-based prediction using SARPy was found to be more efficient in determining the SA of the structure with high accuracy [30]. Hence, in the current investigation, SARPy was implemented to identify the SA with increased toxicity. The privileged substructures were analyzed using the frequency of the SA among the taken dataset. The frequency of the substructures is given by where N fragment class is the number of compounds with micronucleus-positive assay result and the N class denotes the total number of micronucleus assay-positive compounds. On the other hand, N total represents the total number of compounds in the dataset. N fragment total indicates the number of compounds containing the fragments in their structure. The following settings were fixed to identify the toxic substructures: 1) SA comprising of minimum of 2 atoms 2) SA comprising of maximum of 18 atoms 3) SA occurring in a minimum of 10 training set compounds

Cross-Validation of the Models
Cross-validation is a mathematical approach used to evaluate the performance of the machine learning models. Foremost, cross-validation, a standard tool in the ML pipeline, is commonly implemented to obtain statistically unbiased results among the six machine learning models [31]. In the current analysis, the dataset (n=272) is randomly partitioned into training and test set in the ratio of 8:2, respectively, for each validation fold. Then, the average of each scoring matrix, including accuracy, precision, recall, and F1-score, was calculated to compare and select the best models. Finally, the sub-packages k-fold and cross_val_score are imported from sklearn.model_selection package to carry out the crossvalidation process [22].

Dataset Analysis and Feature Selection
A total of 272 non-duplicated chemical compounds and its toxicity data were collected from the PubChem database for model development and validation. Around 3481 bits of molecular fingerprints and 325 features of descriptor clusters were generated for each compound using PaDEL and ChemSAR tools, respectively. The molecular weight of the (11) f m = N fragment class XN total N fragment total XN class compounds varied from 18.9 to 1250.7 g/mol. Notably, compound 28,179 was found to have the lowest molecular weight of 18.9 g/mol, whereas the compound 72,511 was found to have the highest molecular weight of 1250.7 g/mol within the dataset. Interestingly, on cross-validating the molecular weights of the compounds in the PubChem database, we observed that the compound 28,179 showed similar molecular weight of 18.998 g/mol and compound 72,511 showed a slightly increased value of 1326.4 g/mol than the predicted value. To elucidate the diversity of the dataset, Tanimoto coefficients between all the compounds were calculated using MACCS fingerprints. The average Tanimoto coefficient of 0.191 diminutive than the threshold value depicts the diversity of compounds retrieved for our analysis [13].

Feature Selection of Descriptors and Fingerprints
Feature selection is the automated process of selecting highly relevant features to improve the predictive performance of the models. Hence, a low variance feature filtering and high correlation feature filtering strategy were established in our investigation to eliminate the highly redundant features. For instance, 143 descriptors and 1067 fingerprints having low variance value of less than 0.05 were removed. This yielded a total of 2591 features after elimination process consisting of 182 descriptors and 2409 fingerprints. Subsequently, feature selection based on correlation resulted in 139 descriptors and 116 fingerprints, which were further considered for recursive feature elimination with a fivefold cross-validation strategy. This process resulted in the selection of 42 crucial descriptors and 6 fingerprints for machine learning model generation. Initially, model generation was accomplished by implementing the selected 6 fingerprints and 42 descriptors distinctly. Towards the end, the fingerprints and descriptors were combined together to generate the model. For instance, fingerprints such as FP18, FP29, FP70, FP193, FP236, and FP426 were selected for model generation. In case of descriptors, 42 features were selected for model generation, and its distribution over different categories is consolidated in Table S2. The 20 constitutional features including count of phosphorus atoms, number of double bonds, and number of single bonds describe the physicochemical and structural information of the compounds. The 10 Basak descriptors depicted the polarity nature of the molecules. Besides, 9 MOE descriptors reflected the contribution of surface area and partial charges by the compounds. The other 3 topological descriptors bcutm4, bcutm15, and bcutv15 were related to atomic Van der Waals forces and the atomic masses of the compounds [32].

Fingerprint-Based Models
The significant fingerprints identified using RFECV method were implemented for model generation in the initial stage of our analysis. In the current investigation, six different models were built and evaluated using 6 significant fingerprints identified during recursive feature elimination process. The models are estimated using different metrics, and the results are tabulated in Table 1. The accuracy of the models ranged from 0.66 to 0.85. Moreover, the precision values of all the models were found to be between 0.73 and 0.86. On the other hand, recall ranged from 0.68 to 1.00, and F1-score ranged from 0.75 to 0.91. Based on the accuracy, the random forest model (0.86) outperformed in predicting the toxicity of the chemicals than the other models investigated in our analysis. By considering other performance metrics, F1-score served as a benchmark evaluating metric due to its most delicate blend between precision and recall [33]. The result from the fingerprint-based ML model portrays that random forest exhibited good predicting capability with an accuracy of 0.91 to predict the genotoxicity of the compounds (Table 1).

Descriptor-Based Models
The machine learning models were generated in the second stage using the 42 significant descriptors identified using RFECV strategy. The descriptor-based models were analyzed using four different performance metrics and are tabulated in Table 1. The accuracies of all the models were found to be LR (0.77), RF (0.82), DT (0.67), SVM (0.76), NB (0.75), and GBM (0.76), respectively. The precision and recall value of the models varied between 0.77 to 0.87 and 0.71 to 0.94, respectively. Moreover, the F1-score lied from 0.78 to 0.9. It is to note that random forest achieved good performance metrics than other models developed earlier using the descriptors. For instance, the precision of 0.85, recall of 0.94, and F1-score of 0.9 were observed in the descriptor-based RF model. On comparing the results of descriptor-based models, random forest exhibited exceptional predicting capability than other models.

Fingerprints and Descriptor-Based Models
To date, no literature was reported on predicting the genotoxicity of the compounds using the combined features. Hence, in our analysis, we have merged the significant fingerprints and descriptors using the feature selection process in all possible to evaluate the efficiency of the model. Although the accuracy was comparatively lower than fingerprint and descriptor-based models, the other metrics, such as precision, recall, and F1-score, were equivalent to previously built models. Even though random forest excelled with high accuracy of 0.8 in this analysis, the performance of the combined feature models was not satisfactory. In addition, the ROC of models was evaluated as it is highly correlated with the accuracies of the classifiers [34]. It is evident from Fig. 1 that the area under the curve value (AUC) of the random forest model built was found to be 0.82, 0.74, and 0.76 for fingerprints, descriptors, and combination of features, respectively. Using fingerprints to develop random forest models has increased the accuracy by 0.08 times than the descriptor-based RF model and 0.06 times than the combination model. Ultimately, the fingerprint-based random forest model is the most suitable strategy for predicting the genotoxicity of the compounds.

Performance Evaluation of Cross-Validation
Five-fold and tenfold cross-validation techniques were implemented with the same dataset to validate the generated machine learning models and identify the best ML model to predict the toxicity of the compounds. The results of six models with cross-validation were shown in Fig. 2 and Fig. 3. Other performance metrics such as precision and recall are represented in supplementary Fig. 1. It is to be noted that no significant variation in accuracy was observed for all the developed models throughout the fivefold cross-validation process. On the other hand, the accuracy of models generated using fingerprints hiked above 0.88 during tenfold cross-validation. Notably, random forest achieved the highest accuracy of about 0.97 during validation. Similarly, all the models built using descriptors achieved accuracy greater than 0.9 except decision tree (0.81) and Naïve Bayes (0.81) algorithms. Remarkably, random forest achieved higher accuracy of 0.96. A similar pattern of variation in accuracy was noted for the models developed using both descriptors and the fingerprints, with the highest accuracy of 0.95 achieved by random forest classifier. On analyzing the F1-score of the cross-validation process (Fig. 3), a random forest model developed using fingerprints and a combination of descriptors and fingerprints achieved the highest score of about 0.98, which depicts the ability of the model to predict the data with high precision [35].

External Validation of the Models
The accuracy of the ML models was validated using an external dataset of 100 compounds retrieved from the literature. For instance, the literature information in the last 20 years were searched to build the validation dataset. The results of external validation were shown in Fig. 2 and Fig. 3. It is worth mentioning that the generated model performance for the external dataset correlates well with the training set data. For instance, the random forest model outperformed than other models considered in our analysis in terms of its accuracy, precision, recall, and F1-score. For instance, the random forest model developed by Fig. 1 ROC evaluation of generated models for the prediction of toxicity of the compounds. a ROC analysis of models generated using fingerprints of the compounds. b ROC analysis of models generated using descriptors of the compounds. c ROC analysis of models generated using both fingerprints and descriptors of the compounds ▸ (a) ROC Curve analysis of models generated using fingerprints of the compounds (b) ROC Curve analysis of models generated using descriptors of the compounds (c) ROC Curve analysis of models generated using both fingerprints and descriptors of the compounds fingerprints achieved accuracy and F1-score of 0.86 and 0.95, respectively. Notably, the prediction performance of the fingerprint-based model was superior to the model generated using other methods. These observations are consistent well with the model developed using the training set. The prediction results of our model are illustrated in Table S1. It is interesting to note that the model yielded sensitivity and specificity values of 78.72% and 92.98%, respectively. On the other hand, false positive and false negative values of 0.04% and 0.1% were observed. Indeed, sensitivity and specificity analysis calculated with the aid of wet lab data clearly demonstrates its effectiveness in prediction. Thus, we postulate that random forest combined with fingerprints to be the right choice for replacing the animalbased toxicity prediction.

Structural Fragment Analysis
To examine the presence of privileged substructures in micronucleus-positive compounds, substructure frequency analysis was performed using SARpy software. This algorithm will retrieve the relationship between the toxic compounds and also identifies the crucial substructure responsible for their toxicity. The comma-separated value (CSV) file containing the smiles and toxicity status of the compounds was provided as input for identifying the structural alerts. The existence of substructure in a minimum of 10 micronucleus-positive chemical entities is regarded as a toxic compound. Accordingly, nine substructures, including formic acid, polyethylene, chlorocyclohexane, cyclohexane, pyrimidinone, trimethylamine, methoxy-cyclohexane, methylamine, and phosphorous monoxide, were identified with high frequency in micronucleus-positive compounds than the negative chemicals. The substructure and its frequency are presented in Table 2. From this study, we hypothesize that a new molecule containing one or more structural alerts mentioned earlier is likely to have a high probability of being favorable to micronucleus assay. Although the SARpy tool identifies various substructures, understanding the toxicity of the compounds at DNA level is very important. Therefore, the substructures were further validated using Toxtree software, a decision tree-based strategy [36]. It is evident that 2 structural alerts such as chlorocyclohexane and trimethylamine were identified to cause toxic effects at the DNA level among the nine identified substructures resulted in our study. Chlorocyclohexane is a highly flammable substance that causes skin corrosion, severe eye damage, specific target organ toxicity, and respiratory tract irritation. In addition to the aforementioned toxic effects, trimethylamine causes acute oral toxicity in humans [37,38]. These evidences highlight the existence of chlorocyclohexane, and trimethylamine moieties have a higher probability of being positive to in vivo micronucleus assay in our dataset.
Altogether, our method is distinct from the existing literature in terms of the prediction performance, resultant structural alerts, and the utilization of wet lab data for model validation. Note that existing model showed an accuracy of 0.882 during five-fold cross-validation in random forest model alongside six structural alerts. On the contrary, the random forest model built in our analysis is able to demonstrate highest accuracy of about 0.97 during tenfold cross-validation. Adding together, novel structural alerts such as chlorocyclohexane and trimethylamine were identified in the present investigation compared to the Fig. 2 Prediction results of machine learning algorithms during validation process. a Validation of models generated using fingerprints of the compounds. b Validation of models generated using descriptors of the compounds. c Validation of models generated using combination of fingerprints and descriptors of the compounds (a) Validation of models generated using fingerprints of the compounds (b) Validation of models generated using descriptors of the compounds (c) Validation of models generated using combination of fingerprints and descriptors of the compounds existing models. The difference in the resultant accuracy and structural alerts are mainly due to diversity of dataset considered in our analysis.

Conclusion
In the current investigation, supervised classification models were developed using a dataset of 272 compounds to predict the chemical compounds' genotoxicity. Different types of fingerprints and descriptors were generated for model development and validation. Furthermore, three tires of evaluation, such as fivefold, tenfold cross-validation alongside an external dataset of 100 compounds, were utilized to evaluate the robustness of the developed models. The results from our study highlight that the random forest algorithm with fingerprint features would be the right choice to replace the in vivo micronucleus assay. The sensitivity and specificity values calculated with the aid of wet lab data clearly demonstrate its effectiveness in prediction. Importantly, privileged structural alerts such as chlorocyclohexane and trimethylamine were identified using SARpy and Toxtree algorithms. Based on our findings, we hypothesize that compounds containing these substructures have a high possibility of genotoxicity. Overall, future work should explore the integration of more data sources into the models to pave the way for the development of a more comprehensive genotoxicity prediction tool.

Acknowledgements
The authors thank the management of the Vellore Institute of Technology for providing the necessary facility to carry out this research work. We sincerely thank all the anonymous reviewers for their valuable suggestions in the improvement of the manuscript.
Author Contribution PR performed data collection, machine learning model generation, and validation. SV conceived this study and is responsible for the overall design, interpretation, manuscript preparation, and communication.

Conflict of Interest
The authors declare no competing interests.