Identification of Potential Gene Expression in Hepatitis B Virus-induced Liver Cancer Using the Simulated Annealing Algorithm

Background Liver cancer is often associated with hepatitis B infection, and there is a progression from chronic hepatitis to hepatocellular carcinoma. The replication of the virus affects the cell cycle of the host. Many oncogenes that express themselves in the formation of proteins can be highly expressed. A microarray technique can be used as a high throughput measure to determine gene expression in the mechanism of hepatoma progression. However, it has lacks important information on the potential genes to trigger the disease. The information is about the produced mRNA and has a difference to the normal cell line as control. This study aimed to identify the potential genes that could be used as predictors to detect liver cancer using a heuristic algorithm, simulated annealing optimization. The basic idea of this algorithm was to overcome the combinatorial problem using probability values to select the significant features as a predictor for the classifier model in the representative machine learning algorithms, such as SVM, KNN, Naïve Bayes. C5.0 Decision Tree, and Random Forest. The experimental results showed quite high performance, more than 90% on average. However, using the simulated annealing algorithm requires substantial computation time to identify genes that could be used for detecting liver cancer. in the obtained the the of protein-coding in cross-validation, achieved high for higher the machine


Background
Liver cancer can be caused by hepatitis B infection, progressing from the chronic, cirrhosis stage to hepatocellular carcinoma over a relatively long time. However, most patients do not realize that the disease has reached a severe stage. The hepatitis B virus can induce cancer by inserting its genetic load into the host body, affecting the performance of the cell cycle. Many genes or proteins are involved in liver cancer, and they disrupt the regulation of gene expression. For example, in patients with liver cancer, a regulator of the cell cycle (Cdk4, Cyclin I, Cyclin I, D3, CIP2) increases to cause uncontrolled division, without first repairing. Likewise, the tumor suppressor protein disrupts the regulation of gene expression and induces the formation of malignant tumor cells (Feng et al., 2010;Lee et al., 2000;Lim et al., 2010;Wu et al., 2001).
The use of the gene expression microarray is a substantial data analysis challenge of the biological computation approach to detect the disease. The large number of genes involved in screening requires substantial computational power to construct a classifier model for identification. The feature selection method preprocesses data to obtain significant features that are used as predictors to simplify the model. Many related studies on liver cancer prediction have been conducted using gene expression data and feature selection methods. Feature selection using the Markov clustering method was applied to the support vector machine (SVM) classifier algorithm for identifying HCC module biomarkers from gene expression GSE20948 and achieved an AUC rate of 0.875 (Shen and Liu, 2017). Additionally, dynamic Bayesian network feature selection was applied to the SVM classifier algorithm to diagnose HCC using a dataset with geo access number GSE17856, achieving an accuracy rate of 100% (Akutekwe et al., 2014). Another approach used for feature selection is Hybrid Forward Selection based on the Least Absolute Shrinkage and Selection Operator (LASSO) technique, and this has been applied to the SVM algorithm for HCC disease classification with an accuracy rate of 98.2% (Abinash and Vasudevan, 2019).
algorithm. Lastly, this paper is organized as follows. In Section 2, the materials and methods are presented. Section 3 provides the experimental results and performance evaluation. Section 4 provides the conclusions, and Section 5 provides recommendations for further studies. This study provides a new approach that can be used to identify genes and to indicate liver cancer in patients.

Method
Generally, this research process consisted of four sub-procedures, as shown in Figure 1. First, data acquisition of gene expression from NCBI was addressed to obtain the total RNA. All RNA data were used as inputs into the simulated annealing algorithm, and the particular genes were identified by optimizing the combination of selected features. The importance of features was ranked using a system of descendants. A high importance value was used to construct the classifier model of a machine learning algorithm. It showed the number of features reduction significantly for representative gene expression in liver cancer detection.  The dataset that was used in the liver tissue platform was GPL570, with samples in series GSE55092 and GSE121248. In the first data set, GSE55092, the samples were taken from 11 patients at various distances from the center of the tumor. Whole Liver Tissue (WLT) was compared to a Laser Capture-misdirected (LCM) sample of malignant and non-malignant hepatocytes in the same liver. In the sample, profiling gene expression was performed on 17 WLT specimens at any distance from the tumor center from 11 patients of HCC and the selected LCM sample. Another data set, GSE121248, consisted of profiling gene expression in an array. It was taken from liver tissue of either hepatitis B chronic induced HCC or healthy, adjacent liver tissue using Affymetrix gene construction.

Simulated Annealing Feature Selection
Simulated annealing (SA) is a statistical mechanics method with a random search technique that analogizes the annealing process of metal, is like cooling and freezing that carried out into a strong crystal structure with minimum energy. The SA method is used as the basis for general optimization techniques for combinatorial The initial step of SA has randomly selected an initial solution in the cost function. It is also assumed to be the optimal solution. Then, the next solution is chosen as the current optimal solution, and the cost value is calculated while the temperature T does not meet the desired criteria. However, if the current optimal solution is greater than the new solution, then there is no chance of the solution. Unless, it is generated a random value q in the range of (0, 1). In this case, the replacement of the optimal solution was only permitted if q value is less than . Furthermore, the temperature T has been decreased to obtain the termination conditions.

K-Nearest Neighbor Classifier
The K-Nearest Neighbor (KNN) classifier is a supervised learning algorithm for classification that uses the concept of similarity or difference for the data points that are divided into some classes to predict the label or class of the new data object. This algorithm is known as a lazy classification method due to there being no construction to construct a classifier model from training data. The algorithm decides the class of new data points from the training that is similar enough. The class of data objects is then chosen in k closet data and is taken in the majority vote of class from training data (Sutton, 2012.).

Support Vector Machine Classifier
The SVM classifier is a supervised learning method for classification to seek the optimal hyperplane function.
Initially, the function splits the input space into two classes. Then it develops non-linear classifiers by involving kernel tricks within the high volume of dimension. Next, the data are transformed into a high dimension of vector space. Several kernel functions are applied to multiple classes with non-linear classifiers using Polynomial function, Radial Bases Gaussian Function (RBF), and Sigmoid Function. Each class is labelled y-i e {−1, +1} for i = 1, 2, n, where n is total data. The label is divided from the hyperplane (support vector), as defined in Eq. (1): The point data xi is set to −1, as shown in Eq. (2): The point data xi is set to +1, as shown in Eq. (3): The maximum margin is considered the maximum distance of the hyperplane to close the data object.
The basic concept of the SVM classifier is to transform the data x to the high dimensional vector space function < D (x). Thus, the new vector space data is represented in the objective function. The training data is a learning process to seek support vectors as hyperplane by the dot product among the new vector space data using a kernel function as is defined in Eq. (5): The Radial Basis Function (RBF) is a kernel trick used in this study, as shown in Eq. (6): Then, the next step is applying the sequential SVM algorithm to construct predictions, such as the Hessian matrix, iteration to obtain the greatest of the least error rate, or Max (| 5a |) <e. Then, the test data and training data are calculated to obtain the bias and similarities between each of these datasets. As a result, positive or negative classes are obtained (Vijayakumar and Wu, 1999).

Naive Bayes Classifier
Naive Bayes Classifier is a classification method based on the Bayesian probability theorem. The primary character is a very strong (naive) assumption of independence from each event. The model is easy to create using Eq. (7) (Data Mining, 2012) where X is attributed, and C is class.
The Naive Bayes classifier is sought to maximize the probability value of each class, and it is expressed as the Hypothesis Maximum Posteriori, as seen in Eq. (8):

C5.0 Decision Tree Classifier
A decision tree classifier is a method of building a model of the structured tree as a recursive division. This method is known as 'divide and conquers' because it uses the feature values to divide the data into smaller subsets of similar classes. The data are divided into branches that designate the selected decision. The tree is completed by leaf nodes as the termination of the decision. It is used to define the result following a combination of decisions.
The C5.0 decision tree classifier model is an extended C4.5 decision tree algorithm, as proposed by Quinlan in 1993 (Pandya et al., 2015). The C5.0 trees are simpler than the C4.5. Boosting, winnowing, and asymmetric costs for specific errors are applied to this model (Ojha et al., 2017).

Random Forest
Random Forest (RF) is an aggregation of tree predictors with a uniform distribution in one forest. This tree is built as Classification and Regression Trees (CART) decision tree without pruning. The RF is a supervised learning method to be addressed for evaluating the performance of feature selection and reduction in liver cancer detection.
The RF method is built from several decision trees by selecting the number of F features randomly. This means that not all features are used to build the tree. The value of F affects the performance of the RF. If the F value is too small (F << M attribute), the correlation value from the tree will tend to be a small value (small correlation).
Otherwise, if the F value is too large (F >> M attribute), the correlation value from the tree will tend to be a great value (strong correlation) (Breiman, 2001). Furthermore, the F value and the number of all attributes, M, can be determined by equation () = log 2 ( + 1) In the RF method, the number of training data is selected using a bootstrapping technique. The bootstrap aggregating method in sampling is implemented in building each decision tree with previously selected candidate attributes. Broadly, the stages of the Random Forest algorithm are as follows: -The samples are taken bootstrap from training data -Building the decision tree -Feature selection m < M in splitting nodes -Each bootstrap sample creates a decision tree -The class is taken in the majority vote

K-fold cross-validation
K-fold cross-validation is a method used to evaluate the performance of experimental results. The entire dataset was divided into k parts. Then, the parts were iterated for k iterations in a different fold. The total number of folds (k) was divided into two parts, namely, training, and testing data. The testing data were used as m fold, and the training data were used as k-m fold. Each fold was filled with class +1 and class −1 data at a proportion of 50%: 50% (Refailzadeh, 2008).

Figure 3. An illustration of K-fold cross-validation.
The K-fold cross-validation is illustrated in Figure 3. The first data collection was used as test data, and the rest of the data were used as training data in the first iteration. Then, the second data collection was used as test data, and the rest of the data were used as training data. Substitution of test data and training data was conducted as the number of iterations. Each iteration obtained an accuracy rate value within ten iterations, and then the average value of accuracy was calculated. This study used 10-fold cross-validation to evaluate the performance of the representative machine learning implementation upon the selected gene expression.

Results and Discussion
The platform of liver tissue, GPL570, had an unbalanced data distribution in class as shown in Table 1.  (70) The box plot in Figures 4 and 5 show the distribution of numerical data of gene expression in mRNA to be performed in the data quartiles (or percentiles) and the averages. The expression values of GSE55092 and GSE121248 were in the range of (2-12). Additionally, the scatter plots of Figure 4 and Figure 5 show the relationship between the average log-expression value and the expression log-ratio for GSM139321 and GSM3428716 as described them.   Figure 6. The entity-relationship diagram of the GSE55092 and GSE121248 gene expression datasets.
After applying the Simulated Annealing optimization method and ranking based on the variable importance values, then ten features were selected of GSE55092, including SpotID: "216765_at," "236985_at," "FFX.PheX.5_at" "238611_at," "205386_s_at," "216604_s_at," "209740_s_at," "209711_at," "227073_at," and "202596_at." However, some SpotID were not identified in the gene database, and all types of protein were coded, as shown in Table 2. The selected genes are related to either direct or indirect liver cancer progression. An expression of the gene ENSA can reduce tumor propagation of liver cancer. The overexpressed ENSA is correlated to the growth of suppressor tumors concern. (Chen et al., 2013). MDM2 is also an important role in tumor progression (MDM2-P53 Pathway in Hepatocellular Carcinoma | Cancer Research, 2014). Another selected gene, MAP3K2, is identified to directly lead to the cycle cell of HCC progression (Shi et al., 2020).  203358_s_at, 200882_s_at, 1568638_a_at, 227654_at, 222077_s_at, 202243_s_at, 211609_x_at, 223516_s_at, 200783_s_at. They have the related information of genes as shown in Table 3. Almost all the selected genes belong to the type of liver cancer progression. Indoleamine 2,3-dioxygenase (IDO) increased the regulation in hepatocellular carcinoma and effects impede locally immune and metastasis promotion (Shibata et al., 2016). The upregulated gene Stathmin 1 (STMN1) facilitated and triggered the liver cancer pathway (R. . Also upregulation of Proteasome 26S subunit, non-ATPase 4 can trigger liver cancer (Cai et al., 2019).
MicroRNA-137 has a suppressive role in liver cancer via targeting EZH2 (Shi et al., 2020). Extracellular matrix protein 1, a novel prognostic factor, is associated with the metastatic potential of hepatocellular carcinoma(H. Chen et al., 2011). Upregulation of Rac GTPase-activating protein 1 is significantly associated with the early recurrence of human hepatocellular carcinoma . When the selected feature results were applied to the machine learning algorithms, and their performance was evaluated using the confusion matrix to get information for accuracy performance measures for sensitivity, specificity, and Area Under the Curve were obtained as presented in Table 4 ( Baratloo et al., 2015) (Jiao & Du, 2016). -True positive (tp): cases predicted as liver cancer, and they are liver cancer.
-True negative (tn): cases are predicted as healthy, and they are healthy.
-False-positive (fp): cases are predicted as liver cancer, but they are healthy.
-False-negative (fn): cases are predicted as healthy, but they are liver cancer. Table 5 and Table 6 show the performance results of when we compared using feature selection using the simulated annealing algorithm and not using feature selection to find the potential gene expression to construct a classifier model of machine learning algorithms including SVM, Naive Bayes (NB), Random Forest (RF), K-Nearest Neighbor (KNN), and Decision Tree C5.0. We found that feature selection using the simulated annealing algorithm required a computation time for GSE55092 of 6.582 minutes. Also, the required computation time of GSE121248 is 4.65851 minutes. These findings show the drawback of the simulated annealing algorithm is the substantial computation time required to solve the combinatorial problem (Busetti, 2003). The comparison of computation time required for representative machine learning is shown in Table 7. Furthermore, the total computation time required for detecting liver cancer using representative machine learning is as shown in Figure 7. The required time for identifying potential gene expression using simulated annealing feature selection is much higher than classifier modeling in machine learning methods.

Conclusions
Simulated annealing feature selection was implemented to identify potential genes that could be used to detect liver cancer mechanisms. All kinds of selected genes are protein-coding and almost all are related to liver cancer progression. The genes were constructed to simplify the classifier model using a machine learning algorithm and achieved high-performance measures of 90% above, including accuracy, sensitivity, and specificity. The performance rates increase in the range of 1.4% -3.4% on average. However, the computation time required was high in the feature selection method. Therefore, in future work, it is necessary to modify the simulated annealing algorithm to reduce the computation time.