Prediction of chemoresistance trait of cancer cell lines using machine learning algorithms and systems biology analysis

Most of the current cancer treatment approaches are invasive along with a broad spectrum of side effects. Furthermore, cancer drug resistance known as chemoresistance is a huge obstacle during treatment. This study aims to predict the resistance of several cancer cell-lines to a drug known as Cisplatin. In this papers the NCBI GEO database was used to obtain data and then the harvested data was normalized and its batch effects were corrected by the Combat software. In order to select the appropriate features for machine learning, the feature selection/reduction was performed based on the Fisher Score method. Six different algorithms were then used as machine learning algorithms to detect Cisplatin resistant and sensitive samples in cancer cell lines. Moreover, Differentially Expressed Genes (DEGs) between all the sensitive and resistance samples were harvested. The selected genes were enriched in biological pathways by the enrichr database. Topological analysis was then performed on the constructed networks using Cytoscape software. Finally, the biological description of the output genes from the performed analyses was investigated through literature review. Among the six classifiers which were trained to distinguish between cisplatin resistance samples and the sensitive ones, the KNN and the Naïve Bayes algorithms were proposed as the most convenient machines according to some calculated measures. Furthermore, the results of the systems biology analysis determined several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. These predictions pave the way for further experimental researches.


Introduction
Cancer is one of the most lethal and costly diseases around the world. There exist several common therapeutic procedures such as surgery, cytotoxic chemotherapy, targeted therapy, radiation therapy, endocrine therapy and immunotherapy which are used based on the level of cancer aggressiveness. The majority of the aforementioned approaches are invasive with a broad spectrum of side effects [1][2][3][4]. Furthermore, one important challenge that clinical practice face is drug resistance, which results from tolerance of cancer cells to anti-cancer agents [5,6].
The concept of drug resistance was first distinguished by observing bacterial resistance to antibiotics, but the phenomenon was also attributed to a wider range of disorders including cancers in no time [7]. Traditional chemotherapeutic agents destroy cancer cells by directly damaging DNA strand. Therefore, not only they are non-specific but also, they result in broad side effects. Furthermore, studies show that new drugs developed for targeting cancer cells are most effective in the beginning of the therapies, but as time passes, most patients show resistance to these drugs. Resistance to new targeted, chemotherapeutic agents is a big challenge in cancer therapies as these agents are responsible for the preponderance of relapses. The drug resistance phenomena root from different mechanisms which can be cancer-specific or not, such as drug efflux [7,8].
So far, numerous researches have been done to distinguish and describe cancer drug resistance. Housman et al. categorized the mechanisms of drug resistance in cancer as drug inactivation, drug target alteration, drug efflux, DNA damage repair, cell death inhibition, and the epithelial-mesenchymal transition [7]. On the other hand, drug resistance was classified into intrinsic resistance that exists before drug treatment or acquired resistance which is induced after the therapy. The prediction of drug resistance can overcome the inevitable failure of targeted and chemical therapeutics in clinical anticancer treatment [9,10].
Chemotherapy resistance prediction methods include cell culture-based chemo-sensitivity tests, DNA, RNA, and protein-based chemo-sensitivity tests and recently developed computational methods [11,12]. Cell culture-based tests which have been used for more than 30 years have some technical weaknesses. The technique is time-consuming and the primary culture has a low potency to success [13,14].
To resolve the aforementioned issues, DNA, RNA, and protein-based chemosensitivity tests emerged [15,16]. These methods include gene-based tools such as the Oncotype DX ® assay which uses 21 genes to predict the recovery of breast cancer after treatment. The tool can also be used for other cancer types including colon and prostate cancers [11]. Another such tool is the MammaPrint ® which employs 70 genes to predict the possibility of metastasis in breast cancer [17]. The principal challenge in these tests is the recognition of participating genes in the chemoresistance process. Moreover, due to the development of interdisciplinary techniques, computational and statistical methods are used for predicting chemotherapy responses [11].
So far, numerous computational approaches are developed to study drug resistance based on biological mechanisms. These computational techniques are generally divided into mechanism-based mechanistic modeling methods and data-driven prediction methods [18][19][20]. Molecular dynamics simulation, Kinetic models of signaling networks, Ordinary differential equation (ODE) model of cellular populations, Stochastic models, Partial differential equation models (PDEs), Agent-based and Pharmacokinetic-pharmacodynamic models are examples of the first class. The second class benefits from Omics data-based node biomarker screening, Static network biomarker prediction and Dynamic network biomarker prediction models. Linear models, support vector machines (SVMs), hierarchical clustering, principal components analysis (PCA) and the formation of a scoring algorithm are other models of computational methods used for the prediction of cancer responses. These models which belong to a concept known as "machine learning algorithms (ML)" are being used to predict resistance of cancer cells to chemotherapy [20,21]. Huang et al. represent ML as a part of artificial intelligence that can find correlations in the cancer-relevant datasets [21]. Conventional analytical approaches for determining treatment are very expensive and are also limited due to innate technical issues. ML algorithms are considered as cost-benefit, time saving strategies which can evaluate multiple cells line simultaneously [22,23]. Yet another challenge which ML can overcome compared to conventional methods is the ability to determine biological information which are concealed by tumor cell heterogeneity [24].
This study aims to predict the resistance of several cancer cell-lines to Cisplatin. In this study, different algorithms including Naïve Bayes, K-Nearest Neighbors (with k = 3), Decision tree, Random Forest and Neural network are used to classify Cisplatin sensitive and resistance samples. Moreover, the results of our systems biology analysis indicated several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. Our results have been validated against different databases such as UniProt, Enrichr and DIANA mirPath v.3 and the papers extracted from the literature. Furthermore, the results of the systems biology analysis determined several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others.
The remainder of this paper is formed as below: "Materilas and methods" section reviews the Materials and Methods. The results are presented in "Results" section. The discussion is reported in "Discussion" section and finally, "Conclusion" section present the conclusion of the overall work.

Data collection
NCBI GEO database was used to obtain data from datasets [25]. Five different platforms of microarray including GPL13667, GPL6947, GPL6480, GPL6104, and GPL6244 have been used. A total of 85 samples of gene expression datasets of microarray data were selected from different platforms. The selected datasets were related to various cell lines such as ovarian, pancreatic and non-small cell lung cancer (NSCLC) both resistance and sensitive to Cisplatin drug. Sample numbers as well as cancer types are indicated in Table 1.
The data has been normalized using the LIMA package in the R software [26]. Average has been taken between the expressed values of repetitive probes in each dataset to obtain a unique expression value for each probe. A total 7621 genes were harvested after the isolation of common genes between platforms. The Combat software was then used to eliminate batch effects between different platforms and experiments [27]. Also, an average has been taken between replicates of each platform (sensitive and resistance separately) which reduces the total number of samples from 85 to 14 sensitive and 14 resistances to Cisplatin samples.

Data processing
As the data was collected from various sources, it was necessary to somehow remove the discrepancies known as "Batch effects" between different samples. The batch effects of 85 different samples, each containing 7620 genes, were corrected by the SVA package. The SVA package comprises functions to remove the batch factors and other undesirable conversion in high-throughput examination. Specifically, the SVA package comprises functions to identify and build surrogate variables for high-dimensional datasets. Surrogate variables are covariates built directly from high-dimensional data (such as gene expression and RNA sequencing data) that can be utilized in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise. Moreover, the t-SNE algorithm was then used in MATLAB software to make the data presentable before and after the batch effect correction [28].

Feature selection
High dimensional DNA microarray has presented serious challenges to the existing machine learning and classification methods. In other words, in many of medical and microarray datasets, it is possible that many genes are irrelevant or redundant for machine learning algorithm [29][30][31][32]. Feature selection or gene selection is a popular and powerful approach in medical datasets to overcome this shortcoming [33][34][35]. In gene selection, to decrease the microarray data dimensions, by eliminating the irrelevant and similar genes, only a subset of relevant and dissimilar genes that are strongly related to the objective function are selected [36]. This is a powerful strategy to increase the efficiency of the machine learning algorithm, reduce time complexity, build more general classification algorithm, and reduce storage requirements [37,38]. Gene selection approaches have been successfully employed in many medical applications including; gene expression [39], cancer classification [40], medical diagnosis [32], etc. In other words, a fundamental problem in machine learning algorithms is the high dimensional datasets, in which the size of the feature subset is much higher than the size of the patterns. Therefore, the classification accuracy is significantly reduced. As a result, it is necessary to reduce the initial features using dimension reduction techniques [41,42]. One efficient way to reduce the dimension is feature selection (gene selection in DNA microarray datasets). In feature selection, an attempt is made to choose a set of initial features that satisfy two targets simultaneously: the minimum similarity between the selected genes and the maximum relevance of these genes with the target class [43,44]. The main goal of this step is to select appropriate and important feature from original features [45].
To do this purpose, Fisher Score (FS) feature selection algorithm is used to select the features that are most relevant to the target class. Fisher Score is a supervised filter method that selects a feature subset such that the samples in the specific class are most similar to each other and the samples in the different classes are less similar. As a result, this measure Scores higher value to genes with higher separation characteristic. The FS of gene f i is defined as follows: where f k is the mean value of all the samples regarding the feature f k , V is a set of all classes in a dataset, n ν is the number of pattern on the class ν, and σ (f k ) and f ν k shows the variance and average of feature f k on class ν, correspondingly.
The Fisher Score method was performed for 14 folds and in each step, 100 features that were most consistent with the normal distribution were selected so that a total of 1400 features were obtained. In the next step, in order to reduce the sample size, the PCA [46] method was performed for 14 folds on the output features of the Feature Score method and reduced set of features were selected for machine training.

Machine learning
In order to measure the flexibility of the developed method on different classifiers, in the designed experiments, the efficiency of the various approaches on three widely used six machine learning algorithms including Naïve Bayes [47], SVM [7], KNN [48], Decision tree [49], Random Forest [50] and Neural Network [51] algorithms is examined for machine training to detect Cisplatin sensitive and resistant samples in cancer patients undergoing chemotherapy. These machine learning algorithms are one of the most well-known and widely used machine learning algorithms that are used by researchers in various prediction and classification problems.
In these experiments the developed approach and other compared methods implemented using Python language programming on an Intel Core-i9 CPU with 16 GB of RAM.
To train these algorithms and due to the small number of available samples (14 pairs of samples including 14 sensitive and 14 resistant samples), the Leave one out method (LOO) was used [52]. In this method, at each step of the machine training, 13 pairs of samples were used as training data and one pair was used as test data. The training process continued until all pairs of samples were once used as the test data. The training was performed twice, once by considering 1400 features extracted from Fisher Score method and once by considering 210 features extracted by PCA method as the training samples. Finally, the performance of the trained machines was evaluated according to Accuracy [53], Specificity [54], Sensitivity [54], Precision [53], MCC [55] and F1 scores [55].
Biological system evaluation 73 genes were selected using the CountIF filter in Excel (Refer to online resource 1). These genes were repeated in 50% or more of the 14 pairs sample extracted from the GEO. The LIMA package was used to calculate the DEGs between all sensitive and resistance samples. The result indicated that 34 of 73 gens which were chosen in the feature selection phase were down-regulated and the rest were up-regulated. The genes were enriched in biological pathways using the Enrichr database [56]. To obtain the mirs related to the 73 harvested genes, miRCancer [57] and miRDB [58] databases were used and the common mirs between the harvested mirs from the two aforementioned databases were selected based on the specific cancer cell line as well as the direction of the regulation. mirs were enriched using DIANA mirPath v.3 database [59] (refer to online resource 2) and pathways related to chemoresistance were selected among the obtained pathways. Furthermore, the transcription factors which regulate the obtained DE genes were harvested using the TRRUST v2 database [60] and were enriched in pathways related to chemoresistance by Enricher database.

Networks
Using the obtained transcription factors and mirs, two types of network including a TF-gene network and a mir-gene network were constructed using Cytoscape software [61]. Topological analyses were then performed using degree and betweenness centralities to report the networks hub genes.

Biological description
The obtained DEGs were studied through literature review. All of the 73 genes were checked in articles and are indicated in a table based on the mechanism of chemo resistance. Some of these genes were not mentioned in the related articles as chemo resistance agents and therefore, they were nominated for further resistance studies which indicated that these genes have main roles in cells and are included in important biological pathways.

Data processing
To correct batch effects between different samples, t-SNE algorithm was used (Fig. 1). This algorithm reduces data dimension and makes the data easier to picture and therefore, easier to comprehend.

Features selected and reduced by Fisher Score and PCA algorithms
In order to select appropriate features for machine learning algorithms, feature selection was performed for 14 folds using the Fisher Score method. In each fold, 100 genes which were most similar to the normal distribution were selected out of 7620 genes. Furthermore, to reduce the dimension of the selected features, the PCA algorithm was performed for 14 folds on the extracted features and several genes were selected in each fold ( Table 2).

A machine learning approach to detect Cisplatin sensitive and resistant samples in cancer cell lines
Performance of the machines trained by using reduced features extracted from the PCA algorithm (Table 2) are shown in Fig. 2. Among the developed machines, the Decision Tree algorithm with the average Accuracy of 50% has the weakest performance in terms of accuracy. On the other hand, KNN shows the highest accuracy with an average of 67%. The best performance based on accuracy criteria also belong to KNN and Decision Tree algorithms according to the obtained box plots (Fig. 2c). Among the developed machines, the Naïve Bayes algorithm is the weakest machine in terms of negative sample detection with a 50% Specificity criteria. Decision Tree algorithm, on the other hand, has the highest average Specificity criterion of 69%, followed by Random Forest and KNN algorithms. The KNN algorithm has the best performance based on the specificity criterion based on the extracted box plots (Fig. 2c).
Based on the results, the weakest performance in terms of Sensitivity criteria is related to the Naïve bayes algorithm, which has not correctly detected any positive samples. On the other hand, KNN and Decision Tree algorithms have the highest Sensitivity criteria with an average of 78 and 67%, respectively (Fig. 2). The Decision Tree algorithm has the best performance in terms of sensitivity criteria based on the obtained box plots (Fig. 2C).
Among these algorithms, Decision Tree and Random Forest algorithms with an average precision of 71% have the highest average precision. The Decision Tree algorithm has the best performance in terms of precision according to the extracted box plots (Fig. 2C).
Similarly, a new set of six machines was trained only this time, 1400 features extracted from the Fisher Score algorithm were used in the training process. Performance results of these machines are shown in Fig. 3. In the new set, the KNN algorithm with an Fig. 1 a Data before batch effect correction. Data dimension has been reduced using the t-SNE algorithm so that it can be displayed properly. Each batch is illustrated in a distinct color and it is clear that each batch is grouped near each other. b Data collected from various sources after batch effect correction by the Combat software. It is clear that data related to different batches are combined and the boundaries between different batches have been removed average of 67% accuracy has the highest percentage of correct sample detection compared to other algorithms. Random Forest, Naïve Bayes and SVM algorithms come afterward with an average accuracy of 64%. The KNN algorithm is also the best machine in terms of accuracy based on extracted box plots (Fig. 3c).

Determining specific mirs for extracted DE genes
The specific mirs for the extracted 73 DEGs were harvested from the miRCancer and miRDB databases for related cancer types (Table 1). These results determine that the expression profile of mirs are down for upregulated genes and are up for down regulated ones (Table 3; Fig. 4).

mir-target network topology
The mir target network has been topologically analyzed using the degree centrality and the results revealed the hub genes which should be evaluated for their performance in the chemoresistance process (Table 4).

mir enrichment
The upregulated and down regulated mirs in the pathways related to chemoresistance were enriched using DIANA mirPath v.3 database. The related pathways were extracted from KEGG database using the standard P-value of 0.05 (The extracted related pathway: Additional file 1).

TF network topology
Three factors including the in-degree, the out-degree and the betweenness centralities have been noted in TF-gene interaction network for topological analysis. The detected hub genes are related to the in-degree centrality and are listed in Table 5.   The network illustrated in Fig. 5 has been constructed by merging upregulated and down regulated genes and their corresponding regulatory TFs. Topology analysis has been performed based on this network. According to the results obtained from the trrust database, the TFs were down for upregulated genes and were up for down regulated genes ( Table 6).

TF enrichment
The transcription factors regulating the genes harvested from the feature selection step were enriched using Enrichr database in the oncogenes and chemoresistance pathways. In the Enrichr database the pathways were harvested from KEGG and the adjusted P-value was significant. The results are shown in Table 7.

Biological description
A literature review was performed on the 73 DE genes which were harvested by the CountIF filter in the feature selection step. A group of these genes were reported in the literature as chemoresistance genes (Additional file 2). The other ones were identified to have a vital role in the oncogenesis pathway and other important cell functions. Therefore, although they were not reported as chemoresistance genes we propose that these genes might be potentially chemoresistance (Table 8). Future studies can be performed to validate these results.

Discussion
The main aim of this study was to train a machine for the detection of sensitive and resistant samples to Cisplatin in different cancer cell lines including ovarian, pancreatic and lung cancers. Six machines were developed based on different algorithms including Naïve bayes, SVM, KNN, Decision tree, Random Forest and Neural network and the results were evaluated by the accuracy, specificity, sensitivity, precision, MCC and F1 Scores. Furthermore, a series of systems biology analyses were performed using the DE genes harvested from the feature selection step to further improve our study.  Except klhdc10, another client protein for kelch is phosphatase 5 (PP5) which in response to ROS inactivates ASK1. After this interaction, PP5 phosphatase activity will be suppressed. Furthermore, kelch mediates H2O2-induced sustained activation of ASK1 and cell death in Neuro2A cells This data proposes that Slim/KLHDC10 is an activator for ASK1 and its activation through suppression of pp5 leads to oxidative stress-induced cell death [62] MCRS1 Nucleolar MCRS1 which is called MSP58. It is proposed that TOJ3, an avian homologue of MSP58, is associated with Jun-induced cell transformation as well as tumorigenesis. Other studies have identified MSP58 as an oncogene hence its transformation activity was blocked by interaction with the PTEN tumor suppressor. It has also been reported that there exist different expression levels of MSP58 in human glioma and colorectal cancer [63] MSH4 MSH4 mutation has been associated with lung cancer [64]. Furthermore, although hMSH4 and hMSH5 are not involved in DNA-mismatch correction but they participate in the Mutual recombination and appropriate separation of homologous chromosomes at meiosis [65] nucb2 It has been proposed that NUCB2 is associated with metastasis in melanoma. Several studies have demonstrated that KLF4 levels are increased in melanoma cells leading to apoptosis inhibition and metastasis [66] sh3gl2 It is proposed that SH3GL2 can have a tumor suppressor role in brain since deletion mutations in the locus of this gene can cause pilocytic astrocytomas. It has also been shown that through regulation of SH3GL2 gene, miR-330 affects proliferation, migration, invasion, cell cycle and apoptosis of human glioblastoma [67] TMED5 malignant phenotypes including increased cell proliferation, EMT progression, apoptosis inhibition, cell migration and invasion, and drug resistance can emerge as a result of higher expression of TMED5 [68]. TMED5 is also proposed to play a role in NCI/ADR-RES drug-resistance (MDR) [69] TMEM119 TMEM119 plays a role in migration and invasion of gastric cancer cells through activation of STAT3 signaling pathway which is found to be strongly correlated with the invasion, metastasis, and prognosis of gastric cancer [70]. Moreover, Zheng et al. showed that down-regulation of TMEM119 reduces Bcl-2 levels and increases Bax and caspase-3 levels in SGC-7901 cells [71] TMEM219 The over-expression of TMEM219 gene which is localized in the membrane of breast, prostate, and pancreatic tumor cells, can suppress tumor growth. Other studies have revealed that expression levels of IGFBP 3 as well as its death receptor are in close relation to inefficient prognosis and low survival rate in pancreatic ductal adenocarcinoma [72] WIPI1 WIPI1 up regulation has been detected in a variety of tumors. It has also been proposed that WIPI1 plays a role as an autophagy activator through TORC1 suppression. Furthermore, it has been stated that WIPI1 up regulation results in both lower relapses and higher survival rates in breast cancer [73] YWHAH YWHAH gene can be considered as a potential target for therapeutic agents as it is downregulated in liposarcoma cells after Doxorubicin treatment [74]. Furthermore in another study, Kibel AS et al. proposed that YWHAH expression levels is positively correlated with malignant Prostate Cancer [75] znf507 It has been proposed that over-expression of ZNF507 is associated with pancreatic, periampullary adenocarcinoma [76] as well as ovarian high-grade serous carcinomas (as an amplified 'driver' gene) [77] ACP5 (TRAP-ACP5) can be used as a marker for predicting cancer progression and aggressiveness as it plays critical roles in many biologic processes such as bone resorption, osteoclast differentiation and, cell motility promotion through the modulation of focal adhesion kinase phosphorylation. It also acts as a metalloenzyme in activated osteoclasts and macrophages and also serves as a metastasis driver in cancer. It has been recently demonstrated that TRAP, through TGFβ2/TβR and CD44 signaling pathway, results in metastatic MDA-MB-231 breast cancer cells [78] ANXA6 Several cancer types including Melanoma, CC, Epithelial Carcinoma, BC, GC, PCa, ALL, CML, large-cell lymphoma and myeloma have been proposed to be related to disregulation of AnxA6 Therefore, AnxA6 is a potential biomarker for identification, cure and prognosis of certain cancers [79]. Moreover, recent studies have proposed AnxA6 as a suggested target for inihibition of pancreatic cancer via antibodies [80] It was concluded that the machines which were trained using the features extracted from the Fisher Score algorithm performed better than the ones trained by the same set of features reduced using the PCA algorithm. The reason is due to the richer distinguishing information in the features selected by the Feature Score algorithm than the reduced features of the PCA algorithm. Exceptionally, the KNN algorithm performed similarly in both cases. The similarity of the KNN algorithm performance in both cases is due to the preservation of the data arrangement in the reduced space after the implementation of the PCA dimension reduction algorithm. Since the KNN algorithm decides the fate of a data based on its K nearest neighbors, maintenance of the data arrangement after the dimension reduction has led in the same results in both cases. Furthermore, the KNN and the Naïve bayes algorithms are proposed as the most appropriate machines for prediction. However, it should be noted that the appropriate machine must be selected based on the considered specific application.
Using the classifying features extracted, mir-target network and TF-gene network were constructed and enrichment and topology analyses were performed to

Atp6v1g3
BSND and ATP6V1G3 has been proposed as novel immunohistochemical markers for the differential diagnosis of chromophobe RCC from other RCC subtypes and also diagnosis of chromophobe RCC metastasis to distant organs [81] CWF19l2 CWF19 like cell cycle control factor 2 (PMID: 143,884). Breast cancer development has been related to ERBB2, MYC, GSTT1, PIK3CA and CWF19L2 [82] DSC1 Dsc2 level is observed to decrease in colorectal cancer. The alterations in Dsc expression pattern can cause significant changes in desmosome function [83]. DSC1 can also be considered as a biomarker for tumor differentiation, and it can be a prognostic marker for lung cancer [84] DUT DUT expression provides a discernible phenotype in a variety of cancers which can be used for prediction of patients response to chemotherapy as well as overall survival. It is significant to note that resistance to thymidylate synthase inhibitiors is related to 3-fivefold increased expression levels of dUTPase in HT29 and A549 cells [85] EDNRB EDNRB methylation is helpful in screening of oral pre-malignancy and malignancy conditions [86]. hyper-methylation of the EDNRB gene has commonly occurred in NSCLC. Since the rate of EDNRB methylation is significantly higher in squamous cell carcinoma than adenocarcinomas, it can be used to distinguish SCCs from adenocarcinoma of the lung. since the downregulation of EDNRB after hypermethylation of the EDNRB gene is necessary for lung cancer tumorigenesis and is associated to tumor-related death [87] FADS1 FADS1 rs174549 polymorphism is a useful factor for oral cancer PFS prediction, specifically in chemoradiotherapy patients. It can also be considered as a potential target for future of personalized treatment [88] FAM65b FAM65B can be considered as a suitable target for therapeutic approaches based on cancer stem cell elimination. The reason is that overexpression of FAM65B is observed in Prostate tumors such as PC3. These tumors have stem like characteristics. For example, they are proangiogenic and strongly self-renewal [89] FAM89b Fam89b is proposed to be a suitable target for chemotherapeutic strategies since it is a TGF-β pathway suppressor and signaling pathways induced by TGF-β have tumor-suppressing or tumor promoting effects based on type and stage of the cancer [90] Gnai2 Previous studies have demonstrated GNAI2 as a main regulator of oncogenesis and an upstream driver of cancer development in the Ovarian cancer [91]. In addition to ovarian cancer, up-regulation of GNAI2 has also been observed in Hepatocellular Carcinoma. This protein acts through activation of the Ras-ERK/MAPK Mitogenic pathway by membrane recruitment of Rap1 GTPase-activating protein and moderation of GTP-bound Rap1 and also through the enhancement of cell survival by activation of AKT and inhibition of apoptosis by regulating Bcl-2 levels [92] detect hub genes and hub TFs. Based on the degree centrality, PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB and ACSL6 target genes were detected as the chemoresistance hub genes according to mir-topology. PTGER3 is encoding PTGER3 protein, a member of the G-protein coupled receptor family. In this study, the degree of the PTGER3 was obtained to be seven which is higher than that of other obtained hub genes. Furthermore, this gene is also reported to be a cisplatin-resistant gene through Ras-MAPK/Erk-ETS1-ELK1/CFTR1 pathways [93]. YWHAH and CTNNB with the degree centrality of six were identified as the second robust hub genes in the row. It has been reported that after chemotherapy, YWHAH is upregulated in prostate cancer cells and is down regulated in Liposarcoma, representing the potency of this gene in chemoresistance [74,75]. Moreover, it has been shown that CTNNB1 has a vital role in cancer regulatory pathways such as Gastric cancer signaling [94]. With a degree centrality of 5, ANKRD50 and EDNRB were the next obtained hub genes. According to previous studies, it has been reported that EDNRB-methylation is a very common phenomenon in NSCLC. Due to the higher rate of EDNRB methylation in Squamous Cell Carcinoma (SCCs), it can be used to distinguish between SCCs and lung Adenocarcinoma [87]. The TF-gene network topology analysis was also performed and the results specified two hub genes including IFNG and CTNNB1. IFNG is a protein coding gene and it is involved in Folate Metabolism. Furthermore, Yaghoobi et al. have evaluated the IFNG and its antisense (IFNG-AS1) roles in breast cancer and have proposed the involvement of IFNG and IFNG-AS1 in the pathogenesis of breast cancer [95]. In another study, Gao et al. investigated the role of IFNG pathway in the anti-CTLA resistance mechanism. Anti-CTLA-4 produces IFNG to enhance T cell responses. Their data revealed that defects in the IFNG signaling pathway leads to resistance to anti-CTLA-4 therapy [96].
With a systems biology approach including machine learning methods, feature selection, topological analysis, enrichment analysis and finally literature review, we managed to obtain a set of genes which play critical roles in chemoresistance processes. We also have nominated a set of potentially chemoresistance genes which could be used in further studies.

Conclusion
In this study, machine learning approach as well as systems biology analysis was used to extract the genes which commonly separated cisplatin resistant samples from the sensitive ones in lung, pancreatic and ovarian cancers. Furthermore, six classifiers were trained to distinguish between chemoresistance samples from the sensitive ones. As a result, KNN and Naïve Bayes algorithms were selected as the most practical machines according to a set of calculated measures. Moreover, the results of our systems biology analysis indicated several potential chemoresistance genes among which PTGER3, YWHAH, CTNNB1, ANKRD50, EDNRB, ACSL6, IFNG and, CTNNB1 are topologically more important than others. Our results have been validated against different databases such as UniProt, Enrichr and DIANA mirPath v.3 and the papers extracted from the literature. Therefore, this in silico study as well as its predictions can pave the way for further experimental researches.