Investigating unique genes of five molecular subtypes of breast cancer using penalized logistic regression

Background: Breast cancer is the first cancer and fifth cause of death in women around the world. Exploring unique genes for cancers has become interesting. The aim of this study was to explore unique gens of five molecular subtypes of breast cancer in women using penalized logistic regression models. Methods: In this study, microarray data of five independent GEO datasets was combined. This combination includes genetic information of 324 women with breast cancer and 12 healthy women. Lasso logistic regression and adaptive lasso logistic regression were used to extract unique genes. Biological process of extracted gens was evaluated in open-source GOnet web-application. R software version 3.6.0 with glmnet package was used for fitting the models. Results: Totally, 119 genes were extracted among fifteen pairwise comparisons. 17 genes (%14) had overlap between comparative groups. Among 27 genes contributed in positive regulation of cell processes, one gene belonged exclusively to this biological process. Among 46 genes contributed in negative regulation of cell processes, 6 genes belonged exclusively. Among 50 genes that were significant in regulation of metabolism, 4 genes belonged exclusively. Among 32 genes that related to response of stress, 4 genes belonged exclusively. Conclusions: The most genes selected by lasso logistic regression and adaptive Lasso logistic regression, were diagnosed in negative regulation of cell processes. The present study, performed in order to combine differentially expressed genes (DEG) from different microarray studies to improve heterogeneity of DEG list among different studies and identify important and unique genes for five molecular subtypes of breast cancer using penalized logistic regression models. Penalized logistic regression models are efficient for extracting genes in high-dimensional genetic studies. The result of this study did not show specific hub genes but most genes founded by two models related to the negative regulation of cell processes.


Background
Breast cancer (BC) is the most common cancer worldwide among women [1]. Breast cancer comprises for about 18% of all types of cancers in women and is the fifth reason for death around the world [2]. Several factors are associated with this disease such as genetics, lifestyle, menstrual and reproductive history, long term treatment with estrogens. Genetic abnormalities include germ line mutations, gene amplifications, rearrangements, overexpression, deletions or point mutation [3]. Some changes in the structure and expression of genes can keep cells out of the control of normal growth and turn them into cancer.
During the last two decades, microarray data sets have been helping biological researchers to study the expression of thousands of genes simultaneously [4,5]. Recently, microarray analysis has revealed new molecular subtypes behind the classical classification of BC. Gene expression profiling in tumor tissues has been shown that breast cancers may be divided into subtypes consisting of two estrogen receptor (ER) positive types (luminal A and luminal B) and three estrogen receptor negative types [human epidermal growth factor receptor 2 (HER2 or ERBB2), triple-negative (TNBC)/basal-like, and unclassified (normal-like)] with distinctive clinical outcomes [6,7]. These subtypes are clinically meaningful, and can divide patients into groups with distinct tumor morphologies and outcomes [8]. Although the molecular classification of BC has improved treatment outcomes for patients, the heterogeneity among differentially expressed genes requires statistical modeling for selection of unique genes. From the viewpoint of biologists, selection of unique genes can help to find a specific molecular mechanism in each subgroup and improves staging accuracy by elimination irrelevant and noisy genes [9][10][11]. However, selection of genes with DNA microarray data is a challenge for researchers, because of high number of genes and small number of patients (high dimensional data) [12]. Recently, statistical methods called regularization regression methods have been introduced to solve this problem [13]. The most important regularized models are Least absolute shrinkage and selection operator (LASSO) [14] and Adaptive LASSO [15]. The use of these models in microarray datasets leads to the selection of important genes and the elimination of insignificant genes. Algmal and Lee (2015) proposed regularization regression for gene selection in a microarray analysis that contain three dataset (colon dataset with 2000 genes, prostate dataset with 5966 genes and DLBCL dataset with 7129 genes) [16]. Mostafaei et al.
(2018) used Machine-Based Learning Algorithms on 20,097 probes were generated from a small airway epithelium microarray dataset for identification of Novel Genes in Human Airway Epithelial Cells associated with Chronic Obstructive Pulmonary Disease [17].
The present study aimed to employ regularized logistic regression methods called LASSO logistic regression and adaptive LASSO logistic regression in order to introduce unique genes for fifteen comparatives between subtypes of breast cancer in women.

Data collection
The raw data of gene expression profiles was downloaded from the ncbi.nlm.nih.gov website.
Among the searches performed, the data considered include enough sample size and the five molecular subtypes in breast cancer and also data contained suitable power. Four independent GEO datasets from independent studies were combined in order to have a comprehensive file.
Each of the four studies included microarray datasets of breast cancer in different subtypes. patients with breast cancer in five molecular subtypes and 12 normal tissues. The sample size of normal group was small because the microarray data of normal breast tissue was not available compare to cancer tissue in previous studies. Table 1 shows the sample size of five molecular subtypes of breast cancer and control group.

Microarray data processing
Samples processes were analyzed with different chip platform. The raw datasets of microarray GSE were handled through the BioConductor affy package in R software [18]. A sensitive step in integration of heterogeneous data is normalization, Therefore, before merging, each dataset were normalized using Limma package in R software [19]. During combining different GSE data, non-biological batch effects were removed. Different studies were done with different procedures and platform. The adjustment was performed to remove non-biological batches and to save meaningful biological effect in combination of datasets. In order to remove nonbiological batch effects Surrogate Variable Analysis (SVA) package in R was used [20]. Batch effect removal was checked by PCA and boxplot. The outcome of these data combination was a unit expression matrix (the combination of four datasets of this study) and then extracted differentially expressed genes (DEGs) from the matrix. Data processing and integrating were performed using R. In order to identify biological process, the DEGs was entered in open-source GOnet web-application (available at http://tools.dice-database.org/GOnet/), with p-value≤0.05 in enrichment analysis options [21]. KEGG pathway analysis was performed by ToppGene Suit (http://toppgene.cchmc.org) web site.

Statistical analysis
Logistic regression is a statistical method to investigate the effect of predictor variables on dichotomous variable (outcome). Let = ( 1 , 2 , … , ) for = 1,2, … , denote ppredictors for N observation. Assume that responses for the binary logistic regression model can take values = 1,2. Then, Where 0 is the intercept and = ( 1 , 2 , … , ) is a p-vector of regression parameters. In each pairwise comparison between subtypes, patients in one subtypes coded one and the other coded zero. This dichotomous variable was used as outcome variable in each analysis. Predictor variables are genes. In the presence of several predictors (genes) with a few sample size (a few cases), Logistic regression was not applicable. This situation could be managed with Regularized logistic regression methods, which are able to select effective predictors by shrinking unimportant regression coefficients toward zero. Two famous regularized logistic regressions are: LASSO logistic regression and Adaptive LASSO logistic regression.

LASSO logistic regression
LASSO proposed by Tibshirani [14], is one of the popular regularization method. In LASSO have used 1 − regularization for gene selection and estimation simultaneously by constraining the negative log-likelihood function of gene coefficients. Finally LASSO method then finds parameter values to minimize: Where is a tuning parameter and ∑ | | =1 is the penalty function for the LASSO.

Adaptive LASSO logistic regression
The LASSO method has shown to not always provide consistent variable selection. The LASSO penalizes all coefficients equally, even when the coefficients are large. The Adaptive lasso regression is the extended of Lasso logistic regression. This method assigns small weight to unimportant coefficients and large weight to important coefficients in order to have more accurate coefficients. Adaptive LASSO method then finds parameter values to minimize: , |̂| is the maximum likelihood estimate and > 0. the weighted penalty will allow variables with larger coefficients to receive smaller penalties and thus might provide a more consistent solution.
In the end Gene comparisons in each pair of breast cancer subtype were done by both models and common genes that were extracted with both models simultaneously. Since in current study there were five subtypes of breast cancer and one control group, therefore fifteen pairs comparison were done in the R software using the Glmnet package (https://cran.rproject.org/package=glmnet).

Results
In this study after Microarray data processing and integrative meta-analysis, 3948 genes were screened. Lasso and adaptive lasso logistic regression applied on 3948 genes. After fitting two models, some coefficients can become zero and eliminated. Table 2 shows common non-zero gene coefficients that were extracted by two models simultaneously for five molecular subtypes of breast cancer. In total, 102 genes extracted with two models in 10 comparisons between 5 subtypes. 85 genes are unique between comparative groups and they were mentioned by highlights in each comparative columns of According to GOnet database analysis (Fig. 1)

Discussion
The present study was performed in order to combine differentially expressed genes (DEGs) from different microarray studies and identify important and unique genes for five molecular subtype of breast cancer using penalized logistic regression models. Unique genes have been selected with minimal overlapping between comparative groups. The extracted genes belonged to different biological processes including negative and positive regulation of cell processes, regulation of metabolism and response to stress. According to our results, no significant hub genes were identified, but most genes selected by two models related to the negative regulation of cell processes.
Among the selected genes, MYLK [22], TLE3 [23] and LHFP [24] have been previously reported to be expressed in all breast cancer subtypes. We observed that Epidermal growth factor receptor (EGFR) [25] was highlighted in luminal A group while its role as a biomarker was investigated in triple negative/basal subtype. Erb-B2 receptor tyrosine kinase (ERBB2) is a strong prognostic biomarker and overexpresses in about 15-30% of breast tumors [26].
Inhibition of this receptor is the main therapeutic strategy in ERBB2-positive subtypes and we observed it in ERBB2 group compared to Basal and Luminal B groups. Furthermore, our model, classified most DEG to specific comparative groups without overlapping. ER degradation enhancing alpha-mannosidase like protein 3 (EDEM3) is a protein involved in glycan degradation which is a sign of tumor malignancy [27]. Seifi-Alan et al. has shown differential expression of EDEM3 in Luminal A subtype [28]. Our model highlighted EDEM3 in Luminal A-Basal comparative group.
myelin protein zero like-2 (MPZL2) is one of the PACE4 targets which is a proprotein convertase and play a key role in tumor cell migration and malignancy especially in basal breast tumors [29]. We observed MPZL2 in luminal B-basal comparative group. Coronin-1C another gene with differential expression in normal-like versus basal comparative group, have been shown to be the direct target of Y-box binding protein-1 (YB-1) protein. YB-1 down regulates the expression of CORO1C and promotes invasion, migration and metastasis of triple negative/basal breast tumor cells [30].
According to the results, it seems regularized logistic regression methods called LASSO logistic regression and adaptive LASSO logistic regression can select important and unique genes which show a good consistency with experimental studies several limitations of this study should be considered. First, we can refer to the small number of samples in control group due to the nature of the data. Second, our data was extracted from different studies hence precision is not equal in different countries then we tried with normalization to resolve the heterogeneity.

Authors' Contributions
T.D. conceived and designed the study. M.H. collected data, performed Microarray analysis, and prepared the figure. S.R. performed the statistical analysis and wrote the paper. S.J.F. wrote the discussion of the paper. S.J.F. and T.D. reviewed and revised the manuscript. All authors have read and approved the manuscript for publication.