DOI: https://doi.org/10.21203/rs.3.rs-704894/v1
Background: Breast cancer is one of the malignant tumors that threaten women's health, with HER2+ breast cancer being more aggressive. In this study, bioinformatics methods were used to find potential key genes in HER2 + for diagnosis and treatment.
Methods: Datasets of HER2+ breast cancer and normal tissue samples retrieved from TCGA databases were subjected to DEGs analysis using R software. Then WGCNA is constructed for DEGs. The key gene co-expression modules were then subjected to GO and KEGG pathway enrichment analyses, as well as construction of PPI networks using the STRING database for identifying key genes. Finally, key genes were further validated by survival analysis, protein expression, and COX regression models.
Results: We identified 2063 DEGs and 4 gene co-expression modules. Functional enrichment analysis showed that these key co-expression modules were mainly associated with extracellular matrix organization, extracellular matrix structural constituent and neuroactive ligand−receptor interaction. PPI network visualization identified 100 key genes, 3 of which were not present in the other subtypes of breast cancer. UTS2 DRD4 and GLP1R are key genes specific to the HER2+ subtype. Survival analysis showed that UTS2 are prognosis-related key genes in HER2+ breast cancer. Finally, UTS2 combined with clinical data to construct Cox regression model.
Conclusions: Combined with the two screening methods, 3 key genes closely related to HER2 + breast cancer were identified. UTS2 is a new potential key gene and may become a new therapeutic target for HER2 + breast cancer.
Breast cancer is one of the important diseases leading to female death, which seriously threatens women's health. With the progress of medical treatment, there are many methods in the treatment of breast cancer, such as drug therapy, hormone therapy and so on. Among them, molecular targeted therapy greatly improves the therapeutic effect of breast cancer, so it is of great significance to find important molecular markers[1–2]. According to the differences of estrogen receptor (ER), human peripheral growth factor receptor 2 (HER2), and progesterone receptor (PR), breast cancer can be divided into four subtypes: triple-negative (TN), luminal A, luminal B, and HER2-positive[3].
HER2 + breast cancer accounts for 15% − 20% of breast cancer, which is caused by erbB2 / neu oncogene amplification or overexpression of HER2 transmembrane receptor protein [4–6]. HER2 is a member of transmembrane receptor tyrosine kinase family (EGFR / HER1, HER2, HER3, HER4), and they are involved in cell proliferation, motility, anti apoptosis, invasiveness and angiogenesis [7]. HER2 + breast cancer has higher grade, more aggressive phenotype and worse prognosis[8]. Therefore, from the perspective of bioinformatics, this study aims to find the key genes related to its development and prognosis, and provide new ideas for its treatment.
DNA microarray and sequencing have become essential and efficient methods in cancer research following the development of high-throughput sequencing (HTS) technology in recent years[9–10]. Weighted Gene Co-Expression Network Analysis (WGCNA) can represent microarray data more completely by considering the relationship between transcripts measured, which can be evaluated by pairwise correlation between gene expression profiles. WGCNA starts from the level of thousands of genes to identify the gene modules of clinical interest. Finally, it uses the connectivity and gene importance to identify the key genes in the disease pathway for further verification.
In this study, we analyzed DNA microarray data retrieved from The Cancer Genome Atlas (TCGA) databases using bioinformatics methods to provide a theoretical basis for exploring potential targets for the treatment and prognostic assessment of HER2 + breast cancer. It is hoped that this study will provide assistance in the treatment of HER2 + breast cancer.
Datasets
The TCGA database (https://portal.gdc.cancer.gov/) includes clinical sample information and sequencing data [11]. The transcriptome data on TCGA was sequenced using whole transcriptome sequencing and we used RNAseq measured expression mRNA. We downloaded 134 breast cancer samples using R's TCGAbiolinks toolkit, including 47 HER2 + breast cancer samples and 87 normal tissue samples [12].
Identification of DEGs
The gene expression data were divided into HER2 + breast cancer and normal groups, and differential analysis of the data was performed using the limma package in R language [13]. The final differentially expressed genes were filtered using | logFC |>2.0 and P < 0.01.
Co-expression Module Identification
The WGCNA analysis method can search for synergistically expressed gene modules and explore the association between gene networks and phenotypes. In this study, we analyzed the differentially expressed genes by WGCNA [14–17]. The WGCNA package in R was used to analyze the differential genes obtained in the previous step and to normalize the data. Clinical information was used for the type of information of the data, including normal and cancer groups. The soft threshold β = 24 is obtained by picksoftthreshold function to construct co expression matrix.
Functional Enrichment Analysis
In order to further explore the genes in the co-expression network obtained in the previous step, we performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis using the clusterProfiler package in R. The GO analysis separated the function of the genes into three components: biological process (BP), cellular component (CC), and molecular function (MF) [18–19]. P < 0.05 was used as a filter for both GO and KEGG.
PPI networks and Hub Genes
We used the STRING(https://string-db.org) online tool to construct PPI networks of key co-expression modules[20–21]. The PPI networks were then imported into Cytoscape software for visualization and further analysis and used the MCC algorithm of the CytoHubba plugin to search for 100 key genes[22–23]. In order to differentiate HER2 + breast cancer from other breast cancer subtypes, we performed the same differential analysis of mRNA data for the other three breast cancer subtypes (basal, lumA and lumB), which were also downloaded from the TCGA database, using the same methodology using the limma package in R. The 100 key genes obtained by MCC were compared with the differential genes of the other three subtypes, which yielded key genes that were present only in the HER2 + subtype, but not in the other subtypes, as key genes that distinguish the HER2 + subtype from the other subtypes[24].
Survival analysis of key genes and Validation of Protein Expressions
The online tool PROGgeneV2 (http://genomics.jefferson.edu)was used to view the relationship between UTS2, DRD4 and overall survival of HER2 + patients. Divide the data into high and low expression groups based on the median gene expression value and look at the relationship between the high and low expression groups of these two genes and survival time, respectively[25–26]. In addition, the Human Protein Atlas (HPA, https://www.proteinatlas.org/)database was used to view the expression of key gene in tissues, cells, and tumors[27–28].
Cox Proportional Hazards Regression Model
The Cox proportional hazards regression model uses survival time as a strain, which allows simultaneous analysis of the effect of multiple factors on survival. In this study, we used R language survival package combined with clinical data including age and tumor grade as well as the selected key gene to analyze the effect of these factors on the 2-year and 4-year survival of HER2 + breast cancer patients, which can further validate the role of key gene in HER2 + breast cancer patients.
Identification of DEGs
After making differential genes for 134 sets of samples, 2063 differential genes were obtained according to the screening conditions | logFC | > 2.0 and P < 0.01, of which 897 genes were up-regulated and 1166 genes were down-regulated. The differential genes were displayed in a volcano diagram as in Fig. 1.
Co-expression Module Identification
The co-expression network was constructed for the resulting 2063 differential genes, yielding a total of four modules, excluding the gray module. The different colors represent different modules, as shown in Fig. 2A. Then a correlation heat map of the module and sample traits was drawn, as shown in Fig. 2B, with rows representing modules and columns representing traits, and traits including both HER2 + and normal. According to the correlation heat map, it can be seen that the module with the highest correlation with HER2 + breast cancer is the turquoise module with a correlation of 0.94. The turquoise module is a key co-expression module with a gene count of 760.
Functional Enrichment Analysis
The 760 genes in the key co-expression module obtained in the previous step were subjected to GO and KEGG pathway analysis in order to further understand the functions of these genes, as shown in Fig. 3. In the GO analysis, the BP of the key co-expression modules were focused on extracellular matrix organization,extracellular structure organization and regulation of ion transmembrane transport, CC mainly include collagen-containing extracellular matrix, transporter complex༌apical part of celland DNA packaging complex, and MF is mainly related to extracellular matrix structural constituent, glycosaminoglycan binding, and receptor ligand activity. KEGG pathway analysis of key co-expression modules revealed that these key genes primarily function in pathways including neuroactive ligand − receptor interaction, cytokine − cytokine, receptor interaction, tyrosine metabolism and phenylalanine metabolism.
PPI networks and Hub Genes
A PPI network was constructed for 760 key genes of key co-expression modules and 100 key genes were found using the MCC algorithm of the CytoHubba plugin (Fig. 4). Differential gene analysis was also used to obtain differential genes for the other three subtypes of breast cancer, with the same screening method | logFC | > 2.0 and P < 0.01. The differential gene results were as follows, with basal 2093, lumA 1365, and lumB 2084. The 100 key genes obtained by the MCC algorithm were compared two-by-two with the differential genes of the other three subtypes to identify genes present only in the HER2 subtype and not in the other subtypes, DRD4 UTS2 and GLP1R, respectively, as shown in Fig. 5. DRD4 and UTS2 were up-regulated and GLP1R was down-regulated. To verify the accuracy of the key genes, we plotted their expression values in box plots across the four subtypes, as shown in Fig. 6, where UTS2 is expressed higher in the HER2 + subtype than in the other subtypes. DRD4 was most highly expressed in the lumA subtype, but its logFC value was less than 2, which did not meet the screening criteria for differential genes, as shown in Table 1. The logFC value for DRD4 in the HER2 subtype was 2.11, which exceeded the screening criteria by 2. GLP1R is less expressed in the HER2 subtype than in the other subtypes. It is therefore concluded that DRD4 UTS2 and GLP1R were key genes that distinguish the HER2 + subtype from other subtypes.
genesymbol |
subtype |
logFC |
t |
P.Value |
adj.P.Val |
---|---|---|---|---|---|
DRD4 |
lumA |
1.9667676 |
13.53841 |
2.60E-34 |
2.68E-33 |
DRD4 |
lumB |
1.4850377 |
8.286951 |
1.53E-14 |
5.42E-14 |
DRD4 |
basal |
1.7474764 |
9.56844 |
7.61E-18 |
3.49E-17 |
DRD4 |
HER2+ |
2.1192462 |
9.945784 |
9.10E-18 |
5.89E-17 |
UTS2 |
lumA |
1.0425104 |
6.531077 |
2.11E-10 |
5.49E-10 |
UTS2 |
lumB |
1.4346405 |
7.702268 |
5.68E-13 |
1.82E-12 |
UTS2 |
basal |
1.7886484 |
7.733012 |
6.77E-13 |
2.21E-12 |
UTS2 |
HER2+ |
2.2715777 |
8.116018 |
2.84E-13 |
1.22E-12 |
GLP1R |
lumA |
-1.56017 |
-7.32302 |
1.48E-12 |
4.40E-12 |
GLP1R |
lumB |
-1.80133 |
-7.09725 |
2.06E-11 |
5.89E-11 |
GLP1R |
basal |
-0.7034848 |
-2.191190 |
0.029700 |
0.04073 |
GLP1R |
HER2+ |
-2.09328 |
-8.33133 |
8.62E-14 |
3.89E-13 |
Survival analysis of key genes and Validation of Protein Expressions
The PROGgeneV2 database was used to view the effect of UTS2 DRD4 and GLP1R expression on the prognosis of HER2 + patients. The data used for the survival analysis was GSE6130 from the GEO database, and the data was divided into high and low expression groups according to median gene expression to see its impact on survival in HER2 + positive patients. The results showed that patients with low expression of DRD4 and GLP1R had a lower survival time than those with high expression, while UTS2 had a lower survival rate with high expression, as shown in Fig. 7. In addition, in the survival analysis of UTS2, the p-value was 0.0025, which was statistically significant. From the above analysis, it is evident that high expression of UTS2 is significantly associated with low survival in HER2 + patients. To further validate UTS2, the HPA database was used to view its expression in normal breast tissues as well as breast cancer tissues, and the results are shown in Fig. 8. The results showed that the protein expression level of UTS2 was significantly higher in breast cancer than in normal tissues, further validating the accuracy of the results.
Cox Proportional Hazards Regression Model
In the Cox regression model, clinical data were downloaded from the TCGA database for HER2 + patients, where age was the actual age of the patient and tumor grade was represented by 1–4 for T1-T4 grade, respectively. In the expression of UTS2, 0 represented low expression and 1 represented high expression. The Cox regression model was constructed using the data as above and combined with the patient's survival time, and the model results were displayed as a nomogram, as shown in Fig. 9A. By adding up the points corresponding to each index, which corresponds to the total points, we can get the survival rate of 2 years and 4 years for different patients. It can be seen that higher expression of UTS2 corresponds to higher points, which corresponds to lower survival rate. In addition, age and tumor grade are also positively correlated with lower survival rates. To assess the accuracy of the nomogram, we plotted the calibration curves, which were in good agreement for both predictions, as shown in Fig. 9B,C.
In this study, microarray datasets of HER2 + breast cancer retrieved from TCGA databases were subjected to differential gene expression analysis by bioinformatics methods. The gene co-expression networks were constructed from the DEGs of TCGA with the WGCNA package. GO and KEGG pathway enrichment analyses demonstrated that these significantly DEGs are closely related to extracellular matrix organization, collagen-containing extracellular matrix, extracellular matrix structural constituent and neuroactive ligand − receptor interaction, etc. The PPI network identified 100 key genes. Compared with the difference genes of luma, lumb and basal subtypes, 3 key genes, UTS2 DRD4 and GLP1R, were found only in HER2 subtypes. The survival analysis using the PROGgeneV2 online tool showed that the results of UTS2 showed statistical significance (P < 0.05) and the protein levels of the UTS2 gene was significantly higher in tumor tissues compared with normal tissues based on the HPA database. COX regression analysis also showed that high expression of UTS2 was negatively associated with 2-year and 4-year survival in HER2 + patients.UTS2 is a potential prognosis-related key genes in HER2 + breast cancer.
DRD4 is a G protein coupled receptor coding gene, which encodes the D4 subtype of the dopamine receptor and the receptors is activated by the neurotransmitter dopamine. Studies have shown that dopamine receptor is related to the death and migration of tumor cells, and can regulate various death modes of tumor cells[29–30]. DRD4 was highly expressed in breast cancer, neuroblastoma and small cell lung cancer, and its expression level was related to the survival of patients[31]. We found that DRD4 was highly expressed in all four subtypes of breast cancer, the most obvious of which was lumA subtype, but its logFC value did not exceed the threshold. The logFC value of HER2 + subtype was more than 2, which met the threshold requirements. Survival analysis showed that high expression of DRD4 did not show lower survival rate in HER2 + breast cancer, so we did not further analyze it.
GLP1R (Glucagon Like Peptide 1 Receptor) is a Protein Coding gene which is a member of G protein coupled receptor family. GLP-1R is an important target when treating of type 2 diabetes and obesity[32–34]. Recent studies have shown that GLP-1R activation has a cytoreductive effect on human pancreatic cancer cells[35]. We found that GLP-1R was low expressed in HER2 +, lumA and lumB subtypes, and it was most obvious in HER2 + breast cancer. Survival analysis showed that the low expression of GLP-1R showed a lower survival rate, indicating that GLP-1R is an important gene related to the survival of HER2 + patients.
UTS2 (urotensin 2) has 3 transcripts (splice variants), 251 orthologues and is a member of 1 Ensembl protein family. It is a Protein Coding gene. Diseases associated with UTS2 include Portal Hypertension and Congestive Heart Failure. Among its related pathways are RET signaling and Signaling by GPCR[36]. Gene Ontology (GO) annotations related to this gene include signaling receptor binding and hormone activity[37]. Studies have found that urotensin-II could contribute to breast carcinogenesis and Thr21Met polymorphism can be an important risk factor in developing breast tumors[38]. UTS2 might be used as potential targets to improve diagnosis and immunotherapy biomarkers for renal cell carcinoma and colon cancer [39–40].
Our study found that UTS2 was highly expressed in HER2 + breast cancer, but not in lumA, lumB and basal. Additionally, the high expression of UTS2 is also significantly associated with the low survival rate of patients with HER2 + breast cancer. Nomogram proves it too. Therefore, UTS2 may be a key target for the diagnosis and treatment of HER2 + breast cancer.
Even though our results require experimental validation, the identified genes are potential targets for the treatment and prognostic assessment of HER2 + breast cancer.
HER2, Human epidermal growth factor receptor 2;
KEGG, Kyoto Encyclopedia of Genes and Genomes;
OS, Overall survival;
TCGA, The Cancer Genome Atlas;
WGCNA, Weighted correlation network analysis;
GO, Gene Ontology;
BP, Biological process;
CC, Cellular component;
MF, Molecular function.
PPI, Protein-protein interaction
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Availability of data and materials
TCGA database (https://portal.gdc.cancer.gov/)
Competing interests
Not applicable
Funding
Not applicable
Authors' contributions
Pengfei Ning, Zhongxian Li, Wei Liang designed the study;
Yujie WENG, Yucheng JI researched literature;
Rong Jia , Ying Liang wrote the main manuscript text;
All authors reviewed the manuscript.
Acknowledgements
Not applicable