TCGA datasets
TCGA (The Cancer Genome Atlas) is a landmark cancer genomics project, including 33 cancer types, a total of more than 20, 000 samples collected. The transcriptional expression data and clinical information of COL5A3 we studied were downloaded from the official website of TCGA. In addition, we also studied the expression level of COL5A3 gene in cancer and paracancerous tissues combined with the integrated data of TCGA and GTEx database. The RNAseq data in TPM (transcripts per million reads) format were analyzed and compared after log2 transformation. Finally, in the GEO dataset (GSE16515), we compared the expression of COL5A3 in normal and tumor tissues.
RNA-Sequencing data of COL5A3 in pancreatic cancer
Download the RNA-Seq expression data of COL5A3 in pancreatic cancer from TCGA website. Because there are few normal tissue samples of pancreatic cancer in TCGA database, we integrated the normal tissue samples of pancreatic cancer in GTEx database, and finally retained the data of 171cases of adjacent normal tissue and 179cases of tumor tissue. The selected clinical samples contain COL5A3 gene expression data and related clinical information, such as patient sex, age, smoking and drinking history, tumor TNM stage and initial treatment outcome.
The human protein atlas (THPA)
THPA (https://www.proteinatlas.org/) is the human protein map, which aims to provide information on the tissue and cell distribution of all 24000 human proteins. In this study, we used THPA to observe the expression of COL5A3 in normal tissues and pancreatic cancer by immunohistochemical (IHC) images.
Analysis of prognostic indexes
In order to explore the clinical value of COL5A3 gene in the prognosis of patients with pancreatic cancer, we analyzed the survival indexes such as OS (overall survival) and DSS (disease specific survival). The RNA sequencing data and corresponding clinical information of pancreatic cancer patients were downloaded from TCGA database and visualized by Kaplan-Meier curve. The expression of COL5A3 was divided into two groups: low expression group and high expression group, and the P value was obtained by Log-rank test and Cox regression analysis.
Protein-Protein Interaction (PPI) networks and functional enrichment analysis
STRING (https://www.string-db.org/) is a commonly used database for searching known protein-protein interactions and predicting protein-protein interactions. By studying protein-protein interaction networks, it is helpful to mine core regulatory genes. In this study, we obtained the top ten COL5A3 related genes through STRING database and constructed PPI network map. We use GO and KEGG enrichment analysis, mainly ggplot2 package (version 3.3.3) for visualization, and clusterProfiler package (version 3.14.3) to analyze the selected data.
Tumor immune estimation resource database (TIMER)
TIMER (https://cistrome.shinyapps.io/timer/) uses RNA-Seq expression profile data to detect the infiltration of immune cells in tumor tissues. At present, it mainly provides the infiltration of six kinds of immune cells. In this study, we used TIMER database to determine the expression of COL5A3 and tumor purity and the correlation between six kinds of immune infiltrating cells (B cells, CD4+ T cells, CD8+ T cells, neutrophils, macrophages, and dendritic cells).
Tumor immune system interaction database (TISIDB)
TISIDB (http://cis.hku.hk/TISIDB) is a website for studying the interaction between tumors and the immune system. It covers a number of tumor-related databases. In this study, we determined the expression of COL5A3 and 28 kinds of tumor infiltrating lymphocytes (TILs) in different human cancers, and used TISIDB to make the relationship between the expression of COL5A3 and a variety of TILs abundance.
Statistical analysis
The statistical analysis and visualization of all our data are carried out in R (version 3.6.3), and the main R package involved: ggplot2 package (version 3.3.3) is used for visual analysis. Mann-Whitney U test was used to determine the difference between pancreatic cancer tissue and adjacent normal tissue. In order to evaluate the effect of COL5A3 expression on survival, we used Kaplan-Meier and log-rank tests for statistical analysis of survival data, mainly survival package (version 3.2-10), and survminer package (version 0.4.9) for visualization. The ROC curve was analyzed by pROC package (version 1.17.0.1).