Data Acquisition
Data regarding fragments per kilobase million (FPKM) values and microRNA (miRNA)-expression levels for patients with colon adenocarcinoma/rectum adenocarcinoma (COAD/READ) were downloaded from The Cancer Genome Atlas (TCGA) Genomic Data Commons website (https://portal.gdc.cancer.gov/) and used as the training dataset. FPKM values were converted to transcripts per million values and divided into mRNA- and long non-coding RNA (lncRNA)-expression groups. “Masked Somatic Mutation” data were downloaded for patients with COAD/READ, pre-processed using VarScan software, and visualized using the R software package, maftools [7]. The clinicopathological features and prognoses of patients with COAD/READ, such as gender, age, and stage, were downloaded from the UCSC Xena website (http://xena.ucsc.edu/). After removing samples with missing clinical information, 597 tumor samples and 51 normal tissue samples were obtained. Table 1 shows clinical information for patients with COAD/READ. The likelihood of each response to immunotherapy was predicted using the Tracking of Indels by DEcomposition (TIDE) algorithm (http://tide.dfci.harvard.edu) [8]. Gene-expression data from different organizations and in different cell lines were downloaded from TCGA and the Cell Line Cancer Encyclopedia (CCLE) databases (https://portals.broadinstitute.org/ccle/about) to compare IRF-expression levels between tumor and normal tissues.
Gene-expression data in GSE17536 [9] and GSE39582 [10] and clinicopathological patient characteristics were downloaded as validation datasets from the Gene Expression Omnibus (GEO) database. The data were downloaded from Homo Sapiens; this platform is based on the GPL570 [HG-U133_PLus_2] Affymetrix Human Genome U133 Plus 2.0 Array. GSE17536 included 177 COAD tissue samples, and GSE39582 included 566 COAD tissue samples and 19 colon non-tumor tissue samples.
Genetic Characteristics of the IRF Family and Validation with Clinical Prediction Models
We incorporated the expression levels of IRF-family genes into our model. The random forest package of R [11] was used to develop an IRF-based risk-assessment model for patients with COAD/READ. Patients were divided into high- and low-IRF risk groups, based on the median value.
To assess patient prognosis by combining the IRF risk score with clinicopathological features, univariate and multivariate Cox proportional-hazards analysis were used to analyze the independent predictive power of risk scores for overall survival (OS) and disease-free survival (DFS). Subsequently, a survival-prediction nomogram was constructed for patients in the TCGA dataset and was validated for patients in the GEO dataset. To quantify differentiation performance, Harrell’s consistency index (C-index) was measured. A calibration curve was generated to evaluate the performance of the line map by comparing the predicted value of the line map with the observed OS rate.
Differentially Expressed Genes (DEGs) and Clinical Correlation Analysis
Data for patients with COAD/READ were downloaded from TCGA and the GEO databases, and the patients were divided into high- and low-expression groups, according to the IRF score. The DESeq2 package of R [12] was used to analyze DEGs in both groups, where a log fold-change (logFC)>1.0 and P<0.05 were set as threshold values for DEGs.
We compared the expression levels of IRF-family genes at different TNM stages. The Human Protein Atlas (HPA, https://www.proteinatlas.org) provides immunohistochemical expression data for nearly 20 different cancers [13] and enables identification of tumor type-specific, differential protein-expression patterns, where protein-expression levels of all IRF-family genes were compared between normal and CRC tissues.
Functional-Enrichment Analysis and Gene-Set Enrichment Analysis (GSEA)
Gene Ontology (GO) analysis is commonly used for large-scale functional-enrichment research of biological processes (BPs), molecular functions, and cellular components. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a widely used database containing information regarding genomes, biological pathways, diseases, and drugs. GO and KEGG pathway-enrichment analyses were performed with signature genes using the clusterProfiler R package [14]. A false-discovery rate of <0.05 was considered statistically significant.
To investigate differences in BPs among different subgroups, GSEA was performed using the gene-expression profiles of patients with COAD/READ. GSEA can be used to identify statistical differences between two groups in a gene set and estimate changes in pathways and BP activities [15]. The gene set “C2.CP.kegg. V6.2.-symbols” [15] was downloaded from the Molecular Signatures Database for GSEA analysis. An adjusted P value of <0.05 was considered statistically significant.
Constructing a Protein–Protein-Interaction (PPI) network and Screening Hub Genes
We used the Search Tool for Retrieving Interacting Genes (STRING) database [16], which predicts PPIs, to construct PPI networks for the selected genes. Genes with scores of >0.4 were selected to construct a network model, which was visualized with Cytoscape V3.7.2 [17]. In the co-expression network, the maximum clique centrality (MCC) algorithm most effectively located the node in a set. The MCC of each node was calculated using CytoHubba plugins [18] in Cytoscape, and genes with the highest eight MCC values were selected as hub genes.
Constructing a Competing-Endogenous RNA (ceRNA) Network Based on miRNA–mRNA and miRNA–lncRNA interactions
LncRNA–miRNA-interaction data were downloaded from the miRcode database. and miRNA–mRNA-interaction data were downloaded from the miRTarBase, miRDB, and TargetScan databases. The DESeq2 packet of R [12] was used to analyze miRNA- and lncRNA-expression differences between the high-IRF and low-IRF risk groups. LogFC>1.0 and P<0.05 were set as criteria for a statistically significant difference. Cytoscape (V3.7.2) was used to construct a ceRNA network by analyzing correlations between lncRNA- and mRNA-regulated miRNAs simultaneously.
Tumor Immune Estimation Resource (TIMER)-Database Analysis and Comparing Immune-Correlation Scores Between Both Groups
The TIMER database (https://cistrome.shinyapps.io/timer/) enables users to estimate B-cell, CD4+ T-cell, CD8+ T-cell, macrophage, neutrophil, and dendritic-cell infiltration into different tumor types [19]. We used the TIMER database to analyze correlations between the expression levels of different IRF genes and immune cell infiltrations in COAD/READ samples.
The R estimate package [20] quantifies immune cell-infiltration levels in tumor samples, based on gene-expression profiles, and was used to assess the immune-activity and stromal score of each tumor sample. Immune cell-infiltration levels between both groups were compared using the Mann–Whitney U test.
Analysis of Anticancer Therapy-Sensitivity
The Genomics of Drug Sensitivity in Cancer (GDSC) database (https://www.cancerrxgene.org/) enables exploration of gene mutations and targeted cancer therapies. We downloaded gene-expression data from cell lines and IC50 values to analyze correlations between differentially expressed-IRF genes and anticancer drug sensitivities.
Calculating Tumor-Mutation Load Fractions and Analyzing Genetic Variation of IRF Family Members in CRC
The tumor burden (TMB) of each tumor sample was defined as the number of somatic cell mutations identified, excluding silent mutations. Patients with COAD/READ were divided into high-TMB and low-TMB groups according to the median TMB value. The Wilcoxon rank-sum test was used to compare risk scores of IRF-family genes between both groups.
Patients and Specimens in the Validation Cohort
Tumor specimens were obtained from 114 patients with CRC who underwent treatment at Zhongshan Hospital (Fudan University) between 2008 and 2016. The inclusion criteria were as follows: (a) a clear pathological diagnosis of CRC, (b) complete follow-up data until December 2019, (c) suitable formalin-fixed and paraffin-embedded tissues, and (d) agreeing to participate in the study and provide signed informed consent. CRC diagnosis was based on the World Health Organization criteria, and tumor stages were classified according to the 7th edition of TNM classification of International Union Contra Cancrum. Ethical approval was obtained from the research ethics committee of Zhongshan Hospital. The clinical characteristics of the 102 patients with follow-up data are presented in Supplemental Table 1.
IHC-staining Evaluation
Cancer and adjacent normal tissues were formalin-fixed, paraffin-embedded, and prepared as tissue microarrays (TMAs) after hematoxylin and eosin staining and histopathology-guided location. Five-micron-thick TMA sections were deparaffinized and rehydrated in 0.1 M citrate buffer (pH 6.0), followed by high-temperature antigen retrieval in a microwave for 15 min. The sections were incubated overnight at 4°C with primary antibodies against IRF3 and IRF7 (Abcam, Cambridge, UK), CD4 (Servicebio Technology, Wuhan, China), CD8 (Servicebio Technology), CD19 (Servicebio Technology), and CD68 (Servicebio Technology). The sections were incubated for 30 min with a secondary antibody at room temperature and immunostained based on avidin–biotin complex formation, using 3,3¢-diaminobenzidine. Hematoxylin was used as a counterstain.
Antigen–antibody complexes in whole samples were detected using a panoramic slice scanner (3DHISTECH, Hungary) and viewed with CaseViewer 2.2 (3DHISTECH). H-scores were calculated to evaluate gene-expression levels using Quant Center 2.1 (3DHISTECH): H-score=Σ (PI×I)=(% of weakly stained cells×1)+(% of moderately stained cells×2)+(% of strongly stained cells×3), where PI is the proportion of the positive area, and I is the staining intensity.
Statistical Analysis
The data were analyzed with R software (V4.0.2). The independent Student t test was used to estimate the statistical significance of normally distributed variables, and the Mann–Whitney U test was used to analyze differences in non-normally distributed variables between two groups of continuous variables. The chi-squared test or Fisher exact test was used to analyze statistical significance between two groups of categorical variables. Correlation coefficients between different genes were calculated by Pearson correlation analysis. The survival package of R was used for survival analysis, Kaplan–Meier survival curves were used to determine survival differences, and the log-rank test was used to evaluate significant differences in survival times between two groups. Univariate and multivariate Cox analyses were used to determine independent prognostic factors. The pROC package of R [21] was used to draw receiver operating-characteristic (ROC) curves, and area under the curve (AUC) values were calculated to assess the accuracy of risk scores for prognosis estimations. All statistical P values were bilateral, and P<0.05 was considered statistically significant.