Gene expression profile data
Two independent GC gene expression profiles, GSE54129 and GSE79973, consisting of 121 primary gastric tumor samples and 31 normal gastric tissue samples, were selected from the GEO database. The platforms of these two datasets were identical to GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array). We simultaneously obtained clinical information and gene expression profiles from patients by collecting GC information and non-cancerous samples from the TCGA database (TCGA-STAD) and GSE84437 (433 GC patients). The flowchart for the bioinformatics analysis is shown in Figure 1.
Data preprocessing and DEG identification
The "affy" package of R (version 3.6.3 http://r-project.org/), which allows exploratory analysis of oligonucleotide arrays, was used to read CEL files from the GEO database. Two professional bioinformatics analysts carried out the data preprocessing, including background correction, data normalization, removing batch effects, combining normal and tumor group data, ID transform gene symbols, and probe supplemental missing values [11]. Then, we identified DEGs using the "limma" package from Bioconductor [12]. Only genes with |logFC(fold-change)|>2 and adj. P<0.01 were selected. The volcano plot and Venn diagram were generated using "ggplot2" and "Venn diagram" packages, respectively.
Identification of prognosis genes
Based on all the annotated genes (56,536 genes), the prognosis-related genes were identified using the ''survival'' package. The prognosis model was established using the Cox model of ''Risk scores = '' in the ''survival'' package and optimized using the AIC value. The patients with TCGA with risk scores above the median were defined as the ‘'high-risk group'', and the remaining patients were defined as the ‘'low-risk group''. Singular and multiple factor analysis were utilized to estimate the independence and validity of the prognosis model.
Clinical relevance and GSEA enrichment analysis
The correlations between the final filtered genes and clinical parameters were explored using TCGA, which included stage, T stage, age, grade, M stage, and N stage. Subsequently, the samples were divided into high and low expression groups, and GSEA was conducted to link genes with likely pathways [13]. Gene set permutations were performed 1,000 times for each analysis. Based on the premise of FDR < 0.25 and NOM P-value < 0.05, we selected the enriched pathways of interest.
Immune infiltration analysis
CIBERSORT, an analytical tool developed by Newman et al., uses gene expression data to estimate the abundance of member cell types in a mixed cell population [14]. We used the ''CIBERSORT'' package in R software to analyze possible associations between the genes and immune cells. Then TIMER, a comprehensive tool that systematically analyzes the infiltration of immune cells in various cancers, was used to analyze the relationships among the identified genes and five immune evaluation points (TOX, CD274, PDCD1LG2, CTLA4, and PDCD1) [15]. Additionally, we analyzed the association between ALDH3A2 copy number alterations and the STAD infiltration level. Finally, we used the cBioportal database to analyze the correlation between copy number alterations and the gene mRNA levels [16].
Immunohistochemical staining
IHC staining was carried out on tissue sections obtained from 140 paraffin-embedded GC samples. 10 micron-thick sections of GC tissue were mounted on glass microscope slides, deparaffinized in xylene, and then rehydrated in a graded alcohol series. Antigen retrieval was performed at a high temperature using a water bath. The sections were cooled, rinsed, and endogenous peroxidases were quenched using 3% H2O2. After incubation in 5% BSA for 45 min at room temperature, the sections were incubated overnight in the ALDH3A2 antibody (dilution: 1:350; Abcam, city, state) at 4°C. The sections were washed and incubated in secondary antibody for 60 min at room temperature. The antibody staining was visualized using the Dako EnVision System (Dako, Glostrup, Denmark). The IHC staining results were analyzed and scored by two pathologists who were blinded to the sources of the clinical samples. A semi-quantitative integration method was used to analyze the area and intensity of staining [17]. The proportion of cells that stained positive for ALDH3A2 was scored as 1 = 0~10%, 2 = 10%~25%, 3 = 50%~75%, and 4 = 75%~100%. The intensity of staining was scored as 0 = no staining, 1 = weak staining, 2 = moderate staining, and 3=strong staining. The final IHC score was calculated by multiplying one score by the other. Scores larger than six were regarded as a high score, and scores equal to or less than six were considered to be a low score.
Quantitative PCR (qPCR)
Patient tissues used for PCR analysis were obtained from the Shanghai Pudong Hospital of Fudan University. This study was allowed by the Ethics Committee of the Shanghai Pudong Hospital of Fudan University. All patients had approved for the use of clinical tissues for research purposes. Total RNA was isolated using Trizol (Invitrogen). PrimeScript RT Master Mix (Takara, JPN) was used for first-strand cDNA synthesis. For the analysis of the ALDH3A2 mRNA levels, qPCR was performed using SYBR Green according to the manufacturer's instructions (Applied Biosystems, USA). The primers that were used included: ALDH3A2, forward: 5-CTTGGAATTACCCCTTCGTTCTC-3; ALDH3A2, reverse: 5-TCCTGGTCTAAATACTGAGGGAG-3; PDCD1, forward: 5-ACGAGGGACAATAGGAGCCA-3; PDCD1, reverse: 5-GGCATACTCCGTCTGCTCAG-3; PDCD1LG2, forward: 5-ACCCTGGAATGCAACTTTGAC-3; PDCD1LG2, reverse: 5- AAGTGGCTCTTTCACGGTGTG-3; CTLA4, forward: 5-GCCCTGCACTCTCCTGTTTTT-3; CTLA4, reverse: 5-GGTTGCCGCACAGACTTCA-3; GAPDH, forward: 5-ACCACAGTCCATGCCATCAC-3; GAPDH, reverse: 5-TCCACCACCCTG TTGCTGTA-3.
Protein extraction and Western blotting
Total proteins were extracted from human ccRCC tissues using Western and IP lysis buffer (Beyotime, P0013; Beijing, China). The protein concentrations were measured using a BCA reagent kit (Pierce, 23227). The proteins were resolved with 8%-12% SDS-PAGE gels, then blotted onto polyvinylidene fluoride (PVDF) membranes. The membranes were blocked in TBS/0.1% Tween-20 (TBST) containing 5% powdered skim milk for 1h at room temperature (RT). Primary antibodies, ALDH3A2, PDCD1, PDCD1LG2, CTLA4, and GAPDH (AtaGenix, Wuhan, China), were diluted to concentrations of 1:300 or 1:2,000 before incubation with the membranes for 2h at RT. Then the membranes were incubated in secondary antibodies [anti-rabbit or anti-mouse IgG (H+L) biotinylated antibodies (CST, USA)] for 2h at RT.
RNA interference studies
RNA interference of ALDH3A2 was carried out using small interfering RNA (siRNA). HGC-27 and MGC-803 cells were transfected with control siRNA and siRNA-ALDH3A2 using Lipofectamine 3000 (Invitrogen). The target sequence used for siRNA against ALDH3A2 was 5-GCATTGCACCCGACTATAT -3. Western blots and qPCR were used to evaluate the efficiency of the siRNA interference.
Statistical analysis
All statistical analyses and Kaplan-Meier survival curves were conducted using R software 3.6.3 [18]. P<0.05 was deemed statistically significant. The relevance between ALDH3A2 and overall survival (OS) and other clinical variables were analyzed using multivariate Cox analysis. The area under the ROC curve (AUC value) was regarded as excellent for survival predictions when the value was greater than 0.7, and acceptable when the value was greater than 0.6.