2.1 Data
Using the Gene Expression Omnibus database (GEO, https://www.ncbi.nlm.nih.gov/geo/), we chose two datasets linked with sarcoidosis for subsequent analyses. Notably, the GSE83456 [16] based on the GPL10558 platform, including 49 sarcoidosis patients and 61 normal individual blood samples, was used to investigate the immunologic mechanism in sarcoidosis. Additionally, the GSE42834 [17] includes 61 sarcoidosis patients and 113 blood samples from normal individuals, which are also based on the GPL10558 platform, were used for the verification test. The arrays function within the limma package was explored for normalizing gene expression profiles in GSE83456 and GSE42834 [18]. Besides, we employed the impute package (http://bioconductor.org/packages/impute/) as supplementary to the absent data.
2.2 Differentially Expressed Genes (DEGs) Screening
To demonstrate the impact of inter-sample correction, we used a PCA cluster plot with 2 dimensions. DEGs between sarcoidosis and control was also revealed via lmFit and eBayes functions using limma package . The heatmap and the volcano map of DEGs were generated via the ggplot2 and pheatmap package was used in illustrating the differential expression of DEGs. DEGs showing P <0.05 upon adjustment using the false discovery rate (FDR), and |log2fold change |>0.5 were regarded as significant.
2.3 weighted gene coexpression network analysis (WGCNA)
WGCNA explores necessary modules and key genes of overlapping genes. It adopts the topological overlapping measurements to reveal the corresponding expression modules and describe the pattern of gene correlation between different samples[19]. The expression profile of DEGs, universally down-regulated or up-regulated in the sarcoidosis and control groups, were extracted to perform WGCNA in GSE83456. Firstly, we used hclust function for hierarchical clustering analysis. Secondly, we utilized the pick soft threshold function to screen for the soft thresholding power value when constructing the module. Using candidate power (1 to 20), we determined the average connectivity degrees of various modules in addition to their independent traits. For any degree of independence above 0.8, a suitable power value was chosen. The WGCNA R package was employed in establishing the co-expression net-work (modules), with the minimum-sized module set to 30, whereas we issued a unique color label to each module. In WGCNA, we identified gene significance (GS) as the relationship between genes and phenotypes. Then, a module membership (MM): MM (i)=cor(x i, ME) was highlighted to evaluate essential gene functions in the module. Notably, the highest correlation index between Module membership and gene significance was used to screen key module.
2.4 Functional Correlation Analysis of key module
The Database for Annotation, Visualization and Integrated Discovery(DAVID, http://david.abcc.ncifcrf.gov/) which incorporates a comprehensive biological knowledge base and a series of analytic tools employed when extracting biological themes for proteins or genes [20] was used for Gene Ontology(GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses on genes of key module. The Beniamini and Hochberg(BH) method was used for adjusting the p values, and the adjusted p-value <0.05 acted as the threshold for significant results. Visualization of results was compiled using the R ggplot2 package.
2.5 Screening and verifying diagnostic markers
The least absolute shrinkage and selection operator (LASSO) logistics regression and support vector machine recursive feature elimination (SVM-RFE) both were classic methods in machine learning,which were executed to conduct feature selection for screening the diagnostic markers for sarcoidosis. Moreover, the GSE42834 was used to verify the diagnostic efficiency of the obtained diagnostic markers. We used the LASSO algorithm via the glmnet package. Additionally, SVM-RFE, which is a machine learning technique that relies on a support vector machine, was employed to reveal the optimal variations by eliminating SVM-generated eigenvectors [15]. SVM module was constructed to determine the diagnostic value of molecular markers in sarcoidosis using e1071 and kernlab package. Consequently, we choose an interaction of genes derived from SVM-RFE or LASSO algorithm for subsequent analyses. We considered a two-sided p<0.05 to represent statistical significance.
2.6 Evaluating immune cell infiltration
With the uploading of the GSE83456 data to CIBERSORT. Those samples having p>0.05, were eliminated and we got the immune cell infiltration matrix data. Furthermore, ggplot2 package was used for the analysis of PCA clustering about immune cell infiltration quantitative data to generate a PCA clustering map with 2 dimensions. Next, we used corrplot package to create a correlation heatmap in identifying any relationship between 22 infiltrating immune cell types; ggpolt2 package was employed to picture violin plots to observe any difference in the immune cell infiltration.
2.7 relationship analysis of diagnostic molecular markers with infiltrating immune cells
The corrplot package of R software was adopted for Spearman correlation analysis of the infiltrating immune cells as well as the diagnostic markers. The threshold was set as |r | >0.3 and P<0.05. The former result was visualized by the ggplot2 package.
2.8 Clinical specimen verification
2.8.1 Specimen source
The serum samples of 60 patients with pulmonary sarcoidosis were collected from the Department of Respiratory and critical Care Medicine, the first affiliated Hospital of Chengdu Medical College from January 2016 to December 2020. At the same time, the serum samples of 60 healthy subjects were collected as the control group. The peripheral blood samples of all subjects' 10ml were collected by anticoagulant tube. After 30min was placed at 4 ℃, centrifuged for 15 minutes at room temperature at 3000r, the upper serum was retained and stored in the refrigerator at-80 ℃. Admission criteria: (1) Age greater than 18 years old, EBUS-TBNA biopsy, pathological diagnosis of pulmonary sarcoidosis; (2) patients with good compliance can cooperate with the examination. Exclusion criteria: (1) having been given glucocorticoid treatment; (2) complicated with infectious diseases, tumor diseases or immunodeficiency diseases. (3) severe cardiopulmonary insufficiency or blood coagulation disturbance. This study was approved by the Ethics Committee of the first affiliated Hospital of Chengdu Medical College, and the approval number is 2020CYFYIRB-BA-101. All the controls signed the relevant informed consent.2.8.1 Detection of mRNA transcription level in clinical tissue samples of BATF2 and PDK4 by RT-qPCR.
Total RNA was isolated using Trizol (Invitrogen, Grand Island, NY). According to the instructions of cDNA synthesis kit (TaKaRa Company, Japan), the reverse transcription conditions of cDNA, were as follows: 37 ℃, 30 min, 95 ℃, 5 min, -20 ℃. RT-qPCR preparation system: 10 μ L SYBR Green (Japan TaKaRa Co., Ltd.), upstream and downstream primers 0.8 μ L, 2 μ L cDNA, 6.4 μ L without enzyme water. The RT-qPCR reaction was carried out in ABI 8000 real-time quantitative PCR, with GAPDH as the internal reference, and the reaction conditions were set as follows: pre-denaturation at 95 ℃, 30s at 95 ℃, 5s at 60 ℃, 34s at 60 ℃, 40 cycles and annealing at 60 ℃ for 30s. BATF2 upstream primer is 5’-CCTCTCCCGACAACCCTTC, downstream primer is GTGGACTTGAGCAGAGGAGA-3’; PDK4 upstream primer is 5’-TTGGCTGGTTTTGGTTACGG, downstream primer is CACCAGTCAGCCTCAGA-3’; GAPDH upstream primer is 5’-CCATCTTCCAGGAGCGAGAT, downstream primer is TGCTGATGATCTTGAGGCTG-3’. Gene expression levels were calculated relative to the housekeeping gene GAPDH. The ROC curve was drawn by ROCR package in R software, and the area (AUC) under ROC curve was used to evaluate the diagnostic value of serum BATF2 and PDK4 and their combination in pulmonary sarcoidosis.