Whole transcriptome analyis of human lung tissue to identify COPD-associated gene

Background: Chronic obstructive pulmonary disease (COPD) is a common disease characterized by persistent respiratory symptoms and airflow restriction. The mechanisms underlying pathogenesis in COPD are still poorly understood. Identification of the dysfunctional genes in human lung from patients with Chronic obstructive pulmonary disease (COPD) will help up to understand the pathology of this disease. To identify the dysfunctional genes in human lung from patients with Chronic obstructive pulmonary disease (COPD). We used transcriptomic data of lung tissue for 91 COPD cases and 182 matched healthy controls from the Genotype-Tissue Expression (GTEx) database. We employed a stringent model controlling for known covariates and hidden confounders. DESeq2 R package (v1.20.0) was used to test for differential expression. Results: We identified 1,359 significant differentially expressed genes (DEG) with 707 upregulated and 602 downregulated respectively. We evaluated the identified DEGs in an independent microarray cohort of 219 COPD and 108 controls, demonstrating the robustness of our result. Functional annotation of COPD-associated genes highlighted the activation of complement cascade, dysregulation of inflammatory response and extracellular matrix organization in the COPD patients. In addition, we identified several novel key-hub genes involved in the COPD pathogenesis using a network analysis method. Conclusion: In summary, our study represents the comprehensive analysis of gene expression on COPD with the largest sample size, providing great resource for the molecular research in the COPD community.


Introduction
Chronic obstructive pulmonary disease (COPD) is a complex and heterogeneous disease, which is characterized by progressive airflow obstruction and chronic inflammation in the airways, is the third leading cause of death worldwide 1,2 . It is estimated that in 2020, of 68 million deaths worldwide, 4.7 million will be caused by COPD 3 . No currently available drug treatments can delay COPD progression or decrease mortality 4,5 , owing to the mechanisms underlying pathogenesis in COPD are still poorly understood 6 . Therefore, an improved understanding in the molecular pathogenesis of this disease is needed to identify molecular targets, enabling innovative drug development.
Previous studies had revealed numbers of differentially expressed genes (DEGs) associated with COPD progression in lung tissue. For instance, Ning W 7 performed a serial analysis of gene expression (SAGE) and identified a total of 327 DEGs from 12 COPD and 14 controls. Michael E 8 revealed 2,667 DEGs between the 23 COPD and nine controls (fold change > 1.5 and FDR < 0.05) based on microarray results. Mostafaei S 9 identified 44 candidate genes using machine learning algorithms based on microarray of 21 COPD smokers in human airway epithelial cells. However, there was little overlap for candidate genes identified among the various studies. which could be partly due to differences in the sample size, sample quality, platform and the method of identifying COPD-associated genes. Moreover, given the fact that gene expression is likely confounded with batch effect and other covariates (like age, BMI, Gender etc.), the lack of these information or do not explicitly control it in previous studies, which may lead to biases in screening COPD-associated genes.
RNA sequencing (RNA-seq) technology have tremendously advanced our understanding of the genes, pathways and molecular processes that underlying many human diseases 10,11 .
The Genotype-Tissue Expression (GTEx) project 12,13 has collected a large number of lung tissue sample with high-quality from hundreds of human donors. High-quality lung tissues 4 were used to isolate nucleic acids, genotyping, whole genome sequencing and RNA-seq analysis were performed. Of note, there are 91 lung tissue samples from the COPD patients. All covariates were well-documented. The GTEx project thus provides a great opportunity to identify the change of gene expression in the lung tissue of COPD patients.
RNA-seq in a large cohort of COPD cases and controls can advance our knowledge of the disrupted gene expression in COPD.
In this study, we performed a transcriptomic and network-based analysis, characterizing the gene expression changes associated with COPD based on GTEx data set. Several classes of genes were identified, many of which have not been previously associated with COPD. In summary, our study represents the comprehensive analysis of gene expression on COPD with the largest sample size, revealing previously unreported candidate genes and pathways that may serve as potential molecular targets in COPD.

Identification of COPD-associated genes in the GTEx lung tissue
The overall pipeline is illustrated in Fig. 1A, and the details are provided in the Methods.
Briefly, we extracted 17,882 gene expression profile (normalized counts > 5) from the GTEx data set. The optimal matching algorithm 14 was used to balance two groups of COPD patients and controls on known covariates (Age, Gender, Race, BMI, RIN). A similar overall distribution of propensity scores in matched control (N=182) and COPD (N=91) was observed ( Figure 1B). The demographic characteristics of patients with COPD and matched control are shown in Table 1. We performed Student's t-test to compare the COPD and control groups and all gave no significant results in every covariate, showing that distribution of known covariates was almost same between two groups. Moreover, we used the R sva package 15 Table 1). We identified several novel DE genes, including XIST, GREM1, IGHV2-26, IGLV3-27, LINC00551, HAPLN1, which have not been previously associated with COPD.

Validation of gene expression changes in an independent cohort
To validate our results of identified COPD-associated genes, we compared our findings with an independent microarray-based COPD gene expression dataset from the publicly available LGRC data portal (See Methods). In a nutshell, we pulled out the 665 lung samples with microarray-based transcriptome profiles, then excluded the samples with ILD disease or unknown conditions, resulting in 219 COPD cases and 108 Non-COPD controls.
We performed a similar linear regression-based approach to model gene expression values against COPD status (see Methods). The overlapped genes (n=934) from our DEGs and 6 microarray probes were examined. As shown in Figure 2B, Log2 Fold Changes (LFCs) of our DE genes were significantly correlated with LFCs of common genes in the microarray dataset (Spearman's rho correlation = 0.80). The top 10 DEGs ranked by Log2Foldchange from RNA-seq were almost all up-regulated in the validation microarray data set ( Table 2).
All the comparison of LFCs in common DEGs from two different data set were provided in Supplemental Table 2. These results suggest a concordant of COPD-associated gene expression changes in an independent dataset, and one derived using a different platform (microarrays rather than RNA-seq), gives us a high degree of confidence in the identified COPD-associated genes.

Functional annotation of COPD-associated genes
To acquire a functional overview of the biological processes (BP), molecular functions (MF) and pathways involved in the pathogenesis of COPD, we investigated the functional annotations of biological processes (BP), molecular functions (MF) and pathways for the up-and down-regulated COPD-associated genes separately. The top enriched results are presented in Figure 3. The complete annotation list is provided in Supplementary Table 3.
The positively COPD-associated genes were significantly enriched in 161 GO BP and 19 MF terms. Among these, as shown in Figure 3A and 3C, the most significantly GO terms In the up-regulated COPD-associated genes, a total of 328 nodes and 627 interactional 8 pairs were included in the PPI network. As shown in Fig. 4A, 15 genes were identified as the key hub. Among these, PDGFRA (Platelet Derived Growth Factor Receptor Alpha), and MMP2 (Matrix Metallopeptidase 2), two genes that involved in the activation of PDGF pathway and matrix metalloproteinases (MMPs) were significantly increased, which is verified by quantitative reverse transcription-PCR and microarray in previous study 7 .
RUNX1 was reported DNA gains at a high frequency in emphysema using array comparative genomic hybridization (array CGH) 24 . Smad3 is crucial to signal TGF-β1 induction of VEGF, contributing to the pathogenesis of COPD 25 . HIF-1 plays a critical role in the process of hypoxia-induced pulmonary vascular remodeling, involved in COPD 26 .
Most of the other genes are previously unreported, further studies were needed to prove these findings.
For the down-regulated COPD-associated genes, a total of 591 nodes and 1529 interactional pairs were included in the PPI network. As shown in Fig. 4B, we identified 16 genes as the key hub genes. Interestingly, the four proteasome subunit PSMA3/4 and PSMC3/4 were all down-regulated, suggesting that the proteasome function is insufficient in lung of COPD patients. CAV1 (Caveolin 1) is a scaffolding protein within caveolar membranes involved in the costimulatory signal essential for T-cell receptor (TCR)mediated T-cell activation. The loss of CAV1 would contribute to the imbalance of Th17/Treg cells in patients with COPD was also previously reported 27 . Notably, HDAC1 is a key molecule in the repression of production of proinflammatory cytokines in alveolar macrophages. Kazuhiro 28 reported that a decrease in the HDAC could be associated with enhanced inflammation in COPD, which is consistent with our findings. Interestingly, that several key hub down-regulated genes that were not yet extensively studied in the context of COPD pathogenesis such as pre-mRNA-binding proteins, HNRNPK and very 9 poorly described MAGOH. We believe that these genes and their interaction may potentially represent novel mechanisms or therapeutic targets for COPD pathogenesis in lung.

Discussion
Despite the tremendous advancement in COPD research, there is still no drugs can control or delay the progression of this disease 4,5 . It indicates us that novel genes or pathways are needed to be identify for the development of new therapies. The aim of this study was to identify the COPD-associated genes, we used RNA-seq to compare the gene expression patterns of COPD lung tissues and normal controls. In contrast to previous related works, our model stringently controls for known covariates, such as age, race, BMI, and RIN value and potential hidden confounding factors. The DEGs were validated in an independent microarray, showing the robustness of our results.
Some caveats are worth discussion below. A major limitation of this study is its lack the smoking information in the GTEx database, while smoke do affect a lot of gene expression in the pathogenesis of COPD 29,30 . However, we used the SVA R package 15 to infer 5 surrogate variables, which could partially mimic the effect of smoke. In addition, Although the major environmental risk factor for COPD is tobacco smoking, only 20% of smokers develop COPD 31 and 25-45% of patients with COPD have never smoked 32 . The strength of our study is that identified COPD-associated genes has been based on a large transcriptomic dataset, which has high benchmarks for quality control and processing (GTEx). The large sample size confers adequate power to detect important patterns while at the same time, it overcomes possible bias introduced by outliers. Matthew N. 33 reported the complex sources of variation in tissue expression data and cellular composition of a tissue is among the largest drivers of sample variability. Here, we did not take into account cellular heterogeneity of the lung tissue in the RNA-seq data, which is another limitation. Single-cell RNA sequencing 34,35 is needed to reveal a more concise transcriptome profile in the future COPD research.
To summary, our analysis is currently the largest RNA-seq based transcriptome research in COPD. Our data revealed 1,359 DEGs and several key hub genes in the COPD patients.
This unprecedented high-resolution view of the lung transcriptome associated with COPD may ultimately provide invaluable resource for the researcher and the development of new therapies. Further mechanistic investigations on these genes are warranted to elucidate function in the pathogenesis of COPD.

GTEx Tissues and Expression Data
The GTEx data set (v7.p2, October 2017 released) was downloaded from the GTEx project through dbGaP (https://dbgap.ncbi.nlm.nih.gov). The covariates, including age, gender, body mass index (BMI), RIN Number (SMRIN) and COPD status (MHCOPD) were obtained from the GTEx Portal (GTEx_Data_V7_Annotations_Subject Phenotypes DS.txt). For detailed information regarding sample collection, RNA sequencing, and the data processing pipeline refer to the GTEx Consortium paper 36 . In this data set, we excluded cases with unknown COPD status, and races other than black or white, leaving 273 healthy controls and 91 COPD cases. To reduce effects of cofounders in our statistical model, MatchIt (v3.0.2) 37 in R was used to balance five covariates (Gender, Age, Race, BMI and SMRIN) between healthy controls and COPD patients with "optimal" matching and 2:1 optimal ratio, resulting in 91 COPD cases and 182 matched healthy controls for the further analysis.

11
All the subsequent analyses were based on the raw read count data. The read counts were processed with the R-Bioconductor software suite (R version 3.5.1; Bioconductor version 3.7). To account for differences in library size and normalized counts, we used the size factors estimated via the median-ratio method and subsequently transformed the counts by using a variance-stabilizing transformation based on the dispersion-mean relationship (implemented in DESeq2). From the 56,202 genes represented in the GTEx RNA-Seq data, we extracted those with a minimum level of average expression (mean of normalized and transformed counts > 5). This threshold was chosen since it appeared to separate background noise from signal (Supplemental Figure S1). Finally, a total of 17,882 genes passed filtration and was used for the further analysis.

Identification of significantly differentially expressed genes in COPD
DESeq2 R package (v1.20.0) 38 was used to test for differential expression. To adjust for batch effects and other hidden confounding factors, we applied the surrogate variable analysis (SVA) algorithm implemented in the sva R package 15 . In addition to the known covariates, 5 surrogate variables were then added to the formula in DESeq2. The residuals for normalized read counts, after the known covariates and surrogate variables correction, were tested against the "COPD" status using the following negative binomial (NB) generalized linear regression model (GLM):

Functional annotations and Network-based analysis of COPD-associated genes
The clusterProfiler (v3.8.1) 39 in R was applied to perform GO enrichment analysis of molecular function and biological process on significant COPD associated genes (FDR < 0.05). For the pathway-based analyses, we defined significant COPD-associated pathways at the level of q-value less than 0.05 using the Reactome database (Version 61) 20 . The visualization of pathways was carried on the cluego plugin 40 in Cytoscape (Version 3.4.0).
Network analysis of expression changes were performed using NetworkAnalyst 41 which is based on curated protein-protein interactions (PPI) from the STRING database (Version 10) 42 . The interaction with confidence score cutoff > 900 and required the experimental 13 evidence are considered. To simplify the dense network to reveal key connectivity, a minimum interaction network was constructed to generated interactions from the upregulated and down-regulated COPD-associated genes separately.

Validation in an independent microarray dataset
To validate our results, we downloaded an independent human COPD microarray-based data set. The microarray datasets were preprocessed by a log2 transformation followed by quantile normalization. Duplicate genes were replaced by their mean value. Then, we only kept the genes that overlapped with DEGs and microarray probes. We removed samples with ILD disease or unknown conditions. For the overlapped gene signature (n = 934), an empirical Bayes shrinkage method was used to obtain a moderated t-test statistic and its p-value using limma 43 R package. Multiple testing P-values were adjusted using the BH method.