Quality control of transcriptomes and proteomes
Quality control of RNA-seq reads. The libraries were constructed for the RNA sequencing, and deep sequencing was completed for the ten (six IPF vs. four control) samples that met the quality requirements. The samples had an average of 150,114,026 ± 15,174,766 sequence Reads and 22,517,103,960 ± 2,276,214,831 bases. There was no significant diff erence between the control group and the IPF group in the measured number of sequences and bases (p <0.05). The average base error rate was 0.53±0.098%, and there is no significant difference between the control group and the IPF group (p <0.05). The average GC ratio (GC%) was 47.15±1.80%, the percentage of Q20% bases (error rate < 1%) was 93.87±1.18%, and G30% bases (error rate < 0.1%) was 86.30±2.02%. There was no significant difference in these three indicators between the two groups (p <0.05). After data filtering, the average clean Q3 ratio is 89.25±1.52%, and the cleaning rate is 90.81±0.83% (Table 1).
The bases had a homogeneous distribution along with the sequences, the maximum error rate is < 1%, and the minimum base quality (-10×log10(error P)) was above 30 (Figure S1, A-C). The sequence quality was further improved after data filtering (Figure S1, D-F).
Table 1. Quality control of raw RNA-seq data
|
Control
|
IPF
|
p-value
|
Reads
|
1.56×108 ±1.77×107
|
1.46×108±1.19×107
|
0.41
|
Bases
|
2.33×1010±2.65×109
|
2.20×1010±1.79×109
|
0.41
|
Error (%)
|
0.56±0.09
|
0.53±0.10
|
0.70
|
GC%
|
46.94±1.58
|
46.75±1.59
|
0.71
|
Q20(%)
|
93.55±1.15
|
93.86±1.17
|
0.71
|
Q30(%)
|
85.79±1.94
|
86.25±2.04
|
0.76
|
Clean reads
|
1.46×108±1.71×107
|
1.38×108±1.14×107
|
0.45
|
Clean bases
|
2.11×1010±2.52×109
|
2.00×1010±1.65×109
|
0.47
|
Clean rate (%)
|
90.46±0.76
|
91.05±0.43
|
0.20
|
This table lists several indices of NGS data. The “IPF” and “Control” columns indicate the QC results for each group. The p-values were calculated from two-tailed t-tests. |
Mapping quality of RNA-seq reads. 94.7±0.005% reads were successfully mapped to the human reference genome. The detected gene number approached saturation along with the increase of mapped reads, indicating a good sequence depth of this experiment. The averaged mapped genes of all the samples were 27,000 to 33,000 (Figure S2A). The quality control and comparison results show that the sequencing results met the quality requirements for further analysis. Among the detected RNA sequences, 81.33% derived from exons, 16.1% from intronic, and 2.57% from intergenic reads (Figure S2B). The detected genes were evenly distributed across chromosomes by comparison with the human gene distribution map (Figure S2C).
Quality control and quantification of proteomes. Libraries were generated for all 11 samples that met the quality requirements and the proteins were detected and quantified. As a result, 9,119 protein groups and 9,232 proteins were detected at the QC level of 1% FDR (Spectrum, Peptide, and Protein levels). The levels of 7,823 protein groups and 7,932 proteins were quantified at the QC level of 1% FDR (precursor and protein levels). The average coefficient of variation (CV) of the precursors was 40.80% and 31.90% for the control and IPF samples, respectively. The median of the precursors’ CV was 40.6% and 32.4% for the control and IPF samples (Figure S3A). The recovery rate (the ratio of the identified proteins to the indicators in the human protein library) was 68.90% and 77% and the completeness (the ratio of the average number of identified proteins to the number of parent ions quantified in the experiment) was 51.90% and 63.50% for the control and IPF group samples, respectively. The cumulative recovery plot shows that 85% of proteins from the protein spectrum database have been detected in the 11 samples (Figure S3B). The completeness plot shows that the total completeness of all samples was 83.8%, with 4,200 proteins identified in all samples (Figure S3C). Consistency analysis of the qualitative results showed that 3,200 proteins were detected in all samples, and another 3,000 proteins were detected in more than half of the samples (Figure S3D). The heat map of all the detected proteins shows no significant differences in the identification and quantification among all samples (Figure S3E).
Transcriptome Analysis of IPF
Differential gene expression. We identified the DE genes between samples through two cut-offs: log fold change (|log2(fold change)| > 1) and significance level (adjusted p<0.05). In comparison with normal lung tissue, a total of 2531 genes were significantly differentially expressed in the lungs of patients with end-stage IPF, including 1772 up-regulated and 759 down-regulated genes (Fig. 1.A). Clustered heatmap (Fig. 1.B) shows that the control and IPF groups could be well separated by the genes, while the samples of IPF number 1, 4, and 5 had a clearer contrast with controls.
GO and KEGG enrichment analysis was performed to probe the biological processes and signaling pathways associated with the DE genes (Fig. 1.C). IPF lung tissues had a significant enrichment of biological processes and functional pathways that dominate the mechanism of IPF when compared to control tissues. These enriched pathways influence the progression of IPF at the biological process, cellular structure, and molecular function levels. Eight of the top ten pathways are related to the immune system activities and inflammatory response. Furthermore, there are also the pathways of the construction of ECM, which replaces normal Alveolar tissue and deposits abnormally in IPF [37, 38].
Differential lncRNA expression. To investigate the regulatory impact of lncRNA in the end-stage IFP patients' gene expression, we first quantified their expression and identified DE lncRNAs. The results showed that a total of 604 lncRNAs were significantly differentially expressed in IPF lung tissue, including 410 up-regulated genes and 194 down-regulated genes (Fig. 2.A). Clustered heatmap (Fig. 2.B) showed that the expression of these DE lncRNAs could separate the IPF samples from the control samples. And IPF samples 1, 4, and 5 showed a clearer contrast to the control samples than the other three IPF samples.
As the lncRNAs mainly function by regulating the protein-coding target genes, we predicted the potential target genes of cis-regulation and trans-regulation for the lncRNAs. We then performed GO enrichment analysis on the target genes and analyzed the results with the significance threshold of p< 0.05 (Fig. 2.C). Most of the enriched pathways were associated with the structure and function of lung epithelial apical junction, such as apical junction assembly and tight junction assembly. This implies that the DE lncRNAs in IPF may mainly promote the process of epithelial-mesenchymal transition (EMT), cell migration, accelerated fibrosis progression, innate immunity, as well as cellular differentiation and proliferation [39, 40]. Besides, there are also two pathways related to apoptosis, such as the cysteine-type endopeptidase activity involved in apoptotic process.
Proteomics Analysis of IPF
Principal component analysis was performed on the protein expression data using the PLS-DA method, and the top 2 components were plotted in Fig. 3.A. The results showed that the end-stage IPF tissues were more concentrated on the graph compared with normal tissues, indicating a higher homogeneity of protein expression and a more consistent within-group expression profile. We performed a Welch's ANOVA test on the protein quantifications and defined the DE proteins by a threshold of adjusted p< 0.05 and fold change > 1.5. As a result, we got 1,532 DE proteins in IPF tissues, including 1,231 up-regulated proteins and 301 down-regulated proteins (Fig. 3.B).
Fig. 3.C shows the top 10 enrichment results under the three categories of Biological Process, Molecular Function, and Cellular Component. We note that these enriched gene clusters were mainly focused on the negative regulation of TOR and TORC1 signaling, which are associated with the decreased metabolism and protein production, autophagy, and extracellular matrix (ECM) production in end-stage IPF [41-43].
The DE proteins were significantly enriched in 13 KEGG pathways (p < 0.05) (Fig. 3.D). According to previous studies, five of them are associated with the pathology of end-stage IFP. The RAS signaling pathway is associated with cell apoptosis and regeneration [44, 45], the tight junction and gap junction are associated with cell regeneration and junction construction [39, 40], the mTOR signaling pathway regulates cell growth and metabolism [41-43], and nucleotide excision repair is associated with wound repair [46, 47].
Multi-omics analysis
By interactively analyzing the expression matrices of RNAs of different types and proteins, we identified the key genes of each type that drive the discrimination between IPF and control tissues and investigated the correlations between the ncRNAs and the expression of mRNAs and proteins.
Using the DIABLO method, we identified the genes contributing most to the discrimination between IPF and control tissues. These top-contributing genes include 20 mRNAs, 20 proteins, ten lncRNAs, ten lincRNAs, ten antisense RNAs, and ten miRNAs. These top-contributing genes include 20 mRNAs, 20 proteins, 10 lncRNAs, 10 lincRNAs, 10 antisense RNAs and 10 miRNAs. Only the top-10 genes were kept from three types of ncRNAs, this is because they each have relatively small gene numbers (from 59-279).
First, we display the discrimination of the IPF samples and control samples by the PLS-DA plot (Fig. 4.A). In the PCA plots, the control samples and IPF samples 1, 2, 4, and 5 were clustered closely in all blocks, while the IPF samples 6 and 3 were at longer distances from the other IPF samples. Among the six blocks of the expression matrix, the IPF samples are more discrete in the mRNA and protein blocks, while they are more homogeneous in the ncRNA blocks. Fig. 4.B shows the correlation structure between components from each expression matrix. There are very strong associations between ncRNAs, mRNA, and proteins, and the correlation coefficients between any two datasets ≥ 0.98. The results indicate a good matrices design that favors the separation of the two groups.
Second, we created a clustered heat map representing the multi-omics profiles of all the samples (Fig. 4.C). The image shows that these top-contributing genes from six matrices well represent the separation of gene expression features of the IPF and control group. The only exception is the IPF sample 6, the expression characteristics of which are similar to neither the control nor the other IPF tissues. This result is consistent with the PLS-DA plot, in which IPF is also clearly discriminated from other IPF samples on the first component (x-axis). As the most important contributors to the expression characteristics, the top-contributing proteins include ROM01, T22D3, MIS12, ZN384, LHPL2, TANC2, DESI1, MEA1, ARID2, NFRKB, PKP2, MTG1, RIPR2, ARHGP, DPOA2, GNB1L, YETS2, IKZF1, MBOA2, and CEP57.
Third, we produced a Circos plot displaying the relationship between and within the top-contributing genes from the six matrices, the cutoff for the correlation coefficient was set as > 0.9 (Fig. 5). The strong correlations between mRNAs, proteins, and the ncRNAs indicate a universal regulatory effect of these ncRNAs on mRNA transcription and protein translation. Compared to the proteins, mRNAs had more strong links with the regulatory ncRNAs. Among the ncRNAs, the lincRNA has the most links with protein and mRNA, suggesting a significant regulatory role in IPF. In this co-expression network, the most-contribute variables are the lincRNAs ENST00000437698.1 and ENST00000442197.1, the antisense RNAs ENST00000519197.1 and ENST00000566738.1, the lncRNAs NR_110255.1 and NR_024344.1, and the miRNAs NR_030340.1 and NR_030408.1.
DE genes classification and functional enrichment analysis
Genes significantly differentially expressed at both transcriptional and translational phases were extracted and classified into four categories based on the trend they were regulated. The threshold of p value was set as < 0.05, and threshold of fold change was set as > 2 for transcriptome and > 1.2 for proteome. The results are displayed in the quadrant diagram (Fig. 6.A).
Classification of DE genes. A total of 78 genes were differentially expressed in both transcriptome and proteome, and they were divided into four categories according to their regulation. 24 genes were significantly up-regulated in both omics, such as TUBB3, IGLV1-47, and CAPS. A total of 46 genes were significantly downregulated in both omics, including AGER, BNTL9, and RETN. Eight genes had opposite regulation trends, three genes were significantly down-regulated in the transcriptome but up-regulated in the proteome: CSK, RAC2, and SEMA5B. Five other genes were significantly up-regulated during transcription but down-regulated during translation: EPS8L1, GON7, HOMER2, IGLV8-61, and PROC.
GO enrichment analysis. The 24 genes were most frequently located on chromosomes 4 and 11, which had 3 and 4 genes, respectively. Twenty-one genes had four or more isoforms, suggesting that isoforms may be more active in the lung tissue of patients with severe IPF. These genes are significantly enriched in seven biological pathways and high-level GO terms (Fig. 6.B). The enrichment network shows that the enriched functions were clustered in the biological process of regeneration and cell morphogenesis involved in differentiation (Fig. 6.D). These over-activated pathways participated in the cell regeneration, differentiation, and intercellular sequential generation, which probably due to the deteriorated tissue damage and regeneration processes in the end-stages IPF patients [1, 48].
These genes downregulated in both omics mainly locate on chromosomes 1, 9, 17, and 19. The genes were significantly enriched in 12 biological pathways and high-level GO terms (Fig. 6.C). The enrichment network shows that the most enriched terms were the biological process of myeloid leukocyte activation, regulation of IL-1 β production, cell-cell communication, cellular extravasation, and lipid localization (Fig. 6.E). These significant compromised functions and biological processes in the end-stage IPF lung tissues might be associated with reduced immune activities and the damage and obliteration of the alveolar tissue [49].
Identification of new biomarkers and verification
For the 78 genes we obtained in the previous step, we further checked their transcript expression in the 91 public transcriptome datasets. The detailed information about their expression feature is supplied in supplementary table. then, we identified 13 candidate genes that have significant differential expression in three omics (Table 2).
Table 2. Expression of candidate genes in three omics studies
|
Transcriptomes
|
Proteomes
|
Public transcriptomes
|
|
log2FC
|
p-value
|
p.adj
|
log2FC
|
p-value
|
p.adj
|
log2FC
|
p-value
|
p.adj
|
ADGRL2
|
-1.268
|
1.35E-03
|
0.021
|
-2.411
|
0.005
|
0.020
|
-0.834
|
4.88E-11
|
1.23E-09
|
BTNL9
|
-2.276
|
3.04E-08
|
4.31E-06
|
FC=0
|
1.00E-06
|
4.64E-06
|
-2.620
|
1.18E-36
|
1.22E-33
|
CA4
|
-3.256
|
5.49E-04
|
0.011
|
-3.262
|
3.67E-03
|
0.015
|
-3.022
|
4.57E-26
|
1.16E-23
|
IGLV1-47
|
3.642
|
1.21E-04
|
3.54E-03
|
1.952
|
3.08E-03
|
0.013
|
2.749
|
8.64E-20
|
7.94E-18
|
LIMCH1
|
-1.055
|
2.81E-04
|
6.79E-03
|
-0.884
|
1.93E-03
|
8.44E-03
|
-0.886
|
1.04E-17
|
6.92E-16
|
MID1IP1
|
-1.125
|
6.02E-04
|
0.012
|
-1.705
|
6.79E-03
|
0.027
|
-0.882
|
1.44E-17
|
9.38E-16
|
PLLP
|
-1.563
|
0.001
|
0.016
|
FC=0
|
0
|
4.64E-06
|
-1.052
|
6.87E-25
|
1.41E-22
|
QDPR
|
-1.041
|
8.50E-04
|
0.015
|
-0.400
|
0.008
|
0.033
|
-0.593
|
2.45E-17
|
1.55E-15
|
S100A4
|
-1.463
|
1.54E-04
|
4.27E-03
|
-1.153
|
0.007
|
0.029
|
-0.851
|
1.80E-13
|
6.63E-12
|
SELENBP1
|
-1.465
|
5.33E-04
|
0.011
|
-1.246
|
3.13E-03
|
0.013
|
-1.092
|
1.28E-35
|
1.20E-32
|
STX11
|
-1.874
|
2.19E-04
|
5.64E-03
|
-1.836
|
0.010
|
0.038
|
-1.233
|
8.15E-18
|
5.50E-16
|
THY1
|
2.777
|
3.92E-04
|
8.70E-03
|
1.808
|
2.41E-03
|
0.010
|
2.554
|
2.15E-35
|
1.97E-32
|
TUBB3
|
3.026
|
8.06E-04
|
0.015
|
3.693
|
6.93E-03
|
0.028
|
2.126
|
2.74E-16
|
1.53E-14
|
Log2FC: log2 of the fold change, lfcSE: standard error of log2 fold change, statistic: Wald statistic, p.adj: adjusted p-value, adjustment method = Benjamini-Hochberg. |
Literature review shows that many of these top-rank candidate genes have been reported involving in IPF, which strongly prove the efficiency of our research approach. Among these thirteen genes, three of them have been reported to have significant impacts on the pathology of IPF (S100A4 [50], STX11 [51], and THY1 [52]), while another three have been reported to have differential expression in IPF yet not verified (BTNL9 [53], SELENBP1 [54], and LIMD1 [55, 56]). Eleven genes had no studies reported in IPF (ADGRL2, CA4, IGLV1-47, LIMCH1, MID1IP1, PLLP, and QDPR).
BTNL9 and PLLP expression in lung tissues from human and BLM-induced mice
Based on the results of three omics and the review of literature research, we decided to validate the expression of BTNL9 and PLLP in IPF. both genes’ mRNA transcriptions were quantified using qPCR, and their protein expression in lung tissues was detected by Western blotting. qPCR showed that both PLLP’s mRNA expression significantly reduced in IPF patients (Wilcoxon test, p< 0.01) (Fig. 7. B); BTNL9’s mRNA expression had a non-significant reduction (Fig. 7, A). Western blot showed that PLLP’s protein expression decreased in IPF patients (Fig. 7, C).
On day 21 after BLM induction, the immunohistochemistry showed that, compared with the saline group mice, the lung tissues of BLM-induce mice showed progressive pulmonary fibrosis and alveolitis. The serum hydroxyproline increased, accompanied by increased expression of type I collagen (COL I) and Fibronectin in the lung tissue. These results indicated that the BLM-induced mouse lung fibrosis model has been successfully established. qPCR and Western blot were then used to measure BTNL9 and PLLP’s expression in the BLM-induced mouse model. The results showed that BTNL9 and PLLP’s mRNA expression significantly reduced in BLM-induced mice (Wilcoxon test, p< 0.05), and their protein expression also decreased in BLM-induced mice (Fig. 7, D-F).
Co-expression networks of BTNL9 and PLLP
The co-expression network demonstrates the promoted and inhibited gene sets associated with the expression of specific genes. The co-expression network of BTNL9 shows that its expression is associated with the promotion of endothelium establishment, vessel endothelium migration, and construction of cell-cell junction. BTNL9’s expression is associated with the inhibited pathways such as immune system activity, production of extracellular matrix, and cilium production (Fig. 8). The co-expression network of PLLP shows that its expression is associated with the promotion of endothelium development, cell membrane, and cell junction development. It is associated with the inhibited pathways such as abnormal respiratory function, immune system activity, and cilium production (Fig. 9).