Comprehensive and quantitative proteome profiling of the mouse brain
To characterize genetic architecture of protein expression, we generate deep brain proteome of four mouse strains, including B6 and D2, and their two reciprocal F1 hybrids (i.e., B6D2F1 and B6D2F1) (Fig. 2A). By using 11-plex TMT-based LC/LC-MS/MS with extensive fractionation, we identified a total of 273,063 peptide-spectrum matches (PSMs) and 87,892 peptides, corresponding to 9,979 proteins (9,688 genes) at protein FDR < 1% (Fig. 2A; Additional file 1: Table S1). Principal component analysis shows that two replicates of four mouse strain grouped well (Fig. 2B). Pearson correlation analysis also shows a high correlation (r2 = ~ 0.99) between the two replicates (Figure S1). The agreement of biological replicates indicates a high quality of the proteomics data.
We next ask to what extent mRNAs can be detected by our proteomic data and differences between mRNAs across four mouse strains reflect at the level of proteins. To this end, we compared our proteomic data with transcriptome from the same mouse strains generated by RNA sequencing (7) with respect to protein coverage and absolute abundance (i.e., concentration). The proteins identified in this study cover most (80.5%) of highly abundant genes (i.e., log2(tpm) > 5), indicating deep coverage of the expressed proteome (Fig. 2C). Consistent with previous reports (14, 15), comparison of absolute abundance showed a modest correlation between mRNA and proteins (correlation coefficient r2 = 0.459; p value < 2.2x10− 16) (Fig. 2D), suggesting that ~ 45% of the variance of expression levels between different proteins could be explained by mRNA levels. The discrepancy could be ascribed to protein translation rates and post-translational modifications as well as biases of RNA-sequencing and mass spectrometry technologies.
Genetic variation can lead to difference in expression level of the same protein across different mouse strains. To determine which proteins are influenced by the genetic variation, we calculated the coefficient of variation (CV) for each protein. We found that a subset of proteins (1,347/9,979) showed high variation in protein expression (Fig. 2E), defined as two standard deviation above the average of the CV. Gene Ontology (GO) enrichment analysis showed that these variable proteins were enriched in chromatin modification and protein secretion.
Genetic difference in protein expression
We next sought to identify differentially expressed proteins (DEPs) between B6 and D2 strains, as they are highly polymorphic in genotypes and phenotypes. We identified 329 DEPs at the FDR of 0.05 and log2 fold change (log2FC) cut-off of 1.5 (Fig. 3A; Additional file 2: Figure S2), including 113 and 216 proteins with higher expression in B6 and in D2, respectively (Additional file 1: Table S2). A large proportion (71.4%) of proteins show a modest level of expression alteration (log2FC between 1.5 and 2). Among these 322 out of 329 DE proteins on autosomes, we identified 25 proteins with single parent expression (SPE), defined as an extreme form of differential expression in which either B6 or D2 shows a high expression abundance (log2 expression level z-score > 25th percentile) while the other is silent (log2 expression level z-score < 1st percentile) (Fig. 3B). Enrichment analysis revealed that DEPs with higher relative expression in B6 were significantly enriched for GO terms related to prostaglandin response (Fig. 3C, Additional file 1: Table S3). In contrast, DEPs with higher relative expression in D2 were significantly enriched for terms related to mitochondrial function, electron transfer activity, and cytochrome-c oxidase activity (Fig. 3C; Additional file 1: Table S3).
We also define a correlation of the relative expression (B6 vs. D2) between mRNA and protein levels. We observed a fairly low correlation of relative expression ratio in mRNA compared to protein (Fig. 3D), indicating potential buffering at the protein level caused by genetic variation. Despite their low correlation, we confirmed consistent changes at both transcript and protein levels, such as ALAD and HDHD3 (Fig. 3E,F), for which, in our previous study, we found that ALAD and HDHD3 span with a CNV and are associated with high variation in mRNA expression in multiple brain regions between B6 and D2 strains (11).
Systematic characterization of inheritance pattern of protein expression
The differences in protein expression across strains can be further partitioned into heritable and non-heritable variation. The proportion of heritable variation (genetic) contributing to the total observed variation is known as the broad sense heritability (H2). We calculated heritability as the ratio of strain variance to the total variance for each protein. We found that 8,470 (84.9% of the total) proteins showed heritability > 50% (Fig. 4A). The median heritability of all proteins is 0.82, which is higher than that of transcripts in BXD strains (16). We evaluated the inheritance patterns of protein expression using the distribution of D/A (i.e., dominance / additivity), revealing that the dominant expression pattern is more common than the additive pattern (Fig. 4B).
We then determined inheritance patterns by comparing protein expression in the F1 hybrids and in two parent strains. The patterns include additive effect (A), dominant effect (D), over-dominance, and under-dominance. We next applied the weighted gene co-expression network analysis (WGCNA) to define the inheritance patterns of protein expression. We only selected proteins with highly variable expression across samples by removing proteins with a coefficient of variation (CV) less than that of the mean of the lower CV distribution (Fig. 4C). Considering the reference protein database is largely derived from the B6 genome, we only focused on proteins with higher expression in B6, resulting in a total of 2,238 proteins to be used as input for co-expression network analysis (Fig. 4D). With soft-threshold power (β = 18) in WGCNA, we identified a total of four major patterns: additive, dominant, over- and under-dominant expressions. The largest pattern is dominant expression (n = 1,133), followed by additive expression (n = 980) (Fig. 4E; Additional file 1: Table S4). In addition, we also observed 63 over- and 62 under-dominant expressions. To assess the biological functions of each inheritance pattern, we performed gene ontology enrichment analysis using the R package clusterProfiler (version 3.18.1). We found that proteins in the additive pattern are enriched for signal transduction pathways (Additional file 1: Table S5, Additional file 2: Figure S3A), whereas dominant expression are enriched for pathways such as mRNA processes and autophagy (Additional file 1: Table S5, Additional file 2: Figure S3B).
Identification of protein allele specific expression in cis- and trans-regulations
The highly variable genome sequences between B6 and D2 strains provide an opportunity to investigate allele-specific expression. While several studies have investigated ASE in mice using transcriptomics data, there is no research for protein allele specific expression (pASE). Recently, the proteogenomics approach that integrates genomic and proteomics data has been proven to be a valuable method in detecting variant peptides (17–19). We performed the proteogenomics analysis to detect variant peptides using JUMPg, a proteogenomics pipeline we recently developed.
Using 11,115 missense variants detected in our previous D2 sequencing project, we identified a total of 286 variant peptides, including 169 and 205 D-allele and B-allele peptides, respectively, at the peptide FDR of 1% (Fig. 5A; Additional file 1: Table S6). By comparing B-allele and D-allele peptides, we found a total of 88 pairs of variant peptides (Fig. 5B, Additional file 1: Table S7). Two examples of B-allele peptide and D-allele peptide are shown in Fig. 5C. The B-allele peptide can only be detected in B6 strain and both F1 hybrids, whereas the D allele peptide can be detected in D2 strain and both F1 hybrids. Even though the signal of the two allelic peptides cannot be directly compared due to different chemical properties that alter ionization efficiency, the ratio of the two alleles in parents and F1 hybrids can be calculated, allowing us to determine pASE. By comparing the allelic ratio in parents and F1 hybrids, the regulation of protein expression are classified into five different categories: cis-, trans-, compensatory, conserved, and unexpected biased (Fig. 5D). The ratio of the two parental alleles is contributed by a combination of cis- and trans-regulatory effects. If the allelic ratio in the F1 is similar to the parental proportions, the differential expression in the parents is likely due to variation in cis-acting elements because the common trans- effect is present in the F1 hybrids. In contrast, if the change in the allelic ratio is only observed in parental strains but not in F1s (log2 ratio < 1), it is likely to be caused by trans-acting factors. If there is no change in parental strains, but with significant changes in F1 hybrids, the cis- and trans- effects in the parental are compensatory. We define conserved regulation as there are no changes in both parental strains and F1 hybrids.
Among those 88 pair variant peptides with both B- and D-types (Fig. 5E), we found 33 cis-regulation, followed by 25 conserved regulation. In addition, there are 17 trans-regulation and 5 compensatory regulations. In addition, two proteins showed unexpected biased, which could be due to peptide false identification or quantitative measurement error in the shot-gun proteomics.
Comparison of regulations at transcript and protein expression levels
We next sought to compare regulation at transcript and protein expression levels. To identify ASE at the transcript level, we analyzed the transcriptome of the hippocampus of matched samples (i.e., B6, D2, and B6D2F1). By mapping RNA-seq data to both B6 reference and D2 customized genomes, we identified 2,630 protein-coding transcripts with ASE expression (Fig. 6A), including 215 cis-, 500 trans-, 213 compensatory, 1,666 conserved, and 36 unexpected bias regulation. By comparing regulation between transcripts and proteins, we found that there is a significant overlap between transcripts and proteins showing ASE (Fisher’s exact test p = 2.2 x 10− 9). The conserved regulation showed the highest overlapped in both levels. However, only two genes/proteins were found to show ASE in trans-regulations (Fig. 6B).