Analysis of the complexity and diversity of Super deepSAGE data across tissues
Super deepSAGE obtained ~ 5 million reads per sample with an average sequencing depth of 71X (total number of genes identified by deep sequencing / total number of aligned reads, sequencing matrix is listed in Supplemental document 1). A total of 32,213 transcripts were covered by Super deepSAGE. Rarefaction analysis of a size-fractionated library for each tissue was performed to determine the complexity and diversity of pig tissues [16]. The sequencing depth achieved using eight samples-multiplexed deep sequencing technique (added different linker and pooled eight samples together to a single deep sequencing run) reached near-saturation of transcript discovery within all size ranges. Saturation was seen very early in Super deepSAGE sequencing data due to low tag complexity (number of tags) in libraries (Fig. 1A-F showed the first six deep sequencing runs). Samples from the same sequencing run were compared using reads from different size-fractionated libraries to further investigate the diversity of the relationship between sequencing depth and transcript discovery. In all deep sequencing runs, tissues exhibited transcriptome diversity in terms of both the total number of reads and the number of transcripts discovered. For example, the muscle tissue (MS.DI_2), saturated much sooner than the conceptus (CPT.SPH_8) and fewer transcripts were discovered in the first deep sequencing run (Fig. 1A). Similar sequencing depth and diversity were obtained using size-fractionated reads numbers from the other 22 sequencing run and discovered transcript numbers as outcome measures (Supplemental Fig. SA-D).
Data quality and internal consistency control using principal component analysis (PCA)
Principal component analysis (PCA) was used to check if the samples clustered together according to their tissue source [17]. Even though the samples were collected from 32 individual animals from different families, genders, and ages (Table 1), the PCA plot confirmed that the samples from the same tissues clustered together and were distinct from other samples (Fig. 2). The transcripts in conceptus, blood, and macrophages had relatively distinct expression profiles and segregation from the rest of the samples when plotted using the first two components of the PCA analysis (Fig. 2A). The adenohypophysis, cerebral cortex, heart, and muscle were aggregate and separated from other samples when plotted using the third and fourth components (Fig. 2B). The adrenal, liver, mesenteric lymph nodes, peripheral blood mononuclear cell, and spleen deviated from other samples when plotted using the fifth and sixth components (Fig. 2C). When eliminating those samples from the datasets and re-calculating the PCAs, the remaining samples; fat, placenta, endometrium, kidney, lung, and stomach grouped differently according to the tissue/cell types (Fig. 2D-F). Tissues having similar cellular composition and biological function, like alveolar and monocyte-derived macrophages or heart and skeletal muscles, clustered closely together but were distinct from each other.
Table 1
Detailed information of the collected samples
Code | Tissue | Code | Tissue |
AC | Adrenal cortex | FT.BF | Back fat tissue |
AM | Adrenal medulla | KID | Kidney |
CPT.SPH | Conceptus spherical | ADE | Adenohypophysis |
CPT.TUB | Conceptus tubular | MP.BMD | Bone-marrow derived macrophage |
FT.AB | Abdominal fat tissue | MS.BF | Biceps femoris |
MS.DI | Diaphragm muscle | EDMT | Endometrium |
STOM | Stomach | BLD | Blood |
CPT.FIL | Conceptus filamentous | PBMC | Peripheral blood mononuclear cell |
MS.LD | Longissimus dorsi | HT | Heart |
LNG.TRA | Lung porcine trachea | CC | Cerebral cortex |
PLACT | Placenta | MP.MD | Monocyte derived macrophage |
LNG.BRO | Lung porcine bronchus | MP.ALV | Porcine alveolar macrophages |
LNG.DIS | Lung porcine distal | MLN | Mesenteric lymph nodes |
SPL | Spleen | LIV | Liver |
Comparison of the Super deepSAGE data with previously published microarray research
The expression profiles were compared with previously published microarray data [8]. The processed microarray datasets were acquired from the GEO database and normalized to the Super deepSAGE data using the quantile normalization method to make these two datasets comparable. There is a total of 8,199 common transcripts for seven tissues in both platforms, a total of 24,013 transcripts remain undetected by the Affymetrix platform, and a total of 4,478 transcripts were undetected in Super deepSAGE experiments (Fig. 3). Among the commonly detected transcripts, a high correlation (r = 0.85–0.93 and p-values less than 1.0 × e− 30) was calculated between the gene expression profiles generated by the two platforms (Fig. 3). A similar dynamic range was observed in both platforms for transcripts with a relative expression level (log2 based and quantile normalized expression value) between 4.0 and 9.0. Differences in expression profiles were apparent between the two platforms as several genes exhibiting relatively higher or lower expression values in either platform deviated from the slope (Fig. 3). All transcripts had an expression value in the microarray due to background hybridization or noise, regardless of whether it was truly expressed or not. The overall dynamics of the fitted curve tend to show that the Super deepSAGE platform is a more sensitive technique than the microarray for low expression genes that show a concaved trend at the lower ends (with relative expression level less than 4.0 in Fig. 3). For those genes with high expression levels, variability is high in both Super deepSAGE and microarray platforms. In the seven overlapped tissues between Super deepSAGE and microarray, the 50 highest expressed Super deepSAGE tags, 38 (76%) found corresponding probe sets in the 50 highly expressed genes, and only three tags showed a statistically significant difference between Super deepSAGE and microarray data.
Identification of tissue-differential expression of transcripts
A total of 4,165 transcripts showed significant up or down-regulation in at least one tissue, in comparison to the average tag count for 27 tissues. K-means clustering analysis was then performed by trying a different number of centers (K from 5 to 28) and several random sets (S from 10 to 1000). An ad hoc method comparing each tissue to the average tag count for all 27 tissues was performed, and a very stringent threshold was set (fold change > 5.0, p-value < 1.0 × 10− 6) to filter the tissues specifically expressing transcripts. We selected K = 10 and S = 400 to produce a clustered result with a clear expression pattern (by visualization), highly reproducible for each duplicated run (Fig. 4). The detailed clustering information is available in Supplemental document 2. The result indicated that Cluster 1 has the largest number of transcripts, and most of these transcripts were expressed at a low level in tissues, except macrophages, PBMCs, blood, and conceptus in which it was moderately expressed. The conceptus expressed transcripts were in Cluster 2, while the conceptus, macrophages, PBMCs, and blood down-expressed transcripts were in Cluster 4. The macrophages, PBMCs, blood, mesenteric lymph nodes, and spleen specific transcripts were in Cluster 5. The genes specifically expressed in the heart and skeletal muscles were in cluster 10. The cerebral cortex specific genes were in Cluster 6, and liver specifically expressed transcripts were in Cluster 7. The adrenal cortex, adrenal medulla, cerebral cortex, and adenohypophysis specific transcripts were in Cluster 8. Transcripts in Cluster 3 and Cluster 9 were ubiquitously expressed in multiple tissues.
Identification of over-represented motif for tissues specifically expressed transcripts
The CLOVER software [20] with JASPAR PWM database [21] was used to identify over-represented transcription factor binding motifs for each gene cluster. The promoter regions for a transcript cluster (1,000 bp upstream from the TSS) were determined using the Ensemble Biomart tool (Sus Scrofa assembly 11.1, gene 99) [22]. The promoter regions for the transcripts detected, with a similar GC content, were used as background. Motifs having a p-value of ≤ 0.05 was significant (Table 2, top 5 motifs). The most significantly enriched motif in Class 1 is MZF1. TFAP2A and TFAP2C were also significantly enriched with a raw score higher than 30. In Class 2, there was only one significantly enriched motif, RHOXF1. In Class 3 and 4, there were five and four motifs with p-value < 0.05 respectively, but the raw score was lower than ten. In Class 5, there were at least five motifs with p-value < 0.05, and three of them, RUNX1, ASCL1, and Myod1 had a raw score higher than 30. In Class 6, the significantly enriched motifs with the highest score were SNAI2 and FIGLA, whereas, in Class 7, the significantly enriched motifs with the highest score was NR4A2. In Class 8, there was only one motif ZEB1 enriched in the promoter region of these transcripts. In Class 9, all the enriched motifs had a raw score of less than ten. In Class 10, the top three motifs were Ascl2, Myog, and Tcf12.
Table 2
Significantly over-represented binding motifs in the promoter region of transcripts showing a similar expression pattern
Gene cluster | Jaspar ID | TF name | Occurrence | Gene count | Raw score | FDR |
Class 1 | MA0056.1 | MZF1 | 291 | 291 | 91.6 | 0 |
Class 1 | MA0810.1 | TFAP2A(var.2) | 165 | 165 | 33.1 | 0.002 |
Class 1 | MA0524.2 | TFAP2C | 149 | 149 | 32.3 | 0.003 |
Class 1 | MA0811.1 | TFAP2B | 143 | 143 | 29.1 | 0.004 |
Class 1 | MA0507.1 | POU2F2 | 53 | 53 | 8.35 | 0.006 |
Class 2 | MA0719.1 | RHOXF1 | 142 | 142 | 6.96 | 0.008 |
Class 3 | MA1105.1 | GRHL2 | 28 | 28 | 6.99 | 0.003 |
Class 3 | MA0842.1 | NRL | 60 | 60 | 3.67 | 0.006 |
Class 3 | MA0164.1 | Nr2e3 | 52 | 52 | 2.42 | 0.003 |
Class 3 | MA0029.1 | Mecom | 23 | 23 | 2.25 | 0.009 |
Class 3 | MA0117.2 | Mafb | 49 | 49 | 1.69 | 0.004 |
Class 4 | MA0691.1 | TFAP4 | 20 | 20 | 4.93 | 0.004 |
Class 4 | MA0091.1 | TAL1::TCF3 | 16 | 16 | 2.43 | 0.006 |
Class 4 | MA0616.1 | Hes2 | 19 | 19 | 0.8 | 0.003 |
Class 4 | MA0089.1 | MAFG::NFE2L1 | 21 | 21 | -1.09 | 0.005 |
Class 5 | MA0002.2 | RUNX1 | 135 | 135 | 55.5 | 0 |
Class 5 | MA1100.1 | ASCL1 | 79 | 79 | 46.8 | 0.008 |
Class 5 | MA0499.1 | Myod1 | 59 | 59 | 31.6 | 0.003 |
Class 5 | MA1124.1 | ZNF24 | 30 | 30 | 18.6 | 0.002 |
Class 5 | MA1109.1 | NEUROD1 | 57 | 57 | 9.76 | 0.004 |
Class 6 | MA0745.1 | SNAI2 | 75 | 75 | 18.7 | 0.001 |
Class 6 | MA0820.1 | FIGLA | 77 | 77 | 17.6 | 0.003 |
Class 6 | MA0138.2 | REST | 6 | 6 | 7.13 | 0.002 |
Class 6 | MA0665.1 | MSC | 36 | 36 | 6.63 | 0.002 |
Class 6 | MA0691.1 | TFAP4 | 30 | 30 | 5.79 | 0.003 |
Class 7 | MA0160.1 | NR4A2 | 59 | 59 | 15.5 | 0 |
Class 7 | MA0693.2 | VDR | 31 | 31 | 4.29 | 0.004 |
Class 7 | MA0017.2 | NR2F1 | 27 | 27 | 4.12 | 0.009 |
Class 7 | MA1142.1 | FOSL1::JUND | 38 | 38 | 3.25 | 0.001 |
Class 7 | MA0059.1 | MAX::MYC | 16 | 16 | 1.48 | 0.008 |
Class 8 | MA0103.3 | ZEB1 | 21 | 21 | 10.9 | 0.002 |
Class 9 | MA0084.1 | SRY | 22 | 22 | 9.59 | 0.01 |
Class 9 | MA0130.1 | ZNF354C | 35 | 35 | 9.47 | 0.007 |
Class 9 | MA0463.1 | Bcl6 | 21 | 21 | 7.12 | 0.007 |
Class 9 | MA0799.1 | RFX4 | 7 | 7 | 5.64 | 0.007 |
Class 9 | MA0798.1 | RFX3 | 7 | 7 | 5.38 | 0.009 |
Class 10 | MA0816.1 | Ascl2 | 66 | 66 | 28.6 | 0.009 |
Class 10 | MA0500.1 | Myog | 58 | 58 | 28.5 | 0.01 |
Class 10 | MA0521.1 | Tcf12 | 57 | 57 | 27.2 | 0.008 |
Class 10 | MA0665.1 | MSC | 36 | 36 | 9.17 | 0.002 |
Class 10 | MA0108.2 | TBP | 57 | 57 | 4.39 | 0.007 |
Case report: confirmation of the regulatory roles of RUNX1 in PBMCs in pig
In the cluster heatmap (Fig. 4), Class 1 and 5 tentatively show (by visualization) high expression in macrophages, PBMCs, and blood. However, the expression level of genes in Class 1 was lower than in Class 5. Further, mesenteric lymph nodes and spleen specific transcripts in Cluster 5 indicated that this class is an immunity-related gene cluster. The top over-represented motif in Class 5 is RUNX1, and literature search of its targets indicated that TLR-2 (Toll-like receptor), LCK (tyrosine kinases), and VAV1 (Rho family GTPases) play a role in T and B-cell development and activation. These three representative RUNX1 targets were selected for further experimental validation.
Confirmation of the RUNX1 binding site in the promoter region of TLR-2, LCK, and VAV1
The toll-like receptor 2 (TLR-2), lymphocyte-specific protein tyrosine kinase (LCK), and vav1 oncogene (VAV1) plasmid containing the 1Kb putative promoter sequence were used in in vivo studies (wild type). To show the regulatory effect of RUNX1, the binding site of RUNX1 in TLR-2, LCK, and VAV1 was mutated or deleted. Reporter vectors constructed by the wild type, mutated, or deleted promoter sequences were transfected into the peripheral blood mononuclear cells (PBMCs), and luciferase activity was monitored. Binding site deletion significantly attenuated the expression of the downstream reporter luciferase activity (p < 0.05), indicating that RUNX1 could interact with the target site and regulate the expression of the downstream reporter gene (Fig. 5A-C). The mutated vectors showed significant attenuation of the activity of downstream luciferase at 40, 44, and 48 hours post-transfection (p < 0.05) indicating a regulatory relationship between RUNX1 and the targets. Another experiment was performed using mouse macrophage cells (RAW 264.7) to validate the hypothesis further. Consistent with the previous results, mutation of the RUNX1 binding sites in TLR-2, LCK, and VAV1 promoter sequence significantly attenuated the activity of downstream luciferase at 40, 44, and 48 hours post-transfection (Fig. 5D-F). The luciferase reporter activity after transfection with the wild-type vector was significantly higher in macrophage cells than in the PBMC assays, suggesting that the endogenous RUNX1 expression in mouse macrophage cells was higher than in PBMCs.
RNA flow cytometry analysis of RUNX1 targets in LPS and RUNX1 inhibitor-treated PBMCs
To show the effect of RUNX1 on three targets; TLR2, LCK, and VAV1, pig PBMCs were stimulated with LPS and/or RUNX1 inhibitor, for 6 hours, during which their TLR2, LCK, VAV1, CD14 protein levels were monitored. Two subsets of cells readily emerged from CD14/TLR2 analysis in PBMCs: a CD14hi/TLR2lo (CD14high/TLR2low) and a CD14lo/TLR2lo population (Fig. 6D). The percentage of CD14hi/TLR2lo cells increased in LPS plus RUNX1 inhibitor-treated samples, but the proportion of CD14lo/TLR2lo cells remained unchanged. The percentages of TLR2hi (for both CD14hi and CD14lo) cells increased seven-fold in LPS alone treated samples compared with the non-treated controls. Four subsets of cells readily emerged from CD14/LCK analysis in PBMCs treated with LPS or RUNX1 inhibitor: a CD14hi/LCKlo, CD14hi/LCKhi, CD14lo/LCKhi, and CD14lo/LCKlo population (Fig. 6E). The percentage of CD14hi/LCKhi, and CD14lo/LCKhi cells increased in LPS plus RUNX1 inhibitor-treated samples, and the proportion of CD14hi/LCKlo cells was decreased. The percentage of CD14hi/LCKhi cells increased by 40% in LPS alone treated samples compared with the non-treated controls. Two subsets of cells readily emerged from CD14/VAV1 analysis in PBMCs: a CD14hi/VAV1lo and a CD14lo/VAV1lo population (Fig. 6F). The percentage of VAV1hi (for both CD14hi and CD14lo) cells increased four-fold in LPS plus RUNX1 inhibitor-treated samples. The percentages of VAV1hi (for both CD14hi and CD14lo) cells increased seven-fold in LPS alone treated samples compared with the non-treated controls and is two-fold higher than in LPS plus RUNX1 inhibitor-treated samples.
Real-time PCR analysis of RUNX1 targets in LPS and RUNX1 inhibitor-treated PBMCs
To investigate if the expression patterns of the 33 RUNX1 target genes that have RUNX1 binding site (from Class 5, Supplemental document 3) could be modeled by LPS and RUNX1 inhibitor treatment in vitro, we performed real-time PCR assays for 33 genes after treating PBMCs (collected from a separate set of animals) with two different doses of LPS (1 ng/mL, 10 ng/mL), and RUNX1 inhibitor (1 ng/mL, 10 ng/mL). Samples were collected six hours post-stimulation. A total of 21 genes were induced significantly in response to at least one dose of LPS stimulation, as expression levels for these genes were different when compared to non-stimulated control. A total of 10 genes were down-regulated significantly in response to the RUNX1 inhibitor treatment. Hierarchical clustering analysis was used to determine whether the response of LPS stimulation was similar to the patterns detected in RUNX1 inhibitor treatment, and if any differences were observed depending on the dosage of LPS and RUNX1 inhibitor used. As shown in Fig. 7, the genes were grouped into two large clusters. In the upper part of the cluster (from CLEC5A to SLA) the expression patterns of samples with RUNX1 inhibitor treatment, RUNX1 inhibitor plus LPS treatment, and non-simulated controls tend to be similar. Different doses of the RUNX1 inhibitor did not affect the samples, as observed by the overlap of respective samples in the heatmap. The LPS treated samples were unique and were distinct from the RUNX1 inhibitor-treated groups and control groups. Similar to the RUNX1 inhibitor, different doses of LPS also did not affect the samples. The expression patterns of RUNX1 inhibitor plus LPS treatment samples tend to be similar to controls and RUNX1 treatment only samples because of the overlap in the heatmap. In the lower part of the cluster (from ARHGDIB to RGS18), the expression patterns of some of the samples following RUNX1 inhibitor treatment, RUNX1 inhibitor plus LPS treatment, and non-simulated controls tend to be similar with the LPS treated samples. The reason for this discrepancy could be due to variation in preparing the PBMCs and the stimulation process which are difficult parameters to control.