Sequencing and alignment profile
Using four lanes of Illumina NextSeq500, we generated 388 and 403 million reads for the three replicate rRNA-depleted-RNA-Seq samples (A, B, C) and the three Poly(A)-RNA-Seq samples (D, E, F), respectively. After filtering low-quality reads with Trimmomatic (Bolger et al., 2014), more than 86% of reads (~ 336 and ~ 354 million reads for rRNA-depleted and Poly(A) samples, respectively) remained for subsequent analysis. For the additional bacterial-unenriched Poly(A)-RNA-Seq transcriptomes, 341 million reads were generated for the four samples, and 320 million reads remained after the quality filter step. All these high-quality reads were then aligned to updated versions of A. queenslandica and its proteobacterial symbionts AqS1, AqS2 and AqS3 genomes (Xiang 2021) by HISAT2 (version 2.0.5) (Pertea et al. 2016). On average, 69.3% rRNA-depleted-RNA-Seq reads, 81.2% Poly(A)-RNA-Seq reads, and 63.1% bacterial-unenriched Poly(A)-RNA-Seq reads mapped to the four genomes. Of these mapped reads, 4.8% rRNA-depleted-RNA-Seq reads, 0.1% Poly(A)-RNA-Seq and few bacterial-unenriched Poly(A)-RNA-Seq reads mapped to more than one genome, and were discarded because they could not be unambiguously assigned. 222 million mapped rRNA-depleted-RNA-Seq reads, 287 million mapped Poly(A)-RNA-Seq reads, and 51 million mapped bacterial-unenriched Poly(A)-RNA-Seq reads remained for further analysis.
For the bacterial-enriched rRNA-depleted-RNA-Seq libraries, 77% of the mapped reads were attributed to the sponge host Aq, while 19.2%, 2.54%, and 1.3% were attributed to AqS1, AqS2, and AqS3, respectively (Fig. 1A). By comparison, for the bacterial-enriched Poly(A)-RNA-Seq libraries, almost all the aligned reads (99.2%) were attributed to the sponge host; only 0.6%, 0.2%, and 0.1% reads were attributed to bacterial symbiont genomes AqS1, AqS2, and AqS3, respectively (Fig. 1B). For the bacterial-unenriched Poly(A)-RNA-Seq library, 99.9% of mapped reads were attributed to the sponge host, and only 0.03% to the three symbiotic bacteria in total.
Correlation between samples
The reads aligned to each gene were considered as a measure of gene expression. Correlation of the gene expression between the technical triplicates was estimated by gene read counts (gene depth) between individuals (biological replicates) within each experimental group. All read counts were included in calculating the Spearman correlation efficiencies. For the host sponge A. queenslandica, the gene expression correlation was strong among the triplicates from both rRNA-depleted- and Poly(A)-RNA-Seq groups, with Spearman correlation efficiencies ranging from 0.907 to 0.969 (Fig. 1C, D). For the three symbiotic bacteria, the correlation of gene expression varied between the biological replicates from each experimental group. For the rRNA-depleted-RNA-Seq data, the Spearman correlation efficiencies of AqS1 biological replicates ranged from 0.967 to 0.987, with AqS2 from 0.916 to 0.941, and AqS3 from 0.826 to 0.889. These were higher than those correlation efficiencies of the Poly(A)-RNA-Seq replicates, with AqS1 ranged from 0.915 to 0.920, AqS2 from 0.843 to 0.855, and AqS3 from 0.695 to 0.733. Overall, the strong correlations between the biological replicates from each experimental group suggested the gene expression profiles of the four species in the A. queenslandica holobiont were comparable within the rRNA-depleted- and the Poly(A)-RNA-Seq samples.
Gene coverage
Transcript quantification is one of the most common applications of RNA-Seq; it estimates the gene expression levels based on the number of reads mapped to each gene. This approach is widely used to identify differential expression (DE) of genes between different treatments and contexts (Conesa et a., 2016). Many DE software packages filter genes with < 5 reads before identifying DE genes because of both biological and statistical concerns, removing genes with low read counts prior to downstream analysis (Chen et al. 2016). To better understand the influence of RNA-Seq library preparation method for capture of both animal host and bacterial symbiont transcriptomes, we compared the percentage of genes to which RNA-Seq reads were mapped across the different methods (Fig. 2A). Specifically, we filtered the genes with two thresholds (5 and 1 read pairs) and then calculated the gene coverage.
When we used the minimal threshold of 1 read pairs mapped, the percent of A. queenslandica genes covered by the rRNA-depleted-RNA-Seq data (53.93%) was very similar to that of the Poly(A)-RNA-Seq data (57.38%) (two-tailed t-test, p-value 0.308), and both of these were lower than the bacterial unenriched-Poly(A)-RNA-Seq data (66.62%) (Fig. 2A). For the bacterial symbiont genomes, on average, 89.70% (± 0.0082) AqS1 genes, 95.11% (± 0.0029) AqS2 genes, and 80.36% (± 0.0260) AqS3 genes were captured by rRNA-depleted-RNA-Seq libraries. These gene coverages were much higher than those captured by Poly(A)-RNA-Seq libraries, which, on average, were 70.07% (± 0.0119) for AqS1, 65.99% (± 0.0181) for AqS2, and 44.46% (± 0.0259) for AqS3 (one-tailed t-test, p-values were 2.54e-05 in AqS1, 4.98e-04 in AqS2, and 3.56e-05 in AqS3). The coverages also were higher than that of the bacterial unenriched-Poly(A)-RNA-Seq samples, which were 18.14% in AqS1, 24.06% in AqS2, and 1.13% in AqS3. This gene coverage comparison indicated that rRNA-depleted-RNA-Seq and Poly(A)-RNA-Seq could capture the animal host transcriptomes at the same level, but rRNA-depleted-RNA-Seq was better at capturing the bacterial symbiont transcriptomes. The unenriched-Poly(A)-RNA-Seq was less likely to comprehensively capture bacterial symbiont transcriptomes compared to the rRNA-depleted-RNA-Seq and Poly(A)-RNA-Seq.
When the threshold was raised to at least 5 read pairs per gene, on average, 50.94% and 49.92% A. queenslandica genes were covered with in the rRNA-depleted-RNA-Seq data and Poly(A)-RNA-Seq data, respectively, with 58.66% of the genes represented in bacterial unenriched-Poly(A)-RNA-Seq data (Fig. 2A). These values are consistent with the estimated proportion of genes in the genome that are expressed in adult A. queenslandica (Fernandez-Valverde et al. 2015). For the symbiont bacterial genes, on average, 86.23% (± 0.0079) AqS1 genes, 93.65% (± 0.0087) AqS2 genes and 75.63% (± 0.0344) AqS3 genes were captured in rRNA-depleted-RNA-Seq data. BY comparison, in the Poly(A)-RNA-Seq data, on average, only 54.22% (± 0.0016) AqS1 genes, 41.07% (± 0.0123) AqS2 genes, and 19.57% (± 0.0130) AqS3 genes were captured. In the bacterial unenriched-Poly(A)-RNA-Seq samples, only 3.22% AqS1 genes, 5.43% AqS2 genes, and 0.20% AqS3 genes were captured. These gene coverages indicate that rRNA-depleted-RNA-Seq captured a similar proportion of sponge genes and a higher proportion of bacterial genes than Poly(A)-RNA-Seq. For the unenriched-Poly(A)-RNA-Seq sample, the coverage was representative only for the sponge transcriptome but not for the symbiotic bacterial transcriptomes.
Distribution of read depth in gene coding regions
Sequence depth is a crucial measure of the robustness of RNA-Seq data. We calculated the gene depth by counts of aligned reads per gene, and compared the gene depth distribution of the rRNA-depleted-, Poly(A)- and bacterial unenriched-Poly(A)-RNA-Seq samples (Fig. 2B). For the host sponge A. queenslandica, on average, the median gene depths were 110 (± 15.95) and 113 (± 1.73) reads per gene for the rRNA-depleted- and Poly(A)-RNA-Seq libraries, and 83 for the bacterial unenriched-Poly(A)-RNA-Seq adult. For each bacterial symbiont, the rRNA-depleted-RNA-Seq data showed the highest average gene depths, which were higher than these of the Poly(A)-RNA-Seq data and bacterial unenriched-Poly(A)-RNA-Seq data. On average, the rRNA-depleted-RNA-Seq median gene depths were 446 (± 89.37) in AqS1, 162 (± 84.50) in AqS2, and 47 (± 13.75) in AqS3; the Poly(A)-RNA-Seq median gene depths were 14 (± 0.58) in AqS1, 7 (± 0.58) in AqS2, and 4 (± 0) in AqS3 (Fig. 2B). The median gene depth of all the three primary symbiotic bacterial was 2 for the bacterial unenriched-Poly(A)-RNA-Seq sample. The similar read depth distributions of A. queenslandica for rRNA-depleted- and Poly(A)-RNA-Seq samples implied that there was no noteworthy gene depth difference among the host sponge transcriptome captured by both methods, but sponge transcript depths were lower than obtained by bacterial unenriched-Poly(A)-RNA-Seq (Fig. 2B). The symbiont gene depths were much higher in the rRNA-depleted-RNA-Seq data compared to both the Poly(A)-RNA-Seq and the bacterial unenriched-Poly(A)-RNA-Seq data.
Comparison of the expressed biological functions in the different RNA-Seq data sets
A first step towards understanding gene functions in host and symbionts is to annotate expressed and enriched genes in RNA-Seq data sets using Gene Ontology (GO) (Ashburner et al., 2000). GO analyses were performed and compared between rRNA-depleted-RNA-Seq, Poly(A)-RNA-Seq and bacterial unenriched Poly(A)-RNA-Seq using Blast2GO (version 5.2.4) (Conesa & Gotz, 2008) to determine if any of the RNA-Seq approaches resulted in a bias in the types of genes captured in the transcriptomes. In total, 28,570 A. queenslandica genes, 2,417 AqS1 genes, 1,193 AqS2 genes, and 1,977 AqS3 genes were assigned to 6,357, 1,317, 904, and 1,030 GO terms, respectively.
To identify any differences in GO functional annotations among the different RNA-Seq libraries, the percentage of expressed GO terms in each sample was calculated for each species. That is, for a given species, if any gene in the genome that had been assigned to a particular GO term was expressed with at least 1 read pair in a given sample, that GO term was considered as an expressed GO term. The expressed GO percentage was estimated as the percent of total GO terms for the genome that were represented in the RNA-Seq data, with replicates averaged (Fig. 3A; Supplementary Table S1). For A. queenslandica, almost all the GO terms were present in the rRNA-depleted-RNA-Seq data (95.88% ± 0.0085), Poly(A)-RNA-Seq data (97.43% ± 0.0036) and bacterial unenriched-Poly(A)-RNA-Seq data (98.47%). For the bacterial symbionts, the expressed GO term percentages were 98.96% ±0.002 in AqS1, 99.34% ± 0.001 in AqS2 and 98.32% ± 0.006 in AqS3 for the rRNA-depleted-RNA-Seq samples (Fig. 3A). These percentages of expressed GO terms in rRNA-depleted-RNA-Seq samples were higher than in Poly(A)-RNA-Seq samples, which was 92.63% ± 0.006 in AqS1, 86.32% ± 0.01 in AqS2, and 79.58% ± 0.014 in AqS3 (one-tailed t-test, p-values were 3.68e-04 in AqS1, 9.37e-04 in AqS2, and 1.91e-04 in AqS3); and also higher than that of bacterial unenriched-Poly(A)-RNA-Seq sample, which was 38.72% in AqS1, 46.02% in AqS2, and 2.91% in AqS3 (Fig. 3A). These GO term expression percentages indicate that all the different libraries captured a similar functional transcriptional profile of the host sponge A. queenslandica. In contrast, while rRNA-depleted-RNA-Seq provides a near-complete functional representation of the symbiotic bacteria, this is not true for the Poly(A)-RNA-Seq and bacterial unenriched-Poly(A)-RNA-Seq.
GO functional profiles of expressed and over-represented (enriched) genes in the three RNA-Seq data sets, with replicates averaged, were compared to determine if GO categories were equally represented in the different types of RNA-Seq libraries (Fig. 3B; Supplementary Table S1). This approach revealed differences in the representation of specific GO functions in each of the libraries. For the sponge A. queenslandica, on average, the expressed gene percentage of each GO term was higher in the bacterial unenriched-Poly(A)-RNA-Seq data (93.64%) compared to the rRNA-depleted-RNA-Seq data (86.57%) and the Poly(A)-RNA-Seq data (90.62%). For each bacterial symbiont, the rRNA-depleted-RNA-Seq data showed the highest GO term expressed gene percentages compared to the Poly(A)-RNA-Seq data and bacterial unenriched-Poly(A)-RNA-Seq data (Fig. 3B). The rRNA-depleted-RNA-Seq average GO term expressed gene percentages were similar, 95.57, 98.62% and 95.62% for AqS1, AqS2, and AqS3, respectively. In contrast, the average GO term expressed gene percentages of Poly(A)-RNA-Seq and bacterial unenriched-Poly(A)-RNA-Seq for the three symbionts varied considerably. For the Poly(A)-RNA-Seq, the average GO term expressed gene percentages decreased from 84.49% in AqS1, to 77.45% in AqS2 and 67.85% in AqS3. For the bacterial unenriched-Poly(A)-RNA-Seq, this percentage decreased from 21.88% in AqS1 and 31.27% in AqS2 to 0.43% in AqS3 (Fig. 3B). These GO term percentages suggest that for the symbionts, the Poly(A)-RNA-Seq method produces bias in the functional types of genes that are captured, as indicated by GO enrichment results.
We used pairwise comparisons of expressed gene numbers to identify which functional groups of genes, as indicated by GO terms, were most differently captured by the different library types. For each pairwise comparison, the most differentially represented 30 GO terms were identified by the biggest differences of the expressed gene numbers. The most differentially represented GO terms included four biological process GOs (oxidation-reduction process, proteolysis, transmembrane transport, and translation), four cellular component GOs (integral component of membrane, membrane, cytoplasm, and plasma membrane) and 13 molecular function GOs (i.e. catalytic activity, hydrolase activity, transferase activity, ATP binding, and DNA binding) (Fig. 3C). Overall, compared to unenriched-Poly(A)-RNA data, the rRNA-depleted-, and Poly(A)-RNA data sets under-represented sponge genes involved in integral component of membrane, membrane, ATP binding, and protein binding (Fig. 3C). For the three symbiotic bacteria, compared to the rRNA-depleted RNA data, the Poly(A)- and bacterial unenriched-Poly(A)-RNA data underrepresented bacterial genes involved in oxidation-reduction, transmembrane transport, integral component of membrane, membrane, and ATP binding (Fig. 3C).