High-throughput Transcriptome Sequencing and Read Mapping in Two Rice Subspecies
To characterize the gene expression profiles at the reproductive stage of the two rice subspecies, we extracted total RNA from 40-mm-long young panicles of 26 indica and 25 japonica plants (Table S1). For each plant, equal amounts of total RNA isolated from three panicles was mixed to establish an RNA pool, which was then further processed using a previously described method (Marioni et al. 2008) with minor modifications, and the samples were subjected to shotgun sequencing by an Illumina GAII instrument. We obtained 2.45 billion reads for the 26 indica plants and 2.41 billion reads for the 25 japonica plants at a 125 base-pair resolution (Table S2). After filtering out low-quality reads and reads containing adaptor sequences (7% of all reads), we obtained approximately 4.52 billion clean reads (345.48 and 331.99 Gb for indica and japonica plants, respectively) (Table S2). These clean reads were then mapped to the reference japonica rice genome (Nipponbare; IRGSP v7.0), with at most two mismatches allowed for each read, while ignoring the reads that were mapped to more than two locations in the reference genome (multi-mapped reads). According to these criteria, we filtered out (~15%) of the clean reads and mapped 70.36–89.88% of the clean reads to the reference genome (Fig. 1a; Table S3). Among the mapped clean reads, 62.47–78.92% were mapped uniquely to one genomic location (Table S3). In addition, the rates of non-splice genes were 41.08-56.60% for indica plants and 44.68–60.83% for japonica plants (Fig. 1b; Table S3). We then calculated the mapping rates for each sample and the average mapping rates for the two subspecies. The results showed a significant difference in the average mapping rate between the two subspecies. The average mapping rate for indica plants was 77.66%, whereas that for japonica was 83.88% (t=19.275 and p<0.001). Due to the fact that we used the japonica genome as the reference, we suspect that this difference may have been caused by mapping biases. Therefore, we decided to employ a previously constructed pseudo-transcriptome (Koenig et al 2013) of indica rice as the reference genome for the reads obtained from the indica rice samples. As a result, there was no significant difference in the average mapping rate between the two subspecies. The newly calculated average mapping rate for indica plants was 84.01%, whereas that for japonica remained 83.88% (t=0.12 and p=0.91).
RNA-Seq Read Assembly and Identification of Novel Transcriptionally Active Regions
We subsequently assembled the RNA-Seq clean reads into 97,005 transcribed fragments with a mean length of 884 bp (ranging from 50 to 20674 bp) (Fig 2a). These fragments were aligned to the sequences of known rice genes (MSU 7.0) The sequence alignment results are shown in Fig 2b. Among the assembled fragments, 26.31% overlapped with exons, 36.31% fell into annotated exons, 16.57% fell into introns, and the remaining 20.77% were non-gene sequences (Fig 2b). We also found that 42% of rice genes had alternative transcripts. This percentage was much lower than that previously estimated based on microarray results (Jung et al. 2013).
In total, we identified 4579 novel transcriptionally active regions based on the sequences of rice genes and non-coding RNAs (ncRNAs) deposited in the Ensembl ncRNA and National Center for Biotechnology Information (NCBI) EST “others” databases (E-value<1e-6, Table S4). Among these regions, 61.7% were unknown protein-coding regions, 67.2% could be transcribed into single-exon transcripts, 42.71% shared more than 90% sequence identity with at least one known EST, and 12.14% were likely to encode non-coding RNAs. In addition, our RNA-Seq data expanded the untranslated regions (UTR) of 11.23% of rice genes.
Identification of DEGs between the Two Rice Subspecies
According to the MSU rice genome annotation database, 74.71–88.49% of the mapped reads overlapped with annotated genes for each sample. The number of reads mapped to a certain annotated gene ranged from 2 to more than 300,000, with a median number of 23 and 32 for indica and japonica plants, respectively. There were 40,580 genes mapped by two reads in at least two plants. In addition, we were able to establish linear relationships between the number of reads mapped to a certain gene and the gene expression level (FPKM) in all the samples (0.77 <R2 <0.94; Fig. 3a). Given that we found significant differences in the number of reads mapped to a certain gene between the indica and japonica plants (t-test; P=0.012), we concluded that the two rice subspecies had significantly different gene expression profiles. Furthermore, principal variance component analysis also revealed significant differences in gene expression levels between japonica and indica individuals (Fig 3b).
We then employed fragments per kilobase of transcript per million mapped reads (FPKM) values to estimate the expression level of various genes. To validate the estimation results, we examined the relative expression levels of three genes with clear functional annotations (Table S5) by quantitative real-time PCR (qRT-PCR) analysis. We found statistically significant correlations between the relative expression levels determined by qRT-PCR and the FPKMs of these genes (Spearman’s correlation coefficient, r=0.8924, P<0.05; Fig 4), indicating that the gene expression levels revealed by our RNA-Seq data were reliable. Among the 41,080 expressed genes, 38911 (94.72%) were expressed in both rice subspecies, 623 genes were expressed only in indica plants and 1546 genes were expressed only in japonica plants. Thereafter, we analyzed our RNA-Seq data using a generalized linear model and identified 1857 (4.52% of all the expressed genes) DEGs between indica and japonica rice (false discovery rate [FDR] <0.05; Fig. 5a). Of these DEGs, 576 and 1281 were expressed at lower and higher levels, respectively, in indica rice than in japonica rice (Fig. 5a). We then mapped these DEGs to the rice genome and found they are distributed throughout the 12 chromosomes (P>0.05; Fig. 5b). Furthermore, using REEF software (sliding window size was set to 500 kb to 1 Mb and the step size was set to 10 kb), we found that the DEGs tended to group in close proximity along the chromosomes (FDR <0.05) (Copper et al. 2006). However, we could not conclude that the DEGs were clustered.
Functional Annotation of the DEGs
We annotated the DEGs using enrichment analysis tool. According to the annotations, 21 Gene Ontology (GO) terms were significantly enriched (P<0.05; Fig. 6). These terms could be classified into two groups. The first group consisted of reproduction-related terms, including “reproduction”, “recognition of pollen”, “multi-organism reproductive process” and “cell recognition”. The second group consisted of cell wall biosynthesis-related terms, including “cell wall organization”, “cellulose biosynthetic process”, “cellulose metabolic process”, “cellulose synthase activity” and “UDP-glucosyl transferase activity”. The GO analysis showed that the genes encoding the binding proteins were enriched in these co-expressed genes (Fig. 6). Interestingly, genes expressed at lower levels in indica rice than in japonica rice were annotated with reproduction-related terms, whereas those expressed at higher levels in indica rice than in japonica rice were annotated with cell wall biosynthesis-related terms. Notably, the DEGs annotated with reproduction-related terms included the MADS-box genes, which comprise a gene family associated with flower development and reproduction (Becker and Theissen 2003). We identified six MADS-box genes, namely OsMADS26, OsMADS15, OsMADS37, OsMADS63, OsMADS74, and OsMADS8 that were expressed at significantly different levels between the two subspecies (Table S1). Consistent with these findings, among the ten most significant morphological differences between the two subspecies, six were reproduction-related traits.
Effect of Selection Pressure on the Regulation of the DEGs
To explore whether artificial selection may have affected the expression of the DEGs, we investigated the gene regulatory regions (i.e., 5’ and 3’ flanking regions) and gene-coding regions of the DEGs (Fig 7). Specifically, we searched for genomic regions with polymorphism to divergence (π/Dxy) ratios ranked in the lowest 5% (P<0.001), because such regions may have a greater chance of experiencing positive selection. Compared with non-DEGs, DEGs from both subspecies had more 5′ flanking regions with low π/Dxy ratios (Fisher’s exact test, P<0.05). These results were then confirmed by a resampling test (P<0.05). Taken together, these results suggest that the 5′ regulatory regions of the DEGs experienced positive selection during rice domestication, and that gene regulation plays an important role in the evolution of the two subspecies.