Base and raw data quality
Following RNA extraction, two aliquots of each extract were constructed as Illumina libraries, respectively, using identical amounts of starting material, and then subsequently sequenced to facilitate bioinformatic comparisons on the data. In addition, to verify the compatibility of the library preparation kit for GenoLab M, we used kits from different manufacturers for testing (Supplemental Table S1). The sequencing strategy was pair-end 100 bp for GenoLab M and pair-end 150 bp for NovaSeq 6000. We initially generated between 23.20 M to 62.87 M clean reads per library in NovaSeq 6000 platform, and 26.86 M to 139.69 M clean reads per library in GenoLab M platform (Table 1). Each individual sample has similar base throughput from both sequencing platforms. The quality of sequencing data was checked using FastQC. For high base quality (over Q20) base percentages, the GenoLab M showed an average of 94.86%, and the NovaSeq 6000 showed an average of 97.50% with a slight preponderance (Table 1). As shown in Fig. 2, the clean reads from GenoLab M reached an average mapping rate of 91.80% and an average unique mapping rate of 88.33%, which are comparable to the mapping rates of reads from the NovaSeq 6000 platform. The two platforms shared fairly consistent reads distribution along genes across species (Fig.3) and in expression density distribution (Fig.4). Interestingly, the LncRNA expression level measured using Yeasen LncRNA library kit (YS) is higher than the other kits used in human and mouse. In Fig.5, the charts showed that accuracy in the quantification of both low and high abundance genes were consistent. They further indicate that LncRNA expression by YS has obviously higher abundancy than the other kits in human and mouse (Fig.5 A and B), which is consistent with the Fig.4 B and D. Overall, the sequence quality of the two platforms was similar across various library kits.
Inter-platform comparison of gene detection and quantification
In transcriptome and LncRNA analysis, the identification of genes is very important for the majority of research projects. Therefore, we further compared the capacity of GenoLab M and NovaSeq 6000 platforms on gene detection and quantification. Totally over 42,000, 16,000 and 26,000 genes were identified in bean, human, and mouse, respectively, via two sequencing platform (Fig.6, A-C). For transcriptome, we observed a small fraction of different genes between the GenoLab M and NovaSeq 6000 platforms. Over 92% of genes were commonly detected by both sequencing platforms. However, for LncRNA, only 71% of genes were shared between the two sequencing platforms (Fig.6, D-E). This difference most likely stemmed from analysis using the method StringTie as novel LncRNAs judgment and the different read length of the sequence [18]. StringTie (1.3.1) was used to calculate FPKMs of LncRNAs and novel LncRNA was set at least 0.1. We checked the Pearson correlation coefficient of the transcriptome and LncRNA data produced by the two platforms using the same methods and found that all one pairs of samples showed high correlation coefficients, ranging from 0.972 to 0.992 in transcriptome, and ranging from 0.691 to 0.793 in LncRNA (Fig. 7). There is still a slight gap in the correlation between LncRNA and the two platforms. In all, GenoLab M has remarkable inter-platform concordance with NovaSeq 6000, suggesting that GenoLab M could substitute NovaSeq 6000 in many application fields where transcriptome and LncRNA are the primary focus.
Detection of alternative splicing
As one of the major mechanisms to generate transcriptome diversity, alternative splicing (AS) is gaining more and more attention in recent years. In this context, the ability of each sequencing platform under comparison to detect splicing junctions and corresponding alternative splicing patterns were subsequently analyzed across transcriptomes. In mouse, 53,557, 59,709 and 53,014, 56,741, 64,105 and 48,089 AS events could be detected by GenoLab M and NovaSeq 6000, respectively. Top three AS events in all libraries were TSS: Alternative 5' first exon (transcription start site), TTS: Alternative 3' last exon (transcription terminal site) and AE: Alternative exon ends (5', 3', or both) cross two platforms (Fig.8 A). In mouse LncRNA data, the AS events component in mRNA presented similarly to transcriptome (Fig.8 B). For human sample, AS events component in transcriptome and mRNA of LncRNA data were of the same pattern and Top 3 AS were TSS, TTS and SKIP:Skipped exon(SKIP_ON,SKIP_OFF pair) as showed in Fig.8 C and D. In beans, 78,137, 82,558 and 105,038, 83,072, 84,526 and 90,580 AS events could be detected by GenoLab M and NovaSeq 6000, respectively. Top three AS events in all libraries were TSS, TTS and AE (Fig.8 E). As for both the number and the type of different AS events, we found that there was no significant difference between the three species in the two platforms.
Identification of SNP and InDel mutation
SNP and InDel are crucial genomic features to reveal genetic variation. High throughput transcriptome analysis contributes to how these DNA variations can be transcribed into RNA messengers to affect subsequent protein function. Therefore, we examined the competency of the GenoLab M sequencing platform to detect SNP and InDel variations at the mRNA level. Regarding SNP detection, we found that SNPs called from the two sequencing platforms (Table.2) were highly similar in both variety and quantity. The largest difference is that the GenoLab M platform identified slightly more SNP events in mice than NovaSeq 6000 on average.
For InDel events, GenoLab M detected less of them than the NovaSeq 6000 in bean, human and mouse (Table.3). The closest InDel number was in bean sample prepared with Vazyme Biotech (VZ) transcriptome library kit, while significant difference was observed in mouse via Yeasen Biotechnology (YS) transcriptome library kit. These results suggest that GenoLab M has slightly inferior in InDel detection, probably due to shorter read length in this study.