A comprehensive evaluation of single-end sequencing data analyses for environmental microbiome research

Illumina sequencing platforms have been widely used for amplicon-based environmental microbiome research. Analyses of amplicon data of environmental samples, generated from Illumina MiSeq platform illustrate the reverse (R2) reads in the PE datasets to have low quality towards the 3’ end of the reads which affect the sequencing depth of samples and ultimately impact the sample size which may possibly lead to an altered outcome. This study evaluates the usefulness of single-end (SE) sequencing data in microbiome research when the Illumina MiSeq PE dataset shows significantly high number of low-quality reverse reads. In this study, the amplicon data (V1V3, V3V4, V4V5 and V6V8) from 128 environmental (soil) samples, downloaded from SRA, demonstrate the efficiency of single-end (SE) sequencing data analyses in microbiome research. The SE datasets were found to infer the core microbiome structure as comparable to the PE dataset. Conspicuously, the forward (R1) datasets inferred a higher number of taxa as compared to PE datasets for most of the amplicon regions, except V3V4. Thus, analyses of SE sequencing data, especially R1 reads, in environmental microbiome studies could ameliorate the problems arising on sample size of the study due to low quality reverse reads in the dataset. However, care must be taken while interpreting the microbiome structure as few taxa observed in the PE datasets were absent in the SE datasets. In conclusion, this study demonstrates the availability of choices in analyzing the amplicon data without having the need to remove samples with low quality reverse reads.


Introduction
Amplicon-based microbiome approach has been widely employed to understand the community structure and role of microorganisms in environmental research. In amplicon microbiome studies, different hypervariable regions of 16S rRNA gene such as V1V3, V3V4, V4V5 or V6V8 are generally amplified and sequenced on high-throughput sequencing (HTS) platforms. Today, researchers have a wide variety of HTS platforms among which Illumina MiSeq has been adopted as a popular tool to generate amplicon data (Pollock et al. 2018;Singer et al. 2019). The Illumina MiSeq has also been used in the large scale environmental studies such as the Earth Microbiome Project, global soil analyses, climate change research, etc. (Gilbert et al. 2014;Thompson et al. 2017;Bižić et al. 2020;Oliverio et al. 2020). Although the Illumina MiSeq platform has advantages such as low cost, flexibility, fast run time and generation of 300 bp long paired-end (PE) sequencing reads (Wen et al. 2017;Bharti and Grimm 2021), the platform has some disadvantages. One such disadvantage is having low phred quality towards the 3′ end of the reads (Liu et al. 2020). Especially, the quality scores of R2 reads (reverse) are often lower than that of R1 reads (forward) (Schirmer et al. 2015;Tan et al. 2019). This study also evaluated the quality scores of PE sequencing data of 536 environmental samples, sequenced on Illumina MiSeq platform; available on SRA, and the results are shown in Fig. 1. The analyses revealed that the phred quality score of approximately 30-40% of R2 reads is reduced below 20 after ~ 225 bp. In a standard bioinformatics workflow, the sequencing reads are primarily trimmed at the 3' end to remove the poor-quality bases (Ex: phred quality < 20) so as to retain maximum number of reads for downstream analyses. Later, the R1 and R2 reads are stitched together to obtain the complete amplicon. However, a copious amount of R2 reads, generated by MiSeq platform, often have low phred score after ~ 225 bp as observed in Fig. 1 and trimming of R2 reads at 225 bp could lead to a significant reduction of overlapping region which will affect the merging process of R1 and R2 reads. Another option is trimming the reads after spanning enough overlapping region for merging but still a sizeable population of R2 reads will be removed from the dataset when stringent quality criteria is used during the filtering step. Thus, the low quality of R2 reads in PE sequencing data of samples could lead to significant reduction in sequencing depth of respective samples during the filtering and merging processes. Earlier studies have highlighted that the microbiome composition and structure could be influenced by sequencing depth (Zaheer et al. 2018; on behalf of the REHAB consortium et al. 2019; Ramakodi 2021). Thus, samples with low sequencing depth due to poor quality of R2 reads need to be removed from the dataset to avoid problematic results. In contrast, most of the R1 reads of samples have good quality (phred score > 20) bases even after 275 bp (Fig. 1). Thus, exclusively analyzing the R1 reads could help to retain most or all the samples in microbiome analyses. Earlier, Werner et al. (2012) compared the Illumina PE and SE reads using 2X150 bp PE data of V4 region and their study illustrated that the alpha and beta diversity indices based on the SE sequencing data are comparable to the results derived from PE data. However, the efficiency of SE data for taxonomy profiling was not evaluated. In addition, Werner et al. (2012) analyzed 2X150 bp data of different samples that was available at that time but today the Illumina MiSeq platform has the chemistry which generates up The figure illustrates that the quality of reverse reads is comparatively lower than that of forward reads. A R1-forward reads; B R2-reverse reads to 300 bp reads. However, the efficiency of single-end (SE) sequencing read (R1) analyses to infer microbiome structure in environmental research is not yet known. Thus, a comprehensive evaluation of SE and PE reads generated from Illumina MiSeq platform to assess microbiome structure is required at this moment. In this context, this study aims to evaluate the effectiveness of SE sequencing read analyses in environmental microbiome research.

Data
The V1V3, V3V4, V4V5 and V6V8 amplicon sequencing data, generated by Soriano-Lerma et al. (2020) for soil samples, were analyzed in this study to evaluate the usefulness of SE sequencing reads for environmental microbiome research. The location of sampling, sample collection protocol and amplicon sequencing procedure are described in detail in Soriano-Lerma et al. (2020). Briefly, a total number of 128 samples (32 samples for each amplicon region), as submitted by the authors were downloaded from the sequence reads archive (SRA) of NCBI for the analyses (BioProject Accession Number: PRJNA612815). All the data were downloaded using the SRA toolkit. The SRA accession of each sample along with their metadata as obtained from SRA is given in Supplementary Table 1. The samples were divided into four groups according to the 16S rRNA amplicon region and the dataset were analyzed, separately.

Amplicon microbiome analyses
The amplicon sequencing data was analyzed on R version 3.6.3 using DADA2 tool which generates amplicon sequencing variants (ASVs). Recent studies have highlighted that ASVs could infer the microbiome structure better than the conventional operational taxonomic units (OTUs) (Callahan et al. 2016;Caruso et al. 2019) which is based on the clustering approach. Thus, this study adopted ASVs to analyze the environmental microbiome data. The schematic diagram showing the DADA2 workflow along with the parameters used are shown in Supplementary Fig. 1. Briefly, the R1, R2 and PE (R1 and R2) data of samples were analyzed, separately. Subsequently, the non-chimeric sequences were generated based on (i) only R1 reads, (ii) only R2 reads and (iii) PE data. Thus, this study generated three ASV datasets for each amplicon sequencing region for further downstream analyses. The data generated from R1 or R2 are referred as SE dataset, whereas the data generated by merging R1 and R2 reads are referred as PE dataset.

Taxonomy assignment
The taxonomy assignments of ASVs were carried out using the tool IdTaxa (Murali et al. 2018). Briefly, each dataset was analyzed, separately and a stringent parameter 70% confidence threshold was selected for the inference of taxonomy. The SILVA SSU database 138 (www. arb-silva. de) as available on the IdTaxa web tool was used as the training dataset for the taxonomic assignment of ASVs.

Downstream analyses
The downstream analyses were performed in R version 3.6.3 using the packages phyloseq (McMurdie and Holmes 2013), microbiome, Decipher (Wright 2016), ggplot2 (Wickham 2016), tidyverse (Wickham et al. 2019), dplyr, ape (Paradis et al. 2004), vegan, hclust and venn. Primarily, the ASVs without known phylum were removed from the dataset. Also, the ASVs were removed, if the ASVs exist in only one sample and/or have less than 0.001 proportion of minimum sample depth in the dataset. The alpha and beta diversity analyses were performed using the rarefied dataset. The minimum sequencing depth of the dataset was used as the rarefaction depth. The core microbiome structure was inferred based on the following criteria that the taxa have more than 0.001 of relative abundance with the prevalence of 50%.

Results and discussion
The SE sequencing read analyses in microbiome research was evaluated using the data generated from the Illumina MiSeq platform which is widely used for amplicon microbiome research worldwide. The results of this current study support the usefulness of SE data analyses in environmental microbiome studies.

Quality of reverse reads influences the sample sequencing depth
Primarily, the impact of R2 read qualities on generation of full-length amplicon was analyzed. Briefly, the R1 and the R2 reads were truncated at different lengths, the PE data were merged and the results were compared (Fig. 2). The results show that the depth (number of sequences/ reads of a sample) of R2 and merged sequences is inversely proportional to the truncation length of R2 reads whereas the truncation length of R1 reads does not have a great impact on merging the PE dataset. The results indicate that the quality of R2 reads, obtained from Illumina MiSeq platform, decreases as the length of reads increases and the low base quality of R2 reads affects the sequencing depth of samples,

SE versus PE datasets for the inference of alpha and beta diversity
This study evaluated the R1, R2 and PE datasets of four different amplicons, V1V3, V3V4, V4V5, and V6V8, for inferring the microbiome structure in environmental microbiome research. The analyses showed that the number of non-chimeric sequences obtained for R1 datasets (Median: V1V3-43,333; V3V4-33,063; V4V5-37,313; V6V8-34,932) is higher than that of R2 (Median: V1V3-19,166; V3V4-30,892; V4V5-20,390; V6V8-21,825) and PE (Median: V1V3-9579; V3V4-16,539; V4V5-20,257; V6V8-19,652) datasets. Similarly, the observed ASVs were found to be significantly higher (P value < 2.2e−16 to 0.012) for R1 datasets as compared to the other datasets, irrespective of amplicon regions. The comparison of distribution of observed ASVs based on the R1, R2 and PE datasets are shown in Fig. 3. Earlier studies showed that the sequencing depth (number of reads per sample) could influence the number of ASVs. For instance, the dataset with the highest sequencing depth was found to infer significantly more number of observed ASVs than the datasets with lower sequencing depth (Ramakodi 2021). Thus, the higher distribution of observed ASVs in R1 datasets as compared to its counterparts could be due to the higher sequencing depth in R1 datasets. The beta diversity of samples based on the bray-curtis dissimilarity index was studied and the dendogram based on the bray-curtis distance values is shown in Supplementary Figs. 2 to 5 for V1V3, V3V4, V4V5, and V6V8, respectively. The beta diversity analyses show that the relationships between samples based on the R1, R2 and PE datasets vary, irrespective of amplicon regions. The bray-curtis distance is based on the microbial community composition of datasets. The alpha diversity results clearly illustrate R1, R2 and PE datasets to exhibit Fig. 2 Effect of truncation length of sequencing reads on merging paired-end reads. The results suggest that the number of filtered R2 reads decreases as the truncation length of R2 reads increases. In addition, truncation length of R2 reads influences the generation of non-chimeric PE sequences different proportions of ASVs within each amplicon region. Thus, the differences in the distribution of the number of ASVs/ microbial taxa within R1, R2 and PE datasets could have influenced beta diversity.

Core microbiome composition: SE versus PE datasets
The ultimate goal of any microbiome study is to find the structure of core microbiome and this study also evaluated the utility of SE sequencing data to infer the core microbiome composition. The comparison of core microbiome at genus level, based on the SE and PE datasets of different amplicon regions is shown in Fig. 4 and the relative abundance of taxa (class level) is shown in Fig. 5. The analyses showed the R1 datasets to have a higher number of unique genus as compared to R2 and PE datasets. Similarly, the R1 datasets exhibited higher numbers of unique class, order and family level taxa for all the amplicon regions, except V3V4 ( Supplementary Fig. 6). The results suggest that the SE datasets, especially the datasets comprising exclusively R1 reads, could infer more number of taxa as compared to PE datasets. The reason for observing a higher number of taxa in R1 datasets could be attributed to the sequencing depth. Earlier studies highlighted that the microbiome data is compositional in nature which means the observed microbiome structure is defined by the number of reads available in the dataset (Gloor et al. 2017;Susin et al. 2020). A higher sequencing depth could infer more taxa as compared to the dataset having low sequencing depth (Zaheer et al. 2018). In this study, the R1 datasets had a higher number of nonchimeric sequences. Thus, the R1 datasets yielded a higher number of taxa. These observations suggest that the R1 reads could provide more information on the microbiome including the rare taxa which require higher sequencing depth. However, some discrepancies in inferring the core microbiome were observed between SE and PE sequencing data which could be attributed to the shorter length of SE sequencing data which affects the phylogenetic resolution (Fuks et al. 2018;Johnson et al. 2019). Also, the microbiome composition observed for different amplicon regions vary which is not surprising as the original study from which the data was adopted herein, also illustrated the discrepancies on microbiome composition by different amplicon regions (Soriano-Lerma et al. 2020). Further, the results show that the diversity of core-microbiome inferred by V1V3 is comparatively lower than the diversity of bacteria inferred by other amplicon regions (Fig. 5). Thus, this study suggests that the hyper-variable regions such as V3V4, V4V5 and V6V8 could be preferred over V1V3 regions for environmental microbiome research. In summary, this study suggests that the SE sequencing data, especially the R1 reads, provide results comparable to PE datasets. In fact, the R1 datasets yielded more taxa as compared the PE datasets, irrespective of amplicon regions. Thus, the SE sequencing data analyses could be adopted as an alternative approach to infer microbiome composition as and when the quality of R2 Fig. 3 Distribution of observed ASVs based on SE and PE datasets for different amplicon regions. The figure shows that the observed ASVs based on R1 reads is higher than that of other datasets reads, generated by Illumina MiSeq platform, are low and significantly reduces the PE data and subsequently, affects the sample size of the study. However, the results obtained from SE sequencing datasets need careful interpretation as some of the taxa observed in PE datasets were absent in SE datasets.

Conclusions
The Illumina MiSeq platform often yields low quality R2 reads which affects the sample sequencing depth. In general, the samples with low sequencing depth need to be removed from the dataset to control the bias due to sequencing depth. Nonetheless, removing the samples could lead to spurious results and also may affect the study design. In this context, the researchers need to adopt an alternative approach to infer the microbiome structure to overcome the problems associated with low quality data. This study demonstrates that the SE sequencing datasets, especially the R1 reads, infers the core microbiome structure as comparable to the PE dataset. Conspicuously, the R1 datasets inferred a higher number of taxa as compared to PE datasets, irrespective of amplicon regions. Thus, analyses of SE sequencing data, especially R1 reads, in amplicon microbiome studies could ameliorate the problems related to low quality reverse reads in the dataset. However, care must be taken while interpreting the microbiome structure as the SE sequencing data could not resolve the taxonomy of some of the taxa. In conclusion, this study demonstrates the SE data analysis as one of the workflow that could be adopted in amplicon microbiome analyses to overcome sample elimination issues arising due to problematic reverse reads. Fig. 4 Comparison of core microbiome taxa (genus level) between SE and PE datasets. The results suggest the R1 datasets infer a higher number of taxa for most of the amplicon region, except V3V4

3
Consent for publication Not applicable.