Uncovering the Genetic Diversity of Giardia Isolates from Outbreaks in New Zealand

Background Giardia the most common causes of the and is a notiable disease in Recent advances in molecular techniques, such as PCR and Sanger sequencing, have greatly improved our understanding of the taxonomic classication and epidemiology of this parasite. However, there has been an inability to identify shared subtypes between samples from the same epidemiologically linked cases, due to samples showing multiple dominant subtypes within the same outbreak when characterised using Sanger sequencing. Methods Sanger sequencing and the diversity captured by NGS. It shows that even in genetically diverse samples, like the one from the outbreak in Hawke’s Bay in 2015, there are agreements between the Sanger sequence data and the NGS data. Analysis of the NGS data was conducted to probe the intra-sample diversity of these samples.


Introduction
Giardia is an enteric protozoan parasite with the distinction of being among the most common causes of diarrhoea in humans and farm animals worldwide (Cacciò and Sprong, 2011). Giardia infects the epithelial cells of the gastrointestinal tract causing self-limiting diarrhoea in all classes of vertebrates. This parasite transmits via the faecal-oral route, and in humans particularly, contact with contaminated water sources is the dominant mode of infection and cause of outbreaks.
Approximately 280 million people are infected with this parasite every year, and the prevalence of infections in humans ranges between 0.4-7.5% in high-income countries, and 8-30% in low-/middleincome countries (Feng and Xiao, 2011;Einarsson, Ma'ayeh and Svärd, 2016). Symptoms range in severity, however, the disease can be fatal in immunocompromised individuals and ranks among the leading causes of death in children under the age of 5 (Luján and Svärd, 2011). This is why giardiasis, the disease for which Giardia is the causative agent, was recognised by the Word Health Organisation (WHO) in its neglected diseases initiative, highlighting the public health signi cance of this parasite (Savioli, Smith and Thompson, 2006). Since then, reporting of this parasite has improved in many countries.
Exacerbating the burden of this disease is the lack of any effective vaccines against the pathogens.
At present eight species of Giardia are recognised, these are: G. agilis (associated with amphibians), G. ardeae (great blue herons), G. cricetidarum (hamsters), G. intestinalis (alternatively named G. duodenalis or G. lamblia, discussed below), G. microti (associated with voles and muskrats), G. muris (rodents), G. peramelis (southern brown bandicoots), and G. psittaci (found in budgerigars) (Ryan et al., 2019). According to current understanding, the species responsible for all human infections is G. intestinalis, which is further divided into eight assemblages (or subtypes): A-H. These assemblages can be further classi ed into sub-assemblages. Assemblages A and B are thought to be responsible for most zoonotic infections and cause the majority of infections in humans. However, as molecular techniques have advanced, evidence of infection by other assemblages has been identi ed in humans (Feng and Xiao, 2011). Assemblages A and B have a wide host range including humans, livestock, domestic and wild animals. The remaining assemblages have narrow known host ranges. Assemblages C and D are associated with dogs and other canids, assemblage E with livestock, assemblages F with cats, assemblage G with rodents, and assemblage H with seals (Feng and Xiao, 2011). Assemblage B is responsible for the majority of human cases of giardiasis in low-and high-income settings, including in New Zealand where this assemblage was identi ed in 79% of cases between 2009and 2015(Garcia-R et al., 2017a. The clinical effects of giardiasis vary among individuals, ranging from asymptomatic carriage to severe malabsorption syndrome in some acute cases (Luján and Svärd, 2011). However, the mechanisms underlying the differences in phenotypes within these diseases are poorly understood. Previous studies suggest that differences in infectivity exist between assemblages. Experimental observations found that human volunteers inoculated with assemblage B were more likely to succumb to infection and develop symptoms than those inoculated with assemblage A (Cacciò, Lalle and Svärd, 2017). Nevertheless, studies looking at the correlation between symptoms and assemblages have produced contradictory results. The ability to link phenotypic features with assemblages would greatly increase our understanding of transmission patterns. Similarly, investigation of the genetic structure at the population level is essential for the proper inference of the transmission patterns of Giardia and its epidemiology.
Outbreaks of giardiasis occur frequently each year across the world. Previous reviews found that between 2011 and 2017 over 140 waterborne outbreaks occurred globally (Efstratiou, Ongerth and Karanis, 2017). Outbreaks might be initiated through waterborne transmission but have the potential to spread further through human-human interaction (Katz et al., 2006). The true burden of this disease is potentially underestimated due to poor reporting in some countries. Giardiasis only became a noti able disease in the USA, Europe and New Zealand between the late 90s and early 2000s (Adam et al., 2016;Plutzer et al., 2018;Snel, Baker, & Venugopal, 2009). Increased surveillance of Giardia and other enteric parasites will give a better idea of the true burden of giardiasis globally. Surveillance data in New Zealand found that Giardia was responsible for 7.4% of total outbreaks in the country during 2016, with person-to-person contact being the most common mode of transmission (Institute of Environmental Science and Research Ltd (ESR), 2018). However, there has been an inability to identify the same subtypes of Giardia in epidemiologically linked cases in New Zealand. A patient with an infection may carry multiple subtypes of the same infectious agent and the outcome of the competitive interactions between them has an effect on the clinical presentation of the disease, which, in turn, affects the e cacy of treatment (Thompson & Smith, 2011). For this reason, understanding the within-host genetic diversity of a pathogen is essential for effective disease management.
Questions remain as to whether epidemiologically linked cases in New Zealand were all part of the same events or if they represent within-and between-host diversity (Garcia-R et al., 2017a). A possible reason for this could be a lack of resolution due to the standard detection methods used. Over the years the methods for the detection and classi cation of Giardia have progressed from microscopic analysis of physical characteristics to molecular tools such as PCR and Sanger sequencing of notable genes like bg, gdh, tpi and SSU rRNA genes. However, because Sanger sequencing combines the contribution of all DNA fragments present in the reaction mixture, even this may lack su cient resolution where mixed assemblages are present. PCR ampli cation of the gdh gene will amplify sequences from any Giardia assemblage that is present in the extracted DNA, which can lead to a mixed signal in the resulting Sanger sequence or failure to detect rare assemblage types. These limitations affect disease surveillance and make it di cult to capture within-host diversity. In contrast to Sanger sequencing, next-generation sequencing (NGS) techniques like amplicon-based sequencing allow millions of fragments to be sequenced in a single run allowing the researcher to separate the signal originating from each target molecule, thus allowing the e cient isolation, detection and quanti cation of rare types. In recent years, researchers have applied NGS techniques to study the epidemiology of cryptosporidiosis and giardiasis, which has led to great advances in the understanding of these infectious diseases (Ortega-Pierres et al., 2018).
In this study, NGS techniques are used to gain a better understanding of the genetic diversity of giardiasis outbreaks in New Zealand. Taking human faecal samples from three outbreaks of giardiasis that occurred between 2010 and 2017 in various regions across the country and some samples from routine surveillance, and utilising amplicon-based metabarcoding at the glutamate dehydrogenase (gdh) locus, the hypothesis that epidemiologically linked cases share subtypes undetectable with consensus sequencing technologies was tested. In addition, NGS was used to detect the degree of genetic diversity (strictly, richness) present in samples from patients diagnosed with giardiasis. Comparing these results to the results of Sanger sequencing at the same locus it was possible to detect the presence of mixed infections and gained a better understanding of the assemblages of Giardia present in New Zealand. This study shows that amplicon-based sequencing provides better tools for painting a clearer picture of the role of protozoan genetic diversity in giardiasis outbreaks in New Zealand, which could lead to a better perception of protozoan outbreak epidemiology.

Sampling
The Protozoa Research Unit (PRU) at the Hopkirk Institute, Palmerston North, New Zealand, receives human faecal samples diagnosed as positive by accredited diagnostic laboratories from routine surveillance and outbreaks of giardiasis in New Zealand. All the samples included in this study, both from outbreaks and routine surveillance, were from patients diagnosed with giardiasis. A list of the samples from routine surveillance and outbreaks of giardiasis that occurred in New Zealand between 2010 and 2018 can be found in Table 1.

PCR
A partial fragment of the glutamate dehydrogenase (gdh) gene was ampli ed by nested PCR using a previously established PCR programme and set of primers (Read, Monis and Thompson, 2004). The external primers were modi ed to contain MiSeq™ adapter sequences on the 5' end according to standard protocols (Illumina Inc., 2013). Agarose gel electrophoresis was used to verify the presence of fragments of the correct size (432 bp) from all the PCR reactions. A blank containing deionised H 2 O was used as a negative control, and DNA from a sample that had already been veri ed by PCR and Sanger sequencing as containing Giardia DNA was used as a positive control.

Next-generation sequencing (NGS)
The PCR products for all 17 samples were cleaned according to Illumina recommended protocols (Illumina Inc., 2013). The DNA concentration in each sample was measured using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scienti c, Waltham, Massachusetts, United States), the samples diluted to a 5 ng/µl concentration according to the Illumina protocol referenced above then delivered to the Massey Genome Service (Massey University, Palmerston North, New Zealand) for library preparation and amplicon-based sequencing. Sequencing was carried out on an Illumina MiSeq™ using 500-cycle V2 chemistry according to the manufacturer's recommendations, producing 2 × 250 base paired-end reads. Due to the potential uneven representation of bases at each cycle with amplicon sequencing, an Illumina PhiX control library was loaded onto the Illumina MiSeq™ run at 20% volume, to even out the base composition and prevent biases in the initial few cycles that otherwise would result in base calling errors.

Construction of a gdh database
Through our collaboration with the New Zealand Ministry of Health the PRU receives anonymised faecal samples from patients diagnosed with giardiasis. The samples are analysed through PCR and Sanger sequencing at the gdh locus. The assembly of sequences and compilation of databases was done using Geneious v.10.2.6 (Kearse et al., 2012). Using the Giardia intestinalis sequences from our in-house database a separate database was compiled consisting of 858 unique gdh sequences, most had previously been submitted to GenBank by Garcia-R et al. (2017) and can be found in GenBank with accession numbers MT265681 -MT265802. To capture the greatest possible extent of known diversity of Giardia gdh sequences, a dataset of all available gdh sequences for G. intestinalis from GenBank (Benson et al., 2013) was extracted and imported into Geneious. The search strategy employed one search string (Giardia) and included the keywords glutamate dehydrogenase, and gdh. The sequences were trimmed to the length of the primers employed in this study and all sequences less than 393 bp were discarded. This left 337 unique sequences from GenBank. The 337 GenBank sequences were combined with the 858 sequences extracted from our in-house database, then duplicate sequences were extracted to create a collection of 1109 unique sequences covering most of the assemblages of G. intestinalis that have been characterised at the gdh locus.

Sequence processing
The Illumina sequence reads for the 17 samples involved in this study were analysed inside the Quantitative Insights Into Microbial Ecology 2 (QIIME 2) environment (Bolyen et al., 2019). The dada2 methodology (Callahan et al., 2016) was used to lter and trim the forward and reverse sequence reads, dereplicate them, calculate and plot error rates, merge paired reads and construct a sequence table, and remove chimeras. Then our database of 1109 known unique sequences was used as a reference to assign taxonomy to the merged sequences. To remove the impact of index hopping or PCR error, from the processed and merged sequences only the top 1971 sequences, based on the expected sequence length, were imported from dada2 into the phyloseq R package (McMurdie and Holmes, 2013) for plotting, ranking of the most expressed sequences and creation of a heatmap. The resulting table of sequences was run against the reference database to exclude any sequences that did not match known sequences of G. intestinalis then put through phyloseq again for further analysis. Only the top 50 sequences present across all the samples were used for the creation of bar plots and heatmaps to reduce the possibility of sequencing errors being included in the analysis.

Overview of sample data
Of the 17 historical faecal samples from cases of giardiasis that had occurred in New Zealand between 2010 and 2018 fragments of the gdh gene were successfully ampli ed for all of them using nested PCR. The samples for which the assemblage according to Sanger sequencing were known and the most dominant assemblage according to NGS are shown in Table 2. There were no disagreements in assigned dominant assemblage between the two sequencing methods. According to the NGS data, and focusing on the dominant assemblage in each sample, 11/17 samples were found to belong to sub-assemblage BIV, 1/17 to BIII, 2/17 to AII, 1/17 to AIII, 2/17 to E. Fig. 1 provides a comparison of the assemblage assigned by Sanger sequencing and the diversity captured by NGS. It shows that even in genetically diverse samples, like the one from the outbreak in Hawke's Bay in 2015, there are agreements between the Sanger sequence data and the NGS data. Analysis of the NGS data was conducted to probe the intrasample diversity of these samples.

Metabarcoding analysis
The diversity of assemblages found in each sample after processing and analysis of the NGS reads are shown in Fig. 2 Three samples from routine surveillance were included in this study (see Table 1) to compare the genetic diversity between samples from outbreaks and samples from sporadic cases. No signi cant differences were observed. Two samples (11359_S16 & 14201_S15) from routine surveillance represented the rst report of sub-assemblage AIII and assemblage E in human samples from the South Island in New Zealand. These samples were analysed further in another study (Garcia-R et al., 2021).

Identifying links between outbreak cases
The primary aim of this study was to utilise NGS to detect a genetic link between epidemiologically linked cases. To this end, an analysis of the outbreaks that occurred in Gisborne in 2014 and Hawke's Bay in 2015 was conducted. These outbreaks were selected based on the fact that although the samples within each outbreak were epidemiologically linked, according to Sanger sequence data the samples did not share the same dominant genotype.
Of the 5 samples from the outbreak that occurred in Gisborne in 2014, 4/5 were characterised as subassemblage BIV and 1/5 as AII according to Sanger and NGS data. Fig. 3A is a heatmap showing the genetic diversity, captured by NGS, within the samples involved in this outbreak. From this, it is evident that a single variant of sub-assemblage BIV is shared by all the samples in this outbreak. Also, a copy of sub-assemblage AIII is present in one of the samples from this outbreak (10049_S8).
Of the 5 samples received from the outbreak in Hawke's Bay in 2015, 3/5 were identi ed as subassemblage BIV, 1/5 as BIII, and 1/5 as AII according to Sanger and NGS data. From the heatmap shown in Fig. 3B it is evident that, despite the differences in dominant assemblages, sub-assemblage BIV is shared between all the samples from this outbreak. Sample 13273_S14 represented the only sample from an outbreak of cryptosporidiosis in Auckland. According to the NGS data this sample was also positive for G. intestinalis sub-assemblage BIV (Fig. 2). This represents an example of a mixed-species infection. The NGS abundance data for the rest of the outbreaks is available in Supplementary Figure 1.

Discussion
This investigation into the intra-sample diversity of G. intestinalis in patients from historical outbreaks of giardiasis in New Zealand compares the capabilities of NGS and Sanger sequencing technologies. The strength of Sanger sequencing lies in its ability to detect the dominant sequence within a sample. The results outlined here show that NGS is also capable of the same level of discernment with regards to the identi cation of dominant sequences, shown by the agreements between the data from Sanger sequence and NGS of samples from cases of giardiasis that occurred in New Zealand between 2010 and 2018. The aim of this study was to use NGS to capture the diversity within samples, and this is where the bene t of NGS over Sanger shows itself. NGS is capable of sequencing multiple reads in each sample, compared to the one consensus read per sample achieved with consensus sequencing technologies. Here, NGS was employed to uncover the genetic diversity present within cases of giardiasis in this country. As stated in the introduction, the clinical manifestation of giardiasis can differ between individuals. Further work needs to be done to ascertain if there is a link between (sub)assemblage and clinical presentation, however recent advances in the in vitro culture of Giardia (Liu et al., 2020) have the potential to address this. So, the ability to capture the genetic diversity within samples from cases of giardiasis and link them to the symptoms displayed by the patient could greatly advance our understanding of the disease mechanisms of this parasite.
The data presented here suggest that assemblage B is still the most common assemblage of Giardia in New Zealand, present in 16/17 samples. However, the ability to capture the diversity of assemblages within samples showed that, although they might not be dominant, assemblages A and E reported with increased frequency in New Zealand, as evidenced by their presence in 7/17 and 4/17 samples respectively. This is particularly signi cant since assemblage E was thought to be exclusively infectious to livestock. However, recent studies have shown that it is increasingly present in humans as well (Abdel-Moein and Saeed, 2016). The signi cance of this nding is discussed in a study published by our research group (Garcia-R et al., 2021).
The subtyping in this study was carried out at only the gdh locus. This presents a potential limitation since other studies have shown that sequencing typing at different loci can result in assignation of multiple subtypes (Feng and Xiao, 2011;Brynildsrud et al., 2018) and is why more recent studies utilise multi locus sequence typing (MLST) (Seabolt, Konstantinidis and Roellig, 2021). However, this study sought to compared data from NGS to samples that had previously been characterised by Sanger sequencing at the gdh locus. For this reason, metabarcoding at the same locus was considered appropriate for this study. No no-template controls or DNA extraction reagent blanks were included in the library prep for NGS. These are usually used as an indication of the level of lane-hopping or environmental contamination present in the sequenced samples. Here, the use of nested PCR resulted in the ampli cation of speci cally the Giardia DNA at the speci c locus analysed in this study. In addition, while index hopping might be present it is usually between 0.1 to 1% on the Illumina MiSeq platform (Sinha et al., 2017;Hornung, Zwittink and Kuijper, 2019;England and Harbison, 2020), NGS sequencing in this study produced millions of reads and the low quality and abundance reads were removed from the study. Furthermore, only the top 50 sequences were used when analysing the diversity across all samples and within each outbreak. Also, each outbreak had a different pattern of amplicons, generally with different dominant subtypes, which suggests there was little cross-contamination present. Another limitation was the low number of samples from outbreaks of giardiasis. This was due to the fact that only a subset of samples from outbreaks that occurred in New Zealand between 2010 and 2018 are sent to our laboratory for molecular characterisation.
Another aim of this study was to use NGS to uncover genetic links between epidemiologically linked samples. It was hypothesised that epidemiologically linked cases share assemblages undetectable with consensus sequencing technologies. The outbreaks that occurred in Gisborne in 2014 and Hawke's Bay in 2015 provided a perfect case study for this. In those outbreaks there were multiple dominant assemblages present in the samples within each outbreak. By applying NGS and metabarcoding at the gdh locus it was shown that sub-assemblage BIV was shared between all samples from the Gisborne and Hawke's Bay outbreaks, thereby verifying the hypothesis. This improves our understanding of the epidemiology of these outbreaks.

Conclusions
In conclusion, this study highlights the importance of utilising NGS technologies to uncover the genetic diversity of Giardia in humans to gain a better understanding of the risk factors associated with the disease. Out of 17 samples, 13 showed the presence of multiple variants of Giardia. This suggests that labelling a human sample using consensus sequencing technologies as belonging to one assemblage is insu cient and does not capture the true genetic diversity that can exist in one individual. In addition, these results suggest that Giardia frequently invades humans as part of a mixed infection. Further work needs to be carried out to ascertain the relative contribution of each assemblage to the disease phenotype. This will give us a better understanding of the disease mechanisms of the parasite and create a clearer epidemiological picture that will inform public health services in the development of better strategies to combat this persistent and prevalent parasite by allowing them to properly pinpoint all potential sources of infections and disrupt transmission pathways. The taxonomic distribution of (sub) assemblages in samples from the routine surveillance and the multiple outbreaks included in this study. The X axis shows the (sub) assemblage of each sample according to Sanger sequence data and the Y axis displays the number of samples corresponding to each assemblage; the colour codes in each bar represent the genetic diversity within each sample according to NGS.

Figure 2
Heatmap showing the relative abundance of the top 50 sequences in each sample. The multiple variants of each assemblage present in each sample are displayed on the Y axis.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Appendix.docx