ddRADseq-mediated detection of genetic variants in sugarcane

The article presents an optimization of the key parameters for the identification of SNPs in sugarcane using a GBS protocol based on two Illumina NextSeq and NovaSeq platforms. Sugarcane (Saccharum sp.), a world-wide known feedstock for sugar production, bioethanol, and energy, has an extremely complex genome, being highly polyploid and aneuploid. A double-digestion restriction site-associated DNA sequencing protocol (ddRADseq) was tested in four commercial sugarcane hybrids and one high-fibre biotype for the detection of single nucleotide polymorphisms (SNPs). In this work we tested two Illumina sequencing platforms, read size (70 vs. 150 bp), different sequencing coverage per individual (medium and high coverage), and single-reads versus paired-end reads. We also explored different variant calling strategies (with and without reference genome) and filtering schemes [combining two minor allele frequencies (MAFs) with three depth of coverage thresholds]. For the discovery of a large number of novel SNPs in sugarcane, we recommend longer size and paired-end reads, medium sequencing coverage per individual and Illumina platform NovaSeq6000 for a cost-effective approach, and filter parameters of lower MAF and higher depth coverages thresholds. Although the de novo analysis retrieved more SNPs, the reference-based method allows downstream characterization of variants. For the two best performing matrices, the number of SNPs per chromosome correlated positively with chromosome length, demonstrating the presence of variants throughout the genome. Multivariate comparisons, with both matrices, showed closer relationships among commercial hybrids than with the high-fibre biotype. Functional analysis of the SNPs demonstrated that more than half of them landed within regulatory regions, whereas the other half affected coding, intergenic and intronic regions. Allelic distances values were lower than 0.07 when analysing two replicated genotypes, confirming the protocol robustness.


Introduction
Sugarcane (Saccharum sp.) is a major economic crop in the world, with a global production of about 166.18 million tons in the 2019/2020 agricultural season (Walton 2020 In Argentina, it is the most important industrial crop with 381,138 ha cultivated and the main source of sugar (INDEC 2018). It is grown in the northwestern provinces, mainly in Tucumán (225,480 ha), but also in Jujuy (113,241 ha) and Salta (34,233 ha), according to the records of the 2018 National Agricultural Census (INDEC 2018). There are a few hectares in the east of the country, in the provinces of Santa Fe (2330 ha) and Misiones (2270 ha) (INDEC 2018). The production of first-generation bioethanol (1G) through the fermentation of juices or molasses is becoming increasingly important. In fact, the national production in 2018/2019 was 16.8 million tons of sugar and 4.7 million tons of 1G bioethanol (García et al. 2020;Benedetti 2018). Sugar-based bioethanol production as a second-generation (2G) biofuel has attracted interest as a renewable energy and represents a promising alternative to conventional fuels: twenty-eight percent bagasse is produced per ton of sugar, which can be used to generate heat and electricity (Bottcher et al. 2013). Although there are only 1G biorefineries in Argentina, the prospects for the introduction of 2G industries are promising.
This complexity directly affects the rearrangement of chromosomes during meiosis and generates additional variation through the gain or loss of genetic material (Glyn et al. 2004;Vieira et al. 2018). This high genetic variability can certainly be used to develop new genotypes with improved industrial traits, such as increasing sugar content and improving digestibility of sugarcane bagasse. However, conventional breeding requires years of testing, both in greenhouses and in experimental fields before a new hybrid can be released. Moreover, breeding programs usually evaluate morphological and agronomic traits, which are prone to error given the environmental influence on vegetative traits (Medeiros et al. 2020). In this context, the use of molecular assisted breeding would benefit crop improvement, not only by shortening the breeding cycle, but also by improving genetic knowledge of accessions used for crosses.
Several years ago, the initial study of plant genomes by low-throughput molecular markers, such as RFLP, RAPD, AFLP, SSR and EST, provided some insights into the organisation and content of genomes, as well as the functionality of genes and noncoding DNA and their precise location (Thirugnanasambandam et al. 2018). These markers are characterized by their accuracy, reproducibility, and in some cases neutrality. Examples in sugarcane involve projects that have identified the association of several markers such as RFLP, AFLP, SSR and PCR markers to the rustresistant gene (Bru1) to generate genetic maps around the gene or use them to evaluate the gene presence (Asnaghi et al. 2004;Costet et al. 2012;Balsalobre et al. 2017;Yang et al. 2018). However, the use of low-throughput markers in large genomic studies, results in very low coverage, which compromises the scope of studies (Fickett et al. 2018). With the advent of next generation sequencing (NGS) technologies, more efficient molecular markers have been developed, such as single nucleotide polymorphisms (SNPs), which can be easily automated and provide whole genome coverage. SNPs have a significant advantage over other types of markers in polyploids such as sugarcane because they are codominant, widespread throughout the genome, and can determine the number of copies of each allele at a given polymorphic locus (gene dosage) throughout subgenomes (more than two chromosomes per homology group) (Garcia et al. 2013). Gene dosage is in turn thought to be involved in variation in gene expression and thus phenotypic variation caused by the relative abundance of each allele (Garcia et al. 2013). A comprehensive review of SNP and other molecular markers was recently published by Aitken (2021). Implementation of SNPs and phenotypic data helps identify correlations between these variants and quantitative trait loci, associations that can be potentially useful for future sugarcane breeding programs (You et al. 2019;Yadav et al. 2021;Mahadevaiah et al. 2021).
Double-digest restriction site-associated DNA sequencing (ddRADseq), is one of the many technologies included in the so-called genotyping by sequencing (GBS; Peterson et al. 2012). This protocol involves the use of two different restriction enzymes and the selection of a specific fragment size. This method can identify and simultaneously genotype thousands of loci distributed throughout the genome, providing a genomic perspective to the genotypes analysed. It also provides high reproducibility, large loci depth coverage, and selection of effective SNPs distributed across the genome (Aguirre et al. 2019). For polyploid crops, detecting true SNPs is more challenging than for diploid crops 1 3 due to subgenomes within genotypes. However, many tools developed for diploids can also be applied to allopolyploids because they behave similarly to diploid species (Bourke et al. 2018). Nevertheless, quality control and filtering are important steps in bioinformatics analysis to detect false positive SNPs.
Herein we present an adaptation of the ddRADseq protocol, originally developed by Peterson et al. (2012) and optimized for non-model diploid species by Aguirre et al. (2019), to apply it to a highly polyploid species to identify genomic variants for breeding applications in modern and cultivated Argentine sugarcane hybrids. Our working hypothesis is that for ddRADseq protocols applied to complex genomes, SNPs recovery is affected by read size, number of reads per individual, and single-reads versus paired-end reads. In this article, we tested sequences from two Illumina sequencing platforms, with a reduced number of samples to evaluate the above conditions. In addition, we discussed various bioinformatics filtering schemes (FSs) to determine the application of parameters that balance the rate of missing data with the accuracy of detected SNP.

ddRADseq libraries
Leaves from three replicates per sugarcane accession were used for isolation and purification of high-quality DNA using an EasyPure Plant Genomic DNA Kit (TransGen Biotech). DNA quality was checked using Nanodrop (Thermo Fisher Scientific, Waltham, MA, USA) and agarose gel electrophoresis analysis. DNA was quantified using a Qubit 2.0 fluorometer (Thermo Fisher Scientific). The ddRADseq protocol 1 optimised for Eucalyptus dunnii (Aguirre et al. 2019) was modified for sugarcane by using the PstI/ MboI combination in the digestion step. DNA fragment size of the final library, ranged from 400 to 600 bp, representing 260-460 + 140 bp of barcodes, based on Fragment Analyzer assessment (Agilent Technologies, Santa Clara, USA

Efficiency of read lengths, depth sequencing coverage and single-reads vs. paired-end sequencing
To assess the efficiency of the sequencing platforms in terms of read lengths, depth of sequencing coverage and single-reads vs. paired-end sequencing: (a) reads obtained in No-150 were shortened to 70 bp (while maintaining the high sequencing coverage and paired-end, hereafter referred to as No-70-PE/HC) and compared to No-150. To evaluate the robustness of the protocol, a new library of samples NA 78-724 and LCP 85-384 was created from a new DNA extraction using the protocol described above and sequenced using the Illumina NovaSeq6000 platform [150 bp paired-end reads; ~ 4 million reads, referred to as medium sequencing coverage (MC)]. In addition, a random subsampling of No-150 reads was performed to obtain a medium sequencing coverage dataset for samples NA 78-724 and LCP 85-384. In this way, two data sets were created in independent experiments, which are replicates for the assessment of the protocol robustness.

Generation of sugarcane SNPs matrices
The bioinformatics workflow used to analyse the raw data is shown in Fig. 1. Quality was checked visually using FastQC (Andrews 2010) and summarised with MultiQC (Ewels et al. 2016). The program "process_radtags" of Stacks v 1.42 package (Catchen et al. 2011(Catchen et al. , 2013 was used for quality cleaning. Those reads whose average quality within a window fell below a Phred score of 10 (Ewing and Green 1998), or have un-called nucleotides, adapter sequences or deficient restriction enzyme cut sites were discarded (Catchen et al. 2011). Two strategies were followed for SNP calling: (1) reference guided and (2) de novo (Fig. 1).
For the first strategy, the cleaned reads were mapped with Bowtie2 under default parameters (Langmead and Salzberg 2012) against the genome of monoploid sugarcane cultivar R570 (Garsmeur et al. 2018) which was used as a reference, SNP identification was performed using the "ref_map.pl" pipeline of Stacks v 1.42 package (default parameters). In the second strategy, the "denovo_map.pl" pipeline of Stacks was used for SNP discovery. Following both strategies, the corresponding raw SNP matrices were obtained under variant call format (VCF) using the "Population" program implemented in Stacks. The raw matrices statistics were calculated to determine subsequent filtering thresholds.

Clearing SNP matrices
Filtering of markers was performed using VCFtools (Danecek et al. 2011). An initial and global filter of 0% missing data was applied to all data sets to ensure good and reliable data. MAF > 0.1, commonly applied to diploids, was used as a conservative threshold, while a minimum MAF of 0.3 was used as recommended for allopolyploids (Clevenger and Ozias-Akins 2015). Three different depth coverage ranges with minimum and maximum thresholds were investigated: (1) 10x-36x: being 10× the usual recommendation for diploids (high confidence base-call) and 36× as maximum threshold (Ravinet and Meier 2020).
Six FS resulted from the combination of different parameters of MAF and read depth (RD, Fig. 1).

Hierarchical cluster analysis and principal component analysis
The R package vcfR was used to introduce VCF files into the R environment for further analysis (Knaus and Grünwald 2017; R Core Team 2019). SNP density per chromosome for Ne-70 with FS1 and No-150 with FS5 was plotted using the R package CMplot (Yin et al. 2021). Correlation analysis of chromosome length and number of SNPs per chromosome was performed using "cor" and "cor.mtest" functions (corrplot package) (Wei and Simko 2021). Correlation coefficients were considered significant if the p-value was < 0.05 (*). Allele sharing distances between samples of Ne-70 with FS1 and No-150 with FS5 were calculated using the R package pegas (Paradis 2010) and plotted using the R Stats package (R Core Team 2019). The agglomeration method used was the "average" as it had the highest cophenetic correlation (best clustering model for both datasets) (Borcard et al. 2018). Principal component analysis (PCA) on the same matrices were calculated using TASSEL and plotted with the R package ggplot2 to analyse the full set of SNPs of each panel, to represent the relationships among samples (Bradbury et al. 2007;Wickham 2016).

SNP effect prediction
The variant effect predictor (VEP) from the Ensembl Plants (version release 51) was used to predict the functional effect of SNP of Ne-70 with FS1 and No-150 with FS5, following the instructions described in Ensembl/plant-scripts: Scripting analyses of genomes in Ensembl Plants (Yates et al. 2022), with the reference genome and its annotation, in FASTA and gff3 formats.

Evaluation of the robustness of the protocol
The protocol was evaluated by comparing two replicates of the commercial hybrids NA 78-724 and LCP 85-384. The variant calling workflow was performed as described above. The VCF file was filtered for missing data and three different depth coverage range thresholds were tested (a RD = 10x-36x; b RD = 30x-60x; and c RD = 56x-200x). Using the SNPs panels, the allele sharing distances and their cophenetic matrices between samples were calculated as described above.

Results
The ddRADseq protocol implemented in this work allowed the generation of a library with pooled samples with good quality and quantity. The restriction enzymes chosen, PstI/ MboI, were a successful combination for the digestion step and the manual size selection was able to recover the size of the DNA fragments of interest (Fig. S1). To investigate the performance of the ddRADseq protocol implemented for allopolyploid sugarcane genome we compared the results obtained using two sequencing platforms, considering the information content obtained, the best cost-benefit ratio, and different FSs. Furthermore, the library accomplished the performance expectancy of the Illumina platforms (Fig. S2).
The number of reads and their alignment to the reference genome for each sample, in Ne-70 and No-150 platforms are shown in Table 1. The number of reads with the No-150 was on average more than 7 times-fold the number detected with the Ne-70 (Table 1; Table S1), exceeding the expected 2.5:1 ratio. Because the same library was used in both assays, differential ratios could probably be due to sample nature, sequencer loading and clustering in flow cell. Samples NA 78-724 and NA 56-30 accounted for the highest and the lowest number of reads, respectively, in both data sets ( Table 1). The overall alignment rate of the sequences against the monoploid sugarcane genome was higher in Ne-70 than in No-150. These rates, which were similar across samples in the former platform, averaged 54%, and were in contrast with the rates of the latter platform (~ 42%), with the exception of sample NA 78-724, which reported 50% (Table 1).

Sugarcane SNP matrices obtained with Stacks program
Raw SNP data were obtained from the five sugarcane genotypes sequenced in each Illumina platform  and two SNP calling analyses (reference guided and de novo). In the former, 188,199 SNPs for the de novo strategy (denovo_map.pl) and 86,051 SNPs in the reference-based (ref_map.pl) were recovered, while 2,831,922 SNPs for the denovo_map.pl and 460,890 SNPs in the ref_map.pl were identified in the latter ( Table 2). The denovo_map.pl strategy allowed identifying an amount of SNPs 2.1 and 6.1 times larger than ref_map.pl in Ne-70 and No-150, respectively. Additionally, 15.0 times more SNPs with denovo_map.pl were recovered in No-150 compared with Ne-70, whereas 5.4 times more SNPs were recovered with ref_map.pl strategy. After

Clearing SNP matrices
The raw SNP matrices generated from Ne-70 and No-150 datasets, plus the three simulations, were filtered for quality, missing data, RD, and MAF, as shown in Fig. 1. Sixty SNP panels were generated from six FSs applied. Ne-70, No-70-PE/MC and No-150-PE/MC matrices showed a decreasing trend of SNPs from FS1 to FS6 in both strategies (denovo_map.pl and ref_map.pl), indicating that the SNPs retained decreased with RD as fewer SNPs fulfilled the restrictive conditions imposed by RDs 56x-200x (Table 2). No-150 showed a similar SNP recovery behaviour as Ne-70 in denovo_map.pl, but No-150 had the highest SNP recovery values in ref_map.pl with RD 56x-200x (FS5 and FS6) ( Table 2). This trend was also observed in the No-70-PE/HC simulation, but in this matrix the number of SNPs was lower due to the smaller fragment size. At this depth, single-dose alleles could be recovered at a heterozygous locus, while at a shallower depth they could be lost. SNPs retained within each RD were higher with MAF > 0.1 for both platforms and SNP calling. Since the application of FS1 in Ne-70 and FS5 in No-150 retained the highest number of SNPs, both FSs were selected for further comparisons. SNP density was plotted to visualise the distribution of SNPs along each chromosome (Fig. 2). Raw data from Ne-70 showed regions with more than 289 SNPs spread throughout the whole set of chromosomes, with distal regions showing relatively high SNP density, i.e., more than 181 SNPs (Fig. 2a). FS1 application upon the Ne-70 platform diminished dramatically SNP density in the entire set of chromosomes (Fig. 2b), being this reduction consistent with the differential number of SNPs retained in FS1 compared to raw data (Table 2). Raw data from No-150 displayed regions with more than 1621 SNPs spread throughout the whole set of chromosomes (Fig. 2c). FS5 application upon the No-150 platform revealed a lower distribution of SNPs in pericentromeric regions compared to distal ones (Fig. 2d).

Hierarchical cluster analysis and principal component analysis
Dataset from ref_map.pl filtered matrices responsible for the highest numbers of SNPs Fig. 2b,d) were analysed according to two different multivariate methods, one based on clustering whereas the other on ordination on reduced space. Cluster analysis of the ref_map.pl filtered matrices (Table S3) resulted in two dendrograms that showed distinct clustering arrangements (Fig. 3a, b). A single-member-cluster was composed of INTA 05-3116 (high fibre biotype) in both dendrograms (Fig. 3a, b). A second cluster, formed by four commercial hybrids: NA 78-724, LCP 85-384, HOCP 92-665, and NA 56-30, was also detected in both dendrograms, although different distances were determined among the hybrids in each dendrogram (Fig. 3a, b). PCA allowed the graphical representations of the relationships among the five sugarcane genotypes (Fig. 3c, d).
In the PCA plot, a summary of total variation of the SNPs presented by the first (PC1) and second (PC2) principal components explained 56% of total variance in both plots. PC1 contributed to 32% of total variance and distinguished two groups of genotypes (encircled in solid line). Group 1 was composed of INTA 05-3116, whereas group 2 was composed of the four commercial hybrids. PC2 accounted for 24% of total variance separating LCP 85-384, NA 56-30,  (Fig. 3d). Taken together, hierarchical cluster analysis and PCA revealed similar patterns of genotype clustering and ordination on reduced space.

Robustness of the protocol
The four samples, two replicates of each commercial hybrid (NA 78-724 and LCP 85-384), yielded 77,918 unfiltered SNPs. After filtering the missing data, 22,744 variants were retained. The three different depth coverage thresholds tested yielded three SNPs matrices: a RD = 10x-36x, where 14,024 SNPs were found; b RD = 30x-60x, where 5137 SNPs were found; and c RD = 56x-200x, where 2465 SNPs were found. These SNPs panels were used to calculate the allele sharing distances between samples (Fig. 5). In all cases replicates  Table S5). At mid depth coverages, the distance value was 0.153 for NA 78-724 and 0.112 for LCP 85-384 ( Fig. 5b; Table S5). At the highest depth coverages, lower distance values of 0.066 for NA 78-724 and 0.071 for LCP 85-384 were obtained ( Fig. 5c; Table S5).

Discussion
Sugarcane genomics analysis faces major hassles due to the gain and loss of genetic material tightly associated with the highly polyploid and aneuploid nature of the crop. Genotyping sugarcane is not only complex because of the multiple number of allele copies, but also because of the aneuploidy and unknown ploidy of some hybrids (Serang et al. 2012). Furthermore, in polyploids the falsepositive rate of SNPs detection rises due to ambiguous mapping of highly similar homeologous loci, becoming challenging to the precise SNPs detection (Clevenger et al. 2015). GBS has been proved to be a successful way to analyse sugarcane (Balsalobre et al. 2017;Yang et al. 2017), however, to our knowledge ddRADseq approach has never been applied to this crop before. Our simulations and comparisons showed that, regardless of the complexity of genotyping sugarcane hybrids, there is consistency in our data results obtained by the application of the ddRADseq protocol.
Current NGS platforms generate data for analysing genes and genomes, through sequencing by synthesis chemistry (López de Heredia 2016). Differences between platforms rely on the output range, reads per run, maximum read length and costs (Illumina 2021). For this reason, the comparison between sequencing platforms is crucial to decide how to invest available resources. Our results showed that the number of reads assigned to each genotype followed the same order in Ne-70 and No-150 platforms, although the final number of reads in the latter one was on average several times the number detected in the former one (Table 1). It is worth noting that NA 78-724 and NA 56-30, two Argentinean hybrids coming from the same sugarcane breeding program, stood for extreme numbers of reads in both sequencers, suggesting a consistency in sequencer performance among samples.
The overall alignment rate of both platforms averaged ~ 48%. Several non-mutually exclusive phenomena . Three different RD coverage thresholds were tested a RD = 10x-36x, b RD = 30x-60x, and c RD = 56x-200x might be responsible for this value: first, the hybrid R570 reference genome assembly originated from synteny with sorghum, using BACs based on their global alignments with this related species and representing the gene-rich part of its genome (Garsmeur et al. 2018). This indicates that fewer conservative regions are unlikely to align with other sugarcane hybrids. Second, R570 is a cultivar from Reunion Island, located in the Indian Ocean, suggesting that it might be genetically distant from the Argentine and American genomes assayed in this investigation (Acevedo et al. 2017;Garsmeur et al. 2018); then the alignment software was developed for diploid species, representing a technical drawback for a highly polyploid and aneuploid species such as sugarcane along with computational problems. These observations increase the difficulty in locating the reads in the exact position of the reference genome even though Yang et al. (2017) reported that Bowtie 2 was more effective than other aligners in sugarcane. Interestingly, the high fibre biotype and the sugarcane cultivars showed similar alignment rates, independently of the platform used (Table 1).
Stacks software provides two types of analyses for SNP recovery: de novo and reference-based (Rochette and Catchen 2017). As expected, the de novo analysis allowed a higher rate of SNPs detection than the reference-based analysis in both platforms, considering the raw data, since the latter pipeline only considered the reads that mapped against the reference genome. However, the referenced matrices give us information about the SNP position in the chromosome, so more information could be obtained. In addition, as we will further discuss later, quality filters are essential for finding reliable SNPs.
Simulations applied to further unravel the role of the read length, sequencing coverage and paired-end versus singlereads impact on the number of called SNPs have demonstrated that the SNPs are not evenly distributed along the read length. As expected, more SNPs were recovered in the longest reads  than in the shortest one (No-70-PE/ HC) for both mapping strategies, considering that the probability of finding a variant increases with read length, along with the accuracy of mapped locus. Moreover, Manimekalai et al (2020) recommend paired-end reads from at least 150 bp to distinguish homeologous SNPs from allelic SNPs in alloploids for high-quality SNP discovery. Additionally, the number of SNPs is not directly proportional to the number of reads per individual. In relative terms, the amount of SNPs recovered in the raw data in 585 SNPs in ref_map.pl and 363,919 SNPs in denovo_map. pl) exceeded half the amount recovered in 864 SNPs in ref_map.pl and 767,203 SNPs in denovo_ map.pl), suggesting that assigning a medium sequencing coverage (4 million reads/individual) is still adequate to maintain a cost-effective balance for applying ddRADseq to a larger number of samples.
The lack of information on the application of the ddRADseq protocol in sugarcane moved us to test six FSs composed of a combination of different MAF and RDs to select the most suitable filtering parameters values for improving SNP calling. The inclusion of minimum and maximum depth filters was recommended for depth coverage (Ravinet and Meier 2020). While a minimum depth cut-off removes false positive calls and ensures higher quality calls, a maximum cut-off tends to avoid paralogous or repetitive regions (Ravinet and Meier 2020) that have been recently reported in the sugarcane genome (Trujillo-Montenegro et al. 2021). There was a trend that with medium sequencing coverage, as the RD value increased, the number of SNPs recovered decreases. With high sequencing coverage, this trend was reversed. The combination of the same MAF with increasing minimum and maximum depths cut-offs diminished the SNPs recovered in both Ne-70 mapping strategies and in  pl the combination of the same MAF with increasing RDs augmented the SNP call. This could be related to paralog regions coming from the duplication events or polyploid nature where we can find several copies of some regions (or genes) within the same genome that belong to different ancestors (subgenomes). In addition, in highly polyploid genomes, such as sugarcane which can achieve a ploidy level of 12, 56× was the estimated sequence depth for single dose markers to be called, while in lower depth coverage, a single dose SNP of heterozygous genotype could be called as homozygous genotype (Yang et al. 2017).
According to our results, PE reads and longer reads (comparing No-150 with No-70-PE/MC, No-70-PE/HC, No-150-PE/HC and Ne-70) yield higher number of SNPs. Even though high sequencing coverage is ideal for polyploids, in particular for sugarcane, medium sequencing coverage (around 4 million and PE reads/individual) provides sufficient variants and, combined with the Illumina platform NovaSeq6000, is still cost-effective for applying ddRADseq protocol to a larger number of samples. In addition, the best filter combination was a MAF 0.1 and a RD coverage between 56× and 200x (for both strategies of SNPs calling).
Distal regions along each sugarcane chromosome exhibited relatively high SNP density both in raw data and, to a minor extent, in FS1 and FS5 (Fig. 2) due to the differential number of SNPs retained in these FS compared to raw data (Table 2). These observations are consistent with the pattern observed in the sorghum reference genome assembly, where distal regions of chromosomes exhibited high gene density and pericentromeric regions displayed low gene density and low rates of recombination (McCormick et al. 2018).
Findings of hierarchical clustering and PCAs were in accordance both within and between matrices, highlighting the isolation of the high-fibre biotype from the commercial hybrids (Fig. 3), likely due to their differential genetic backgrounds Kane et al. 2021). Interestingly, NA 78-724 and NA 56-30, the two Argentinean hybrids genetically developed under the same sugarcane-breeding program did not share a same cluster in any of the matrices analysed. Furthermore, NA 56-30 and HOCP 95-665 showed the nearest genetic distance in No-150 (Fig. 3b) among the samples assayed, in agreement with genetic similarities based on COP estimations (Acevedo et al. 2017).
To shed more light on the effect of the SNPs retained on genes, transcripts, protein sequences, and regulatory regions on the five sugarcane genotypes assayed, the VEP was used. Overall, similar proportions of SNPs were assigned to the same functional consequences in both matrices (Fig. 4a, c). This further shows that independently of the differential number of SNPs detected in each platform, similar variant proportions were allocated to the very same consequence terms. Among the SNPs discovered, missense variants (6.4-5.4%, Fig. 4b, d) responsible for the change of 1 base resulting in a different amino acid sequence where the protein length is preserved, emerged as interesting variants to consider for further analysis. Some SNPs identified in the coding region, such as splice region variants (0.5-0.7%), refer to a change either within 1-3 bases of the exon or 3-8 bases of the intron, representing a variation in the DNA sequences. Similar considerations may be considered for synonymous variants (6.3-4.7%) as they are supposed to have no change to the encoded amino acid. More than half of the variants detected landed within regulatory regions, suggesting that they might alter the biological functionality of the gene (Fig. 4a, c). Finally, in non-coding variants or variants affecting non-coding genes, impact is difficult to predict or there is no evidence of impact at all. Aguirre et al. (2019) evaluated the robustness of the protocol using replicates in E. dunnii. In our work, we performed a similar comparison using SNPs matrices obtained from independent ddRADseq experiments and two genotypes. The matrices were generated with three depth coverage and avoiding missing data. In all cases, the two replicates clustered with different distance values depending on the coverage analysed. The differences observed could be related to the ability to detect heterozygosity of a locus with low frequency, such as single-dose SNP, which are common in sugarcane and require high depth coverage to be detected. These results are consistent with those of Yang et al. (2017), according to which the use of a sequence depth of 56 reads per sample enables the selection of high-quality SNPs. This is because at this depth it is possible to obtain two reads for minor alleles at SNP loci with a single dose and identify more than 90% of potential SNPs even in a dodecaploid species. Several genome assemblies have been published recently, such as those of the sugarcane hybrid CC 01-1940(Trujillo-Montenegro et al. 2021, S. officinarum (GenBank assembly accession: GCA_020631735.1; Bio-Project: PRJNA744175), and S. spontaneum (Zhang et al. 2022). The last two genome assemblies belong to founding species. The sugarcane hybrid CC 01-1940 was sequenced using NGS technologies (PacBio, Illumina short reads and HI-C reads) (Trujillo-Montenegro et al. 2021). The protocol applied here, using the new references, is ongoing to complement and enrich the matrices obtained so far. At least preliminarily, we obtained higher mapping rates, although it did not significantly increase the number of SNPs detected. The pipeline considered in this study can be used independently of any newer versions of the genome sequence that may be published in the future. Moreover, we will continue improving the pipeline for the identification of novel SNPs in polyploids. Furthermore, compared to SNP array technology, it can be applied without previous knowledge on polymorphisms. Even more, the protocol was also successfully used in Stetsonia coryne (Gutiérrez et al. 2021) for the discovery of SSR, and in Prunus persica to analyse the Germplasm Collection of San Pedro Research Station (San Pedro, INTA, Argentina) (Aballay et al. 2021). All in all, this protocol can be applied specifically to the sugarcane population of the breeding program being assessed to discover molecular markers associated with agricultural traits.

Data availability
The datasets generated and analysed during the current study are not publicly available because they are part of CM Doctoral Thesis but they are available from the corresponding author on reasonable request.