Analysis of Segregation Ratio Distortion and Linkage Mapping in Two Switchgrass F1 Populations: Lowland-lowland and Lowland-upland Using Genotyping by Sequencing Data

Background: Switchgrass is an emerging bioenergy crop due to its perennial nature, high biomass yield, and ability to grow in marginal land. The high genetic diversity in switchgrass germplasm can be exploited to capture favorable traits that increase the range of adaptation and biomass yield. Genetic diversity can be explored using single nucleotide polymorphisms (SNPs) that next-generation sequencing has made possible for high-throughput genotyping. We used genotyping-by-sequencing (GBS) of genomic fragments resulting from two methylation sensitive restriction enzymes: PstI and MspI . Two bi-parental F1 populations were developed from crosses between lowland B6 and lowland AP13 (AB population), and lowland B6 with upland VS16 genotypes (BV population), with a target number of 298 progenies in each population. Pseudo-testcross strategy was adopted to perform linkage analysis in these populations that are segregating for winter dormancy using single dose markers (SDA): heterozygous in one parent and homozygous in the other parent. We compared the amount of polymorphisms between the two crosses and examined the pattern of segregation distortion based on the SNPs data generated. Results: Two genetic maps were generated for each population, with 2772 markers in AB and 3766 markers in BV. The higher number of markers in the BV population was expected for since the parents originated from different ecotypes and verified to have the highest genetic distance. More segregation distortion was observed in markers located in the telomeric regions where more genes reside. More markers from the AB population exhibited segregation distortion compared to the BV, and the proportion of heterozygous alleles were significantly higher than homozygous alleles in AB population. The linkage maps showed strong collinearity with P. virgatum V5.1 reference genome with a very minimal number of markers originating from different chromosomes.

Conclusion: Understanding the extent of segregation distortion in switchgrass crosses is important for the correct inclusion of markers based on their segregation ratio when constructing a linkage map. Switchgrass linkage maps should be a useful resource to dissect beneficial biomass traits linked to SNP markers.

Background
Switchgrass is a C4 perennial warm season grass native to most of North America, spanning southern Canada, most of the United States, and northern Mexico (1). It has traditionally been used for pasture and rangeland grazing since the 1940s (1). It was selected by the U.S. Department of Energy (DOE) Biofuel Feedstock Development Program (BFDP) in 1991 (2) as a herbaceous model species for biomass energy due to its high biomass yield, low nutrient and water requirements and suitability for planting in marginal land unsuitable for grain and forage crops (3,4).
Switchgrass is divided taxonomically into two major ecotypes, upland and lowland based on their phenotypic divergence caused by the difference in latitudinal adaptation (1).
Upland ecotypes are widely adapted to latitudes north of 34°N, extending into much of eastern Canada, while lowland ecotypes are adapted to the south, approximately up to 42°N in the western portion of the grassland range, but can be found as far north as 45°N in eastern North America due to the moderate climate resulting from ocean effects (1).
Adaptation of upland ecotypes involves phenology for a short growing season and tolerance to cold winter temperatures (reviewed in Lowry et al. (5)). On the other hand, lowland ecotypes are more adapted to a longer growing season and are less tolerant to cold temperatures. In terms of morphological characters, lowland ecotypes are taller, have fewer and larger tillers, longer and wider leaf blades, and thicker stems than the upland ecotypes (1).
The basic chromosome number of switchgrass is x = 9, with a wide range of somatic chromosome counts, from n = 18 to 108 (6,7). Lowland ecotypes are mostly tetraploid (2n = 4x = 36 chromosomes) while octoploids (2n = 8x = 72 chromosomes) are very rarely found (8). Upland ecotypes exist for both tetraploids and octoploids, with the octoploids being approximately two to three times more abundant than the tetraploid. Zhang et al. (9) estimated the earliest divergence of upland and lowland ecotypes to be around 1.3 Mybp and the divergence of two similar species in the Panicum genus: P. virgatum and P.
hallii was around 5.3 Mybp. Switchgrass is predominantly cross-pollinated with gametophytic self-incompatiblity (10), however a wide range of successful self-pollination has been reported in literature (11)(12)(13)(14). Crossing between the upland and lowland ecotypes is possible at the tetraploid level and several studies showed that the inheritance in tetraploid switchgrass is disomic (15)(16)(17). Significant flow of genes occurred between the two diverse ecotypes during the glacial maxima about a million years ago (8,9).
Prior to the use of genetic marker to group the different accessions of switchgrass, the distinct phenotypic morphologies and natural habitat were used as primary sources of information (18). With the advances in molecular markers for genotypic identification, switchgrass ecotypes were grouped based on genetic marker profiles, including chloroplast (cpDNA) sequences (9,(19)(20)(21), simple sequence repeats (SSR) (9,21), and recently single nucleotide polymorphisms (SNP) (22)(23)(24). Not only have these markers successfully grouped switchgrass accessions into lowland and upland ecotypes, SNP markers could cluster them into higher number of groupings that are linked to ecotypes, cytotypes, and geographical origins (22).
In this study we developed two F1 mapping populations derived from crosses between the winter dormant 4x lowland AP13 and a non-dormant 4x lowland B6, and B6 with the winter dormant 4x upland VS16. The objectives of the crosses are to develop two populations that segregate between the two dormancy levels, together with other traits such as cold tolerance, biofuel quality traits, and biomass yield. We used genotyping by sequencing to selectively genotype each individual in the two populations by target sequencing at the potential gene regions. This is done through DNA cleaving with a combination of two methylation-sensitive restriction enzymes (REs): "rare cutter" PstI and "common cutter" MspI, and DNA barcoding to enable sample multiplexing in one library pool (25). By using methylation-sensitive REs, the repetitive regions of genomes can be avoided and lower copy regions can be targeted more for sequencing (25). The use of a "rare cutter" enzyme has been shown to successfully reduce genome complexity for large, complex, and polyploid genomes (26,27).
For the construction of linkage maps, the two-way pseudo-testcross mapping strategy is adopted whereby recombinations that occur at each parental side; during egg and pollen formation, are used to construct two linkage maps per chromosome (28,29). This paper describes the genotyping-by-sequencing methodology and construction of two genetic maps for each population. We then investigate the amount of polymorphisms between the two crosses; lowland-lowland and lowland-upland, and the pattern of segregation distortion based on the SNPs data generated. We used SNPs data from the three parents used in the crossings to calculate Euclidean genetic distance in order to verify the level of diversity within each cross in view of two hypotheses: 1) A population derived from a wide cross such as between different ecotypes will have a higher number of SNPs, and 2) A population with higher diversity should contain more segregation distortion markers due to the existing reproductive barrier that resulted from the cross between two diverse parents (16 (30), and this is common in polyploids where homoeologous chromosomes can have duplicated genomic regions. By considering biallelic loci in SNP calling, the problem of calling paralogs can be significantly reduced (30).
The total number of variants in the raw VCF output after GATK HaplotypeCaller and GenotypeGVCFs processes was 2,539,025 (2.5 M). Removal of loci with quality score less than 20 and genotype quality scores less than 20 resulted in the same number of variants.
This means that the GATK pipeline applied a stringent filtering step and gave an output of only high quality SNPs. Removal of the loci that were 20% missing in genotypes and loci that were not biallelics resulted in 44,698 variants. Variants were filtered for minor allele frequency (MAF) < 0.05 and resulted in 19,830 sites. The same number of variants was retained after filtering for alleles with less than 8 sequence depth. Coverage of 8 reads for variants calling means only 0.4% (0.5 8 ) of the heterozygous alleles will be mistakenly called as homozygous, setting a very high confidence level for SNP calling.
The average allele counts per variant site in the final filtered file was 519 alleles, which means that the average SNP coverage for the population is 519X and for individuals is Selfed and/or low coverage individuals were identified using markers that are fixed for different alleles between parents-all marker "0" (AA) in AP13 and marker "2" (CC) in B6 were used whereby hybrid progenies were expected to have at least 80% total number of marker "1" (AC). As a result, 9 progenies were discarded from further analysis. Markers were then extracted for single dose alleles (SDA) in each side of the parents. Additional 4 progenies were discarded because of > 50% missing data in SDA for both parents, leaving 285 progenies used in linkage mapping. Final filtering for number "2" allele frequency > 20% and chi-square goodness of fit for 1:1 allele segregation (p-value > E10 -15 ) resulted in 3548 (24% of filtered variants) and 3823 (26% of filtered variants) alleles for maternal and paternal SDA, respectively. There were a total of 398 alleles (2.7% of filtered variants) that were heterozygous in both parents (single dose alleles present in both parents). Using markers that were fixed for different alleles between parents and expectation of a minimum of 80% heterozygous genotype in the progenies resulted in the identification of 66 progenies that were potentially a cross between B6 and contaminated pollen source.
Markers were then extracted for single dose alleles (SDA) in each side of parents.
Additional 5 progenies were discarded because of > 50% missing data in SDA for both parents, leaving 227 progenies used in linkage mapping. Final filtering for number "2" allele frequency > 20% and chi-square goodness of fit for 1:1 allele segregation (p-value > E10 -15 ) resulted in 5693 (21% of filtered variants) and 7883 (29% of filtered variants) alleles for maternal and paternal SDA, respectively. There were a total of 758 alleles (2.8% of filtered variants) that were heterozygous in both parents. c) Genetic maps AB population (AP13 x B6) For maternal markers, 45 loci were found to be identical and thus discarded. 871 well segregated markers were firstly used in the construction of framework LG, while the remaining segregation distorted markers were added in their respective LG based on their strongest cross-link to the well-segregated loci in the framework LG. This step maintained the 18 LGs for maternal map. There were a total of 1540 markers (43% of total maternal SDA) that were able to be ordered within linkage groups with a total map length of 2475.61 cM and 1.76 cM of average inter-marker distance (Table 1). For paternal markers, there were 65 identical loci that were discarded and a total of 1171 well segregated markers that were used in the construction of framework LG. There were a total of 1232 markers (32% of total paternal SDA) that were able to be ordered within linkage groups with a total map length of 1704.06 cM and 1.78 cM of average inter-marker distance (Table 1). Illustrative comparison of maternal and paternal linkage maps shows that maternal LGs are consistently longer than paternal LGs ( Figure 2). Haplotype map shows distinct recombination blocks with minor occurrences of only one marker genotype per recombination block ( Figure S1). Distribution of markers along genetic map points to the possible pericentromeric region where the marker density is highest within 20 cM genetic bin sliding window ( Figure S2). Distribution of allele frequency for mapped markers showed two peaks for 0.25 and 0.75 allele frequencies signifying AA x Aa cross ( Figure   S3).

BV population (B6 x VS16)
For maternal markers, 547 loci were found to be identical and thus discarded. 3867 well segregated markers were used in the construction of framework LG, while the remaining segregation distorted markers were added using strongest cross-link. There were a total of 1824 markers (47% of total maternal SDA) that were able to be ordered within linkage groups with a total map length of 1482.90 cM and 1.07 cM of average inter-marker distance ( Table 1). For paternal markers, there were 904 identical loci that were discarded and a total of 4433 well segregated markers that were used in the construction of framework LG. There were a total of 1942 markers (44% of total paternal SDA) that were able to be ordered within linkage groups with a total map length of 1606.79cM and 0.92 cM of average inter-marker distance (Table 1). Comparative illustration of maternal and paternal linkage maps shows no obvious pattern of higher centiMorgan distance between the two maps ( Figure 3). The haplotype map for this population still shows distinct recombination blocks, however with higher occurrences of one marker genotype per recombination block ( Figure S1). Possible pericentromeric region is shown in the marker frequency along genetic map histogram ( Figure S2). Distribution of allele frequency for mapped markers showed two peaks for 0.25 and 0.75 allele frequencies signifying AA x Aa cross ( Figure S3 Table S1). Wilcoxon signed rank test comparing heterozygous and homozygous allele frequencies in each linkage group for both AP13 and B6 maps proved that the frequency of heterozygous alleles are significantly higher than homozygous alleles (p-value < 0.0001) (Table S1). There is more prevalence of contiguous severe segregation distortion in the telomeric regions compared to the pericentromeric region  Table S1).
Wilcoxon signed rank test comparing heterozygous and homozygous allele frequencies across linkage groups for both B6 and VS16 maps proved that the frequency of homozygous alleles are significantly higher than heterozygous alleles (p-value < 0.0001) in majority of linkage groups (Table S1). Similar with AB population, there is more incidence of contiguous severe segregation distortion at telomeric regions compared to pericentromeric regions ( Figure 8).
LGs with severe distortion at telomeres include 2K, 3K, Collinearity of markers in the genetic position (cM) with the physical position in the reference genome (Mb) suggested high level of collinearity, apart from some local misordering such as in LG 4N of AP13 map ( Figure 9). There were only few markers originating from different chromosomes; for the concatenated K and N chromosomes in AP13 map, only 0.88% and 0.40% SNPs, respectively with a total of 0.65% for both maps.
For B6 map, a slightly higher number of SNPs originated from different chromosome, 1.62% for K chromosomes and 2.61% for N chromosomes with a total of 2.11% for both maps. For this population, only 0.76% of the markers had no hit with BLASTn with a cutoff of E-value < 1 x 10 -5 . Very few small gaps were observed in the physical map due to the scarcity of SDA alleles in these regions. These included LG 3N of AP13 map, and 4K, 5K, and 4N of B6 map. Small regions of inversion can also be observed at some LGs such as 4K of AP13 map and 3K of B6 map.

BV population (B6 x VS16)
Collinearity between physical and genetic maps was also observed for BV maps, but not as strong as AB maps ( Figure 10). There were few markers originating from different chromosomes; for the concatenated K and N chromosomes in the B6 map, 1.50% and 2.42% SNPs, respectively with a total of 1.92% for both maps. For VS16 map, 1.47% and 1.76% of the SNPs originated from different chromosomes for K and N subgenomes, respectively with a total of 1.60% for both maps. For this population, 1.27% of the markers showed no hit with BLASTn with a cutoff of E-value < 1 x 10 -5 . More gaps were observed in the physical maps including 4K, 5K, 7K, 8K, 4N, 6N, and 7N of BV map, and 1K, 2K, 3K, and 3N of VS16 map. A small region of inversion is observed at 5K of BV map, and 9K of VS16 map.

Discussion
Several genomic studies in switchgrass have contributed to the ample genomic resources for the species. These include restriction fragment length polymorphism (RFLP) (15), cDNA libraries sequencing (31)(32)(33)(34)(35), bacterial artificial chromosome (BAC) libraries sequencing (36), simple sequence repeats (SSR) (14,16,(37)(38)(39), and more recently is genotyping-bysequencing (GBS) (17,40). Genetic and genomic studies in switchgrass were greatly improved by the publicly available reference genome in the JGI Phytozome database; Panicum virgatum V4.1. The reads were sequenced from AP13 and the overlapped contigs were anchored on the AP13 x VS16 genetic map. The size of genome assembly is approximately 1,165.7 Mb where a total of 89,680 contigs were localized in the main 18 scaffolds (pseudomolecules) that were labeled 1-9 (K or N) based on subgenome specific markers. At the time of preparation of this manuscript, a draft Panicum virgatum V5.1 reference genome was just released internally and we were able to use it for estimating the collinearity between genetic and physical maps.
In this study we compare the outcome of genotyping-by-sequencing on two switchgrass F1 populations: lowland-lowland (AP13 x B6) and lowland-upland (B6 x VS16).Both populations share a common parent, B6, which is a winter non-dormant lowland genotype.
The number of progenies that were initially intended to be used for SNPs generation was 298, however due to low reads coverage for some individuals, 285 progenies for AB population and 227 progenies for BV population were used in the analysis. The size of mapping population is important to determine the correct ordering of markers that are closer than ̴ 2-5 cM (41) and to improve map resolution (26). The moderate to large number of progenies used in this study are comparable to numbers of progenies used in many linkage mapping studies that were around 80 -200 progenies ( (14,26,37,41,42)).
The number of sequences generated was 2.2 B for AB and 2.4 B for BV population. The higher number of sequences generated for BV population resulted in higher number of sequences per individual for reads that were already trimmed for enzyme cut site and poor quality; 7.2 M sequences/individual for BV compared to 6.2 M sequences/individual for AB might be due to difference in genomic DNA concentrations. By increasing genomic DNA concentration for sequencing we were able to increase the mean SNPs coverage per population from 519 alleles per site for AB population to 536 alleles per site for BV population. The higher percentage of reads alignment to V4.1 reference genome for BV (73.26%) compared to AB (65.12%) can be explained by the improved genome coverage in this population. The high SNP coverage in our study was also due to the use of two enzymes combination: PstI-MspI. In this strategy, fragment amplification occurs for DNA strands containing the ligated forward (PstI cut-site) and common Y (MspI cut-site) adapters (26). Since PstI is a six base rare cutter and MspI is a four base common cutter, this enzymes combination could lead to higher sequencing depth. Previous GBS study in switchgrass utilizing only one enzyme; ApeKI (17) and PstI (40), reported lower loci coverage. The choice of RE in GBS is crucial since it determines the tradeoff between high number of fragments and sequencing depth of fragments (17,41). Several studies showed that ApeKI can generate large numbers of SNPs because it is a frequent cutter with partial sensitivity to methylation, but PstI can give at least 8 times higher coverage and its methylation sensitive nature can target more gene-rich regions (17,27). In this study both aspects were covered, the large number of variants generated due to the high output sequencing platform, and deep loci coverage from the use of two enzymes combination.
The number of variants in the raw VCF output was also higher for the BV population, 6M compared to 2.5M for the AB population. This can be explained by the higher sequence coverage and higher number of polymorphisms in BV resulting from a cross between lowland and upland ecotypes. It was interesting to note that five peaks were observed in the histogram of allele frequency after reads were retained for biallelics in BV population.
This suggests that there were more hybridization of the same homozygous loci for B6 and Indels. Even though more markers were included in the map, the total map length was lower for both parental maps in BV compared to AB. This is due to the low inter-marker distance for BV map; 1.07 and 0.92 for B6 and VS16 maps, respectively, compared to AB, 1.76 and 1.78 for AP13 and B6 maps, respectively. A possible pericentromeric region for each linkage group is suggested based on the highest marker density found within 20 cM bin sliding window. The high frequency of markers in the likely pericentromeric region of the genetic map may have lower recombination resulting in lower amount of recombination events observed (26,27,29).
Inter-and intraspecific crosses can often lead to a distortion of segregation ratios in the hybrid progenies (43)(44)(45)(46). Segregation distortion is a deviation of segregation ratios from the expected Mendelian fractions (47,48) that may result from competition among gametes or from abortion of the gamete or zygote. Competition among gametes may occur because of gametophyte genes expressed during postmeiosis of the microspore and pollen development in angiosperms (48,49). Genetic differences among pollen may lead to gametophyte competition and selection, which result in nonrandom fertilization, while hybrid sterility genes can cause abortion of a specific gamete or zygote genotypes (49).
Several linkage mapping studies in switchgrass reported the existence of segregation distortion with level of distortion that was dependent on population. Okada et al. (16) reported 3% distorted single dose alleles (SDA) in female map and 14 % in male map, in a population that was derived from lowlands Kanlow and Alamo. Higher amount of marker distortion in male map was hypothesized to be due to the presence of rye's selfincompatibility (SI) Z locus ortholog that present in LG VIIb. SI locus can cause failure of fertilization by the affected pollen and hence more marker distortion was observed for male maps. Marker distortion in other LG not containing S-Z locus was proposed to be due to post-zygotic mechanisms resulting from two-locus interactions between parents, by which the author suggested that geographically distant parental origin may contribute to a reproductive barrier leading to segregation distortion. Other suggested cause for segregation distortion in the study was the presence of transmission distorter loci.  (16) and produced a comparative genetic map using SNP marker, and found higher amounts of distorted alleles in the mild-TRD and severe-TRD maps with 21% and 33% distorted markers, respectively. The inclusion of distorted alleles had increased the total map length by 10% for mild TRD and 20% for severe TRD, when compared to the well-segregated SDA map. The study showed a distribution of distorted markers in some instances in half of the length of LGs, at pericentromeric or telomeric regions, and on majority of the markers in the LGs.
In our study, the percentage of distorted markers in the final map of AB population is higher than BV population -70% in AB and 20% in BV. We also found longer length of genetic maps for AB population (2475.61 cM for AP13 and 1704.06 cM for B6) compared to BV population (1482.90 cM for B6 and 1606.79 cM for VS16). Longer map length could indicates greater rates of recombination (16,42), or simply a result of map inflation caused by transmission ratio distortion that could introduce spurious linkage, biased estimates of recombination fractions, or incorrect marker order (reviewed in Okada et al. (16)). Because of a significantly higher amount of distorted markers was observed for AB population and distortion was evidently seen in both AP13 and B6 parental maps, we can suggest postzygotic interactions as the cause of segregation ratio distortion. Postzygotic interaction can be caused by the presence of hybrid sterility genes favoring certain types of genotype formation which lead to abortion of zygote. In our case, it favored the formation of heterozygous loci. For polyploid outcrossers, it is known that heterozygosity can increase colonization ability via improved plant fitness through heterosis and masking of deleterious alleles by the presence of extra genome (50,51). In our study, the allele state was analyzed within subgenome and not across subgenomes; thereby the theory of fixed heterozygosity through hybridization of diverged subgenomes may not be the case.
From our first year field phenotypic data, there was a high mid-parent heterosis (MPH) for these three traits: fall regrowth height, normalized difference vegetation index, and spring emergence date, with 44%, 56%, and 15.3%, MPH respectively (unpublished data). These three traits were showed to be significantly correlated with dry biomass yield in a switchgrass diversity panel consisting of both upland and lowland ecotypes (52). Therefore, we can attribute the overall increase in plant fitness for AB population was due to high level of heterozygosity.
For BV population, since the two parents were different ecotypes with high degree of divergence (high Euclidean distance calculated between B6 and VS16 parents), crossing them could potentially lead to accumulation of genes favoring survival in both adapted regions. Minimal segregation distortion was observed, and across LGs, most markers presented a 1:1 segregation between heterozygous and homozygous genotypes, with inclination towards a Mendelian segregation model. In this population, we observed more regions of distortion in the paternal map especially on chromosomes 5K , 5N, and 7N, where 100%, 52%, and 96% of the markers on each respective LG were distorted. Since more distortion was observed in paternal map and involved chromosome 7N that contains rye's self-incompatibility (SI) Z locus ortholog, we can infer that the mechanism of distortion might be at least partly due to male self-incompatibility. We observed more instances of contiguous severe segregation distortion at telomeric regions compared to pericentromeric regions for both AB and BV populations. Daverdin et al. (29) reported a significantly higher number of genic markers located at the distal chromosome regions compared to pericentromeric regions, while the opposite was true for genomic markers.
With this in mind, we can deduce that more functional segregation distortion occurred at genic regions in the distal locations compared to non-genic regions in the pericentromeric location.
A very high percentage of SNPs were successfully aligned through BLASTn to the correct chromosomes in the switchgrass reference genome V5.1. In B6 population, 99.35% for AP13 map and 97.89% for B6 map while 98.09% for B6 map and 98.40% for VS16 map from BV population aligned correctly. Fiedler et al. (40) found approximately 60% of markers in their severe TRD map to align to the same chromosome when P. virgatum V1.1 was used for alignment. This suggested a tremendous improvement of switchgrass sequence assembly in the V5.1 since we only have 0.65% and 1.92% of markers in AB and BV population, respectively, that were not aligned to the same chromosomes. We found a stronger collinearity between physical and genetic maps for AB population compared to BV population, and the presence of more marker gaps in the physical map of BV population.
One obvious reason is due to the construction of switchgrass reference genome from the AP13 genotype, the same genotype that we used as the maternal parent in AB population.
According to Lu et al. (17) the use of reference genome for SNP discovery may not represent the whole genome of a species because some genomic regions might be technically missing and the presence and absence variation (PAV) that is unique to the referenced genotype.

Conclusions
Comparison of polymorphic markers between two crosses, lowland x lowland and lowland x upland showed that the population derived from different ecotypes contained more variants. This was supported by the higher genetic distance between B6 and VS16 parents compared to AP13 and B6 or AP13 and VS16 parents. Our hypothesis that a population derived from more divergent parents (BV population) would have more distorted markers due to possible reproductive barriers, was proven inaccurate, instead the less divergent AB population contained more distorted markers. A possible cause for the segregation distortion in AB is post-zygotic interactions favoring heterozygous genotypes which results in high level of heterosis in F1 progenies. Lower extent of segregation distortion found in the population derived from the more divergent parents B6 and VS16 can be explained by the presence of more polymorphic markers in this population suggesting accumulation of alleles for adaptation in different geographical regions, and thus the need for specific zygotic formation is not as crucial. Understanding the extent of segregation distortion in switchgrass crosses is important for the correct inclusion of markers based on their segregation ratio when constructing a linkage map. The linkage maps produced in this study will be used in future studies involving QTL mapping for extended production, biomass yield, and biomass quality traits.

a) Population development
Two F1 populations consisting of 298 progenies derived from two crosses, AP13 x B6 and B6 x VS16 were produced in the greenhouse in 2015 and 2016. A non-dormant genotype B6 was crossed to the dormant lowland genotype AP13. B6 was also crossed to the dormant upland genotype VS16. The parents of the mapping populations are tetraploid (2n = 4x = 36). Both crosses were made in isolation in separate greenhouse sections to prevent unintentional cross-pollination from unidentified source of switchgrass pollen. The plants were cross-pollinated by placing the parents close together. Seeds produced from cross pollinated plants were harvested at maturity and were dried at room temperature before undergoing pre-chilling treatment to break seed dormancy. The pre-chilling treatment consisted of placing the seed on a wet filter paper laid in a petri dish. Petri dish was closed and sealed with parafilm then placed in a 4°C refrigerator for two weeks. After that, the seeds were planted in flats for germination.
The seedlings were first genotyped using one SSR marker before transferring them into bigger pots to ensure they are hybrids. The plants were later divided into three clones for  . Genomic DNA quality was confirmed by visualizing them in 1% agarose gel.
Good quality genomic DNA gives one high molecular weight band with very minimal smears. The DNA was then cleaved using restriction enzymes per the GBS protocol developed by the Devos lab (pers. Communication) that was adapted from Poland et al. (26). In brief, genomic DNA was digested with two restriction enzymes, a "rare cutter" PstI

c) Reads filtering and variant calling
Sequencing reads were first checked for their quality using FastQC. The reads for each lane were de-multiplexed according to their unique barcode sequences using 'process_radtags' from Stacks (55) and merged as a single file for each genotype and type of read (read 1 and read 2). After that, reads were trimmed to remove enzyme cut sites and low quality bases where reads with phred score below 33 were discarded. Switchgrass reference genome version 4.1 from JGI Phytozome (56) was used to align the trimmed reads using bowtie2 version 2.2.9 (57) and reads were processed for GATK compatible format using samtools version 1.3.1 (58) and picard version 2.4.1 (59). Alignment of reads to a reference genome is theoretically the best practice for variant callings because the reference genome can separate a real variant and a paralog, thus reducing false SNP calls from paralogs (22). SNP calling were done on each genotype's output file using "HaplotypeCaller" and pooled into one file using "GenotypeGVCFs" from GATK version 3.4.0 (60).
The final output file generated from "GenotypeGVCFs" from GATK 3.4.0 contains a very large number of variants and these must be filtered to remove the false positive variants that might be caused by sequencing errors. In addition, the filtered variants must meet these criteria to be accepted as true variants: 1) Quality score > 20, 2) Genotype quality score > 20, 3) Less than 20% genotype missing value, 4) Biallelics, 5) Minor allele frequency > 5%, and 6) Sequence depth ≥ 8. SNP with a MAF < 5% were removed because they cannot be distinguished from sequencing errors (22). The raw variant files were filtered using VCFtools (61). The filtered variants in VCF file was then converted to binary format using PLINK 1.07 (62), which assuming C is the minor allele recoded genotypes as follows: 0 (AA), 1 (AC), 2 (CC), and NA (00). The binary file was further filtered for variants that are present in both parents. Besides alleles filtering, progeny sequences were analyzed to ensure that they are true hybrids. For hybrid verification only homozygous markers were used, for example marker "0" in female parent and marker "2" in male parent. For hybrid progenies, they must contain heterozygous allele (marker "1") at this locus since they are F1. A subset of markers, i.e. 4000 markers were used to screen out non-hybrids and poorly sequenced progenies-true hybrids have at least 80% heterozygous alleles for these subsets of markers.

d) Construction of linkage maps
Variants were pulled out for single dose alleles (SDA)-heterozygous in one parent and homozygous in the other parent. Chi-square goodness of fit (p-value > 0.05) was used to identify well-segregating alleles with 1:1 ratio. Only SDAs with < 10% missing data were retained in this category. The rest of the alleles with p-value > 1 x 10 -15 and < 20% missing data were considered distorted markers, following the classification used by The first step in linkage map construction was to exclude identical alleles, and then only the SDA alleles were used for initial marker grouping (28,40). The next step was to add the ungrouped distorted alleles to their strongest cross-link (SCL) well-segregated loci using the LOD threshold value of 10.0. This step is important to maintain the integrity of linkage groups (LGs) by giving the strongest link LGs in the initial step and to maximize the number of linkage groups (40). For the parents used in the mapping population, the accurate number of LGs should be 18, representing the nine "a" and "b" subgenomes of switchgrass. Markers that were not ordered by JoinMap were taken out and designated as accessory markers (28,40). These markers were forced to be included in the alternative map using JoinMap's third round marker placement function and kept for our reference.
LGs were labeled according to the pseudomolecule name in switchgrass reference genome 4.1 (K and N subgenomes) where each LG should have at least 85% alleles belonging to the same pseudomolecule. Finally, linkage maps were drawn using MapChart 2.30 (63).
Marker distribution was plotted along genetic distance to point the regions with high density markers as this might be an indication of a possible pericentromeric region.
Marker frequency was shown for 1 cM genetic distance bin for every linkage group.

e) Analysis of segregation distortion
Chi-square goodness-of-fit test for the expected 1:1 segregation ratio that resulted from the cross between heterozygous allele (H) to homozygous allele (A) was calculated for all markers in the map. A marker is significantly deviated from the expected segregation ratio at p-value < 0.05 and severe distortion is detected at p-value < 0.001. Percentage of distorted and well-segregated markers in each linkage group and number of heterozygous and homozygous genotypes for each marker were calculated and presented in graphs. Chisquare values for each marker across concatenated K and N subgenomes in genetic distance are presented in graph to illustrate region with severe segregation distortion.
Region with at least three consecutive markers with p-value < 0.001 was highlighted as severely distorted (64).

f) Estimation of genetic distance between parents
Euclidean genetic distance was calculated between the parents to estimate the degree of genetic dissimilarity in order to make inference on genetic divergence. SNP aggregation for all three parents were firstly conducted using "GenotypeGVCFs" from GATK 3.4.0 using switchgrass V4.1 reference genome for reads alignment. Raw variants were then filtered in similar steps described previously. Only the sites having genotype calls for all three parents were used for the calculation of pairwise distance. This is to avoid calculation bias due to differential amounts of missing data between pairs in comparison. Similarity between two parents was first calculated using the formula: [Due to technical limitations, this equation is only available as a download in the supplemental files section]. Euclidean genetic distance between two parents was then calculated using the formula (65) Availability of data and materials The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests. All authors read and approved the final manuscript.       Collinearity of marker positions in genetic maps (x-axis) and switchgrass reference genome V5.1 (y-axis) across K and N subgenomes in AP13 (top) and B6 (bottom) maps for AP13 x B6 population. Red dots (y = 0) represent markers that are not from the same chromosome.   Table S1.xlsx