Genotyping by sequencing reveals genetic relatedness and duplicates amongst local cassava (Manihot esculenta Crantz) landraces and improved genotypes in Kenya


 Background: Future demand for cassava is expected to increase to mitigate climatic changes, sustain food security and provide raw materials for industry. To meet these demands, adoption of modern omics methods ensures reliability, precision and timely delivery of more productive and resilient varieties. Therefore the purpose of this study was to contribute towards accurate identification of cassava accessions from a mix of duplicate clones, diverse local landraces (LARs) and improved genotypes (IMGs) in farmer fields. This is vital for cassava breeding. Results: A total of 112 germplasms sampled through a field survey in major cassava growing regions of Kenya, were genotyped using single nucleotide polymorphisms (SNPs) markers generated through genotyping-by-sequencing (GBS) approach. Of the 33672 SNPs, 88% were anchored onto chromosomes, 3% in scaffolds and 9% could not be mapped. LD pruning and identity by state matrix estimation revealed 5808 SNPs that were used for hierarchical clustering and ADMIXTURE analysis for ancestries. Considering a sub-population of 2 - 20, a 5-fold cross-validation procedure identified 14 subpopulations present in the population from which the population structure was modeled. Approximately 48% of the germplasms were classified into 17 independent clusters as identical clones or duplicates. The remaining 52% formed admixtures and hence unique or non-duplicated clones; reducing the total number of samples surveyed from 112 to 73. Of the duplicates, 10 clusters were formed from LARs, four from IMGs, and three from a mix of both LARs and IMGs. The major and minor clusters contained 8 and 2 accessions, respectively. About 71% of clusters contained accessions from the same geographical region while 29% had accessions from different regions. The results revealed genetic relationships amongst LARs and IMGs. Duplication of LARs was attributed to historical sharing or exchange of planting materials by farmers while duplicates of IMGs could be attributed to convergent evolution, selection, or sharing of common parentage. The high number of admixtures or unique clones implied minimal loss of genetic diversity. Geographical restriction of clusters adduced to the minimal movement of planting materials across the country, perhaps linked to either inefficient seed distribution system or disease-driven quarantine measures. Conclusions: GBS was successfully used to study the genetic relatedness of cassava genetic resources and variety identification in farmer fields. This omics approach and data herein generated could be adopted by breeders and other stakeholders in designing efficient and effective cassava improvement programs which might include the development of a core set of diagnostic markers for quality assurance, disease resistance, and targeted genomic profiling in cassava.

Despite its signi cance, cassava production in SSA still lags behind other parts of the world. This has largely been attributed to low investments in breeding programs and inherent genetic challenges associated with the crop. Genetic barriers such as high heterozygosity, inbreeding depression, allopolyploid, poor seed set, irregular owering, and the polygenic and recessive nature of many desirable traits, constrain development of new or improved varieties especially via conventional breeding (Elegba et al. 2021;Ceballos et al. 2020; Makwarela and Rey 2006). These are further compounded by a mixture of diverse local landraces and improved varieties that are often cultivated by most small-scale farmers on the same piece of land. Indeed, farmers often exchange stem cuttings or planting materials with their neighbors and neighboring communities, resulting in elds with a mixture of local cassava varieties (Andersson and de Vicente 2010; Nakabonge et al. 2018). Commonly, this results in the same ethnic or local name being assigned to different cassava germplasms or the same germplasms assigned different local names. Variety naming systems in the absence of formal seed systems can be quite temporally and spatially variable, leading to inconsistencies in the names of a particular variety (Rabbi et al. 2015). All these hamper the selection of breeding lines. To overcome these limitations, molecular approaches can assist in reliable identi cation, characterization, and veri cation of genotypes or varieties and hasten selection of Accurate identi cation of crop cultivars is crucial in assessing the impact of crop improvement research outputs and the two commonly used identi cation approaches; elicitation of variety names from farmer interviews and morphological plant descriptors, have inherent uncertainty levels (Rabbi et al. 2015). The major aim of variety or cultivar identi cation is to catalog the crop's genetic diversity (Lopez-Lavalle et al. 2021). There are many reports on many landraces of cassava in SSA but with limited studies on the genetic relatedness between these landraces and elite or improved accessions (Turyagyenda et al. 2012).
Molecular marker technologies such as RFLPs, AFLPs, SSRs, DArTs, and SNPs among others have been used to detect polymorphisms and characterize genetic variation in cassava cultivars (Lopez-Lavalle et al. 2021). Rabbi et al (2015), successfully used SNPs derived from GBS to track and identify released cassava varieties and local landraces in Ghana, West Africa. The present study, therefore, applied the GBS approach to generate SNPs that revealed genetic relatedness amongst local landraces and improved cassava genotypes sampled from various cassava growing regions in Kenya. This is a preliminary step toward the acceleration of the cassava breeding process in the country.

Sample collection
A eld survey was carried out in April 2018 in selected areas within major cassava growing regions of Nyanza, western, eastern, and coastal Kenya (Fig. 1). Systematic sampling was applied to identify cassava farmers or farms for cassava leaf collection ). This involved stopping at regular predetermined intervals (~2-5 km) allowing wide coverage of the surveyed areas between farmer elds along the major motorable roads traversing each sampling location (Mware et al. 2009). The local name of the landraces and/or names of villages and GPS coordinates where the samples were collected were recorded (Table 1). Cassava leaves were harvested and pooled from ve plants per landrace or genotype. The leaves were immediately transferred to falcon tubes half-lled with silica gels to preserve their integrity prior to DNA extraction.

Sequencing cassava using DArTSeq
Cassava leaf samples were sent to Integrated Genotyping Service and Support (IGSS) platform located at the Biosciences eastern and central Africa (BecA-ILRI) Hub in Nairobi, Kenya for genotyping. DNA extraction was done using TANBead Plant extraction kit. The quality and quantity of genomic DNA were determined using NanoDrop ND-1000 (Thermo Fisher Scienti c) and agarose gel electrophoresis. Libraries were constructed according to Kilian et al. (2012) DArTSeq complexity reduction method through digestion of genomic DNA using a combination of PstI and MseI restriction enzymes and ligation of barcoded adapters followed by PCR ampli cation of adapter-ligated fragments. Libraries were sequenced using single read sequencing runs for 77 bases. Next generation sequencing was carried out using the Illumina Hiseq2500. DArTseq markers scoring was achieved using DArTsoft14 which is an in-house marker scoring pipeline based on algorithms. Two types of DArTseq markers were scored, SilicoDArT markers and SNP markers which were both scored as binary for presence /absence (1 and 0, respectively) of the restriction fragment with the marker sequence in genomic representation of the sample. Both SilicoDArT markers and SNP markers were aligned to the reference genomes of Cassava_v61 to identify chromosome positions.

Data analysis
The quality of the SNP data was ltered using TASSEL and SNPs anchored on scaffold or missing chromosome information were discarded. TASSEL was also used to select SNPs with >0.05 minor allele frequencies (MAF) and SNPs with no more than 20% missing genotype data. For LD pruning and IBS matrix estimating, Plink 1.9 was used to select for SNP with less than 0.5 R 2 LD value within each 50-SNP window size i.e. considering 50 SNPs at a time, the LD between them should be less than 0.5 LD R 2 . Two methods used for grouping the genotypes included hierarchical clustering using identity by state (IBS) matrix and a model-based maximum likelihood estimation of individual ancestries from multi-locus SNP genotype datasets using ADMIXTURE (Rabbi et al. 2015). IBS examines if two lines are identical based on the nucleotide (SNP alleles) that they share. Using the pruned SNPs from Plink, IBS matrix was calculated with the distance function of Plink (Purcell et al. 2007). The matrix was used for hierarchical clustering using the Ward2 method for distance estimation. The critical distance threshold used to declare two genotypes are identical was 0.05 based on the empirically determined evidence suggested by Rabbi et al (2015) from the distribution of distances between duplicated DNA of 64 cassava samples. A ward's minimum variance hierarchical cluster dendrogram (Fig. 3) was then generated from IBS matrix using Analyses of Phylogenetics and Evolution (APE) package (Paradis et al. 2004) implemented within R software (R Core Team, 2020).
After ltering, LD pruning and IBS matrix were used to determine the LD threshold and select SNPs accordingly. The same set of LD-pruned SNPs used for the hierarchical clustering was also used for ADMIXTURE to identify ancestries of the collected cassava germplasms (
Traits or characteristics of most landraces had not been documented compared to improved genotypes that were developed for resistance or tolerance against two (CMD & CBSD) major virus diseases (Table 1). However, farmers casually interviewed during sampling attributed their preferences to local landraces for sweet or bitter tubers, early maturity, and high yield (data not shown). Improved genotypes were introduced into these regions by research institutions such as International Center for Tropical Agriculture (CIAT), International Institute of Tropical Agriculture (IITA) and Kenya Agricultural and Livestock Research Organization (KALRO) ( Table 1).

Filtering and selection of SNPs and optimum population identi cation
A total of 33672 SNPs was identi ed. Out of this, 29614 SNPs (~88%) were anchored to chromosomes, 942 (~3%) were present in scaffolds, while the remaining 3116 SNPs (~9%) could not be mapped to any chromosome or scaffold. After quality ltering, 20846 SNPs were selected. LD pruning and IBS matrix estimation revealed that 5808 SNPs met the selected LD threshold criteria ( Table 2). The 5-fold cross-validation procedure revealed the number of optimum populations to be 14 (Fig. 4).

Admixture analysis
Genetic relationships among genotyped cassava germplasms are shown on hierarchical clustering dendrogram (Fig. 3) while population structure depicting ancestries from admixture presented as a barplot (Fig. 5). The admixture clustering together with dendrogram topology enabled identi cation of clusters of genetically identical germplasms containing only landraces, only improved genotypes as well as clusters containing both landraces and improved genotypes (Table 3). A total of 54 germplasms (~48%) were grouped into 17 independent clusters (I -XVII) as identical clones or single pure lines (Table 3). They represented duplicated clones bearing different local names. Out of 17 clusters, 10 contained only landraces; four had only improved genotypes and the remaining three clusters had accessions from landraces and improved genotypes (Fig. 6). Of the 10 landrace clusters, cluster IX was the largest with eight accessions, followed by cluster XIV with ve 5 accessions, cluster I and X each with four accessions, four clusters (XVII, XVI, XII, and XI) each with three accessions and two clusters (XV & VII) with two accessions each (Fig. 6). All the four clusters that contained only improved genotypes (VI, IV, III & II) had two accessions each while three clusters containing both landraces and improved genotypes (V, VIII & XIII) had three accessions each (Fig. 6).

Discussion
Most of the sampled materials (approximately 63%) were local landraces compared to improved cassava genotypes that constituted 37%. This implied cultivation of more local cassava varieties or landraces which have been attributed to farmer preferred characteristics such as culinary attributes and cooking quality, sweet or bitter tastes, early maturity, pests and disease resistance, high yield, root storability in the ground, drought tolerance among other traits The SNPs marker data generated using GBS was successfully used to determine genetic relatedness among sampled cassava germplasms. From a total of 33672 SNPs identi ed, 5808 SNPs (~17%) obtained after LD pruning and IBS matrix estimation were used for hierarchical clustering and ADMIXTURE analysis to identify ancestries. This enabled the identi cation of germplasms that clustered together as well as unique or non-duplicated germplasms. Thus, a large number of SNPs may not be needed to achieve accurate identi cation of cassava varieties, whether in farmers' elds or formal germplasms collections Further results from the present study showed that the majority of the duplicated clones were local landraces while geographically, most of the duplicated accessions were sampled either from the same region or from different regions of closer proximity. These redundancies were previously attributed to the historical sharing of cassava accessions or the same germplasms exchange between farmers with different genotype names (Albuquerque et al. 2019).
Farmers often exchange planting materials with their neighbors or different neighboring communities, resulting in elds with a mixture of local cassava varieties (Andersson and de Vicente 2010; Nakabonge et al. 2018). Thus the same ethnic or local name could be assigned to different cassava germplasms or the same germplasms assigned different local names. Variety naming systems in the absence of formal seed systems can be quite temporally and spatially variable leading to inconsistencies in the names of a particular variety (Rabbi et al. 2015). The informal farmer-farmer seed distribution system is often ine cient, denying farmers in far ung areas access or a share of alternative planting materials. Ferguson et al (2021) reported that individual cassava landraces were not widely distributed across Tanzania with limited farmer-to-farmer diffusion with implications for seed systems. Indeed, smallholder farmers recycle stem cuttings of traditional landrace cultivars (Nweke et al. 2002) and there is a ow of seed within and outside the villages, with little introduction of new cultivars (Mtunguja et al. 2014). The absence of an effective seed distribution system (Kyamanywa et al. 2011) has limited farmers' access to planting materials from improved genotypes. Additionally, elicitation of cassava variety names from farmer interviews during surveys and/or use of morphological plant descriptors have had inherent uncertainty levels (Rabbi et al. 2015). Morphological descriptors are also greatly in uenced by the environment and show continuous variation and high plasticity, with most of them only scorable at maturity (Ndung'u et al. 2014). Restrictions of clusters to the same geographical areas where accessions were sampled could also be attributed to quarantine measures that restricted the movement of planting materials in order to stop the spread of virus diseases such as CMD and CBSD.
Similarities in cassava accessions can also arise due to convergent evolution, selection, or sharing of common parentage (Ndung'u et al. 2014). This was probably the case in Kitui region where the majority of duplicates were improved genotypes that had shared the same parents during their breeding for resistance to cassava brown streak disease . Crops gradually lose their genetic variability through domestication and breeding, resulting in more uniform cultivars and reducing their recombination rates (Rufo et al. 2019). This could perhaps be used to explain clusters that included both improved genotypes and local landraces. It is however noted that no recent evidence has shown loss of genetic variation from genetic drift during the introduction of cassava to Africa (Ferguson et al. 2019). The relatively low levels of diversity reported in the previous study were only observed in IITA breeders' germplasms and may represent rather a genetic bottleneck (Ferguson et al. 2019). For future breeding programs involving hybridization or selection, de Oliveira et al (2015) recommended the introduction of new genetic variability into commercial cultivars to avoid low genetic variation and to improve the quality of cassava roots. The unique or non-duplicated landraces and improved genotypes in the present study represented a more expanded cassava genetic pool from which variability can be derived for future breeding purposes. It might also be important to build the core collection of the 73 unique genotypes studied in this study for further e cient conservation and cassava breeding. High genetic diversity drives better crop adaptation to emerging environmental cues.

Conclusion
Molecular markers have an important role to play as farmers frequently give different names to the same cultivar or landraces, making identi cation di cult, particularly as cassava varieties are not easy to distinguish morphologically (Mbanjo et al. 2021). This enables the correct assessment of adoption rates, which in turn, in uences breeding priorities and agricultural policies (Kretzschmar et al. 2018). Knowledge on the extent of genetic diversity among cassava landraces and improved genotypes in Kenya using GBS-derived SNP markers may promote their conservation and/or e cient selection and utilization as parental lines for breeding for biotic and abiotic stress tolerance. Although local landraces may be low-yielding, they may have high genetic diversity that could promote gene ow through hybridization (Turyagyenda et al Tables   Table 1 Cassava landraces and genotypes sampled during eld surveys from different cassava growing regions of Kenya.   Table 2 The distribution of the SNPs across the cassava genome ##  Table 3 Classi cation of cassava accessions into clusters based on underlying sub-population groups derived from Figure 5 Five (5) major cassava growing regions of Kenya where leaf samples of local landraces and improved genotypes were collected. These regions represent 100% areas within Kenya where cassava is cultivated. GPS indicates the global positioning system for the coordinates of the regions.  Barplot showing population structure modeled with 14 underlying sub-population groups from ADMIXTURE. The sample order of the hierarchical clustering was maintained for the ADMIXTURE plot for easy comparison of the out from the two grouping method. For the ADMIXTURE plot, the different colors represent the different sub-population while each bar represents each individual sample. Samples with just one color are pure lines from a single subpopulation. Samples with more than one colors are admixture from different sub-populations.