DArT-seq based SNP analysis of diversity, population structure and linkage disequilibrium among 274 cowpea (Vigna unguiculata L. Walp.) accessions

Background: Genetic diversity in a germplasm is crucial for continuous improvement of crop varieties. A panel of 274 cowpea ( Vigna unguiculata L.) accessions of unknown genetic diversity was assembled from diverse sources. This study used 3127 SNP markers, generated with the diversity array technology (DArT), to assess genetic diversity, population structure and linkage disequilibrium (LD) in the assembled germplasm. Results: The population structure analysis inferred three subpopulations within the germplasm, which was conrmed by Neighbour-Joining (NJ) clustering and principal component analysis (PCA). Low genetic distances (0.005 to 0.44) were observed between accessions. Accessions from Africa; West and Central Africa (113 accessions), East and Central Africa (93 accessions), and Asia (53 accessions) were predominant in the germplasm; and distributed across all subpopulations. High xation indexes (0.48 ≤ F ST ≤ 0.56) were obtained for the inferred subpopulations. AMOVA revealed a very large contribution of within subpopulations variation to the observed genetic variation in the germplasm. However, the expected heterozygosity (He) was higher than the observed heterozygosity (Ho), indicating high proportion of inbred lines in the germplasm. Linkage Disequilibrium (LD) was observed in the germplasm, which showed a low decay at longer physical distance between markers in the genome. Conclusions: Signicant genetic structuration exists in the assembled cowpea germplasm which shows that there is a potential for improvement of the crop. The subgroups consisted mainly of inbred lines which, although from different geographical regions shared alleles in common reecting high movement of seeds and exchange of germplasm between regions. The presence of linkage disequilibrium in the germplasm paves a way for prospective whole genome-wide association studies in cowpea for quality attributes and important agronomic traits.

which, although from different geographical regions shared alleles in common re ecting high movement of seeds and exchange of germplasm between regions. The presence of linkage disequilibrium in the germplasm paves a way for prospective whole genome-wide association studies in cowpea for quality attributes and important agronomic traits. Background Diversity in plant genetic resources creates an avenue for plant breeders to develop improved varieties with desirable attributes to cope with the ever-changing environments [1,2]. Cowpea [Vigna unguiculata (L.) Walp; 2n = 2 x = 22] is an important grain legume crop in the world [3]. Grain, leaves and pods are the most used parts of cowpea plant, and their characteristics vary among cultivars [4,5]. Cowpea contributes to food and nutrition security and income generation of millions of households in semi-arid tropics, including Asia, Africa, and Latin America [3,6,7]. Although cowpea thrives in drought prone environments and on poor soils [8], it is highly susceptible to pest and diseases, which leads to the low yields (25-600Kg/ha) reported in the production areas [3,[9][10][11]. This is a threat to world food security, and calls for constant efforts by breeding programmes to explore, create and use diversity within the species in order to overcome the various biotic and abiotic constraints and meet consumers' preferences.
Genetic diversity of a given plant population re ects its evolutionary potential and determines its capacity to be improved [2]. It informs on the patterns and magnitude of population structure which is driven by the combined effects of evolutionary processes such as recombination, mutation, genetic drift, demographic history, and natural selection [12]. Knowledge of the genetic structure can provide valuable guidelines for the formulation of breeding strategies [13,14]. Investigating the genetic structure of a population entails a thorough analysis of the allelic patterns between individuals within the population to make use of the genetic variation for cultivar development [15].
To date, a range of molecular and quantitative methods have been developed for easy and effective assessment of genetic diversity [2]. The rapid advances in sequencing technologies provides many possibilities to decipher the organization of natural populations [16]. High throughput genotyping such as Diversity Arrays Technology (DArT) has emerged as technology of choice for genetic diversity analysis and genomic studies because of its e ciency and low cost [14]. DArT offers high-throughput markers system for genome analysis and has successfully been deployed to assess genetic diversity in a number of legume crops including cowpea [17,18], common bean [19], and pigeonpea [20].
Previous studies indicated the presence of population structure within the cultivated cowpea (V. unguiculata) [17,21,22]. Two to four subgroups with varying levels of genetic diversity were identi ed depending on the germplasm used [17,18,[21][22][23][24][25]. Meanwhile, the reproductive nature of cowpea, which is primarily a self-pollinated plant [6], increases the degree of inbreeding with individuals becoming more homozygous for many alleles, consequently narrow genetic base and genetic distance were reported in the species [23,26]. Furthermore, the high degree of inbreeding in this species also increases chances of linkage disequilibrium (LD) between loci [27], which is a determining factor in marker-trait association analysis [28], hence should be assessed in a plant germplasm collection designated for long-term breeding.
In 2018-2019, a set of 274 cowpea accessions was acquired from different origins and will serve as genetic material for developing cowpea breeding programme in Benin. The present study used DArT-seq generated SNP markers for diversity, population structure and linkage disequilibrium analyses in the germplasm for effective breeding decision-making.

Results
Pro le and diversity of SNP markers A total of 12,689 SNPs markers were generated from the DArT-seq genotyping of 274 cowpea accessions ( Table 1) from 33 countries across the globe (Fig. 1). High number of markers (9562 SNPs) were removed during ltering based on low minor allele frequency (MAF<0.05), high percentage (>20%) of missing data, and lack of information on their position in the genome. The rest of the markers, 3217 SNPs (24.65%) spanning the 11 linkage groups of cowpea matched the quality criteria and were used in the diversity analyses.
The pro les of the 3127 SNP markers are presented in Table 2. The markers showed high reproducibility (0.99) with a mean call-rate value of 0.87. Markers diversity analysis revealed that these markers had an average minor allele frequency (MAF) of 0.22, and the majority (93.51%) had a PIC value above 0.1 with a mean value of 0.24 (Fig. 2, Table 2). The mean expected heterozygosity (0.23) of the markers was higher than the average observed heterozygosity (0.07) Population structure Two different approaches (STRUCTURE and DAPC) were used to identify the tting number of clusters within the cowpea germplasm. Results of the structure analysis showed that the likelihood of DeltaK (ΔK) peaked at K =3 (Fig. 3a), indicating that three clusters contribute to the total variation in the diversity panel under study. Consequently, the 274 cowpea accessions can be grouped into 3 subpopulations/clusters. The distribution of the cowpea accessions among different clusters revealed that cluster 1 had the highest percentage of membership (53.2%), followed by cluster 2 (31.8 %) and cluster 3 which recorded the lowest percentage (14.9 %). However, as shown in the inferred ancestry bar plot (Fig. 3b), some accessions were in admixture i.e. represent a sum of variation from more than one cluster. Based on the probability value for assignment of individual accession to a speci c cluster, 31 accessions fell into the group of admixed individuals (Additional le 2).
In the DAPC approach, the curve of BIC versus number of clusters shows a rapid decline from K=1 to K=3 followed by a very slow increase (Fig. 4 a), suggesting that K=3 is the optimum number of clusters inferred through this approach. Furthermore, 2 discriminant functions (DA) were detected, which explained 59.09% and 24.92 % of the variation in the dataset respectively (Fig. 4b). The graph of probability values of assignment of each accession to a speci c cluster showed a perfect inference of the accessions with no admixed individuals (Fig. 4c). The plot of the densities of individuals on the rst discriminant function showed that cluster 1 was distant from the two other clusters ( The phylogenetic relationship showed that accessions from all regions, except the accessions from Oceania, were distributed in two or three subpopulations (Additional le 2). Accessions from West and Central Africa, and accessions from East and Southern Africa were predominant in cluster 2 and Cluster 1, respectively. In terms of within country diversity, Nigerian accessions were highly represented in cluster 1 (20 accessions) and cluster 2 (17 accessions). The majority of accessions from Benin were found in cluster 2 (27 accessions). Ugandan accessions were predominant (76 accessions) among accessions from East and Southern Africa region, with the majority (53 accessions) in cluster1. Indian accessions (49 accessions), the most predominant in the group of accessions from Asia, were distributed in cluster 1 (28 accessions), cluster 2 (15 accessions) and cluster 3 (6 accessions).
Principal Component Analysis (PCA) was performed and the scatter plot of the 274 accessions based on rst two principal component axes, which together explained 23.57% of the variation among accessions, showed that the accessions can be grouped into 3 main subpopulations with some of them in admixture ( Fig. 3 d). These patterns are consistent with the structure analysis.
Genetic diversity and population differentiation of observed groups Genetic diversity and population differentiation in the germplasm were examined by estimating diversity parameters and analysing molecular variance among and between inferred clusters/subpopulations ( Table 3, Table 4 and Table 5). From the structure analysis, the genetic variability among the three inferred subpopulations represented herein by Nei's net nucleotide distance (Table 3), varied from 0.17 (between cluster 1 and cluster 2) to 0.26 (between cluster 2 and cluster 3). This indicates a degree of relatedness between subpopulations, with cluster 1 more related to cluster 2 than cluster 3 while cluster 2 and cluster 3 were more distant. The highest within population variation (expected heterozygosity) was observed in cluster 3 (0.  Table 4). The highest number of effective allele (Ne=1.48) was recorded in cluster 3 while the highest PIC mean value was observed in cluster 2 (PIC=0.23). The observed heterozygosity (Ho) values were generally lower than the expected heterozygosity (He) across subpopulations (Table 4). Cluster 3 had the highest observed heterozygosity (Ho=0.11) while cluster 1 and cluster 2 recorded the lowest values for this parameter, 0.07 and 0.06, respectively.
The genetic diversity estimates based on the DAPC clustering method (Table 4) showed that the number of effective alleles ranged from 1.48 in cluster 3 to 1.44 in cluster 2. Cluster 2 had the highest PIC mean value (0.23) while the lowest value for this parameter (0.19) was observed in cluster 3. Overall, expected heterozygosity (0.23≤He≤0.29) was higher than observed heterozygosity (0.05≤Ho≤0.10) in all subpopulations. The highest observed heterozygosity was obtained in cluster 2 (Ho=0.10).
Analysis of molecular variance (AMOVA) revealed similar patterns in the repartition of the genetic variance irrespective of the clustering approaches used ( Table 5). The total genetic variation in the germplasm was partitioned mainly into among accessions variation and among subpopulations variation. Low contribution of between subpopulations variance was observed; 1% and 19% following the NJ clustering and DAPC, respectively. The inbreeding coe cient (F IS ) was very high implying that the inferred subpopulations are mainly composed of inbred lines.

Linkage disequilibrium
Linkage disequilibrium (LD) was analysed across the cowpea genome. Out of 3127 SNPs, 1754 SNPs spanning the whole genome were in true LD based on comparison between pairs of markers in window size of 500kb ( =0.2 threshold). These markers were almost evenly distributed across the genomes with the highest numbers of SNPs in LD with each other registered on chromosomes 3 and 7 (Fig5.a). LD estimates (R vs 2 ), corrected by structure and genetic relationship observed in the cowpea germplasm, were plotted along physical genetic distance between markers across the genome (Fig5.b). The LD decay plot showed that on average LD estimates were in general low (R vs 2 =0.02). However, few high LD values were observed over a short physical genetic distance which decayed rapidly, with very low decline at longer distances between markers across the genome. LD decayed below R vs 2 =0.2, more precisely to R vs 2 =0.1, with an increasing physical distance of 8.75 to 25.16 kb (Fig. 5b).

Markers diversity
The present study investigated genetic diversity and population structure of a panel of 274 cowpea accessions from diverse origins using a set of 3127 SNP markers. The SNP markers were polymorphic and informative suggesting they are amenable to reliable ngerprinting and good inference of genetic variation within the germplasm. The results showed that the markers were highly reproducible (0.99), scored high call rate (0.87) with average values of minor allele frequency (MAF) and polymorphism information content (PIC) of 0.22 and 0.24, respectively. These values are within the range of values reported in previous SNP-based genetic diversity studies conducted in some important food crops including cowpea [18,21], common bean [19], and maize [29]. However, low heterozigozity (Ho=0.07) was observed in these markers. Low mean of observed heterozigosity values (Ho=0.06; Ho=0.075) was also reported in worlwide germplasm collections of cowpea maintained at USDA GRIN [21] and IITA [18]. The mean heterozygosity, calculated across a number of loci, is a true indicator of the degree of genetic variation within a population [30], implying there is low genetic variation in cowpea. This is expected since cowpea is primarily a self-pollinated crop exhibiting high degree of inbreeding [6,23,31], which reduces genetic diversity.

Population structure and relationships between accessions
Population structure analysis is important in understanding genetic diversity and association mapping in a germplasm [15]. The cowpea collection was divided into three subgroups irrespective of the approaches used, DAPC and STRUCTURE, and this was further con rmed by both PCA and Neighbour-Joining clustering analyses. Previous studies in worldwide germplasm (size= 298-768 accesions) of cowpea have also reported the presence of three clusters [18,25,32], suggesting that our collection, to some extent, has captured the diversity in the crop. This can be very useful considering the use of the germplasm in breeding activities. Conducting evaluation on the whole panel for speci c trait of interest may be very informative, and could also guide selection of genotypes to use as a training population in genomic prediction studies. However, this may require more high-density genome wide markers [33].Differences were observed between the DAPC and Structure analysis in the size of the subgroups which can be attributed to the presence of admixed individuals in the population. Thirty-one (31) accessions were in admixture when the cut-off value of coe cient ancestry was set at 0.52, and this number could increase with higher thresholds, a factor that leads to discrepancies in the results of these analyses [34,35].
Genetic differentiation and allelic patterns of the subpopulations Low genetic distances were observed between accessions. The genetic distance varied from 0.005 to 0.44 between pairs of accessions in the set of the germplasm, con rming that some accessions within the germplasm shared many alleles. The result corroborates the ndings of Fatokun et al. [18] who reported a low genetic distance, ranging from 0.0096 to 0.462, between pairs of 298 cowpea lines. The low genetic distance may limit the progress in developing superior crop varieties through simple hybridization between accessions. As highlighted earlier, the movements of seeds across geographical areas which promotes genes ow between breeding germplasms can affect existing genetic boundaries, which reduces genetic distance among individuals and populations differentiation.
High F ST mean values (0.48≤ F ST ≤0.56) were observed for the subpopulations, suggesting there is a remarkable level of genetic differentiation of the subpopulations. This was further supported by the results of analysis of molecular variance, with high contribution of within subpopulations variance to the total genetic variation in the germplasm. Previous studies [18,39] also indicated that genetic variation in cowpea is mainly due to within subpopulations variation. Based on the observed heterozygosity (Ho) values, subpopulation 3 (Ho=0.11) and Subpopulation 2 (H=0.10) were the most diverse among all clusters following the Neighbour-Joining clustering and DAPC. Nevertheless, the expected heterozygosity (He) was moderately low and, in general higher than the observed heterozygosity (Ho), for all subpopulations. Fatokun et al. [18] also observed a similar trend in a mini-core subset of the world collection. Low Ho implies a high proportion of inbred lines within sub-populations [1], which is con rmed by the high inbreeding coe cient value (F IS =1).

Patterns of Linkage disequilibrium
Linkage disequilibrium (LD) is very important to investigate in population genetic and genomic studies. On average, a low linkage disequilibrium characterized the cowpea germplasm collection with slow LD decay over long distances of physical distance between markers, and this can be explained by the sel ng nature of the crop, which limits the effectiveness of recombination events and delays LD decay [40]. Deploying techniques such as mutagenesis and crossing between genetically distant individuals can help improve the recombination rate and subsequently increase genetic diversity among the germplasm.
Nonetheless, the LD analysis in line with previous studies [41] indicates a faster LD decay in a population of high number of inbred lines. The faster is the LD decay the better is the genetic mapping resolution [42]. In the present study, LD declined rapidly below 0.2, reaching 0.1 at a distance of 26. 16 kb, suggesting there is potential for genome wide association studies and candidate gene selection [43]. A number of markers distributed on different chromosomes in the genome were in true LD and could serve the purpose. There have been reports of quantitative traits loci (QTLs) and or markers associated with different traits on these chromosomes in cowpea, including pod ber contents [46], perenniality and oral scent, seed size [47] and resistance to biotic stresses (Fusarium oxysporum f.sp. tracheiphilum Race 3, Striga gesnerioides race 1) [44,45]. Besides, some accessions in our collection were reported to have good attributes for important traits such as yield and resistance to ower thrips [48] and bruchid [49]. Hence, in-depth investigation of LD patterns in the germplasm can help to map genomic regions associated with these and other preferred traits in cowpea.

Conclusions
This study used 3127 high-quality DArT-seq SNP markers to genotype and analyse genetic diversity within a large collection of 274 cowpea accessions. Important genetic structuration was observed within the germplasm. Each of the subgroups identi ed exhibits a level of genetic diversity that can be leveraged in developing cowpea varieties with desirable attributes. The subgroups consisted mainly of inbred lines which, although from different geographical regions shared alleles in common that implies signi cant exchange of germplasm between regions. The presence of structure and linkage disequilibrium within the collection provides valuable insights into the future use of the germplasm in genome-wide association studies and its exploitation in cowpea breeding programmes.

Plant materials
The cowpea germplasm comprises 274 accessions from 33 countries (Fig. 1, Additional le 1 extracted from the leaves' tissues using the Nucleomag Plant Genomic DNA extraction kit and, quality and quantity control checked on 0.8% agarose. Genotyping was done using the Diversity Arrays Technology Sequencing (DArT-seq). Genomic DNA Library construction was done using genomic complexity reduction technology [50]. The library was puri ed and quanti ed for cluster generation in an automated clonal ampli cation system (cBOT Illumina) followed by Next Generation Sequencing (NGS) in the sequencer Hiseq 2500 (Illumina). The reads were aligned to the cowpea reference genome Vunguiculata_469_v1.0, publicly accessible on Phytozome [51].
Data quality control, ltering imputation, and markers diversity analysis Data quality control and ltering were performed using R package dartR [52]. SNP markers with more than 20% missing data, having a minor allele frequency (MAF) <0.05 and of unknown position were removed. Data imputation was done using the expectation maximization (EM) algorithm, which recorded the highest simple matching coe cient (SMC= 0.76) among other imputations algorithms as implemented in KDCompute pipeline [53]. Summary statistics of the SNP markers were generated in PowerMarker V3.25 [54]. The computed statistics included allele frequencies, expected heterozygosity (He), observed heterozygosity (Ho) and polymorphic information content (PIC).

Population structure analysis
Filtered SNPs were used to infer the population structure within the germplasm. The structure analysis was performed using the Bayesian clustering approach Bayesian clustering approach in STRUCTURE V2.3.4. [55]. The structure analysis was run considering a burn-in period of 10,000 Markov Chain Monte Carlo (MCMC) iterations and 100,000-run length, with an admixture model following Hardy-Weinberg equilibrium and correlated allele frequencies. Ten independent runs were performed for each value of K (number of clusters), ranging from 1 to 11. The outputs from structure were analysed in Structure Harvester [56], which enabled the identi cation of the best K-value as the distinct peak in the change of likelihood (ΔK). The xation index (FST) of each of cluster was retrieved and interpreted according to Wright [57], considering an FST value above 0.25 as a very large genetic differentiation. The accessions were assigned to their respective cluster using the coe cient of ancestry values generated from the Structure software (Additional le 2), with the assumption that an individual is a true member of a given cluster if its coe cient of ancestry in this cluster is above 0.52 [37].
Discriminant analysis of principal components (DAPC) was performed to con rm the best tting clusters (K) among the cowpea germplasm. DAPC is a multivariate method which uses sequential Kmeans and model selection to infer and describe clusters in populations of genetic related individuals [16]. In this approach, the optimum K was identi ed as the minimum number of clusters after which the Bayesian Information Criterion (BIC) increases or decreases by a negligible amount [16]. DAPC was implemented using adegenet package [58] in R V3.5.0 [59].

Genetic relationships and diversity analysis
To examine the phylogenetic relationships between accessions and con rm the number of clusters, an identity by state (IBS) distance matrix (Additional le 1) was generated in Tassel V5.2.60 [60]. A phylogenetic tree was constructed using the Neighbour-Joining (NJ) algorithm in Darwin V6.0.2 [61], which was exported in FigTree V1.4.3 [62] for annotation. Prior the analysis, the 274 accessions were grouped according to their geographical regions of origin to describe the composition of the identi ed clusters (Table 1). A principal component analysis (PCA) was done in Tassel V5.2.60 which enabled to construct a scatter plot of the cowpea accessions using the rst two principal component axes (PC1 and PC2).
Estimates of genetic diversity parameters (He, Ho, PIC, Ne) of the identi ed subpopulations through the NJ and DAPC clustering approaches were computed using PowerMarker V3.25 and GenAlEx 6.41 [63]. Analysis of molecular variance (AMOVA) was also implemented in GenAlEx 6.41 using the SNP markers and the repartition of the 274 cowpea accessions into different subpopulations as revealed by the clustering analysis. Prior to the analysis, the markers dataset were numerically coded (A = 1, C = 2, T = 3, G = 4, see Additional le 1) as suggested in GenAlEx manual [64].

Linkage disequilibrium analysis
Linkage disequilibrium (LD) analysis was performed among markers across the genome. The LD estimates were generated in R LDcorsSv package [59,65] which provides LD measure (R vs 2 ) corrected by both structure and relatedness between accessions in a population [66]. SNPs markers were pruned to identify markers which are in true linkage disequilibrium in window size of 500kb ( =0.2 threshold) using the R package SNPRelate [67].  Tables   Table 1 Geographical distribution of the 274 cowpea accessions   Table 2 Quality and diversity of SNP markers used to investigate genetic diversity and population structure of the cowpea germplasm