Genotyping pepper varieties using Target SNP-seq reveals that population structure clusters according to fruit shape

Background The widely cultivated pepper (Capsicum spp.) is one of the most diverse vegetables; however, little research has characterized the genetic diversity and relatedness of commercial varieties grown in China. In this study, a panel of single-nucleotide polymorphisms (SNPs) was created that consisted of 97 perfect SNPs, which were identified using re-sequencing data from 35 diverse C. annuum lines. Based on this panel, a Target SNP-seq was designed that combined the multiplex amplification of the perfect SNPs with Illumina sequencing to detect polymorphisms across 271 commercial pepper varieties. Results The perfect SNPs panel had a high discriminating capacity due to the average value of polymorphism information content (PIC), observed heterozygosity (Ho), expected heterozygosity (He), and minor allele frequency (MAF), which were 0.31, 0.28, 0.4, and 0.31, respectively. Notably, the studied pepper varieties were morphologically categorized based on fruit shape; blocky, long horn, short horn, and linear-fruited. The long horn-fruited population exhibited the most genetic diversity followed by the short horn, linear, and blocky-fruited populations. A set of 35 core SNPs were then used as KASPar markers, another robust genotyping technique for variety identification. Analysis of genetic relatedness using principal component analysis (PCA) and phylogenetic tree construction indicated that the four fruit shape populations clustered separately with limited overlaps. Based on STRUCTURE clustering, it was possible to divide the varieties into five subpopulations, which correlated with fruit shape. Further, the subpopulations were statistically different according to a randomization test and Fst statistics. Notably, two SNP loci, CaSNP118 and CaSNP053, which are located on chromosome 11 and 6 were significantly associated with fruit shape (p < 1.0 × 10 -4) Conclusions Target SNP-seq developed in this study appears as an efficient power tool to detect the genetic diversity, population relatedness and molecular breeding in pepper. Moreover, this study demonstrates that the genetic structure of the pepper varieties is significantly influenced by breeding programs focused on fruit shape.


Abstract
Background The widely cultivated pepper (Capsicum spp.) is one of the most diverse vegetables; however, little research has characterized the genetic diversity and relatedness of commercial varieties grown in China. In this study, a panel of single-nucleotide polymorphisms (SNPs) was created that consisted of 97 perfect SNPs, which were identified using re-sequencing data from 35 diverse C. annuum lines. Based on this panel, a Target SNP-seq was designed that combined the multiplex amplification of the perfect SNPs with Illumina sequencing to detect polymorphisms across 271 commercial pepper varieties.
Results The perfect SNPs panel had a high discriminating capacity due to the average value of polymorphism information content (PIC), observed heterozygosity (Ho), expected heterozygosity (He), and minor allele frequency (MAF), which were 0.31, 0.28, 0.4, and 0.31, respectively. Notably, the studied pepper varieties were morphologically categorized based on fruit shape; blocky, long horn, short horn, and linear-fruited. The long horn-fruited population exhibited the most genetic diversity followed by the short horn, linear, and blocky-fruited populations. A set of 35 core SNPs were then used as KASPar markers, another robust genotyping technique for variety identification. Analysis of genetic relatedness using principal component analysis (PCA) and phylogenetic tree construction indicated that the four fruit shape populations clustered separately with limited overlaps. Based on STRUCTURE clustering, it was possible to divide the varieties into five subpopulations, which correlated with fruit shape. Further, the subpopulations were statistically different according to a randomization test and Fst statistics. Notably, two SNP loci, CaSNP118 and CaSNP053, which are located on chromosome 11 and 6 were significantly associated with fruit shape (p < 1.0 × 10 -4) Conclusions Target SNP-seq developed in this study appears as an efficient power tool to detect the genetic diversity, population relatedness and molecular breeding in pepper. Moreover, this study demonstrates that the genetic structure of the pepper varieties is significantly influenced by breeding programs focused on fruit shape.

Background
Pepper are members of the genus Capsicum, which originated in South America and represents one of the most economically important vegetable crops worldwide [1][2][3]. To date, 38 species of Capsicum have been reported (USDA-ARS, 2011). Of these, C. annuum, C. frutescens, C. chinense, C. baccatum, and C. pubescens are thought to have been domesticated [4]. Globally, the most predominant species is C. annuum, which has numerous commercial varieties varying greatly in size, shape, pungency, and color.
As the seed trade has developed and globalized, the commercial quality of seeds, which is based on authenticity and purity, has become increasingly important [5]. Traditionally, cultivar characterization was completed by field investigation of morphological traits; however, this process is time-consuming and labor-intensive and is thus not suitable for modern inspection demands [6]. A more highthroughput approach to distinguishing varieties is the used of molecular markers [5]. Indeed, genetic markers have been used for DNA fingerprinting, diversity analysis, variety identification, and markerassisted breeding of multiple commercial crops [7,8]. Moreover, several PCR-based tools have been used to detect genetic diversity in peppers, including random amplified polymorphic (RAPD), restriction fragment length polymorphism (RFLP), and amplified fragment length polymorphism (AFLP) [9][10][11][12].
Recently, the genomes of two C. annuum cultivars, Zunla-1 and CM334, were sequenced [3,13], which provided an important platform for the detection and development of genome-wide simple sequence repeats (SSR) and insertion or deletion (InDel) markers [14][15][16]. Although a large number of SSR and InDel markers have become available, these technologies are not suitable for large scale germplasm characterization. Thus, there is an unmet need for an efficient, rapid, and high-throughput system capable of characterizing thousands of germplasm.
One approach for meeting such high standards is the use of single nucleotide polymorphisms (SNPs), which are good markers for genotyping because of their whole genome coverage and primarily biallelic nature. Accordingly, multiple high-throughput SNP genotyping platforms have been developed, including the GoldenGate [17] and Infinium [18], TaqMan [19], and KASPar platform (KBiosciences, www.kbioscience.co.uk). However, SNP marker genotyping is considered expensive as it requires a comprehensive technical platform, equipment, and reagents.
Genotyping by target sequencing (GBTS) is a targeted sequence-capture strategy that is able to genotype more than thousands of SSRs or SNPs through the use of high throughput-sequencing technology. The two main types of GBTS are multiplex PCR and probe-in-solution-based target sequencing; the technology has been commercialized as AmpliSeq [20], NimbleGen [21], SureSelect [22], GenoBaits, and GenoPlexs [23]. To date, this technology has been widely used for medical applications but has rarely been used for agriculture species. However, a Target SSR-seq technique, which is a multiplex PCR-based approach, was successfully applied to the study of genetic diversity and structure in 382 cucumber varieties [24]. The results of this study demonstrated that GBTS is a customizable, flexible, high-throughput, low cost, and accurate sequencing tool.
Until now, the genetic diversity of domesticated Capsicum species has primarily been investigated using SSR markers [25][26][27], and genetic maps have been constructed with SSRs and InDels based on inter-or intraspecific populations [15,16,28]. However, discovery efforts for genome-wide SNPs have lagged significantly behind those for SSRs and InDels, and studies on genetic diversity of pepper varieties are limited. The objectives of the present work were: 1) to develop a Target SNP-seq technique suitable for genotyping pepper varieties; 2) to characterize composite core-SNP markers for use with the KASPar platform to maximize variety identification; 3) to examine the level of genetic diversity, structure, and differentiation within 271 pepper varieties. This study demonstrated that a novel Target SNP-seq can be used as a rapid and efficient tool for genotyping peppers, and the genetic structure of these cultivated varieties have been strongly impacted by breeding programs that select for fruit shapes.

Results
Genome-wide perfect SNPs used for Target SNP-seq Re-sequencing of the 31 pepper lines in this study generated a total of 872 Gb of paired-end sequence data, at an average depth of ~ 8.4. Following mapping to the Zunla-1 genome, 3,613,192 high-quality SNPs were detected across the genomic sequences of the 31 re-sequenced lines and four previously published cultivars (Dempsey, Zunla-1, Perennial, and Chiltepin). Using the cultivar progenitor Chiltepin as an out-group, the phylogenetic tree showed that pepper lines could be generally be classified according to fruit shapes, with the exception of three long horn-fruited lines that grouped with the linear-fruited lines. Based on the genetic distance, the transition in fruit shapes were from Chiltepin-like peppers followed by linear-fruited, short horn-fruited, long horn-fruited, finally to blocky-fruited peppers, which were the furthest from the Chiltepin-like peppers (Fig. 1A).
Furthermore, the 35 lines can be divided into two major groups based on the optimal number of K=2 by STRUCTURE (Fig. 1B); Group 1 consisted of the nine bell-fruited lines and 10 of the long hornfruited lines, whereas the remaining peppers, including three long horn, all the linear, and all the short horn-fruited, as well as PI640446, Perennial, and Chiltepin were assigned to Group 2.
Given that pepper genomes are highly repetitive, strict criteria were used to identify the perfect SNPs.
In total, 521 perfect SNPs were identified, and 92 that were evenly distributed across the genome ( Fig. 2; Table S2) were selected as multiplex PCR targets.
Genotyping analysis of pepper varieties using Target SNP-seq In total, 288 pepper varieties were genotyped using the Target SNP-seq. Samples that were missing more than five of the 92 loci were removed from the analyses. The final panel contained 271 varieties, including 90 blocky, 113 long horn, 25 short horn, and 43 linear-fruited varieties (Table S1).
A total of 55.9 million reads were generated from the 271 varieties, and the average Target read depth was 2064, approximately 82% of the samples were sequenced at a depth greater than 1000 × ( Fig. S2A). Among the 271 varieties, 238 varieties (87.8%) aligned to the Zunla-1 genome at a rate of more than 90% (Fig. S2B). Of these aligned reads, 221 varieties (81.5%) exhibited an align rate to the target SNP region of over 80% (Fig. S2C). Furthermore, the Target SNP-seq uniformity index was analyzed, which was used to calculate the proportion of the coverage above 10% of the mean depth value for each variety. The average uniformity index in this study was 89.5% (Fig. S2D), indicating a high level of accuracy.

Perfect SNPs in 271 pepper varieties
The genetic parameters, MAF, Ho, He, and PIC revealed by each perfect SNP are given in Table S3. MAF is a measure of the discriminating ability of the markers; as such, the closer the MAF is to 0.5 for biallelic markers, the better discriminatory properties. In this study, 28.26% of perfect SNPs showed an MAF between 0.4 and 0.5, whereas only four SNP had MAF below 0.1 (Fig. 3A). The Ho value of each SNP ranged from 0.01 (CaSNP079) to 0.59 (CaSNP009) with an average of 0. 28

Perfect SNPs across the fruit shapes
The average values of the genetic parameters across the four fruit shape populations were also compared for genetic diversity, and the results showed that the blocky-fruited population had the lowest average values for He (0.18), Ho (0.16), and PIC (0.15) ( Table 1)

Identification of a core SNP Set
The perfect SNPs panel distinguished 97.7% of the 271 pepper varieties (Fig. 3), the remaining displayed the same multilocus genotypes that were also difficult to distinguish from field phenotypes.
Given that some varieties may exist with multiple names, the varieties with the same genotypes may be redundant and were discarded to build non-redundant genotype varieties. Thus, a minimum of 27 of the perfect SNPs could distinguish between all the non-redundant varieties (Fig. 3).
To develop a core SNP set for the KASPar platform, each perfect SNP marker was tested on a set of 23 pepper varieties with two allele-specific forward primers and one common reverse primer. The results show that 35 SNP primers produced consistent and repeatable results with Target SNP-seq. Finally, 35 SNPs with a high discrimination power of up to 97% across all varieties and 100% in non-redundant varieties were proposed as a core SNPs set for use with the KASPar platform ( Fig. 3; Table S4).
Genetic structure in pepper varieties PCA was performed using the 92 perfect SNPs to investigate population clusters across the 271 varieties (Fig. 4A). Accordingly, the PCA plot indicates that the four fruit shape populations generally clustered separately. The distribution of blocky-fruited varieties was very concentrated, whereas that of the long horn-fruited varieties was relatively dispersed. Linear-fruited varieties showed more affinity to the short horn than either long horn or blocky-fruited varieties. Linear and blocky-fruited populations were the most diverse, and these clusters did not overlap, suggesting considerable genetic divergence over their breeding programs. Notably, a selection of both long horn-and shortfruited varieties showed affinities to the linear-fruited population.
The population structure of the 271 varieties was further inferred using the cluster program, STRUCTURE, through gradually increasing the number of clusters (K). The Evanno's correction [29] showed the peak of delta K at K=2, which suggests the presence of two main populations, denoted as Pop1 and Pop2. Pop1 comprised 160 varieties (59.0%), containing all blocky-fruited varieties, 60.2% of the long horn, and only two linear-fruited varieties ( Fig. 4B; Table S1). The remaining 111 varieties (41.0%) were assigned to Pop2, which included all the short horn-, and linear-fruited varieties, as well as 39.8% of long horn-fruited varieties ( Fig. 4B; Table S1). When K = 3, the Pop1 was subdivided into two clusters, blocky or long horn-fruited types. At K = 4, a mixture of 56% of short horn-fruited, 15 long horn and two liner-fruited varieties were assigned to a new cluster from Pop2, and these short horn-fruited, as well as a new long horn-fruited group, were assigned to independent clusters, respectively, when K = 5. Of note, linear-fruited types were never assigned to an independent cluster as K was increased. Considering the classification of populations appeared highly correlated with fruit types when K = 5, the two main populations were further subdivided into five subpopulations (Subpop1~Subpop5; Fig. 4B; Table S1). Subpop1, 2, 3, and 4 displayed a clearly cut structure with no or very few admixtures. Subpop1 comprised 98 varieties, 90 of which belong to blocky-fruited and the remaining eight long horn-fruited varieties. Long horn-fruited varieties were members of both Subpop2 and Subpop3, which is not surprising as long horn-fruited varieties were distributed across In summary, three independent analysis methods strongly support the division of pepper varieties into five well-differentiated genetic populations, which correlated with distinct fruit shapes, indicating that the genetic structure of these cultivated varieties may have been strongly impacted by the fruit shape selection of the breeding programs.

Genetic variation assessment of pepper populations
Comparison of the results between Pop1 and Pop2 using AMOVA revealed that 33.04% of the total genetic variation was partitioned among Pops, 8.47% within Pops, and the remaining 58.49% within varieties ( Table 2). AMOVA analysis of the five Subpops further indicated that the maximum variation (63.83%) occurred within varieties, the minimum variation (3.54%) was accounted for within Subpops, and 32.63% of the variation occurred between Subpops (Table 2), suggesting relatively moderate differentiation among Subpops.
To test for significant variation between Pops and among Subpops, a randomization test and pairwise F st estimation were performed. From the output, we can see four histograms representing the distribution of the randomized strata (Fig. 5). The observed results in the output show significant differentiation of the structure of Pops and Subpops considering all levels of the Pops and Subpops strata (Fig. 5). These results also support the separation of the varieties into two Pops and five Subpops. Furthermore, pairwise estimates of F st demonstrated that population differentiation between Pop1 and Pop2 is high (F st = 0.35). The pairwise F st between the five Subpops ranged from 0.13 between Subpop2 and Subpop3 (both consist largely of long horn-fruited) to 0.48 between Subpop1(mostly blocky-fruited) and Subpop4 (short horn-fruited) ( Table 3). Notably, a high genetic differentiation (F st = 0.43) was also shown between Subpop1 and Subpop5 (mostly consisting of linear-fruited). Subpop4 also had very low genetic differentiation from Subpop5 (F st = 0.14).
Identification of SNPs associated with fruit shape A total of 21 SNP loci did not indicate any diversity (PIC = 0) within certain fruit populations, of which 16, 1, 3, and 5 loci were for blocky, long horn, short horn, and linear-fruited population, respectively (Table S3). These fruit shape-specific loci may have been under selection during breeding or were selected due to linkage with genes that are determinative of fruit traits. Using the MLM K+Q model and Bonferroni correction for p-value, two SNP loci, CaSNP118 and CaSNP053, were identified as significantly associated with fruit shape across 271 pepper varieties (p < 1.0 × 10 -4 ). To match the associations with previously identified quantitative trait loci (QTL), the physical position of the two SNP loci in the both the reference genome of Zunla-1 and CM334 are given in Table 4. The two associated SNP loci were located on chromosomes 11 and 6, respectively; and contributed 13.6% and 9.2% of the phenotypic variation, respectively. In total, 252 and 1217 annotated genes were identified in the associated region of CaSNP118 and CaSNP053, respectively (Table S5).

Discussion
High-throughput genotyping by Target

SNP-seq
High-throughput genotyping technology has become essential for effective crop breeding programs.
Target SSR-seq, which combined the multiplexed amplification of perfect SSRs with high-throughput sequencing, was recently developed and applied to the identification of cucumber varieties, leading to the characterization of a set of core SSRs [24]. This sequencing technology can acquire thousands of data points in under 72 hours, costs less than $7/sample, and is associated with genotyping accuracy up to 100% due to the high coverage. In this study, re-sequencing tools were used to identify 92 perfect SNPs from the genomes of 35 C. annuum. The identified SNPs were then used for Target SNPseq to assess genetic diversity across 271 pepper varieties that are popular in China.The results show that the perfect SNPs panel has a high discriminating capacity for varieties as 71.74 % of the perfect SNPs had PIC values > 0.30 (Table S3). Further, a minimum of 27 of the perfect SNPs could distinguish between all the non-redundant varieties (Fig. 3). Notably, the mean PIC value was found to be 0.31, which is low in comparison to the values derived from studies using SSR markers [27,30].
These discrepancies may be explained by considering the nature of the different types of markers; SSRs are multiallelic and more polymorphic than SNP markers, which are biallelic [31].
A set of 35 core SNPs that had the same discrimination power as the 92 perfect SNPs was successfully converted into KASPar markers, representing another robust genotyping choice for pepper varieties ( Fig. 3; Table S4). Unlike SSR markers, SNP markers do not require reference cultivars be included in each experiment and will also overcome the confusion between labs with regard to SSR alleles.
Population structure among inbred C. annuum lines Since their initial domestication in Mexico, peppers have been under strong selection for fruit shape and size [32]. Consumption habits and pepper type preference vary globally. In the US alone, more than 20 market types are recognized and consumed [33]. In China, a majority of the pepper varieties commercially cultivated belong to the species C. annuum, and the market types are classified by fruit shapes, such as the popular blocky, long horn, short horn, and linear-fruits [34,35]. To date, most experiments have evaluated the genetic relationships among several Capsicum species [36,37] or the genetic diversity of C. chinense germplasm from relatively restricted regions [38]. The phylogenetic analysis based on molecular markers and pan-genome confirmed that C. chinense and C. frutescens are more closely related to each other than to C. annuum [37,39]. Only a few studies have attempted to characterize the population relatedness in cultivated C. annuum [31,40,41].
Notably, the relationships among the 35 re-sequenced C. annuum lines described in this study align with previous reports grouping C. annuum according to fruit traits [40]. Further, clustering of the blocky-fruited varieties in the most derived positions relative to small hot Chiltepin-like types can also be observed in previous studies [40,41].
Genetic structure among C. annuum varieties Although previous work has demonstrated that a small population of C. annuum lines clustered according to the fruit shapes, the relationships among the commercially important C. annuum varieties from different companies have remained unclear. In the present study, the relationships among four fruit shape populations were assessed across a broad range of pepper varieties cultivated in China. Comparison of the genetic parameters showed the lowest Ho was observed within the blocky-fruited population, while the highest was detected in the horn-fruited population (Table 1).
These findings agree with the earlier studies that found a reduction in diversity was associated with non-pungent blocky-fruited lines relative to pungent lines [41][42][43]. The narrow genetic diversity associated with the blocky-fruited varieties may be a consequence of inbreeding with a limited gene pool.
Additionally, the PCA and phylogenetic tree demonstrated that the four fruit shape populations clustered separately with a little or no overlap. This aligns with the fruit shape classification system and demonstrates that the genetic structure of pepper varieties in China has been significantly influenced by breeding programs that select for fruit shape. Similarly, STRUCTURE analysis grouped the varieties into two main populations, Pop1 and Pop2, which were further divided into five subpopulations, Subpop1 to Subpop5 (Fig. 4B). Moreover, the subpopulations correlated with fruit shape. Notably, Subpop1, Subpop4, and Subpop5 corresponded to the blocky, short horn, and linearfruited varieties, respectively. However, the majority of the long horn-fruited varieties were divided into two subpopulations, Subpop2 and Subpop3, which were statistically unique (Fig. 5) Identification of associated loci and candidate genes for fruit shape Fruit shape is a complex trait controlled by multiple genes. QTL analyses have been previously used to study fruit shape trait by predicting the regions of the genome that affect the trait and estimating the effect of each region. The first fruit-shape QTL in peppers, named fs10.1, was detected on linkage group 3 [44]. Recently, Chunthawodtiporn et al. [45] detected one fruit shape QTL, which was located at 85.3 cM on chromosome 2 and accounted for 14.9% of the phenotypic variation. In the present study, two novel SNP loci, CaSNP118 and CaSNP053, were significantly associated with fruit shape and were located on the end of chromosome 11 and 6, respectively. Combined, these two SNPs account for a total of 22.8% of the phenotypic variation associated with fruit shape (Table 4).
Within the C. annuum genome, CaSNP118 is located between SNP loci CaSNP098 and CaSNP095, which covers approximately 44 Mb and has 252 annotated genes; CaSNP053 is located between SNP loci CaSNP054 and CaSNP052, which covers approximately 30 Mb and has 1217 annotated genes (Table S5). Therefore, many predicted genes could be considered as candidate fruit-shape controlling gens based on their predicted roles in protein function. For example, CA11g12200 is calmodulindomain protein kinase gene, CA06g15400 and CA06g21580 are both Ovate family proteins, and CA06g20780 is a homeodomain-like superfamily protein. Each of these genes is a candidate for fruit shape determination in tomato. Ovate family proteins control fruit shape transformation from round to pear-shaped fruit, WUSCHEL encodes a homeodomain transcription factor that controls meristem size and locule number, and SUN encods a member of the IQD family of calmodulin-binding proteins that leads to fruit elongation [46,47].
Future directions of Target SNP-seq in pepper Of note, SSR or InDel loci that are suitable for specific primer design could be added to the perfect SNP panel used in our Target SNP-seq. For example, we demonstrated that disease resistance markers, such as resistance genes for Tobamovirus [48], Phytophthora capsici [49], bacterial spot [50], and potato virus Y [51], can be successfully added to the perfect SNP panel for Target SNP-seq (data not shown). The commercial application of this technique has the potential to increase the efficiency of marker-associated selection programs, as well as aiding in variety identification.

Conclusion
The Target SNP-seq developed in this study is a high-throughput and reliable tool for the investigation of genetic diversity, variety identification, and characterization of population structure in peppers. The use of PCA, phylogenetic tree generation, and STRUCTURE demonstrates that the genetic structure of commercially available pepper varieties in China has been significantly influenced by breeding programs that select for fruit shape. Finally, association analysis of a limited number of SNPs allowed for the identification of novel genomic regions and candidate genes that control fruit shape.

Methods
Plant materials, fruit shape categorization, and DNA extraction A collection of 288 commercial pepper varieties (Table S1), which were acquired from 60 different seed companies in China, were used for genetic identification. These pepper varieties are the most popular throughout China.

Plants were grown under greenhouse conditions at the Vegetable Varieties Exhibition center in the
Tongzhou district of Beijing. Fruit shape was categorized into one of four fruit shapes; blocky-fruited, long horn-fruited, short horn-fruited, and linear-fruited (Table S1). The classification examples for the fruit shapes are presented in Fig. S1.
First true leaves from 30 independent plants, which were identified based on the National Varieties Identification Standard, were collected and mixed to extract DNA using a CTAB-based method [52].
The DNA quality was assessed using 1.2% (w/v) agarose gel electrophoresis, and the concentration was determined using a Nanodrop 2000 Spectrophotometer (Thermo Fisher Scientific, DE, USA).

Re-sequencing and SNP identification
In total, 31 diverse pepper lines (C. annuum), including 30 inbred lines from our ongoing breeding programs and small hot PI640446 provided by the U.S. National Plant Germplasm System (NPGS), were selected for re-sequencing on the Illumina X Ten platform at Shanghai Majorbio Biopharm Technology Co. Ltd. (Shanghai, China). The inbred lines had diverse genetic backgrounds and horticultural traits, including eight blocky-fruited lines, 13 long horn-fruited lines, five short hornfruited lines and four linear-fruited lines (Fig. 1).
Perennial and C. annuum cv. Dempsey [13], were filtered into clean data using Trimmomatic [53]. The clean reads were then mapped to the reference genome of Zunla-1 [3] using the Burrows-Wheeler Alignment Tool (BWA), and SNPs were called using the Genome Analysis Toolkit (GATK, v2.4-7g5e89f01) [54]. SNPs with minor allele frequency (MAF) > 5% and missing data < 10% were imported into MEGA to build the rooted phylogenetic tree using the cultivar progenitor, Chiltepin, as an out-group with the neighbor-joining method [55]. Population structure analysis was completed using STRUCTURE v2.3. The number of populations (K) was determined following the standard procedure [56] with a burn-in period of 100,000 iterations and Markov Chain Monte Carlo of 100,000.
Twenty independent runs were performed for K varying from 1 to 15. The optimum K was defined according to the Evanno's correction method [29].
To acquire a dataset of genome-wide SNPs for subsequent Target SNP-seq analysis, perfect SNPs were identified by the following criteria: (i) minor allele frequency (MAF) > 0.4 to filter out uninformative SNPs; (ii) miss rate < 0.2; (iii) heterozygosity < 0.2; (iv) no sequence variation in the 100 bp flanking sequence of the SNP locus; and (v) the number of alleles per locus was two for the SNP.
Target SNP-seq The Target SNP-seq procedure was completed as previously described using the SNPs identified above [24]. In brief, library construction for Target

SNP genotype analysis of Target SNP-seq
The raw data from the Target SNP-seq was de-multiplexed to determine the exact genotypes for each variety based on the sample-specific barcodes using the Illumina bcl2fastq pipeline (Illumina, San Diogo, CA, USA). Clean data were filtered out using Trimmomatic, and the reads of each variety were mapped to the pepper reference genome of Zunla-1 [3] using BWA with default parameters. SNP genotypes were called using GATK. Based on the high-throughput sequencing results, the SNP alleles with the maximum numbers of reads and the second maximum numbers of reads were treated as the major and minor allele for each target SNP locus When the read frequency of the major allele was more than 0.7, the locus was described as homozygous. If the read frequencies of the major and minor allele were both more than 0.35, the locus was described as heterozygous.
Determination of genetic parameters for each perfect SNP Genetic parameter statistics of the perfect SNPs, including the observed heterozygosity (H o ), expected heterozygosity (H e ), and polymorphism information content (PIC) [57] were calculated using a Perl script with the following equation: [Due to technical limitations the formula could not be inserted here. It can be found in the supplemental files titled "Formula 1 -Heterozygosity"] where l is the allele locus, and P i and P j represent the population frequency of the i th and j th allele. The chromosomal distribution of the perfect SNPs was mapped using Circos software (http://circos.ca/) with the SNP region magnified to 2 Mb.

Genetic structure analysis
Genetic relationships among varieties were investigated using three different methods: principal component analysis (PCA), STRUCTURE, and a phylogenetic tree. PCA was carried out using the FactoMineR package in R [58]. The Bayesian-based model procedure implemented by STRUCTURE v2.3 [56,59] was also used to determine population structure. The number of populations (K) was determined as described above, and the unrooted phylogenetic tree was constructed using the Ape and Poppr packages in R based on the neighbor-joining method with the tree viewed using MEGA v5.1 [60,61].

Population diversity analysis
The different fruit shape populations, as well as the subpopulations inferred from STRUCTURE, Ho, He, PIC, and MAF analyses, were calculated using the methods mentioned above. To measure genetic differences between pairs of subpopulations, the analysis of molecular variance (AMOVA) and the pairwise F st was performed using the Poppr and Hierfstat R packages, respectively [61].

Core SNPs set for variety discrimination
To develop a set of core SNPs that discriminates between varieties using the KASPar platform, two allele-specific forward primers and one common reverse primer were designed for each perfect SNP marker. The 23 commercial varieties were then used to assess the potential utility of the SNPs markers through the KASPar platform; fluorescence was detected as previously described [62].
Detailed instructions are available at www.kbioscience.co.uk.
A Perl script was used to determine the core-SNPs set from the successfully verified SNP markers according to the following [24]: 1) the highest discernible SNP marker was chosen as an initial core dataset and each SNP marker was subsequently added to the initial core dataset to form a new dataset; 2) the second SNP marker was chosen from the new dataset with the highest discernibility, and added to the core dataset; 3) the steps were repeated until the maximum discernibility was reached. Finally, the SNP markers associated with the maximum variety discrimination were identified as the core-SNP set.

Association analysis
Fruit shape for each variety was scaled from 1 to 4, with 1 referring to blocky, 2 to long horn, 3 to short horn, and 4 to linear-fruited varieties (Table S1; Fig. S1). The software program TASSEL 5.2.25 was used for association analysis. A mixed linear model (MLM) that considered both fruit shape populations (Q matrix) and the kinship matrix (K matrix), and a general linear model (GLM) using fruit shape populations (Q matrix) as a fixed factor were used for association identification of loci conferring fruit shape. Significance of marker-trait association was indicated when the p-value was less than 10 -4 . Because it has been popularly proved that the MLM+Q+K model is a more effective approach than other models for detecting loci [63,64], only data from the MLM+Q+K model is presented in this study. The phenotypic variation explained by each perfect SNP was the R 2 -value obtained from the GLM model. Candidate genes between the nearest up-and down-stream SNP loci to the significantly associated loci were identified from the protein annotation published using the CM334 genome [13].

Availability of data and materials
The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive  b The numbers in parentheses refer to the numerical ranking of diversity in descending order.   Table 4. SNP loci significantly associated with fruit shape as identified by association analysis.     Table S1. Classification and information on the pepper varieties used in this study.      Population structure inferred using STRUCTURE. All the varieties were divided into two main populations (Pop1 and Pop2) when K=2, which was the optimal K. The populations were subdivided into five subpopulations, Subpop1~Subpop5, which correlated with fruit shape.
(C) Phylogenetic tree analysis. The tree was produced using the neighbor-joining method based on the 92 perfect SNPs. The scale bar indicates the simple matching distance.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.