Association mapping, trait variation, interaction and population structure analysis in cucumber (Cucumis sativus L.)

In several regions of the world, low productivity in this crop is attributed to several factors including poor understanding of the genomic complexity of important traits associated with fruit quality and yield. Therefore, genome wide association analysis was performed for important traits using simple sequence repeats (SSR) markers. Significant variation was recorded for all the studied traits in 78 cucumber genotypes under two environments (open field and net house) which indicated that the constituted association panel was suitable for association mapping. Genotyping was done using 60 highly polymorphic SSRs. By performing genome scanning out of 60 SSR markers, using mixed linear model (MLM) approach 4 and 6 markers explained an average of 23.93% and 17.37% of the trait variation under net house and open field condition, respectively. Based on MLM approach two markers on 3rd chromosome (UW084942) and 4th chromosome (UW062953) found to be associated with the average fruit weight (g) under both net house and open field condition. Population structure analysis revealed four distinct sub-populations that corroborated with the geographical origin as well as fruit quality and quantitative traits. The four sub-populations (A–D) had fixation index percentage equal to 24.35 29.48, 37.17 and 8.97 respectively, supporting the existence of moderate population structures. Therefore, the extensive phenotypic and genotypic characterization, population structure, and markers associated with important traits provided in this study will facilitate marker assisted improvement programs in cucumber.


Introduction
Cucumber (Cucumis sativus L. 2n = 2x = 14) is one of the most important member of the family Cucurbitaceae. Wide variation for most of the traits including growth habit, sex expression, fruit size and colour were recorded in Indian germplasm (Sebastian et al. 2010). As a result of continuous selection on the basis of local preferences, a large number of landraces and feral forms have been accumulated in different growing regions throughout the world (Grumet et al. 2021). Most of the commercially important traits of cultivated crops are quantitatively inherited and hence, genes responsible for inheritance of these traits are difficult to detect. Cucumber has a narrow genetic base with 3-12% polymorphism (Staub et al. 2005;Yang et al. 2013). Cucumber is considered as a model plant for genetic research because of its smaller genome size (367 Mb), rich diversity of gene expression and short life cycle (Ren et al. 2009). A large number of SSR markers from whole genome sequences have been developed (Cavagnaro et al. 2010;Ren et al. 2009). Draft genomes of three cucumber lines ('9930' 'Gy14' and 'B10') are now available and they are instrumental in understanding the structure and organization of cucumber genome (Yang et al. 2013).
The genetic variance for important horticultural yield traits is directly related to the genetic distance between the parents. Genome mapping has also indicated that the frequency of polymorphism increases progressively with increased genetic distance between the parents . In order to use marker-assisted selection (MAS) effectively, the saturated linkage maps (5-10 cM) must be constructed and judiciously applied to maximize cost-benefit ratio (Staub et al. 2005). SSRs were used in genetic mapping, MAS, genetic diversity (Dar et al. 2017;Jat et al. 2018;Kumar et al. 2020) and population structure analysis in cucumber (Dar et al. 2017). The sequencing of the whole cucumber genome (Huang et al. 2009) made it possible to use more breeder friendly molecular markers such as SSRs for genetic mapping and MAS in cucumber. The advantages of these cucumber SSR markers have already been reported in several recent linkage mapping and marker-trait association studies Weng et al. 2010). But only a few important genes or QTLs of cucumber have been mapped with these markers (Yuan et al. 2008a, b;Miao et al. 2011).
Presently, association mapping approach is used in identifying major QTLs and candidate genes for array of economically important traits in different crop species. This approach offers unique opportunities to exploit genetic variation in natural populations with high-resolution mapping of complex agronomic traits (Zhu et al. 2008). Association mapping relies on the linkage disequilibrium (LD) which is maintained over generations between loci which are genetically linked to one another (Neumann et al. 2011). It has been identified as a tool to resolve complex trait variations down to the sequence level (Nordborg and Tavare 2002), and to identify novel mutations causing specific phenotypes (Palaisa et al. 2004). SSRs are used quite often in identifying complex traits in different crop species.
Apart from the use of SSR markers, several analytical advances are available for the identification of loci governing particular traits. These advances include the stratification of population structure and statistical models such as generalized linear model (GLM) and mixed linear model (MLM) approaches. A unified, mixed-model approach for association mapping combined with a population structure analysis is a dependable and robust system for identifying reliable quantitative trait loci (Yu and Buckler 2006). The present study focused on analyzing the population structure of 78 cucumber genotypes and identifying linked molecular markers for earliness and yield traits. These marker-QTL associations identified in our study would be immensely useful in cucumber improvement through MAS.

Plant materials
The experimental material for the present study consisted of 78 diverse cucumber genotypes comprising of Indian as well as exotic germplasm maintained at Division of Vegetable Science, ICAR-IARI, New Delhi (Supplementary Table 1).

Phenotyping of the germplasm
These 78 genotypes were evaluated in net house during rainy season (August to November, 2017) and for two consecutive seasons under open field condition i.e. rainy season (August to November, 2017) and summer season (February to May, 2018). Out of total 15 plants grown in each genotype, 5 plants were randomly chosen and tagged to record data of 10 important quantitative traits under open field condition (Table 1). However, under net house we were able to record data on 8 traits ( Table 2). Data of 2 seasons recorded under open field condition were pooled for analysis. Data of all the traits were recorded with three replications using a randomized block design.

Genomic DNA extraction
From fresh young leaves of the 78 diverse cucumber genotypes genomic DNA was extracted following a modified CTAB method (Doyle and Doyle 1990). DNA quality was checked through 0.8% agarose gel electrophoresis and quantity by using spectrophotometer at wavelength of 260 nm, and stored at − 20 °C for future use.

PCR amplification
Sixty SSR markers (Supplementary Table 2) distributed across the cucumber genome were used in this study (Yang et al. 2013). The PCR reactions were carried out in a volume of 20 μl reaction mixture each containing 2 μl of 10 × reaction buffer, 0.30 μl of 10 mM dNTPs, 1 μl each of forward and reverse primers, 2.0 μl of template genomic DNA (50 ng), 0.2 μl of Taq DNA polymerase (0.75 U) and 13.5 μl molecular grade water. The PCR thermal profiles were: DNA denaturation at 94 °C for 4 min followed by 35 cycles of 94 °C for 30 s, annealing temperature at 52-59 °C for 1 min and extension at 72 °C for 1 min, finally 72 °C for a final extension of 10 min and cooling at 4 0 C. The PCR products were resolved by electrophoresis in 3.0% agarose gels containing 0.1 μg/ml ethidium bromide in 1 × TBE buffer at 120 Volts for 3 h. After electrophoresis DNA fragments were visualized and documented using ALPHA IMAGER gel documentation system (Alpha Innotech, USA).

Statistical analysis
The data were analyzed using the ANOVA procedure of SAS 9.2 (SAS Institute, Cary, NC, USA) to derive their summary statistics including mean, range, standard deviation, variance, and coefficient of variation. The positive and negative correlations among the yield and related traits in 78 cucumber genotypes were measured using Pearson correlation coefficient at 1 and 5% levels of significance in SPSS for Windows. The fixed effect model was used for combined analysis of variance for 8 phenotypic traits including genotype (G) × environment (E) interaction for both the environments (net house and open field). Genetic and environmental effects as well as the genetic variation among the genotypes were measured according to the various genetic parameters developed by Burton (1951), Burton and DeVane (1953) and Johnson et al. (1955). For population structure analysis, we assumed an admixed model with a uniform prior probability and independent allele frequency of the number of populations, K. All the runs with 50,000 MCMC replicates after a burn-in of 50,000 replicates were conducted for K = 2-10. Five independent runs were performed for each value of K to generate our estimate of the true number of sub-populations as reported by Pritchard et al. (2000). The 'Structure harvester' program (http:// taylo r0. biolo gy. ucla. edu) was used to obtain the final K value (K = 4 was optimum for this analysis), based on both the LnP (D) and Evanno's DK (Evanno et al. 2005;Earl et al. 2012). Markers which were exhibiting a probability value (p value) less than 0.02 thresholds were considered significantly associated with the particular phenotypic trait. The association of each alleles of the locus with the trait of interest studied under experiment was tested using two approaches viz. GLM and MLM (Yu et al. 2006), using the TASSEL v2.0.1 software (Bradbury et al., 2007). The MLM approach exhibited the least variation in observed p values from expected p values in the quartile-quartile plot compared with the variation of the Q (population structure) or K (kinship) model in GLM approach.

Phenotypic evaluation and correlation among the important traits
The combined analysis of variance indicated significant effects of environment, genotypes and genotypes × environment (GE) interactions on 8 quantitative traits studied under net house and open field condition (Supplementary Table 3). Summary statistics including minimum, maximum and mean values of earliness and yield traits studied under open field condition (10 traits; 2 seasons mean) and net house (8 traits; 1 season mean) are given in the Tables 1 and 2 respectively. The superior performance for earliness and yield traits were recorded under net house grown crop compared to open field condition.
The correlations among the important earliness and yield traits are presented in Tables 3 and 4. Eight traits studied under net house showed node number for first female flower was positive and highly significantly correlated with days taken for first female flower to anthesis, days taken for fruit harvest and with vine length. Days taken for first female flower to anthesis was also had positive significant correlation with days taken for first fruit harvest and with vine length. Fruit length was positively correlated with the fruit length/diameter ratio, average fruit weight and fruit yield per plant. Average fruit weight had positive and highly significant correlation with vine length. Fruit length had negative and non-significant correlation with the fruit diameter. Fruit diameter showed significant and negative correlation with fruit length/ diameter ratio (Table 3).
Traits studied under open field condition also showed positive and highly significant correlation with the traits like days taken for first female flower anthesis with days taken for first fruit harvest and with vine length while with number of fruits per plant it was negatively correlated. Number of fruits per plant was positively correlated with the yield per plant. Fruit length showed significantly high and positive correlation with fruit length diameter ratio, average fruit weight, yield per plant and vine length. Average fruit weight showed high positive and significant correlation with yield per plant as well as with vine length. Fruit length was negatively correlated with fruit diameter. Fruit diameter showed negative correlation with fruit length/diameter ratio (Table 4). Under both environments correlation among traits like node number for first female flower, days taken to first female flower anthesis and vine length was positive and significantly high. Fruit length was positively correlated with fruit length/diameter, average fruit weight and with vine length while negative correlation was found with fruit diameter in both the environment. Average fruit weight was positively correlated with vine length. These wide phenotypic trait variations under both the environments among 78 cucumber genotypes indicated that the constituted association panel was suitable for association mapping for earliness and yield traits.

Variability and heritability analysis
Heritability estimate is an important parameter to the breeder for selecting the desired genotypes for further use. The genetic parameters of variability estimates such as phenotypic coefficient of variation (PCV), the genotypic coefficient of variation (GCV), broad sense heritability, and genetic advance as percent of means are presented in Table 5. The GCV values were smaller than PCV values for the traits like node number of first female flower, days taken for first female flower to anthesis, days taken for first fruit harvest and fruit length showing that these characters were more influenced by their surrounding environments. The high (> 20%) GCV values were observed for node number of first female flower (23.49), fruit length (23.33), fruit length/diameter ratio (24.95) and average fruit weight (32.38) where as low (0-10%) GCV values were observed for all other traits studied. The lowest GCV was observed for vine length (5.47). Among the studied traits, broad sense heritability ranged from 0.68 for fruit diameter to 0.95 for node number of first female flower. Heritability is grouped as low (< 0.2), moderate (< 0.2-0.4) and high (> 0.4). In our study all traits were found highly heritable traits. High heritability estimates showed a high response to selection for those traits.
Population structure analysis A total of 60 highly polymorphic SSR markers were used in determining the population structure and marker-trait association analysis. The Bayesian model-based program STRU CTU RE 2.2 was used to infer the population structure. Admixture modelbased simulations were carried out by varying K from 2 to 10 with ten runs for each K, using all 78 cucumber genotypes which revealed evident knees at K = 4. The average LnP (D) (log-likelihood) value increased continuously with the increase in K from 2 to 10. However, its most apparent inflection was obtained at one of the best replicates of K = 4. The results of population numbers (K) were further confirmed using DK estimation. A sharp peak with the maximum value of DK was obtained at K = 4, thereby confirming the classification of 78 cucumber genotypes into four distinct population groups (Fig. 1). Using this approach, 78 cucumber genotypes were assigned to the corresponding A-D subpopulations, representing 24.35% (19), 29.48% (23), 37.17% (29) and 8.97% (7) of the genotypes ( Furthermore, the genetic distances among these four sub-populations by Fst value were also measured which showed the variable level of genetic differentiation between inferred populations. The Fst value ranged from 0.129 to 0.137 with an average 0.133, revealing smallest genetic distances between sub-pop B and C the largest between sub-population C and D (Table 7). Genotypes with more than 0.80 Fst considered as pure and less than 0.80 as admixture (Singh et al. 2016). There were mixed proportions of accessions with admixture with parental varieties in subpopulations defined by the structure, which might have grouped in different defined clusters. In the sub-population A, the percentage of mixed individuals was found to be 21.05%, likewise 26.08% for B, and 17.24% for C while sub population D consisted of 100% pure genotypes. Based on DK estimation cucumber genotypes were divided into four distinct populations. The first group (A) consisted of most of the genotypes collected from different districts of West Bengal along with four genotypes from IARI, New Delhi. This group also contains two genotypes

Association analysis
Marker-trait association was performed for important earliness and yield traits in two different environments using MLM approach in the TASSEL software. By performing genome scanning through MLM approach, 4 and 6 SSR markers were associated with 8 and 10 earliness and yield traits studied under net house and open condition respectively (Tables 9 and 10). In combination, all four and six markers explained an average of 23.93% and 17.37% of the trait variation, respectively. Among all 8 traits studied under net house, UW084942 and UW062953 makers were associated with average fruit weight (g) with 29 and 15% of the phenotypic variance, respectively. While days taken to first fruit harvest and fruit diameter (cm) were associated with marker UW085408 and UW083899, respectively. Ten traits studied under open field condition revealed that a single marker UW084942 was associated with two traits namely average fruit weight (g) and fruit length (cm) while marker UW062953 separately associated with the average fruit weight (g). Other traits like number of fruits per plant, days taken to first female flower anthesis, vine length (cm) and days taken for first female flower to anthesis were associated singly with following markers UW085408, SSR06011, SSR03552 and UW084541, respectively.
The Manhattan plots of marker UW084942, common for traits studied under net house and open field condition, in which the genomic coordinates were displayed along the X-axis and the negative logarithm of the association p value for each SSR are displayed on the Y-axis, clarify that each dot on the Manhattan plot depicts a SSRs. The Manhattan plot displaying the significant association of UW084942 with average fruit weight (g) in both the approaches under two environment viz. under net house and open field condition (Fig. 2a, b).
On the other hand, the results of MLM-QQ plots clearly showed nearly perfect distribution of scores with slight deviation (Fig. 3a, b). This reveals the significance of MLM scans for the identification of candidate genes for such complex traits like average fruit weight (g). Therefore, in this analysis the evaluation of QQ plots suggests the importance of test of both general and mixed linear models in genome wide association mapping.

Discussion
In this study, 78 cucumber genotypes were collected from 17 locations across India and abroad for marker trait association. The morphological traits revealed a considerable variability in flowering behavior, growth habit and fruit characters. Node number to first female flower which determines the earliness of a genotype showed significant variation under both environments. Identification of genotypes with lower node to female flower would be useful in extending the availability of fruit. Judicious planting of early and late genotypes will help in sustainable marketing of cucumber cultivars for a longer period. Very wide range of diversity was recorded for fruit weight and fruit diameter under both the growing condition. Fruit length, fruit diameter and average fruit weight are considered as major yield contributing traits. For slicing cucumber, thin and long fruits are desirable and small fruited genotypes may be desirable for processing industry. In the present study wide range of variation was recorded for most of the other traits. Wide variations for important traits in cucumber were earlier reported by several workers (Mliki et al. 2003;Pandey et al. 2013;Kumar et al. 2020). The high heritability was recorded for all 8 traits studied under net house and open field condition. High heritability estimates for number of traits were reported in cucumber (Munshi et al. 2007;Kumar et al. 2008;Yogesh et al. 2009 andBisht et al. 2010). Selection can be exercised on the basis  Usually, cucumber has a narrow genetic base with 3-12% polymorphism with low to moderate variability (Yang et al. 2013). However, because of the inclusion of different genotypes, including native and exotic collection of genotypes, a wide variability was observed in this study.
Population stratification contributes to false positive results in association analysis (Yu and Buckler 2006). The population structure with respect to geographical origin contributes toward pseudo-associations (Maskri et al. 2012). Therefore, adequate methods need to be implemented to control the effects of population structure to avoid the high rate of Type-I error (Agrama et al. 2007) using STRU CTU RE 2.2. In this study, a Bayesian model-based population structure analysis indicated the occurrence of three sub-populations largely according to the major geographic regions of their origin or genetic makeup.
Based on DK estimation cucumber genotypes were divided into four distinct populations. We found that most of the genotypes from IARI, New Delhi grouped together (B) and another group (C) consisted of genotypes from NBPGR, New Delhi and remaining genotypes from IARI, New Delhi. Group B and C had least genetic distance among them suggesting their genetic relatedness with each other. Genotypes from Eastern part of India (Group A) were found to be genetically distant from the genotypes from IARI, New Delhi and NBPGR, New Delhi. This study indicated the narrow genetic base of the Indian germplasm in spite of its origin in Indian part of Himalayan foothills. Therefore, there is urgent need to conserve highly variable accessions in order to minimize the genetic erosion. Similar reports were documented regarding genetic erosion of cucumber germplasm in India during the joint Indo-US expedition programme . Two genotypes from USA (Gy14 and Gy-421) and five genotypes from China (CL 773, CL 746, CL758, 702-6-B76C and Russian) were scattered  (Pandey et al. 2013). Contribution of Indian genotypes in the global cucumber germplasm has been validated by Dar et al. (2017) and separation based on their geographical location was evident among the selected cucumber genotypes. However, some of the genotypes could not be differentiated according to geographical patterns and hence admixtures in the populations can be attributed primarily due to cross-pollination and gene flow from genotypes grown in vicinity. The information generated through population structure and ancestral background would facilitate germplasm conservation and management strategy and selection of suitable parents in cucumber breeding programs. The MLM approach was found to be more reliable as this approach simultaneously takes accounts for both population structure as well as kinship statistics (Neumann et al. 2011).
Current study was carried out in two environment viz. under net house and open field condition revealed that traits exhibited significant marker-trait association in the MLM approach. Studies conducted in rapeseed (Cai et al. 2014), maize (Yu and Buckler 2006), bread wheat (Neumann et al. 2011), soybean (Kumar et al. 2014), mango (Lal et al. 2017) demonstrated the effectiveness of the MLM approach. As population structure can cause spurious associations (Kang et al. 2008) through MLM approach, incorporation of kinship in the association analysis allowing an improved control of type I and type II error rates over GLM due to relatedness and population structure (Yu and Buckler 2006).
The most intriguing part of the current study is that the many of the QTL previously published using the recombinant mapping approach were also validated and identification of novel genomic regions associated with the traits. Validation of the QTL identified by the other researchers makes the current study robust and very reliable. In the current study based on MLM analysis we found that 8 traits studied under net house confirm association of trait like days to first fruit harvest with marker UW085408 on chromosome 7 at 0.6 cM. While average fruit weight (g) was associated with two markers UW084942 and UW062953 on 3rd chromosome at 40 cM and on 4th chromosome at 105.1 cM respectively. Fruit diameter was associated with UW083899 on chromosome 4 at 55.6 cM. While 10 traits which were studied under open field condition revealed association of 6 traits through MLM analysis with the markers. A trait like number of fruits per plant was associated with UW085408 on chromosome 7 at 0.6 cM. Vine length, average fruit weight (g) and fruit length (cm) were found associated with SSR03552 at 53.7 cM, UW084942 at 40 cM and UW084942 at 40 cM on chromosome 3, respectively. Traits like days to first female flower anthesis was associated with two markers namely SSR06011 on chromosome 3 at 96.7 cM and UW084542 on chromosome 1 at 43 cM.
Among all traits studied under both net house and open field condition we found that there is common association for the trait like average fruit weight (g) on chromosome 3 with marker UW084942 at 40 cM. Earlier researcher have reported that fruit related traits like fruit weight and fruit shape index which includes fruit length, diameter and length/diameter ratio are controlled by quantitative trait loci (QTLs) in cucumber (Wei et al. 2016;Wenzel et al. 1995;Dijkhuizen and Staub 2002;Yuan et al. 2008a, b). Correlation study revealed that average fruit weight had significant and positive correlation with days taken for first fruit harvest, fruit length, fruit length/ diameter ratio and yield per plant therefore associated marker UW084942 may be used to select fruit yield per plant over all.
MLM approach revealed that marker UW084942 on chromosome 3 was associated with the fruit length for both the environments. Wei et al. (2016) identified three QTLs for mature fruit length, with the largest effect displayed by mfl3.1, explaining 36.39% of the observed phenotypic variance on chromosome 3. Another QTL fl3.1 explained 39.10% of phenotypic variance and was also located on chromosome 3. The effects of other QTLs (fl1.1 and fl6.1) were comparatively smaller which were on linkage group 6. Similar finding were reported by Cheng et al. 2010 andMiao et al. 2011, they identified five QTLs (on LG1, LG4 and LG6) and three QTLs (on LG5 and LG6) that explained 7.1-14.1% of the phenotypic variation. Similarly, Yuan et al. 2008a, b detected 8 QTLs and dominant QTL on LG4 explained 23.32% of the fruit length variance. Three markers namely UW083899 on 6th chromosome, UW044712 on chromosome 1st and UW084186 on chromosome 2nd were associated with fruit diameter. UW062953 on 4th chromosome and SSR00607 on 3rd chromosome were associated with fruit length/diameter ratio. Similarly, Xu et al. 2015 identified a 0.19-Mb-long quantitative trait locus on chromosome 2 controlling fruit flesh thickness QTL fft2.1 (Fruit flesh thickness was calculated as the distance from the exocarp to the endocarp) and further they confirmed by SSR marker-based classical QTL mapping in 138 F 2 individuals.
Based on MLM approach one marker each on chromosome 3rd (UW084942) and 4th chromosome (UW062953) found to be associated with the average fruit weight (g) in the both environments. Wei et al. (2016) identified 9 QTLs for fruit weight of mature and immature fruits on three linkage groups. Three QTLs were detected on LG1; one was on LG6, and five on LG3 for fruit yield traits. Among all QTL fl3.2 explained 44.60% of the phenotypic variance and QTL fl3.1 explained 39.10% of phenotypic variance and both were located on linkage group 3. Yuan et al. 2008a, b also reported the same linkage group for immature fruit weight on LG3.
In this study, among the particular traits number of markers was found to be associated with more than one trait. This result indicates precise evaluation of traits. The different SSR loci, which exhibit overlapping association with diverse traits, signify the biological correlation among them. Therefore, it was presumed that the loci governing fruit length, fruit diameter, fruit weight, vine length, and yield per plant were probably located closest to each other on the same chromosome. But for other traits the differences in QTL number, position and phenotypic variance explained might be attributed to different mapping populations such as genotypes, population size etc. (Wei et al. 2016;Yuan et al. 2008a, b;Dijkhuizen and Staub 2002). Similarly, this is fact that crop lines are often assessed in a multi environment trial (MET), i.e. in different geographic locations, seasons, or years, in order to determine performance stability across environments (i.e. G · E). G · E is an important component of genetic variability (Crossa et al. 2011).

Conclusions
In conclusion, we reported the trait variation and genetic structure of 78 cucumber genotypes through which we could identify the potential germplasm for earliness and yield traits for open field as well as protected cultivation of cucumber. Moreover, two potential SSR markers, UW062953 and UW084942 can be used in breeding programs for yield improvement in cucumber because these markers were associated with average fruit weight (g) in MLM approach in two environments. The present study laid a solid foundation to develop more markers with close association with economically important traits in cucumber. Finally, this study shows that we should investigate more number of genotypes including parthenocarpic gynoecious lines and SSR markers to produce detailed picture on genetic structure of cucumber.