Genome-wide association study reveals novel quantitative trait loci and candidate genes of lint percentage in upland cotton based on the CottonSNP80K array

Thirty-four SNPs corresponding with 22 QTLs for lint percentage, including 13 novel QTLs, was detected via GWAS. Two candidate genes underlying this trait were also identified. Cotton (Gossypium spp.) is an important natural textile fiber and oilseed crop cultivated worldwide. Lint percentage (LP, %) is one of the important yield components, and increasing LP is a core goal of cotton breeding improvement. However, the genetic and molecular mechanisms underlying LP in upland cotton remain unclear. Here, we performed a genome-wide association study (GWAS) for LP based on 254 upland cotton accessions in four environments as well as the best linear unbiased predictors using the high-density CottonSNP80K array. In total, 41,413 high-quality single-nucleotide polymorphisms (SNPs) were screened, and 34 SNPs within 22 quantitative trait loci (QTLs) were significantly associated with LP. In total, 175 candidate genes were identified from two major genomic loci (GR1 and GR2), and 50 hub genes were identified through GO enrichment and weighted gene co-expression network analysis. Two candidate genes (Gh_D01G0162 and Gh_D07G0463), which may participate in early fiber development to affect the number of fiber protrusions and LP, were also identified. Their genetic variation and expression were verified by linkage disequilibrium blocks, haplotypes, and quantitative real-time polymerase chain reaction, respectively. The weighted gene interaction network analysis showed that the expression of Gh_D07G0463 was significantly correlated with that of Gh_D01G0162. These identified SNPs, QTLs and candidate genes provide important insights into the genetic and molecular mechanisms underlying variations in LP and serve as a foundation for LP improvement via marker-assisted breeding.


Introduction
Cotton (Gossypium spp.) is the world's most important renewable natural textile fiber crop, and it is also a major source of oilseed worldwide (Hulse-Kemp et al. 2015). The allotetraploid upland cotton (G.hirsutum L., AADD, 2n = 4x = 52) is the most widely cultivated species and accounts for approximately 95% of global cotton lint production . Although cotton fiber qualityrelated researches have dominated in cotton, increasing and improving cotton yield remains a major objective in cotton breeding for China and other major cotton production countries in the world (Constable et al. 2015;Su et al. 2016;Sun et al. 2018). Cotton yield-related traits, mainly including boll number (BN), boll weight (BW), and lint percentage (LP), are typically quantitative traits and controlled by multi-genes (Si et al. 2017;Song et al. 2019;Sun et al. 2018). LP is an important component for determining cotton lint yield (Culp Communicated and Harrell 1975), and it is calculated as the lint weight [LW, g]/seed cotton weight [g] × 100 Zhang et al. 2011). LP is positively correlated with the numbers of lint on seed surface as well as the weight of single fiber, and negatively correlated with the seed index (SI, weight of 100 seeds). However, little is known about the genetic and molecular mechanisms underlying variations for LP in upland cotton despite some related studies in this area. Therefore, dissecting genetic variation and identifying candidate genes significantly associated with LP is essential. Quantitative trait locus (QTL) mapping has been widely used to dissect the genetic variation and maker-assisted selection (MAS) for complex cotton traits, and it is generally based on two mapping populations, including genetically segregated populations constructed using bi-parents and natural populations. Over the past two decades, thousands of QTLs (http:// www. cotto nqtldb. org) for fiber quality, LP and other traits have been detected through bi-parental linkage mapping in upland cotton (Zhang et al. 2005;Shen et al. 2006;Gutierrez et al. 2011;Said et al. 2013;Wang et al. 2013;Shang et al. 2016;Fang et al. 2017b;Abdelraheem et al. 2018). However, most of these QTLs are not directly applicable to breeding due to low marker density and poor genetic diversity, which results in very large genetic and instability across populations (Su et al. 2016;Li et al. 2018b). Genome-wide association study (GWAS) based on linkage disequilibrium (LD) can effectively associate genotypes with phenotypes in natural populations (Sun et al. 2017;Jiang et al. 2018). Compared with linkage mapping, GWAS has the advantages of high resolution, cost efficiency, and non-essential pedigrees for detecting important QTLs or genes associated with complex traits. Consequently, GWAS has been widely used and applied in many crops for various traits (Atwell et al. 2010;Liu et al. 2019;Yao et al. 2020;Wang et al. 2021;Wen et al. 2018). With the completion of cotton genome sequencing (Paterson et al. 2012;Li et al. 2015;Zhang et al. 2015;Hu et al. 2019;Huang et al. 2020) and the establishment of a high-throughput genotyping platform based on the high-throughput array (Hulse-Kemp et al. 2015;Cai et al. 2017), a large number of single nucleotide polymorphism (SNP) markers have been developed, which has greatly promoted the application of genome-wide association analyses in cotton. Recently, GWAS research in cotton has mainly focused on fiber quality (Gapare et al. 2017;Sun et al. 2017;Dong et al. 2018;Li et al. 2018a;Tan et al. 2018;Yuan et al. 2019b) and cotton yield components (Su et al. 2016;Sun et al. 2018;Song et al. 2019;Xing et al. 2019;Zhu et al. 2020), and other agronomic traits in cotton (Li et al. , 2018bYuan et al. 2018Yuan et al. , 2019aFu et al. 2019). In comparison, few reports have focused on identifying loci and candidate genes of the LP traits through a combined GWAS and weighted gene co-expression network analysis (WGCNA) strategy in cotton.
To better understand the allelic variations in the cotton genome at a natural population level and identify candidate genes significantly associated with LP, a GWAS was conducted for LP using 254 upland cotton accessions and the high-density CottonSNP80K array (Cai et al. 2017) based on phenotypic tests in four environments as well as best linear unbiased predictors (BLUPs), which was the array developed by Nanjing Agricultural University. The results revealed 34 SNPs and two causal candidate genes significantly associated with LP, and the candidate genes were further verified through RNA-seq and WGCNA analysis. These results would help us to better understand the genetic mechanism of the LP variations of yield traits and enhance the foundation for genetic improvement of lint yield through marker-assisted breeding in cotton.

Plant materials and field experiments
A total of 254 upland cotton accessions were selected for GWAS in this study, comprising 214 accessions that originated from Chinese Yellow River region (YRR), eight that originated from the Chinese northwestern inland region (NIR), 17 from the Chinese Yangtze River region (YtRR), six from the northern specific early maturation region (NSEMR) in China and nine introduced from abroad (Table S1). All 254 accessions were planted at Linqing (36° 48′ N, 115° 41′ E) in Shandong Province of China, and Anyang (36° 05′ N, 114° 29′ E) in Henan Province of China for 2016 and 2017, denoted as 16LQ, 16AY, 17LQ and 17AY, respectively.
In each experimental environment, all accessions were planted in a single-row plot (5.0 m long with 0.76 m between rows). The field experiments used a randomized complete block design with three replications in each environment. The field management conformed to local practices.

Phenotypic evaluation and statistical analysis
Twenty naturally fully open bolls from the central and inner parts of each plant were randomly collected from each plot at the cotton plant maturity stage, and ginned. LP was evaluated through conventional methods as the fraction of lint weight to seed cotton weight .
To reduce environmental errors, the best linear unbiased predictors (BLUPs) for the LP trait were estimated using the R software lme4 package (Bates and Maechler 2007).
The BLUP values and single-environment value were used for the GWAS. The broad-sense heritability of LP was also calculated using the R software lme4 package. Finally, the correlation coefficients for the LP between environments 1 3 were calculated and the analysis of variance (ANOVA) was conducted using R software. Statistical analysis of phenotypic data was conducted using SPSS 22.0 software. The frequency distribution of traits in each environment was calculated using the R package (R Core Team, Vienna, Austria).

SNP genotyping
Genomic DNA of the 254 cotton accessions was extracted from young leaf tissue using a modified CTAB method (Paterson et al. 1993). The DNA quantity and quality were measured with a Nano Drop 2000 and agarose gel electrophoresis. A CottonSNP80K array containing 77,774 SNPs (Cai et al. 2017) was used to determine the genotype of the 254 accessions. All of the SNP genotype data underwent raw data normalization, clustering and genotype calling using Illumina Genome StudioGenotyping Module (Illumina) (Illumina 2010). The SNPs with a SNP call rate < 85% and a minor allele frequency (MAF) < 0.05 were excluded to avoid potential issues due to spurious LD and false positive associations. A final set of 41,413 high-quality SNPs was used for GWAS analysis.

Population structure, kinship (K), and LD analyses
The population genetic structure of the 254 cotton accessions was estimated using a Bayesian model-based method in STRU CTU RE 2.3.4 (Evanno et al. 2005). The number of population clusters was predefined as K = 1-9, using an admixture model with 20 independent runs of 100,000 burns in length and a Markov chain Monte Carlo (MCMC) replication number of 100,000. The optimal K value was determined by the logarithmic probability of LnP (K) and ΔK based on the rate of change of LnP (K) between successive values of K. A phylogenetic tree including the 254 cotton accessions was constructed with the distance matrix method in Phylip software (version 3.69) (Felsenstein 1989). Principal component analysis (PCA) was used to assess the population structure and test the principal component vector according to the Tracey-Widom method (Price et al. 2006). The correlation coefficient (r 2 ) of alleles was calculated to measure linkage disequilibrium (LD) in each group level using Haploview software (Barrett et al. 2005). The LD decay rate was also measured as the chromosomal distance at which the average pairwise correlation coefficient dropped to half of its maximum value. The LD decay map was drawn using the R program.

GWAS analysis and identification of candidate genes
The GWAS was performed using the R/rMVP package with the FarmCPU model (Liu et al. 2016b), which had the advantages of a mixed linear model and stepwise regression.
The significance threshold was selected through the number of markers (P values = − log10 (1/the number of total selected SNPs)). Python scripts were used to perform the gene extraction of significant loci through the genome GTF files (https:// github. com/ karmen-cc/ chenc hen. git).
Candidate genes were explored based on gene annotations in the G. hirsutum acc.TM-1 genome  in region within ± 500 kb regions of significant SNPs. With the exception of the genes whose normalized fragments per kilobase per million mapped read (FPKM) values were equal to 0, the candidate genes were subjected to Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis.

Transcriptome sequencing, WGCNA and quantitative real-time PCR (qRT-PCR) analysis
The raw RNA-seq data of two varieties with different LP values, LMY22 with a higher LP (42.60%), and LY343 with a lower LP (31.52%) and their tissues (ovule and fiber developmental periods) were downloaded from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (accession no. PRJNA546484) . For transcriptome sequencing analysis, the expression levels of the materials were analyzed using TopHat and Cufflinks software (Trapnell et al. 2012). FPKM values were used to indicate the abundance of gene expression in each material.
The R/WGCNA package (version 1.69) was used to construct the weighted gene co-expression network (Langfelder and Horvath 2008), and the pick Soft Threshold set was used to calculate the weight value (Liu et al. 2016a). The Cytoscape/CytoHubba software (V_3.7.2) was used to visualize the gene network. The R/mufzz and R/pheatmap packages were used for gene cluster analysis and heatmap construct, respectively. Finally, gene ontology using the TBtools/GO enrichment to analyze and draw produced via the R/ggplot2 packages (https:// github. com/ karmen-cc/ chenc hen. git).
Total RNA was extracted from G. hirsutum acc. LMY22 and LY343 tissues, including ovules at 0 days post-anthesis (DPA) and fibers at 5, 10, 15, 20 and 25 DPA, and reverse transcription was performed using the PrimeScript RT reagent kit (TaKaRa, Kusatsu, Japan). The qRT-PCR was performed on a Light Cycler 480 II (Roche, Basel, Switzerland) using SYBG Premix Ex Taq II (TaKaRa, Kusatsu, Japan). The expression levels of two causal genes (Gh_D01G0162 and Gh_D07G0463) and four genes that were randomly selected from the interaction network (Gh_A13G0389, Gh_D01G0200, Gh_D07G0449 and Gh_D07G0457) were calculated according to the 2 −ΔΔCT method . GhActin was used as an internal control. The primer 1 3 sequences of these six genes and control gene are listed in Table S7.

Phenotypic variation of the LP trait in 254 upland cotton accessions
The phenotypic values for the LP of 254 upland cotton accessions, collected from four environments and BLUPs in 2016 and 2017 (Fig. S1), were used for variation analysis. Continuous and extensive phenotypic variations were observed for the LP trait among the 254 upland cotton accessions in each environment. LP values ranged from 22.90 to 48.32%, with a mean value of 40.69% across the four environments. The coefficient of variation (CV) ranged from 8.05 to 9.16%, and exhibited an approximately normal distribution pattern in the four environments based on the skewness and kurtosis values ( Fig. S1; Table 1). One-way analysis of variance (ANOVA) revealed that the genotype (G), environment (E), and the interaction of genotype and environment (G × E) all significantly affected the LP trait. In addition, the broad-sense heritability (h B 2 ) of LP was 86.65% (Table 1), indicating that the LP trait was less influenced by the environment and mainly controlled by genetic factors.

Genetic variation based on SNPs
The genotypes of 254 accessions were examined using Illumina GenomeStudio software (Illumina 2010). After removing low-quality SNP loci (minor allele frequency < 0.05 and call rate < 85%), a final set of 41,413 high-quality SNPs (41,413/77,774, 53.25%) was used for the subsequent screening of polymorphic loci, the population structure (Q), and relative kinship (K), as well as GWAS analysis ( Table 2). The 41,413 SNPs were unevenly distributed on 26 chromosomes of the allotetraploid cotton genome, with 21,962 and 19,451 SNPs in the A and D subgenomes, respectively ( Fig. 1; Table 2).

Population structure and LD decay estimation
Population structure analysis was performed using all 41,413 SNPs. The LnP(K) values continuously increased with K from 1 to 9, and Evanno's delta K value reached a sharp spike when K = 2 ( Fig. 2b), which suggested that the 254 upland cotton accessions could be divided into two major subgroups (Fig. 2a-c). Furthermore, phylogenetic tree and principal component analysis (PCA) showed that two subgroups for the 254 upland cotton accessions were similar to the STRU CTU RE analysis despite some accessions overlapping in the two subgroups (Fig. 2d, e). Each subgroup was composed of accessions from different ecological zones, including the YtRR, YRR, NIR, NSEMR and abroad, and that was unrelated to geographical distribution (Table S1). Moreover, the LD decay of our population was approximately 520 kb, when the r 2 = 0.34 at half of its maximum value (Fig. S2). Considering the LD decay distances and compared with previous studies of the different natural population in cotton (Dong et al. 2018;Li et al. 2018a;Song et al. 2019;Su et al. 2016;Sun et al. 2017). We assumed that the region of SNP-associated candidate genes for the LP trait was approximately 500 kb in size.

GWAS for the LP trait
Based on 41,413 high-quality SNPs and the phenotypic values from four environments and the BLUPs, GWAS was performed to identify the associated loci in 254 upland cotton accessions. Thirty-four SNPs were identified that were significantly associated with the LP trait (Table 3). These SNP loci were distributed on 15 chromosomes (Fig. 3a, S3; Table 3). The phenotypic variation explained by these SNPs ranged from 5.91% to 14.48%, with an average of 8.91% (Table 3). Three significant SNPs (TM47821, TM59286, and TM63365), which were distributed on Chr.15, Chr.25 and Chr.16, respectively, were simultaneously detected in four environments and BLUPs. Two significant SNPs (TM43813 and TM47822), which were distributed on Chr.13 and Chr.15, respectively, were simultaneously detected in one environment and in BLUPs. Seven significant SNPs, namely TM28552, TM43079, TM47579, TM47823, TM63366, TM63368 and TM69118, were consistently detected in two environments. Two SNPs (TM50517 and TM58487) were only detected in BLUPs, and the remaining 20 SNPs were only detected in one environment (Table 3). For instance, the SNP locus TM63365 on Chr.16, which was simultaneously detected between the four environments and BLUPs, had the highest − log 10 (P) value (12.75) and explained the largest phenotypic variation (14.48%) in 17LQ. For the SNP locus TM28552 on Chr.08, which was detected among 16AY, 16LQ, and BLUPs, had the highest − log 10 (P) value (11.66) and phenotypic contribution rate (13.56%) in BLUPs (Table 3). It is generally believed that SNPs are reliable and important sites when they are detected in more than two environments, and can be used for further analysis ). Based on previous research results and combined with ours research, the ± 500 kb regions of significant SNPs could be defined as QTLs interval, and QTLs with overlapping regions can be regarded as the same locus. In this study, a total of 22 QTLs were detected, and similar to significant SNP loci, these QTLs were scattered across different chromosomes (Table S2). Most of these QTLs contained only one significant SNP except for three QTLs such as qLP-Chr.15-1, qLP-Chr.16-1 and qLP-Chr.24, associated with three significant SNPs and the other six QTLs including qLP- Chr.11, and qLP-Chr.19-2 associated with two significant SNPs. In this study, the qLP-Chr.08, qLP-Chr.11, qLP-Chr.13-2 and qLP-Chr.13-3 were detected in two environments. The qLP-Chr.12-2 was detected in three environments. The qLP-Chr.  were detected in four environments. The other 13 QTLs were only detected in one environment (Table S2).

Identification and analysis of favorable SNP alleles and candidate genes for LP
A total of 1388 candidate genes associated with 34 significantly related SNPs were identified through GWAS (Table S4). There were three non-synonymous SNP variations associated with LP in the peak, and only two SNPs, TM47821 (Chr.15) and TM63365 (Chr.16), were simultaneously detected in four environments and BLUPs (Table S3). Consequently, this study focused on the peaks in Chr.15 and Chr.16 for the subsequent analysis. To narrow the range of candidate genes associated with LP, local LD analysis was conducted of the peak SNPs and non-synonymous SNPs variation in the GWAS was examined. Finally, two major genomic regions (GR1 and GR2) that included three loci, with 96 and 79 genes, respectively, were found to be associated with LP on chromosomes Chr.15 and Chr.16, respectively (Table S5).
The first major genome region of 1.14 Mb (GR1, between 1,200,000 and 1,260,000 bp) on chromosome Chr.15 included three significant SNP loci (TM47821, TM47822, and TM47823) (Fig. 4a, S3). The LD block analysis showed that the candidate SNP locus TM47821 did not fall into any LD block and was located between the Block 4 and Block 5 (Fig. 4b). Interestingly, a non-synonymous SNP mutation (A + 264 → G + 264) in exon 1 of GH3.5 (Gh_D01G0162: exon 1: c.A264G: p.P89V) was significantly associated with LP (− log 10 P > 4.61) (Fig. 4a, c; Table S3). There were two haplotypes with distinct phenotypes in 254 upland cotton accessions: the haplotype AA allele had significantly higher LP values than accessions with the GG allele (p < 0.01) (Fig. 4d). Further, two materials with different LP values (LMY22, with higher LP and LY343, with lower LP) were randomly selected to verify the causal gene Gh_D01G0162. The qRT-PCR results indicated that Gh_D01G0162 was significantly highly expressed in the fiber of LMY22 at 0-and 5-DPA (Fig. 4e), and the number of fiber protrusions was much greater in LMY22 (Fig. S4). The expression level of Gh_D01G0162 in LMY22 gradually decreased, while that in LY343 gradually increased with the fiber development (Fig. 4e). These results suggest that the candidate/causal gene Gh_D01G0162 may participate in early fiber development to affect the number of fiber protrusions and thereby determine the LP variation.
Another notable hotspot region was about 0.06 Mb (GR2, between 4,904,213 and 4, 967, 256 bp) on chromosome Chr.16, including three significant SNP loci (TM63365, TM63366, and TM63368), which the significance threshold was above the horizontal red lines (Fig. 5a), SNP-TM63365, the green dot in red dotted region, was significantly associated with LP (− log 10 P > 4.61) (Fig. 5a, S3; Table S3). The LD block analysis showed that the candidate SNP locus TM63365, fell into Table 3 SNPs significantly associated with lint percentage in four environments and BLUPs      (Fig. 5b), in which there were seven closely linked SNPs, namely TM63358, TM63360, TM63361, TM63362, TM63363, TM63364 and TM63365 ( Fig. 5b; Table S6) and 11 candidate genes were located according to the data of TM-1 genome sequencing (Table S6). Interestingly, a non-synonymous SNP variation (G/A) at 6360 bp in exon 10 of Gh_D07G0463 resulting in an amino acid change from glutamate to lysine, which is a homologue of the NADPH/respiratory burst oxidase protein D (RBOHD) in Arabidopsis ( Fig. 5c; Table S5). According to the haplotype analysis, the accessions carrying the AA allele in 254 cotton accessions exhibited significantly increased LP compared to the GG allele (Fig. 5d). Moreover, the expression of Gh_D07G0463 was higher during the early fiber development (0 and 5 DPA) than in other stages and gradually decreased, except for the late fiber development in LMY22 and LY343 (Fig. 5e). Altogether, the results indicate that the candidate/causal gene Gh_D07G0463 may play an important role in the early fiber development that affects LP.

WGCNA analysis and hub gene identification for significant SNP loci
The 1388 candidate genes associated with lint percentage were further validated using transcriptome sequencing data (PRJNA546484) . After filtering the genes whose FKPM values were equal to 0, a total of 1291 genes were left for subsequent WGCNA. When β = 9, the scalefree network fitting exponent R 2 > 0.8, and the average connectivity is close to 0, indicating that power processing with this value can obtain a scale-free network that meets the requirements. Here, the power of β = 9 (Soft.R.sq = 0.8) were chosen to make sure we can obtain a scale-free network (Fig. S5a-b). Six weighted gene co-expression modules were obtained according to the cluster of the gene, and they were divided into three categories by GO enrichment analyses: the biological process, cell component and molecular function categories (Fig. S5c-f). The turquoise module, with a larger number of genes, was selected for subsequent WGCNA (Fig.  S5c-d), and obtained 50 hub genes. Transcriptome profiling showed the relative expression level of these hub genes were higher in the initial fiber development stage than in the later stage ( Fig. 6a; Table S8), and GO enrichment analyses indicated there were only two types of biological processes and molecular functions (Fig. 6b). Co-expression network analysis of hub gene Gh_D07G0463 was visualized with Cytoscape software. Forty-nine genes had an interaction network relationship with the hub gene Gh_D07G0463 (Fig. 6c; Table S9). Another causal candidate gene, Gh_D01G0162, was also located in the turquoise module, but it was not a hub gene. A co-expression network analysis was also performed, and the results showed that Gh_D01G0162 interacted with 25 genes, including the causal gene Gh_D07G0463 (Fig.  S6). This suggests that the two causal candidate genes Gh_ D01G0162 and Gh_D07G0463 may play important roles in the early fiber development that affects the change of LP. Fig. 6 The results of the heat map, GO enrichment analyses, and gene co-expression networks for the hub gene. a Heat map of the 50 hub genes. b GO enrichment analyses of the 50 hub genes. c Co-expression network diagram of the hub gene Gh_D0G0463. d Expression analysis of gene Gh_A13G0389, Gh_D01G0200, Gh_D07G0449, Gh_D07G0457 by qRT-PCR. * and **indicate significantly differential expression at 0.05 and 0.01 level, respectively 1 3

Analysis of favorable SNP alleles
To further identify the cumulative effect of the favorable SNPs on LP, the allelic variation of two SNP loci on chromosomes Chr.15 (TM47821, A/G) and Chr.16 (TM63365, A/G) were investigated. These SNP loci were significantly associated with LP in this study. Based on the SNP alleles of the two loci (TM47821, A/G and TM63365, A/G), the 254 upland cotton accessions were classified into three haplotypes (AA-AA, AA-GG/GG-AA, and GG-GG). The number of haplotypes AA-AA, AA-GG/GG-AA, and GG-GG were 109, 81, and 64 accessions, respectively, with mean LP values of 42.87%, 40.35%, and 37.39%, respectively, showing that pyramiding the favorable alleles could increase LP (Fig. 7). These results suggested that increasing the frequency of elite alleles would significantly improve the LP, thus enhancing the cotton yield.

Large numbers of high-quality SNPs detected by the CottonSNP80K array ensure effective GWAS in cotton
GWAS based on large-scale resequencing and high-density SNP arrays provides a powerful platform for the rapid identification of genetic variants and candidate genes associated with variations of agronomic traits that can be directly applied to crop improvement (Huang et al. 2011;Hulse-Kemp et al. 2015;Cai et al. 2017). However, it is critical to select materials that include a high degree of genetic diversity for GWAS (Li et al. 2018b). In this study, a total of 254 upland cotton (G. hirsutum) accessions were selected and formed into a natural panel for LP loci detection. The cotton accessions originated from different ecological cottongrowing areas in China and abroad (Table S1). Although most of the accessions were from the Yellow River region (YRR) in China, they were consistent with the phenotypes accurately characterized in Anyang and Linqing, and also met the breeding aims for the production needs of the YRR in China. The range of pedigrees was very rich because of the wider range of genetic diversity among these accessions, which better met the needs of the GWAS. Furthermore, the broad-sense heritability of LP was 86.65% in this study, which was similar to previously reported Song et al. 2019;Xing et al. 2019). These findings show that the trait of LP is relatively stable and less affected by environmental factors, and thus the markers significantly associated with LP obtained by the GWAS should be useful for cotton molecular breeding. GWAS also has the advantage of high resolution. Highthroughput molecular markers and wide distribution on the whole genome were efficient markers for trait-gene association in the GWAS (Cai et al. 2017;Huang et al. 2017;Xing et al. 2019). In the present study, 41,413 polymorphous SNP markers were filtered out from the 77,774 SNPs, accounting for 53.25% of the SNPs at a molecular level ( Table 2). The average density of polymorphic SNPs was approximately one SNP per 49.44 kb (Table 2). This marker density was similar to that reported by Dong et al. (2018), Li et al. (2018a, b), and Yuan et al. (2018), in which the SNP markers were from the CottonSNP80K array as well, but significantly different from that reported by Huang et al. (2017), Sun et al. (2018) and Song et al. (2019), in which the SNP markers were from the Cotton-SNP63K array. And it may be mainly due to differences in the selection of the reference genome and accessions. The average value of PIC was 0.237, less than the values reported by Huang et al. (2017) (0.332) and Sun et al. (2018) (0.285) and close to those reported by Yuan et al. (2018) (0.267) and Song et al. (2019) (0.250). The LD decay distance of the population used in the present study was approximately 520 kb (Fig. S2). This value was higher than the distance obtained by Li et al. (2018b) (400 kb), lower than the result reported by Sun et al. (2017) (820 kb), and similar to the distance reported by Dong et al. (2018) (500 kb). These inconsistent results may be due to differences in population structure, population sizes and SNP marker filtering criteria. In this study, the 254 accessions were divided into two subpopulations based on Fig. 7 Box plot for lint percentage plotted as different numbers of favorable alleles. The X-axis represents the number of favorable alleles and the Y-axis represents the mean value of the lint percentage. The significance of differences was analyzed by a two-sided Wilcoxon test and ***indicate significantly differential expression at 0.001 level 1 3 the population structure, a neighbor-joining phylogenetic tree, and PCA (Fig. 2c-e). This result was also consistent with previous reports in cotton Sun et al. 2017;Yuan et al. 2018;Song et al. 2019).

Identification of novel stable QTL and elite-allele loci for marker-assisted breeding for LP in cotton
LP is an important factor for determining cotton lint yield (Culp and Harrell 1975) and is also a typically quantitative trait controlled by multi-genes (Si et al. 2017;Sun et al. 2018). Several QTLs for LP have been detected based on linkage and association mapping in cotton Said et al. 2013;Shang et al. 2016), some of which have also been identified by GWAS (Su et al. 2016;Huang et al. 2017;Sun et al. 2018;Song et al. 2019;Xing et al. 2019;Zhu et al. 2020). In the present study, a total of 34 SNPs were found to be significantly associated with LP (Table 3), corresponding with 22 QTLs (as defined in this study) (Table S2). Among them, one overlapped with previously reported QTLs for LP, eight were adjacent, and 13 QTLs were novel (Table S2). For instance, qLP-Chr.15-1, was co-localized with qLP-C15-1 which was previously reported by Wang et al. (2013) (Table S2). Eight QTLs, namely qLP-Chr. 07,, were adjacent to the markers/QTLs DC40182 , qLP-C12-1 , qLP-C12-2 , NAU3017 , qLP-D1-2 (Si et al. 2017), qLP-16-2 ), qLP-16-1 (Wang et al. 2011, and qLP-C21-1 , respectively (Table S2). The remaining 13 QTLs did not correspond to any reported QTLs. These QTLs may be novel QTLs identified via GWAS analysis, and need to be further verified through the analysis of different genetic populations.
Elite-allele loci are valuable resources for crop breeding programs, and the cumulative effect of favorable SNPs is an efficient way to improve target traits in crop plants (Su et al. 2016). This study detected two SNPs, TM47821 and TM63365, which were significantly associated with LP and had a positive effect on LP. Interestingly and importantly, the haplotypes of elite-allele loci of the two significantly associated SNPs are AA-AA, and the accessions carrying AA alleles at TM47821 and TM63365 had higher LP values than those harboring the GG allele (Figs. 4d, 5d). Therefore, these stably inherited QTLs, which were repeatedly identified across different genetic populations and environments, in addition to the elite-allele loci (TM47821 and TM63365) with a positive effect on LP, may have great potential in marker-assisted breeding for LP in cotton. These elite allelic loci require further relevant experiments to be identified and validated.

Potential candidate genes and the underlying genetic and molecular mechanisms controlling the LP
The numbers of lint on seed surface, which is determined at the fiber initiation stage, is one of the important factors affecting LP. It has been reported that genes controlling the LP were highly expressed during earlier stages of fiber development, especially the initiation and elongation phases Haigler et al. 2012;Su et al. 2016). According to the definition of LP, with a certain weight of seed cotton, the level of LP is positively proportional to the number of fiber protrusions per unit area and is inversely proportional to the size of the seed.
To date, some candidate genes associated with LP were identified via GWAS using different association populations in cotton (Su et al. 2016;Huang et al. 2017;Fang et al. 2017a;Sun et al. 2018;Song et al. 2019;Xing et al. 2019). One candidate gene, Gh_D08G2376, was speculated associated with the LP, and it may regulate seed size and fiber development (Huang et al. 2017). However, most of the identified genes affecting LP are related to the initiation of fiber development Haigler et al. 2012;Su et al. 2016). In this study, 175 genes were found in the two major candidate regions (GR1 and GR2). We particularly focused on two genes, Gh_D01G0162 and Gh_D07G0463, as their exon regions harbored polymorphic SNPs with non-synonymous mutations. Moreover, qRT-PCR results indicated that both genes were significantly highly expressed in fiber of LMY22 (higher LP) at 0-and 5-DPA (Fig. 4e), and the number of fiber protrusions was much greater in LMY22 (Fig. S4). The gene, Gh_D01G0162, is a homologue of the auxin-responsive GH3 family protein in Arabidopsis. The GH3 proteins are reported to be involved in various developmental processes and environmental responses in plants (Jeong et al. 2021), especially related to hormones, such as auxin (Mellor et al. 2016) and JA (Staswick et al. 2002). Auxin promotes fiber development during the early stages of fiber initiation, including at 0-and 5-DPA (Lee et al. 2007). Similar to Gh_D01G0162, the expression of Gh_D07G0463, which is a homologue of the NADPH/respiratory burst oxidase protein D (RBOHD) in Arabidopsis, was also significantly highly expressed in the fiber of LMY22 during the early fiber development (0 and 5 DPA) (Fig. 5e). And the AtR-BOHD was involved in root growth (Foreman et al. 2003), and produced a portion of the reactive oxygen species (ROS) that play roles during root-hair elongation (Gapper and Dolan 2006) and associated ROS accumulation (CM-H2DCF imaging) during root-hair elongation (Foreman et al. 2003). Cotton fibers are classified as seed trichomes, which share many similarities with leaf trichomes and root-hair in Arabidopsis (Lee et al. 2007). These candidate genes were also validated by using transcriptome sequencing data (PRJNA546484) . Fifty hub genes were identified, including Gh_D07G0463, one of the causal genes obtained via GWAS analysis. Transcriptome data showed that all of the hub genes were more highly expressed in the initial fiber development stages at 0 and 5 DPA than in the later stage (Table S8; Fig. 6a). The qRT-PCR analysis revealed that four hub genes, which were randomly selected from the interaction network, were highly expressed at the early stage of fiber development at the 0 and 5 DPA ovule and fiber development stages ( Fig. 6d; Table S8), and consistent with the co-expression network analysis (Fig. 6a). These candidate genes can be verified in future studies. More experiments, such as developing different genetic populations to fine mapping and detecting the expression of genes in target regions, can be designed to quickly confirm the candidate genes to find the connections between gene expression and phenotypic variation.
In summary, The 254 upland cotton accessions in China were genotyped using the high-density CottonSNP80K array and clustered into two subpopulations via 41,413 polymorphic SNPs screened. Thirty-four SNPs within 22 QTLs associated with LP were identified via GWAS, including 13 novel QTLs. Two key candidate genes (Gh_D01G0162 and Gh_D07G0463), which were significantly highly expressed during earlier stages of fiber development that affected the formation of the LP, were also identified. These significantly identified associated SNPs, QTLs (especially these novel QTLs) and candidate genes will help to better elucidate the genetic and molecular mechanisms underlying variations in LP, and they have potential use in LP improvement via marker-assisted breeding in cotton.