Specific-locus, amplified-fragment sequencing and SNP Calling
The restriction enzymes RsaI and HaeIII were selected based on predictions from in silico digestion. There were 205,334 predicted SLAF tags 264–394 bp in length. The digestion efficiency was 93.30%. In this experiment, 465.13 Gbp of sequences were generated from the 112 ramie accessions. There were 502,578 SLAF tags identified throughout the genome. The average depth per sample was 12.34×. There were 369,997 SLAF tags with polymorphisms. All SLAF tags aligned to the reference genome using BWA software [24]. The PLINK software [25] performed quality control on the data. A total of 2,955,337 SNP loci were identified by GATK and SAMtools. More than 85.9% of the SNP variant effects were in the intergenic, intron, downstream, and upstream areas of the coding region whereas only 13.58% were in the exons (Table 2). After filtration for integrity (> 80%) and low minor allele frequency (> 5%), 215,376 highly consistent SNPs were identified and used in the subsequent analyses.
Population genetics analysis
Genetic relationships among the 112 samples were calculated using the 215,376 highly consistent SNPs identified in the present study. Among these 12,432 pairwise combinations, 11,644 (93.662%) with genetic relationship coefficients < 0.05 (Additional files 1: Figure S1). Only four pairwise combinations (0.032%) (two pairs accessions: cd and br, ce and dr), which with higher kinship value between 0.6 to 0.65, suggesting the two pairs accessions may have closely relationship. Thus, most of accessions were very weak or no relationships in this study.
The population structure analysis indicated that the 112 association panel derived from five ancestors based on cross-validation (CV) errors (Figure 1). Of these 112 accessions, 57 were group one, 11 were group two, 15 were group three, 18 were group four, and the remaining 11 were group five (Figure 1). Group one accessions came from Hunan (15) and Jiangxi (13). The remaining 29 originated from Hubei (6), Guangxi (5), Guizhou (5), Sichuan (5), Yunnan (2), Hainan (2), Zhejiang (1), Chongqing (1), India (1), and Shanxi (1). Group two accessions were collected from Sichuan (2), Guangxi (2), Shanxi (2), Indonesia (2), Zhejiang (1), Jiangxi (1), and Hunan (1). Group three accessions were derived from Guizhou (4), Chongqing (3), Hunan (2), Jiangxi (2), Sichuan (1), Hubei (1), Shanxi (1), and Brazil (1). Group four accessions were gathered from Hunan (4), Guizhou (3), Sichuan (3), Hubei (2), Jiangxi (2), Yunnan (1), Hainan (1), Guangxi (1), and Cuba (1). Group five accessions were acquired from Hunan (3), Jiangxi (2), Sichuan (2), Guangxi (1), Shanxi (1), Guizhou (1), and Indonesia (1).
Neighbor-joining (NJ) clustering show this set of germplasm formed three major groups (Figure 2). Based on the Admixture cross-validation (CV) figure (Figure 1), the second choice consisted of three subpopulations (K) when the cross-validation error was the second lowest. Another selection comprised three subgroups for the 112 core ramie germplasms. Principal component analysis (PCA) showed genotype dispersal among the various components, and the first three PCs collectively explained only about 11.13% of the total variance, suggesting that the population structure of the 112 accessions was very low (Additional files 1: Figure S2). No significant correlation was detected between genetic relationships and geographic origins.
Linkage disequilibrium across the whole ramie genome
LD decayed was estimated between SNP pairs along each scaffolds using the 215,376 highly consistent SNPs. The genome-wide LD decayed to r2 = 0.1 was ~11.75 kb in physical distance (Figure 3), which is lower than that reported for most other crops. There are major difference among the various scaffolds. LD ranged from 1.09 kb (Scaffold11 (PHNS01006314.1)) to 20.95 kb (Scaffold2) (Additional files 2: Table S1).
GWAS of phenotypic traits and quantitative real-Time PCR analysis of candidate genes
The GWAS models tested included GLM-Q, compressed MLM-Q+K, and EMMAX. According to the quantile-quantile (Q–Q) plot results, the GLM-Q model was strongly skewed towards significance for most traits (Additional files 1:Figure S3). Therefore, the Q matrix did not adequately account for the population structure or cryptic relatedness. The MLM-Q+K used both the Q and kinship matrices overcorrected these confounders (Additional files 1:Figure S4). The EMMAX model use the identity-by-state (IBS) relationship matrix and effectively reduced the false positive rate. It had sufficient statistical power to identify significant marker-trait associations (Additional files 1:Figure S5). Therefore, the EMMAX-Q model was applied for GWAS in this study.
Variations in all 13 traits were detected among the genotypes (Table 1). We tested the associations between the 215,376 SNPs and each trait using the EMMAX model. Based on the chosen P < 4.64 × 10-7, we identified 44 trait-SNP associations for five traits, namely, FF (third season), RSD (third season), SKT (third season), FP (second season), and FP (third season). These were designated the top QTL candidates (Additional files 3: Table S2). The association pattern across the genome scaffolds visualized with Manhattan plots showed that association trait-SNPs were scattered throughout the genome (Figures 4; Additional files 1: Figure S6–S8). We arbitrarily searched 100 kb upstream and downstream of these trait-SNP association loci already known for candidate genes in the Boehmeria nivea cultivar Zhongzhu No. 1 genome.
We predicted the number of candidate genes associated with the five aforementioned traits. We found 30 candidate genes from the significant SNP regions (PHNS01005842.1-2705313) associated with FF of the third season (Additional files 4: Table S3). Whole_GLEAN_10029631 and whole_GLEAN_10029638 located at 66.7 kb and 60.4 kb from the trait-SNP, respectively, were promising candidate genes encoding a cotton fiber-expressed protein of unknown function and an Arabidopsis thaliana homeobox protein ATH, respectively.
Seven significant SNP markers were associated with RSD of the third season (Additional files 5: Table S4). The most significantly associated SNP marker was at PHNS01002615.1:385693 (-log10P = 7.47), which corresponded to the transcripts of whole_GLEAN_10015232 annotated as AT-rich interactive domain-containing protein 2 whose function is unknown. A subsequent survey identified the following promising candidate genes: a zinc finger protein CONSTANS-LIKE 3 gene (whole_GLEAN_10015241) located 63.3 kb from the significant SNP (PHNS01002615.1:385693), a cotton fiber-expressed protein of unknown function (whole_GLEAN_10025327) located 9.2 kb from the significant SNP (PHNS01003751.1:2205139), an Arabidopsis thaliana trihelix transcription factor ASIL2 (whole_GLEAN_10026480) located 97 kb from the significant SNP (PHNS01007343.1:2569037), and an Arabidopsis thaliana E3 ubiquitin-protein ligase RHB1A (whole_GLEAN_10015233) located 5.2 kb from the significant SNP (PHNS01002615.1:385693).
One SNP marker (PHNS01006314.1:1911874) was associated with SKT of the third season (Additional files 6: Table S5). The transcript of whole_GLEAN_10021907 was annotated as Arabidopsis thaliana G-box-binding factor 1 and was located 31.5 kb from the marker. The transcript of whole_GLEAN_10021909 was annotated as Arabidopsis thaliana homeobox-leucine zipper protein ATHB-40 and was located 13 kb from the marker. The transcript of whole_GLEAN_10021913 was annotated as Arabidopsis thaliana BES1/BZR1 homolog protein 2 and was located 22.4 kb from the marker. All of these were promising candidate genes.
For FP of the second season, seven SNP markers were detected above the threshold (P = 4.64 × 10-7) in the scaffolds PHNS01004650.1, PHNS01011408.1, PHNS01001065.1, PHNS01005672.1, PHNS01004146.1, PHNS01003683.1, and PHNS01007343.1 (Additional files 7: Table S6). One Arabidopsis thaliana homeobox-leucine zipper protein HDG5 homolog (whole_GLEAN_10016513) and one Arabidopsis thaliana transcription factor KAN2 homolog (whole_GLEAN_10016511) were located 22.1 kb and 81.2 kb, respectively, from the trait-SNP PHNS01003683.1:866079. One Arabidopsis thaliana WAT1-related protein At3g18200 homolog (Whole_GLEAN_10007759) was located 71.9 kb from SNP PHNS01011408.1:355775 and three Arabidopsis thaliana transcription factor ORG2 homologs (whole_GLEAN_10011656, whole_GLEAN_10011657, and whole_GLEAN_10011658) were located at 74.8 kb, 83.8 kb, and 88.8 kb, respectively, from the significant SNP PHNS01004146.1:755398. These were promising candidate genes.
Twenty-seven significant GWAS loci among 17 scaffolds were identified for FP of the third season (Additional files 8: Table S7). There were five SNP markers in scaffold PHNS01000838.1 within whole_GLEAN_10010742 whose function is unknown. The whole_GLEAN_10006587 homolog of Trema orientalis GRAS transcription factor was located 30.4 kb from the marker PHNS01001992.1:258552. The whole_GLEAN_10025406 homolog of Arabidopsis thaliana Auxin-responsive protein IAA13 was located 0.7 kb from PHNS01003751.1:2871833. The whole_GLEAN_10029892 ortholog of Trema orientalis GRAS transcription factor was located 48.7 kb from PHNS01005842.1:4650339. Whole_GLEAN_10017958 was annotated as the NAC domain-containing protein (NAC29) and was located 97 kb from PHNS01010418.1:933907.
The candidate genes of FF (third season), RSD (third season), SKT (third season), and FP (second season) were validated by qRT-PCR. Whole_GLEAN_10029631 and whole_GLEAN_10029638 were comparatively upregulated in varieties with relatively higher fiber fineness (Figure 5). Whole_GLEAN_10021913 was comparatively downregulated in thicker SKT varieties. No other candidate genes were correlated with the RSD (third season), SKT (third season), or FP (second season) for six accessions by qRT-PCR (Figures S9–S11).