We used GWAS to identify DNA polymorphisms associated with biomass production and wood chemical composition in P. trichocarpa, which determine its potential as feedstock for lignocellulosic ethanol. This approach complements our previous phenotypic characterization of the same association population [5] by identifying SNPs underlying traits of growth, ecophysiology and wood quality, the primary traits targeted for the development of genetically improved clones suitable for dedicated biomass and bioenergy plantations. An approach based on sequence capture allowed us to detect genotype-phenotype associations across the P. trichocarpa gene exome.
The association population used in this study consisted of 461 clones (from 101 provenances), comprising part of the natural distribution range of P. trichocarpa in the Pacific Northwest of the United States. In a previous study [5], we observed significant phenotypic and genetic variation for growth, spring bud phenology, water use efficiency, C and N assimilation, as well as lignocellulosic components and metabolome of wood (Table 1). Similarly, clonal repeatability, represented in terms of individual heritability estimates, also varied among the traits. We hypothesized from this information that multiple polymorphic loci across the genome should be detected in association with phenotypes, and particularly, those with high heritability should reveal a large number of significant SNP-markers.
Genotyping
The processes of exome sequencing and genotyping identified 5.1 million SNPs across the P. trichocarpa genome in the association population, and after filtering, a set of 813,280 SNPs was used for association analyses (Table 2). The number of selected SNPs was proportional to chromosome size, ranging from 29,287 to 100,299 SNPs, for chromosomes 9 and 1, respectively (Table 2, Fig. 1a). Considering the full genome length, an average of one SNP every 482 bp (Table 2) was included in the analyses. Taking advantage of the full genome assembly, genotyping methodologies such as those based on sequence capture can target entire exons or genes across the genome, avoiding bias arising by a priori selection of candidate loci [23, 25]. In comparison to similar preceding studies that used SNP array platforms [6, 18, 19, 41], the number of SNPs in our analyses represent an increase in the power of applied genomic scanning. However, this amount is lower than the utilized by approaches based on whole-genome sequencing developed recently [7, 15].
Intra-chromosomal linkage disequilibrium
The extent of linkage disequilibrium (LD) was analyzed across each chromosome. On average, the LD over physical distance decayed below r2 0.2 at 26.9 kbp. A representative example, for Chromosome 12, is depicted in Fig. 1 b. The complete set of chromosomes with its LD is included in Fig. S1. The decay varied depending on specific chromosomes, with the most rapid decay observed on chromosomes 7 and 15 (r2 0.2 at 18.9 kbp) and the slowest decay on chromosome 11 (r2 0.2 at 51.6 kbp). Genome-wide LD decay exhibited different extents among chromosomes (Table 2). LD decay to r2<0.2 was observed on average at 26.9 kbp. High variation of LD across the genome (among and within chromosomes) has been reported for this species [23]. The estimated extent of LD decay predicted in our study is higher than the observed by Wegrzyn et al. [18] (r2 0.2 at ~0.5 kbp) and Wang et al.[42] (r2 0.2 at ~8 kbp) for P. trichocarpa. Distinct methodologies, number of markers, population sizes, genetic origins and standard errors among the studies may account for the different findings. Compared with other tree species extent of LD estimated in this study is similar to species belonging to Fraxinus [43], Prunus [44] and Eucalyptus [45] genus.
Single SNP-marker associations
Significant associations (p-value < 6.1479E-8) were identified for DBH, h, leaf C and N content, and δ15N. Figure 2 (a and c) depicts the number of associations detected per chromosome for a selected set of traits. A detailed list for each trait is provided in Table S1. Similarly, Manhattan plots for each phenotype are included in Fig. S2.
In general, and consistently with chromosome length, the highest numbers of significant associations were observed for chromosomes 1 and 5. The lowest number of associations was observed for chromosome 16. The proportion of significant SNPs of the total analyzed, ranged from 0.02 ‰ to 0.50 ‰ for leaf C content on chromosome 10, along with δ15N on chromosomes 6 and 10, and leaf N content on chromosome 5 (Table S1b), respectively. In the case of growth traits, 2 and 148 associations were detected for DBH and h, respectively. Within the ecophysiological traits, the number of significant associations ranged from 12 to 220 for C content and leaf N-content, respectively. For traits related to the chemical composition of wood, associated SNP-markers were over the significance cutoff (p-value < 6.1479E-8). Similarly, in the case of wood metabolites, considering a selected subset of those with the top five highest heritability estimates, no significant associations meeting the adjusted p-value were identified for Adenosine (Ade), Hydroxybenzoic Acid (HbA), Galactinol (Gal), Galactonic Acid (GAc) and Alpha tocopherol (Toc). The proportion of phenotypic variation accounted for the cumulative effect of significantly associated SNPs was 0.2 %, 1.1 %, 0.1 %, 0.7 % and 0.7 % for DBH, h, leaf C content, leaf N content, and δ15N, respectively.
Significant single nucleotide polymorphisms associated with phenotype were identified mostly in exonic regions. SNPs are part of genes encoding proteins belonging to the functional classes: Protein Synthesis/Modification (54.5 %), DNA/RNA Metabolism (27.3 %), Energy/Metabolism (9.1 %) and Signal transduction (9.1 %) (Fig. 3a). A list with these SNPs and genes is given in Table S3. An example for the Protein Synthesis/Modification category was a gene encoding a Periodic Tryptophan Protein 1 (Potri.007G019500), which was associated with height, and leaf N and δ15N. Among genes related with proteins involved in DNA/RNA Metabolism, one for a helicase senataxin (without gene model in Phytozome) was significant for height and leaf N. For genes in the Energy/Metabolism functional class, a representative was one (Potri.015G119700) encoding a Domain of unknown function (PGG), which was associated with DBH. For the Signal transduction class, the gene encoding a Rop Guanine Nucleotide Exchange Factor 1 (Potri.009G140100) was significant for height and leaf N.
Considering the applied significance threshold with Bonferroni correction (p-value < 6.1479E-8), GWAS performed on single-SNPs was successful in identifying polymorphisms associated with growth traits (DBH and h), leaf C and N-contents, as well as stable isotope parameters (δ15N) (Fig 2, Table S1). For traits related to spring bud phenology (DBF), wood chemical components (C5 and C6 sugars, lignin) and wood metabolites (GAc, Gal and HbA) significant associations at p-value< 0.0001 were detected, but they did not reach the adjusted threshold. The presence or lack of significant SNPs for these traits appears to be independent of heritability estimates for each. For some traits with moderate to high H2i (e.g. S:G ratio or DBF), GWAS did not detect single-SNP associations. On the other hand, for traits with low to moderate H2i (e.g. leaf C-content and δ15N) a relatively higher number of SNPs were identified. Similar situations were observed for phenology traits in previous studies with P. trichocarpa [19]. On average for all traits with significant associations ~ 1 % of phenotypic variation was accounted for by the cumulative effect of significant SNPs. The influence of multiple SNPs associated with phenotypes is particularly interesting in the context of the development of models for genomic selection, where large numbers of markers are utilized to predict the genetic merit of individuals [46]. Differences among traits in terms of the number of significant SNP-markers suggest the differential effect of both the variable number of SNPs influencing each trait and the individual impact of some SNPs. In that sense, some individual SNPs could have a such low effect size that none reach statistical significance. Furthermore, the apparent lack of correspondence between estimates of H2i and the phenotypic variance collectively accounted for by SNPs, could be explained by non-additive effects (e.g epistasis, GxE effect) or epigenetic factors acting on some traits. These types of effects are usually underestimated because MLM utilized for GWAS only suppose additive interactions [19]. Finally, another factor influencing the number of significant associated SNPs (and their effect on phenotypes) deals with the complexity of analyzing thousands of single markers across the genome. Stringent thresholds for controlling type I error are required for p-value adjustment in GWAS, given the correlated nature of markers along a chromosome [47]. For example, it has been suggested that the general applicability of the traditional false discovery ratio (FDR) [48] may suffer from several problems when applied to association analysis of a single trait [49]. In that sense, we utilized the Bonferroni correction to define the significance threshold. Thus, in spite of significant associations were detected at p-value < 0.00001 (and even lesser) in traits such as Vol, DBF, lignin or GAc, they did not reach the adjusted p-value threshold and were considered non-significant.
Sliding window analyses
The multiple-marker analysis by sliding-window allowed us to identify genomic regions containing different sets of SNPs jointly associated with each trait. Figure 4a depicts a representative Manhattan plot with the significant windows identified for leaf δ15N. Manhattan plots for other traits are included in Fig. S3. A variable number of windows per chromosome were detected among the phenotypes (Fig. 2 b and d). The total number of significant windows ranged from 6 for HbA, to 192 for N content (Table S2). For most traits, the main contributions were observed by chromosomes 8 and 1. However, for traits such as DBF, C:N, δ15N, and Toc, the most relevant chromosomes in terms of the number of significant windows included to 6, 4, 5 and 10, respectively. The multiple-SNP approach applied by sliding window analysis has been proposed as a robust alternative for identifying clustered significant patterns of SNPs, that are associated with complex traits, in a chromosomal context in humans and plants [39, 50-52]. In our study, significant windows identified a series of SNP clusters which were coincident with coding regions of multiple genes (Table S4). The graphical relationship between SNPs identified by single-marker associations and the detection by sliding window analysis is depicted in Fig.4, where the highlighted window (Fig.4 a) contains 14 significant SNPs belonging to the XRN4 gene (Fig.4 b). Additionally, information coming from both detection approaches allowed us to define genome zones with high LD, significantly associated with phenotypic variation, revealing the presence of phenotypically-relevant haplotypes (Fig. 4c). Although more evidence will be necessary, haplotype blocks defined by this way could be indicative of polymorphic regions with pleiotropic effects.
Considering the top three most significant windows across all chromosomes and traits, the most represented functional classes were Protein with Unknown Function (21.2 %), Energy/Metabolism (21.1 %), Protein Synthesis/Modification (19.2 %), and Transcription (15.4 %) (Fig. 3b). A list with the windows and genes included in these classes are given in Table S4. Some of the detected genes encoding proteins with roles in Protein Synthesis/Modification were those expressing Similar to Threonyl-tRNA Synthetase (Potri.008G145600), Interleukin-1 Receptor-Associated Kinase 4 (Potri.008G145900) and Leucine Rich Repeat (LRR1) (Potri.005G015700) associated with wood C5-sugars, C6-sugars and height, respectively. An example of the genes dealing with proteins belonging to the Energy/Metabolism class were Exostosin Heparan Sulfate Glycosyltransferase-Related (Potri.010G197900) and Similar to Aldehyde Dehydrogenase 1 Precursor (Potri.012G078700), associated with δ15N and DBF, respectively. Among genes encoding enzymes involved in Transcription, Similar to Agamous-like MADS Box Protein AGL12 (Potri.019G076800) and WRKY Transcription Factor 10-related (Potri.013G086000), were associated with Gal and C5-sugars, respectively. Concerning the Protein Synthesis/Modification class, examples of identified genes are those encoding a Similar to Threonyl-tRNA synthetase and Leucine Rich Repeat (LRR_1)//Leucine rich repeat (LRR_8), associated with C5-sugars and height, respectively.
Genes detected by single-SNP association and sliding windows approaches
We also verified the consistency between SNP-markers identified by single-SNP association and sliding window analyses. To this end, we considered as an example the most significant window (#154; p-value 4.70E-25) detected in chromosome 5 for leaf δ15N (Fig. 4a). The SNPs identified within the window were part of a gene (Potri.005G048900) encoding a Similar to 5'-3' Exoribonuclease (XRN4) Gene, which is involved in disease resistance, response to ethylene, RNAi, and miRNA-mediated RNA decay [53]. The associated window comprised 64 SNPs (Fig. 4b). Fourteen of these markers, were also identified by the single-marker association, indicating the consistency between both approaches. Particularly, markers such as S05_3547832, S05_3547864, S05_3547904 and S05_3548573 were in high LD (r2 > 0.75) and produced significant variation at the level of leaf δ15N means (Fig. 4c). Alternatively, alleles for those SNPs represent intronic and non-synonymous polymorphisms, involving possible effects on transcript splicing, and protein structure and function.
Significant associations for traits underlying growth, nutrient metabolism and xylem formation, among others, define SNPs and genes, which might represent logical candidates for functional studies focused on confirming their role and impact on phenotype of Populus species. Considering their high significance and the simultaneous detection by the single-SNP association and/or sliding windows approaches, we centered part of our analysis on gene Similar to 5'-3' Exoribonuclease (XRN4), which included SNPs and windows associated with leaf δ15N. The exoribonuclease function of this gene links it to transcription, RNA metabolism and RNA interference in eukaryotes [54]. In plants, it has been related to ethylene signaling [55] and response of plants to abiotic and biotic stresses [56, 57]. Mutation of members of the XNR gene family produced sensitivity to N starvation in Saccharomyces cerevisiae [58] and morphological alterations in Arabidopsis thaliana [59, 60]. Thus, association of XRN4 with leaf δ15N, an indicator of N use efficiency [61], could be related to the N metabolism and mobilization at leaves, particularly during the last third of the growing season, when sampling was done. More studies will be required to detect possible effects of SNPs at XRN4 on photosynthesis and biomass production.
Chemical composition of wood
Association analyses of the chemical composition of P. trichocarpa wood was previously performed in the same population by our group [18] using a candidate gene approach. We carried out a comparison between the results from both studies considering the 40 candidate genes utilized by Wegrzyn et al. [18], which encodes enzymes from the cellulose and lignin biosynthesis pathways and cytoskeletal proteins. Association results in the present study were significant only under a p-value < 0.0001 threshold. Results indicated both overlap and divergence between the two studies (Fig. 5 and Table S5). SNPs within genes encoding cellulose synthase (CesA1A) were significantly associated with C6-sugars in both studies. For this trait, we also detected SNPs in TUA5 gene, which were not identified by Wegrzyn et al. [18]. In the case of lignin content, members of the cellulose synthase gene family (CesA2B and CesA1A) were differentially detected (differentially) in both studies. We also identified SNPs belonging to 4- Serine Hydroxymethyl Transferase SHMT6. Finally, for the S:G ratio, our analyses detected SNPs in Laccase LAC1A, Phenylalanine Ammonia-Lyase PAL5 genes, which were not identified by Wegrzyn et al. (2010). In spite of the genotypes and wood chemical characterization methods were mostly the same in both studies, distinct trial sites (the first in Westport, Oregon, and the second in Davis, California), sampling height or differential presence of juvenile wood, among other factors, might explain the differences in the findings reported previously [18] and those described in the present work.