We used GWAS to identify DNA polymorphisms associated with biomass production and wood chemical composition in P. trichocarpa, which determine its potential as feedstock for lignocellulosic ethanol. This approach complements our previous phenotypic characterization of the same association population [5] by identifying SNPs underlying traits of growth, ecophysiology and wood quality, the primary traits targeted for the development of genetically improved clones suitable for dedicated biomass and bioenergy plantations. An approach based on sequence capture allowed us to detect genotype-phenotype associations across most of the P. trichocarpa gene space.
Phenotyping
The association population used in this study consisted of 461 clones (from 101 provenances) comprising an important part of the natural distribution of P. trichocarpa in the US Pacific Northwest. Significant genetic variation was previously reported for growth, spring bud phenology, water use efficiency, C and N assimilation, as well as lignocellulosic components and metabolome of wood (Table 1) [5]. Expressed as coefficient of variation, this ranged from 3.6 to 78.3 %, for leaf Δ and the wood metabolite Hydroxybenzoic acid (HbA), respectively. Similarly, clonal repeatability, represented in terms of individual heritability estimates, varied among the traits, from 0.07 for C5-sugars to 0.9 for DBF. We hypothesized from this information that multiple polymorphic loci across the genome should be detected in association with phenotypes, and particularly, those with high heritability should reveal a large number of significant SNP-markers.
Genotyping
The processes of exome sequencing and genotyping identified 5.1 million SNPs across the P. trichocarpa genome in the association population. After the filtering steps described above, a set of 813,280 SNPs was used for association analyses (Table 2). The number of selected SNPs was proportional to chromosome size, ranging from 29,287 to 100,299 SNPs, for chromosomes 9 and 1, respectively (Table 2, Fig. 1a). Considering the full genome length, an average of one SNP every 482 bp (Table 2) was included in the analyses. Taking advantage of the full genome assembly, genotyping methodologies such as those based on sequence capture can target entire exons or genes across the genome, avoiding bias arising by a priori selection of candidate loci [20, 22]. In comparison to similar preceding studies that used SNP array platforms [6, 14, 16, 39], the number of SNPs in our analyses represent a significant increase in the power of applied genomic scanning.
Intra-chromosomal linkage disequilibrium
The extent of linkage disequilibrium (LD) was analyzed across each chromosome. On average, the LD over physical distance decayed below r2 0.2 at 26.9 kbp. A representative example, for Chromosome 12, is depicted in Fig. 1 b. The complete set of chromosomes with its LD is included in Fig. S1. The decay varied depending on specific chromosomes, with the most rapid decay observed on chromosomes 7 and 15 (r2 0.2 at 18.9 kbp) and the slowest decay on chromosome 11 (r2 0.2 at 51.6 kbp). Genome-wide LD decay exhibited different extents among chromosomes (Table 2). LD decay to r2<0.2 was observed on average at 26.9 kbp. High variation of LD across the genome (among and within chromosomes) has been reported for this species [20]. The estimated extent of LD decay predicted in our study is higher than the observed by Wegrzyn et al. [14] (r2 0.2 at ~0.5 kbp) and Wang et al.[40] (r2 0.2 at ~8 kbp) for P. trichocarpa. Distinct methodologies, number of markers, population sizes, genetic origins and standard errors among the studies may account for the different findings. Compared with other tree species extent of LD estimated in this study is similar to species belonging to Fraxinus [41], Prunus [42] and Eucalyptus [43] genus.
Single SNP-marker associations
Considering the set of filtered SNPs and variables relating to growth, phenology, ecophysiology and wood chemical components, we conducted a total of 12.2 million tests. In the case of wood metabolites, the total number of tests was 4.07 million. Figure 2 (a, c, e and g) depicts the number of significant associations (p-value < 0.0001, q-value < 0.1) detected per chromosome for a selected set of traits. A detailed list for each trait is provided in Table S1. Similarly, Manhattan plots for each phenotype are included in Fig. S2. In general, and consistently with chromosome length, the highest and lowest numbers of significant associations were observed for chromosomes 1 and 9, respectively. The proportion of significant SNPs of the total analyzed, ranged from 0.01 ‰ to 0.77 ‰ for C:N on chromosome 1, and height on chromosome 5 (Fig. 2a), respectively. In the case of growth traits, 32 and 270 associations were detected for DBH and h, respectively. For Vol, all associations (110) had a q-value over 0.1. The same was observed for the 95 significant markers detected for DBF. Within the ecophysiological traits, the number of significant associations ranged from 5 to 463 for SLA and leaf N-content, respectively. For traits related to the chemical composition of wood, associated SNP-markers (with p-value < 0.0001) ranged from 43 to 63 for C5-sugars and S:G ratio, however, these associations were over the q-value threshold of 0.1. In the case of wood metabolites, considering a selected subset of those with the top five highest heritability estimates, the number of associated SNPs varied from 29 to 79 for Hydroxybenzoic Acid (HbA) and Galactonic Acid (GAc) and Alpha tocopherol (Toc), respectively. No significant associations meeting the FDR criteria were identified for Adenosine (Ade) and Hydroxybenzoic Acid (HbA). The effect of SNPs on phenotypic variation depended on specific traits. The proportion of variation accounted for the cumulative effect of significantly associated SNPs (with q-value <0.1) ranged from 1.5 % for SLA to 38.8 % for GAc (Table 1).
Single nucleotide polymorphisms associated with phenotype were identified in both intergenic and genic regions. SNPs in that last category are part of genes encoding proteins belonging to a variety of functional classes. Considering the three most significant SNPs per trait, across all traits, the most represented classes were Protein Synthesis/Modification and Unknown Function Proteins (24.5 %), DNA/RNA Metabolism (20.4 %) and Energy/Metabolism (12.2 %) (Fig. 3a). A list with the SNPs and genes included in these classes is given in Table S3. An example for the Protein Synthesis/Modification category was a gene encoding a Periodic Tryptophan Protein 1 Potri.007G019500), which was associated with height, and leaf N and δ15N. Among genes related with proteins involved in DNA/RNA Metabolism, one encodes a cAMP-Regulated Phosphoprotein 21 (Potri.008G021900), significant for NArea and SLA. For genes in the Energy/Metabolism functional class, a representative was one encoding a 4-Hydroxyphenylpyruvate Dioxygenase, which was associated with leaf C variation (Potri.005G205200).
Considering the applied significance threshold (p-value 0.0001 and q-value < 0.1), GWAS performed on single-SNPs was successful in identifying polymorphisms associated with growth traits (DBH and h), leaf C and N-contents, as well as stable isotope parameters (δ15N) (Fig 2, Table S1). For traits related to phenology (DBF), wood chemical components (C5 and C6 sugars, lignin) and wood metabolites (GAc, Gal and HbA) associations with p-value< 0.0001 were detected, but they did not reach the q-value threshold. The presence or lack of significant SNPs for these traits appears to be independent of heritability estimates for each. For some traits with moderate to high H2i (e.g. S:G ratio or DBF), GWAS detected a few or no single-SNP associations. On the other hand, for traits with low to moderate H2i (e.g. leaf C-content, δ15N, GAc) a relatively higher number of SNPs were identified. Similar situations were observed for phenology traits in previous studies with P. trichocarpa [16]. On average for all traits with associations with q-value < 0.1, 22.8 % of phenotypic variation was accounted for by the cumulative effect of significant SNPs. The influence of multiple SNPs associated with phenotypes is particularly interesting in the context of the development of models for genomic selection, where large numbers of markers are utilized to predict the genetic merit of individuals [44]. Differences among traits in terms of the number of significant SNP-markers suggest the differential effect of both the variable number of SNPs influencing each trait and the individual impact of some particular SNPs. In that sense, some individual SNPs could have a such low effect size that none reach statistical significance. Furthermore, the apparent lack of correspondence between estimates of H2i and the phenotypic variance collectively accounted for by SNPs, could be explained by non-additive effects (e.g epistasis, GxE effect) or epigenetic factors acting on some traits. These types of effects are usually underestimated because MLM utilized for GWAS only suppose additive interactions [16]. Finally, another factor influencing the number of significant associated SNPs (and their effect on phenotypes) deals with the complexity of analyzing thousands of single markers across the genome. Stringent thresholds for controlling FDR may not be appropriate for p-value adjustment, given the correlated nature of markers along a chromosome [45]. In this study, significant associations (p-value < 0.0001 and even lesser) in traits such as Vol, DBF, Leaf Δ or S:G ratio were over a q-value of 0.1, 0.3 or higher, and they were considered non-significant. It has been suggested that the general applicability FDR suffers from several problems when applied to association analysis of a single trait and proposed alternative significance criterion (the genomewise k-error rate) [46]. Thus, further data analyses will be required to establish if associations excluded by p-value adjustment should be included in the set of SNPs that effectively control some specific traits.
Sliding window analyses
The multiple-marker analysis by sliding-window allowed us to identify genomic regions containing different sets of SNPs jointly associated with each trait. Figure 4a depicts a representative Manhattan plot with the significant windows identified for leaf δ15N. Manhattan plots for other traits are included in Fig. S3. A variable number of windows per chromosome were detected among the phenotypes (Fig. 3 b, d, f, h). The total number of significant windows (q-value < 0.1) ranged from 26 for S:G ratio, to 291 for N content (Table S2). For most traits, the main contributions were observed by chromosome 1. However, for traits such as DBH, DBF, C:N, δ15N, and GAc, the most relevant chromosomes in terms of the number of significant windows included to 5, 6, 4, 5 and 10, respectively. The multiple-SNP approach applied by sliding window analysis has been proposed as a robust alternative for identifying clustered significant patterns of SNPs, that are associated with complex traits, in a chromosomal context in humans and plants [37, 47-49]. In our study, significant windows identified a series of SNP clusters which were coincident with coding regions of multiple genes (Table S4). At the same time, and as it is expected, those clusters included SNPs identified by single-marker associations, as depicted in Fig.4. Significant windows were also detected in gene regions where no significant single SNP-markers were present. Additionally, information coming from both detection approaches allowed us to define genome zones with high LD, significantly associated with phenotypic variation, revealing the presence of phenotypically-relevant haplotypes (Fig 4c and 5d). SNP clusters identified by sliding window analysis were coincident with SNPs detected by single-marker tests, which were significant for two traits simultaneously (Fig 5). For example, in the case of the negatively correlated wood traits, C6-sugars and lignin content [5], both approaches detected the Leucine Rich Repeat (LRR1)//Leucine Rich Repeat (LRR8) gene (Potri.005G015700), which is important for the variation of these two wood components (Fig. 5c and d). Although more evidence will be necessary, haplotype blocks defined by this way could be indicative of polymorphic regions with pleiotropic effects.
Considering the top three most significant windows across all chromosomes and traits, the most represented functional classes were Protein with Unknown Function (26.3%), Protein Synthesis/Modification (22.8 %), Energy/Metabolism (17.5 %) and Transcription (14 %) (Fig. 3b). A list with the windows and genes included in these classes are given in Table S4. Some of the detected genes encoding proteins with roles in Protein Synthesis/Modification were those expressing Similar to Threonyl-tRNA Synthetase (Potri.008G145600), Interleukin-1 Receptor-Associated Kinase 4 (Potri.008G145900) and Leucine Rich Repeat (LRR1) (Potri.005G015700) associated with wood C5-sugars, C5-sugars and lignin, respectively. An example of the genes dealing with proteins belonging to the Energy/Metabolism class were Exostosin Heparan Sulfate Glycosyltransferase-Related (Potri.010G197900) and Similar to Aldehyde Dehydrogenase 1 Precursor (Potri.012G078700), associated with δ15N and DBF, respectively. Among genes encoding enzymes involved in Transcription, Similar to Agamous-like MADS Box Protein AGL12 (Potri.019G076800) and WRKY Transcription Factor 10-related (Potri.013G086000), were associated with Gal and C5-sugars, respectively.
Genes detected by single-SNP association and sliding windows approaches
We also verified the consistency between SNP-markers identified by single-SNP association and sliding window analyses. To this end, we considered as an example the most significant window (#154; q-value 7.97E-23) detected in chromosome 5 for leaf δ15N (Fig. 4a). The SNPs identified within the window were part of a gene (Potri.005G048900) encoding a Similar to 5'-3' Exoribonuclease (XRN4) Gene, which is involved in disease resistance, response to ethylene, RNAi, and miRNA-mediated RNA decay [50]. According to Populus trichocarpa genome site, this gene is expressed mainly in dormant buds (Fig 4d). The associated window comprised 64 SNPs (Fig. 4b). Fourteen of these markers, were also identified by the single-marker association, indicating the consistency between both approaches. Particularly, markers such as S05_3547832, S05_3547864, S05_3547904 and S05_3548573 were in high LD (r2 > 0.75) and produced significant variation at the level of leaf δ15N means (Fig. 4c). Alternatively, alleles for those SNPs represent intronic and non-synonymous polymorphisms, involving possible effects on transcript splicing, and protein structure and function.
The identification of SNP-markers using both single-SNP association and sliding window approaches was analyzed considering the high correlation between some pairs of traits and the possibility of the presence of pleiotropic effects. For example, Venn diagrams and Manhattan plots identified a set of SNP-markers associated simultaneously with C6 sugars and lignin (Fig. 5 a, b). A significant negative correlation has been observed between C6-sugars and lignin in Populus trichocarpa wood, in both phenotypical (rp =-0.81) and genetic (rg =-0.79) terms [5]. These common SNPs belong to a gene encoding a Leucine Rich Repeat (LRR 1)/Leucine Rich Repeat (LRR 8) protein (Potri.005G015700), recently linked to cell elongation and wall extension [51]. This gene included a segment with markers exhibiting moderate to high LD (r2 > 0.50), which was associated with significant variation in both traits (Fig 5 c, d). This sort of interaction was also analyzed for DBH and h, and leaf N and δ15N (Fig. S4). Results indicate the consistency between both association approaches for finding common markers underlying different traits. At the same time, significant sliding windows (q-value <0.1) support detection by single SNP-marker tests, particularly for those traits in which FDR threshold was not reached.
Significant associations for traits underlying growth, nutrient metabolism and xylem formation, among others, define SNPs and genes, which might represent logical candidates for functional studies focused on confirming their role and impact on phenotype of Populus species. Considering their high significance and the simultaneous detection by the single-SNP association and/or sliding windows approaches, we centered part of our analysis on genes Similar to 5'-3' Exoribonuclease (XRN4) and Leucine Rich Repeat (LRR 1)//Leucine Rich Repeat (LRR 8), which included SNPs and windows associated with leaf δ15N, and wood C6-sugars and lignin, respectively. Gene XRN4 is expressed mainly in dormant buds (Fig 4d), according to information available for P. trichocarpa. The exoribonuclease function of this gene links it to transcription, RNA metabolism and RNA interference in eukaryotes [52]. In plants, it has been related to ethylene signaling [53] and response of plants to abiotic and biotic stresses [54, 55]. Mutation of members of the XNR gene family produced sensitivity to N starvation in Saccharomyces cerevisiae [56] and morphological alterations in Arabidopsis thaliana [57, 58]. Thus, association of XRN4 with leaf δ15N, an indicator of N use efficiency, could be related to the N metabolism and mobilization at leaves, particularly during the last third of the growing season, when sampling was done. More studies will be required to detect possible effects of SNPs at XRN4 on photosynthesis and biomass production. On the other hand, as mentioned above, gen LRR1//LRR8 encoding a Leucine-Rich Repeat Receptor-Like Kinases is expressed mainly in stems, fully expanded buds and female flowers (Fig. S5). LRR genes play important and diverse roles in growth and development in poplars [59]. Evidence supports the role of these receptors as a signaling components in the regulation of the synthesis of cellulose and lignin that controls the secondary cell wall thickening [60, 61]. This role would explain the association that we detected simultaneously for both chemical components of wood (C6-sugars and lignin).
Chemical composition of wood
Association analyses of the chemical composition of P. trichocarpa wood was previously performed in the same population by our group [14] using a candidate gene approach. We carried out a comparison between the results from both studies considering the 40 candidate genes utilized by Wegrzyn et al. [14], which encodes enzymes from the cellulose and lignin biosynthesis pathways and cytoskeletal proteins. Association results in the present study were significant only under a p-value < 0.0001 threshold. Results indicated both overlap and divergence between the two studies (Fig. 6 and Table S5). SNPs within genes encoding cellulose synthase (CesA1A and CesA2B) were significantly associated with C6-sugars in both studies. For this trait, we also detected SNPs in genes not previously identified by Wegrzyn et al. [14]. These included SNPs within 4CL1, LAC2 and TUA5. In the case of lignin content, different members of this gene family (CesA2B and CesA1A) were differentially detected (differentially) in both studies. Moreover, we identified lignin-associated SNPs belonging to 4-Coumarate:CoA Ligase 4CL1, Laccase LAC2, Serine Hydroxymethyl Transferase SHMT6 and the cytoskeletal protein α-Tubulin TUA5. Finally, for the S:G ratio, our analyses detected 4-Coumarate:CoA Ligase 4CL1, Phenylalanine Ammonia-Lyase PAL5, Caffeoyl CoA O-Methyltransferase CoAOMT2 and Ferulate 5-Hydroxylase F5H1, and the cytoskeletal β-Tubulin TUB16, which were not identified by Wegrzyn et al. (2010). In spite of the genotypes and wood chemical characterization methods were mostly the same in both studies, distinct trial sites, sampling height or differential presence of juvenile wood, among other factors, might explain the differences in the findings reported previously [14] and those described in the present work.