E-index Method for Identifying Dynamic Growth Traits of Poplar


 To detect the mechanism of growth, volume is important to uncover the genetic basis of dynamic complex quantitative traits. Unfortunately, it is difficult to construct the unique simple growth curve to accurately describe the growth process of all trees by the conventional GWAS based on the functional mapping method, which reduces the power of statistics for the growth model. To address this issue, this work adopted a novel approach about the Earliness degree index (E-index). First, it adopted the method of spline interpolation to fit the growth data to acquire the growth curves for each tree. Second, an innovative calculation model based on E-index was used to measure the earliness degree for each growth curve and to identify the potential relationship between QTL effects and traits by a series of hypothesis tests. Besides, a permutation test could be used to estimate the threshold for p values and to screen out significant QTLs from SNPs related to the growth process. To verify the validity and practicability of our model, we applied this method on the data about the volumes of 64 poplar trees chosen randomly from the progeny of two poplar species I-69 and I-45 with 156362 single nucleotide polymorphisms (SNPs). Through the E-index method, 13 significant markers were identified for testcross and 10 for intercross related to the growth process. Overall, this study could help elucidate the underlying genetic mechanisms of complex dynamic traits and perform marker-assisted selection in poplar.


Introduction
The growth of trees is a process of dynamic change during its cycle life and is deeply influenced by genetic factors. Accordingly, detecting genetic polymorphisms that regulate economically primary traits plays an important role in identifying targets for genetic modifications that improve productivity. Genome-wide association study reveals putative regulators of bioenergy traits in Populus deltoides. Specifically, the discovery and breeding of superior tree varieties have long been a critical task of forestry management. Because poplar has several distinguishing characteristics such as modest genome size, rapid growth rate, high yield, and relative ease of experimental manipulation [1,2]. As the perennial woody plant, the poplar is the first species to complete the whole genome sequencing for the study on forestry genomics [3,4]. Therefore, among the various species of trees, the poplar was considered as the model forest species and is most suitable for genome research.
Quantitative traits are characters of the continuous range of variation and are influenced by both environmental and genetic factors. Quantitative traits are far more common than quality traits and can be hugely polygenic [5]. Studying the genetic characteristics of tree quantitative traits is very important for genetic improvement of cultivated trees. However, the quantitative traits are hard to be like quality traits to accurately distinguish different phenotypes and to analyze genetic behavior such as gene separation, recombination and linkage through dominant and recessive traits. For poplar, volume is one of the significant quantitative traits to measure the economic values and it is quite meaningful to find poplar varieties with high volume in the development process. Therefore, this article dissected the genetic regulation of grown development for poplar to breed varieties with high yields.
The traditional QTL mapping analysis adopted statistical methods to map and identify specific complex loci and was widely applied in agriculture, forestry, fishery and other fields [6,7]. However, these QTL mapping methods merely focused on the phenotypic data at one single time point individually in the process of biological growth, and it cannot accurately describe the development mechanism of organisms related with the time factor. In order to overcome this limitation, Wu et al. introduced the functional mapping method [8] to search superior genotypes related with developmental process. This method described the growth law by estimating growth parameters for different genotypes at one gene and constructed the statistical model to detect the genotypes with statistically significant difference of growth parameters [9]. Furthermore, the method of functional mapping has been well extended and there have been a wealth of literatures applying it to analyze the growth mechanism of organisms [10][11][12][13][14][15]. But the functional mapping method does not work well to find the significant gene loci related to traits when the growth curve or function of its growth process is not monotonic or even worse there does not exist uniform function with parameters to fit the successively collected measurement values. To overcome the inherent inadaptability of functional mapping, we proposed E-index method to cumulatively measure earliness degree of development in the overall process which was verified in simulation data and showed very good effect [16].
Further studies would be needed to screen out the gene loci with substantial ability of growth through statistical method of multiple hypothesis testing. Nevertheless, there are hundreds of thousands of SNP makers for each sample of poplar, and the same group of phenotypic data is calculated and tested for so many times. So it will increase the type I error rate, accordingly there will be a lot of wrongly judged to mislead a wrong significant association [17]. Therefore, as the original level of significance α needs to be adjusted. The most convenient way is to use Bonferroni method to correct. That is to say, the significant level of adjustment is α/n, where n is the number of SNP markers. This method is convenient, but it is too conservative to miss the real significant SNP markers easily. To improve the accuracy of correction and to solve the problem of multiplicity, an alternative criterion for many statistical methods was provided such as step down method [18], step up method [19], False Discovery Rates (FDR) method [20], etc. Permutation test as one of the FDR method [21,22] is based on the rearrangement of sample data for multiple hypothesis tests and is suitable for small samples with unknown overall distribution. In this paper, E-index method combined with permutation test was proposed and applied to detect QTL related with growth of poplar and opened a new way to select more comprehensive species with strong growth ability in breeding.

Plant materials
As a valuable genetic resource, artificial hybridization is commonly used to identify the significant quantitative traits of species. In this paper, the set of progeny for poplars was chosen through the following hybrid strategy as a pilot study of genetic mapping. First, a full-sib family population of 450 F1 hybrids was generated by a cross between Populus deltoides clone I-69 and Euramerican poplar (P. x euramericana) clone I-45 at Nanjing Forestry University [6]. All of the progeny was planted at Zhangji Forest Farm (34.14°N, 117.38°W) in Jiangsu Province, China, in 1987 and the detail situation of plant environment was described in the several literatures [23,24].
We randomly chose 64 of these progeny along with their two parents, I69 and I45, to be genotyped genome-wide by single nucleotide polymorphisms (SNPs) using the Applied Biosystems (Foster City, CA, USA) QuantStudio 12K Flex Real-Time PCR System. In all, a total of 156,362 SNPs were available via screening and sorting for quality control, which belongs to testcross and intercross makers respectively [25]. In order to improve the statistical effect, the SNP markers with Minimum Allele Frequency (MAF) ≤ 0.05 were eliminated. So that the genotypic data about 109244 testcross markers segregated from only one heterozygous parent I-69 or I-45 and 38904 intercross markers from both heterozygous parents were acquired ultimately for further analyses. The growth data closely related with the time factor were captured through above trait and values about volumes were measured annually for 24 years from 1987 to 2010. For the phenotypic data of poplar growth, it is almost impossible to find a uniform function be fit all of the growth data well. Thus, we introduced the novel E-index method to replace the traditional QTL mapping analysis to detect the genetic mechanism of growth development.

E-index's definition and properties
The concept of E-index was proposed by Wang et al. to measure the earliness degree of growth, which is a powerful tool to quantitatively depict the development features such as the faster or slower, earlier or later growth [16]. This method could successfully identify the specific SNP which govern the growth process for the same biomass at the final time point. In order to expend the scope of applications for any growth process without this restriction, an improved method is founded on framework of the E-index and addresses the issue on poplar. For the character of poplar volume data in experiment, the specific E-index method was implemented through the following biologically meaningful mathematical equations to characterize the interplay between QTLs and development.
Suppose that ( ) is a growth curve of poplar and it is a continuous function on a closed interval [a, b], a < b, where represents the year (time point) during growth process, and ( ) is the measure value such as volume or other phenotypic value of poplar sample at time point (year). Also assume that ( ) is differentiable within its range and ′ ( ) is its derivative function. Then the -index of the growth curve can be defined as In order to illuminate the concept of E-index intuitively, we used 4 types of growth curves (2a) ~ (2d) as followed as an example: The E-index values can be calculated according to formula (1) and illustrated in Fig. 1(a) ~ (d). For each curve of plants, the day is from 0 to 10, and the biomass is from (0) to (10).
Take in Fig. 1(a), for instance, first. The growth formula of (2a) is a linear function and the biomass of plant is 0 at the beginning point and 10 at the end point. The earliness degree of individual (2a) calculated by equation (1) is 5.00 which can be seen from Fig. 1(a), that is the proportion of the triangle area with green-lines to the whole time span.
Similarly, for the growth curve (2b) with non-constant rate, the E-index value can be calculated as 8.61 which indicated that the plants with growth curve (2b) gain their biomass earlier than plants with (2a). The difference of earliness degree between these two curves is intuitively displayed by the area with green lines in Fig. 1(a) and 1(b). Using the same method we can calculate the earliness degree of the growth curves (2c). The E-index value is 3.99 which is the smallest among curves (2a) ~ (2c), indicating the growth time of plants with (2c) is the latest than the others with (2a) and (2b).
In the last, for non-monotonically increasing growth curve (2d), the earliness degree of growth curve was illustrated in Fig. 1 (d) by the area with green lines minus the area with pink lines which is the area between the curve and the horizontal line = 0. We can compute the corresponding Eindex as function (3) This result is consistent with our observation and intuition: growth and development occur earliest than other three types of growth curves (2a) ~ (2c). Thus, it indicates that the E-index method are not only applied to measure the monotonically increasing function, but also can be used to calculate the non-monotonically increasing function. For the non-monotone increasing function, the E-index value might be greater than 10 as growth curve (2d). It indicates that the -index for growth curves could be any designated value in the range. Thus, from these four types growth curves, we can calculate the E-index value through the formula above easily to measure how early growth occurs in the whole growth process to know the growth status of every plant.

Calculation
To measure the biological growth and development properties, the most of the traditional QTL mapping methods could be applied in the condition of the known growth curves which can be represented by the monotonically increasing function with unique parameters. Nevertheless in the actual process of biological growth, it is difficult to acquire the function that satisfies the above strict requirements. Meanwhile, the growth values at some time points may be missing due to the limitation of the measurement condition or other factors, which may affect the accuracy of curve fitting. Fortunately, E-index method is not restricted by these requirements and works well if we get each function of growth curve for the corresponding samples data, either discrete or continuous. So the challenge is how to find a function fitting the successively collected measurement values well. Here, we adopt spline interpolation approach to fit the phenotypic data of poplar growth to avoid the influence of missing values on curve fitting. Even if the growth values are missing at one or several interpolation time points in the whole process of development, the growth curve also can be well fitted by the spline interpolation method. Based on the fitting growth curves of every plant, the E-index method can be further applied to differentiate complex dynamic traits with the statistical framework and to detect QTL in the developing process.

Spline interpolation method
Spline interpolation is a form of interpolation with a special type of piecewise polynomial called a spline. To calculate the E-index value about the plant, the spline interpolation can be used to fit the phenotypic data of poplar growth because the interpolation error can be made small even when using low-degree polynomials for the spline. Meanwhile, it has been verified that a typical kind of spline interpolation, the cubic spline interpolation function, fits the growth data well under the condition of unknown growth curve [16]. Here, we took one of the parents of poplar plant I-69 for example to describe the calculation process. We measured the volume values at each time point from 1 th year to 24 th year as displayed as grey point in the Fig. 2. A smooth curve can be obtained by spline interpolation to fit the data well and be drawn as the red line. the effectiveness of process. Therefore, even without having the original curves and only with several interpolation points, we can use the cubic spline interpolation to fit the collected measurement data and calculate the Eindex values according to the spline functions to measure earliness degree of each plant. Because the function obtained by the spline interpolation is globally differentiable, the formula (1) can be transformed with following equation: E-index is displayed with all grey area in Fig. 2(a) and its value is 0.248 by calculation as equation (4). Because the growth period of trees is very long, it is often difficult to obtain all biomass at each time point during the tree growth process. E-index method can be not only used to calculate the earliness degree during the whole growth process but also at any period of growth. For example in Fig. 2(b), the growth cycle is divided into two periods (1,12) and (12,24), for each segment we can calculate the E-index value using the equation (4) and the value of E-index vector is (0.077, 0.202). For the second segment ∈ (12,24) in Fig. 2(b), the two parts of the numerator in equation  (4) is illustrated by the grey area and its value is 4.880-2.4599=2.4202. So the E-index value is 2.4202/10=0.202 by standard calculation. The E-index vector's value indicates that the two part of the curve is similar trend, but the second segment is more earliness degree than the first segment.

E-index's statistical framework
For the above collected data of trail, we constructed an E-index's statistical framework to differentiate complex dynamic traits and verified that some genotypes significantly affect the phenotypes. We adopted two different statistical framework as followed for two types SNP markers, intercross and testcross.

Testcross markers
There are two cases for testcross markers. One case is the markers which is segregating heterozygous mother I-69 and homozygous father I-45 and the other case is the opposite. Because they could be applied by the similar statistical framework, we chose the first case to present the statistical steps in detail. Suppose the genotypes of the progeny are FF for homozygote and FM for heterozygote and the numbers of the corresponding samples are m and n. And suppose that the volume value vector of the ith sample for genotype FF is = ( ,1 , ,2 , ⋯ , , ), = 1,2, ⋯ , , and that of the jth samples for genotype FM is = ( ,1 , ,2 , ⋯ , , ), = 1,2, ⋯ , at each time point , = 1, 2, … , (where = 24). The statistical steps are as followed, using t-test with (m+n-2) degrees of freedom to discover the significance difference between genotype FF and FM.
(1) For each phenotype vector and of poplar sample, cubic spline interpolation method can be fit to obtain the continuous curves ( ) and ( ), where = 1, 2 ⋯ , 24 is the time point for measurement of plants.
(3) After that a statistic testing can be defined as follows where ̅̅̅̅̅ is the mean of the vector and ̅̅̅̅̅̅ is the mean of the vector (here we suppose that ̅̅̅̅̅ > ̅̅̅̅̅̅ ) and the common variance is denoted as where 2 is the sample variance of and 2 is that of .
(4) We can test the null hypothesis that the two groups of samples are not significantly different: versus the alternative hypothesis that the two groups of growth curves are significantly different: 0 will be rejected if > ( + − 2) , otherwise 0 will be accepted, where is the computation result of (5) and ( + − 2) is the -distribution value with the confidence level and ( + − 2) degrees of freedom.
(3) After that F-test can be defined as follows: = m( ̅̅̅̅̅ − ̅ ) 2 + n( ̅̅̅̅̅̅ − ̅ ) 2 + l( ̅̅̅̅̅̅ − ̅ ) 2 , where ̅̅̅̅̅ is the mean of the vector , ̅̅̅̅̅̅ is the mean of the vector , ̅̅̅̅̅̅ is the mean of the vector , and ̅ is the mean of E-index for all samples. The degrees of freedom ( 1 , 2 ) in this statistical value is (2, m+n+l-3).
(4) We can test the null hypothesis that the mean of E-indexes values about three groups samples are not significantly different: versus the alternative hypothesis that the three groups of growth curves are not all the same. 0 will be rejected if > ( 1 , 2 ) , otherwise 0 will be accepted, where F is the computation result of (9) and ( 1 , 2 ) is the F-distribution value with the confidence level and ( 1 , 2 ) degrees of freedom.

Permutation test
Based on resampling methods, permutation test was advocated by Churchill and Doerge [26], and it were frequently employed in practical data for the high dimensional testing such as GWAS [27]. To find the meaningful SNP effectively, we adopted the permutation test method to filter significant SNP from the first candidate1000 SNP markers with the smallest p values which were acquired by the above statistical framework. For each candidate SNP marker, we recorded the number of different genotypes and the respectively E-index values calculated for each poplar samples. Here, we took a SNP testcross marker Scaffold.Id 9/8587909 as an example.

Results
From 148,148 SNP markers included testcross and intercross of poplar, the E-index method was used to calculate the relative earliness degree; furthermore, the permutation test was applied to screen out the statistically significant gene fragment as QTL. In the following description, we illustrate the detecting results for significant QTL in detail under different SNP effects, both additive and dominant.

Manhattan plot
Based on the above E-index method and the statistical framework, the probability p values of testcross and intercross SNP markers can be calculated. Because these values are too small to be difficult to detect their differences between them, we have a logarithmic conversion to the p value represented as the −log (p). The smaller the p value is, the greater the −log (p) value will be [28].
By means of GWAS with the above permutation testing, we made the Manhattan plot across 19 chromosomes for 148,148 SNP makers as shown in Fig. 3 about intercross marker and testcross markers. Significant SNP markers could be screened as QTLs, of which −log (p) values are greater than 8.0 ( = −ln (3.4E − 4)) for testcross markers and are greater than 9.3 ( = −ln (8.8E − 5)) for intercross markers. The detailed information about the most statistical significance QTLs for poplar is displayed in the Table 1 included 13 testcross markers and 20 intercross markers and can be shown in Fig. 3 which are the points above the red line. Table 1. Segregating type SNP Line no. Chr Physical 45 allele to variation in growth form. It appears that intercross QTLs were distributed much more centralized throughout the poplar genome and most of the significant intercross markers for the volume growth process were located on chromosome 5 (Fig. 3b).

QTL detection by E-index method
In order to vividly display the effectiveness of E-index, we choose the most statistically significant SNP markers for additive effects and dominant effects. The typical growth curves of poplar samples about testcross marker Scaffold.Id 9/8587909 are drawn out in Fig. 4(a). Two groups of the growth curves of genotypes FF and FM are generated respectively in red and yellow lines, and the average growth curves for both corresponding genotypes are displayed by blue and green lines. Intuitively, the two groups of growth curves are obviously far apart during the development processes and the most of growth curves about genotype FF are above genotype FM and the end time point. That is to say, the poplar samples with the SNP markers FF for this QTL are early maturity plants during the growth development and have a greater volume than the plants with genotypes FM at the endpoint with high yields. So if we want to get fast-growing poplar samples with great volume, we need to select poplar varieties with genotype FF of this SNP marker. Meanwhile, for the intercross marker Scaffold.Id 5/ 5137853, the growth curves can also be divided into three groups successfully by the E-index method for different genotypes FF, FM and MM in Fig. 4(b). The E-index value can quantify the trend of the whole growth curve vividly neither at any point of time, nor just for the ending time point.

Calculation of heritability
Heritability is an important measure to characterize the effect of genes on traits [29]. Here we introduced the heritability to explain the additive effects and dominant effects for each significant SNP marker through the measured phenotypic values about the volume of poplar samples and the observed genotypes [30]. There is an additive effect for each significant testcross SNP marker and the values of heritability can be calculated through equation (13) at each time point as followed, = |̅̅̅̅̅ − ̅̅̅̅̅| or = |̅̅̅̅̅ − ̅̅̅̅̅̅|, by where 1 is the estimated allele frequency for F, and 0 is the estimated allele frequency for M, is the median estimate of the additive effect for SNP j and ̅̅̅̅̅ , ̅̅̅̅̅ or ̅̅̅̅̅̅ are the absolute mean value of the corresponding phenotypes for different genotypes FF, FM or MM. And the variance of all phenotype values about the samples is expressed by (̃). Meanwhile, there are additive and dominant effects for each significant intercross SNP marker and the value of heritability can be calculated by equation (14) with the similar meaning about the same symbol as followed [31], The value of heritability could indicate the effects of SNPs and their data about the statically significant SNPs at the beginning time point are also listed in table 1. It can be seen that the mean values of heritability is 6.56% and 6.16% for the selected testcross SNP markers and intercross SNP markers respectively. The dramatic variation of heritability for these markers at each time point of the whole process can be drawn in Fig. 5. It can be seen clearly in the diagram that the two types of heritability curves show a general trend that increases rapidly and then decreases gradually, with a peak at years 3-7. Fig. 5. The heritability curves of testcross markers (a) and intercross markers (b) with statistically significant in the temporal pattern.

Conclusion & Discussion
This paper provided an effective and novel method to apply the E-index theory to measure the earliness degree of development for poplars. Traditional QTL studies, such as functional mapping, are used to identify the growth mechanism according to growth curves which should meet some strict requirements. Nevertheless, in some cases, it is not easy to acquire the uniform function for all of the growth curves to describe the development of poplar trees. Thus, according to this actual situation, we adopted the cubic spline interpolation method to fit the phenotypic measurements about each poplar species well. Then E-index theory was applied to measure the earliness degree of each poplar sample and the hypothesis testing was used about markers to acquire the candidate QTL. Finally, we used the permutation test to verify and screen out the significant SNP markers.
Although this article only deals with the earliness degree for volume growth of poplar, the Eindex method can also be applied in other situations with a similar mode of growth. For example, the same method can apply to other traits such as weight, area and so on. Also, it can be extended to any tree species rather than only about poplars described in this experiment. In addition, this paper provided more broadly useful guidelines to get the accurate SNP markers as QTL by measuring earliness degree of development in any growth phase to reveal detailed characteristics of growth and it provided theoretical and practical support for poplar breeding.