Statistical modelling of precipitation data in Canadian Prairies with a dynamic mixture structure

The present paper aims to model statistical structure of heavy-tailed precipitation data. In particular for agricultural purposes in Canadian Prairies, this is essential in many aspects such as assessing and managing risks resulting from the occurrence of unexpected precipitation events. Daily (or weekly) precipitation time series often contain many zeros (on dry days) and also exhibit important characteristics such as heavy-tailedness and volatility clustering. These features make it challenging to develop an effective model from both theoretical and practical viewpoints. In this paper, we propose a dynamic mixture model constructed based on a generalized Gaussian crack distribution with a GARCH specification to take into account the full range of precipitation measurements with a sufficient flexibility to fit both thin- and heavy-tailed data, and stochastic volatility. The method of maximum likelihood estimation with the profile log-likelihood algorithm is illustrated with some simulation studies. The model fitting results on a historical dataset from twelve stations in Canadian Prairies show the applicability of the proposed model.


Introduction
The modelling of statistical structure and essential features of the precipitation data has received a great deal of attention from the researchers in the field for the past couple of decades (Stern and Coe 1984;Dunn 2004;Lennartsson et al. 2008). Precipitation, as an essential climate factor, plays a crucial role in various fields such as agriculture and hydrology. In agriculture, the main steps in crops production rely heavily on the precipitation patterns. As a result of global warming, the occurrence of extreme precipitation has become more prevalent in some areas around the globe. For example, the recorded data reveal that the extreme precipitation occurrences which lead to remarkable crop losses and large floods, have grown in number in the US (Changnon et al. 1997;Karl and Knight 1998 prove that the agricultural areas in Canada are to expect drier summers all across but tend to receive more precipitation during the winter and spring times. Consequently farmers will face too much water to handle in the seeding season but just not enough in the growing season, both happening during the same year. Anticipations further highlight that despite the regions in southern Canada getting drier during the summer, they may as well witness a rush of quickly-passing though heavy rainfalls. The shape of precipitation data distribution may be diverse due to a variety of factors that include different spatial characteristics of the monitoring stations, no downpours on dry days, and behaving in a non-Gaussian structure on wet days. However, extreme daily precipitations are often overlooked in statistical models. Some studies suggest that the daily precipitation may behave in accordance with a gamma distribution described by its shape and scale parameters (Wilks 2011). Even though the gamma distribution is capable of fitting the main body of data well, it often fails to fit the upper tail (Papalexiou and Koutsoyiannis 2016;Nerantzaki and Papalexiou 2019). On the other hand, the generalized extreme value (GEV) distribution and the general Pareto distribution (GPD) have usually been employed to describe the upper tail (Huser and Davison 2014). By focusing heavily on the extremes, the GEV and the GPD families are not flexible enough to model data distributions presenting both thin-and heavy-tailedness. Moreover, the practical application of the GPD is sensitive to the threshold values whose appropriate selection is particularly challenging in the presence of stochastic volatility. Since both the GEV and the GPD models presume a static, time-invariant probability density function (pdf), it is less desirable in the case of the precipitation data due to the dynamic nature of the time series (Benestad 2004). A recent study, however, introduces an excellent approach that reproduces the essential characteristics of rainfall in a simple, yet accurate way. It also demonstrates improved accuracy when simulating wet/dry spells and correlations within wet spells (Papalexiou 2022).
One of the main challenges in modelling precipitation data is the large portion of measurements being zero (on dry days) while the remaining portions take positive values (on wet days). In the literature, two methods have been proposed to deal with the issue. Shifting and censoring is the first, and defining a binary mixture mechanism is the second (Katz and Parlange 1995;Harvey and Ito 2020). The present work utilizes the latter, which allocates probabilities to zero and positive measurements, and assumes that positive measurements are drawn from a generalized Gaussian crack (GGCR) distribution. The other novel aspect of the present study is to extend the static GGCR model to a dynamic one through a specification of a GARCH type model for the scale parameter of the GGCR distribution. Together with the binary mixture structure, this dynamic model generalization allows to handle various time series with non-negative (possibly extreme) values and stochastic volatility.
This paper has three main objectives. First of all, we introduce a dynamic mixture mechanism to model precipitation measurements. The second goal is to demonstrate the applicability of the proposed model for modelling precipitation based data sets by fitting real data set from Canadian prairies. Last but not least, we study the dependence structure of precipitation in prairies through Archimedean copulas on the residuals (the generalized Gaussian component) of the fitted models.
The rest of the paper is organized as follows. Section 2 gives a brief review of the generalized Gaussian crack distribution. In Section 3, a model for precipitation measurements is introduced. In Section 4, the application of the dynamic mixture model is given for the precipitation measurements of twelve stations in the Canadian prairies region. Finally some concluding remarks are made in Section 5.

Preliminaries
In this section, we give a brief overview of the generalized crack distribution (GCR). Bae and Chen (2017) introduce the following generalization of the Gaussian crack (life-time) distribution which is a two-point mixture extension of the Birnbaum-Saunders distribution.

Generalized Crack Distribution
Definition 1 A random variable T follows a GCR distribution, represented by T ∼ GC R (α, β, p; g), if and only if its probability density function (pdf) is written as where α > 0, β > 0 and 0 ≤ p ≤ 1 represent the shape, the scale, and the mixture weight parameters, respectively. The density function g(·) is symmetric about zero and is referred to as the auxiliary (or baseline) density function. The functions f g I S and f g L B−I S are the pdfs of the inverse symmetric (IS) and the length-biased inverse symmetric (LB-IS) distributions possessing the baseline density function g(·), respectively.
The cumulative distribution function (cdf) of the random variable T ∼ GC R (α, β, p; g) can be written as , and G(·) is the cdf of the baseline density function. From Eq.(3), it can be noticed that the GCR distribution family is naturally a scale family, which means if T ∼ GC R(α, β, p; g) then cT ∼ GC R (α, cβ, p; g) for any positive constant c.
Using a p = 1/2 to showcase a special one, the cdf pertaining to the generalized Birnbaum-Saunders distribution (Birnbaum and Saunders 1969a, b) with a baseline density function of g can be found as Bae and Chen (2017) study the tail behaviour of the GCR family and provide three useful examples of heavy-tailed crack distributions based on the baseline densities of Student's t, Laplace, and generalized Gaussian (GG) distribution. Here we focus on the generalized Gaussian crack distribution and use it as the main building block for modelling precipitation data. The GGCR family is a large distribution family bearing the Laplace and Gaussian crack distributions inside. The distribution is created on the basis of the following generalized Gaussian density function as its baseline density:

Generalized Gaussian Crack distribution
where μ, λ > 0, θ > 0 are the location, the scale, and the tail index (shape) parameter, respectively. Note that the variance of the generalized Gaussian distribution with the mentioned density is given as σ 2 = λ 2 (3/θ )/ (1/θ ). To maintain symmetry, we let μ = 0. For the identification of the parameters, we further set σ 2 = 1, which results in λ = √ (1/θ )/ (3/θ ). Thus, we obtain the following density function of generalized Gaussian crack distribution: where, α > 0, β > 0, p ∈ [0, 1] and θ > 0. Figure 1 plots a few density curves of the GGCR distribution by setting parameter values of α = 1, 2, β = 5, θ ∈ {0.8, 1, 25} and p ∈ {0.2, 0.5, 0.9}. It can be readily observed from the figure that the GGCR distribution is, to a large extent, flexible to take account various distributional shapes including both thin-and heavy-tails. Indeed, the GGCR distribution with a shape parameter of θ < 2 is heavy-tailed, and with θ ≤ 1, can be deemed a supexponential distribution (Bae and Chen 2017). Moreover, the GGCR distribution pertains to the maximum domain of attraction (MDA) of the Gumbel distribution. Be aware that distributions with different tail behaviours, in the range of the moderately heavy (e.g., lognormal distribution) to light (e.g., the Gaussian distribution), belong to the MDA of Gumbel.

The proposed model
In this section, we first introduce a dynamic mixture mechanism based on the GGCR distribution to describe the model structure of precipitation data. Then, the parameter estimation method along with the results of the simulation study are offered.

The dynamic mixture model
Let y t be a measurement at time t of a precipitation time series. The model assumes that a measurement of value zero occurs with the probability π t whereas, with the probability 1 − π t , a positive measurement is drawn from a GGCR distribution. Formally, where π t has a form of , δ 1 > 0.
We assume u t is a GARCH(1,1) process with the symmetric generalized Gaussian noise. 1 Specifically, Therefore, u t ∼ GG(h t , θ). Consequently, one can easily show that the positive observations are drawn from GGCR, i.e., y t ∼ GGCR(α, β, 1/2, ψ), where ψ = (ω, γ 1 , γ 2 , θ) is the vector of parameters involved in the dynamic GG density. It is noteworthy to mention that, by the model specification of π t in Eq. (7), the probability of zero measurement decreases as h t increases. In other words, the higher variability a climate system is, the more likely we observe positive precipitations.
Indeed, the distribution of y t is a discrete-continuous mixture with a point mass at zero. Let 1(y > 0) denote an indicator variable which takes the value one if y > 0 and zero when y = 0. Then, the conditional density function of y t given y t−1 , is written as for the vector of parameters = (α, β, ω, γ 1 , γ 2 , θ, δ 0 , δ 1 ).
A parameter estimation method is to be considered in the next section.

Maximum likelihood estimation
In this study, the parameter estimation of the proposed mixture model is carried out through the maximum likelihood method together with the profile log-likelihood algorithm. Assume y 1 , . . . , y n is a time series having a size of n taken from the mixture model with the vector of unknown parameters denoted by = (α, β, ω, γ 1 , γ 2 , θ, δ 0 , δ 1 ). The likelihood function which is the product of all successive conditional densities in Eq. (8), has a form of The t-th term of the log-likelihood function is written as With this, the log-likelihood function is given as By using Eq. (7) we have With the set of non-zero observation represented by A + = {t|1(y t > 0) = 1}, one can rewrite the log-likelihood function as follows The maximum likelihood estimate (MLE) of = (α, β, ω, γ 1 , γ 2 , θ, δ 0 , δ 1 ) is obtainable by (numerically) solving the following system of equations which are composed of the partial differential equation with respect to each component of the paramter vector . This is also referred to as the likelihood equations.
Due to the non-linear and complex nature of the system of likelihood equations, finding an analytical solution for the MLE of parameters appears infeasible. Therefore, an appropriate numerical method may be used as an effective alternative. Here we consider the method of profile log-likelihood along with numerical optimizations. It takes advantage of two nested maximization procedures to alleviate the issue of instability often faced when working in numerical optimizations with quite a few parameters (Davison 2003). The profile log-likelihood algorithm for the case of the current study is given as follows: 1. Set initial value of h 2 t as the sample variance (i.e., h 2 0 = Var(y 1 , · · · , y n )). Also set the initial value of u t as the √ y t β} for non zero measurements, respec- tively. Herep is the empirical estimate of probability of zero observations. 2. Set the starting value of the parametersδ (k) Givenδ (k) , maximize the first profile log-likelihood function, Then, maximize the second profile log-likelihood function, The solution of optimization, denoted byδ (k+1) is the (k + 1)th profile likelihood of (δ 0 , δ 1 ).
The estimates obtained from the above algorithm are local maximums rather than the global MLE. To determine the appropriate local MLEs in the vicinity of the global MLEs, it is advised to run the algorithm several times using a wide rage of initial parameter values.
To investigate the statistical features of the obtained estimates, a few simulation cases have been studied. We generate time series with three different sample sizes of 500, 1000 and 3000 from the mixture model with a few prescribed sets of parameters. These parameters have been deliberately varied in a wide numeral range to make possible the inclusion and numerical study of various distributional behaviours (for instance, thin-to heavy-tailed data, and low to high volatility parameters etc.). The R function constrOptim.nl and optim are employed in steps 3 and 4, respectively, in order to maximize the profile log-likelihood functions in the algorithm. It must be mentioned that for the purpose of parameter

Application
In this section we provide an application of the proposed mixture model on some real precipitation data from the Canadian prairies.

Data
The dataset considered in this study involves the records of twelve monitoring stations' daily precipitation in millimeter (mm) from 1971 to 2018 (47 years) in Canadian prairies. It was obtained from the Government of Canada's official website. 2 Figure 4 demonstrates the vicinity and topography of the area under consideration. Each station is expected to have 17,532 measurements, which was not the case due to the existence of missing values. Table 3 summarizes the stations' spatial information and the percentage of missing measurements. Figure 5 gives the percentage of missing values and their patterns. Even though it is impossible to verify formally, one may postulate that the missing-data mechanism is missing at random (MAR), i.e., the missing value is correlated with other observed measurements. Under the assumption of MAR, the first step to analyse the dataset lies in the imputation of the missing measurements. Numerous imputation methods or approaches, the application of which largely depends on the mechanism missing data, can be found in literature. We have applied two approaches: the k-nearest neighbour (KNN), and the spatio-temporal kriging (Troyanskaya et al. 2001;Torgo 2010;Montero et al. 2015) with the mean squared error (MSE) being the comparison basis among the two. The R function knnImputation in "DMwR" package and KrigeST function in "gstat" package, have been utilized. In order to select the best method between the two, cross validation procedure is used and the resulting MSEs are reported in Table 4. The results suggest that the KNN demonstrates a better performance on the task of missing value imputation for the dataset under study.

Seasonal adjustment
In this study, we focus on the weekly total precipitation in motoring stations each of which having 2,506 data points. Figure 6 scrutinizes the precipitation value's deviation from the centred averages through a box plot to spot any possible discrepancies for each week of a year. The seasonal pattern can be readily recognized on the plots. We employ a classical statistical decomposition approach to isolate the seasonality. Both the additive and multiplicative decomposition methods may be considered. The former is suitable when the level of seasonal fluctuations remains the same throughout all spans of the time series while the latter is best used when the seasonal fluctuations change in accordance with their corresponding levels of the time series. In particular for precipitation data, the multiplicative model is advantageous due to its ability to preserve zeros and the fact that the magnitude of the cycles demonstrate variations with time. One may express the time series in this model by the equation as where Y t , T t , S t and e t represent the measurement, the longterm trend, the seasonal component and the random error at time t, respectively. Note that the mixture model described in Section 3 does not assume any seasonal component and thus,  Week Precitipication MAR it needs to be removed. To accomplish that, the decompose function in R has been used. Figure 7 illustrates the three components of the precipitation time series along with the time series itself. Figure 8 demonstrate the box plot of deseasonalized time series which shows that there is no more obvious seasonal pattern remaining. We are now in position to fit the mixture model introduced in Section 3 to the deseasonalized precipitation measurements.

Model fitting
The descriptive statistics of each precipitation series, for each motoring station have been reported in Table 5. It lists the number of zero observations, the empirical extrema, and the sample L-moment ratios (Hosking 1990). The sample Lskewness (t 3 ) and the sample L-kurtosis (t 4 ) values suggest that the distribution of precipitations has right-skewed and  Week Precitipication MAR heavy-tailed pattern for all stations. The descriptive statistics motivate the use of the mixture model constructed in Section 3. Table 6 gives the estimated parameters in the mixture model. As it can bee seen, two stations in Alberta (FOR and TAB) and two stations in Manitoba (BAL, MAR) havê θ values that are less than 1, indicating a heavy tail characteristic of precipitation.
To evaluate the appropriateness (or the goodness of fit) of the proposed model, we conduct statistical tests on the standardized residuals whether they follow the standardized generalized Gaussian distribution (i.e.,û t /ĥ t ∼  GG(1,θ)). Such assessment has been made via conducting the Kolmogorov-Smirnov goodness of fit (KS) test. The outcomes of KS test based on R function ks.test are given  Table 7. With the test statistics smaller than the critical value, we fail to reject the null hypothesis that the residuals follow the standardized generalized Gaussian distribution for each station under consideration at 5% significance level. Such a fact may as well be witnessed by checking Fig. 9. The two represent the probability histogram and the cumulative distribution functions of residuals for every station (in red) and those of standardized generalized Gaussian witĥ θ pertaining to that station (in black). Moreover, to ascertain the stationarity of the residuals (plotted in Fig. 10), the statistics of the Augmented Dickey-Fuller (ADF), where the null hypothesis is the presence of unit root at the residuals series, are summarized in Table 7. The statistics of this test have been obtained through employing the R function adf.test. According to the results, for all series the null hypothesis of ADF test can be rejected at a significance level of 1%.

Dependence modelling
Studying co-movement of extremes and dependency structure of precipitation among different stations in the grassland helps identify and measure any possible association between them. Here we use the Archimedean copula (Nelsen 1999) to model the dependence structure of the residuals series of the dynamic mixture model. Prior to employing the copulas, one may examine the empirical estimates of some dependence measures among the paired stations. If the association is not strong or, the series are nearly independent, the fitting of the parametric copulas reduces to the modelling using the independent copula. To this end, the values of (pairwise) Kendall's tau and Spearman's rho rank-correlation coefficients among all stations have been computed. In addition, the correlation test with null hypothesis of zero association has been carried out. As can be seen in Table 8, the coefficients are not significant for the majority of the station pairs. However, meaningful associations can be observed among a few stations pairs such as FOR and QUE (both in Alberta) as well as BAL and MCC (both in Manitoba). In the next step, the best fit Archimedean copulas are determined for five pairs of stations demonstrating significant associations. In particular for applications to agriculture and vegetation, the weekly precipitation measurement is of main concern and thus, we consider the time series of weekly sum of the measurements obtained from the imputed daily measurement by the KNN.   For the parameter estimation of copula, the maximum likelihood estimation method has been utilized. Specifically, the log-likelihood function for a copula, denoted by L(θ ), is given as where n indicates the number of observations, θ is the parameter involved in the copula density function c(u, v), (u, v) ∈ [0, 1] 2 . To get the estimate of θ, denoted byθ , one needs to determine the derivative of L(θ ) and then put it equal to zero. That is,  We consider the following four members of Archimedean family of copulas: Clayton, Frank, Gumbel and Joe copula.
To select the best fit model among the four candidates, we consider two most commonly used criteria, the Akaike information criterion (AIC) and Bayesian information criterion (BIC), which make possible the assessment of the outcomes of the model fitting operation by imposing a penalty on the number of estimated parameters. Formally, they are defined as AIC = −2 log L(θ) + 2k, BIC = −2 log L(θ) + k log n, where n and k are the numbers of observations and the estimated parameters, respectively. The most suitable copula is deemed to be the one possessing the smallest values of those two information criteria. The selected copula is to be considered for the process of simulating the dependence structure of the standardized residuals of the mixture model. The R package "VineCopula" has been used for computation purposes. The corresponding results can be found in Table 9. According to the table, the Gumbel copula results in a better fit for FAB (in Alberta) and INH (in Saskatchewan). However, it can be concluded that the dependency structure among these two stations is asymmetric and they show an upper tail dependency. On the other hand, the Clayton copula gives the best fit for FOR and QUE (both in Alberta). The dependence structure among these two stations is asymmetric and they exhibit lower tail dependency. For the other three pairs, the Frank copula offers the best fit among all other models. As the parameter for the Frank copula is close to zero, neither lower nor upper tail dependency can be detected. Overall, regarding all θ parameters, the presence of extreme co-movement can not be recognized among all monitoring stations. One may infer weak tail dependencies between stations under study. To further prove a reasonable agreement between the empirical and the best theoretical copula, Fig. 11 has been sketched.

Conclusion
In this study a dynamic mixture model has been proposed to describe the model structure of precipitation data. The model has been constructed to take into account important characteristics of precipitation time series in which there are many zero (on dry days) and non-negative (on wet days) observations with the presence of stochastic volatility and extreme values. In the proposed model, strictly positive observations have been assumed to be drawn from a dynamic GGCR distribution with a specification of a GARCH type model for the scale parameter of the static GGCR distribution. This facilitates us to capture the volatility as well as heavy-tailedness. In our mixture model setting, the GARCH component also contributes to explain the serially dependent probability of zero observations. To demonstrate the applicability of the introduced model, a historical precipitation dataset of twelve stations in Canadian prairies has been studied. The findings from the model fitting results for the shape parameter of GGCR (θ ) suggest that distribution of precipitation in eight stations are heavy-tailed. Furthermore, the dependence structure of the residual series from the fitted mixture model has been studied in terms of some commonly used Archimedean copulas.
The present paper has focused on the construction of model, parameter estimation and in-sample model checking.
To further ensure the applicability of the proposed model, out-of-sample forecasting of precipitations can be carried out using the block bootstrapping approach in which the original serial dependence structure is preserved within a block. In addition, some regression terms with geographical and meteorological information may be included in the prediction model as explanatory variables. We will pursue this in a sequel of this paper.