Spatial Mixture Copula Model For Multiple Correlated Responses With An Environmental Application

In environmental monitoring, multiple measurements are often collected at many locations and these measurements depend on each other in complex ways, such as nonlinear dependence. In this research, a novel copula-based geostatistical modelling approach was developed to model multivariate continuous spatial random fields using mixture copulas that captures both spatial and joint dependence of multiple responses over two-dimensional locations. In a bivariate context, the mixture copulas were used to capture the joint spatial dependence of a bivariate random field and the spatial copula of the bivariate random field was constructed as the convex combination of mixture copulas. The proposed model was applied to real forest data and simulated nonlinear data. The performance of the novel method was compared with existing spatial methods, which included a univariate spatial pair-copula model, a multivariate spatial pair-copula model that utilises nonlinear principal component analysis (NLPCA), and conventional kriging. The results show that the proposed model outperforms the existing methods in the interpolation of individual responses and reproduction of their bivariate dependence.


Introduction
In environmental science, multiple attributes are often measured at a given geographical location. For instance, mining industries collect multiple measurements of minerals, such as cadmium, zinc, and copper (Goovaerts et al., 1997). Forest inventories monitor and collect multiple measurements of biomass, such as bole, foliage, stump, branch, and root biomass (Finley et al., 2007). These multiple measurements can depend on each other and ignoring the joint dependence of multiple variables can decrease the accuracy in the interpolation process (Musafer et al., 2017). Specifically, this article focusses on the spatial modelling of multivariate continuous spatial random fields that models the joint spatial dependence of multiple spatial responses. Subsequently, the individual responses can be interpolated across locations by utilising the modelled multivariate spatial dependency.
Traditionally, kriging based geostatistical models have been widely used to capture the linear relationships of univariate continuous random fields over space, where the variogram has been used to measure the dissimilarity of the variables at different locations (Cressie and Wikle, 2011). However, kriging based methods are limited in their ability to model real-world complexity, such as nonlinear spatial dependence (Musafer and Thompson, 2016). Also, a common assumption in the variogram method is that samples are independent of each other, which means that the kriging estimation variance is independent of the sample values (Diggle et al., 2003;da Rocha and Yamamoto, 2000). This assumption is over simplistic, it is more realistic that sample points do depend on each other, for example, exhibiting high correlation between points that are close to each other and with decreasing correlation the further that points are separated. In contrast to kriging, spatial copula models can be used to describe the spatial dependence structure of variables by considering the distance dependent correlation between sample values as well as capturing complex spatial dependencies (Musafer et al., 2017).
Copulas provide a distribution free approach to model multiple random variables by capturing their nonlinear and multivariate dependence (Nelsen, 2007). The copula method has been widely applied in many fields, such as finance, economics, and civil engineering (D'Amico and Petroni, 2018;Patton, 2012;Li et al., 2018), to describe the dependence structure of two or more random variables, and subsequently used in the prediction process. In geostatistics, Bárdossy (2006) introduced an univariate spatial copula framework to predict groundwater quality parameters, such as pH, sulphate, nitrate, and chloride, at unobserved locations. Bárdossy's (2006) spatial copula method divides the distance over which spatial dependence exists into equally spaced intervals, also referred to as distance classes or spatial bins, and requires the use of the same family of copulas to be fitted for each spatial bin. Bárdossy and Li (2008) subsequently developed a method for interpolation from Bárdossy's (2006) spatial copula model. Gräler (2014) proposed a more flexible copula based spatial model that permits copulas from different families to be fitted across the distance classes. They also developed a method for spatial interpolation using the pair-copula construction of Aas et al. (2009), to predict measurements at unsampled locations. The added flexibility of Gräler's (2014) model, over Bárdossy's (2006) model, permits increased accuracy in modelling and prediction. More recently, Sohrabian (2021) modified Gräler's (2014) spatial model with mixture copulas to jointly model upper and lower tail dependence within each spatial bin. However, these spatial copula methods model only univariate continuous spatial random fields and are not capable of jointly modelling multiple dependent continuous spatial random variables.
Existing spatial models for multivariate random fields include coregionalisation kriging models (Cressie and Wikle, 2011), which ignore nonlinear spatial dependence and fail to reproduce the individual responses across locations (Leuangthong and Deutsch, 2003). Also, Gnann et al. (2018) proposed a method to model multivariate spatial dependence using Gaussian copula, which was not capable of modelling non-Gaussian spatial random fields. A solution to model non-Gaussian multivariate spatial random fields, Musafer et al. (2017) proposed a multivariate pair-copula spatial model, whereby the spatial variables are transformed into uncorrelated factors using NLPCA prior to fitting univariate spatial copula models to the independent factors. Subsequent back transformation is required to transform interpolated values to the scale of the original variables and to re-inject correlation. Musafer et al.'s (2017) method indirectly models the joint dependence between spatial variables through a black-box transformation. We propose a novel method to model multivariate spatial random fields that jointly models the spatial variables via a white-box mixture copula. In the application and simulation study presented, this multivariate spatial mixture copula model more accurately reproduced individual responses across locations and successfully reproduced the joint dependence between variables.

Methodology
The proposed approach is presented in a bivariate context. However, the method can be extended to more than two responses with a trivial generalisation of the bivariate setting. The methodology consists of two essential components: modelling each spatial variable separately using Gräler's (2014) univariate spatial copula; then joining the univariate spatial copulas using Sohrabian's (2021) mixture copula. We also use the proposed model for interpolation : the joint dependence between variables is interpolated; then Nelsen's (2007) inverse conditional approach is used to obtain the conditionally interpolated values of an individual variable. A summary of each component of the methodology, and a description of the modelling and interpolation method are provided in the following.

Copulas
Copulas are joint cumulative distribution functions (CDFs) of two or more random variables with uniformly distributed margins on the interval [0, 1]. Let Y 1 and Y 2 be continuous random variables with marginal CDFs F 1 and F 2 , then there exists a unique copula C : [0, 1] 2 → [0, 1] such that, where u and v are any selected quantiles on the interval [0, 1].
There are various types of copula families that describe a variety of dependence structures. Commonly used bivariate copulas include Gaussian, Student-t, Clayton, Frank, Gumbel, and Joe copulas (Nelsen, 2007).

Spatial copulas for univariate spatial random fields
Whilst a non-spatial copula models the dependence between two or more non-spatial continuous random variables, a spatial copula describes the joint spatial dependence of an univariate spatial variable at any two spatial locations.
Let Z(x) denote an univariate second-order stationary spatial random field Z that is observed at twodimensional location x ∈ X , and let X = (x 1 , x 2 , . . . , x n ) be the set of existing observed locations in the given spatial domain X . The spatial copula that describes the dependence of a variable at any two locations separated by distance h is, here F is the CDF of Z and is assumed to be same at each location x.
In calculating the empirical bivariate copula distribution, the distances between every pair where h K is the maximum distance at which significant dependence is observed. Given the pairs of points for each distance class, the bivariate copulas for the each distance class are calculated, and the distance dependent spatial copula is formed as a convex combination of copulas of each distance class as follows, where λ k =h k −h k−1 h k −h k−1 for k = 2, 3, . . . , K,h k is the mean separation distance corresponding to each spatial bin, and h 1 , h 2 , . . . , h K denote the upper limits for different spatial bins (Gräler, 2014).

Mixture copula
The mixture copula is a weighted linear combination of two or more copulas on the interval [0, 1]. Let C 1 and C 2 be the bivariate copulas, then there exists a bivariate mixture copula C m : where 0 ≤ w ≤ 1.

Spatial mixture copula construction for multivariate spatial random fields
] be a bivariate second-order stationary spatial random field that is observed at two-dimensional location x. Fig. 1 shows the basic steps for modelling and interpolation in a bivariate context. However, the method can be extended to spatial random field with more than two responses.

Modelling
Step 1: For each Z l , l = 1, 2, fit the univariate marginal probability distributions of Z l that gives the maximum log-likelihood among competing distributions, e.g., Gamma, Weibull, Normal, Log-normal, and let F l denote the best CDF of Z l .
Step 2: The maximum distance h K is referred as cut-off distance, which can be determined by visual inspection of the correlogram that is a plot of Kendall's tau (τ ) against the mean distance (h k ) of each bin. Theh k for bin k is calculated as the average of the distances between the pairs belonging to that bin, then τ is calculated for each bin. Fig. 2 depicts an example correlogram Fig. 2 An example correlogram. The blue dashed line indicates the cut-off distance at which pairs of points are no longer considered to be spatially dependent.
In Fig. 2, nearby point pairs are highly correlated, the correlation decreases as the distance increases between point pairs, and far points tend to independence at the cut-off distance of 800 km.
Step 3: For each bin k, calculate the empirical bivariate copula C l,k of Z l , Step 4: For each bin k, construct the bivariate mixture copula C m k , where 0 ≤ w ≤ 1.
Step 5: The spatial mixture copula for the multivariate spatial random field Z is constructed as the convex combination of mixture copulas as follows where λ k =h k −h k−1 h k −h k−1 for k = 2, 3, . . . , K,h k is the mean separation distance, and h 1 , h 2 , . . . , h K denote upper limits of the chosen distances for the spatial bins.

Interpolation
Interpolation of Z 2 at location x conditional on the known given value of Z 1 can be generated at the same location x, using the copula CDF of C m h . The interpolation of Z 1 , given Z 2 , is described by simply switching the subscripts 1 and 2 in the interpolation steps.
Step 6: Obtain the joint CDF values of Z using C m h , and let T be the vector with joint CDF values, where T = (t 1 , t 2 , . . . , t n ).
Step 7: Obtain the marginal CDF values of Z 1 using F 1 , and let S be the vector with marginal CDF values, where S = (s 1 , s 2 , . . . , s n ).
Step 8: Derive the conditional distribution of T , given S = s, using the partial derivative of C m h as follows, Step 9: Take, Z 2 = F −1 2 (r).

Application
The proposed method was applied to model real forest data that was taken from georeferenced forest inventory plots in the US Department of Agriculture Forest Service Bartlett Experimental Forest (BEF) in Bartlett, New Hampshire (Finley et al., 2007). The attributes of interest were forest-wide biomass estimations within the area of 1,053 hectares. In this study, only foliage biomass (Z 1 ) and bole biomass (Z 2 ) were used that observed at 335 two-dimensional locations. The interpolation of bole biomass can be used for carbon accounting purposes, and the interpolation of foliage biomass can be used to identify regions with high values of foliage biomass (Musafer et al., 2017). Table 1 gives the summary statistics of the data. Fig. 3a and 3b show the spatial distributions of Z 1 and Z 2 at observed locations. The marginal distributions are illustrated in Fig. 3c and 3d respectively.   The distance between each pair of sample locations was calculated and the cut-off distance was selected as 800 km using the correlograms of Z 1 and Z 2 . Ten equally spaced spatial bins were created and the pairs of points were placed into relevant spatial bins based on the distance calculated. The empirical bivariate copulas for each bin of Z 1 and Z 2 were fitted and the ML values were compared among competing bivariate copulas, Gaussian, Student's t, Clayton, Frank, Gumbel, and Joe. Table 2 shows the best fit copulas and the estimated copula parameters θ, where C 1,k and C 2,k are the fitted bivariate copulas of Z 1 and Z 2 respectively. Table 2 The best bivariate copulas for each bin. The joint dependence of Z 1 and Z 2 was modelled using the mixture copulas with the equal weight (w = 0.5). The mixture copulas were constructed as the weighted sum of the bivariate copulas in Table   2, and Table 3 shows the mixture copulas of each bin. Table 3 The mixture copulas of each bin.
The proposed spatial mixture copula method was used to interpolate Z 1 and Z 2 using the inverse conditional approach as described in the steps 6-9. The interpolated values were compared with the interpolation based on the existing pair-copula, co-kriging, and NLPCA spatial methods. Subsequently, the actual values and interpolated values at sampled locations were used to calculate the root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Also, the accuracy in the reproduction of the bivariate relationship of Z 1 and Z 2 was evaluated based on the mean square error from the kernel density estimation (KDE MSE). The KDE MSE is calculated by taking the mean of the squared differences between the bivariate KDEs of the actual data and interpolated data. Table 4 shows the model validation results. According to Table 4 almost all the RMSE, MAE, and MAPE values are the lowest for the Z 1 and Z 2 interpolations based on the spatial mixture copula. The MAPE value of co-kriging method is the smallest for the Z 1 interpolation that is very close for the spatial mixture copula. Thus, it is clear that the proposed method outperformed in the interpolation of Z 1 and Z 2 across the observed locations. Also, the proposed method accurately reproduces the bivariate relationship in terms of the minimum value of KDE MSE. Fig. 4 shows the bivariate relationship of Z 1 and Z 2 between actual and interpolated values.

Simulation study
A simulation study was proposed to assess the performance of the novel spatial mixture copula method.
In this simulation study, an existing ordinary kriging simulation method was modified to simulate non-Gaussian and non-linear bivariate random fields. The summary of the simulation study is given in steps a-d. The semi-variance γ(h) with seperation distance h is calculated as and ordinary kriging can be used to predict an unobserved measurement at an unsampled location x 0 as followsẐ where Z(x) is a univariate spatial random field and β i are weights that determined from the exponential variogram.
Here, we used the ordinary kriging method to simulate 10000 univariate Gaussian random fields over the 100 by 100 girded locations, which was clearly explained in Bárdossy (1997).
c) Next, randomly select 100 samples from each Gaussian random field. Algorithm 1 shows the basic steps of simulating non-Gaussian bivariate spatial responses using the simulated univariate Gaussian random fields from the step b.
Algorithm 1 Algorithm for simulating non-Gaussian bivariate spatial responses. Definition and Notation: # Let S be the matrix with 10000 simulated Gaussian random fields with sample size 100 # i is the index of first response # j is the index of second response S 1,i = N U LL # vector to store the first non-Gaussian responses S 2,j = N U LL # vector to store the second non-Gaussian responses # S 1,i (ω) is the i th simulated response at location ω ∈ Ω # S 2,j (ω) is the j th simulated response at location ω ∈ Ω # S 1,i (ω) and S 2,j (ω) are vectors with sample size 100, i.e., ω = (ω 1 , ω 2 , . . . , ω 100 ) Calculation: while ( while (S 2,j with desired spatial correlation structure is obtained) S 2,j = (S1,i) 7 10 5 + noise # to obtain a strong non-linear bivariate relationship # Calculate the best marginal distribution of S 2,j (a non-Gaussian marginal distribution) # Create the spatial bins for S 2,j (seven bins) end while end for } end while d) Fit the proposed spatial mixture copula C m h for each simulated bivariate non-Gaussian random field, and simultaneously predict each response at location ω using the inverse conditional method explained in the steps 6-9. Table 5 gives the cross-validation results of the simulation study, which compares the mean RMSE, KDE MSE, and standard error (SE) in the interpolation based on 100 simulated samples with the existing methods. Fig. 6 shows a graphical summary of the bivariate relationship of the simulation study.  6 Relationship between first simulated response (S 1 ) and second simulated response (S 2 ), 100 actual simulated values (red) overlay with 100 interpolated simulated values (black).
The mixture copula method more accurately interpolates each response across locations that achieves the minimum mean RMSE. Also, all these methods are pretty good at reproducing the bivariate relationship in terms of KDE MSE. However, the pair-copula and the co-kriging methods fail to predict extremes of the actual data and predict a small number of data outside of the trend the actual simulated data.

Conclusions
This article proposed a new geostatistical method for modelling spatially correlated multiple responses using mixture copulas. The novel method models multiple spatial responses without any normalisation of the original variables. The method was applied to model bivariate nonlinear spatial responses Z 1 (foliage biomass) and Z 2 (bole biomass). Also, 100 nonlinear bivariate random fields were simulated and the method was applied to each simulation. The model performance was assessed in the cross-validation of actual versus interpolated values at sampled locations. The results showed that the proposed spatial mixture copula model outperformed the existing pair-copula, co-kriging and NLPCA methods in terms of the minimum values of RMSE, MAE, MAPE, and KDE MSE.
Further improvement to the spatial mixture model is the optimal weights selection of each response in the mixture copula modelling. For instance, one spatial response may have a strong spatial dependence across locations than the other responses, and the optimal weights selection may increase the prediction accuracy of each response across locations. Moreover, the novel method can extend to model multiple spatial responses with time, where the temporal dependence can incorporate in the spatial mixture copula modelling with the multivariate spatial random fields.