Spatiotemporal Prediction of COVID--19 Mortality and Risk Assessment

This paper presents a multivariate functional data statistical approach, for spatiotemporal prediction of COVID-19 mortality counts. Specifically, spatial heterogeneous nonlinear parametric functional regression trend model fitting is first implemented. Classical and Bayesian infinite-dimensional log-Gaussian linear residual correlation analysis is then applied. The nonlinear regression predictor of the mortality risk is combined with the plug-in predictor of the multiplicative error term. An empirical model ranking, based on random K-fold validation, is established for COVID-19 mortality risk forecasting and assessment, involving Machine Learning (ML) models, and the adopted Classical and Bayesian semilinear estimation approach. This empirical analysis also determines the ML models favored by the spatial multivariate Functional Data Analysis (FDA) framework. The results could be extrapolated to other countries.


Introduction
Coronavirus disease 2019 (COVID-19) rapidly spreads around many other countries, since December 2019 when arises in China (see [44]; [51]). It has supposed a substantial damage in terms of both human lives and economic cost. There is no vaccine specifically designed for this virus. Thus, the absence of reliable data regarding this infectious disease transmission makes necessary cautious responses. A crucial problem is geo-localization of outbreaks, since it allows a more effective allocation of medical resources (see, e.g., [11]; [25]; [34]; [36], just to mention a few). Some mathematical models 1 have been introduced in the literature, trying to describe the spatiotemporal dynamics of COVID-19. To generate and assess short-term forecasts of the cumulative number of reported cases, three models are presented in [37], which were validated with outbreaks of other diseases different from COVID-19. Alternative SEIR type models with little variations are formulated, involving, for instance, stochastic components (see [26]). In [19], a θ-SEIHRD model is introduced, well adapted to COVID-19, based on the Be-CoDiS model (see [20]). This model is able to estimate the number of cases, deaths, and needs of beds in hospitals. The most remarkable feature of the presented approach in [19] is the balance between complexity and indentifiability of model parameters.
Mathematics contributes in an essential way to the derivation of epidemiological models, and the subsequent analysis of the causes, dynamics, and spread of an epidemic/pandemic (see, e.g, [17] and [23] for an overview on mathematical models in Epidemiology). Particularly, ordinary differential equations have been extensively applied in the characterization of the time evolution of an infectious disease over uniform population distributions within a habitat. Most of those models are compartment epidemic models, such as the SIR (susceptible-infectious-recovered) model. The SIR model was first proposed in [24] to explain plague and cholera epidemics. It was later extended to the SIR-susceptible (SIRS) model to allow imperfect immunity (see, e.g., [12]). Indeed, there exists an extensive literature on SIR models, based on ordinary differential equations (see, e.g. [13]; [21]; [27]; [35]; [41]; [45]; [49]). Delay differential equations have also been applied in the SIR modeling context (see [5]; [31]; [40], among others). These models are limited, since any structure due to the spatial position is not reflected. Reaction-diffusion models provide a framework for their spatial extension, reflecting the infectious disease spread over a spatial region. Thus, they provide a simplest modeling alternative in Ecology and Epidemiology involving space and time. A diffusive predator-prey model is analyzed in [15]. A one dimensional SIR model, including constant diffusive movement of all individuals is derived in [47] with no-flux boundary conditions. SIR modeling based on hyperbolic partial differential equations is studied in [32], and, based on parabolic partial differential equation systems with no-flux boundary conditions in [28]. See also [39] and [38] on SEIRD (susceptible, exposed, infected, recovered, deceased) model, that incorporates the spatial spread of the disease with inhomogeneous diffusion terms.
The stochastic version of SIR-type models intends to cover several limitations detected regarding uncertainly, in the observations, and the hidden dynamical epidemic process. Specifically, the counts compartments are observed with error, and several factors introduce uncertainly in the hidden epidemic process, such as the presence of heterogeneous populations. The transmission, recovery and loss of immunity rate parameters, for example, are uncertain themselves. In a first stage, a probabilistic mechanism is adopted involving a Markov chain of SIR states (see [4]; [48]). Some more recent stochastic models involve complex networks (see [50]; [43]) or drug-resistant influenza (see [10]). As pointed out in [52] (see references therein), these models do not take into account the observation error in the counts, and obtain mass balance equation from the observed counts. The uncertainty in the parameter space is neither accounted for. The approach presented in [52] covers these gaps, adopting a Bayesian hierarchical statistical SIRS model framework. Specifically, the stochastic SIR-like model fitting is performed, assuming that the counts are observed with error and the dynamical evolution of the SIR process is stochastic. Parameter uncertainty is also modelled under this Bayesian statistical framework (see also [2]).
Many alternative models have been proposed to address the problem of uncertainly in the counts and the parameters, in the spatiotemporal statistical and Cox process context (see, e.g., [1]; [4]; [14], among others). Some of them go beyond the SIR modeling framework (see, e.g., [7] or [22], where the problem of correlation of rates of change between neighbouring sites or individuals is addressed). Survival Dynamical System are proposed in [46], paying attention to survival functions instead of population counts. We particularly refer to the approach presented in [33], where a spatially descriptive, temporally dynamic hierarchical model is formulated, to handle inference from small-area counts on the number of infected during successive, regular time intervals. The realizations of an underlying multivariate autoregressive process are involved in the infection relative risk modeling, incorporating the space-time dynamics, and providing an epidemic-field-based view. Markov chain Monte Carlo (MCMC) in a Bayesian statistical framework is applied to compute the posterior estimates of all parameters of interest.
Our paper extends some of the tools applied in [33] to the nonlinear trend model fitting, and infinite-dimensional statistical frameworks. The prediction approach presented involves nonlinear parametric functional regression, and spatial multivariate functional linear residual correlation analysis. It is well-known that Functional Data Analysis (FDA) techniques allow to make inference on the small-scale and large-scale sample properties, in our case, of the hidden stochastic mortality risk process generating the spatiotemporal counts (see, e.g., [42]). Note that the classical stochastic diffusion models offer a particle view rather than a field view (see, e.g., [30]). Thus, we incorporate large-and small-scale sample information in our statistical analysis framework. The first stage involves nonlinear parametric curve regression, from the observed realizations of the COVID-19 mortality log-risk process, at each one of the Spanish Communities. The second step performs a spatial multivariate functional linear residual correlation analysis. Thus, the resulting semi-linear additive plug-in predictor approximates the behavior of the COVID-19 mortality log-risk process in space and time. Bayesian and Classical approaches are adopted in the implementation of the second step. Hence, our approach includes uncertainly in the observed counts, parameters and hidden COVID-19 mortality log-risk process. The presented estimation methodology is validated from the observed number of deaths registered at the Spanish Communities, during the period analyzed, since March, 8, 2020 until May, 13, 2020. Our results show a remarkable qualitative agreement with the reported epidemiological data.
Nowadays Machine learning models have established themselves as serious contenders to classical statistical models in the area of forecasting. Research started in the eighties with the development of the neural network model. Subsequently, research extended the concept to alternative models, such as support vector machines, decision trees, and others, that are collectively called machine learning models ( [3]; [16]). This paper also aims to establish an empirical comparative study of these models and our approach. Applying random K-fold validation, the forecasting ability of ML models are tested into two categories, respectively corresponding to the original and projected functional data. A model ranking is established in both cases. In the second category, they are compared with our multivariate infinite-dimensional Bayesian and Classical semilinear approaches. From the COVID-19 mortality data set analyzed, the outperformance of Radial Basis Function Neural Network (RBF), and Gaussian Processes (GP) against the remaining forecasting models tested: Support Vector Regression (SVR), Bayesian Neural Networks (BNN), Classical and Bayesian multivariate infinite-dimensional semilinear estimation, Multilayer Perceptron (MLP), Generalized Regression Neural Network (GRNN) is observed. Note that, the best K-fold running for these superior RBF and GP methods is obtained under an FDA preprocessing of the data. Our approach presents a more stable and robust behavior against the possible random splitting of the functional sample into training, and validation or testing samples.
The outline of the paper is the following. The adopted modeling approach is introduced in Section 2. Section 3 derives the estimation methodology. This methodology is applied to the spatiotemporal statistical analysis of COVID-19 mortality in Spain in Section 4. The empirical comparative study with ML models is addressed in Section 5. An empirical model ranking is provided in Section 6 leading to some concluding remarks. In the Appendix additional outputs, obtained in our spatiotemporal estimation of COVID-19 mortality and risk assessment in Spain, are displayed. Finally, this Appendix also provides a brief introduction to our implementation of ML models in an FDA framework.

The Model
Let (Ω, A, P) be the basic probability space. Consider H to be a real separable Hilbert space, with H denoting its dual. Let Λ = {Λ t,z , z ∈ D ⊂ R d , t ∈ R + , } be a spatiotemporal risk process on (Ω, A, P) such that (1) where T is a bounded Borel set in R + , and B denotes the Borel σ-algebra.
defines a temporal realization of the generalized risk process evaluated at the test function h ∈ H. Note that, the weighted integration of the realizations of the log-risk process, in terms of test function h ∈ H, keeps the spatial information on this process. For certain families of compactly supported test functions, the local mean defined by this integral allows the localization of the values of the spatiotemporal risk process over a specific spatial region.
For any bounded Borel set T , let N (T ) : (Ω, A, P) −→ N be a measurable mapping, with values in the natural numbers N. Given the observation of I T (h), the conditional probability distribution of the number of random events N (T ), that occur in the bounded interval T ∈ B, is a Poisson probability distribution with functional mean I T (h), h ∈ H. Thus, its conditional characteristic function is given by

Data Model
Let us consider the following observation model for the generalized log-risk process: For t = 1, . . . , T, where {ψ p, p , p = 1, . . . , P } ⊂ H denotes a function family in H, whose elements have compact support defining the small-areas, where the counts are aggregated. Here, p , p = 1, . . . , P, are the localization hyperparameters (in the next section, the center and bandwidth parameters), associated with those small-areas. For each p ∈ {1, . . . , P }, θ(p) = (θ 1 (p), . . . , θ q (p)) ∈ Θ denotes the unknown parameter vector to be estimated at the pth region, and Θ is the open set defining the parameter space, whose closure Θ c is compact in R q . The parametric regression function g t is continuous over θ(·) ∈ Θ, and t ∈ R + , with respect to H-norm. Furthermore, we assume that g t is of the form with In equation (3), for any T ≥ 2, ε t is assumed to be a temporal dynamical zero-mean process that satisfies the state equation where {ν t , t = 1, . . . , T } are independent H-valued zero-mean Gaussian random variables, with E [ ν t 2 H ] = E [ ν 0 2 H ] < ∞, and P [ν t ∈ H] = 1, t = 1, . . . , T. The autocorrelation operator ρ is bounded linear operator. Thus, the observation noise ε t obeys an Autoregressive Hilbertian equation of order one (ARH(1) equation), which admits an unique stationary solution under the condition ρ k 0 L(H) < 1, for certain k 0 ∈ N (see, e.g., [8]). Here, · L(H) denotes the bounded linear operator norm in H.

Trend Model Fitting and Residual Correlation Analysis
Let L 1 , . . . , L P be the small-areas, where the counts are aggregated. Consider {ψ p,ρp , p = 1, . . . , P } ⊂ H to be a family of compactly supported radial functions in H, whose centers c p , p = 1, . . . , P, are allocated at the regions L 1 , . . . , L P , respectively. The bandwidth parameters ρ 1 , . . . , ρ P provide their associated windows with suitable size. From (3), we first compute the least-squares estimator (LSE) θ T (p), in the Walker sense, of the parameter vector θ(p), for p = 1, . . . , P (see (4)-(5)). Thus, for p = 1, . . . , P, the LSE θ T (p) is obtained as the solution to the following optimization problem: Hence, defines the parametric nonlinear regression predictor of the log-risk process at region p, for any time t ∈ R + , and for p = 1, . . . , P.

Classical and Bayesian componentwise estimation of the mortality residual log-risk process
Let Y = Y t = ln (Λ t (·)) − g t ( θ T (·)), t = 1, . . . , T be the functional residuals from the nonlinear parametric regression model fitting. The empirical autocovariance and cross-covariance operators, based on the residuals Y t (·) = ln (Λ t (·)) − g t θ T (·) , t = 1, . . . , T, are respectively given by (9) for every h, g ∈ H. Operators R Y 0,T and R Y 1,T are also nuclear operators. In particular, we will consider the spectral decomposition where {λ k,T ( R Y 0,T ), k = 1, . . . , T } and {φ k,T , k ≥ 1} denote the empirical eigenvalues and eigenvectors of R Y 0,T , respectively. The following classical componentwise estimator of the autocorrelation operator is considered (see [8]): For every h, g ∈ H, where K(T ) denotes the truncation parameter, that depends on the functional sample size T. An optimal choice is given by K(T ) = ln(T ). Under the conditions derived in Corollary 4.1, Theorem 4.8 and Theorem 8.7 in [8], ρ T is a strong-consistent componentwise estimator of the autocorrelation operator of the residuals, with respect to the norm in the space L(H) of bounded linear operators. Under the assumptions made in Theorem 1 in [18], the corresponding plug-in predictor is a weak-consistent estimator of ε t in equation (6), with respect to H-norm. We refer to (12), as the classical plug-in predictor of the mortality residual log-risk process.
In the Bayesian componentwise estimation of ρ in (6), we assume that ρ is compact self-adjoint operator. An a-priori beta probability distribution with parameters a k > 0 and b k > 0 is considered to represent the uncertainly in the eigenvalue γ k (ρ) of ρ, for every k ≥ 1. From the sample ε t , t = 1, . . . , T, of size T, since γ k (ρ) ∼ B(a k , b k ), the likelihood function is computed as where ε tk = ε t (ψ k ) = ε t , ψ k H , and σ k = E[ε t (ψ k )] 2 , t = 1, . . . , T, with ψ k being the k-th eigenvector of ρ, for each k ≥ 1. Here, I is the indicator function on the interval (0, 1), and B(a k , b k ) denotes the beta function . From (13), the Bayesian estimator γ k (ρ) of γ k (ρ) is given by where, for each k ≥ 1, The Bayesian componentwise estimator ρ T of the autocorrelation operator ρ is then computed as where K(T ) is the truncation parameter, usually selected as K(T ) = ln(T ). The resulting Bayesian componentwise plug-in predictor is given by In practice, in the computation of the Bayesian componentwise estimator ρ T , we replace in (13)-(14), ε t by ε t = Y t = ln (Λ t (·)) − g t ( θ T (·)), for each t = 1, . . . , T. The eigenvectors {ψ k , k ≥ 1} can also be replaced by their empirical counterpart. Hence, We refer to (16) as the Bayesian plug-in predictor of the residual log-risk process Y t . From equations (12) and (16), the corresponding Classical and Bayesian plug-in predictors implemented for spatiotemporal forecasting of COVID-19 mortality risk are given, for any t, by Classical and Bayesian estimation of the autocovariance operator of the innovation process ν t in (6) is provided in Appendix.

Statistical Analysis of COVID-19 Mortality in Spain
The COVID-19 mortality analysis performed here is based on the daily records reported by the Spanish Statistical National Institute, since March, 8 to May, 13, 2020, at the 17 Spanish Communities. Interpolation at 265 temporal nodes, and cubic B-spline smoothing is applied to obtain our functional data set, in the implementation of the spatial heterogeneous nonlinear parametric curve regression model (3)-(4) (see Figures 9 and 10 in Appendix).
The spatiotemporal correlation structure of COVID-19 mortality residual log-risk process is estimated from equations (11) and (15). Thus, from (17), with P = 17, Tables 9-10, in Appendix, display the parameter estimates A k (·) and B k (·), k = 1, . . . , N, in the nonlinear parametric trend model fitting. Here, we have considered N = 6. Thus, 12 parameters must be estimated, since ϕ k = 2π 66 , for k = 1, . . . , N = 6, is assumed to be constant. In Tables  In equation (14), for the computation of the Bayesian estimator (15) of the autocorrelation operator ρ, the prior γ k (ρ) ∼ β(4, 5) has been tested, for every k ≥ 1. Note that, although, is usually considered, we have also tested the truncation parameter values N = 8 and N = 9. Table 1 displays the discrete approximation of · L 2 ([0,66]) -norm, multiplied by the scale factor S = 10 2 , of the empirical functional quadratic errors associated with these three truncation parameter values. We find that the sample size T = 265 leads to the threshold dimension K(265) = 9, providing the allowed model complexity, in our finite-dimensional implementation of the COVID-19 mortality log-risk predictor, under both, Bayesian and Classical approaches.
Figures 1 and 2 respectively show the maps reflecting the original and estimated spatiotemporal evolution of COVID-19 mortality risk, and accumulated mortality counts, under the presented Bayesian and Classical semilinear estimation approaches, over the 17 Spanish Communities analyzed.

An Empirical Comparative Study
This section provides an empirical model ranking for spatiotemporal COVID-19 mortality risk assessment, based on random K-fold validation, involving the ML models introduced in Appendix, and the forecasting approach presented in this paper. The random K-fold (K = 5, 10) has been  In the second category, the performance of the presented approach is compared with these ML methodologies.
A brief introduction to the ML methodologies, here implemented in a classical and FDA frameworks, is provided in Appendix.

Results from the Empirical Comparative Study
We start from the functional data set obtained after the FDA preprocessing procedure has been applied, as described in the previous section. Specifically, data have been interpolated and cubic B-spline smoothed. The logarithmic transform and linear scaling are also applied. We held out the first ten points and the last three, for each log-risk curve, as an out of sample set. In the first category, ML models are implemented from this data set, establishing a first model ranking based on the obtained validation results.
In the second category, COVID-19 mortality log-risk curves are projected onto the empirical eigenvectors of the autocovariance operator, considering the truncation parameter value K(T ) = 8. Our choice of the truncation parameter K(T ) = K(265) = 8 fits the threshold providing a balance between  empirical model ranking is not affected by these two categories. However, it can be observed how each ML model performance is favored by one of the two categories (see Section 6). For instance, linear SVR is the best option for the first category, while better forecasting results are obtained when nonlinear SVR is implemented from the projected data. As we will see in Section 6, the particular features of the data at each one of the 17 Spanish Communities analyzed also affect the performance of the implemented prediction methodologies. One-step-ahead forecasting is selected in the implementation of the ML methodologies, and the proposed Classical and Bayesian semilinear approaches. Model fitting is evaluated in terms of the Symmetric Mean Absolute Percentage Errors (SMAPEs), given by, for P = 17, and T = 265, The most widely used methodology for model selection in ML is K-fold validation approach. In our case, we have applied random K-fold validation. We have computed the mean of the SMAPEs obtained at each one of the K iterations of the random K-fold cross validation procedure implemented. This validation technique consists of random splitting the functional sample into a training and validation sample at each one of the K iterations. Model fitting is performed from the training sample, and the target outputs are defined from the validation or testing sample. In particular, we have selected K = 5, 10, in the K-fold validation implemented. By running each model ten times and averaging SMAPEs, we remove the fluctuations due to the random initial weights (for MLP and BNN models), and the differences in the parameter estimation in all methods, due to the random specification of the sample splitting in the implementation of the random K-fold validation procedure.
The ten-running based random K-fold validation SMAPEs are displayed in Tables 2 (K = 5) and Table 3 (K = 10), for the six ML techniques tested, GRNN, MLP, SVR, BNN, RBF, and GP, from data in the first category. The ML model estimation results, based on the complete functional sample, are also displayed in Table 11 in Appendix (see also Figures 3 and 4). In Table 12 in Appendix, one can also find the corresponding estimation results from the projected data. (Note that, the same Spanish Community codes, as before, are used in all the tables referred in this section, that is, C1 for Andalucía; C2 for Aragón; C3 for Asturias; C4 for Islas Baleares; C5 for Canarias; C6 for Cantabria; C7 for Castilla La Mancha; C8 for Castilla y León; C9 for Cataluña; C10 for Comunidad Valenciana; C11 for Extremadura; C12 for Galicia; C13 for Comunidad de Madrid; C14 for Murcia; C15 for Navarra; C16 for Pas Vasco, and C17 for La Rioja).
Tables 4 (K = 5) and Table 5 (K = 10) provide the ten-running based random K-fold validation results, from the projected data category. The random K-fold validation results obtained after implementation of the Bayesian and Classical semilinear approaches, as well as the estimation results based on the global sample are displayed in Table 6. See also Figures 5 and 6, where original and estimated COVID-19 mortality log-risk and cumulative cases curves are respectively shown.
In the previous selection of ML model hyperparameters, we have also applied random K-fold validation (K = 5, 10). Our selection has been made from a suitable set of candidates. Specifically, the optimal numbers of hidden (NH) nodes in the implementation of MLP and BNN have been selected from the candidate sets = [0, 1,3,5,7,9] and [1,3,5,7,9], respectively. The cross-validation results in both cases, K = 5, 10, lead to the same choice of the NH optimal value. Namely, NH= 1 for MLP, and NH= 5 for BNN. The last one displays slight differences with respect to the values NH= 3, 7, in the 10-fold cross-validation implementation. In the same way, we have selected the spread and bandwidth parameters in the RBF and GRNN pro-cedures. Thus, after applying random K-fold validation, with K = 5, 10, the optimal values β = 2.5, and h = 0.05 are obtained, from the candidate sets [2.5, 5, 7.5, 10, 12.5, 15, 17.5, 20] and [0.05, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7], respectively. As commented, in the first category, the best SVR performance corresponds to the linear case, finding the hyperparameters that minimize the five-fold cross-validation loss (by using automatic hyperparameter optimization from fitrsvm MatLab function). While, from the projected data category, the best option corresponds to the Gaussian kernel based nonlinear SVR model fitting (applying the same option of automatic hyperparameter optimization, in the argument of fitrsvm MatLab function). In the implementation of GP we follow the same tuning procedure for model selection. In this case, for both categories, we have opted by Bayesian cross-validation optimization (in the hyperparameter optimization argument of the fitrgp MatLab function).
In all the results displayed, the SMAPE-MEAN (M.) and SMAPE-TOTAL (T.) have been computed as rank-based performance measures, for comparing the ML models tested, and the presented Classical and Bayesian semilinear estimation approaches.

Concluding Remarks
Model empirical selection results hold in a broad-based sense, as displayed in Section 5.1. Specifically, these results have been tested for K = 5 and K = 10 in the random K-fold validation, leading to almost the same model ranking in the two categories. See Table 7 and Figure 7 for the first category, and Table 8 and Figure 8 for the second category. (In Table 8 and Figure 8, C-ARH(1) and B-ARH(1) respectively denote the Classical and Bayesian semilinear estimation). The random K-fold validation results displayed for the presented Classical and Bayesian semilinear approaches locate these methodologies after SVR model. Thus, the fifth and sixth positions are respectively occupied in that empirical ranking. In the first category, RBF and GP are almost equal, while in the second category slightly larger differences can be observed. On the other hand, the projected data category favors the GRNN, RBF and GP performance, while better results are ob-    Note that, the best models in the above rankings, RBF and GP, display a better performance in the second category. Thus, we can conclude that the FDA preprocessing produces a remarkable improvement, even in the communities, like, Navarra and Galicia, where the worst results are obtained with all the ML models tested. Particularly, this improvement is more significative at Murcia Community. In that sense, one can also observe that the spatial component of the data also affects the performance of the ML models tested through the two categories distinguished (e.g., the best performance overall ML models tested is obtained at Islas Baleares, Castilla-León and La Rioja, and the worst at Navarra and Galicia). Remark also the outperformance of the Classical and Bayesian semilinear approaches at the Islas Baleares and Murcia Communities. Note also that, the rank-based performance measure applied only refers to estimation accuracy, and other features of the probability distribution of the  estimators have not been investigated yet. Indeed, the presented approach offers more stable, and robust estimations against the random selection of the training and validation samples. Thus, an improved model ranking position could be occupied by our approach if variability in that sense is penalized. Another subject to be addressed in a near future is the optimal selection of the test functions {ψ p,ρp , p = 1, . . . , P } involved in the formulation of the generalized least-squares loss function (7), and in the COVID-19 mortality risk assessment from equation (17). Our final goal would be to obtain an optimal design, from a set of candidates, of the test function family, providing the spatial sample information required, for spatiotemporal ML, and Classical and Bayesian semilinear forecasting, in a multivariate functional data framework (see, e.g., [6] regarding optimal selection of time intervals, or, instants, in the implementation of SVR from a multivariate functional data set).     presented in [9]. Specifically, for each k ≥ 1, consider
In equation (20), as conjugate prior for parameter τ k = 1 2λ k (R ν 0 ) , let us consider τ k ∼ Γ (g k , β k ) (with β k denoting the rate parameter), for g k > 2, k ≥ 1. Hence, E 1 τ k = β k g k −1 . For k ≥ 1, the posterior density of τ k , given  x ik = ν i (φ k,ν ), i = 1, . . . , T, is That is, the posterior is a Gamma probability distribution with parameters T /2 + g k and β k + T i=1 x 2 ik . For the loss function defined in terms of the squared error, the Bayes estimator of 2λ k (R ν 0 ) = 1/τ k is the posterior expectation of 1/τ k , given by (see [29], p.236). Hence, from (21), the Bayesian componentwise estimator of R ν 0 is computed as follows: In the classical approach, an unbiased estimator of λ k (R ν 0 ) is obtained from the identity Thus, the classical componentwise estimator of R ν 0 is defined as In practice, in the computation of equations (22) and (24), when the eigenvectors {φ k,ν , k ≥ 1} are unknown, one can consider their consistent approximation in terms of the empirical eigenvectors {φ k,T,ν , k ≥ 1}, under the conditions assumed in Corollary 4.1, Lemma 4.3, and Theorem 4.8 in [8]. Indeed, under the conditions in Theorem 1 of [18], and the referred setting of conditions in [8], Note also that, under the conditions assumed in [9], in particular, when, for every k ≥ 1, , we obtain the following equivalent asymptotic behavior in the Hilbert-Schmidt operator norm · S(H) , for both estimation approaches: In the practical implementation of the Bayesian componentwise estimator (22) in Sections 4-5, we have considered the prior probability distribution τ k ∼ Γ (g k , β k ) with g k = 6, and β k = (2g k − 1)λ K(T ),T R ν 0,T /10 for each k ≥ 1, where λ K(T ),T R ν 0,T denotes the K(T )th empirical eigenvalue of the autocovariance operator of the innovation process defining the residual log-risk process, and K(T ) represents the truncation parameter value (see [9]).

Additional Outputs in the Spatiotemporal Log-Gaussian Analysis of COVID-19 Mortality Risk
The original COVID-19 mortality cumulative cases curves, and its interpolation and B-spline smoothing are shown in Figure 9. The corresponding COVID-19 mortality risk curves are also displayed in Figure 10.  The least-squares parameter estimates computed in the spatial heterogeneous nonlinear parametric trend model fitting are shown in Tables 9-10.
Finally, the COVID-19 mortality log-risk estimations, based on spatial heterogeneous linear and nonlinear trend model fitting, are compared in Figure 11, under a Classical (C) and Bayesina (B) approximation of the functional residuals, considering the truncation parameter value K(265) = 9.
A Brief Introduction to our FDA Implementation of ML models The implemented ML models are introduced here. Their reformulation in the multivariate FDA framework adopted is also provided.

Multilayer Perceptron (MLP)
MLP shares the philosophy of nonlinear regression, in terms of a link function g, the hidden node output, defining the following approximation of the response: 3 The hidden node outputs are also weighted by the components of (η 0 , . . . , η N H ). In this context, y is usually referred as the network output. MLP allows the approximation of any given continuous function on a compact set as close as arbitrarily, from a given network with a finite number of hidden nodes. Optimization algorithms are applied to obtain the weights from the least-squares loss function. In our implementation of MLP, in an FDA framework, the input vector is defined by with 1 D (·) denoting the indicator function of the domain D ⊂ R 2 , the spatial support of the functions in H. Here, ln(λ t−j )(φ l ), l = 1, . . . , K(T ), j = 0, . . . , p, denote the observed projected log-risk values, onto the autocovariance operator eigenvectors φ 1 , . . . , φ K(T ) , at times t, . . . , t − p. Parameter K(T ) denotes the truncation achieved in our finite-dimensional approximation, and T the number of temporal nodes, or training data points. In  practice, the theoretical eigenvectors are usually replaced by the empirical eigenvectors. Parameter p + 1 provides the number of temporal lags incorporated in the prediction. In our case, since we have considered one-step-ahead forecasting, p = 1, the response at time t + 1, is approximated from the input vector Hence, equation (25) can be rewritten as Usually, the logistic function g(u) =  RBF works with node functions, depending on a center and scale parameters, fitting the local smoothness of the response. Specifically, from an initial blank network, the nodes are sequentially added, around the training pattern, until an acceptable error is reached. All the output layer weights are then recomputed using the least squares formula. Gaussian functions have been widely selected as node functions. Equation (27) defines the input vector in our implementation of RFB in an infinite-dimensional setting given by the formula where, as usual, the weight parameters η j , j = 1, . . . , N H, define the linear combination of radial basis functions. Here, the scale parameter ρ, and the functional location parameters c j ∈ H, j = 1, . . . , N H, respectively provide the width, and the functional center of the node functions. In practice, a K(T )-dimensional implementation is considered based on the truncation parameter K(T ), as given in (27).

Support Vector Regression (SVR)
SVR implementation involves a loss function leading to a balance between model complexity and precision (accurate prediction). A bias parameter b is also considered in its formulation as reflected in the following equation: where the loss function is considered. Here, x m and y m respectively denote the mth training input vector and the target output, for m = 1, . . . , M, and Thus, the errors below are not penalized. The solution to the optimization problem associated with the loss function (31) is given by where α m and α m are Lagrange multipliers. The support vectors are the training vectors giving nonzero Lagrange multipliers. A nonlinear version is obtained when the inner product x T x in (32) is replaced by K(x T x), with K being a kernel. A usual kernel choice is the Gaussian kernel. In our implementation of this methodology in an FDA framework from equations (26)- (27), we will consider where T denotes, as before, the number of temporal nodes, and for l ≥ 1, Bayesian Neural Network (BNN) The design of BNN involves Bayesian estimation, and the concept of regularization. The network parameters or weights are considered random variables, following a prior probability distribution. Smooth fits are usually favored in the selection of the prior probability of the weights to reduce model complexity. The posterior probability distribution of the weights is obtained after data are observed. The network prediction is then computed. Specifically, the optimal prediction is obtained by minimizing the following expression: where E O is the sum of the square errors in the network output, based on the posterior distribution of the parameters, E W represents the sum of the squares of the weights or the network parameters, and ν ∈ (0, 1) denotes the regularization parameter. Let L be the number of parameters. An L-dimensional Gaussian prior probability distribution is usually assumed for the network parameters with zero-mean and variance-covariance matrix 1 2(1−ν) I L×L , with I L×L denoting the L × L identity matrix. Thus, This prior puts more weight onto small network parameter values close to zero. The posterior probability density, given the observed data O = o, and the value ν of the regularization parameter, is defined as Considering that the errors are also Gaussian distributed the conditional probability of the data O given the parameters ν and w, is obtained as where M denotes the number of training data points. From equations (36)- (38), where c denotes the normalizing constant. The conditional probability of the parameter ν given de observed data O = o is also computed under a Bayesian framework as Equations (39) and (40) are maximized to obtain the optimal weights and the regularization parameter ν, respectively.
In our FDA implementation from (26)- (27), the corresponding optimization problem is formulated by conditioning to the projected data. Note that, in the selected Gaussian prior probability framework, the error projections are also Gaussian, and our choice of the function basis, diagonalizing the empirical autocovariance operator of the errors, leads to a projected error vector with independent Gaussian components, which is also normalized, in terms of the empirical eigenvalues as described in the preprocessing procedure in Section 5.1. Hence, optimization from equations (39) and (40) can be implemented in a similar way.

Generalized Regression Neural Network (GRNN)
GRNN is based on kernel regression. The kernel estimator is computed from the weighted sum of the observed responses, or target outputs associated with the training data points in a neighborhood of the objective data point x, where prediction must be computed. The training data points are selected in the vicinity of the given objective point x. Specifically, the following formula is applied in the approximation of the response value at the point x : where y j is the target output for training data point x j , for j = 1, . . . , M, and K is the kernel function. Usually an isotropic rapidly decreasing kernel function is considered, e.g., the Gaussian kernel K(u) = exp (−u 2 /2) / √ 2π constitutes a common choice. The bandwidth parameter h defines the smoothness of the fit. Thus, h controls the size of the smoothing region. Hence, large values of h correspond to a stronger smoothing than the smallest values allowing a larger degree of local variation. Our FDA formulation, from (26)- (27), follows straightforward by considering the H-norm · H in (41), instead of the Euclidean norm · .

Gaussian Processes (GP)
A good performance is usually observed in the implementation of GP regression, based on the multivariate normal probability distribution assumption, characterizing the observed responses at the different training data points. Specifically, we consider the observation model where the additive noise vector has independent components. A multivariate normal distribution of the random vector Z ∼ N (0, Σ), with covariance matrix Σ Z is assumed. This matrix provides the variances and covariances between the output vector components in the training set. Applying Bayes rule we obtain the solution to the inverse problem (42). Thus, the estimation of Z, for a given input vector x , is obtained as Here, we consider an alternative multivariate FDA interpretation, in terms of a training test function set (instead of a training data point set, based on temporal nodes). Hence, Σ Z is replaced by the matrix autocovariance operator R Z of the output, and Σ Y,Z is replaced by the matrix cross-covariance operator R Y,Z , both evaluated at the training test function set. In this case, σ 2 denotes the functional variance or trace of the autocovariance operator, characterizing the Gaussian infinite-dimensional distribution of the H-valued independent components of the zero-mean observation noise vector . Hence, the identity matrix I is replaced by the identity operator I H P on H P , with P denoting the number of training test functions. That is,

ML Estimation Results in the Two Categories
The estimation results obtained from the implementation of the above ML models in the first category are displayed in Table 11. The SMAPEs from the projected data are shown in Table 12.