Generalized Pareto Regression Trees for extreme events analysis

In this paper, we provide finite sample results to assess the consistency of Generalized Pareto regression trees, as tools to perform extreme value regression. The results that we provide are obtained from concentration inequalities, and are valid for a finite sample size, taking into account a misspecification bias that arises from the use of a"Peaks over Threshold"approach. The properties that we derive also legitimate the pruning strategies (i.e. the model selection rules) used to select a proper tree that achieves compromise between bias and variance. The methodology is illustrated through a simulation study, and a real data application in insurance against natural disasters.


Introduction
Extreme value theory (EVT) is the branch of statistics which has been developed and broadly used to handle extreme events, such as extreme floods, heat waves episodes or extreme financial losses [21,16].One of the key results behind the success of this approach was proved by Balkema and de Haan in [4]: they show that the tail of the distribution of a series of observations can be approximated by a parametric family of distributions, namely Generalized Pareto (GP) distributions.This property allows the statistician to find information from the largest observations of a random sample to extrapolate the tail.This yields the so-called Peaks over Threshold (PoT) method introduced in [27] which consists in fitting a GP distribution to the excesses above some (high) suitably chosen threshold.In a regression framework, the parameters of this GP distribution depend on covariates reflecting the fact that different values of these covariates may result in a different tail behavior of the response variable.In this paper, we study the use of regression trees to perform GP regression on the excesses.This ensemble method, introduced by [7], determines clusters of similar tail behaviors depending on the value of the covariates, based on a recursive partition of the sample and simple model selection rules.In the present work, we provide theoretical results and empirical evidence on the consistency of such a procedure and of these selection rules.The result we provide are based on concentration inequalities, in order to hold for finite sample sizes.The main difficulty stands in the misspecification of the model and on handling the fact that the distributions are heavy tailed.
Tail regression is a challenging task.Several papers have been interested in extreme quantile regression, to name a few, in 2005, Chernozhukov [11] and, in 2012, Wang et al. [34] derive extreme quantile estimators assuming a linear form for the conditional quantile.In 2019, Gardes and Stupfler [18] and Velthoen et al. [32] use conditional intermediate-level quantiles to extrapolate above the threshold and deduce estimators for extreme conditional quantiles.Another approach is to model the parameters of the GP distribution of functions of the covariates e.g. as local polynomials [5] or as generalized additive models [10].Very recently, in 2021, Velthoen et al. [33] proposed to a gradient boosting procedure to estimate conditional GP distribution.Let us note that the nonparametric approaches rely on regularity assumptions on the way the tail of the distribution evolves with the covariates (which are required to be continuous through the use of kernel smoothing).A nice feature of the regression tree approach we consider in the present paper is its ability to handle several covariates which components may be either discrete or continuous.Moreover, this method is adapted to situations where the tail behavior is supposed to be significantly different depending on the characteristics, as for example it is the case in an application to cyber-insurance considered in a former paper, see [17].
Regression trees, introduced by Breiman [7] along with the CART algorithm (for Clustering And Regression Trees), are flexible tools to perform a regression and clustering task simultaneously.They have been used in various fields, including industry [20], geology (see e.g.[26]), ecology (see e.g.[14]), claim reserving in insurance [24].Through the iterative splitting algorithm used in CART, nonlinearities are introduced in the way the distribution is modeled, while furnishing an intelligible interpretation of the final classification of response variables.The splitting criterion-used to iteratively separate observations into clusters of similar behaviors-depends on the type of problems one is considering.While the standard CART algorithm relies on mean-squared criterion to perform mean-regression, alternative loss functions have been considered as in [9] for quantile regression, or in [29] who used a log-likelihood based loss.Loh [22,23] provide detailed descriptions of regression trees procedures and a review of their variants.In this paper, building on the Balkema and de Haan result, we use a GP log-likelihood loss, as in [17], to perform extreme value regression.
The rest of the paper is organized as follows.In Section 2, we introduce notations and describe the GP regression tree algorithm.Section 3 lists the main results of this paper, that is deviation bounds for the regression tree estimator for finite sample size, and consistency of the "pruning" (that is model selection) strategy.Empirical results are gathered in Section 4, which provides a simulation study, and a real data analysis in natural disaster insurance.Detailed proofs of the technical results are shown in the Appendix.

Regression trees for extreme value analysis
This section describes the estimation method (GP regression trees) that is considered in the paper.Some classical results in EVT are given in Section 2.1 to motivate the GP approximation.Regression trees adapted to this context are described in Section 2.2.A short discussion on the advantage of this technique compared to competing approaches is developed in Section 2.3.

Extreme value theory and regression
Let us consider independent and identically distributed observations Y 1 , Y 2 , . . .with an unknown survival function F (that is F (y) = P (Y 1 > y)).A natural way to define extreme events is to consider the values of Y i which have exceeded some high threshold u.The excesses above u are then defined as the variables Y i − u given that Y i > u.The asymptotic behavior of extreme events is characterized by the distribution of the excesses which is given by In 1975, Pickands [25] showed that, if F satisfies the following property with γ 0 > 0, then for some σ 0u > 0 and H σ 0u ,γ 0 necessarily belongs to the Generalized Pareto (GP) distributions family which distribution function is of the form where σ 0u > 0 is a scale parameter and γ 0 > 0 is a shape parameter, which reflects the heaviness of the tail distribution.Especially, if More details on these results can be found in e.g.[12,6].
In practice, the so-called Peaks over Threshold (PoT) method is widely used, see [13,12].It consists in choosing a high threshold u and fitting a GP distribution on the excesses above that threshold u.The estimation of the parameters σ 0u and γ 0 may be done by maximizing the GP likelihood.The choice of the threshold u can be understood as a compromise between bias and variance: the smaller the threshold, the less valid the asymptotic approximation, leading to bias; on the other hand, a too high threshold will generate few excesses to fit the model, leading to high variance.The existing methods are mostly graphical, up to our knowledge, no automatic data-driven selection procedure is available.
In the present paper, we consider a regression framework, that is that our goal to know the impact of some random covariates X on the tail of the distribution of a response variable Y.The previous convergence results hold, but for quantities σ 0u , γ 0 and u that may depend on X.More precisely, this means that, if we assume that γ 0 (x) > 0 for all x (which is the assumption that we will make throughout this paper), then (2.1) becomes where F (y|x) = P(Y ≥ y|X = x), see [6] and references therein, and (2.2) becomes lim where Suppose that we observe (Y i , X i ) 1≤i≤n a sample of (Y, X), where X belongs to a compact set X .Following the PoT approach, the estimation of the function γ 0 (X) and σ 0 (X) = σ 0u(X) can typically be done by fitting a regression model on the data points (Y i , X i ) such that Y i exceeds a proper threshold u(X i ).More precisely, let us define where θ = (σ, γ) τ (where a τ denotes the transpose of a vector a) and φ is the GP log-likelihood function, that is From (2.4), θ * (x) should be close to θ 0 (x) = (σ 0 (x), γ 0 (x)) τ for u(x) large enough.Based on this idea, Beirlant (2004) [5] proposed a nonparametric approximation of the loss function maximized by θ * .This technique, based on local polynomials, requires continuity of the covariates and some smoothness assumptions on θ 0 .On the other hand, parametric methods [10,2] have also been proposed, but relying on a stronger assumption on the shape of θ 0 .
In the next section, we introduce a regression tree approach which is adapted to both continuous and discrete variables, and that relies on few assumptions (since the estimated regression function θ 0 does not need to be smooth).

GPD regression trees
Regression Trees are a convenient tool to capture heterogeneous behaviors in the data, see [7].These models aim at constituting classes of observations which have a relatively similar behavior in terms of the response variable Y.These classes are defined by "rules", which affect an observation to one of these classes according to the values of its covariates X.These rules are obtained from the data through the CART (Clustering And Regression Tree) algorithm, and the non-linearity of the procedure allows for an adaptation to the estimation of large classes of regression functions.
Fitting regression trees relies on a so-called "growing phase", described in our context in Section 2.2.1, which corresponds to the determination of these splitting rules.Section 2.2.2 shows how an estimator of the regression function θ 0 can be deduced from such a tree.The "pruning step", which can be understood as a model selection procedure, is described in Section 2.2.3.

Growing step: construction of the maximal tree
The CART algorithm consists in determining iteratively a set of "rules" x = (x (1) , . . ., x (d) ) → R j (x) to split the data, aiming at optimizing some objective function (also referred to as splitting criterion).In our case, we want to approximate the criterion (2.5), that is we are searching for a regression function θ(X) among some class such that A set of rules (R j ) j∈J is a set of maps such that R j (x) = 1 or 0 depending on whether some conditions are satisfied by x, with R j (x)R j (x) = 0 for j = j and j R j (x) = 1.In case of regression trees, these partitioning rules have a particular structure, since they can be written, for quantitative covariates (the case of x containing qualitative variables is described in Remark 2.1 below), as R j (x) = 1 x 1 ≤x<x 2 for some x 1 ∈ R d and x 2 ∈ R d , with comparison symbols to be understood as component-wise comparisons.In other terms, if d = 1, rules can be identified as partitioning segments, if d = 2 they are rectangles (hyper-rectangles in the general case).The determination of these rules from one step to another can be represented as a binary tree, since each rule R j at step k generates two rules R j1 and R j2 (with R j1 (x) + R j2 (x) = 0 if R j (x) = 0) at step k + 1.The algorithm can be summarized as follows: Step 1: R 1 (x) = 1 for all x, and n 1 = 1 (corresponds to the root of the tree).
Step k+1: Let (R 1 , ...R n k ) denote the rules obtained at step k.For j = 1, . . ., n k , • if all observations such that R j (X i ) = 1 have the same characteristics, then keep rule j as it is no longer possible to segment the population; • else, rule R j is replaced by two new rules R j1 and R j2 determined in the following way: for each component X ( ) of X = (X (1) , . . ., X (d) ), define the best threshold x ( ) j to split the data, such that x where Then, select the best component index to consider: = arg max Φ(R j , x ( ) • Let n k+1 denote the new number of rules.
This algorithm has a binary tree structure.The list of rules (R j ) 1≤j≤n k are identified with the leaves of the tree at step k, and the number of leaves of the tree is increasing from step k to step k + 1.The stopping rule can also be slightly modified to ensure that there is a minimal number of points of the original data in each leaf of the tree at each step.
Remark 2.1.In this version of the CART algorithm, all covariates are continuous or {0, 1}−valued.For qualitative variables with more than two modalities, they must be transformed into binary variables, or the algorithm must be slightly modified so that the splitting step of each R j should be done by finding the best partition into two groups on the values of the modalities that minimizes the loss function.This can be done by ordering the modalities with respect to the average value-or the median value-of the response for observations associated with this modality.

From the tree to the parameter estimation
From a given set of rules R = (R j ) j=1,...,s , let T j = {x : R j (x) = 1}, the jth leaf of the corresponding tree.The estimator θ associated with a tree T = (T ) =1,...,K (where K is the total number of leaves) is obtained as The maximal tree is the T max obtained once the previous algorithm stops.It corresponds to a trivial estimator of m, since either the number of observations in a leaf is one, or all observations in this leaf have the same characteristics x.
The pruning step, presented in the next section, consists in extracting from the maximal tree a subtree that achieves a compromise between simplicity and good fit.

Selection of a subtree: pruning step
For the pruning step, a standard way to proceed is to use a penalized approach to select the appropriate subtree, see [7,19].For a given tree T K with K leaves (T ) =1,...,K , associated with the corresponding estimator θ, the performance of this tree is measured through the following criterion For a given level of penalty λ, the selected tree is the one that maximizes criterion (2.6), achieving a compromise between good fit and simplicity.To determine this optimal tree, it is not necessary to compute all the subtrees from the maximal tree.It suffices to determine, for all K ≥ 0, the subtree T K which maximizes the criterion (2.6) among all subtrees with K leaves, and then to determine the final tree among a list of K max trees (where K max is the number of leaves of the maximal tree).The trees T K are easy to determine, since T K is obtained by removing one leaf to T K+1 , see p.284-290 in [7].
The penalization constant λ can be chosen using a test sample or k−fold cross-validation.In the first case, data are split into two parts before making the tree grow (a training data of size n and a test sample which is not used in computing the tree).In the second case, the dataset is randomly split into k parts which successively act as a training or a test sample, see for example [3,28].

Comparison with competing approaches
Compared to competing approaches in extreme value regression, the advantage of the procedure is to introduce discontinuities in the regression function while parametric approaches suppose a form of linearity, e.g.[2].The more flexible non-parametric approaches, as [5], rely on smoothing techniques that require the covariates to be continuous.Chavez-Demoulin et al. [10] propose a semi-parametric framework to separate the continuous covariates from the discrete ones.Smoothing splines are used to estimate non-parametrically the continuous part, while the influence of discrete covariates is captured by a parametric function.

Main results
In this section, we show that the GP regression tree procedure defined in Section 2.2 is consistent.Notations and assumptions used throughout this section are listed in Section 3.1.We then state our first main results on the consistency of a fixed tree with K leaves, by separating the stochastic part of the error (Section 3.2) from the misspecification part (Section 3.3) caused by the GP approximation.The consistency of the pruning methodology is studied in Section 3.4.

Notations
Let us recall that the PoT approach consists in considering observations such that Y i ≥ u(X i ).Below, we will restrain ourselves to the case where u(x) = u.Our results easily extend to the case where u(x) = m j=1 u j 1 x∈X j , where (X j ) 1≤j≤m are subsets of the space of covariates.Another possible extension would be to assume that u(x) = f (β, x) for some parameter β and f a known function.Nevertheless, a choice of such a particular threshold function seems hard to justify.Hence, we restrain ourselves to the simplest case.
Moreover, the result we provide holds uniformly for u ∈ [u min , u max ] to cover adaptive choice of this parameter.Conditions on u min and u max are given in Assumption 3.1.
Here, k n will denote the average number (up to some constant) of observations on which the model is fitted.It is hence related to the rate of convergence of the procedure.The following assumption introduces conditions on this rate k n and on the space of parameters.Assumption 3.2.We assume the parameter space to be Θ = S × Γ where Moreover, assume that k n = O(n a 2 ), with a 2 > 0, and that the number of leaves of the maximal tree K max satisfies K max ≤ κk n , with κ > 0.
Next, let us introduce some notations regarding the trees.Consider a tree T (u) with K leaves denoted T , = 1, . . ., K. Introducing the (normalized) contribution of the log-likelihood to the th leaf, say the estimated value of the parameter in the leaf T .This estimator is expected to be close to We denote by T * (u|T ) the tree with same leaves as T, but with parameters θ * (u).This quantity is not exactly our target: ideally, we would like to estimate θ 0, (u) = (σ 0 (T , u), γ 0 (T )), such that lim where We denote T 0 (u|T ) the tree with same leaves as T but with parameters θ 0, (u).
If θ = (θ ) =1,...,K denotes the set of parameters of a tree with K leaves (T ) =1,...,K , we will denote θ(x) the function defined by We will first focus on the difference T (u) and T * (u|T ) in Section 3.2, which is the stochastic part of the error.On the other hand, the difference between T * (u|T ) and T 0 (u|T ) (and ultimately the difference between θ(x) and θ 0 (x)) is studied in Section 3.3 and can be understood as a misspecification term, caused by the fact that the excesses above the threshold are not exactly GP distributed.
For = 1, . . ., K, let ∇ θ L (θ, u) denote the gradient of L (θ, u), denoting with, for z > 0, To handle the stochastic part, we shall add a few assumptions.We first need a domination condition on the class of the derivatives of the functions y → φ(y − u, θ).These derivatives are uniformly bounded by where C is a constant (not depending on n), and w = γ max /σ min .
Assumption 3.3.Assume that, for some ρ 0 > 0, In fact, this assumption is automatically satisfied if Assumption 3.2 holds: since γ(x) Additionally, we need some regularity assumptions on the criterion L .
Assume that there exists a constant The condition on the infimum can be relaxed: Assumption 3.4 comes naturally in using a Taylor expansion.Hence, the infimum with respect of θ 1 , . . ., θ 4 can be restricted to θ 2 to θ 3 belonging to a small neighborhood of θ 1 (and not to the whole set Θ).

Deviation bounds for our estimator
In this section, we study the consistency of a fitted tree T (u), a subtree of the maximal tree T max (u), with K leaves (T ) =1,...,K ,.We compare this fitted tree to T * (u|T ), which is the tree based on the same subdivision, but where, in each leaf , the parameter is θ * (u) (instead of θ (u) in T (u)).
The first step is to define a distance between trees.Let us define (a, b) ∞ = max(|a|, |b|), and for two trees T and S, .
The main result of this section is a deviation bound for T (u) − T * (u|T ) 2 , which is Theorem 3.5 below.
The proof of Theorem 3.5 is postponed to the appendix section (Section A.3).The exponential terms on the right-hand side come from concentration inequalities proved by Einmahl and Mason [15], while the polynomially decreasing term is related to the fact that the log-likelihood is an unbounded quantity, but that can still controlled when considering its expectation.
As a by-product, we obtain the following Corollary 3.6 (by integration of the bound of Theorem 3.5).

E sup
From Corollary 3.6, one can see that the L 2 −norm of the stochastic part of the error, and, as expected, increases with the complexity of the tree.On the other hand, the error decreases almost at rate k 1/2 n (up to some logarithmic factor), which is the convergence rate of standard estimators used to estimate the tail parameter in absence of covariates.
The proof is again postponed to the appendix (Section A.4).

Misspecification bias
For X = x, the ultimate goal is to estimate the tail index parameter θ 0 (x) = (σ 0u (x), γ 0 (x)), introduced in (2.4), by maximization of the GP likelihood.The difference between θ 0 (x) and θ * (x) can be understood as a misspecification term due to the fact that the observations above the threshold are not exactly distributed according to a GP distribution.This bias term can be controlled under second order conditions which are standard in Extreme Value Analysis.Indeed, recall that assuming that the underlying distribution F (•|x) satisfies Condition (2.3) guarantees that asymptotically the associate excesses above the threshold u are GP distributed.For finite samples, the excesses are thus not exactly GP distributed which introduces some bias term.In order to control this bias term, a second-order condition is needed, that is a condition to control the rate of convergence in Condition (2.3).There exist numerous ways to express this second-order condition.Here, we consider the same condition as Condition C.6 in [5].First, Condition (2.3) can be translated into where η is a slow-varying function, that is η(ty | x)/η(t | x) → 1 as t → ∞, for all y > 0.
Assumption 3.7.Assume that for all x, there exist a constant c and a function ψ such that as t → ∞ for each y > 0 with ψ(t) > 0 and ψ(t) → 0 as t → ∞ and ρ ≤ 0.
Let us note that we could also consider the case of c, ψ and ρ depending on x, and then assume some uniform bound over x of these quantities.We chose this more restrictive formulation to simplify the notations.
The next result guarantees that the bias term tends to 0 as u → ∞.
Proposition 3.8.There exists a constant c and a function ψ such that ψ(u) > 0 and ψ(u) → 0 as u → ∞, and such that, for X = x, where C 2 (u) is a constant depending on u, γ min and γ max .

Consistency of the pruning step
The previous results cover the case of a tree with fixed number of leaves K.In practice, the question is to select the proper subtree of T max (u), the maximal tree obtained once the previous step of the CART procedure has stopped, with some "optimal" number of leaves, which is the objective of the pruning step described in Section 2.2.3.As seen in Corollary 3.6, the stochastic part of the error put to the square increases proportionally to K.This is closely related to the natural inflation of the log-likelihood (which is locally quadratic) when the number of leaves increases, justifying a penalty proportional to K, as in [7,19].The aim of Theorem 3.9 is to corroborate this choice.
First of all, for a decomposition (T K ) =1,...,K of K leaves, let us define T K (u) the tree with parameters θ K (u) estimated with the CART procedure, T * K (u) the tree with parameters In words, T * (u) = T * K 0 (u) (u) is the subtree of T max (u) that achieves the closest proximity to x → θ * K (x) in the sense that it maximizes the expectation of the (pseudo)-log-likelihood.
Second of all, we denote, as explained in (2.6), the selected number of leaves and T (u) = T K(u) (u) the corresponding selected tree.Define the log-likelihood L n (T K , u) associated with a tree Finally, for two trees T and S, ∆L n (T, S) = L n (T, u) − L n (S, u) and similarly, ∆L(T, S) = L(T, u) − L(S, u).
The following Theorem 3.9 shows that the pruning methodology selects a tree T (u) which approximately achieves the same rate as T K 0 (u), even if K 0 (u) is unknown, provided that the penalty constant λ belongs to some reasonable interval.Theorem 3.9.Let D = inf u inf K<K 0 (u) ∆L(T * (u), T * K (u)) and suppose that there exists a constant c 2 > 0 sauch that the penalization constant λ satisfies where C 5 is a constant depending on T * (u).
The proof is given in Section A.6.

Simulation study and real data analysis
This section is devoted to the illustration of the GP regression procedure on simulated data (Section 4.1) and on a real dataset (Section 4.2).

Simulations
In this section, we assess the performance of the GP regression procedure on simulated data and compare it with the competing approach proposed by [10].We first describe the simulation framework and then discuss the experiments results.
We consider the following regression framework: X is a one-dimensional variable uniformly distributed on [0, 1], and the response variable Y , conditionally on X = x, is distributed according to a Burr distribution of parameters (σ, γ 0 (x)) which survival function is given by with σ > 0 and γ 0 (x) for all x.Note that F (• | x) satisfies the property (2.3).We consider two cases: (i) γ 0 (x) as a step-wise function and (ii) γ 0 (x) as smooth function.In both cases, the scale parameter σ was fixed equal to 1.
(i) step-wise function: In this case, the function γ 0 is taken as (ii) smooth function: In this case, the function γ 0 is taken as, for x ∈ [0, 1], We simulate 1 000 replications for different sizes of the observation sample (n =1000, 2500, 5000, 10 000 and 25 000) according to the described framework for both cases (i) and (ii).For each sample, we consider the excesses above the 0.90-empirical quantile, which corresponds to k n =100, 250, 500, 1 000 and 2 500.For each simulated sample, we compute the regression tree procedure (CART), and the method based on generalized additive model (GAM) proposed by [10].Next we compute 1 0 (γ(x) − γ 0 (x)) 2 dx for each estimator.The empirical mean squared error is then obtained by averaging these errors over the 1 000 replications.Results are shown in Table 1.The boxplots of the empirical quadratic errors are shown in the supplementary material (Section A).
Let us note that the GAM approach is not designed to capture nonsmooth functions like in the step-wise case.Nevertheless, we see that this technique manages to fit relatively correctly even in this case when the sample size is large.For k n = 1 000 and 2 500, the results of the GAM approach are similar or even slightly better than the regression tree method.On the other hand, we observe that regression trees lead to a better fit for small sample sizes, even in the smooth case where it is not designed to take into account the regularity of γ 0 (x).

Prediction of the cost of flooding events in France
In order to improve the knowledge and the management of natural catastrophes, the French Federation of Insurance (FFA) is interested in the prediction of the cost of such events, especially of the most severe ones, shortly after their occurrence.These catastrophic events present some heterogeneity in their intensity depending on their characteristics, such as the affected meteorological region or the number of individual houses in flood risk area.The prediction of their cost thus becomes a challenging task.In this section, we illustrate how the GP regression tree procedure can be used to gain further insight in this heterogeneity.The ability of the procedure to design classes of events that are more homogeneous (in view of analyzing the tail of their distribution) is an appealing property in view of operation applications in insurance.The database we consider was obtained through a partnership with the FFA, in particular with one of its dedicated technical body, the association of French insurance undertaking for natural risk knowledge and reduction (Mission Risques Naturels, MRN).It consists of all 3 100 flooding events that have been granted the status of natural catastrophe in France from 1999 to 2019 (let us note that the status "natural catastrophe" is a French specificity, with some legal consequences when an event receives this label, see [8,1]).This database is fed by 13 contributors including the major French insurance companies, allowing this database to cover 70% of French nonlife insurance market.The database gathers information regarding each flooding event (its cost, the meteorological region, the season, the number of affected hydrological regions, the number of individual houses and the number of professional business premises in flood-risk area).Note that, since the purpose of this database is the fast prediction of the cost of a flooding event (as soon as possible after its occurrence), the variables that are registered correspond to quantities that are available before the event, or soon after it.
The variable of interest, the total cost of a flooding event, is highly volatile.Indeed, it ranges between 0 and 394 376 000 euros with an empirical variance equal to 1.77e + 14. Figure 1 shows the average of the costs of the 10% most onerous flooding events within each meteorological region.This highlights the heterogeneity of the severity of the most severe events.Furthermore, the top ten most onerous events represent 43% of the total cost of this database and the top hundred 80%.
Figure 1: Cartography of the cost of flooding events in France from 1999 to 2019.For each meteorological region, the average of the costs of the 10% more onerous events is shown.The lighter red color suggesting a small cost while a darker color suggests a large cost.Now, let us recall that our goal is to understand the heterogeneity of the total cost of the most severe flooding events, that is of extreme flooding events.As explained in Section 2.1, the definition of extreme events consists in choosing a threshold u, which should be chosen as a bias-variance trade-off.We chose a value of u = 100 000 based practical considerations and validated by sensitivity analyses (shown in the supplementary material, Section B).This yields 1 100 extreme events, that is for which the cost is larger than u.
The GP regression tree was performed on the database corresponding to the flooding events extracted from the original database for which the total cost is larger than u (=100 000 euros).The variables of this database and their characteristics are summarized in Table 2. Again, it can be noticed that the cost, the variable of interest, is highly volatile.Table 2: List of quantitative and categorical variables in the database and their characteristics.For the quantitative variables, Table a) shows the minimum, the first quartile, the median, the mean, the third quartile and the maximum, and for the categorical variables, Table b) the number of observations per category.b) The tree obtained from GP regression procedure is shown in Figure 2 (the quantile-quantile plots of the GP fit in each leaf are shown in the supplementary material, Section C).The tree is composed of 6 leaves, with three splits according to only 3 covariates: the number of individual houses, the number of professional business premises in flood-risk area and the number of affected meteorological regions.This seems reasonable since the first two covariates represent the exposure to floods, but also the population density of the affected area and the third one the extent of the flood.In each leaf, are given the shape and scale parameters.The worst case scenario corresponds to the leaf on the far right, with a shape parameter equal to 1 and containing 9% of all flooding events.This leaf corresponds to events for which more than 9 meteorological regions are affected and more than 597 518 professional business premises are in flood-risk area.The least severe case corresponds to the third leaf from the left, with a shape parameter equal to 0.24 and containing only 3% of the events.Table 3 presents for each leaf the empirical median and mean of the costs and the theoretical median and mean of the corresponding GP distribution.Let us recall that for a GP distribution with a scale parameter σ and a shape parameter γ, the theoretical median is given by σ(2 γ − 1)/γ and the empirical mean by σ/(1 − γ) for γ < 1 and ∞ for γ ≥ 1.First of all, for every leaf, the median is much smaller than the mean suggesting that we are indeed dealing with extreme events.Then, the empirical and theoretical medians are of the same order for each leaf while the empirical and theoretical (when it exits) means are only comparable for the leaves 3 and 5 for which the shape parameter is significantly different from 1.

Conclusion
In this paper, we investigated the consistency of Generalized Pareto regression trees, applied to extreme value regression.The results that we derive are non-asymptotic, and allow to justify the consistency of the pruning method- ology used to select a proper subtree.Let us note that the conditions under which our results hold are relatively weak, in the sense that they hold even if the tail index γ is arbitrary close to zero (the special case γ = 0 is excluded) or large.Moreover, no regularity assumptions on the target parameters is required, due to the flexibility of the regression tree procedure.
Through the simulation study and the real data analysis, we investigated the practical performances of the methodology.The regression tree approach can be applied in various situations, and still provides interpretability of the results.On the other hand, regression trees may be unstable, since quite sensitive to some changes on the data that have been used to fit them.Hence, this work is a first step into the direction of studying other relied methodologies, like random forests (see for example [7]) in this field of extreme value regression.

A Proofs
In this Section, we present in details the proof of the results presented throughout the paper.Concentration inequalities required to obtain the results are presented in Section A.1.These inequalities are used to obtain deviation bounds in Section A.2, which are the key ingredients of the proof of Theorem 3.5 (Section A.3), Corollary 3.6 (Section A.4), and Theorem 3.9 (Section A.6). Section B shows some results on covering numbers that are required to control the complexity of some classes of functions considered in the proofs.Some technical lemmas are gathered in Section C.

A.1 Concentration inequalities
The proofs of the main results are mostly based on concentration inequalities.The following inequality was proved initially by Talagrand [30], see also [15].Proposition A.1.Let (V i ) 1≤i≤n denote i.i.d.replications of a random vector V, and let (ε i ) 1≤i≤n denote a vector of i.i.d.Rademacher variables (that is, P(ε i = −1) = P(ε i = 1) = 1/2) independent from (V i ) 1≤i≤n .Let F be a pointwise measurable class of functions bounded by a finite constant M 0 .Then, for all t, , and where A 1 and A 2 are universal constants.
The difficulty in using Proposition A.1 comes from the need to control the symmetrized quantity E sup ϕ∈F n i=1 ϕ(V i )ε i .Proposition A.2 is due to Einmahl and Mason [15] and allows this control via some assumptions on the considered class of functions F.
We first need to introduce some notations regarding covering numbers of a class of functions.More details can be found for example in Chapter 2.6 of [31].Let us consider a class of functions F with envelope Φ (which means that for (almost Proposition A.2. Let F be a point-wise measurable class of functions bounded by M 0 with envelope Φ such that, for some constants A 3 , α ≥ 1, and 0 ≤ √ v ≤ M 0 , we have Then, for some absolute constant A 5 ,

A.3 Proof of Theorem 3.5
The proof of Theorem 3.5 then consists in gathering the results on the leaves obtained in Corollary A.5.Let u min ≤ u ≤ u max , The results follows from Corollary A.5, and from the assumption on A.4 Proof of Corollary 3.6 Write E sup We now use Theorem 3.5 to bound the integral on the right-hand side.Since A.5 Proof of Proposition 3.8 Let x fixed, then, Now, from Taylor expansion, for = 1, . . ., K, conditionally on X ∈ T , for some parameters γj (resp.σj ) between γ 0 (x) and γ * (resp.σ 0 (x) and σ * ).Thus, under Assumption 3.4, where Z is a random variable distributed according to the distribution F u defined in Section 2.1 with σ 0 (x) = uγ 0 (x) and with Under Assumption 3.7, we have and then Consequently, and Hence, A.6 Proof of Theorem 3.9 The following lemma will be needed to prove Theorem 3.9.
Lemma A.6.Let D = inf u inf K<K 0 (u) ∆L(T * (u), T * K (u)) and u ∈ [u min , u max ] fixed.Suppose that there exists a constant c 2 > 0 such that the penalization constant λ satisfies , and, for K < K 0 (u), For K > K 0 (u), a bound is then obtained from Theorems A.3 and A.4 if where D = inf K<K 0 (u),u∈[u min ,umax] D(K 0 (u), K), Hence, These two probabilities can be bounded using Theorems A.3 and A.4 provided that, for all K < K 0 (u), We are now ready to prove Theorem 3.9.Let u ∈ [u min , u max ] fixed.Firstly, from Theorem 3.5, Secondly, recall that where µ(T ) = P(X ∈ T ).Following the same idea as in the proof of Proposition 3.8, from Taylor's expansion, under Assumptions 3.4 and 3.7, Finally, for some constant C 5 . .

B Covering numbers
Lemma B.1.Following the notations of the proof of Theorem A.3, the class of functions F satisfies for some constants C 4 > 0 and α > 0 (not depending on n nor K). Proof.Let for some α > 0 and constants F i .

C Technical Lemmas
Lemma C.1.With v F defined in Proposition A.1, Proof.We have Lemma C.2. Define, for j = 1, 2, 3, Under the assumptions of Theorem A. Next, from Chernoff inequality,

Figure 2 :
Figure2: GP regression tree obtained for flooding events.For each leaf, the value of the shape parameter γ (first line) and the scale parameter σ at 10 −5 (second line) are given.Percentage of observations affected to each leaf is mentioned.

Corollary A. 5 .
Under the assumptions of Theorem A.3 and A.4 and Assumption 3.4, for t

Table 1 :
Empirical mean squared errors for the GP regression tree procedure (GP CART), and the GAM model for different sample sizes for a) the stepwise case and b) the smooth case.

Table 3 :
Empirical median and mean, and theoretical median and mean for each leaf (in euros).
1 Y ≥u min is obtained from Lemma C.2, and nE 1,n /k n ≤ e 1 k