A weighted composite log-likelihood approach to parametric estimation of the extreme quantiles of a distribution

Extreme value theory motivates estimating extreme upper quantiles of a distribution by selecting some threshold, discarding those observations below the threshold and fitting a generalized Pareto distribution to exceedances above the threshold via maximum likelihood. This sharp cutoff between observations that are used in the parameter estimation and those that are not is at odds with statistical practice for analogous problems such as nonparametric density estimation, in which observations are typically smoothly downweighted as they become more distant from the value at which the density is being estimated. By exploiting the fact that the order statistics of independent and identically distributed observations form a Markov chain, this work shows how one can obtain a natural weighted composite log-likelihood function for fitting generalized Pareto distributions to exceedances over a threshold. A method for producing confidence intervals based on inverting a test statistic calibrated via parametric bootstrapping is proposed. Some theory demonstrates the asymptotic advantages of using weights in the special case when the shape parameter of the limiting generalized Pareto distribution is known to be 0. Methods for extending this approach to observations that are not identically distributed are described and applied to an analysis of daily precipitation data in New York City. Perhaps the most important practical finding is that including weights in the composite log-likelihood function can reduce the sensitivity of estimates to small changes in the threshold.


Introduction
Estimating extreme quantiles of a distribution provides a striking case of the bias-variance tradeoff that is ubiquitous throughout statistics. Specifically, from a sample of size n from some distribution, quantiles that are not too extreme can be estimated with little bias under weak assumptions on the smoothness of the underlying distribution (see, e.g., Falk (1985)). However, without additional assumptions, it is difficult to say something useful about quantiles like 1 − n , where n is comparable to or, worse, much smaller than n −1 . Alternatively, assuming that the observations come from some low-dimensional parametric family can yield good estimates of even extreme quantiles if the family includes the true distribution, but at the risk of severe bias if the truth is not in this family. As described in Section 2, it is common practice to appeal to extreme value theory, which shows that for a broad range of distributions, the distribution of exceedances beyond a high threshold is well approximated by a generalized Pareto distribution (GPD). For y + the positive part of y, the survival function for the GPD is (1 + (x − ) + ∕ ) −1∕ + , with location parameter real, shape parameter real and scale parameter positive. For = 0 , the survival function is defined by continuity and equals exp{−(x − ) + ∕ }.
Using this formulation does not make the bias-variance problem go away. Indeed, the situation is analogous to nonparametric density estimation, where one might only assume the density is, say, twice differentiable, which allows one to obtain rates of convergence for estimates of the density at any fixed x for which the density is positive (Sheather 2004). For density estimation, it is well-known that density estimates with modestly lower asymptotic mean squared error can be obtained by using a kernel that decreases smoothly as observations get farther from x (Epanechnikov 1969). This work develops an analogous approach to downweighting the influence of less extreme observations for fitting GPDs to the tails of a sample from some distribution in pursuit of extreme quantile estimates with reduced mean squared error.
Section 2 describes the basic methodology, which exploits the elementary fact that the order statistics from an i.i.d. (independent and identically distributed) sample form a Markov chain. This result allows one to write down a weighted composite log-likelihood (WCL) for the parameters of the GPD that includes the usual log-likelihood function when the weights are all 1; see Varin et al. (2011) for a general review of weighted composite log-likelihood methods. Section 2 also explores the relationship of this approach to the weighted Hill estimator in Csörgǒ et al. (1985). Section 2.1 shows how, under the strong assumption that the exceedances over a chosen threshold follow a GPD exactly, one can obtain confidence intervals for scalar parameters or quantiles by inverting hypothesis tests obtained from deviance statistics based on WCL, using parametric bootstrapping to calibrate the tests. Section 3 develops the asymptotic theory for the resulting estimates of extreme quantiles in a simple special case in which the shape parameter of the tail-approximating GPD is known to be 0. Under a standard condition on the second-order tail behavior of the distribution and allowing j, the number of the largest order statistics used to fit the GPD, to grow at an appropriate rate as n increases, the estimated quantiles are then asymptotically normal with asymptotic bias related to this second-order tail behavior. These results allow one to compare mean squared errors for different weighting functions. Specifically, similar to results in Csörgǒ et al. (1985), depending on an exponent appearing in the expression for the second-order behavior of the underlying distribution, one finds an asymptotically optimal nonnegative weighting function. Simulation results in Section 4 show benefits to weighting in the more realistic setting where the shape parameter is unknown. In addition to yielding somewhat better quantile estimates if j is chosen well, using weights can substantially reduce the volatility of the estimated quantiles as j varies. Simulations in Section 4.2 show the parametric bootstrap approach described in Section 2.1 can yield hypothesis tests for extreme quantiles with levels close to their nominal values.
Extreme value methods are often applied to observations that are dependent and/ or not identically distributed. The Markov property of order statistics only applies when the observations are i.i.d. In a simulation study, Section 5 examines estimates of for stationary time series whose marginal distributions are exactly GPD. Even for time series with strong extremal dependence, the estimates using WCLs that ignore this dependence have comparable bias to estimates based on an unweighted procedure. Section 6 considers independent but not identically distributed observations and writes down two conditional distributions based on order statistics that could be used to obtain WCLs in this setting. Because the order statistics no longer form a sufficient statistic when the observations are not i.i.d., Section 6 proposes using quantile regressions to threshold and scale the observations so that the rescaled threshold exceedances are more nearly i.i.d. before reducing them to order statistics. The method is tested in a simulation using a realistic statistical model for daily temperature in Minneapolis, Minnesota, which exhibits a large seasonal cycle. In this example, extreme quantile estimates based on the unweighted likelihood work slightly better than the WCL approach when the best threshold is used, but the weighted approach retains the advantage of having results that vary much more smoothly with a changing threshold probability. In Section 7, this approach is used to analyze over 150 years of daily rainfall data in New York City. Both weighted and unweighted procedures were fitted to roughly half of the data and then evaluated using the other half of the data. Measured by a log-likelihood for the testing data censored at cutoffs of 2 or 3 inches, the weighted approach again has the advantage that its results vary much less erratically with the choice of threshold probability. When this probability is chosen well, both approaches yield good estimates of seasonally varying extreme quantiles for daily rainfall in the testing data. Section 8 discusses further possible applications of WCLs based on order statistics, both to extremes and more generally, and comments on choosing thresholds in practice. Section 9 proves Theorem 1 from Section 3.

Methodology
Let us first consider the general setting of fitting a parametric model to i.i.d. observations. Suppose X 1 , … , X n are i.i.d. from some density f from a family of densities indexed by the parameter with corresponding survival function S . Writing X (1) ⩽ ⋯ ⩽ X (n) for the corresponding order statistics, then because the order statistics form a Markov chain (David and Nagaraja 2003)[Section 2.5], the log-likelihood function can be written as (ignoring an additive constant) Of course, it would be pointless to use this form of the log-likelihood function if f included the true density. However, if one were more interested in some parts of the distribution than others and were concerned that the parametric model could be substantially misspecified in ranges that were not of particular interest, then this form of the log-likelihood function naturally allows weighting: for constants w * , w 1 , … , w n − 1 . The corresponding weighted score equations obtained by setting derivatives with respect to the components of to 0 are, under standard regularity conditions, unbiased estimating equations when f includes the true density (Li and Babu 2019)[Chapter 9]. The weights will generally be nonnegative, but the weighted score equations yield unbiased estimating equations for any real weights. Now suppose we do not want to assume a parametric model for the entire distribution of X 1 , but that we can model the conditional density of X (n − j + 1) , … , X (n) given X (n − j) for j small compared to n with some parametric family indexed by , leading to the conditional log-likelihood function Similar to (1), we can downweight the influence of less extreme observations: (1) The idea of a local likelihood for dependent observations goes back to at least Rao (1970) in the context of regularly observed time series. Now, if we assume that the distribution of X 1 , … , X n is in the domain of attraction of a generalized extreme value distribution, then standard extreme value theory shows that as a threshold approaches the upper bound of the support of this distribution, properly scaled exceedances over this threshold converge in distribution to a GPD(0, , ) (de Haan and Ferreira (2006)[Theorem 1.1.6]). If n → ∞ and j∕n → 0 , then X (n − j) converges to this upper bound with probability 1, so when n is large and j/n is small, it is reasonable to assume that exceedances over X (n − j) approximately follow a GPD with location parameter 0. Applying this approximation to (2) and defining It is also possible to apply (2) to various extensions of the GPD whose greater flexibility may allow one to use lower thresholds (Gamet and Jalbert 2022).
As in Drees et al. (2004) for the unweighted log-likelihood function, if we allow < −1 , then for w 1 > 0 , the maximizer of (3) with respect to ( , ) ∈ (0, ∞) × ℝ does not exist, since, for any < −1 , (3) tends to ∞ as ↓ − Y j . Furthermore, standard asymptotics for the maximizer can only be expected to hold for > −0.5 (Drees et al. 2004). Since values of ≤ −0.5 are rare in most applications of extremes, following (Drees et al. 2004), I will restrict the estimates of to (−0.5, ∞) in all of the simulations.
Let us next consider the relationship between this approach to that proposed in Csörgǒ et al. (1985) for estimating based on adding weights to the classical Hill estimator of when > 0 (Hill 1975). We can obtain the weighted Hill estimator by using the ordinary Pareto distribution in (2), which, for > 0 and x 0 > 0 , has survival function (x∕x 0 ) −1∕ . For X (n − j) > x 0 , we then get plus a term not depending on (x 0 , ) . Maximizing (4) with respect to gives (2) Csörgǒ et al. (1985) replace log by log + in (5), where log + x = log x for x > 1 and 0 otherwise, so that this estimator is defined even when X i can be negative. Further study of this weighted Hill estimator is provided in Caeiro et al. (2019) and the references therein. Groeneboom et al. (2003) describe a kernel estimator for that works for all real , although their functional forms are not nearly as simple as for the weighted Hill estimator. The estimates in all of these works are in terms of the logarithms of the largest order statistics and thus are not location invariant, whereas the method described here is location-invariant for all . As noted by Beirlant et al. (2012), many (but certainly not all) methods for estimating extremes lack location-invariance. Whether location invariance is always desirable is not a simple question. In particular, for intrinsically positive quantities, 0 is a special value, so location invariance may sometimes be inappropriate, although scale invariance would still generally be desirable. Simulations in Section 4.1 show that this lack of location invariance can be problematic if observations are i.i.d. from a t distribution. Furthermore, it is unclear how the kernel methods in Groeneboom et al. (2003) could be extended to observations that are not identically distributed, whereas WCLs based on order statistics can be applied in this setting (Section 6), if perhaps not totally satisfactorily.
As in the case of density estimation, one would generally want the w k 's in (3) to decrease smoothly and tend to 0 as k approaches j. This can be accomplished by setting, for k > 0 , w k = ((k − 1)∕j) for some positive, continuous and decreasing function on [0,1] with (1) = 0 . All of the non-constant weight functions used in this work satisfy (1) = 0.

Hypothesis tests and confidence intervals
When making inferences for extremes using the GPD approximation of exceedances over a high threshold, it is difficult to take proper account of possible biases induced by using this approximation. If j is chosen small enough relative to n and the observations fall in the domain of attraction of a generalized extreme value distribution, the GPD approximation will be very good and estimation bias due to using this approximation should be negligible. If, at the same time, j can be chosen large enough so that estimates of interest follow close to a normal distribution, then, for example, confidence intervals based on this approximate normality might be expected to have close to the correct coverage properties. In this section, I propose a methodology for producing hypothesis tests and confidence intervals by acting as if the exceedances over a chosen threshold follow a GPD exactly. However, I do not propose a method for choosing j to make estimation biases negligible. Furthermore, even if such a methodology were available, it would most likely require choosing j in a way that makes the mean squared error of quantile estimates substantially larger than what can be obtained using larger values for j. For example, looking ahead to Theorem 1 in Section 3, one would presumably need to take C = 0 in (11) to make the bias in the estimates of extreme quantiles asymptotically negligible (Theorem 1 is only proven for C > 0 ), which I would expect leads to mean squared errors that do not converge to 0 at the optimal rate. Despite these problems, it seems worthwhile to develop inferential methods that work well when the GPD approximation holds exactly, which are provided in the case of unweighted likelihood methods by, for example, the R package extRemes (Gilleland 2020b).
One basis for inference about GPD parameters and the associated extreme quantiles is to use the asymptotic normality of the point estimates when the exceedances follow a GPD exactly. However, Coles (2001) [Section 4.3] notes that one generally obtains more accurate inferences about extreme quantiles using the deviance statistic based on a profile likelihood approach. The standard chi-square approximation for the deviance statistic does not apply when using WCL. Pace et al. (2011) gives valid asymptotic approximations for the deviance statistic based on composite likelihoods, including situations with nuisance parameters, which, again assuming the GPD approximation holds exactly, should apply in the present setting. Deriving analytical expressions for the required quantities would be quite involved and, fortunately, is not necessary. Specifically, again assuming the exceedances follow a GPD exactly, these calculations can be avoided by taking a simulation based approach to obtaining the null distribution of a deviance statistic under some hypothesis on a parameter value or quantile. Numerically inverting these tests provides confidence intervals for these quantities.
When using profile likelihoods to make inferences about extreme quantiles, Coles (2001)[page 83] recommends ignoring the uncertainty in using X (n − j) to estimate the (n − j)∕(n + 1) quantile and acting as if exceedances of X (n − j) follow a GPD(0, , ) distribution exactly. Following this approach, for some scalar function h( , ) , consider testing the null hypothesis h( , ) = h 0 versus the alternative h( , ) ≠ h 0 using the WCL (3). Let (̂,̂) be a maximizer of (3) and (̂0,̂0) be a maximizer of (3) subject to the constraint h( , ) = h 0 . Then reduces to the usual deviance statistic when the weight function is identically 1. In this unweighted case and assuming exceedances follow a GPD exactly, as long as j → ∞ as n → ∞ , the null distribution of (6) is well-approximated by a 2 1 distribution. In the weighted case, I propose using a parametric bootstrap to approximate this null distribution, and then numerically inverting the resulting test to obtain a confidence interval. Specifically, observations from a GPD(0,̂0,̂0) . For this simulated data, find the maximizer of the WCL, denoted by (̂b,̂b) , and the maximizer of the WCL under the null constraint h( , ) = h 0 , denoted by h(̂b 0 ,̂b 0 ) . Then approximate the distribution of the deviance statistic (6) by the empirical distribution of D b (w) = 2{ (̂b,̂b; w) − (̂b 0 ,̂b 0 ; w)} for b = 1, … , B . To obtain a confidence interval for at a certain confidence level, say 0.95, we would need to find the set of values for h 0 for which the simulationcalibrated test based on the deviance statistic (6) is not rejected at the 0.05 level. Assuming this set is an interval ( L , U ) , we would then need to find L and U for which the simulation-based approximations of the p-value of the test statistic equal 0.05.  Carpenter (1999) proposed essentially this approach to obtaining confidence intervals and called it test inversion bootstrapping, although that work focuses on Studentized test statistics rather than deviances. In the present setting, this would entail obtaining some estimate of the standard deviation of h(̂,̂) , call it ŝ d(h(̂,̂)) , and then using bootstrapping to approximate the distribution of {h(̂,̂) − h 0 }∕ŝ d(h(̂,̂)) . Schendel and Thongwichian (2017) applied test inversion bootstrapping to making inferences about extreme quantiles using unweighted loglikelihoods assuming exceedances over a threshold follow a GPD exactly. They find that this procedure can give better results than profile likelihoods, but do not consider using parametric bootstrapping to approximate the distribution of the deviance statistic. See Gilleland (2020a) for further discussion of test inversion bootstrapping for extremes. For inverting the tests to obtain confidence intervals, Carpenter (1999) recommends an approach based on the Robbins-Munro algorithm. I do not explore this numerical issue here as it suffices to compute just the p-values under the null to obtain the coverage properties of the intervals obtained by inversion. Section 4.2 gives results from a brief simulation study on the size and power of the testing procedures described here. I do not explicitly compare results for test inversion bootstrapping as applied to the deviance statistic as opposed to Studentized test statistics. Thus, while I find it natural to apply test inversion bootstrap to the deviance statistic, I have no direct evidence that it gives better inferences than using a Studentized test statistic when applied to estimation of extreme quantiles. Using the profile likelihood does avoid the problem of computing an estimated standard deviation for h(̂,̂).
If one wanted to test hypotheses or obtain confidence intervals for observations that are not independent and/or identically distributed, then a block bootstrapping approach might be suitable. For example, for the daily precipitation data studied in Section 7, bootstrapping blocks of length 1 year (or possibly m years for some small positive integer m) would allow one to have resampling units that respect seasonality and that reduce dependence due to the length of the sampling period, although the precipitation data shows some evidence that precipitation patterns across years are not i.i.d.

Asymptotic theory
When fitting GPDs to the largest order statistics from a sample, we implicitly assume that the true distribution is in the domain of attraction of some generalized extreme value distribution. The class of distributions that satisfies this property is so broad that it is necessary to restrict what true distributions one wishes to consider in order to obtain results that would shed any light on how to pick the weights in the WCL (3). I only consider a narrow special case here, where the true distribution behaves like an Exponential distribution in its upper tail and = 0 is treated as known. Following the program in Drees (1998) or de Haan andFerreira (2006), it should be possible to obtain results treating as unknown, but even the limited setting studied here provides insights into the choice of weights and evidence for the asymptotic superiority of unequal weights.
Write j n to make the dependence of j on n explicit. Similarly write w kn instead of w k . When = 0 is known, the WCL function then equals Define w ⋅n = ∑ j n k = 1 w kn , w 0n = 0 , and Δ kn = kw kn − (k − 1)w k − 1, n . Maximizing (7) with respect to yields For p ⩽ (j n + 1)∕(n + 1) , the corresponding estimated 1 − p'th quantile is Thus, is the effective weighting function for the ordered exceedances beyond X (n − j n ) in Q (1 − p) . Suppose, for some > 0 and a ≠ 0, where, for any fixed x > 0 , we have lim t↓0 t − R(t, x) = 0 and this convergence is uniform for 0 < x ≤ 1 . While this condition is quite restrictive, it is satisfied by any finite mixture of Exponential distributions or by any logistic distribution. In particular, for a logistic distribution, whose survival function S is of the form 1∕(1 + e (x − )∕ ) , we have = −1 , = 1 and a = −1 in (10). Defining Ψ(x) = (1 − x )∕ and Φ(t) = − at , it follows that Condition 1 in Drees (1998) is satisfied. Thus, we can use Theorem 2.1 from Drees (1998) to find the limiting asymptotic distribution of the estimated quantiles in the important case where j n grows with n in a way that the bias and standard deviation of these estimated quantiles are of the same order of magnitude.
Theorem 1 Suppose X 1 , X 2 , … are i.i.d. random variables with distribution satisfying (10) and, for finite C > 0 , the sequence of positive integers j n satisfies Then for a positive sequence n = o(j n ∕n) and Q as in (9), The proof of this result is given in Section 9. We can obtain a more transparent form for how this result depends on j n by assuming that n is regularly varying at infinity with index − , which implies that log n ∼ − log n as n → ∞ . If > 1∕(2 + 1) so that n = o(j n ∕n) , then (12) implies , which is a consequence of (10). We now see that the relative error in an estimated extreme quantile is just O p (j −1∕2 n ) . The form (13) also makes for easier comparisons with much more general results in de Haan and Ferreira (2006)[Section 4.3] for maximum likelihood estimates.
For working out asymptotic optimality results, it is helpful to replace j n in (12) by its asymptotic approximation (11), so that in distribution. Writing C for the value of C that minimizes the second moment of the limiting distribution in (14), we have assuming the denominator is not 0. Defining the corresponding expression for the minimized second moment of the asymptotic , which is attained when j n satisfies Of course, (18) is not useful for selecting j n since it requires knowing a and .
Restricting to be nonnegative, we can, for any given , seek the nonnegative function that integrates to 1 on (0,1) and minimizes I ( ) . This problem is just a one-sided version of Lemma 18 in Devroye and Györfi (1985)[Chapter 5], so an optimal nonnegative is This minimizer is not unique since, for any c > 1 , the function c (ct) also minimizes the functional I . Note that satisfies the smoothness condition in Theorem 1 when ≥ 1 . Csörgǒ et al. (1985) obtains similar results for the weighted Hill estimator of . Writing u for the weight function that is identically 1 on [0,1], we can calculate I 1 ( u )∕I 1 ( 1 ) = 3 4∕3 ∕4 ≈ 1.0817 , so, even in the best of circumstances, we should not expect dramatic improvements in mean squared error from using WCL. Davison and Smith (1990)[Section 10] note a similarity between threshold selection, or nearly equivalently, the choice of j, and bandwidth selection in kernel density estimation. By allowing weights in the composite log-likelihood function, the similarity between fitting GPDs to exceedances and kernel density estimation is made even stronger. In particular, when considering mean squared error of kernel density estimates for densities with two derivatives, asymptotic results that look quite a lot like (14)-(18) with = 2 occur (Rosenblatt 1971), although without the log n terms in (14) and (17). If we know and allow to take on negative values, we can make the limiting bias in (12) equal to 0 by choosing so that ∫ 1 0 t (t) dt = 0 ; an analogous result holds for kernel density estimation (Jones and Signorini 1997). One would need to be cautious about using negative weights in practice, so I do not pursue this direction here.

Simulation study
There are many issues raised by the methodology in Section 2 that could profitably be studied through simulations. The results presented here are necessarily limited and focus on settings with positive shape parameters, which are often of greater interest in studies of extremes. Section 4.1 shows results for varying weight functions on point estimation for extreme quantiles. Section 4.2 considers the size and power of tests based on the deviance statistic (6) using 2 approximations in the unweighted case and a parametric bootstrap for weighted and unweighted composite likelihoods.

Point estimation
This section studies estimation of extreme quantiles for some heavy-tailed distributions using both WCL and the Hill-based kernel estimate. To motivate the forms for the quantile function used in the first simulation, consider the following extension of (10) for ≠ 0, c 1 , c 2 ≠ 0 and > 0: (Drees 1998) the asymptotic biases of estimates of extreme quantiles based on fitting GPDs to the larger order statistics are determined by the function c 2 t (x − + − 1)∕( − ) . Drees (1998) considers a more general condition than (20), but this more specific form suffices here. From Theorem 1 here and an analogous result for Hill-Weissman quantile estimates in Csörgǒ et al. (1985), it is plausible to conjecture that, under (20), is the asymptotically optimal nonnegative kernel for extreme quantile estimation. If we set then (20) holds with R(t, x) identically 0, so that this quantile function may be well suited for exploring whether such a conjecture might hold.
Based on 5000 simulations for each setting, Fig. 1 shows results for WCL estimates of the 1 − 1∕n quantile for a range of j values under (21) when c 1 = c 2 = 1, = 0.25, = 0.5, 1 or 2 and n = 1600 or 6400. The four weight functions considered are the constant weight function u and for = 0.5, 1 and 2. In all six settings, all three weighted kernels have a modestly lower minimum mean squared error than u : the ratio of the minimum mean squared error over j for u to the minimum mean squared error over j and for the three weighted kernels are, for = 0.5, 1 and 2, respectively, 1.026, 1.016 and 1.018 for n = 1600 and 1.036, 1.034 and 1.020 for n = 6400 . Furthermore, for both sample sizes and all 3 values of in (21), setting = either minimizes the minimum mean squared error over the values of j considered or is within 0.2% of this minimum, which is less than the simulation standard error. One would expect that the value of j minimizing the mean squared error is larger for weight functions that decrease more rapidly, corresponding to smaller values in . Indeed, for all six settings, the value of j minimizing the mean squared error increases as one goes from u to 2 , 1 and 0.5 . It is also worth noting that these minimizing j values can be quite large, with j/n sometimes greater than 0.5 when n = 1600 . These large optimal j values occur because the true quantile functions are similar to the quantile function of a GPD even at fairly low quantiles, whereas a distribution whose density goes to 0 at the lower end of its support could deviate substantially from a GPD at these moderate quantiles. Thus, asymptotic results based on (20) may often require quite large values for n to provide accurate approximations. Perhaps more importantly in practice than the modest decrease in mean squared errors for extreme quantiles when using WCL is the greater stability in the resulting Finally, let us briefly consider comparing estimates of extreme quantiles using WCL versus Hill-based estimates. The standard estimate for the 1 − p quantile for p small when estimating using a Hill-based estimate and the j largest order statistics is the Weismann-Hill estimate: X (n − j) (j∕(np))̂ (Weissman 1978;Caeiro et al. 2019). Whether or not weighting is used, Weissman-Hill estimates generally have much lower variability (unless the estimate of has a large positive bias) but potentially much higher bias than WCL estimates. Thus, depending on the examples chosen, either approach can be vastly superior to the other. One might expect that if quantiles of the true distribution above a certain threshold match the quantiles of an ordinary Pareto distribution well, then the Hill-based quantile estimates will do well. For example, if X − follows a t distribution on degrees of freedom, then as p ↓ 0 , the 1 − p quantile of X is of the form a p −1∕ + + b p 1∕ + o p 1∕ for nonzero constants a and b . Since the Pareto distribution with (x 0 , ) = (a , 1∕ ) has 1 − p quantile equal to a p −1∕ , we might expect the Weissman-Hill estimate of extreme quantiles to work well when = 0. Figure 3 shows results for Weissman-Hill and WCL estimates of the 1 − n −1 quantile based on n = 1600 observations from t distributions on 8 degrees of freedom with location parameter 0 or 1, constant or linear weight functions and a range of bandwidth parameters j from 60 to 560. The differences between the Weissman-Hill and WCL estimates are much larger than those between constant and linear weighting. For = 0 , as expected, the Weissman-Hill estimates have moderate median bias when j = 60 , the smallest value shown. However, as j increases, the biases of can then be negative with nontrivial probability, so these estimates no longer make sense. This limited simulation study suggests that, in addition to only applying when > 0 , Weissman-Hill quantile estimates are more sensitive to the choice of bandwidth j and the choice of weight function than WCL estimates and thus need to be used with caution.

Hypothesis testing
The simulations in this section consider testing hypotheses about the shape parameter and extreme quantiles. By construction, these results provide information on the coverage properties of confidence sets that would be obtained by inverting these tests. To examine possible benefits from using simulations to calibrate a test statistic even when a 2 approximation is valid, let us first assume the observations are i.i.d. random variables from a GPD and consider making inferences about , so that h( , ) = . In this case, the distribution of (6) under the null = 0 is invariant under linear transformations of the observations whether or not weights are used. It thus suffices to simulate the distribution of (6) by repeatedly simulating X 1 , … , X n as i.i.d. random variables from a GPD(0, 1, 0 ) and obtaining simulated deviances D 1 (w), … , D B (w) , since varying or would not change the null distribution of the deviances. If the descending rank of D(w) among D 1 (w), … , D B (w) is r, then r∕(B + 1) is an exact p-value for testing the null hypothesis. Thus, one might want to use this approach for assessing the p-value of a test for = 0 even in the unweighted case for which there is a simple asymptotic approximation. For X 1 , … , X 800 i.i.d. random variables from a GPD(0, 1, 0.25) and j = 25 or 50, I did this simulation B = 9999 times to obtain 9999 simulated values of the null distribution of the test statistic (6). Figure 4 plots the p-values based on the 2 1 approximation v. the simulated p-values, which just equal the ranks of the deviance statistic divided by 10,000. Only the range of simulated p-values between 0 and 0.1 is shown, since that is the range of most interest for inference. We see that for j = 25 , the asymptotic distribution gives p-values that are substantially too small, so that a test using this approximation will reject true nulls too often. A smaller but still noticeable miscalibration in the same direction occurs when j = 50 . Results for j = 100 and 200 (not shown) are well calibrated. This example shows that even in the simple setting of the observations following a GPD exactly and using an unweighted likelihood, there can be some advantage in using simulations to calibrate the deviance test statistic rather than the 2 approximation for moderate values of j. In practice, the GPD model is used when the conditional distribution of exceedances beyond X (n − j) is only approximately a GPD. To examine this situation, suppose X 1 , … , X 800 are i.i.d. random variables with t 4 distribution, the t distribution on 4 degrees of freedom, in which case, the shape parameter of the limiting GPD for exceedances is 0.25. For a range of values of j and for constant and linear weight functions, I conducted B = 9999 simulations under the GPD model for exceedances and set the rejection region for the test of the null at the nominal 0.05 level to values of the deviance statistic greater than the 9500th largest value of this statistic in each setting. I then conducted 10,000 simulations under the t 4 distribution and the same values for j and weight functions. For j = 25, 50, 100 and 200, the observed fraction of test statistics falling in the rejection region were, respectively, for constant weights, 0.0646, 0.0825, 0.1556 and 0.5213, and for a linear weight function, 0.0614, 0.0736, 0.1171 and 0.3219. As j increases, the GPD model for the exceedances becomes increasingly in error, and the probability of rejecting the null increases as expected. The rejection probabilities are, for each value of j, smaller for the linear weight function than for the constant weight function, but this comparison is arguably unfair since, for a given j, estimates with constant weights should have lower variance in addition to their higher bias. To provide a fairer comparison, when the observations follow a t 4 distribution, for j = 25 and 50, I found values j ′ such that the probability of rejecting the null 0 = 0.25 when using the exceedances of X (800 − j � ) and the linear weight function are as close as possible to the probability of rejecting the null when using constant weights and the exceedances of X (800 − j) , so 0.0646 for j = 25 and 0.0825 for j = 50 . Numerical experimentation yielded j � = 33 when j = 25 and j � = 64 when j = 50.
I then did 10,000 simulations from a range of distributions with densities of the form on the set 1 + x 2 > 0 , where c is a normalizing constant (Papastathopoulos and Tawn 2013), which includes the t density when = 1∕ , the standard normal density when = 0 (by continuity) and a linear transformation of symmetric beta densities when < 0 . For every real , for a random variable with density (22), the limiting GPD for exceedances has shape parameter . Figure 5 shows the corresponding power functions for these two weight functions and j values. Note that for slightly larger than 0 , the probability of rejecting the null is actually smaller than the probability of rejection when = 0 for both weight functions. Overall, there is not much to choose from between the two weight functions, with the linear weight function being modestly superior for < 0 and the constant weight function for > 0 .
When using this approach to test hypotheses about extreme quantiles, the distribution of the deviance statistic (6) now depends on the true value of all of the parameters, so the parametric bootstrap will not give exact inferences even if the truth is exactly a GPD. To investigate this problem, I repeated the following procedure 4000 times for q = 0.99 and q = 0.999 : simulate X 1 , … , X 800 i.i.d. from a GPD(0, 1, 0.25) and compute the deviance statistic for testing that the q-quantile is equal to its true value Q 0 (q) for constant and linear weight functions. As before, it would not be fair to use the same value of j for both weight functions, so I used j = 50 for the constant weight function and j � = 65 for the linear weight function, where j � = 65 gave the best match to the interquartile range of ̂ for constant weights with j = 50 . For both weight functions and for each of the 4000 simulations, I carried out 199 additional simulations of j (or j ′ ) exceedances beyond X (n − j) distributed i.i.d. from a GPD(0,̂0,̂0) . Keeping in mind that we are acting as if X (n − j) equals the true (n − j)∕(n + 1) quantile, for each of these 199 simulations, I computed the deviance statistic for testing the null hypothesis Q(q) = Q 0 (q) and calculated the rank of the observed deviance relative to these 199 simulated values. If the test were perfectly calibrated, the 200 possible ranks for the observed deviances would all occur with probability 0.005 and the descending rank divided by 200 would be a valid p-value. The results of this simulation are displayed in Fig. 6 where, for ease of comparison, the p-values when using equal weights and a 2 1 approximation are put in bins of width 0.005 to match up with the resolution of the simulation-based p-values. For the 0.99 quantile, we see that all three procedures have the lowest ranks appearing too frequently, so that the null would be rejected too often. The 2 1 approximation with equal weights yields the worst results and the test using the linear weight function and bootstrap calibration the best. In particular, at the nominal 0.05 level, the estimated probabilities of rejecting the null are 0.07000, 0.08625 and 0.09675, for the linear weight function with bootstrap calibration, equal weights with bootstrap calibration and equal weights with a 2 1 approximation, respectively. For the 0.999 quantile, all 3 tests are better calibrated, although equal weights and a 2 1 approximation still yields p-values less than 0.005 too often. The worse performance for all 3 tests for q = 0.99 may arise from ignoring the variability of X (n − j) as an estimate of Q((n − j)∕(n + 1)) quantile, which should matter more the closer q is to (n − j)∕(n + 1).

Observations from a stationary process
When observations are identically distributed but not independent and we fit a correct parametric model to their marginal distribution acting as if they are independent, then the gradient of the resulting log-likelihood function still yields unbiased estimating equations. This property does not generally carry over to the WCLs in (1) and (2). However, we might hope that if the sample size is sufficiently large and the observations are not too dependent, the resulting biases might be small. For a stationary time series, declustering methods (Ferro and Segers 2003) can reduce the impact of this dependence, but they can have their own problems with bias when the goal is estimation of marginal quantiles (Fawcett and Walshaw 2007) and there are other approaches for addressing the impact of serial dependence on uncertainty quantification for extreme quantiles (Fawcett and Walshaw 2012). Of course, if the dependence of extremes is of specific interest, then an appropriate data analysis will have to extend beyond estimating marginal quantiles.
To examine the impact of dependence on a WCL that ignores this dependence, I simulate from a stationary time series model with extremal dependence in which the marginal distributions are GPD(0, 0.25, 0.25). Since the true marginal distribution is exactly GPD, any bias in estimates of when using constant weights should be a small sample bias and not attributable to any misspecification in the marginal distribution of the observations. To produce models with extremal dependence and known marginal distributions, I start with a stationary autoregressive process of order 1 with Cauchy innovations and then transform the observations to have the Cauchy random variables, X 1 = Z 1 and X t = X t − 1 + (1 − )Z t for t > 1 , for which all marginal distributions are also standard Cauchy. Then take the appropriate monotonically increasing transformation of each X t to obtain the stationary series with GPD(0, 0.25, 0.25) marginals. The upper tail dependence index (Davis et al. 2013) for transformed observations k steps apart is k , the same as for the Cauchy series. Figure 7 shows results for simulated estimates of for time series of length 1600, values of 0.5 and 0.75 and using u and the linear weight function 1 . It is not appropriate to compare u and 1 for the same values of j, since u should have lower variability for any fixed j. If observations are i.i.d., then the weighted Hill estimator in Csörgǒ et al. (1985) has asymptotic variance proportional to K( , j) = j −1 ∫ 1 0 2 (t) dt , suggesting that one pick pairs of j values j u and j 1 such that K( u , j u ) = K( 1 , j 1 ) , which yields j 1 = 4j u ∕3 . Figure 7 shows that the resulting variabilities of the estimates using the two weight functions are very similar, so that it is reasonable to compare their biases. In all cases, the estimators' medians are less than the true value of 0.25. For both weight functions, as we might expect, biases are larger when the bandwidth is smaller and is larger. The biases are always slightly larger for 1 than for u , but the differences are quite a bit smaller than the biases using u . Thus, in this setting where the marginal distribution of the observations is actually GPD, the relative increase in bias from using the linear weight function is small even with quite strong extremal dependence between neighboring observations.

Non-identically distributed observations
There is a substantial literature on extremes for independent but not identically distributed random variables (Einmahl et al. 2016;de Haan and Zhou 2021), generally based on assuming that the upper quantiles of a sequence of random variables can be approximated by a GPD with one or more parameters changing smoothly with the index of the series. It is not obvious how to extend these approaches to allow for a WCL since the order statistics are only a Markov chain when the observations are i.i.d. This section suggests some WCLs that are valid for independent observations with varying distributions and examines them in a simulation study based on a realistic model for daily temperature in Minneapolis.
Order statistics for random variables that are independent but not identically distributed are considered in David and Nagaraja (2003), Bon and Păltănea (2006), and Balakrishnan and Zhao (2013), but it is not apparent how to directly use the results in these works to obtain a useful composite log-likelihood function. The exact joint distribution of the order statistics for X 1 , … , X n independent but not identically distributed requires summing over all permutations of the indices 1, … , n (David and Nagaraja (2003)[Section 5.2]). However, even if this sum were feasible to compute, it would not generally be a good basis for inference because the order statistics are no longer a sufficient statistic when the observations are not identically distributed.
One possible basis for a WCL in this setting is the conditional likelihood function of X (k) given the values and indices of the first k − 1 order statistics and the index for X (k) . Let us assume no ties to avoid complications. Write i(1), … , i(n) for the indices of the ordered observations, so that, for example, X i(1) = X (1) and define the set I(k) = {i(1), … , i(k)} . Additionally, denote the density and survival functions for X i by f , i and S , i , respectively. One will need the dependence of on i to have some structure such as smoothly changing in i as in Einmahl et al. (2016) and de Haan and Zhou (2021). Then the conditional density of X (k) given X (1) , … , X (k − 1) and I(k) is Alternatively, at greater computational effort, we can calculate the conditional likelihood function of X (k) given X (1) , … , X (k − 1) and I(k − 1): Both of these formulas make sense for k = 1 as long as we define X (0) = −∞ and I(0) as the null set. In addition, they both reduce to the conditional density of X (k) given X (k − 1) when the observations are i.i.d. Weighted sums over k of the logarithms of either of (23) or (24) are possible WCLs. For making inferences about extreme quantiles, the idea is to apply such a WCL just to exceedances over a threshold that can vary with the index i of the series. For both the simulated temperature data in this section and the precipitation data analysis in Section 7, these thresholds and the dependence of on the time index i will be linear functions of a periodic cubic spline basis with period of one year.
Unlike the case for identically distributed observations, neither (23) nor (24) give the exact log-likelihood function when the weights are all 1, so there is the potential for loss of information when using such WCLs. Indeed, if the distributions of the observations are very different, then conditioning on I(k) could lose a lot of information about . For example, suppose our model for the independent observations X 1 and X 2 is that X i has a uniform distribution on ( i , i + 1) , where ( 1 , 2 ) equals either (0,2) or (2,0). Then the actual observations X 1 and X 2 would tell us for sure which of the two possibilities for ( 1 , 2 ) was correct, whereas both (23) and (24) are identically 1 for k = 1 or 2 and thus provide no information about the parameters. Of course, this example is extreme but it makes the point that if the correct marginal distributions of the observations are very different, conditioning on the ranks of the observations can lead to a large loss of information.
One might hope that, because it conditions only on I(k − 1) rather than I(k), WCLs based on (24) will sometimes yield more statistically efficient estimates than when using (23). Since both (23) and (24) reduce to the conditional density of X (k) given X (k − 1) when the observations are identically distributed, another approach to reduce this information loss is to preprocess the observations so that they are nearly identically distributed. Here is one possible way this preprocessing could be done.
(23) − 1) ) . (24) For some small > 0 , choose some threshold probability 1 − and fit a quantile regression at this level to the observations using suitably chosen covariates that can explain the nonstationarity. Denote the fitted threshold for observation i by t i . Now fit a second quantile regression at level 1 − ∕2 using the same covariates as the previous regression. Write r i for the difference in the fitted values at time index i for the levels 1 − ∕2 and 1 − quantile regressions. Then write Y i = (X i −t i )∕r i for the standardized exceedances and fit a GPD with possibly varying scale and shape parameters using a weighted composite log-likelihood function. When the number of observations above a designated (possibly varying) threshold is large, calculating WCLs using even the simpler (23) could be onerous. We can reduce the computations substantially by breaking up the data into L groups and only considering the ranks within each group. For example, in the analysis of daily precipitation data presented in the next section, I will split up the observations by months, which, in addition to reducing the computations by roughly a factor of 12, may help to reduce any information loss due to using order statistics of observations that are not identically distributed. More explicitly, define Γ as the set of indices k that are in group and for which the normalized Y k is positive. Write n for the cardinality of this set. Let Y (1, ) < ⋯ < Y (n , ) be the order statistics among the Y k 's with k ∈ Γ and define i (k) and I (k) as before except for just those observations in group . Then using, for example, (23), we have the WCL Computing the WCL even in (25) can be quite slow in settings that would be common with climatological data, for which one might have decades of daily data. Thus, it is a challenge to fully evaluate the efficacy of this approach through a simulation study. Here, I present results from a limited study based on simulations of length 20 years using a fitted model for average daily temperature in Minneapolis. The model is taken from Krock et al. (2022) and is based on a 7-parameter family of univariate distributions that allows flexible behavior of extreme quantiles in both tails (Stein 2021). The model includes what are effectively separate scale and location parameters for each tail and these 4 parameters are allowed to vary seasonally and across years to account for climate change. The other 3 parameters determine the shapes of the tails and do not vary. These fitted seasonally varying distributions have normalized exceedance distributions that converge to GPDs with shape parameter -0.220 for the lower tails and -0.251 for the upper tails.
To simplify matters somewhat, here I leave out the effects of climate change and let all 20 years of the simulated data follow the 2020 seasonal pattern from the fitted model in Krock et al. (2022). The top plot in Figure 8 shows the resulting densities for temperature on 4 different days of the year. We see large seasonal effects in the means, spreads and shapes of the distribution, including a rather noticeable shoulder in the density for January 1, a feature that is also clearly visible in the raw data (Krock et al. 2022). The bottom plot shows some quantiles for this fitted model as a function of day of year. Using this model, I ran 1000 − 1, ) ) .
simulations of length 20 years assuming independence across days. Of course, the true temperature process has substantial dependence between neighboring days, but even assuming independence leaves a challenging statistical problem due to the complex seasonal effects. For definiteness, I will describe the models and estimation procedures just for the upper tails. In all cases in which a quantity varies seasonally, I used linear models with a periodic cubic spline basis having 8 knots and period 365.25 days, which matches the form for the seasonally varying parameters used in Krock et al. (2022). Values for considered were 0.1 × (4∕3) m for m = −2, … , 5 , corresponding to a range of values from 0.056 to 0.421. Preliminary simulations showed that values outside this range gave inferior results in both tails for all procedures considered. The time needed to compute fits based on (25) necessitated using a small set of values. For all procedures, first estimate the 1 − quantile as a function of time of year using quantile regression. Seasonally varying GPDs were then fit to exceedances beyond the fitted 1 − regression quantile, where the shape parameter is assumed constant and the logarithm of the scale parameter is linear in the periodic spline basis. The unweighted estimates for these parameters were computed using the fevd program in the R package extRemes (Gilleland ). For the weighted procedure, the exceedances were first rescaled using the difference in regression quantile estimates at the 1 − ∕2 and 1 − levels as described earlier in this section. Then the shape parameter and scale parameters were estimated maximizing (25) with the standardized observations broken into L = 12 groups by month and = 1 . The top plot in Fig. 9 shows mean squared errors for estimates of the 0.001 and 0.999 quantiles as a function of day of year for the best (overall mean squared error) unweighted procedure over the values of considered, which was 0.133 for both quantiles. We see strong seasonal variation in the quality of estimation, with peaks around days 80 and 325, which correspond to times of year when the temperature quantiles are changing the fastest (see bottom plot of Figure 8). The bottom plot shows the ratio of mean squared errors for the two best weighted estimates (values for of 0.237 and 0.316 in both tails) to these best unweighted estimates. No procedure uniformly dominates any other, with, in both tails, the weighted estimates performing noticeably better during the winter months but somewhat worse during much of the rest of the year. In terms of overall mean squared error across the year,  the best unweighted procedure is about 2% better than the best weighted procedure in both tails. To see if using the grouping of observations into months degraded the performance of the weighted procedure, I redid the simulation for = 0.237 without the grouping, but the bottom plot in Fig. 9 shows the results hardly change. Considering further values of might affect the results modestly, but would be unlikely to alter the overall slight superiority of the unweighted procedure.
Even under well-controlled circumstances, the advantage of weighted procedures after optimizing over was shown to be modest in the simulations in Section 4.1. Thus, it should not be too surprising that unweighted procedures can sometimes be slightly better when is chosen optimally. Figure 10 shows estimates for the shape parameter in each tail for a single simulation and a range of values. We see that, as for i.i.d. observations (compare to Fig. 2), these estimates vary much less erratically when using the weighted procedure. Thus, one might hope that any minor loss of efficiency that can occur in some instances when weighting will be compensated for by having estimates of parameters and quantiles that are less sensitive to the choice of threshold.

Application to daily rainfall data
This section considers estimating seasonally varying upper quantiles of daily precipitation (measured to the nearest 0.01 inches) at the Central Park weather station in New York City for the period January 1, 1869-March 1, 2021, obtained from the National Centers for Environmental Information website (Menne et al. 2012). Figure 11 shows quantiles for these data by month. The months August-October have the smallest 0.75 quantiles and the largest 0.999 quantiles, so there is a clear seasonality in the data that cannot be explained by a simple scaling factor. Figure 12 suggests that there is some non-stationarity in the process across years, with the total  Fig. 10 Estimates of from a single simulation of Minneapolis daily temperature for a range of values. Black for lower tail, gray for upper tail, • for unweighted log-likelihood estimates, + for linear weighted composite log-likelihood estimates annual precipitation taking on consistently low values from the late 1940s to the late 1960s and the last 50 years containing the 9 highest annual precipitation values out of the 152 complete years of data available. High levels of aerosol pollution could account for the reduced precipitation in the period from the late 1940s to the late 1960s (Levin and Cotton 2008). The annual maximum of daily precipitation (right panel of Fig. 12) does not show obvious non-stationarity and, to avoid complications, I will assume no long-term trend in upper quantiles across years.
To look at temporal dependence among extreme precipitations, it is useful to first remove seasonality so that these seasonal patterns do not influence measures of dependence such as autocorrelations. With = 0.1 , I standardized the daily precipitation patterns using the quantile regression procedure described in the previous section and write S j for the standardized precipitation on day j. The covariates in the quantile regressions form a periodic cubic spline basis with period of one-year and 6 evenly spaced knots. Figure 13 shows autocorrelation functions for two forms of time series. The left plot of the figure gives autocorrelation functions for the indicators that S j > c for c = 0, 2, 4 and 6. The number of exceedances for these four values of c are 5554, 1487, 472 and 167, respectively, out of the 55,577 total number of days. If there were extremal dependence in the precipitation amounts at some lag, then the autocorrelation at that lag should tend to a positive value as c increases. The left plot shows some evidence of quite weak extremal dependence at lag 1 and no evidence of extremal dependence at 0.03, so there is hardly any dependence in precipitation amounts once one adjusts for seasonality. This result suggests, and other results confirm, that much of the dependence in the deseasonalized precipitations is related to dependence in the occurrences of precipitation rather than their amounts. Overall, it appears reasonable to ignore dependence when fitting GPD models to the seasonally varying upper quantiles of the precipitation distribution.
To provide a fair comparison between maximum likelihood estimates and maximum WCL estimates using (23), I split the data roughly in half into training and testing data in alternating periods of 4 years. Thus, the training data consists of 1869-1872, 1877-1880,..., 2013-2016 and all other years in the testing data, including the partial data for 2021. Spreading the years in the training data across the time period was done to reduce any influence of a changing climate on the results and four-year blocks were selected to reduce statistical dependence between the training and test data. In all quantile regressions and linear models for the logarithm of the scale parameter of the GPD, I used a periodic cubic spline basis with oneyear period and 6 evenly spaced knots. For a range of quantiles 1 − , using the quantreg package in R (Koenker 2021), I then estimated threshold functions via quantile regression, yielding a fitted threshold function t k as a function of day k. Because of the considerable length of this data set, I used the more accurate value of 365.242 days per year rather than 365.25, although it makes almost no difference in the results. For the ordinary (unweighted) likelihood, I then fit a GPD to the exceedances with a scale function ̂k whose logarithm is linear in the spline basis functions using the R package extRemes (Gilleland 2020b). For the WCLs only, to make the exceedances more nearly identically distributed, I first standardized the precipitation using the same approach as described in the previous paragraph with = 0.1 except now based only on the training data. Denote the difference between the estimated 1 − ∕2 and 1 − quantiles on day k by r k . Using (25), I then fit GPDs to the standardized exceedances over these seasonally varying thresholds again using a linear model for the logarithm of the scale function ̂k . Optimization was carried out using the nlm command in R and starting values given by the ordinary likelihood estimates for the given value of . I considered allowing the shape parameter to vary seasonally by allowing a separate value of this parameter for the months of August-October, but preliminary analyses indicated including such a factor did little to improve the fit, so I chose to treat the shape parameter as a constant.
Although the distribution of extreme precipitation clearly varies seasonally, it may be more relevant to assess the chances of precipitation events of a given size regardless of season. Thus, to evaluate these various fitted models, I considered how well they capture precipitation distributions above thresholds of either 2 or 3 inches. Precipitation above 2 inches occurs 348 times (0.63% of days) and precipitation greater than 3 inches 97 times (0.17% of days) over the entire dataset.
Following Stein (2021), the criterion function I used is the log-likelihood of the testing data censored from below at a cutoff c of either 2 or 3 inches, ignoring any possible dependence in the observations. Now using k to indicate a day in the testing set and writing p k for the corresponding observed precipitation, this criterion function can be written as where r k is identically 1 for the maximum likelihood estimate. This result assumes that t k < c for all j, which is true in all cases shown here. Note that ̂kr k is the estimated scale parameter of the GPD on day k for daily precipitation.
These criterion functions are plotted in Fig. 14. For the 2 inch cutoff, the best maximum likelihood estimate beats the best WCL estimate by about 1 log-likelihood unit. However, the performance of the maximum likelihood estimate varies quite irregularly in , so this modestly superior performance could easily be due to luck. In practice, it may be desirable to have a procedure whose results are fairly insensitive to modest changes in the difficult to specify quantity and, in this regard, the weighted procedure clearly dominates. For the more extreme and, thus, perhaps more relevant cutoff of 3 inches, the best WCL estimate is marginally better than the best ordinary likelihood estimate. Furthermore, with the 3 inch cutoff, the ordinary likelihood estimate performs substantially worse when 1 − is increased from its optimal value, 0.96, to the next larger value considered of 0.97. 14 Cross-validated log-likelihoods for New York City daily precipitation censored at 2 and 3 inches based on a seasonally varying threshold fit via quantile regression at a range of quantiles and a seasonally varying GPD for exceedances over thresholds using maximum likelihood ( • ) and maximum WCL (+) with a linear weight function Figure 15 gives more detailed results for the number, by month, of observed and expected exceedances of 2 and 3 inches of daily precipitation. The year 2021 is left out of the calculations so that the training and testing data have the same number of days, although, in fact, there were no observed exceedances of 2 inches for the partial year 2021. Based on their good performance in Fig. 14, this figure shows results for = 0.04 for the unweighted procedure and = 0.11 for the linear weighted procedure. For both cutoffs, the expected seasonal patterns are very similar for the two estimates, with differences that are much smaller than the differences between the observed outcomes in the training and testing periods. If we go to Expected and observed number of exceedances by month of daily precipitation greater than 2 inches (black symbols) and greater than 3 inches (gray symbols). Observed exceedances given by ◻ for training period and × for test period. Expected exceedances for ordinary maximum likelihood with = 0.04 given by • and for WCL with linear weight function and = 0.11 given by + even higher cutoffs, the observed exceedances are too rare to make a monthly comparison informative. Summing over all months, the unweighted and weighted procedures estimate the expected number of exceedances of 4 inches over the training (or testing) period of 15.16 and 15.54, respectively, whereas the observed number of exceedances was 18 for the training period and 12 for the testing period. The same results for expected exceedances of 5 inches are 5.15 and 5.27 for the unweighted and weighted procedures, respectively; there are 4 exceedances of 5 inches in both the training and testing periods. Overall, both procedures give very good agreement with the observed record in both the training and testing periods. This kind of cross-validation could be used to select , although there are many issues that would need to be addressed before such an approach could be used routinely in practice. One simple approach to choosing the threshold quantile 1 − when is constant is to plot estimates of as a function of this quantile and then select at a point where the estimated shape parameter shows no obvious systematic variation for smaller (Coles 2001)[p. 83]. Figure 16 gives such a plot for the two procedures. For both procedures, it appears that the estimated shape is decreasing across the entire range of values, which would normally indicate that should be taken even smaller than 0.02, the minimum value considered. However, the crossvalidation results indicate that such small values of are not the best for fitting the upper tail of the rainfall distribution. We do see that the estimates of shape using the WCL show less local variation than the ordinary log-likelihood. For example, while the overall slope of the estimates for ordinary log-likelihood is steeper than for the WCL, the WCL estimates are monotonically decreasing in 1 − , whereas the estimates based on the ordinary log-likelihood are not. Thus, it may be easier to  Fig. 16 Estimated shape parameters as a function of the threshold quantile 1 − for the ordinary loglikelihood ( • ) and WCL with linear weight function ( + ). Value for WCL and 1 − = 0.98 not shown because of convergence issues in fitting model spot systematic trends in shape estimates as the threshold varies when using WCL, although this example indicates that one might be better off deliberately selecting a threshold at which the estimated shape is still clearly systematically changing.

Discussion
The various WCLs described herein can be used in settings other than when assuming a GPD model for the upper tail of the distribution. There are a number of approaches to modeling distribution functions of i.i.d. observations when the main interest is in the upper tail, including mixture modeling (Scarrott and MacDonald 2012;Scarrott 2016), parametric extensions of the GPD (Papastathopoulos and Tawn 2013;Naveau et al. 2016;Stein 2020Stein , 2021 and semiparametric approaches (Huang et al. 2018). In this setting, one would presumably consider choosing j in (2) much larger than when fitting a GPD, perhaps even taking j = n but still taking w k small when k is a substantial fraction of n to limit the influence of the smaller observations. For temperature, one may sometimes be interested in both the upper and lower tails of the distributions, in which case the models in Stein (2020Stein ( , 2021 may be appropriate. In this case, taking n odd for convenience, one might consider a WCL that uses the marginal log-likelihood function of the median and then includes two sums like the sum in (2), one for the order statistics increasing from the median and one for those decreasing from the median. Further afield, this WCL could be used to downweight the more extreme rather than the less extreme observations, giving estimates of characteristics of distributions that are robust to outliers. It could be interesting to compare the resulting estimates to traditional robust procedures (Huber 2004).
In terms of the form of the weight function, the linear weight function 1 works fairly well in all circumstances considered here, although results are similar for other weight functions such as 2 . Some specific issues needing attention include the preprocessing step when observations are not identically distributed and methods for selecting thresholds. The weighted approach can substantially reduce random fluctuations in estimated parameters and extreme quantiles as a function of the threshold. This reduction in fluctuations might be expected to yield improved extreme quantile estimation when thresholds are chosen in a data-driven manner, but since threshold selection is often done based on visual inspection of graphical diagnostics, it is not clear how one might definitively reach such a finding.

Proof of Theorem 1
The proof depends strongly on Theorem 2.1 in Drees (1998). Write ⌊⋅⌋ for the greatest integer function and define Q n (t) = X (n −⌊j n t⌋) . Thus, writing k h for k − 1∕2 , we have X (n − k + 1) =Q(k h ∕j n ) for 1 ⩽ k ⩽ j n and X (n − j n ) =Q(1) . Next, let R 1 , R 2 , … be a sequence of independent standard Exponential random variables, S j = ∑ j i = 1 R i and Q n (t) = Q(1 − S ⌊j n t⌋ + 1 ∕n) . Then Q n (⋅) on [0,1] has the same asymptotic distributional behavior as Q n (⋅) , so that to prove Theorem 2.1, it suffices to prove the same Since U n + 1, n = V n (1) + o(1) almost surely, by (38), Since V n is a Gaussian process whose mean and covariance structure is independent of n, Theorem 1 follows by computing the mean and variance of ∫ 1 0 (t)V 1 (t) dt− (1)V 1 (1) . We have and, by straightforward calculations, 1 0 (t)V n (t) dt − (1)V n (1) + o p (1).