Record Tests to Detect Non Stationarity in the Tails With an Application to Climate Change

The analysis of non stationary behaviours and trends in the extremes of a series is an important problem in global warming. This work develops statistical tools to analyse that behaviour, using the properties of the occurrence of records in i.i.d. series. The main diﬃculty of this problem is the scarcity of information in the tails, so that it is important to obtain all the possible evidences from the data available. To that end, ﬁrst, diﬀerent statistics based on upper records are proposed and the most powerful is selected. Then, using that statistic, several approaches to join the information of four types of records, upper and lower records of forward and backward series, are suggested. It is found than these joint tests are clearly more powerful. The suggested tests are speciﬁcally useful in the analysis of the eﬀect of global warming in the extremes, for example of daily temperature. They have a high power to detect weak trends and they can be widely applied, since they are non parametric. The proposed statistics join the information of M independent series, what is useful given the necessary split of the series to arrange the data. This arrangement solves usual problems of climate series (seasonality and serial correlation) and provides more series to ﬁnd evidences. These tools are used to analyse the eﬀect of global warming in the extremes of daily temperature in Madrid.


Introduction
Clear evidences of global warming have been found in many areas of the planet. Concerning temperature, there is no question that Earth's average is increasing (Sánchez-Lugo et al. 2019), and there is a general consensus to conclude the existence of an increasing trend in its mean evolution. Most of the works on climate change focus on the analysis of mean values, however, another relevant aspects are changes in variability and in the tails of the distributions. Many works show the interest of analysing if the occurrence of extremes is affected by climate change (Benestad 2004;Xu and Wu 2019;Saddique et al. 2020). Moreover, consequences of global warming on human health, agriculture, and other fields, are often related to the occurrence of more and more intense extremes (Coumou and Rahmstorf 2012).
Although of great interest, the analysis of a non stationary behaviour in extremes is acknowledged to be difficult due to the scarcity of data. It is also difficult to link it to the mean evolution of temperature since, given the small magnitude of this trend in terms of the variability of the daily temperatures, its effect on the extremes is not evident. Even assuming that global warming affects the occurrence of extremes, there may be considerable climate variability in different areas of the planet, and more research on this topic is needed. Conducting this type of studies would be ease by the existence of simple statistical tools, in the same way that the studies about the mean temperature have been favoured by the availability of simple non parametric tests, such as Mann- Kendall (MK) test (Kendall and Gibbons 1990).
Climate models that do not adequately represent non stationary behaviours in the extremes can yield important biases in the results, specially in extreme value statistics such as return values or return periods. For example, ensambles of climate-model simulation are often useless since the frequency of extremes is too low to be well sampled by the ensamble (Durran 2020). Statistical models that represent the whole distribution of temperature also tend to badly fit the tails of the distributions. The validation of climate models must include the analysis of their capability to properly reproduce the most extreme values, but specific statistical tests to this end are not available.
In this context, it is clear the interest of developing statistical tools to analyse non stationary behaviours in the extremes. The annual maxima or the excesses over threshold are the traditional approaches to study extremes in environmental sciences, and they are still commonly used (Prosdocimi and Kjeldsen 2020). However, in this work, we propose a different approach based on the analysis of records. This approach has some important advantages due to the probabilistic properties of records. In particular, the fact that the distribution of the occurrence of records in an independent identically distributed (i.i.d.) series (X t ) does not depend on the distribution of X t . This property makes easier the development of distribution-free statistics, the use of Monte Carlo methods to implement inference tools, and to join the information of the records of different series. Another advantages of using records is that they do not require the information of the whole series. This is common for example in sports or in old climate series. Coumou et al. (2013) underlines the interest of of this type of analysis, and the importance of quantifying how the number of temperature records is changing world-wide and establishing its relationship with global warming. Different approaches have been used to study the occurrence of temperature records. Redner and Petersen (2007) compared the observed values of records with the expected values under a stationary climate in a descriptive way, and using simulations with given distributions. Benestad (2004) compared the observed and the expected number of records under stationarity in a more formal way, using a χ 2 test and graphical tools. Other common approach is to assume a Gaussian distribution of the temperature. For example Newman et al. (2010) used simulations to determine the influence of trends and long-term correlations on record-breaking statistics. Franke et al. (2010) investigated the asymptotic behaviour of the probability of record at a given time, and characterised it under several distributions. Coumou et al. (2013) used the probabilities of record by Franke et al. (2010) assuming a Gaussian distribution to make descriptive comparisons with the observed records in monthly temperatures. Wergen and Krug (2010) and Wergen et al. (2014) also used those probabilities: they found that they were useful only at a monthly scale, but found difficulties to quantify the effect of slow changes in daily temperature.
Although all these approaches are useful, there is a lack of formal tests to evaluate the effect of climate change on very extreme temperatures. There are some references of general statistical tests based on records. Foster and Stuart (1954) proposed some distribution-free tests to analyse the randomness of a series, based on the number of upper and lower records. Diersen and Trenkler (1996) analysed the asymptotic relative efficiency (ARE) of the previous tests and showed how it can be improved by weighting the records according to their position in the series.
The aim of this work is to develop statistical tests, and complementary graphical tools, to detect non stationary behaviours in the extremes of a series. The underlying idea of the tests is to use the distribution of the occurrence of records in an i.i.d. series (X t ), and to study if the observed records are compatible with that behaviour. To develop tests that can be useful in climate analysis, the characteristics of temperature series, that will be described in the next section, have to be taken into account.
The outline of the paper is as follows. Section 2 describes the motivating problem and the data. Section 3 presents two families of tests, the first only uses the upper records, and the second joins the information of four types of records. A simulation study of the size and power to compare the proposed tests is shown in Section 4. Section 5 describes some graphical tools, and Section 6 the analysis of the effect of global warming on the extremes of daily temperature in Madrid (Spain). Finally, Section 7 summarises the main conclusions.

Description of the problem
The motivating problem of this work is the analysis of the effect of global warming on the extremes of a daily maximum temperature series, and the series in Madrid (Spain) is used as an example. Our aim is not only to objectively establish the existence or not of non stationary behaviour in the extreme temperatures, but also to identify the time, the periods of the year, and the features where it occurs. Day Daily maximum temperature (ºC) Fig. 1 Daily maximum temperature and lower (blue) and upper (red) records, Madrid.

Data
The series of daily maximum temperature in Madrid, T x , recorded in 0 C from 01/01/1940 to 31/12/2019 is obtained from the European Climate Assessment & Dataset, (Tank et al. 2002); the observations of the 29th of February have been removed. Madrid is located in the centre of Iberian Peninsula (40.4 o N 3.7 o W) at 667 m a.s.l. and has an inland Mediterranean climate (Köppen Csa). Winters are cool and summers are hot, with average temperatures in January and July of 10 and 31 o C, respectively.
The temperature series show seasonal behaviour, and a strong serial correlation clearly significant. Figure 1 shows the mean evolution of T x , that has a slight trend, much lower than the variability of the series. Thus, it is not clear if this trend affects the extremes, in particular the occurrence of records. In addition, the trend in the mean temperature differs across he days of the year. This can be observed in Figure 2 (left) that showsθ i , the slope of a linear trend estimated by least squares in the subseries of each day of the year, standardised in mean and standard deviation. The mean of the slopes is 0.0075, but they move from -0.0025 to 0.025. Figure 2 (right) represents the evolution of the annual mean of T x , and its records. This plot shows that the increase of temperature and occurrence of records at an annual scale is more clear than in daily temperatures. This suggests that global warming manifests itself not only by global record-breaking temperatures, but also by a higher number of days with extreme temperatures.
To sum up, temperature series present the following characteristics: they have strong seasonal behaviour, serial correlation, and different trend evolution within the year. It is noteworthy that the strong seasonal behaviour not only yields different distributions of the variables, but also a high variability of the entire series. Moreover, to study the effect of global warming at a daily scale, that is essential in climate applications, the increase of the number of warmer days has also to be taken into consideration. In the next section, we suggest an approach to arrange the data that deals with all these problems. Year Annual mean temperature (ºC) Fig. 2 (Left) Slopes of linear trend in Tx in each day of the year; red stars mark the days whose series can be considered independent. (Right) Annual mean of Tx and upper (red) and lower (blue) records, Madrid.

Data preparation
A common approach to remove the seasonal behaviour of a daily series with annual seasonality (X 1,1 , X 1,2 , . . . , X 1,365 , X 2,1 , X 2,2 , . . . , X T,365 ) is to split it into 365 series, one for each day of the year, .
In this way, each column in the matrix is a series formed by serially uncorrelated observations (it is checked using the Pearson correlation test) with no seasonal behaviour. The resulting 365 series do not have the same distribution, but given that the distribution of the occurrence of records does not depend on the distribution of i.i.d. series, the results can be aggregated. The series of consecutive days are clearly correlated. To obtain uncorrelated series that will make easier the development of inference tools, only a subset of the 365 series available is considered. To that end, the following approach is applied: given that the series at day k is in the subset, correlation between series k and k + 1 is tested; if the correlation is not significant, series k + 1 is included in the subset, otherwise correlation between series k and k + 2 is tested. This step is repeated until a series k + i which is not significantly correlated with series k is found. Applying this approach to the temperatures series in Madrid, and starting the 1 st of January, we obtain M = 58 series of length T = 80. The selected days are marked with red stars in Figure 2 (left).
The suggested approach not only provides a set of M uncorrelated series with no seasonal behaviour, but also it will allow to join the information of the different series of daily records. In this way, we will be able to analyse both the increase of the temperatures and of the number of days with higher temperatures all over the year, maintaining the daily scale.

Challenges to analyse the tails of temperature series
To develop tests that are useful to detect non stationarity in the temperature extremes, we have to consider all the characteristics of daily temperature, and the way data have to be arranged to be analysed. Thus, we need tests with a high power to detect weak deviations of the null hypothesis, such as linear or other types of trends that may be small compared to the variability of the entire series. Secondly, non parametric tests with few assumptions would be preferable, so that they can be applied in a wide range of situations. Finally, due to the arrangement of the data resulting from splitting the series, the tests must be able to join information from several series, possibly with different distributions. Tests that are able to join this type of series will also be useful in spatial analysis, to study series from different locations and to obtain global conclusions over the area of interest.
3 Statistical tests to study i.i.d. series In this section we develop a set of tests to study non stationary behaviours in the occurrence of records, that satisfy all the requirements described in the previous section. First, we review some properties of the occurrence of records and the Monte Carlo method, used in the development of the tests, and we state the null hypothesis to be analysed. Then, two families of tests are proposed, one based on the upper records and other that joins the information of four types of records.

Basic definitions and the classical record model
Given a series of variables (X t ), an observation of X i is called an upper record (or a record) if its value exceeds all the previous observations, that is X i > max t<i (X t ). Analogously, X i is a lower record if X i < min t<i (X t ). Since lower records can be defined in terms of the upper records of the negative series (min t<i (X t ) = − max t<i (−X t )), all the properties for the upper records can be also applied to the lower records.
The basic variables to characterise the occurrence of records in a series are the binary variables I t , with t ≥ 1, defined as, Variable I 1 is always 1. N t is defined as the number of records up to time t, N t = t i=1 I i . The classical record model names the situation where we have the records of a series of i.i.d. continuous variables (X t ) t≥1 , with F the common cumulative distribution function. An important result for the classical record model states that the distribution of record times does not depend on F , Arnold et al. (1998). The following properties, that are a consequence of the previous result, characterise in more detail the behaviour of variables I t and N t . They are useful to develop non parametric tests with asymptotic distributions, and to make easier the implementation of Monte Carlo approaches.
Property 1. Given a sequence of i.i.d. continuous variables (X t ), the sequence of variables (I t ) are mutually independent and with a Bernoulli(1/t) distribution, p t = P (I t = 1) = 1/t, t = 1, 2, . . . Property 2. Given a series of i.i.d. continuous variables (X t ), the series of the corresponding variables (N t ) converges in distribution to a Normal distribution,

Monte Carlo method under the classical record model
Property 1 states that, under the hypothesis of i.i.d. series, variables (I t ) are independent and their distribution is Bernoulli(1/t), whatever it is the continuous distribution F of (X t ). That makes easy the calculation of pivotal statistics based on I t . In addition, the Monte Carlo method is also easy to apply: for any F , a Bernoulli(1/t) distribution can be used to generate samples of the variables I t in i.i.d. series. Then, the implementation of the Monte Carlo method is standard: a series of T independent Bernoulli(1/t) variables is generated, and the pivotal statistic R is obtained. Repeating this step B times, a sample of observations of R under the null, R 1 , . . . , R B , is obtained and the p-value is estimated as Monte Carlo method can also be used with statistics that are functions of dependent binary variables, as we will see in Section 3.3.2. The only difference with the previous approach is that a series of T independent N (0, 1) variables (or with any other continuous distribution) has to be simulated first. Then, the corresponding series of the dependent record indicator variables I t are obtained, and used to calculate the pivotal statistic R.

Notation and null hypothesis
In all the tests, we assume that there are M (≥ 1) mutually independent series of length T available, (X t1 ), (X t2 ), . . . , (X tM ). These series can be the result of splitting the original series, or series measured at different spatial points for example. To keep the notation simple, it is assumed that all the series have the same length T , but this restriction is not necessary.
The null hypothesis of all the proposed tests is, with p tm = P (I tm = 1). In the context of global warming, the most general alternative hypothesis is, for at least one t = 1, . . . , T, m = 1, . . . , M. (2) Hypothesis (2) includes the existence of a monotonous positive trend in the mean, a usual assumption of global warming, but it is more general. It also includes other types of non stationarity, for example, nonmonotonous trends or series with increasing variability.

Tests based on the the upper records
In this section, we propose several approaches to build tests assuming that only the upper records are available. First, we consider a family of statistics based on the number of records N t , and second a family based on the likelihood function of the I t variables.

Tests based on N T
Property 2 states the Normal asymptotic distribution of N T , the number of records in a series. This property was used by Foster and Stuart (1954) to define the dstatistic, based on the difference between the upper and the lower records. Diersen and Trenkler (1996) found that the efficiency of the d-statistic and other similar statistics is improved by splitting a series of length T in k series of length T /k. Taking these results into account, we define N is asymptotically Normal under the null, since N T m are independent variables and they are asymptotically Normal when T → ∞ (Foster and Stuart 1954). On the other hand, by central limit theorem, each variable S t is also asymptotically Normal under the null, when M → ∞. That means that the Normal approximation of N is obtained when any or both of M and T increase, and that even with moderate values, a good approximation can be expected. The mean and the variance under the null are Using the previous distribution under the null, a test based on N is built. The where N 0 is the observed value of the N and Z ∼ N (0, 1); since N takes integer values, a continuity correction is applied. It is noteworthy that using Property 1 and the independence of the M series, S t ∼ Binomial(M, 1/t) under the null. That means that the exact distribution of N is Poisson-Binomial, that is the distribution of the sum of T independent Binomial(M, 1/t) variables with t = 1, . . . T . It has not an explicit expression but it can be computed with numerical methods. However, it does not worth to use it since it has been checked that the exact and the asymptotic Normal distributions are equivalent even with M = 1 and low values of T .
Weighted statistic. Diersen and Trenkler (1996) considered that more powerful statistics could be obtained by weighting the records according to their position in the series. The motivation is that the probability that an observation exceeds the actual record is inversely proportional to the number of previous observations. After an empirical study with different weights, the authors recommended the use of linear weights w t = t − 1. They found that the ARE of the weighted tests improved with respect to the unweighted ones.
In our case, we define the weighted statistic, Under the null, N w is asymptotically Normal when M → ∞, since under those conditions S t are asymptotically Normal. However, we do not have normality when T → ∞, since T t=1 w t I tm is not longer asymptotically Normal. The mean and the variance are Statistics with estimated variance The previous tests are based on the asymptotic Normal distribution of the statistics with the expectation and variance obtained under the null. A disadvantage of these tests is that when the null hypothesis is not true, the expectation will increase but also the variance will change. That could diminish the power of the test, since the statistic will tend to take higher values but possibly with a higher variability. An alternative is the standardized statistic, .
Assuming that V (I tm ) is the same in the M series, and denoting N w can be estimated by the sample variance of the M values N w T obtained from (X tm ). Under the null, N w S has a Student t distribution with M − 1 degrees of freedom, and with p t > 1/t the statistic will tend to be far from 0. This statistic is also more robust to the existence of serial correlation, since the expected value of N w S is the same, whether the series is correlated or not, but its variance changes. It is noteworthy that in correlated i.d. series the probability of record is not 1/t, but the deviations are negligible even with medium correlations.
Our aim is to develop tests based on the likelihood to study the null hypothesis in (1). Standard likelihood tests assume that parameter values in the null hypothesis are interior points of the maintained hypothesis. However, in the one-sided alternative in (2) the parameters lie on the boundary of the parameter space, so that standard regularity conditions fail to hold, and usual asymptotic distributions are not longer valid Gourieroux et al. (1982). Tests under two different assumptions are proposed: first, the probabilities of record at time t of the M series are allowed to be different, and second it is assumed that they are equal (but possibly different from 1/t). The most common example of the second situation is when the M series have the same distribution.
General test for M independent series If the M series are not assumed to have the same probabilities of record at time t, there are M (T − 1) different probabilities p tm , without any restriction between them. Then, the number of unknown parameters p tm is equal to the number of observations, and they cannot be estimated. An approach based on the score function, which only require the estimation of the parameters under the null is suggested.
Score-sum statistics. King and Wu (1997) proposed a general method of constructing a locally most mean powerful unbiased score test for one-sided alternatives. It has a small-sample optimal power property when no nuisance parameters are present, as in this case.
The statistic is based on the sum of the score vector, evaluated under the null. In this case, using the likelihood in (5) and the null hypothesis in (1) The information matrix under the null, Then, we consider the statistic where 1 K is the unity vector of length K. Assuming standard regularity conditions, but without a requirement that the true parameter be an interior point of the parameter space, and using the asymptotic Normal distribution of the score vector, the distribution of S under the null is asymptotically N (0, 1) when M → ∞. Using this distribution, we can build a test and the p-value for the alternative in (2) is It is noteworthy that S is a linear function of variables S t , t = 2, . . . , T , with weights proportional to t 2 /(t − 1). That means that it is a particular case of the weighted statistic N w , with the advantage that the considered weights are analytically justified.
Tests for M independent series with the same distribution The tests proposed in this section assume that the probabilities of record in the M series are equal but they can be different from 1/t. One advantage of this assumption is that the number of unknown parameters, T − 1, is lower, and they can be estimated. In addition, it could be expected that in cases where the assumption is true, these tests would be more powerful than S. However, the power study in Section 4 shows that this is not true. Since the use of these tests will not be recommended, they are only briefly described here. Shapiro (1988) showed that given a vector y ∼ N (0, V) of dimension n and a convex cone C, the statistic is distributed asχ 2 (V ), a mixture of χ 2 distributions. If V is the identity matrix and C = {b : b ≥ 0}, the weights of the mixture are w i = n i 2 −n for i = 1, . . . , n. We apply this approach to y = I −1/2 0 q 0 , which under the null has an asymptotic distribution N (0, 1). The resulting statistic is and using the asymptotic distribution, a test can be built as previously described.
We also derive a statistic based on the likelihood ratio function using the approach by Gourieroux et al. (1982) to deal with one-sided alternatives Under the null, it has aχ 2 (I −1 0 ) asymptotic distribution. Since the numerical calculation of this distribution is computationally expensive, a Monte Carlo method is used, but the power performance is worse than T .

Tests joining information from different types of records
When the entire series is available, the power of a test based on records can be improved by joining the information from the binary variables of the occurrence of lower records, denoted (I L t ), and the occurrence of records in the backward series. This idea was suggested by Foster and Stuart (1954) and Diersen and Trenkler (1996). The backward series are the series obtained when the order of the terms in the series are reversed, so that we start by observing the last term, Notice that t is the index of the position in the backward series, so that I t and I B t do not correspond to the same day. As variables I t , under the null, I B t ∼ Bernoulli(1/t). Analogously, we define the binary variables for the occurrence of its lower records I BL t . Two approaches are suggested to join the information from the four types of records. The first step is to obtain the type of statistic described in the previous section for each of the four types of records. Then, we can build a joint statistic, or we can join the resulting p-values using a Fisher's type method. In both cases, it is necessary to characterise the dependence between the individual statistics. The approaches presented here can be applied to any of the statistics in Section 3.2, but the results are presented for S since, as it will be seen in Section 4, it is the most powerful statistic.

Correlation between statistics under the null
The statistic can be expressed as S = T t=2 w t S t + K, where K is a constant and .
All the calculations are expressed in terms of the weights w t ; in this way, the notation is simpler and the results are easily generalized to other N w statistics. S L , S B and S BL denote the statistics based on the corresponding lower records or backward series.
Correlation between S and S L (S B and S BL ). It can be proved that for t, t ′ > 1, Note that the resulting correlation does not depend on M , since w 2 t is multiplied by a factor 1/M . These statistics are asymptotically independent and, even for quite low T , the correlation is negligible, it is −0.044 for T = 50.
Correlation between S and S B (S L and S BL ). Using that E(I t I B t ′ ) = P (I t = 1, I B t ′ = 1), it is proved that for t, t ′ > 1, and the correlation is These statistics show an increasing negative dependence, non negligible; it is -0.667 for T = 50, and it approaches to -2/3 with increasing T .
Correlation between S and S BL (S L and S B ). It is proved that for t, t ′ > 1, Then, A simulation study shows that these statistics are asymptotically independent, and even for low T , the correlation is negligible, smaller than 0.03 for T = 50.

Generating a joint statistic
The idea of this approach is to join the information of S, S L , S B and S BL into one statistic. To that end, it must be taken into consideration that, under the alternative of an increasing trend, S and S BL tend to have high positive values (since p t > 1/t) while S L and S B have negative values (since in the corresponding series p t < 1/t). To unify the behaviour of the four statistics, we will consider linear combinations of S, −S L , −S B and S BL . The simplest option is to join the statistics that are asymptotically independent. For example S2 = S + S BL , whose asymptotic distribution under the null is N (0, √ 2). We also consider S4 = S − S L − S B + S BL , that joins the four statistics available. Its expectation under the null is 0, and its variance is calculated using the covariances in Section 3.3.1. S4 has an asymptotic Normal distribution when M → ∞, since it is the sum of M independent variables with the same distribution under the null. The p-value is also calculated using the Monte Carlo method in Section 3.1.2. The aim of this double calculation is to check the validity of the asymptotic Normal distribution, and to state the values of M and T where the approximation can be used. The Monte Carlo method can be applied since S4 is a pivotal statistic.
One advantage of the suggested approach is its flexibility to define statistics, specially when Monte Carlo method is used. Depending on the aim of the test, other statistics can be of interest, for example S − S B and S BL − S L are a better option to study only the behaviour of the upper and the lower tail of the distribution, respectively. Other definitions have been tried, such as S − S L , but the simulation analysis showed that S4 is more powerful.

Generating a joint p-value
The idea of this approach is to join the information of P p-values, P i , from the tests based on S, −S L , −S B and S BL , using a Fisher's type method. Standard Fisher's approach states that the distribution of X = −2 P i=1 log P i under the null is χ 2 2P , but it requires independent statistics. Using this approach, we propose the test F 2 that joins the p-values from S and S BL .
We also propose the test B4 that joins the p-values of the four statistics using the modification by Kost and McDermott (2002) of the Brown approach to join dependent p-values. If the statistics are normally distributed, as the score statistics, the distribution of X under the null can be approximated by cχ 2 f , where c = V (X )/(2E(X )) and f = 2E(X ) 2 /V (X ). The expected value is E(X ) = 2P and, using an approximation for the covariance Cov(−2 log P i , −2 log P j ), where ρ ij is the correlation between the statistics i and j. Another restriction to join the p-values is that the statistics must have the same behaviour (increase or decrease) under the alternative. To achieve this, the p-values of the statistics S, −S L , −S B and S BL must be joined. Then, c and f are where ρ 1 = ρ S,−S L , ρ 2 = ρ S,−S B and ρ 3 = ρ S,S BL , defined in Section 3.3.1.

Size and power analysis of the tests
This section summarises the main results from a simulation analysis of the size and power of the tests proposed. First, it is analysed the performance of the tests based on the upper records, and secondly the tests which require the entire series. The estimations are based on 10,000 replications and a significance level α = 0.05. In the size analysis, i.i.d. series with a N (0, 1) distribution are generated without loss of generality since, under the null, the distribution of the statistics do not depend on the distribution of (X t ). The size of the tests is estimated for M = 1, 4 and 12 and T = 25, 50 and 100. The values of M correspond to common situations: non split data, quarterly and monthly split data.
The power study focuses on the comparison of the tests under 'difficult conditions', since when M and T are high or the series have strong trends, all the statistics have a similar high power that approaches 1. Thus, only small samples sizes, M = 1, 4, 12 and T = 25 and 50 and small trends are shown in the study. The power is estimated using series with a monotonous positive trend θ, since they imply the alternative hypothesis H 1 : p tm > 1/t. Although there are other situations which can lead to this alternative, a positive monotonous trend is the most common way of modelling global warming. Three kinds of monotonous trends were initially considered: linear, concave and convex trends, θ t = θt, θ t = θ √ tT , and θ t = θt 2 /T , so that all of them increase to θT when t → T . Given that very similar results are obtained in the three cases, only the results of the linear trend are shown. Concerning the magnitude, trends varying from 0.005 to 0.05 of the standard deviation of the variables, θ = 0.005, 0.01, 0.025 and 0.05 with σ = 1, are considered. Series with different distributions are used to estimate the power. We consider N (0, 1) series and other distributions commonly used in climate and environmental sciences. Series in these fields often require to be modelled with asymmetric and semi-bounded distributions (defined in (0, ∞)) such as the Exponential or Gamma distribution for rainfall, Generalized Pareto (GP) for peaks over threshold applied to temperature or hydrological series, or Generalized extreme value (GEV) for annual maximum temperatures or other maxima.
In climate analysis, it is likely that the series used to implement the test do not exhibit the same trend, for example in different seasons. Consequently, an alternative H 1,θ m corresponding to this situation is included in the analysis: it uses M series with different θ m values generated from a N (0.0075, 0.005) distribution. Values 0.0075 and 0.005 are the mean and standard deviation estimated from the sample of trends obtained in the temperature series in Madrid.

Comparison of the tests using only upper records
A thorough study of all the statistics described in Section 3.2 was performed. Only the results for N , S and T are summarised here, since the statistics which are linear weighted functions of S t (N w and S) yield similar results; the standardised version N w S neither improves the power with uncorrelated series. The performance of T and R, the statistics for M series with the same probabilities of record, are also very similar between them. Table 1 summarises the size of the three statistics. It is adequate in all the cases, even for N and S that have an asymptotic distribution. The size of S, that is asymptotic when M → ∞, for M = 1 is slightly higher than the nominal value, but it decreases quite fast. Figure 3 summarises the power of the tests with Normal series with a trend. The following conclusions are obtained from the results.

Power analysis
-The comparison of N and S shows that there is a clear increase of the power when weights are used. However, the value of the weights is not so relevant, and we found that the linear weights, obtained empirically, are equivalent, in terms of the power, to the theoretically derived weights from S. -Unexpectedly, the tests T and R, that assume that the M series have the same probabilities of record, have a lower power even when that assumption is true. Alternative with series with different trends. The pattern of the power with series under the alternative H 1,θ m is very similar to that obtained with series with a constant trend, and S provides the best results in all the settings (see Figure  S.1 in Supplementary Material). The values are also equivalent, for example, the power of S under H 1,θ m with random trends with mean 0.0075 and T = 50 and M = 12 is 0.3, while its counterpart with series with θ = 0.0075 is 0.29.
Alternative with negative trends. The analysis of negatives trends is not of great interest in climate. However, given that the occurrence of upper records in a series with a negative trend is symmetric to the occurrence of lower records with a positive trend, a brief analysis is carried out. The results are shown in Figure  S.2 in the Supplementary Material, and the conclusions are analogous, although the power is a bit lower. Some analysis using other distributions than the Normal are carried out and, as in the previous cases, S is the most powerful test. More results with other distributions will be presented in the next section. To sum up, S is the most powerful test in all the considered situations. Consequently, it will be used to build the tests that join the information of different types of records.

Comparison of the approaches to join different types of records
This section summarises the performance of the approaches proposed to join the information from different types of records. We have a twofold objective, to analyse the improvement of the power when more than one type of records is used, and to study if it is more effective to build a joint statistic or a joint-p-value. Thus, the tests based on S2, S4, F 2 and B4 are compared in the study. Table 2 summarises the size estimated for the four tests. It is adequate, although the tests based on only two statistics, specially F 2, tend to yield sizes slightly higher than the nominal value when M is low. Figure 4 summarises the power of the tests with Normal series with a trend; the statistic S, based only on the upper records, is also included for comparison purposes. The following conclusions are obtained.

Power analysis
-The pattern of the power is quite similar in all the settings: the joint tests are clearly more powerful than S, but the differences between them are small. -Although the improvement resulting from joining information from different types of records is clear, the increase of the power with four types of records over two types is much lower. The power of S2 and F 2 is slightly higher than S4 and B4 when M = 1; however, the power of S4 increases faster with M . In any case, the differences are negligible, and given that the size of S4 and B4 is better with low sample sizes, and the time of computation is similar, the tests joining four types of records should be preferable. -Concerning the approach to join the information, both joint statistics and joint p-values have a similar power in these settings. The power of B4 is slightly lower with M = 1, but the differences are negligible -As expected, the power of the joint tests increases with both M and T , and it is over 0.8 for quite low sample sizes: for θ = 0.05 with M = 4 and T = 25 or with M = 1 and T = 50, and for θ = 0.025 with M = 4 and T = 50. To obtain a power higher than 0.8 for θ = 0.01, a value M = 20 is required with T = 50, and with M = 12, a value T = 62. Analogously, for θ = 0.005, a value M = 80 is required with T = 50, and with M = 12, a value T = 115.
Alternative with series with different trends. The conclusions about the power with series under the alternative H 1,θ m are analogous. Figure 5 shows that the power of the four joint tests is higher than S, and similar among them. The value of the power is very similar to that obtained with M series with the same trend: it is 0.45 with T = 50 and M = 12, while the counterpart with M series with trend θ = 0.0075 is 0.43. A power higher than 0.8 is obtained with T = 50 and M = 36. Alternative with series with non Normal distributions. The power of the tests is now estimated using distributions commonly used in climate, and with different types of tails (one-side bounded, with heavier and lighter tails than the Normal). Figure 6 summarises the power of tests S4 and B4 when they are applied to a series with a trend and the one-side bounded distribution GP with shape parameter ξ = −0.1 (values −0.5 < ξ < 0.5 are common in climate) and σ = 1. In this case, the power is higher than with Normal series in all the settings. Even for θ = 0.005 the power of B4 is higher than 0.9 with T = 50 and M = 4, and for θ = 0.01 also with T = 25 and M = 12. The power with series GP (σ = 1, ξ = 0), that is Exponential distribution, GP (σ = 1, ξ = −0.5), and GP (σ = 1, ξ = −1), that is Uniform distribution (two-side bounded), also have a higher power than the Normal, see Figures  The results lead to conclude that in one or two-side bounded distributions or in distributions with one or two tails lighter than the Normal distribution, the power of the record tests is higher. With the GP distribution, the record tests are even more powerful than MK test to detect trends in the mean. In effect, Figure  6 shows that the power of MK is lower, specially with weak trends. That means that, in this type of distributions, the detection of trends is most powerful if we focus on the behaviour of the bounded or light tail instead of focusing on the mean evolution. In that case, the power of B4 is higher than S4.
To sum up, we conclude that the union of different types of records clearly improves the power of the test. We propose the use of tests based on B4 or S4, although with bounded or light tail distributions, B4 is slightly more powerful.

Graphical tools to detect non stationary behavior in records
The use of statistical tests is essential to obtain objective conclusions about the existence of non stationary behaviours in the extremes. However, the use of graphical tools is also important to explore and characterise the existence of non stationarity. Basic exploratory plots based on the times of occurrence of the records were used in Section 2, but here we suggest some more elaborated plots together with confidence intervals (CI). Two types of plots, based on variables N t and the estimated probabilities p t , are proposed.

Plots on N t
Using Property 2 and the approach in Section 3.2, it is obtained that the mean number of records up to time t in the M series,N t = M m=1 N tm /M is asymptotically Normal under the null. Using this result, CI for µ t = E(N t ) arē where z 1−α/2 is the 1 − α/2 percentile of a N (0, 1) distribution. These intervals together with the point estimatorN t , and its expected value under the null can be plotted versus time. The resulting band is not a real confidence band of the values µ t , due to the dependence between the differentN t . However, it is useful to observe deviations from stationarity in the evolution of the number of records, and to identify the time point from which this deviation is significant. Another advantage is that the four types of records can be displayed in the same plot, since their expected behaviour under the null is the same. The same approach can be used to make plots joining the number of lower and upper records in forward and backward series, that are also asymptotically Normal when M → ∞. It is noteworthy that, at each time t, it is necessary to calculate the forward and backward records in the series observed only up to time t, not the number of backward records up to t for the series observed up to time tm denotes the number of records in the backward series (X tm ) of the first t observations, the expected value ofD t under the null is 0.

Plots based onp t
The maximum likelihood estimatorsp t = S t /M satisfy E(p t ) = 1/t under the null, or equivalently E(tp t ) = 1. When (X t ) is not an i.i.d. sequence, there does not exist a general expression for p t and E(tp t ). Assuming series with a linear trend θ, Ballerini and Resnick (1985) proved that p t has an asymptotically constant limit lim t→∞ p t = β 0 , if the distribution has a finite first moment. The assumption of p t = β 0 + β 1 /t is compatible with the previous result. Then, it is reasonable to consider as general alternative, the regression model E(tp t ) = β 1 + β 0 t, for t > 1, whose expected behaviour under the null is β 0 = 0, β 1 = 1. Consequently, the plot of tp t versus time under the null should be a random cloud of points around 1, and the fitted regression line Y = 1.
This model is heteroscedastic under the null, since V (tp t ) = (t − 1)/M . This implies that weighted least square estimatorsβ W 0 andβ W 1 with weights equal to the reciprocal of the variance must be used. CI for E(tp t ) can be obtained using that tp t = tS t /M and S t ∼ Binomial(M, 1/t) under the null.  (1) 1960 (21) 1980 (41) 2000 (41) 2020 (81) Year ( All the statistical tools proposed in this work are implemented in the R-package RecordTest (Castillo-Mateo and Cebrián 2020), publicly available from CRAN.
6 Analysing the effect of global warming in records of daily temperature The tests and graphical tools described in the previous sections are used to analyse the effect of global warming in the records of the series presented in Section 2, the daily temperature series in Madrid.
Joint analysis of the tails. Given that the complete series is available, and that our first aim is to study the existence of non stationary behaviour in the tails of daily temperature, we start by applying the tests based on different types of records. Taking into account the power study, the tests S4 and B4 are the most appropriate to assess the hypothesis H 0 : p tm = 1/t. The resulting p-values, 2.4e-07 and 7.0e-08, lead to conclude at any usual significance level that the probability of record is higher than in an i.i.d. series. That means that there are evidences of non stationarity in the occurrence of records, due to an increasing trend. Figure 7 shows the mean number of recordsD t versus time and it allows us to identify when the previous non stationary behaviour appears and to characterise it over time. It is significant from 1980, and the value of the statistic shows an increasing trend from 2000.
Separate analysis of the upper and the lower tails. We are also interested in studying whether this non stationary behaviour appears in both tails of the distribution or only in one of them, and whether it is equally strong in both cases or not. To analyse separately the behaviour of the upper and the lower records, we use the statistics S − S B and S BL − S L respectively. The resulting p-values are 1.5e-06 and 0.006 so that, although there are evidences of a non stationary behaviour in both tails, it is more clear in the upper records.
In order to study in more detail where the non stationary behaviour appears, Figure 8 (1) 1960 (21) 1980 (41) 2000 (41) 2020 (81) Year ( Figure 8 (right) shows the estimated probabilities of upper record tp t for each year t together with the regression line and the confidence band. In an i.i.d. series, the slope of the regression line should be zero, while a positive slope is observed. In addition, many of the estimated probabilities are outside the CI from around 1980. This plot allows us to identify the specific years where the probability of records is much higher than expected. Similar plots can be done for the other types of records, and they confirm the previous conclusions.

Analysis of the tails by season.
To study if the non stationary behaviour differs across the seasons of the year and to identify the periods where it is stronger, the previous tests are applied separately to the four seasons of the year: winter (DJF), spring (MAM), summer (JJA) and autumn (SON).
The resulting p-values are summarised in Table 3. If both tails are analysed jointly, using S4 and B4, the non stationary behaviour is significant in all the seasons except spring. However, if we study only the upper records, S − S B , the evidence of trend is strong in summer and autumn but weak in winter. Concerning the lower records, S BL − S L , only in winter there are evidences of a decrease of the lower records. No changes in lower records may be caused by an increase of the variability. Figure 9 shows the cumulative number of records by season. The plot in spring suggests that although the joint tests are non significant, the number of upper backward records is significantly lower than expected, but it is compensated by the higher than expected lower forward number of records.   Analysing two separate periods, it is concluded that it is due to a decreasing trend before 1970 and an increasing trend afterwards.

Conclusions
In the context of global warming, it is clear the interest of analysing the existence of non stationary behaviour in the tails of a series, and in particular in its records. This work develops several statistical tests and complementary tools to detect this type of behaviour in climate series, using the properties of the occurrence of records in series i.i.d. More precisely, the tests assesses the null hypothesis H 0 : p tm = 1/t versus the alternative H 0 : p tm > 1/t. From a methodological point of view, the following conclusions are obtained.
❼ The approach proposed to arrange the data, based on splitting the series, solves two usual problems of climate series: seasonal behaviour and serial correlation. It also allows us to analyse simultaneously both the increase of the highest temperatures and the number of warm days, maintaining a daily scale.
❼ A family of six tests based on the upper records is proposed. N is based on the number of records, and N w and N w S are weighted version of the previous one, the latter using an estimation of the variance. S is based on the likelihood function. Assuming that the M series have the same distribution, two statistics based on the score and the likelihood ratio, T and R, are considered. Asymptotic distributions are obtained for most of the previous statistics. In addition, the Monte Carlo method can be used to estimate the p-value in all the cases, since they are pivotal statistics. This method is used to check the validity of the asymptotic distributions, revealing that they are valid even for M = 1 and low values of T . We conclude that statistics that are linear weighted combinations of variables S t , N w and S, are the most powerful. Statistic S, whose weights are analytically justified, is proposed as the best test based on the upper records.
❼ The second family of tests aims to join the information of different types of records: the upper and the lower records of the forward and the backward series. Four statistics, S2, S4 (based on joint statistics) and F2 and B4 (based on joint p-values) that include two or four types of records are considered. A power study shows than the union of two or more types of records clearly improves the power of S based only on the upper records. The union of the statistics or the p-values yields tests with a similar power. The power of the joint tests with N (0, 1) series is high even with small sample sizes and weak trends. The power is higher for series with one or two-side bounded distributions or distributions with one or two light tails, such as GP and GEV often used in climate analysis. For GP series, even for a weak trend θ = 0.005, the power of B4 is higher than 0.9 with T = 50 and M = 4. The approaches suggested to join the information of different types of records are very flexible. They allow us to define other statistics to study specific features, such as the non stationary behaviour only in the upper or the lower tail, or to give more weight to a particular type of records.
❼ From all the considered record tests, B4 and S4 are the most powerful. These tests have important advantages that make them specifically useful in the analysis of global warming. First, they have high power to detect weak trends. They are non parametric and require few assumptions, so that they can be applied in a wide range of situations. Moreover, they are able to join the information of M independent series, a property that is useful to deal with seasonal behaviour and also to carry out spatial analysis.
The tests are complemented with graphical tools that aim to characterise where and when the non stationary behaviour occurs. Finally, all the tests and tools are easy to apply, and are implemented in the R-package RecordTest.
The proposed inference tools are used to analyse the effect of global warming in the extremes of the daily temperature in Madrid. It is concluded that there are strong evidences of non stationary behaviour in the tails of the distribution, that affects the occurrence of records. This non stationary behaviour is statistically significant from around 1980, and it increases from 2000. It is stronger in the upper tail, specially in the last years of the observed period. The behaviour among seasons is not homogenous: it is significant in all the seasons except spring. If we focus on the behaviour in the lower tail, it is only significant in winter.
It is noteworthy that the proposed inference tools are not only useful to analyse the effect of global warming in the extremes of observed series. They are also a necessary tool in the validation of climate models that, to avoid important biases in their conclusions, should include an analysis of their capability to reproduce the most extreme values. The tests can also be used in other fields where the study of records is important such as hydrology, finance, etc. Figure 1 Daily maximum temperature and lower (blue) and upper (red) records, Madrid. (Left) Slopes of linear trend in Tx in each day of the year; red stars mark the days whose series can be considered independent. (Right) Annual mean of Tx and upper (red) and lower (blue) records, Madrid.

Figure 3
Power analysis of tests based on upper records using N(0, 1) series.

Figure 4
Power analysis of tests based on four types of records using N(0, 1) series with the same trend.

Figure 5
Power analysis of tests based on four types of records using N(0, 1) series with different trends.

Figure 6
Power analysis of tests based on four types of records using GP(σ = 1, ξ = −0.1) series.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryMaterial.pdf