Systematic investigation of skill opportunities in decadal prediction of air temperature over Europe

Decadal Climate Predictions (DCP) have gained considerable attention for their potential utility in promoting optimised plans of adaptation to climate change and variability. Their effective applicability to a targeted problem is nevertheless conditional on a detailed evaluation of their ability to simulate the near-term climate evolution under specific conditions. Here we explore the performance of the IPSL-CM5A-LR DCP system in predicting air temperature over Europe, by proposing a systematic assessessment of the prediction skill for different time windows (periods of the calendar time, forecast years and months/seasons). In this framework, we also compare raw and de-biased hindcasts, in which the temperature outputs have been corrected using a quantile matching method. The systematic analysis allows to discern certain conditions conferring larger predictability, which we find to be intermittent in time. The predictions appear more skilful around the 1960s and after the 1980s, in coincidence with large shifts of the Atlantic Multidecadal Variability, which are well reproduced in the hindcasts. Averages on longer forecast periods also generally imply better prediction skill, while the best predicted months appear to be mainly those between late spring and early autumn. Moreover, we find an overall added value due to initialisation, while de-biased predictions significantly outperform raw predictions only for a few specific time windows. Finally, we discuss the potential implications of the proposed systematic exploration of skill opportunities in DCPs for integrated applications in climate sensitive sectors.


Introduction
One of the biggest scientific challenge for the twenty-first century concerns the capacity of simulating the future climate evolution through numerical models (Dutton 2002). Climate change and variability is threatening many natural ecosystems (e.g. IPCC 2014), and is having a progressively stronger impact on human society by affecting a wide range of sectors (Arent et al. 2014), like agriculture, fishery, energy, tourism, and transport to name but a few. Knowing a few years in advance how the climate will vary with a certain degree of accuracy is crucial for stakeholders and policymakers (Trenberth et al. 2016), as it potentially allows for well-timed adaptation efforts. Predictions up to 10 years ahead, hereafter Decadal Climate Predictions (DCP), are thus increasingly demanded (Kushnir et al. 2019) as they deal with time ranges typically relevant for infrastructures, long-term investments and other business plans (Dessai and Bruno Soares 2015). For this reason, both public and private sectors are growingly encouraging the development of operational climate services based on near-term predictions (Vaughan and Dessai 2014;Buontempo et al. 2014;Street 2016). Climate services aim at providing customized products for decision makers in climate-sensitive sectors, thus creating a bridge between the academic world and the end-users by translating scientific outcomes into targeted 1 3 information (Goddard 2016; Giannini et al. 2016). Yet, the development of a climate service needs to be based on trustable climate predictions (Lemos et al. 2012;Mehta et al. 2013), and so its effective operability requires a scrupulous evaluation of the actual skill of the existing prediction systems in simulating the near-term climate evolution, notably in those specific contexts that are relevant for the targeted analysis (Bruno Soares et al. 2017). As an example, let us consider the tourism sector: a climate service for ski resorts in the Alpine regions should demonstrate skilful predictions of snow fall during the winter in mountains areas, while a climate service for seaside activities in the Mediterranean region should rather demonstrate skilful prediction of temperature and precipitation in coastal areas during summer. This implies that forecasts based on the same DCP system may show a wide range of confidence when applied to these very different scopes, as the prediction skill considerably depends on a multitude of contingent factors like the relevant climatic variable, the specific season, the particular period, and the region under investigation. The rationale behind this work is therefore to propose a prototype for the identification of the optimal "opportunities" for integrated applications of DCPs. Here, we explore the potential expressed in this respect by the IPSL-CM5A-LR model in decadal predictions of air temperature over Europe, but the same approach might be potentially used to evaluate the potential reliability of other DCP systems in simulating any other climatic variables. In this context, the Copernicus Climate Change Services (https:// clima te. coper nicus. eu/) will soon include operational decadal predictions implemented by different Institutes, which will possibly allow for an optimal selection of model experiments as a function of the specific study.
The way in which the future climate is simulated depends on the considered time scale. On short time scales in the order of days, i.e. O(days), meteorological forecasts are socalled initial condition problems since the prediction primarily depends on the internal state of the climate system at the beginning of the prediction. On long time scales, i.e. O(century), climate projections are essentially so-called boundary condition problems since they primarily depend on the response to an external forcing. The forcing includes both anthropogenic changes in atmospheric greenhouse gas and aerosol concentrations and natural changes such as modulations of the solar insolation and volcanic eruptions. In between these time ranges, near-term DCP are a mix of initial and boundary conditions problem (Murphy et al. 2010). Indeed, the climate evolution over a time horizon of 1-10 years is basically the combined result of (i) the external forcing, and (ii) the unforced internal variability, coming from the intrinsic variations of the climate system. Over Europe, the climate variability on decadal timescales is largely modulated by the Atlantic Multidecadal Variability (AMV) and the North Atlantic Oscillation (NAO). The AMV (Kerr 2000;Sutton and Hodson 2005) is a mode of climate variability affecting the Sea Surface Temperature (SST), characterised by fluctuations between anomalously warm and anomalously cool phases, with enhanced energy in the inter-decadal band (Dima and Lohmann 2007). While the AMV appears to be linked with the variability of the Atlantic Meridional Overturning Circulation (AMOC) (Yeager and Robson 2017;Oelsmann et al. 2020), its main drivers are still uncertain. Indeed, over the historical era, AMV-like decadal fluctuations can be potentially attributable to intrinsic variability in absence of external forcing (Knight et al. 2005;Ting et al. 2009), or to a response to external forcing, e.g. changes in aerosol and greenhouse gas concentrations of anthropogenic origin (Booth et al. 2012;Bellucci et al. 2017) and natural events like volcanic eruptions (Ottera et al. 2010;Swingedouw et al. 2015Swingedouw et al. , 2017Borchert et al. 2021). The switch between positive and negative phases of the AMV causes significant climatic impacts over Europe, leading to distinct mean temperature and precipitation patterns, most notably during summer (Sutton and Hodson 2005). Aside from the AMV, the NAO is an atmospheric mode of variability of the flow patterns over the North Atlantic Ocean, which has important impacts on the weather and climate in Europe, notably during winter (Hurrell et al. 2013). The NAO variability has a weak red spectrum (Wunsch 1999), and it is characterised by statistically significant decadal variations, which in general appear to be out of phase with the AMV signal Omrani et al. 2014). These modes of decadal variability overlap with the long-lasting warming signal due to increased greenhouse gas emissions, thus shaping the near-term climate variations by either amplifying, smoothing or even inverting the long-term trend.
DCPs are designed to account for both internal modes of variability and the effects of changing anthropogenic emissions and natural phenomena. They consist of O(10) years experiments forced by external boundary conditions, and starting from a specific climate state constrained by observations, which represents the imposed initial conditions of the dynamical system. This initialisation supplies the main potential added value of DCPs with respect to climate projections (Boer 2004). A variety of techniques and methodologies of initialisation have been developed by the different modelling groups, influencing the predictability of a specific phenomenon (Matei et al. 2012;Menary et al. 2015). In general, DCPs mostly rely on the large thermal inertia provided by the ocean for the climate system, which may provide a "memory" of O(10) years at the surface and of O(10 3 ) years at depth. Initial conditions are primarily obtained by assimilating "full-field" or "anomalies" of a set of oceanic parameters from observational data (Smith et al. 2013). For example, the IPSL-CM5-LR DCP system Mignot et al. 2016) is based on initial conditions constructed from observed SST anomalies, while both SST and Sea Surface Salinity (SSS) anomalies are used for the IPSL-CM6A-LR DCP (Estella-Perez et al. 2020). More sophisticated methods of initialisation also imply the assimilation of atmospheric variables (see Meehl et al. 2014 for a detailed summary of the different methods of initialisation used within CMIP5 models). By exploiting the slowly evolving interactions within the climate system, the synchronisation of the model initial conditions with the actual climate state might constrain, at least for a certain time window, the stochastic evolution of the climate system in the simulation. Therefore, the key question in the evaluation of DCP is whether the initialization process is able to effectively align the phase of the modelled and observed internal variability, thus producing in principle more skilful simulations than climate projections over a time scale of one decade.
Prediction skill is typically assessed by comparing the hindcasts, i.e. retrospective predictions initialised at a given past climate state, with the corresponding observational data over the common period. Pioneering studies showed the potential skill improvement through initialisation (Collins et al. 2006;Smith et al. 2007;Keenlyside et al. 2008), thus favouring the development of a coordinated protocol for decadal prediction systems within the Coupled Model Intercomparison Project (CMIP5, Taylor et al. 2012). This first multi-model approach allowed a more robust assessment of the impact of the different initialization strategies. In this framework, added value with respect to climate projections have been clearly shown for the 1-10 years predictions of different climatic variables over different regions (van Oldenborgh et al. 2012;Kim et al. 2012;Doblas-Reyes et al. 2013;Bellucci et al. 2015), and notably over the North Atlantic (Branstator and Teng 2012), where both the AMV and AMOC show large potential predictability (Garcia-Serrano et al. 2015). Also, it has been shown that predictions based on a multi-model ensemble mean are more skilful than predictions with individual models (Bellucci et al. 2015). However, multi-model analyses and inter-model comparisons imply a standard procedure for DCP evaluation, which may prevent a detailed exploration of the full range of potential predictability shown by a single DCP system as well as gaining understanding in the physical sources of predictability. Indeed, in previous studies, the skill metrics are typically (but not exclusively) calculated on an annual basis, i.e. by considering the yearly means, over standard lead-time years-usually 1-5 years and 6-10 years-and for the whole period of the available hindcasts, which typically start in 1960 (e.g. Mignot et al. 2016). Yet, the prediction skill can be strongly dependent on (i) the specific period considered, (ii) the prediction lead-time used and (iii) the specific seasons of the year in focus. In this regard, the notion of "windows of opportunity" has been recently adopted for DCPs (Dessai and Bruno Soares 2013;Mariotti et al. 2020), pointing out that specific periods in the past may confer greater predictability than others (Brune et al. 2018;Borchert et al. 2019;Mariotti et al. 2020). Furthermore, the predictability of a specific phenomenon has been shown to be strongly dependent on the region and on the considered time scale (Boer 2004(Boer , 2011van Oldenborgh 2012). At the same time, the predictions' quality may be strongly dependent on the considered season (e.g. Yeager et al. 2018). For example, over Europe, as the summer climate is mainly modulated by the AMV, while in winter it is largely influenced by NAO variations, the different model abilities in reproducing and predicting these two modes of variability and the associated teleconnections may lead to different predictability over Europe for the different seasons. In turn, the predictability of these modes of variability has been shown to be linked with the background climate mean state and variability (Qasmi et al. 2017), leading to an enhanced prediction skill over Europe for specific periods and forecast times. The confluence of factors implying a larger DCP skill may enhance its usability for specific contexts, with implications for the optimisation of climate services. While the existence of some of these factors has been identified in previous studies (e.g. Qasmi et al. 2017;Yeager et al. 2018;Mariotti et al. 2020), their systematic evaluation is missing so far.
The scope of the present study is to assess the potential offered by the DCP system based on the IPSL-CM5A-LR model in predicting air temperature over Europe. Previous studies using predictions with IPSL-CM5A-LR DCP system already showed skill in predicting the AMOC ) and the AMV (Mignot et al. 2016). Due to their influence on the European climate (e.g. Persechino et al. 2013), this suggests the likely existence of some skill also in predicting the near-term temperature over the continent. Here, we explore the possible existence of specific time windows for which the DCP exhibits a higher predictability. These potential windows of opportunity are defined by particular time periods, whose background climate mean state and variability can favour the DCP predictability. Moreover, we analyse the benefits of the initialisation procedure and its time duration, as well as the impact of seasonality on predictability. In this way we extend the notion of windows of opportunity to forecast periods and months of the year.
The expected outcomes of the study are to provide supporting information for the development of climate services. In this framework, potential limitations are the systematic biases in mean state and variance that intrinsically affect the climate models (e.g. Haerter et al. 2011;Maraun 2012;Bilbao et al. 2021), which can produce imprecise assessments of climate evolution and its impacts. To address such a potential limitation, different bias-adjustment techniques have been developed (e.g. Michelangeli et al. 2009). Debiasing consists in adjusting raw model data to calibrate their statistical properties with those of the corresponding 1 3 observational data. Its benefit has been already tested for long-term impact analysis of climate change over specific regions, e.g. West Africa (Famien et al. 2018). While data adjustment reduces the error due to model biases, its effects on DCP skill are still unexplored. For this reason, we also systematically compare, for the first time, raw and de-biased hindcasts, and we evaluate whether the data adjustment, beyond correcting the mean state bias of the predictions, is also beneficial in terms of prediction skill. Our approach, detailed in Sect. 2, is based on a systematic analysis of the prediction skill of raw and debiased hindcasts when simultaneously varying the initialisation periods, the prediction lead times and the predicted months of the year. The main features of skill over Europe and its 7 sub-regions are illustrated in Sect. 3, as well as the pattern of the added-value due to initialisation and the skill improvement due to de-biasing. In Sect. 4 we finally discuss the potential implications of our main findings and stress the utility of this type of analysis for the optimal development of climate services based on DCPs.

The climate model
The decadal predictions analysed in this study have been performed with the IPSL-CM5A-LR model (Dufresne et al. 2013) developed at the Institut Pierre Simon Laplace (Paris). It is a global general circulation model consisting of the coupling between atmospheric and oceanic systems. The atmospheric component is based on the LMDZ5A model (Hourdin et al. 2013) which, for the Low Resolution (LR) configuration, consists in 96 × 95 grid points corresponding to a resolution of 3.75 × 1.875 and 39 vertical levels. The ocean component is based on the NEMOv3.2 model (Madec 2008), with a horizontal resolution varying from 0.5° to 2° and 31 depth levels varying from 10 m thickness near the surface to 500 m at depth. The model also includes the seaice module LIM2 as well as the biogeochemical module PISCES (Aumont and Bopp 2006). The IPSL-CM5A-LR model has been set to produce both an ensemble of climate projections and an ensemble of decadal predictions. The latter employs the progressive imposition of initial conditions, which have been produced by means of a simple assimilation technique consisting of nudging to observed surface SST anomalies (cf. Mignot et al. 2016). It is important to specify that, while decadal predictions with the latest IPSL-CM6A-LR model version have been recently released, this latest version was not available at the time of this study, when de-biasing adjustment was carried out. Since one of the aims of this study is to assess the potential improvement due to a quantile de-biasing adjustment, our analysis here is exclusively based on the IPSL-CM5A-LR version.

Simulations and validation dataset
We analyse monthly averaged temperature from a set of 2 different model experiments, namely (i) the non-initialised historical experiments (HIS), and (ii) the initialised decadal predictions experiments (DCP). For the skill metrics calculation (see Sect. 2.5), we compare these temperature model outputs with observation-based temperature data (OBS), i.e. NOAA-20CR reanalyses data (Slivinski et al. 2019), interpolated on the model grid. NOAA-20CR is a global gridded data set consisting of 56 different members.
Here we use their ensemble mean. While reanalysis data imply the use of a model and thus the possible inclusion of errors due to intrinsic model biases, it is worth stressing that NOAA-20CR temperature data over Europe are significantly consistent with other observational data not implying the use of a model, e.g. the correlation with HadCRUT4 interpolated temperature observational dataset is 0.95 (p < 0.05) for 1960-2014 de-trended monthly anomalies averaged over Europe (not shown).
For both simulations and validation datasets we consider monthly temperature anomalies, which have been computed by removing the corresponding monthly climatology calculated over the common period 1961-2014.
The historical experiments (HIS) are extracted from the CMIP5 database and consist of 3 members running from 1850 to 2005. The initial conditions are obtained randomly from a 1000-year control simulation based on stationary preindustrial climatological forcing. The HIS external boundary conditions consist of the prescribed radiative forcing estimated from observed aerosol and greenhouse gases concentrations in the atmosphere since 1850 as well as changes in ozone and land-use, and the effects of solar radiation and volcanic eruptions over this period. The HIS ensemble will be referred to as "non-initialised", since it does not start from an observed climate state. In conformity with the hindcasts, we only consider the part of the experiments after 1961. Moreover, after 2005, we prolonged this experiment until 2014 by using the RCP4.5 scenario (Taylor et al. 2012).
The initialised DCP experiments are directly derived from nudged experiments, which are based on a data assimilation aimed at adjusting the SST anomalies towards observational i.e. ERSST (Reynolds et al. 2007), anomalies. The nudged experiment is a "constrained" experiment in which, under the same boundary conditions as in the HIS experiments, a SST anomaly term is added into the conservation equations for SST to adjust the heat flux at each model time-step t . The restoring term is expressed as follows: where γ = 40 Wm −2 K −1 is the restoring coefficient corresponding to a relaxation time-scale of around 60 days for a mixed layer of 50 m, SST ' MOD is the modeled SST anomaly, and SST ' ERSST is the measured SST anomaly with respect to the climatological mean obtained from ERSST over the overlapping period. From the constrained nudged simulations after 1960, ensembles of 3 members of 10-year freerunning simulations have been launched from December 31 of every year until 2013, without any constraint applied on the SST. These free-running simulations are still constrained by the external forcing as in the corresponding portions of the HIS experiments. The DCP members are separated by adding a white noise perturbation to the SST field at the time of initialization, chosen randomly at each grid point between − 0.05 and 0.05 °C, thereby mimicking the unpredictable part of the climate signal. This specific protocol is described in further details in Mignot et al. (2016).

De-biased predictions
In this study we analyse both raw temperature predictions and de-biased temperature predictions obtained through a bias adjustment of the raw DCP dataset. Here we use adjusted data provided by "the Climate Data Factory" (https:// thecl imate dataf actory. com/), whose de-biasing procedure relies on the Cumulative Distribution Function transform (CDF-t) method (Michelangeli et al. 2009;Vrac et al. 2016;Famien et al. 2018). It is based on the quantile mapping method (Vrac and Friederichs 2015), consisting of an adjustment of the raw simulated temperature through a transfer function, such that its cumulative distribution function (CDF) matches the observation-based one over a calibration period. Also, the CDF-t represents a variant of the quantile-quantile method as it also accounts for changes of CDF between the correction period and the calibration period (Michelangeli et al. 2009), and was already adopted and validated for various applications (Oettli et al. 2011;Lavaysse et al. 2012;Vautard et al, 2013;Vigaud et al. 2013). For the present evaluation, the reference data for the calculation of the transfer function have been obtained by interpolating the NOAA-20CR reanalysis (Slivinski et al. 2019) over the IPSL-CM5A-LR spatial grid. Moreover, in order to make the different periods used for skill assessment completely independent on the calibration period of the de-biasing process, data correction has been performed separately for two different periods, whose transfer functions were, in turn, calculated over two independent periods, e.g. 1961-1987 and 1988-2014 for lead-time of 1 year.

Data organisation
While HIS and OBS datasets are continuous time-evolving data, raw and de-biased DCP experiments are multiple 10-year simulations starting every year. Therefore, in order to make all the datasets conform to each other, HIS and OBS datasets have been first organized to mimic the DCP outputs, by decoupling them as multiple 10-year pseudopredictions according to the start dates. In this way, the OBS temperature over a generic year Y, corresponds to the OBS pseudo-prediction starting from year Y-LT, where LT is the lead time. For example, the OBS temperature in 1981 corresponds to the OBS pseudo-prediction initialised in 1980 for LT = 1 year, in 1979 for LT = 2 years and so on. Finally, over the common period 1961-2014, we compute all the dataset as continuous time-evolving monthly temperature in function of their individual lead-time years LT and months M. The resulting time-series, hereinafter named principal time series, are at the base of the systematic calculation of skill scores (see Sects. 2.6 and 2.7). Indeed, for each dataset, the different linear combinations of these 120 principal time series, along with the selection of the period of initialisation P, allows to define all the possible combinations of time windows, i.e. forecast periods and multi-months periods (see the table in the Appendix and further discussion in Sect. 2.7).

De-trending
In order to exclude the contribution of the long-term radiative forcing from the skill assessment, which explains a large part of the skill in decadal forecasts (van Oldenborgh et al. 2012), our analysis is exclusively based on de-trended datasets. For each dataset, we remove a linear trend calculated over the overlapping period from all the principal time series defined above. Note however that both the external radiative forcing and the climate response are in reality non-linear (see discussion in Garcia Serrano et al. 2015), so the residual signal represents just an approximation of the un-forced component of the near-term temperature evolution.

Skill metrics
The skill evaluation is based on the comparison between the monthly temperature anomalies of the simulations dataset, i.e. HIS and DCP, and validation dataset, i.e. OBS (see Sect. 2.2). Here we define the prediction skill scores by means of two different verification metrics, namely the anomaly correlation coefficient (ACC), and the root mean square error (RMSE). For the comparison of datasets that have been averaged over different time windows, the latter metric is calculated after having standardised the variance of the time series. Indeed, the averaging process over different forecast times and different months corresponds to a linear combination of different so-called principal time series, so that the variance of time series averaged over longer time windows (e.g. for LT = 2-9 years, for given M and P) is, on average, intrinsically lower than the variance of time series averaged over shorter time windows (e.g. for LT = 2-3 years, for given M and P). This prevents the direct comparison of the RMSE calculated over different LT and M, as their difference may be just the result of this numerical artefact. Therefore, we adjust all the time series that are a combination of different principal time series such that their variance matches the mean of the variances of the principal time series composing it. In this way, the RMSE calculated for these scaled time series (which hereinafter we will refer to as RMSE*) is not affected by the intrinsic differences of variance due to the averaging process. Note that this manipulation is not necessary for ACC, as its calculation implies a standardisation of the data.
The statistical significance of the ACC metric is evaluated through a one-sided Student's t-test, for which the effective degrees of freedom have been calculated taking into account the serial autocorrelation (Bretherton et al. 1999). The test on the significance of the difference between two correlations values is based on the Fisher z-transformation. Finally, the statistical significance of the difference between the RMSE from two different datasets is evaluated through the Welch's t-test.

Systematic analysis
The de-trended principal time series for DCP, HIS and OBS datasets defined in Sects. 2.4 and 2.5 are at the base of the systematic analysis of the DCP skills. Starting from them, (i) we partition the whole period of initialisation 1960-2012 to obtain 28 different 26-year moving initialisation periods P; (ii) we combine the 10 different individual forecast years to obtain 55 different combinations of consecutive (single or multiple) lead-time windows LT, i.e. the so-called forecast periods; (iii) we extract all the 78 different combinations of consecutive predicted months M. This procedure defines a three-dimensional matrix of time windows accounting for all the combined configurations of P, LT and M, hereinafter referred to as contexts (see table in the Appendix for their definition). The systematic approach proposed here is aimed at evaluating the temperature prediction skill score S of DCP (both raw and de-biased) and HIS for each of the defined contexts. In other words, we analyse the function S = f(M,LT,P), where the time windows P, LT and M are considered as independent variables. The choice of 26-year moving periods for the P variable is justified by the fact that it is the largest length of years for which the first initialisation period  has no common time step with the last initialisation period . Results with different moving periods (35-year and 44-year lengths) have been also analysed, and are qualitatively similar to those that will be presented here (not shown).

Prediction skill for a reference context
To establish a reference point in our systematic analysis, we evaluated the skill of the IPSL-CM5A-LR model in predicting air temperature over Europe for the reference context. In Fig. 1 we thus show the spatial distribution of the ACC and RMSE scores for the (i) period P identifying predictions initialised every year from 1960 to 1985, (ii) forecast time from the first to fifth year of prediction (LT = 1-5 years) and (iii) predicted annual temperature means (M = Jan-Dec). Non-initialised historical simulations (Fig. 1a, d) exhibit only limited skill concentrated over the Mediterranean sector, although the ACC is not statistically significant at the 95% level (and thus not visible in Fig. 1a), as for the rest of Europe, which is characterised by both low ACC and relatively high RMSE. Raw predictions (Fig. 1b, e) clearly appear more skilful than the historical experiments. Added value with respect to HIS simulation is statistically significant over most of the land surface north of 45° N when ACC is considered, while RMSE is significantly lower in DCP than in HIS over three main spots: the U.K., the central part of Europe and the region between the Black and Caspian Seas. The ACC becomes significantly positive over most of the Central sectors of Europe, also including the southern part of Scandinavia as well as the peninsular part of Italy, the Balkans and the regions surrounding the Black and Caspian Seas. Finally, de-biased predictions (Fig. 1c, f) exhibit a further improvement of the skill with respect to the raw predictions. The ACC skill for the de-biased DCP is generally higher than for the raw DCP, although such an improvement is statistically significant just for a few grid points, e.g.
in Scandinavia and in the Hellenic Peninsula. At the same time, the RMSE for the de-biased DCP is generally lower than for the raw DCP, notably over Scandinavia where this difference is statistically significant.
To extend the detection of the skill opportunities beyond this reference context, we systematically calculate the skill metrics for different time windows, i.e. for different combinations of P, LT and M. We first focus this analysis on the raw DCP dataset, while we extensively evaluate the added value due to initialisation and the effects of de-biasing in Sect. 3.4, where the performance of the raw DCP have been systematically compared with the performance, respectively, of the HIS simulations and of the de-biased hindcasts.

Skill at varying P, LT and M
We now analyse how the prediction skill illustrated in Fig. 1 changes when the independent variables P, LT and M are successively varied (Fig. 2). For this, we consider time series of air temperature averaged over Europe (see Sect. 2.7 for the definition of its boundaries) and we calculate ACC and standardised RMSE* (see Sect. 2.6) for all possible combinations of consecutive P, LT, and M. For the fixed standard LT = 1-5 years and M = Jan-Dec, ACC skill is statistically significant for most of the 26-yr initialisation periods P (Fig. 2a), but for those starting from 1972 and 1979. For P = 1972-1997 ACC score is the lowest (0.26, p > 0.05) while best ACC is found for P = 1961-1986 (0.52, p < 0.05). Concomitantly, RMSE varies between 0.75 and 0.9 (Fig. 2a, d). The modulation of skill on P (Fig. 2a, d) appears to be relatively less marked than the one on LT (Fig. 2b, e) and M (Fig. 2c, f), at least for the time windows analysed so far. This partly reflects the fact that the different initialisation periods P largely overlap. For P = 1960-1985 and M = Jan-Dec (Fig. 2b, e), the best ACC along the LT axis (for P = 1960-1985 and M = Jan-Dec) is found for Fig. 1 Skill scores exhibited by the different experiments in predicting the 1-5 years annual mean air-temperature over Europe for simulations started along the period 1960-1985. The predictability is defined by two different the skill score metrics, i.e. the ACC (upper panels) and the RMSE (lower panels). For ACC maps, only correlations that are statistically significant at the 95% confidence level have been displayed. The left panels show the skill scores for historical simulation, the middle panels show the skill scores for the raw hindcast and the right panels show the skill scores for the de-biased hindcast. The circles on the central and right panels indicate, respectively, a statistically significant skill improvement (p < 0.05) due to initialisation (with respect to HIS skill) and a statistically significant skill improvement (p < 0.05) due to de-biasing (with respect to raw DCP skill) 1 3 LT = 1-8 years where it reaches 0.74 (p < 0.05), while the worst ACC skill is found for LT = 7 years where it is − 0.24 (p > 0.05). Similarly, the best RMSE* is also found for LT = 1-8 years where it measures 0.59, while the worst RMSE* skill is found for LT = 10 year where it is 1.17. On the M axis, the evolutions of ACC and RMSE* (Fig. 2c, f) appear as a sequence of parabolic-like curves with vertexes centred over those combinations including the late spring and early autumn months. Best ACC and RMSE* scores along the M axis (for P = 1960-1985 and LT = 1-5 years) are found for predicted months M = May-Sep when they respectively measure 0.78 (p < 0.05) and 0.42, thus evidencing a certain degree of conformity between the two metrics used. The worst ACC skill is found for predicted months M = Dec, for which it is − 0.22 (p > 0.05), while the worst RMSE* skill is found for predicted months M = Jan, for which it is 1.51.
A more comprehensive view on the function S = f(P,LT,M) is given in Fig. 3, where two of the independent variables are changed while the third is held constant. These two-dimensional representations qualitatively confirm most of the features shown in Fig. 2. The predictions appear in general more skilful over the extended summer season (from late spring to early autumns) and over longer forecast  Table 1 in the Appendix for a more detailed explanation periods (Fig. 3a, c, d, f). Furthermore, for forecast periods implying an average over the same number of years, the skill appears to be unsurprisingly larger for those lead times that are closer to initialisation time, e.g. LT = 1-5 years shows better skills than LT = 6-10 years. When considering the summer months (Fig. 3b), there is significant ACC skills for the 26-yr periods starting around 1965 and 1980, punctuated by a skill degradation for P beginning between about 1970 and 1980. For example, for LT = 1-5 years and M = May-Sep, the ACC is 0.77 (p < 0.05) for the period 1961-1986, 0.17 (p > 0.05) for the period 1972-1997 and 0.65 (p < 0.05) for the period 1982-2007 (Fig. 3b). For the same periods P, the RMSE* is respectively 0.41, 0.74 and 0.51 (Fig. 3e). When fixing M = Jan-Dec (right panels of Fig. 3), higher ACC and RMSE* skills around 1965 and 1980 are also found for those contexts implying longer time averages and including the first prediction years (Fig. 3c,   f). In general, the peaks and troughs of skill in the threedimensional matrix identify clusters of points characterised by high or low predictability, as all their adjacent points exhibit similar scores. This feature gives confidence in the robustness of the results. The modulation of the predictability for various initialisation periods will be interpreted in the light of the background variability in Sect. 3.3.
Overall, Fig. 3 evidences that, for specific time windows, the prediction skill is significantly higher than for the reference context (cf. black circles in the Fig. 3). In turn, the prediction skill for the reference context is also significantly higher than the skill for other specific time windows (cf. crosses in the Fig. 3). Yet, only part of all the possible 120,120 time windows are shown in Fig. 3. From the three-dimensional matrix of skill scores over the full European domain, it is possible to extract numerically the conditions of best performance for the prediction of  , (middle panels) period P and months M (for LT = 1-5 years), and (right panels) period P and lead-time LT (for M = Jan-Dec). Prediction skills have been obtained by comparing de-biased temperature hindcasts averaged over Europe with the temperature observations averaged over the same region. The skill metrics are the ACC (upper panels) and the RMSE* (lower panels). Only correlations that are sta-tistically significant at the 95% confidence level have been displayed. The violet squares indicate the reference context. Black circles indicate that the skill is significantly (p < 0.05) higher than the skill for the reference context. Grey crosses indicate that the skill is significantly (p < 0.05) lower than the skill for the reference context. Dashed grey lines on M and LT axes correspond to the principal time-series (see Table 1 in the Appendix) near-term air temperature. When the entire continent and all the possible combinations of P, LT and M are considered, the best skill is found for predictions over the period 1980-2005, for LT = 1-9 years and considering the months of the year from June to October. This is largely consistent with the previous findings evidencing that prediction skill scores are higher for an extended summer season, and for forecast periods of several years. Specifically, at the grid point level (Fig. 4a), the ACC scores for this optimal configuration are generally higher than the scores under the standard context (Fig. 1b), with the area characterised by a significant correlation showing an expansion over the Atlantic sector of Europe and part of Iberian Peninsula. Indeed, over these regions, the ACC score is significantly higher than the ACC for the reference context. Most of the western and central part of the continent shows a significant ACC, while poor skill persists over the north-eastern regions and over the southern part of Iberian Peninsula.
It is important to note that the procedure of averaging the air temperature over a relatively large area gives an indication of the predictability over Europe, yet does not allow capturing all regional features of such predictability. Therefore, following the same systematic approach used so far for the whole European region, we additionally considered 7 different sub-regions of Europe (Fig. 4b-h). Such partition shows that significant skill can be found everywhere in Europe under certain conditions of P, LT and M (Fig. 4b-h). The predictability is significantly dependent not only on these independent variables, but also on the specific area considered, as the conditions for the best skill vary from one region to another. Nevertheless, a common feature is that their best performance coincides with the following the systematic approach defined in Sect. 2, implying the use of ensemble mean air temperature averaged over the corresponding regions, here delimited by the black boxes. Only correlations that are statistically significant at the 95% confidence level have been displayed. The circles indicate, a statistically significant ACC increase (p < 0.05) with respect to the ACC for the reference context (Fig. 1b) simulation of summer months. Indeed, the conditions for the best ACC always include at least the months from June to September for all the selected regions. Also, apart from north-eastern Europe, the best skill is associated with forecast periods averaged at least over 7 years and including the first lead-time years. This agrees with what was already shown in Figs. 2 and 3. We interpret this feature as due to (i) the larger impact of initialisation for short lead times and (ii) to a better imprints on climate of the oceanic variations and external forcing for longer forecast periods. Nevertheless, some region may show unexpected higher skills for long lead times or for predictions averaged over just a few years, which are possibly linked to a delayed atmospheric response or a response to external forcing. Such a re-emergence of skill has been, for example, illustrated in the oceanic context by Matei et al. (2012) and Brune et al. (2018). This could possibly explain the peculiar peak of skill found for LT = 3-7 years over north-eastern Europe (Fig. 4d), whose robust interpretation, however, would need a dedicated study.

The relationship between skill and the AMV
In Fig. 3b, we showed a clear pattern of ACC skill score dependence on P for the whole European region, with higher prediction skills occurring for periods starting outside of the early 1970s. Such skill modulation along P appears more marked when M belongs to the central part of the year. This suggests a possible link with the predictability of the AMV, as the latter has been shown to have its largest impact on Fig. 5 Comparison of a the ACC skill of the 1-5 years AMV index (black lines) in hindcasts, and the observed (blue lines) and modelled (red lines) AMV standard deviation over different 26-years periods with (b-h) the two-dimensional ACC skill patterns in predicting the 1-5 years mean temperature over different sub-regions of Europe (see Sect. 2 for their definition). Thick red and blues lines in the first panel have been calculated from ensemble mean temperature, while dashed black and blue lines have been calculated from the single members.
Only correlations that are statistically significant at the 95% confidence level have been displayed. Black circles indicate that the skill is significantly (p < 0.05) higher than the skill for the reference context. Grey crosses indicate indicate that the the skill is significantly (p < 0.05) lower than the skill for the reference context. Dashed grey lines on M and LT axes correspond to the principal time-series (see Table 1 in the Appendix) European temperature during the summer (Sutton and Hodson 2005). In Fig. 5 we compare the skills in predicting the AMV over the different combinations of 26-yr periods P with the pattern of S = f(P,M) for all the 7 sub-regions for a common fixed LT, i.e. LT = 1-5 years. Here we defined the AMV signal as the 5-year low-pass filtered annual mean temperature averaged over the North Atlantic basin (80 W-0 W, 0 N-65 N). Note that we used a 5-year low-pass filter for a direct comparison with temperature predictions over Europe with LT = 1-5 years. In this framework, we also analyse the modelled and observed AMV standard deviation over the different periods, which have been shown to characterise both the predictability of the AMV and its teleconnections with the summer temperature over Europe (Qasmi et al. 2017). We find that the predictability of the AMV is phased with the observed AMV variance (Fig. 5a). Indeed, the highest AMV skill occurs for those periods in which the observed AMV standard deviation (blue curve in Fig. 5a) is greatest, while lower skill corresponds to periods with a less variable AMV. It is also worth noting that individual members of DCP produce a less skilful AMV than the ensemble mean, thereby confirming that averaging more realisations reduces the unpredictable noise (Mignot, et al. 2016;Smith et al. 2019). The simulated AMV (red curves in Fig. 5a) is characterised by an underestimated variance, which is not merely due to the ensemble mean effect, as individual model members also show lower AMV variance than observed. In addition, the AMV variance in the DCP does not exactly phase with the observed AMV variance, although the peaks of maximum AMV variance in the model coincides with those in observations. We aim to understand whether these features are linked to the predictability of air temperature over Europe and over its different sub-regions. The comparison between Figs. 3b and 5a shows that the periods of best ACC skill for air temperature over Europe (Fig. 3b) coincide with the peaks of maximum AMV predictability (Fig. 5a). Nevertheless, low temperature prediction skill for the 26-yr periods P starting around the 1970s appear to be phased with the lower AMV standard deviation in the model, rather than with the AMV predictability itself. Therefore, in agreement with what was already found by Qasmi et al. (2017), the teleconnection between the AMV and European summer temperature in the model appears to be linked with the simulated AMV variance. This demonstrates the possibility to consider the variance of the predicted AMV as a potential indicator of future windows of opportunity in the decadal prediction of air temperature over Europe. As an example, we can estimate that a large shift in predicted AMV, as has recently been suggested for the coming years (Robson et al. 2016), might lead to enhanced predictability. That is, real-world decadal predictions of the coming decade may be more accurate than what overview metrics (cf. Fig. 1) might imply.
The patterns of S = f(P,M) within the 7 sub-regions ( Fig. 5b-g) evidence that southernmost sectors, e.g. Iberian Peninsula, Mediterranean sector, show the highest skill for predictions after the 1970s, contrary to the northernmost regions, e.g. Scandinavia, Central Europe, for which the highest skill is found for P starting prior to the 1970s. Also, the skill variability on the P-axis is in general weaker for the Eastern regions, suggesting a lower impact of AMV variations there. Despite these regional differences, the best ACC skill scores in all the 7 sub-regions mainly concern the temperature prediction of the extended summer seasons for 26-yr initialisation periods starting around 1965 or 1980, when the AMV predictability is maximum. Furthermore, none of the regions is characterised by good temperature prediction skills for 26-yr periods P starting around the 1970s, when the simulated AMV show the minimum standard deviation values.

The pattern of initialisation added value and improvements due to de-biasing
Following the systematic approach adopted so far, we focus in the last part of this study on the potential skill improvement due to both the initialisation and de-biasing procedure for varying P, LT and M. For the reference context, Fig. 1 already demonstrated a progressive skill improvement in predicting air temperature over Europe due, successively, to initialisation (Fig. 1b, e) and to the de-biasing (Fig. 1c, f). By producing a 3-dimensional matrix of skill anomalies for air temperature predictions over Europe, both for ACC and RMSE, we now explore if the added values found for the reference context also hold for different combinations of P, LT and M. In other words, we study the function ∆S = f(P,LT,M), where ∆S is S RAW -S HIS (Fig. 6) and S DEB -S RAW (Fig. 7), respectively. Initialisation leads to an overall improvement of the skill in predicting air temperature over Europe (Fig. 6). Added value can be seen for most of the combinations of M and LT (Fig. 6a, d). Largest skill increases are found, in general, for predictions of spring and summer seasons (Fig. 6a, b, d, e), although the largest ACC increases do not always correspond to the largest RMSE decreases (e.g. Fig. 6b, e). Nevertheless, for both skills, the improvement is statistically significant for windows implying short lead times, e.g. LT = 1-3 years or for relatively long forecast periods, e.g. LT = 2-8 years (Fig. 6a, c, d, f). Finally, the features on the P-axis appear to be the most heterogeneous. Indeed, while skill improvements uniformly involve most of the combinations of P (Fig. 6b, c, e, f), the largest ACC increases are related to the 26-yr initialisation periods starting around the 1960s (Fig. 6b, c). For these periods, the ACC increase is statistically significant, while for initialisation periods starting after 1970s the skill improvement mainly concerns lead-time years averaged over long periods, i.e. more than 7 years (Fig. 6c). At the same time, significant RMSE decreases do not appear to prefer a particular period of initialisation, being uniformly distributed over the P axis (Fig. 6e, f).
The effect of the de-biasing procedure is strongly dependent on the specific time window analysed (Fig. 7). It produces an overall skill improvement for the period P = 1960-1985 (Fig. 7a, d), and, in general, for about the first five combinations of 26-yr initialisation periods P starting after 1960 (Fig. 7b, e). This mainly concerns the prediction of autumn months. On the opposite, for the time windows including the first months of the year, de-biasing produces just a slight skill improvement or even no skill improvement. This pattern is completely reversed for 26-yr initialisation periods starting after around 1965 (Fig. 7b, e), for which de-biasing appears more beneficial for the prediction of the first months of the year. These improvements are, for most of the cases with LT = 1-5 yr, statistically significant when considering the RMSE metric (Fig. 7e). Yet, de-biased predictions appear to be less skilled than raw predictions when simulating the last months of the year (Fig. 7b, e). In total, the de-biasing implies a statistically significant ACC improvement of just less than 2% of the 120,120 time windows detected, while a similar amount of time windows is characterised by a significant ACC decrease. Concurrently, the RMSE skill is significantly improved for slightly more than 2% of the time windows, while no significant RMSE degradation has been reported. Therefore, in general it is not possible to establish a priori whether the data adjustment has a beneficial effect on prediction skill: this strongly depends on the time window considered, thus making the systematic Fig. 6 Two-dimensional patterns of the added value due to initialisation in predicting air temperature over Europe for (left panels) fixed P = 1960-1985, (middle panels) fixed LT = 1-5 years, and (right panels) fixed M = Jan-Dec. The added values have been obtained by subtracting the skills associated with the HIS ensemble mean from the skill associated with the DCP ensemble mean (S RAW -S HIS ). Prediction skills are ACC (upper panels) and RMSE (lower panels), and have been obtained by comparing simulated temperature over Europe with OBS data over the same region. Circles on upper panels indicate where the ACC for the raw DCP experiment is significantly greater (p < 0.05) than the ACC for the HIS experiment. The violet squares indicate the reference context. Circles on the lower panels indicate where the difference between the RMSE in DCP and HIS experiments is statistically significant at the 95% confidence level according with the Welch's t-test on the mean squared errors. The violet squares indicate the reference context. Dashed grey lines on M and LT axes correspond to the principal time-series (see Table 1 in the Appendix) detection of skill opportunities a necessary step before the application of DCP to the analysis of the impact of the nearterm climate variations.

Discussion and conclusions
In this work we have used hindcasts of the IPSL-CM5A-LR DCP system to systematically assess its skill in predicting air temperature over Europe. We explored the degree of predictability for all 78 combinations of consecutive predicted months of the year, and of all the 55 combinations of leadtime years. We also investigated the potential existence of windows of opportunity by evaluating these skills for all the 28 combinations of 26-year moving periods initialisation starting from 1960. Such a systematic approach can be easily adopted for evaluating the skill of different DCP systems, and for different climatic variables, thus representing a prototype for a comprehensive exploration of the potential exhibited by a DCP system.
We found that temperature prediction skill over Europe is generally larger for the simulation of late spring, summer and early autumn seasons. The systematic evaluation of the hindcasts show peaks of predictability for boreal summer, independently of the considered epoch and lead-time, in agreement with what was found in Yeager et al. (2018). The length in years of the forecast period has also been found to be a factor influencing the skill score. In general, the longer this length is, the better the predictability that results, likely because most of the predictable signal might be found at low frequency due to oceanic processes. This might be compared with the principle for which increasing the model members Fig. 7 Comparison of skill metrics between raw and debiased DCP for (left panels) fixed P = 1960-1985, (middle panels) fixed LT = 1-5 years, and (right panels) fixed M = Jan-Dec. The skill improvements have been obtained by subtracting the skills associated with the raw DCP ensemble mean from the skill associated with the de-biased DCP ensemble mean (S DEB -S RAW ). Prediction skills are ACC (upper panels) and RMSE (lower panels), and have been obtained by comparing simulated temperature over Europe with OBS data over the same region. Circles on upper panels indicate where the ACC for the de-biased DCP experiment is significantly greater (p < 0.05) than the ACC for the raw DCP experiment. The violet squares indicate the reference context. Circles on the lower panels indicate where the difference between the RMSE in DCP and HIS experiments is statistically significant at the 95% confidence level according with the Welch's t-test on the mean squared errors. The violet squares indicate the reference context. Dashed grey lines on M and LT axes correspond to the principal time-series (see Table 1 in the Appendix) of a prediction improves its performance (Bellucci et al. 2015;Smith et al. 2019), although this is likely related to a better estimate of the signal and not of its physical characteristics. Furthermore, we identified two main windows of opportunity, namely for the 26-year periods starting around 1965 and around 1980, which we found to be characterised by the largest skill scores. We suggested that, consistent with previous studies, the source of temperature predictability over Europe for these different windows of time is linked with the AMV. Indeed, the peaks of summer temperature predictability over Europe coincide with the peaks of AMV predictability, which is in turn is correlated with the peaks of observed AMV variance. The identification of the sources of skill is an important issue for reducing uncertainties in forecasts and improving the next generation of DCPs. Since IPSL-CM5A-LR underestimates the amplitude of the AMV variability, similarly to state-of-the-art climate models (Qasmi et al. 2017), it is possible that a better reproduction of the AMV variability in the model can produce more skilful air-temperature predictions over Europe. In this context, the possible on-going shift of the AMV phase might lead to an increase in variance of the AMV, which may therefore open a good window of opportunity for predicting European air temperature in the coming decades. Nevertheless, the effect associated with the simulated AMV is just one of the possible factors influencing temperature predictions over the continent, as other factors not analysed here may also explain it, e.g. the atmospheric circulation ).
An important finding of this study is that, when simulating air temperature over Europe, hindcasts are significantly more skilful than historical simulations for most of the contexts analysed, thereby demonstrating the general added value due to initialisation. Furthermore, for certain time windows we have shown a possible additional significant improvement of the prediction skill coming from DCP debiasing. However, this beneficial effect is strongly dependent on the context and on the skill metric considered, and it is overall limited to less than 2% of the time windows analysed. Qualitatively similar results have been found by using adjusted data with a different de-biasing method, yet the skill changes (both improvement and degradation) were much smaller than those presented in this study (not shown here). It is therefore likely that the specific bias-adjustment method (e.g. the calibration period used, the way in which the transfer function is calculated, the observation-based data used as reference, etc.) is essential in affecting the eventual prediction skill.
Concerning the impact analyses that can possibly follow this pilot study, it is worth stressing that the adjusted data also exist at higher spatial resolution, thus potentially representing a better dataset for applications to targeted studies. These higher resolution data have been obtained for the predictions with IPSL-CM5A-LR model by calculating, for instance, the transfer function using WFDEI (WATCH Forcing Data methodology applied to ERA-Interim data) reanalysis data (Weedon et al. 2014) after the projection of model predictions on the 0.5°×0.5° WFDEI grid, thus making the de-biasing procedure both a data correction and a spatial downscaling. The same procedure is planned to be carried out for the IPSL-CM6-LR DCP system. Our results are therefore rather promising in light of the fact that the next generation of DCP is expected to perform even better than the previous one. Indeed, the IPSL-CM6-LR DCP system, which was released recently, relies on a higher resolution, on a better estimation of the external forcing, on more model members, i.e. 10 members against 3 for the IPSL-CM5A-LR, and on a more sophisticated initialisation method, i.e. the data assimilation involves also the Sea Surface Salinity and not only the SST. This new version of the IPSL model, along with the new generation of DCPs contributing to the CMIP6 (Boer et al. 2016), have already shown to generally produce more accurate predictions of the North Atlantic SST than the previous versions (Borchert et al. 2021), although most of the improvement was associated to a more accurate response to the external forcing.
Finally, we suggest that a systematic analysis, such as the one presented here, provides relevant information for the development of climate services based on DCP (Bruno Soares and Buontempo 2019). For example, the fact that temperature predictions appear to be particularly skilful over spring and summer seasons represents a promising base for the development of a reliable climate service for agriculture sector based on DCP. As a specific example, assume we want to use the IPSL-CM5-LR DCP system for impact analyses in viticulture. The climatic impact on grapevine yields is classically studied by using simulated temperature data to force phenological models, e.g. Sgubin et al. (2018Sgubin et al. ( , 2019, which mainly operates over the growing season of the year. While the temperature during the grapevine dormancy occurring between autumn and winter influences the grapevine budburst (Garcia de Cortazar-Atauri et al. 2009), the main following phenological phases (flowering, veraison and maturity) are predominately determined by the spring and summer temperature (Parker et al. 2013). Therefore, the high skill scores found for decadal prediction of temperature over these seasons promises high confidence in DCP for a near-term impact analysis on viticulture. A deeper analysis of the skill in predicting the 1-10 years phenological stages for different grapevine varieties over Europe is the object of an on-going study aimed at testing the effective usability of DCP in the development of a prototype service for viticulture targeting the time horizon of years to decades ahead. In this framework, the use of a new generation of DCP models, along with the de-biasing adjustments proposed in the present study, will enable higher spatial resolution data, which might be an essential factor for a reliable integration of phenological models for near-term predictions of grapevine growing.