## 4.2.1 Uncertainty Intercomparison

We used three performance-based metrics, spatial mean, pattern similarity, and interannual variability, to compare observational and model uncertainties both globally and regionally25 (21 regions). These metrics were calculated for each model \(m\), variable \(q\), season \(k\), and region \(r\) against each observation \(c\). For each model with a total of N grid points, we defined \({o}_{n}\) and \({y}_{n}\) as monthly observational and climate model data at grid point \(n\) in region \(r\). Furthermore, in the following, an overbar indicates the temporal mean across all time steps for a season \(k\) in the study period, while two overbars reflect the temporal mean across all time steps in a year that falls into season \(k,\) and spatial means.

Spatial mean performance (\({P}_{S}\)) was evaluated by averaging the climatological bias (\(e\)) between a model simulation and an observation over all grid points as follows:

$${e}_{n,m,c,r}^{q,k}=\left({\underset{\_}{y}}_{n,m,r}^{q,k}-{\underset{\_}{o}}_{n,c,r}^{q,k}\right)$$

1

$${P}_{S}^{q,k,m,c,r}={\sum }_{n=1}^{N}{e}_{n,m,c,r}^{q,k}$$

2a

Spatial pattern similarity performance (\({P}_{R}\)) was assessed using pattern correlation, which, for clarity, was expressed by dropping the indices \(q,k,m,c,r\) as follows:

$${P}_{R}=\frac{cov\left({\underset{\_}{y}}_{n},{\underset{\_}{o}}_{n}\right)}{\sigma \left({\underset{\_}{y}}_{n}\right)\sigma \left({\underset{\_}{o}}_{n}\right)},n=1\dots .N$$

2b

where, \(cov\) and \(\sigma\) denote spatial covariance and standard deviation, respectively.

Interannual variability performance (\({P}_{\sigma }\)) was evaluated using the ratio of the standard deviation of a yearly varying model simulation and observation as follows:

$${P}_{\sigma }=\frac{\sigma \left(\underset{\_}{\underset{\_}{y}}\right)}{\sigma \left(\underset{\_}{\underset{\_}{o}}\right)}$$

2c

Further, to diagnose the relationship between the influence of the choice of reference data on the observational uncertainty and the influence of the choice of a GCM on the model uncertainty, these were quantified using the methodology suggested by Kotlarski et al.11. Briefly, observational uncertainty was defined as the mean standard deviation of the metric values when a GCM was compared to all seven reference datasets as follows:

$${U}_{obs}=\frac{{\sum }_{m=1}^{M}\sqrt{\frac{1}{6}{\sum }_{c=1}^{7}{\left({P}_{m,c}-\frac{1}{7}{\sum }_{c=1}^{7}{P}_{m,c}\right)}^{2}}}{M}$$

3a

where \(M\) is 57 for CMIP6, 47 for CMIP5, and 24 for CMIP3, and \({P}_{m,c}\) is the value of a performance metric for a given variable, season, and region when using the reference dataset \(c\) (\(c\in \{\text{1,2},3,\dots 7\}\) for evaluating GCM \(m\) (\(m\in \{\text{1,2},\dots .M\}\). Note that whenever we averaged correlations, we first took the mean over Fisher’s \(z\) transformed correlation coefficients and then computed their inverses to derive an average correlation (see the Supplementary Materials for more details about the methods).

Furthermore, when calculating model uncertainty, 150 \(\left(M7\right)\) GCM realizations were created for each CMIP vintage using the bootstrapping without replacement method to ensure consistency with several observational references. Because these realization numbers were deemed sufficient, increasing them had no discernible effect on the uncertainty calculation (Figure S2). The model uncertainty was thus defined as the mean standard deviation of the metric values when all 150 member realizations were compared to a particular reference dataset as follows:

$${U}_{mod}=\frac{{\sum }_{c=1}^{7}\left(\frac{1}{150}{\sum }_{nr=1}^{150}\sqrt{\frac{1}{6}{\sum }_{m\in {S}_{nr}}^{}{\left({P}_{m,c}-\frac{1}{7}{\sum }_{m\in {S}_{n}}^{}{P}_{m,c}\right)}^{2}}\right)}{7}$$

3b

In this case, the ratio of the observed to the model uncertainty defined the robustness of the evaluation, and a ratio greater than 0.5 (i.e., a model uncertainty within two standard deviations of observations) indicated that observational uncertainty significantly contributed to the overall evaluation uncertainty as follows:

$$R=\frac{{U}_{obs}}{{U}_{mod}}$$

3c

## 4.2.2 Model Dissimilarity

We used independent spatial patterns to assess model dissimilarity26. As the models with common physics module(s) and codes may have had similar error patterns, a symmetric correlation matrix (\(S\)) was created using spatially varying error structures (Eq. 1). As the amounts of precipitation and their variability were higher over the tropics compared to other regions27, we normalized these errors using the observed interannual standard deviation (\(\sigma\)) at each grid point for each variable, season, and model (Figs. 8a and 8b) as follows:

$${e}^{{\prime }}=e/{\sigma }_{o}$$

4a

Individual climate models also have common errors (Figs. 8c and 8d) because of unresolved physical processes or/and topography28. Importantly, there was a high and statistically significant correlation (~ 0.72) between individual model errors and the multimodel mean error (MME or \(\overline{e}={\sum }_{m=1}^{M}{e}^{{\prime }}\)) across all fields, indicating a shared systematic bias (Figure S1). Model similarity due to shared conceptual frameworks and codes, by definition, means that the error patterns of variable \(q\) of model \(m\) should be related to the error in the corresponding \(q\) patterns of the other model. Thus, the MME-associated error pattern of a model can be obtained as the correlation coefficient \(r({e}^{{\prime }},\overline{e})\) between \({e}^{{\prime }}\) and \(\overline{e}\). The MME-independent portion of the models, in a linear sense, was determined by subtracting their MME-associated error patterns (Figs. 8e and 8f) as follows:

$$d={e}^{{\prime }*}- r({e}^{{\prime }},\overline{e})\times {\overline{e}}^{*}$$

4b

Where * denote the standardization, which ensures all data fields having uniform scales before they used in Eq. 4b.

In Eq. (4b), the MME pattern \({\overline{e}}^{*}\)and the model error pattern \(d\) are unrelated. The correlation between the two models (\({S}_{i,j}\)) for any variable and season was given as follows:

$${S}_{i,j}=r\left({d}_{i},{d}_{j}\right)$$

5

The zero average correlation between model pairs showed that removing the MME signal from a model reduced its combined influence associated with unresolved physical processes or/and topography (Figure S1). The spatial error patterns in the annual precipitation simulated by the GFDL-ESM4 model with and without the MME-associated component are shown in Figs. 8c and 8e, and those simulated by the MCM-UA-1 model are shown in Figs. 8d and 8f, respectively. \({S}_{i,j}\)were then converted into a distance measure, \({D}_{i,j}=1-{S}_{i,j}\), which with a smaller (larger) value of \({D}_{i,j}\) (\({S}_{i,j}\)) indicated high similarity between the two models and vice versa for dissimilar models. The effect of GCM modification (i.e., modifications of parametrizations, resolution, and addition of new model components) on inter-model similarity was estimated as follows:

$${D}_{i,j}^{mod}={\sum }_{k=1,k\ne i,k\ne j}^{M}\frac{\left({{D}_{i,k}-D}_{j,k}\right)}{{D}_{j,k}}\times 100$$

6

Finally, the effective number of climate models, \({M}_{eff},\) following Pennell and Reichler8, was calculated using eigenvalues of matrix26 \(S\) as follows:

$${M}_{eff}=\frac{{\left({\sum }_{i=1}^{i=M}{\lambda }_{i}\right)}^{2}}{{\sum }_{i=1}^{M}{\lambda }_{i}^{2}}$$

7

where \({\lambda }_{i}\) represents the \({i}^{th}\) eigenvalues of matrix \(S\), \({M}_{eff}\) varies between one and \(M\), and \({M}_{eff}=1\left(M\right)\) indicates that all model error structures are identical (independent).

## 4.2.3 Statistical Analysis

We used a one-tailed 99% confidence limit (i.e., \(r \underset{\_}{>} 0.72\)) to indicate significantly similar model pairings. This was done under the assumption that inter-model correlations (i.e., matrix *S* from Eq. 5) follow a Gaussian distribution and the mean of all averaged inter-model correlations is zero (Figure S1).

In addition, in the case of computing \({M}_{eff}\) irrespective of the choice of models, we used the bootstrapping resampling without replacement method21 to generate 150 samples of randomly chosen models. The number of models in each sample ranged from 3 to 24 for CMIP3, 3 to 47 for CMIP5, and 3 to 57 for CMIP6. Increasing the sample numbers beyond 150 had no further discernible effect on \({M}_{eff}\) and were, therefore, 150 samples considered to be sufficient for \({M}_{eff}\) analysis (Figure S2). The interval between the 2.5% and 97.5% levels of the 150 bootstrapped samples was used to calculate the 95% confidence interval of the \({M}_{eff}\) over model M.

Furthermore, the effect of observational uncertainty on model uncertainty was significant at the 95% confidence level when the model uncertainty was either lower than or equal to two times that of the observed uncertainty (i.e., the ratio of observation to model uncertainty was larger than 0.5).

The significant levels of errors in both the multimodel mean and individual models were tested using the two-tailed Student’s t-test (Fig. 8). The independent sample size (\(Neqv\)) in a time series of length *N* (i.e., 20), used in the analysis of the two-tailed Student’s t-test, was calculated as a lag-one autocorrelation (*r*1) at the 95% significance level (Eq. 8) as follows:

$$Neqv=N*\left(\frac{1-r1}{1+r1}\right)$$

8