4.1 Resolving geopotential height into principal components
Figure 2 shows the eigenvalues for each PC. We can see that the elbow flattens out at nine PCs. In addition, the first nine PCs account for 90% of the total variability for Z500. Therefore, we regard the first nine PCs as adequate to describe the spatiotemporal variation of the geopotential height.
Figure 3 shows the spatial patterns associated with, and the proportion of the overall variance explained by each of the PCs. The first two PCs, which collectively describe 45% of the variance, exhibit spatial structures that are very similar to wellknown circulation patterns over Europe. PC1 corresponds to the summer North Atlantic Oscillation (SNAO) described in (Folland et al., 2009). The SNAO is a dipole pattern with nodes of opposite polarity over central Scandinavia and over Greenland. The positive (negative) phase of SNAO results in warm (cool) and dry (wet) summers in Scandinavia. PC2 corresponds to the East Atlantic pattern (EA), which, unlike the SNAO, exists throughout the entire year (Wulff et al., 2017). Here one node is located west of England, and the other (of opposite polarity) over Eastern Europe and the Mediterranean.
4.2 Variable selection
For variable selection, we used the data from 1982–2020 to have a complete record of all explanatory variables identified in section 2.2. We first investigate dependencies between explanatory variables, and then we fit and train LR models to different combinations of the explanatory variables. For SAT, the SATmax, and for SST, the Daily max lag of one day of North & Baltic Sea manipulations gave the best results. If CAPE and SATmax were in the model, then their interaction is added as an explanatory variable.
It can be seen from Fig. 4 that all the VIFs except SATmax, SST, and TCW are less than two, which suggests a weak correlation between a given explanatory variable and other variables in the model. To better understand why SATmax, SST, and TCW have higher VIFs, we calculated the matrix of correlations between all possible pairs of explanatory variables. The correlation matrix shows that SATmax is moderately correlated to SST and TCW, and SST is moderately correlated with TCW, which explains why they have higher VIF values. We also reproduce the positive correlation between PC1 and SATmax reported in (Folland et al., 2009). PC2 is a pattern of strong northwesterly flow over Denmark at 500 hPa, and therefore the negative correlation between PC2 and both CAPE and TCW means that unstable and moist air comes with strong southeasterly flow, and this is in accordance with domain knowledge (e.g., Solantie et al., 2006). Furthermore, PC3 has positive correlations with SATmax and TCW. This is not straightforward to explain since PC3 represents a northeasterly flow with its maximum intensity north of Denmark.
Table 2 shows the performance metrics of each explanatory variable independently and the combinations of variables with the best performance. The Brier score is close to zero and almost the same for all combinations of explanatory variables, which means that the models are equally wellcalibrated. In the ROC curve analysis, the mean AUC for all models except SATmax and SST alone is above 0.70, which shows that models based on CAPE, PC Z500, or TCW have good discriminative ability. The MannWhitney U test also suggests skillful models with p < 0.05 (not shown). However, these singlevariable models have FPR values around 0.29, which for an imbalanced dataset like the one used in our analysis means a lot of false positives. That also shows up in a poorer PR curve in Fig. 5.
The model combining CAPE, PC Z500, and TCW as explanatory variables outperform the singlevariable models. Adding SATmax as an explanatory variable increases the PR AUC only a little, but because the interaction with CAPE is statistically significant (Table 3), we choose to include this variable in the model. On the other hand, adding SST provides no additional performance improvements. It can be seen from Fig. 5 that the models “PC Z500 & CAPE & SATmax & TCW” and “PC Z500 & CAPE & SATmax & TCW & SST” have almost the same PR AUC (the highest). Therefore, PC Z500, CAPE, SATmax, and TCW are considered the most important explanatory variables, and we use them in further evaluation.
Table 2
Performance metrics for LR models with different (combinations of) explanatory variables for 1982–2020. Data are displayed as mean (Crossvalidation values of the first quartile/third quartile). The best model based on each individual metric score is highlighted

ACC

FPR

TPR

ROC AUC

PR AUC

Brier

CAPE

0.72 (0.68–0.76)

0.29 (0.24–0.33)

0.87 (0.84–0.91)

0.84 (0.81–0.87)

0.29 (0.25–0.36)

0.07 (0.06–0.09)

PC Z500

0.62 (0.56–0.7)

0.4 (0.31–0.48)

0.78 (0.69–0.89)

0.72 (0.7–0.74)

0.17 (0.15–0.2)

0.08 (0.07–0.09)

TCW

0.64 (0.56–0.71)

0.37 (0.29–0.48)

0.74 (0.64–0.83)

0.72 (0.71–0.74)

0.19 (0.15–0.24)

0.08 (0.07–0.09)

SATmax

0.42 (0.3–0.55)

0.61 (0.45–0.76)

0.75 (0.58–0.9)

0.53 (0.5–0.57)

0.09 (0.08–0.11)

0.08 (0.07–0.09)

SST

0.52 (0.44–0.65)

0.49 (0.35–0.58)

0.71 (0.64–0.74)

0.59 (0.55–0.62)

0.11 (0.08–0.14)

0.08 (0.07–0.09)

CAPE & TCW

0.73 (0.7–0.79)

0.27 (0.19–0.31)

0.75 (0.69–0.8)

0.79 (0.78–0.8)

0.24 (0.21–0.29)

0.07 (0.07–0.09)

PC Z500 & CAPE & TCW

0.77 (0.73–0.82)

0.24 (0.18–0.28)

0.81 (0.76–0.87)

0.84 (0.83–0.86)

0.33 (0.29–0.39)

0.07 (0.06–0.08)

PC Z500 & CAPE & SATmax & TCW

0.78 (0.73–0.83)

0.23 (0.16–0.28)

0.81 (0.81–0.84)

0.86 (0.84–0.87)

0.38 (0.34–0.42)

0.06 (0.06–0.08)

PC Z500 & CAPE & SST & TCW

0.77 (0.73–0.81)

0.23 (0.19–0.28)

0.8 (0.78–0.87)

0.84 (0.83–0.86)

0.33 (0.29–0.38)

0.07 (0.06–0.08)

PC Z500 & CAPE & SATmax & SST & TCW

0.78 (0.74–0.82)

0.22 (0.18–0.27)

0.81 (0.78–0.84)

0.86 (0.84–0.87)

0.38 (0.34–0.42)

0.06 (0.06–0.08)

Following the model evaluation procedures of section 3.2.1, CAPE, TCW, PC1, PC3, PC4, PC6, the interaction of CAPE and SATmax are found to be statistically significant explanatory variables for the model trained with all and the thinned data (Table 3). We can conclude that for these explanatory variables, the serial correlation of observations does not affect the confidence intervals to the extent that inference results are at risk of being misinterpreted. The variables CAPE and TCW stand out as having a positive influence on the probability of extreme rainfall; the odds ratios of SATmax are larger than unity, meaning that high SATmax favors extreme precipitation, although the odds ratio is not robust across the subsampled datasets. These findings are in accordance with the arguments we gave in Section 2.2 for including these explanatory variables.
The interaction of CAPE and SATmax seems to have a (slightly) negative impact on the probability of an extreme event. In this case, a significant odds ratio of 0.99 for the interaction term suggests that the effect of CAPE on the probability of an extreme event becomes weaker as SATmax increases. Therefore, even though the odds ratios of CAPE and SATmax are both larger than unity, the interaction term decreases the overall effect of CAPE and SATmax on the probability of an extreme event. The interaction term is best understood by examining how the odds ratio of CAPE in the LR changes as the values of SATmax change (Fig. 6).
Table 3
The LR odds ratios and the corresponding confidence intervals in parenthesis for all data, and thinned to every third and every fifth day, respectively
Explanatory Variable

All data (95% CI)

1/3 Days (95% CI)

1/5 Days (95% CI)

CAPE

1.04 (1.03–1.05)

1.04 (1.03–1.06)

1.03 (1.02–1.05)

TCW

1.18 (1.15–1.21)

1.2 (1.14–1.25)

1.17 (1.11–1.24)

SATmax

1.01 (0.96–1.05)

1.05 (0.97–1.14)

0.99 (0.9–1.1)

PC1

0.51 (0.43–0.59)

0.55 (0.42–0.71)

0.58 (0.42–0.79)

PC2

0.91 (0.8–1.04)

0.99 (0.78–1.24)

0.9 (0.68–1.18)

PC3

1.47 (1.29–1.67)

1.57 (1.25–1.98)

1.47 (1.13–1.93)

PC4

0.73 (0.64–0.83)

0.76 (0.61–0.94)

0.69 (0.53–0.9)

PC5

0.98 (0.87–1.11)

1.05 (0.85–1.29)

1.16 (0.91–1.49)

PC6

1.66 (1.45–1.89)

1.66 (1.33–2.09)

1.73 (1.33–2.28)

PC7

0.93 (0.83–1.05)

0.96 (0.78–1.19)

0.96 (0.75–1.24)

PC8

1.15 (1.01–1.31)

1.19 (0.95–1.49)

1.27 (0.97–1.67)

PC9

0.96 (0.85–1.09)

0.98 (0.79–1.21)

0.9 (0.69–1.18)

CAPE* SATmax

0.999 (0.998–0.999)

0.998 (0.998–0.999)

0.999 (0.998–1)

Significant pvalues are indicated in bold. Significance is evaluated at a 5% level. Confidence intervals are based on the profiled loglikelihood function.
Turning to the PCs of Z500, the odds ratio of PC1 is smaller than unity, and the odds ratio of PC3 is larger than unity, and by conferring with Fig. 3, we can interpret that as easterly flows increase the probability of an extreme precipitation event. Similarly, the odds ratio of PC4 being below unity can be interpreted as southeasterly flow increases the probability of extreme precipitation events. This is in agreement with local experience.
The PC6 has a pattern, which implies a stagnant atmosphere over Denmark. Therefore it is surprising that it has a significant odds ratio. That said, the exact shape of the higherorder PC usually is sensitive to the data. This effect is not included in the uncertainty intervals in Table 3, which, therefore, may be underestimated.
4.3 Variable importance
Regarding hyperparameter tuning for RF, an ensemble of 700 trees was found to be sufficient for achieving good performance, with negligible benefits for larger forests. The optimal hyperparameters for each ML algorithm can be found in the supplementary material.
CAPE and TCW obtained the highest importance across all models, with CAPE being the first by RF, NNET, and TCW for LR and SVM. The explanatory variables PC1 and PC6 were consistently rated among the top six variables of the models. A difference was that PC2 was among the top 3 variables in the NNET model, but it had a minor contribution to LR, RF, and SVM. The most notable difference was observed regarding the SATmax, which played no role in LR and a minor role in the RF, SVM model, while it was the fourth most important explanatory variable for predicting extreme events in the NNET model (Fig. 7).
4.4 LR vs. ML
In contrast to LR and NNET, ROC AUC decreased for RF and SVM when applied to the test set (Table 4) compared to the training set, but all models still showed high performance (ROC AUC greater than 0.80). In the test dataset, the ML algorithms with the greatest ROC AUC were RF, NNET (AUC = 0.87) (Fig. 8), followed by LR (AUC = 0.86), and SVM (AUC = 0.80). However, the Delongtest results in Table 4 indicated that there were significant differences only between ROC AUC of SVM and that of the other models. For accuracy at the optimal threshold, LR and RF were the bestperforming algorithms for predicting extremes events (ACC = 0.75, 0.73), and they had the smallest FPR (FPR = 0.26, 0.28), but NNET had the highest TPR (TPR = 0.90 while TPR = 0.87 and 0.83 for RF and LR). RF had the best area under the PR curve (PR AUC = 0.39), followed by LR (PR AUC = 0.38) and NNET (PR AUC = 0.37). Brier score did not show any difference among models. SVM had the worst performance in terms of all metrics.
Table 4
ROC AUC (95% CI) performance comparison of the four models applied to the training and test sets. 95% CI is computed with 2000 stratified bootstrap replicates
Model

Training

Test

LR

0.86 (0.85–0.88)

0.86 (0.83–0.89)a

RF

0.998 (0.997–1)

0.87 (0.85–0.9)a

NNET

0.87 (0.86–0.89)

0.87 (0.84–0.89)a

SVM

0.96 (0.95–0.97)

0.80 (0.76–0.84)b

ab Different letters in the same column indicate significant statistical differences (p < 0.05, Delong Test)
When nonlinear effects were incorporated into LR via restricted cubic splines, there was more overfitting, and the performance of LR was not increased compared to traditional LR (see supplementary material).
Figure 9 highlights that the bestperforming models give similar predictions. LR, RF, and NNET predictions are strongly correlated (LRRF, 0.84; LR NNET, 0.92; RF NNET, 0.92). On the other hand, the correlations of LR, RF, and NNET predictions with SVM ditto are weak. The predictions for the observed extreme events only follow the same pattern: high correlation between LR and RF, NNET, and low correlation with SVM. The distribution of predicted probabilities for nonextreme events coincides with the expected leftskewed pattern in all models. In contrast, the distribution of the expected probabilities for extreme event occurrences deviates from the ideal case of a highly rightskewed distribution.