Evaluating boreal summer circulation patterns of CMIP6 climate models over the Asian region

Our confidence in future climate projection depends on the ability of climate models to simulate the current climate, and model performance in simulating atmospheric circulation affects its ability of simulating extreme events. In this study, the self-organizing map (SOM) method is used to evaluate the frequency, persistence, and transition characteristics of models in the Coupled Model Intercomparison Project Phase 6 (CMIP6) for different ensembles of daily 500 hPa geopotential height (Z500) in Asia, and then all ensembles are ranked according to a comprehensive ranking metric (MR). Our results show that the SOM method is a powerful tool for assessing the daily-scale circulation simulation skills in Asia, and the results will not be significantly affected by different map sizes. Positive associations between each two of the performance in frequency, persistence and transition were found, indicating that a good ensemble of simulation for one metric is good for the others. The r10i1p1f1 ensemble of CanESM5 best simulates Z500 in Asia comprehensively, and it is also the best of simulating frequency characteristics. The MR simulation of the highest 10 ensembles for the Western North Pacific Subtropical High (WNPSH) and the South Asia High (SAH) are far better than those of the lowest 10. Such differences may lead to errors in the simulation of extreme events. This study will help future studies in the choice of ensembles with better circulation simulation skills to improve the credibility of their conclusions.


Introduction
The general circulation of the atmosphere is a key factor affecting climate variation, whether on global or regional scales, because it drives the circulation of energy and water vapor (Maidens et al. 2021;Zhao et al. 2019a, b). With the intensification of global warming, various extreme events have occurred frequently and have had a great impact on society and the environment (Gu et al. 2016;Kong et al. 2020;Sales et al. 2018). Some extreme events, such as extreme high temperature and extreme precipitation, are seriously affected by the atmospheric circulation (Boschat et al. 2014;Fischer and Schär 2010;Gibson et al. 2017;Liu et al. 2015;Loikith et al. 2019;Lu et al. 2020;Ohba and Sugimoto 2020). The circulation at 500 hPa is of great importance because it presents a strong relationship between higher level circulation and surface variables (Gao et al. 2019;Horton et al. 2015;Mioduszewski et al. 2016).
Global climate models (GCMs) are powerful tools for simulating the current climate and projecting future climate . The Sixth Coupled Model Intercomparison Project Phase 6 (CMIP6) was initiated by the World Climate Research Programme (WCRP), with the purpose of answering new scientific questions facing the field of climate change and providing data support to achieve the scientific goals established by the WCRP "Grand Challenge" 1 3 plan. The CMIP6 includes about 112 climate models from 33 institutions around the world participating in 23 subprograms. These data will support the next 5-10 years of global climate research (Eyring et al. 2016). The evaluation of CMIP6 climate models is an important requirement for further research on downscaling and projection. However, most of the current assessments of CMIP6 are for surface variables, such as temperature and precipitation (Almazroui et al. 2020). Assessments of the circulation in Asia are rare.
There are many methods for evaluating model circulation, but their main purposes are all downscaling. The mainstream downscaling methods include principal component analysis (PCA), empirical orthogonal function (EOF) analysis, K-means clustering, and the self-organizing map (SOM) method. Previous studies have shown that the PCA and EOF methods are not accurate enough and not very intuitive when evaluating model circulation patterns . K-means clustering is an effective circulation classification method (Agel et al. 2017), but it can only represent some discrete atmospheric systems and cannot organize them into a continuum (Gao et al. 2019). However, the actual atmospheric circulation develops continuously.
The SOM method solves these issues. The SOM is an unsupervised neural network algorithm that maps highdimensional input data to two dimensions. When iterating, it not only updates the winning node, but also updates its neighboring nodes according to the weight. The SOM method was first proposed by Kohonen (1998), and first applied by Hewitson and Crane (2006) in the field of climate downscaling, and has since been widely used in the field of climate research. It can organize long time sequences of atmospheric circulation into a continuum, so that not only can the characteristics of a specific circulation type be seen, but also how this type might develop, because one particular node tends to change from its neighboring nodes. This method is effective in connecting abnormal atmospheric circulation patterns and extreme high temperature and precipitation events (Agel et al. 2017;Loikith et al. 2017).
Some previous studies have used the SOM method to evaluate the ability of CMIP5 models to simulate weatherscale circulation patterns (Cassano et al. 2007;Wang et al. 2015). The circulation simulation capabilities of models can be ranked by comparing the correlation coefficient or root mean square error (Mioduszewski et al. 2016) between models and reanalysis products for the frequency, persistence, and transition metrics (Gibson et al. 2016). Models with better simulation capabilities for one characteristic (such as frequency) tend to be better for other characteristics (persistence and transition; Gibson et al. 2016). These research results provide a basis for studying the causes of extreme events and scenario projections.
The Western North Pacific Subtropical High (WNPSH) is the most important circulation system that affects the summer in Asia at 500 hPa. Its position, shape, and strength dominate the climate of Asia (Zhang et al. 2020;Zhao et al. 2019a, b). Monsoon and typhoon activities over the western Pacific are closely related to the WNPSH . For example, water vapor from the tropical ocean is transported to China around the western ridge of the WNPSH. The convergence of warm humidity and cold air from high latitudes means that rainfall often occurs on the northwestern edge of the WNPSH (Liu et al. 2014;Preethi et al. 2017). Besides, the ensemble performance on SAH (South Asian High) and the westerlies are also tested, which are also prominent circulations in JJA that have impacts on Asian weather in addition to WNPSH.
This study has two main objectives. First, to use the SOM method to evaluate all ensembles of climate models of CMIP6 based on three metrics for frequency, persistence, and transition and to obtain rankings from the performance of these different aspects and a comprehensive ranking metric (MR). Second, to check whether the top-ranked ensembles also give better simulations of the prominent circulations in summer Asia.
This paper is organized as follows: Sect. 2 introduces the data and methodology. Section 3 presents the rankings of CMIP6 ensembles from the performance of different metrics and checks the relationship between SOM-based ensemble rankings and the performance of prominent circulations in summer Asia. A discussion is presented in Sect. 4 and the conclusion in Sect. 5.

Reanalysis data
In this study, the geopotential height at 500 hPa (Z500) of the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) is mainly used, and data from the Japanese Meteorological Agency (JMA) Reanalysis (JRA-55) is also used to test whether different reanalysis products will lead to different results (Ebita et al. 2011). ERA5 is based on the Integrated Forecasting System (IFS) Cy41r2 and has benefited from long-term development in model physics, core dynamics and data assimilation (Hersbach et al. 2020). Compared with the previous generation JMA reanalysis, a more advanced data assimilation scheme is used for JRA-55, with increased model resolution, new variational bias correction for satellite data, and several additional observational data sources (Ebita et al. 2011). we use absolute Z500 instead of Z500 anomaly because absolute Z500 has a clearer physical meaning and is more intuitive, whereas using Z500 anomaly may mix different circulation regimes (An and Zuo 2021). When the study started, ERA5 was available from 1979, and considering the ending time of most CMIP6 models in historical experiment, we choose 1979-2014 as the research period. We consider only boreal summer (June-August) and calculate the daily average for every 6 hours [0000, 0600, 1200, and 1800 coordinated Universal Time (UTC)], so the total time range is 3312 days. The analysis domain extends from 40° to 180° E, and 0° to 60° N, following the suggestion from Gao et al. (2019) that China is mainly affected by the circulation of this region.

Climate model data
We analyze the absolute Z500 of 140 ensembles from 23 selected climate models participating in CMIP6 (Table 1). Compared with CMIP5, CMIP6 has more institutions and models participating, and also has more sub-experiments. CMIP6 aims to provide a scientific basis for the Intergovernmental Panel on Climate Change (IPCC) 6th Assessment Report (AR6). The Historical experiment is started from the model state of the piControl experiment at a certain time but is driven by various external forcings based on observations that have changed over time since 1850 (Zhou et al. 2019), and the piControl experiment is an experiment that maintains the external forcing (e.g., greenhouse gas, solar radiation, aerosol, land use) at the level of 1850 to drive the global coupled model for long-term integration of more than 500 years. Each model of CMIP6 may have multiple members in the form of "r*i*p*f*", where "r", "i", "p" and "f" represent initial conditions, initialization methods, physical processes, and forcing conditions, respectively, and "*" will be replaced by different numbers.

Methodology
In this study, the SOM method is used to classify Asian Z500 characteristics. Each node classified represents a type of circulation pattern. The input data are daily absolute Z500, which interpolated onto an Equal-Area Scalable Earth-type (EASE) grid at a spatial resolution of approximately 2.5° × 2.5° to ensure that the correct distance is calculated by the SOM procedure (Gibson et al. 2017). Firstly, each node initializes its parameters (i.e., weight coefficient), and the number of parameters share the same dimensions of input data. Then, the closest node is found according to the distance function (such as Euclidean distance), and the node with the smallest distance is regarded as "winning node". After finding the "winning node", the weight vector of each node and its neighboring nodes are updated according to the gradient descent method and weight coefficient, respectively. The algorithm iterates until it converges or meets the termination condition set by the user. This study uses the second version of the SOM toolbox on MATLAB developed by Helsinki University of Technology (http:// www. cis. hut. fi/ proje cts/ somto olbox/). The "hexagonal" lattice and "sheet" map shape are set up on the topological structure, and keep other settings default except random initialization Liu et al. 2006). Gibson et al. (2017) have shown that it is suitable for classifying a greater number of different weather patterns to keep the first 50% neighborhood radius decreasing from 5 to 1, and the last 50% at 1. In this study, 1000 iterations have been performed during SOM training. The neighborhood radius of the first 500 iterations decreased from 5 to 1, and the last 500 was kept at 1.
One of the most important steps in the SOM approach is to determine the number of nodes. Too few nodes may cause one node confuse different circulation features, and too many nodes will cause different nodes share similar features (Ford and Schoof 2017;Loikith et al. 2019;Mioduszewski et al. 2016). To determine the most suitable map size for this study, we have tested 11 experiments of different map sizes, including 2 × 3, 3 × 4, 4 × 4, 4 × 5, 5 × 5, 5 × 6, 5 × 7, 6 × 6, 6 × 7, 7 × 7, and 7 × 8 ( Fig. 1). For each configuration, we have calculated the spatial correlation coefficient between the actual value of each day and the assigned node to quantitatively evaluate the difference between different map sizes.
We call the SOM result obtained from ERA5 the "Master SOM" (Fig. 2), and then map each day onto the "Master SOM" for each ensemble. Mapping is accomplished by finding the node in the "Master SOM" which closest to (i.e., has the smallest squared distance from) the daily state of the ensemble (Cassano et al. 2007). Three metrics were used to evaluate CMIP6 from the SOM derived nodes: frequency, persistence and transition metrics (Gibson et al. 2016). Node frequency refers to the probability of occurrence of this node, and the frequency metric is defined by dividing the number of days the node appears by the total number of days. The 95% confidence level for the node frequency of occurrence is given by The "violin" plot showing pattern correlations under different node configurations between each daily absolute 500 hPa geopotential height and the SOM node it was assigned to, displaying the mean (black line), median (red line), and probability distribution (gray violin)

Fig. 2
Master 20 node (4 × 5) SOM derived from ERA5 daily absolute 500 hPa geopotential height (shading and black isobars every 40 gpm) in summer (June-August) over 1979-2014. In terms of the node referencing, node "a1" refers to the node in the top-left corner of the SOM plane and node "e4" refers to the node in the bottomright corner where l represents the number of days, and p represents the node frequency for a randomly derived data set. In other words, p = 1/N for an SOM with N nodes. Node frequency is considered significant if it exceeds these limits. For the master SOM with 20 nodes the expected frequency of occurrence for each node is 0.05 with a 95% confidence interval of ± 0.74%. Node persistence represents the continuity of a node. In this study, the persistence metric is expressed by dividing the number of events that lasted more than two days at this node by the number of events that lasted only one or two days. The transition metric refers to the transition probability from one node to another. Each node can transit to nodes other than itself, so the SOM with N nodes can have N(N − 1) kinds of transitions in total.
The performance of different ensembles on different metrics is described by calculating the Pearson correlation coefficient between different ensembles and ERA5. For example, the frequency performance of a particular ensemble is represented by the correlation coefficient of the frequency metrics between this ensemble and ERA5 for all nodes. The higher the correlation coefficient, the better the frequency performance of the ensemble, and the topper the frequency ranking. After ranking according to different metrics, the comprehensive rankings can be obtained according to the ranking metric (MR), which is defined as where m is the number of ensembles, and n is the number of metrics. The closer the MR value is to 1, the higher the comprehensive performance of the ensemble (Li et al. 2015).
In order to describe the WNPSH, we calculated the western ridge point index, the northern boundary position index, the area index and the intensity index of ERA5 and each ensemble. The western ridge point index refers to the longitude of the westernmost position of the 5880-gpm isobar at the height of 500 hPa in the domain 10°-60° N, and 90°-180° E. The northern boundary position index refers to the average latitude of the 5880-gpm isobar on the north side of the subtropical high along each meridian at the height of 500 hPa and in the same domain. The area index refers to the sum of the area enclosed by all grid points greater than 5880-gpm isobar at 500 hPa in the domain of north of 10º N, and 110°-180° E. The intensity index refers to the sum of the difference between values of grids greater than 5880-and 5870-gpm value (Liu et al. 2019). The SAH intensity index and the westerlies index are calculated to describe SAH and the westerlies. The SAH intensity index is calculated by summing the difference between the values of grids greater than 1660-and 1660-gpm value at 100 hPa isobar . The westerlies index is obtained by calculating the difference between the geopotential height between 40° N and 60° N at 250 hPa isobar (Gong and Wang 2002).

Determination of SOM map size
In order to determine the optimal number of nodes, we have tested 11 sets of SOM configurations with different map sizes, and have calculated the spatial correlation coefficient between the actual Z500 and the assigned nodes, as shown in Fig. 1. As the number of nodes increases, the spatial correlation coefficient also increases. However, a larger number of nodes may produce nodes that share similar circulation characteristics and a smaller number of nodes may confuse different circulation characteristics. According to most researches about Asia, the number of SOM nodes ranges from 12 to 20 (Gao et al. 2019;Li et al. 2020b;Liu et al. 2015;Wang et al. 2015). If we search to minimize the quantization error within groups and to maximize the topographic error between groups, 20 (4 × 5) nodes are optimal (Li et al. 2020b;Yin et al. 2010). Figure 2 shows the SOM map obtained using the daily absolute Z500 from ERA5 with the aforementioned SOM settings. In this "Master SOM", node "a1" refers to the node in the top-left corner of the SOM plane and node "e4" refers to the node in the bottom-right corner. The orange shading in nodes a3, a4, b3, b4, c3, c4, d3, d4, e2, e3, and e4 to the east highlights the area with geopotential height exceeding 5880 gpm, where the 5880-gpm isobar is considered the boundary of the WNPSH (Liu et al. 2019). The orange shading of nodes d2, e1, a3, a4, b3, b4, c3, c4, d3, d4, e2, e3, and e4 to the west indicates the North Africa High (NAH), which often changes with the movement of the WNPSH (La et al. 2002). According to the "Master SOM", neighboring nodes share similar circulation characteristics, and the difference along the diagonal between node a1 and node e4 is the largest. The circulation pattern shown at node a1 is characterized by the East Asian Trough (EAT) near the Kamchatka Peninsula in middle and high latitudes and the subtropical high represented by the 5840-gpm isobar at middle and low latitudes. Cold advection from the west of the EAT transports cool air from the north to Asia, alleviating the high temperature in summer. The circulation pattern shown at node e4 is more conducive to generating high temperature. The WNPSH extends westward to southern China, and its intensity reaches 5880-gpm, whereas the westerly jets in the middle to high latitudes are stable and block the cold air from northern latitudes from moving south, leading to the development of high temperatures in Asia. If this type of circulation pattern lasts for a long time, coupled with unsupported regional soil moisture conditions, persistent high temperatures will occur (Boschat et al. 2014;Ding and Qian 2011). From node a1 to e4, the EAT gradually weakens, but the WNPSH continues to expand westward and southward, which gradually has a greater impact on Asian weather. The NAH also gradually intensifies from node a1 to e4, but its position does not change much. In addition, the existence of the strong WNPSH of node e4 is conducive to the transport inland of water vapor from the Pacific Ocean, which favors heavy rainfall in southern China, Japan and South Korea.

"Master SOM"
The node frequency of the "Master SOM" shows the frequency of occurrence of each circulation pattern of ERA5 in all 3312 days (Fig. 3a); node frequencies that are statistically (at the 95% level) above or below those expected by chance are shown in red. The frequencies of nodes a1 and e4 are the highest, both exceeding 10%. The frequencies of nodes a2, b2, b3, d2, and e3 are the lowest, only ~ 2%, suggesting these are transition circulation patterns that appear less frequently. It shows that the SOM method is effective in identifying weather patterns that occur less frequently. The frequencies of nodes a3, a4, b1, b4, c1, d4, and e1 are close to the random probability of 5% and are not statistically significant at the 95% level.
The node persistence represents the duration of each node, and the solid black line in Fig. 4 shows the continuity of each node in "Master SOM". The persistence metrics of most nodes are greater than 1 except nodes a2, b2, b3, d2, and e3, suggesting that, for most nodes, the number of events lasting more than two days is greater than the number of events lasting two days or less. Nodes a2, b2, b3, d2, and e3 have more events with a duration of two days or less. These nodes also have the lowest frequencies (~ 2%; Fig. 3a), and on average appear once or twice in summer each year. Nodes a1 and e4 are the most persistent, and the frequencies of these two nodes are the highest (Fig. 3a), indicating that the circulation patterns corresponding to these two nodes are the two dominant circulation patterns in summer from 1979 to 2014.
Node transition is represented by the probability of a node transitioning to another. Figure 5a shows the transition probability of each node in "Master SOM". Generally speaking, a node tends to transfer to its neighboring nodes. This is why the high probability values are concentrated near the diagonal from a1 to e4. The transition probability of Fig. 3 Contour plots of node frequencies corresponding to a the Master SOM from ERA5, b the highest ranked ensemble for frequency performance, and c the lowest ranked ensemble for frequency performance. The highest/lowest ranked ensembles for frequency performance are defined here based on the correlation coefficients of frequency metrics between ensembles and ERA5. Particular node fre-quencies statistically (at p < 0.05) above or below those expected values by chance are shown in red. Node locations correspond to those in Fig. 2. The correlation coefficient between ERA5 and b the highest ranked ensemble for frequency performance is 0.73, while the correlation coefficient between ERA5 and c the lowest ranked ensemble for frequency performance is 0.21 Fig. 4 Line graphs of persistence metric corresponding to ERA5 (black line), the highest ranked ensemble for persistence performance (red line), and the lowest ranked ensemble for persistence performance (blue line). The highest/lowest ranked ensembles of persistence performance are defined here based on the correlation coefficients of persistence metrics between ensembles and ERA5. Node locations correspond to those in Fig. 2. The correlation coefficient between ERA5 and the highest ranked ensemble for persistence performance is 0.97, while the correlation coefficient between ERA5 and the lowest ranked ensemble for persistence performance is 0.39 different nodes and the circulation pattern corresponding to each node in "Master SOM" (Fig. 2) can be combined to analyze the physical explanation of the change of different circulation patterns. Node e4 has the highest probability of transferring to nodes d4, e2, and e3, with a total probability of 0.86, which shows that the position of the WNPSH gradually retreats eastward and northward at low and middle latitudes. However, the probability of node e4 transferring to other nodes is almost zero. The transition probability of node d4 to nodes c4 and e3 is higher, at 0.31 and 0.25, respectively. This means that if the WNPSH is the same shape as in node d4, it is easier for it to shrink (to nodes c4 and e3), but almost never to expand (to node e4), Interestingly, the transition probability of node e4 to d4 is much higher than that of d4 to e4. This shows that although a system is not final and may change in intensity, the probability of changing opposite ways is different, at least for the most important system (the WNPSH) in this study region. Among the transition metrics of all nodes, that from node b4 to a4 is the largest, with a value of 0.47, followed by the transition metric from node a4 to a3, at 0.43. The transition metrics of node b3 to other nodes (nodes a2, a3, a4, b2, b4) are similar but not high (~ 0.1 in each case).

Ensemble ranking of different metrics
We have calculated the correlation coefficients between all ensembles shown in Table 1 and ERA5 for the frequency, persistence, and transition metrics to evaluate the performance skills of each ensemble on these three different metrics, and three sets of rankings for all ensembles have been obtained (Table 2). Positive associations between any two of frequency, persistence and transition performance have been found, indicating that good simulation skill for one kind of characteristic is related to skills in others (Fig. 6). There are, however, a few exceptions. For example, the persistence performance of the r23i1p1f1 ensemble of IPSL-CM6A-LR is good, with a correlation coefficient of 0.97 with ERA5 and a ranking of 2 for persistence metric (Table 2). However, the transition performance of this ensemble is relatively weaker, with a correlation coefficient of 0.40 with ERA5 and a ranking of 121 for transition metric (Table 2). Figure 6 shows that the relationship between frequency performance and transition performance is the strongest, with a correlation coefficient of 0.57, while the correlation coefficient between persistence and transition performance is 0.50, and 0.45 between frequency and persistence performance. They have all passed the 99% significance test. As a consequence, we have calculated the MR metric considering all three different metrics, and the final ensemble rankings are based on the MR metric.
Frequency performances for the highest and lowest ranked ensembles are given in Fig. 3b, c. The r10i1p1f1 ensemble of CanESM5 is the best ensemble for frequency performance. The frequencies of diagonal nodes a1 and e4 are very high, while some transitional nodes in the middle, such as nodes b3, c3, d3, and e3 have very low frequencies. The correlation coefficient between this ensemble and ERA5 reaches 0.73. However, this ensemble also obviously overestimates the frequency of some nodes, such as a3, a4, e1, and e2, and underestimates others, such as b1 and b2. What is interesting is that nodes with a more westward and southward WNPSH appear more frequently in the ensemble that has the highest frequency performance. The r1i1p1f1 ensemble of INM-CM4-8 has the lowest frequency performance, and the correlation coefficient between this ensemble and ERA5 is only 0.21. From Fig. 3c, almost all of the Z500s of this ensemble are assigned to nodes a1, b1, c1, d1, and e1, indicating that this ensemble has a systematic negative bias of Z500, which causes Z500 of this ensemble to be allocated to those nodes with weaker Z500 circulation. Figure 4 shows the highest and lowest ranked ensembles for persistence performance. The r1i1p1f1 ensemble of NorESM2-MM is the highest performing ensemble for Transition probability plots (grayscale indicates probability) for a ERA5, b the highest ranked ensemble for transition performance, and c the lowest ranked ensemble for transition performance. The highest/lowest ranked ensembles for transition performance are defined here based on the correlation coefficients of transition metrics between ensembles and ERA5. Node locations correspond to those in Fig. 2. The correlation coefficient between ERA5 and b the highest ranked ensemble for transition performance is 0.86, while the correlation coefficient between ERA5 and c the lowest ranked ensemble for transition performance is 0.37   r8i1p1f1  123  116  12  88  IPSL-CM6A-LR  r7i1p1f1  79  71  101  89  MIROC6  r4i1p1f1  128  109  18  90  IPSL-CM6A-LR  r13i1p1f1  85  64  109  91  IPSL-CM6A-LR  r28i1p1f1  91  62  110  92  IPSL-CM6A-LR  r10i1p1f1  88  65  112  93  MPI-ESM1-2-HR  r2i1p1f1  58  125  85  94  CESM2  r5i1p1f1  59  119  91  95  INM-CM5-0  r9i1p1f1  133  16  125  96  IPSL-CM6A-LR  r6i1p1f1  108  26  140  97  IPSL-CM6A-LR  r4i1p1f1  80  92  104  98  IPSL-CM6A-LR  r19i1p1f1  96  66  118  99  IPSL-CM6A-LR  r11i1p1f1  78  79  128  100 persistence simulation, with a correlation coefficient of 0.97 with ERA5. The persistence metrics evolution of each node in this ensemble is almost exactly the same as that of ERA5, but the persistence metrics of all nodes except d4 are lower than those of ERA5. This shows that in the r1i1p1f1 ensemble of NorESM2-MM, there are fewer events with a duration of more than two days, and more events with a duration of less than 2 days for each node.
The above results show that the highest performing ensembles for node frequency, persistence and transition in CMIP6 are the r10i1p1f1 ensemble of CanESM5, the r1i1p1f1 ensemble of NorESM2-MM and the r8i1p1f1 ensemble of MPI-ESM1-2-LR, respectively. The lowest performing ensemble for frequency and persistence is the r1i1p1f1 ensemble of INM-CM4-8, and the lowest performing ensemble for transition is the r6i1p1f1 ensemble of IPSL-CM6A-LR. To better describe the simulation ability of CMIP6 on Z500 in Asian region, the MR metric is used afterwards. According to the MR metric (Table 2), the top one ranking is the r10i1p1f1 ensemble of CanESM5, which is also the top one ranking ensemble for frequency performance, and the second ranking ensemble is the r1i1p1f1 ensemble of NorESM2-MM, which is also the top one ranking for persistence performance. The lowest ranked ensemble is the r1i1p1f1 ensemble of INM-CM5-0. The r1i1p1f1 ensemble of INM-CM4-8 is ranked last for frequency and persistence performance, but due to its relatively high ranking for transition performance, it ranked second last rather than last by the MR metric.

Ensemble ranking and prominent circulations in Asia
The WNPSH, SAH and westerlies are key circulation systems that control the summer monsoon and typhoon activities, and are important indicators of summer weather in Asian countries . In order to test whether the top-ranked ensembles can better represent the actual climate in Asia, we calculated the daily WNPSH indexes, the SAH intensity index and the westerlies index for ERA5 and all ensembles shown in Table 1. The probability density 1 3 functions (pdf) of the highest 10 and lowest 10 ensembles are then calculated, and compared with ERA5. Figure 7a, b shows the pdf distributions after fitting to normal distribution of the western ridge point index and the northern boundary position index of ERA5 and the highest and lowest 10 ensembles, respectively. The average western ridge point index of ERA5 is around 130° E, and that of the highest 10 ensembles is around 125° E, which is about 5° E west of ERA5. The pdf distribution of the lowest 10 ensembles shows that the western ridge point index is around 160° E on average, which is about 30° E east of that of ERA5 and even 10° E east of the average of the CMIP5 simulations (Zhao et al. 2019a, b). This may be because only the CMIP5 r1i1p1f1 ensembles were evaluated, and here we evaluated all ensembles. The average northern boundary position index of ERA5 is around 31° N, and the average of the highest 10 ensembles is around 29° N, about 2° to the south. The average northern boundary position index of the last 10 ensembles is around 40° N, which shows that the lowest 10 ensembles shift the northern boundary of the WNPSH northward by about 10°. The results are similar whether choosing highest/lowest 5 or 20 ensembles ( Supplementary  Fig. 1). As extreme events like heatwaves, droughts, and typhoons are greatly affected by the location of the WNPSH (Choi and Kim 2019;Zhao et al. 2019a, b), such differences will inevitably lead to errors extreme events simulation. Figure 7c, d shows the pdf distribution of WNPSH area index and intensity index of ERA5 and the highest 10 and lowest 10 ensembles after gamma distribution. It can be seen that the pdf of the highest 10 ensembles is very close to that of ERA5, but the simulated area and intensity are both larger. For the lower-ranked ensembles, most of the ensembles simulate very small and weak WNPSH, and the indexes are almost near 0, which means that the lowest ensembles can hardly simulate the real area and strength of WNPSH. Figure 8 shows the pdf distribution of the SAH intensity index and the westerlies index of ERA5 and the highest 10 and lowest 10 ensembles after fitting to gamma (Fig. 8a) and , area index (c) and intensity index (d) of ERA5 (black) and the highest 10 (red) and lowest 10 ensembles (blue) after fitting to normal (a, b) and gamma (c, d) distributions. The black and red curves correspond to the left axis, while the blue curve corresponds to the right (c, d)

Fig. 8
Probability density function distributions of SAH intensity index (a) and westerlies index (b) of ERA5 (black) and the highest 10 (red) and lowest 10 ensembles (blue) after fitting to gamma (a) and normal (b) distributions normal (Fig. 8b) distributions. The results prove that the ensembles which simulate WNPSH well have better simulation capabilities for SAH, and vice versa. The eastward shifting of SAH influence the short-term variation of WNPSH at lower level through dynamical and thermodynamical mechanisms. First, the descent flow forced by the negative vorticity advection at upper layer contributes to the vorticity development of WNPSH. Second, the adiabatic warming caused by the descent is also conducive to the maintenance of warm WNPSH at lower level (Ren et al. 2007). The westerly jet is strongest near 200 hPa, but considering the model output data, we have calculated the westerlies index at 250 hPa. The simulation is very close to that of ERA5 for both the highest and lowest 10 ensembles.

Discussion
In order to test whether different reanalysis data give different results, we generate another "Master SOM" (Supplementary Fig. 2) using JRA-55 reanalysis data (Ebita et al. 2011;Kobayashi and Iwasaki 2016), and then apply the same procedure to rank all ensembles listed in Table 1. The rankings (Table omitted) are similar to those shown in Table 2, which is obtained from ERA5 reanalysis data. For example, the 25 ensembles of CanESM5 are still ranked high, and the 32 ensembles of IPS-CM6A-LR are still ranked low, although there are slight differences in the specific ranking numbers. Therefore, we consider that the SOM results obtained from these two different reanalysis datasets are consistent.
The simulation capability of a model is affected by many factors, such as dynamic core, parametric scheme, resolution and so on. Our research results show that the model's ability of simulating Z500 does not seem to have much to do with resolution. The r10i1p1f1 ensemble of CanESM5, which is the top one ranked ensemble for the MR metric, has a relatively coarse spatial resolution of 500 km, while the r1i1p1f1 ensemble of NorESM2-MM, the second ranking ensemble, has a relatively fine horizontal resolution of 100 km. The lower-ranking ensembles are mainly from the IPSL-CM6A-LR and INM-CM5-0 models, with resolutions of 250 and 100 km, respectively. However, we have only four different resolutions here, 500, 250, 100, and 50 km, and there is only one ensemble with 50-km resolution, so we cannot draw a significant conclusion. Subsequent work needs to add more samples with different resolutions to analyze the relationship between circulation simulation skills and model resolution. In addition, the influence of different dynamic cores and parameterization schemes on the circulation simulation is worth further analysis. But it is beyond doubt that the simulation skills for Z500 of CMIP6 is improved compared with that of CMIP5. Zhao et al. (2019a, b) evaluated the ability of CMIP5 to simulate the atmospheric circulation of East Asia and showed that there's a bias of about 20 hPa between CMIP5 ensemble mean and reanalysis of the WNPSH climatology. However, the bias is reduced to 10 hPa and the northern boundary is much closer to that of the reanalysis (figures omitted).
Extreme events have seriously affected the whole world in recent years (Kong et al. 2020;Lee et al. 2020;Li et al. 2020a;Lu et al. 2020;Meehl and Tebaldi 2004). Many studies have pointed out the close connections between circulation and extreme weather events (Boschat et al. 2014;Ding and Qian 2011;Horton et al. 2015;Li and Ruan 2018;Liu et al. 2015;Pezza et al. 2011;Raymond et al. 2017). As mentioned above, the circulation pattern of node e4 favors the occurrence of extreme high temperature and heavy rainfall events in summer. Also, node e4 has a long duration (Fig. 4), which will be more likely to lead to long lasting extreme high temperature events and even compound extreme events (Faranda et al. 2020). Our next work will focus on the extent to which extreme events in Asia, such as extreme high temperatures and heavy precipitation, are affected by the circulation, and how the circulation affects extreme events. If these issues can be studied thoroughly, it will help improve the accuracy of projections of extreme events and reduce the economic losses and casualties in Asia.

Conclusions
The Asian summer climate is significantly affected by the large-scale atmospheric circulation. The ability of models to simulate circulation characteristics is one of the most important factors affecting the future progress of Asian regional climate research. This paper uses the SOM method to evaluate the ability of CMIP6 in simulating Z500 in Asian region. Our results show that the r10i1p1f1 ensemble of CanESM5 is the best ensemble for frequency performance, and it is also the top ensemble for comprehensive performance, measured as the MR metric. The r1i1p1f1 ensemble of NorESM2-MM is the best ensemble for persistence performance, and the r8i1p1f1 ensemble of MPI-ESM1-2-LR is the top ensemble for transition performance. The r1i1p1f1 ensemble of INM-CM4-8 is the lowest ensemble for frequency and persistence performance. The r6i1p1f1 ensemble of IPSL-CM6A-LR has the lowest transition ranking. The r1i1p1f1 ensemble of INM-CM5-0 is the ensemble with the lowest of MR metric. Generally speaking, the rankings of different ensembles from the same insititution are relatively close. For example, the MR rankings of the 25 ensembles of CanESM5 are relatively high, whereas the MR rankings of the 32 ensembles of IPSL-CM6A-LR are relatively low. In addition, pairwise correlation coefficients between the frequency performance, persistence performance and transition performance are all around 0.5 and significant, which means that good simulation skill for one type of circulation characteristic (like frequency) is related to skills in simulating other types of circulation characteristics (persistence and transition).
Judging from the pdf distributions of different circulation indexes, the top-ranked ensembles indeed simulate the WNPSH and the SAH better, indicating that the rankings based on the SOM method are credible. The evaluation of the circulation indexes supports the rankings based on the SOM method.