Understanding the impacts induced by cut-off thresholds and likelihood measures on confidence interval when applying GLUE approach

Generalized likelihood uncertainty estimate (GLUE) approach is affected heavily by the choice of cut-off threshold (Tr) and likelihood measure. This work presents a research on investigating potential mechanisms behind the impacts of Tr and likelihood measure on 95% confidence interval estimated by GLUE (95 CI). Several typical likelihood measures are classified into five groups according to error models. Theoretical analysis reveals strict mathematical relationships between likelihood measures based on a same error model. Besides, a multiple attribute decision making (MADM) framework is proposed to assess overall quality of various geometrical features for 95 CI by integrating a range of interval indicators into a comprehensive score. Case studies indicate that (1) As Tr increases, 95 CI widens in low-level flow section, moves upward in recession phase of medium-level flow section and narrows in high-level flow section. The variations of interval indicators with Tr can be attributed to trade-off mechanism amongst the variations of geometrical features caused by widening, moving and narrowing trends of 95 CIs. (2) Much higher similarity in interval indicators and 95 CIs occurs for likelihood measures of a same group than those of different groups. (3) Formula of likelihood measure controls individuality of 95 CIs, error model of likelihood measure controls commonality of 95 CIs, and Tr acts as a regulator between individuality and commonality. (4) 95 CI shows similar responses to the choice of Tr and likelihood measure regardless of time span and climate variability. To summarize, this work offers further understanding and guidance in chose of Tr and likelihood measure for GLUE applications.

As based on the concept of equifinality, GLUE accepts a series of comparatively reliable (also called 'behavioral') parameter sets rather than a single optimal parameter set (Beven 2006). Behavioral and non-behavioral parameter sets are distinguished by joint use of cut-off threshold and likelihood measure . Likelihood measure is used to quantify the fit-of-goodness between model outputs and observed data (Chen et al. 2021). A parameter set is considered behavioral when its likelihood value is larger than cut-off threshold (Zulkafli et al. 2021). By running hydrological model with all behavioral parameter sets, a set of model outputs are estimated at each time point and, then, used to derive posterior distribution by Bayesian theory. 95% confidence interval (95 CI) is obtained by 2.5th and 97.5th percentiles of posterior distributions at all time points (Xie et al. 2019). 95 CI is often used to quantify the uncertainty of model outputs (Yan et al. 2017), which is also research object of this work.
The largest reason for widespread application of GLUE is the simplicity of assumption, ease of implementation, and no need of modification to source codes (Su et al. 2018;Pang et al. 2019). Ballinas-González et al. (2016) pointed out that GLUE improves uncertainty analysis by introducing new information without making strict error assumptions. Tongal and Booij (2017) stated that model nonlinearity, parameter covariation as well as errors arising from different sources (i.e. model structure, model parameters and input-output datasets) can be handled implicitly in GLUE. Some works considered that GLUE is robust in data-scarce, discontinuous and over-parameterized circumstances (Fuentes-Andino et al. 2017;Shivhare et al. 2018). Due to above merits, GLUE has been extended to the fields of crop growth (Li et al. 2018a, b), nitrous oxide emission (He et al. 2016), acid deposition (Whyatt et al. 2017), coastal engineering (Simmons et al. 2017), wildfire spread (Benali et al. 2017), and virus inactivation (Mayotte et al. 2017).
95 CI is very sensitive to the choice of cut-off threshold and likelihood measure (Simmons et al., 2017). A variety of likelihood measures have been suggested in literature (e.g. Beven and Freer 2001;Mirzaei et al. 2015;Sharafati et al. 2018). Before 2010s, there used to be some debates on informal formulas of likelihood measures and subjective choice of cut-off threshold in GLUE (e.g. Beven and Freer 2001;Vrugt et al. 2009). Till now, no agreement has been reached due to many differences in basic hypotheses of GLUE and formal Bayesian methods. Beven and Binley (1992) considered GLUE to be reasonable because the nature of errors cannot be justified in sophisticated and non-stationary natural circumstances. Sharafati et al. (2018) deemed that GLUE allows flexible definition in formulas of likelihood measures according to modelers' targets. By adopting suitable cut-off threshold and likelihood measure, GLUE can generate similar or even better 95 CIs than other uncertainty analysis methods (e.g. Ballinas-González et al. 2016;Lehbab-Boukezzi and Boukezzi 2019;Vrugt et al. 2009).
Quantitative analysis of 95 CI is a difficult task because 95 CI is not a single value but constituted by a series of intervals at all time points . For this, hydrologists created a range of interval indicators to characterize various geometrical features of 95 CIs. A widely-used interval indicator is the ratio of observed data covered by 95 CI (i.e. CR). A CR close to 100% is desirable (Sharafati et al. 2018), but cannot be achieved in most cases (Beven and Freer 2001). Zhang et al. (2015) believed that an excessively wide 95 CI conveys too little beneficial information. Two indicators of band-width (B) and relative band-width (RB) are often adopted to quantify the thickness of 95 CI (Ballinas-González et al. 2016;Choi et al. 2020;). In addition, Xiong et al. (2009) proposed four interval indicators related to the symmetry degree and deviation amplitude of 95 CI, which have been extensively used in the last decade (e.g. Chen et al. 2013;Ragab et al. 2020;Shi et al. 2019;Pang et al. 2019Pang et al. , 2020.
Recalling current literature, two problems are of urgent need to be addressed in GLUE applications: 1. Some contrast studies reported that GLUE yields poorer 95 CIs than other uncertainty analysis methods (Emam et al. 2018;Shivhare et al. 2018;Tang et al. 2021). Nonetheless, we deem their conclusions to be unconvincing because GLUE was applied with no careful choice of cut-off threshold and likelihood measure in these studies. They abandoned the great merit of flexible definition in cut-off threshold and likelihood measure in GLUE. This equals to compare an uncalibrated model with calibrated ones. To solve this problem, some scholars tried to identify suitable cut-off threshold and likelihood measure by comparing geometrical features of the 95 CIs obtained by different combinations of cut-off thresholds and likelihood measures (e.g. Freni et al. 2008Freni et al. , 2009Li et al. 2011;Zhu and Du 2017). However, most works did not assess overall quality of various geometrical features for 95 CIs, yet focused only on one or two interval indicators (i.e. CR and B/RB in most cases). More importantly, literature has rarely carried out investigations on how and why geometrical features of 95 CIs are affected by choice of cut-off threshold and likelihood measure. 2. Interval indicators characterize a 95 CI from the perspectives of various geometrical features . In this sense, an analysis on overall quality of 95 CIs corresponds to an analysis on comprehensive behaviors of multiple interval indicators. The studies of Chen et al. (2013), Xiong et al. (2009), Ragab et al. (2020 and Shi et al. (2019) revealed that interval indicators of a 95 CI often conflict with each other. Generally, large coverage ratio (i.e. large CR), narrow band-width (i.e. small B or RB) and high symmetry degree (i.e. low S or Ts) cannot be achieved simultaneously (e.g. Xiong et al. 2009;Chen et al. 2013). In many works, cut-off threshold and likelihood measure are determined subjectively by a balance amongst different interval indicators of 95 CIs with no convincing evidences. To the best of our knowledge, none has proposed a quantitative framework to assess overall quality of various geometrical features for 96CIs based on a range of interval indicators.
Based on what's discussed, this work aims to (1) study potential mechanisms behind the impacts induced by choices of cut-off threshold and likelihood measure on 95 CIs; and (2) build a multiple attribute decision making (MADM) framework to assess overall quality of various geometrical features for 95 CIs. Section 2 gives a brief description of GLUE, cut-off threshold, likelihood measure and interval indicator. Particularly, nine typical likelihood measures are classified into five groups according to their error models. Theoretical analysis is conducted to establish mathematical relationships between likelihood measures based on a same error model, and to study how error model and formula of likelihood measure affect the steps of GLUE. Subsequently, an MADM framework is proposed to integrate multiple interval indicators into a comprehensive score. Section 3 describes the study area of our interest and HBV hydrological model. Section 4 carries out three numerical experiments and a real application to investigate the impacts induced by choice of cut-off threshold and likelihood measure on 95 CIs. Section 5 discusses the implications behind some findings of Sect. 4. Finally, Sect. 6 summarizes the conclusions of this work. Step 1 Decide the formula of likelihood measure, value of cut-off threshold, number of samples N sampling , and prior distributions of model parameters.
(2.1) Sample a candidate parameter set from prior distributions of model parameters by Monte Carlo sampling method; (2.2) Run hydrological model with candidate parameter set; (2.3) Calculate likelihood value with the formula of likelihood measure based on model outputs and observed flows.
Step 3 Accept candidate parameter set if likelihood value exceeds cut-off threshold. Such parameter set is called 'behavioral' parameter set.
Step 4 Run hydrological model with all behavioral parameters sets and estimate posterior probability density (PPD) distribution of model outputs at each time point by Bayesian theory. Please refer to Blasone et al. (2008) for details on estimation of PPD distribution.
Step 5 Obtain 95 CI according to the 2.5th and 97.5th percentiles of PPD distributions at all time points.

Cut-off threshold
Cut-off threshold (Tr) is used in step 3 of GLUE to separate behavioral and non-behavioral parameter sets. A behavioral parameter set yields a likelihood value of larger than Tr. Tr can be defined as a value (e.g. 0.8, 0.9) (Shivhare et al. 2018) L2 Inverse of residual variance (IRV) L3 Exponential likelihood (EL) L8 Mean absolute relative error (MRE) Detail information of above likelihood measures is listed in Table 1. These likelihood measures can be categorized into five groups according to error models. For example, the formulas of CE, IRV and EL (i.e. L1, L2 and L3 in Table 1) contain a same term of P n i¼1 ðEq i À Q i Þ 2 , which is a common error model called square error (SE). It can be derived that CE, IRV and EL can be transformed mathematically into each other by following formulas: where where Q i is observed flow at the ith time point, Q is the mean of observed flows, while N 1 and N 2 are factor exponents of IRV and EL, respectively. Considering strict mathematical relationships in Eqs.
(2), (3) and (4), we classify CE, IRV and EL into a same group called square error (SE) group. As such, NAE and IMAE are categorized into absolute error (AE) group, CM into square root error (SRE) group, LEL into square logarithmic error (SLE) group, and MRE and EMRE into absolute relative error (ARE) group. Strict mathematical relationships can always be established between likelihood measures based on a same error model. As to NAE and IMAE in AE group, there is: As to MRE and EMRE in ARE group, there is: The grouping of likelihood measures and the formulas of error models are also shown in Table 1 This also means that the rankings of candidate parameter sets in order of likelihood values are identical for CE, IRV and EL. If Tr takes a percentage form of p%, CE, IRV and EL obtain identical populations of behavioral parameter sets in step 3 of GLUE. Meanwhile, different formulas of CE, IRV and EL yield different likelihood values for a same parameter set, leading to some distinctions in PPD distributions in step 4 and 95 CIs in step 5. Above analysis applies to likelihood measures of a same group, but not to those of different groups. Main reason is that no strict mathematical relationships like Eqs. (1)-(7) can be established between two likelihood measures of different groups. Therefore, identical Tr generates different populations of behavioral parameter sets in step 3, different PPD distributions in step 4 and different 95 CIs in step 5.
As a whole, both error models and formulas of likelihood measures show substantial impacts on GLUE steps and estimated 95 CIs. Error model of likelihood measure determines the population of behavioral parameter sets in step 3 of GLUE, whereas the formula of likelihood measure affects some characteristics of PPD distributions in step 4 of GLUE. Detailed impacts of these two factors on 95 CIs are investigated in Sect. 4.

Quantitative assessment on overall quality
of geometrical features for 95 CI

Interval indicators
The analysis of 95 CI is generally converted to the analysis on interval indicators characterizing various geometric features of 95 CI. Table 2 lists seven widely-used interval indicators, which are further divided into four types related to coverage ratio, band-width, symmetry and deviation amplitude. Some indicators belong to a same type, yet focus on different aspects of a certain geometrical feature. For example, both B and RB describe the thickness of 95 CI, the former of which considers flow magnitude but the latter removes the impacts of flow magnitude. Please refer to the cited references listed in Table 2 for details of interval indicators. Theoretically, ideal 95 CI is characterized by large coverage ratio (i.e. large CR), narrow band-width (i.e. small B and RB), high symmetry degree (i.e. low S and Ts) and short distances from interval centers to observed flows (i.e. small D and RD) (Xiong et al. 2009). Many works, however, reported that good qualities of geometrical features cannot be achieved simultaneously and interval indicators often conflict with each other (e.g. Chen et al. 2013;Shi et al. 2019;Ragab et al. 2020). The analysis on comprehensive behaviors of multiple interval indicators is a typical multiple attribute decision making (MADM) problem, which has not been studied in uncertainty analysis of hydrological simulation.

Multiple attribute decision making (MADM) framework
MADM framework is often employed to identify the best solution from multiple ones (Kelemenis and Askounis 2010). This work adopts modified version of the MADM framework proposed by Li et al. (2018a, b) Table 2 are calculated and, then, stored into a vector of S = (a 1 , a 2 ,. . ., a i ,. . ., a 7 ) where a i is the value of the jth interval indicator (j = 1, 2,. . ., 7). Vector S is called an ''event''. Different combinations of Trs and likelihood measures generate different 95 CIs and events. If there are n events, an indicator matrix of A = [a ij ] n9m can be constructed with a ij being the value of the jth indicator in the ith event (i = 1, 2, . . ., n; j = 1, 2,. . ., 7). The ith row in matrix A corresponds to the ith event, and the jth column in matrix A corresponds to a vector consisting of the jth indicators in all events. A larger B implies worse quality (i.e. wider band-width) for a 95 CI and such indicator is called a ''negative'' indicator. Inversely, a larger CR implies better quality (i.e. higher coverage ratio) for a 95 CI. An indicator like CR is called a ''positive'' indicator. Of all indicators listed in Table 2, CR is the only positive indicator and the others (i.e. B, RB, S, Ts, D, RD) are all negative indicators. Different formulas are adopted to normalize positive and negative indicators. For positive indicators, there is: and, for negative indicators, there is: . . .; n; j ¼ 1; 2; . . .; 7 ð8Þ  Average asymmetry degree 2 (Ts) where a j max = max(a 1j , a 2j , …, a nj ) and a j min = min(a 1j , a 2j , …, a nj ). Each element in the matrix of B = [b ij ] n9m falls into the range from 0 to 1.

2.2.2.2
Step 2: weight assignment to each indicator The weights of interval indicators are calculated by criteria importance through inter-criteria correlation (CRITIC) method (Rostamzadeh et al. 2018). One advantage of CRITIC method is the consideration of both inter-and intra-correlations in indicators. Main steps of CRITIC method are shown below: 1. Calculate standard deviation of each indicator (i.e. inter-correlation of indicator) in matrix B: where b j denotes the mean of the jth indicator in matrix B.

Compute correlation coefficient of every two indicators
(i.e. intra-correlation of indicators): 3. Combine standard deviation with correlation coefficients to construct a label Z j for each indicator: 4. Lastly, obtain CRITIC weight of the jth indicator by: 2.2.2.3 Step 3: generation of comprehensive score by TOPSIS TOPSIS is a powerful ranking tool (Kelemenis and Askounis 2010;Ş engül et al. 2015). The processes of generating comprehensive score for each event by TOPSIS are presented as follows: 1. Multiply each element in matrix B with CRITIC weight: 2. Define ideal and negative-ideal events as: where V j ? = max(vb 1j , vb 2j , …, vb nj ) and V j¯-= min(vb 1j , vb 2j , …, vb nj ) (j = 1, 2,. . ., 7). 3. Calculate Euclidean distances from the ith event to ideal and negative-ideal events by: 4. Generate comprehensive score (CS) based on relative distance from the ith event to negative-ideal event: CS falls into the range from 0 to 1. An event close to ideal event (i.e. S¯? 0) leads to CS ? 1, while an event close to negative-ideal event (i.e. S ? ? 0) leads to CS ? 0. 5. Rank the events in matrix B according to the magnitude of CSs.
MADM framework generates a CS for each event from statistical perspective. An event in this work contains seven interval indicators characterizing different geometric features of a 95 CI. Therefore, it is statistically reasonable to treat CS as quantitative measure for overall quality of various geometrical features for a 95 CI. A larger CS implies better overall quality.
3 Study area, data and hydrological model

Study area and data
The source region of the Yellow River basin (SYR) is chosen as study area of this work (i.e. Fig. 1), which is controlled by Tangnaihai gauge with an upstream area of 121,000 km 2 . This region is well known as ''water tower'' of the Yellow River basin, supplying 38% of total runoff for the Yellow River basin (Wang et al. 2018). As located at the Qinghai-Tibetan Plateau, the SYR region is dominated by typical cold and semi-humid climate with an annual mean precipitation of approximately 411.0 mm. More than 75% of annual precipitation falls between May and September. Permafrost and seasonally-frozen ground are widely distributed over the whole SYR region (Qin et al. 2017). Melting water of seasonally frozen-ground is also crucial water supply of the SYR region. In this work, daily flows at Tangnaihai gauge as well as daily precipitations and temperatures at 11 meteorological stations

HBV hydrological model
A lumped HBV model is used for simulating daily flows, which consists of four modules including snow accumulation and melt module, soil moisture accounting module, runoff response module and routing procedure module (Zhang and Lindström 1997). Degree-day method in HBV model (i.e. snow accumulation and melt module) can, to some extent, simulate freezing and melting processes of seasonally frozen ground. This is a crucial reason for wide applications of HBV model to hydrological simulations of the SYR region in many previous studies (e.g. Liu and Zhou 2019;Wang et al. 2012). Please refer to Montero et al. (2016) for details of HBV hydrological model. Table 3 provides the descriptions and ranges of model parameters in HBV model.
As many other studies (e.g. Shivhare et al. 2018;Li et al. 2018a, b), GLUE is also applied to calibrate HBV model in this work. In calibration process, likelihood measure is specified as CE and the Step 2 of GLUE in Sect. 2.1.1 is executed for 10,000 times. The parameter set yielding the largest CE in calibration period is recognized as the optimal parameter set of HBV model for the SYR region. After calibration, CEs of model outputs simulated by optimal parameter set are larger than 0.90 in both calibration and validation periods. Mean relative errors (MREs) are 4.0% and 9.0% in calibration and validation periods, respectively. High CEs and low MREs in both calibration and validation periods demonstrate that HBV model is appropriate in the estimation of daily flows for the SYR region.
As a real application, the impacts of climate variability on 95 CIs are studied in Sect. 4.4. In each GLUE simulation, sampling number is set as 10,000 and model parameters are uniformly distributed within the ranges listed in Table 3  Rank mean daily flows in descending order and define mean daily flows within the range of [ 70% probability as low level, 30%-70% as medium level and \ 30% as high level. In this section, mean daily flows of 85% probability (i.e. 181.7 m 3 /s on March 4th), 50% probability (i.e. 422.9 m 3 /s on November 20th) and 15% probability (i.e. 1039.9 m 3 /s on August 2nd) are selected as examples of low-, medium-and high-level flows. The SYR region is frequently affected by La Ninas and El Ninos (https://origin.cpc.ncep.noaa.gov/). Climate variability may be major reason for extensively wide intervals between different (i.e. 0th, 25th, 75th and 100th) percentiles of boxplots from July to October, leading to huge differences in annual total flows of different years.   Fig. 3d-f, where observed flows are successfully covered by 95 CIs at all Trs. Much less underestimation in PPD distributions can be detected for medium-level flows (i.e. Fig. 3d-f) than low-level flows (i.e. Fig. 3a-c). The rejection of small model outputs with increase of Tr leads to an increase in 2.5th percentile of all boxplots in Fig. 3d Fig. 3g-i), PPD distributions exhibit much higher degree of symmetry than PPD distributions of low-and medium-level flows in Fig. 3a-f. It can also be observed that enlarging Tr further improves symmetry degree of PPD distributions. Particularly in the case of Tr = 95%, PPD distributions are symmetrical approximately to dotted vertical lines (i.e. observed daily flows) in Fig. 3g-

Case 2: potential impacts induced by the formula of likelihood measure on 95 CI
Case 2 is designed to investigate the impacts induced by formulas of likelihood measures on 95 CIs. Likelihood measures used in Case 2 are all chosen from SE group to remove the impacts induced by error models of likelihood measures, i.e. CE, IRV(0.5), IRV(1), IRV(2), EL(1), EL(5) and EL(10). The value in parenthesis denotes the factor exponent of corresponding likelihood measure.   The impacts induced by factor exponents of likelihood measures are also investigated in Case 2. Three factor exponents (i.e. 0.5, 1 and 2) are adopted by IRV and three (i.e. 1, 5 and 10) by EL. Evidently, a larger factor exponent yields steeper and more symmetrical PPD distributions in Fig. 6a, c, thus narrower 95 CIs and less distances from interval centers to observed flows in Fig. 6b, d. This leads to the improving quality of all indicators with increasing factor exponent in both Figs. 5 and 7. Of all likelihood measures, EL(10) obtains the steepest and most symmetrical PPD distributions at all Trs. A likely reason is that EL(10) assigns very small probability to model outputs far from observed flows but large probability to those near observed flows. As shown obviously in Figs. 5, 6 and 7, the impacts induced by factor exponents of likelihood measures on PPD distributions and 95 CIs are also weakened by increasing Tr. To summarize, above findings indicate that the differences in PPD distributions, interval indicators and 95 CIs induced by formulas or factor exponents of likelihood measures are controlled by the value of Tr. A larger Tr generally leads to less differences.

Impacts on PPD distributions and interval indicators
There are slight distinctions in interval indicators of calibration and validation periods. Most indicators (i.e. CR, B, S, Ts, D and RD) are slightly worse in validation period than in calibration period. The only exception is RB,  Fig. 4a). More medium-and high-level flows lead to a larger denominator in the formula of RB (See Table 2) and, therefore, lower RB.

Assessment on overall quality of 95 CI
It is difficult to identify the best Tr for a given likelihood measure or the best likelihood measure for a given Tr because good qualities of different geometrical features (e.g. large coverage ratio, narrow band-width) cannot be achieved simultaneously by a 95 CI. In Fig. 5, EL(10) obtains the best CR at Tr = 45%, the best S at Tr = 90%, the best Ts at Tr = 85%, and the best B, RB, D and RD at Tr = 95%. In case of Tr = 90%, CE is superior to EL(10) in CR, S, Ts and RD, but inferior to EL (10)   Above analysis indicates that general characteristics of PPD distribution depends heavily on error model of likelihood measure. Different characteristics of PPD distribution at each time point (e.g. Fig. 9) leads to large distinctions in interval indicators of 95 CIs obtained by likelihood measures based on different error models (i.e. Figs. 10, 11).
Due to different formulas of likelihood measures, slight distinctions can be seen in interval indicators obtained by  Figures 10 and 11 reveal that such distinctions are much more remarkable at low Trs than at large Trs. As analyzed before, error models of likelihood measures lead to similarity of 95 CIs while formulas of likelihood measures lead to differences of 95 CIs. It can, therefore, be concluded that increasing Tr can  likelihood measure has more powerful impacts on 95 CIs than formula of likelihood measure.

Assessment on overall quality of 95 CI
The problem of indicator conflicts is even more serious in Figs. 10 and 11 compared to Figs. 5 and 7. Figure 12 shows the CSs of 95 CIs obtained by different likelihood measures in calibration and validation periods. Increasing Tr leads to larger CSs for all likelihood measures, implying an improvement in overall quality of geometrical features for 95 CIs. CSs obtained by likelihood measures of a same group show high consistency, which is further enhanced with the increase of Tr. Besides, it can be found that distinctions in CSs (i.e. Fig. 12) are not as significant as distinctions in a single interval indicator (i.e. Figs. 10, 11) for different likelihood measures. Ranking the CSs of Tr = 95% in calibration period (i.e. Fig. 12a) obtains SE group [ AE group [ SRE group [ ARE group [ SLE group. This, to some extent, demonstrates the rationality of adopting SE-based likelihood measure for GLUE simulations in many works (e.g. Chen et al. 2021;Lehbab-Boukezzi et al. 2016;Simmons et al. 2017;Sheng et al. 2019, etc.). In Fig. 12, the CSs of AEbased likelihood measures are very close to those of SEbased likelihood measures. LEL works poorly because ARE error model evenly treats the large and low flows of different hydrological implications. According to the magnitudes of CSs, SE-based likelihood measure combined with large Tr is highly recommended for estimation of 95 CIs by HBV model in the SYR region.   Fig. 13 exhibit CSs of 95 CIs for individual years, where each radar map corresponds to the results of a likelihood measure. The variations of CSs with years at Tr = 45, 70, 95% are marked by red, blue and black lines in each radar map, respectively. Given a Tr and likelihood measure, the CSs of different years are ranked in descending order and rankings are written beside corresponding CSs.
In all radar maps of Fig. 13, an increase in Tr always leads to an increase in CSs for all years. As to likelihood measures of a same group, high similarities can be seen in magnitudes and rankings of the CSs for a given year. For given year and Tr, the largest CSs are generally obtained by SE-based likelihood measures (i.e. Fig. 13a-c), while the lowest CSs are always obtained by SLE-based likelihood measure (i.e. Fig. 13g). It is also clear that change patterns of CSs with Tr and likelihood measure for individual years are generally consistent with those for the whole period (i.e. Case 3 in Sect. 4.3). To some extent, these findings suggest that 95 CI shows similar responses to choice of Tr and likelihood measure regardless of time span and climate variability.

Impacts of flow magnitude on 95 CI
2008 is a wet year of having the largest annual total flow, and obtains the lowest CSs in most cases (i.e. ranked 9th). The magnitude of annual total flow is affected by spatial-temporal distribution of precipitation, which is controlled by sophisticated interactions of various local and global meteorological systems. Uncertainty of precipitation might be weakened or enhanced by joint influences of different meteorological systems, thus reduce or increase the uncertainty of the results simulated by HBV model. This may be crucial reason for the non-sensitivity of CS to the magnitude of annual total flow.

Impacts of La Ninas and El Ninos on 95 CI
The years affected by La Ninas and El Ninos are marked respectively by blue and red colors on each radar map of Fig. 13. Clearly, different error models show different degrees of sensitivity to La Ninas and El Ninos: 1. As to SE-, AE-and SRE-based likelihood measures in Fig. 13a (i.e. 2002, 2004, 2006, 2007 and 2008) generally obtain lower CSs compared to unaffected years (i.e. 2001, 2003 and 2005). There are no large distinctions in the magnitudes of CSs for the years affected by La Ninas and El Ninos. This implies that both La Ninas and El Ninos lead to a decrease in overall quality of the 95 CIs obtained by LEL-based likelihood measure. 3. In Fig. 13h, i, the years affected and unaffected by La Ninas or El Ninos can also be well distinguished by the CSs obtained by ARE-based likelihood measures (i.e. MRE and EMRE). Unaffected years obtain higher CSs with rankings of lower than 4th compared with affected years. Moreover, the years affected by El Ninos gain lower CSs than those affected by La Ninas in Fig. 13h, i. This indicates that ARE-based likelihood measures can also distinguish the years affected by La Ninas and the years affected by El Ninos. Therefore, we deem that ARE error model is more sensitive to La Ninas and El Ninos than other error models.
La Ninas and El Ninos are anomalous meteorological phenomena, which can interfere spatial-temporal distribution of precipitation in global scale. Anomalous alteration of hydro-meteorological conditions due to La Ninas and El Ninos enhances the uncertainty of precipitation, leading to larger uncertainty of the results simulated by HBV model (i.e. lower CSs). This explains the reason for the poorer overall quality of 95 CIs in the years affected by La Ninas and El Ninos.

Variation of interval indicators with Tr
95 CI can hardly be assessed in direct and quantitative manners because constituted by a set of intervals between the 2.5th and 97.5th percentiles of PPD distributions at all time points. As many other works (e.g. Benali et al. 2017;Tongal and Booij 2017), this work converts the analysis of 95 CIs to the analysis on interval indicators of 95 CIs. In this work, a range of interval indicators (i.e. Table 2) are adopted to facilitate a quantitative assessment on various geometrical features of 95 CIs.
The band-width of 95 CI shows a narrowing trend with increase of Tr for all likelihood measures in Figs. 5, 7, 10 and 11, which agrees with the findings of many previous works (e.g. Pang et al. 2018;Su et al. 2018). It can also be found in Figs. 5, 7, 10 and 11 that larger Tr generally leads to a higher ratio of observed flows covered by 95 CIs (i.e. larger CR). This is, however, opposite to decreasing trend of CR with increasing Tr commonly reported by other studies (e.g. Shi et al. 2019;Xiong et al. 2009). Table 5 collects the results of some studies concerning change patterns of interval indicators with Tr. As reported repeatedly in literature, an interval indicator does not show regular change patterns with Tr, yet shows different change patterns under different circumstances.
Given Tr and likelihood measure, interval indicators of 95 CIs are affected seriously by characteristics of hydrological model and study area. A hydrological model adopts a set of conceptual or physical modules to describe different processes of rainfall-runoff responses in nature. Low-, medium-and high-level flows show different degrees of sensitivity to different modules. For instance, low-level flow is affected mainly by modules related to groundwater and baseflow, and high-level flow by modules related to runoff production and concentration of surface flow. In a study area, these modules may present good descriptions of some natural processes, but poor descriptions of some others. Besides, describing ability of a module may be good in some study areas but poor in some others because a natural process may have different manifestations in study areas dominated by different hydrological, meteorological and topographical features. The differences in describing ability of modules to natural processes lead to distinctions in various geometrical features of 95 CIs in flow sections of different levels. Modules of good describing ability may cause high coverage ratio (i.e. large CR) in mainly-affected flow sections. An increase in Tr improves the quality of most geometrical features for 95 CIs. Inversely, modules of poor describing ability may cause low coverage ratio (i.e. small CR) in mainly-affected flow sections, where increasing Tr leads to decreasing quality of most geometrical features for 95 CIs. As discussed in Sect. 4.1.3, we attribute the variations of interval indicators with Tr to trade-off mechanism amongst the variations of geometrical features caused by narrowing, moving and widening trends of 95 CIs in flow sections of different levels.

Conflicts in interval indicators
Table 5 also shows the existence of conflicts in interval indicators related to different geometrical features. Many studies reported that large coverage ratio (i.e. large CR), narrow band-width (i.e. small B and RB) and high symmetry degree (i.e. low S and Ts) cannot be achieved simultaneously (e.g. Chen et al. 2013;Shi et al. 2019;Xiong et al. 2009). The problem of indicator conflict is widespread in contrastive studies of different Trs , likelihood measures (Choi et al. 2020), hydrological models (Chen et al. 2013) and study areas (Ragab et al. 2020). Obvious indicator conflicts can also be observed in Figs. 5, 7, 10 and 11 of this work.
The problem of indicator conflict is solved in this work by an MADM framework proposed in Sect. 2.2.2. In this framework, a CS is generated for each 95 CI by integrating a range of interval indicators, which can be viewed as a statistically meaningful measure for overall quality of 95 CI. This enables a quantitative comparison of the 95 CIs obtained under different circumstances (e.g. Trs, likelihood measures, hydrological models, study areas, etc.). The MADM framework is very flexible because interval indicators imported into framework can be adapted according to users' objectives. Some scholars seek only for a balance between coverage ratio and band-with. Then, MADM framework retains interval indicators related to coverage ratio and band-width, while removes those related to other geometrical features. If the flows of different study areas are of different magnitude orders, one can remove interval indicators affected by flow magnitudes (e.g. B, D) and retain relative or dimensionless interval indicators (e.g. CR, RB, S, Ts, RD). Of course, newly-created indicators can also be brought into MADM framework.

Implications on hydrological modeling
The SYR region is a crucial water source for Northwest and North China, yet it is a resource-deficient area with a per-capita water resource of 1/20 world average level. Currently, this region has been suffering seriously from many eco-social-economic problems, such as eco-environmental degradation, permafrost reduction, vegetation degeneracy, desertification extent and contradiction between the supply and demand of water resources (Di et al. 2019;Wang et al. 2001). Uncertainty analysis helps the alleviation of supply-demand contradiction, improvement of river health, promotion of water-use efficiency, which are of great hydrological, ecological, social and economic significance to management and development of the SYR region (Di et al. 2019;Feng et al. 2006). In this work, HBV model yields good simulations of medium-and high-level flows, but poor simulations of lowlevel flows for the SYR region (see Figs. 3,4). This accords well with the results of Tegegne et al. (2019) and Tongal and Booij (2017). In low-level flow sections, 95 CI is extremely underestimated at almost all time points, leading to low coverage ratio (i.e. small CR), large bandwidth (i.e. large B and RB), bad symmetry (i.e. large S and Ts) and long distances from interval centers to observed flows (i.e. large D and RD). Figure 4a shows that the worst simulations of HBV model occur in March when seasonally-frozen ground begins to melt and becomes a crucial water supply of the SYR region. This may be caused by two reasons. Firstly, frozen-thaw processes of seasonallyfrozen ground cannot be well described by degree-day method of HBV model. Secondly, daily average temperature is used in HBV model, but seasonally-frozen ground may melt during the periods with temperature of [ 0°C even though daily average temperature is less than 0°C.
The MADM framework proposed in Sect. 2.2.2 also has great implications on hydrological modeling. As tested in this work, this framework can be used to identify the most appropriate combination of Tr and likelihood measure. This maximizes the advantage of flexibility in GLUE and offers a statistically reasonable way of identifying or comparing Tr and likelihood measure in practical cases. It can also be applied to many contrastive studies, e.g. choice of hydrological model or uncertainty analysis method for a study area, comparison of the 95 CIs for different study areas, etc. Besides, this framework can be extended to GLUE-based uncertainty analysis in many other fields, including agriculture (Li et al. 2018a, b;Sheng et al. 2019), chemistry (Whyatt et al. 2017), environmental science (He et al. 2016;Su et al. 2018) and emergency management (Benali et al. 2017).

Conclusions
This paper presents an investigation on potential mechanisms behind the impacts induced by choice of Tr and likelihood measure in GLUE on the characteristics of 95 CI. The SYR region is chosen as study area and HBV model is employed for simulating daily flows. Seven typically-used interval indicators are adopted to describe various geometrical features (i.e. coverage ratio, bandwidth, symmetry and deviation amplitude) of 95 CIs. Major findings are summarized as follows: 1. Nine typical likelihood measures are classified into five groups according to error models, namely square error (SE), absolute error (AE), square root error (SRE), square logarithmic error (SLE) and absolute relative error (ARE). Theoretical analysis reveals strict mathematical relationships amongst the likelihood measures of a same group. It is also proved that error model of likelihood measure determines the population of behavioral parameter sets in step 3 of GLUE. 2. An increase in Tr leads to various change patterns of 95 CIs in low-, medium-and high-level flow sections for the SYR region. The variation of interval indicators with Tr reveals that a larger Tr generally leads to larger coverage ratio, narrower band-width, higher symmetry degree and less distances from interval centers to observed flows for 95 CI. This implies an improvement in the quality of all geometrical features for 95 CIs with increase of Tr. However, some indicators show different change patterns with the findings of previous studies (see Table 5). We deem that the variations of interval indicators with Tr can be attributed to the trade-off mechanism amongst the variations of geometrical features caused by widening, moving and narrowing trends of 95 CIs in flow sections of different sections. 3. Both formula and error model of likelihood measure show substantial influences on PPD distribution, interval indicator and 95 CI. General characteristics of PPD distributions depend on error model of likelihood measure, whereas detail characteristics of PPD distributions are adapted by formula of likelihood measure. Therefore, much higher similarity in PPD distributions and interval indicators can be detected for likelihood measures of a same group than those of different groups. 4. Enlarging Tr weakens the distinctions of the 95 CIs caused by different formulas or error models of likelihood measures. It can be summarized that the formula of likelihood measure controls the individuality of 95 CIs, error model of likelihood measure controls the commonality of 95 CIs, and Tr acts as a regulator between individuality and commonality. These three factors jointly result in ultimate characteristics of PPD distributions and 95 CIs in GLUE application. 5. Obvious conflicts can be observed in interval indicators of the 95 CIs obtained by different Trs and likelihood measures. To solve the problem of indicator conflict, an MADM framework is proposed in this work to integrate a range of interval indicators into a CS, and a larger CS stands for better overall quality of various geometrical features for 95 CI. Case studies show that SE-based likelihood measure combined with large Tr is highly recommended for estimation of 95 CI by HBV model in the SYR region.
6. The YSR region shows significant climate variability during 2001-2009 with large distinction of annual total flow and frequent occurrence of La Ninas or El Ninos. The investigation on CSs for individual years reveals that 95 CI shows similar responses to choice of Tr and likelihood measure regardless of time span and climate variability. Moreover, overall quality of 95 CI is sensitive to La Ninas and El Ninos, but not to the magnitude of annual total flow. Of all error models, ARE error model shows the highest sensitivity to overall quality of 95 CIs for the years affected by La Ninas and El Ninos.
Note that some of above conclusions may apply only to particular hydrological model and study area of this work. In addition, study objects of this work involve only regular likelihood measures, but not those designed for specific targets (e.g. Pang et al. 2019;Liu et al. 2020). More works are needed to testify some findings and MADM framework of this work under more complex and diverse circumstances.