Implementation of Supervised PCA for Global Sensitivity Analysis of Models with Correlated Inputs

Global Sensitivity Analysis (GSA) plays a significant role in quantifying the tangible impact of model inputs on the uncertainty of response variable. As GSA results are strongly affected by correlated inputs, several studies considered this issue, but most of them are quite time-consuming, labor-intensive, and difficult to implement. Accordingly, this paper puts forward a novel strategy based on the Supervised Principal Component analysis (Supervised PCA), benefiting from the Reproducing Kernel Hilbert Space (RKHS). Indeed, by conducting one kind of variance-based sensitivity analysis (SA), a renowned method exclusively customized for models with orthogonal inputs, on Supervised PCA (SPCA) regression, the impact of correlation structure of input variables is effectively taken into account. The ability of the suggested technique is evaluated with five test cases as well as two hydrologic and hydraulic models, and the results are compared and contrasted with those obtained from the correlation ratio method taken as a robust benchmark solution. It is found that the proposed method satisfactorily identifies the sensitivity ordering of model inputs. Furthermore, it is proved in this study that the performance of the proposed approach is also supported by the total contribution index in the derived covariance decomposition equation. Moreover, the proposed method compared to the correlation ratio method, is found to be time efficient and easy to implement. Overall, the proposed scheme is appropriate for high dimensional, relatively nonlinear or expensive models with correlated inputs whose coefficient of determination is larger than 0.5. shift in researchers’ way of carrying out prediction based on re gression approaches. Their studies makes reference to the work of Gretton et al. (2005), who proposed an independent criterion in Reproducing Kernel Hilbert Spaces. This criterion measures the dependency between two random variables according to the Hilbert-Schmidt Independence Criterion (HSIC). In reference to HSIC, the necessary and sufficient conditions for the two random variables to be independent, can be achieved if the value of this statistic (i.e., HSIC) is zero.

dimensional problems, the process of quantification and propagation of uncertainty in both parameter estimation and model structure identification become quite time consuming and labor-intensive. In a sense, in a typical modeling and simulation with a highly nonlinear model structure, a small error in input variables might result in a large uncertainty (or error) in model output (Helton et al. 2005;Iman et al. 2002). Thus, it is precious to investigate how the uncertainty in the output of a model can be proportioned to uncertainty in model inputs. Global Sensitivity Analysis is devised to address these issues in model building and construction.
Indeed, GSA is utilized in many applications in water resources engineering and environmental studies (Ciriello et al. 2013;Zheng and Wang 2015; such as model calibration, uncertainty reduction of model output, model validation, identification of redundant parameters and decision-making processes that try to explore which model inputs have the most influential impact on output variability (Crosetto and Tarantola 2001). Due to the numerous applications including the one cited above, GSA can be effectively utilized to conduct preliminary exploratory data analysis to build the corresponding model.
As an example, in a typical rainfall-runoff process, being able to implement GSA in order to build a parsimonious model structure would not only help to eliminate the redundant parameters but also help to reduce the uncertainty in model output which somehow lead to saving in time and capital. This task is considered as an inevitable part of any efforts conducted to convert rainfall into runoff in an efficient way.
There is a wide range of GSA methods in the market based on different philosophies and theoretical definitions of sensitivity measures, such as regression technique, Morris, variance-based, regional SA, and density-based methods. The details of these approaches can be found elsewhere (Saltelli et al. 2004;Borgonovo 2007). Moreover, in this field, the breakthrough discovery has been made by , who proposed the methodology called Variogram Analysis of Response Surface (VARS) method.
However, the cited approaches suffer from a fundamental problem with regard to correlation structure of input variables. Nowadays, non-orthogonal model input variables can be considered to be a rule rather than an exception in GSA because it is apparent in the shadow of belittling and ignorance of model factor dependence; GSA outcome will be fallacious. As a result, on no account is it satisfactory to accept the result of GSA without considering the dependent nature of input variables. Accordingly, several scientists had deliberately attempted to present or develop GSA methods in light of dependent factors. As an example, a few studies applied the parametric and/or non-parametric methods to decompose the variance of the response variable in the presence of dependency among model inputs (Saltelli et al. 2001 and2004;Xu and Gertner 2008;Da Veiga et al. 2009;Chastaing et al. 2012;Li et al. 2010;Xu 2013;Zhang et al. 2015;Mara et al. 2015), whereas some other investigators extended the analytical formulations for non-orthogonal input variables and the associated numerical approaches to calculate the variance-based sensitivity measure and/or more advanced Sobol's sensitivity indices (Mara and Tarantola 2012;Kucherenko et al. 2012;Kucherenko et al. 2017;Wang et al. 2018). As the computational budgets associated with most of the above approaches are quite expensive due to high dimensionality and/or more complex model structure, development of more innovative, time-efficient approaches in light of non-orthogonal inputs is strongly recommended for practical purposes (Ge and Menendez 2017;Zhou et al. 2019;Lamboni and Kucherenko 2021). One should also acknowledge the fact that how intricate it is for engineers and other investigators, with rudimentary information on GSA, to use the above methods because of their complicated procedures and concepts for practical applications on a routine basis.
In light of the above elaboration, this paper intended to couple the variance-based SA, a class of methods customized for independent inputs, with Supervised PCA, based on Hilbert-Schmidt Independence Criterion (HSIC), to address issues concerning CPU time and methods' complexity. Indeed, implementation of this kind of variance-based SA on a regression model derived from the Supervised PCA would lead to an analytical method in terms of components of dominant eigenvector to estimate the firstorder sensitivity measure with no additional computational budget for a system whose inputs probability distribution functions have several correlated input variables.
The rest of the paper is organized as follows: first, an overview of the GSA based on the variance-based approach is presented in the next section. Then, the paper gives a detail account of Supervised PCA, followed by the proposed method along with the proof of its validity in the next section. The subsequent section is devoted to assessing the performance of the proposed method via five test function as well as two more realistic models in hydrologic and hydraulic domains, for which the corresponding results of all test cases are compared and contrasted with the correlation ratio method based on McKay's scheme. Finally, the summary and the conclusions which can be drawn from the paper are summarized in the last section.

Theoretical Background
The methodology presented in this work is intended to couple Supervised PCA with the variance-based SA.
It is important to make this coupling process quite clear. In essence, the variance-based (e.g., Sobol and/or FAST) methodology is limited to transfer functions for which the input variables are independent. On the other hand, Supervised PCA considers the impact of correlation among input variables and dependent variable using regression-based analysis. Could it be possible to effectively benefit from the advantage of Supervised PCA and combine it with the particular variance-based SA to come up with a better tool for prioritizing the input variables when the input variables are correlated? In light of this question, in what follows, at first, a concise account of the basic concept of variance-based SA is provided, followed by touching on the theoretical background of Supervised PCA in some detail to see how this coupling exercise can be implemented.

Variance-based sensitivity analysis
For a computer model with number of input variables given by = ( ), = ( 1 , 2 … , ), if the input variables are random, then the output becomes a random variable as well. For such a model, decomposition of the total unconditional variance of output variable can be written as: conveys the notion of variance, and represents the expectation operator. Also, [ ( | )] and [ ( | )] are called the main and residual effect, respectively (Saltelli et al. 2008). Thus, in order to achieve a reduction in the variance of model output, one can compute the first-order sensitivity index by defining the ratio of the main variance (i.e., ) as a variance of conditional expectation to variance of model output as: And consequently, the sensitivity analysis of higher-order terms in light of the interaction of model factors such as and on model output uncertainty can be written as: Saltelli et al. (2004) can be consulted for further detail regarding computation of higher order interactions.
Under the independent assumption, the variance-based sensitivity indices can be estimated numerically via the two most popular techniques, e.g., Sobol (1993) and FAST (Cukier et al. 1973).
If a mathematical model under consideration is assumed to be approximately linear due to the relatively linear impact of model factors on output, the given model, i.e., = ( 1 , 2 … , ) can be simplified by eliminating the nonlinear or interaction terms as follows: Later on, we will be exposed to how the original nonlinear function can be converted to Eq. (4). Hence, in light of this simplification, one has to acknowledge the fact that we consider only the main effect of each model input in the corresponding regression model. When the components of this linear model inputs are mutually independent, it can be proved that the main effect in reference to variance decomposition can be stated as follows (Saltelli et al. 2008): In which = ⁄ . The ′ are standardized regression coefficients (SRCs). Indeed, in the case of orthogonal model factors, the SRC is considered as a method of sensitivity analysis for the model inputs, and in accordance with the identity of Eq. (5), the squared SRCs describe the contribution of each factor to total variance (Saltelli et al. 2008).
In order to evaluate the performance of the SRCs, it is necessary to appreciate the connection between SRCs and the measure of goodness so called the model coefficient of determination, 2 as: Where is the output of the original model i.e., = ( 1 , 2 … , ), and ̂ stands for an estimate of using the regression model i.e., Eq. (4). This coefficient is an attempt to recognize how well the model under study can be approximated by regression in the form of Eq. (4) (Weisberg 2005). When the model is linear, using SRC as a GSA method can exactly quantify the amount of response surface variation explained by each model input. Needless to say, a moderate nonlinear model should honor the coefficient of determination to be greater than 0.7 to be applicable in this context (Cariboni et al. 2007;Saltelli et al. 2004).

The Supervised PCA
The supervised PCA, initially pioneered by Barshan et al. (2011) laid the foundations for this approach in the field of supervised methods. This is because before the cited contribution, a considerable number of dimensionality reduction techniques based on supervised methods could only take into account similarities and differences for classification purposes. This property of the cited supervised methods is in contrast with Barshan et al.'s approach, which also examines the quantitative value of the target variables. As a result, it is applicable to both classification and regression problems. Their ideas can be considered as a paradigm shift in researchers' way of carrying out prediction based on regression approaches. Their studies makes reference to the work of Gretton et al. (2005), who proposed an independent criterion in Reproducing Kernel Hilbert Spaces. This criterion measures the dependency between two random variables according to the Hilbert-Schmidt Independence Criterion (HSIC). In reference to HSIC, the necessary and sufficient conditions for the two random variables to be independent, can be achieved if the value of this statistic (i.e., HSIC) is zero.
There is a theoretical relationship for HSIC but it is impractical for actual settings. Consequently, the empirical estimate of HSIC proposed by Gretton et al. (2005) as a practical criterion, can be implemented to check the dependence or independence of two random variables with finite number of observations. Thus, this criterion, i.e., empirical estimate of HSIC for a series of n observations such as ≔ {( 1 , 1 ), ( 2 , 2 ), … , ( , )}, is as follows (Gretton et al. 2005): Where ℱ is a Reproducing Kernel Hilbert Space 3 in which for each point ∈ , there is an element ∅( ) ∈ ℱ, such that 〈∅( ), ∅( )〉 ℱ = ( , ) = and considered to be a positive definite kernel for ≥ 1 and 1 , 2 , … , ∈ ℝ, i.e., Likewise, is a RKHS in which for each point ∈ , there is an element ( ) ∈ , such that 〈 ( ), ( )〉 = ( , ) and is considered as a positive definite kernel. It is necessary to recall that ℱ and have to be separable. Indeed, they must have complete orthonormal systems. In a nutshell, , , ∈ × and : = ( , ), : = ( , ). In addition, : = − 1 (centering matrix and e is a unit vector). Finally stands for the trace of a matrix.
In Supervised PCA, the subspace spanned by new features are examined such that the principal components of input variables with maximum dependency on response surface are reserved through the empirical criterion, i.e., HSIC. Indeed, for a model with standardized inputs, ̃= (̃1,̃2, … ,̃) and ℓ standardized outputs, ̃= (̃1,̃2, … ,̃ℓ), to maximize the dependency among the projected data (i.e., =̃) 4 and the response surface (i.e., ̃ ), it requires to maximize the ( ). This could be justified by resorting to empirical HSIC cited above (Barshan et al. 2011). In order to maximize ( ), is replaced with the kernel matrix to get: After replacing the dimension of each matrix in Eq. (9), this equation can be rewritten as follows: 3 The necessary background of this concept is provided by Gretton et al (2005) 4 Matrix is a modal matrix consisting of eigenvectors that maps the data sets to a new space in which features are uncorrelated.
Based on the unique property of trace in matrix algebra 5 : Consequently, we have: is a kernel matrix of ̃ (e.g., ̃̃) , and = − 1 .
Ultimately, the optimization for HSIC objective function is accompanied by the following constraint (Barshan et al. 2011): The optimal solution for this optimization problem is considered to be the eigenvectors of the real and symmetric matrix ℚ =̃̃. Taking the components of the eigenvectors as the decision variables, the optimal solution will be = [ 1 , 2 , … , ] associated with 1 ≥ 2 ≥ ⋯ ≥ , which are selected among eigenvalues. Here is the dimension of eigenspace (Barshan et al. 2011). The beauty of the above mathematical argument can be justified as follows: If kernel is equal to the identity matrix , the matrix ̃̃ is equivalent with the covariance of matrix ̃. i.e., the PCA method (Barshan et al. 2011). Since = :

Proposed method
As mentioned in the preceding section, the novelty in this study is based on implementing variance-based SA (applicable under the assumption of independent inputs) on the regression model extracted from samples of model inputs are required. For this purpose, Latin Hypercube Sampling (LHS) can be used to generate various realizations of the input variables for both correlated and/or uncorrelated model inputs (Iman and Conover 1982). This method is also recognized as an inverse Nataf transformation (Nataf 1962).
Then, the original model is run using these samples, and a one-dimensional response surface ( ) will be generated. As soon as the multi-inputs-single output realizations are generated, one can cast the regression model stipulated in Eq. (4). When the independent assumption is violated, i.e., the model inputs are correlated, using ordinary least square gives a poor estimate of model parameters of this equation, i.e., .
Thus, performing SA on this linear model will produce unjustified solution. Although the Principal Component Regression (PCR) method can overcome this problem, coupling it with variance-based approach (e.g., Sobol and/or FAST) has a drawback that will be highlighted in the following study.
It is proved in appendix B, after replacing matrix with ̃̃ in matrix ℚ, eigenvalue analysis results in only one nonzero eigenvalue ( 1 ). In that appendix, a unique eigenvalue and the corresponding eigenvector for matrix ℚ is rationalized. Thus, its associated eigenvector, i.e., 1 = [ 11 , 21 , … , 1 ] is selected for projecting explanatory variables, i.e., . As the corresponding new feature ( 1 ) has a maximum linear dependency on ̃, this means other new features can be eliminated from the model due to zero correlation with output which is proved in appendix B. In the meanwhile, in linear regression, when model variables are uncorrelated with zero means, the regression coefficients of Eq. (15) can be estimated as follows (Bedford 1998;Xu and Gertner 2008): As we proved in appendix B, (̃, ) = 0, for ≠ 1. Thus, ̂= 0 for ≠ 1. Consequently, the regression-based SPCA model will be simplified as follows: After replacing 1 and 1 with their equivalences [i.e., Eq. (16) and Eq. (17)] into Eq. (18), the regressionbased SPCA model can be expressed in terms of standardized variables as: Where 1 are the entries of the 1 eigenvector corresponding to the maximum eigenvalue, i.e., 1 . As a result, Eq. (19) can be expressed in terms of the original variables and as: The above equation can be written as: Hence, Eq. (21) is considered as a regression-based SPCA model for = ( ). It is quite important to note that the seemingly observed [i.e., = ( )] was shown to have coefficient of determination greater than 0.7 with the computed based on Eq. (21) to get meaningful results (Saltelli et al. 2004). This means that a large part of the variation of output variable can be effectively described by the regression model. In order to add another dimension to what they found, we examined the reliability of the proposed method for the case when 2 is smaller than 0.7 with positive feedback.
At this stage, the variance-based SA can be effectively utilized to investigate the impact of on . Before this purpose, it might help to summarize the main point of the above mathematical manipulations for practical purposes. If we are given a functional thereby the predictor variables are correlated. We have two choices to investigate the impact of various predictor variables on output. According to the first option, one can implement the variance-based SA (e.g., Sobol and/or FAST) directly on original model and report the most important variables for which the results are not justified. However, one can also implement regression-based SPCA to come up with Eq. (21) and then impose the variance-based SA (applicable under the assumption of the independent inputs) on this regression 6 model to get a meaningful and justifiable results. We argue that prioritization in reference to the second option would give rise to better and acceptable results.
Following the second option, estimation of the sensitivity measures based on Eq. (5) using regression-based SPCA model can be written as: Where ( ) stands for estimation of ( ) without considering the correlation among inputs. Since = , the constraint of 1 1 = 1 is satisfied. Thus: [ 11 2 + 21 2 + 31 2 + ⋯ + 1 2 ] = 1 . 1 = 1 1 = 1 After proper simplification, we have: At this stage, we might want to raise a serious question. While variance-based SA (e.g., Sobol and/or FAST) cannot be implemented on correlated data emerging from a linear or nonlinear model, how and why such tool can be safely implemented on the regression-based SPCA model for evaluating the importance of correlated predictor variables. The following mathematical elaboration is intended to address this question.
To this end, at first, we showed in appendix B, there is a relationship between the first eigenvalue of matrix ℚ assuming =̃̃, its eigenvector components associated with the first eigenvalue , 1 , and ( , ) as follows: Eq. (25) is substituted into Eq. (24). As a result, the first-order sensitivity measure can also be computed as follows: Eq. (26) can be restated as: In what follows, we managed to prove that the coefficient is approximately equivalent to the corresponding coefficient obtained from the regression-based SPCA model, i.e., . In light of this, one may want to start with the coefficients of the regression-based SPCA model in reference to Eq. (21): The coefficient , can be simplified via the following mathematical manipulations. After substituting the standardized variables into 1 , (̃, 1 ) can be written as: Following Eq. (25), the (̃, 1 ) becomes: As shown in appendix B, the dominant eigenvalue, 1 can be shown to be: After combining Eq. (30) and (31), (̃, 1 ) can be found to be: Now we can switch to computation of ( 1 ) in order to further simplify the coefficients of the regressionbased SPCA model. For this purpose, considering Eq. (18) once again and taking variance of both sides results in: Since ̃ is a standardized variable, (̃) = 1. Therefore, by substituting Eq. (32) into Eq. (33) we have: Eq. (34) is very applicable to linear models. However, as one departs from linearity, then ( 1 ) departs from the dominant eigenvalue, 1 . As a result, ( 1 ) ≈ 1 .
After replacing 1 and (̃, 1 ) with their equivalences, i.e., Eq. (25) and Eq. (32), respectively into Eq. (28), the coefficients are simplified to the following relation: Based on Eq. (34) and subsequent comment, the coefficients can be further simplified to: Therefore, in reference to the above equation, after implementing the variance-based SA on the regressionbased SPCA model, the first-order sensitivity measure can be written as: Finally: In the meanwhile, in appendix C, we managed to relate the first-order sensitivity measure [i.e., Eq. (38)] to two sources of variability, i.e., the variability due to each variable and the variability due to co-variation among the variable under consideration and other predictor variables as stipulated below: It is worth noting that the derived covariance decomposition is in line with covariance decomposition, first proposed by Li et al. (2010). In short, back to the question raised in the theoretical background section, the above mathematical manipulations imply that while variance-based SA (under the assumption of the independent factors) cannot be used directly on original model with correlated input, it is possible to effectively benefit from the advantage of Supervised PCA and combine it with this type of variance-based SA to come up with a better tool for prioritizing the input variables when the input variables are correlated.
After proposing the above approach, an important question can be raised by the reader: is it possible to conduct the variance-based SA (e.g., Sobol and/or FAST) on Principal Component Regression (a special form of the Supervised PCA) to differentiate between the important and irrelevant variables in models with correlated inputs. It depends on the nature of the problem at hand. Correlation of new predictor variables with the output and/or its lack could be the cause of success or failure. The process to address this issue is very similar to what we did while coupling variance-based SA with regression-based SPCA model. In order to keep the integrity of material in place, interested readers might want to check appendix D for further detail regarding the coefficients in Eq. (40). In summary, the PCR model can be expressed as a linear combination of original input variables as follows: Thus, after coupling the variance-based SA (under assumption of the independent inputs) with PCR, the first-order sensitivity measure can be written as: After some manipulations, the above equation is simplified as: In subsequent sections, a few numerical test cases are devised to either confirm or refute the validity of the above coupling exercise.

Applications and results
In this section, the proposed method based on coupling variance-based SA (applicable under the assumption of independent inputs) with either regression-based SPCA or PCR is applied to five test cases as well as two simple hydrologic and hydraulic models to evaluate the effectiveness of the corresponding modified model. From the theoretical background section, it became quite clear that implementation of either PCR and/or regression-based SPCA calls for generation of realizations for various input variables. As the governing regression model (e.g., regression-based SPCA) is linear, the number of realizations to build this regression can be kept quite small, e.g., 500 (Song et al. 2015). In the meanwhile, a few test cases were run with different sample sizes to examine the impact of various number of realizations on results of the proposed scheme.
It is worth recalling that after conducting eigenvalue analysis on ℚ extracted from Supervised PCA method, the aforementioned matrix has only one dominant eigenvalue and associated eigenvector. However, when it comes to PCR, it has more than one dominant eigenvalue. For this reason, the impact of different number of components on model performance was investigated. It is worth noting that if all components are included in the regression model, the PCR coefficients are equivalent to ordinary least square (Jolliffe 1986).
Even though we had a chance to analytically touch on validity of the proposed scheme, we also used the correlation ratio method as a benchmark solution implemented on the original model to further evaluate the hypothesis incorporated into various test cases. This benchmark, which is based on McKay's approach (McKay 1997;Saltelli et al. 2001), is always recognized as a valid approach due to its nonparametric nature (Xu and Gertner 2008). For this reason, this benchmark solution is suitable for nonlinear models even in the presence of strong nonlinear effects. However, as the method uses replicated LHS, it requires a large sample size to acquire an acceptable precision. In this work, like Xu and Gertner (2008), we suggest 100 replications with each replication having a sample size of 500 (a total of 50000 model runs). Finally, the negative impact of ignoring the correlation between inputs is examined by performing Sobol numerical variance-based SA (which assumed the model inputs are independent) on all test functions.

Test case 1: A linear model with constant coefficients
This test case, which had also been used by Li et al. (2010), is a simple and additive model with five inputs: In this case, the marginal distribution of each input variable is normal with a mean of 0.5 and a unit standard deviation. i.e. ~(0.5,1) . The Spearman Rank Correlation Coefficient (SRCC), better to say Spearman Rank Correlation matrix, of input variables is defined as follows: Intuitively, in the case of independent inputs, it is straightforward to deduce each variable having the same impact on model output uncertainty due to their equal SRCs. No doubt, taking into account the correlation structure of input variables, each variable has different impact. Table 1   In reference to the numerical values of for various approaches and different input variables in Table 1, after clustering the input variables into three clusters, the solution associated with the benchmark solution has 1 , 2 in cluster (1), 3 in cluster (2), and 4 , 5 in cluster three. It is crystal clear that the sorting in the proposed approach and its equivalence ( ) is quite consistent with the benchmark solution. However, neither the Sobol, implemented on the original model, nor three flavors of the variance-based PCR managed to reproduce the benchmark solution.
Needless to say, the change in numerical values of from one approach to another is highly coined with the approach itself and one should not expect similar values emerging from each approach. Indeed, for sensitivity analysis purposes, detecting the importance of input variables and their rankings in each approach will become quite important.

Test case 2: A linear model with variable coefficients
Like the preceding test case, the second test case is also adopted from Li et al. (2010) as follows: = 5 1 + 4 2 + 3 3 + 2 4 + 5 Except the equation, all other conditions of this case are very similar to the first test case. The results of the estimated sensitivity measures based on the aforementioned methods cited in Test case 1 are summarized in Table 2. However, the model structure seems to change the destiny of variables as regard to their importance.

Test case 3: A simple nonlinear model
In order to evaluate the performance of the proposed method when the transfer function is nonlinear with correlated inputs, the simplest nonlinear model with three-variable inputs, first proposed by Xu and Gertner (2008), is considered in this study:  Table 3 shows the results of SA regarding the aforementioned methods cited earlier. As the table shows, the most important input variable according to the Sobol scheme is 3 . However, both the proposed scheme as well as the benchmark solution found the most important variable to be 1 . Even more, the result is quite consistent with . By contrast, conducting the variance-based approach on PCR using either one and/or two components, cannot capture the degree of importance of input variables.

Test case 4: A nonlinear model--A typical version of the Portfolio model
This nonlinear test function, which is a typical version of the portfolio model, has four variables with the following equation: This relatively strong test case with respect to the degree of nonlinearity, as relates to interactions among the input variables, can be examined in the following three scenarios.  Table   4. As Table 4 clearly demonstrates, the suggested implemented technique along with the results summarized in the last column ( ) and yardstick solution are in good agreement. This could be attributed to the fact that all schemes take care the impact of correlation structure among input variables effectively. What's more, this ranking can be achieved using PCR based on two components. In contrast, performing variancebased SA on PCR regression using three components only ranks the inputs correctly and cannot differentiate between the importance of 1 and 2 . Similar to the test case 2, adding the correlation structure cannot change the rank of the inputs, although with respect to the correlation among inputs, 2 becomes slightly less important than 1 .

The second scenario
With the exception of the correlation structure, all conditions of this test case are the same as the preceding scenario. The correlation structure is: 14 = 0.8, 34 = 0.3 and the remaining pairs are assumed to be independent. Fig. 2 demonstrates the scatter plot matrix for this test case. The results of SA are summarized in Table 5.

The third scenario
This scenario is proposed by Ge and Menendez (2017) for which they assumed the PDF associated with each random variable has different form. In other words, in this case, the distributions of 1 , 2 , 3 and 4 are respectively, normal (0,1), gamma Γ(2, 1), uniform (0,1), and lognormal (0,1). The rank correlation coefficients are the same as the preceding scenario, i.e., 14 = 0.8, 34 = 0.3 and = 0 ≠ 1, 3 ≠ 4. The first-order sensitivity measures based on the adopted approaches can be seen in Table 6. In light of different governing PDF for each predictor variable, as Table 6 demonstrates, the proposed method managed to mimic the variation of sensitivity measures documented by the benchmark solution as well as , while conducting variance-based on PCR has failed to reproduce the proper ranking of model inputs.

Test case 5: The high-dimensional model similar to Sobol G function
As an additional more comprehensive example, this test case is designed to evaluate the ability of the proposed scheme in the presence of high dimensional problem. This model has the following equation: The variables incorporated into this test case have uniform standard marginal distribution and the parameter is always nonnegative. Eq. (47), by its nature, could accommodate interaction of all order. The Sobol G function, a strongly nonlinear and non-monotonic model, can be generated by replacing with |2 − 1|.
The test case considered in this study has a lower degree of nonlinearity and is not as complex as the Sobol G function. In the Sobol G function, by increasing , the importance of the corresponding variable will be reduced (Saltelli et al. 2008). This mathematical feature is very applicable to the proposed test function as well. In this test case, we assume the -function comprises of 12 variables and their associated coefficients are: [ 1 , 2 , … , 12 ] = [0. 01, 0.3, 0.6, 18, 25, 32, 39, 57, 77, 83, 90, 99]. Furthermore, it is assumed that the SRCCs of ( 1 , 12 ), ( 2 , 11 ) and ( 3 , 10 ) are 0.8, 0.75 and 0.7, respectively. By considering these correlation structure and other pertinent information, the estimated first-order sensitivity measures are summarised in Table 7 and Table 8 for all schemes considered in this study.  Once again, Table 7 shows that both the proposed scheme as well as the approach are capable of reproducing the ranking stipulated by the benchmark solution. If one ignores the correlation structure inherent in input variables, the conventional scheme, Sobol, fails to highlight the importance of the last three variables. However, the correlation structure inherent in input variables triggers both the proposed scheme as well as the benchmark solution to give appropriate ranking to input variables. Upon coupling the variance-based SA with PCR (Table 8), the first-order sensitivity measures of the last three variables become more distinct but the coupling exercise did not manage to regenerate the ranking associated with those variables compared to benchmark solution.

Practical test cases
In this part, two practical test cases are considered. One of them is based on a nonlinear hydrologic model, and another considers a hydraulic model to simulate flood inundation. The details of these models are discussed below:

Test case 6-A simple hydrologic model
This test case represents a hydrologic model that relates instantaneous peak discharge to watershed characteristics in India by a power model as given below: The parameters of this model are calibrated against the characteristics of 58 watersheds (McCuen and Snyder 1986). These characteristics are considered as model input variables and their probability density functions are described in Table 9. After implementing the adopted approaches, including Sobol on the original model, the proposed approach, five flavors of variance-based PCR, and the benchmark solution implemented on the original model as well as the approximated equivalence of the proposed method ( ), on the respective governing models, the first-order sensitivity measures are summarized in Table 10 and Table 11. As can be seen in Table 10, the most influential parameters are , , and based on the proposed method.
Nevertheless, one can hardly differentiate among variables such as and when it comes to the sensitivity measures. In addition, parameters such as ( ), ( , ), and ( ) are divided into the third, fourth, and fifth groups in terms of their influences, respectively. This classification, borne out by the proposed approach is very consistent with the correlation ratio method and . It seems in coupling variance-based SA with PCR (Table 11), the number of principal components considered is quite influential in delineating the important parameters. In our case, beyond the two principal components, the PCR approach cannot manage to reproduce the sensitivity ordering inherent in model input variables. It is quite interesting to acknowledge the fact that upon ignoring the correlation structure of input variables, Sobol approach, implemented on the original model, found the drainage area to be the most influential model input. Indeed, according to the Sobol approach, important variables are found to be unimportant (i.e., type II error) while less influential parameters are considered to be quite important (i.e., type I error).

Test case 7-A hydraulic model: flood inundation in a diversion channel
This practical example is a simplified version of a diversion channel subjected to uniform flow under flood inundation scenario. In order to protect the service road from flood inundation, a dyke is built in between the diversion channel and the service road. Indeed, this simple application, which is used as an instructive test in Iooss and Lemaître (2015), and Chastaing et al. (2012), try to simulate the height of water in the diversion channel with respect to the height of dyke intended to protect the service road, agricultural and industrial sites adjacent to the diversion channel bank from inundation. This model that comprises the characteristics of a typical channel reach has the following equation (See Fig. 4): Upon assuming uniform flow in a wide rectangular channel, ℎ can be derived as (de Rocquigny 2006): Where is the maximum overflow (in meter) being a function of eight inputs. Symbols used in Eq. (49) and (50) were defined in Table 12. Computation of maximum overflow calls for monitoring these eight variables. As these field variables are highly corrupted by noise and some of them also exhibit both spatial and temporal variability (e.g., and ), the dependent variable has to be considered as a random variable of its own. In Table 12, after defining each variable, the probability density function associated with each variable is also documented.

Fig. 4 flood inundation in diversion channel
As a general rule, the flow rate is highly correlated with channel roughness. As one increases the channel roughness, the flow discharge will decrease. Needless to say, Strickler coefficient is inversely proportional to channel roughness. As a result, a correlation coefficient of 0.45 is assumed for subsequent computations.
Furthermore, and can be considered to be correlated with correlation coefficient 0.3. From educational point of view, other parameters are considered to be uncorrelated. After implementing the adopted approaches on the governing equations, the first-order sensitivity measures are summarized in Table 13. As Table 13 illustrates, the results of the proposed method agree quite well with the assessment reported in benchmark and solutions. According to the table content, is found to have an influential impact on overflow depth followed by Q and . Indeed, in practice setting, that would be the case as far as protection of adjacent agricultural as well as industrial sites is concerned. In light of this, the height of dyke along with the maximum annual flow rate has to be chosen with proper care in practice. It is quite puzzling if one implements the variance-based SA on PCR, the methodology can trigger the height of dyke and leave the maximum annual flow rate intact. In addition, this coupling compared to the proposed method, cannot effectively demonstrate the other variables importance. Once again, the above procedural scheme was also implemented for various number of realizations. As Fig. 5 shows it seems the number of realizations has minimal impact on ranking process.

Summary and Discussion
Nowadays, the GSA is considered to be a potent approach to organize the input variables in terms of their degree of importance as they affect the dependent variable. Over the last two decades, one of the most important challenging task in GSA is to consider the impact of correlated input variables on the variability of model output. More recent scientific endeavors and the associated literature tried to do their best to develop approaches to address this issue. In light of this, when it comes to quantitative approaches (e.g., correlation ratio method), the methodology is usually accompanied by a great deal of computational cost and is quite time consuming, particularly in the case of having high dimensionality. For this reason and due to the complexity of the developed approaches, the majority of research activities assume the decision variables to be independent for prioritization purposes. In this paper, an innovative methodology is developed thereby the conventional variance-based approach (under the assumption of orthogonal input variables) is coupled with a regression-based SPCA model to evaluate and assess the impact of correlated input variables.
In order to evaluate the effectiveness of the proposed scheme, altogether seven test cases are considered. We would like to conclude our paper by just noting that the developed methodology is remarkably simple to implement, time efficient particularly for high dimensional problems due to its analytical nature and it is approximately equivalent to total contribution index in the covariance decomposition equation, i.e., which could be assessed by noting how is related to Variance-Covariance structure. This equivalency can be justified by the fact that the regression-based SPCA can take into account the impact of output variable on the new features. In conclusion, the proposed scheme can be considered to be an enlightening approach for sensitivity analysis modelers. In future, it is recommended to modify and implement the proposed methodology on more complex transfer function and monitor the CPU time in comparison with the more time consuming approaches in the literature such as correlation ratio method.

Appendix B
After replacing with ̃̃ in the matrix ℚ =̃̃, we have: Since ̃ and ̃ are standardized, and matrix is symmetric, i.e., = , we have: And As a result, the matrix ℚ is simplified as follows: Indeed, the terms ̃ ̃ and ̃̃ are two column and row vectors, respectively for which their arrays are (̃,̃): In reference to Eq. (B5), Eq. (B4) implies that the matrix ℚ can be considered as multiplication of a vector i.e., ̃ * ̃ * 1 (with p entries) by its transpose.
As a general rule, multiplication of a vector (with dimension of * 1) by its transpose would always result in a symmetric matrix with rank 7 of one. In the following, the eigenvalues and eigenvectors of this matrix are further discussed (Strang 2021). For this reason, consider the vector as follows: And a Matrix: For this matrix, ‖ ‖ 2 can be claimed to be an eigenvalue of matrix ̃ corresponding to eigenvector .
Based on the above theory, since the matrix ℚ is the multiplication of a vector i.e., ̃ ̃ (with entries) by its transpose, its dominant eigenvalue can be written as: ,and other eigenvalues are zero. In addition, the eigenvector of this matrix corresponding to the dominant eigenvalue, 1 is ̃ ̃ with its components as follows: As in subsequent application, we will be using the standard eigenvector, i.e. 1 1 = 1, each component of the dominant eigenvector will be divided by the magnitude of the eigenvector to make it standard as: The ̃ and ̃ are standardized values of the original variables, thus the 1 can be expressed in terms of the original variables ( , ) as: 1 = 1 ( 1 ) .5 * * ( , ), = 1,2, … , In addition, in the following, we prove that if is replaced by ̃̃ in matrix ℚ =̃̃, the correlation between standardized output and all new features with the exception of the dominant new feature (i.e., ≠ 1) is zero. For this purpose, we substitute the standardized variables into , thus, (̃, ) ( = 2, 3, … , ) can be expanded as: (̃, j ) = (̃, 1 X 1 + 2 X 2 + ⋯ + X p ) = (̃, 1̃1 ) + (̃, 2̃2 ) + ⋯ + (̃,̃) = 1 (̃,̃1) + 2 (̃,̃2) + ⋯ + (̃,̃) = .

Appendix D
In this appendix, a rationale is offered to prove Eq. (40) in the text. After regressing the ̃ on the new uncorrelated features based on PCR, we have: ̃ = 1 ′ 1 + 2 ′ 2 + ⋯ . + ′ + 0 (D1) Where ′ is a new feature on the basis of PCR. As mentioned in the main text, the coefficient of linear regression when the inputs are uncorrelated with zero means can be estimated by Eq.(17). Thus: ̂= (̃, ′ ) ( ′ ) , = 1,2, … , Since in PCA, ( ′ ) = , the PCR is: