High-Dimensional Sparse Single-Index Regression Via Hilbert-Schmidt Independence Criterion

Hilbert-Schmidt Independence Criterion (HSIC) has recently been used in the field of single-index models to estimate the directions. Compared with some other well-established methods, it requires relatively weaker conditions. However, its performance has not yet been studied in the high-dimensional scenario, where the number of covariates is much larger than the sample size. In this article, we propose a new efficient sparse estimate in HSIC based single-index model. This new method estimates the subspace spanned by the linear combinations of the covariates directly and performs variable selection simultaneously. Due to the non-convexity of the objective function, we use a majorize-minimize approach together with the linearized alternating direction method of multipliers algorithm to solve the optimization problem. The algorithm does not involve the inverse of the covariance matrix and therefore can handle the large p small n scenario naturally. Through extensive simulation studies and a real data analysis, we show our proposal is efficient and effective in the high-dimensional setting. The Matlab codes for this method are available online.


Introduction
Let Y ∈ R be an univariate response and X ∈ R p be a p × 1 predictor, the single-index model as a practically useful generalization of the classical linear regression model considers the following problem Y = g(β X, ), (1.1) where β is a p × 1 vector, is an unknown random error independent of X, and g is a link function.Let span(β) denote the subspace spanned by β, the goal of the single-index model is to estimate span(β) without specifying or estimating the link function g.To our best knowledge, Li and Duan (1989) firstly studied this problem and proposed to estimate the span(β) under the linear condition that E(X|β X) is a linear function of β X.This linear condition applies to the marginal distribution of X and is common in regression modeling.
Later, Cook (1994Cook ( , 1998) ) introduced sufficient dimension reduction (SDR) expanding the concept of the single-index model.It aims to find the minimal subspace S ⊆ R p such that Y ⊥ ⊥ X|P S X, where ⊥ ⊥ stands for independence and P S stands for the projection operator to the subspace S.Under mild conditions (Cook, 1996;Yin et al., 2008), such a subspace exists and is unique.We call it the central subspace, denote it by S Y |X and call its dimension d = dim(S Y |X ), which is often far less than p.When the central subspace is one dimensional or in other words d = 1, the caused regression problem is the single-index model (1.1).There are many methods proposed to estimate the central subspace (Li, 1991;Cook and Weisberg, 1991;Xia et al., 2002;Cook and Ni, 2005;Zhu and Zeng, 2006;Li and Wang, 2007;Wang and Xia, 2008;Cook and Forzani, 2009;Zeng and Zhu, 2010;Yin and Li, 2011;Ma and Zhu, 2012).For a comprehensive list of references about SDR methods, please refer to Ma and Zhu (2013).
Unfortunately, one drawback of the dimension reduction methods mentioned above is that the estimated linear combinations still contain all the original predictors, which often makes it difficult to interpret the extracted components.To improve interpretability, numerous attempts have been made to perform variable selection and dimension reduction simultaneously, including Cook (2004); Ni et al. (2005); Li et al. (2005); Li (2007); Li and Yin (2008); Chen et al. (2010).These methods perform well when the number of covariates p is less than the sample size denoted by n, but don't work under the scenario p > n.To tackle the difficulty, Yin and Hilafu (2015) proposed sequential procedures in SDR and Lin et al. (2018) proposed high-dimensional sparse sliced inverse regression (SIR).Moreover, Wang et al. (2018) introduced a reduced-rank regression method for estimating the sparse directions, and Tan et al. (2018b) proposed a convex formulation for fitting sparse SIR in high dimensions.Other recent high-dimensional SDR methods can be seen in Qian et al. (2019) and Tan et al. (2020).
In this article, following the work of Zhang and Yin (2015) and Tan et al. (2018b), we develop a new approach using Hilbert-Schmidt Independence Criterion (HSIC) for singleindex models.The proposed method can handle the scenario p > n and require the weakest conditions among the existing high-dimensional sparse SDR methods.The key idea is to formulate the HSIC based single-index model in a form of estimating the orthogonal projection ββ onto the subspace span(β) rather than span(β), with the constraints of the nuclear norm and the operator norm to relax the normalization constraint.Moreover, our proposal uses a lasso penalty on the orthogonal projection ββ to encourage the estimated solution to be sparse.To sum up, the main contributions of our work are as follows.First, our method extends the HSIC-based single-index regression (Zhang and Yin, 2015) to a sufficient variable selection method.Since it does not involve the inversion of the sample covariance matrix, it can naturally handle a large p small n situation.Second, motivated by the majorization-minimization principle, we design a fast and efficient algorithm to solve the problem.The objective function of our method is non-linear, so the algorithm in this article is more complicated and tricky than the algorithm in Tan et al. (2018b).Third, Tan et al. (2018b) proposed a cross-validation scheme based on the idea of Cook and Forzani (2008) to select the tuning parameters.Their method requires that the distribution of X|Y follows normal distribution, while our method apply a kernel method to estimate the link function which perfectly avoid this assumption.Last but not least, we can easily extend our method to situations where the response is multivariate.
The article is organized as follows.Section 2 reviews the background of HSIC-based single-index method and Section 3 details our proposed method.In Section 4, we conduct extensive simulation studies and a real data analysis.A short conclusion and some technical proofs are provided in Section 5 and Appendix.
The following notations will be used in our exposition.2 Overview of HSIC-based Single-Index Regression Gretton et al. (2005aGretton et al. ( , 2007Gretton et al. ( , 2009) ) proposed an independence criterion termed the Hilbert-Schmidt Independence Criterion to detect statistically significant dependence between two random variables.For univariate X and Y , HSIC denoted by H(X, Y ) has a population expression where X and Y denote independent copies of X and Y , and K(•) and L(•) are positive definite kernel functions.The definition of HSIC exists when the various expectations over the kernels are finite, which is true as long as the kernels K(•) and L(•) are bounded.One often used kernel is a Gaussian kernel (see Kankainen, 1995), i.e., Moreover, Feuerverger (1993) showed that the statistic is equivalent to the characteristic function-based statistic when the Gaussian kernel choice is adopted.Throughout the article, we present our method using the Gaussian kernel, however, our method can be extended to other kernel choices without much issue.
According to Gretton et al. (2005b), HSIC equals 0 if and only if two random variables are independent, which makes it possible for its application in the field of SDR.Indeed, under a mild condition, Zhang and Yin (2015) showed that solving (2.2) with respect to a p × 1 vector β would yield a basis of S Y |X , or in other words, the single-index direction: where Σ denotes the covariance matrix of X.Note that solving (2.2) may not have a unique solution in terms of β, but we are interested only in span(β), which is unique as shown in the following proposition.
Proposition 2.1.Assume that the support of X ∈ R p is a compact set, and that η spans the central subspace such that η Ση = 1.
is a sum of three U-statistics (see Serfling, 1980;Gretton et al., 2007): where In later sections, we will utilize the equivalent form (see Gretton et al., 2007;Wu and Chen, 2021), obtained by replacing the U-statistics with V-statistics rather than Equation (2.3), where K and L are the n×n matrix with entries K ij (β) and L ij respectively, H = I − 1 n 11 , and 1 is a n × 1 vector of ones.Here, Lij denotes the (i, j)-th entry of the product matrix HLH.The estimator of a basis for the central subspace Then, the central subspace is estimated as span(η n ) and the sufficient dimension reduced variable is η n X.The following proposition characterizes the asymptotic properties of the estimator η n .
For details about the regularity conditions and the specific form of V 11 , please refer to Zhang and Yin (2015) and its online supplementary material.

Problem Formulation
Let Π = ββ , the HSIC-based single-index regression procedure (2.5) can be rewritten as the following minimization problem: where In high dimensional SDR, it is often true that only a few elements of X are informative and we would like to select these variables only.To achieve this goal, Tan et al. (2018b) introduces the notion of subspace sparsity and imposes a lasso penalty on all elements of Π to encourage such sparsity.Moreover, they utilize the nuclear norm and the spectral norm to relax the constraint.Following the work of them, we propose the sparse estimate by solving min where λ is a tunning parameter.Note that we only consider the dimension of the central subspace to be 1, so there is no need to impose spectral norm constraint.More similar work can be seen in sparse principal component analysis, canonical correlation analysis, and sliced inverse regression (Vu et al., 2013;Gao et al., 2017;Tan et al., 2018aTan et al., ,b, 2020)).In addition, when the kernel is the product kernel, we can naturally extend the method to settings where the response is multivariate.That is, for a q-dimensional response Y = (Y 1 , . . ., Y q ) , we use the product kernel: where Y = (Y 1 , . . ., Y q ) is an independent copy of Y.

Computation
In this subsection, we propose an efficient optimization algorithm for solving the problem (3.2).Let f (Π) denote the objective function of the problems (3.1).Although f (Π) is not convex, it is differentiable and has Lipschitz continuous gradient over the bounded convex set.We state properties about the objective function f (Π) in the following proposition.
Proposition 3.1.f (Π) is differentiable and its derivative function is or equivalently, where C is a n×n matrix with the entry We prove the Proposition 3.1 in the Appendix.
Remark 1.It is worth noting that we would like to use the expression form (3.4) instead of (3.3) when actually calculating the derivative function ∇f (Π).Plus, the Lipschitz continuity property of f (Π) motivates us to design a method for performing the optimization in this work from the viewpoint of the majorization-minimization principle (Lange et al., 2000;Hunter and Lange, 2004).
Since the objective function f (Π) has a Lipschitz continuous gradient over the bounded set D, there exists a positive constant L < ∞ such that for all Π ∈ D and Π ∈ D. Thus, the right hand side of (3.5) is a majorizing function of f (Π) at Π (i.e., the right hand side of (3.5) is greater than or equal to f (Π) for all Π ∈ D with equality when Π = Π).This suggests the following majorize-minimize (MM) iteration to solve the problem (3.2): where Π (r+1) and Π (r) are the (r + 1)-th and r-th iterates of the optimization variable corresponding to Π, respectively.By the property (3.5), we can easily obtain which means that iterates generated from the algorithm are guaranteed to monotonically decrease the objective function value.Hunter and Lange (2004) showed the sequence Π (r) r≥0 obtained by the iterative formula (3.6) converges to a critical point of the problem (3.2).The MM algorithm is a well-applicable and simple algorithmic framework for solving such problems.The key challenge in making the proposed algorithm efficient numerically lies in solving the subproblem (3.6).
The subproblem (3.6) is a quadratic problem with the convex constraint, so any local minimum can be guaranteed to be a global minimum.We employ the linearized alternating direction method of multipliers algorithm (L-ADMM, Zhang et al., 2011;Wang and Yuan, 2012;Yang and Yuan, 2013) to solve it.This algorithm can allow us to tackle the difficult caused by the interaction between the penalty term and the constraints.We give the derivation details of solving the subproblem (3.6) through this algorithm in the Appendix.
In practice, we find that this algorithm can solve the subproblem quite efficiently.
Algorithm 1 presents the entire algorithm flow we use to solve the problem (3.2).It has two loops: an outer loop in which the MM algorithm approximates the original problem (3.2) iteratively by a series of convex relaxations, and an inner loop in which the linearized alternating direction method of multipliers algorithm is used to solve each convex relaxation (3.6).In the inner loop, the update of Π is performing soft-thresholding and the update of H is a projection operator which needs to compute a singular value decomposition, and modify the obtained singular values with a monotone piecewise linear function.For specific details about the projection operator, please refer to the Proposition A.1 in the Appendix.Matlab codes implementing the method are available at https://github.com/runxiong-wu/sHSIC.

Tuning Parameter Selection
The tuning parameter λ in our proposed method determines the sparsity level of the estimate.Tan et al. (2018b) proposed a cross-validation approach based on the framework of principal fitted components (PFC, Cook and Forzani, 2008) to select the corresponding sparsity tuning parameter.However, the PFC method requires that the distribution of X|Y should be normally distributed, which may not be suitable in the real application.To avoid the assumption, we use the Nadaraya-Watson kernel method to estimate the conditional expectation E(Y |X).Let Π be an estimate of the orthogonal projection Π, the sufficient dimension direction estimator β is estimated by the top eigenvector of Π.Given a new Input: X, Y , the tuning parameter λ, the Lipschitz constant L, the L-ADMM parameters ρ > 0 and τ = 4ρλ 2 max ( Σ).
12 until stopping criterion met; Output: β = the most top eigenvector of Π (r+1) .Algorithm 1: MM Algorithm for Solving (3.2) data X * , the Nadaraya-Watson kernel estimator of conditional mean E where K h is a kernel with a bandwidth h.In this article, we use a Gaussian kernel and take the leave-one-out estimate for bandwidth selection.Note that there is a trick to compute the cross-validation function with a single fit.This trick vastly reduces the computational complexity, at the price of the increasing memory consumption.For specific details, please refer to Fan and Gijbels (1996).
We then construct an M-fold cross-validation procedure based on (3.7) to select the tuning parameter λ.Suppose C 1 , . . ., C M are M equally sized and mutually disjoint subsamples of the whole dataset.The cross-validation procedure utilizes each single subsample be the test data, and the remaining M − 1 subsamples be the training data.For each fixed tuning parameter λ, the corresponding overall prediction error is computed as where |C m | denotes the cardinality of the set C m .Finally, we choose the tuning parameter which minimizes the prediction error.
4 Numerical Study

Simulations
In this section, we compare the performance of our proposed method with the most competitive high-dimensional sparse SDR approach (Tan et al., 2018b) under various simulation settings.We use two measures: the true positive rate (TPR) and the false positive rate (FPR), to assess how well the methods select variables.In particular, TPR is defined as the proportion of active predictors that are correctly identified while FPR is defined as the proportion of irrelevant predictors that are falsely identified.An estimate with a bigger TPR and a smaller FPR is better.Furthermore, we calculate the absolute correlation coefficient (corr) between the true sufficient predictor and its estimate to evaluate accuracy of the methods.The larger the absolute correlation coefficient, the better the estimate.For each study, we repeat 200 times.Study 1.This model is a classic linear regression model from Tan et al. (2018b): where ∼ N (0, 1), X = (X 1 , . . ., X p ) ∼ N p (0, Σ) with Σ ij = 0.5 |i−j| for 1 ≤ i, j ≤ p, and X and are independent.In this study, the central subspace is spanned by the vector β = (1, 1, 1, 0, . . ., 0) / √ 3 with p−3 zero coefficients.
Let Π be an estimator of the orthogonal projection Π, the sufficient dimension direction estimator β is obtained by computing the top eigenvector of Π.When computing the TPR and the FPR in practice, we truncated β by zeroing out its entries whose magnitude is smaller than 10 −4 .For the method in Tan et al. (2018b), we use Tan's code with the default parameter setting.
The simulation results from Study 1 to Study 4 are summarized in Table 1.We can see that although our proposed method in Study 1 is slightly better than the method of Tan et al. (2018b) in terms of FPR, it is worse than Tan et al. (2018b) in general.This phenomenon is well explained by that the SIR method has the best performance in a classic linear model.In Study 2, our method outperforms the other method slightly in general.
The performance of the method in Tan et al. (2018b) relies on the choices of the methodspecific kernel matrix while our method does not have this limit.In Study 3, the conditional distribution is approximately symmetrical, which causes serious problem to the method of Tan et al. (2018b).However, our method is still valid in this case.In Study 4, the linearity condition about X is destroyed while most of SDR methods require this condition.Thus, in such a case it is not surprising that our proposed method performs better than the rest method.In short, our proposed method performs very well across all the four studies in the high-dimensional setting.Studies 5 and 6 investigate the effect of our proposed method about variable selection in a multivariate response model.As far as we know, it seems no apparent competitor in such scenarios.The results are summarized in Table 2 and we can see our proposed method works fine even if the response is multivariate.

Real Data Analysis
In this part, we evaluate the performance of our proposed method in a real dataset about riboflavin (vitamin B 2 ) production with Bacillus subtilis, which is publicly available in the R package hdi.This dataset was analyzed by Dezeure et al. (2015), Hilafu andYin (2017), andShi et al. (2020).It consists of a single real-valued response variable which is the logarithm of the riboflavin production rate and p = 4088 predictors measuring the logarithm of the expression level of 4088 genes.The purpose is to systematically search genomic features that contain sufficient information for riboflavin production rate response prediction.We center the response and standardize all the covariates before analysis.
The sample size n = 71 is small compared with the covariate dimension p = 4088.To handle the ultrahigh dimensionality, we preselect the most significant 100 genes via the DC-SIS (Li et al., 2012).Following the work of Hilafu and Yin (2017), we split the data into a training set of 50 samples and a test set of 21 samples.The training set is used to select features and estimate the central subspace.To evaluate the performance in the test data, we fit a linear model with the selected variables as predictors, rather than building a complex model.genes with the adjusted R 2 78.7% while our proposed method only selects 21 genes with the adjusted R 2 76.7%.However, the predicted RMSE of Tan et al. (2018b) and our proposed method in the test set data are 2.192 and 2.068, respectively.The scatterplots of these two methods about the actual and predicted values for the 21 test samples are displayed in Figures 1(c) and 1(d).Thus in terms of prediction, our method is slightly better than Tan et al. (2018b).

Conclusion
In this article, we extend the HSIC based SDR method of Zhang and Yin (2015) to handle a large p and small n scenario by borrowing the idea from Tan et al. (2018b).The proposed method estimates the basis of the central subspace and performs sufficient variable selection simultaneously.Compared with other high-dimensional sparse SDR methods, our proposed method requires the weakest conditions so far.It enjoys a model free property and requires very mild conditions on X and no particular assumption on Y |X, or X|Y .The simulation studies showed that our method is highly efficient and stable in both n > p and n < p scenarios.
There are several possible prospects for future research.It may be of interest to extend this idea to multiple-index models, which is not trivial since it needs a new algorithm design.
Moreover, the current computational bottleneck for our proposed method is on solving the majorization step, which has a computational complexity of O(p 3 ) per iteration.Thus, it will be also interesting to redesign a highly efficient algorithm such that our proposed method is scalable to accommodate large-scale data.Finally, the asymptotic properties for our method are deserved to discuss in the future which are not covered in this article. get 4n 2 is constant which verifies the claim.
A.2 Linearized Alternating Direction Method of Multipliers Algorithm for Solving (3.6) To implement the linearized alternating direction method of multipliers algorithm, we rewrite the subproblem in formula (3.6) as min This is also equivalent to minimize the following scaled augmented Lagrangian function, where ρ is a small constant and Γ is the dual variable.The L-ADMM minimizes the augmented Lagrangian function by alternatively solving one block of variables at a time.
In particular, to update Π at the j-th iteration, we need to minimize where H j and Γ j are the j-th estimates of H and Γ respectively.However, there is no closed-form solution for the above minimization problem.To tackle the difficulty, Fang et al. (2015) proposed to linearize the quadratic term in the above problem by applying a second-order Taylor Expansion.Following the work of them, we obtain the update for Π: As suggested by Fang et al. (2015), we pick τ ≥ 4ρλ 2 max ( Σ) to ensure the convergence of the linearized alternating direction method of multipliers algorithm.The above iterate can be written in the more familiar notation: which has the closed-form solution where Finally, we update the dual variable by and M is the set of p×p symmetric semi-definite positive matrices.In this new formulation, our focus is changed to directly estimate the orthogonal projection Π onto the subspace instead of estimating the basis β.

Figure 1 :
Figure 1: Panels (a) and (b) are the sufficient summary plots of Tan et al. (2018b) and our proposed method in the training set, respectively; Panels (c) and (d) are the scatterplots of Tan et al. (2018b) and our proposed method with the actual and predicted values for the testing samples, respectively.

Table 1 :
Summary of the simulation studies.The mean, averaged over 200 datasets, are reported.All entries are multiplied by 100.

Table 2 :
Summary of the simulation studies 5 and 6.The mean, averaged over 200 datasets, are reported.All entries are multiplied by 100.