Robust Matrix Completion By Exploiting Dynamic Low-Dimensional Structures


 This paper studies the robust matrix completion problem for time-varying models. Leveraging the low-rank property and the temporal information of the data, we develop novel methods to recover the original data from partially observed and corrupted measurements. We show that the reconstruction performance can be improved if one further leverages the information of the sparse corruptions in addition to the temporal correlations among a sequence of matrices. The dynamic robust matrix completion problem is formulated as a nonconvex optimization problem, and the recovery error is quantified analytically and proved to decay in the same order as that of the state-of-the-art method when there is no corruption. A fast iterative algorithm with convergence guarantee to the stationary point is proposed to solve the nonconvex problem. Experiments on synthetic data and real video dataset demonstrate the effectiveness of our method.


Introduction
Many practical datasets have intrinsic low-dimensional structures despite the high ambient dimension. Motivated by applications like image and video processing [1,2], remote sensing [3] and collaborative filtering [4], the low-dimensional structures have been extensively studied in problems like Low-Rank Matrix Completion (LRMC) [5,6] and Robust Principal Component Analysis (RPCA) [7,8]. LRMC aims to recover the unobserved entries of a rank-r matrixL from the partial observations. Given some entries of matrixL, the goal of LRMC is to recoverL from those observed entries. In a more general setup of low-rank matrix recovery, the observations can be linear function ofL instead of partial entries [9]. RPCA decomposes a given data matrix into the sum of a low-rank matrix (true data) and a sparse matrix (outliers). Given M =L +C, whereL is a low-rank matrix, andC is a sparse matrix with at most s nonzero entries, RPCA aims to separateL and C from M . For example, in power system monitoring, L models the spatial-temporal blocks of ground-truth data, andC models the erroneous measurements [10]. This problem can be formulated as nonconvex optimization problems, due to the rank and sparsity con-straints. Solving convex relaxations of such problem by replacing the rank constraint with the nuclear norm [11,12,5] constraint and the sparsity constraint with the 1 -norm constraint [13] are proven to return the original matrix under certain conditions [7,11]. Since the nonconvex formulation directly addresses the lowrank constraint, it is more effective than the convex relaxations and has attracted much attention recently. Despite the effectiveness, it is generally more challenging to prove the convergence for nonconvex problems, and it is even harder to prove the convergence to the global optimum. We list a few works that provide the convergence guarantees for some nonconvex approaches [14,15,16,17,18,19,20]. Except for one work that considers the robust PCA [15], all the rest of the listed works focus on the low-rank matrix completion. In this paper, we also consider the existence of sparse error matrices and guarantee the convergence of our algorithm to stationary points.
The existing work on LRMC and RPCA assume that the low-dimensional structure does not change over time and consider one fixed low-rank matrix. However, for example, users' preference may change over time in collaborative filtering [21]. In this case, a more general model is to assume that the data lie in a low-dimensional subspace that gradually changes over time. The problem of tracking data points that lie in a gradually changing subspace and being robust to the sparse outliers is referred to Dynamic RPCA [22,23,24,25]. Ref. [22] studies the online version of RPCA where the sparsity pattern and the signal value of the sparse components change in a correlated manner over time. Ref. [23] proposes an online algorithm for recovering a time sequence of sparse vectors and a time sequence of dense vectors from their sum, where the dense vectors lie in a slowly changing low-dimensional subspace. Ref. [24] studies the online RPCA problem and proposes a practical algorithm with theoretical guarantee. RPCA with the weak temporal correlations was studied in [25], and the theoretical analysis only holds when the temporal correlations of the data points are relatively weak. These papers consider the temporal correlations in either the underlying subspace or the sparse part, but the problem of missing data is not addressed.
Some methods have been developed to track the lowdimensional structure with incomplete data, under the terminology of subspace tracking with missing data or Dynamic MC. Examples include Grassmannian Rank-One Update Subspace Estimation (GROUSE) [26], Parallel Estimation and Tracking by Recursive Least Squares (PETRELS) [27] when the low-dimensional is represented by a linear subspace, as well as Multiscale Online Union of Subsets Estimation (MOUSSE) [28] when the low-dimensional structure is represented by a union of subspaces. Ref. [21] proposes a model of a sequence of dynamically correlated matrices through slow-varying subspaces. Each matrix is assumed to be low-rank and contain erasures, but the sparse errors are not modeled in [21]. A detailed review of Dynamic RPCA and Dynamic Matrix Completion is provided in [29]. Another line of work is the tensor robust completion [30,31,32], which can also be applied on the missing and corrupted data scenario. However, robust tensor completion is far from being well-studied compared with robust matrix completion. This is because it is computationally intractable to determine the rank or its best convex approximation [33].
This paper studies the problems of recovering lowrank matrices when the measurements are partially corrupted and partially lost. It models the dynamic correlations of low-rank matrices through time-varying subspaces. Our model generalizes from the one in [21] by additionally modeling sparse errors in the measurements. To the best of our knowledge, this is the first analytical study of robust matrix completion of temporally correlated matrices with partially corrupted measurements. We formulate the problem as a nonconvex optimization problem and theoretically characterize the recovery error. Our analytical bound characterizes the recovery enhancement of exploiting the temporal correlations in a sequence of low-rank matrices over recovering each low-rank matrix individually. We propose a fast iterative method to solve the nonconvex problem approximately and show that the algorithm can always converge to a critical point.
The rest of the paper is organized as follows. Section 2 introduces our model and problem formulation. Section 3 provides the theoretical bound of the matrix recovery error. Section 4 proposes an iterative algorithm and shows its convergence result. Section 5 records the numerical results. Section 6 concludes the paper.

Model and Proposed Method
Our problem formulation is built upon and generalizes the problem setup in [21]. LetL t ∈ R n1×n2 denote the actual data at time t, and letC t ∈ R n1×n2 denote the sparse additive errors in the measurements at time t. The temporal correlations are modeled as a sequence of low-rank matrices with correlated low-dimensional subspaces. Specifically, letX t ∈ R n1×n2 denote the measurements at time t, whereL t has rank at most r,Ū t ∈ R n1×r ,V t ∈ R n2×r , andC t has at most s nonzero entries. We further assume L t ∞ ≤ α and C t ∞ ≤ α for some constant α. The temporal correlations can be modeled by correlations inŪ t 's andV t 's. Without loss of generality, we assumeV t changes over time, whileŪ t is fixed to beŪ such that we havē For the sake of simplicity in our analysis, we consider a simple model onV t as follows.
where t represents the perturbation noise inV t . Note that the temporal information (1) is only used in the theoretical analysis, while our proposed method does not need the information of t . Moreover, the factorization of a low-rank matrix is not unique. Our result holds as long as there exists one factorization such that (1) holds. Z t ∈ R n1×n2 denotes the measurement noise. Ω t is the set of observed entries inX t with |Ω t | = m t . The partial observed measurements can be presented by where P Ω t : R n1×n2 → R m t is a linear operator. In this paper, we consider the case that operator P Ω t is a uniform sampling ensemble (with replacement), defined as follows.
Definition 1 A linear operator P Ω t : R n1×n2 → R m t is a uniform sampling ensemble (with replacement), if we can express each entry of P Ω t (X) as [P Ω t (X)] i = P t i , X , where all matrices P t i are i.i.d. uniformly distributed on the set and e j (n) are the canonical basis vectors in R n .
The dynamic robust matrix completion problem is stated as follows. Given partially observed and corrupted measurements {y t } for t = 1, .., d, can we recover the actual dataL d ? Note that [21] does not con-siderC t , and our problem reduces to the one in [21] whenC t = 0 for t = 1, ..., d.
We estimate the unknown (L d ,C d ) using a nonconvex optimization approach. The objective function is given by where ω t 's are predetermined non-negative weights, and d t=1 ω t = 1. We estimate (L d ,C d ) by (L,Ĉ), where (L,Ĉ) = arg min and the feasible set is defined as We consider the online setting where the objective is to estimate the data at the current time step with t = d. The past observations with t < d are used to enhance the estimation ofL d becauseL 1 , ...,L d are correlated. One alternative approach is to estimate matrices in all time instants by replacing the variable L with L 1 , L 2 , · · · , L d . However, this alternative is more computationally expensive than (4) because the number of variables increases by a factor of d. Similarly, in our proposed approach (4), we only estimateC d rather than allC 1 ,...,C d to reduce the computational complexity. We consider the error of estimating historical data with t < d using P Ω t (L + C) − y t rather than P Ω t (L) − y t , because the former provides a better estimation numerically when the errors have temporal correlations, which often happen when sensors (e.g., cameras) are partially masked or damaged. In our theoretical analysis in Section III, we do not assume any correlations among sparse error matrices. Moreover, our approach does not need the information of t and applied to the cases that the temporal correlations in the low-rank matrices are unknown.

Results: Theoretical
Our theoretical bound is built upon and generalizes the result in [21]. The major difference from [21] is the extra effort to handle the nonzeroC t 's.

Assumptions of Our Approach
We first define some coefficients. n max = max(n 1 , n 2 ) and n min = min(n 1 , n 2 ).
For the simplicity of the theoretical analysis, we assume the number of observations |Ω t |'s are the same for t = 1, ..., d, and define m 0 = |Ω t |. Note that the locations of the observations in Ω t 's are different for different t. Let p = m 0 /(n 1 n 2 ) denote the fraction of observed entries.
Suppose that we are given measurements as in (2) where all P Ω t 's are uniform sampling ensembles. Assume that • (a)L t evolves according to (1), has rank at most r, and is incoherent with parameter µ 0 .

Theoretical Recovery Guarantee of Our Approach
We first define some coefficients that will be used in Theorem 1.
where c 1 , c 2 c 3 , c 4 , c 5 , c 6 , c 7 , and c 8 are constants. We provide below our simplified theorem with orderwise bound. The complete version can be viewed in Appendix B. For brevity, we assume n 1 and n 2 are in the same order O(n) and we choose the weights ω t = 1 d for t = 1, ..., d to simplify the order of the observation lower bound.
Theorem 1 establishes the recovery error for the dynamic robust matrix completion problem. With some tedious calculations, one can check that Analysis of the additive errors. If s = O(n), i.e., the number of corrupted measurements per row is bounded, from the Theorem, we have if m 0 ≥ O(n(log(n)) 2 ), and hold with probability at least 1 − 11 2n − 7dne −n . Note that (5) and (6) diminish to zero when n increases.
When the measurements do not contain corruptions, i.e.C t 's are all zeros, [21] showed that if m 0 ≥ O(n(log(n)) 2 ), with probability at least 1 − 5 2n − 5dne −n , we have when r = O(1). Here and Q 2 ≥ Q 1 . Note that the probability of success in Theorem 1 is in the same order as that in [21] as well.
The role of m 0 . One can check that Q 1 and Q 2 are in the order of O( 1 m0 ), which means the recovery error is reduced with the increasing number of observations in the measurements.
The role of d. If we choose the weight ω t = 1 d for t = 1, ..., d, one can check that the required number of observations of each matrix is reduced by a factor of O( log d d ) when d increases. One can also check that two terms in Q 2 decrease with the increasing of d, which means the recover error reduces by exploiting the temporal dynamic in the low rank matrices.
The optimal weights. Note that if the number of corrupted measurements per row is bounded, the upper bound in Theorem 1 is dominated by the following term ω 2 t (α 2 c 3 +c 6 σ 2 1 +c 6 (d−t)σ 2 2 ).
By minimizing this term with respect to ω t for t = 1, ..., d and subject to d t ω t = 1, ω t ≥ 0, we can obtain the optimal selection of ω t . The analytical solution ω * t is shown as follows Note that σ 2 affects the perturbation noises. V t 's are the same when σ 2 = 0. In this case,L t 's are also the same. Therefore, σ 2 = 0 suggests that we choose the same weights for all time instants, i.e., ω t = 1/d. In contrast, σ 2 > 0 suggests that we shall choose larger ω t when t increases. In addition, the similarities amonḡ L t 's become smaller when σ 2 increases, which suggests that we should choose larger ω t for larger t. Please refer to Appendix for the proof of Theorem 1.

Algorithm for Dynamic Matrix Completion with Erroneous Measurements
Since (4) is nonconvex, one option is to solve its convex relaxation instead by replacing the rank constraint with the nuclear norm constraint and the sparsity constraint with the 1 -norm constraint. The relaxed problem is shown as follows: where L * is the sum of all singular values of L, and Note that (8) reduces to the Robust Matrix Completion (RMC) problem [34] when d = 1. We also remark that setting the same α for L ∞ and C ∞ can simplify the presentation of the upper bound in the theoretical analysis. In the algorithm, one can set L ∞ ≤ α 1 , C ∞ ≤ α 2 with different constants α 1 and α 2 .
Since solving the convex relaxation problem (8) is not effective, we propose an algorithm to solve the nonconvex problem (4) directly. The low-rank matrix L is factorized into two matrices U ∈ R n1×r and V ∈ R n2×r such that L = U V T . We move the con- in the objective function. The revised problem of (4) is written as follows: where C(r, s, α 1 , α 2 ) is a replacement of C(r, s, α) in (4) by replacing L ∞ ≤ α and C ∞ ≤ α with L ∞ ≤ α 1 and C ∞ ≤ α 2 . We then havê as our revised objective function. The solution of (9) is the same as that of (4) when ρ approaches infinity. However, if ρ is too large, the step size is very small, and that affects the convergence rate. One practical solution is to dynamically change ρ from small to large with a fixed multiplier γ > 1 and use the result from the previous step as the initialization in the current step [35]. We remark that the decomposition L = U V T is a natural way to replace the rank inequality constraint [36,20]. With only the low-rank constraint, methods like alternating least squares and gradient descent on Grassmann manifold can be implemented to solve the problem [36,20]. To address the infinity norm constraint on L, we still keep the variable L and leverage the proximal alternating gradient descent method, which provides straightforward solutions to the infinity norm constraint.
In each iteration, we apply proximal alternating gradient descent to update the estimations of U , V , L and C. and where the proximal map associated to σ is defined as: and functions corresponding to σ in (10) and (11) are indicator functions as follows.
The details of our algorithm are summarized in Algorithm 1. (12) is obtained by setting where (L i+1 ) kj denotes the entry in row k and column j of L i+1 . This corresponds to lines 5 -8 of Algorithm 1. Similarly, (13) can be obtained by setting (14) is equivalent to only keeping s entries of C with the largest absolute values. Operations of C correspond to lines 6-10 of Algorithm 1.
The variables L, U , V , and C are initialized as fol- Then we perform truncated singular value decomposition with the r largest singular val- L0 . C 0 is set to be a zero matrix. The step sizes τ U , τ V , τ L , and τ C in Algorithm 1 are selected by calculating Lipschitz constants in each iteration. More specifically, where β is a constant larger or equal to one and l U,V,L,C i is the Lipschitz constant for variables U , V , L, and C computed in each iteration [1] : , and l C i = 1. Similar to the assumptions made on the Nonnegative Matrix Factorization in Ref. [38] and the assumptions on Nonnegative Tensor Factorization in Ref. [39], we assume that the sequences {U i }, {V i } [1] In [37], we use backtracking line search to obtain the step size. We here utilize the Lipschitz constants in each iteration to calculate the step size, which will accelerate our algorithm in application. are bounded and inf i∈N { U i F , V i F } > 0. As suggested by Ref. [38], the second boundedness assumption can be easily dropped out by replacing where ν is a small positive constant. We fixed a constant δ, and the algorithm terminated if

Algorithm 1 Algorithm for dynamic robust matrix completion
and zero matrix C 0 ∈ R n 1 ×n 2 , parameters r, α 1 , α 2 , β, ρ, γ, δ, and s, weights ωt for t = 1, ..., d. We next show that every sequence generated by our algorithm will converge to a critical point of (9). Our algorithm is a special case of Proximal Alternating Linearized Minimization (PALM) algorithms. The convergence of PALM algorithms have been proved in [38]. One can check that ∇ U H, ∇ V H, ∇ L H and ∇ C H are Lipschitz continuous, and H(U, V, L, C) + J(L) + Q(C) + K(C) satisfies the Kurdyka-Lojasiewicz (KL) property based on Proposition 3 in Ref. [38]. Then our algorithm converges to a critical point of (9) from every initial point. See Appendix G for more details. According to Theorem 2.9 in Ref. [39], if the initial point is sufficiently close to the optimal point, our algorithm converges at least sublinearly. We also remark that the complexity of our algorithm in each iteration is in the order of O(n 1 n 2 r).

Results: Numerical Experiments
We test the performance of Algorithm 1 on both synthetic dataset and the actual video datasets. The re-covery performance is measured by the relative recovery error L d −L d F / L d F , whereL represents the actual data, andL represents the recovered data. The corruption rate denotes the fraction of nonzero entries inC t . In the simulations on synthetic datasets, we generate the nonzero entries of the sparse matrixC t i.i.d. from the Gaussian distribution N (0, 2.5 2 ). In the simulations on the actual video dataset, we generate the nonzero entries of the sparse matrixC t i.i.d. from the Gaussian distribution N (0, 1). We consider two types of corruptions: (1) uniformly chosen locations for all corrupted entries and (2) uniformly chosen locations for corrupted blocks, and each block is a square shape with corruptions. The corruptions follow type (1) if not otherwise specified. The average erasure rate is the percentage of the missing entries. We set β = 1, δ = 10 −6 , γ = 1.01 and set the starting point of ρ to be 0.1. The simulations are run in MATLAB on a computer with 3.4 GHz Intel Core i7. All results are averaged over 100 runs.

Experimental set-up
We first compare Algorithm 1 with the convex relaxation (8) and solve (8) by CVX [40]. We set n 1 = 50, where matrix t is drawn from Gaussian distribution N (0, σ 2 2 ). Noise matrix Z t is drawn from Gaussian distribution N (0, σ 2 1 ). The weights {ω t } in Algorithm 1 are set to be ( 1 d , ..., 1 d ). We first set s = 500, σ 1 = 0.01, and σ 2 = 0.03. One can check that the variance of each entry ofL is equal to 5 when d = 1, and the variance increases to 5.018 when d = 5. Then we choose α 1 to be 10, which is two times of the variance when d = 1. According to Monte Carlo simulations, the probability of L ∞ being less than 10 in our setup is larger than 99% when d ≤ 5. From the property of the Gaussian distribution, we choose α 2 = 7.5 since P ( C t ∞ ≤ 7.5) > 99.7%. We set λ 1 = 0.001, λ 2 = 0.00015 in (8). Fig. 1 shows the recovery performance of the convex relaxation (8) and Algorithm 1 with different d. We can see that Algorithm 1 performs better than the convex relaxation under the same d, and the performance of Algorithm 1 improves when d increases. When d = 5, the average running time of Algorithm 1 is 0.35 seconds, while the convex relaxation using CVX needs 300 seconds. Even though CVX is not designed for fast speed, the comparison still shows that the improvement of our algorithm in the speed. We study the impact of ρ on the number of iterations. If we set ρ = 5 and γ = 1, i.e., ρ does not change, the average number of iterations is 8000 when the average erasure rate equals 0.8. As a comparison, the average number of iterations is 1300 if ρ starts with 0.1 and γ = 1.01. We also test other fixed values of ρ from 5 to 3000. The smallest number of iterations is still seven times larger than the number of iterations by updating ρ in our setup. Thus, dynamically adjusting ρ indeed speeds up the convergence.

Experimental results
We then set d = 3 and keep the other simulation setup the same. Fig. 2 shows the comparison of Algorithm 1 with the locally weighted matrix smoothing (LOWEMS) proposed in [21], and the convex relaxation (8). LOWEMS alternatively minimizes U and V in the objective function of equation (13) in [21] over U and V . This method does not consider the corruptions, and the performance degrades when a significant fraction of measurements are corrupted. We compare the three methods under the cases where the corruption rates are 20% and 30%. We can see that Algorithm 1 performs better than the other two methods. We also vary the perturbation noise by increasing σ 2 from 0.05 to 0.5. We fix d = 3, s = 500, and the average erasure rate as 0.5 in this experiment. We set (ω 1 , ω 2 , ω 3 ) to four different groups of values: (a) (0, 0, 1) (b) (0.33, 0.33, 0.33) (c) (0.1, 0.3, 0.6) (d) (0.05, 0.15, 0.8). Here (0, 0, 1) is the baseline since no history is considered in this case. The relative recovery error against σ 2 is shown in Fig. 3. One can find that the relative recovery error in case (b) increases when the perturbation noise increases, and exceeds the baseline after σ 2 = 0.25. This means that placing equal weights on measurements at all time degrades the performance if the perturbation is significant. One can improve the performance by setting a large weight on the current incident and a small weight on the history. The recovery error is the smallest in case (d). We will leave the selection of optimal weights as future work.

Performance on actual video dataset 1 5.2.1 Experimental set-up
In this section, we test our method on a noisy video dataset, in which a man slowly moves his head. The first row of Fig. 4 shows three different frames of the video. The second row shows the frames with 15% corruptions and 15% missing entries, and the third row shows the frames with 20% block corruptions and 15% missing entries. The video dataset with only a few moving objects usually changes slowly. We remark that the perturbation model on V and L simplifies the analyses but is not required to implement our recovery method. Our method applies to datasets with smooth changes, and this testing dataset satisfies this property.

Experimental results
Under different corruption rates (5%, 10%, 15%, 20%), we run our algorithm with different d. Fig. 5 shows that d > 1 outperforms d = 1. Moreover, performances of d = 3 and d = 2 are close when the erasure rate is small, while d = 3 is the best when the average erasure rate increases. We also test our algorithm on frames with 20% corruptions in blocks. As shown in Fig. 6, increasing d helps decreasing the recovery error when the average erasure rate increases. The recovery error curves are similar to the curves (20% corruption rate) in Fig. 5, indicating that our algorithm is not sensitive to corruption types. We then set d = 2 and keep the other simulation setup the same. Fig. 7 shows the recovery performance of our method with different corruption rates. Performance of our method improves when the corruption rate decreases.
We also show the box-plot-diagram of relative recovery error when the average erasure rate changes in Fig. 8. The tops and bottoms of each "box" are the 25th and 75th percentiles of the samples, respectively. The results show that our method has small variances.

Performance on actual video dataset 2 5.3.1 Experimental set-up
We test our method on a public video dataset, in which an anchor is broadcasting news (The dataset can be downloaded from this website [41]). The dataset contains 300 frames with each frame size 720 × 486. We downsize the video into a 3168 × 300 matrix. Then eachL t (t = 1, 2, 3) contains 100 consecutive frames. r is set to be 12. The corrupted entries are set to be 20% of the total number of entries inL d . The other parameters are the same as in Section 5.2.

Experimental results
In Fig. 9, we show the performance of Algorithm 1 given d = 1, 2, 3. Again, one can see that the recovery error decreases when d increases. We compare Algorithm 1 to the Tensor Robust Principal Component Analysis (TRPCA) [32] which extends the matrix robust principal component analysis to the tensor case.
We also compare to the convex relaxation (8) when d = 3. We can see that Algorithm 1 outperforms the convex relaxation but is slightly worse than TRPCA. We remark that TRPCA needs to estimate the whole dataset and thus is far more time-consuming than Algorithm 1. Fig. 10 shows the visualization of the recovered frames. Relative recovery error The convex relaxation (d=3) Algorithm 1 (d=1) Algorithm 1 (d=2) Algorithm 1 (d=3) TRPCA Figure 9 The performance of our method on a public video dataset (20% corruptions).

Conclusion and Discussion
This paper develops a data recovery method from partially observed and partially corrupted measurements with time-varying low dimensional structures. Exploiting the temporal correlations in the low dimensional structures, we show that the recovery error of our proposed method diminishes as the problem size increases, and the error decays in the same order as that of the state-of-the-art data recovery method with uncorrupted measurements. A proximal algorithm with convergence guarantee to stationary points is developed. Although the proposed theorem is built on the global optimum and there is a gap between the theorem and our algorithm, the algorithm is proved effective on both the synthetic dataset and the actual video datasets. One future work is to study the global convergence of the algorithm.

Appendix A: Notations
The notations in the proofs are summarized in Table  1. Figure 10 Recovery Visualization (first row: original frames; second row: frames with 20% random corruptions and 30% missing; third row: the recovery using our method with d = 3). Table 1 Notations the low-rank matrix and the sparse matrix at time d, respectively. L,Ĉ ∈ R n 1 ×n 2 the estimated low-rank matrix and the estimated sparse matrix, respectively. nmax, n min max(n 1 , n 2 ) and min(n 1 , n 2 ), respectively.
, the fraction of the observed entries. X 2 , X F the spectral and Frobenius norm of matrix X, respectively. P Ω t : R n 1 ×n 2 → R m t the uniform sampling ensemble. P * Ω t : R m t → R n 1 ×n 2 adjoint operator of P Ω t .

Appendix C: Proof sketch of Theorem 1
The proof of Theorem 1 follows the same framework of the proof of Theorem 3.8 in [21]. With the additional sparse matrix, our proof is more involved than the one in [21]. Given the estimatorX =L +Ĉ from (4), we define the recovery error to be The goal will be to provide bounds on L −L d 2 F + Ĉ −C d 2 F and L −L d 2 F respectively. In order to simplify the proof, we first define where γ t i is Rademacher variable. P t i ∈ X , and where e j (n) are the canonical basis vectors in R n . Our theoretical analysis builds on the following inequality.
Lemma 1 is a counterpart of Proposition 3.1 in [21]. Since we take the sparse matrix into account, Lemma 1 is a probabilistic inequality while Proposition 3.1 in [21] is a deterministic inequality. Please refer to Appendix D for the proof of Lemma 1.
The remaining work is to lower bound the LHS of (15) and upper bound the RHS of (15). In Lemma 2, we will provide an upper bound of the term d t=1 ω t P * Ω t [P Ω t (X d ) − y t ] 2 , which will eventually upper bound the RHS of (15). In Lemma 3, we will provide a lower bound of the term d t=1 ω t P Ω t (∆ d ) 2 2 , which contains ∆ d 2 F and E( Σ R 2 ). Lemma 6 in Appendix H provides an upper bound of E( Σ R 2 ). We will then get an upper bound of the term by combining Lemma 1, Lemma 2, Lemma 3, and Lemma 6. Assume we have Because C ∞ ≤ α and ij 1 [Cij =0] ≤ s, we then have According to the triangle inequality, we have We have We then get the upper bounds on L −L d 2 hold with probability at least 1 − 3 n1+n2 − 7dn maxexp(−n min ).
The proof of Lemma 2 follows the same line as the part of proof of Theorem 3.8 after Lemma E.2. in [21]. With the additional sparse matrix, our proof is more involved than that in [21]. Please refer to Appendix E for the proof of Lemma 2. We then define the following constraint set for given r and s, which will be used in Lemma 3 and the final proof of Theorem 1.
The proof of Lemma 3 follows the similar line as the proof of Lemma E.1. in [21]. With the additional sparse matrix, our proof is more involved than that in [21]. Please refer to Appendix F for the proof of Lemma 3.
Combining Lemma 1 and (16) yields With some tedious calculations, we then have The second part of the inequality (Q 2 ) in Theorem 1 follows from combining Lemma 2, Lemma 6 and (17).

Appendix D: Proof of Lemma 1
Proof Define Since (3) is continuous in X and the set C(r, s, α) is compact, (3) achieves a minimizer at some pointX ∈ C(r, s, α). SinceX is a minimizer of the constrained problem, then for all matrices X ∈ C(r, s, α) we have the following inequalitỹ By the second-order Taylor's theorem, we expandF (x) aroundx d = vec(X d ), With some tedious calculations, we have the following expression for the gradient ofF (x): Based on the above gradient and the definitions of P * Ω t and P Ω t , we have, for any B ∈ R n1×n2 , when b = vec(B). Note that the second term in (20) is Based on (21) and (22), the absolute value of first term in (20) can be bounded as where the first inequality comes from the fact | A, B |-≤ A B * . The second inequality holds because A * ≤ √ r A F if A is a rank-r matrix. The above analysis follow the similar line of the proof of Proposition 3.1 in [21]. The following proof take care of the additional sparse corruptions in our problem formulation. We will use Mcdiarmid's Inequality (see Lemma 7 in Appendix H) to bound the sparse term. Note that where each entry of the random matrix Z t ∈ R n1×n2 is i.i.d. Gaussian distributed with variance σ 2 1 . Consider the function and s ) T + Z t ]) = 0 since Z t and s are all from zero mean Gaussian distributions.
where s comes from the sparsity upper bound of S. Entries of JS are independent with respect to uniform sampling P * Ω t P Ω t in J.X d − y t is fixed when we apply the Mcdiarmid's inequality below. Mcdiarmid's Inequality bounds the distance between a function and its expectation when the function contains independent random variables (see Lemma 7 in Appendix H). Since E(| J, S |) is bounded by 4m0s n1n2 α 2 , we can further bound the function | J, S | using Mcdiarmid's Inequality. Note that S hass(s ≤ 2s) nonzero entries. We have By selecting ξ = 4α J ∞ s log(n 1 + n 2 ), we then have holds with probability at least 1− 2 n1+n2 . We then have hold with probability at least 1− 2 n1+n2 , where the last inequality holds because The result follows from combining (23) and (24).

Appendix E: Proof of Lemma 2
We here upper bound We first rewrite J as where each entry of the random matrix Z t ∈ R n1×n2 is i.i.d. Gaussian distributed with variance σ 2 1 . Define and We here introduce the term S t to take care of the sparse part in our problem formulation. The proof of Lemma 2 follows the similar line of that in [21]. But we need to pay more attention on bounding S t in the following analysis. We here introduce a random matrix G t ∈ R n1×n2 that has exactly one non-zero entry: where E ij is the canonical basis of the matrices. We also introduce the following random matrix H t , which is the average of m 0 independent copies of G t : where each G t i is an independent copy of G t . J then can be decomposed as sum of independent random matrices: We have that Lemma 5 in Appendix H describes the spectral norm deviation of a sum of uncentered random matrices from its mean value. We then will apply Lemma 5 to the sum of dm 0 independent random matrices.
Note that for given t and k, where the first inequality applies the triangle inequality, and the second one applies Jensen's inequality. Also note that The third inequality holds because (E(G t k ))(E(G t k )) T is positive semidefinite; the last equality uses the fact that for a fixed t, G t k are random matrices following identical distributions independently for all 1 ≤ k ≤ m 0 . We then have The remaining work is to uniformly upper bound G t k 2 for all 1 ≤ t ≤ d and 1 ≤ k ≤ m 0 and upper bound ρ 0 . We have for all 1 ≤ t ≤ d and SinceL t is incoherent with parameter µ 0 from assumption (a) in Section 3, we have Ū :i 2 ≤ µ 0 r/n 1 for any i = 1, ..., n 1 .
We then have By the tail probability of Gaussian random variables and the standard union bound, for all 1 ≤ t ≤ d and 1 ≤ k ≤ m 0 , we have where L := n 1 n 2 ( 2 log(d(n 1 + n 2 )n 1 n 2 )σ 2 max + 2α).
We have Similarly, we have Note that Based on the analysis in [21], we have and P(max Since we involve the sparse corruptions in our problem formulation, we need to use Hoeffding's Inequality and union bound to bound the term S, as computed in the following equations (25) and (26). By Hoeffding's Inequality and the union bound, we have and We then have and P(max Note that 1 ≤ µ 0 ≤ n 1 /r, so µ 2 0 r n1 ≤ 1. Now we are ready to bound ρ 0 by combining (27) and (28):

Now by Lemma 5, we have
).
If we let s = 8 log(n1+n2)ν m0 and substitute this into the above matrix Bernstein inequality we obtain A hidden condition when the above inequality holds is that ν dominates the denominator of the exponential term. The remaining work is to have sufficiently large m 0 to guarantee that ν dominates the denominator of the exponential, which follows ν ≥ (2/3)L 8 log(n 1 + n 2 )ν m 0 .
The above inequality immediately implies that m 0 ≥ 32 27 n 1 n 2 log(n 1 + n 2 )( 2 log(d(n 1 + n 2 )n 1 n 2 )σ 2 The remaining work is to bound In [21], we only have the first term. We here need to bound the second term because of the sparse corruptions involved. First we note that each entry of F t is Gaussian and the variance is not greater than σ 2 1 + (d − t)σ 2 2 . Then, according to results on bounds for the spectral norm of i.i.d. Gaussian ensemble, we have where c 1 , c 2 are absolute positive constants. Note that c 1 exp(−c 2 n max ) dn max exp(−n min ). We also have We then have Now we are ready to bound J 2 2 . With probability at least 1 − 3 n1+n2 − 7dn max exp(−n min ) we have

Appendix F: Proof of Lemma 3
The proof of Lemma 3 follows the same line of the proof of Lemma E.1 in [21]. The major modification is that we need to involve the sparse term. Lemma 4 is the key step of proving Lemma 3, and will be included in the following proof.
Proof Set We will show that the probability of the following event is small: Note that B contains the complement of the event in Lemma 3. We use a peeling argument to bound the probability of B. Let ν = max( 186624 log(n 1 + n 2 ) 25m 0 log(6/5) , log(n 1 + n 2 ) 2n 1 log(6/5) ), and α = 6/5. For l ∈ N let If event B holds for some X ∈ E(r, s), it must be that X belongs to some S l and For T > ν consider the set E(r, s, T ) = {X ∈ E(r, s) : X 2 F ≤ n 1 n 2 T } and the event Note that X ∈ S l implies that X ∈ E(r, s, α l ν). Then (29) implies that B l holds and B ⊂ ∪B l . Thus, it is sufficient to bound the probability of the event B l . Let Such bound is given by the following Lemma 4.
Lemma 4 Suppose all P Ω t 's are fixed uniform sampling ensembles. Then Proof The proof follows the similar line of the proof of Lemma F.1 in [21]. We will use McDiarmid's Inequality to bound the sparse term introduced by our problem formulation. By Massart's concentration inequality, we have where c 5 = 1 8 ( 1 18 · 5 12 ) 2 . Next we bound the expectation E(H T ). Using a symmetrization argument we have where γ t i is a Rademacher variable (independent on both i and t). The assumptions L ∞ ≤ 1 2 and C ∞ ≤ 1 2 imply that X ∞ ≤ 1. We then have | P Ω t i , X | ≤ 1. The contraction inequality yields One can easily obtain the upper bound on (30) in [21] when there are no corruptions. If corruptions exist, we need to use McDiarmid's Inequality to bound the sparse term, as computed in the following analysis. Note that Since X ∈ E(r, s, T ), we have We have Note that ω t γ t i P Ω t i , C 's are independent variables, and E( ω t γ t i P Ω t i , C ) = 0. By McDiarmid's Inequality, we have ).
By selecting ξ = √ m 0 n 1 T , we then have hold with probability at least 1 − 2e −2n1T . We then have with probability at least 1 − 2e −2n1T , Note that we have 1 18 We then complete the proof.
Appendix G: Convergence analysis of Algorithm 1 To prove the convergence of our algorithm, we first prove the Lipschitz continuous property of the firstorder partial derivatives, which plays a key role in proving the nonincrease property of the function value.
We here calculate the first-order partial derivatives of H(U, V, L, C): With the other three variables fixed, we have where (a) follows from the inequality · 2 ≤ · F . Under the assumptions of boundedness, ∇ U H is Lipschitz continuous, and l U i is bounded away from zero.
Similarly, ∇ V H is Lipschitz continuous, and l V i is bounded away from zero.
where (b) follows the fact that d t=1 ω t = 1.
We already show that the partial derivatives of H(U, V, L, C) are Lipschitz continuous. In the light of Ref. [38], we separate the main steps to achieve the convergence result into two major parts and outline them as follows. The first part guarantees that the sequence generated by the algorithm has convergent subsequence and the set of the accumulations points is a subset of the critical points. Let Φ = H(U, V, L, C) + J(L) + Q(C) + K(C). Given the Lipschitz continuous property, we can first prove that where {z i } is the sequence generated by our algorithm, and γ 1 is a positive constant. This result shows that the function value is nonincreasing. With the boundness assumption of the sequence, we can then show that γ 2 ω i ≤ z i − z i+1 , ω i ∈ ∂Φ(z i ), ∀i = 0, 1, 2, · · · where γ 2 is another positive constant. The conclusions can be drawn from the above results.
After the first part, the global convergence to a critical point has not been proved yet since the sequence itself is not a Cauchy sequence. We then prove the Kurdyka-Lojasiewicz (KL) property of Φ, which does not depend on the structure of the specifically chosen algorithm. Since H(U, V, L, C) is differentiable everywhere, the function is real analytic, therefore H(U, V, L, C) has the KL property according to Session 2.2 of [39]. Note that J(L), Q(C), and K(C) are all indicator functions of semi-algebraic sets [38]. Thus, Φ satisfies the KL property. From [38], we then can conclude that {z i } is a Cauchy sequence. All the properties above guarantee that our algorithm can converge globally to a critical point of (9).

Appendix H: Additional auxiliary Lemmas
Lemma 5 [Corollary 6.1.2 in [42]] Consider a finite sequence {S k } of independent random matrices with common dimension n 1 × n 2 . Assume that each matrix has uniformly bounded deviation from its mean: Then for all > 0,