Low-Complexity Joint Weighted Neumann Series and Gauss–Seidel Soft-Output Detection for Massive MIMO Systems

We introduce a joint weighted Neumann series (WNS) and Gauss–Seidel (GS) approach to implement an approximated linear minimum mean-squared error (LMMSE) detector for uplink massive multiple-input multiple-output (M-MIMO) systems. We first propose to initialize the GS iteration by a WNS method, which produces a closer-to-LMMSE initial solution than the conventional zero vector and diagonal-matrix based scheme. Then the GS algorithm is applied to implement an approximated LMMSE detection iteratively. Furthermore, based on the WNS, we devise a low-complexity approximate log-likelihood ratios (LLRs) computation method whose performance loss is negligible compared with the exact method. Numerical results illustrate that the proposed joint WNS-GS approach outperforms the conventional method and achieves near-LMMSE performance with significantly lower computational complexity.


Introduction
Massive MIMO (M-MIMO) systems can significantly improve the link reliability and spectral efficiency compared to the small-scale MIMO systems. The theoretically predicted gains of the M-MIMO rely on optimal multi-user signal separation at the receiver. In the uplink channel, the optimality of the maximum ratio combining (MRC) is guaranteed theoretically for M-MIMO systems with an infinite number of BS antennas. However, systems with realistic antenna configurations are far from the infinite-antenna limit, e.g., 64 Basestation antenna, which renders the MRC detector not attractive in practical applications. The advent of the Internet of Things (IoT) will foresee a large number of active users in future systems. Therefore it is highly desirable to design low-complexity high-performance detectors for practical "not-so-massive systems" with a low base station (BS)-to-user-antenna ratio (BUAR). The 1 3 maximum-likelihood (ML) detector can achieve optimal performance, however, its complexity increases exponentially with the number of users, which makes it unaffordable for M-MIMO systems with large number of users. The sphere decoding [1] can achieve near-optimal performance with reduced average complexity. Nevertheless, its worst-case complexity is still prohibitive when the dimension of the MIMO system is large and/or the modulation order is high. Furthermore, the variable complexity of the SD makes it challenging for practical hardware implementation. The linear minimum mean-squared error (LMMSE) detector has been widely utilized in multiple communication applications due to its good performance-complexity tradeoff. The LMMSE detector is shown to achieve near-optimal error-rate performance for M-MIMO systems with a large BUAR [2]. However, the associated matrix-inversion entails high computational complexity for practical implementation with a large number of users. To address the complexity issue, the Richardson method has been proposed in [3], but its performance is highly sensitive to the step-size, thus rendering it less attractive for practical implementation. By exploiting the channel hardening property of the M-MIMO channel, the Neumann series (NS) approximation approach [4] is proposed to convert the matrix inversion into a series of matrix-vector multiplications to alleviate the implementation burden of the matrix inversion. The performance of the NS method approaches that of the LMMSE for a M-MIMO system with a large BUAR. However, the performance of the NS-based approach degrades significantly if the BUAR is not large enough, e.g., less than four [5]. Dai et al. [6] proposed a Gauss-Seidel (GS) method to implement an approximated LMMSE detection. The GS-based approach [6] exhibits a slow convergence rate, especially for the system with a low BUAR.
In this work, we propose a joint weighted Neumann series (WNS) and Gauss-Seidel approach to implement the LMMSE detection without matrix inversion. In the proposed scheme, a two-term NS is first utilized to generate an initial solution for the subsequent GSbased iterative detection. To further improve the quality of the initial solution, a weighted approach is introduced to compensate for the approximation error introduced by the limited expansion term of the NS. The weighting coefficients are determined empirically based on an off-line basis. The GS iterative algorithm then exploits the promising initial solution to achieve a faster convergence rate. We also propose a low-complexity LLR computation based on the WNS with only negligible performance degradation. Numerical results illustrate that the proposed method outperforms the GS method [6] and approaches the performance of the LMMSE detector with significantly lower computational complexity.
The rest of this work is organized as follows. Section 2 describes the system model. The proposed joint WNS and GS method is detailed in Sect. 3. The performance and complexity of the proposed method is compared with the conventional ones in Sect. 4. The concluding remarks are provided in Sect. 5.
Notation: Vectors and matrices are denoted by boldface lowercase and uppercase letters, respectively. The transpose, complex conjugate, and conjugate transpose are represented by (⋅) T , (⋅) * , and (⋅) H , respectively. Tr(⋅) denotes the trace of a matrix. The notation ‖ ⋅ ‖ stands for the Euclidean norm for vectors. [x] represents the expectation of variable x.

System Model
We consider a coded M-MIMO orthogonal frequency division multiplexing (OFDM) system, where the BS is equipped with N R antennas and serves N T single-antenna users simultaneously. The transmitted bit streams from N T users are separately encoded by a channel encoder, and then mapped to a sequence of energy-normalized complex-valued quadrature-amplitude-modulated constellation points. To ease presentation, we only consider one single subcarrier. The input-output relation between the transmitted and received signals on subcarrier l is expressed as where (l) ∈ C N R ×1 represents the received vector on subcarrier l, (l) ∈ C N R ×N T denotes the spatially multiplexed complex channel matrix, ∈ C N T ×1 is the transmitted symbol vector, and (l) ∈ C N R ×1 is the complex additive white Gaussian noise vector with zero mean and variance 2 per complex entry. When there is no ambiguity, we will suppress the superscript l in the remainder of this manuscript. The BUAR is defined as = The a posteriori LLR of bit k for the i-th user can be obtained by the by the max-log approximation as follows: where i = 2 i ∕ 2 i denotes the post-equalization signal-to-interference-and-noise-ratio (SINR), X (0) k and X (1) k are the sets containing the symbols from the modulation constellation X with the k-th bit being 0 and 1, respectively.
The LMMSE algorithm requires matrix inversion −1 to obtain the LMMSE estimate ̂ [cf. (2)], the effective channel gain = −1 , and the NPI variance to calculate the approximate LLRs for soft-input channel decoding. The computation of −1 requires O(N 3 T ) number of operations [7], which results in prohibitive complexity for a M-MIMO system with a large number of users.

Signal Detection Based on the Joint WNS and GS Method
Since the LMMSE filtering matrix W is Hermitian positive, we can decompose W as (1) (l) = (l) + (l) , where , , and H are the diagonal, the strictly lower triangular, and upper triangular parts of , respectively.
Based on the GS method [6], the transmitted signal vector s is estimated as follows: where t denotes the iteration number, and (0) represents the initial solution. In general, if there is no a priori information about the final solution, the initial solution (0) in (6) is normally set to be a zero-vector. The initial solution (0) would impact the convergence rate greatly and affect both complexity and accuracy of the final solution, in particular for complexity constrained practical implementation. An initial solution, which is close to the LMMSE estimate ̂ , can lead to a faster convergence than the iterative method which is started at zero-vector. However, to find a good initial (0) is as difficult as determining the LMMSE estimate ̂ .
To reduce the computational burden of the matrix inversion, the NS expansion [8] is proposed to implement −1 as follows where Θ is an arbitrary matrix satisfying the condition lim n→∞ (I N T − Θ −1 ) n → N T . We can decompose in (5) into its main diagonal D and off-diagonal B such that = + . Since is a diagonally dominant matrix, we can choose Θ = to save computational complexity. By keeping only the first L terms of the Neumann series, we obtain the L-term approximation given by: The number of expansion terms, L, can be used as a tuning parameter to trade-off between complexity and accuracy. For L = 1 , the NS-based approach degenerates into the matchedfilter detector. For L = 2 , the inverse of is approximated by which requires O(N 2 T ) number of operations and is significantly lower than O(N 3 T ) operations required by an exact inversion approach.
For L > 3 , the NS-based approach entails computational complexity O(N 3 T ) which is comparable to or even higher than that of the exact matrix inversion [4]. For a system with a small BUAR, a relatively large number expansion term L (e.g., 4) is usually required to achieve a satisfactory performance [6]. Therefore, the NS-based approach with L > 3 may not be a cost-effective option for LMMSE detection [6]. Nonetheless, it is reasonable to assume that an NS-based approach can provide a promising searching direction even for a limited number of expansion terms. Based on this observation, we propose to utilize the two-term NS-based approach of (9) to obtain the initial solution given by:  Since the NS with a limited number of expansion terms will inevitably introduce approximation errors, we can introduce a weighted Neumann series (WNS) to mitigate it as follows: where Δ = [ 0 , … , L−1 ] T are the weights to provide a better approximation to the exact matrix inversion. The objective is to minimize the approximation error (AE) between the exact matrix inversion −1 and the approximate matrix inversion based on the WNS of (11) given by where ‖ ⋅ ‖ F is the Frobenius norm.
Since and ̂ −1 L are Hermitian, we can constrain to be a real-valued number without loss of optimality. For a 2-term WNS, we can utilize a numerical method to obtain optimized real-valued 0 and 1 . We will elaborate the determination of the optimal 0 and 1 later in Sect. 4.1.

LLR Computation Based on the Proposed WNS
Since we utilize the 2-term WNS to approximate the exact inverse −1 , the approximate equivalent channel matrix is given by Therefore, the approximate effective channel gain of user i is expressed as where (̂ −1 2 ) ii denotes the i-th diagonal entry of ̂ −1 2 . The effective noise variance percieved from user i is given by ̂i =̂i(1 −̂i) . Based on (4) and (14), we can efficiently obtain the approximate LLR.

Computational Complexity Analysis
Since W and the matched-filter output MF are all required by the conventional LMMSE algorithm and the proposed joint WNS-GS method, we focus on the complexity of LLR computation. We assume that one complex-valued multiplication can be realized by four real-valued multiplications and two real-valued additions. Since the complexity of the additions is much simpler than the multiplications, the evaluation is based on the required number of real-valued multiplications.
(1) Computation of −1 2 ∶ Since the multiplication of a N T × N T real-valued diagonal matrix and a N T × N T complex-valued matrix requires 2N 2 T real-valued multiplications, the computation of −1 2 requires 2(2N 2 T − 2N T ) real-valued multiplications based on (9). Since the −1 2 is Hermitian, the total number of real-valued multiplications is (2) Computation of (0) ∶ The matrix-vector multiplication of (10) entails 4N 2 T real-valued multiplications.
(3) Computation of ( ) : Based on (6), the solution ( ) can be expressed as represent the i-th element of (t) , (t−1) , and MF , respectively. Based on (15), the computation of s (t) i requires 4N T real-valued multiplications. Therefore, the ( ) with dimension N T × 1 requires t4N 2 T real-valued multiplications. To sum up, the overall required number of real-valued multiplications of the proposed joint WNS-GS algorithm with iteration t is 4(t + 3∕2)N 2 T . Figure 1 compares the complexity of the NS-based method algorithm [4] and the proposed joint GS-NSE, whereby the LMMSE algorithm based on the Cholesky decomposition is also included as a baseline for comparison. Figure 1 illustrates that the NSbased method exhibits lower computational complexity than the LMMSE algorithm based on the Cholesky decomposition for t ≤ 3 . The NS-based algorithm loses the complexity advantage over the exact matrix inversion for t > 4 as illustrated in Fig. 1. Figure 1 shows that the computational complexity of the proposed joint WNS-GS is similar to that of the conventional GS method proposed in [6]. This is explained as follows: The computational overhead of the LLR computation of the GS method [6] is on par with that of determination of the initial solution (0) [c.f. (10)]. The proposed method requires more effort to obtain the initial solution [c.f. (10)] than that of the diagonal-approximation-based approach [6], and the LLR can be obtained as a by-product [c.f. (10) and (14)], thus, entailing comparable computational complexity. Fig. 1 Complexity comparison of the GS, NSE, LMMSE and the proposed joint WNS-GS method As explained in Sect. 3.1, the proposed method exploits the NS's exploration capability to generate a closer-to-LMMSE estimation as the initial solution, therefore, significantly enhancing the subsequent GS's convergence rate. This will be verified later in Sect. 4.2.

Numerical Results
In this section, we evaluate the block error rate (BLER) performance of the proposed joint WNS-GS method and compare with the conventional NS-based approach [6] and the GS method [4]. A turbo-coded OFDM multiplexing system with 2048-point fast Fourier transform (FFT) and 15-kHz subcarrier spacing is considered in this work. The channel model is the extended vehicular (EVA) model with a maximum Doppler frequency of 10 Hz, which corresponds approximately to a mobile velocity of 5.4 km/h for operation in the 2.0 GHz band. Binary information data bits with a length of 4264 are turbo encoded by a mother code of rate R c = 1/3 (with generator polynomials [13, 15] 8 ) and then punctured to code rate R c = 5/6 as specified in the 3GPP standard [9]. Max-Log-MAP decoding with four decoding iterations was applied in the channel decoder. The modulation scheme of 64 QAM is applied in all simulations.

Determination of ˛0 and ˛1
It is apparent that the determination of optimal 0 and 1 is related to the BUAR, SNR, and specific channel statistics. Since the closed-form solution of the 0 and 1 is difficult to treat analytically, we use a numerical method to obtain the empirically optimized 0 and 1 . We first study the relationship between the AE with respect to 0 and 1 . For 64 × 24 M-MIMO, Fig. 2 illustrates that the AE reaches the minimum when 0 = 1.05 and 1 = 0.55 . The BLER results (not shown due to space limits) illustrate that the proposed method with 0 = 1.05 and 1 = 0.55 achieves the best performance, which validates the observation of the AE. For simplicity, we may choose 0 = 1.0 and 1 = 0.5 for low overload (e.g., < 0.25 ), 0 = 1.05 and 1 = 0.55 for intermediate overload (e.g., 0.25 <= < 0.375 ), and 0 = 1.1 and 1 = 0.6 for high overload (e.g., 0.375 <= < 0.5 ). Overall, the performance degradation due to the mismatch of the optimized weighting factor values for various user loads is marginal based on extensive numerical results.

Remark 1
The results provided in this work are not intended to be exhaustive but rather to serve as a guiding principle. In all cases tested, the proposed joint WNS-GS method exhibits great resilience to the mismatch of the SNR, and/or the BUAR, and/or the channel model, which renders the proposed joint WNS-GS method attractive for implementation in practical systems.

BLER Comparison
We first consider the performance of the proposed approximate LLR computation method. Fig. 3 illustrates that the gap between the proposed approximate LLR computation and the exact method is only about 0.05 dB at BLER = 0.2 for t = 3 . The gap becomes indistinguishable diminish for t > 3 . Therefore, the proposed approximated LLR computation method can be a low-complexity alternative to the complex exact LLR computation method.
We then compare the proposed method with conventional GS and NS methods [6] in Fig. 4. For iteration t = 4 , Fig. 4 illustrates that the proposed joint WNS-GS method achieves a performance gain of approximately 1.2 dB at BLER = 10 −1 compared to the GS method [6] for the 64 × 24 MIMO (the conventional zero initial solution based results were not depicted since its performance is worse than the diagonal-based one [6]). The NS-based approach does not even converge, i.e., BLER = 1, as depicted in Fig. 4. The gap between the conventional GS method with t = 4 and the proposed one with t = 3 is negligible. This translates to reduced latency, which is appealing for delaystringent applications. Even more prominent performance gain is observed for a higher loading 64 × 32 MIMO. Fig. 4 illustrates that the proposed method achieves close-to-LMMSE with a gap of less than 0.2 dB at BLER = 10 −2 with t = 4.

Conclusion
In this paper, we proposed a joint WNS-GS approach to implement the LMMSE detection for uplink M-MIMO systems. The proposed method exploits the NS's exploration capability to generate a promising initial solution for a quick start of the subsequent GS's iterative detection. A weighted approach is incorporated into the NS to further refine the initial solution, thus significantly enhancing the convergence rate of the GS method. We also proposed to compute approximate LLR based on the WNS approach with negligible performance degradation. Numerical results illustrate that the proposed joint WNS-GS approach outperforms the conventional GS method with comparable computational complexity. The proposed joint WNS-GS method achieves near-LMMSE performance with significantly reduced computational complexity. The extension of the proposed joint WNS-GS paradigm to other signal processing problems involving high-order matrix inversion, such as downlink precoding for M-MIMO systems is straightforward.
Yuquan Luo is a master student at the School of Computer and Communication Engineering, University of Science and Technology Beijing, China. His research interests include massive multiple-input multiple-output (MIMO), random access scheme design and non-orthogonal multiple access.
Hua Li received the bachelor's degree from Henan University of science and technology, China. He is currently pursuing his Ph.D. degree at the School of Computer and Communication Engineering, University of Science and Technology Beijing, China. His research interests inclue massive multiple-input multiple-output (MIMO), non-orthogonal multiple access (NOMA) and reconfigurable intelligent surface (RIS).