Low Complexity, Pairwise Layered Tabu Search for Large Scale MIMO Detection

This paper presents a low complexity pairwise layered tabu search based detection algorithm for a large-scale multiple-input multiple-output system. The proposed algorithm can compute two layers simultaneously and reduce the effective number of tabu searches. An efficient Gram matrix and matched filtered output update strategy is developed to reuse the computations from past visited layers. Also, a precomputation technique is adapted to reduce the redundancy in computation within tabu search iterations. Complexity analysis shows that the upper bound of initialization complexity in the proposed algorithm reduces from O(Nt4)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(N_t^4)$$\end{document} to O(Nt3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(N_t^3)$$\end{document}. The detection performance of the proposed detector is almost the same as the conventional complex version of LTS for 64QAM and 16QAM modulations. However, the proposed detector outperforms the conventional system for 4QAM modulation, especially in 16×16\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$16 \times 16$$\end{document} and 8×8\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$8 \times 8$$\end{document} MIMO. Simulation results show that the percent of complexity reduction in the proposed method is approximately 75% for 64×64\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64 \times 64$$\end{document}, 64QAM and 85% for 64×64\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64 \times 64$$\end{document} 16QAM systems to achieve a BER of 10-3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{-3}$$\end{document}. Moreover, we have proposed three layer-wise iteration allocation strategies that can further reduce the upper bound of complexity with minor degradation in detection performance.


Introduction
Large scale MIMO system at the base station can improve the spectral efficiency [15,16,26]. However, the computational complexity for detection increases exponentially with the increase in the number of transmit antennas. The objective of MIMO detection is to attain near maximum likelihood (ML) performance. Sphere decoding (SD) [4] is considered to be the most efficient approach to attain near ML performance. However, its computational complexity grows exponentially with the number of antennas and constellation order, and thus SD is not suitable for the detection of large-scale MIMO systems [26]. Quasi ML detectors such as k-best SD, fixed complexity sphere decoder 1 3 (FSD), Imbalanced FSD (IFSD) have been proposed in the literature to reduce the computational complexity of SD at the cost of lower detection performance [3,13]. However, in large-MIMO detection their performance is not optimal [26] due to limited fixed number of search path visit. Therefore, achieving near-optimal detection performance with lower complexity is still a challenging task for large-scale MIMO systems. Various low complexity large-scale/ Massive MIMO detection algorithms have been proposed in the literature to address this issue [1,26]. In [24] one symbol update likelihood ascent search (1-LAS) and its improved version sequential LAS (SLAS) and Global LAS (GLAS) [20] are proposed for large MIMO systems. The performance of 1-LAS is is not optimal in large MIMO detection but its complexity is almost close to minimum mean square error (MMSE) detectors. SLAS and GLAS shows optimal performance at the cost of significantly higher complexity [20].
Tabu search (TS) detector is based on local neighbourhood searching starting from an initial solution [8,9,27]. It has been shown that the computational complexity of TS is far lower than that of SD or FSD for large MIMO systems [23]. In [22] reactive tabu search (RTS) was proposed, and it can obtain near ML performance for BPSK and 4QAM modulations. FPGA-based tabu search is implemented in [25] for large MIMO detectors. However, RTS performance degrades for higher-order modulations like 16QAM and 64QAM. The random restart Tabu search (R3TS) [6] was reported to improve the performance of RTS by running multiple RTS and chooses the best solution vector. Recently developed QR decomposition aided tabu search (QR-TS) [19] reduces the computational load of conventional TS by efficiently utilizing the structure of upper triangular matrix obtained from QR decomposition. A modified QR-TS, namely neighbour-grouped TS (NG-TS) [18] is also proposed to reduce the complexity. A deep learning aided tabu search is proposed in [17] for large-MIMO detection. However, for higher-order modulations like 64QAM and 8 × 8 MIMO system, complexity of QR-TS and NG-TS is significantly high, making it unfit for larger antenna systems like 32 × 32∕64 × 64 MIMO with 64-QAM modulation. The concept of tabu search is also extended to non-orthogonal multiple access (NOMA) bases systems [10] and under-determined large-MIMO systems [7]. Layered tabu search (LTS) was proposed further to improve the performance of RTS by layer-wise detection and suitable for large antenna systems [5,21].
In [11] a modified LTS algorithm for complex-valued channel matrix (LTS-C) [11] was proposed to reduce the complexity of conventional LTS further. LTS-C achieves lower computational complexity as the effective number of detection layers is half of the corresponding real-valued LTS detection algorithm. Each RTS performed in LTS/LTS-C requires an initialization and searching phase. The initialization phase of RTS computes the Gram matrix and matched filter output. Computation of Gram matrix and matched filtering for multiple layers increases the arithmetic complexity significantly. Searching phase does not require matrix-matrix or matrix-vector multiplications and mostly scalar operations are performed. Thus, the computational complexity of LTS/LTS-C is primarily dominated by the initialization step of each RTS. In this work, we have shown an efficient Gram matrix and matched filter output update strategy between layers to address the large initialization complexity issue for LTS. Our contributions in this work are summarized as follows: • We have proposed a real-valued pairwise layered tabu search algorithm (PLTS), which can utilize the QR matrix efficiently in its computation and computes two layers simultaneously. Thus, the effective number of layer of detection is same as LTC-C. • An efficient update method for Gram matrix and matched filtered output is proposed between successive layers of tabu searches to reduce the initialization complexity of each tabu search. The proposed update method reduces the overall RTS initialization complexity from O(N 4 t ) to O(N 3 t ). • The complexity of each tabu search iteration is further optimized by using precomputation technique. • Layer-dependent iteration allocation strategy is proposed to reduce the number of iterations and complexity. Also, the effect of receive antenna variation is studied.
Outline: The remainder of this paper is organized as follows: Sect. 2 briefly introduces the system model. Section 3 describes the neighbour definition, the concept of RTS and LTS algorithms. In Sect. 4, proposed LTS and modified RTS algorithms are described. The complexity and BER performance of the proposed and existing algorithms are compared in Sect. 5. Finally, conclusions are drawn in Sect. 6. Notation: Bold uppercase and lowercase letters represent matrices. The ith column vector of H is denoted by H i , and H i,j stands for the element in ith row and jth column of H . The real and imaginary part of a complex number resented by ℜ(.) and ℑ(.) respectively. ⌈x⌉ represents the ceiling value of number x.

System Model
We consider massive multi-user MIMO base station with N r receive antennas, serving N t single-antenna users, where N t ≤ N r . The channel is assumed to be frequency flat fading. The received signal vector is given by where r = [r 1 ,r 2 , …r N r ] T is the received signal vector, r ∈ ℂ N r ×1 , x = [x 1 ,x 2 , …x N t ] is the transmitted vector. The transmitted symbols x i are drawn from a complex QAM constellation set ̃ . Thus the transmitted symbol vectors are selected from the set ̃Nt consisting of M N t vectors, i.e. x ∈̃ N t ×1 , where M is the number of elements in constellation set ̃ . ñ = [ñ 1 ,ñ 2 , … ,ñ N r ] T is an N r × 1 vector of additive white Gaussian noise (AWGN) samples, whose entries are independent and identically distributed (i.i.d.). H is the N r × N t channel matrix, where (i, j)th element H ij is the complex channel gain between jth transmit antenna and ith receive antenna, H ∈ ℂ N r ×N t . The channel state information (CSI) and synchronization is assumed to be perfect at the Base station (BS). The complex system model can be converted to a real system model through real-value decomposition. In this work we adopt the orthogonal version of real-value decomposition (ORVD) [2,14] to represent the received signal as: where r ∈ ℝ 2N r ×1 and n ∈ ℝ 2N r ×1 . The elements of r are defined as r 2i−1 = ℜ(r i ) ,  [14] (1) r =Hx +ñ , where i = 1, 2, … N r and j = 1, 2, … N t . QR decomposition can be applied to decompose the real valued channel matrix as: H = QR , where, Q ∈ ℝ 2N r ×2N t is an unitary matrix and R ∈ ℝ 2N t ×2N t is an upper triangular matrix. Multiplied by Q T both side of (2), the system model can be represented as: where y = Q T r , v = Q Tn . The statistics of v and n are same as Q is an unitary matrix. The R matrix obtained from the QR decomposition of the above ORVD channel matrix H exhibits following properties which will be used in later section to optimize the number of arithmetic operations [14]: where i = 1, 2, … N t and j = i, i + 1, … , N t . The maximum likelihood solution can be expressed as where (x) = x T H T Hx − 2r T Hx is the cost metric of solution vector x . The computation complexity of (8) is exponential with N t and detection of large scale MIMO system requires low complexity solution. Hence, in this work, we have developed a low complexity solution using layered tabu search.

Neighbour Definition
Neighbours of symbol a q defined as the set N G (a q ) ∈ ⧵ a q . The maximum and minimum number of elements in N G (a q ) are 1 , solution vector corresponding to the mth iteration where, x (m) i ∈ . Then (u, v)th neighbour vector of x (m) defined as [23] where u = 1, 2, … , 2N t is is the position in which of the neighbour vector z (m) (u, v) and x (m) differs. v = 1, 2, … , N defines the neighbour index. The elements of z (m) (u, v) are selected as To understand the above concept, lets consider, N t = 2 , 16-QAM modulation and the number of neighbours N = 2 . The term w v (x (m) u ) in (10) can have following values: Assume that the initial solution vector at mth iteration is Then the neighbour vectors defined in (9) can be generated as The underlined elements shows the position where, x (m) and z (m) (u, v) differs.

Reactive Tabu Search
The algorithm starts from the initial vector x (0) , g (0) = x (0) and searches in the neighbourhood in i M iterations. The cost of the initial solution computed as [23] Where, The steps followed in mth iteration are given below: Step 1 Let Where (z (m) (u, v)) is the cost of (u, v)th neighbour.
Next, find (u � , v � ) such that Check for the acceptance of the move (u � , v � ) using tabu matrix. If this move is accepted, then choose the move as the initial solution of next iteration, i.e x (m+1) = z (m) (u � , v � ) . If the move is not accepted, then evaluate (19) again excluding the previous unaccepted moves in the same iteration.
Step 3 Update the entries of tabu matrix and also compute f (m+1) for next iteration as

Stopping condition
The number of iterations can be reduced by using stopping criteria whenever the algorithm reaches to the ML solution [23].

Layered Tabu Search (LTS)
This section briefly present the concept of LTS in complex domain [11,21]. Assume x = H † r , where H † is the pseudo-inverse of H and r is obtained from the QR decomposition of H . Let x is obtained by rounding each entries in x to the nearest neighbour in ̃ , x be the solution vector, which is determined in N t steps and initialize x N t =x N t . Then at kth layer, k = N t , (N t − 1), … , 1 following steps are performed: Step 1 where x l is lth symbol of x , and x is obtained from previous layer.
Step 2 Find the symbol in ̃ , closest to r k . Let the symbol be ã q and is the threshold [21].
The output of the RTS will be the updated as x = [x k ,x k+1 , … ,x Nt ] . Make k = k − 1 and go back to step 1.

Proposed Layered Tabu Search
The dimension of matrices doubles in the real equivalent of a complex-valued system, which doubles the number of layers in LTS. Thus, the chances of RTS and the number of operations increase significantly [11]. The proposed LTS also uses a real-valued system model, but it computes a pair of layers simultaneously. Thus, the proposed PLTS algorithm detects the symbols in N t steps, where any kth step is termed as 'kth level' and at kth level, (2k − 1) th and 2kth layers of computations are performed.

Neighbour Difference Generation and Precomputation
In this section, we shall assume the case where RTS is required to perform at kth level of detection. The condition for RTS at kth level of detection is given in line 10, Algorithm 1. The dimension of neighbour vectors for each RTS depends on the level of detection. For . We shall consider the same initial vector at the mth iteration and show the resultant neighbour vector difference. Thus, using (25), we can write, e 1,1 = 2 , e 1,2 = 4 , e 2,1 = −2 , e 2,2 = 2 , e 3,1 = −2 , e 3,2 = −4 , e 4,1 = −2 , e 4,2 = 2 . Therefore, the resultant difference matrix is In this example, only eight elements are required to store instead of 32 elements in the previous case. Also, the proposed method eliminates the subtractions used in the conventional method as given in (16).
It can be observed from (18) that in each iteration of RTS, e 2 u,v W u,u required to compute for all of 2N(N t − k + 1) neighbours. That is for m iterations, 2mN(N t − k + 1) number of times e 2 u,v W u,u required to compute. To reduce the redundancy in computation, we propose precomputation of e 2 u,v W u,u . Thus at the initialization step of RTS, all the possible values of e 2 u,v W u,u are precomputed where u = 1, 2, … 2(N t − k + 1) and v = 1, 2, … N . Therefore, (18) can be modified as u values are precomputed at the initialization step of each tabu iteration. Therefore, (18) and (27)

Efficient Gram Matrix and Matched Filter Output Computation
The earlier section discussed the precomputation technique to reduce redundant computation. In conventional LTS, W in (13) and y MF in (14) are computed at the initialization step of every RTS and it increases the number of arithmetic operations significantly. To reduce the number of operations, we propose two methods: (i) use the symmetry properties of W to reduce the number of arithmetic operations and (ii) update W and y MF obtained from the previous RTS. The W = R T R matrix exhibits the following properties: As W is a Gram matrix, it follows W ij = W ji and hence no proof is given for property 1. Proof of property 2 and 3 are given in the "Appendix". Using these above properties it can be written that Therefore using the symmetry in W , only N 2 t elements required to calculate out of 4N 2 t elements in W . The (i, j)th element in W can be computed as Thus, computation of W i,j can be done in terms of i pair of layers starting from ith pair. Lets define the partial computation of W i,j at kth level as: i,j with the help of computations from past layers. Now in case of proposed LTS, upper triangular matrix at kth level expressed as As, R shows pair-wise symmetry [14], R (k) also have identical properties like R matrix. Also, all the properties defined for W can be applied for only one fourth of the elements required to compute. Now, direct computation of W (k) increase number of arithmetic operations significantly. Thus, a method is developed to update the elements of W (k) using the computations from past layers. Lets consider present layer of search is k and the last layer, where RTS was performed (k + t) , i.e. W (k+t) is already available. So, the proposed update method defined as below: (i) The first step is to initialize W (k) as Here, (k + t) ≤ N t and W (N t +1) = 0 . The other elements in W (k) are set to zero.
(ii) Elements of W (k) can be updated using (31) as where i = 2, 4, … , 2(N t − k + 1) and j = i, i + 1, … , 2(N t − k + 1) and rest of the elements in W (k) can be determined using (28) and (29). Similar to the computation of W , the matched filter output y MF can be computed and update level wise. Now, y MF in (14) can be determined as The 2ith and (2i − 1) th element in y MF can be determined as (36) and (37) is equivalent to i levels of operations where at each level, a pair of layers are considered. Now in case of proposed method, computation of y MF at kth level defined by y (k) MF can be computed as where i ≤ 2(N t − k + 1) . Direct computation of (38) increase the number of arithmetic operations. Lets assume, y (k+t) MF is available, i.e. at (k + t) th level the last computation was done and (k + t) ≤ N t . Thus, t levels of computation required to update. The proposed low complexity update method is as follows: MF and rest of the elements in y (k) MF set to zero. (ii) Using (36) and (37), update the ith element as

Layer-Dependent Iteration Allocation Strategies
The choice of the number of RTS iterations is important for both performance and complexity. A large number of iterations can improve the detection performance significantly but increase the computational complexity [21].
To reduce the effective number of iterations, we propose layer dependent iteration allocation strategies. In LTS, the size of the partial solution vector grows as the number of layers increases and at the last layer, we get the complete solution vector. Symbol correction in upper layers is more critical than in lower layers. Thus, we assign the RTS iterations at lower layers a lower value in the proposed strategies. Lets consider the minimum and maximum limits of RTS iterations are itr min and itr max respectively, then at kth level of the proposed algorithm ( k = N t , N t − 1, … , 1 ), the allowed RTS iterations defined by itr i can be chosen using any of the following strategies given below. Here, i = N t − k + 1.

Strategy 1 This is a linearly increasing iteration allocation strategy and is expressed as
Strategy 2 This is an exponentially increasing iteration allocation strategy and is defined as

Modified Low Complexity RTS Algorithm
Algorithm 2 describes the modified low complexity RTS algorithm. First, precomputation is performed (line-6) to reuse the values over multiple iterations and cost factor C u,v are computed for all neighbours (line 11). In step 2, acceptance of the move (u � , v � ) is checked using its cost and tabu value (lines [14][15][16][17][18]. If the move is not accepted, then go back to the line 12, exclude the previous (u � , v � ) move from search and get new (u � , v � ) . If all the moves are rejected, decrease the non-zero entries of the tabu matrix by its minimum non-zero value and go back to step 2. After a move is accepted, its repetition is checked by comparing its cost with the previous iteration selected moves cost. If repetition is found, length of repetition l rep , tabu period P and number of repetition rep count is updated (line 28).
Next, cost of accepted move u ′ ,v ′ is compared with the minimum cost (g (m) ) . If accepted move cost is higher then (u � , v � ) move is added to tabu list and also its tabu value is assigned. Else, (g (m+1) ) is updated with u ′ ,v ′ . The variables q ′ ,q ′′ and v ′′ (line 36 and 40) are defined by and a q � , a q �� ∈ . Here, l flag = 1 indicates that the algorithm reaches to local minima. Step 4 of the algorithm defines the algorithm termination condition. Finally, in step 5, entries of tabu matrix are updated, and f (m+1) is computed for the next iteration if the termination condition failed.

Complexity Analysis
In this section, we shall examine the amount of complexity reduction that can be achieved by the proposed method. From (30) it can be observed that computation of W i,j requires i multiplications and (i − 1) additions where i = 2, 4, … , 2N t and j = i, i + 1, … 2N t . Thus, total number of additions and multiplications are required at ith layer are (i − 1)(2N t − i + 1) and i(2N t − i + 1) respectively. The maximum number of operations required when tabu search performed at last level corresponding to k = 1 and it is independent of the number of tabu search levels. This is because the proposed metric update method restricts all the redundant computations. Now, consider i = 2l where, l = 1, 2, … N t . Thus, maximum number of multiplications and additions required for all W (k) computation for k = N t , N t − 1, … , 1 can be expressed as:

3
Similarly we can find the number of arithmetic operations for all y (k) MF in proposed update method defined in (39). In this case number of arithmetic operations are equal to the number of operations required for multiplication of R ∈ ℝ 2N t ×2N t and y ∈ ℝ 2N t ×1 . Thus total number of additions and multiplication required are N t (2N t − 1) and N t (2N t + 1) respectively.
On the other hand, in conventional approach, W (k) in (13) and y (k) MF in (14) are computed at the initialization phase of each RTS and does not utilize any past layer results. Lets consider n = N t − k + 1 as the number of layers associated with kth level and as k = N t , N t − 1, … , 1 , n = 1, 2, … , N t . The total number of additions and multiplications at kth level are 4n 2 (2n − 1) and 8n 3 for W (k) and 2n 2 and n(2n + 1) for y (k) MF respectively. The maximum number of operations requires when tabu search occurs for all the layers. Thus, the maximum number of additions and multiplications required to compute all W (k) can be expressed as Similarly, the maximum number of additions and multiplications to compute all y (k) MF can be expressed as Thus, in proposed update method, the complexity is reduced from O( Table 1 shows both algorithms' total additions and multiplications required for RTS initialization in the worst-case scenario.

Complexity Reduction in Precomputation
In the conventional approach, e 2 u,v W u,u required to compute at every iteration as given in (18). Thus, for each iteration of RTS performed at kth level, 2Nn multiplications are required. Suppose total i n iterations are performed at nth level, then a total 2i n nN number of multiplications are required. Thus precomputation reduces the number of multiplications operations to compute e 2 u W u,u at kth level by a factor of 2i n . It should be noted that i n is not constant for any level, and it is dependent on the termination condition, and its maximum value is defined at the initialization step. If we assume that tabu search performed at all levels and the number of iterations for each RTS is fixed to i M , then the total multiplications in the conventional method.
Thus, precomputations can reduce the number of multiplications in conventional LTS up-to 2i M times.

Performance Evaluation
This section evaluated the bit error rate (BER) performance and complexity of the proposed system compared with the conventional complex version of LTS algorithm. Following parameters are used for both conventional LTS and proposed PLTS algorithms: rep max = 10 , itr max = 20 , = 10 , N = 1 for 4-QAM, rep max = 10 , itr max = 100 , = 100 and N = 3 for 16-QAM and rep max = 20 , itr max = 200 , = 200 , N = 2 for 64-QAM and P 0 = 1 . The threshold value is set to d min ∕5 for our simulation. For a fair comparison, in the simulations, we used sorted QR decomposition (SQRD) [12] based channel matrix ordering for both the detection algorithms. The proposed system entirely uses a real-valued equivalent system. Thus, real-valued SQRD is used for it. The complex model is assumed for conventional LTS, so complex SQRD used for it.
systems use complex arithmetic operations for interference cancellation, and RTS is performed in a real equivalent system. It can be observed that the lower limit of each detector's arithmetic complexity converges towards a fixed value. For a specific antenna configuration, these lower limits are close to each other. This happens as at higher SNR, most RTS operations are skipped, and the arithmetic operations are required only for interference cancellations. On the other hand, the upper limit of arithmetic complexity depends on the antenna configuration and modulation order; i.e., as the number of antennas increases, more RTS is required to perform at low SNR. Thus, when most of the layers are required to perform RTS, the In Figs. 3 and 4, we notice that the average number of real arithmetic operations are dependent on signal-to-noise ratio for both LTS and PLTS. We can identify three regions in each plot which are low-SNR, mid-SNR and high-SNR region. In low-SNR, the average real arithmetic operations are highest and almost constant. The reason is that in the low-SNR region, the noise magnitude is high, and the distance between the interference cancelled symbol and its nearest constellation point is above the threshold for most cases. This is the worst-case scenario where most layers perform the RTS operation. We can also notice that the efficient Gram matrix and matched filter output update and precomputation technique in the proposed method helps to reduce the upper limit of complexity in PLTS.
In the mid-SNR region, the required operations decrease with SNR increase. As the SNR value increases, more RTS operations are skipped, reducing the average number of real arithmetic operations.
We can see that the average real arithmetic operations are almost constant in the high-SNR region. The reason is that most of the RTS operations are skipped. Thus the required operations shown in Figs. 3 and 4 at high-SNR region are due to interference cancellation. We notice that the proposed method requires a much lower number of operations in higher SNR than conventional LTS. This happens due to different initial solution vector (x) generation method. In conventional case, pseudo-inverse of H is used to generate initial solution vector x = H † r . However, the proposed LTS used x = R −1 y . As R is an upper triangular matrix, backward substitution can be used to determine R −1 which has much lower complexity than the pseudo-inverse, and that is why the complexity of PLTS at higher SNR is lower than LTS. Figure 5 shows the corresponding percent of complexity reduction achieved in the proposed system. It can be observed that the percent of complexity reduction is more for higher-order antenna systems. Also, the percent of complexity reduction increases with SNR and saturates at higher SNR except for 4-QAM systems. Table 2 shows the percent of complexity reduction and corresponding SNR values in the proposed system to achieve a BER of 10 −3 . The proposed algorithm percent of complexity reduction depends on the number of transmit antennas and modulation order. The complexity reduction is more for a larger antenna system for a particular modulation. Also, the lower-order modulation schemes can achieve more complexity reduction. The reduction of complexity is achieved mainly due to the efficient computation of matrices and pre-computation techniques. If we consider conventional LTS in a real domain instead of a complex domain, complexity reduction would be more, and  Table 3 presents a comparison of the complexity in terms of per-symbol-complexity (PSC), which is defined as the average number of real arithmetic operations per symbol ( ×10 3 ) required to achieve a specific BER (in our case, it is 10 −3 ). The PSC value for both algorithms reduces with the decrease in the number of transmit antennas, and the proposed PLTS algorithm outperforms the LTS algorithms in all the cases. It can be observed that for 64 × 64 MIMO system, the PSC value of LTS increases with a decrease in constellation size. This happens as more layers are required to perform RTS, and the initialization step of RTS primarily dominates complexity. However, initialization complexity of the proposed algorithm are not affected by the number of layer visit. Thus, its PSC value is primarily dominated by neighbourhood search operations. Figure 6 shows the effect of itr min variation on BER performance and the average number of operations. We have considered strategy 1 for simulation. The BER performance degrades slightly with the decrease in itr min . However, the average number of arithmetic operations also decreases with the decrease in itr min . Although the floor value of complexity converges to a fixed value for higher SNR, the plot's transition slope is almost the same for all the iteration numbers. The upper bound of arithmetic complexity reduces with itr min as shown in the plot in Fig. 6b. A performance loss of 0.5 dB can be seen from Fig. 6a to achieve a BER of 10 −3 when we reduce the itr min from 200 to 120. Figure 7, shows the impact of itr min on BER performance and average real arithmetic operations for different iteration strategies. We have considered 32 × 32 , 64QAM system for simulation. Three different SNR values are considered to understand the BERcomplexity trade-off for different strategies. At 20dB SNR, we notice that the detection performance is almost independent of itr min and iteration strategies. However, in   Figure 8 shows the performance of the proposed detector with the increase in the number of receive antennas for a fixed number of transmit antennas N t = 32 . The BER performance of the detector improves significantly with the increase in the number of receive antennas, and the complexity plot shows an increase in the slope of transition with the number of receive antennas. It is worth mentioning that the conventional LTS also shows similar performance improvement with an increase in the number of receiving antennas. The upper and lower bound of the arithmetic complexity are not significantly affected by the smaller variation of the number of receive antennas.

Conclusion
We have proposed a low-complexity pair-wise layered tabu search method for a large-scale MIMO system in this work. The proposed algorithm utilizes the metric update method between successive layers of tabu search to reduce the initialization complexity of RTS. The pre-computation technique is developed to further reduce the redundant computation within RTS. The proposed PLTS algorithm shows significant complexity reduction with almost the same detection performance. In addition, level-dependent iteration can further reduce the upper bound of arithmetic complexity with almost identical BER performance. Moreover, an increase in the number of received antennas shows significant performance improvement without affecting the upper and lower bound of complexity much, making it suitable for a massive MIMO system with the number of transmitting to receive antenna ratios close to one.