A Modified Shuffling Method to Reduce Decoding Complexity of QC-LDPC Codes


 Layered decoding (LD) of Low-Density Parity-Check (LDPC) codes is a decoding schedule that facilitates partially parallel architectures for performing Belief Propagation (BP)-based iterative algorithms. It has reduced implementation complexity and memory overhead compared to fully parallel architectures and higher convergence speed compared to both serial and parallel architectures. In this paper, we introduce a modified form of shuffling of the Parity-Check Matrices (PCMs) of Quasi-Cyclic LDPC (QC-LDPC) codes, which is basically an interleaving operation of the rows of the PCM. The modified shuffling method just like the conventional shuffling method results in a PCM in which each layer can be produced by the circulation of its above layer one symbol to the right. However, it additionally guarantees the weights of the columns in each layer to be either zero or one. Then, we show that due to these two properties, the number of occupied Look-Up Tables (LUTs) on a Field Programmable Gate Array (FPGA) is reduced by about 93% and consumed on-chip power by nearly 80%. Nevertheless, shuffling doesn’t degrade Bit Error Rate (BER) performance compared with the non-shuffled case. Additionally, decoding throughput is not sacrificed for low SNR values and its degradation is negligible until the BER of 1e-6.


Introduction
With the advent of next-generation wireless networks the spontaneous adoption of many of the conventional transmission techniques and protocols may no longer be viable, since these networks have more stringent constraints in terms of throughput, latency, reliability and energy efficiency. Forward-Error Correction (FEC) methods are one of the vital elements of such networks which are used to provide the demanded level of reliability in the transmission of information. Nevertheless, powerful FEC techniques like LDPC, Turbo or polar codes, known as capacity-achieving codes, bring about higher complexity and higher power consumption compared with traditional coding techniques. Among these codes, LDPC codes have been so far incorporated into several previous technologies and they are still seen as the potential candidate for the new standards like Fifth Generation (5G) and IEEE 802.11be.
The distinctive characteristics of LDPC codes such as their low-density PCM and Tanner graph lacking short cycles have facilitated the use of BP as the main decoding method for LDPC codes. As a matter of fact, their promising error-correcting performance occurs only if they are decoded with a BP algorithm. However, an iterative decoding algorithm based on BP is inherently costly and complex, and its use is not straightforward when next-generation wireless communication systems with tight constraints on throughput, latency and energy efficiency are in scope.
Hence, trying to introduce BP-based decoding algorithms with reduced complexity has been a focal point of research.
In nutshell, BP algorithm is an attempt to obtain a more precise estimation of the primary estimates of the codeword bits which are exhibited by the soft-decision sequence received at the output of the demodulator. Toward this goal and dealing with PCM from the viewpoint of its Tanner graph, the BP algorithm employs a set of messages representing probabilities that a given symbol in a received codeword is either a one or zero [1]. These messages are successively passed between the nodes in the Tanner graph until they produce a sequence satisfying parity-check equations.
Over the years, many researchers have been trying to modify BP algorithm in different aspects. Numerous works address, for instance, the issue of computation of reliability messages [2][3][4][5][6][7][8][9][10][11][12][13]. Several other works investigate scheduling of the algorithm [14][15][16] which determines in what order the reliability messages should be exchanged between the nodes in the Tanner graph. Decoding schedule is generally associated with the implementation architecture of the decoding method. Flood schedule [17] for example facilitates a fully-parallel architecture in which all the Variable Nodes (VNs) and Check Nodes (CNs) in the Tanner graph pass messages concurrently to their neighbors in every iterations of the algorithm. Although hav-ing high throughput this schedule demands as many functional units in hardware as the number of CNs and VNs, thus requiring a large silicon area with high interconnect complexity [18]. In a serial architecture, in contrast, a smaller number of functional units are re-used several times to perform each decoder iteration. In this way decoding complexity is lowered, although at the price of reduced decoding throughput.
Partially parallel decoding architecture is a good trade-off between hardware complexity and decoding throughput and it is best accomplished by Layered Decoding (LD) schedule [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37]. In this schedule the rows of PCM are divided into several layers, and each iteration of the BP algorithm is likewise split into several sub-iterations, each running over one layer of the PCM. During each sub-iteration, reliability messages are exchanged between CNs of that layer and their neighbor VNs, and, ultimately, the updated reliability messages are handed to the next layer.
Accordingly, in each sub-iteration only a subset of CNs, i.e. as many as the number of rows in each layer, participate in the decoding process. In this way, corresponding functional units in hardware can be re-used for the CNs of different layers and layers are processed successively from top to down the PCM. This causes a reduced hardware utilization of LD compared to flood schedule. Furthermore, LD schedule achieves better convergence performance than the flooding schedule due to the fact that the latest VTC messages are always used to update the CTV messages during a sub-iteration.
Since the introduction of LDPC codes many different ways for their construction have been proposed. Among them some generate a structured type of LDPC codes known as QC-LDPC codes which possess a cyclic property. By the leverage of this cyclic property their encoding and decoding process can be considerably simplified, while presenting comparable performance to random (or unstructured) LDPC codes [38,39]. For the sake of decoding complexity it is highly desirable the number of ones in each column of each layer, i.e. the weight of the columns of each layer, be either one or zero when LD is adopted as the decoding schedule. QC-LDPC codes have inherently such layered structure with this property on the weight of columns and they are usually adopted as the FEC code when LD is the decoding method.
In an attempt to exploit the cyclic property of a QC-LDPC code to simplify its decoding, the authors in [27] introduced a novel shuffling method, which shuffles the rows of the PCM of a QC-LDPC code prior to decoding and gives it a new layered format. After applying this shuffling each layer can be produced by circulating its above layer one symbol to the right. The authors then show this cyclic property can be exploited in order to simplify LD and speed up convergence rate. In particular, due to the cyclic property it is enough to realize only the first layer of PCM in hardware rather than the entire of that.
The downside of this shuffling idea is that it probably spoils the primary singleweight columns property of the PCM, meaning that it is unlikely for the shuffled PCM to maintain that property. In such cases, LD will no longer be a recommended decoding schedule and the conventional BP algorithm is the preferred choice.
We outlined a modified shuffling idea in our previous work [40] which was an extension to the primary shuffling idea. The modified shuffling method results in a shuffled PCM that has both the desired single-weight columns feature and also the cyclic property. This is in fact accomplished by introducing a set of offset values prior to performing the shuffling.
In this paper, the modified shuffling method of [40] is further investigated, the logic behind it is clearly expressed and its benefits and advantages are extensively highlighted. To be specific, [40] lacked any implementation results to verify the improvements promised by the modified shuffling method. Here, the results of implementing LD of several QC-LDPC codes when shuffled with the proposed technique is provided, and, for the sake of comparison, they are accompanied with the results associated with LD of non-shuffled QC-LDPC codes. The results in the former case reveal improvements in terms of number of occupied LUTs on FPGA and also power consumption. The target FPGA is a Xilinx-Virtex 7 and the tested codes are four QC-LDPC codes from IEEE 802.15.3c and one from IEEE 802.16e standards. It is shown that these improvements are achieved without sacrificing BER performance and decoding throughput for low values of E b /N 0 . Although provided simulation results and analysis indicate that with shuffling the decoder needs more sub-iterations to achieve a specific performance, we argue that this is not always translated into a lower decoding throughput. Implementation results reveal that the clock frequency with the shuffled PCM can be increased compared to the non-shuffled case, due to the simplifications made by the shuffling method in the decoding process. Therefore, slower convergence rate can be compensated to an extent by having faster clock.
This compensation is full for low values of E b /N 0 and throughput degradation is negligible until the BER of 1e-6.
The organization of the paper is as follows. Section 2 presents preliminaries, including the fundamentals of QC-LDPC codes and their LD. Section 3 is devoted to the assessment of the novel shuffling method and its attributes. Implementation and simulation results along with necessary analysis come in section 4. Final conclusions are made in section 5.

Preliminaries
Before delving into the main proposal of the paper, a short introduction on QC-LDPC codes and LD schedule is necessary.

QC-LDPC Codes
A linear block code C is called an LDPC code if it is the null space of a PCM, denoted by H, with following characteristics [41]: I-The number of ones in each row and column is small compared with n (the length of the code) and J (the number of rows in H); II-the number of ones in common between any two columns (or rows) is not greater than one. The latter condition is usually referred to as the Row-Column (RC) constraint in the literature [42].
The PCM of an LDPC code is usually visualized by means of a graph known as the Tanner graph. The Tanner graph of a PCM consists of two sets of nodes. One set represents CNs, i.e. the parity-check sums (or equations), which are, in fact, the rows of H, and the other set stands for VNs that are equivalent to columns of H.
A CN in the Tanner graph is connected to a VN if and only if the corresponding element of H is one.
When the PCM of an LDPC code is comprised of Circulant Permutation Matrices (CPMs) and zero matrices, the resultant code will be Quasi Cyclic (QC) as well. A circulant is a matrix in which each row is a cyclically rightward-shifted copy of its above row. If a circulant is a permutation as well, i.e. every rows and columns are single-weight, then the circulant is called a CPM. The PCM of a QC-LDPC code could be represented as in which c and t are two positive integers provided that c ≤ t and A i,j s are either In the resultant code, codewords have sectionized cyclic structure. That means, with cyclic shifting of all the t sections in a codeword individually, another valid codeword is generated [42].
Another way for representing the PCM of a QC-LDPC code which is more compact than (1) is known as the base matrix, denoted by W. Note that a CPM is in fact an identity matrix whose rows have been circularly shifted some samples to the left or right. W is a matrix whose non-negative integer entries specify the shifting value in the corresponding CPM with respect to an identity matrix. The other entries, usually chosen to be -1, represent zero matrices in PCM. Transformation of a base matrix to its corresponding PCM is called dispersion. Fig. 1 shows the base matrices for the QC-LDPC codes utilized in IEEE 802.15.3c standard and Fig. 2 shows the 1/2-rate (2304,1152)-QC-LDPC code used in IEEE 802.16e. In these two figures, empty places are the locations of zero matrices.

LD of QC-LDPC Codes
Due to the RC constraint, the Tanner graph representation of the PCM of an LDPC code lacks short cycles of length 4. Therefore, the decoding method for them could be formulated as the computation of marginal probabilities on a factor graph [43].
An efficient way to solve these problems is BP algorithm which is exact when the factor graph is free of cycles, but approximate when it has cycles [43]. However, BP-based decoding algorithms have become the major decoding method for LDPC codes, although the Tanner graph of an LDPC code may have a few short cycles.
In the BP algorithm the reliability messages are successively passed between VNs and CNs in the Tanner graph till a codeword that has the maximum probability according to the received soft-decision sequence is found.
LD, as stated before, is a schedule which relies on the assumed layered structure of the PCM and is viewed as a way to realize a partially-parallel architecture for execution of the BP algorithm. In LD schedule, each iteration is split into several sub-iterations, running over successive layers of the PCM. During each sub-iteration, reliability messages are exchanged between CNs of that layer and their neighbor VNs, and at the end, the updated reliability messages are handed down to the next layer. Accordingly, only a subset of CNs and VNs participate in each sub-iteration, and layers are processed successively from top to down the PCM.
Let y = (y 0 , . . . , y n−1 ) be the soft-decision sequence at the output of the channel that is to be decoded. Assuming that the PCM has been divided into L layers each containing E consecutive rows of the PCM, successive layers of H qc are traversed by the decoding algorithm in order from top to down. Specifically, the i-th layer qc , its support is defined as all its neighbor VNs, i.e.
Analogously, the support of l-th VN within the i-th layer is defined as all the CNs inside that layer connected to that VN. Mathematically, Here, without loss of generality, min-sum scheme [5] has been used for computing CTV messages and sgn(x) is a sign function which is equal to 1 when x ≥ 0 and -1 otherwise. It should however be noted that other proposed schemes like the ones in [4,7,[44][45][46][47][48] could rather be used for computation of the CTV messages.
3 Hard decision and stopping criterion test: APP values for all the VNs are computed as:

Modified Shuffling of QC-LDPC Codes
In this section, we propose a modified shuffling idea for LD of QC-LDPC codes.
Shuffling is the act of swapping the rows of the PCM, in a manner that the complexity of the decoding algorithm reduces, while the error correction performance is preserved. The primary idea of shuffling method has already been proposed in [27].
An example of such a shuffling is depicted in Fig. 3  It should also be emphasized that the desired cyclic property of H sh qc does exist in the case of modified shuffling as it does in the basic shuffling method, and hence each layer of H sh qc can still be produced from its previous layer by circular shifting.
The advantage of LD with a shuffled PCM over LD with a non-shuffled one from the implementation point of view is also worth noticing. Fig. 6 illustrates the LD architecture for H sh qc of Fig. 3. In this figure, VN Processing Unit (VNPU) and CN Processing Unit (CNPU) respectively stands for a processing unit responsible for computing VTC and CTV messages in (4) and (5)  As a result of this circula-tion, the existing connections between VNPUs and CNPUs remain valid and they will represent the connections for the next layer. Therefore, the next sub-iteration can be initiated instantly, without the need to redefine the connections between VNPUs and CNPUs. This is the main advantage delivered by shuffling the PCM of a QC-LDPC code, enabling us to make best use of their inherent cyclic feature.

Implementation and Experimental Results
We have implemented both LD with shuffled PCM and LD with non-shuffled PCM of the example codes of IEEE 802.16e and IEEE 802.15.3c standards. The hardware used for implementation was a Xilinx Virtex-7 FPGA. The acquired results have been shown in Table 2. As deduced from the figures in the table, with shuffled PCM, the design is considerably smaller and occupies nearly 93% less numbers of LUTs on FPGA. In terms of the on-chip power reported by the implementation tool, also, LD with shuffled PCM consumes about 80% less power. In summary, the superiority of the shuffling method in terms of hardware area and consumed power is apparent from the implementation results. Note that the design of the non-shuffled IEEE 802.16e is too big to fit in the FPGA, and hence the results are not available. It should also be noted that our main concern is to highlight the improvement introduced by the very shuffling method, but not the other elements of decoding like finding the first two minima. This stems from the fact that layering is different in two cases. In the case of nonshuffled PCM, the J rows are divided into c layers, each of b rows, while in the case of shuffled PCM, they are divided into b layers, each of c rows. Since, c is usually much more smaller than b, in the first case, a bigger number of VNs are processed in a sub-iteration and hence it needs fewer sub-iterations in total. However, this further number of sub-iteration to reach a specific BER is compensated to an extent with a bigger clock frequency allowed by the shuffling, as indicated by Table 2. To reach a better insight for that the average throughput is plotted for different codes in Fig.   9. The average throughput can be expressed as Here, f clk is the clock frequency specified in table 2

Conclusion
The novel shuffling method proposed in this paper is basically an interleaving the rows of the PCM of a QC-LDPC code with two objectives in mind. First, the columns in each layer of the shuffled PCM must remain single-or zero-weight.
Second, each layer must be producible from the upper layer by a one-symbol circular shifting to the right.

Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Competing interests
The authors declare that they have no competing interests.