Basic-Set Trellis Min–Max Decoder Architecture for Non binary LDPC Codes With High-Order Galois Fields

Basic-Set Trellis Min–Max Decoder Architecture for Non binary LDPC Codes With High-Order Galois Fields


Non binary low-density parity-check (NB-LDPC) codes outperform their binary counterparts in terms of error correction performance. However, the drawback of NB-LDPC decoders is high complexity, especially for the check node unit (CNU), and the complexity increases considerably when increasing the Galois-field (GF) order. In this paper, a novel basic-set trellis min–max algorithm is proposed to greatly reduce not only the CNU complexity but also the number of messages exchanged between the check node and the variable node compared with previous studies, which is highly efficient for higher order GFs. In addition, the proposed CNU is designed to compute the messages in a parallel way. Layered decoder architectures based on the proposed algorithm were implemented for the (837, 726) NB-LDPC code over GF(32) and the (1512, 1323) code over GF(64) using 90-nm CMOS technology, and obtained a reduction in the complexity by 30% and 37% for the CNU, and 40% and 37.4% for the whole decoder, respectively. Moreover, the proposed decoder achieves a higher throughput at 1.67 Gbit/s and 1.4 Gbit/s compared with the other state-of-the-art high-rate NB-LDPC decoders with high-order GFs.


  • Modelsim
  • Xilinx 14.2


Non binary low-density parity-check (NB-LDPC) codes defined over Galois fields (GFs) GF(q) with q > 2 outperform their binary counterparts in terms of error-correcting performance and performance improvement in the error-floor region when code length is moderate. In addition, these codes have good ability of burst error correction, especially for high-order GFs. Research results demonstrate that NB-LDPC codes provide superior performance compared with the best optimized binary LDPC code over fading channels, and the combination of NB-LDPC code with high-order modulations improves both the bandwidth efficiency and the error-correction capability. Moreover, the elimination of the error floor is critical for flash memory applications, and the NB-LDPC codes show much promise for multilevel flash memory applications. However, the main disadvantage of NB-LDPC codes is their highly complex decoding algorithms; it is difficult to achieve maximum throughput and minimum area for their architectures. In practical implementations, the NB-LDPC decoders have several drawbacks, such as a highly complex check node unit (CNU), a large area spent on storage elements, and routing congestion. First, the belief propagation (BP) algorithm used for binary LDPC decoding was introduced for the NB-LDPC decoding. Then, a fast Fourier transform-BP (FFT-BP) algorithm in the probability domain was proposed to reduce the computational complexity in check node processing by replacing the convolutional operations with multiplications in the frequency domain.

Although the probability domain algorithm provides optimal error-correcting performance, the large number of additions and multiplications causes an exponential increase in hardware complexity. In the FFT-BP algorithm based on the logarithm domain used log-likelihood ratio (LLR) values to decode the channel messages instead of probability values, in which the multiplications are replaced with additions. For practical NB-LDPC decoder implementations, suboptimal algorithms such as extended min-sum (EMS) and the min–max algorithm have been proposed to reduce the complexity of the CNU as the main bottleneck of the NB-LDPC decoder.

The min–max algorithm is interesting because it uses comparisons instead of additions in the check node processing, which not only reduces the hardware complexity but also prevents the numerical growth of the decoder. In addition, a forward–backward scheme was utilized to derive the check node output messages. This scheme includes sequential computations, which cause a throughput problem for the decoder architectures. Moreover, additional storage memories are required to store the intermediate messages such as forward and backward messages. Recently, the path construction algorithms and the relaxed min–max (RMM) algorithm introduced the trellis representation for check node processing to eliminate computing the forward–backward messages, and thus reduces the memory requirement for the intermediate messages. The RMM algorithm using the minimum basis to generate the check node output messages was proposed for NB-LDPC decoders, which further reduces the check node complexity.

However, the sequential check node processing requires a large number of clock cycles, which limits the maximum throughput of the decoder. In the trellis EMS algorithm was proposed to improve the throughput of the NB-LDPC decoders, where the check node output messages are generated in parallel by means of an extra column inserted to the original trellis. A disadvantage of the decoders and is high area, which causes a reduction of the overall decoder efficiency. To take advantage of the idea, the simplified trellis min–max (STMM) algorithm was proposed to improve the throughput of the min–max decoders with less complexity. In the one-minimum-only TMM algorithm was introduced on the basis of the STMM algorithm to reduce the CNU complexity by obtaining only one minimum and estimating the second one.

 In q × dc check node output messages are exchanged between the check node and the variable nodes. For high-order GFs or high-rate NB-LDPC codes, there are two main drawbacks are, first, the amount of exchanged messages increases, which causes wiring congestion, and thus limits the maximum throughput of the decoders. Second, the check node output messages are stored in the memory for the next decoding iteration in the layered decoders. Therefore, the memory requirement becomes large, which leads to a significant growth in the decoder area for NB-LDPC codes.

To overcome the drawbacks, originally introduced a compression technique to reduce the exchanged messages between one check node and the variable nodes to four sets, including the intrinsic and extrinsic information, the path coordinates, and the hard-decision symbols with a size of 5 × (q − 1) + dc messages without any errorcorrecting performance loss. For further improvement, the research proposed to simplify the CNU architecture and reduce the exchanged messages to 4×(q −1)+dc messages with a similar error-correcting performance. The approximated TMM algorithms were introduced to reduce the amount of intrinsic information from (q − 1) elements to only two elements and L q elements, respectively, at the cost of some error-correcting performance loss. The remaining elements are calculated from the approximation functions.

In this paper, a novel basic-set TMM (BS-TMM) algorithm is proposed for NB-LDPC codes based on the theory of the GF GF(q = 2p), where each field element is uniquely represented by a linear combination of p independent field elements. In the proposed BS-TMM algorithm, the basis set including the intrinsic information of only p = log2 q independent field elements in the extra column is stored, and the other elements are constructed on the basis of this basic set. Moreover, a novel algorithm is introduced for finding p independent field elements with the most reliable messages of the basic set in parallel. The BS-TMM algorithm allows the reduction of exchanged messages between one check node and variable nodes from 4 × (q − 1) + dc [16] to (q − 1) + 3 × p + dc messages with a negligible performance loss of 0.1 dB. The proposed method provides a great area reduction and throughput improvement for the NB-LDPC decoders with good errorcorrection performance. Therefore, it is extremely efficient for the design of high-rate and high-order NB-LDPC decoders. Two NB-LDPC decoders, including (837, 726) over GF(32) and (1512, 1323) over GF(64), were implemented on the basis of the BS-TMM algorithm.


  • Less Efficient
  • Low complexity


Decoder Architecture

In this section, a complete decoder architecture based on the BS-TMM algorithm is designed for NB-LDPC codes. The proposed decoder architecture achieves a great reduction in the area because of the large area reduction in the CNU architecture. In addition, an improvement in the throughput is obtained since reducing the wires between the check node and variable node processors mitigates the routing congestion. The layered min–max algorithm for the proposed decoder is presented in Algorithm 6, where the BS-TMM in Algorithm 3 is implemented in the check node processor. In addition, the decompression network (DN) corresponding to Algorithm 5 is implemented in the variable node processor to generate the C2V messages R mn(a) from outputs of the CNU architecture. The DN has three parts: 1) generating the LLR

Fig. 1. Proposed extra column and path information generator for GF(8). (a) Extra column generator for the j th element. (b) Control signal generator. (c) Path information generator for the j th element.

values of the extra column Q(a) and the path information d(a) with a maximum of p deviations on the basis of the basic set B∗ = {m1*l , I*l , a*l}1≤l≤p; 2) generating the C2V messages in the delta domain as Rmn(a) on the basis of Q(a), E(a), and d(a); 3) and converting the C2V messages from delta to normal domain. It is noted that two DNs are required in the variable node processor. However, the proposed decoder area is much lower than that of the conventional decoders .

First, Fig. 1 shows the proposed extra column and path information generator for GF(8). The LLR value of each element in the extra column Q(a j) is selected from one of the p LLR values m1*l (1 ≤ l ≤ p) in the basic set B∗ depending on the p control signals sl[ j] (1 ≤ l ≤ p), as shown in Fig. 1(a). (q − 1) architectures as in Fig. 9(a) are required to compute (q − 1) messages in the Q(a) simultaneously. p control signals sl[ j] (1 ≤ l ≤ p) are generated using the architecture in Fig. 1(b). To compute Q(a j), only one of the control signals sl[ j] is equal to “1,” and others are equal to “0.” Thus, only one of p LLR values is selected for the output Q(a j)n order to calculate the p control signals sl[ j], 2p − 1 = q − 1 combinations of p field elements in the basic set B∗ excluding the zero element are divided into p groups, as shown in Fig. 1(b). p control signals sl[ j] correspond to p outputs of the groups. The ith group contains the field element a*l and its combinations with all possible combinations of the previous field elements a*k(0 < k < l). Therefore, the lth group (l > 0) includes 2l−1 combinations of the field elements. In addition, (q−1) path information corresponding to (q−1) field elements is also constructed, where each path information d[ j] has p column indexes dl[ j] (1 ≤ l ≤ p). (q − 1) architectures as in Fig. 1(c) are required to compute (q−1) path information. For p field elements in the basic set a*l (1 ≤ l ≤ p), their paths are one deviation, and thus p values of the path information {dl[ j]}1≤l≤p are the same as column index I*l. The path of the field element generated by the combination of all field elements in the basic set has a maximum of p deviations; thus, p values of the path information dl[ j] (1 ≤ l ≤ p) correspond to p column indexes I *l(1 ≤ l ≤ p) in the basic set. For other field elements generated by the remaining combinations, the number of their deviations is k (1 < k < p), and then the dl[ j] is assigned to the column index I *l with 1 ≤ l ≤ k. Otherwise, the dl[ j] with k < l ≤ p is assigned to the column index I *k .

Fig. 2 presents the proposed C2V message generator for GF(8) with dc = 4. The C2V messages Rmj(a) (1 ≤ j ≤ dc) are simultaneously introduced by either Q(a) or E(a), which are the outputs of the multiplexers. The control signals for the multiplexers depend on the column indexes and p deviations of the path information. If the column index j (1 ≤ j ≤ dc) is equal to at least one of p deviations d(a) (1 ≤ l ≤ p), then the output of the multiplexer is assigned to compensation value E(a). Otherwise, the output of the multiplexer is assigned to Q(a). Fig. 3 shows the top-level decoder architecture for the proposed layered decoding algorithm, where one row of H corresponding to one layer is processed in one clock cycle. It can be seen that the decoder architecture is divided into a variable node processor and check node processor.

To start the decoding process, the LLR messages from channel information Ln(a) are loaded in variable node memory (VNMEM). From the next layer and next iteration, the output messages of the variable node processor Qnk,l(a) are stored in the VNMEM. The VNMEM includes dc memories with a depth of (q − 1) as the size of the CPM and a width of q × w bits. For each decoding time, one address is read and one address is written from each memory

The permutation and depermutation of the variable messages are implemented by modules P and P−1, respectively. Each module requires dc × (q − 1) × log2 q multiplexers of w bits to permute or depermute dc vectors of (q − 1) messages, and the control signals are based on the hmn nonzero values of H.

The normalization module N is responsible for finding the most reliable messages and their locations zn, and generating the Qnk,l(a) messages for the inputs of the check node processor. In addition, normalization ensures that the smallest value in each LLR vector Qnk,l (a) is always equal to zero. At the last decoding iteration, the zn values are the hard-decision symbols c˜n stored in the output memory, and the P module and subtractor are inactive during this process. Since a layered decoding scheme is used, the outputs of the check node processor in one iteration must be stored in the check node memory (CNMEM) for the next iteration process. Thus, the CNMEM in the proposed decoder has a depth of M and a width of p×(w+ log(dc) + p)+(q−1)×w+dc×p bitcorresponding to the output bits of the check node processor. A total of M×[p×(w+ log(dc) +p)+(q−1)×w+dc×p] bits are stored in one iteration. Compared with the M ×q ×dc ×w bits stored in CNMEM in the conventional approach, the memory requirement for CNMEM in the proposed decoder is greatly reduced, which leads to a large reduction in decoder area.


  • Highly efficient
  • Complexity is high


Huyen Pham Thi, Member, IEEE, and Hanho Lee , Senior Member, IEEE, “Basic-Set Trellis Min–Max Decoder Architecturefor Nonbinary LDPC Codes With High-OrderGalois Fields”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018.