Low-Complexity VLSI Design of Large Integer Multipliers for Fully Homomorphic Encryption

Low-Complexity VLSI Design of Large Integer Multipliers for Fully Homomorphic Encryption

ABSTRACT:

Large integer multiplication has been widely used in fully homomorphic encryption (FHE). Implementing feasible large integer multiplication hardware is thus critical for accelerating the FHE evaluation process. In this paper, a novel and efficient operand reduction scheme is proposed to reduce the area requirement of radix-r butterfly units. We also extend the single port, merged-bank memory structure to the design of number theoretic transform (NTT) and inverse NTT (INTT) for further area minimization. In addition, an efficient memory addressing scheme is developed to support both NTT/INTT and resolving carries computations. Experimental results reveal that significant area reductions can be achieved for the targeted 786 432- and 1 179 648-bit NTT-based multipliers designed using the proposed schemes in comparison with the related works. Moreover, the two multiplications can be accomplished in 0.196 and 2.21 ms, respectively, based on 90-nm CMOS technology. The low-complexity feature of the proposed large integer multiplier designs is thus obtained without sacrificing the time performance.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

FULLY homomorphic encryption (FHE) allows computations to be carried out directly on cipher texts for ensuring data privacy on un trusted servers, thus attracting much attention for cloud computing applications. Generally, FHE can be classified into three categories: lattice-based, integer based, and (ring) learning with errors.

One of the main challenges in the development of practical FHE applications is to mitigate the extremely high-computational complexity and resource requirements. For example, software implementations of FHE in high-performance computers, still consume significant computation time, particularly for accomplishing large integer multiplication which usually involves more than hundreds of thousands of bits. For lattice-based FHE, 785 006- bit multiplication is required for the small setting with a lattice dimension of 2048. To accelerate FHE operations, various efficient schemes have been presented to tackle large integer multiplication based on the Schönhage–Strassen algorithm (SSA). In the algorithm, the number theoretic transform (NTT) is applied because the asymptotic complexity of NTT-based multiplication for million-bit applications is lower than that of existing approaches such as the Karatsuba algorithm and the Toom–Cook algorithm. Specifically described schemes for speeding up the modular multiplication and addition/subtraction of large numbers together with a parameter pre computation strategy, and then implemented all FHE operations on graphics processor units for parallel computations. Cao et al presented a low latency hardware architecture for NTT-based multiplication and a low Hamming weight architecture for the integer based FHE with unbalanced bit length of input operands, both implemented in field-programmable gate arrays (FPGAs). Considering the implementations in application-specific integrated circuits (ASICs), Doroz et al employed a cache architecture together with customized processing units to carry out 98 304-point NTT for implementing a 1 179 648-bit multiplier. Wang et al employed a radix-16 butterfly unit (BU) with a memory-based in-place fast Fourier transform (FFT) architecture to perform 65 536-point operations over a finite field for fulfilling 786 432-bit multiplication. Recently, Feng and proposed to implement NTT with two moduli along with the Chinese remainder theorem to enlarge the sample size for million-bit multiplier design, thus improving transform efficiency.

Compared to pipelined FFT architectures, memory-based FFT, is recognized as a more feasible solution for low area complexity, especially for large-size FFT. The same conclusion can also be applied to NTT, which possesses the same data flow as the traditional FFT but with a different set of twiddle factors. As a result, memory-based FFT/NTT solutions are well suited to FHE applications that attempt to accelerate large integer multiplication by ASIC/FPGA design with restricted hardware cost. For memory-based FFT/NTT architecture designs, efficient memory management schemes are usually demanded to increase the equivalent memory bandwidth by partitioning the required memory into several banks. High-radix BUs are commonly applied to reduce the number of operation stages, thereby increasing the resulting performance. There are always tradeoffs between hardware cost and time complexity for a given application specification. In this paper, a new operand reduction scheme is proposed to optimize the high-radix BU design for reducing its area requirement. The basic idea is to combine the partitioned input data of the BU to decrease the number of operands required in the summation operation based on the inherent properties of

NTT. In addition, we extend to the design of large integer multiplication using single-port, merged-bank (SPMB) memory for further reducing the memory area of the NTT implementation. A unified address generator unit (AGU) is also presented to support NTT/INTT (inverse NTT) and resolve carries computations. Experimental results show that applying the proposed schemes to the designs of 786 432- and 1 179 648- bit multipliers can achieve up to 52.34% and 18.73% area reductions compared to the related implementations, respectively, and the area saving is obtained without compromising the time performance.

DISADVANTAGES:

  • Memory addressing is not efficient
  • Significant area reductions cannot be achieved
  • Complexity feature is high

PROPOSED SYSTEM:

Proposed architecture and hardware design

Fig. 1 depicts the proposed large integer multiplication architecture, which consists of two NTT units, a resolve carries unit, an AGU, a controller unit, and several memory units. An NTT unit comprises one radix-r BU, r 64-bit modular multipliers (MulMod), one radix-r switcher, and one buffer. Each of the two NTT units accesses data from two single-port SRAM banks. The ROMs are used to store the associated twiddle factors, i.e., the powers of the primitive Nth root of unity in Zp, for NTT/INTT computations. The following summarizes the operations of the proposed NTT-based multiplication architecture. The NTT1 and NTT2 units are employed to carry out NTT computations of the two input operands at the same time. In addition, we reuse the NTT1 unit to perform INTT computation of the result of point-wise multiplication. Each NTT input data has 64 bits and the radix-r BU is employed to process r input data. The data are packed using the proposed operand reduction schemes and then processed based on 192-bit operations in the BU. The MulMod unit is basically a 64-bit modular multiplier, in which a few modular additions and subtractions are employed to fulfill the multiplication of the BU output data and a chosen 64-bit value.

 The chosen value can be a twiddle factor from ROMs, a constant value of 1, a BU output data of the NTT2 unit for performing point-wise multiplication, or the inverse value N -1 for INTT computation, depending on the current operational status. The radix-r switcher is employed to perform the temporal data relocation for NTT/INTT computations. Moreover, to reduce the critical path delay in NTT units, the BU unit is implemented in a five-stage pipelined structure, and the MulMod unit has four pipelined stages. The buffer is used to store and reschedule NTT/INTT output data for achieving conflict-free memory access. The resolve carries unit, implemented with carry look ahead addition, is adopted to handle the carries with the INTT output data and obtain r digits of the multiplication result at a time. Note that each digit of the multiplication result contains b bits for base B = 2 b. Assuming r = 16, the detailed radix-16 BU design is shown in Fig. 8. As mentioned in Section III-A, the outputs of a BU adopt different merging strategies for operand reduction according to it. Specifically, the output X 0 requires 96-bit Wallace tree addition of 16 operands [see Fig. 2(a)], the eight odd-indexed outputs and  four even-indexed outputs, i.e., X2, X6, X10, and X14, take 192-bit Wallace tree addition of six operands [see Fig. 2(b)], and the remaining three even-indexed outputs, i.e., X4, X8, and X12, demand 192-bit Wallace tree addition of eight operands [see Fig. 2(c)]. The CSAs and pipeline registers, i.e., the black rectangles, are used to reduce the critical path delay of carrying out multioperand addition. The carry propagation adder is employed to convert the carry-save form to its corresponding binary representation. Finally, the modular reduction circuit is used to fulfill a modulo-p operation. Note that X0 takes a less number of modular additions in the modulo-p operation than the other NTT outputs

The AGU is modified to support both NTT/INTT and resolving carries computations, as shown in Fig. 3. The controlling signal CtrlAG1 is used to determine the value of the circular right shift of BC; this is done by selecting either the stage counter value or a constant value. Fig. 4 shows Design of BC for N = 65536 and r = 16. The other controlling signal CtrlAG2 is employed to indicate whether or not a bit-reversed operation is needed. There are three operation modes in the unified AGU design, each corresponding to a distinct addressing sequence: 1) CtrlAG1 = 0 and CtrlAG2 = 0 for NTT computation, 2) CtrlAG1 = 0 and CtrlAG2 = 1 for INTT computation, and 3) CtrlAG1 = 1 and CtrlAG2 = 0 for resolving carries computation. Finally, assuming N = 65 536 and r = 16, Fig. 10 depicts the hardware implementation of the 12-bit BC for the AGU, i.e.

for NTT/INTT computation with CtrlAG1 = 0, and or resolving carries computation with CtrlAG1 = 1. The constant value required for resolving carries computation in the AGU is calculated at steps 7 and 8 of the SPMB_RC algorithm

ADVANTAGES:

  • Efficient memory addressing
  • Significant area reductions can be achieved
  • Low-complexity feature

REFERENCES:

Jheng-Hao Ye and Ming-Der Shieh ,Member, IEEE, “Low-Complexity VLSI Design of Large IntegerMultipliers for Fully Homomorphic Encryption”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.

Vector Processing-Aware Advanced Clock-Gating Techniques for Low-Power Fused Multiply-Add

Vector Processing-Aware Advanced Clock-Gating Techniques for Low-Power Fused Multiply-Add

ABSTRACT:

The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

Power and energy efficiency have become the dominant limiting factor to processor performance and have increased significantly processor design complexity, especially when considering the mobile market. Being able to exploit high degrees of data-level parallelism (DLP) at low cost in a power- and energy-efficient way, vector processors are an attractive architectural-level solution. Undoubtedly, the design goals for mobile vector processors clearly differ from the performance-driven designs of traditional vector machines. Therefore, mobile vector processors require a redesign of their functional units (FUs) in a power-efficient manner.

Clock gating is a common method to reduce switching power in synchronous pipelines. It is practically a standard in low-power design. The goal is to “gate” the clock of any component whenever it does not perform useful work. In that way, the power spent in the associated clock tree, registers, and the logic between the registers is reduced. It is the most efficient power reduction technique for active operating mode.1 Therefore, the conditions under which clock gating can be applied should be extensively studied and identified. A widely used approach is to clock gate a whole FU when it is idle. A complementary and more challenging approach is clock gating the FU or its subblocks when it is active, i.e., operating at peak performance. Furthermore, there are characteristics of vector processors that provide Since fused multiply-add (FMA) units usually dissipate the most power of all FUs, their design requires special attention. Abundant floating-point (FP) FMA is typically found in vector workloads, such as multimedia, computer graphics, or deep learning workloads. Although in the past FMA has been used for high performance, it recently has been included in mobile processors as well. In contrast to highperformance vector processors (e.g., NEC SX-series and Tarantula ) that have separated units for each FP operation, mobile vector processors’ resources are limited; thus, we typically have a single unit per vector lane capable of performing multiple FP operations rather than separate FP units. Apart from that, additional advantages of using FMA over separate FP adder and multiplier are as follows. 1) Computation localization inside the same unit reduces the number of interconnections (power and energy efficiency). 2) Higher accuracy (single, instead of two round/normalize steps). 3) Improved performance (shorter latency).

We present three kinds of techniques: 1) novel ideas to exploit unique characteristics of vector architectures for clock gating during active periods of execution (e.g., vector instructions with a scalar operand or vector masking); 2) novel ideas for clock gating during active periods of execution that are also applicable to scalar architectures but especially beneficial to vector processors (e.g., gating internal blocks depending on the values of input data); 3) ideas that are already used in other architectures and that we present as its application is beneficial to vector processors, and for the sake of completeness (e.g., idle VFU).

Regarding the second and third groups of ideas, an advantage of vector processing that extends the applicability of clock gating is that vector instructions last many cycles, so the state of the clock-gating and bypassing logic remains the same during the whole instruction. As a result, power savings typically overcome the switching overhead of the added hardware (which is often not a case in scalar processors.

To fulfill current trends in digital design that promote building generators rather than instances, we perform this research in a fully parameterizable, scalable, and automated manner. We developed an integrated architecture-circuit framework that consists of several generators, simulators, and other tools, in order to join architectural-level information (e.g., vector length or benchmark configuration) with circuit-level outputs (e.g., VFU power and timing measurements). We implement our clock-gating techniques and generate hardware VFU models using a fully parameterizable Chisel based FMA generator (FM Agen) and a 40-nm low-power technology.

We discuss the related work individually for each of our clock-gating techniques together with the description of the technique in Section IV. Besides, in the context of alternative low-power techniques for FP units, interesting approaches have been proposed (caching results that can be reused) and byte encoding (computation performed over significant bytes). However, detailed and accurate evaluation reveals that the actual savings are often low and with an unaffordable area overhead .

Fig. 1. Two-lane, four-stage VFU (MVL = EVL = 64) executing FPFMAV V3<-V0,V1,V2.

DISADVANTAGES:

  • Inaccuracy in operation
  • Power savings is not efficient

PROPOSED SYSTEM:

Proposed clock-gating Techniques

Scalar Operand Clock-Gating

 We propose this technique to tackle the cases in which one or two operands do not change during the vector instruction. Table III lists the types of instructions during which

Fig. 2. Simplified block diagram of a one-lane, four-stage VFU with all clock-gating techniques applied (AllCG technique). Input signals for the baseline without clock-gating are multiplicands (A and B), addend (C), rounding mode, and operation sign (opsign), while output signals are result (Out) and exception flags.

Table I

 Types of instructions where ScalarCG AND ImplCG

scalar operand clock-gating (ScalarCG) is active. As only one of all the supported vector instructions has all three vector operands, often at least one operand is scalar. Only the FPFMAV instruction, in which all operands are vectors, does not benefit from this technique.

During these instructions, the corresponding input register(s) of scalar operand(s) should latch a new value only on the first clock edge of the execution of the instruction, while during the rest of the instruction, they can be clock-gated. To implement this, we introduce the signals VS[2..0] (Fig. 2), where VS[i] = 0 means that the ith operand is gated after the mentioned first cycle. VS signals are derived from the instruction OPCODE. Deriving VS signals from the OPCODE is done before the first pipeline stage (as shown in Fig. 2). This generation (decoding) requires regular comparators, and they are not on the critical path as the OPCODE is available at least one cycle in advance. Table I shows corresponding VS signals for all the instructions.

Implicit Scalar Operand Clock-Gating (ImplCG)

This technique is an additional optimization of ScalarCG and aims to exploit further the information given through the instruction OPCODE for clock-gating, operand isolation, and computation bypassing. In the case of addition and subtraction instructions, such as FPADDV and FPSUBV, the 53 × 53 mantissa multiplier is not needed as it is known that one of the multiplicands is “1,” and thus, we can bypass, isolate, and clock-gate it providing the value of the other multiplicand directly to the adder. There is an analogous situation for FPMULV since the addend is known to be “0.” In this case, the 162-bit wide adder, leading zero anticipation, and the aligning part are not needed.

To control bypassing, isolation, and clock-gating of the mentioned submodules, we introduce signal INSTYP (see Fig. 2 and Table I), generated from the instruction OPCODE, which indicates whether an FPFMAV or an FPADDV/FPSUBV/FPMULV instruction is executed. INSTYP together with VS signals provide information of the instruction type. For example, INSTYP = 1, VS[0] = 1, and VS[2] = 0 indicate that we have an FPMULV instruction while INSTYP = 1, and VS[1] = 0 indicates that we have an FPADDV/FPSUBV instruction. Fig. 3 shows the simplified block diagram of gated FMA submodules when the aforementioned instructions are executed. Circuitry added for implementing ImplCG mostly consists of clock-gating cells and MUXs.

In the context of instruction-dependent techniques, there is interesting research done in the past for scalar processors [8]. The main advantages of our ImplCG proposal over the mentioned research are: 1) we apply the technique for a variable number of pipeline stages; 2) we evaluate power, timing, and area; and 3) we propose the technique for vector processors. The advantage of applying this technique on a vector processor over other models (e.g., scalar) is that vector instructions last many cycles, so the state of the related hardware (clock-gating logic and MUXs) maintains the same

Fig. 3.Gated FMA sub modules during FPMULV (dark gray) and FPADDV (light gray) instructions in case of ImplCG.

state during the whole instruction. Thus, there will be less switching overhead than in the scalar case.

Vector Masking and Vector Multilane-Aware Clock-Gating (MaskCG)

Here, we target cases in which there are idle cycles during the vector mask instructions (e.g., FPFMAV_MASK). Common cases in which vector mask control is used are: 1) sparse matrix operations, and 2) conditional statements inside a vectorized loop. Additionally, we assume that the same mechanism is also used to reduce the EVL to less than the MVL . We assume that the control logic will detect and optimize this case, skiping the last elements of the vector corresponding to the trailing 0s of the mask. However, in vector designs with nL lanes, there will still be mod(EVL, nL ) idle lanes in the last cycle of the operation. The VMR directly controls the clock-gating of the whole arithmetic unit during these idle cycles (see Fig. 2).

 Regarding the internal implementation, we perform clock-gating at pipeline stage granularity [25], so we prevent useless cycles inside the unit, i.e., the data are latched in subsequent stages only if necessary. Once the Enable signal of the first pipeline stage’s register gets the value “1,” this Enable signal propagates to the end of the pipeline, one stage per cycle (see Fig. 2). In other words, the Enable signal of the nth stage is actually the first stage’s Enable signal delayed by n−1 cycles. This is implemented by adding a 1-bit-wide, (nS−1)-long shift register that drives clock-gating cells. To the best of our knowledge, there is no related work that aims to exploit vector conditional execution with VMR to lower the power of vector processors.

Input Data Aware Clock-Gating (InputCG)

Here, we identify the scenarios in which, depending on the input data, a part of mantissa processing is not needed for the correct result and, thus, can be bypassed. We use a recoded format for internal representation that allows us to detect special cases (explained in Section III-A) and zeros with an negligible hardware overhead; it requires inspection of only three most significant bits of the exponent (fourth column in Table II). Table II presents the identified scenarios (conditions) in which a hardware block of mantissa

TABLE II

InputCG—Conditions under which a hardware block of mantissa arithmetic computations and corresponding input registers can be bypassed

arithmetic computations and the corresponding input registers can be bypassed, isolated, and clock-gated. The recoded format allows detection of relevant scenarios by using simple 3-bit comparators. They are located at the inputs of VFU (A, B, and C processing block on Fig. 2). In that way, we assure that the mentioned detection comparators are not on the VFU’s critical path, i.e., gating information is available in time. The added internal hardware is similar as for ImplCG. Having zero addend is analog to FPMULV instruction case (see Fig. 3). Zero multiplicand allows gating and bypassing all the modules from Fig. 3 except the registers that hold operand C value as in that case the final result is operand C. In the case of NaN and infinity, there is no need for any computation as the result that has to be at the VFU output is already known (explained in Section III-A), so we can gate/bypass/isolate vast majority of FMA sub modules. There are many workloads whose data contain a lot of zero values, thus can fairly benefit from the last two sub techniques presented in Table II. Although these techniques are applicable to other architectures as well, their application to vector processors is more efficient since the recurrent values are common within the vector data, thus lowering the switching overhead in added hardware (clock gating logic and MUXs). While both ImplCG and InputCG techniques aim to exploit cases when the addend is “0,” in this case, there is no external information of “0” existence via VS signals, but it has to be detected, and the gating has to be done on time. As in the case of ImplCG, the research done in presents a related data-driven technique for scalar processors. The main advantages (that enable additional savings) of our InputCG technique over the mentioned research are: 1) detection of zero operands and 2) gating the mantissa multiplier when processing NaNs.

ADVANTAGES:

  • High accuracy
  • Improved performance
  • Power saving is efficient

REFERENCES:

Ivan Ratkovi´c , Oscar Palomar, Milan Stani´c, Osman SabriÜnsal, Adrian Cristal,

and Mateo Valero, Life Fellow, IEEE, “Vector Processing-Aware Advanced Clock-GatingTechniques for Low-Power Fused Multiply-Add”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.

A 588-Gb/s LDPC Decoder Based on Finite-Alphabet Message Passing

A 588-Gb/s LDPC Decoder Based on Finite-Alphabet Message Passing

ABSTRACT:

An ultrahigh throughput low-density parity check (LDPC) decoder with an unrolled full-parallel architecture is proposed, which achieves the highest decoding throughput compared to previously reported LDPC decoders in the literature. The decoder benefits from a serial message-transfer approach between the decoding stages to alleviate the well-known routing congestion problem in parallel LDPC decoders. Furthermore, a finite-alphabet message passing algorithm is employed to replace the VN update rule of the standard min-sum (MS) decoder with lookup tables, which are designed in a way that maximizes the mutual information between decoding messages. The proposed algorithm results in an architecture with reduced bit-width messages, leading to a significantly higher decoding throughput and to a lower area compared to an MS decoder when serial message transfer is used. The architecture is placed and routed for the standard MS reference decoder and for the proposed finite-alphabet decoder using a custom pseudohierarchical backend design strategy to further alleviate routing congestions and to handle the large design. Post layout results show that the finite-alphabet decoder with the serial message transfer architecture achieves a throughput as large as 588 Gb/s with an area of 16.2 mm2 and dissipates an average power of 22.7 pJ per decoded bit in a 28-nm fully depleted silicon on isulator library. Compared to the reference MS decoder, this corresponds to 3.1 times smaller area and 2 times better energy efficiency.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

Low density parity-check (LDPC) codes have become the coding scheme of choice in high data-rate communication systems after their rediscovery in the 1990s, due to their excellent error correcting performance along with the availability of efficient high-throughput hardware implementations in modern CMOS LDPC codes are commonly decoded using iterative message passing (MP) algorithms in which the initial estimations of the bits are improved by a continuous exchange of messages between decoder computation nodes.

Among the various MP decoding algorithms, the min-sum (MS) decoding algorithm and its variants [e.g., offset min-sum (OMS) and scaled MS] are the most common choices for hardware implementation. LDPC decoder hardware implementations traditionally start from one of these established algorithms (e.g., MS decoding), where the exchanged messages represent log likelihood ratios (LLRs). These LLRs are encoded as fixed point numbers in two’s-complement or sign-magnitude representation, using a small number of uniform quantization levels, in order to realize the message update rules with low complexity conventional arithmetic operations. Recently, there has been significant interest in the design of finite-alphabet decoders for LDPC codes. The main idea behind finite-alphabet LDPC decoders is to start from one or multiple arbitrary message alphabets, which can be encoded with a bit-width that is acceptable from an implementation complexity perspective. The message update rules are then crafted as generic mapping functions to operate on this alphabet.

The main advantage of such finite-alphabet decoders is that the message bit width can be reduced significantly with respect to a conventional decoder, while maintaining the same error-correcting performance. The downside of this approach is that the message update rules of finite-alphabet decoders usually cannot be described using fast and area efficient standard arithmetic circuits. Different hardware architectures for LDPC decoders have been proposed in the literature in order to fulfill the power and throughput requirements of various standards. More specifically, various degrees of resource sharing result in flexible decoders with different area requirements. On the one hand, partial-parallel LDPC decoders and block parallel LDPC decoders are designed for medium throughput, with modest silicon area. Full-parallel and unrolled LDPC decoders on the other hand, achieve very high throughput (in the order of several tens or hundreds of Gigabits per second) at the expense of large area requirements. Even though, in principle, LDPC decoders are massively parallelizable, the implementation of ultrahigh speed LDPC decoders still remains a challenge, especially for long LDPC codes with large node degrees. While synthesis results for such long-length codes, as fortechnologies.example reported in, show the potential for a very high throughput, the actual implementation requires several further considerations mainly due to severe routing problems and the impact of parasitic effects.

Contributions

In this paper, we propose an unrolled full-parallel architecture based on serial transfer of the decoding messages, which enables an ultrahigh throughput implementation of LDPC decoders for codes with large node degrees by reducing the required interconnect wires for such decoders. Moreover, we employ a finite-alphabet LDPC decoding algorithm in order to decrease the required quantization bit-width, and thus, to increase the throughput, which is limited by the serial message transfer in the proposed architecture. We also adopt a linear floor plan for the unrolled full-parallel architecture as well as an efficient pseudohierarchical flow that allows the high-speed physical implementation of the proposed decoder. To the best of the authors’ knowledge, by combining the aforementioned techniques, we present the fastest fully placed and routed LDPC decoder in the literature.

DISADVANTAGES:

  • Higher area is implemented
  • Less Energy Efficiency

PROPOSED SYSTEM:

Serial message -transfer LDPC decoder

Unrolled full-parallel LDPC decoders provide the ultimate throughput with smaller routing congestion than conventional full-parallel decoders. However, they are still not trivial to implement for long LDPC codes with high CN and VN degrees, which suffer from severe routing congestion

                             Fig. 1.Serial message-transfer decoder architecture

Hence, in this section, we propose an unrolled full-parallel LDPC decoder architecture that employs a serial message transfer technique between CNs and VNs.1 This architecture is similar to the bit-serial implementations in the way the messages are transferred; however, as we shall see later, it differs in the fact that it is unrolled and in the way the messages are processed in the CNs and VNs.

Decoder Architecture Overview

An overview of the proposed unrolled serial message transfer LDPC decoder architecture is shown in Fig. 1. As with all unrolled LDPC decoders, each decoding iteration is mapped to a distinct set of N VN and M CN units, which form a processing pipeline. In essence, the unrolled LDPC decoder is a systolic array, in which a new set of N channel LLRs is read in every clock cycle, and a decoded codeword is put out in every clock cycle. Even though both the CNs and VNs can compute their outgoing messages in a single clock cycle, similar to the architecture, in the proposed serial message-transfer architecture each message is transferred one bit at a time between the consecutive VN and CN stages over the course of Q msg clock cycles, where Q msg is the number of bits used for the messages. More specifically, each CN and VN unit contains a serial-to-parallel (S/P) and parallel-to-serial (P/S) conversion unit at the input and output, respectively, which are clocked Q msg times faster than the processing clock to collect and transfer messages serially, while keeping the overall decoding throughput constant. More details on the architecture of the CN and VN units as well as the proposed serial message-transfer mechanism are provided in the sequel.

Finite -alphabet serial message-transfer LDPC decoder

Even though the serial message-transfer architecture alleviates the routing congestion of an unrolled full-parallel LDPC decoder, it has a negative impact on both throughput and hardware complexity, as discussed in the previous section. In our previous work, we have shown that finite alphabet decoders have the potential to reduce the required number of message bits, while maintaining the same error rate performance. In this section, we will review the basic idea and our design method for this new type of decoders and then show how the bit-width reduction technique can be applied verbatim in order to increase the throughput and reduce the area of the serial message-transfer architecture.

Mutual Information-Based Finite-Alphabet Decoder

In our approach, the standard MP decoding algorithm update rules are replaced by custom update rules that can be implemented as simple lookup tables (LUTs). These LUTs take integer-valued input messages and produce a corresponding output message. Moreover, the input-output mapping that is represented by the LUTs is designed in a way that maximizes the mutual information between the LUT output messages and the codeword bit that these messages correspond to. We note that a similar approach was also used, but the corresponding hardware implementation would have a much higher hardware complexity.

Fig. 2. Frame error rate of the IEEE 802.3an LDPC code under floating-point MS decoding, fixed-point MS decoding with different bit-widths, LUT-based decoding, and floating-point OMS decoding (offset = 0.5) as reference, all with I = 5 decoding iterations

Error-Correcting Performance and Bit-Width Reduction

In Fig. 2, we compare the performance of the IEEE 802.3an LDPC code under floating-point MS decoding, fixed-point MS decoding (with Q ch=  Qmsg∈ {4, 5}), and LUT-based decoding (with Q ch= 4 and Q msg= 3) when performing I = 5 decoding iterations with a flooding schedule. We also show the performance of a floating-point OMS decoder as a reference. We observe that the fixed-point decoder with  Qch= Q msg = 5 has almost the same performance as the floating-point decoder, while the fixed-point decoder with  Q ch= Q msg= 4 shows a significant loss with respect to the floating-point implementation. Thus, a standard MS decoder would need to use at least Q ch= Q msg = 5 quantization bits. The LUT-based decoder, however, can match the performance of the floating-point decoder with only Q ch = 4 channel quantization bits and Q msg= 3 message quantization bits.3 Additionally, with the above quantization bit choices, no noticeable error floor has been observed for both the MS decoder and LUT-based decoder when 1010 frames have been transmitted (which corresponds to a bit error rate ≈ 10−12 with the current block length). We note that, for the LUTbased decoder, the performance in the error floor region can be traded with the performance in the waterfall region by an appropriate choice of the design signal-to-noise ratio (SNR) for the LUTs.

LUT-Based Decoder Hardware Architecture

The LUT-based serial message-transfer decoder hardware architecture is very similar to the MS decoder architecture as described. However, the LUT-based decoder can take advantage of the significantly fewer message bits that need to be transferred from one decoding stage to the next. This reduction reduces the number of CLKF cycles per iteration, which in turn increases the throughput of the decoder according to provided that the CN/VN logic is sufficiently fast. Moreover, the size of the buffers needed for the S/P and P/S conversions is also reduced significantly, which directly reduces the memory complexity of the decoder. On the negative side, we remark that the VN units for each VN stage (decoder iteration) of the LUT-based decoder are different, which slightly complicates the hierarchical physical implementation as we will see later. Furthermore, since Q ch> Q msg, we now need to transfer the channel LLRs with multiple (two) bits per cycle to avoid the need to artificially limit the number of CLKF cycles per iteration to  Q chrather than to the smaller Q msg. To reflect this modification, we rede- fine (10) as Q max = max Q msg (,  (Q ch/2)). While this partially parallel transfer of the channel LLRs impacts routing congestion, we note that the overhead is negligible since the number of channel LLRs is small compared to the total number of messages.

ADVANTAGES:

  • Lower area is implemented
  • Energy Efficiency is higher

REFERENCES:

Reza Ghanaatian , AlexiosBalatsoukas-Stimming, Thomas Christoph Müller, Student Member, IEEE,Michael Meidlinger, Gerald Matz, Senior Member, IEEE, Adam Teman , and Andreas Burg, Member, IEEE, “A 588-Gb/s LDPC Decoder Based onFinite-Alphabet Message Passing”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2018.

An Efficient Fault-Tolerance Design for Integer Parallel Matrix–Vector Multiplications

An Efficient Fault-Tolerance Design for Integer Parallel Matrix–Vector Multiplications

ABSTRACT:

Parallel matrix processing is a typical operation in many systems, and in particular matrix–vector multiplication (MVM) is one of the most common operations in the modern digital signal processing and digital communication systems. This paper proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error correction codes with the self-checking capability of MVM. Field-programmable gate array evaluation shows that the proposed scheme can significantly reduce the overheads compared to the protection of each MVM on its own. Therefore, the proposed technique can be used to reduce the cost of providing fault tolerance in practical implementations.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

Matrix–vector multiplication (MVM) is a typical computation in domain transformation, projection, or feature extraction. With increasing demand for high performance or throughput, parallel computing is becoming more important in high-performance computing systems, cloud computing systems, and communication systems. For example, parallel matrix processing with a common input vector (parallel MVMs) is performed for pre coding in multiuser multi-in multi-out (MIMO) (especially large-scale MIMO) and high-performance low density parity check decoders. Parallel processing may run on a large number of processing units, e.g., GPUs or distributed systems, and experiments show that soft errors can affect the outputs. In this ABFT design, the two input matrices are changed to a “row checksum” matrix and a “column checksum” matrix, respectively, and single errors in the resulting “full checksum” matrix can be detected and corrected based on its row checksum and column checksum.

Since the row checksum does not exist for an input vector, such ABFT scheme can only be used for error detection in MVM. In the ABFT schemes proposed , checksum technique is used to locate the affected result element or blocks of result elements in MVM, and partial re computation is used for error correction. Such schemes are suitable for the cases in which the delay introduced by re computation is acceptable. But this is not always the case, especially in high-throughput communication receivers where the inputs arrive in real time from A/D converters. A general method to protect linear dynamic systems was proposed. Part of that system model can be extracted as an MVM. I

n a method to protect parallel digital filters was proposed. The scheme is based on considering each filter as a bit in an error correction code (ECC) so that parity check filters are added and used to detect and correct errors. The schemes in share a similar approach in that redundant rows are added to the processing matrix by combination (or coding) of the original rows of the matrix so that the relations among the multiple outputs can be used to locate and correct the erroneous elements. However, this approach only works for single MVM. In  the protection of parallel fast Fourier transform (FFT) operations on different input vectors was studied. In that case, a self checking property of the FFT was combined with the ECC method to reduce the overhead required for protection. we proposed an idea based on ECC and MVM self-checking (checksum) for error protection of parallel MVMs. We assume real-time applications in which integer vectors come from ADCs with a fixed timing and processing is synchronous and repetitive, so a re computation is not tolerable.

This paper is an extension of that two page conference paper, and provides new results based on the FGPA implementation. To the best of the authors’ knowledge, no other fault-tolerant design for parallel MVMs has been proposed. In the proposed structure, a “detection matrix” D and a “sum matrix” A are added for fault tolerance. D is generated by performing ECC to the checksum row of each matrix, and A is a combination of all matrices under protection. By comparing the results of the original MVMs and that of the check matrix, the fault can be located. Then, the erroneous output vector can be recovered by subtracting the correct output vectors from the sum output vector.

DISADVANTAGES:

  • Overheads is not reduced
  • Cost is high
  • Errors are not identified easily

PROPOSED SYSTEM:

Fault tolerance design for parallel matrix processing

  1. Basic Structure The basic structure of the proposed fault-tolerant system for parallel MVMs is shown in Fig. 1. From Fig. 1, we see that the original P matrices remain unchanged, and the redundant part includes two matrices (regardless of the number of parallel matrices under protection):

Fig. 1. Fault-tolerant structure for parallel MVMs

detection matrix  and sum matrix . By comparing the P outputs   with the checking result   the faulty vector  can be detected, and then the correct output vector , can be obtained by subtracting the correct vectors among  from the “sum output vector”  . The number of rows of D (Rp) is determined by the number of parallel matrices (P). When considering the case that only one MVM may fail, a Hamming code can be used for protection, and the relationship between P and Rp would be

On the other hand, in our proposed technique (Fig. 1), a branch is an entire MVM processing.

Generation of Detection Matrix and SUM Matrix

The sum matrix is just the direct sum of the original matrix

For the detection matrix, let us assume P = 4 to facilitate the illustration of the basic idea, and assuming only one branch output may contain errors. First, a checksum row for each matrix can be generated as

Then, the rows of detection D can be generated ,Hamming code as

Then, the error detection matrix can be defined as matrix as

if there are no errors, we should have

where “sum(x)” performs addition of all items within vector x, and

In (12), the result of  is an integer vector, and sum(produces an integer scalar. In summary, if we denote

                             Fig. 2. Implementation structure of MVM (N = 20)

error detection of a single branch can also be done. After identifying the failing branch, the corresponding output vector can be recovered by subtracting the right output vectors from the sum output vector

Since the error detection is based on the sum of the output vectors, and the whole vector could be recovered during the correction, the proposed scheme could deal with multiple errors in a single output vector

ADVANTAGES:

  • Overheads are reduced
  • Cost is low
  • Error can be easily identified by error detection method

REFERENCES:

Zhen Gao , Qingqing Jing, Yumeng Li, Pedro Reviriego, and Juan Antonio Maestro, “An Efficient Fault-Tolerance Design for Integer ParallelMatrix–Vector Multiplications”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.

Analysis and Design of Cost-Effective, High-Throughput LDPC Decoders

Analysis and Design of Cost-Effective, High-Throughput LDPC Decoders

ABSTRACT:

This paper introduces a new approach to cost effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders. The proposed approach, called nonsurjective finite alphabet iterative decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different tradeoffs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

The increasing demand of massive data rates in wireless communication systems will require significantly higher processing speed of the baseband signal, as compared with conventional solutions. This is especially challenging for forward error correction (FEC) mechanisms, since FEC decoding is one of the most computationally intensive baseband processing tasks, consuming a large amount of hardware resources and energy. The use of very large bandwidths will also result in stringent, application-specific, requirements in terms of both throughput and latency. The conventional approach to increase throughput is to use massively parallel architectures. In this context, low-density parity-check (LDPC) codes are recognized as the foremost solution, due to the intrinsic capacity of their decoders to accommodate various degrees of parallelism. They have found extensive applications in modern communication systems, due to their excellent decoding performance, high-throughput capabilities, and power efficiency, and have been adopted in several recent communication standards. This paper targets the design of cost-effective, high throughput LDPC decoders. One important characteristic of LDPC decoders is that the memory and interconnect blocks dominate the overall area/delay/power performance of the hardware design. To address this issue, we build upon the concept of finite alphabet iterative decoders (FAIDs) introduced. While FAIDs have been previously investigated for variable-node (VN) regular LDPC codes over the binary symmetric channel, this paper extends their use to any channel model and to both regular and irregular LDPC codes.

The approach considered in this paper, referred to as nonsurjective finite FAIDs (NS-FAIDs), is to allow storing the exchanged messages using a lower precision (a smaller number of bits) than that used by the processing units. The basic idea is to reduce the size of the exchanged messages, once they have been updated by the processing units. Hence, to some extent, the proposed approach is akin to the use of imprecise storage, which is seen as an enabler for cost and throughput optimizations. Moreover, NS-FAIDs are shown to provide a unified framework for several designs previously proposed in the literature, including the normalized and offset Min-Sum (OMS) decoders, the partially OMS (POMS) decoder, the MS-based decoders proposed, or the recently introduced dual-quantization domain MS decoder.

This paper refines and extends some of the concepts we previously introduced. In particular, the definition of NS-FAIDs is extended such as to cover a larger class of decoders, which is shown to significantly improve the decoding performance in case that the exchanged messages are quantized on a small number of bits (e.g., 2 bits per exchanged message). We show that NS-FAIDs can be optimized by using the density evolution (DE) technique, so as to obtain

DISADVANTAGES:

  • Cost is high
  • Error correction performance is poor

PROPOSED SYSTEM:

Full-Layer Architecture

A different possibility to increase throughput is to increase the hardware parallelism, by including several non overlapping

Fig. 1. Mapping between VNs and VNUs. Black: VNs of degree 2. Red: VNs of degree 3. Blue: VNs of degree 6

rows of the base matrix in one decoding layer. For instance, for the base matrix, we may consider RPL = 4 consecutive rows per decoding layer, and thus the number of decoding layers is L = 3. In this case, each column of the base matrix has one (and only one) nonzero entry in each decoding layer; such a decoding layer is referred to as being full.

Full layers correspond to the maximum hardware parallelism that can be exploited by layered architectures, but they also prevent the pipelining of the data path. Fig. 1 shows Mapping between VNs and VNUs. Black: VNs of degree 2. Red: VNs of degree 3. Blue: VNs of degree 6 One possibility to implement a full-layer decoder is to use a similar architecture to the pipelined one, by removing the registers inserted after the VNU (since pipelining is incompatible with the use of full layers), and updating the control unit. However, in such an architecture, read/write operations from/to the β_memory would occur at the same memory location, corresponding to the current layer being processed . This would require the use of asynchronous dual-port RAM to implement the β_memory, which in general is known to be slower than synchronous dualport RAM. The architecture proposed in this section is aimed at avoiding the use of asynchronous RAM, while providing an effective way to benefit from the increased hardware parallelism enabled by the use of full layers. We discuss below the main changes with respect to the pipelined architecture, consisting of the α_memory and the barrel shifters blocks (the other blocks are the same as for the pipelined architecture), as well as a complete reorganization of the data path. However, it can be easily verified that both architectures are logically equivalent, i.e., they both implement the same decoding algorithm.

1) α_Memory: This memory is used to store the VN-messages for the current decoding layer (unlike the previous architecture, the AP-LLR values are not stored in memory). Since only one  -bit (unsaturated) VN-message is stored for each VN, this memory has exactly the same size as the  _memory used within the previous pipelined architecture. VN-messages for the current layer are read from the α_memory, then saturated or framed depending on the decoding kernel, and supplied to the corresponding CNUs. CN-messages computed by the CNUs are stored in the β_memory (location corresponding to layer ), and also forwarded to the AP-LLR unit, through the DCP (decompress) and DE-FRA (deframing) blocks, according to the CNU implementation (compressed or uncompressed) and the decoding

Fig. 2. High-level description of the proposed HW architectures, with both MS and NS-FAID kernels. (a) Pipelined architecture. (b) Full-layer architecture

kernel (MS of NS-FAID). The AP-LLR unit computes the sum of the incoming VN- and CN-messages, which corresponds to the AP-LLR value to be used at layer + 1 (since already updated by layer ). The AP-LLR value is forwarded to the VNU, through corresponding BS and PER blocks. Eventually, the VN-message for the layer + 1 is computed as the difference between the incoming AP-LLR and the corresponding layer-( + 1) CN-message computed at the previous iteration, the latter being read from the β_memory.

 2) PER/BS Blocks: PER_1 / BS_1 blocks permute / shift the data read from the input buffer, according to the positions / values of the nonnegative entries in the first decoding layer. Similar to the BS_R blocks in the pipelined architecture, the PER_WR / BS_WR blocks permute / shift the AP-LLR values, according to the difference between the positions / values of the current layer’s ( ) nonnegative entries and those of the next layer ( + 1). This way, VN-messages stored in the α_memory are already permuted and shifted for the subsequent decoding layer. Finally, PER_L / BS_L blocks permute / shift the hard decision bits (sign of AP-LLR values), according to the positions / values of the nonnegative entries in the last decoding layer.

ADVANTAGES:

  • Cost effective
  • Error correction performance is good

REFERENCES:

Thien Truong Nguyen-Ly, Valentin Savin , Khoa Le, David Declercq, Senior Member, IEEE,Fakhreddine Ghaffari, and Oana Boncalo, “Analysis and Design of Cost-Effective,High-Throughput LDPC Decoders”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018.

Approximate Hybrid High Radix Encoding for Energy-Efficient Inexact Multipliers

Approximate Hybrid High Radix Encoding for Energy-Efficient Inexact Multipliers

ABSTRACT:

Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy–accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

In modern embedded systems and data centers, energy efficiency is a mandatory design concern. Considering that a large amount of application domains exhibits an intrinsic error tolerance, e.g., digital signal processing (DSP), image processing, data analytics, and data mining, approximate computing appears as an effective solution to reduce their power dissipation. In approximate computing, error has been viewed as a commodity that can be traded for significant gains in cost (e.g., power, energy, and performance), and as a result, it composes a promising design paradigm targeting energy-efficient systems by extremely decreasing the power consumption of inherently error resilient applications.

Specifically, approximate computing exploits the innate error tolerance of the respective applications and deliberately relaxes the correctness of some computations, in order to decrease their power consumption and/or accelerate their execution. Recently, targeting to take advantage of its benefits, massive research has been conducted in the field of hardware approximate circuits. The main targets are arithmetic units, e.g., adders and multipliers, that are the core components in many embedded devices and hardware accelerators. Extensive research is reported in approximate adders, providing significant gains in terms of delay and power dissipation. However, research activities on the approximate multipliers is less comprehensive compared with the respective on approximate adders. In inexact multipliers, approximations can be applied on the partial product generation, as well as the partial product accumulation. Approximations on the partial product generation and approximations on their accumulation are synergistic, and can be applied in collaboration in order to achieve higher power reduction.

Although significant research has been conducted in the partial product accumulation, research activity on the approximation of the partial product generation is still limited. Finally, another limitation of the existing approximate multipliers is that the majority of them does not examine signed multiplication. In this paper, targeting the design of inexact multipliers by applying approximations on the partial product generation, we propose a novel approximate hybrid high radix encoding. In the proposed technique, the most significant bits (MSBs) of the multiplicand are encoded using the radix-4 encoding, whereas the k least significant bits (LSBs) are encoded using a radix-2k (with k ≥ 4). To simplify the increased complexity induced by the proposed hybrid encoding, the circuit for generating the partial products is approximated by altering accordingly its truth table.

Hence, the number of the partial products decreases significantly and simpler tree architectures are used for their accumulation, reducing the multiplier’s energy consumption, area, and critical path delay. The major contributions of this paper are summarized as follows. 1) We propose and enable the application of hybrid high radix encodings for the generation of energy-efficient approximate multipliers, exceeding the increased hardware complexity of very high radix encodings. 2) The proposed technique can be applied to any multiplier architecture and is reconfigurable, enabling the user to select the optimal per application energy–error tradeoff. 3) An analytical error analysis is conducted, showing that the output error of the proposed technique is bounded and predictable. Such a rigorous error analysis leads to precise and a priori error estimation for any input distribution, without the need of time-consuming simulations. 4) We show that t terms of hardware and accuracy, achieving up to 40% less energy dissipation for comparable error values. More specifically, the proposed technique is applied to a 16×16 bit multiplier and is evaluated using industrial strength tools, i.e., Synopsys Design Compiler, Prime Time, and Mentor Graphics ModelSim. Compared with the accurate multiplier, the proposed technique delivers up to 55% energy and area reduction, for mean relative error up to 0.93%. Moreover, compared with related state-of-the-art approximate computing signed multipliers, our technique outperforms them significantly in terms of energy consumption and error. Finally, we examine the scalability of the proposed technique and show that for the same error value, the delivered energy savings increase as the multiplier size increases.

DISADVANTAGES:

  • Accuracy is less
  • Energy efficiency is low
  • Area size is high and
  • Errors are not reduced

PROPOSED SYSTEM:

Approximate hybrid high radix multipliers

High radix encodings offer partial products reduction, and as a result, their accumulation requires smaller trees, leading to energy, area, and/or delay savings. However, high radix encodings require complex encoding and partial product generation circuits, negating thus the benefits of the partial products reduction. In this section, the proposed hybrid high radix encoding and the performed approximations for simplifying its circuit complexity are presented. In the proposed technique, the multiplicand B is encoded using the approximate high radix encoding, generating , and the approximate multiplication A· is performed. Finally, its adaptation on inexact 16-bit hardware multipliers is described, and a qualitative analysis is conducted, targeting to estimate the potential area gains.

Hybrid High Radix Encoding

In the proposed hybrid high radix encoding, B is divided in two groups: the MSB part of n −k bits and the LSB part of k bits. The configuration parameter, k ≥ 4, is an even number, namely, k = 2m: m ∈ Z, with m ≥ 2. The MSB part is encoded using the radix-4 (modified Booth) encoding, while the LSB part is encoded with the high radix-2 kencoding where and The radix-4 encoding includes (n − k)/2 digits ∈ {0, ±1, ±2}, while ∈ {0, ±1, ±2, ±3,…, ± (2k−1 − 1), −2k −1} corresponds to the radix2k – encoding. Overall, B is encoded with (n − k)/2 + 1 digits. The above hybrid high radix encoding is characterized by increased logic complexity, due to the high radix values of that are not power of two, and thus, an approximate version is proposed. However, in order to retain high accuracy, the radix-4 encoding of the MSB is performed accurately. In particular, in the approximate encoding, all the values that are not power of two and the k − 4 smallest powers of two as well, are rounded to the nearest of the 4 largest powers of two or 0, so that the sum of all the values of the approximate digit is 0. We choose to keep only the four largest powers of two, so that the radix-2k  encoding circuit requires only about the double area in comparison with the accurate radix-4 encoder. Therefore, B is approximately encoded as follows:

where Table I presents the accurate radix-4 encoding. The output signals sign j , ×1 j , and ×2 j define the radix-4 digit  . Their logic equations are the following:

Table II presents the approximate radix-2k encoding. The logic equations of the encoding signals that define the radix2k digit  are the following:

 

The effectiveness of the approximate hybrid radix encoding technique is explored with its application to 16-bit signed numbers, for k = 6, 8, 10, namely, the LSBs are encoded using

Fig. 1.i-bit partial product generator based on (a) accurate radix-4 encoding and the approximate (b) radix-64, (c) radix-256, and (d) radix-1024 encoding. ai :i-bit of operand A, ai = ai⊕ sign.

Fig. 2. Partial product tree based on the hybrid encoding of accurate radix-4 and approximate (a) radix-64, (b) radix-256, and (c) radix-1024 encoding partial product bits from the approximate high radix encoding. •: partial product bits from the accurate radix-4 encoding inverted MSBs of the partial products. sign factors.

the radix-64, radix-256, and radix-1024 encoding, respectively. In the radix-64 encoding, the bits of B are grouped as in

The following values of the digit are rounded to their nearest power of two: ±1, ±3, ±5, ±6, ±7, ±9, … , ±15, ±17, … , ±31 are rounded to ±4, ±8, ±16, or ±32, while the smallest powers of two, i.e., ±1 and ±2, are rounded to 0 or ±4. o 0 or ±4.

In radix-1024 encoding, the bits of B are grouped as follows:

Similarly, the nonpowers of two are rounded to ±64, ±128, ±256, or ±512, and the smallest powers of two (±1, ±2, ±4, ±8, ±16, ±32) are rounded to 0 or ±64. The encoder’s inputs are the bits b9, b8,…, b0, the approximate radix-1024 digit is∈ {0, ±64, ±128, ±256, ±512}, and the output signals that define  are sign, ×64, ×128, ×256, ×512.

Partial Product Generation

In the proposed hybrid encoding, the n − k MSBs of B are encoded with the accurate radix-4 encoding, while the k LSBs are encoded with an approximate radix-2k encoding. The accurate radix-4 encoder produces the signals defined, whereas the approximate high radix encoder produces the signals. Overall, there is a reduction of k/2 − 1 partial products generated in the multiplication A · B˜.

In Fig. 1, four partial product generators are presented, i.e., the circuit of the accurate radix-4 encoding and the ones of the three approximate high radix encodings discussed. The partial products created from each encoding are shown in Table III. In addition, the three hybrid high radix encodings create the partial product trees shown in Fig. 2. The trees also include the encoding’s correction term (constant terms and sign factors). The implementation of the partial product accumulation can be chosen by the designer. In this paper, an accurate Wallace treeis used to implement the partial product’s sum, whereas the two outputs produced by the Wallace tree are added using a prefix (fast) adder. Overall, the multiplication circuit consists of stages of operand hybrid radix encoding, partial product generation, partial product accumulation, and final addition. The proposed approximate multipliers are named RAD2k , showing the selected approximate high radix encoding, e.g., RAD64, RAD256, and RAD1024.

ADVANTAGES:

  • Accuracy is high
  • Energy efficiency is high
  • Area size is low and
  • Errors are reduced

REFERENCES:

Vasileios Leon , GeorgiosZervakis , DimitriosSoudris, and KiamalPekmestzi, “Approximate Hybrid High Radix Encoding forEnergy-Efficient Inexact Multipliers”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.

Basic-Set Trellis Min–Max Decoder Architecture for Non binary LDPC Codes With High-Order Galois Fields

Basic-Set Trellis Min–Max Decoder Architecture for Non binary LDPC Codes With High-Order Galois Fields

ABSTRACT:

Non binary low-density parity-check (NB-LDPC) codes outperform their binary counterparts in terms of error correction performance. However, the drawback of NB-LDPC decoders is high complexity, especially for the check node unit (CNU), and the complexity increases considerably when increasing the Galois-field (GF) order. In this paper, a novel basic-set trellis min–max algorithm is proposed to greatly reduce not only the CNU complexity but also the number of messages exchanged between the check node and the variable node compared with previous studies, which is highly efficient for higher order GFs. In addition, the proposed CNU is designed to compute the messages in a parallel way. Layered decoder architectures based on the proposed algorithm were implemented for the (837, 726) NB-LDPC code over GF(32) and the (1512, 1323) code over GF(64) using 90-nm CMOS technology, and obtained a reduction in the complexity by 30% and 37% for the CNU, and 40% and 37.4% for the whole decoder, respectively. Moreover, the proposed decoder achieves a higher throughput at 1.67 Gbit/s and 1.4 Gbit/s compared with the other state-of-the-art high-rate NB-LDPC decoders with high-order GFs.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

Non binary low-density parity-check (NB-LDPC) codes defined over Galois fields (GFs) GF(q) with q > 2 outperform their binary counterparts in terms of error-correcting performance and performance improvement in the error-floor region when code length is moderate. In addition, these codes have good ability of burst error correction, especially for high-order GFs. Research results demonstrate that NB-LDPC codes provide superior performance compared with the best optimized binary LDPC code over fading channels, and the combination of NB-LDPC code with high-order modulations improves both the bandwidth efficiency and the error-correction capability. Moreover, the elimination of the error floor is critical for flash memory applications, and the NB-LDPC codes show much promise for multilevel flash memory applications. However, the main disadvantage of NB-LDPC codes is their highly complex decoding algorithms; it is difficult to achieve maximum throughput and minimum area for their architectures. In practical implementations, the NB-LDPC decoders have several drawbacks, such as a highly complex check node unit (CNU), a large area spent on storage elements, and routing congestion. First, the belief propagation (BP) algorithm used for binary LDPC decoding was introduced for the NB-LDPC decoding. Then, a fast Fourier transform-BP (FFT-BP) algorithm in the probability domain was proposed to reduce the computational complexity in check node processing by replacing the convolutional operations with multiplications in the frequency domain.

Although the probability domain algorithm provides optimal error-correcting performance, the large number of additions and multiplications causes an exponential increase in hardware complexity. In the FFT-BP algorithm based on the logarithm domain used log-likelihood ratio (LLR) values to decode the channel messages instead of probability values, in which the multiplications are replaced with additions. For practical NB-LDPC decoder implementations, suboptimal algorithms such as extended min-sum (EMS) and the min–max algorithm have been proposed to reduce the complexity of the CNU as the main bottleneck of the NB-LDPC decoder.

The min–max algorithm is interesting because it uses comparisons instead of additions in the check node processing, which not only reduces the hardware complexity but also prevents the numerical growth of the decoder. In addition, a forward–backward scheme was utilized to derive the check node output messages. This scheme includes sequential computations, which cause a throughput problem for the decoder architectures. Moreover, additional storage memories are required to store the intermediate messages such as forward and backward messages. Recently, the path construction algorithms and the relaxed min–max (RMM) algorithm introduced the trellis representation for check node processing to eliminate computing the forward–backward messages, and thus reduces the memory requirement for the intermediate messages. The RMM algorithm using the minimum basis to generate the check node output messages was proposed for NB-LDPC decoders, which further reduces the check node complexity.

However, the sequential check node processing requires a large number of clock cycles, which limits the maximum throughput of the decoder. In the trellis EMS algorithm was proposed to improve the throughput of the NB-LDPC decoders, where the check node output messages are generated in parallel by means of an extra column inserted to the original trellis. A disadvantage of the decoders and is high area, which causes a reduction of the overall decoder efficiency. To take advantage of the idea, the simplified trellis min–max (STMM) algorithm was proposed to improve the throughput of the min–max decoders with less complexity. In the one-minimum-only TMM algorithm was introduced on the basis of the STMM algorithm to reduce the CNU complexity by obtaining only one minimum and estimating the second one.

 In q × dc check node output messages are exchanged between the check node and the variable nodes. For high-order GFs or high-rate NB-LDPC codes, there are two main drawbacks are, first, the amount of exchanged messages increases, which causes wiring congestion, and thus limits the maximum throughput of the decoders. Second, the check node output messages are stored in the memory for the next decoding iteration in the layered decoders. Therefore, the memory requirement becomes large, which leads to a significant growth in the decoder area for NB-LDPC codes.

To overcome the drawbacks, originally introduced a compression technique to reduce the exchanged messages between one check node and the variable nodes to four sets, including the intrinsic and extrinsic information, the path coordinates, and the hard-decision symbols with a size of 5 × (q − 1) + dc messages without any errorcorrecting performance loss. For further improvement, the research proposed to simplify the CNU architecture and reduce the exchanged messages to 4×(q −1)+dc messages with a similar error-correcting performance. The approximated TMM algorithms were introduced to reduce the amount of intrinsic information from (q − 1) elements to only two elements and L q elements, respectively, at the cost of some error-correcting performance loss. The remaining elements are calculated from the approximation functions.

In this paper, a novel basic-set TMM (BS-TMM) algorithm is proposed for NB-LDPC codes based on the theory of the GF GF(q = 2p), where each field element is uniquely represented by a linear combination of p independent field elements. In the proposed BS-TMM algorithm, the basis set including the intrinsic information of only p = log2 q independent field elements in the extra column is stored, and the other elements are constructed on the basis of this basic set. Moreover, a novel algorithm is introduced for finding p independent field elements with the most reliable messages of the basic set in parallel. The BS-TMM algorithm allows the reduction of exchanged messages between one check node and variable nodes from 4 × (q − 1) + dc [16] to (q − 1) + 3 × p + dc messages with a negligible performance loss of 0.1 dB. The proposed method provides a great area reduction and throughput improvement for the NB-LDPC decoders with good errorcorrection performance. Therefore, it is extremely efficient for the design of high-rate and high-order NB-LDPC decoders. Two NB-LDPC decoders, including (837, 726) over GF(32) and (1512, 1323) over GF(64), were implemented on the basis of the BS-TMM algorithm.

DISADVANTAGES:

  • Less Efficient
  • Low complexity

PROPOSED SYSTEM:

Decoder Architecture

In this section, a complete decoder architecture based on the BS-TMM algorithm is designed for NB-LDPC codes. The proposed decoder architecture achieves a great reduction in the area because of the large area reduction in the CNU architecture. In addition, an improvement in the throughput is obtained since reducing the wires between the check node and variable node processors mitigates the routing congestion. The layered min–max algorithm for the proposed decoder is presented in Algorithm 6, where the BS-TMM in Algorithm 3 is implemented in the check node processor. In addition, the decompression network (DN) corresponding to Algorithm 5 is implemented in the variable node processor to generate the C2V messages R mn(a) from outputs of the CNU architecture. The DN has three parts: 1) generating the LLR

Fig. 1. Proposed extra column and path information generator for GF(8). (a) Extra column generator for the j th element. (b) Control signal generator. (c) Path information generator for the j th element.

values of the extra column Q(a) and the path information d(a) with a maximum of p deviations on the basis of the basic set B∗ = {m1*l , I*l , a*l}1≤l≤p; 2) generating the C2V messages in the delta domain as Rmn(a) on the basis of Q(a), E(a), and d(a); 3) and converting the C2V messages from delta to normal domain. It is noted that two DNs are required in the variable node processor. However, the proposed decoder area is much lower than that of the conventional decoders .

First, Fig. 1 shows the proposed extra column and path information generator for GF(8). The LLR value of each element in the extra column Q(a j) is selected from one of the p LLR values m1*l (1 ≤ l ≤ p) in the basic set B∗ depending on the p control signals sl[ j] (1 ≤ l ≤ p), as shown in Fig. 1(a). (q − 1) architectures as in Fig. 9(a) are required to compute (q − 1) messages in the Q(a) simultaneously. p control signals sl[ j] (1 ≤ l ≤ p) are generated using the architecture in Fig. 1(b). To compute Q(a j), only one of the control signals sl[ j] is equal to “1,” and others are equal to “0.” Thus, only one of p LLR values is selected for the output Q(a j)n order to calculate the p control signals sl[ j], 2p − 1 = q − 1 combinations of p field elements in the basic set B∗ excluding the zero element are divided into p groups, as shown in Fig. 1(b). p control signals sl[ j] correspond to p outputs of the groups. The ith group contains the field element a*l and its combinations with all possible combinations of the previous field elements a*k(0 < k < l). Therefore, the lth group (l > 0) includes 2l−1 combinations of the field elements. In addition, (q−1) path information corresponding to (q−1) field elements is also constructed, where each path information d[ j] has p column indexes dl[ j] (1 ≤ l ≤ p). (q − 1) architectures as in Fig. 1(c) are required to compute (q−1) path information. For p field elements in the basic set a*l (1 ≤ l ≤ p), their paths are one deviation, and thus p values of the path information {dl[ j]}1≤l≤p are the same as column index I*l. The path of the field element generated by the combination of all field elements in the basic set has a maximum of p deviations; thus, p values of the path information dl[ j] (1 ≤ l ≤ p) correspond to p column indexes I *l(1 ≤ l ≤ p) in the basic set. For other field elements generated by the remaining combinations, the number of their deviations is k (1 < k < p), and then the dl[ j] is assigned to the column index I *l with 1 ≤ l ≤ k. Otherwise, the dl[ j] with k < l ≤ p is assigned to the column index I *k .

Fig. 2 presents the proposed C2V message generator for GF(8) with dc = 4. The C2V messages Rmj(a) (1 ≤ j ≤ dc) are simultaneously introduced by either Q(a) or E(a), which are the outputs of the multiplexers. The control signals for the multiplexers depend on the column indexes and p deviations of the path information. If the column index j (1 ≤ j ≤ dc) is equal to at least one of p deviations d(a) (1 ≤ l ≤ p), then the output of the multiplexer is assigned to compensation value E(a). Otherwise, the output of the multiplexer is assigned to Q(a). Fig. 3 shows the top-level decoder architecture for the proposed layered decoding algorithm, where one row of H corresponding to one layer is processed in one clock cycle. It can be seen that the decoder architecture is divided into a variable node processor and check node processor.

To start the decoding process, the LLR messages from channel information Ln(a) are loaded in variable node memory (VNMEM). From the next layer and next iteration, the output messages of the variable node processor Qnk,l(a) are stored in the VNMEM. The VNMEM includes dc memories with a depth of (q − 1) as the size of the CPM and a width of q × w bits. For each decoding time, one address is read and one address is written from each memory

The permutation and depermutation of the variable messages are implemented by modules P and P−1, respectively. Each module requires dc × (q − 1) × log2 q multiplexers of w bits to permute or depermute dc vectors of (q − 1) messages, and the control signals are based on the hmn nonzero values of H.

The normalization module N is responsible for finding the most reliable messages and their locations zn, and generating the Qnk,l(a) messages for the inputs of the check node processor. In addition, normalization ensures that the smallest value in each LLR vector Qnk,l (a) is always equal to zero. At the last decoding iteration, the zn values are the hard-decision symbols c˜n stored in the output memory, and the P module and subtractor are inactive during this process. Since a layered decoding scheme is used, the outputs of the check node processor in one iteration must be stored in the check node memory (CNMEM) for the next iteration process. Thus, the CNMEM in the proposed decoder has a depth of M and a width of p×(w+ log(dc) + p)+(q−1)×w+dc×p bitcorresponding to the output bits of the check node processor. A total of M×[p×(w+ log(dc) +p)+(q−1)×w+dc×p] bits are stored in one iteration. Compared with the M ×q ×dc ×w bits stored in CNMEM in the conventional approach, the memory requirement for CNMEM in the proposed decoder is greatly reduced, which leads to a large reduction in decoder area.

ADVANTAGES:

  • Highly efficient
  • Complexity is high

REFERENCES:

Huyen Pham Thi, Member, IEEE, and Hanho Lee , Senior Member, IEEE, “Basic-Set Trellis Min–Max Decoder Architecturefor Nonbinary LDPC Codes With High-OrderGalois Fields”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018.

Extending 3-bit Burst Error-Correction Codes With Quadruple Adjacent Error Correction

Extending 3-bit Burst Error-Correction Codes With Quadruple Adjacent Error Correction

ABSTRACT:

The use of error-correction codes (ECCs) with advanced correction capability is a common system-level strategy to harden the memory against multiple bit upsets (MBUs). Therefore, the construction of ECCs with advanced error correction and low redundancy has become an important problem, especially for adjacent ECCs. Existing codes for mitigating MBUs mainly focus on the correction of up to 3-bit burst errors. As the technology scales and cell interval distance decrease, the number of affected bits can easily extend to more than 3 bit. The previous methods are therefore not enough to satisfy the reliability requirement of the applications in harsh environments. In this paper, a technique to extend 3-bit burst error-correction (BEC) codes with quadruple adjacent error correction (QAEC) is presented. First, the design rules are specified and then a searching algorithm is developed to find the codes that comply with those rules. The H matrices of the 3-bit BEC with QAEC obtained are presented. They do not require additional parity check bits compared with a 3-bit BEC code. By applying the new algorithm to previous 3-bit BEC codes, the performance of 3-bit BEC is also remarkably improved. The encoding and decoding procedure of the proposed codes is illustrated with an example. Then, the encoders and decoders are implemented using a 65-nm library and the results show that our codes have moderate total area and delay overhead to achieve the correction ability extension.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

Reliability is an important requirement for space applications. Memories as the data storing components play a significant role in the electronic systems. They are widely used in the system on a chip and application-specific integrated circuits. In these applications, memories

Fig. 1. Memory cell area of different technology (cell area shape is simplified to a square, and Length is the length of side

account for a large portion of the circuit area. This makes memories suffer more space radiation than other components. Therefore, the sensitivity to radiation of memories has become a critical issue to ensure the reliability of electronic systems. In modern static random access memories (SRAMs), radiation-induced soft errors in the form of the single event upset (SEU) and multiple bit upset (MBU) are two prominent single event effects. As semiconductor technology develops from the submicrometer technology to the ultradeep submicrometer (UDSM) technology, the size of memory cells is smaller and more cells are included in the radius affected by a particle as shown in Fig. 1.

When a particle from a cosmic ray hits the basic memory cell, it generates a radial distribution of electron–hole pairs along the transport track. These generated electron–hole pairs can cause soft errors by changing the values stored in the memory cell leading to data corruption and system failure. For the transistors with a large feature size, a radiation event just affects one memory cell, which means that only the SEU occurs. In this case, the use of single error-correction (SEC)- double error-detection (DED) codes  is enough to protect the memory from radiation effects

As the feature size enters into DSM range, the critical charge keeps decreasing and the area of the memory cell scales down for each successive technology node. This makes more memory cells affected by a particle hit as shown in Fig. 2. For the CMOS bulk technology, with the cell-to-cell spacing decreasing, the electron–hole pairs generated in the substrate can diffuse to nearby cells and induce MBUs .

Fig. 2. Schematic description of memory cells included in the radiation effect with variation of the technology node.

This compares with the FDSOI technology, which isolates transistors and limits the multicollection mechanism. Therefore, the multicollection mechanism is more prominent for a bulk technology, and the MBU probability is higher. Although multiple bit error-correction codes (ECCs) can correct multiple bit errors in any error patterns not limited to the adjacent bits, the complexity of the decoding process and the limitation of the code block size limit their use. Meanwhile, from the generation principle of MBUs , the type of the MBUs depends on the initial angle of incidence and scattering angle of the secondary particles. Based on this, adjacent bit errors are dominant error patterns among the multiple bit errors. Therefore, adjacent bits correction ECCs become popular in memory-hardened designs. Many codes are proposed, and the capability of adjacent bits correction mainly focuses on the double adjacent error correction (DAEC), triple adjacent error correction (TAEC), and 3-bit burst error correction (BEC). An alternative to codes that can correct adjacent errors is to use an SEC or SEC-DED code combined with an interleaving of the memory cells. Interleaving ensures that cells that belong to the same logical word are placed physically apart. This means that an error on multiple adjacent cells affects multiple words each having only one bit error that can be corrected by an SEC code. As noted in previous studies, interleaving makes the interconnections and routing of the memory more complex and it will lead to an increase area and power consumption or limitations in the aspect ratio. Therefore, whether it is better to use SEC plus interleaving or a code that can correct adjacent errors will be design-dependent and both alternatives are of interest

As the technology comes to the UDSM, the area of the memory cell keeps further decreasing and even memories having atomic-dimension transistors appear. The ionization range of ions with the order of magnitude in micrometer can include more memory cells in the word-line direction as shown in Fig. 2 than the three bits previously considered. This means that the SEC-DAEC-TAEC codes may not be effective to ensure the memory reliability. Codes with more advanced correction ability are demanded. For example, codes designed with low redundancy for SEC-DAEC-TAEC and 3-bit BEC are presented. Therefore, extending the correction ability to quadruple adjacent bit errors would be of interest, especially if it can be done without adding extra parity bits.

In this paper, we present an improvement of 3-bit BEC codes to also provide quadruple adjacent error correction (QAEC). The code design technique for the QAEC with low redundancy is specified from two aspects: 1) error space satisfiability; and 2) unique syndrome satisfiability. Codes with QAEC for 16, 32, and 64 data bits are presented. From the view of the integrated circuits design, two criteria have been used to optimize the decoder complexity and decoder delay at the ECCs level: 1) minimizing the total number of ones in the parity check matrix and 2) minimizing the number of ones in the heaviest row of the parity check matrix. Additionally, based on the traditional recursive backtracing algorithm, an algorithm with the function of weight restriction and recording the past procedure is developed. The new algorithm not only reduces the cost of program run time, but also remarkably improves the performance of previous 3-bit BEC codes. The encoders and decoders for the QAEC codes are implemented in Verilog hardware description language (HDL). Area overhead and delay overhead are obtained by using a TSMC bulk 65-nm library. Compared with the previous 3-bit BEC codes, the area and delay overhead is moderate to achieve the correction ability extension.

DISADVANTAGES:

  • Not reliable in operation
  • Performance of 3-bit BEC is not amended.

PROPOSED SYSTEM:

Searching algorithms and tool development 

In this section, an algorithm is proposed to solve the Boolean satisfiability problem based on the discussion in the former section. Based on the algorithm, a code optimization tool is developed to obtain the target H matrix with custom optimization restrictions. The introduction of the algorithm is divided into two subsections. In the basic part of the algorithm is introduced to find the solutions meeting the requirement of the Boolean satisfiability. Based on the basic part, the method with column weight restriction is designed to force the optimization process to use as few ones as possible, thus optimizing the total number of ones in the matrix and the number of ones in the heaviest row. This optimized version of the algorithm has been used to obtain all the codes presented in this paper.

Basic Part of Code Design Algorithm

 In order to construct the expected codes, the first step is to ensure the number of the check bits. The number of the check bits is seven for codes with 16 data bits, eight for codes with 32 data bits, and nine for codes with 64 data bits.

The main idea of the algorithm is based on the recursive backtracing algorithm. At first, an identity matrix with block size (n − k) is constructed as the initial matrix, and the corresponding syndromes of the error patterns are added to the syndrome set. Then, a column vector selected from Fig. 3. Flow of code design algorithm. the 2n−k − 1 column candidates is added to the right side of the initial matrix. This process is defined as columnadded action. Meanwhile, the new syndromes which belong to the new added column are calculated. If none of new syndromes is equal to the elements in a syndrome set, the column-added action is successful and the corresponding new syndromes are added into the syndrome set. The base-matrix is updated to the previous matrix with the added column. Otherwise, the column-added action fails and another new column from the candidates is selected. If all the candidates are tried and the column-added action still fails, one column from the right side of previous matrix and the corresponding syndromes are reduced from the base-matrix and the syndrome set, respectively. Then, the algorithm continues the columnadded action until the matrix dimension reaches the expected value. The code design algorithm flow is shown in Fig. 3.

Normally, the recursive backtracing algorithm demands a large amount of computing resources and computing time. In order to accelerate the computing speed of the algorithm operation, firstly, we adopt the decimal operation instead of the matrix operation by conversing the column vectors into decimal numbers. Even so, the algorithm completing the execution of all conditions is not possible. In general, if the code we expect exists, it is easy to obtain the first solution. With different optimization criteria, the algorithm can get better solutions. However, searching the best solution requires the complete result of the whole searching process, which is in most cases, unfeasible with today’s computing resources. Therefore, it is more practical to use the best result obtained in a reasonable computation time. To be consistent with , that time was set to one week for all the results presented in this paper. Fig. 4. Flow of algorithm with column weight restriction and past procedure record is shown below

it can be observed that the new algorithms are able to find better solutions than the existing algorithms such as the one used in . Therefore, they can be applied for the finding process of the QAEC codes. The solutions found for QAEC codes are presented  in the paper.

ADVANTAGES:

  • Reliable in operation
  • Performance of 3-bit BEC is improved.

 

REFERENCES:

Jiaqiang Li , Student Member, IEEE, Pedro Reviriego, Senior Member, IEEE, Liyi Xiao, Member, IEEE,Costas Argyrides, Senior Member, IEEE, and Jie Li, “Extending 3-bit Burst Error-Correction Codes WithQuadruple Adjacent Error Correction”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 2, FEBRUARY 2018.

Fast Neural Network Training on FPGA Using Quasi-Newton Optimization Method

Fast Neural Network Training on FPGA Using Quasi-Newton Optimization Method

ABSTRACT:

In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

Field-programmable gate arrays (FPGAs) have been considered as a promising alternative to implement artificial neural networks (ANNs) for high-performance ANN applications because of their massive parallelism, high reconfigurability (compared to application specific integrated circuits), and better energy efficiency [compared to graphics processing units (GPUs)]. However, the majority of the existing FPGA-based ANN implementations was static implementations for specific offline applications without learning capability. While achieving high performance, the static hardware implementations of ANN suffer from low adaptability for a wide range of applications. Onsite training provides a dynamic training flexibility for the ANN implementations.

Currently, high-performance CPUs and GPUs are widely used for offline training, but not suitable for onsite training especially in embedded applications. It is complex to implement onsite training in hardware, due to the following reasons. First, some software features, such as floating-point number representations or advanced training techniques, are not practical or expensive to be implemented in hardware. Second, the implementation of batch-mode training increases the design complexity of hardware. Batch training is that the ANN’s weight update is based on training error from all the samples in training data set, which enables good convergence and efficiently prevents from sample data perturbation attacks. However, the implementation of batch training on an FPGA platform requires large data buffering and quick intermediate weights computation. There have been FPGA implementations of ANNs with online training.

These implementations used backpropagation (BP) learning algorithm with non batch training. Although widely used, the BP algorithm converges slowly, since it is essentially a steepest descent method. Several advanced training algorithms, such as conjugate gradient, quasi-Newton (QN), and Levenberg–Marquardt, have been proposed to speed up the converging procedure and reduce training time, at the cost of memory resources. This brief provides a feasible solution to the challenges in hardware implementation of onsite training by implementing the QN training algorithm and supporting batch-mode training on the latest FPGA. Our previous work implemented the Davidon–Fletcher–Power QN (DFP-QN) algorithm on FPGA and achieved 17 times speedup over the software implementation. To further improve performance, this brief proposes a hardware implementation of the Broyden–Fletcher– Goldfarb–Shanno QN (BFGS-QN) algorithm on FPGA for ANN training.

To the best of our knowledge, this is the first reported FPGA hardware accelerator of the BFGS-QN algorithm in a deeply pipelined fashion. The performance of the implementation is analyzed quantitatively. The architecture is designed with low hardware cost in mind for scalable performance, and for both onsite and offline training. The designed accelerator is applied to ANN training and is able to cope with different network sizes. The experimental results show that the proposed accelerator achieves performance improvement up to 105 times for various neural network sizes, compared with the CPU software implementation. The hardware implementation is also applied to two real-world applications, showing superior performance.

DISADVANTAGES:

  • Performance is not efficient
  • Power efficiency is lower

PROPOSED SYSTEM:

FPGA  Hardware implementation

Because the steps for computing B, λ, and g are three computationally intensive parts in the BFGS-QN algorithm, we mainly describe how the three parts are designed.

B Matrix Computation

The B computation (BC) block derives an n × n matrix, where n is the total number of weights, according to step 5 of Algorithm 1. The most computationally intensive operations are matrix-by-vector multiplication (MVM). For scalable performance and modularized architecture, MVM is implemented as multiple vector-by-vector multiplications (VVMs). Three on-chip RAM units, each storing n words, are used to store the intermediate computation results  which are repeatedly used in the subsequent computations.

Fig. 2.Architecture of λ computation block. (a) Forward–backward block for determining a search interval [a0, b0]. (b) GSS block. The operations in the dashed blocks share the same hardware.

In addition, an n-word first-input–first-ouput is used to match two input data streams of the final addition and form Bk+1 row by row. where Lmult, L sub, and L div are the latency of multiplier, subtractor, and divider, respectively, and Tvector is the number of execution cycles of a VVM. A deeply pipelined VVM unit is implemented withTvector = n + + L add[log2 L add + 2], where the last two items are for draining out the pipeline. 2) Step Size λ Computation: An exact line search method is implemented to find λkaccording to step 2 of Algorithm 1. The method includes two submodules: the forward–backward algorithm determining an initial search interval that contains the minimizer, and the GSS algorithm shown in Fig. 1 implemented to reduce the interval iteratively. Fig. 2 shows the hardware architecture, where the computation time of each block is analyzed. The architecture comprises a number of pipelined arithmetic operations and the objective function ET evaluation.

The dashed blocks in Fig. 2(a) and (b) are the same piece of hardware shared by the corresponding operations. The complex objective function (1) needs to be frequently evaluated during λ computation. Therefore, we also implement the objective function evaluation in hardware as a separate block for high performance. It can be modified according to different types of ANNs while other parts of the implementation remain the same. As shown in Fig. 3(a), the block is implemented in two parts: the first part exercises the ANN to obtain outputs and the second part computes the training error. Fig. 3(b) shows the structure of the first part. Two accumulation units are implemented to obtain h and yˆ shown. Note that the accumulation unit is implemented differently from the one in the VVM unit, by using length variable shift registers

Resource Usage Analysis

During the computation, intermediate results are accessed in different patterns: 1) some results, such as w and g, which are connected to multiple calculation modules, are read more than once and 2) the values, such as h, F(h), and δ, are read in different orders from writing. Also, the batch training is supported. Therefore, all intermediate results are buffered in FPGA on-chip memories, for pipelined and parallel computations and data reuse. FPGA on-chip dual-port RAMs are used to implement the buffers. The required storage space of each module is shown in Table I. As n increases, off-chip memory will be used and a properly designed memory hierarchy containing on-chip and off chip memories is needed to prevent speed slowdown. In addition, all training data are stored in off-chip memories for applications with large number of training data. The overall data path of the BFGS-QN hardware design needs a number of floating-point (FP) units, as shown in Table II. The number of required FP units is independent of n.

ADVANTAGES:

  • Performance is efficient
  • Power efficiency is higher

REFERENCES:

Qiang Liu , Jia Liu, Ruoyu Sang, Jiajun Li, Tao Zhang, and Qijun Zhang, “Fast Neural Network Training on FPGA UsingQuasi-Newton Optimization Method”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.

Feedback-Based Low-Power Soft-Error-Tolerant Design for Dual-Modular Redundancy

Feedback-Based Low-Power Soft-Error-Tolerant Design for Dual-Modular Redundancy

ABSTRACT:

Triple-modular redundancy (TMR), which consists of three identical modules and a voting circuit, is a common architecture for soft-error tolerance. However, the original TMR suffers from two major drawbacks: the large area overhead and the vulnerability of the voter. In order to overcome these drawbacks, we propose a new complementary dual-modular redundancy (CDMR) scheme for mitigating the effect of soft errors. Inspired by the Markov random field (MRF) theory, a two-stage voting system is implemented in CDMR, including a first stage optimal MRF structure and a second-stage high-performance merging unit. The CDMR scheme can reduce the voting circuit area by 20% while saving the area of one redundant module, achieving at least 26% error-rate reduction at an ultralow supply voltage of 0.25 V with 8.33% faster timing compared to previous voter designs.

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx 14.2

EXISTING SYSTEM:

Triple-modular redundancy (TMR) was first proposed by Von Neumann et al, and has since been adopted as a technique to improve error tolerance at the cost of increased circuit area. TMR can only tolerate soft errors when the probability of three or two modules failing simultaneously is much lower than that of a single module. However, one obvious drawback is the increased area overhead. Therefore, partial TMR (PTMR) was proposed to reduce the area overhead by tradingoff reliability. The dual-modular redundancy (DMR) scheme presented and uses a three-module structure with self-feedback. Robust C-elements and multiplexers are used, respectively, to form voters in two different DMR designs. An algorithmic noise-tolerant (ANT) technique was proposed to solve the problem of soft errors caused by voltage over scaling. Algorithmic soft-error tolerance (ASET) and fine-grain soft-error tolerance (FGSET) designs are both extended ANT designs. The designs  suffer from two drawbacks. First, they still consume large area overhead. Second, reliability loss is incurred by soft errors in the voting design. The reason is that redundancies and estimator-based redundancies work well only when voters never fail, which might be an unrealistic assumption if the circuits are designed using a deep submicrotechnology or an ultralow supply voltage is used. Under such conditions, it is likely that such a failure could occur in the voting circuit, which is a main cause of TMR failure. For a multistage design, three identical voters could be used in each stage to tolerate errors that occur in one of the TMR voters, but this would add undesirable overhead to the design. Some approaches, such as generalized modular redundancy, approximate TMR, and a simulation-based synthesis scheme, improve the original TMR, but they only offer either an optimal implementation strategy or tradeoff accuracy.

A number of error-tolerant methods, such as Markov random field (MRF), differential cascode voltage switch (DCVS), and DCVS-MRF, have been proposed. In these designs, the basic elements include feedback loops that help them to achieve high soft-error tolerance. However, these implementations require higher area overhead than traditional structures. To solve soft-error issues in the voter and save area overhead, we propose a new complementary DMR (CDMR) scheme, as shown in Fig. 1. The CDMR scheme ensures the significance of soft-error tolerance even for the voting circuit. This is achieved by separately processing one module (M1) through a structure with a stable logic “1” as output (referred to as structure A in Fig. 1), and processing another identical module (M2) through a structure with a stable logic “0” as output (shown in Fig. 1 as structure B). A second-stage feedback structure is then used to merge the stable logic “1” and stable logic “0” outputs from the first stage, ensuring the best performance from the first stage (shown in Fig. 1 as structure C). The CDMR scheme outperforms existing designs in two key aspects by: 1) tolerating many soft errors propagated to the voting circuit and 2) saving the area overhead.

DISADVANTAGES:

  • Larger area overheads are present
  • Soft errors are not reduced

PROPOSED SYSTEM:

MRF-Inspired two- stage feedback design

Fig. 2 can complement the loss of the error tolerance in g2 for the first stage using its latching property. The proposed structure benefits from the presence of stage 2 to improve its reliability, which is a feature that TMR, DMR, or other designs lack. Let us extend the single-error assumption for stage 1 by assuming that only one error can emerge from one of the complementary propagation chains at the same time. In other words, when an error occurs from stage 1, the latch structure of g3–g4 in stage 2 does not propagate errors received from stage 1. With respect to our proposed CDMR, the two redundant inputs to the voter must be complementary.

Table I

Values of g3–g4 Feedback and will propagate through stages 1 and 2 as complementary signals in the absence of errors. For example, an ideal input bit stream for xa(xa = xb) is {x0∼ x4 = 0 and x5∼ x9 = 1}. Four bits, x7 and x9 of xdand x 1 and x2  of x are flipped by noise, as circled by a small circle in Fig. 2. Their corresponding bits in the other branch are robust “1” because of the high tolerance of noisy input bit “0” in both NAND gates g1 and g2. This is why we only consider the cases where errors occur in weak “0” in xd or xe. This condition causes the second stage g3–g4 to remain in the hold state in Table I acting as an RS latch, thus protecting the final output results from the influence of the error bits in xd and xebased on the previous correct outputs. We adopted the widely used double-exponential current source to simulate the above cases where a charged or ionizing particle hits the output “0” of stage 1 circuit.

where Qtotal is the total charge caused by the particle strike, and τr and τf are the rising time constant and the falling time constant, respectively. As τ rand τ f are generally set to 50 and 164 ps for different process technologies, we used the current source Qtotal = 70 f c in our simulation. Regardless of whether x a and xb are both high or low, when a charged particle attacks x d or xe, there is one single peak shown in Fig. 2 in output x f . Compared with a much longer pulse at the output of a TMR voter when an error hits on one of its inner branches, it can be regarded to be less harmless in the proposed voter after sampling, as the error is too short to be sampled multiple times. The results in Fig. 2confirm the same error tolerance as what we deduced from the proposed structure in Fig. 2. In the extended one error condition, the output of our module can achieve correct operation as long as the two inner complementary signals are not in error at the same time.

we see that the proposed design has better soft-error tolerance. Therefore, the proposed voting circuit has both higher modular soft terror tolerance and reliability than those of TMR. For multistage logic, the voter is concatenated in each stage to improve the overall system reliability, as shown in Fig. 4(a)–(d). The original TMR, FGSET, and DMR voters for multistage are simply duplicated [refer to Fig. 4(a)–(c)]. However, the proposed voter has enclosed feedback loops and two outputs without voting duplication between two stages, as shown in Fig. 4(d). Note that this design has two complementary outputs as references for error correction. Overall, the area overhead is reduced by at least 50% compared to the designs used in TMR and DMR. We consider a 4-bit ripple-carry adder (RCA) as a case study for the proposed voter in Fig. 4. The input to the proposed design requires a differential input; thus, we redesigned the full adder (FA) as . We present two design schemes for adders. Scheme 1 (S1) in Fig. 4 is designed for a single unit with DMR, in which the outputs of the two modules are connected to a voter. Scheme 2 (S2) in Fig. 4 is implemented as a multistage design by adding a voter at every stage

ADVANTAGES:

  • Larger area overheads are reduced
  • Soft errors are reduced

REFERENCE:

Yan Li, Yufeng Li, Han Jie , Jianhao Hu, Fan Yang, Xuan Zeng, Bruce Cockburn, and Jie Chen, “Feedback-Based Low-Power Soft-Error-Tolerant Designfor Dual-Modular Redundancy”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.