# A Combined Arithmetic-High-Level Synthesis Solution to Deploy Partial Carry-Save Radix-8 Booth Multipliers in Data paths

ABSTRACT:

EXISTING SYSTEM:

MULTIMEDIA, signal processing applications and others are usually dominated by integer addition and multiplication. Hence, developing faster, smaller and power efficient Functional Units (FUs) is critical to implement efficient circuits. The fastest adders, such as the Kogge-Stone prefix adder, possess a noticeable area and power penalty. Several approximate adders have appeared that are aimed at improving these paramenters, especially by reducing the average delay and offering different area/power tradeoffs. Additive structures have also been optimized by using redundant formats such as the total or partial carry save adders.

Figure 1 : Radix R Booth multiplier

Nevertheless, data paths are still dominated by multipliers, as their delay, area and power grow faster than in the case of adders. Multipliers typically consist of an additive tree known as the partial product matrix (PPM), which accumulates the partial products and reduces them to just two operands, and a last stage Carry Propagate Adder (CPA), which adds these two operands and calculates the final result. Given an m × n multiplier, the PPM is composed of m × n 1-bit partial products pi,j , defined by Equation 1.

pi,j = xi y j, ∀i, j, 0 ≤ i < m, 0 ≤ j < n.     (1)

• Critical to develop efficient Circuit
• Maximum average delay
• More area and power consumptions

PROPOSED SYSTEM:

In order to justify this work, several multipliers were synthesized using Synopsys Design Compiler with a 65 nm target-library and without placing any delay constraint. We measured the delay, area, power and energy of the radix 4 and radix 8 Booth multipliers. Both types of multipliers implement the PPM stage through a 4:2 compressor tree and the CPA stage through a Kogge-Stone adder. Table II presents the delay, area, power, and energy of the radix 8 Booth multipliers after synthesis. All the values have been normalized with respect to a radix 4 Booth multiplier for different data widths. It should also be noted that two values are presented for each metric that is reported, namely: those considering that the 3X multiple is being calculated within the radix 8 Booth multiplier, labelled as w/3X, and those considering that the 3X multiple has been pre calculated in the data path as an independent operation, labelled as w/o 3X. As can be observed, in terms of delay, radix 8 implementations are slightly faster when the size is small, i.e. 8 and 16-bits. However, for 32-bits and larger sizes, radix 4 implementations are better. As shown in column w/o 3X, the delay can be balanced and even shortened by pre calculating the 3X multiple for radix 8 implementations. Regarding the area and power, it is clear that reducing the PPM from (n + 1)/2 to (n + 1)/3 achieves an improvement.

Figure 2 : Example of 3X node insertion

Figure 3 : A multiplier with non-redundant operands and radix 4 Booth encoding

As shown in Table II, the reduction in area ranges from 34% to 24% for complete radix 8 Booth multipliers, and from 40% to 25% when pre calculating the 3X multiple. In the case of power, the reduction ranges from 16% to 9% for complete radix 8 Booth multipliers, and from 28% to 18% when pre calculating the 3X multiple. In terms of energy, radix 8 implementations are only more efficient for small bit widths. Nevertheless, this issue can also be resolved, actually obtaining energy reductions, by pre calculating the 3X multiple. All the aforementioned values have been computed when considering just an individual instance of multiplier, so including the decoupled 3X computations and routing them will contribute to increasing the area and energy consumption of the data path. Nonetheless, this example illustrates two main ideas: firstly, there is a delay reduction in the radix 8 multiplier when pre calculating the 3X multiple; and, secondly, moving to radix 8-based implementations produces a noticeable area and energy reduction with respect to the radix 4-based implementations. Therefore, these gains can be leveraged to route and store the 3X computations.

Figure 4: On-the-Fly Correcting Radix 4 Booth Multi speculative Multiplier example of use. (a) A multiplier with partial carry-save operands, k = 4. (b) Reconstructing the non redundant multiplicand. (c) Multiplier branch: radix 4 carry select Booth encoding generates the multiples.

Problem Statement:

Partial carry-save formats have proved to be efficient if we look at the state of the art. Hence, partial carry-save multipliers are essential to implement data paths using this format. All the solutions proposed in literature are based on radix 4 Booth recoding. Nevertheless, a larger radix may provide greater benefits in terms of energy if we are able to efficiently deal with the hard multiples. Therefore, we propose a method for doing this, as well as a generic Booth-based multiplier for partial carry-save inputs and any 2r radix. In this paper, we apply this method to the specific case of radix 8 and explore different fragment sizes.

1. Inserting the 3X Nodes

In order to decouple the 3X calculation, we introduce an extra 2X + X addition node per product, selecting the farthest predecessor, i.e. X, of the multiplication in terms of DFG height. In this way, the probability of providing enough slack to schedule the 2X + X operations is maximized. In other words, the probability of increasing the latency is minimized. Figure 2 illustrates how this approach works. First, it should be noted that each operation is labelled by a number located at the right of the node. In this figure it can observed that if the 2X+X node were inserted after Operation 4, this would incur additional latency. However, as shown in Figure 2, selecting Operation 2 is the best choice, as its distance to Operation 5 is greater than the distance between Operations 4 and 5. Or, in other words, the height of Operation 2 is less than the height of Operation 4. Thus, by selecting the operand supplied by the output of Operation 2 to pre-calculate 3X (2X + X), the latency of the final schedule is not affected.

1. The On-the-Fly Correcting Radix 4 Booth Multi speculative Multiplier

In this subsection, the foundations of the radix 4 partial carry-save Booth multiplier presented in  will be outlined through the example shown in Figures 3 and 4. Figure 3 displays an example of multiplication utilizing radix 4 Booth encoding, while Figure 4a depicts a case with the same input operands but using partial carry-save format. In this example, 16-bit FUs with fragments of k = 4-bits have been considered. This means that there will be 16/4 = 4 fragments producing group generate (Gx, Gy) and propagate (Px, Py) signals. It should be mentioned that the group generate signals are actually the carry-out signals produced by the fragments. The generate and propagate signals produced by the most significant fragment have been labelled as ‘-’, because they will no longer be utilized.

Given a multiplicand in partial carry-save format, i.e. X = U + A, Figure 4b displays how its non-redundant version is calculated. For this purpose, the carry-select technique is employed. In other words, all the fragments are computed in parallel with both carry-in ‘0’ (there is no actual addition) and ‘1’, while the actual carries of the fragments are computed by the Real Carry Calculation unit , . This unit is a crucial part of this step, as the carries generated by each fragment of a partial carry-save adder might be incorrect. However, the Gx, Px group signals generated by every fragment allow the fast computation of the true carries. The selection logic simply chooses a fragment belonging to either U0 or U1, according to the values of Cx. It should be noted that U0 = U, while U1 is obtained after adding ‘1’ to every fragment belonging to U, except the first one.

Figure 4c provides an overview of the multiplier branch. Given a multiplier in partial carry-save format, i.e. Y = V + B, the V signal will be used to generate the Booth encoding. First, a Real Carry Calculation unit computes the actual carries of the multiplier using the Gy, Py signals. Second, tuples generation takes place, which is not as immediate as in a conventional multiplier. As the non-redundant multiplier is not known, the value of some multiplier bits needs to be calculated. This calculation requires few logic levels, as it just depends on the true carries and some bits belonging to V . After the tuples are generated, the B4_0 encoder implements the typical radix 4 recoding, while the B4_1 encoder implements the same recoding but considering that every Booth tuple tp = v2∗p+1v2∗pv2∗p−1 has a carry-in signal equal to ‘1’ in the 2 ∗ p position. In other words, this is equivalent to adding the constant “010” to every tuple. Nevertheless, instead of adding and then recoding, the encoding is directly performed following a truth table , which is faster. Finally, the correct encoding for every tuple will be selected according to the control (ctrl) signals. As can be observed, the final encoding coincides with that provided by the conventional radix 4 Booth recoding shown in Figure 3. The main advantage of this approach is that all the fragments work in parallel once the real carries are known.

After generating the correct Booth encoding as well as the multiplicand (i.e. non-redundant format), Booth multiples generation and PPM addition take place as in conventional multipliers. Finally, the CPA stage is performed via a Multi speculative Adder, generating a result in partial carry-save format, which may be used by other partial carry-save units.

• Easy to develop efficient Circuit
• Minimum average delay
• Less area and power consumptions

REFERENCES:

 G. D. Micheli, Synthesis and Optimization of Digital Circuits, 1st ed. New York, NY, USA: McGraw-Hill, 1994.

# Fast Neural Network Training on FPGA Using Quasi-Newton Optimization Method

ABSTRACT:

In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.

SOFTWARE IMPLEMENTATION:

• Modelsim
• Xilinx 14.2

EXISTING SYSTEM:

Field-programmable gate arrays (FPGAs) have been considered as a promising alternative to implement artificial neural networks (ANNs) for high-performance ANN applications because of their massive parallelism, high reconfigurability (compared to application specific integrated circuits), and better energy efficiency [compared to graphics processing units (GPUs)]. However, the majority of the existing FPGA-based ANN implementations was static implementations for specific offline applications without learning capability. While achieving high performance, the static hardware implementations of ANN suffer from low adaptability for a wide range of applications. Onsite training provides a dynamic training flexibility for the ANN implementations.

Currently, high-performance CPUs and GPUs are widely used for offline training, but not suitable for onsite training especially in embedded applications. It is complex to implement onsite training in hardware, due to the following reasons. First, some software features, such as floating-point number representations or advanced training techniques, are not practical or expensive to be implemented in hardware. Second, the implementation of batch-mode training increases the design complexity of hardware. Batch training is that the ANN’s weight update is based on training error from all the samples in training data set, which enables good convergence and efficiently prevents from sample data perturbation attacks. However, the implementation of batch training on an FPGA platform requires large data buffering and quick intermediate weights computation. There have been FPGA implementations of ANNs with online training.

These implementations used backpropagation (BP) learning algorithm with non batch training. Although widely used, the BP algorithm converges slowly, since it is essentially a steepest descent method. Several advanced training algorithms, such as conjugate gradient, quasi-Newton (QN), and Levenberg–Marquardt, have been proposed to speed up the converging procedure and reduce training time, at the cost of memory resources. This brief provides a feasible solution to the challenges in hardware implementation of onsite training by implementing the QN training algorithm and supporting batch-mode training on the latest FPGA. Our previous work implemented the Davidon–Fletcher–Power QN (DFP-QN) algorithm on FPGA and achieved 17 times speedup over the software implementation. To further improve performance, this brief proposes a hardware implementation of the Broyden–Fletcher– Goldfarb–Shanno QN (BFGS-QN) algorithm on FPGA for ANN training.

To the best of our knowledge, this is the first reported FPGA hardware accelerator of the BFGS-QN algorithm in a deeply pipelined fashion. The performance of the implementation is analyzed quantitatively. The architecture is designed with low hardware cost in mind for scalable performance, and for both onsite and offline training. The designed accelerator is applied to ANN training and is able to cope with different network sizes. The experimental results show that the proposed accelerator achieves performance improvement up to 105 times for various neural network sizes, compared with the CPU software implementation. The hardware implementation is also applied to two real-world applications, showing superior performance.

• Performance is not efficient
• Power efficiency is lower

PROPOSED SYSTEM:

FPGA  Hardware implementation

Because the steps for computing B, λ, and g are three computationally intensive parts in the BFGS-QN algorithm, we mainly describe how the three parts are designed.

B Matrix Computation

The B computation (BC) block derives an n × n matrix, where n is the total number of weights, according to step 5 of Algorithm 1. The most computationally intensive operations are matrix-by-vector multiplication (MVM). For scalable performance and modularized architecture, MVM is implemented as multiple vector-by-vector multiplications (VVMs). Three on-chip RAM units, each storing n words, are used to store the intermediate computation results  which are repeatedly used in the subsequent computations.

Fig. 2.Architecture of λ computation block. (a) Forward–backward block for determining a search interval [a0, b0]. (b) GSS block. The operations in the dashed blocks share the same hardware.

In addition, an n-word first-input–first-ouput is used to match two input data streams of the final addition and form Bk+1 row by row. where Lmult, L sub, and L div are the latency of multiplier, subtractor, and divider, respectively, and Tvector is the number of execution cycles of a VVM. A deeply pipelined VVM unit is implemented withTvector = n + + L add[log2 L add + 2], where the last two items are for draining out the pipeline. 2) Step Size λ Computation: An exact line search method is implemented to find λkaccording to step 2 of Algorithm 1. The method includes two submodules: the forward–backward algorithm determining an initial search interval that contains the minimizer, and the GSS algorithm shown in Fig. 1 implemented to reduce the interval iteratively. Fig. 2 shows the hardware architecture, where the computation time of each block is analyzed. The architecture comprises a number of pipelined arithmetic operations and the objective function ET evaluation.

The dashed blocks in Fig. 2(a) and (b) are the same piece of hardware shared by the corresponding operations. The complex objective function (1) needs to be frequently evaluated during λ computation. Therefore, we also implement the objective function evaluation in hardware as a separate block for high performance. It can be modified according to different types of ANNs while other parts of the implementation remain the same. As shown in Fig. 3(a), the block is implemented in two parts: the first part exercises the ANN to obtain outputs and the second part computes the training error. Fig. 3(b) shows the structure of the first part. Two accumulation units are implemented to obtain h and yˆ shown. Note that the accumulation unit is implemented differently from the one in the VVM unit, by using length variable shift registers

Resource Usage Analysis

During the computation, intermediate results are accessed in different patterns: 1) some results, such as w and g, which are connected to multiple calculation modules, are read more than once and 2) the values, such as h, F(h), and δ, are read in different orders from writing. Also, the batch training is supported. Therefore, all intermediate results are buffered in FPGA on-chip memories, for pipelined and parallel computations and data reuse. FPGA on-chip dual-port RAMs are used to implement the buffers. The required storage space of each module is shown in Table I. As n increases, off-chip memory will be used and a properly designed memory hierarchy containing on-chip and off chip memories is needed to prevent speed slowdown. In addition, all training data are stored in off-chip memories for applications with large number of training data. The overall data path of the BFGS-QN hardware design needs a number of floating-point (FP) units, as shown in Table II. The number of required FP units is independent of n.