A Combined Arithmetic-High-Level Synthesis Solution to Deploy Partial Carry-Save Radix-8 Booth Multipliers in Data paths

A Combined Arithmetic-High-Level Synthesis Solution to Deploy Partial Carry-Save Radix-8 Booth Multipliers in Data paths



While partial carry-save adders are easily designed by splitting them into several fragments working in parallel, the design of partial carry-save multipliers is more challenging. Prior approaches have proposed several solutions based on the radix-4 Booth recoding. This technique makes it possible to diminish the height of a multiplier by half, this being the most widespread option when designing multipliers, as only easy multiples are required. Larger radices provide further reductions at the expense of the appearance of hard multiples. Such is the case of radix-8 Booth multipliers, whose critical path is located at the generation of the 3X multiple. In order to mitigate this delay, in our prior works, we proposed to first decouple the 3X computation and introduce it in the dataflow graph, leveraging the available slack. Considering this, we then present a partial carry-save radix-8 Booth multiplier that receives three inputs in this format, namely, the multiplicand, the multiplier, and the 3X multiple. Moreover, the rest of the data path is adapted to work in partial carry-save. In comparison with conventional radix-4 and radix-8 Booth-based data paths, the proposal is able to diminish the execution time and energy consumption while benefits from the area reduction provided by the selection of radix 8. Furthermore, it outperforms prior state-of-the-art partial carry-save multipliers based on radix 4.



MULTIMEDIA, signal processing applications and others are usually dominated by integer addition and multiplication. Hence, developing faster, smaller and power efficient Functional Units (FUs) is critical to implement efficient circuits. The fastest adders, such as the Kogge-Stone prefix adder, possess a noticeable area and power penalty. Several approximate adders have appeared that are aimed at improving these paramenters, especially by reducing the average delay and offering different area/power tradeoffs. Additive structures have also been optimized by using redundant formats such as the total or partial carry save adders.

Figure 1 : Radix R Booth multiplier

Nevertheless, data paths are still dominated by multipliers, as their delay, area and power grow faster than in the case of adders. Multipliers typically consist of an additive tree known as the partial product matrix (PPM), which accumulates the partial products and reduces them to just two operands, and a last stage Carry Propagate Adder (CPA), which adds these two operands and calculates the final result. Given an m × n multiplier, the PPM is composed of m × n 1-bit partial products pi,j , defined by Equation 1.

pi,j = xi y j, ∀i, j, 0 ≤ i < m, 0 ≤ j < n.     (1)

where, X = xm−1xm−2 … x1x0 and Y = yn−1 yn−2 … y1 y0 are the multiplicand and the multiplier, respectively. Hence, in order to reduce the complexity of a multiplier, either the width m or the height n must be diminished. Lowering m leads to truncated multipliers, which is not the purpose of this work. Adopting the other approach, the height of the PPM is usually reduced by applying a Booth recoding in radix R = 2r,r > 0, which maintains the accuracy of the multiplier. The generic structure of a Radix R Booth Multiplier is depicted in Figure 1. Thus, the height of an m × n multiplier is reduced from n to (n + 1)/r. Intuitively, the larger r is, the better. Nevertheless, applying this recoding technique incurs a certain area penalty to calculate the Booth multiples or sub products. For this reason, typically r = 2, because given a product X*Y, only ± 0X, ± 1X and ± 2X multiples need to be generated. As can be observed, all of them can easily be calculated via shift and negation operations. For r > 2, there appear hard multiples, i.e., those that are not composable with just shifts and negations. Due to their calculation, the penalty then exceeds the gains from the reduction in the PPM height. In particular, the case of the radix 8 Booth recoding requires the computation of the following multiples: ±0X, ±1X, ±2X, ±3X and ±4X. The ±3X multiples are not straightforward to compute. Typically, 3X is computed as the addition of 2X+X, thus increasing the critical path of the multiplier. Previous approaches have proposed decoupling this computation by just pipelining or by inserting the 3X computation as an independent operation in the Dataflow Graph (DFG) by leveraging the slack. In this way it is possible to take advantage of the larger height reduction enabled by the radix 8 recoding instead of the radix 4 recoding, without increasing the radix 8 Booth multiplier’s critical path. This paper is based on the latter approach. In this work the partial carry-save format is incorporated into the whole data path, which will imply modifying both the 3X calculation unit and the multiplier. A detailed study of the relationship between the radix and the fragment size is presented and applied to the case of radix 8. Therefore, it is possible to develop a methodology to deploy data paths that work entirely in partial carry-save format and just reduce results to a non-redundant form when they are output nodes of the DFG. According to the synthesis results, our proposed multipliers outperform conventional radix 4 and radix 8 designs, as well as our prior radix 4 partial carry-save multipliers. Furthermore, the data paths implemented with the proposed approach can reduce 27% energy consumption, being 22% faster on average as well.


  • Critical to develop efficient Circuit
  • Maximum average delay
  • More area and power consumptions




In order to justify this work, several multipliers were synthesized using Synopsys Design Compiler with a 65 nm target-library and without placing any delay constraint. We measured the delay, area, power and energy of the radix 4 and radix 8 Booth multipliers. Both types of multipliers implement the PPM stage through a 4:2 compressor tree and the CPA stage through a Kogge-Stone adder. Table II presents the delay, area, power, and energy of the radix 8 Booth multipliers after synthesis. All the values have been normalized with respect to a radix 4 Booth multiplier for different data widths. It should also be noted that two values are presented for each metric that is reported, namely: those considering that the 3X multiple is being calculated within the radix 8 Booth multiplier, labelled as w/3X, and those considering that the 3X multiple has been pre calculated in the data path as an independent operation, labelled as w/o 3X. As can be observed, in terms of delay, radix 8 implementations are slightly faster when the size is small, i.e. 8 and 16-bits. However, for 32-bits and larger sizes, radix 4 implementations are better. As shown in column w/o 3X, the delay can be balanced and even shortened by pre calculating the 3X multiple for radix 8 implementations. Regarding the area and power, it is clear that reducing the PPM from (n + 1)/2 to (n + 1)/3 achieves an improvement.

Figure 2 : Example of 3X node insertion

Figure 3 : A multiplier with non-redundant operands and radix 4 Booth encoding

As shown in Table II, the reduction in area ranges from 34% to 24% for complete radix 8 Booth multipliers, and from 40% to 25% when pre calculating the 3X multiple. In the case of power, the reduction ranges from 16% to 9% for complete radix 8 Booth multipliers, and from 28% to 18% when pre calculating the 3X multiple. In terms of energy, radix 8 implementations are only more efficient for small bit widths. Nevertheless, this issue can also be resolved, actually obtaining energy reductions, by pre calculating the 3X multiple. All the aforementioned values have been computed when considering just an individual instance of multiplier, so including the decoupled 3X computations and routing them will contribute to increasing the area and energy consumption of the data path. Nonetheless, this example illustrates two main ideas: firstly, there is a delay reduction in the radix 8 multiplier when pre calculating the 3X multiple; and, secondly, moving to radix 8-based implementations produces a noticeable area and energy reduction with respect to the radix 4-based implementations. Therefore, these gains can be leveraged to route and store the 3X computations.

Figure 4: On-the-Fly Correcting Radix 4 Booth Multi speculative Multiplier example of use. (a) A multiplier with partial carry-save operands, k = 4. (b) Reconstructing the non redundant multiplicand. (c) Multiplier branch: radix 4 carry select Booth encoding generates the multiples.




Problem Statement:

Partial carry-save formats have proved to be efficient if we look at the state of the art. Hence, partial carry-save multipliers are essential to implement data paths using this format. All the solutions proposed in literature are based on radix 4 Booth recoding. Nevertheless, a larger radix may provide greater benefits in terms of energy if we are able to efficiently deal with the hard multiples. Therefore, we propose a method for doing this, as well as a generic Booth-based multiplier for partial carry-save inputs and any 2r radix. In this paper, we apply this method to the specific case of radix 8 and explore different fragment sizes.

  1. Inserting the 3X Nodes

In order to decouple the 3X calculation, we introduce an extra 2X + X addition node per product, selecting the farthest predecessor, i.e. X, of the multiplication in terms of DFG height. In this way, the probability of providing enough slack to schedule the 2X + X operations is maximized. In other words, the probability of increasing the latency is minimized. Figure 2 illustrates how this approach works. First, it should be noted that each operation is labelled by a number located at the right of the node. In this figure it can observed that if the 2X+X node were inserted after Operation 4, this would incur additional latency. However, as shown in Figure 2, selecting Operation 2 is the best choice, as its distance to Operation 5 is greater than the distance between Operations 4 and 5. Or, in other words, the height of Operation 2 is less than the height of Operation 4. Thus, by selecting the operand supplied by the output of Operation 2 to pre-calculate 3X (2X + X), the latency of the final schedule is not affected.

  1. The On-the-Fly Correcting Radix 4 Booth Multi speculative Multiplier

In this subsection, the foundations of the radix 4 partial carry-save Booth multiplier presented in [26] will be outlined through the example shown in Figures 3 and 4. Figure 3 displays an example of multiplication utilizing radix 4 Booth encoding, while Figure 4a depicts a case with the same input operands but using partial carry-save format. In this example, 16-bit FUs with fragments of k = 4-bits have been considered. This means that there will be 16/4 = 4 fragments producing group generate (Gx, Gy) and propagate (Px, Py) signals. It should be mentioned that the group generate signals are actually the carry-out signals produced by the fragments. The generate and propagate signals produced by the most significant fragment have been labelled as ‘-’, because they will no longer be utilized.

Given a multiplicand in partial carry-save format, i.e. X = U + A, Figure 4b displays how its non-redundant version is calculated. For this purpose, the carry-select technique is employed. In other words, all the fragments are computed in parallel with both carry-in ‘0’ (there is no actual addition) and ‘1’, while the actual carries of the fragments are computed by the Real Carry Calculation unit [26], [43]. This unit is a crucial part of this step, as the carries generated by each fragment of a partial carry-save adder might be incorrect. However, the Gx, Px group signals generated by every fragment allow the fast computation of the true carries. The selection logic simply chooses a fragment belonging to either U0 or U1, according to the values of Cx. It should be noted that U0 = U, while U1 is obtained after adding ‘1’ to every fragment belonging to U, except the first one.

Figure 4c provides an overview of the multiplier branch. Given a multiplier in partial carry-save format, i.e. Y = V + B, the V signal will be used to generate the Booth encoding. First, a Real Carry Calculation unit computes the actual carries of the multiplier using the Gy, Py signals. Second, tuples generation takes place, which is not as immediate as in a conventional multiplier. As the non-redundant multiplier is not known, the value of some multiplier bits needs to be calculated. This calculation requires few logic levels, as it just depends on the true carries and some bits belonging to V . After the tuples are generated, the B4_0 encoder implements the typical radix 4 recoding, while the B4_1 encoder implements the same recoding but considering that every Booth tuple tp = v2∗p+1v2∗pv2∗p−1 has a carry-in signal equal to ‘1’ in the 2 ∗ p position. In other words, this is equivalent to adding the constant “010” to every tuple. Nevertheless, instead of adding and then recoding, the encoding is directly performed following a truth table [26], which is faster. Finally, the correct encoding for every tuple will be selected according to the control (ctrl) signals. As can be observed, the final encoding coincides with that provided by the conventional radix 4 Booth recoding shown in Figure 3. The main advantage of this approach is that all the fragments work in parallel once the real carries are known.

After generating the correct Booth encoding as well as the multiplicand (i.e. non-redundant format), Booth multiples generation and PPM addition take place as in conventional multipliers. Finally, the CPA stage is performed via a Multi speculative Adder, generating a result in partial carry-save format, which may be used by other partial carry-save units.



  • Easy to develop efficient Circuit
  • Minimum average delay
  • Less area and power consumptions





[1] G. D. Micheli, Synthesis and Optimization of Digital Circuits, 1st ed. New York, NY, USA: McGraw-Hill, 1994.

Fast Neural Network Training on FPGA Using Quasi-Newton Optimization Method

Fast Neural Network Training on FPGA Using Quasi-Newton Optimization Method


In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.


  • Modelsim
  • Xilinx 14.2


Field-programmable gate arrays (FPGAs) have been considered as a promising alternative to implement artificial neural networks (ANNs) for high-performance ANN applications because of their massive parallelism, high reconfigurability (compared to application specific integrated circuits), and better energy efficiency [compared to graphics processing units (GPUs)]. However, the majority of the existing FPGA-based ANN implementations was static implementations for specific offline applications without learning capability. While achieving high performance, the static hardware implementations of ANN suffer from low adaptability for a wide range of applications. Onsite training provides a dynamic training flexibility for the ANN implementations.

Currently, high-performance CPUs and GPUs are widely used for offline training, but not suitable for onsite training especially in embedded applications. It is complex to implement onsite training in hardware, due to the following reasons. First, some software features, such as floating-point number representations or advanced training techniques, are not practical or expensive to be implemented in hardware. Second, the implementation of batch-mode training increases the design complexity of hardware. Batch training is that the ANN’s weight update is based on training error from all the samples in training data set, which enables good convergence and efficiently prevents from sample data perturbation attacks. However, the implementation of batch training on an FPGA platform requires large data buffering and quick intermediate weights computation. There have been FPGA implementations of ANNs with online training.

These implementations used backpropagation (BP) learning algorithm with non batch training. Although widely used, the BP algorithm converges slowly, since it is essentially a steepest descent method. Several advanced training algorithms, such as conjugate gradient, quasi-Newton (QN), and Levenberg–Marquardt, have been proposed to speed up the converging procedure and reduce training time, at the cost of memory resources. This brief provides a feasible solution to the challenges in hardware implementation of onsite training by implementing the QN training algorithm and supporting batch-mode training on the latest FPGA. Our previous work implemented the Davidon–Fletcher–Power QN (DFP-QN) algorithm on FPGA and achieved 17 times speedup over the software implementation. To further improve performance, this brief proposes a hardware implementation of the Broyden–Fletcher– Goldfarb–Shanno QN (BFGS-QN) algorithm on FPGA for ANN training.

To the best of our knowledge, this is the first reported FPGA hardware accelerator of the BFGS-QN algorithm in a deeply pipelined fashion. The performance of the implementation is analyzed quantitatively. The architecture is designed with low hardware cost in mind for scalable performance, and for both onsite and offline training. The designed accelerator is applied to ANN training and is able to cope with different network sizes. The experimental results show that the proposed accelerator achieves performance improvement up to 105 times for various neural network sizes, compared with the CPU software implementation. The hardware implementation is also applied to two real-world applications, showing superior performance.


  • Performance is not efficient
  • Power efficiency is lower


FPGA  Hardware implementation

Because the steps for computing B, λ, and g are three computationally intensive parts in the BFGS-QN algorithm, we mainly describe how the three parts are designed.

B Matrix Computation

The B computation (BC) block derives an n × n matrix, where n is the total number of weights, according to step 5 of Algorithm 1. The most computationally intensive operations are matrix-by-vector multiplication (MVM). For scalable performance and modularized architecture, MVM is implemented as multiple vector-by-vector multiplications (VVMs). Three on-chip RAM units, each storing n words, are used to store the intermediate computation results  which are repeatedly used in the subsequent computations.

Fig. 2.Architecture of λ computation block. (a) Forward–backward block for determining a search interval [a0, b0]. (b) GSS block. The operations in the dashed blocks share the same hardware.

In addition, an n-word first-input–first-ouput is used to match two input data streams of the final addition and form Bk+1 row by row. where Lmult, L sub, and L div are the latency of multiplier, subtractor, and divider, respectively, and Tvector is the number of execution cycles of a VVM. A deeply pipelined VVM unit is implemented withTvector = n + + L add[log2 L add + 2], where the last two items are for draining out the pipeline. 2) Step Size λ Computation: An exact line search method is implemented to find λkaccording to step 2 of Algorithm 1. The method includes two submodules: the forward–backward algorithm determining an initial search interval that contains the minimizer, and the GSS algorithm shown in Fig. 1 implemented to reduce the interval iteratively. Fig. 2 shows the hardware architecture, where the computation time of each block is analyzed. The architecture comprises a number of pipelined arithmetic operations and the objective function ET evaluation.

The dashed blocks in Fig. 2(a) and (b) are the same piece of hardware shared by the corresponding operations. The complex objective function (1) needs to be frequently evaluated during λ computation. Therefore, we also implement the objective function evaluation in hardware as a separate block for high performance. It can be modified according to different types of ANNs while other parts of the implementation remain the same. As shown in Fig. 3(a), the block is implemented in two parts: the first part exercises the ANN to obtain outputs and the second part computes the training error. Fig. 3(b) shows the structure of the first part. Two accumulation units are implemented to obtain h and yˆ shown. Note that the accumulation unit is implemented differently from the one in the VVM unit, by using length variable shift registers

Resource Usage Analysis

During the computation, intermediate results are accessed in different patterns: 1) some results, such as w and g, which are connected to multiple calculation modules, are read more than once and 2) the values, such as h, F(h), and δ, are read in different orders from writing. Also, the batch training is supported. Therefore, all intermediate results are buffered in FPGA on-chip memories, for pipelined and parallel computations and data reuse. FPGA on-chip dual-port RAMs are used to implement the buffers. The required storage space of each module is shown in Table I. As n increases, off-chip memory will be used and a properly designed memory hierarchy containing on-chip and off chip memories is needed to prevent speed slowdown. In addition, all training data are stored in off-chip memories for applications with large number of training data. The overall data path of the BFGS-QN hardware design needs a number of floating-point (FP) units, as shown in Table II. The number of required FP units is independent of n.


  • Performance is efficient
  • Power efficiency is higher


Qiang Liu , Jia Liu, Ruoyu Sang, Jiajun Li, Tao Zhang, and Qijun Zhang, “Fast Neural Network Training on FPGA UsingQuasi-Newton Optimization Method”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.