A Combined Arithmetic-High-Level Synthesis Solution to Deploy Partial Carry-Save Radix-8 Booth Multipliers in Data paths
While partial carry-save adders are easily designed by splitting them into several fragments working in parallel, the design of partial carry-save multipliers is more challenging. Prior approaches have proposed several solutions based on the radix-4 Booth recoding. This technique makes it possible to diminish the height of a multiplier by half, this being the most widespread option when designing multipliers, as only easy multiples are required. Larger radices provide further reductions at the expense of the appearance of hard multiples. Such is the case of radix-8 Booth multipliers, whose critical path is located at the generation of the 3X multiple. In order to mitigate this delay, in our prior works, we proposed to first decouple the 3X computation and introduce it in the dataflow graph, leveraging the available slack. Considering this, we then present a partial carry-save radix-8 Booth multiplier that receives three inputs in this format, namely, the multiplicand, the multiplier, and the 3X multiple. Moreover, the rest of the data path is adapted to work in partial carry-save. In comparison with conventional radix-4 and radix-8 Booth-based data paths, the proposal is able to diminish the execution time and energy consumption while benefits from the area reduction provided by the selection of radix 8. Furthermore, it outperforms prior state-of-the-art partial carry-save multipliers based on radix 4.
MULTIMEDIA, signal processing applications and others are usually dominated by integer addition and multiplication. Hence, developing faster, smaller and power efficient Functional Units (FUs) is critical to implement efficient circuits. The fastest adders, such as the Kogge-Stone prefix adder, possess a noticeable area and power penalty. Several approximate adders have appeared that are aimed at improving these paramenters, especially by reducing the average delay and offering different area/power tradeoffs. Additive structures have also been optimized by using redundant formats such as the total or partial carry save adders.
Figure 1 : Radix R Booth multiplier
Nevertheless, data paths are still dominated by multipliers, as their delay, area and power grow faster than in the case of adders. Multipliers typically consist of an additive tree known as the partial product matrix (PPM), which accumulates the partial products and reduces them to just two operands, and a last stage Carry Propagate Adder (CPA), which adds these two operands and calculates the final result. Given an m × n multiplier, the PPM is composed of m × n 1-bit partial products pi,j , defined by Equation 1.
pi,j = xi y j, ∀i, j, 0 ≤ i < m, 0 ≤ j < n. (1)
where, X = xm−1xm−2 … x1x0 and Y = yn−1 yn−2 … y1 y0 are the multiplicand and the multiplier, respectively. Hence, in order to reduce the complexity of a multiplier, either the width m or the height n must be diminished. Lowering m leads to truncated multipliers, which is not the purpose of this work. Adopting the other approach, the height of the PPM is usually reduced by applying a Booth recoding in radix R = 2r,r > 0, which maintains the accuracy of the multiplier. The generic structure of a Radix R Booth Multiplier is depicted in Figure 1. Thus, the height of an m × n multiplier is reduced from n to (n + 1)/r. Intuitively, the larger r is, the better. Nevertheless, applying this recoding technique incurs a certain area penalty to calculate the Booth multiples or sub products. For this reason, typically r = 2, because given a product X*Y, only ± 0X, ± 1X and ± 2X multiples need to be generated. As can be observed, all of them can easily be calculated via shift and negation operations. For r > 2, there appear hard multiples, i.e., those that are not composable with just shifts and negations. Due to their calculation, the penalty then exceeds the gains from the reduction in the PPM height. In particular, the case of the radix 8 Booth recoding requires the computation of the following multiples: ±0X, ±1X, ±2X, ±3X and ±4X. The ±3X multiples are not straightforward to compute. Typically, 3X is computed as the addition of 2X+X, thus increasing the critical path of the multiplier. Previous approaches have proposed decoupling this computation by just pipelining or by inserting the 3X computation as an independent operation in the Dataflow Graph (DFG) by leveraging the slack. In this way it is possible to take advantage of the larger height reduction enabled by the radix 8 recoding instead of the radix 4 recoding, without increasing the radix 8 Booth multiplier’s critical path. This paper is based on the latter approach. In this work the partial carry-save format is incorporated into the whole data path, which will imply modifying both the 3X calculation unit and the multiplier. A detailed study of the relationship between the radix and the fragment size is presented and applied to the case of radix 8. Therefore, it is possible to develop a methodology to deploy data paths that work entirely in partial carry-save format and just reduce results to a non-redundant form when they are output nodes of the DFG. According to the synthesis results, our proposed multipliers outperform conventional radix 4 and radix 8 designs, as well as our prior radix 4 partial carry-save multipliers. Furthermore, the data paths implemented with the proposed approach can reduce 27% energy consumption, being 22% faster on average as well.
- Critical to develop efficient Circuit
- Maximum average delay
- More area and power consumptions
RADIX 8 VS RADIX 4 BOOTH MULTIPLIERS:
In order to justify this work, several multipliers were synthesized using Synopsys Design Compiler with a 65 nm target-library and without placing any delay constraint. We measured the delay, area, power and energy of the radix 4 and radix 8 Booth multipliers. Both types of multipliers implement the PPM stage through a 4:2 compressor tree and the CPA stage through a Kogge-Stone adder. Table II presents the delay, area, power, and energy of the radix 8 Booth multipliers after synthesis. All the values have been normalized with respect to a radix 4 Booth multiplier for different data widths. It should also be noted that two values are presented for each metric that is reported, namely: those considering that the 3X multiple is being calculated within the radix 8 Booth multiplier, labelled as w/3X, and those considering that the 3X multiple has been pre calculated in the data path as an independent operation, labelled as w/o 3X. As can be observed, in terms of delay, radix 8 implementations are slightly faster when the size is small, i.e. 8 and 16-bits. However, for 32-bits and larger sizes, radix 4 implementations are better. As shown in column w/o 3X, the delay can be balanced and even shortened by pre calculating the 3X multiple for radix 8 implementations. Regarding the area and power, it is clear that reducing the PPM from (n + 1)/2 to (n + 1)/3 achieves an improvement.
Figure 2 : Example of 3X node insertion
Figure 3 : A multiplier with non-redundant operands and radix 4 Booth encoding
As shown in Table II, the reduction in area ranges from 34% to 24% for complete radix 8 Booth multipliers, and from 40% to 25% when pre calculating the 3X multiple. In the case of power, the reduction ranges from 16% to 9% for complete radix 8 Booth multipliers, and from 28% to 18% when pre calculating the 3X multiple. In terms of energy, radix 8 implementations are only more efficient for small bit widths. Nevertheless, this issue can also be resolved, actually obtaining energy reductions, by pre calculating the 3X multiple. All the aforementioned values have been computed when considering just an individual instance of multiplier, so including the decoupled 3X computations and routing them will contribute to increasing the area and energy consumption of the data path. Nonetheless, this example illustrates two main ideas: firstly, there is a delay reduction in the radix 8 multiplier when pre calculating the 3X multiple; and, secondly, moving to radix 8-based implementations produces a noticeable area and energy reduction with respect to the radix 4-based implementations. Therefore, these gains can be leveraged to route and store the 3X computations.
Figure 4: On-the-Fly Correcting Radix 4 Booth Multi speculative Multiplier example of use. (a) A multiplier with partial carry-save operands, k = 4. (b) Reconstructing the non redundant multiplicand. (c) Multiplier branch: radix 4 carry select Booth encoding generates the multiples.
Partial carry-save formats have proved to be efficient if we look at the state of the art. Hence, partial carry-save multipliers are essential to implement data paths using this format. All the solutions proposed in literature are based on radix 4 Booth recoding. Nevertheless, a larger radix may provide greater benefits in terms of energy if we are able to efficiently deal with the hard multiples. Therefore, we propose a method for doing this, as well as a generic Booth-based multiplier for partial carry-save inputs and any 2r radix. In this paper, we apply this method to the specific case of radix 8 and explore different fragment sizes.
- Inserting the 3X Nodes
In order to decouple the 3X calculation, we introduce an extra 2X + X addition node per product, selecting the farthest predecessor, i.e. X, of the multiplication in terms of DFG height. In this way, the probability of providing enough slack to schedule the 2X + X operations is maximized. In other words, the probability of increasing the latency is minimized. Figure 2 illustrates how this approach works. First, it should be noted that each operation is labelled by a number located at the right of the node. In this figure it can observed that if the 2X+X node were inserted after Operation 4, this would incur additional latency. However, as shown in Figure 2, selecting Operation 2 is the best choice, as its distance to Operation 5 is greater than the distance between Operations 4 and 5. Or, in other words, the height of Operation 2 is less than the height of Operation 4. Thus, by selecting the operand supplied by the output of Operation 2 to pre-calculate 3X (2X + X), the latency of the final schedule is not affected.
- The On-the-Fly Correcting Radix 4 Booth Multi speculative Multiplier
In this subsection, the foundations of the radix 4 partial carry-save Booth multiplier presented in  will be outlined through the example shown in Figures 3 and 4. Figure 3 displays an example of multiplication utilizing radix 4 Booth encoding, while Figure 4a depicts a case with the same input operands but using partial carry-save format. In this example, 16-bit FUs with fragments of k = 4-bits have been considered. This means that there will be 16/4 = 4 fragments producing group generate (Gx, Gy) and propagate (Px, Py) signals. It should be mentioned that the group generate signals are actually the carry-out signals produced by the fragments. The generate and propagate signals produced by the most significant fragment have been labelled as ‘-’, because they will no longer be utilized.
Given a multiplicand in partial carry-save format, i.e. X = U + A, Figure 4b displays how its non-redundant version is calculated. For this purpose, the carry-select technique is employed. In other words, all the fragments are computed in parallel with both carry-in ‘0’ (there is no actual addition) and ‘1’, while the actual carries of the fragments are computed by the Real Carry Calculation unit , . This unit is a crucial part of this step, as the carries generated by each fragment of a partial carry-save adder might be incorrect. However, the Gx, Px group signals generated by every fragment allow the fast computation of the true carries. The selection logic simply chooses a fragment belonging to either U0 or U1, according to the values of Cx. It should be noted that U0 = U, while U1 is obtained after adding ‘1’ to every fragment belonging to U, except the first one.
Figure 4c provides an overview of the multiplier branch. Given a multiplier in partial carry-save format, i.e. Y = V + B, the V signal will be used to generate the Booth encoding. First, a Real Carry Calculation unit computes the actual carries of the multiplier using the Gy, Py signals. Second, tuples generation takes place, which is not as immediate as in a conventional multiplier. As the non-redundant multiplier is not known, the value of some multiplier bits needs to be calculated. This calculation requires few logic levels, as it just depends on the true carries and some bits belonging to V . After the tuples are generated, the B4_0 encoder implements the typical radix 4 recoding, while the B4_1 encoder implements the same recoding but considering that every Booth tuple tp = v2∗p+1v2∗pv2∗p−1 has a carry-in signal equal to ‘1’ in the 2 ∗ p position. In other words, this is equivalent to adding the constant “010” to every tuple. Nevertheless, instead of adding and then recoding, the encoding is directly performed following a truth table , which is faster. Finally, the correct encoding for every tuple will be selected according to the control (ctrl) signals. As can be observed, the final encoding coincides with that provided by the conventional radix 4 Booth recoding shown in Figure 3. The main advantage of this approach is that all the fragments work in parallel once the real carries are known.
After generating the correct Booth encoding as well as the multiplicand (i.e. non-redundant format), Booth multiples generation and PPM addition take place as in conventional multipliers. Finally, the CPA stage is performed via a Multi speculative Adder, generating a result in partial carry-save format, which may be used by other partial carry-save units.
- Easy to develop efficient Circuit
- Minimum average delay
- Less area and power consumptions
 G. D. Micheli, Synthesis and Optimization of Digital Circuits, 1st ed. New York, NY, USA: McGraw-Hill, 1994.