Vector Processing-Aware Advanced Clock-Gating Techniques for Low-Power Fused Multiply-Add
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.
- Xilinx 14.2
Power and energy efficiency have become the dominant limiting factor to processor performance and have increased significantly processor design complexity, especially when considering the mobile market. Being able to exploit high degrees of data-level parallelism (DLP) at low cost in a power- and energy-efficient way, vector processors are an attractive architectural-level solution. Undoubtedly, the design goals for mobile vector processors clearly differ from the performance-driven designs of traditional vector machines. Therefore, mobile vector processors require a redesign of their functional units (FUs) in a power-efficient manner.
Clock gating is a common method to reduce switching power in synchronous pipelines. It is practically a standard in low-power design. The goal is to “gate” the clock of any component whenever it does not perform useful work. In that way, the power spent in the associated clock tree, registers, and the logic between the registers is reduced. It is the most efficient power reduction technique for active operating mode.1 Therefore, the conditions under which clock gating can be applied should be extensively studied and identified. A widely used approach is to clock gate a whole FU when it is idle. A complementary and more challenging approach is clock gating the FU or its subblocks when it is active, i.e., operating at peak performance. Furthermore, there are characteristics of vector processors that provide Since fused multiply-add (FMA) units usually dissipate the most power of all FUs, their design requires special attention. Abundant floating-point (FP) FMA is typically found in vector workloads, such as multimedia, computer graphics, or deep learning workloads. Although in the past FMA has been used for high performance, it recently has been included in mobile processors as well. In contrast to highperformance vector processors (e.g., NEC SX-series and Tarantula ) that have separated units for each FP operation, mobile vector processors’ resources are limited; thus, we typically have a single unit per vector lane capable of performing multiple FP operations rather than separate FP units. Apart from that, additional advantages of using FMA over separate FP adder and multiplier are as follows. 1) Computation localization inside the same unit reduces the number of interconnections (power and energy efficiency). 2) Higher accuracy (single, instead of two round/normalize steps). 3) Improved performance (shorter latency).
We present three kinds of techniques: 1) novel ideas to exploit unique characteristics of vector architectures for clock gating during active periods of execution (e.g., vector instructions with a scalar operand or vector masking); 2) novel ideas for clock gating during active periods of execution that are also applicable to scalar architectures but especially beneficial to vector processors (e.g., gating internal blocks depending on the values of input data); 3) ideas that are already used in other architectures and that we present as its application is beneficial to vector processors, and for the sake of completeness (e.g., idle VFU).
Regarding the second and third groups of ideas, an advantage of vector processing that extends the applicability of clock gating is that vector instructions last many cycles, so the state of the clock-gating and bypassing logic remains the same during the whole instruction. As a result, power savings typically overcome the switching overhead of the added hardware (which is often not a case in scalar processors.
To fulfill current trends in digital design that promote building generators rather than instances, we perform this research in a fully parameterizable, scalable, and automated manner. We developed an integrated architecture-circuit framework that consists of several generators, simulators, and other tools, in order to join architectural-level information (e.g., vector length or benchmark configuration) with circuit-level outputs (e.g., VFU power and timing measurements). We implement our clock-gating techniques and generate hardware VFU models using a fully parameterizable Chisel based FMA generator (FM Agen) and a 40-nm low-power technology.
We discuss the related work individually for each of our clock-gating techniques together with the description of the technique in Section IV. Besides, in the context of alternative low-power techniques for FP units, interesting approaches have been proposed (caching results that can be reused) and byte encoding (computation performed over significant bytes). However, detailed and accurate evaluation reveals that the actual savings are often low and with an unaffordable area overhead .
Fig. 1. Two-lane, four-stage VFU (MVL = EVL = 64) executing FPFMAV V3<-V0,V1,V2.
- Inaccuracy in operation
- Power savings is not efficient
Proposed clock-gating Techniques
Scalar Operand Clock-Gating
We propose this technique to tackle the cases in which one or two operands do not change during the vector instruction. Table III lists the types of instructions during which
Fig. 2. Simplified block diagram of a one-lane, four-stage VFU with all clock-gating techniques applied (AllCG technique). Input signals for the baseline without clock-gating are multiplicands (A and B), addend (C), rounding mode, and operation sign (opsign), while output signals are result (Out) and exception flags.
Types of instructions where ScalarCG AND ImplCG
scalar operand clock-gating (ScalarCG) is active. As only one of all the supported vector instructions has all three vector operands, often at least one operand is scalar. Only the FPFMAV instruction, in which all operands are vectors, does not benefit from this technique.
During these instructions, the corresponding input register(s) of scalar operand(s) should latch a new value only on the first clock edge of the execution of the instruction, while during the rest of the instruction, they can be clock-gated. To implement this, we introduce the signals VS[2..0] (Fig. 2), where VS[i] = 0 means that the ith operand is gated after the mentioned first cycle. VS signals are derived from the instruction OPCODE. Deriving VS signals from the OPCODE is done before the first pipeline stage (as shown in Fig. 2). This generation (decoding) requires regular comparators, and they are not on the critical path as the OPCODE is available at least one cycle in advance. Table I shows corresponding VS signals for all the instructions.
Implicit Scalar Operand Clock-Gating (ImplCG)
This technique is an additional optimization of ScalarCG and aims to exploit further the information given through the instruction OPCODE for clock-gating, operand isolation, and computation bypassing. In the case of addition and subtraction instructions, such as FPADDV and FPSUBV, the 53 × 53 mantissa multiplier is not needed as it is known that one of the multiplicands is “1,” and thus, we can bypass, isolate, and clock-gate it providing the value of the other multiplicand directly to the adder. There is an analogous situation for FPMULV since the addend is known to be “0.” In this case, the 162-bit wide adder, leading zero anticipation, and the aligning part are not needed.
To control bypassing, isolation, and clock-gating of the mentioned submodules, we introduce signal INSTYP (see Fig. 2 and Table I), generated from the instruction OPCODE, which indicates whether an FPFMAV or an FPADDV/FPSUBV/FPMULV instruction is executed. INSTYP together with VS signals provide information of the instruction type. For example, INSTYP = 1, VS = 1, and VS = 0 indicate that we have an FPMULV instruction while INSTYP = 1, and VS = 0 indicates that we have an FPADDV/FPSUBV instruction. Fig. 3 shows the simplified block diagram of gated FMA submodules when the aforementioned instructions are executed. Circuitry added for implementing ImplCG mostly consists of clock-gating cells and MUXs.
In the context of instruction-dependent techniques, there is interesting research done in the past for scalar processors . The main advantages of our ImplCG proposal over the mentioned research are: 1) we apply the technique for a variable number of pipeline stages; 2) we evaluate power, timing, and area; and 3) we propose the technique for vector processors. The advantage of applying this technique on a vector processor over other models (e.g., scalar) is that vector instructions last many cycles, so the state of the related hardware (clock-gating logic and MUXs) maintains the same
Fig. 3.Gated FMA sub modules during FPMULV (dark gray) and FPADDV (light gray) instructions in case of ImplCG.
state during the whole instruction. Thus, there will be less switching overhead than in the scalar case.
Vector Masking and Vector Multilane-Aware Clock-Gating (MaskCG)
Here, we target cases in which there are idle cycles during the vector mask instructions (e.g., FPFMAV_MASK). Common cases in which vector mask control is used are: 1) sparse matrix operations, and 2) conditional statements inside a vectorized loop. Additionally, we assume that the same mechanism is also used to reduce the EVL to less than the MVL . We assume that the control logic will detect and optimize this case, skiping the last elements of the vector corresponding to the trailing 0s of the mask. However, in vector designs with nL lanes, there will still be mod(EVL, nL ) idle lanes in the last cycle of the operation. The VMR directly controls the clock-gating of the whole arithmetic unit during these idle cycles (see Fig. 2).
Regarding the internal implementation, we perform clock-gating at pipeline stage granularity , so we prevent useless cycles inside the unit, i.e., the data are latched in subsequent stages only if necessary. Once the Enable signal of the first pipeline stage’s register gets the value “1,” this Enable signal propagates to the end of the pipeline, one stage per cycle (see Fig. 2). In other words, the Enable signal of the nth stage is actually the first stage’s Enable signal delayed by n−1 cycles. This is implemented by adding a 1-bit-wide, (nS−1)-long shift register that drives clock-gating cells. To the best of our knowledge, there is no related work that aims to exploit vector conditional execution with VMR to lower the power of vector processors.
Input Data Aware Clock-Gating (InputCG)
Here, we identify the scenarios in which, depending on the input data, a part of mantissa processing is not needed for the correct result and, thus, can be bypassed. We use a recoded format for internal representation that allows us to detect special cases (explained in Section III-A) and zeros with an negligible hardware overhead; it requires inspection of only three most significant bits of the exponent (fourth column in Table II). Table II presents the identified scenarios (conditions) in which a hardware block of mantissa
InputCG—Conditions under which a hardware block of mantissa arithmetic computations and corresponding input registers can be bypassed
arithmetic computations and the corresponding input registers can be bypassed, isolated, and clock-gated. The recoded format allows detection of relevant scenarios by using simple 3-bit comparators. They are located at the inputs of VFU (A, B, and C processing block on Fig. 2). In that way, we assure that the mentioned detection comparators are not on the VFU’s critical path, i.e., gating information is available in time. The added internal hardware is similar as for ImplCG. Having zero addend is analog to FPMULV instruction case (see Fig. 3). Zero multiplicand allows gating and bypassing all the modules from Fig. 3 except the registers that hold operand C value as in that case the final result is operand C. In the case of NaN and infinity, there is no need for any computation as the result that has to be at the VFU output is already known (explained in Section III-A), so we can gate/bypass/isolate vast majority of FMA sub modules. There are many workloads whose data contain a lot of zero values, thus can fairly benefit from the last two sub techniques presented in Table II. Although these techniques are applicable to other architectures as well, their application to vector processors is more efficient since the recurrent values are common within the vector data, thus lowering the switching overhead in added hardware (clock gating logic and MUXs). While both ImplCG and InputCG techniques aim to exploit cases when the addend is “0,” in this case, there is no external information of “0” existence via VS signals, but it has to be detected, and the gating has to be done on time. As in the case of ImplCG, the research done in presents a related data-driven technique for scalar processors. The main advantages (that enable additional savings) of our InputCG technique over the mentioned research are: 1) detection of zero operands and 2) gating the mantissa multiplier when processing NaNs.
- High accuracy
- Improved performance
- Power saving is efficient
Ivan Ratkovi´c , Oscar Palomar, Milan Stani´c, Osman SabriÜnsal, Adrian Cristal,
and Mateo Valero, Life Fellow, IEEE, “Vector Processing-Aware Advanced Clock-GatingTechniques for Low-Power Fused Multiply-Add”, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.