A Wideband Low-Noise Variable-Gain Amplifier with a 3.4 dB NF and up to 45 dB gain tuning range in 130 nm CMOS

A Wideband Low-Noise Variable-Gain Amplifier with a 3.4 dB NF and up to 45 dB gain tuning range in 130 nm CMOS


A 130 nm CMOS wideband (0.2 to 3.3 GHz) lownoise variable-gain amplifier (LNVGA) with two active baluns working for phase cancellation is presented herein. The LNVGA aims for a wide gain tuning range which avoids signal compression, while enabling a low noise figure. This figure is kept low by the first stage of the LNVGA, whereas the second stage provides the gain variation. The second stage is able to deliver a wide gain tuning range thanks to the utilization of the phase cancellation technique, which is implemented by two active baluns. Since the phase cancellation technique strongly relies on the balun output balancing, a low-imbalance active balun topology is being herein proposed, analyzed in detail, designed, and tested. This new LNVGA design achieves a gain tuning range of 45 dB, a noise figure of 3.4 dB, and dissipates 19 mW in the maximum gain condition. The circuit was fabricated in 130 nm CMOS with a 1.2 V supply.



  • System : Pentium Dual Core.
  • Hard Disk : 120 GB.
  • Monitor : 15’’ LED
  • Input Devices : Keyboard, Mouse
  • Ram : 1 GB


Filipe Dias Baumgratz, Member, IEEE, Carlos Saavedra, Senior Member, IEEE, Michiel Steyaert, Fellow, IEEE, Filip Tavernier, Member, IEEE, and Sergio Bampi, Senior Member, IEEE, “A Wideband Low-Noise Variable-Gain Amplifier with a 3.4 dB NF and up to 45 dB gain tuning range in 130 nm CMOS”, IEEE 2019.


Rapid Balise Telegram Decoder with Modified LFSR Architecture for Train Protection Systems

Rapid Balise Telegram Decoder with Modified LFSR Architecture for Train Protection Systems


We propose a novel balise telegram decoding scheme for RFID-based high-speed train protection systems. To reduce the amount of position errors by minimizing the decoding latency, the proposed scheme enables the decoder to have opportunities to find a valid telegram during single balise passage by reutilizing the values from previous failures. We also present a modified linear-feedback shift register (LFSR) and parallel connection of LFSR units to complete both error detection and synchronization of telegrams in one cycle. The experimental results show that the proposed scheme achieves approximately 2,000 times faster error detection and synchronization than traditional architectures.


BALISES are one of the most important components of the recent train protection systems, as they transmit movement authorities and trackside data (e.g., position reference, speed limits, and maintenance works on the line) to trains, which are the sources of position markers and movement authorities [1]– [5]. Fig. 1 shows the typical on-board transmission equipment having the antenna unit and balise transmission module (BTM) function, which communicates with the European train control system (ETCS) kernels [2]. In general, balises are tele-powered via an inductively coupled signal transmitted by the on-board transmission equipment during train passage and returning the telegram message including safety-related information.

In this work, we propose a novel balise telegram decoding architecture for high-speed trains to increase the decoding speed. As the telegram can be corrupted by electromagnetic interference and crosstalk, movement authorities and trackside data are transmitted after undergoing scrambling, substitution with a symbol code, and addition of checksum for parity checks and synchronization. We focus on minimizing the processing latency required to calculate the remainder of the polynomial division for parity checks and synchronization, maintaining a low computational complexity. The proposed scheme enables the decoder to have additional opportunities to find a valid telegram during single balise passage by reutilizing the essential information from the previous decoding failures. We also propose a modified linear-feedback shift register (LFSR) architecture and parallel connection of two LFSR modules to complete both error detection and synchronization in only one cycle. We then analyze the telegram missing probability, processing latency, and complexity of the proposed structure.

Balise Telegram Encoding:

The user data contain a header (50 bits) and optional packets (variable length). The number of allowed packets depends on the telegram format. To avoid burst errors, random bit errors, and bit slip/insertion, the user data are scrambled and substituted with a symbol code of different Hamming distance and a checksum is added for parity checks and synchronization. Consequently, the telegram format in a binary polynomial

b(x) = bn-1Xn-1 +….. +b1x + b0 …………………..(1)

consists of shaped data, control bits, scrambling bits, extra shaping bits, and check bits, where n = 1023 (long format) or 341 (short format) indicates the length of telegram:

  • Shaped data (b1022 … b110 or b340 … b110) contain the user data (830 bits or 210 bits) after scrambling and substitution.
  • Control bits (b109 … b107) include inversion bit and spare bits, which are set to 001. Scrambling bits (b106 … b95) are the initial state of the scrambler that operates on the data bits before shaping.
  • Extra shaping bits (b94 … b85) are used to enforce the constraints on check bits independent of the scrambling.
  • Check bits (b84 … b0) include 75 parity bits of the error correcting code and 10 bits for synchronization. Check bits shall be defined as

b84X84 +….. +b1x + b0

= Rf(x)g(x)[ bn-1Xn-1+……+ b84X84]+o(x)   ……(2)

where f(x) = fmxm + … + f1x + f0, g(x) = gkxk + … + g1x + g0, and o(x) = g(x) depend on the telegram format. Ra(x)[b(x)] denotes the remainder of the division of b(x) by a(x).

Balise Telegram Decoding:

The on-board transmission equipment stores the received telegrams in a telegram buffer. Assume that the total number of received bits is N (N ≥ n), which depends on the speed of the train and the contact length between the on-board equipment and balise.

Figure 1: Moving window based sequence update

After receiving the first n bits, the telegram decoder starts to capture an n-bit sequence c(x) = cn–1x n–1 + … + c1x + c0 from the telegram buffer in Fig. 1. c(x) is a cyclically shifted version of the transmitted telegram b(x), except for random or burst bit errors. Any balise telegram receiver shall be at least as good as the following basic receiver operation:

1) Consider a window of n + r consecutive received bits (long format: r = 77; short format: r = 121. If the window has already been shifted over 7,500 bits, set r = n).

2) Is the parity-check satisfied, i.e., are the first n bits divisible by g(x)? If not, shift window and go to step 1.

3) Do the r additional bits (rightmost in the window) coincide with the first r bits (leftmost in the window)? If not, shift the window and go to step 1.

4) Find the beginning (position of bn–1) of telegram with the help of f(x). If Rf(x)[c(x)] is an impossible value, go to step 1.

5) Are all 11-bit words (bn–1 … bn–11), (bn–12 … bn–22), …, (b10 … b0) valid? If not, shift the window and go to 1.

6) At this point, the telegram is considered safe.

7) Is the inversion bit b109 = 1? If yes, all of the received bits could be used after inversion.

8) Check the other two control bits. If b108 = 1 or b107 = 0, abort with the message unknown telegram format.

9) Invert the 10-to-11-bit transformation and de-scramble.

10) Output the user bits and the original state of the inversion bit (b109).

Figure 2: The conventional telegram decoder architecture

In this paper, we focus on reducing the processing latency and computational complexity required for error detection (step 2) and synchronization (step 4).1 In general, these procedures are accomplished by sequentially calculating the remainder of the division of c(x) by g(x) and f(x), respectively,

e(x) = Rg(x)[c(x)]=ek-1xk-1 +…… + e1x+e0           (3)   and

s(x) = Rf(x)[c(x)]=sm-1xm-1+…………..+s1x+s0          (4)

Fig. 2 depicts the error detection and synchronization steps of the conventional receiver. For a given n-bit sequence c(x), the error detection unit (EDU) calculates e(x) to determine whether the parity-check is satisfied. If an error is detected (e(x) ≠ 0), c(x) is updated by shifting the window, as shown in Fig. 1. Otherwise, the synchronization unit (SU) calculates s(x) to find the beginning of the telegram. If s(x) is an impossible value, i.e., s(x) = 0, c(x) is updated by shifting the window. Otherwise, the telegram b(x) can be obtained by cyclically shifting c(x) with the help of s(x) and the look-up table (LUT), which stores the corresponding number of bits elapsed since the beginning of the telegram. Whenever the train passes a balise, the receiver repeats this procedure until it finds a valid telegram. The valid telegram is then converted to user data after the validation of all symbol codes (step 5), the confirmation of control bits (steps 7 and 8), code substitution and descrambling (step 9).

Figure 3 : Conventional LFSR architecture

The window may be shifted by either one or several bit positions at a time. Let us specify the window shift size as s bits. For a given total number of received bits N, decreasing s allows the telegram receiver to have additional opportunities to find a valid telegram at the cost of increased processing latency. The conventional telegram decoder normally uses the LFSR-based architecture to calculate e(x) and s(x), which is widely applied in modern communication systems. Fig. 3 shows the typical structure of the LFSR for calculating e(x). Before starting each calculation, each register in the LFSR should be initialized to 0, i.e., ei = 0 (i = 0, …, k – 1). The current sequence c(x) is then entered into the LFSR by inserting its coefficients one by one in descending order. In other words, the coefficient ci (i = 0, …, n – 1) is added to the LSB of e(x), i.e., e0. At the same time, the MSB of e(x), i.e., ek–1, is added to the other coefficients ei (i = 0, …, k – 1) if the corresponding gi (i = 0, …, k – 1) is 1. This feedback path is disconnected when the corresponding gi is 0. As a result, this LFSR implementation requires n cycles to calculate e(x) and s(x), respectively. As the basic receiver is designed to execute error detection and synchronization steps sequentially, it uses at least 2n cycles to complete steps 2 and 4. For example, it takes 2,046 cycles for error detection and synchronization of long format (n = 1023), causing the large amount of position error of high-speed trains. The parallel LFSR architecture may reduce the processing latency; however, it requires more hardware resources for storing intermediate values. As the train control systems should be tolerable to the various error sources, numerous fault-tolerable solutions including the triple modular redundant scheme are employed by duplicating the primitive processing units. Considering the additional complexity, therefore, the previous parallel LFSR structure cannot be accepted for the practical embedded solution. We introduce a novel telegram decoding architecture in the next section to efficiently reduce the processing latency without increasing the hardware cost.


  • More Speed in Decoding process
  • Having More electromagnetic interference and crosstalk
  • Not added checksum for parity checks


Proposed Telegram Decoding Algorithm:

The proposed telegram decoding algorithm allows n-bit sequence c(x) to be updated by one bit whenever the EDU determines that the parity-check is not satisfied or the SU fails to find the beginning of the telegram. In addition, the proposed algorithm does not require initialization of the registers in the LFSR before starting the calculation of the updated sequence. Instead, it reutilizes the remaining values stored in the registers. These values are the results of the previous calculation.

Let cu(x) denote the updated sequence after a 1-bit window shift. cu(x) can be derived from c(x) as follows:

Cu(x) = Cnew + xc(x) – Cn-1Xn  ………….. (5)

where cnew is the new bit entered into the LSB side of the window, xc(x) is 1-bit window shift, and cn–1x n is the removal of the expired MSB of the previous sequence. Calculation of the remainder eu(x) from cu(x) can be rewritten as follows:

eu(x) = Rg(x)[cu(x)]

=cnew + Rg(x)[xc(x)] – Rg(x)[cn-1xn].   ……….(6)

cnew is simply added to the other terms. Let us assume that c(x) = qe(x)g(x) + e(x), where qe(x) is the quotient polynomial. The second term in (6), i.e., Rg(x)[xc(x)], is then derived as

Rg(x)[xc(x)] = Rg(x)[xqe(x)g(x)+xe(x)]

= Rg(x)[xe(x)] = xe(x) – ek-1 g(x).       ….(7)

This means that remaining values stored in the LFSR, e(x), can be reused to calculate the second term in (6). Specifically, all of the coefficients of e(x) are shifted by one to form xe(x), and the divisor g(x) is then subtracted from xe(x) if the order of the polynomial e(x) is k – 1, i.e., ek–1 = 1. Let us assume that

r(x) = Rg(x) [xn] = rk-1 xk-1 + ….. + r1x + r0    …..(8)

Because r(x) does not depend on cu(x), it can be pre-calculated. The third term in (6), i.e., Rg(x)[cn–1xn ] = cn–1Rg(x)[xn ] = cn–1r(x), is then subtracted from the other terms if the expired MSB of the previous sequence equals 1, i.e., cn–1 = 1. Otherwise, Rg(x)[cn–1xn ] = 0. Finally, eu(x) can be simplified as

eu(x) = cnew + xe(x) – ek-1g(x) – cn-1 r(x) ….(9)

Similarly, calculation of the remainder su(x) for the updated sequence cu(x) can be rewritten as follows:

Su(x) = Rf(x) [Cu(x)]

= Cnew + Rf(x)[xc(x)-Rf(x)[cn-1xn] …(10)

= cnew + xs(x) – sm-1 f(x) – cn-1 p(x),


p(x) = Rf(x)[xn] = pm-1 xm-1 + … + p1x + p0    (11)

The proposed telegram decoding algorithm reduces the processing latency and computational complexity considerably because it takes only one cycle to calculate eu(x) and su(x) for the updated sequence cu(x) by reutilizing the remaining values stored in the LFSRs after previous calculations.

Modified LFSR Architecture:

Figure 4: Modified LFSR architecture supporting the proposed telegram decoding algorithm

We propose a novel LFSR architecture that allows us to adopt the proposed telegram decoding algorithm based on the conventional LFSR architecture presented in Section II. Fig. 4 illustrates the proposed LFSR architecture for the EDU. The upper part of the proposed LFSR architecture is related to the calculation of the second term in (9), which is identical to the conventional LFSR architecture shown in Fig. 3. The lower part of the proposed LFSR architecture is superimposed for the calculation of the fourth term in (9). Only one cycle is needed to calculate eu(x) from the updated sequence cu(x) because it reutilizes the remaining values stored in LFSR. Without loss of generality, the LFSR architecture for SU can be obtained in a similar manner.

Single-Cycle Telegram Decoding Architecture:

We further propose a single-cycle decoding architecture for simultaneous error detection and synchronization for decrease the processing latency. Fig. 6 depicts the single-cycle telegram decoding procedure of the proposed architecture. Using the LFSR architecture proposed in Section III-B, both the EDU and SU require only one cycle to calculate e(x) and s(x) from sequence c(x).

Figure 5: Single Cycle telegram decoding of the proposed algorithm

If the EDU determines that the parity- check is not satisfied or the SU fails to find the beginning of the telegram for a given c(x), the MSB of the current sequence cn–1 is expired and the new bit cnew is entered into the LSB side to form a new sequence cu(x). If both operations are completed with valid results, i.e., eu(x) = 0 and su(x) ≠ 0, b(x) can be obtained by cyclically shifting the current sequence using su(x) and the LUT. Whenever the train passes each balise, the decoder repeats this procedure until it finds a valid telegram.


  • Less Speed in Decoding process
  • Having less electromagnetic interference and crosstalk
  • Added checksum for parity checks


[1] System Requirements Specification, Chapter 7 ERTMS/ETCS language, Ref. SUBSET-026, ver 3.4.0, European Union Agency for Railways Std., Dec. 2014.

Radiation-Hardened 14T SRAM Bitcell With Speed and Power Optimized for Space Application

Radiation-Hardened 14T SRAM Bitcell With Speed and Power Optimized for Space Application


In this paper, a novel radiation-hardened 14-transistor SRAM bitcell with speed and power optimized [radiation-hardened with speed and power optimized (RSP)-14T] for space application is proposed. By circuit- and layout-level optimization design in a 65-nm CMOS technology, the 3-D TCAD mixed-mode simulation results show that the novel structure is provided with increased resilience to single-event upset as well as single-event–multiple-node upsets due to the charge sharing among OFF-transistors. Moreover, the HSPICE simulation results show that the write speed and power consumption of the proposed RSP-14T are improved by ∼65% and ∼50%, respectively, compared with those of the radiation hardened design (RHD)-12T memory cell.


SINGLE-event upset (SEU) is a soft-error and nondestructive form of single-event effects (SEEs). In the radiation environment, when the heavy ion is incident on the semiconductor material, the particles will be ionized. These excess charges will be collected by the sensitive nodes of the device. As a result, a voltage perturbation will appear at those nodes. For SRAM bitcell, when the amplitude of the voltage perturbation is strong enough and exceeds the logic threshold level of the inverter, the data stored might be turned over, as shown in Fig. 1; that is, an SEU is caused. With the continuous scaling of CMOS technology, the minimum spacing between the transistors is decreased.

Figure 1: SEU induced by an ion strike in an SRAM memory

As a result, multiple transistors are susceptible to the charge deposited from a single particle strike compared to older processes where only one transistor was affected. The charge sharing results in single-event–multiple-node upsets (SEMNUs), which is becoming the main effect of energetic particle strikes in emerging nanometer CMOS technology. In addition, supply voltage reduction further increases the susceptibility of circuits to radiation. Thus, the development of radiation-hardened technologies in digital circuits is extremely urgent. Due to the larger sensitive volume per bit and lower node capacitance than the dynamic counterpart, SRAM is more prone to soft errors. Therefore, the soft error rate (SER) [6] in SRAM is increased with the technology scaled in the nanometer regime. In order to reduce the SER, numerous alternatives have been proposed to the standard 6T SRAM cell. The main reinforcement method is through constructing special topology of transistor connections inside cells to achieve circuit-level protection. The soft error robust Quatro-10T SRAM cell, offering differential read operation with large noise margin was proposed. However, it can only recover from “1” to “0”; thus, it cannot immune SEU completely. Due to the feedback of the dual node, the dual interlocked storage cell (DICE) can fully immune against single-event transient (SET) occurring on any of its single nodes. However, the very minimum ability of SEMNUs immunity and radiation hardness performance of it has yet to be improved. In Schmitt trigger based (STB)-13T memory cell with fully SEU immune was proposed. However, the limited promotion of SEMNUs immune ability of it is achieved at the expense of writing speed, power consumption, and layout area compared with DICE. Based on the STB13T, two novel hardened memory cells with more reliability, radiation hardened design (RHD)-11T and RHD-13T, were proposed in [9]. Unfortunately, the writing speed, as well as write margin, of them is deteriorated. In low power and highly reliable radiation-hardened application, the RH memory (RHM)-12T was proposed; however, the authors used nMOS as pull-up devices causing worse read noise margins. Recently, the RHD-12T memory cell with favorable radiation hardness performance, as shown in Fig. 2(a), was proposed. Besides the toleration for an SEU on any of its internal single nodes, it can also provide the SEMNUs immune to some extent. Unfortunately, the slow write speed as well as large power consumption limits the application of it. In addition to the reinforcement at the circuit level, specific layout techniques as an alternative method for improving the radiation tolerance have also been proposed. As presented in a new layout technique named layout design through error-aware transistor positioning (LEAP) was applied to DICE, resulting in a new sequential element, LEAP-DICE. The TCAD simulations show that it is effective for increasing the linear energy transfer (LET) upset threshold. In order to investigate the charge sharing, a Monte Carlo simulation platform named tool suite for radiation reliability assessment (TIARA). By analysis of the TIARA simulations results, the layout optimization of the most vulnerable transistor pairs will be targeted.

In this paper, the radiation-hardened with speed and power optimized (RSP)-14T bitcell is proposed. Compared with RHD-12T, its radiation hardness has been improved by the reinforcement of redundant nodes with two extra pMOS transistors. Furthermore, due to the supply of the branch where the redundant nodes located are controlled by the extra PMOSs, during the write operation, the feedback mechanism will be interrupted easily. Thus, the write speed and power consumption have been improved effectively. Generally, SPICE simulations by using the double-exponential current source model are applied for evaluating the radiation tolerance of the circuit, which is time saving. However, the model relies on calibration parameters that are not physical. The charge sharing between transistors will be neglected; it may overestimate the SEU immune ability of other SRAM cells. Thus, in order to consider the charge sharing between transistors as well as reducing the CPU burden, TCAD mixed-mode simulation as a good qualitative approach to valuate SEU immune is adopted in this paper. Combined with the layout-level design, the simulation results show that the proposed circuit has better SEU immunity.


  • More Soft Error Rate
  • More Power consumption
  • More immunity and radiation hardness



  1. Read and Write Operation

The schematic of the proposed RSP-14T is shown in Fig. 2(b). Here, the transistors N4 and N5, controlled by a word line (WL), are access transistors, which control the connection between the bit lines (BL and BLB) and the storage nodes (Q and QB). The nodes S1 and S0 are redundant nodes of Q and QB. If the stored bit is “1,” the logic values at nodes Q, QB, S1, and S0 are “1,” “0,” “1,” and “0,” respectively.

Figure 2 : (a) RHD-12T bit cell (b) Proposed RSP-14T bit cell

The functional analysis of the proposed RSP-14T is sequentially presented: 1) write; 2) read; and 3) hold operation. In write operation, we assume that Q = “1” and QB = “0” and the bitlines BL and BLB are set to “0” and “1,” respectively. When the WL is activated, the value stored in Q and QB will be changed to “0” and “1,” respectively. After that, once WL is discharged to “0,” the new state of the memory cell is stored. For a read operation, the BL and BLB are precharged to “1.” When the selected WL is enabled, the transistors N4 and N5 are turned on, BLB will be discharged through transistors N4 and N1. As a result, the differential voltage of the BLs will be generated and amplified by the sense amplifier. During the hold operation, WL is deactivated and the storage nodes are isolated from the BLs; thus, they maintain the initial state. In this paper, the transistors P0 and P7 are used to control the connection or cutoff between the power supply and transistors P1/P5, which is beneficial to improve the write speed and power consumption compared with RHD-12T.

  1. Error Tolerance Analysis

Regardless of the charge sharing between the transistors in the actual layout, the analyses of SEU recovery behavior at circuit level are given. Assuming that Q = “1,” QB = “0,” S1 = “1,” and S0 = “0,” respectively, the analyses of the nodes (Q, QB, S1, and S0) are demonstrated as follows.

Case 1 (Positive Transient Pulse at Node S0):When the drain of P1 is hit by a particle, it will collect positive charge and increase the voltage at node S0 (i.e., S0 will be changed from “0” to “1”). As a result, P6 and P5 will be turned off. However, it cannot further affect the OFF/ON-states of other transistors, and the storage status of Q and S1 nodes will remain unchanged. Therefore, the transient fault at S0 cannot propagate inside the cell. Finally, the nodal logic level will be recovered after the radiation events.

Case 2 (Positive Transient Pulse at Node QB):When the drain of P2 is hit by a particle, it will collect positive charge and increase the voltage at node QB (i.e., QB will be changed from “0” to “1”). As a result, N2 and N0 will be turned on. Correspondingly, Q and S1 will be changed from “1” to “0,” P0 and P1 will be turned on, and N3 will be turned off, and then S0 will be changed from “0” to “1.” Finally, the storage state of the cell will be turned over. (It has been made difficult to change in the layout-level design as presented in Section III.) Due to the transistors being stacked and topology optimized, the parasitic bipolar amplification effect of P2 (the source of P3 is connected with VDD, whereas the source of P2 with weak connection “1”) is mitigated. As a result, the quantity of charge collected by the drain of P2 is reduced, which improves the SEU tolerance of node QB.

Case 3 (Negative Transient Pulse at Node S1):When the drain of N0 is hit by a particle, it will collect negative charge and S1 will be discharged from “1” to “0,” and P3 and P1 will be turned on. However, due to the blocking effect of transistors P2 and P0, the fault at S1 cannot further propagate in the cell. Therefore, QB and S0 will remain in their original status. Due to the low status at QB and S0, P7 and P5 are always at open state. Hence, the current provided by P7 and P5 will charge S1 continuously. This positive feedback will accelerate the recovery process of S1. Finally, the nodal storage status will be recovered after the radiation events.

Case 4 (Negative Transient Pulse at Node Q):When the drain of N2 is hit by a particle, negative charge will be collected and Q will be discharged from “1” to “0,” and then N1 and N3 will be turned off. This is very similar to case 1, thus, the storage status of Q will finally recover after the radiation events.

For time efficiency, in order to prove that the above-mentioned analyses are correct and locate the most vulnerable node of the proposed RSP-14T, the transient injections at S0, QB, S1, and Q nodes are simulated, as shown in Fig. 3, by the double-exponential current source. On this basis, further analysis of SEU is given by TCAD in Section III. Here, the double-exponential current is expressed as

I(t) = I0(e−t/τα − e−t/τβ ) (1)

I0 = Q/(τα − τβ) (2)

where I0 is the peak of current source, Q is the amount of deposited charge, τ α is the collection time constant of the junction, and τ β is the time constant for initially establishing the ion track. In this paper, I0 is set as ∼174 μA, while τ α and τ β are set as 200 and 50 ps, respectively.


  • Less Soft Error Rate
  • Less Power consumption
  • Less immunity and radiation hardness




[1] P. E. Dodd and L. W. Massengill, “Basic mechanisms and modeling of single-event upset in digital microelectronics,” IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 583–602, Jun. 2003.

Power-Efficient Gm-C DSMs With High Immunity to Aliasing, Clock Jitter, and ISI

Power-Efficient Gm-C DSMs With High Immunity to Aliasing, Clock Jitter, and ISI


Recent progress in continuous-time (CT) Delta– Sigma modulators (DSMs) research has shown that applying a passive RC low-pass filter (LPF) in the feedback path can significantly improve the power efficiency of a CT DSM. On the other hand, to achieve high performance, a CT DSM faces the adverse effects of clock jitter, inter symbol interference (ISI), or degradation of anti aliasing ability. These challenges are extremely difficult to tackle simultaneously without consuming excessive power. This paper proposes a Gm-C DSM with a combined RC and switched-capacitor LPF frontend stage to achieve a high performance against aliasing, clock jitter, and ISI simultaneously while having an extremely low power consumption. Transistor level simulations on an audio band modulator and a 10-MHz bandwidth modulator are given, verifying the high immunity of the proposed circuit to clock jitter, ISI, and aliasing while attaining a power efficiency up to 7.4 fJ/conversion step.


DELTA–SIGMA modulators (DSMs) have been applied in a variety of electronic products from biomedical devices with a narrow frequency range of dozens of Hertz to wireless communication networks with a bandwidth up to tens of megahertz or higher. Oversampling and spectral shaping of quantization noise allow a modulator to use low-quality components, such as amplifiers with low gain and comparators with relaxed offset, to achieve high resolution. There are mainly two types of implementations: discrete-time (DT) DSMs and continuous time (CT) DSMs. CT DSMs are attractive due to their constant input impedance, anti aliasing feature and relaxed bandwidth, and the slew-rate requirement of amplifiers compared with their counterparts implemented by switched-capacitor (SC) circuits. A constant input impedance gets rid of the signal dependent charge injection associated with an SC input, makes the input current smoother than the rapidly changing current pulses in an SC input, thereby reduces the power of the input buffer, contributing to the reduction of the power of the whole system. The anti aliasing ability leads to the elimination of the frontend filters, which also saves power for the whole system. A pure Gm-C implementation presented in achieves very high resolution but has a very limited input range. In, a resistor input and a multi bit feedback DAC are proposed to realize a subtractor to generate a rough virtual ground before the Gm-C integrator, but a source-degenerated trans conductor is still needed to enhance the linear input range, which greatly degrades the power efficiency of the Gm-C loop filter. Recent publications point out that applying a passive RC low pass filter (LPF) in a CT DSM can significantly improve its power efficiency by attenuating the swing of the error signal to its first and most crucial active integrator.

On the other hand, CT DSMs suffer from clock jitter and inter symbol interference (ISI). These problems become more pronounced in high-resolution and wide bandwidth designs. Traditional methods to reduce the sensitivity to clock jitter include the use of SC DACs and the finite-impulse response (FIR) DACs. An SC DAC possesses high immunity to clock jitter because the aperture uncertainty only affects the small tail of the SC current pulses. However, the steep current pulses increase the output settling requirements of the first amplifier to extremes, thereby significantly increasing the power consumption. Moreover, the conventional SC DAC degrades the anti aliasing ability, a fundamental advantage of CT DSMs, because of the sampling operation at the virtual ground of the first amplifier. The FIR DAC approach, despite its success in jitter reduction, requires a second FIR filter to maintain the loop stability, increases the total load resistance (when the DAC is resistive loaded) by a factor of 2N2 (N is the number of FIR taps), draws dynamic power consumption in its driving circuits, and is still sensitive to ISI when the non-return-to zero (NRZ) signaling is applied.

The ISI in a single-bit DAC results from the asymmetrical rise and fall times of the DAC’s output waveform. The effect of ISI in an NRZ DAC can be reduced by calibration approaches and ISI error shaping methods. However, they are complex circuits and consume extra power. Another approach to the ISI problem is to use a return-to-open (RTO) DAC or a return-to-zero (RZ) DAC. However, to keep the same full-scale value, an RZ/RTO DAC needs to increase the height of feedback current pulses, which increases the power consumption of the amplifier to retain the same linearity. In addition, the widely used rectangular-shape RZ DAC is more sensitive to clock jitter than an NRZ DAC.

The problems of clock jitter, ISI, and aliasing are extremely difficult to tackle simultaneously without consuming excessive power. In this paper, a new single-bit CT DSM is proposed with high-power efficiency and high immunity to clock jitter, ISI, and aliasing at the same time. The CT DSM has a combined passive RC and SC LPF frontend stage followed by a trans conductance-C (Gm-C) loop filter. To demonstrate the effectiveness of the proposed structure, two modulators are designed: one with an audio bandwidth (25 kHz) and the other a wider bandwidth (10 MHz) for mobile wireless receiver applications. Transistor-level simulations show that the audio band design achieves 95- dB signal-to-noise-and-distortion ratio (SNDR) with 49 μW in a 0.18-μm CMOS, corresponding to Walden’s figureof-merits (FoMW ) of 21.3fJ/conversion step; the wideband (10 MHz) example achieves 81-dB SNDR while consuming 1.36 mW in a 65-nm CMOS, corresponding to an FoMW of 7.4fJ/conversion step.


  • High gain and High resolution
  • High power efficiency
  • More Clock jitter



In spite of its high linearity, the CT DSM of Fig. 1(d) suffers from clock jitter and ISI in circuit implementation. In this section, we present a frontend circuit to address these problems. The proposed Gm-C CT DSM circuit architecture is shown in Fig. 2(a). The PI bock (k1s + 1/s) in Fig. 1(d) is realized by gm2, RE , and C3, where the excessive loop delay is compensated by RE and requires no additional active circuit. The dash-boxed part is a combined SC (for the feedback signal) and RC (for the input signal) LPF, short-handed as RSC-LPF. The RSC-LPF makes the Gm-C modulator immune to clock jitter, ISI, and aliasing while achieving a high power efficiency.

Figure 1:  Third-order CT DSMs with (a) simple single-bit approach; (b) 3-bit quantizer; (c) single-bit quantizer and a seven-tap FIR NRZ DAC; and (d) single-bit quantizer and an LPF. Nonlinearities in the first active integrators are modeled

To appreciate the development of the RSC-LPF, let us start with the approach of using an RC LPF for both the input and feedback signals, as shown in Fig. 2(b). The transfer function of this circuit is given by

ε(s)/ VIN(s) = 1/ 2 + s RinCP            (8)

where the pole location of the LPF is determined by the RinCP constant. By balancing the tradeoff between the in-band quantization noise and thermal noise, placing the pole at the signal band edge is a proper choice [8]. The resistive path for the feedback can implement an NRZ or RZ DAC, however, being sensitive to clock jitter. To suppress the effect of clock jitter, R f is replaced with a capacitor C1 switched at the sampling frequency fs, as shown in Fig. 2(c). Normally, R f equals Rin. The choice of their values is discussed in Section VII. When f fs, i.e., in the signal band, the switched capacitor C1 emulates a resistor of value Req = Ts/C1, where Ts = 1/ fs. The transfer function from VIN to ε is still given by (8) in the signal band. There are tiny differences at multiples of the sampling frequency, which will be explained in Section VI. The transfer function from VF to ε is

ε(z)/VF (z) = (C1 /C1+CP) /(z − (CP−C1/C1+CP)) = α / (z − (1 − 2α), α = C1 C1 + CP .    (9)

Equation (9) is equivalent to (8) when f fs under the condition of Ts = RinC1. A very small signal ε is left for the subsequent open-loop Gm-C integrator to handle, freeing it from linearity and slew-rate constrains to save power.

A fundamental difference exists between the proposed RSC-LPF approach and the SCR DAC approach [11], which looks similar. The SCR DAC must be assisted by a closed loop OTA and thus imposes settling requirements on the OTA. By contrast, in the proposed approach, the charging and discharging of C1 in the SC DAC does not involve the OTA, and thus does not impose any settling requirements on it. This leads to a significant improvement in power efficiency.


Figure 2 : (a) Proposed Gm-C CT DSM with an RSC LPF front end, where V is the digital output data, VF is the feedback signal, DAC2 is a current steering DAC. (b) RC LPF. (c) RSC LPF alone

In Fig. 2(a), the speed of charge redistribution between C1 and CP is determined by the ON-resistances of the switches. The transistors realizing the switches can be sized to minimize their ON-resistances, as the charge injection is not an issue for the following reason. Since VF has only two possible values and ε is nearly a virtual ground, for either the level of VF , the amount of the charge injected from the switches is fixed. The switch charge injection only results in an offset or gain error but not nonlinearity.

The values of Rin, C1, and CP can be determined with the following procedures.

1) The ratio between the full ranges of the input and the feedback is fixed. Normally Rin = Req is chosen for equal input and feedback ranges.

2) The value of Rin is determined by the thermal noise requirement based on (28) in Section VII.

3) By RinCP = (1/2π · BW), where BW is the signal bandwidth and CP is found.

4) By Ts = ReqC1 = RinC1, C1 can be derived.

A mismatch between Rin and R f (=Ts/C1) of the feedback SC causes a change in the pole frequency of the LPF for both VIN and VF (by the same proportion). The result is only a scaling of VIN with respect to VF . Simulations have validated that the modulator can tolerate a ±40% mismatch between Rin and Req without significant performance degradation. When the modulator loop works normally, the node ε in Fig. 3(a) with a small swing can be regarded as a virtual ground. Thus, the load to the preceding stage of the modulator can be regarded as a purely resistive of value Rin. On the other hand, before the modulator loop is locked, which is a negligibly short period, the feedback signal VF does not cancel the input signal and the node ε is not a virtual ground. The input impedance is (1 + s RinCP)/sCP, which has the magnitude always larger than Rin and is easier to drive


  • Low gain and Low resolution
  • Low power efficiency
  • Low Clock jitter


[1] G. Singh, R. Wu, Y. Chae, and K. A. A. Makinwa, “A 20 bit continuoustime  modulator with a Gm-C integrator, 120 dB CMRR and 15 ppm INL,” in Proc. ESSCIRC, Sep. 2012, pp. 385–388.

Multiplier-free Implementation of Galois Field Fourier Transform on a FPGA

Multiplier-free Implementation of Galois Field Fourier Transform on a FPGA


A novel approach to implementing Galois Field Fourier Transform (GFT) is proposed that completely eliminates the need for any finite field multipliers by transforming the symbols from a vector representation to a power representation. The proposed method is suitable for implementing GFTs of prime and nonprime lengths on modern FPGAs that have a large amount of on-chip distributed embedded memory. For GFT of length 255 that is widely used in many applications, the proposed memory based implementation exhibits 25% improvement in latency, 27% improvement in throughput, and 56% reduction in power consumption compared to a finite field multiplier based implementation.


Fourier Transform over a Galois (finite) field (GFT) and its inverse (IGFT) are some of the most computationally demanding tasks in the implementation of Bose-ChaudhuriHocquenghem (BCH) and Reed-Solomon (RS) codes. Finite field multiplication is the bottleneck in implementing the GFT, so several techniques have been proposed in research literature to reduce the number of multiplications required. A substitution and pre-computation based technique is proposed to compute GFT for prime lengths greater than 2 over GF(2m) for arbitrary m that saves about one-quarter of the multiplications compared to a brute-force implementation. There have been research efforts that use a FFT style implementation of GFT to reduce the number of multiplications of researchers have proposed methods that can reduce the number of multiplications from n 2 to 1 4 n(log2 n) 2 over GF(p m) where p is a prime value and n log2 n for GF(22r ), for an arbitrary value of r, and in [6] the authors improve this further to O(n(log2 (n))log3/2 2 ) using the cyclotomic method. In [2] authors present the hardware design and implementation of cyclotomic Fast Fourier Transform (CFFT) over GF(2m) by reformulating the method presented. Though the architecture has some advantages because some of the results of the computation are reused instead of computing them again, the number of finite field multipliers.

In this paper we propose a technique to implement GFT that does not use any finite field multipliers. The main insight of our work is that by transforming the GFT computation from a vector representation to a power representation, we can replace multiplication by wrap around carry addition. The challenge is to do the conversion from the vector to power and back efficiently. We proposed to use the extremely large on-chip embedded memory available in modern FPGAs as ROM (readonly memory) to store the pre computed conversion tables. On a Virtex 7 there is about 37 Mb of embedded memory (called Block RAM). This is sufficient to create the ROMs for GFT computation on finite field up to a length of 1023, which meets the requirements of many emerging applications. For example, in the new iterative soft-decision decoding of RS codes – the application that motivates the work presented in this paper – GFT of length 127 is required. Recent FPGAs such as Stratix 10 TX2800 from Altera has about 229Mb of memory with over 11721 M20K blocks. With more memory, the proposed scheme can be easily extended to larger field sizes if necessary. At the end of the paper we also propose a simple technique to reduce the memory requirements for prime length implementation by taking advantage of the cyclic property of multiplications over Galois Field.

In an FPGA the embedded memory blocks can be configured as extremely wide word ROMs. So, it allows a highly parallel vector processing style implementation, which results in extremely low latency and very high throughput. Note that this is only possible because the embedded memory on a FPGA is very flexible and can be configured with extremely large word size. For example, when n is 1023, the proposed architecture accesses 1023 10-bit words (i.e. 10230 bits) in parallel each clock cycle (227 MHz) which represents an aggregate memory bandwidth of 2.3 Tb/s. This allows the computation of GFT of length 1023 in 1027 clock cycles with a throughput of 2.5 Gbps, using about 52% of the on-chip memory resources on the FPGA.


  • More number of multiplications
  • Finite filed multiplier will take more computations
  • Not having carry additions


We propose two architectures – Serial In Parallel Out (SIPO) and Parallel In Serial Out Architecture (PISO) which differ in how the inputs arrive and how outputs are produced. In SIPO architecture (see Fig. 1) inputs a0 to an−1 come in serially. Power ROM converts vector representation to power representation. The depth of this ROM is n + 1 and the width is m = log2 (n + 1) bits. So, the size of the Power ROM is (n + 1) × log2 (n + 1). Beta ROM consists of powers that are needed to be added with the inputs. The size of Beta ROM is n + 1 × (log2 (n + 1) × n).

Figure 1 : Serial In Parallel Out (SIPO) Architecture

In both the architectures, the values of Beta ROM are pre-computed. The adder unit performs wrap-around carry addition where the carry is added back to get the final result. Then the Vector ROMs are used to convert from power to vector representation. The size of each Vector ROM is (n + 1) × log2(n + 1). If dual-port memory is used the number of Vector ROMs needed is b n 2 c. The output of the Vector ROM goes to the multiplexers. The Multiplexer selects outputs zero if the input is zero or selects the output of Vector ROM otherwise. These are accumulated together by Galois field addition i.e. using XOR gates. In this architecture once ak is received, ak ∗ β z is calculated in parallel for all z values according to k th column of the square matrix in equation (2). After all the inputs have arrived, the outputs b0 to bn−1 are available simultaneously.

Figure 2 : Parallel In Serial Out ( PIPO) Architecture

In PISO architecture shown in Fig. 2, inputs a0 to an−1 are assumed to be available in parallel, so we start computing the outputs from b0 to bn−1 one by one (serially). PowerROM converts vector representation to power representation. The depth of this ROM is n + 1 and the width is log2 (n + 1) bits. The main difference between SIPO and PISO is that we replicate multiple copies of this PowerROM to facilitate parallel look-up. With dual-port ROMs, the number of PowerROMs needed is b n 2 c. BetaROM consists of powers that are needed to be added with the inputs. The size of BetaROM is (n + 1) × (log2(n + 1) × n). The adder unit is wraparound carry addition where the carry is again added with the result to get the final result. Then the VectorROMs are used to convert from power to vector representation. The size of each VectorROM is n+1×log2 (n+1). The number of VectorROMs needed are b n 2 c. These outputs are XOR-ed together (as in the previous case) to get the final output. The control unit for this architecture is very simple as it just generates the address for the BetaROM.


  • Less number of multiplications
  • Finite filed multiplier will take less computations
  • carry additions will added


[1] W. Gappmair, “An efficient prime-length dft algorithm over finite fields gf (2m),” Transactions on Emerging Telecommunications Technologies, vol. 14, no. 2, pp. 171–176, 2003.

Multiloop Control for Fast Transient DC–DC Converter

Multiloop Control for Fast Transient DC–DC Converter


A novel ac coupled feedback (ACCF) is proposed to alternatively realize fast transient response while inherently controlling the start-up in-rush current of a dc–dc switching converter. The proposed ACCF is modified from a conventional capacitor multiplier and connected between the outputs of the converter and the transconductance. With this supplemental feedback, the transient response has been significantly improved due to the gain-boosting effect around the compensator’s midband. Moreover, the ACCF circuit assists to manage the ramping speed of the output voltage during power-up, thereby eliminating the bulky soft-start circuit. The new controller is very simple to implement and occupies a tiny footprint on-chip. A buck converter with the proposed scheme has been fabricated using the 0.18-µm standard CMOS process with an active silicon area of 0.573 mm2. Measurement results show that the output voltage rises linearly for a soft-start period of 1.05 ms according to the designed slope. Excellent load transient responses are achieved under different load current steps; the output voltage overshoot/undershoot of 60 mV settles down within 10 µs for a load variation from 50 µA to 1 A in 1 µs. Moreover, the proposed converter maintains both excellent load and line regulations of 0.018 mV/mA and 0.0056 mV/mV, respectively.





  • Design of fast transient DC-DC convertor implemented in 45nm CMOS technology, and vary input voltage 2.6V ~ 4.2V.

Proposed Title:

  • An Efficient Approach of Fast Transient DC-DC Convertor using Multiloop Control

Proposed Abstract:

Step-down method ofDC -to-DC power converter it is called Buck Converter. The operation of electromechanical device of DC-DC converter is converts direct current (DC) from certain level of input voltage to another voltage level. This paper proposes the novel design implementation of fast transient response current-mode buck converter with ac coupled feedback (ACCF). Where, ACCF is the modified design of a conventional Capacitor multiplier. The previous method of DC to DC Converters requires more power to achieve the fast transient to voltage conversion and it has high electromagnetic interface (EMI) noise. To overcome this problem this work presents a novel design of DC-DC converter with ACCF. ACCF circuit used to eliminate the bulky soft-start circuit when the ramping speeds of the output voltage during power-up. A Present Proposed system uses current mode- controller to improve response in speed and also increasing load transient voltages. The proposed scheme has been implemented in input voltage 2.6V ~ 4.2V and 45nm CMOS technology and compared in terms of Voltage, power, area and delay of the DC-DC converter will be calculated.


THE demand for fast load transient performance has grown significantly, affecting the power supplies of modern high-speed processors—especially processors targeting to achieve a fast transition from the low-power idle mode to the high-speed active mode. There is a massive load current change when the system switches from an idle mode to an active mode. Ideally, the regulator should maintain a voltage level that is almost constant, which means that there should be a negligible output voltage overshoot/undershoot and rapid response time. To accomplish these stringent requirements, various research works on fast transient dc–dc converters have been proposed.

Among these methods, the increase of system bandwidth has been the most general solution for the majority of the analog circuits designed according to the linear control theory. In the design of a switching mode dc–dc converter, the wide system bandwidth can be achieved by adopting the current-mode control method. On top of that, an adaptive pole–zero position circuit has been proposed to instantly move the pole and zero pair of the compensation network to higher frequencies, in order to temporarily extend the bandwidth of the system during the transient event. The pole and zero are moved back immediately after the transient event to stabilize the system. However, a careful design targeting system stability during the whole process needs to be taken into consideration once the bandwidth is tentatively changed. Another commonly used method to improve the transient response is to increase of the slew rate of the error amplifier.

Different current-boosting modules are used to increase the source/sink current at the output of the error amplifier during the transient period. The required boosting current must be large enough to realize an obvious improvement on the transient response, and at the same time, it needs to avoid the over-response oscillation caused by the excessive boosting current. Moreover, current-boosting modules introduce more power consumption, which degrades the overall efficiency.

Besides that, the use of nonlinear control is an alternative to realize fast transient response. For example, hysteresis control can offer immediate feedback during load variations. However, this method has a drawback—the high electromagnetic interference (EMI) noise due to its variable switching frequencies. Circuits can be added to lock the switching frequency and thus abate the EMI noise, but such additions bring more complexities into the circuit design.

Figure 1: Conventional current-mode buck converter with the proposed ACCF loop

Figure 2: Type-II compensation designs. (a) OpAmp–RC topology. (b) Gm–C topology


The conventional current-mode buck dc–dc converter is highlighted with the blue dashed line. Differing from the conventional type-III compensation network which contains three poles and two zeros to boost the phase to ensure stability, the proposed method generates the same number of poles and zeros to boost the mid-band gain to significantly enhance the response strength of the compensator, thus increasing the transient response. At the same time, the proposed scheme also helps to manage the output voltage ramping speed during the converter power-up.

Theoretically, the transient performance depends mainly on two factors—the response speed and the response strength. The former can be interpreted as the delay time from the load transient event to the change on the control signal, and the latter refers to the amount of the amplitude changes on the control signal. Taking the widely used type-II compensation design as an example here, the structure of the type-II compensation designs can be realized by either the operational amplifier–resistor–capacitor (OpAmp–RC) topology [as shown in Fig. 2(a)] or the transconductance–capacitor (Gm–C) topology [as shown in Fig. 2(b)]. During the load step, the variation in the output voltage results in a change in the feedback signal . The control signal Vc can only respond gradually, due to the compensator’s integrating effect. Its response speed depends on the bandwidth of the compensator. Assuming an error voltage Verr turns up on the feedback, the amount of change of the control signal (|Vc|) from its steady-state level can be expressed in the following equations for the


OpAmp–RC and Gm–C topologies, respectively [1]:

where Verr stands for the error voltage that turns up on the feedback, gm0 is the transconductance of the gm0 cell, and Rfb and Rc are the resistors connected to the inverting input terminal of the OpAmp and the compensation resistor, respectively.



  • Difficult to achieve fast load transient performance.
  • Current boosting modules in existing system require more power, which degrades the overall efficiency.
  • High electromagnetic interface (EMI) noise.




This paper utilizes a current-mode controller to improve the response speed aspect by increasing the system bandwidth in order to enhance the load transient response. On top of that, a novel ACCF is paralleled around the compensator to boost the response strength at the mid-band (as shown in Fig. 3), which is lacking in the conventional designs. The output impedances of both gm0 cell and ACCF are assumed to be infinity (which will be discussed in Section III). The transfer function of the conventional type-II compensation network can be expressed in (3) which includes two poles and one zero (located at p1, p2, and z1, respectively). With the additional ACCF, three poles and two zeros will be generated (located at p 1, p 2, p 3,z 1, and z 2, respectively), and the newly generated poles and zeros will boost the compensation mid-band gain as derived in (4). The respective Bode plot can be found in Fig. 4 accordingly

Without ACCF:

Figure 3: Proposed compensation network


where i1 and i2 are the output current from the gm0 cell and ACCF, respectively, α and α are constant coefficients, Ra and Rb are the voltage divider series resistors connected at the output of the dc–dc converter, and Cc1, Cc2, and Rc are the compensation capacitors and resistor, as shown in Fig. 3. Furthermore, the ACCF circuit also defines the output voltage rising slope during the converter power-up. The working principle of the soft-start function is illustrated in Fig. 5. During start-up, Vout is much lower than Vref, so that the gm0 cell will be saturated with an extremely high output voltage Vc, which will program an unfavorable runaway inductor current. With the help of the ACCF circuit, in this case, the fast-rising voltage appearing at Vout caused by the inrush start-up current is coupled through the ACCF and induces the ac current from Vc into the ACCF itself. As a result, Vc is pulled down to define the slow increase in the inductor current and output voltage. This particular inherent function of the proposed ACCF eliminates the need for a dedicated soft-start circuit in the conventional designs.

Figure 4: Illustration of the Bode magnitude plot of the buck converter with ACCF versus without ACCF.


Figure 5: Start-up response of the proposed controller.





  1. Active Compensation Capacitor

Active capacitor has been proposed and used in the amplifier to amplify capacitance in [21]. In this paper, the large passive compensation on-chip capacitor Cc1 is replaced by the equivalent active capacitor in the proposed buck converter to reduce the footprint, as shown in Fig. 6. The schematic of the active capacitor will be also modified and used in realizing the ACCF, which will be discussed later. Equations (5)–(7) explain the derivation of the equivalent active capacitor from the passive capacitor mathematically. The equivalent circuit of (7) is modeled in Fig. 6(b) as an active capacitor. The current mirror is used as the current control current source to amplify the ac current in this work. Fig. 7 shows the circuit implementation of the active capacitor by using the current mirror. A (N + 1) times smaller passive capacitor is utilized in parallel with the current mirror, which has an amplification factor of N (N = 19 in this paper), to realize the same capacitance as Cc1(Cc1 = 80 pF in this paper). Fig. 8 shows an approximately seven-time-reduction in the silicon area with the use of the equivalent active capacitor. The higher the value of N, the smaller is the footprint of the active capacitor.

Figure 6: Schematics of the equivalent capacitors. (a) Passive capacitor.(b) Active capacitor.

Figure 7: Circuit implementation of the active capacitor.


Nevertheless, the higher N with higher quiescent current will increase the static power consumptions. As a result, N needs to be designed at an optimized value

Passive capacitor



  1. Multiloop Control for Transient Enhancement and Soft-Start

The main structure of the ACCF circuit is modified from the active capacitor circuit , where an ordinary on-chip capacitor C f 1 is ac coupled from Vout, amplified by the current mirror (transistor M2-to-M1 with the ratio M:1, M > 1), and connected back to the main control loop at the gm0 cell output node, as shown in Fig. 9. The circuit implementation of the gm0 cell is shown in Fig. 10. Resistor Rgm is inserted in between the source terminals of MP1 and MP2 to sense the voltage differences (V) between the input terminals (V+ and V−) of the two super-source-followers; hence, gm0 of the cell will be inversely proportional to the value of Rgm . The output current is generated by the two pairs of current mirrors formed by MP3, MP4, MN3, and MN4 at the output stage, and it will charge the compensation network at the output of the gm0 cell. The mid-band gain (Amid-band) of the gm0 cell is, therefore, proportional to Rc while inversely proportional to Rgm , as expressed in (8). Eventually, the process, voltage and temperature variations of the two resistors are effectively canceled. Nevertheless, Rc should be placed close to Rgm in the layout for a better matching purpose

where Rc is the resistor in the compensation network at the output of the gm0 cell, and Rgm is the resistor in between the sources of MP1 and MP2.

Figure 8: Proposed current-mode buck converter.


Figure 9: Schematic of the gm0 cell.



The circuit implementation of the proposed ACCF realizes fast transient response as well as soft-start, which are analyzed in three aspects in the following.

1) Soft-Start Analysis: With the help of the ACCF, the fast rising voltage appearing at Vout caused by the start-up in-rush current will be coupled through C f 1. Then, an ac current will be inducted into the transistor M1. The M2-to-M1 current mirror is able to amplify the induced current and pull it from Vc; therefore, Vc is adjusted to a lower level. Consequently, Vc, inductor current, and Vout are well managed by the current control loop. The controllable soft-start slope is expressed in the following equation:

where Igm0_max is the maximum output current of the gm0 cell, M is the current ratio of transistor M2-to-M1, and C f 1 is the capacitor in the proposed ACCF.

2) Steady-State Analysis: The ACCF path is virtually disconnected from Vout during the steady state. The low pass filter R f 2C f 2 is added to prevent the Vout ripples from passing through the ACCF path and disturbing Vc. It is designed to cut off around the converter’s switching frequency. Although there is no current flowing between the gm0 cell and the ACCF circuit, the output impedance of the gm0 cell is reduced due to the ACCF output stage. The advanced current mirror with high output impedance should be considered on condition that a large dc loop gain is preferred.

3) Transient Analysis: As discussed previously, the load transient performance is improved on account of the gain boosting effect around the compensator’s mid-band. This can be understood analytically by deriving the transfer function from the small-signal equivalent circuit, as shown in Fig. 11(a). The transconductance of the transistors M1 and M2 are labeled as gm1 and gm2 (gm2 = Mgm1), and RO represents the combined output impedances of the gm0 cell and the ACCF. The original single-loop compensation capacitor Cc1 still determines the dominant pole in this multiloop network.

  1. Continuous-Sensing-Technique for Fast-Response Current Sensor

The current sensor has been used to sense the inductor current in the current-mode dc–dc converter. The conventional design is shown, where a sense FET (SenFET) is implemented to sense the current passing through the high side power FET MP at the ratio of k to 1 (k

1). MP and Ms1 are switched ON simultaneously when the gate control signal Q is low. Meanwhile, the current sensor works in the “active sensing mode” and the SenFET keeps tracking the current passing through MP using an almost equalized drain-to-source voltage level. In the other words, Va is adaptively adjusted to Vb through the feedback of OpAmp. The outputs of the current sensor are isenc and Vsenc, which represent the sensed current and voltage with the required ratios, respectively. Alternatively, Vb will be at the same voltage level as Vin in the case where Ms2 is switched ON by the high Q signal (Q¯ connected at the gate of Ms2 represents the complement signal of Q). In this situation, Vb equals to Vin as well as Va, the output sensed current isenc is negligible. The OpAmp is in its “sleeping mode.” The current sensor will only change back to the “active sensing mode” once Q switches from high to low in the next switching phase, which means that the OpAmp is required to “wake up” instantly thus to accurately sense the current passing through MP by tracking the changes at node Vb. However, the reaction time that the OpAmp needs to adjust itself from one set of dc operating points to the other must be considered. It causes delays (tdelay) at the beginning of every sensing stage as illustrated in Fig. 13—the blue dashed line represents the sensed MP current by the conventional current sensor and the red solid line represents the ideally sensed inductor current. Obviously, the problem will become more severe when the on-time of MP is short.

To solve the stated problem, a continuous-sensing-technique is proposed to sense the current passing through MP without any delay (tdelay). As shown in Fig. 14, the proposed current sensor consists of two sensing stages, namely, the high side current sensing stage (which senses the power PMOS MP drain current when Q is low) and the low side current sensing stage (which senses the power NMOS MN drain current when Q is high). The low side sensed MN current (isenN2) will be connected to the high side current sensing stage through MH and MPS2. With that, the high side current sensing stage senses the MP drain current when Q is low. Mps2 will be turned ON and the current introduced by the low side current sensor (isenN2) will be connected to the high side current sensing stage once Mps1 is switched OFF when Q is high. With this additional current source, the high side current sensing stage will continuously be active and sense the current isenN2 by tracking Vpa to Vpb. Mps2 will be switched OFF while Mps1 and Mp are switched ON when Q changes to low, and the power MOSFET sensing mode will then be active again. There is no delay in between the two switching phases because the two input nodes of the OpAmp (Vpa and Vpb) are continuously well-controlled, so do the dc operating points. Since no “wake up” time is needed, there is no delay for the high side current sensing stage. The delay at the low side current sensing stage will not affect the sensing accuracy in the peak current-mode control. The simulation results can be found in Fig. 15. The output voltage/current (Vsenp/Isenp) of the current sensor is proved to have negligible delays.

Figure 10: Proposed continuous-sensing current sensor.



  • Fast load transient performance has been achieved.
  • Low power consumption.
  • Low electromagnetic interface (EMI) noise.

Literature Survey:

  • Zhou, Z. Sun, Q. Low, and L. Siek, “Fast transient response DC–DC converter with start-up in-rush current control,” Electron. Lett., vol. 52, no. 22, pp. 1883–1885, Oct. 2016.–A new alternative to achieve fast transient response while inherently managing the in-rush current during DC/DC switching converter start-up is proposed. An AC coupled feedback (ACCF) is introduced using a capacitor multiplier from the output of the converter to the output of the error amplifier. With this additional feedback, the transient response, which used to be limited by the compensator mid-band gain has been significantly improved. Meanwhile, the ACCF circuit can help to control the converter output ramping speed during power-up, thus eliminating the bulky soft-start circuit. The simplified circuit design means the new controller can be realised by a tiny on-chip circuit, thereby minimising the footprint and the cost. A buck converter with the proposed technique is designed using 0.18 μm CMOS process and simulated across different process corners. Simulation shows that the output voltage increases linearly with the designed slope despite the variations of the input voltage, inductor or output capacitor values. An excellent load transient response of 0.021 mV/mA is achieved for a load current variation from 50 μA to 1 A in 1 μs.
  • -Y. Hsieh and K.-H. Chen, “Adaptive pole-zero position (APZP) technique of regulated power supply for improving SNR,” IEEE Trans. Power Electron., vol. 23, no. 6, pp. 2949–2963, Nov. 2008.– This paper proposes an adaptive pole-zero position (APZP) technique to achieve excellent transient response of dc–dc converters. The APZP technique triggers the two-step nonlinear control mechanism to speed up the transient response at the beginning of load variations. Before the output voltage is regulated back to its voltage level, the APZP technique merely functions as a linear control method to regulate output voltage in order to ensure the stability of the system. Fast transient response time, low output ripples, and stable transient operation are achieved at the same time by the proposed APZP technique. Experimental results in the UMC 0.18- mum process show that the transient undershoot/overshoot voltage and the recovery time do not exceed 48 mV and 10 mus , respectively. Compared with conventional design without any fast transient technique, the performances of overshoot voltage and recovery time are enhanced by 37.2% and 77.8%. With the APZP technique, the performance of dc–dc converters is improved significantly.
  • -H. Lee, S.-C. Huang, S.-W. Wang, and K.-H. Chen, “Fast transient (FT) technique with adaptive phase margin (APM) for current mode DC-DC buck converters,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 10, pp. 1781–1793, Oct. 2012. This paper proposes a fast transient (FT) control with the adaptive phase margin (APM) to achieve good transient response in current-mode DC-DC buck converters at different load conditions. The overshoot/undershoot voltage and the transient recovery time are effectively reduced. The APM control can always maintain the system phase margin at an adequate value under different load conditions. That is, the compensation pole-zero pair is adapted to load current to extend the system bandwidth and get an adequate phase margin. Experimental results show the overshoot/undershoot voltage is smaller than 60 mV (3%) and transient period is smaller than 12 μs as load current suddenly changes from 100 to 500 mA, or vice versa. Compared with conventional designs without any fast transient technique, the undershoot voltage and recovery time are enhanced by 45% and 85%, respectively.
  • F. Lee and P. K. T. Mok, “A monolithic current-mode CMOS DC-DC converter with on-chip current-sensing technique,” IEEE J. Solid-State Circuits, vol. 39, no. 1, pp. 3–14, Jan. 2004. A monolithic current-mode CMOS DC-DC converter with integrated power switches and a novel on-chip current sensor for feedback control is presented in this paper. With the proposed accurate on-chip current sensor, the sensed inductor current, combined with the internal ramp signal, can be used for current-mode DC-DC converter feedback control. In addition, no external components and no extra I/O pins are needed for the current-mode controller. The DC-DC converter has been fabricated with a standard 0.6-/spl mu/m CMOS process. The measured absolute error between the sensed signal and the inductor current is less than 4%. Experimental results show that this converter with on-chip current sensor can operate from 300 kHz to 1 MHz with supply voltage from 3 to 5.2 V, which is suitable for single-cell lithium-ion battery supply applications. The output ripple voltage is about 20 mV with a 10-/spl mu/F off-chip capacitor and 4.7-/spl mu/H off-chip inductor. The power efficiency is over 80% for load current from 50 to 450 mA.
  • Y. Leung, P. K. T. Mok, K. N. Leung, and M. Chan, “An integrated CMOS current-sensing circuit for low-voltage current-mode buck regulator,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 52, no. 7, pp. 394–397, Jul. 2005. An integrated current-sensing circuit for low-voltage buck regulator is presented. The minimum achievable supply voltage of the proposed current-sensing circuit is 1.2 V implemented in a CMOS technology with V/sub TH/=0.85 V, and the current-sensing accuracy is higher than 94%. With the developed current-sensing circuit, a buck regulator, which is able to operate at a 1.2-V supply, is implemented. A maximum output current of 120 mA and power-conversion efficiency higher than 89% are achieved.
  • -I. Wu, B.-T. Hwang, and C. C.-P. Chen, “Synchronous double pumping technique for integrated current-mode PWM DC–DC converters demand on fast-transient response,” IEEE Trans. Power Electron., vol. 32, no. 1, pp. 849–865, Jan. 2017. While the fast transient techniques have been extensively investigated, the research aiming at current-mode pulse width modulation (PWM) converters is relatively unexplored. This paper presents a synchronous double-pumping (SDP) technique for current-mode PWM dc-dc converters to achieve fast-transient response between different load conditions. The advantages and limitations of the existing conventional techniques are discussed and analyzed. With the proposed SDP technique, a nearly optimized recovery time speedup and voltage drop minimization for every different conventional current-mode converters can be obtained. The prototype chip was fabricated using a TSMC 0.35-μm CMOS process occupies the area of 2.242 mm 2including all bonding pads and ESD protection circuits. The output voltage ripple is measured about 15 mV in peak-to-peak value. The recovery time is 2.4 and 2.6 μs, respectively, in response to the 400-mA step-up and step-down load changes. Those are improved by a factor of 8.33 and 8.23, respectively.
  • -H. Lee, K.-Y. Chu, C.-J. Shih, and K.-H. Chen, “Proportional compensated buck converter with a differential-in differential-out (DIDO) error amplifier and load regulation enhancement (LRE) mechanism,” IEEE Trans. Power Electron., vol. 27, no. 5, pp. 2426–2436, May 2012. A differential-in differential-out error amplifier and a load regulation enhancement mechanism are proposed in the buck converter that aims to improve load regulation and noise immunity. By using the proportional compensator in the proposed converter, there is no need of external compensation components in this design. As a result, a compact-size and high-performance dc-dc buck converter can be guaranteed. Experimental results show that load regulation can be improved from 0.5 to 0.025 mV/mA. The test chip was fabricated by 0.25 μm CMOS process and occupied 1.65 mm active silicon area.
  • -J. Liu, T.-H. Chen, and S.-R. Hsu, “Area-efficient error amplifier with current-boosting module for fast-transient buck converters,” IET Power Electron., vol. 9, no. 10, pp. 2147–2153, Aug. 2016. The current boosting module (CBM) is proposed to be implemented in an error amplifier (EA) for improving the load transient of a dc-dc converter. To enhance the slew rate of the EA during transient, the CBM is adopted to raise/reduce the gate voltages of the output-stage transistors of the EA for enhancing its current driving capability with simple circuitry. Moreover, to save power consumption in steady state, the accelerative mechanism of the EA is turned off. A buck converter with CBM is implemented with 0.35 μm 2P4M complementary metal-oxide-semiconductor process. The experimental results demonstrate that the recovery time and transient ripple of the buck converter are improved by over ten times and two times, respectively, compared with those of the buck converter without CBM, for a 450 mA load current change. The maximum power conversion efficiency is 92.8% when the input and output voltages are 4 and 2.5 V, respectively. Hence, the buck converter with CBM can concurrently fulfil the functionalities of fixed switching frequency and fast load transient.


[1] M. Zhou, Z. Sun, Q. Low, and L. Siek, “Fast transient response DC–DC converter with start-up in-rush current control,” Electron. Lett., vol. 52, no. 22, pp. 1883–1885, Oct. 2016.

Many-Objective Sizing Optimization of a Class-C/D VCO for Ultralow-Power IoT and UltralowPhase-Noise Cellular Applications

Many-Objective Sizing Optimization of a Class-C/D VCO for Ultralow-Power IoT and UltralowPhase-Noise Cellular Applications



In this paper, the performance boundaries and corresponding tradeoffs of a complex dual-mode class-C/D voltagecontrolled oscillator (VCO) are extended using a framework for the automatic sizing of radio frequency integrated circuit blocks, where an all-inclusive test bench formulation enhanced with an additional measurement processing system enables the optimization of “everything at once” toward its true optimal tradeoffs. VCOs embedded in the state-of-the-art multistandard transceivers must comply with extremely high performance and ultralow power requirements for modern cellular and Internet of Things applications. However, the proper analysis of the design tradeoffs is tedious and impractical, as a large amount of conflicting performance figures obtained from multiple modes, test benches, and/or analysis must be considered simultaneously. Here, the dual-mode design and optimization conducted provided 287 design solutions with figures of merit above 192 dBc/Hz, where the power consumption varies from 0.134 to 1.333 mW, the phase noise at 10 MHz from −133.89 to −142.51 dBc/Hz, and the frequency pushing from 2 to 500 MHz/V, on the worst case of the tuning range. These results pushed this circuit design to its performance limits on a 65-nm CMOS technology, reducing 49% of the power consumption of the original design while also showing its potential for ultralow power with more than 93% reduction. In addition, worst case corner criteria were also performed on the top of the worst case tuning range optimization, taking the problem to a human-untrea table LXVI-D performance space.



VOLTAGE-CONTROLLED oscillators (VCOs) play a key role in modern radio frequency (RF) integrated circuit (IC) multistandard transceivers and, therefore, are subject to continuous research efforts that push the boundaries of their multifaceted performance/power efficiency in the state-of-the-art applications and integration technologies. Usually, different wireless systems have various requirements for the VCO performance. For Internet of Things (IoT) applications, the VCO should maintain a low power consumption, while the phase noise performance can be quite relaxed, e.g., −102 dBc/Hz at 2.5-MHz offset for the Bluetooth low-energy receiver at 2.4-GHz carrier frequency. On the other hand, the cellular applications require very stringent phase noise performance, e.g., −162 dBc/Hz at 20-MHz offset at 900-MHz carrier frequency for the Global System for Mobile transmitter (TX) and −160 dBc/Hz at 30-MHz offset at ∼2-GHz carrier frequency for the long-term evolution/wideband code division multiple access TX.

The design of VCOs is usually time-consuming, even after a particular architecture has been selected. In addition to the phase noise and power consumption, other specifications such as the frequency tuning range and frequency pushing due to the supply voltage variation also need to be carefully considered in a practical design. According to the time-variant phase noise model, for a typical voltage-biased VCO employing crosscoupled nMOS transistors (Fig. 1) oscillating at ω0, its phase noise at offset frequency ω can be expressed as

where Q is the tank quality factor, VP is the differential output amplitude, C is the total tank capacitance, T,rms is the rms impulse sensitivity functions of the parallel resistance representing the conversion from tank thermal noise to phase noise, and F is the noise factor defined by the ratio between the total phase noise and the phase noise induced by the tank loss. To meet the phase noise requirement at a certain frequency, a proper tank capacitance C and inductance L need to be chosen. However, it is difficult to obtain accurate values for C and L using (1) since the values of F depend on the working mode of M1/M2, related to the gate-biasing voltage and transistor sizes.

Figure 1: Dual-mode class-C/D VCO schematic with SCA for an increased

If taking the noise contributions for the transistor channel conductance (GDS) into consideration, the situation becomes even more complex. Furthermore, for each iteration, when the L value is changed, the switchedcapacitor array (SCA) and varactors also need to be redesigned to meet the frequency-tuning range requirement, which would change the tank Q and, in turn, affect the phase noise performance. In the practical design, even more iterations are required to guarantee satisfactory VCO performances in the presence of process, voltage, and temperature variations. The recent works also reveal that the phase noise and frequency pushing can be improved by utilizing the commonmode resonance at the double-oscillation frequency, which requires extra design efforts to balance the differential-mode and common-mode tank inductances and capacitances. Thus, numerous nonsystematic iterations are inevitable to attain high-quality designs.


  • More Phase noise
  • More power consumption
  • More iterations are required to give performance in voltage and temperature variations.



To overcome the difficulties found on the manual sizing of RF IC blocks, different optimization-based sizing approaches were developed. These EDA tools use algorithms that efficiently explore the design space, instead of iterating over designer-defined analytical equations. They can be applied over performance models that capture several circuits and inductor characteristics of the RF circuits and, particularly, VCOs and  however, the use of foundry-provided device models and a circuit simulator as an evaluation engine, i.e., simulation-based sizing, proved to be the most accurate and widely adopted approach for RF, despite its increased computational effort. There are several commercially available solutions, e.g., Cadence’s Virtuoso GXL or MunEDA’s DNO/GNO that also follows the simulationbased architecture, and while useful, most of these tools still take a limitative single-objective approach being used mostly to adjust the manual sizing in a semiautomated manner. Therefore, these simulation-based methodologies are continuously subject to research efforts by the research community to cope with the most recent design challenges.

Developed methodologies are usually applied to simpler VCO topologies for a small number of design variables and considering only a small set of performance figures. To exemplify, in, the cross-coupled double-differential VCO was optimized for a 4-D performance space (oscillation frequency fosc, phase noise, power, and oscillation amplitude OscAmp). The optimization was done on a 7-D design variable space. On the other hand, in, the VCO was optimized for a 9-D performance space (frequency-tuning range, phase noises, power, OscAmp, and area). In this case, a 9-D design variable space was considered. In other works, the performance and design variable spaces are similar, and hardcoded formulas are used to compute other metrics, e.g., figure of merit (FOM). Following, when faced with a complex real-world VCO design, designers in both academic and industrial environments end up using EDA tools to solve only subproblems of the manual design, i.e., change only a subset of the design variables x to tackle local optimization (LO) targets, as illustrated in Fig. 2(b). This mixed iterative/sequential optimization design approach leads to suboptimal solutions, as the tradeoffs between conflicting performance figures are not properly weighted. Therefore, for modern VCO applications, this approach does not fit, as more complex topologies and a wider set of requirements must be balanced simultaneously, e.g., multimode operation, digitally controlled frequency-tuning ranges, or attain a limited frequency pushing due to supply voltage variation.

This paper applies and adapts an EDA framework to bypass the difficulties faced on the sizing of complex RF IC blocks and, particularly, a dual-mode class-C/D VCO. The major contributions of this paper can be summarized as follows.

1) Adoption of an EDA framework to fully optimize a complex class-C/D VCO for the state-of-the-art IoT and cellular specifications.

2) Study and discussion of the possibility to meet extreme operational requirements in a single optimization run with the same framework setup, by analyzing the complete tradeoffs between power consumption, phase noise, and frequency pushing, obtained with a many-objective optimization. A study that is impossible to perform using commercially available solutions.

3) Unlike previous research works in VCO sizing optimization, here, the circuit’s performance space greatly surpasses what can be found on EDA solutions in the literature. Two human-untreatable 18-D and 66-D performance spaces, defined over two different modes, i.e., worst case mode in typical conditions and worst case mode in worst case corner (WCC) conditions, respectively, for the same 28-D design variable space that affects the sizing of 43 devices (RF and digital components).

Figure 2 : (a) Knowledge-based manual design. (b) Mixed iterative/sequential optimization design approach. (c) Adopted optimize “everything-at-once” approach, where x is the design variables’ array

4) The adopted automatic design methodology is built over the established all-inclusive test bench formulation for optimization-based RF IC sizing but enhanced with parsers for the native output formats of most widely used off-the-shelf simulators and a comprehensive set of postprocessing options. As such, the proposed formulation enables the optimize “everything-at-once” approach of Fig. 2(c), leading to a more systematic design flow that reduces the risk of bad design decisions while balancing all the design challenges simultaneously. III. PRELIMINARIES This section reviews important concepts for analog and RF IC automation, i.e., the optimization-based sizing and all-inclusive test bench formulation.


Figure 3: Architecture of the multitest bench RF IC sizing optimization.

  1. Optimization-Based Rf Ic Sizing In the traditional optimization-based sizing, the kernel is responsible for proposing P different sizing solutions for circuit simulation, each one with a new set of x design variables (e.g., devices’ widths, lengths, and number of fingers) and is set to solve the constrained many-objective problem find x that min fm(x) m = 1, 2,… M s.t. g j(x) ≥ 0 j = 1, 2,… J x L i ≤ xi ≤ xU i i = 1, 2,… N (2) where x is the vector of N design variables, g(x) is the J constraint functions, and the output is a Pareto-optimal front (POF) representing the tradeoffs between M objective functions f (x). In this problem, the number of design variables defines the search space order, while the variable ranges (minimum, maximum, and step values) define the size of the search space.



  • Less Phase noise
  • Less power consumption
  • Less iterations are required to give performance in voltage and temperature variations.


[1] A. Mazzanti and P. Andreani, “Class-C harmonic CMOS VCOs, with a general result on phase noise,” IEEE J. Solid-State Circuits, vol. 43, no. 12, pp. 2716–2729, Dec. 2008.

Low-Power Near-Threshold 10T SRAM Bit Cells With Enhanced Data-Independent Read Port Leakage for Array Augmentation in 32-nm CMOS

Low-Power Near-Threshold 10T SRAM Bit Cells With Enhanced Data-Independent Read Port Leakage for Array Augmentation in 32-nm CMOS



The conventional six-transistor static random access memory (SRAM) cell allows high density and fast differential sensing but suffers from half-select and read-disturb issues. Although the conventional eight-transistor SRAM cell solves the read-disturb issue, it still suffers from low array efficiency due to deterioration of read bit-line (RBL) swing and Ion/Ioff ratio with increase in the number of cells per column. Previous approaches to solve these issues have been afflicted by low performance, datadependent leakage, large area, and high energy per access. Therefore, in this paper, we present three iterations of SRAM bit cells with nMOS-only based read ports aimed to greatly reduce datadependent read port leakage to enable 1k cells/RBL, improve read performance, and reduce area and power over conventional and 10T cell-based works. We compare the proposed work with other works by recording metrics from the simulation of a 128-kb SRAM constructed with divided-wordline-decoding architecture and a 32-bit word size. Apart from large improvements observed over conventional cells, up to 100-mV improvement in read-access performance, up to 19.8% saving in energy per access, and up to 19.5% saving in the area are also observed over other 10T cells, thereby enlarging the design and application gamut for memory designers in low-power sensors and battery-enabled devices.



STATIC Random Access Memory (SRAM) occupies a significant portion of a system-on-a-chip (SoC) and has a notable contribution to the total power consumption and area of the SoC. Since area is an important factor when designing circuits, memory design engineers aim to place as many cells as possible per column to allow sharing of peripheral circuitry. The conventional 6T and 8T cells are greatly limited by their inability to work in longer columns. This is because they suffer from data dependent leakage and degraded ION/IOFF ratio and read bit-line swing as more cells are placed on a single column. Therefore, there is a need to design new circuits to address this issue. Previous approaches have tried to solve this issue by improving the ION/IOFF ratio to enable up to 1k cells per column. Although these approaches have been successful at this task, these still suffer from large area or varying data-dependent performance. Some also fail to account for the minimum energy point in SRAMs and therefore, consume a lot of energy per access at ultra-low voltages. This work describes three iterations of SRAM bit cells with nMOS-only based read ports aimed to greatly reduce data-dependent read port leakage to enable 1k cells per RBL, improve read performance, and reduce area and power over conventional 6T and 8T cells and other novel read-port based cells. With a unique topology in each of the three cells’ read port, we obtain improved read access performance, low energy per access, and low area respectively, thereby enlarging the design and application gamut for memory designers in low power sensors and battery enabled devices.

SRAM’s impact has become especially important due to the emergence of battery powered portable devices and low power sensor applications. Most SRAM design effort has been led to facilitate voltage scaling and improving yield. The conventionally implemented six transistor (6T) cell in SRAMs allows high density, bit-interleaving and fast differential sensing but suffers from half-select stability, read-disturb stability, and conflicting read and write sizing. Previous attempts to solve these issues have included the implementation of assist techniques, novel cell design, architectural improvements, or technological developments.

Half-select and read-disturb issues in SRAMs can be mitigated by optimization of word-line voltage level. This includes word-line under-drive assists using process corner tracking or using replica access transistors. Delayed word-line boost to match the internal voltage of half-selected cells to that of the bit-line during a read operation helps to improve their stability but requires fine tuning to establish the sensitive tradeoff between read stability and write ability. Cell supply boost assist can also be used to improve half-select stability by increasing the drive strength of pull down nMOS.

Negative cell ground implementation to improve read stability is the most effective assist but has high energy cost due to use of multiple GND rails. Disturb issues can also be mitigated by partial precharge of bit-lines to decrease the strength of access transistors. Pilo et al. make use of regulators to reduce the precharge voltage level of the bit-lines to around 70% of supply voltage to improve the read stability. Alternatively, the bit-lines can be precharged using an nMOS instead of a pMOS to obtain a single VTH drop on the bit-lines. A process variation tolerant selective precharge assist has also been used to decrease bit-line voltage level using charge sharing to improve half-select disturb issues. However, such partial bit-line precharge techniques reduce read ability and become less effective at lower voltages due to reduced VDS of the access transistors. Multiple supply line assist can also be used to improve read and write half-select stability issues in SRAMs. In a column-based dynamic supply technique was proposed. By implementing different supply voltages for read, write and standby modes, it relieved half-select stability issues and allowed bit-interleaving. However, this resulted in increase in dynamic power, design and routing effort and area due to generation of multiple supply voltages.

Although assist techniques can be beneficial in improving the performance and yield of SRAMs, they can often have a deteriorating complementary effect on write and read operations. They can also incur large area overhead, increase the energy per access, and have a limited and saturating effect on yield. Furthermore, since read and write stability is greatly dependent on temperature variations, an SRAM can either be write-limited at lower temperatures or read-limited at higher temperatures. Therefore, assists often require process and temperature tracking for effective yield improvement.

Apart from assist techniques, improvements on the architectural front have also been made to address half-select and read disturb stability issues. These include cross-point selection of words using both row and column word-lines to improve half select stability. Shorter bit-lines can also be used to improve read stability. These work by reducing bit line capacitance, thereby improving dynamic read margin. However, this comes at the expense of large area overhead due to greater number of cell banks. In another work, an array architecture with an area overhead of 12% was implemented in order to address the half-select disturb issue by decoupling the large bit-line capacitance from half-selected cells. Read and-write-back scheme has also been used to alleviate the write-disturb in half-select cells. It allows data retention by writing back the stored data after each read. However, such techniques increase the dynamic power consumption since every column is subjected to full voltage swings. Additionally, the sense amplifier cannot be shared amongst several columns and has to be integrated in each column, thereby incurring a large area overhead.

Figure 1 :  Schematic of (a) 6T (b) 8T SRAM cell.

With the 6T SRAM cell being afflicted by various stability issues, the 8T SRAM cell has been proposed (shown in Fig. 1). It has a decoupled read path comprising of two nMOS transistors. Although it eliminates the read-disturb issue, it is still afflicted by a pseudo-read during a write operation in half-selected cells on the same row. As such, the issue of loss of bit-interleaving capability arises. Bit-interleaving is essential to low voltage SRAM operation since it is combined with Error-Correction Code (ECC) to combat soft errors and achieve required yield targets. Soft errors, including Single Bit Upsets (SBUs) and Multiple Cell Upsets (MCUs) are caused by bombardment of alpha-particles, thermal neutrons or high energy cosmic rays. The rate of soft errors increases by 18% for every 10% decrease in supply voltage. This is especially problematic for low voltage SRAMs, since in sub-threshold operation region, the critical charge in nodes is significantly reduced, leading to frequent MCUs. In MCUs have been mitigated by implementing and combining bit-interleaving structure with ECC. In addition, bit-interleaving capable cell structures such as the column-decoupled 8T cell in , disturb-free 9T cell in, two-port disturb-free 9T cell, multi-port 9T cell in and the differential 10T cell to enable bit-interleaving and remove half-select disturb issues by using both row and column word-lines. For cell structures without interleaving capability such as the single ended 8T cell, additional parity or ECC bits can be interleaved per word for soft error correction.

Even if the read and write disturb issues are alleviated using the methods described above, an array implemented using the 8T cells has low array efficiency. This is because, its single ended mechanism requires a hierarchical sensing architecture which implements as few as eight cells per local RBL and multiple local RBLs per global RBL. Additionally, unlike the fast differential sensing in the 6T cell, the single ended sensing has a slow full swing operation. As greater number of cells are put on the same local RBL in order to improve array efficiency, both delay and the read bit-line voltage swing are greatly affected. Therefore, this form of hierarchical sensing does not compare to differential sensing in terms of both performance and array efficiency. Although many techniques have been proposed to improve the single ended read sensing performance, the area overhead still remains large. In order to improve the array efficiency and read bit-line voltage swing of single-ended-read cells, many modified read ports have been proposed.

These designs aim to put up to 1k cells per bit-line by improving the ION/IOFF ratio of SRAM read ports. This approach helps to greatly improve the array efficiency as peripheral circuitry can be shared amongst greater number of cells. Although these approaches have been successful at this task, these still suffer from large area, varying data-dependent performance and high energy consumption. In this work, we propose three iterations of SRAM bit cells with nMOS-only based read ports and compare them with conventional 6T and 8T cells and previous 10T cell-based works by measuring metrics from simulation of a 128kb array on the 32nm technology node. We compute minimum energy per access for all cells considering different activity factors for various levels of caches and calculate dynamic failure rate based on operating frequency and process variations.


  • More area and power consumptions
  • Maximum energy point in SRAM
  • More Transistor Count




Topology of Proposed Bit Cells:

The schematic of the proposed 10T SRAM cells is shown in Fig. 2. Each of them comprises of cross coupled inverters (PUL-PDL and PUR-PDR) and two access transistors (ACL and ACR). The read port of each cell consists of four nMOS (R1, R2, R3 and R4). The read port in Fig. 2(a) has improved data-dependent read bit-line leakage and is aimed at high performance. The read ports in Fig. 2(b) and (c) have complete data-independent read bit-line leakage and are aimed at very low power and high density respectively. The working of each port has been explained in the next section. From here on, the proposed cells are referred to as 10T-P1, 10T-P2 and 10T-P3.


Figure 2 Schematic of the proposed (a) 10T-P1 (b) 10T-P3 (c) 10T-P2 cells.

Bit Cell Working Mechanism:

When operating in near and sub-threshold region, the ION/IOFF is severely degraded and it becomes increasingly difficult to implement greater number of cells on a single column. As the number of cells increase, the combined pass gate leakage becomes comparable to the read current, thereby making it difficult for the sense amplifier to correctly evaluate the read bit-line voltage level. Furthermore, the data stored in the cell also affects the read bit-line leakage, thereby making the off-state read bit-line leakage current to fluctuate highly. This is exacerbated at ultra-low voltages, where the worst-case data pattern can lead to the RBL voltage level of ‘zero’ becoming greater than the RBL voltage level of ‘one’.

Figure 3 :  Schematic of read port of (a) Calhoun and Chandrakasan [1] (b) Kim et al. [3] (c) Pasandi and Fakhraie [2] (d) Proposed 10T-P1 (e) 10T-P2 (f) 10T-P3 cell.

In order to improve the ION/IOFF ratio, the read port shown in Fig. 3(a) was proposed. When the cell stores ‘one,’ the R2 pMOS charges the intermediate node, thereby greatly reducing the read bit-line leakage through R1 nMOS. However, this also leads to flow of leakage current from intermediate node into the RBL. The combined leakage of all cells on the same column can raise the low logic level of RBL to several hundred millivolts, thereby leading to reduced voltage swing and sensing margin. The conceptual scenario of the effective read bit-line voltage swing for this case has been depicted in Fig. 4(a). On the other hand, when the cell stores ‘zero,’ the RBL leakage is reduced through the stacking effect of nMOS. Therefore, such a topology makes the effective RBL swing largely dependent on the data pattern in the column. In another work [3], the data dependency was removed by creating a data-independent leakage path between the cell’s read port and the RBL. This led to a significant voltage swing on the RBL even at lower voltages. The read port and the corresponding effective RBL swing for the same has been shown in Fig. 3(b) and Fig. 4(b) respectively. A recent work [2], also proposed a modified read port [shown in Fig. 3.(c)], to improve the ION/IOFF ratio. However, it is also afflicted by the data-dependent leakage path issue. Depending upon the data stored in the cell, the leakage from intermediate node to RBL can change drastically, thereby leading to varying low logic voltage levels of RBL. Despite this issue, it is able to maintain an RBL swing, as shown in Fig. 4(c). From here on, the cells in Fig. 3(a), (b) and (c) will be referred to as the 10T-C, 10T-K and 10T-P cells respectively. Like the proposed cells, these cells also have the same topology for the write port and differ in terms of the read port only.

Figure 4 : Conceptual Effective Read Bit-Line Swing of (a) Calhoun and Chandrakasan [1] (b) Kim et al. [3] (c) Pasandi and Fakhraie [2] (d) Proposed 10T-P1 (e) 10T-P2 (f) 10T-P3 cell.

The schematic of the proposed read ports is shown in Fig. 3(d)–(f). The proposed 10T-P2 and 10T-P3 cells are aimed at low power and low area respectively while simultaneously maintaining a data-independent ION/IOFF ratio. The principle behind their working is depicted in Fig. 5(c) and (d). As seen in Fig. 5, the magnitude of Ileak becomes equal in both read ‘zero’ and read ‘one’ case. This helps to maintain the required difference in magnitude between accessed-cell current in both cases. As such, a significant effective RBL swing can be observed, as shown in Fig. 4(e) and (f). This is not possible in the case of conventional 8T cell sensing, because of the large dependence of leakage current on the data pattern.

Although the proposed 10T-P1 cell decreases its data dependency in comparison to the 10T-C cell as seen in Fig. 4(d), it largely remains incapable of performing a read operation at ultra-low voltages. However, in the following subsection, we show that operating at ultra-low voltages increases the energy per access and operating near the threshold point is optimal for lowest energy consumption. As such, the 10T-P1 cell is operated near the sub-threshold region for lowest energy consumption and highest performance. At near-threshold and super-threshold voltages, the read bit-line swing is not an issue for the 10T-P1 cell. A more comprehensive analysis of RBL swing of each cell with respect to data pattern, supply voltage and temperature is presented in the next section.


  • Less area and power consumptions
  • Minimum energy point in SRAM
  • Reduced Transistor Count


[1] B. H. Calhoun and A. P. Chandrakasan, “A 256-kb 65-nm sub-threshold SRAM design for ultra-low-voltage operation,” IEEE J. Solid-State Circuits, vol. 42, no. 3, pp. 680–688, Mar. 2007.

Line Coding Techniques for Channel Equalization: Integrated Pulse-Width Modulation and Consecutive Digit Chopping

Line Coding Techniques for Channel Equalization: Integrated Pulse-Width Modulation and Consecutive Digit Chopping


This paper presents two new line-coding schemes, integrated pulse width modulation (iPWM) and consecutive digit chopping (CDC) for equalizing lossy wire line channels with the aim of achieving energy efficient wire line communication. The proposed technology friendly encoding schemes are able to overcome the fundamental limitations imposed by Manchester or pulse-width modulation encoding on high-speed wire line transceivers. A highly digital encoder architecture is leveraged to implement the proposed iPWM and CDC encoding. Energy-efficient operation of the proposed encoding is demonstrated on a high-speed wire line transceiver that can operate from 10 to 18 Gb/s. Fabricated in a 65-nm CMOS process, the transceiver operates with supply voltages of 0.9 V, 1 V, and 1.1 V. With the help of the proposed iPWM encoding, the transceiver can equalize over 27-dB of channel loss while operating at 16 Gb/s with an efficiency of 4.37 pJ/bit. The design occupies an active die area of 0.21 mm2.


GROWTH of online video content for high resolution (4k, 8k) video, and data generated by IoT devices has resulted in an exponential increase in the data rates at each and every point in the communication chain from data centers to smart phones. Wire line communication system addresses the bandwidth demand in two ways: (1) by increasing the number of channels and (2) by increasing the data rate per channel. Increasing the channels typically requires investment in new infrastructure, and is therefore discouraged. Consequently, increasing the data rates per channel has been the trend in wireline links over the last 16 years, as shown in Fig. 1(a). While the energy efficiency of these links has continued to improve, the efficiency improvement has slowed down in the last six years, as shown in Fig. 1(b).

A major reason for the slowdown in energy efficiency improvement is the fact that, while data rates continue to increase, communication channels have remained more or less same since channel upgrades are very expensive. The same channel at higher data rates results in more inter-symbol interference (ISI), which requires greater equalization to compensate the channel loss. Equalization of the channel loss consumes significant power and degrades the energy efficiency of the wire line communication link.

Conventional equalization techniques on the receiver end such as decision feedback equalizers have tight feedback timing constraints, which result in higher power consumption as the data rate increases. Feed forward equalization (FFE) on the transmitter with voltage mode driver avoids the feedback path and results in efficient equalization. Based on the FFE tap resolution requirement, the output driver and pre-driver are divided into multiple segments. Although such a segmented FFE implementation helps to maintain a constant output termination impedance (50) across all tap settings, it comes at the cost of (a) increased signaling power, (b) increased switching power since multiple segments are required to achieve desired linearity, and (c) tight coupling between 50 termination tuning and FFE tap coefficients tuning. These three constraints reduce the FFE efficiency as the number of FFE taps are increased to equalize heavy channel loss.

Figure 1 : (a) Data rate vs. year of wire line link publication. (b) Energy efficiency vs. year of publication

Conventional line-coding techniques such as Manchester encoding (also known as pulse width modulation or PWM), can equalize the wire line channel without increasing signaling power, without segmenting the output driver, and without coupling the 50 termination tuning with the coefficient tuning. However, PWM encoding requires the insertion of a precise narrow pulse in every data bit. These narrow pulses must be accurately reproduced at the transmitter output, which necessitates very wide bandwidth in the high-speed data path, resulting in poor energy efficiency and difficulty in scaling PWM encoding to higher data rates. For example, creating a 10% duty cycle on a 64Gb/s PWM data stream would require a pulse width of 1.5ps with less than 1ps of rise/fall time at the transmitter output. Researchers have shown that phase pre-emphasis encoding scheme can help to reduce data dependent jitter. However, it is ineffective at equalizing high-loss channels.

In view of these limitations, we propose two highly-digital phase-domain line-coding/modulation technique for equalization: (a) integrated pulse width modulation (iPWM) and (b) consecutive digit chopping (CDC). The proposed iPWM technique can compensate more than 27dB of channel loss at 16Gb/s while consuming 69.9mW of power. Compared to the state-of-the-art PWM designs, the proposed iPWM scheme achieves 36× better energy efficiency for the same data rate, and 3.2× higher data rate for the same energy efficiency. The proposed CDC encoding technique in tandem with iPWM, can equalize a channel loss of upto 30dB at 14Gb/s.

In the past, researches have also proposed digital phase modulation techniques such as pulse width modulation (PWM) to encode the information in pulse widths instead of voltage levels (example: PAM-4/8/16). This work differentiates itself from the prior PWM based modulation research by the fact that the proposed work presents line-coding technique to equalize a wireline channel and it does not modulate the pulse width to encode information. In the proposed work, the information is contained in the two voltage levels only.


  • Number of Channels is very low
  • Low Data rate
  • More area and power consumptions


Wireline communication channels have low-pass characteristics. That is insertion loss of the channel increases with frequency. Let’s say the data through the channel has 0 consecutive identical digits (CIDs), that is, the data is an alternating data (101010…), whose power spectrum is limited to just one frequency. With such data, the loss offered by the channel is constant and there is no ISI (no eye closure), as shown in Fig. 2(a). In reality, the transmitted data has consecutive identical digits (CIDs) such as 10110 (2 CIDs), 01110 (3 CIDs), 11110 (4 CIDs), etc, which causes the power spectral density of the data to have a wide bandwidth. Because of frequency dependent insertion loss of the channel, the loss offered to 4 CIDs is less as compared 3 CIDs or 2 CIDs because the majority of the power spectrum of data with 4 CIDs is located at a lower frequency as compared to the power spectrum of data with 3 CIDs and 2 CIDs.

Figure 2 : (a) Effect of consecutive identical digits on inter-symbol interference (ISI). (b) Time domain pulse response of NRZ data and proposed iPWM encoded data in the presence of CIDs

As a result, the data with 4 CIDs has a higher amplitude at the channel output than the data with 3 CIDs and 2 CIDs. Consequently, the transition time to a bit of opposite polarity immediately following the CIDs is higher for 4 CIDs as compared to 3 CIDs and 2 CIDs. The long transition time reduces the horizontal and vertical eye opening of the data bit immediately following the CIDs. Hence, due to the difference in the insertion loss offered to CIDs, the data bit immediately following the CIDs, suffers from inter-symbol-interference (see Fig. 2(a)).

The proposed iPWM operates on the fact that the ISI on the data bits following the CIDs can be reduced by reducing the pulse width of CIDs at the transmitter. The concept of reducing the pulse width is graphically explained in Fig. 2(b). In the proposed iPWM, instead of transmitting 2 CIDs for the complete duration of 2 UI (unit interval), 2 CIDs are transmitted for less than 2 UI. As a result, the transition to the bit of opposite polarity happen early, in other words, the post-cursors generated by 2 CIDs is reduced, which helps to reduce the ISI. Similarly, this technique can be leveraged to remove ISI while transmitting 3 or more CIDs.

Benefits of iPWM Encoding:

The timing diagram of the proposed iPWM encoding and comparison with the Manchester/PWM encoding is shown in Fig. 3. In case of Manchester encoding (Fig. 3(a)), bit one is represented by a signal, which stays high for 50% of the period and low for the remaining 50%. Bit zero is represented by a signal, which is low 50% of the period and then goes high. In PWM encoding, the time for which the signal stays high and low can be varied from 0% to 50% depending on the channel loss. In the presence of CIDs, the Manchester/PWM encoding generate narrow pulses, which must be transmitted through pre-driver and output driver to be precisely reproduced at the channel. These narrow pulses necessitate very wide bandwidth in the high-speed data path, resulting in poor energy efficiency and difficulty in scaling the Manchester/PWM encoding to higher data rates.

Figure 3 : (a) Manchester/PWM encoded data and problem of narrow pulse width generation. (b) Timing diagram of the proposed iPWM encoding for post-cursor and pre-cursor equalization.

The proposed iPWM encoding avoids inserting narrow pulses and instead reduces the pulse width of CIDs to achieve equalization, as shown in Fig. 3(b). Post cursor ISI is reduced by reducing the trailing edges of CIDs, while the pre-cursor ISI can be reduced by reducing the leading edges of CIDs.

It can observe from Fig. 3(b) that the number of transitions in the iPWM encoded data is same as that of NRZ data, and therefore, the bandwidth requirement on the high-speed data path of the output driver, in case of the iPWM encoding technique, is the same as that of the NRZ encoding. This helps in increasing the data rates of iPWM encoded data to 56Gb/s and beyond, without exponentially increasing the switching power of the transmitter.

Furthermore, conventional Manchester encoding technique has a lower limit on the minimum pulse width that can be transmitted because of the bandwidth limitation on the data path, which often results in over equalizing of a low-loss channel. When over equalization occurs, ISI is added instead of being subtracted, resulting in incorrect detection at the receiver, and consequently, higher bit error rate. The proposed iPWM encoding can change the pulse width of CIDs with very high precision (1ps precision or better can be easily achieved in 65nm CMOS), which helps to equalize a wide range of channel loss. Since iPWM encoding is done before the pre-driver, the output driver can be implemented as an unsegmented source series terminated driver. This reduces the signaling and switching power of the transmitter, which makes the proposed iPWM an energy efficient equalization scheme. Moreover, the proposed iPWM also helps to decouple the 50 termination tuning with the encoding coefficient tuning in source series terminated output driver. In summary, the proposed iPWM (1) does not generate narrow pulses, (2) can equalize a wide range of loss, (3) reduces transmitter signaling and switching power, (4) decouples 50 termination resistor tuning from coefficient tuning, and (5) uses technology scalable encoding architecture.


  • Number of Channels increases
  • Data rate per channel will increases
  • Less area and power consumptions




[1] T. Anand. Wireline Link Performance Survey. Accessed: Jan. 2018. [Online]. Available: https://web.engr.oregonstate.edu/~anandt/linksurvey

Hardware-Efficient Post-processing Architectures for True Random Number Generators

Hardware-Efficient Post-processing Architectures for True Random Number Generators


In this brief we present novel post-processing modules for use in True Random Number Generators. These modules are based on mathematical constructs called strong blenders, which provide theoretical guarantees for the randomness of the output. We start by pointing out problems with current post processing methods used in state-of-the-art TRNG designs. We present three novel hardware-efficient architectures and provide guidelines for choosing the design parameters.


Hardware True Random Number Generators (TRNGs) are used in all devices that require secure communication, device authentication or data encryption. Applications include smart cards, RFID tags and IoT devices. TRNGs used in cryptography are subject to strict certification procedure. In the past, the security of TRNG designs was evaluated by running a set of statistical tests such as NIST 800–22 and DIEHARD. However, as pointed out, the statistical features exploited by future cryptanalysis techniques cannot be foreseen in advance. Therefore, it is a risky practice to rely only on a finite set of statistical tests to verify the security of a random number generator. A notable incident happened in 2003 when the Motorola TRNG was attacked only one year after the details of the design were disclosed.

Today’s certification authorities require a theoretical explanation for the unpredictability of generated data. Based on the theoretical model of the digital noise source (DNS), a designer has to make an entropy claim – i.e. a lower bound of the generated entropy. Once this bound is determined, an appropriate digital post-processing method is used to compress the sequence of raw numbers into a shorter sequence of full entropy random numbers that could be used by the application.

While TRNGs presented in open literature often achieve impressive results in terms of throughput, energy and hardware area, they rarely follow all necessary requirements for use in cryptography. A common mistake is the wrong choice of the post-processing algorithm. For example, designs presented in use Von Neumann’s post processing. This method works only when the probability of generating the output bit 1 doesn’t change over time and when there is no correlation between the generated bits. Unfortunately, these conditions are never met in practice. TRNGs presented in use a parity filter for post-processing, while a design from uses an xor gate to combine the outputs of two independent physical sources of randomness. In some specific cases, these methods increase the min-entropy of the output, but they don’t provide general-case theoretical guarantees. TRNG designs presented in use the LFSR-based whitening schemes instead of post-processing. Such schemes don’t compress the data and thus don’t increase the entropy per bit. In addition, many designs don’t use any post-processing because the raw bits pass NIST 800–22 statistical tests. As illustrated by the attack on the Motorola TRNG, this is not a good practice. We stress that the Motorola TRNG was also able to pass all statistical tests from the DIEHARD suite.

NIST special publication 800–90B recommends using one of the vetted post-processing methods based on cryptographically secure primitives such as block ciphers or cryptographic hashes. However, these methods have a high cost in terms of area, energy and throughput which makes them unsuitable for lightweight applications. To the best of our knowledge, the only attempt to implement a mathematically secure, hardware-efficient post-processing method was made by Intel in their µRNG design for IoT applications. This method was based on a single finite field addition and multiplication, a construct that was proposed and theoretically analyzed. Unfortunately, the implementation presented in uses the wrong choice of the finite field and the design parameters, and thus fails to provide security guarantees. Motivated by the current lack of mathematically-secure post-processing modules in the TRNG state-of-the-art, we propose three hardware-efficient post-processing architectures suitable for compact implementations. These architectures are based on mathematical constructs called strong blenders, which provide theoretically proved guarantees for the statistical quality and unpredictability of the output. The only requirement imposed on the digital noise source is that it produces sufficient amount of min-entropy, which makes these post-processing methods compatible with all physical sources of randomness. For all three architectures we provide a method for selecting design parameters given the min-entropy of the digital noise source.


  • Low throughput
  • Low energy
  • More area, and power consumptions



We propose three hardware-efficient architectures for postprocessing modules. These architectures are based on a twosource strong blender [26] which consumes l bits from each source and produces a single w-bit word. Such blender can be constructed using any set of l×l matrices with a property given in Equation (5). We opt for the right-shift matrices because of their efficient hardware implementation and their compatibility with bit-serial DNS architectures. These are superdiagonal matrices given by:

A bit vector multiplied by Ai results in the same bit-vector shifted by i positions to the right. In bit-serial architectures this multiplication is implemented by simply delaying the bit sequence for i clock cycles. A sum of any subset of matrices A1, …Aw results in a matrix with a rank of at least l−w. Thus, by equation (5) r = w, and by equation (7), the statistical distance of the output from Uw is limited to:

δ < 2 −(b+1−w−l/2) .

Fig. 1 shows the architectures of the proposed post processing modules. A straightforward implementation using two DNSs is shown in Fig. 1a. In this architecture the strong blender is used as a two-source extractor. Multiplications Aix are implemented by delaying an input bit-stream by i clock cycles. Inner products are implemented using an AND gate, an XOR gate and a flip-flop for storing intermediate results. The computation is performed in l-clock cycles while the sources generate raw bits, after which the result is stored in the w-bit output register.

Figure 1 : Architectures of the post-processing modules based on strong blenders

Higher utilization of available entropy can be achieved by exploiting the independence of the strong blender output from one of the inputs. In the architecture shown in Fig. 1b one of the input sequences is reused for generating multiple output words. This architecture uses the same computational core (shown in gray) as the two-source architecture, but it requires only one DNS. The operation consists of two phases: the setup phase and the entropy extraction phase. In the setup phase, an l-bit sequence is generated and stored in the circular-shift register. During the extraction phase, this sequence is rotated through this register providing one input for the computational core, while the DNS generates the data for the second input. Between these two phases, the DNS should be restarted in order to guarantee the independence of the two inputs. This way, we bypass the requirement for two distinct entropy sources.

Fig. 1c shows an architecture that is suitable for high throughput designs. This architecture uses m sources and m−1 computational cores. Each generated bit-sequence is used at most twice, thereby avoiding the risk that a single corrupted sequence affects many output words.

The selection of the optimal architecture should be guided by the application requirements and the properties of the used digital noise source. The choice between the one-source and the two-source architecture comes down to the choice between an l-bit shift register and a DNS. In case that the selected DNS performs better in terms of the the metric of interest (area, power or energy) compared to the shift register, a two-source architecture should be used. Otherwise, a single source architecture is optimal. The multiple source architecture shown in Fig. 1c should be used in case when the throughput requirement cannot be achieved using the other two architectures.


  • High throughput
  • More energy
  • Less area, and power consumptions


[1] A. Rukhin et al., “A statistical test suite for random and pseudorandom number generators for cryptographic applications.” NIST Special Publication 800-22, August 2008.