10T SRAM Using Half-VDD Precharge and Row-Wise Dynamically Powered Read Port for Low Switching Power and Ultralow RBL Leakage

10T SRAM Using Half-VDD Precharge and Row-Wise Dynamically Powered Read Port for Low Switching Power and Ultralow RBL Leakage

ABSTRACT:

We present, in this paper, a new 10T static random access memory cell having single ended decoupled read-bitline (RBL) with a 4T read port for low power operation and leakage reduction. The RBL is precharged at half the cell’s supply voltage, and is allowed to charge and discharge according to the stored data bit. An inverter, driven by the complementary data node (QB), connects the RBL to the virtual power rails through a transmission gate during the read operation. RBL increases toward the VDD level for a read-1, and discharges toward the ground level for a read-0. Virtual power rails have the same value of the RBL precharging level during the write and the hold mode, and are connected to true supply levels only during the read operation. Dynamic control of virtual rails substantially reduces the RBL leakage. The proposed 10T cell in a commercial 65 nm technology is 2.47×the size of 6T with β=2, provides 2.3×read static noise margin, and reduces the read power dissipation by 50% than that of 6T. The value of RBL leakage is reduced by more than 3 orders of magnitude and (ION/IOFF) is greatly improved compared with the 6T BL leakage. The overall leakage characteristics of 6T and 10T are similar, and competitive performance is achieved. The proposed architecture of this paper analysis the logic size, area and power consumption using Tanner tool.

 

EXISTING SYSTEM:

SRAM cell must robustly operate under hold, read, and write mode. An SRAM cell uses the positive feedback of cross-coupled inverters (INVs) to store a single bit of information in a complementary fashion. Access transistors provide the mechanism for the read and write operation. Before every access, column BL pair (BL and BLB) is precharged to the supply voltage. For the write operation, one of the precharged BLs is discharged through the write driver.

Figure 1: Conventional 6T SRAM read. (a) Column of M bit-cells during read. (b) Top: hold and read SNM butterfly curve (with worst case noise polarity during hold). Bottom: transient behavior showing read disturbance

Fig. 1(a) shows a single column ofM6T SRAM cells, where one cell is accessed in read mode with data=0(Qa=0), while other M−1 cells are in the hold mode. Leakage components are labeled, and for the worst case leakage, all M−1 cells store data=1(Qu=1).I read flows from BL to the VSS through AL and NL of the accessed cell, and the BL voltage is decreased. The unaccessed cell on the BL exhibits BL leakage.IuLeak0 is the main component of BL leakage whileIuLeak1 is negligible, as VDS of AR of the unaccessed cell is large, while VDSof its AL is very small (varies from 0 to VBL). These leakage components decrease the differential BL voltage development. As there are a large number of cells in a single column, the worst case BL leakage can decrease BLB voltage enough to make an erroneous read. Thus, I read must be greater than (M−1)×IuLeak0,whereMis the number of cells in a single column.

Figure 2: SRAM read ports (a) 6T. (b) 8T.(c) 9T.(d) 9T.(e) 10T.

In essence, 6T SRAM has conflicting read and write requirements and transistor sizing cannot be done independently. Also, 6T has inherit RSNM problem as the read current passes through the cell internal node, and it further degrades with VDD scaling. Also, being considered as baseline design, 6T has overall a higher power dissipation, and higher BL leakages, as the low power techniques employ a certain mechanism to lower the dynamic power dissipation, e.g., charge sharing and hierarchical BL and the leakages (by employing virtual rails). The read port of 6T SRAM cell is shown in Fig. 2(a) that highlights the internal node Q in the read current path. Many alternative bit cells and techniques have been proposed in the literature to improve SRAM cell stability, reduce the leakage currents, and achieve low power operation compared with the conventional6T design.

An 8T SRAM cell adds a separate 2T read port, shown in Fig. 2(b), and necessarily solves the problem of read stability. Internal nodes are isolated from the read current path, and thus a high RSNM is achieved. Also, sizing of 8T read port can be done independently without affecting the write operation.

In 6T SRAM read operation, one of the BL stays at the VDD while the other decreases by VBL amount. However, in the case of 8T SRAM, there is only one BL (RBL) and it either decreases or stays at the VDD level depending on the bit read. Now, the sensing of SE BL can be done using different circuits such as: 1) domino sensing that requires full VDD swing ON the local-BL; 2) psuedo-differential that requires a reference signal; and 3) ac coupled sensing that requires the use of capacitors. Using a reference-based sense amplifier, only a small voltage difference is required.

 

DISADVANTAGES:

  • Power consumption is high

PROPOSED SYSTEM:

We present our half VDD precharge and charger cycling technique for low power read operation. A 4T read port is designed to employ the proposed technique. ReadBL (RBL) is charged and discharged through the read port according to the state of stored bit. Read port is powered by virtual power rails that run horizontal and are shared bythe cells of a word. The dynamic control of read port power rails reduces the RBL leakage substantially.

Figure 3: Proposed 10T SRAM cell with row-wise read port dynamic power lines

Proposed cell and low power technique:

The proposed 10T SRAM cell with SE RBL is shown in Fig. 3. We have added a 4T read port to the 6T cell to decouple the internal nodes during the read operation. Read port consists of an INV P1-N1 driven by node QB, and a transmission gate (TG) P2-N2. The output (Z) of the INV is connected to RBL during the read operation through TG, which is controlled by (read) control signals. Furthermore, read port is powered by virtual power rails, VVDD and VVSS, which are dynamically controlled. These virtual power rails (control signals) run horizontally, and have the true rail values only during the read operation. For the RBL leakage reduction, both the virtual rails have the same level as the precharge level of RBL.

  • The 10T SRAM cell using an INV and a TG has been proposed earlier. However, our proposed 10T scheme is different from the previous design in the following aspects. The previous INV+TG-based 10T cell was application specific, while our proposed design is generic.
  • We have used the dynamically controlled power rails for the read port.
  • We precharge RBL at VDD/2, while the previous 10T design eliminated the precharge phase, and used INV to fully charge or discharge the RBL.
  • The basic read technique of both the designs is completely different. The main idea of the proposed design is “the charging or the discharging of the read BL from VDD/2 for every read operation.” The previous design either discharges from VDD to VSS, or charges from VSS to VDD.
  • A powerful INV was used previously to produce full VDD swing on the RBL. In the proposed design, RBL is precharged at VDD/2, and only a small voltage difference (comparable with 6T) is produced for every read cycle.
  • In the proposed design, for every read cycle the RBL will exhibit some change (positive or negative) from its precharged value of vdd/2. However, the RBL would not change for consecutive similar bit reads. RBL would change only if consecutive read bits are different.

 

ADVANTAGES:

  • Power consumption is low

 

SOFTWARE IMPLEMENTATION:

  • Tanner tool

 

A 2.5-ps Bin Size and 6.7-ps Resolution FPGA Time-to-Digital Converter Based on Delay Wrapping and Averaging

A 2.5-ps Bin Size and 6.7-ps Resolution FPGA Time-to-Digital Converter Based on Delay Wrapping and Averaging

ABSTRACT:

A high-resolution time-to-digital converter (TDC) implemented with field programmable gate array (FPGA) based on delay wrapping and averaging is presented. The fundamental idea is to pass a single clock through a series of delay elements to generate multiple reference clocks with different phases for input time quantization. Due to periodicity, those phases will be equivalently wrapped within one reference clock period to achieve the required fine resolution. In practice, a hybrid delay matrix is created to significantly reduce the required number of delay cells. Multiple TDC cores are constructed for parallel measurements and then exquisite routing control and averaging are applied to smooth out the large quantization errors caused by the in homogeneity of the TDC delay lines for both linearity and single-shot precision enhancement. To reduce the impact of temperature sensitivity, a cancellation circuit is created to substantially reduce the offset and confine the output difference within 2 LSB for the same input interval over the full operation temperature range of FPGA. With such a fine resolution of 2.5 ps, the integral nonlinearity is measured to be from merely −2.98 to 3.23 LSB and the corresponding rms resolution is 4.99–6.72 ps. The proposed TDC is tested to be fully functional over 0 °C–50 °C ambient temperature range with extremely low resolution variation. Its performance is even superior to many full-custom-designed TDCs The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.

EXISTING SYSTEM:

Conventionally, TDC with sub nano second resolution can be realized with emitter-coupled logic (ECL) which is not only power consuming but also area consuming and unsuitable for portable systems or integrated chips. Many different techniques have been developed in order to achieve a high resolution and a wide measurement range such as time-to-amplitude conversion, Vernier principle, time stretching, and time interpolation. In theory, the simplest implementation of TDC is a high frequency counters those increments every clock cycle.3-ps incremental resolution is achieved with the help of a time-consuming statistical method: the lookup table (LUT).Meanwhile, multistage interpolation can be applied straightforwardly to obtain a wide measurement range while keeping high resolution at the same time. Fig. 1 shows the conceptual timing diagram of the two-stage time interpolation technique based on the classic Nutt method. The input interval Tin is segmented into T12, T1, and T2. T12 is synchronous with the reference clock CLK and can be readily digitized by a coarse counter whileT1 andT2 with duration less than one clock period TCLK are processed by fine TDCs or interpolators with resolutions much smaller than TCLK. T in can be measured as

Tin =T12+T1−T2.                                                                                                (1)

Since the interpolator dominates the effective resolution of TDC, many structures are created to enhance its accuracy. The most commonly used are tapped delay line, pulse stretcher (dual-slope conversion), pulse shrinking, and Vernier delay line (differential delay line) to achieve sub gate delay resolution. After tens of years of evolution, it is still a challenge for experienced designers to accomplish an effective resolution better than 10 ps for TDCs. More subtle techniques are required. Time amplification is adopted to implement a TDC with 9 b, a 1.25-ps bin size, and an output standard deviation of<1 LSB. The measured differential nonlinearity (DNL) and integral nonlinearity (INL) are 0.8 LSB and 3 LSB, respectively, with a limited dynamic range. Cyclic time-domain successive approximation is created to get a 1.2-ps resolution and a 327-μs dynamic range. The RMS single-shot precision is 3.2ps achieved using an external INL-LUT for the interpolators. Vernier ring is invented to generate an 8-ps LSB width with an output standard deviation of <1 LSB also. The performance is further improved by a gated Vernier ring structure to realize an equivalent resolution of 3.2ps with an oversampling ratio of 16.An 8-b cyclic TDC is proposed to achieve a 1.25ps LSB width, a±0.7 LSB DNL, and a−3to+1LSBINL.To enhance dynamic accuracy for applications with periodic TDC input, time-domain delta sigma modulation for noise shaping is adopted to get an effective resolution around 6 ps.

DISADVANTAGES:

  • worst performance

PROPOSED SYSTEM:

Assuming that nwrapped phases are uniformly distributed in one reference clock period, the bin size of the TDC can be calculated as

LSB=TCLK/n=1/n×f                                                                                            (2)

During circuit implementation, the pulse-shrinking/stretching mechanism caused by the aspect ratio mismatch among adjacent devices will limit the realizable length of the clock delay line [36]. To accomplish pico second – level resolution, at least hundreds of delay cells are required. After being fed into such a long delay line, the duty cycle of high-frequency reference clock will be either shrunk or stretched to be 0% or 100% before reaching the end of delay line. No delayed clock signal will be generated for the rest of the delay stages after the duty cycle reaches0% or 100% to ruin the TDC accuracy. In theory, the clock frequency can be lowered to get a larger pulse width to ensure that the reference clock can propagate to the end of delay line. However, the delay line must be lengthened correspondingly to maintain the same resolution as revealed by (2). The impact of pulse-shrinking/stretching mechanism is proportionally increased to spoil the effectiveness of clock frequency lowering. On the contrary, the input signal can be made with a larger pulse width than the reference clock and fed into the delay line instead to solve the dilemma. The conceptual timing diagram is shown in Fig. 1. Since all the wrapped clocks quantize the same input signal, Tin can be duplicated in theory so that each clock can be paired up with one specific input signal (e.g., Ci with T in, i)as depicted in Fig. 1(a). Then, we can align all clocks while shifting the input signals accordingly to keep exactly the same timing relation between each pair of signals Ci and Tin, I in Fig. 4(b).Equivalently, T in is fed into the same delay line and then all delayed input signals are quantized by the same reference clock to get the same output for the proposed TDC. The expense is long dead time since only when Tin propagates to the last delay stage can the TDC get the final conversion output.

Figure 1: Timing diagram with (a) delayed clocks and (b) delayed inputs

Another problem is raised by the above modification to delay T in instead of CLK. For much fine resolution, the input delay line is expected to be very long with significant pulse shrinking/stretching impact which limits the smallest measurable width of the input signal Tin. Consequently, a large TDC offset can be expected. To reduce the offset and logic utilization, a delay matrix with multiple short delay lines can be used for a single input Tin to generate enough number of delayed signals as revealed in Fig. 2(a). In theory, different delay cells or strict timing constraints need to be adopted for vertical and horizontal delay lines to make sure the maximum uniformity can be realized among the wrapped phases. Since both Tin and CLK can be delayed to generate the required phase shifts, a hybrid delay matrix or the so-called 2-D Vernier is thus constructed to substantially reduce the number of delay cells from approximate H×V to H+V as shown in Fig. 2(b).

Figure 2: (a) Delay matrix. (b) Hybrid delay matrix

One feasible way to evenly distribute the phases among reference clocks is to use FPGA embedded multi-output phase locked loop (PLL) for phase division as depicted in Fig. 3.There is only one H-stage delay line used.

Figure 3: Hybrid delay matrix withPLL for clock phase division.

ADVANTAGES:

  • Better performance

SOFTWARE IMPLEMENTATION:

  • Modelsim
  • Xilinx ISE