A New CDMA Encoding/Decoding Method for on-Chip Communication Network

A New CDMA Encoding/Decoding Method for on-Chip Communication Network


As a high performance on-chip communication method, the code division multiple access (CDMA) technique has recently been applied to networks on chip (NoCs). We propose a new standard-basisbased encoding/decoding method to leverage the performance and cost of CDMA NoCs in area, power assumption, and network throughput. In the transmitter module, source data from different senders are separately encoded with an orthogonal code of a standard basis and these coded data are mixed together by an XOR operation. Then, the sums of data can be transmitted to their destinations through the onchip communication infrastructure. In the receiver module, a sequence of chips is retrieved by taking an AND operation between the sums of data and the corresponding orthogonal code. After a simple accumulation of these chips, original data can be reconstructed. We implement our encoding/decoding method and apply it to a CDMA NoC with a star topology. Compared with the state-of-the-art Walsh-code-based (WB) encoding/decoding technique, our method achieves up to 67.46% power saving and 81.24% area saving together with decrease of 30%–50% encoding/decoding latency. Moreover, the CDMA NoC with different sizes applying our encoding/decoding method gains power saving, area saving, and maximal throughput improvement up to 20.25%, 22.91%, and 103.26%, respectively, than the WB CDMA NoC. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


The previously proposed CDMA NoCs are based on a digital encoding and decoding method requiring that the spreading codes have both the orthogonal and balance properties. To this end, the Walsh code is typically used. However, the Walsh-code-based (WB) encoding and decoding method has inherent shortcomings, which are given as follows.

1) Design Complexity: In the encoding method, an arithmetic addition logic unit, whose logic overhead increases with the number of senders, is used to mix coded data together. In the decoding method, a key demux-accumulation-compare unit is used to retrieve the source data from mixed data chips (in this brief, each bit of a spreading code is called a chip, and thus the encoded data is called data chips). This unit is, however, area-consuming.

2) Low Code Utilization: In an S-chip Walsh code set, S must be equal to 2N, where N is a natural number, and at most S − 1 sequences can be used to encode the original data. This results in a waste of sequences in the code set. For example, a 16-node network needs a 32-chip Walsh code set, because a 16-chip Walsh code set can only provide 15 sequences for data encoding and it thus cannot satisfy the requirement of 16 sequences, one for each node.



  • High area
  • High power


The basic structure of applying CDMA technique to NoC with a star topology is shown in Fig. 1. In this figure, a PE executes tasks of the application and network interface (NI) divides data flows from PE into packets and reconstruct data flows by using packets from NoC.

In the sender, packet flits from NI are transformed to a sequential bit stream via a parallel-to-serial (P2S) module. This bit stream is encoded with an orthogonal code in the Encoding module (E in Fig. 1). The coded data from different encoding modules are added together in the Addition module (A in Fig. 1). Then, the sums of data chips are transmitted to receivers. In the receiver, Decoding modules (D in Fig. 1) reconstruct original data bits from the sums of data chips. Then these sequential bit streams are transformed to packet flits by serial-to-parallel (S2P) modules. Finally, these packet flits are transferred to NI.

Fig. 1. Structure of CDMA NoC.

CDMA Encoder

Two different encoding methods, WB encoder and SB encoder, are compared in Fig. 2.

Fig. 2(a) shows the WB encoder architecture. An original data bit is first encoded with a Walsh code by taking an XOR operation. Then, these encoded data are added up to a multibit sum signal by taking arithmetical additions. Each sender needs an XOR gate, and multiple wires are used to express the sum signal if we have two or more senders. Moreover, the number of wires increases as the number of senders increases.

Fig. 2. Block diagram of encoding scheme. (a) WB encoder. (b) SB encoder.

Fig. 2(b) shows our SB encoding scheme. An original data bit from a sender is fed into an AND gate in a chip-by-chip manner, and it will be spread to n-chip encoded data with an orthogonal code of a standard basis. The relationship between a bit and a chip is shown in Fig. 3. Then, the encoded data from different senders are mixed together through an XOR operation, and a binary sum signal is generated. Therefore, the output signal is always a sequence of binary signal transferred to destination using one single wire. The progressions of both the encoding schemes are depicted in Fig. 3.

Fig. 3. Data encoding example. (a) WB encoding. (b) SB encoding.

Fig. 3(a) and (b) illustrates the WB encoding process with four-chip Walsh codes and the SB encoding process with four-chip standard orthogonal codes, respectively.

CDMA Decoder

The WB decoding scheme is presented in Fig. 4(a). According to the chip value of Walsh code, the received multibit sums are accumulated into positive part (if the chip value is 0) or negative part (if the chip value is 1). Therefore, the two accumulators in the WB decoder separately contain a multibit adder to accumulate the coming chips and a group of registers to hold the accumulated value. Through the comparison module after the two accumulators, the original data is reconstructed. If the value of positive part is large, the original data is 1. Otherwise, the original data is 0.

Fig. 4. Block diagram of decoding scheme. (a) WB encoder. (b) SB encoder

The SB decoding scheme is shown in Fig. 4(b). When the binary sum signal arrives at receivers, an AND operation is taken between the binary sum and the corresponding orthogonal code in chip-bychip manner.



  • Less area
  • Less power


  • Modelsim
  • Xilinx ISE

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures


The coarse-grained reconfigurable architectures (CGRAs) are a promising class of architectures with the advantages of high performance and high power efficiency. The compute-intensive parts of an application (e.g., loops) are often mapped onto the CGRA for acceleration. Due to the extra overhead of memory access and the limited communication bandwidth between the processing element (PE) array and local memory, previous works trying to solve the routing problem are mainly confined in the internal resources of PE arrays (e.g., PEs and registers). Inevitably, routing with PEs or registers will consume a lot of computational resources and cause the increase of the initiation interval. To solve this problem, this paper makes two contributions: 1) establishing a precise formulation for the CGRA mapping problem while using shared local data memory as a routing resource and 2) extracting an effective approach for mapping loops to CGRAs. The experimental results on loops of the SPEC2006, Livermore, and MiBench show that our approach (called MEMMap) can improve the performance of the kernels on CGRA up to 1.62×, 1.58×, 1.28×, and 1.23× compared with the edge-centric modulo scheduling, EPIMap, REGIMap, and force-directed map, respectively, with an acceptable increase in compilation time. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


Coarse-grained reconfigurable architectures (CGRAs) are becoming a very attractive platform because of its high performance for computation, high flexibility, and low power dissipation. A CGRA typically consists of an array of processing elements (PEs), a shared local data memory, and a context memory. CGRAs have four major architectural parameters:

  1. PE array size;
  2. interconnection topology of PEs;
  3. local data memory architecture;
  4. Configuration method of PEs.

Since it is an indisputable fact that a program spends 90% of its execution time on only 10% of the code (loops), the most important task for CGRA compiler is efficiently and automatically mapping loops onto CGRA. Software pipelining is usually adopted to exploit more loop-level parallelism. Software pipelining improves the parallelism of loops by overlapping loop iterations. Modulo scheduling is widely used for software pipelining. The goal of modulo scheduling is to find a valid mapping with a minimum initiation interval (MII), where II represents the delay between the initiations of iterations of the loop. We know that II is inversely proportional to the performance and is limited by both the resource constraints and the loop-carried dependence. In CGRA, the existing mapping techniques usually use PEs and registers to route data dependence. However, if the PEs and registers are overused to route data dependence, it will inevitably cause the shortage of PEs for computation, and finally, result in a large II. In particular, when the loop contains some loop carried dependence, the number of PEs or registers used to route dependence will increase dramatically. Therefore, in order to improve the performance of application mapping onto CGRA, the key is to reduce the occupation of PEs and registers for routing dependence, and eventually, the II can be reduced. As the on-chip storage resources, local data memory is an alternative to PE or register for routing data dependence.

Although local data memory is a kind of storage resource just like registers, the hardware structure and access manner of local data memory are a quite different from those of registers. First, registers are distributed in PE array, and each register can only be accessed by its neighboring PEs, while the local data memory is centralized and can be accessed by all PEs. Second, data transfer by registers between two PEs is in a hop-by-hop manner, which occupies multiple registers, while data transfer by local memory only involves a pair of load and store operations. Last but most importantly, the latency of memory access is usually longer than that of the logic operation, while the latency of register access is usually the same as that of the logic operation. Due to these three major differences, we cannot simply treat local data memory as an extension of registers and use an existing mapping approach, e.g., REGIMap [5], to utilize local data memory for routing data dependence.


  • Power efficiency is low
  • Low performance


Target Architecture

In the past two decades, more than a dozen of CGRAs have been proposed, covering a wide range of application fields, e.g., ADRES, MorphoSys, Silicon Hive, and Raw. The proposed CGRAs differ in four major architectural parameters: 1) PE array size; 2) interconnection topology of PEs; 3) local data memory architecture; and 4) configuration method of PEs. To make our approach have wide applicability, we summarize these four parameters and parameterize the problem formulation with these four parameters.

Fig. 1. CGRA-based computing platform, including a 4 × 4 2-D mesh PE array-based CGRA, a host controller, and a main memory.

If we want to use local memory as the routing resource, we need to solve two main problems. The first one is to select, which dependence should be replaced by memory operations and which CSs in kernel should be allocated to inserted memory operations. The second one is to find a search strategy with low complexity to look for a valid mapping. After some analysis, we find that the first problem is equivalent to finding an efficient method to replace the dependence with load and store operation pairs and schedule them at the suitable CSs in the kernel. Therefore, we switch our perspective to the rows of the kernel (all the operations at the same CS in the kernel construct a row of the kernel).

Fig. 2. (a) DDG. (b) Kernel with IICS = 4. (c) Treated kernel when IICS = 4, Nm = 1. (d) Treated kernel when IICS = 4, Nm = 1. (e) Treated kernel when IICS = 4, Nm = 2, lc = {2, 3}. (f) Difference between IICS and II when σ = 2 with the kernel in (e).

For example, in Fig. 2(c) and (d), the Nm is equal to 1, because only a single row (third row and fourth row, respectively) contains the memory operations. After applying this formulation to the kernel in Fig. 2(e), we have achieved the IIs of it, as shown in Fig. 2(f).

For the case σ = 1 (including the case that the CGRA supports the partial reconfiguration), the memory operations should be placed in a distributed way. For example, the load operation aL in Fig. 2(c) should be placed at the same row as the operation d, because it will reduce the number of PEs used for routing the loaded data to operation e and a. In this case, the load operations should be placed at the locations that are close to their target operations, while the store operations should be close to their source operations.

Dependence Replacing

When a pair (IICS, Nm ) is given, we need to select the dependence that should be routed by memory and rows, where the memory operations should be inserted. Considering the fact that Nm is not bigger than IICS, there may be many combinations of rows, where the memory operations can be allocated in the kernel. For example, there are six combinations when (IICS, Nm ) = (4, 2), namely, (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), and (2, 3). A combination (x, y) means that only xth row and yth row of kernel are allowed to place the memory operations. For example, (2, 3) means that only second row and third row are allowed to place the memory operations, as shown in Fig. 2(e).

Search Strategy

So far, we have solved the problem that dependence should be replaced when a pair (IICS, Nm) is given. To make the search in the space of (IICS, Nm ) efficient, we design a search strategy. In order to explain all the rules in our search strategy clearly, we give an example. We map the DDG in Fig. 4(c) onto the 2 × 2 CGRA in Fig. 3(c) whose local memory has 2-read and 1-write ports, and the latency of memory operation is double to that of a logic operation (σ = 2).

Fig. 3. (a) DDG. (b) Example to explain the rules for inserting load/store operations. (c) 2 × 2 CGRA with a memory. (d) Example to explain the rule to place the load/store operations.

Next, we define a dynamic collection IIset to record all the search nodes in the current search space. In order to match them with the combinations (IICS, Nm ), we design the IIset as a trituple (line 9). Then, our approach attempts to find a valid mapping in the while loop. Every time, it chooses the combination with the MII from the present search space (line 11) and processes the input DDG with the chosen IICS and Nm . Then, our algorithm attempts to do P&R with the replaced DDG. If it fails, the algorithm will perform the following steps.

1) If the condition Nm = MNm is satisfied, we need to update the dynamic search space with two new search nodes: 1) (IICS+1, MNm) and 2) (IICS, MNm+1), where IICS is the number of CSs in the present iteration.

2) If the condition Nm = IICS is satisfied, the algorithm only needs to delete the current search node from the dynamic search space because of the limitation of Nm ≤ IICS.

3) If those two conditions are not satisfied, our algorithm will update the current search node with (IICS, Nm +1). Otherwise, we exit because a valid mapping has been achieved.


In a CGRA-based computing platform, data are transferred between PE array and main memory through local data memory. Thus, the local memory is not only required to serve the PE array, but also to exchange data with the main memory. To improve memory access efficiency, we design a memory management mechanism based on double buffering. By this mechanism, the data exchange between PE array and local memory is parallel with the data exchange between local memory and main memory so that communication time can be eliminated.

In this mechanism, the local data memory is divided into four parts, marked as PI1, PI2, PII1, and PII2, as shown in Fig. 4(c). Each part alternately serves the PE array and exchanges the data with the main memory.

Fig. 4. Example of memory management: (a) Pseudocode of loop, (b) Execution flow of the loop in (a), (c) The process of memory access for the groups from k to k+3.

The data exchange between local memory and main memory is carried out by a multichannel DMA. Each memory part corresponds to an independent channel so that each memory part can have an individual memory access pattern. The DMA is controlled and triggered by the host controller, such as the configuration, initialization, read, or write.


  • High Power efficiency
  • High performance


  • Modelsim
  • Xilinx ISE

PEVA: A Page Endurance Variance Aware Strategy for the Lifetime Extension of NAND Flash

PEVA: A Page Endurance Variance Aware Strategy for the Lifetime Extension of NAND Flash


With aggressive scaling and multilevel cell technology, the reliability of NAND flash continuously degrades. The lifetime of NAND flash is highly restricted by the bit error rate (BER), and error-correcting codes (ECCs) can provide only limited error correction capability to tolerate increasing bit errors. To cope with this issue, a novel page endurance variance aware (PEVA) strategy is proposed to extend the lifetime of NAND flash based on the experimental observations from our hardware–software codesigned experimental platform. The experimental observations indicate that the BER distribution of retention error shows distinct variances in different pages. The key purpose of PEVA is to exploit the lifetime potency of every page in a block by introducing fine-grained bad page management instead of coarse-grained bad block management (BBM). The experimental results show that the PEVA can extend the lifetime of 2×-nm NAND flash by 9.8× compared with the conventional BBM and that there is at most an 8.7% degradation in writing speed compared with the traditional sector mapping technology. In addition, the maximum writing response time increased by at most 5.9% during the operation of the PEVA strategy. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


Bad blocks are blocks that contain one or more invalid bits whose reliability is not guaranteed. Bad blocks may be present when the device is shipped, or may develop during the lifetime of the device. Devices with bad blocks have the same quality level and the same AC and DC characteristics as devices where all the blocks are valid. A bad block does not affect the performance of valid blocks because it is isolated from the bit line and common source line by a select transistor. Bad block management, block replacement, and the error correction code (ECC) software are necessary to manage the error bits in NAND Flash devices.

Recognizing Bad Blocks

NAND Flash devices are supplied with all the locations inside valid blocks erased (FFh). The bad block information is written prior to shipping. For single-level cell (SLC) small page (528-byte/256-word page) devices, any block where the sixth byte (x8 device)/first word (x16 device), in the spare area of the first page does not contain FFh is a bad block. For SLC large page (2112-byte/1056-word page) devices, any block where the first and sixth bytes (x8 device)/first word (x16 device) in the spare area of the first page does not contain FFh is a bad block. For SLC very large page (4224-byte page) devices, any block where the first and sixth bytes in the spare area of the first page does not contain FFh is a bad block.

For multilevel cell (MLC) devices, any block where the first byte in the spare area of the last page does not contain FFh is a bad block. The bad block information must be read before any erase is attempted because the bad block information is erasable and cannot be recovered once erased. It is highly recommended to not erase the original bad block information. To allow the system to recognize the bad blocks based on the original information, it is recommended to implement the bad block management algorithm shown in Figure 1.

Figure 1: Software Tool Chain

The bad block table is created by reading all the spare areas in the NAND Flash memory. The bad block recognition methods that build the bad block table without using the original bad block information provided in the spare areas of the memory are not equally effective. The invalid blocks are detected at the factory during the testing process, which involves severe environmental conditions and PROGRAM/ERASE cycles as well as proprietary test modes. The failures that affect invalid blocks may not all be recognized if methods different from those implemented in the factory are used. Once created, the bad block table is saved to a good block so that on rebooting the NAND Flash memory the bad block table is loaded into RAM. The blocks contained in the bad block table are not addressable. So, if the flash translation layer (FTL) addresses one of the bad blocks, the bad block management software redirects it to a good block.

Block Replacement

Figure 2: Bad block Management Flow Chart

During the lifetime of the NAND device additional bad blocks may develop. NAND devices have a status register that indicates whether an operation is successful. Additional bad blocks are identified when attempts to PROGRAM or ERASE give errors in the status register. As the failure of a PAGE PROGRAM operation does not affect the data in other pages in the same block, the block can be replaced by reprogramming the current data and copying the rest of the replaced block to an available valid block. Blocks can be marked as bad and new blocks allocated using two general methods:

  • Skip block
  • Reserve block

Skip Block Method

In the skip block method the algorithm creates the bad block table and when the target address corresponds to a bad block address, the data is stored in the next good block, skipping the bad block. When a bad block is generated during the lifetime of the NAND Flash device, its data is also stored in the next good block. In this case, the information that indicates which good block corresponds to each developed bad block must also be stored in the NAND Flash device.

Reserve Block Method

In the reserve block method, the bad block table is created in the same manner as described in Figure 2. In this method bad blocks are not skipped but replaced by good blocks by redirecting the FTL to a known free good block. For that purpose, the bad block management software creates two areas in the NAND Flash: The user addressable block area and the reserved block area as shown in Figure 3.

Figure 3: Reserved Block Method


  • The lifetime of flash memory is less



We first introduce our PEVA strategy, and we present the additional considerations for the PEVA strategy. Regarding the NAND flash block, the BPP number is much greater than the WPP. Thus, the lifetime improvement could be considerable. The PEVA techniques can be used in any device that uses FTLs managed with page granularity.

Implementation in FTLs

Here, we use a simple and clear sector mapping algorithm to illustrate the PEVA strategy. In the sector mapping, every logical sector is mapped to a corresponding physical sector. Therefore, if the file system recognizes m logical sectors, the raw size of the logical-to-physical mapping table is m. In addition, an address translation mapping table and a physical sector number (psn) status table are required during the FTL operations. The psn status table is stored both in the spare area of the NAND flash and in the RAM cache.

The PEVA strategy issues read operations after the file system initiate a read request. Algorithm 1 describes the process of read operations, and it only has one input: a logical sector number (lsn). Algorithm 1 primarily consists of two steps:

  1. read page data from NAND flash
  2. identify the bad page.

Fig. 4 shows an example of a reading operation using the PEVA strategy. In this example, it is assumed that a block is composed of four pages, which results in four psn in a data block. When the file system performs the command read (9, A), it must read the data A from lsn 9. According to the address translation mapping table, the FTL algorithm reads the data from psn 7. Then, the BCH ECC will be executed.

Fig. 4. Processing of read operations in the PEVA strategy.

Fig. 5 shows an example of a writing operation using the PEVA strategy. When the file system executes the command write (7, B), it writes the data B to lsn 7. We find that the corresponding psn is psn 8 by checking the mapping table. However, the status of psn 8 is bad. The PEVA strategy will find another free psn in the psn status table.

Fig. 5. Processing of write operations in the PEVA strategy.

The important process scheme of the PEVA and the traditional BBM is to identify the bad page or block by implementing ECC.

Additional Considerations for the PEVA Strategy

By introducing the bad status for each page, the PEVA moves from coarse granularity block management to fine granularity page management, which can exhaust the lifetime potency of each page. As more and more pages are marked bad over time, the storage capacity decreases linearly. In some applications, there are high storage capacity demands. To balance the lifetime enhancement against storage capacity, we propose a CT-PEVA strategy.



  • Increase the lifetime of flash memory


  • Modelsim
  • Xilinx ISE

Area-Aware Cache Update Trackers for Postsilicon Validation

Area-Aware Cache Update Trackers for Postsilicon Validation


The internal state of the complex modern processors often needs to be dumped out frequently during postsilicon validation. Since the caches hold most of the state, the volume of data dumped and the transfer time are dominated by the large caches present in the architecture. The limited bandwidth to transfer data present in these large caches off-chip results in stalling the processor for long durations when dumping the cache contents off-chip. To alleviate this, we propose to transfer only those cache lines that were updated since the previous dump. Since maintaining a bit-vector with a separate bit to track the status of individual cache lines is expensive, we propose two methods: 1) where a bit tracks multiple cache lines and 2) an Interval Table which stores only the starting and ending addresses of continuous runs of updated cache lines. Both methods require significantly lesser space compared with a bit-vector, and allow the designer to choose the amount of space to allocate for this design-for-debug feature. The impact of reducing storage space is that some nonupdated cache lines are dumped too. We attempt to minimize such overheads. We propose a scheme to share such cache update tracking hardware (or Update Trackers) across multiple caches in case of physically distributed caches so that they are replicated fewer times, thereby limiting the area overhead. We show that the proposed Update Trackers occupy less than 1% of cache area for both the shared and distributed caches. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


Among several interesting design-for-debug (DFD) features used in the industry today, is a dumping mechanism, where the entire state of the processor is transferred off-chip to the debug infrastructure, to be analyzed offline. However, such DFD hardware needs to operate under tight area constraints, and should cause minimal interference in the normal execution of the processor. Thus, the goals of an ideal DFD hardware supporting efficient dumping during postsilicon validation are:

  1. minimal intrusiveness;
  2. minimal space requirements;
  3. Maximum visibility into the chip.

These requirements are clearly orthogonal, and balancing the three is a complex task.

The entire processor state consists of all caches and registers in various structures such as pipelines, register files, translation look aside buffers (TLBs), and reorder buffers. Capturing this state at regular intervals and analyzing sequences of such snapshots offline gives crucial hints on possible causes of errors. However, due to the large sizes of the last-level caches, transferring each snapshot off-chip is a time consuming process. We also require that, during the dumping phase, the processor should not update the cache, since it may lead to inconsistencies. Therefore, the processor is stalled during the dumping phase. The duration of processor stalls can be reduced by decreasing the amount of data that is required to be transferred off-chip. Such reduction in off-chip transfers also limits the perturbations experienced by the test workloads due to the debug infrastructure. This significantly improves reproducibility of intermittent bugs.


  • The area overhead accurs



Definition 1: A bit-vector corresponds to a sequence of 0s and 1s, where 1 indicates that the corresponding cache line was modified after the previous cache dump and 0 indicates otherwise. Unless otherwise mentioned, the bit-vector maintains one bit per cache line.

Definition 2: We define the overhead as the number of nonupdated cache lines that are transferred off-chip, expressed as a percentage of the total number of cache lines. It is quantified as (y − x/N) × 100%, where x is the number of updated lines, y is the number of dumped lines, and N is the total number of lines in the cache.

Method 1: t-Lines/Bit Bit-Vector

This method maps each bit to t (>1) adjacent cache lines. A bit is set to 1 if any of the t cache lines to which it corresponds is updated after the previous cache dump. This method reduces the amount of storage required by a factor of t. A designer has the flexibility to choose any value of t that satisfies the area budget. A large value of t increases the overhead due to an increase in the number of non-updated cache lines in the proximity of an updated cache line. In Fig. 1, columns (b), (d), and (f) illustrate the t-lines/bit bit-vector for t = 2 corresponding to the bit-vectors shown in columns (a), (c), and (e), respectively. The respective overheads incurred are shown in the bottom row.

Fig. 1. (a), (c), and (e) Examples of bit vectors that capture the information on updated cache lines. (b), (d), and (f) The corresponding 2-lines/bit bit vector. The shaded portion of bit vector shows the dumped lines using the interval table with k = 1.

Method 2: Greedy Algorithm

We propose a Greedy online algorithm to capture the information on updated cache lines into Interval Table I[k]. When the number of runs of 1’s exceeds k, we merge the adjacent intervals to form a larger interval. Since this action causes us to include some 0’s in the interval, we select for merging two adjacent intervals with the minimum Gap between them. Of course, some of the included 0’s may get updated to 1 in the future due to spatial locality of accesses.

Definition 3: Local Gap is the distance of a newly updated cache line to a boundary of its nearest runs of 1’s. minLocalGap is the minimum among the two local gaps (one to each boundary). Global Gap refers to the distance between the adjacent intervals already stored in the Interval Table I[k]. Similarly, minGlobalGap is the minimum among the k − 1 global gaps.

Method 3: Hybrid Algorithm

Definition 7: The Density (D) of a given window is defined as the number of intervals in the window of cache lines. A window with high density indicates large number of updated cache lines in the window that are not contiguous.

The Hybrid algorithm extends the Greedy algorithm by maintaining an additional bit-vector called auxiliary bit-vector (of size b_size). This bit-vector is used to store the update information of the densest window of size b_size. Since this region can change during the online operation of the cache, we associate a start address b_start with this bit-vector. The updates to the cache line at b_start and the next (b_size-1) lines are tracked by the auxiliary bit-vector. A temporary bit-vector is used for swapping intervals between the Interval Table and the auxiliary bit-vector, as and when the need arises. Since this temporary bit-vector is used only for swapping, we use a t-bit bit-vector with (t = 2) as the temporary bit-vector to limit the area overhead.

The Hybrid algorithm extends the Greedy algorithm in two ways:

1) The auxiliary bit-vector filters the update requests before sending them to the Interval Table

2) Update information from the bit-vector is swapped with that of the Interval Table in case it leads to freeing up of intervals in the Interval Table.


The straightforward way to track updated cache lines in distributed caches is to maintain the update information of each cache separately. This results in a linear increase of area overhead with the increasing number of caches.

One way to reduce area overhead is to share the Update Trackers across multiple caches. This requires the Update Trackers to be replicated only Nt = (N/c) times (instead of N times) where N is the total number of caches and c (≥1) is the number of caches sharing the structure.

Horizontal Sharing

Fig. 2 shows an example of two caches A and B sharing a single Update Tracker (bit-vector and Interval Table). The bit-vector and the Interval Table shown in Fig. 2(b) hold the update information of both the caches (A and B) shown in Fig. 2(a). The additional lines that are dumped from each cache due to sharing the bit-vector are marked in blue. The shared bit-vector is the bitwise-OR of the exclusive bit-vectors of A and B.

Fig. 2. Horizontal sharing; cache line 7 of the shared bit-vector is the OR of Line 7 of A and B. (a) Caches A and B maintain separate Update Trackers (bit-vector and Interval Table). (b) Caches A and B share the Update Tracker

Vertical Sharing

Fig. 3 shows an example of vertical sharing of Update Trackers (bit-vector and Interval Table) between two caches. The bits shaded in blue in Fig. 3(a) are the additional cache lines that are dumped due to this sharing scheme. Under this scheme, merging the adjacent c cache lines in each of the caches helps us keep the range of line numbers seen by the Interval Table the same as that of a single cache. By merging the adjacent c cache lines, we exploit the spatial locality of the references to a cache to limit the area overhead.

Fig. 3. Vertical sharing; cache line 7 of the shared bit-vector is the OR of Lines 14 and 15 of A (similar to t-bit vector for t = 2). (a) Caches A and B maintain independent Interval Tables. (b) Caches A and B shares an Interval Table.

Sorted Storage of Intervals

We decided to store the intervals in the Interval Table I[k] in sorted order (based on their starting address) because it allows minLocalGap and minGlobalGap to be computed in a single pass.

Memory Configuration

An appropriate memory architecture needs to be selected to support the efficient computation of minLocalGap. Fast parallel mechanisms for computing the minimum of n values require log2(n) comparisons (using a tree of comparators). In an aggressive design using b separate dual-ported banks, we can read out 2b intervals simultaneously, and find the minimum distance among these 2b values in log2(2b) cycles (assuming one cycle per comparison).

Logic Design

Fig. 4 shows the detailed design of the hardware implementation of the Greedy algorithm. The hardware uses a single dual-ported memory to store all the k intervals. After each interval is read out in sequence, the check for membership, and computation of minLocalGap and minGlobalGap, are performed simultaneously. If the newly updated cache line is determined to be a member of an existing interval, the controller aborts all the in-flight operations and returns to the initial state.

Fig. 4. Hardware design of Greedy algorithm.

Update Buffer

We use an Update Buffer to temporarily store cache line update requests that are received when the hardware is busy processing the current cache line update. During the 2k + 1 cycles described above, the processor is not allowed to update any other cache line.



  • Reduce the area overhead



  • Modelsim
  • Xilinx ISE

Trigger-Centric Loop Mapping on CGRAs

Trigger-Centric Loop Mapping on CGRAs



A coarse-grained reconfigurable architecture (CGRA) is a promising platform based on considerations for both performance and power efficiency. One of the primary obstacles that CGRAs might face is how to accelerate loops with if–then–else (ITE) structures. A recent control paradigm for CGRAs named triggered instruction architecture (TIA) can provide an efficient scheme to accelerate loops with ITE structures. Yet common loop mapping frameworks cannot leverage this scheme autonomously. To this end, this brief makes two contributions: 1) identify and remove redundancy nodes from a data flow graph and 2) propose an integrated approach—TRMap, which consists of operations merging, Boolean operations offloading, and transformation of triggers. Our experimental results from some vital kernels extracted from SPEC2006 benchmarks and digital signal processing applications show that by using TIA scheme, TRMap is able to accelerate loops with ITE structures to an execution that is 1.38× and 1.64× faster than that achieved by a full predication scheme (FP-Choi) and a state-of-the-art method (BRMap). The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


Loop mapping is one of the major challenges associated with CGRA, especially those loops with if–then–else (ITE) structures. The importance of accelerating loops with ITE structures. Software pipelining is widely used in loop mapping onto CGRA and its goal is to minimize the initiation interval (II)—the number of cycles between the start of two consecutive iterations of a loop. II has the smallest possible value called the minimum II (MII). The MII is max (RecMII and ResMII), where RecMII and ResMII are the MIIs determined, respectively, by recurrences in the loop and by the available hardware resources. There are two major problems in mapping the loops with ITE structures. First, the branch paths in ITE structures carry a lot of operations, which occupy many PEs. Second, since these operations are guarded by the control-flow decisions calculated from Boolean operations (AND, OR, and NOT), many PEs are also occupied to execute these Boolean operations on their ALUs. These two problems degrade the utilization ratio of PEs, resulting in a large II. For the first problem, merging operations in branch paths can be a good choice. The state-of-the art method BRMap leverages this merging method and shows better performance than the full predication scheme (FP-Choi). To tackle the second problem, we can offload the Boolean computation of the control-flow predicates from the PEs onto the schedulers. Since the schedulers offer possibilities for transforming the Boolean operations to triggers. We define redundancy nodes to be the Boolean operations that alter the control flow in a data flow graph (DFG). They can be executed on the scheduler and then be removed from the DFG. However, besides those Boolean operations for control- flow decisions, there are also some other Boolean operations for computing logic values which cannot be executed on the schedulers. Therefore, we need to identify the redundancy nodes in the DFG and guarantee that only the redundancy nodes are removed.



  • Loop mapping is worse



Basic Idea

To improve the performance of mapping loops with ITE structures onto the TIA, our approach contains three procedures:

  1. Merging operations from branch paths;
  2. Removing redundancy nodes from the DFG;
  3. Trigger allocation and assignment.

In particular, the operations that update the same variable from branch paths must be mapped to the same PE, since only one of the paths is taken at runtime.

Fig. 1. (a) TDFG after eliminating the Boolean operator P from the DFG since P’s source operands are predicates P1 and P2. (b) Valid mapping of the TDFG with II of 5.


Based on the idea above, we propose a mapping flow for TIA named TRMap, as shown in Fig. 2. First, TRMap inputs a control flow graph (CFG) and employs a front-end technique called hyperblock to construct the DFG from multiple basic blocks of the CFG through SSA transformation. Next, our heuristic of legal merging, valid elimination, and trigger transformation (MET) is applied to minimize the DFG and get the triggered DFG (TDFG). Finally, TRMap employs modulo scheduling-based placement and routing (P&R) algorithm, e.g. REGIMap, to map the TDFG onto the TIA.

Fig. 2. Processing flow of TRMap.

Algorithm and Time Complexity

Our MET algorithm is shown in Algorithm 1. MET starts with operations merging. m and n represents the number of branch paths and the number of the operations in the longest branch path, respectively. Operations in the j-path are stored in the set Vj (Vj ⊂ VP). MET merges multiple operations that update the same variable from n branch paths to a merged node v. If the operation from j-path is NULL, a nop operation would be put into v. All merged nodes are put into VM from our formulation and then the select operations are eliminated. After operations merging, MET starts eliminating redundancy nodes.

Trigger transformation contains two steps:

1) dependence conversion

2) control-flow decisions computation.

We present the process of trigger transformation by generating the triggered instructions in Fig. 3 from the TDFG in Fig. 1(a). The first step is used to convert the data dependence into predicate dependence. It first assigns a predicate to each node of the DFG (lines 27 and 28 in Algorithm 1), such as a0, b0, and c0 in lines 1, 3, and 5, respectively, in Fig. 3. These predicates are initialized to false and turned into true in an implicit way until their predicated operations are performed (lines 2, 4, 6, and so on in Fig. 3). Once the output of the source instruction is taken out by the last destination instruction, they must be reset (lines 8, 10, and 12 in Fig. 3). If current node has data dependence, a predicate is assigned to this node and conjuncted with existing predicates (lines 29–31 in Algorithm 1). This process is repeated iteratively until no data dependence exists. The second step (lines 32–35 in Algorithm 1) computes the predicate expressions Vr stored by the elimination procedure to get the control-flow decisions that form the triggers (lines 17, 19, 21, and 23 in Fig. 3).

Fig. 3. Triggered instructions.


  • Better loop mapping


  • Modelsim
  • Xilinx ISE

Fixed-Point Computing Element Design for Transcendental Functions and Primary Operations in Speech Processing

Fixed-Point Computing Element Design for Transcendental Functions and Primary Operations in Speech Processing



This brief presents a fixed-point architecture based on a reconfigurable scheme for integrating several commonly used mathematical operations of speech signal processing. The proposed design can perform two transcendental mathematical operations called logarithm and powering, and three commonly used computations with similar operations named polynomial calculation, filtering, and windowing. By analyzing the adopted algorithms of the above five operations, a simplified computing unit is designed. This unit can combine six types of operations by reconfiguring the data paths, and the same multiply– add architecture can be reused for reducing the redundant usage of logic gates. The experimental results reveal that the proposed design can work at a 200-MHz clock rate, and its gate count only has 11.9k. Compared with the results of the floating-point function, the median errors of the proposed design for computing the powering and logarithmic functions are 0.57% and 0.11%, respectively. Such results indicate that this simple architecture can be effectively used in most speech processing applications. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


With respect to the related studies, Ramamoorthy et al. first noticed the suitable priorities of the Newton–Raphson method (NRM) for computing the square-root operations. In this brief, a finding was discovered that the division in the recursion of the NRM could be removed, when the target function of the NRM changed to the inverse square root. Seth and Gan also presented a hardware design based on the NRM. In this design, several lookup tables (LUTs) and polynomial expansions (PEs) are used to approximate the value of the inverse square root, for reducing the number of multiplications and additions. However, the memory requirement for the LUTs and the PEs was still a drawback. Auger et al. developed a multiplier free algorithm based on a dichotomous coordinate descent (DCD) technique for the solutions of division, square root, and logarithmic operations. The DCD-based algorithm included only additions, shift operations, and logic conditions. Such an algorithm was very proper to design a relevant low-gate-count architecture. Nam et al. developed a 3-D graphics rendering engine for a handheld system. The design, respectively, used two LUTs for computing the reciprocal square root and the power function. It might hugely increase the chip area. To solve this problem, Kim et al. proposed an arithmetic architecture that can compute the most fixed-point transcendental functions using the equivalent logarithmic functions. To reduce the usage of memory, the fractional part of the logarithmic functions was approximated by an LUT with several adders and shift operations. This brief provided a fast and area-efficient solution for various transcendental functions for a fixed-point 3-D graphic application. However, the designed LUTs limited the range of the output value, so it was hard to be used for other applications



  • Gate count is high
  • Power consumption


Computation Analysis for Designing FSM

In the proposed design, the basic operations of the abovementioned algorithms are first analyzed. We note that the required basic operations in the above algorithms are very similar. That is, the designed hardware can use the same architecture to compute these operations, so the required hardware area can be reduced. Table I lists the analyzed operations for calculating the operation that is listed. The first and second columns list the function type and its included computation, respectively. The rest of the columns show the necessary operations for these types of computation.


Fig. 1 totally consists of seven states, in which six states are used to compute different kinds of computations, as listed in Table I. These states include:

  1. idle (IDLE);
  2. BLA phase one (BLA P1);
  3. BLA phase two (BLA P2);
  4. the NRM for computing the power function (NRM);
  5. EBS;
  6. polynomial computation/ convolution (POL/CON);
  7. windowing (WIN).

Notably, each state implies several behaviors, and the different behaviors are decided by the values of the corresponding counter and state register.

Fig. 1. Example of state diagram for computing (a) logarithmic function (Mode = 0) and (b) power function (Mode = 1). Dashed lines: controller will not go in these states with the different mode.

Fig. 2 shows the architecture of the proposed system, including a 4-to-2 multiplexer, a 16-word dual-port SRAM, a parallelin-serial-out (PISO) register, a mathematical computing unit, and a corresponding controller. As shown in Fig. 3, the input Mode determines which operation is performed by the proposed system when the Act signal rises up. Then, the input Order means the order of polynomial, a filter, or a window function. The dual-port SRAM has 16-word space and 32-bit word length, which can temporally store the coefficients. In addition, the PISO register determines the behavior for the calculation of the EBS state. Finally, the multiplexer array can decide which coefficients need to be stored in the SRAM, or to be computed.

Fig. 2. Architecture of the proposed system.

Fig. 3 represents the architecture of the proposed mathematical computing unit. This architecture includes only a multiplier, an adder, a barrel shifter, a comparator, two registers, and seven multiplexers. In this design, the same arithmetic logics can be reused for different kinds of computations, as listed in Table IV. As shown in Fig. 3, the left multiplexer selects the coefficients for different kinds of computations. The middle multiplier, adder, and shifter can reorganize the computing flow to achieve distinct functionalities.

Fig. 3. Architecture of a mathematical computing unit.

  1. The following content briefly describes how the proposed hardware works with these six states. 1) IDLE: This state keeps waiting until the signal Act rises up, and then, the input Mode decides the next states for computing the corresponding function.
  2. BLA P1: This state responses for computing BLA P1. The signal Sig_C is sent from the comparator in the mathematical computing unit, and is used to conform whether the next state is BLA P1 or BLA P2.
  3. BLA P2: This state is used to compute the second phase of the BLA. The computation of the second phase will be kept until the Count value reaches M. Afterward, the next state is decided by the value that is saved in the stage register. If the value is one, the system will go back to BLA P1 to calculate the value of log2(B). Otherwise, the next state will be set to the NRM for computing (log2(B))−1.
  4. NRM: This state executes the iterations of Newton’s method for computing the power function. Cnt records the computation cycle. The value of xk2M will be first computed by the EBS when PCount is zero.
  5. EBS: This state is to compute εI , εF , and xk2M . In each cycle, the PISO shifts a bit to the controller for determining the behavior of the EBS.
  6. POL and WIN: These two states response for the computation of the polynomial and the convolution, respectively.


  • Improve the reusable efficiency.
  • Gate count is reduced
  • Power reduced


  • Modelsim
  • Xilinx ISE

Design of Modified Second-Order Frequency Transformations Based Variable Digital Filters with Large Cutoff Frequency Range and Improved Transition Band Characteristics

Design of Modified Second-Order Frequency Transformations Based Variable Digital Filters with Large Cutoff Frequency Range and Improved Transition Band Characteristics


The frequency transformation based filters (FT filters) provide an absolute control over the cutoff frequency. However, the cutoff frequency range (Ωc_range) of the FT filters is limited. The second-order frequency transformations combined with coefficient decimation technique based filter (FTCDM filter) has wider Ωc_range compared with the FT filter; however, the ratio of transition bandwidth of the transformed filter to that of the prototype filter, tbwFT/tbwmod, is large over a significant portion of Ωc_range. In this paper, we propose a novel idea of relaxing the one-to-one mapping condition between the frequency variables, to overcome the issue of limited Ωc_range for tbwFT ≤ tbwmod. In the proposed modified second-order frequency transformation based filter (MSFT filter), we relax the one-to-one mapping condition between the frequency variables and use low-pass to high-pass transformation on the prototype filter to achieve wider Ωc_range with tbwFT ≤ tbwmod. Design example shows that the MSFT filter provides 3 and 1.22 times wider Ωc_range compared to FT and FTCDM filters, respectively. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


Frequency transformations are one of the first methods proposed to design the variable digital filters. The frequency transformation based filter (FT filter) provides absolute control over ωc. (Cutoff frequency resolution of 0.00001π can be obtained using the frequency transformations based filter.) However, the cutoff frequency range of the transformed filter, Ωc_range, is very limited and transition bandwidth of the transformed filter (i.e., tbwFT) can be significantly wider than the prototype filter’s transition bandwidth (i.e., tbwmod). A variable digital filter based on the second-order frequency transformations and CDM-I technique was proposed to increase Ωc_range. (This variable filter is referred to as frequency transformations combined with coefficient decimation technique based filter i.e. FTCDM filter in the rest of the paper.) However, the ratio tbwFT/tbwmod is large over a significant portion of Ωc_range of the FTCDM filter. Hence, its prototype filter needs to be overdesigned, i.e., the order of prototype filter needs to be increased to account for this increase in the transition bandwidth of the transformed filter. In this paper, we present a modified second-order frequency transformation based filter (MSFT filter), which provides wider Ωc_range compared to FT and FTCDM filters and has advantage of maintaining tbwFT ≤ tbwmod for the entire Ωc_range.



  • Fixed high-pass, bandpass, as well as bandstop filter responses
  • Complex circuit



Use of Low-pass to High-Pass Transformation

The proposed MSFT filter can provide transformed lowpass filter responses that can be obtained from the prototype low-pass filter of cutoff frequency ωc as well as the prototype low-pass filter of cutoff frequency π − ωc. Consider a high-pass filter of order 2N with symmetric coefficients, implemented in the Taylor form (same as in the case of low-pass filter).

The hardware realization of low-pass to high-pass transformation, i.e., reversing the sign of every alternate “a” coefficient of the low-pass filter can be accomplished by changing every alternate “add/sub” block of the add-delay chain of the filter structure to function as a subtractor. The function addition or subtraction of the add/sub block can be selected using only one select line. If the high-pass filter response is to be obtained simultaneously along with the low-pass response, a separate add-delay chain can be implemented.

Relaxing the One-to-One Mapping Condition

For a particular value of ωc, Ωc_range of the second-order frequency transformations based filter is determined by the range of parameters A0, A1, and A2. From (5), it is clear that

c can be controlled using only two parameters, say A1 and A0 (or equivalently A1 and A2) [15]. Even though A0 can vary from −1 to 1, the useful range of A0 (and hence Ωc_range) for any particular value of A1 is limited by the constraints, such as the condition of one-to-one mapping between ω and Ω, and the condition that cosΩc should be real. For instance, when A1 = 1, the useful range of A0 is −0.5 to 0.5. For A1 = 0, there is not a single value of A0 that can satisfy all the constraints.

Fig. 1. (a) Proposed MSFT filter. (b) Block D(Z)

Realization of the MSFT Filter

The proposed MSFT filter combines the two techniques presented above to provide four types of low-pass filter responses:

1) Low-pass 1: the responses HL;

2) Low-pass 2: the responses HC H;

3) Low-pass 3: the responses obtained as first band of HL0;

4) Low-pass 4: the responses obtained as HC H0 + first band of the corresponding responses HL0.

Hence, the proposed MSFT filter has wider Ωc_range compared to FT filter and is capable of providing various high-pass, bandpass, and bandstop filter responses. In the MSFT filter, the number of variables required to implement is reduced from three (i.e., A0, A1, and A2) to two (A1 and A2). In the MSFT filter as A1 = {1, 0}, the multiplication with A1 can be implemented simply using a multiplexer with one input as the other multiplicand and the other input as zero. Low-pass to high-pass transformation (i.e., reversing the sign of alternate “a” coefficients) can be implemented in any of the two ways, either by reconfiguring the alternate add/sub blocks or by implementing a parallel add-delay chain. Unlike the FT and FTCDM filters, our MSFT filter requires a low complexity masking filter to obtain two low-pass responses (Low-pass 3 and Low-pass 4) from the response HL0.

Implementation of the MSFT filter is shown in Fig. 1. Block D(Z) implements the frequency transformation. The responses HL, HC L, HH , and HC H (obtained when A1 = 1), and the responses HL0, HC L0, HH0, and HC H0 (obtained when A1 = 0) are shown in Fig. 3(a). The response HC L is the complementary response of the response HL. The response HH0 is obtained at the output of high-pass add-delay chain when A1 = 0 and the response HC H0 is its complementary response.



  • Variable high-pass, bandpass, as well as bandstop filter responses
  • Circuit complexity is reduced


  • Modelsim
  • Xilinx ISE

A Fast-Acquisition All-Digital Delay-Locked Loop Using a Starting-Bit Prediction Algorithm for the Successive-Approximation Register

A Fast-Acquisition All-Digital Delay-Locked Loop Using a Starting-Bit Prediction Algorithm for the Successive-Approximation Register 


This brief presents a fast-acquisition 11-bit all-digital delay-locked loop (ADDLL) using a novel starting-bit prediction algorithm for the successive-approximation register (SBP-SAR). It can effectively eliminate the harmonic lock and the false lock. The achievable acquisition time is within 17.5–23.5 or 17.5–32.5 clock cycles when the ADDLL works at the low or high clock rate, respectively. The digital-controlled delay line and the SBP-SAR of the ADDLL chip are synthesized using Taiwan Semiconductor Manufacturing Company’s (TSMC’s) 0.18-µm CMOS cell library. The proposed ADDLL can operate at a clock frequency from 60 MHz to 1.1 GHz. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.


To widen the operating frequency range of the DLLs, various architectures of DLLs were reported. Looking into the acquisition time of the DLLs, we conclude that the successive approximation register (SAR)-based realizations, the time-to-digital converter (TDC)-based realizations, and the hybrid TDC-feedback DLL have sufficiently fast DLL response time compared with the multiphase-clock DLLs. In the SAR-based DLLs, the required acquisition time ranges from tens to hundreds of clock cycles, depending on the number of SAR bits they have. In the TDC-based DLLs, the acquisition time is within 15 clock cycles. Furthermore, the hybrid realization advanced the acquisition time to only three clock cycles.

Although the TDC-based DLLs and the hybrid DLL possess short acquisition time, most of their circuits have to be custombuilt; i.e., the design efforts of their circuits are high. On the other hand, the force/release SAR-based DLLs were cell-based all-digital DLLs (ADDLLs). El-Shafie and Habib only showed the simulation result, whereas we verified our method by measuring the ADDLL chip. The acquisition time of the ADDLL requires 33 clock cycles for an 11-bit SAR and the jitter performance was comparable with its full-custom counterparts. Since Yao et al. [1] successfully demonstrated the possibility of using a cell-based design to realize a SAR ADDLL running up to 1.2-GHz clock, it is worth developing a more powerful algorithm to shorten the average acquisition time of the SAR ADDLL. Simulation results verified that a short average acquisition time could be obtained using a starting-bit predicting algorithm of SAR (SBP-SAR) for a wide-operating-range ADDLL. This brief further refines the SBP-SAR algorithm and realizes the SBP-SAR ADDLL using TSMC’s 0.18-μm CMOS cell library. The ADDLL acquisition time can be confined from 17.5 to 32.5 clock cycles.



  • The harmonic lock and the false lock are occurs



DCDL Structure

The DCDL in this brief is similar to that, except that we modify the lattice delay unit (LDU) that is shown in Fig. 1. Unlike the LDU proposed, every NAND block of the LDU in this brief is actually composed of two parallel small NAND gates with their input connections interchanged. This modification further equalizes the loads of all nodes of the LDUs in the DCDL. Compared with our previous work, the DLL jitter performance of this brief is improved.

Fig. 1. Modified LDU.

Intrinsic Delay

The intrinsic delay Tint is the shortest delay of the DCDL. It is equal to the shortest delay of the fine-tuning delay unit plus the delay caused by cascading two NAND gates and a buffer. We duplicate this part of the DCDL as a subblock called τint for delaying CLKin to get CLKint. Fig. 2 shows the schematic of τint.

Fig. 2. Schematic of the subblock τint.

Proposed ADDLL

Fig. 3 shows the flowchart of the proposed ADDLL. In the beginning, k = 0 and the system starts the SBP operation. The ADDLL judges whether the DCDL delay is less than Tc_int + 4Tclk. If CLKout does not show up when Ncounter ≤ 1, then k will be increased by 1. This process repeats until the DCDL delay is less than Tc_int + 4Tclk. Next, the starting bit is determined by the SBP-SAR algorithm presented in Section II. Next, a SAR search takes over the phase acquisition. Once the SAR procedure is completed, the ADDLL operates in a closed-loop tracking mode.

Fig. 3. Flowchart of the proposed ADDLL.

Fig. 4 shows the block diagram of the proposed ADDLL. It consists of a 11-bit DCDL, a phase detector, a divide-by-2-or-3 frequency divider (FD), and an SBP-SAR controller. The dotted line in Fig. 4 encloses the SBP-SAR controller. It consists of a 11-bit SAR, a starting-bit generator (SBG), a timing controller, a delay subblock τint, and some supporting circuits. Notably, if SELSAR = 0, D[10:0] = Ddb[10:0]. Otherwise, D[10:0] = DSAR[10:0]. Ddb is the initial code of the DCDL for each SBP operating cycle. The values of Ncounter, HalfDM, and Cycle1 determine the SAR starting bit for the corresponding SBP cycle.

Fig. 4. Block diagram of the proposed ADDLL.

Fig. 5 shows the schematic of the FD. According to the simulation results, when the clock frequency is >550 MHz, the divisor should be 3 for sufficient settling time of the SAR. In this case, we let SELdiv23 be 1. Otherwise, the divisor is 2 and SELdiv23 is 0. The signal ENSAR is used to enable the FD. Notably, the first cycle of the SAR operation of our ADDLL extends to 3 cycles of CLKin, no matter what divisor value is. The D flip-flop DFFdum in the FD can fulfill this requirement.

Fig. 5. Schematic of the FD.


  • Eliminate the harmonic lock and the false lock


  • Modelsim
  • Xilinx ISE

On Efficient Retiming of Fixed-Point Circuits

On Efficient Retiming of Fixed-Point Circuits



Retiming of digital circuits is conventionally based on the estimates of propagation delays across different paths in the data-flow graphs (DFGs) obtained by discrete component timing model, which implicitly assumes that operation of a node can begin only after the completion of the operation(s) of its preceding node(s) to obey the data dependence requirement. Such a discrete component timing model very often gives much higher estimates of the propagation delays than the actuals particularly when the computations in the DFG nodes correspond to fixed point arithmetic operations like additions and multiplications. On the other hand, very often it is imperative to deal with the DFGs of such higher granularity at the architecture-level abstraction of digital system design for mapping an algorithm to the desired architecture, where the overestimation of propagation delay leads to unwanted pipelining and undesirable increase in pipeline overheads. In this paper, we propose the connected component timing model to obtain adequately precise estimates of propagation delays across different combinational paths in a DFG easily, for efficient cut set-retiming in order to reduce the critical path substantially without significant increase in register-complexity and latency. Apart from that, we propose novel node-splitting and node-merging techniques that can be used in combination with the existing retiming methods to achieve reduction of critical path to a fraction that of the original DFG with a small increase in overall register complexity.The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.



We discuss the limitation of popularly used conventional retiming of FIR filter and suggest a low-overhead alternative retiming based on connected component timing model. Besides, in case of lattice filters, we show that the two-slow transformation does not help reducing the MSP but increases the power consumption. We discuss alternative retiming of lattice filter as well, and show the advantage of alternative schemes.

Connected Component Timing Model

Very often in a DFG we find that an adder node follows a multiplication node or another adder node. Similarly, we can find that when a multiplication node follows an addition node, the propagation delay across them is much less than the sum of propagation delays of individual nodes. We have referred to this as connected component timing model and shown that it could be used to avoid unwanted pipelining and achieve reduction of corresponding pipeline overheads.

Retiming of FIR Filter

Limitations of Conventional Retiming of FIR Filter: The DFG of the DF FIR filter of length N = 8 is shown in Fig. 1(a). The dashed line in Fig. 1(a) passes across a feed-forward cutset to decompose it into two subgraphs G1 and G2 as shown in Fig. 1(b). The retimed DFG [shown in Fig. 1(c)] can be obtained by adding a delay on all the edges going from subgraph G1 to subgraph G2. We consider, here, the input samples and coefficients to be of L-bit words, and the product word to be of 2L bits. Moreover, the addition time of 2L-bit words could be comparable with that of the L-bit multiplication time.

Fig. 1. (a) DFG of FIR filter of length N = 8. (b) Decomposition of DFG to two subgraphs G1 and G2 for feed-forward cutset retiming of DFG. (c) Retimed DFG.

Flexible Retiming of FIR Filter Based on Connected Component Timing Model: We discuss here a flexible scheme for the retiming of the DFG of Fig. 1(a). To perform the desired retiming, the direction of accumulation path in the DFG is reversed [as shown in Fig. 2(a)], and the cutsets across the dashed lines (after each gray box) are considered one after the other for retiming.

Fig. 2. (a) Cutset selection for the proposed retiming of DFG of the FIR filter of length N = 8. (b) Retimed DFG.


  • Area coverage is high
  • Delay is high



We discuss, here, the proposed technique for the reduction of critical path by combination of retiming with node-splitting and node-merging. Besides, we discuss the application of the proposed technique in FIR filter and simple examples of recursive computation.

Splitting and Merging of DFG Nodes

Fig. 3. Splitting of multiplier circuit.

It is quite simple to split an adder circuit into two nearly equal parts as shown in Fig. 3 for the ripple carry adder (RCA) to perform the addition T = P + Q, where P and Q are 8-bit words. Similar splitting can be performed for carry-lookahead adders also. In a recent paper, we have shown that it is quite easy to decompose a Wallace-tree multiplier into two parts where each part involves nearly half of the propagation delay of a multiplier, for fine-grained seamless pipelining of multiplier and multiply–add circuits. Such a scheme is shown in Fig. 4 for the decomposition of a multiplier for the computation of T = P × Q, where each of P and Q is an 8-bit word.

Fig. 4. Splitting of multioperand adder circuit.

Retiming Combined With Splitting and Merging of Nodes

We demonstrate, here, the proposed technique for retiming combined with splitting and merging of DFG nodes in three typical examples.

1) Retiming of a Basic Recursive Computation:

Let us consider the conventional retiming of the DFG of a basic recursive computation given by

y[n] = x[n] + h · y[n − 1].                                              (1)

The block diagram of the computation of (1) is shown in Fig. 5 and the retiming of the corresponding DFG is shown in Fig. 6. The parentheses near the nodes denote the relative computation times in terms of unit time (ut).

Fig. 5. Block diagram of the computation of (10)

Fig. 6. (a) DFG of the computation of (10). (b) Two-slow transformation of the DFG. (c) Retimed DFG.

2) Retiming of an Infinite Impulse Response Filter:

Let us consider an infinite impulse response (IIR) filter given by

y[n] = h1 · y[n − 2] + h2 · y[n − 3] + x[n].                              (2)

the computation of (2) can be represented by the DFG of Fig. 7(a) and conventionally retimed, as shown in Fig. 7(b). The critical path of the retimed DFG is nearly 2 ut which is the same as that of the original DFG [Fig. 7(a)].

Fig. 7. (a) DFG of IIR filter given by y[n] = h1 · y[n −2] +h2 · y[n −3] + x[n]. (b) Conventional retimed DFG.

3) Retiming of an FIR Filter:

We illustrate, here, the reduction of critical path of FIR filter by splitting and merging of nodes combined with retiming. The DFG of FIR filter of length N = 4 is shown in Fig. 8(a) where the multiplication nodes as well as addition nodes are split into two parts of nearly equal delays.

Fig. 8. (a) DFG of DF configuration of FIR filter of length N = 4. (b) Splitting of nodes of the DFG.


  • Reduce the area
  • Reduce the delay


  • Modelsim
  • Xilinx ISE

Floating-Point Butterfly Architecture Based on Binary Signed-Digit Representation

Floating-Point Butterfly Architecture Based on Binary Signed-Digit Representation


Fast Fourier transform (FFT) coprocessor, having a significant impact on the performance of communication systems, has been a hot topic of research for many years. The FFT function consists of consecutive multiply add operations over complex numbers, dubbed as butterfly units. Applying floating-point (FP) arithmetic to FFT architectures, specifically butterfly units, has become more popular recently. It offloads compute-intensive tasks from general-purpose processors by dismissing FP concerns (e.g., scaling and overflow/underflow). However, the major downside of FP butterfly is its slowness in comparison with its fixed-point counterpart. This reveals the incentive to develop a high-speed FP butterfly architecture to mitigate FP slowness. This brief proposes a fast FP butterfly unit using a devised FP fused-dotproduct-add (FDPA) unit, to compute AB ± CD ± E, based on binarysigned-digit (BSD) representation. The FP three-operand BSD adder and the FP BSD constant multiplier are the constituents of the proposed FDPA unit. A carry-limited BSD adder is proposed and used in the three-operand adder and the parallel BSD multiplier so as to improve the speed of the FDPA unit. Moreover, modified Booth encoding is used to accelerate the BSD multiplier. The synthesis results show that the proposed FP butterfly architecture is much faster than previous counterparts but at the cost of more area. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.

PROJECT OUTPUT VIDEO: (Click the below link to see the project output video):


Fast Fourier transform (FFT) circuitry consists of several consecutive multipliers and adders over complex numbers; hence an appropriate number representation must be chosen wisely. Most of the FFT architectures have been using fixed-point arithmetic, until recently that FFTs based on floating-point (FP) operations grow. The main advantage of FP over fixed-point arithmetic is the wide dynamic range it introduces; but at the expense of higher cost. Moreover, use of IEEE-754-2008 standard for FP arithmetic allows for an FFT coprocessor in collaboration with general purpose processors. This offloads compute-intensive tasks from the processors and leads to higher performance. The main drawback of the FP operations is their slowness in comparison with the fixed-point counterparts. A way to speed up the FP arithmetic is to merge several operations in a single FP unit, and hence save delay, area, and power consumption. Using redundant number systems is another well-known way of overcoming FP slowness, where there is no word-wide carry propagation within the intermediate operations.


  • Delay is high
  • Area coverage is high


The FFT could be implemented in hardware based on an efficient algorithm in which the N-input FFT computation is simplified to the computation of two (N/2)-input FFT. Continuing this decomposition leads to 2-input FFT block, also known as butterfly unit. The proposed butterfly unit is actually a complex fused-multiply– add with FP operands. Expanding the complex numbers, Fig. 1 shows the required modules. According to Fig. 1, the constituent operations for butterfly unit are a dot-product (e.g., BreWim + BimWre) followed by an addition/subtraction which leads to the proposed FDPA operation (e.g., BreWim + Bim Wre + Aim). Implementation details of FDPA, over FP operands, are discussed below.

Fig. 1. FFT butterfly architecture with expanded complex numbers.

The carry-limited addition circuitry for BSD numbers is shown in Fig. 2, where capital (small) letters symbolizes negabits (posibits). The critical path delay of this adder consists of three full-adders. The proposed FDPA consists of a redundant FP multiplier followed by a redundant FP three-operand adder.

Fig. 2. BSD adder (two-digit slice).

Proposed Redundant Floating-Point Multiplier

The proposed multiplier, likewise other parallel multipliers, consists of two major steps, namely, partial product generation (PPG) and PP reduction (PPR). However, contrary to the conventional multipliers, our multiplier keeps the product in redundant format and hence there is no need for the final carry-propagating adder. The exponents of the input operands are taken care of in the same way as is done in the conventional FP multipliers; however, normalization and rounding are left to be done in the next block of the butterfly architecture (i.e., three-operand adder).

1) Partial Product Generation: The PPG step of the proposed multiplier is completely different from that of the conventional one because of the representation of the input operands (B, W, B’, W’).


Fig. 3. Generation of the ith PP.

Fig. 3 shows the required circuitry for the generation of PPi based on Table I where each PP consists of (n + 1) digits (i.e., binary positions).

Fig. 4. Proposed redundant FP multiplier.

2) Partial Product Reduction: The major constituent of the PPR step is the proposed carry-limited addition over the operands represented in BSD format. This carry-limited addition circuitry is shown in Fig. 2 (two-digit slice).

Fig. 4 shows the proposed redundant FP multiplier.

Proposed Redundant Floating-Point Three-Operand Adder

The straightforward approach to perform a three-operand FP addition is to concatenate two FP adders which leads to high latency, power, and area consumption. A better way is to use fused three-operand FP adders.

Fig. 5. Proposed FP three-operand addition.

The proposed FP three-operand adder is implemented as shown in Fig. 5 in which new alignment and addition blocks are introduced. Moreover, due to the sign-embedded representation of the significands (i.e., BSD), a sign logic is not required.


  • Area is reduced
  • Delay is reduced


  • Modelsim
  • Xilinx ISE