New Majority Gate Based Parallel BCD Adder Designs for Quantum-dot Cellular Automata

New Majority Gate Based Parallel BCD Adder Designs for Quantum-dot Cellular Automata


In this paper, we first theoretically re-defined output decimal carry in terms of majority gates and proposed a carry look ahead structure for calculating all the intermediate output carries. We have used this method for designing the multi-digit decimal adders. Theoretically, our best n-digit decimal adder design reduces the delay and area-delay product (ADP) by 50% compared with previous designs. We have implemented our designs using QCA Designer tool. The proposed QCA Designer based 8-digit PBA-BCD adder achieves over 38% less delay compared with the best existing designs.


The decimal arithmetic has received wide attention in response to the increasing demand for precision in financial and commercial based applications. Several digital processors and computers were designed including decimal arithmetic hardware units. The current CMOS technology is approaching its scaling limitation. New nanotechnologies including quantum-dot cellular automata (QCA), nano magnetic Logic (NML), and spin-wave devices (SWD) are studied due to their advantages in terms of low power and high density. These emerging nanotechnologies are based on majority logic, which is different from conventional Boolean logic in CMOS.

Figure 1 : Block diagram of 1-digit BCD adder

As the core of decimal arithmetic, previous works have been conducted into majority-based parallel decimal adders. The existing majority logic based parallel decimal adders mostly share the same structure, but differ from each other in the usage of binary adders. The 1-digit ripple carry adder (RCA) based BCD adders are proposed. However, these designs can be further optimized to reduce hardware complexity. Carry flow adder (CFA) based and carry look ahead adder (CLA) based BCD adders are presented, which show good performance. Moreover, exploits novel binary adder to propose the efficient 1-digit BCD adder, reducing comprehensive consumption. In order to fully utilize the majority gates, rewrite the correction function for less majority gates. Different from the existing designs, we use a new approach to compute carry logic in the multi-digit BCD adder.

In this paper, we propose a new definition for BCD adder output carry computation in terms of majority gates and use it for computing all the carries of the multi-digit BCD adder in parallel. We have introduced decimal group generate and decimal group propagate signals to calculate carries in the BCD adder. As a result, we have reduced delay in the multi digit BCD adder. We have used different types of binary adders, such as RCA, CFA and parallel binary adder (PBA) for realizing the proposed multi-digit BCD adder. Theoretically, our PBA based n-digit BCD adder reduces the delay and area-delay product (ADP) by 50% compared with the existing designs.

Figure 2 Block diagram of 4-digit BCD adder.

We have implemented our designs using QCA technology and designed using QCA Designer. The proposed QCA Designer based 8-digit PBA-BCD adder achieves at least 38% less delay compared with the best existing designs.

Fig. 1 shows the block diagram of conventional 1-digit BCD adder. The 1-digit BCD adder consists of 4-bit binary adder (ADD1), correction logic (CL) and 4-bit binary adder (ADD2). The binary adder (ADD1) adds the decimal number dA3:0, dB3:0 and dCin to produce the binary sum bS3:0 and output carry bCout. The CL circuit produces the cL3:0 and decimal output carry signals dCout for converting binary sum bS3:0 to decimal sum dS3:0. The cL3:0 = (0110)2, if dCout = 1 otherwise cL3:0 = (0000)2. The binary adder (ADD2) produces decimal digit dS3:0 by adding bS3:0 and cL3:0.

The theoretical delay required for generating 1-digit BCD adder dCout signal (dc(1)) and dS3:0 signal (d(1)) are given in (1) and (2).

where da1, dcl and da2 represent the delays required for single ADD1, CL and ADD2 blocks, respectively.



  • More power
  • Low Density
  • More delay



The block diagram of parallel 1-digit BCD circuit is shown in Fig. 3. The design in [12] used the same block diagram for the implementation of BCD adder but they have used ANDOR gate based output carry as shown in (6).

dCout = bCout + (bS3:0 >= 10) + (bS3:0 == 9)dCin     (6)

We are going to define the dCout in terms of majority gates. For this, we rewrite (6) as follows:

dCout = bCout + (bS3:0 >= 10) + (bS3:0 >= 9)dCin

= bCout + (bS3:0 >= 10) + [bCout + (bS3:0 >= 9)]dCin        (7)

Figure 3: Proposed block diagram of 1-digit BCD adder.

The logic signals bCout + (bS3:0 >= 10) and bCout + (bS3:0 >= 9) can be rewritten as [bCout + (bS3:0 >= 10)] · [bCout + (bS3:0 >= 9)] and [bCout + (bS3:0 >= 10)] + [bCout + (bS3:0 >= 9)], respectively. By substituting these values in dCout, we can rewrite the equation of dCout as follows:

dCout = [bCout + (bS3:0 >= 10)] · [bCout + (bS3:0 >= 9)] + [bCout + (bS3:0 >= 10) + bCout + (bS3:0 >= 9)]dCin      (8)

The dCout in (8) is clearly in 3-input majority gate form with inputs bCout + (bS3:0 >= 10), bCout + (bS3:0 >= 9) and dCin. We can write the dCout using the majority gate as shown in (9).

dCout = M(bCout + (bS3:0 >= 10), bCout + (bS3:0 >= 9), dCin) (9)

The terms (bS3:0 >= 10) and (bS3:0 >= 9) are binary signals and we are calling these signals as decimal group generate and decimal group propagate signals. These two signals are represented as dG3:0 and dP3:0, as shown in (10) and (11), respectively

dG3:0 = bCout + (bS3:0 >= 10)          (10)

dP3:0 = bCout + (bS3:0 >= 9)             (11)

The proposed majority gate form of dCout using dG3:0 and dP3:0 signals is given as follows:.

dCout = M(dG3:0, dP3:0, dCin)          (12)

The dCout in (12) uses decimal group generate and decimal group propagate signals for calculation. This is similar to CLA method for the calculation of carry. Because of this, we are calling CL stage as CL-CLA. The cL3:0 signal is calculated using the dCout as shown in (13).

cL3:0 = {0, dCout, dCout, 0}               (13)

The proposed dCout in (12) requires only 1 majority gate after calculating the dG3:0 and dP3:0 signals. Fig. 4 shows the majority gate diagram of proposed dCout in (12). We have used the majority gate results presented in [10] for calculation of dG3:0, as shown in (14).

dG3:0 = bCout + bS3 · bS2 + bS3 · bS1 = M(bCout, M(bCout, bS3, 1), M(bS3, bS2, bS1))             (14)

To save the area, we have calculated dP3:0 as follows:

dP3:0 = bCout + (bS3:0 >= 9)

= bCout + (bS3:0 >= 10) + (bS3:0 == 9)

= dG3:0 + bS3 · bS0                      (15)

We can observe that the decimal group generate and decimal group propagate signals are independent of decimal input carry, which are produced parallelly in the multi-digit BCD adder. Consequently, all decimal group generate and decimal group propagate signals of the multi-digit BCD adder share the same delay. Fig. 5 shows the majority gate circuit for calculating the carries dC1, dC2, dC3 and dC4 using decimal group generate and decimal group propagate signals. The delay required for calculating the dC4 in Fig. 5 is only the delay of four majority gates, which can be achieved from the proposed definition of dCout in (12).


  • Low power
  • High Density
  • Low delay



[1] L. K. Wang, M. J. Schulte, J. D. Thompson, and N. Jairam, “Hardware designs for decimal floating-point addition and related operations,” IEEE Transactions on Computers, vol. 58, no. 3, pp. 322-335, 2009.

Multistage Linear Feedback Shift Register Counters With Reduced Decoding Logic in 130-nm CMOS for Large-Scale Array Applications

Multistage Linear Feedback Shift Register Counters With Reduced Decoding Logic in 130-nm CMOS for Large-Scale Array Applications


Linear-feedback shift register (LFSR) counters have been shown to be well suited to applications requiring large arrays of counters and can improve the area and performance compared with conventional binary counters. However, significant logic is required to decode the count order into binary, causing system-on-chip designs to be unfeasible. This paper presents a counter design based on multiple LFSR stages that retains the advantages of a single-stage LFSR but only requires decoding logic that scales logarithmically with the number of stages rather than exponentially with the number of bits as required by other methods. A four-stage four-bit LFSR proof of concept was fabricated in 130-nm CMOS and was characterized in a time-to-digital converter application at 800 MHz.


WITH recent advances in applications such as single-photon detection, it has become necessary to implement a large number of arrayed counters in small areas. These include time-of-flight (TOF) ranging in depth cameras where counters are required to count clock cycles and also photon-counting cameras that count the number of photons in an interval. Reducing the area consumed by the counter in these applications is critical in increasing the number of pixels in the cameras, as each camera pixel contains a separate counter. While linear-feedback shift registers (LFSRs) are typically used as pseudorandom number generators, it has been shown that they are also an efficient way to implement synchronous counters and are well suited to large arrayed designs, as the shift register can act as a serial readout mechanism. LFSR counters have been used in the CMOS pixel design and in single-photon detection arrays. The clock speed of an LFSR is independent of the number of bits in the counter, and they traverse all states in the state space except the all zero state. However, the count order of LFSRs is pseudorandom, so extra processing is required to decode the LFSR state into binary order. Three different techniques to decode the LFSR sequence into binary are compared in the iteration method, the direct lookup table (LUT) method, and a time-memory tradeoff algorithm. The iteration method iterates over the entire count sequence of the LFSR and compares each to the counter value. For an n-bit LFSR, this requires approximately 2n−1 comparisons on average. The direct LUT method instead uses an n × n LUT that directly decodes the LFSR state. The time-memory tradeoff algorithm introduced in combines both methods by storing 2(N/2) LFSR count values in a table and iterating over the LFSR sequence until the count value matches a value in the table. The number of iterations is then subtracted from thestored value to obtain the decoded value. Another algorithm based on discrete logarithms was introduced in and was adapted for use with ring generator event counters.

Applications with large arrays require every cell in the array to be decoded to binary order for further processing, and for system-on-chip designs, it is necessary to perform this decoding on chip. This requirement dictates that the decoding logic must be integrable and fast, since many conversions need to occur. However, all of the above-mentioned methods grow exponentially in either time or area with the size of the LFSR. For single-photon detection applications, there are several examples of arrayed designs that would not be able to be implemented with LFSR counters without prohibitively large integrated LUTs.

This paper proposes a new counter design based on multiple LFSR stages, which can be decoded with logic that grows logarithmically with the counter size rather than exponentially. While a straightforward concatenation of LFSR counters would cause a significant performance reduction, similar to binary ripple counters, this paper introduces a technique to distribute the ripple signal in time and compensates for this in a generalized decoding logic scheme. This paper also presents a proof of concept implementation and characterization of this counter design in a 130-nm CMOS process. Throughout the remainder of this paper, an n-bit LFSR will be referred to as an n-LFSR.


  • More large number of array counters
  • More area and more power consumptions
  • Technology size is high


The general scheme of the counter design is shown in Fig. 1. There are M identical n-LFSR blocks that are controlled by an enable signal. When the (m − 1)th n-LFSR undergoes a specific state change, the enable signal is asserted so that the mth n-LFSR advances one state. This allows the entire M × n bit state space to be traversed. In large arrayed designs, the counter can also act as a high-speed serial readout chain. This is achieved with minimal additional logic that bypasses the LFSR feedback and ripple-carry blocks. The multistage counter scheme reduces the counter into M independent modules, allowing each n-LFSR to be decoded separately by an n×n bit LUT rather than an (M×n)×(M×n) bit LUT. For small n, the LUT can easily be implemented on chip.

Figure 1: Block diagram of the multistage LFSR counter.

LFSR Block:

Each stage of the counter is triggered once per period of the previous stage, so missing states from the LFSR sequence will cause large blocks of counter states to be missing from the counter state space. Thus, it is important that the n-LFSR is designed for a maximal length. The maximal sequence length of an n-LFSR is only 2n − 1, so additional logic is required to incorporate the missing state into the count sequence. This can be achieved using a NOR and XOR function to disable the feedback logic when the 0x000 …1 state is detected, as shown in Fig. 2(c). This sequence-extension logic extends the sequence length of the individual component LFSRs to 2n so that the counter covers every state in the 2M×n state space. This also allows the multistage counter to be used in applications that require every state to be covered, such as self-starting counters, where traditional LFSRs would not be applicable.

Figure 2: (a) Structure of a conventional n-bit ring generator. (b) Structure of conventional many-to-one 4-LFSRs. (c) Structure of a proposed multistage n-LFSR block with sequence-extension logic (dotted components). The entire feedback block is implemented as a single logic block.

Several LFSR feedback styles exist, including many-to-one, one-to-many (alternatively known as Fibonacci and Galois LFSRs, respectively), and ring generators. Ring generators [depicted in Fig. 2(a)] are typically regarded as the optimal way to implement an LFSR, where the shift register forms a ring and taps form subloops within the ring. However, the sequence-extension requires additional logic in the LFSR, dominating the critical path. Instead, many-to-one style LFSRs [Fig. 2(b)] are used, allowing the feedback logic and the sequence-extension logic to be combined into a single logic block for logic minimization as shown in Fig. 2(c). The multistage counter allows flexibility in choice of the size of the n-LFSR, so that small single-tap LFSRs are preferentially chosen. A single-tap many-to-one LFSR is topologically indistinguishable from the corresponding ring generator.


Ripple-Carry Logic

Since the n-LFSR contains every state in the state space, the LFSR must include the 0b1111 … → 0b0111 … transition. This state is a Gray-code transition and occurs in every n-LFSR design, so it is an ideal ripple trigger transition. This sets the start of the n-LFSR sequence to 0b0111 … so that it is decoded by the decoding logic to 0x …00. If the counter was designed so that an LUT could directly decode every stage correctly in a single clock cycle, the ripple signal would need to propagate through every stage and detect if each stage will transition. Instead, to prevent the performance of the counter from decreasing with every extra stage added to the counter, the ripple signal only acts on the direct next stage and the ripple signal for the subsequent stages is carried to the next clock cycle. This distributes the transition edge over time and, for the mth stage, adds an m clock cycle delay to the transition edge.

Figure 3: Timing diagram of the operation of the multistage LFSR counter. Arrows show the operation of the ripple-carry logic. Highlighted states require further processing by the decoding logic.

The counter timing diagram that demonstrates the operation of the ripple-carry logic is shown in Fig. 3. Each LFSR state is given as a binary value (0b …), whereas the state after decoding with an LUT (the LUT Decode signal) is the hex value in brackets (0x …) for each state. When LFSR 0 transitions from the 0b1111 … state to the 0b0111 … state, the RIPPLE 0 signal is generated. On the next clock edge, the RIPPLE 0 signal acts on LFSR 1 causing it to also undergo the 0b1111 … → 0b0111 … transition. This, therefore, also generates a ripple signal to act on LFSR 2 on the next clock edge. In this way, the ripple-carry logic causes the transition edge to be delayed one clock cycle per stage. The delayed transition causes an error triangle to form, shown by highlighted states in Fig. 3. These states are decoded incorrectly by the LUT and therefore need to be corrected with a minor amount of decoding logic in addition to the n × n bit LUT.

Decoding Logic

The decoding logic acts as a postprocessing step on the multistage LFSR counter array output. As each LFSR value is read out of the array, it is passed through an LUT. Fig. 3 shows that the LUT corrects most states to binary order. However, additional logic is required to correct the errors caused by the delayed transition. Two types of LUT decode errors occur: initial errors and overflow errors. Initial errors occur in the states on the upper edge of the transition error triangle when the counter is stopped on the clock cycle before the mth stage transitions. The decoded value of the previous stages is also the number of clock cycles since the start of the transition edge. Since the ripple takes m clock cycles to reach the mth stage, these errors can be detected if the decoded value of the previous stages is equal to m −1. Overflow errors occur when the previous stage has an error and is equal to 0x … FF. These errors indicate that a previous stage should have caused a ripple event on an earlier clock cycle.

The decoding logic that detects and corrects these errors is shown in Fig. 4. The error detection of each stage depends on the decoded value of the previous stages, so each stage is processed sequentially. If an error condition on the next stage is detected, the next stage invalid register is set, so that it is corrected on the next clock cycle. The errors are only ever one less than the correct value, so the next stage invalid selects either the LUT output or adds one to the LUT output. An overflow error in the next stage will occur if the current stage is an error and also 0x … FF. This can be detected by ANDing the incrementer carryout with the next stage invalid register. Initial errors can be detected by storing the previously decoded stages in latches and comparing with a counter that keeps track of the current stage number. If the current stage is equal to the previously decoded value, the next stage will have an initial error.

Figure 4: Logic to decode the multistage LFSR counter state order into binary. Each stage is decoded separately in sequence

The counter only needs to count to M and therefore needs y = log2(M)bits. Therefore, only a y-bit comparison needs to be made between the previous decoded state and the counter, so Z = (y/n)stages need to be stored. The zeros register is used to ensure that all other bits in the previously decoded value are zero so that the comparison is valid. The counter zero point (0x …00) is an error state. If all counter stages are reset to the zero state, the starting count values will be decoded incorrectly. However, the final state (0b1111 …) is not an error state, so if the counter is reset to this state instead, the counter will correctly transition to zero after one clock cycle. The final count value will be off by one, but this can be corrected by adding one to the final count value if required.

The time to decode the count value will depend on both M and n. The decoding logic takes one clock cycle per stage of the counter, so M clock cycles are required to decode the entire count value. The critical path of the decoding logic is implementation-dependent but potentially includes the LUT, the incrementer, and the comparison to the stage counter, all of which have reduced performance for large n. Thus, the maximum clock rate will be lower for larger n values, but more stages will be required for an equivalently sized counter with a larger n value, requiring fewer clock cycles to decode the counter. If the readout of the array is performed serially, then the decoding logic can be set up as a pipeline in the readout chain. This only adds an M clock cycle pipeline delay to the entire array readout, but all individual counter values will be read out with a decoded value.


  • Less Number of array counters with multistage implementation.
  • Less area and power consumptions
  • Reduced the Technology size



[1] D. Bronzi, Y. Zou, F. Villa, S. Tisa, A. Tosi, and F. Zappa, “Automotive three-dimensional vision through a single-photon counting SPAD camera,” IEEE Trans. Intell. Transp. Syst., vol. 17, no. 3, pp. 782–795, Mar. 2016.