# A Low Power High-Speed 8-Bit Pipelining CLA Design Using Dual Threshold Voltage Domino Logic

Chua-Chin Wang, Senior Member, IEEE, Chi-Chun Huang, Ching-Li Lee, Tsai-Wen Cheng

Abstract—A high speed and low power 8-bit carry-lookahead adder (CLA) using two-phase modified dual threshold voltage (dual- $V_t$ ) domino logic blocks which are arranged in a PLA-like design style with pipelining is presented. The modified domino logic circuits employ dual- $V_t$  transistors and reversed bulk-source biases for reducing subthreshold leakage current when advanced deep submicron process is used. Moreover, an NMOS transistor is inserted in the discharging path of the output inverter such that the modified domino logic can be properly applied in a pipeline structure to reduce the power consumption. The addition of two 8-bit binary operands is executed in 2 cycles. Not only is it proved to be also suitable for long adders, the dynamic power consumption is also drastically reduced by more than 10% by the measurement results on silicon.

*Index Terms*—CLA, domino logic, dual threshold voltage CMOS, pipeline, PLA-like.

#### I. INTRODUCTION

F AST adders are key elements in digital circuits, e.g., mul-tipliers, and DSP chips. Many efforts have been focused on the improvement of adder designs [4] - [8]. CMOS dynamic logic has been recognized as one of the promising options to challenge the GHz operations for the adder design, [1]. However, the major trade-off of these prior GHz logic circuits is the high power consumption which is not a tolerable price to pay in recent mobile technologies. These circuits unavoidably consume power even if they are in a standby condition. A dual- $V_t$  circuit technique was proposed in [2] for reducing standby power dissipation while still maintaining high performance in domino logic. Lim et al. [11] proposed an energy-efficient carry-lookahead adder using reversible energy recovery logic with self-energy-recovery circuit to reduce the reversibility overhead. Kursun et al. [3] employed sleep switches and dual- $V_t$  CMOS technology to place an idle domino logic circuit into a low leakage state. In this paper, we propose a low power PLA-like structure using modified dual- $V_t$  domino logic blocks. An 8-bit CLA using the dual- $V_t$  domino logic blocks which are arranged in a PLA-like manner [6] and

T.-W. Cheng is with the Asuka Semiconductor Inc., Hsin-Chu, Taiwan 30078.

synchronously triggered. It is implemented on silicon to verify the power reduction as well as the preservation of high speed. The major advantage of the low power design methodology is that it is robust regardless of long data words, e.g., 64-bit binary data. The power reduction is found to be more than 10% compared to the prior works.

#### II. LOW POWER HIGH-SPEED 8-BIT CLA

#### A. Typical Dual- $V_t$ Domino Logic Circuits

Employing dual- $V_t$  transistors for reducing subthreshold leakage current in domino logic circuits was proposed by [2]. We, then, utilized such a dual- $V_t$  scheme to carry out a typical dual- $V_t$  domino logic circuit in Fig. 1. The high- $V_t$  transistors are represented in Fig. 1 by a thick line in the channel region. The domino logic operation is divided into two phases: precharge phase and evaluation phase.

- 1). During the precharge phase, clock = 0, P11 is on and N11 is off. Then, node A is precharged to  $V_{DD}$  and the output is initialized to be low.
- 2). During the evaluation phase, clock = 1, P11 is off and N11 is on. If the low- $V_t$  evaluation block is evaluated to be "pass", the charge at node A should be ground through the low- $V_t$  evaluation block and N11. The output then is a logic high. If the low- $V_t$  evaluation block is evaluated to be "stop", there will be no discharging path for node A. A "keeper" PMOS, P13, is added to keep node A at  $V_{DD}$ . The output then is a logic low.

Summarized by the above description, the output will be high when the low- $V_t$  evaluation block is evaluated "pass", i.e., "1", during clock = 1. On the contrary, the output will be low when the low- $V_t$  evaluation block is evaluated "stop", i.e., "0", during clock = 1.

The critical signal transitions are the delay of the domino logic circuits occurring along the evaluation path when node A is discharging. Hence, in the dual- $V_t$  domino logic circuits, high- $V_t$  transistors are used in those non-critical precharge paths. Alternatively, low- $V_t$  transistors must be utilized in the speed critical evaluation paths [2]. As a result, the subthreshold leakage current of the dual- $V_t$  domino logic circuits is expected to be smaller compared to an all low- $V_t$  domino logic circuit.

#### B. Modified Dual-Vt Domino Logic Circuits

However, there is a problem with such a typical dual- $V_t$  domino logic. That is, the output of the typical dual- $V_t$ 

Manuscript received April 3, 2006; revised November 17, 2006, and April 13, 2007. This work was partially supported by National Science Council under grant NSC 94-2213-E-110-022 and 94-2213-E-110-024, and National Heaith Research Institute under grant NHRI-EX95-9319EI.

C.-C. Wang, and C.-C. Huang are with the Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424 (e-mail: ccwang@ee.nsysu.edu.tw).

C.-L. Lee is with the Department of Electronic Engineering, Kao Yuan University, Lujhu Township, Kaohsiung, Taiwan 82151.



Fig. 2. Typical two-stage dual- $V_t$  domino logic circuit



Fig. 1. Typical dual- $V_t$  domino logic circuit



Fig. 3. Modified dual- $V_t$  domino logic

domino logic circuit can not hold the logic state during the precharge phase in the next cycle. For example, Fig. 2 shows a typical two-stage dual-Vt domino logic circuit to construct a pipeline structure. Both stages must alteratively operate in the precharge phase and the evaluation phase, respectively, for the pipelining operation. When the first stage is in the precharge phase and the second stage is in the evaluation phase, node B will be low rather holds the previous state causing the second stage can not evaluate the function itself. Therefore, it can not be directly applied in any pipeline structure. By contrast, a modified dual- $V_t$  domino logic circuit, as shown in Fig. 3, is proposed to resolve such a difficulty. A clock-controlled NMOS transistor, N33 in Fig. 3, is inserted in the discharging path of the output inverter. The operation of the modified dual- $V_t$  domino logic circuit is similar to that of the typical dual- $V_t$ domino logic circuit apart from the precharge phase. During the precharge phase, clock = 0, P31 is on, N31 and N33 are both off. Thus, P32 is switched off. The output has neither charging path nor discharging path such that the state will be kept as the previous state. This also results in that the circuit will consume less power.

## C. High- $V_t$ Transistors with Reverse-Biased Bulk-Source Voltage

According to [9], we attain the following  $V_{TH}$  formulation,

$$V_{TH} = V_{TH0} + \gamma (\sqrt{|2\phi_F| + V_{BS}} - \sqrt{|2\phi_F|}), \quad (1)$$

where  $V_{TH0}$  denotes the threshold voltage with a zero bulk bias,  $\phi_F$  is electrostatic potential,  $\gamma$  is the body coefficient, and  $V_{BS}$  is the bulk-source bias. Applying a positive  $V_{BS}$ bias, the  $V_{TH}$  will be increase and the subthreshold leakage current, on the contrary, will be reduced. In order to reduce the subthreshold leakage current as well as preserve the high speed in the proposed design, the high- $V_t$  transistors, P31, and P33, in the non-critical precharge path are applied with the reversed bulk-source biases. The variation of the subthreshold leakage current of the high- $V_t$  PMOS transistors with reversed bulksource biases in the dual- $V_t$  domino logic circuit is simulated. In the simulation, the bulk-source junctions of both high- $V_t$ transistors P31 and P33 are reversed with a 1.2 V, i.e.,  $V_B = 3$ V. Meanwhile, the low- $V_t$  evaluation block is set to be "pass", and the domino logic circuit is operated at the evaluation phase (clock = 1). Table I shows the simulation results given TT model, VDD = 1.8 V and  $25^{\circ}C$ . It reveals the subshreshold leakage current  $(I_D)$  of P31 and P33 with a 1.2 V  $V_{BS}$  is smaller than that of P31 and P33 with 0 V  $V_{BS}$ . Notably, though the  $I_{BS}$  of P31 and P33 with reversed bulk-source biases will be increased, it still can be ignored when compared with the subshreshold leakage current.

### D. PLA-Styled 8-bit CLA Design

If the propagate signals  $(P_i)$  and the generate signals  $(G_i)$ of a CLA are produced by combinatorial logic function blocks before they are fed into the function blocks for  $S_i$ 's and  $C_i$ 's, then the Boolean equations of  $S_i$ 's and  $C_i$ 's imply that a twolevel AND-OR logic function block is a possible solution to achieve high speed operations. Thus, the PLA-styled design is suitable for such a function block. A conceptual PLA-styled design for CLA is shown in Fig. 4. A typical PLA consists

| COMPARISON OF THE LEAKAGE CURRENT |                          |                          |                          |  |  |  |
|-----------------------------------|--------------------------|--------------------------|--------------------------|--|--|--|
|                                   | $I_D$                    | $I_{BS}$                 | $I_{BD}$                 |  |  |  |
| P31 and P33 with 0 V $V_{BS}$     | $1.28 \times 10^{-12}$ A | 1.34×10 <sup>-27</sup> A | $4.07 \times 10^{-17}$ A |  |  |  |
| P31 and P33 with 1.2 V $V_{BS}$   | $2.43 \times 10^{-16}$ A | 1.71×10 <sup>-18</sup> A | 1.71×10 <sup>-18</sup> A |  |  |  |

TABLE I



Fig. 4. PLA-styled CLA

of an AND array and an OR array. It is well known that the series NMOS in the evaluation block of NAND or AND gates will produce long discharging delays which subsequently slow down the entire circuit. We can take advantage of the non-inverting feature of the domino logic to utilize a NOT-OR-NOT-OR configuration instead of the typical AND-OR style, where the two OR planes are made of the modified dual- $V_t$  domino logic circuits as shown in Fig. 3. Meanwhile, it can also minimize the series transistor count in the low- $V_t$ evaluation block. The OR array is made of the modified dual- $V_t$  domino logic with a predefined low- $V_t$  evaluation block. The inputs to the first OR array is the inverted  $P_i$ 's and  $G_i$ 's signals which are also produced by other modified dual- $V_t$ domino logic units as shown in Fig. 5. Notably, we define the propagate signals in a different way from the traditional  $P_i = A_i + B_i$ , because the  $P_i = A_i \bigoplus B_i$  can be reused to generate the sum term, i.e.,  $S_i$ .

### E. Cycle-Based Operation and Area Analysis

1) Cycle-based operation : The critical path of an adder resides on the generation of carry signals, i.e.,  $C_8$  in the 8-bit adder. After the binary operands are ready, the generation of  $P_i$ 's and  $G_i$ 's by using the modified dual- $V_t$  domino logic

takes the high half of a full cycle. That is, the results of GP blocks will be ready when the clock is low. The inverted  $P_i$ 's and  $G_i$ 's will then be fed into the first OR plane of the modified dual- $V_t$  domino-based PLA. The inverted outputs of the first OR plane will be presented to the second OR at the high half of the second cycle. The final  $C_i$ 's results then are ready in the low half of the second cycle. Right after the generation of every  $C_i$ 's, they are inverted and fed into the  $S_i$ 's function blocks. Another half cycle then is required to produce all of the  $S_i$ 's. The final result will be latched after 2 cycles as shown in Fig. 6.

2) Area : The transistor count of the PLA-styled implementation for CLA using ANT (all-N-transistor) logic, an analytic form has been derived in [10]. By the similar derivation method, the number of the total transistors required to implement the proposed *n*-bit CLA with PLA-styled design using the modified dual- $V_t$  domino logic is as follows.

$$T_{total} = \frac{1}{6}(n+1)(n+2)(n+3) + \frac{9}{2}n(n+1) + 48n+9$$
(2)

For instance, for an 8-bit adder using our proposed design, the overall transistor count is 882.

VDD

Avg. Power

Power Reduction

(Operand Length Ratio × Voltage Scaling × Energy Operation Scaling) Scaled Power

| COMPARISON OF POWER | CONSUMPTION (DATA | RATE = 100 MHZ) |                 |
|---------------------|-------------------|-----------------|-----------------|
|                     | [11]              | [7]             | Proposed design |
| Clock Freq.         | 200 MHz           |                 |                 |
| Operand Length      | 32 bit            | 8 bit           | 8 bit           |
| CMOS Process        | 1P4M 0.35 μm      | 1P5M 0.25 μm    | 1P6M 0.18 μm    |

1.6 V

80 mW

6.13×

 $(4 \times 0.79 \times 1.94)$ 

13.05 mW

2.5 V

9.7 mW

 $2.68 \times$ 

 $(1 \times 1.93 \times 1.39)$ 

3.62 mW

 TABLE IV

 Comparison of power consumption (Data rate = 100 MHz)

Operand Length Ratio =  $\frac{\text{Bit length of the prior design}}{\text{Bit length of the proposed design}}$ Voltage Scaling =  $(\frac{\text{VDD of the prior design}}{\text{VDD of the proposed design}})^2$ Energy Operation Scaling =  $(\frac{\text{Process of the prior design}}{\text{Process of the proposed design}})$ Scaled Power =  $\frac{\text{Avg. Power}}{\text{Process of the proposed design}}$ 

Scaled Power =  $\frac{Power}{Power}$  Reduction





Fig. 5. (a) P block. (b) G block



Fig. 6. Operation timing diagram of the PLA-styled CLA

#### **III. SIMULATIONS AND IMPLEMENTATION**

1.8 V

1.03 mW

1x

 $(1 \times 1 \times 1)$ 

1.03 mW

To reveal the power-saving advantage of the proposed low power design, two 8-bit CLAs are, respectively, implemented by the modified single- $V_t$  domino logic and the modified dual- $V_t$  domino logic using the same CMOS process. The detailed schematic and die photo of the two CLAs implemented by TSMC (Taiwan Semiconductor Manufacturing Company) 0.18  $\mu$ m 1P6M CMOS process shown in Fig. 7 and 8, respectively. The proposed CLA using the modified dual- $V_t$  domino logic passes all model (FF, TT, and SS), and temperature (0°C  $\sim$ 75°C) corner simulations. A post-layout simulation example is shown in Fig. 9, which is operated in 1 GHz given a worst case as SS model, 75°C. It illustrates that the result of an addition appears after two clock cycles. Fig. 10 shows the waveforms of the modified dual- $V_t$  domino logic CLA measured by Agilent 93000 SOC (system on a chip) test system. The characteristics of the proposed low power CLA is tabulated in Table II. The power consumption of both CLAs is summarized in Table III given the same randomly generated input sequences. A power consumption comparison with several prior works is shown in Table IV. It is obvious that the proposed design possesses the least power consumption.



Fig. 7. Schematic of the proposed CLA



Fig. 8. Die photo of the proposed CLA



Fig. 9. Post-layout simulation given SS model, 75°C



Fig. 10. Waveforms of physical measurement on silicon

TABLE II

CHARACTERISTICS OF THE MODIFIED DUAL- $V_t$  DOMINO LOGIC 8-BIT ADDER

|                                | Proposed CLA                    |  |  |  |
|--------------------------------|---------------------------------|--|--|--|
| VDD                            | 1.8 V                           |  |  |  |
| Highest System Clock           | 1.0 GHz                         |  |  |  |
| Input Data Rate                | 500 MHz                         |  |  |  |
| Avg. Power                     | 5.64 mW <sup>‡</sup>            |  |  |  |
| Area                           | $1.02 \times 1.02 \text{ mm}^2$ |  |  |  |
| Transistor Count               | 882                             |  |  |  |
| <sup>‡</sup> : TT model, 25°C. |                                 |  |  |  |

#### TABLE III

Power reduction by using the modified dual- $V_t$  domino logic CIRCUITRY

|                                | Data Rate | Modified<br>Single-V <sub>t</sub> | Modified<br>Dual-V <sub>4</sub> | Reduction |  |  |
|--------------------------------|-----------|-----------------------------------|---------------------------------|-----------|--|--|
| Simulation                     | 500 MHz   | 6.29 mW                           | 5.64 mW                         | 10.3% ‡   |  |  |
| Measurement                    | 100 MHz   | 1.17 mW                           | 1.03 mW                         | 12%       |  |  |
| <sup>‡</sup> : TT model, 25°C. |           |                                   |                                 |           |  |  |

#### REFERENCES

- [1] R. X. Gu, and M. I. Elmasry, "All-N-logic high-speed true-single-phase dynamic CMOS logic," IEEE J. of Solid-State Circuits, vol. 31, no. 2, pp. 221-229, Feb. 1996.
- [2] J. T. Kao, A.P. Chandrakasan, "Dual-threshold voltage techniques for low-power digital circuits," IEEE J. of Solid-State Circuits, vol. 35, no. 7, pp. 1009-1018, July 2000.
- V. Kursun, and E. G. Friedman, "Sleep switch dual threshold voltage [3] domino logic with reduced standby leakage current," IEEE Trans. on VLSI Integration Systems, vol. 12, no. 5, pp. 485-496, May 2004.
- [4] R. Rogenmoser, and Q. Huang, "An 800-MHz 1mm CMOS pipelined 8b adder using true phase clocked logic-flip-flops," IEEE J. of Solid-State *Circuits*, vol. 31, no. 3, pp. 401-409, Mar. 1996.
  [5] Z. Wang, G. A. Jullien, W. C. Miller, J. Wang, and S. S. Bizzan, "Fast
- adders using enhanced multiple-output domino logic," IEEE J. of Solid-State Circuits, vol. 32, no. 2, pp. 206-214, Feb. 1997.
- C.-C. Wang, C.-J. Huang, and K.-C. Tsai, "A 1.0 GHz 0.6-µm 8-bit carry lookahead adder using PLA-styled all-N-transistor logic," IEEE Trans. on Circuits and Systems, Part II : Analog and Digital Signal Processing, vol. 47, no. 2, pp. 133-135, Feb. 2000.
- [7] C.-C. Wang, C.-L. Lee, P.-L. Liu, "Power-aware design of an 8-bit pipelining asynchronous ANT-based CLA using data transition detection," 2004 IEEE Asia-Pacific Conference on Circuits and Systems (APCCAS'04), vol. 1, pp. 29-32, Dec. 2004.
- [8] C.-C. Wang, Y.-L. Tseng, P.-M. Lee, R.-C. Lee, and C.-J. Huang, "A 1.25 GHz 32-bit tree-structured carry lookahead adder using modified ANT logic," IEEE Trans. on Circuits and Systems - I Fundamental Theory and Applications, vol. 50, no. 9, pp. 1208-1216, Sep. 2003.
- [9] B. Razavi, "Design of Analog CMOS Integrated Circuits," New York, McGraw Hill, 2002.
- [10] C.-C. Wang, C.-F. Wu, and K.-C. Tsai, "A 1.0 GHz 64-bit high-speed comparators using ANT dynamic logic with two-phase clocking," IEE Proceedings - Computers and Digital Techniques, vol. 140, no. 6, pp. 433-436 Nov 1998
- [11] J. Lim, D.-G. Kim, and S.-I. Chae, "A 16-bit carry-lookahead adder using reversible energy recovery logic for ultra-low-energy systems," IEEE J. of Solid-State Circuits, vol. 34, no. 6, pp. 898-903, June 1999.
- [12] G. Yang, S.-O. Jung, K.-H Baek, S. H. Kim, S. Kim, and S.-M. Kang, "A 32-bit carry lookahead adder using dual-path all-N logic," IEEE Trans. on VLSI Integration Systems, vol. 13, no. 8, pp. 992-996, Aug. 2005.

### IV. CONCLUSION

We propose a low-power high-speed PLA-styled dual- $V_t$ domino logic design for adder implementation. A modified dual- $V_t$  domino logic circuit is used for pipelining structure and the unnecessary power consumption is avoided. Not only is the correctness of the function in the giga-hertz range preserved, the power dissipation is also reduced. The PLAstyled dual- $V_t$  domino logic structure using only one clock makes the result of an 8-bit adder appear in two cycles. The proposed design can be easily expanded to a hierarchical 64bit adder such that the result will be attained in 4 cycles.

#### V. ACKNOWLEDGMENT

The authors would like to thank National Chip Implementation Center (CIC) for providing the service of the chip fabrication. The authors also like to thank the grant from the project of Aim for theTop University Plan in Ministry of Education of Taiwan, and technology development program for academia (92-EC-17-A -07-S1-025) in Taiwan", for partially supporting this investigation.