

JrnlID 11265\_ArtID 128\_Proof# 1 - 28/8/2007

1 Journal of VLSI Signal Processing 2007<br>2 © 2007 Springer Science + Business Media. LLC. Manufactured in The United States. <sup>2</sup>/<sub>2</sub>  $\circ$  2007 Springer Science + Business Media, LLC. Manufactured in The United States.<br>2007-2012 DOI: 10.1007/s11265-007-0128-8 3 DOI: 10.1007/s11265-007-0128-8

### <sup>4</sup> Power-Aware Design of An 8-Bit Pipelining ANT-Based CLA Using Data <sup>5</sup> Transition Detection

Q1 6 CHUA-CHIN WANG, GANG-NENG SUNG AND PAI-LI LIU

7 Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung, 80424, Taiwan

Q2 9

9 Received: 1 February 2007; Revised: 00 Month 0000; Accepted: 10 July 2007

ed: *I February 2007; Revised: 00 Month 0000; Accepted: 10 July 2007*<br>
eed and low-power 8-bit carry-lookahead adder (CLA) using two-phase is<br>
re arranged in a PLA design style with power-aware pipelining is presente<br>
are Abstract. A high speed and low-power 8-bit carry-lookahead adder (CLA) using two-phase all-N-transistor (ANT) blocks which are arranged in a PLA design style with power-aware pipelining is presented. The pull-up charging and pull-down discharging of the transistor arrays of the PLA are accelerated by inserting two feedback MOS transistors between the evaluation NMOS blocks and the outputs. The analysis of the area (transistor count) tradeoff is also provided in this work. The output of the addition of two 8-bit binary numbers is done in two cycles. The proposed power-aware pipelining design methodology using a simple data transition detection circuit takes advantage of shutting down the processing stages with identical inputs in two consecutive cycles. The data transition detection circuit is used to monitor the state switching of input data. Not only is it proved to be also suitable for long adders, the power consumption is drastically reduced by at most 50% at every process corner.

21 Keywords: power-aware, ANT, data transition detection, CLA, pipeline

### 22 1. Introduction

 Fast adders are key elements in digital circuits, including multipliers [1], and DSP chips [2]. Many efforts have been focused on the improvement of adder designs [[5,](#page-8-0) [7–9](#page-8-0)]. CMOS dynamic logic has been recognized as one of the promising options to challenge the GHz operations or even higher for the adder design, [[3–4\]](#page-8-0). Other logics suffer from a variety of different difficulties which were addressed in [\[7](#page-8-0)]. However, the major penalty of these prior GHz logic circuits is the high power consumption which is not a tolerable price to pay in recent mobile technologies. These circuits also unavoidably con- sume large power even if they are in a stand-by condition. We, hence, propose a power-aware PLA- like structure to improve our high-speed all-N- transistor (ANT) function block [\[7](#page-8-0), [9](#page-8-0), [10](#page-8-0)]. An 8-bit CLA using ANTs which are arranged in the poweraware PLA-like structure and asynchronously trig- 40 gered is implemented to verify the power reduction 41 as well as the preservation of high speed. A simple 42 but effective data transition detection (DTD) circuit 43 is proposed to resolve the power consumption 44 problem. The major advantage of the power-aware 45 design methodology is that it is robust regardless of 46 long data words, e.g., 64-bit binary data. 47

A physical 8-bit ANT-based CLA using the 48 proposed power-aware DTD is fabricated on silicon. 49 Physical measurements verify that the reduction of 50 the power dissipation is far less than that of prior 51 works. 52

### 2. Power-Aware High-Speed 8-bit CLA 53

Although the N-block dynamic logic intrinsically 54 possesses high speed [\[4](#page-8-0)], it is not good enough for 55

<span id="page-1-0"></span>

Wang et al.

 the operation in the gigahertz range. The reasons are: 57 firstly, the slopes of the clock's edges must be gentle, and secondly, the number of stacks in the evaluation N-block severely affects the size of all of the transistors in the unit.

### 61 2.1. All-N-Transistor (ANT) Function Unit

 Hence, a modified dynamic logic, ANT [\[10](#page-8-0)], has been proposed as shown in Fig. 1. The feature of this modification is the feedback transistor pair, P3 and N3, between the evaluation block and the output.

- 66 1). When  $clk = 0$ , P1 is on and the gate of P2 is 67 precharged to be  $V_{dd}$ . Then, P2 is off and N4 is off.<br>68 This makes the output to stay at the previous state. This makes the output to stay at the previous state.
- 69 2). When  $clk = 1$  and the N-block is evaluated to be 70 "pass", the charge at node A should be ground 71 through the N-block and N1 theoretically. Note 72 that N4 is on and N2 is also on at the beginning. 73 If the previous state of output is high, then N3 74 will be turned on via N4. This means that N3 75 provides another fast discharging path for the 76 charge at node A. When the voltage at node A is 77 dropped below the threshold voltage of PMOS, 78 P2 and P3 start to be on. The output will then be 79 charged to  $V_{dd}$  via paths P2 and P3-N4.



Figure 1. ANT logic.

- 3). When  $clk = 1$  and the previous state of the output 80 is low and the N-block is evaluated to be "pass,"  $81$ the voltage at node A starts to drop. When  $V_A - 82$  $V_{dd} > V_{tp}$ , where  $V_{tp}$  is the threshold voltage of 83 P3, P3 will be turned on such that the gate of N3 84 will be charged to be  $V_{dd}$ . Not only the charge at 85 node A will be discharged faster, but also the 86 output will be charged to high via P2 and N4. 87
- 4). When  $clk = 1$  and the N-block is evaluated to be 88 "stop", the charge at node A should be kept if the  $89$ previous state of output is low. There will be no 90 discharging path for node A because N3 will be 91 off via N4. If the previous state is high, the 92 output will be ground via N4 and N2 before the 93 voltage at node A starts to drop. 94

Summarized from 2 and 3 in the above, the output 96 will be high when the N-block is evaluated "pass",  $97$ i.e., "1", during  $clk = 1$ . By 4, the output will be low 98 when the N-block is evaluated "stop", i.e.,  $\degree$ 0", 99 during  $clk = 1$ . The function of ANT logic block, 100 thus, is conclusively correct and non-inverting. 101 Restated, P3 and N3, respectively, provide an extra 102 charging path and an extra discharging path such that 103 the speed of the evaluation can be accelerated. 104

is on and the gate of P2 is<br> *u<sub>th</sub>*. Then, P2 is off and N4 is off.<br>
with to stay at the previous state.<br>
the N-block is evaluated to be<br>  $\alpha$  and N interoretically. Note at node A should be ground<br>
will be high when the In addition to the previous discharging path problem, 105 one of the reasons why other high-speed logic cannot 106 run correctly given clocks with short rise time or fall 107 time is that the size of each transistor cannot be tuned 108 properly. Both [[3\]](#page-8-0) and [\[4\]](#page-8-0) intrinsically possess this 109 shortcoming. The sizing problem of the transistors in 110 the ANT besides those in the N-block drastically 111 affect the speed. We have been proceeded several 112 simulations to find out the best figure of merit for the 113 sizing of each transistor in Fig. 1 using TSMC 0.25 114 pm 1P5M CMOS technology. 115

#### 2.2. PLA-Styled 8-Bit CLA Design 116

The formulation of a 8-bit CLA is represented by the 117 following equations: 118

$$
S_i = C_{i-1} \oplus P_i
$$
  
\n
$$
C_i = G_{i-1} + P_{i-1}G_{i-2} + P_{i-1}P_{i-2}G_{i-3}
$$
 (1)  
\n
$$
+ \ldots + P_{i-1}P_{i-2} \ldots P_1P_0C_0
$$

where  $A_i, B_i, i = 0...7$ , are inputs, and  $P_i, G_i$  are 120 *propagate* and *generate* signals, respectively,  $P_i = 121$  $A_i \bigoplus B_i, G_i = A_i \cdot B_i.$  122

If the  $P_i$ s and  $G_i$ s are produced by combinatorial 123 logic function blocks before they are fed into the 124

An 8-Bit Pipelining ANT-Based CLA Using Data Transition Detection

<span id="page-2-0"></span>



<span id="page-3-0"></span>**EDITOR'S PROOF** JrnlID 11265\_ArtID 128\_Proof# 1 - 28/8/2007

Wang et al.



and C<sub>i</sub>s, then Eq. (1) implies<br>
OR logic function block is a<br>
chieve high speed operations.<br>
design is suitable for such a<br>
the 125 function blocks for  $S_i$ s and  $C_i$ s, then Eq. [\(1](#page-1-0)) implies 26 that a two-level AND-OR logic function block is a 27 possible solution to achieve high speed operations. 28 Thus, the PLA-styled design is suitable for such a 29 function block. A conceptual PLA-styled design for 30 CLA is shown in Fig. [2](#page-2-0). A typical PLA consists of an 31 AND array and an OR array. It is well known that 32 the series NMOS in the evaluation block of NAND 33 or AND gates will produce long discharging delays 34 which subsequently slow down the entire circuit. We 35 can take advantage of the non-inverting feature of 36 the ANT logic to utilize a NOT-OR-NOT-OR 37 configuration instead of the typical AND-OR style, 38 where the two OR planes are made of ANT logic 39 blocks. Meanwhile, it can also minimize the series 40 transistor count in the evaluation block. The OR 41 array is made of the ANT logic with a predefined



Figure 4. Power-aware circuitry.



Figure 5. Block diagram of the proposed CLA.

evaluation block. The inputs to the first OR array is 142 the inverted  $P_i$  s (propagate) and  $G_i$  s (generate) 143 signals which are also produced by other ANT logic 144 units as shown in Fig. 3. Note that we define the 145 propagate signals in a different way from the 146 traditional  $P_i = A_i + B_i$  because the  $P_i = A_i \bigoplus B_i$  147 can be reused to generate the sum term, i.e.,  $S_i$ . 148

2.3. Speed and Area Analysis 149

Speed The critical path of an adder resides on the 150 generation of carry signals, i.e.,  $C_8$  in the 8-bit adder. 151 After the binary data are ready, the generation of  $P_i$ s 152 and  $G_i$ s by using the ANT logic takes the high half of 153 a full cycle. That is, the results of GP blocks will be 154 ready when the *clk* is low. The inverted  $P_i$ s and  $G_i$ s 155 will then be fed into the first OR plane of the ANT- 156 will then be fed into the first OR plane of the ANT-



Figure 6. Regulator in the power-aware circuitry.

An 8-Bit Pipelining ANT-Based CLA Using Data Transition Detection

<span id="page-4-0"></span>

Figure 7. Schematic of the proposed CLA.

157 based PLA. The inverted outputs of the first OR 158 plane will be presented to the second OR at the high 159 half of the second cycle. The final  $C_i$ s results then are 160 ready in the low half of the second cycle. Right after ready in the low half of the second cycle. Right after 161 the generation of every  $C_i$ s, they are inverted and fed<br>162 into the S<sub>i</sub>s function blocks. Another half cycle then into the  $S_i$ s function blocks. Another half cycle then  $163$  is required to produce all of the S<sub>is</sub>. The final result 164 will be latched after two cycles.

 Area As for the transistor count of the PLA-styled implementation for CLA using ANT logic, though an analytic form has been derived in [[10\]](#page-8-0), the buffers required for the system clock tree and the propaga- tion of intermediate signals was ignored. The analysis of the cost of the buffers is as follows.

171 Clock Tree and Buffers To avoid any degradation 172 resulted from signal propagation, buffers are re-



Figure 8. Die photo of the proposed CLA.

<span id="page-5-0"></span>**EDITOR'S PROOF** JrnlID 11265\_ArtID 128\_Proof# 1 - 28/8/2007

Wang et al.



Figure 9. Post-layout simulation result.

173 quired at the output of each  $C_{i+1}$ ,  $S_i$ ,  $G_i$ , and  $P_i$ ,  $174 \quad \forall i = 1, \ldots, n$ , and the clock tree for the system clock. 175 The total is  $B_t = 12 \times 5n$ , since there are a total of 176 12 transistors in a single buffer.

178 In summary, the number of the total transistors 179 required to implement an n-bit CLA with PLA-styled 180 design using ANT logic is

$$
T_{ANT\ total}(n) = \frac{1}{6}(n+1)(n+2)(n+3) + 5n(n+1)+50n+3+60n
$$
 (2)

### 2.4. Data Transition Detection (DTD) 183

A simple thought to improve the power efficiency is 184 to "deny" the current fed into those function units of 185 which the input data are identical between two 186 consecutive operation cycles. The dynamic power, 187 hence, of CMOS logic elements will be drastically 188 reduced. Take the ANT block shown in Fig. [1](#page-1-0) as an 189 example. Assume the N-block is composed of two 190 cascaded NMOS transistors to constitute an AND 191 gate. The probability of the data inputs of two 192



<span id="page-6-0"></span>

Figure 10. Waveforms of physical measurement by Agilent 1660C.

 consecutive operation cycles is 25% which implies a significant portion of power consumption. Hence, a monitoring circuitry, called data transition detector (DTD), as shown in Fig. 4 is proposed to resolve the low power demand.

198 The DTD design is based on an important observa-199 tion, which is that the state switching of either  $A_i$  or  $B_i$ 200 will cause a series of state switches with regard to  $P_j$ ,  $G_i$ , and  $C_i$ ,  $\forall j \geq i$ . Hence, an early state transition 201 detection of lower bits can be used to determine 202 whether the computation of higher bits is required or 203 not. It carries out the monitoring mechanism and 204 triggers the addition operations asynchronously 205 depending on the comparison of the previous 206 operands and the current operands. The DTD is 207 composed of three blocks: three stages of delay 208

t1.1 Table 1. Characteristics of the proposed power-aware 8-bit adder.

| t1.2 |                        | Proposed CLA                       |
|------|------------------------|------------------------------------|
|      | t1.3 Highest data rate | 500 MHz                            |
|      | $t1.4$ Area            | $1.360\times1.180$ mm <sup>2</sup> |
|      | t1.5 Transistor count  | 3.988                              |

Table 2. Power reduction by using the power-aware DTD t2.1 circuitry (given random input vectors, system clock=200 MHz).



# **EDITOR'S PROOF** JrnlID 11265\_ArtID 128\_Proof# 1 - 28/8/2007

Wang et al.

209 chains to generate phase-shifted data pulses, a two-

210 cycle generation circuitry, and a voltage regulator.

211 As shown in Fig. [4](#page-3-0), the  $D_0$  is propagated through 212 three delay stages which individually comprises four 213 cascaded inverters to generate  $D_1$ ,  $D_2$ , and  $D_3$ . The 214 delay stages can be controlled by the control signals, 215 SEL1 and SEL2, to adjust the delay time of the delay 216 chains.  $D_i$ ,  $i = 0, \ldots, 3$ , are inverted respectively to 217 generate  $\overline{D_i}$ ,  $i = 0, \ldots, 3$ . Since what the 8-bit PLA-218 styled CLA needs to complete an addition operation 219 is 2 cycles, we conclude that  $(D_3D_2D_1D_0) = 0001$ , 220 0111, 1110, 1000, are the states required to generate 221 two consecutive clock cycles for the required 222 addition. Hence, the individual two-cycle generation 223 circuitry for each ANT logic, e.g., CLK0 for C1 and 224 SUM0, is carried out as shown in Fig. [5.](#page-3-0)

 However, the most critical part of the proposed DTD is the sensitivity of the strobe duration with respect to the power variation. One of the most efficient approach to avoid the unstable power supply is to employ step-down bandgap-referenced voltage regulators to supply a temperature indepen- dent reference voltage to the rest of the circuitry [6]. Referring to Fig. [6,](#page-3-0) the regulator is composed of AMP, PM61, and a resistor string. The generated 234 internal voltage for the DTD is a very stable  $V_{int}$  =  $V_{dd} - V_{thp}$ , where  $V_{thp}$  is the threshold voltage of 236 PM61.

 Area Overhead A single DTD is composed of a three-stage buffer (12 transistors), four four-input AND gates (32 transistors), four inverters (eight transistors), six transmission gates (24 transistors), and an NOR gate (eight transistors). In brief, there 242 are a total of 84 transistors in a single DTD. For an  $n$ -243 bit adder, we need  $2n \times 84$  to carry out the DTDs. 44 Besides, in a two-cycle clock generation circuitry, a 245 total of n NAND gates and  $(n - 1)$  OR gates are 246 required for the *n*-bit adder.  $4n + 6(n - 1)$  MOSs, 47 thus, must be taken into account. Hence, the area 48 overhead caused by the proposed power-aware 49 design is concluded as follows.

 $T_{\text{DTD}}(n) = 168n + 10n - 6kc05$  (3)

250 Notably, the cost ratio of the DTD is defined as 252  $CR(n) = \frac{T_{\text{DTD}}(n)}{T_{\text{ANT total}}(n)}$ . Hence,

$$
\lim_{n-\infty} CR(n) = \lim_{n-\infty} \frac{T_{DTD}(n)}{T_{ANT\ total}(n)} = 0,\tag{4}
$$

According to Eqs. ([2\)](#page-5-0) and (3), it implies that the 254 proposed power-aware design is particularly suitable 255 for long adders. 256

### 3. Simulations and Measurement 258

dividual two-cycle generation<br>
logic is shown in Fig. 9 which illus<br>
logic, e.g., CLKO for C1 and<br>
is shown in Fig. 5.<br>
that the  $V_{dd}$  is coupled with a 1 MHz si<br>
critical part of the proposed<br>
possessing 10%  $V_{dd}$  ampl The block diagram of the proposed 8-bit power- 259 aware PLA-styled ANT-based CLA has been shown 260 in Fig. [5.](#page-3-0) By contrast, the detailed schematic and 261 diephoto of the CLA implemented by TSMC 0.25  $\mu$ m 262 1P5M CMOS process are revealed in Figs. [7](#page-4-0) and [8](#page-4-0), 263 respectively. An example of the output waveform of 264 8-bit power-aware PLA-styled CLA using ANT 265 logic is shown in Fig. 9 which illustrates that the 266 result of an addition appears after two cycles given 267 that the  $V_{dd}$  is coupled with a 1 MHz sine wave noise 268 possessing [10](#page-6-0)%  $V_{dd}$  amplitude. Fig. 10 shows the 269 output waveforms of the addition of "00000000" and 270 "00000000", and the addition of "00000000" and 271 "1111111111111", respectively. The measurement is 272 carried out by using Agilent 1660CP logic analyzer. 273 The characteristics of the proposed power-aware 274 CLA is summarized in Table [1](#page-6-0). 275

To reveal the power-saving advantage of the 276 proposed power-aware design, two 8-bit adders are, 277 respectively, implemented by the approach of [\[10](#page-8-0)] 278 and the proposed design using the same CMOS 279 process. The power reduction of the power-aware 280 design is summarized in Table [2,](#page-6-0) where random test 281 vectors are used to test these circuits. 282

### 4. Conclusion 283

We propose a power-aware high speed PLA-styled 284 ANT logic design for the adders' implementation. A 285 novel but simple DTD circuit is used to monitor the 286 switching activity of input data such that the unnec- 287 essary power consumption is avoided. Not only the 288 correctness of the function given a fast clock is 289 preserved, but also the power dissipation is reduced. 290

### Acknowledgement 292

The authors would like to express their deepest 293 gratefulness to CIC of NSC for their thoughtful chip 294 fabrication service. The authors also like to thank 295 "Aim for Top University Plan" project of NSYSU and 296 MOE, Taiwan, and technology development program 297 for academia (92-EC-17-A-07-S1-025) in Taiwan. 298

An 8-Bit Pipelining ANT-Based CLA Using Data Transition Detection

- <span id="page-8-0"></span>299 This research was partially supported by National
- 300 Science Council under grant NSC 92-2220-E-110-
- 301 001 and 92-2220-E-110-004.

#### 302 References

- 303 1. V.G. Oklobdzija, D. Villeger and T. Soulas, "An Integrated 304 Multiplier for Complex Numbers." The Journal of VLSI 304 Multiplier for Complex Numbers," The Journal of VLSI<br>305 Signal Processing, vol. 7, no. 3, Oct. 1994, pp. 213–222. Signal Processing, vol. 7, no. 3, Oct. 1994, pp. 213–222.
- 306 2. R.V.K. Pillai, D. Al-Khalili, A.J. Al-Khalili and S.Y.A. Shah, 307 <sup>44</sup> Low Power Approach to Floating Point Adder Design for 307 "A Low Power Approach to Floating Point Adder Design for 308 DSP Applications," The Journal of VLSI Signal Processing, 308 DSP Applications," The Journal of VLSI Signal Processing, 309 vol. 27, no. 3, March 2001, pp. 195-213. vol. 27, no. 3, March 2001, pp. 195–213.
- 310 3. M. Afghahi, "A Robust Single Phase Clocking for Low<br>311 Power, High-Speed VLSI Applications," IEEE J of Solid-State 311 Power, High-Speed VLSI Applications," IEEE J of Solid-State<br>312 Circuits. vol. 31. no. 2. Feb. 1996. pp. 247–253. Circuits, vol. 31, no. 2, Feb. 1996, pp. 247–253.
- 313 4. R.X. Gu and M.I. Elmasry, "All-N-Logic High-Speed True-314 Single-Phase Dynamic CMOS Logic," IEEE J Solid-State<br>315 Circuits, vol. 31, no. 2, Feb. 1996, pp. 221–229. Circuits, vol. 31, no. 2, Feb. 1996, pp. 221-229.
- 316 5. R. Rogenmoser and Q. Huang, "An 800-MHz 1 mm CMOS 317 Pipelined 8-b Adder Using True Phase Clocked Logic-Flip-
- **UNCORRECTAL** Pipelined 8-b Adder Using True Phase Clocked Logic-Flip-

Flops," IEEE J Solid-State Circuits, vol. 31, no. 3, Mar. 1996, 318 pp. 401–409. 319<br>K. Sundaresan, K.C. Brouse, K.U-Yen, F. Ayazi and P. E. 320

- 6. K. Sundaresan, K.C. Brouse, K.U-Yen, F. Ayazi and P. E. 320 Allen, "A 7-MHz Process, Temperature And Supply Compensated Clock Oscillator in 0.25 µm CMOS," 2003 Interna- 322 tional Symposium on Circuits and Systems (ISCAS'03), vol. 1, 323<br>May 2003, pp. 693–696. 324 May 2003, pp. 693–696.<br>C.-C. Wang, C.-J. Huang and K.-C. Tsai, "A 1.0 GHz 0.6-µm 325
- 7. C.-C. Wang, C.-J. Huang and K.-C. Tsai, "A 1.0 GHz 0.6-µm 325<br>8-bit Carry Lookahead Adder Using PLA-Styled All-N- 326 8-bit Carry Lookahead Adder Using PLA-Styled All-N- 326<br>Transistor Logic," IEEE Trans of Circuits and Systems, Part 327 Transistor Logic," IEEE Trans of Circuits and Systems, Part II: Analog and Digital Signal Processing, vol. 47, no. 2, Feb. 328<br>2000. pp. 133–135. 329 2000, pp. 133-135.
- 8. Z. Wang, G.A. Jullien, W.C. Miller, J. Wang and S.S. Bizzan, 330 "Fast Adders Using Enhanced Multiple-Output Domino 331 Logic," IEEE J Solid-State Circuits, vol. 32, no. 2, Feb. 332<br>1997. pp. 206-214. 333 1997, pp. 206–214. 333<br>C.-C. Wang, Y.-L. Tseng, P.-M. Lee, R.-C. Lee and C.-J. 334
- 9. C.-C. Wang, Y.-L. Tseng, P.-M. Lee, R.-C. Lee and C.-J. Huang, "A 1.25 GHz 32-bit Tree-Structured Carry Lookahead 335 Adder Using Modified ANT Logic," IEEE Trans on Circuits 336 and Systems—I Fundamental Theory and Applications, vol. 337 50, no. 9, Sep. 2003, pp. 1208–1216. 338
- 10. C.-C. Wang, C.-F. Wu and K.-C. Tsai, "A 1.0 GHz 64-bit 339 High-Speed Comparator Using ANT Dynamic Logic with 340 Two-Phase Clocking," IEE Proceedings—Computers and 341 Digital Techniques, vol. 145, no. 6, Nov. 1998, pp. 433–436. 342