| Technology | BI | | CBI | | APBI | | BS 4-way | | BS 16-way | | |-----------------|------|------|------|------|------|------|----------|------|-----------|------| | | AvP | EpBC | AvP | EpBC | AvP | EpBC | AvP | EpBC | AvP | EpBC | | | [mW] | [pJ] | [mW] | [pJ] | [mW] | [pJ] | [mW] | [pJ] | [mW] | [pJ] | | 0.18 μm Lo. Le. | 1.79 | 36 | 1.66 | 33 | 2.38 | 48 | 8.90 | 178 | 6.14 | 123 | | 0.13 μm Lo. Le. | 0.90 | 18 | 0.77 | 15 | 0.93 | 19 | 4.65 | 93 | 2.01 | 40 | | 0.18 μm Hi. Sp. | 1.91 | 38 | 1.67 | 33 | 2.10 | 42 | 9.27 | 185 | 7.19 | 144 | | 0.13 um Hi Sn | 1.04 | 21 | 0.87 | 17 | 1.28 | 26 | 5.10 | 102 | 3 34 | 67 | TABLE III ENERGY PER Bus Cycle (EpBC) and Avg. Power (AvP) for Different Bus Encoders. Bus Frequency: 50 MHz Fig. 7. Actual energy consumption ratio $E_{\%}$ versus bus line capacitance for BS implementation. ### VI. CONCLUSIONS The outcome of our work is that the BS approach can be effectively applied to off-chip buses, obtaining better performance than previous approaches. When the area limitation is not a primary issue, the fully parallel implementation is definitely preferable. Minor design improvements may still be introduced the pattern transmission. #### REFERENCES - L. Benini, A. Macii, E. Macii, M. Poncino, and R. Scarsi, "Architectures and synthesis algorithms for power-efficient bus interfaces," *IEEE Trans. Computer-Aided Design*, vol. 19, pp. 1498–1506, Sept. 2000. - [2] A. Chandrakasan and R. Brodersen, "Minimizing power consumption in digital CMOS circuits," *Proc. IEEE*, vol. 83, pp. 498–523, Apr. 1995. - [3] T. Kam, S. Rawat, D. Kirkpatrick, R. Roy, G. S. Spirakis, N. Sherwani, and C. Peterson, "EDA challenges facing future microprocessor design," *IEEE Trans. Computer-Aided Design*, vol. 19, pp. 1498–1506, Dec. 2000. - [4] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, "Information-theoretic bounds on average signal transition activity," *IEEE Trans. VLSI Syst.*, vol. 7, pp. 359–368, Sept. 1999. - [5] R. Siegmund, C. Kretzschmar, and D. Muller, "Adaptive bus encoding technique for switching activity reduced data transfer over wide system buses," in *Int. Workshop Power and Timing Modeling Optimization Simulation*, Göttingen, Germany, 2000. - [6] M. R. Stan and W. P. Burleson, "Bus-invert coding for low-power I/O," IEEE Trans. VLSI Syst., vol. 3, pp. 49–58, Mar. 1995. - [7] P. P. Sotiriadis and A. Chandrakasan, "Bus energy minimization by Transition Pattern Coding (TPC) in deep sub-micron technologies," in *Proc. IEEE Int. Conf. CAD*, San Jose, CA, 2000, pp. 322–327. - [8] P. P. Sotiriadis, T. Konstantakopoulos, and A. Chandrakasan, "Analysis and implementation of charge recycling for deep sub-micron buses," in *Int. Symp. Low-Power Electronics Design*, Huntington Beach, CA, Aug. 2001. - [9] P. P. Sotiriadis and A. Chandrakasan, "Low power bus coding techniques considering interwire capacitances," in *IEEE Custom Integrated Circuits Conf.*, May 2000, pp. 507–510. - [10] STMicroelectronics, Corelib HCMOS8 DataBook and CB35000 Standard Cell Library Datasheet, 2001. - [11] STMicroelectronics, Corelib HCMOS9 DataBook and CB45000 Standard Cell Library Datasheet, Aug. 2002. - [12] K. L. Tai, "System-In-Package (SIP): Challenges and opportunities," in Proc. IEEE ASP-DAC, Jan. 2000, pp. 191–196. - [13] Y. Zhang, J. Lach, K. Skadron, and M. R. Stan, "Odd/even bus invert with two-phase transfer for buses with coupling," in *Int. Symp. Low-Power Electronics and Design*, Monterey, CA, Aug. 2002. - [14] H. Zhang, V. George, and J. M. Rabaey, "Low-swing on-chip signling techniques: effectiveness and robustness," *IEEE Trans. VLSI Syst.*, vol. 8, pp. 264–272, June 2000. # A 1.2 GHz Programmable DLL-Based Frequency Multiplier for Wireless Applications Chua-Chin Wang, Yih-Long Tseng, Hsien-Chih She, and Ron Hu Abstract—A CMOS local oscillator using a programmable delayed-lock loop based frequency multiplier is present in this paper. The maximum measured output frequency is 1.2 GHz. The frequency of the output clock is $8\times$ to $10\times$ of an input reference clock between 100 to 150 MHz at simulation. No LC-tank is used in the proposed design such that the power dissipation as well as the active area is drastically reduced. The design is carried out by TSMC 1P5M 0.25 $\mu m$ CMOS process at 2.5 V power supply. The average lock time is optimally shortened by initializing the start-up voltage of the voltage-controlled delay tap line at the midway of the working range. Meanwhile, the power dissipation is 52.5 mW at 1.2 GHz output. Index Terms—DLL, frequency multiplier, programmable. ### I. INTRODUCTION Ever since low-cost radio frequency (RF) CMOS technology becames the challenger of its conventional discrete counterpart [1], the CMOS solution for local oscillator (LO) has been demanded to possess better phase-noise performance and lower power consumption [2]. Many CMOS RF transceivers were proposed, e.g., [2]–[4]. Although, [4] proposed a fixed-frequency RF LO to block-downconvert the entire Manuscript received October 28, 2002; revised June 25, 2003, June 6, 2004. This work was supported in part by the National Science Council under Grant NSC 92-2220-E-110-001 and in part by the National Health Research Institute under Grant NHRI-EX93-9319EI. - C.-C. Wang and Y.-L. Tseng are with the Department of Electrical Engineering, National Sun Yat-Sen University, 80424 Kaohsiung, Taiwan R.O.C. (e-mail: ccwang@ee.nsysu.edu.tw). - $\mbox{H.-C.}$ She is with the VastView Technology Inc., 300 Hsin-Chu, Taiwan R.O.C. - R. Hu is with Asuka Microelectronics Inc., 300 Hsin-Chu, Taiwan R.O.C. Digital Object Identifier 10.1109/TVLSI.2004.837997 Fig. 1. Proposed programmable frequency multiplier. RF band to IF band, it requires another channel-select LO to downconvert the desired channel to baseband. [2] proposed a nonprogrammable design basing upon a delayed-lock loop (DLL), but noise-prone and slow current-driven OPAMPs are used to construct the replica bias. In this paper, we propose an enhanced implementation of LOs using the programmable DLL-based frequency multiplier without LC-tanks. ### II. DLL-BASED FREQUENCY MULTIPLIER Most of the current wireless systems implement a frequency multiplier with a phased-locked loop (PLL)-based architecture typically. Conventional PLLs divide the frequency output of the voltage-controlled oscillator (VCO) by a feedback frequency divider and compares the phase relation to speed up or slow down the VCO until the phase locked. If such a design is intended to be implemented and integrated in a CMOS RF transceiver, the low-Q CMOS components will deteriorate the phase noise. Regarding the jitter problem, the jitter accumulation in the ring oscillator of PLLs contributes to the starting point of the next clock cycle. Hence, the PLL usually contains more jitter than the DLL. Referring to [11], the jitters in PLLs and DLLs can be shown in the following: $$Jitter_{PLL} = \alpha \cdot Jitter_{DLL} \tag{1}$$ where Jitter<sub>PLL</sub> is the phase jitter of the PLL, and Jitter<sub>DLL</sub> is the phase jitter of the DLL. $\alpha$ is $\sqrt{1/(2K_dK_wK_LT)}$ , where $K_d$ is phase detector gain, $K_w$ is the VCO gain, $K_L$ is the loop filter gain, T is the reference period. According to (1), the jitter in PLLs is $\alpha$ times of that of DLLs. In our design, we propose a programmable DLL-based frequency multiplier with selectable output frequency and solve the above jitter accumulation problem. The proposed programmable frequency multiplier is given as shown in Fig. 1, including phase-frequency detector (PFD), charge pump, loop filter, voltage-controlled delay tap line (VCDTL), positive edge collector (PEC), and clock generator (CG). Since the function of the PFD is well known in the literature, it will not be discussed in the following text. S[1:n] is encoded as one-hot code and used to select output frequency. In the case of n=10, if S[1:10] = '0 000 000 000 001', the $10\times$ mode is chosen. If S[1:10] = '0 000 000 100', the 8× mode is chosen. $f_{\rm ref}$ is the input reference frequency, $f_{\rm back}$ is the output of VCDTL, and $f_{\rm out}$ is the output frequency of the DLL. The signal relations of VCDTL, positive edge collector, and clock generator are shown as a $2\times$ mode example in Fig. 2(a). When the DLL is locked, $f_{\rm ref}$ leads $f_{\rm back}$ exactly one cycle. PEC $_1$ and PEC $_2$ sense $\phi_1$ to $\phi_4$ and generate PULSE $_2$ and PULSE $_4$ . PULSE $_2$ and PULSE $_4$ drive CG $_1$ and CG $_2$ , respectively. Finally, $f_{\rm out}=2f_{\rm ref}$ is generated. If the number of delay stages in VCDTL is 2n, the maximum number Fig. 2. (a) Example of a four-stage tap line and (b) the left-most part of VCDTL. of PEC cells is n. The frequency of $f_{\rm out}$ equals $(n/2) \times f_{\rm ref}$ . Regarding the detailed functions of the rest of the modules are described as follows. **VCDTL**: It comprises a plurality of cascaded stages. One stage is composed of one delay cell and one demultiplexer (DEMUX). Such a serial of cascaded stages is called the tap line. The first stage of the tap line is driven by the external reference clock, $f_{\rm ref}$ , which is assumed to be crystally clean. At the output node of each stage, different clocks with different phase shift are generated, which is named $\phi_1,\ldots,\phi_{2n}$ . The DEMUXs are controlled by external signals, S[1:n], to determine the feedback path to PFD from VCDTL. The feedback path will determine the multiple between $f_{\rm ref}$ and $f_{\rm out}$ . Referring to Fig. 1, assume the programmable range of $f_{\rm out}$ is $n_1 \times f_{ref}$ to $n_2 \times f_{\rm ref}$ ( $n_1 < n_2$ ), and the range of $f_{\rm ref}$ is $f_1$ to $f_2$ ( $f_1 < f_2$ ). In order to avoid the stuck/harmonic lock problem, the range of $f_{\rm back}$ should be limited as $$2n_1f_1 > f_{\text{back}} > \frac{n_2f_2}{2}.$$ The left-most part of the VCDTL is shown in Fig. 2(b). The current through MP2 of the first delay cell is a mirror current of that via MP1. The current of MP2 determines the delay of each delay cell. Hence, it is voltage controlled. The current in MP1 is the summation of the currents through MN1 and MN2. MN2 is a fixed amount current sink as long as it is saturated. The current via MN1 is determined by the voltage at its Fig. 3. Circuits of charge pump and loop filter. gate, i.e., $V_{\rm ctrl}$ , which is supplied by the loop filter. The resistor R is a current limiter. Notably, a simple thought to shorten the lock time is that the start-up voltage of $V_{\rm ctrl}$ is biased to the middle of the entire operating range. For instance, if the range of operating voltage is 1.4 to 2.6 V, as a reset signal of the DLL is enabled the start-up switch is turned on and $V_{\rm ctrl}$ is biased to be 2.0 V by a band-gap voltage reference. It is noted that the DEMUX at the output of the odd-numbered stages is a dummy DEMUX. Its function is nothing but to equalize the load of each delay cell. Hence, the control lines of these dummy DEMUXs are all grounded. Charge Pump and Loop Filter: The circuits of charge pump [10] and loop filter are shown in Fig. 3. The charge sharing problem can be solved by the stand-by current $(I_{\rm stand-by})$ . At UP operation, loop filter is charged by $I_{\rm UP}+I_{\rm stand-by}$ . On the other hand, loop filter is charged by $I_{\rm DOWN}+I_{\rm stand-by}$ at DOWN operation. Assume $I_{\rm UP}+I_{\rm stand-by}=I_{\rm UP-DOWN}=I_{\rm DOWN}+I_{\rm stand-by}$ , the open-loop gain $K_{\rm OP-loop}$ of the proposed DLL is $$\begin{split} K_{\mathrm{OP-loop}} &= (D_I(s) - D_O(s)) \cdot \frac{I_{\mathrm{UP-DOWN}} \cdot K_{DL} \cdot Z_{CL}}{n'} \\ Z_{CL} &= \frac{R_p s C_{p1} + 1}{s (C_{p1} + C_{p2}) + s^2 R_p C_{p1} C_{p2}} \end{split}$$ where $D_O(s)$ is the period of $f_{\rm back}$ , $D_I(s)$ is the period of $f_{\rm ref}$ . $I_{\rm UP-DOWN}$ is the charge current of the loop filter, $K_{\rm DL}$ is the gain of VCDTL, $Z_{\rm CL}$ is the impedance of loop filter, and n' is the selected mode (for example, n'=8 for $8\times$ mode). The closed-loop gain $K_{\mathrm{CL-loo_p}}$ of the proposed DLL is $$\begin{split} D_O(s) &= (D_I(s) - D_O(s)) \cdot f_{\text{ref}} \\ &\cdot \frac{I_{\text{UP-DOWN}} \cdot K_{\text{DL}} \cdot Z_{\text{CL}}}{n'} \\ K_{\text{CL-loop}} &= \frac{D_O(s)}{D_I(s)} = \frac{1}{1 + \frac{1}{f_{\text{ref}}I_{\text{UP-DOWN}}K_{\text{DL}}Z_{\text{CL}}}}. \end{split}$$ **Positive Edge Collector (PEC)**: This module monitors the rising edges of those clocks generated in VCDTL. As soon as one rising edge is detected, a corresponding low pulse is triggered, e.g., $PULSE_2, PULSE_4, \ldots, PULSE_{2n}$ . A low-pulse train, $PULSE_{2\times i}$ , namely the corresponding rise edge, is determined by every two adjacent generated clocks, i.e., $\phi_i$ and $\phi_{i+1}$ . The digital design to realize such a detection will be addressed in the following sections. Its major operation is to detect the rising edges of the generated clocks and then produce corresponding low pulse trains. For example, Fig. 2(a) shows a four-stage tap line. On the top of Fig. 2(a) is the waveforms of the generated clocks respective output node of each stage. Assume the (0,0,0,0) is the initial state of $Q_1, Q_2, Q_3, Q_4$ . Table I is the $\begin{array}{ccc} & TABLE & I \\ TRUTH & TABLE & OF & THE & 4-STAGE & TAP & LINE \end{array}$ | clocks | $Q_1$ | $Q_2$ | $Q_3$ | $Q_4$ | $PULSE_2$ | $PULSE_4$ | |---------------------|-------|-------|-------|-------|-----------|-----------| | $\overline{\phi_1}$ | 1 | 0 | 0 | 0 | 0 | 1 | | $\phi_2$ | 1 | 1 | 0 | 0 | 1 | 1 | | $\phi_3$ | 1 | 1 | 1 | 0 | 1 | 0 | | $\phi_4$ | 1 | 1 | 1 | 1 | 1 | 1 | | $\phi_1$ | 0 | 1 | 1 | 1 | 0 | 1 | | $\phi_2$ | 0 | 0 | 1 | 1 | 1 | 1 | | $\phi_3$ | 0 | 0 | 0 | 1 | 1 | 0 | | $\phi_4$ | 0 | 0 | 0 | 0 | 1 | 1 | Fig. 4. Clock generator. Fig. 5. (a) The die photo and the layout of the proposed design. (b) Simulation waveforms given from $8 \times$ to $10 \times$ . truth table of $Q_1,Q_2,Q_3,Q_4$ , and the result $PULSE_4$ . The function of the PEC cell is proved to be independent of the initial conditions of the DFFs. The best advantage of such a design is that it is noise immune, since they are all digital circuits. Fig. 6. (a) Single sideband (SSB) phase noise at $f_{\rm out}=1.2$ GHz. (b) Spurious tones performance at $f_{\rm out}=1.2$ GHz = 120 MHz input $\times$ 10. **Clock Generator**: This module is in charge of generating the desired frequency by reading the external selection signal, S[1:n]. The PEC output, PULSE<sub>2</sub> to PULSE<sub>2n</sub>, will be the strobe source of $f_{out}$ . Low pulse trains generated by PEC cells can not be used directly as an ideal clock source apparently. A pseudo-N logic is utilized to realize such a generator as shown in Fig. 4. S[i] is the *i*th control line to determine whether the ith clock generator is enabled or not. If i = 1, then S[i] is low, and accordingly MN3 is on which grounds the gate of MN4 to disable the corresponding ith cell. By contrast, if S[i] is high, MN3 is off and the low pulse train $PULSE_{2i}$ is propagated to the gate of MN4. Then, $f_{\text{out}}$ is generated. Because of the variations between the pull up time and the pull down time, the pseudo-N clock generator may have the duty cycle drifting problem. By carefully layout matching, good transistor size matching, and complete PVT corner simulations, the duty cycle variation can be reduced. In this design, the worse case duty cycle variation is reduced to be 6.11%. In short, all of the circuits except the current mirror at the leftmost section of the VCDTL are digital and programmable. And these digitized circuits have good performance of noise immunity, power drifting, and temperature variations. ${\bf TABLE\ \ II}$ The Simulation and Measurement Summary of the Proposed Design | | Simulation | Measurement | |-----------------|------------|-------------| | VDD | 2.5 V | 2.5 V | | Ref. clock | 120 MHz | 120 MHz | | Multiply factor | 10 × | 10 × | | Output Freq. | 1.2 GHz | 1.2 GHz | | Output level | -5 dBm | -17 dBm | | Phase noise | | -88 dBc/Hz | | | | @ 10 KHz | | | | -98 dBc/Hz | | | | @ 50 KHz | | Spurious tones | -11 dBc | -20 dBc | | Average power | 23.2 mW | 52.2 mW | TABLE III PERFORMANCE COMPARISON WITH PRIOR DESIGNS | | Power | Max freq. | VDD | Process | | |------------|-------------|-----------|-------|-------------------|--| | | consumption | | | | | | [5] | 52 mW | 400 MHz | 2.3 V | 0.16 μm DRAM | | | [6] | 60 mW | 400 MHz | 1.7 V | 0.4 μm CMOS | | | [7] | 132 mW | 130 MHz | 3.3 V | $0.35 \mu m$ CMOS | | | [8] | 42 mW | 320 MHz | 3.3 V | $0.35~\mu m$ CMOS | | | [2] | 129 mW | 900 MHz | 3.3 V | $0.35 \mu m$ CMOS | | | our design | 52.2 mW | 1.2 GHz | 2.5 V | $0.25~\mu m$ CMOS | | #### A. Simulations and Implementation In order to verify the correctness and performance of our proposed design, we use the Taiwan Semiconductor Manufacturing Company (TSMC) 0.25- $\mu$ m 1P5M CMOS process to implement the entire circuit. Fig. 5(a) shows the die photo as well as the layout of our design. Several proved and well known circuits besides those circuits in Section II are also carried out, including a high-speed and low-power PFD and charge pump by Lee [10], and a glitch-free single-phase dual-O/P DFF by Huang [9]. As shown in Fig. 5(b), when the programmable numerical inputs increase from 8× to 10× given a 150 MHz reference clock, the lock time is 0.6 $\mu$ s. With 10%× $V_{\rm DD}$ noise coupled to $V_{\rm DD}$ , our design still functions correctly besides the lock time is extended to 4.3 $\mu$ s. The HP4433B Signal Generator is used to feed the chip with $f_{\rm ref}$ which is ranged from 100 to 150 MHz. Spectrum analyzers used for physical measurements are HP8563E and HP8594E. Fig. 6(a) shows the SSB phase noise occurring at $f_{\rm out}=120$ MHz $\times$ 10=1.2 GHz. By contrast, Fig. 6(b) shows the spurious tones occurring at the same output. The overall characteristics of the simulation results and the physical chip measurements are shown in Table II. Table III summarizes the performance comparison of the proposed circuit and several prior designs. It is concluded that our circuit possesses the highest output frequency with the second best power consumption. ### B. Conclusion This paper presents a programmable LO design approach using DLL-based frequency multiplier. Besides a current mirror, the rest of the design is purely digital logic which in turn eliminates the noise prone problem. No large inductors and capacitors are required to balance the output impedance. The power consumption is very low compared to most of the prior works. The lock time is also drastically shorten, since the start-up voltage has been biased to the middle of the operating range. #### REFERENCES T. H. Lee, The Design of CMOS Radio-Frequency Integrated Circuits. Cambridge, U.K.: Cambridge Univ. Press, 1998. - [2] G. Chien and P. R. Gray, "A 900-MHz local oscillator using a DLL-based frequency multiplier technique for PCS applications," *IEEE J. Solid-State Circuits*, vol. 35, pp. 1996–1999, Dec. 2000. - [3] Texas Instruments, TRF6900—Single-chip RF transceiver, Reading, MA, May 2000. - [4] J. C. Rudell, J. J. Ou, T. Cho, G. Chien, F. Brianti, J. A. Weldon, and P. R. Gray, "A 1.9-GHz wide-band IF double conversion CMOS receiver for cordless telephone applications," *IEEE J. Solid-State Circuits*, vol. 32, pp. 2701–2088, Dec. 1997. - [5] S. J. Kim, S. H. Hong, J. K. Wee, J. H. Cho, P. S. Lee, J. H. Ahn, and J. Y. Chung, "A low-jitter wide-range skew-calibrated dual-loop DLL using antifuse circuitry for high-speed DRAM," *IEEE J. Solid-State Circuits*, vol. 37, pp. 726–734, June 2002. - [6] B. W. Garlepp, K. S. Donnelly, J. Kim, P. S. Chau, J. L. Zerbe, C. Huang, C. V. Tran, C. L. Portmann, D. Stark, Y. F. Chan, T. H. Lee, and M. A. Horowitz, "A portable digital DLL for high-speed CMOS interface circuits," *IEEE J. Solid-State Circuits*, vol. 34, pp. 632–644, May 1999. - [7] H. H. Chang, J. W. Lin, C. Y. Yang, and S. I. Liu, "A wide-range delay-locked loop with a fixed latency of one clock cycle," *IEEE J. Solid-State Circuits*, vol. 37, pp. 1021–1027, Aug. 2002. - [8] S. S. Hwang, K. M. Joo, H. J. Park, J. W. Kim, and P. Chung, "A DLL based 10–320 MHz clock synchronizer," in *Proc. 2000 IEEE Int. Symp. Circuits and Systems*, vol. 5, May 2000, pp. 265–268. - [9] Q. Huang and R. Rogenmoser, "A glitch-free single-phase CMOS DFF for gigahertz applications," in *Proc. 1994 IEEE Int. Symp. Circuits and Systems*, vol. 4, June 1994, pp. 11–14. - [10] W.-H. Lee, J.-D. Cho, and S.-D. Lee, "A high speed and low power phase-frequency detector for charge pump," in *Proc. Design Automation Conf. Asia and South Pacific*, vol. 1, Jan. 1999, pp. 269–272. - [11] B. Kim, T. C. Weigandt, and P. R. Gray, "PLL/DLL system noise analysis for low jitter clock synthesizer design," in *Proc. 1994 IEEE Int. Symp. Circuits and Systems*, vol. 4, June 1994, pp. 31–34. ### Sequence-Switch Coding for Low-Power Data Transmission ### Myungchul Yoon Abstract—Reducing the power dissipated by buses becomes one of the most important elements in low-power VLSI design. A new coding scheme called sequence-switch coding (SSC) is proposed in this paper. It is a general-purpose coding scheme that employs the sequence of data in reducing the number of transitions on buses. A simple switching algorithm is presented to show the feasibility of SSC. According to simulations, this algorithm reduces around 10% of bus transitions in the transmission of benchmark files. SSC can be used for burst data transfer in any application. In particular, it is suitable for internet and multimedia applications that have stream-type data transfer pattern. *Index Terms*—Bus-invert (BI) coding, lagger algorithm, low-power transmission, sequence-switch coding (SSC), transition-reduction scheme. # I. INTRODUCTION Buses have been used as an efficient communication link among functional modules in very large scale integration (VLSI) systems. Whereas the size of functional modules decreases with the development of semiconductor technology, the size of VLSI chips increases, and so does the number of functional modules on a chip. Increasing Manuscript received March 13, 2003; revised May 18, 2004. This work was supported by the BrainKorea21 Program in 2003. The author is with the BrainKorea21, Department of Electronics and Computer Engineering, Korea University, Seoul 136-701, Korea (e-mail: mc\_yoon@naver.com). Digital Object Identifier 10.1109/TVLSI.2004.837995 communication requirements among the modules demand more complicated and more efficient buses. Currently, internal bus design plays important roles in the performance of a chip. Since too many modules rely on the buses for their communication, the buses are usually heavily loaded so that they dissipate quite an amount of power in operation. Activation of external buses consumes significant power as well, because many input—output (I/O) pins and large I/O drivers are attached to the buses. Typically, 50% of the total power is consumed at the I/Os for well-designed low-power chips [1]. Thus, reducing the power dissipated by buses becomes one of the most important concerns in low-power VLSI design. The dynamic power dissipated in a bus is expressed as the following [2]: $$P_{\rm BUS} = \sum_{\rm line} C_{\rm load} V_{\rm DD}^2 N_{\rm trans}$$ where $C_{\rm lo\,ad}$ is the total load capacitance attached to a bus line, $V_{\rm DD}$ is the voltage swing at operation, and $N_{\rm trans}$ is the number of transitions per second. There are two approaches to reduce the dynamic power of buses. One is to save the dynamic power per activation by reducing either $C_{\rm lo\,ad}$ or $V_{\rm DD}$ . Reduced swing bus is an example of this approach [3]–[5]. The other is to reduce the number of bus activations by coding. Bus-invert (BI) coding [6], Gray code [7], and the beach solution [8] are some examples of this approach. Before the advent of internet and multimedia systems, most of I/O data used in VLSI chips are granulated data which are requested aperiodically on demand and consist of few bytes of discrete information. The previous coding schemes were devised for the applications with this kind of I/O patterns. However, a new pattern of data transmission has become a great concern with the widespread use of internet and multimedia systems. Data are transferred like a stream when used in applications such as MP3 players and video players. Once an operation is started, it requires to transmit large amounts of data from a few kilobytes to hundreds of megabytes. For example, web-surfing or downloading files from the internet involves transmission of a few kilobytes of data, while playing music or movies needs streaming data transmission up to a few gigabytes. As streaming becomes one of the major data transfer patterns, we have one more degree of freedom, i.e., the sequence of data, that we can exploit to reduce the number of bus transitions during data transmission. A new coding scheme called *sequence-switch coding* (SSC) is proposed in this paper. It is different from previous transition-reduction coding schemes in that it is aimed at applications with the *stream-type* data transfer pattern. SSC reduces the number of bus transitions by rearranging the transmission sequence of data. An algorithm called *lagger algorithm* is presented to show the feasibility of SSC. This algorithm reduces around 10% of bus transitions in transmission of the benchmark files. Section II presents a brief review of previous work for low-power bus transmission. The basic idea of SSC is described in Section III. An SSC algorithm is presented in Section IV, and combination of this algorithm and BI Coding is discussed in Section V. Performance of SSC algorithms is evaluated by simulations, and the results and analysis are shown in Section VI. Finally, Section VII concludes this work. # II. PREVIOUS WORK Since I/O coding was proposed in [9] to reduce transient noises, there have been many efforts to reduce the dynamic power of buses by coding which can minimize the number of bus transitions in transmission. BI coding [6] is a general-purpose coding that is suitable for the transmission of uncorrelated data. Some variations of BI such as partial BI