# An Area-Saving Decoder Structure for ROMs

Chua-Chin Wang, Member, IEEE, Ya-Hsin Hsueh, and Ying-Pei Chen

Abstract—Read-only memories (ROMs) are widely used in both digital communication systems and daily consumer electronics. The major functions of ROMs are storage of data, programs, firmwares, etc. In this paper, a three-dimensional decoding structure for ROMs is proposed. The number of address decoding stages is drastically shortened. Hence, the delay is reduced, as well as the power consumption and area. The analysis of overall transistor count and delay is thoroughly derived. A real  $256 \times 8$  ROM possessing the proposed decoder is physically fabricated by 0.5- $\mu$ m two-poly two-metal (2P2M) CMOS technology.

Index Terms—Area saving, natural encoding, P+implant, ready-only memory (ROM) decoder, three-dimensional decoding.

#### I. INTRODUCTION

EAD-ONLY memories (ROMs) are an important part K of many digital systems, e.g., digital signal processors (DSPs), microprocessors, digital filters, etc. They are particularly important in portable systems due to the storage of programs and data. Hence, the chip size and power consumption need to be enhanced, as well as the improvement of speed. Prior ROM designs were mainly focused on the technology evolution [1], [4], [6], core architecture [5], [7], or special-purpose circuit and logic [2], [3]. The improvement of address decoders and data encoding for ROMs has long been ignored. Most of the prior decoders for ROMs utilized multiplexers to decode the row and column addresses. However, the characteristics of P+implant ROMs [8] cause the Hamming distance of adjacent words to be one. This feature leads the order of the data words to be stored, which will depend on the decoder structure such that they appear in a nonnatural order, e.g., 0, 2, 5, 3, 1, .... The following problems will then be introduced.

- 1) Different ROM users will store the data words in different patterns owing to the different decoder structures.
- Due to the data words stored in various patterns, programs to call the data stored in the ROMs must be adjusted accordingly.
- 3) Such types of ROMs are very difficult to test.

In this paper, we propose a three-dimensional address decoding structure associated with a corresponding data cell encoding arrangement for P+implant ROMs such that the data words are encoded and stored in the ROMs in a natural pattern, i.e.,  $0, 1, 2, 3, \ldots$  without any conversion circuits. Our method simplifies the procedure to encode the ROMs and decode the

Manuscript received April 2, 2001; revised July 5, 2002. This work was supported in part by the National Science Council under Grant NSC 89-2215-E-110-017 and Grant NSC 89-2215-E-110-01.

C.-C. Wang and Y.-H. Hsueh are with the Department of Electrical Engineering, National Sun Yat-Sen University, Taiwan 80424, R.O.C. (e-mail: ccwang@ee.nsysu.edu.tw).

Y.-P. Chen is with EMPIA Technology Inc., Taipei, Taiwan 114, R.O.C. Digital Object Identifier 10.1109/TVLSI.2003.816134



Fig. 1. ROM-based: (a) NOR gate and (b) NAND gate.

address such that application programs need no adjustment and testing becomes much easier. Not only is the size of the entire decoder shrunk, but the access time and power dissipation is also greatly reduced.

(b)

# II. THREE-DIMENSIONAL DECODER FOR ROMS

## A. Structures of ROMs

The storage of a bit "0" or "1" in a ROM is determined by the existence of a pass transistor residing at the cross of a row and column. Referring to Fig. 1, two typical CMOS ROMs are shown. The top of Fig. 1 is a NOR gate. When the input of any



Fig. 2. Simple one-dimensional decoding circuit for ROMs.

row turns high, the pass transistors coupled in the row are then turned on to ground the corresponding outputs. By contrast, the bottom of Fig. 1 shows an NAND gate of which output is low when the inputs to all of the rows coupled to the column are high. These two examples demonstrate that the ROMs can realize any given Boolean functions as long as the word length is adequate.

#### B. One-Dimensional Decoder Structure of ROMs

Referring to Fig. 2, a straightforward decoding circuit, i.e., one-dimensional decoder [2], for a P+implant ROM is shown. The "P+implant" NMOS transistor denotes that an NMOS is added to a layer of P+implant. The gate of this NMOS is then opened such that it cannot be turned on by a voltage asserted on its gate. Meanwhile, PMOS transistors only appear in the designated module in which PMOSs are used as pull-up loads. The operations of such a P+implant ROM are: 1) precharging: when the clock is low, the outputs W7-W0 are all precharged to high and 2) evaluation: when the clock turns high, address inputs A4-A6 will then determine which output line is connected to ground.

For instance, if A4A5A6 = 000, W0 in Fig. 2 will be kept high, while the rest of the outputs will be grounded. Thus, the row selection signal is W7 - W0 = 00000001.

## C. Two-Dimensional Decoder Structure

When the address size of ROMs increases linearly, the one-dimensional decoder structure will inevitable make the core of the ROM grow exponentially. Not only does such a decoding scheme leads to a large area, but it also increases the access time of data. A two-dimensional decoder structure was proposed to



Fig. 3. General format of the two-dimensional decoder circuit.

resolve such a problem [1], [3]. It was also called an "X-Y" decoding. Fig. 3 shows a block diagram of a 128  $\times$  1 ROM. The operations of such a design are illustrated as follows.

- 1) The address lines are equally divided into two parts. A6, A5, and A4 are fed into a P+implant decoder, which, in fact, is a one-dimensional ROM decoder to generate a word selection signal W7, W6, ..., or W0. The generated word selection signal will be fed to the ROM data core and determine which row of the core is activated. For instance, if A6A5A4 = 111, W7 is activated.
- 2) The upper decoder is fed with A3, A2, and A1 to decode which pair of columns in the data core is selected. Notably, N-implant NMOS transistors are utilized in the upper decoder. The N-implant NMOS possesses an N+ layer such that the transistor is always turned on regardless of the asserted gate voltage. Hence, if A3A2A1 = 111, then the pair of columns 14 and 15 is selected.
- 3) The lower decoder is fed with A3, A2, A1, and A0 to determine which one of the selected column is granted to be the output. Similar to the upper decoder circuit, N-implant NMOS transistors are utilized in the lower decoder. It is easy to comprehend that A0 is the one to select the column. Assume that A0 is 0. Column 14 is then chosen to be output.
- 4) The states of the 128 bits in the  $128 \times 1$  ROM is determined by whether there is a P+implant NMOS in the corresponding position.
- 5) In summary, A6A5A4A3A2A1A0 = 11111110, the state of the corresponding bit (row: W7, col: 14) in Fig. 4 will be accessed.

The features of the two-dimensional decoder are summarized as follows.

- Referring to Fig. 5, the upper decoder selects one pair of two adjacent cells in a 4 × 4 K-map. It leads to two cells (bits in ROM core) and can share a column.
- Referring to Fig. 6, the lower decoder also selects a pair of adjacent cells. However, the selection of such a pair is 1 b away from that used in the upper decoder circuit. It also



Fig. 4. Detailed schematic of the two-dimensional decoding circuit.



Fig. 5. K-map for the upper decoder.

enjoys the advantage of column sharing. The intersect of these two selected pairs will decide which bit is delivered to the output. On the other hand, we can also say that A0 is used to select which one of the selected two adjacent cells (bits) is granted by a different point of view.

#### D. Three-Dimensional Decoder Structure

The major disadvantage of the two-dimensional decoding scheme is that the column-sharing design leads to a nonnatural order of data arrangement, i.e., the order of the data bit is 0, 1, 3, 2, 6, 7,  $\dots$  This drawback brings up two side effects, i.e., it is hard to program the needed data and it is difficult to debug. We, thus, propose a three-dimensional decoder structure to resolve these problems without any format conversion circuit. Fig. 7 shows the block diagram of the three-dimensional decoder. Take the same  $128 \times 1$  ROM as an example.

|    | 00 | 01 | 11 | 10 |
|----|----|----|----|----|
| 00 | 0  | 1  | 3  | 2  |
| 01 | 4  | 5  | 7  | 6  |
| 11 | 12 | 13 | 15 | 14 |
| 10 | 8  | 9  | 11 | 10 |

Fig. 6. K-map for the lower decoder.

- 1) The address lines are divided into three parts: A6, A5, and A4 are used to decode the row number of the data core; A3, A2, and A1 are fed to an upper decoder; and only one A0 is required in the lower decoder circuit.
- 2) Two additional modules are needed to resolve the non-natural encoding problem in prior ROM decoder designs. They are the upper pass block associated with the upper decoder and the lower pass block associated with the lower decoder circuit. The elements of these two blocks are NMOSs.
- All of the N-implant NMOS transistors in prior ROM decoders are dispensable since they provide virtually no function. A binary-tree-like decoder can then be constructed, as shown in Fig. 8.
- 4) A6, A5, and A4 selects a row (word) in the data core. A3, A2, and A1 then determine which shared column is activated.



Fig. 7. Block diagram and general format of the three-dimensional decoding circuit.



Fig. 8. Binary-tree-like decoder.

5) Two pass transistors are gated with A0 and then decide which side of the activated column to be the output " $d_{\rm out}$ ."

Notably the order of the data encoded in the ROM core is in a natural order of sequence in such a decoder design, i.e.,  $0, 1, 2, 3, \ldots$  Besides this, the binary-tree-like decoders without any N-implant saves a large number of transistors such that the area becomes much smaller. Fig. 9 shows the detailed schematic of the three-dimensional decoder circuit.

## E. Performance Analysis

1) Transistor Count: Although the transistor count cannot fully represent the area of the whole chip, more transistors bring up more wiring so as to increase the chip area. In order to compare the difference among these decoder schemes, we assume n address lines to be decoded for a ROM. The following analysis excludes inverters (buffers) on the address lines and the precharging PMOSs at the output lines since they are all required in any decoder circuit. It also did not contain the transistors of the ROM core.

**One-dimensional scheme**: It is trivial to derive that the number of the NMOSs needed is

$$M_1(n) = 2^n \cdot n + 2^n + 2^{n-1}. (1)$$

**Two-dimensional scheme**: Referring to Fig. 3, we conclude the following facts, while not counting the N-implant NMOS, which are always on.

Number of MOSs in the upper decoder circuit is

$$2^{n-y-1-1+1} \cdot (n-y-1+1)$$
.

Number of MOSs in the lower decoder circuit is

$$(2^{n-y-1}-1)\cdot (n-y-1)+2(n-y).$$

Number of MOSs in the P+implant core decoder circuit is

$$2^y \cdot (y) + 2^{y-1}$$
.

Thus, the total transistor count in this scheme is the summation of the above three terms as follows:

$$M_2(n,y) = (2^{n-y-1}) \cdot (n-y) + (2^{n-y-1} - 1)$$
$$\cdot (n-y-1) + 2(n-y) + 2^{y-1} + 2^y \cdot (y). \quad (2)$$



Fig. 9. Schematic diagram of the three-dimensional decoding circuit.

 $\label{table} TABLE \quad I$  Comparison of Transistor Count in Different Scenarios

| -              |                |       |       |
|----------------|----------------|-------|-------|
| $\overline{n}$ | $\overline{y}$ | $M_2$ | $M_3$ |
| 7              | 3              | 89    | 88    |
| 7              | 4              | 96    | 92    |
| 8              | 3              | 178   | 144   |
| 8              | 4              | 133   | 120   |
| 8              | 6              | 425   | 270   |

| $\overline{y}$ | $M_3$ | $D_3$ | $M_3 \cdot D_3$ |
|----------------|-------|-------|-----------------|
| 2              | 240   | 12    | 2880            |
| 3              | 144   | 15    | 2160            |
| 4              | 120   | 22    | 1640            |
| 5              | 156   | 37    | 5772            |

For instance, if n=7, y=3, then  $M_2=89$ ; if n=8, y=4, and  $M_2=133$ . They are much smaller than  $M_1(7)=960$  and  $M_1(8)=2176$ , respectively.



Fig. 10. Three-dimensional mesh comparison of two- and three-dimensional decoders.

**Three-dimensional scheme**: Referring to Fig. 7, a total of n address lines are divided into three parts: y for the data core decoding, z for the upper decoder, and x for the lower decoder, i.e., n = x + y + z. However, A0 will be the only address line used in the lower decoder circuit



Fig. 11. Schematic of ROM4.

to determine which side of the activated column is the granted output. x is then fixed to be one. Thus, we attain n=y+z+1.

• Number of MOSs in the upper decoder circuit is

$$(2^{z+1}-1)+2\cdot 2^z$$
.

- Number of MOSs in the upper pass block is  $2^z$ .
- Number of MOSs in the lower decoder circuit is two.
- Number of MOSs in the lower pass block is  $2 \cdot 2^z$ .
- Number of MOSs in the P+implant core decoder circuit is

$$(2^{y+1}-1)+2\cdot 2^y$$
.

Thus, the total transistor count in this scheme is the summation of the above five terms. Besides this, we also substitute z = n - y - 1 to derive the following result:

$$M_3(n, y, z)|_{z=n-y-1} = (2^{z+1} - 1) + 2 \cdot 2^z + 2^z + 2 + 2 \cdot 2^z + (2^{y+1} - 1) + 2 \cdot 2^y + (2^{y+1} - 1) + 2 \cdot 2^y$$

$$M_3(n, y) = 7 \cdot 2^{n-y-1} + 4 \cdot 2^y. \tag{3}$$

By using (3), we can make a comparison in Table I.

Since  $M_1$  is definitely much larger than  $M_2$  and  $M_3$ , there is no need to compare it with the other two. Fig. 10 shows the transistor count of  $M_2(n,y)$  and  $M_3(n,y)$  in a three-dimensional mesh format.  $M_3$  is the most transistor saving.

2) Speed Analysis: Since a design with a minimal number of MOSs might not be the fastest circuit, the cost of delay using the three-dimensional decoder structure should be evaluated before the selection of y or z. If the rows and columns are not too long, we assume that the "delay count" is proportional to the number of MOSs on the data path from input to output. Hence, the delay measure of the three-dimensional decoding scheme is formulated as follows:

$$D_{3}(n,y,z)|_{z=n-y-1} \approx z + 1 + 1 + 1 + 2^{y}$$

$$\approx z + 3 + 2^{y}$$

$$= n - y + 2 + 2^{y}$$

$$= D_{3}(n,y). \tag{4}$$

Table II is then derived to discover what is the best choice for the decoder for a  $256 \times 8$  ROM.



Fig. 12. Simulations waveforms.

According to Table II, although the scenarios given y=3 and y=4 appear to possess almost the same transistor-delay product, the delay is generally deemed as a parameter with a higher priority. This why we choose n=8 and y=3 to implement a  $256\times 8$  ROM below.

#### F. Chip Implementation: $256 \times 8$ ROM

## III. SIMULATION AND TESTING

# A. Speed (Delay) Simulation

Fig. 12 shows the three-dimensional decoder ROM simulation waveforms of the address inputs and the output product given thorough HSPICE simulations. A  $4\times4$  integer multiplier can be realized by a  $256\times8$  ROM in which four address lines denote the multiplicand, while the other four represent the multiplicand.



Fig. 13. Two-dimensional decoder output waveform without pads (delay = 10.6 ns).



Fig. 14. Three-dimensional decoder output waveform without pads (delay = 3.3 ns).

tiplier. The byte-wide output is the product. The  $256 \times 8$  ROM is composed of eight  $256 \times 1$  ROMs, i.e., ROM0, ROM1, ..., ROM7. Their individual output is  $d0, d1, \ldots, d7$ , respectively. Fig. 11 is the schematic of one of the ROMs, e.g., ROM4. The NMOS, which is connected to GND, represents the P+implant NMOS. Figs. 13 and 14 reveal the maximal delay of the two-dimensional and proposed decoders, respectively. The maximal access time (delay) of the three-dimensional decoder is 3.3 ns

TABLE III COMPARISON WITH PRIOR ROMs (WITH PADS)

| Design       | access time | technology                        |
|--------------|-------------|-----------------------------------|
| Neruke's [4] | 120 ns      | $0.72~\mu\mathrm{m}$ CMOS         |
| Sunaga's [5] | 60  ns      | $1.0~\mu\mathrm{m}~\mathrm{CMOS}$ |
| ours         | 20 ns       | $0.5~\mu\mathrm{m}$ CMOS          |

TABLE IV
DELAY, POWER DISSIPATION, AND POWER-DELAY PRODUCT OF DIFFERENT ROM DECODER DESIGNS

|             | 2-D ROM decoder          | 3-D ROM decoder                   |
|-------------|--------------------------|-----------------------------------|
| delay       | 10.6 ns                  | 3.3 ns                            |
| power       | $8.3292 \times 10^{-5}W$ | $8.5133 \times 10^{-5} \text{ W}$ |
| power·delay | 0.8883 (mW·ns)           | 0.2724 (mW·ns)                    |



Fig. 15. Die photograph of the proposed three-dimensional decoder.

without pads. By contrast, that 10.6 ns is that of the two-dimensional decoder. The post-layout simulations show that the highest clock rate for our decoder is 50 MHz (period = 20 ns) with pads. Table III reveals the performance comparison with several other prior ROM designs.

#### B. Power Dissipation Simulations

Regarding the power consumption comparison, we also conduct a series of simulations that employ the Monte Carlo method of HSPICE. The number of sweeps is 30 and the clock period is 10 ns. The power dissipation results are tabulated in Table IV. Since there are more PMOSs and inverters in the three-dimensional decoder architecture, it has higher power dissipation than that of the two-dimensional decoder. However, if we consider the power-delay product as a measure, the superiority of our three-dimensional decoder for ROMs is preserved.

#### C. Physical Chip Testing

The proposed ROM decoder was approved by the Chip Implementation Center (CIC), National Science Council (NSC),





Fig. 16. Testing of the real chip (clock = 20 MHz).

and then fabricated by the United Microelectronics Company (UMC) via 0.5- $\mu$ m two-poly two-metal (2P2M) CMOS technology. The size of the chip is  $1.8 \times 1.8$  mm<sup>2</sup>. Fig. 15 shows the die photograph of the  $256 \times 8$  ROM and the proposed decoder. The input test patterns are supplied by an HP 1660CP logic analyzer. Fig. 16 shows the snapshots generated by the logic analyzer when the chip is fed with a 20-MHz clock rate. The multiplier is denoted by A0-A3 and the multiplicand by A4-A7. The product of the output of ROM from d0 to d7 is displayed in a decimal format. The maximum clock rate is 20 MHz, while the maximal access delay is measured to be 16 ns given 256 test patterns. These results justify our design.

#### IV. CONCLUSION

We have presented a novel area-saving and high-speed decoder structure for ROMs. The transistor count and the delay have been clearly analyzed and verified. The physical chip implementation using the proposed three-dimensional decoding schemes has also been presented. The simulation results turn out to be very appealing.

#### REFERENCES

- [1] E. Bertagnolli, F. Hofmann, J. Willer, R. Mary, F. Lau, P. W. von Basse, M. Bollu, R. Thewes, U. Kollmer, U. Zimmermann, M. Hain, W. H. Krautschneider, A. Rusch, B. Hasler, A. Kohlhase, and H. Klose, "ROS: An extremely high mask ROM technology based on vertical transistor cells," in VLSI Technology Dig. Tech. Papers Symp., 1996, pp. 58–59.
- [2] D. A. Hodges and H. G. Jackson, *Analysis and Design of Digital Integrated Circuits*, 2nd ed. New York: McGraw-Hill, 1988.
- [3] R. Kanan, A. Guyot, B. Hochet, and M. Delclercq, "A divided decoder-matrix (DDM) structure and its application to a 8 Kb GaAs MESFET ROM," in *Proc. IEEE Int. Circuits Systems Symp.*, vol. 3, June 1997, pp. 1888–1891
- [4] Y. Naruke, T. Iwase, M. Takizawa, K. Saito, M. Asano, H. Nishmura, and T. Mochizuki, "A 16 Mb mask ROM with programmable redundancy," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, Feb. 1989, pp. 128–129.
- [5] T. Sunaga, "A 30-ns cycle time 4-Mb mask ROM," *IEEE J. Solid-State Circuits*, vol. 29, pp. 1353–1358, Nov. 1994.
- [6] H. Takahashi, S. Muramatsu, and M. Itoigawa, "A new contact programming ROM architecture for digital signal processor," in VLSI Circuits Dig. Tech. Papers Symp., 1998, pp. 158–161.
- [7] A. Tuminaro, "A 400 MHz, 144 Kb CMOS ROM macro for an IBM S/390-class microprocessor," in *Proc. Int. Computer Design Conf.*, Oct. 1997, pp. 253–255.
- [8] J. N. Kao, "Memory device using natural-order encoding," Patent 366496, Aug. 11, 1999.

**Chua-Chin Wang** (M'97) was born in Taiwan, R.O.C., in 1962. He received the B.S. degree in electrical engineering from the National Taiwan University, Taiwan, R.O.C., in 1984, and the M. S. and Ph.D. degrees in electrical engineering from the State University of New York at Stony Brook, in 1988 and 1992, respectively.

He is currently a Professor with the Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan, R.O.C. His recent research interests include low-power and high-speed logic circuit design, very large scale integration (VLSI) design, neural networks, and interfacing I/O circuits.

**Ya-Hsin Hsueh** was born in Taiwan, R.O.C., in 1976. She received the B.S. and M.S. degrees in electrical engineering from the National Sun Yat-Sen University, Taiwan, R.O.C., in 1998 and 2000, respectively, and is currently working toward the Ph. D. degree in electrical engineering at the National Sun Yat-Sen University.

Her current research interests are VLSI design and display panel image process circuits.

Ying-Pei Chen was born in Taiwan, R.O.C., in 1974. She received the B.S. degree in computer science and information engineering from Tamkang University, Taipei, Taiwan, R.O.C., in 1998, and the M.S. degree in computer science and information engineering from the National Sun Yat-Sen University, Kaohsiung, Taiwan, R.O.C., in 2000.

She is currently with EMPIA Technology Inc., Taipei, Taiwan, R.O.C.