# 24% Power Reduction by Post-Fabrication Dual Supply Voltage Control of 64 Voltage Domains in V<sub>DDmin</sub> Limited Ultra Low Voltage Logic Circuits

Tadashi Yasufuku<sup>1</sup>, Koji Hirairi<sup>2</sup>, Yu Pu<sup>1</sup>, Yun Fei Zheng<sup>1</sup>, Ryo Takahashi<sup>1</sup>, Masato Sasaki<sup>1</sup>, Hiroshi Fuketa<sup>1</sup>,

Atsushi Muramatsu<sup>2</sup>, Masahiro Nomura<sup>2</sup>, Hirofumi Shinohara<sup>2</sup>, Makoto Takamiya<sup>1</sup>, Takayasu Sakurai<sup>1</sup>

<sup>1</sup> University of Tokyo, Japan

<sup>2</sup> Semiconductor Technology Academic Research Center (STARC), Japan

tdsh@iis.u-tokyo.ac.jp

### Abstract

A post-fabrication dual supply voltage ( $V_{DD}$ ) control (PDVC) of multiple voltage domains is proposed for a minimum operating voltage ( $V_{DDmin}$ )-limited ultra low voltage logic circuits. PDVC effectively reduces an average  $V_{DD}$  below  $V_{DDmin}$ , thereby reducing the power consumption of logic circuits. PDVC is applied to a DES CODEC's circuit fabricated in 65nm CMOS. The layout of DES CODEC's is divided into 64  $V_{DD}$  domains and each domain size is 54µm x 63.2µm. High  $V_{DD}$  ( $V_{DDH}$ ) or low  $V_{DD}$  ( $V_{DDL}$ ) is applied to each domain and the selection of  $V_{DD}$ 's is performed based on multiple built-in self tests.  $V_{DDH}$  is selected in  $V_{DDmin}$ -critical domains. A maximum 24% power reduction was measured with the proposed PDVC at 300kHz,  $V_{DDH}$  =437mV, and  $V_{DDL}$ =397mV.

# Keywords

Low voltage logic circuit, low power, dual supply voltage, fine-grain power supply voltage

# 1. Introduction

Reduction of the power supply voltage (V<sub>DD</sub>) is an effective method for achieving ultra low power logic circuits since the active power and the leakage power depend on V<sub>DD</sub>. Thus, many works have been carried out on the low V<sub>DD</sub> operation of logic circuits [1-2]. The V<sub>DD</sub> scaling is, however, limited by the minimum operating voltage  $(V_{DDmin})$  of CMOS logic gates. V<sub>DDmin</sub> is a minimum power supply voltage when the circuits operate without function errors [3-4]. The dependence of V<sub>DDmin</sub> of flip-flop (F/F), NAND, and NOR gates on the number of logic gates is investigated. Fig. 1 (a) shows a schematic diagram of a 2-input NAND chain for the V<sub>DDmin</sub> measurement. The NAND chain has outputs from the 11th stage to the 10001th stage. Fig. 1 (b) is a schematic diagram of a F/F chain.  $V_{DDmin}$  is defined as  $V_{DD}$  where the output of F/F is stopped. Fig. 2 shows the measured dependence of V<sub>DDmin</sub> of F/F, NAND, and NOR gates on the number of stages in 65nm CMOS. Measurement of V<sub>DDmin</sub> is conducted with slow clock (1kHz). V<sub>DDmin</sub> increases as the number of stages increases. V<sub>DDmin</sub> of F/F is much higher than that of NAND and NOR gates. For example, V<sub>DDmin</sub> of F/F is 378mV at 4096 stages. This result indicates that V<sub>DD</sub> scaling below 400mV in large scale processors with 10M to 100M logic gates is difficult, because V<sub>DDmin</sub> of 10M to 100M logic gates is above 400mV. In order to achieve ultra low V<sub>DD</sub> logic circuits, a new solution

to exceed the  $V_{DDmin}$  limit is required. Reducing  $V_{DDmin}$  at a design phase, however, is difficult because  $V_{DDmin}$  is mainly determined by random variations of the threshold voltage of transistors [5]. Furthermore, only one functional error of F/F or logic gates increases  $V_{DDmin}$  of a whole logic circuit. Therefore, in order to reduce  $V_{DDmin}$ ,  $V_{DD}$  must be controlled with multiple domains. The conventional fine-grained  $V_{DD}$  control at design phase [6] cannot solve the  $V_{DDmin}$  problem, because the position of the logic gate with the highest (=worst)  $V_{DDmin}$  is random. Thus, to achieve low power logic circuits with ultra low  $V_{DD}$ , a post-fabrication dual supply voltage control (PDVC) for fine-grained  $V_{DD}$  domains is proposed in this paper. PDVC reduces the power consumption of logic circuits effectively by reducing  $V_{DD}$  below  $V_{DDmin}$ .



Fig. 1. Schematic diagram of chain circuits to measure  $V_{DDmin}$  fabricated in 65nm CMOS. (a) 2-input NAND chain. (b) F/F chain.



Fig. 2. Measured dependence of  $V_{\text{DDmin}}$  on number of stages in F/F, NAND and NOR.

978-1-4673-1036-9/12/\$31.00 ©2012 IEEE

Section II shows test circuits divided into 64  $V_{DD}$  domains for PDVC. Area penalty of PDVC is also discussed. Section III presents experimental results. The power reduction by reducing  $V_{DD}$  below  $V_{DDmin}$  with the proposed PDVC is also discussed. Section IV concludes this paper.

# 2. Proposed Post-Fabrication Dual Supply Voltage Control (PDVC)

Fig. 3 illustrates a difference between a conventional dual  $V_{DD}$  control and PDVC. In the conventional dual  $V_{DD}$  control,  $V_{DD}$  is common within a functional block and is independently controlled in each block. In addition, level shifters are inserted between functional blocks with different  $V_{DD}$ 's. In contrast, in the proposed PDVC, the layout of the whole logic circuit is divided into many domains regardless of the functional blocks. Although the layout of PDVC has connections between different voltage domains, level shifters are not inserted, because the leakage current between different  $V_{DD}$  domains is negligible when the difference of  $V_{DD1}$  and  $V_{DD2}$  is small. In the conventional dual  $V_{DD}$  control,  $V_{DD}$  of large functional



Fig. 3. Comparison between conventional and proposed dual  $V_{\text{DD}}$  control.



Fig. 4. (a) Block diagram of test circuit divided into 64  $V_{DD}$  domains. (b) Schematic of layout of logic circuit with 64  $V_{DD}$  domains and schematic diagram of domain n.

blocks cannot be reduced, because the probability of the existence of F/F's with high  $V_{DDmin}$  within the functional block increases. On the other hand,  $V_{DD}$  is reduced below  $V_{DDmin}$  by the proposed PDVC, because  $V_{DD}$  of domains which does not include bad F/F's (namely F/F's with high  $V_{DDmin}$ ) is reduced. [7] also shows a within-functional-block fine-grained adaptive dual  $V_{DD}$  control. In [7],  $V_{DD}$  is controlled by the setup error prediction signals generated by canary F/F's. The  $V_{DD}$  control in [7], however, cannot reduce average  $V_{DD}$  below  $V_{DDmin}$ , because the canary F/F's also have function errors due to its  $V_{DDmin}$ . Therefore, the proposed PDVC is required in  $V_{DDmin}$ -limited ultra low voltage logic circuits.

Fig. 4 (a) shows a block diagram of the fabricated logic circuit to demonstrate the proposed PDVC. The core circuits are series-connected 64-bit data encryption standard codec's (DES CODEC's). These DES CODEC's execute an encryption and a decryption based on the preset key. The inputs of DES CODEC's are generated by a 64-bit linear feedback shift register, and the outputs are compressed by a 64-bit multiple input signature register (MISR). The outputs of MISR are read using a scan chain and the result is compared to expectation vectors. The logic circuit in Fig. 4 (a) is divided into 64  $V_{DD}$ domains without relations to its function as shown in Fig. 3. Fig. 4 (b) shows a schematic of a layout of the logic circuit with 64  $V_{DD}$  domains. A schematic diagram of domain *n* is also shown. The circuit is divided into 8 x 8 domains and each domain size is the same. Each domain has 2 power switches to select high  $V_{DD}$  ( $V_{DDH}$ ) or low  $V_{DD}$  ( $V_{DDL}$ ). The power switches are domain-by-domain controlled by Select V<sub>DD</sub> signal from a tester. Fig. 5 shows a die micrograph and a layout of the fabricated logic circuit (DES CODEC's) in 65-nm CMOS. The layout with 64  $V_{DD}$  domains was designed using a commercial auto P&R tool. The die size is 960µm x 1260µm. The core area of DES CODEC's with 64  $V_{DD}$  domains is 516 $\mu$ m x 516 $\mu$ m. Each domain size is 54µm x 63.2µm including 2 power switches. The domain size is smaller than that of [7] (100µm x 100µm). The area overhead of the 2 power switches is 7%. Compared with a conventional single V<sub>DD</sub> design with a power switch for the power gating, however, the area overhead due to an additional power switch in PDVC is 3.5%. The area



Fig. 5. Die micrograph and and layout of the DES CODEC's fabricated in 65-nm CMOS.

overhead due to the separation between 64  $V_{DD}$  domains is negligible. The fabricated logic circuit (DES CODEC's) includes 110,045 gates and 4,123 F/F's. As shown in Fig. 2,  $V_{DDmin}$  of this circuit is determined by F/F's instead of combinational circuits and will be around 400mV, because  $V_{DDmin}$  of F/F's with 4096 stages is 378mV.

#### 3. Experimental Results

Fig. 6 shows measured shmoo plot of the fabricated DES CODEC's from 0.4V to 1.2V. In this measurement, single  $V_{DD}$  is provided to all 64 domains. Maximum operating frequency of the circuit is 630MHz at 1.2V. The maximum operating frequency decreases as  $V_{DD}$  is reduced.  $V_{DDmin}$  of the DES CODEC's is 450mV, because the DES CODEC's has function errors below 400mV at any clock frequencies. In this paper, the clock frequency for the  $V_{DDmin}$  measurement is fixed to 300kHz. Table I summarizes measured  $V_{DDmin}$  of 10 dies in ascending order. Minimum and maximum  $V_{DDmin}$  in 10 dies are 399mV and 437mV, respectively. Although  $V_{DDmin}$  varies between die to die, it is difficult to measure  $V_{DDmin}$  of all fabricated dies in terms of a test cost. The highest (=worst)  $V_{DDmin}$ , however, can be calculated [5]. Thus,  $V_{DDH}$  is fixed to  $V_{DDmin}$  of the worst die (437mV, in this case).

Fig. 7 shows a flow chart of an algorithm used for PDVC.  $V_{DDH}$  is selected in  $V_{DDmin}$ -critical domains, while  $V_{DDL}$  is selected in  $V_{DDmin}$ -non-critical domains. A post-fabrication dieto-die dual  $V_{DD}$  selection is inevitable, because the domains where  $V_{DDmin}$  is high are determined by random transistor variations and the selection of  $V_{DDH}$  or  $V_{DDL}$  in each domain is different between dies. In order to minimize the power of the DES CODEC's, the number of  $V_{DDL}$  domains should be





Table I. Measured V<sub>DDmin</sub> of 10 dies

| Die # | V <sub>DDmin</sub> [mV] |     |
|-------|-------------------------|-----|
| 1     | 399                     | 15  |
| 2     | 404                     | Ě   |
| 3     | 424                     |     |
| 4     | 424                     |     |
| 5     | 426                     |     |
| 6     | 430                     |     |
| 7     | 431                     |     |
| 8     | 433                     |     |
| 9     | 435                     | I   |
| 10    | 437                     | lda |





Fig. 8. Measured dependence of power reduction ratio on iteration counts in one die.



Fig. 9. (a) Measured dependence of power reduction ratio on  $V_{DDH} - V_{DDL}$ . (b) Dependence of percentage of  $V_{DDH}$  domains on  $V_{DDH}$  -  $V_{DDL}$ .

maximized, because both the dynamic and leakage power of a domain is reduced by changing from  $V_{DDH}$  to  $V_{DDL}$ . Thus, an algorithm to efficiently find out  $V_{DDL}$  ( $\langle V_{DDmin}$  of each die) domains is required to shorten testing time.

Explanation of the algorithm is shown in (i)-(v), where n (=100) is the maximum number of iterations, i is a current iteration count, c (<1) is a constant, w (<1) is a weighting coefficient, k is the number of domains used from a failed domain list, and a is an initial number (>1).

(i)  $V_{DD}$  of all domains are assigned to  $V_{DDH}$ . The voltage of  $V_{DDH}$  is determined by the highest (=worst)  $V_{DDmin}$  across dies. Next, the voltage of  $V_{DDL}$  is determined. Optimal  $V_{DDL}$  in this algorithm will be discussed in Fig. 9.

(ii) Domains with  $V_{DDH}$  are selected and changed to  $V_{DDL}$  domain, and built-in test is executed. Prior  $V_{DDL}$  domains are not selected, because F/F's with high  $V_{DDmin}$  are not included in the domain.

(a) The number of domains to be selected is calculated. Only  $V_{DDH}$  domains are selected in this step (ii). The number is  $ac^i$ . When iteration count is small, larger number of domains is selected to reduce power consumption rapidly. In contrast, when iteration count is large, 1 domain is changed to approach optimal solution.

(b) Transition probability of each domain is calculated based on a failed domain list and w. Latest k domains are selected from the failed domain list. The failed domain list saves past failed domains in the built-in self test. The past failed domains have a high possibility of including logic gates with high V<sub>DDmin</sub>.

(c) Domains are randomly selected by the transition probability ratio calculated in (b). The selected  $V_{DDH}$  domains are changed to  $V_{DDL}$  domains.

(iii) Built-in self test is executed.

(iv) When the test is passed, adopt new set of  $V_{DD}$ . When the test is failed, discard changes and add failed domains to the failed domain list.

(v) Increment *i*. While i < n, return to (ii).

Please note that a measurement of the power consumption is not required in this algorithm. This is important because this algorithm has a potential to be embedded in the circuits.

Fig. 8 shows the measured dependence of power reduction ratio on the number of iterations. 10 trials are executed for 1 die. These 10 lines take different routes, because the algorithm randomly selects domains. Power reduction of a chip is defined as average of 10 trials. In this case, the power reduction ratio after 100 iterations is 24%. In order to determine an optimal  $V_{DDL}$ , Fig. 9 (a) shows the measured dependence of power reduction ratio on the voltage difference between  $V_{\text{DDH}}$  and  $V_{\text{DDL}}$  after 100 iterations in 10 dies. When  $V_{\text{DDH}}$  -  $V_{\text{DDL}}$  is 40mV or 50mV, the power reduction ratio is maximized. In this paper,  $V_{\text{DDH}}$  -  $V_{\text{DDL}}$  of 40mV is used, because the average power reduction ratio denoted by the dotted line in Fig 9 (a) is minimized at 40mV. Fig. 9 (b) shows the measured dependence of the percentage of  $V_{\text{DDH}}$  domains on  $V_{\text{DDH}}$  -  $V_{\text{DDL}}.$  With increasing  $V_{DDH}$  -  $V_{DDL}$ , the percentage of  $V_{DDH}$  domains increases, while the power consumption of V<sub>DDL</sub> domains decreases. This is the reason why the power reduction ratio is maximized at V<sub>DDH</sub> - V<sub>DDL</sub> of 40mV in Fig. 9 (a).



Fig. 10. Measured dependence of average power reduction ratio on iteration counts in 10 dies.

Fig. 10 shows the measured dependence of the power reduction ratio on the number of iterations in 10 dies. Each line denotes average of 10 trials in each die.  $V_{DDH}$  is 437mV and  $V_{DDL}$  is 397mV. The power is reduced by 6.7% to 17% at 10 iterations, 14% to 24% at 30 iterations, and 20% to 24% at 100 iterations. As the number of iterations increase, the power reduction ratio also increases. When the test cost of the 100 iterations is not acceptable, the 10 iterations or the 30 iterations will be a reasonable choice.

In order to investigate the distribution of  $V_{DDH}$  and  $V_{DDL}$  in 64  $V_{DD}$  domains and to check the stability of the convergence of the proposed algorithm in Fig. 7, Fig. 11 shows the measured maps of the probability of  $V_{DDH}$  in 64  $V_{DD}$  domains for 10 dies. 10 trials are performed to measure the probability of  $V_{DDH}$  per die. In 64  $V_{DD}$  domains of each die, the probability of  $V_{DDH}$  is nearly 0% or 100%, which indicates that the proposed algorithm is stable. The maps for 10 dies have no strong correlations. In order to check the random and systematic components in the distribution of  $V_{DDH}$  and  $V_{DDL}$  in 64  $V_{DD}$  domains, Fig. 12 shows the average of 10 maps in Fig. 11. The probability of  $V_{DDH}$  across 10 dies is less than 40%, which indicates that the position of F/F's with high  $V_{DDmin}$  is



Fig. 12. Averaged probability map of V<sub>DDH</sub> in 10 dies

Table. II. Summary of key features.

| Technology                             | 65nm CMOS                 |
|----------------------------------------|---------------------------|
| Clock frequency for V <sub>DDmin</sub> | 300kHz                    |
| Core area                              | 516µm x 516µm             |
| Average power                          |                           |
| Before PDVC                            | 18.4μW                    |
| After PDVC                             | 14.3μW                    |
| Power reduction by PDVC                | 6.7% ~ 17%@10 iterations  |
|                                        | 14% ~ 24% @30 iterations  |
|                                        | 20% ~ 24% @100 iterations |

almost random and the die-to-die PDVC is required to achieve the low power logic circuits with ultra low  $V_{DD}$ .

Table 2 summarizes the key features of the fabricated DES CODEC's. Power reduction up to 24% is achieved by the proposed PDVC.

### 4. Conclusion

A post-fabrication dual supply voltage control (PDVC) of multiple voltage domains is proposed for a  $V_{DDmin}$ -limited ultra low voltage logic circuits. PDVC effectively reduces an



Fig. 11. Probability map of V<sub>DDH</sub> of 10 dies in 10 trials

average  $V_{DD}$  below  $V_{DDmin}$ , thereby reducing the power consumption of logic circuits. PDVC is applied to the DES CODEC's circuit fabricated in 65nm CMOS. The layout of the DES CODEC's is divided into 64  $V_{DD}$  domains and each domain size is 54µm x 63.2µm. The area penalty of PDVC is 3.5%. A maximum 24% power reduction at 30 iterations was measured with the proposed PDVC at 300kHz,  $V_{DDH}$  =437mV, and  $V_{DDL}$ =397mV.

# Acknowledgment

This work was carried out as a part of the Extremely Low Power (ELP) project supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO).

# References

- N. Lotz and Y. Manoli, "A 62mV 0.13µm CMOS standard-cell-based design technique using schmitt-trigger logic," International Solid-State Circuits Conference (ISSCC), pp. 340-341, Feb. 2011.
- [2] M. Seok, D. Jeon, C. Chakrabarti, D. Blaauw, and D. Sylvester, "A 0.27V 30MHz 17.7nJ/transform 1024-pt complex FFT core with super-pipelining," International Solid-State Circuits Conference (ISSCC), pp. 342-343, Feb. 2011.
- [3] T. Yasufuku, T. Niiyama, Z. Piao, K. Ishida, M. Murakata, M. Takamiya, and T. Sakurai, "Difficulty of power supply voltage scaling in large scale subthreshold logic circuits," IEICE Transaction on Electronics, E93-C, No.3, pp.332-339, March 2010.
- [4] T. Yasufuku, S. Iida, H. Fuketa, K. Hirairi, M. Nomura, M. Takamiya, and T. Sakurai, "Investigation of determinant factors of minimum operating voltage of logic gates in 65-nm cmos," International Symposium on Low Power Electronics and Design (ISLPED), pp. 21-26, Aug. 2011.
- [5] H. Fuketa, S. Iida, T. Yasufuku, M. Takamiya, M. Nomura, H. Shinohara, and T. Sakurai, "A closed-form expression for estimating minimum operating voltage (vddmin) of cmos logic gates," ACM Design Automation Conference, pp. 984-989, June 2011.
- [6] M.R. Kakoee and L. Benini, "Fine-grained power and body-bias control for near-threshold deep sub-micron emos circuits," IEEE Transaction on Emerging and Selected Topics in Circuits and Systems, Vo.1, No.2, pp. 131-140, June 2011.
- [7] A. Muramatsu, T. Yasufuku, M. Nomura, M. Takamiya, H. Shinohara, and T. Sakurai, "12% power reduction by within-functional-block fine-grained adaptive dual supply voltage control in logic circuits with 42 voltage domains," 37th European Solid-State Circuits Conference (ESSCIRC), pp. 191-194, Sep. 2011.