# Expected Vectorless Teacher-Student Swap (TSS) Test Method with Dual Power Supply Voltages for 0.3V Homogeneous Multi-core LSI's

Taro Niiyama, Koichi Ishida, Makoto Takamiya, and Takayasu Sakurai University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan

Abstract- A Teacher-Student Swap (TSS) test method with the dual supply voltage ( $V_{DD}$ ) for the ultra low  $V_{DD}$  homogeneous multi-core LSI's is proposed and the test chips are fabricated in 90 nm CMOS. In this method, two same cores with different power supply voltages test each other by comparing their outputs, which eliminates the need for the expected vector. When  $V_{DD}$  is less than 0.3V, the die-to-die power reduction by the dual  $V_{DD}$  in the 5 chips was from 18% to 48%. In order to manage the large die-to-die variations at low  $V_{DD}$ , the fine grain dual  $V_{DD}$  with TSS test method is a promising approach without increasing the test cost.

#### I. INTRODUCTION

Both the low power supply voltage ( $V_{DD}$ ) and multi core are the recent trend for power efficient processors. Many works have been carried out on the subthreshold logic circuits [1-4] and homogeneous multi-core LSI's (e.g. 80 cores [5] and 64 cores [6]). The subthreshold logic circuits are energy efficient [1,4], while they are very slow. Therefore, an ultra parallel processing with homogeneous subthreshold multi cores may be a promising approach for the energy efficient processors with the practical processing speed.

The increasing test cost proportional to the number of cores, however, is among the most serious issues facing the multi-core LSI's. Fig. 1 shows the conventional test method of the homogeneous multi-core LSI's. The outputs from n cores are compared with expected vectors, and pass or fail are determined. The increasing quantity of the outputs with the number of cores raises the test cost.

Meanwhile, the minimum power supply voltage ( $V_{DDmin}$ ) of logic circuits is determined by the function errors of logic



Fig. 1. The conventional test method of the homogeneous multi-core LSI's.

gates due to device variations, and  $V_{DDmin}$  increases with the number of logic gates [7]. Therefore, the fine grain  $V_{DD}$  tuning [7] is required to reduce  $V_{DD}$  of the subthreshold logic circuits and the tuning also raise the test cost.

This paper presents a new test method to reduce the test cost in ultra low  $V_{DD}$  homogeneous multi-core LSI's. The concept of the proposed Teacher-Student Swap (TSS) test method is shown in Section II. Section III presents the chip implementation of the dual  $V_{DD}$  homogeneous multi-core LSI's with TTS test method. Measurement results from 90nm CMOS test chips are described in Section IV.

## II. PROPOSED TEACHER-STUDENT SWAP TEST METHOD

Fig. 2 shows a proposed TSS test method of the homogeneous multi-core LSI's. Every core is paired with the neighboring cores and the pair is called "teacher and student pair" in this paper. A teacher core and a student core are swappable and they test each other. The teacher core is a function guaranteed core and operates as an expected vector generator for the student core, which eliminates the need for the input of the expected vectors from the tester. The student core are compared with the expected vectors from the teacher core, and only the pass or fail signals are reported to the tester. Therefore, the quantity of the outputs from the chip is much smaller than that in Fig. 1, thus reducing the test cost for the



Fig. 2. proposed TSS test method of the homogeneous multi-core LSI's.



multi-core LSI's.

Fig. 3 shows a flow chart of TSS test. In order to achieve the fine grain  $V_{DD}$  tuning for the ultra low  $V_{DD}$  homogeneous multi-core LSI's, dual  $V_{DD}$  (a high  $V_{DD}$  ( $V_{DDH}$ ) and a low  $V_{DD}$  ( $V_{DDL}$ )) is adopted.  $V_{DDH}$  is the minimum supply voltage where all cores operate correctly and  $V_{DDL}$  is the supply voltage where some of the cores fail to operate. Thus the core with  $V_{DDH}$  is the teacher core and the core with  $V_{DDL}$  is the student core. As shown in Fig. 3, in the initial test (Step1), both core 0 and 1 are teacher-mode with  $V_{DDH}$  to find the initial failure of the cores. When the initial failure is found, both core 0 and 1 are disabled and can be replaced with redundant cores. In the test of core 0 (Step2), the core 0 with  $V_{DDH}$  is the student mode and the core 1 with  $V_{DDH}$  is the teacher mode. When the test is failed, the core 0



Fig. 4. (a) Layout and (b) micrograph of a homogeneous 64 core LSI with the TSS test method in 90 nm CMOS.



Fig. 5. Schematic of the fabricated teacher and student pair with dual VDD.

operates with  $V_{\mbox{\scriptsize DDH}}.$  Step3 is the swapped version of Step2 and concludes the TSS test.

## III. CHIP IMPLEMENTATION OF HOMOGENEOUS MULTI-CORE LSI WITH TEACHER-STUDENT SWAP TEST METHOD

In order to demonstrate the TSS test method with dual  $V_{DD}$  for the ultra low  $V_{DD}$  homogeneous multi-core LSI's, test chips has been designed and fabricated. Figs. 4 (a) and (b) show the layout and the micrograph of a homogeneous 64 core LSI with the TSS test method in 90 nm CMOS respectively. The core area is 1.0mm×0.68mm. The 64 core LSI has 32 teacher and student pairs. The two cores in a pair have symmetrical layouts.

Fig. 5 shows the schematic of the fabricated teacher and student pair with dual  $V_{DD}$ . Each core includes the 16 bit ripple carry adder and D-flip-flops. In the cores, the primitive cells provided by the foundry are used and are not optimized for low  $V_{DD}$  operations [1,8]. A 32 bit LFSR generates pseudo random numbers, and the 17 bit adder outputs from core 0 and 1 are compared. The pass or fail signals are stored in a  $V_{DD}$  memory.  $V_{DDH}$  or  $V_{DDL}$  is selected by pMOSFET's. The all circuits except the cores and I/O's use a separate  $V_{DD}$ .

### IV. MEASUREMENT RESULTS

 $V_{\text{DDL}}$  distributions within 64 cores are measured by the TSS test method. The power reduction by the dual  $V_{\text{DD}}$  is also discussed.

#### A. V<sub>DDL</sub> Distributions

Fig. 6 shows the measured  $V_{DDL}$  dependence of the number of error cores in 64 cores of a chip. The clock frequency is varied from 10kHz to 30MHz.  $V_{DDH}$  is defined as  $V_{DD}$  where the first error core is observed. The  $V_{DDL}$  distributions are divided into 2 modes (a delay error mode ( $V_{DD} > 0.3V$ ) and a function error mode ( $V_{DD} < 0.3V$ )). At 30MHz and 15MHz, the  $V_{DDL}$  distributions are narrow and determined by the classical timing error of the logic gates, which is the delay error mode. In contrast, at 1MHz and 10kHz, the  $V_{DDL}$  distributions are determined by the function error mode and do not depend on the clock frequency. The distributions are determined by the function error of the logic gates due to device variations[7],



Fig. 6. Measured  $V_{DDL}$  dependence of the number of error cores in 64 cores of a chip. The clock frequency is varied from 10kHz to 30MHz.

which is the function error mode. At 5MHz and 2MHz, both the delay error mode and the function error mode are mixed.

Figs. 7 (a) and (b) show the measured  $V_{DDL}$  distribution map on 64 cores at 30MHz (delay error mode) and 10kHz (function error mode) respectively. The x- and y-axis in Fig. 7 corresponds to that in Fig. 4 (a). The maximum  $V_{DDL}$  is 388mV and the minimum  $V_{DDL}$  is 370mV at 30MHz. The maximum  $V_{DDL}$  is 178mV and the minimum  $V_{DDL}$  is 280mV at 10kHz. The  $V_{DDL}$  distribution in Fig. 7 (a) has both the random and the systematic (y-axis direction) component. The systematic component will derive from the on-chip IR-drop. In contrast, the  $V_{DDL}$  distribution in Fig. 7 (b) has only the random component. The peak-to-peak variation of  $V_{DDL}$ (=102mV) is larger than that (=18mV) in Fig. 7 (a), because the low  $V_{DD}$  exposes the effect of the device variations on the circuit variations.

## B. Power Reduction by Dual $V_{DD}$

Fig. 8 (a) shows the  $V_{DDL}$  dependence of the power and the number of error cores in 64 cores of a chip at 4 different clock frequencies. The power is simulated by SPICE based on the measured number of error cores. The leakage power accounts



Fig. 7. Measured  $V_{DDL}$  distribution map on 64 cores (a) at 30MHz (delay error mode) and (b) 10kHz (function error mode).



Fig. 8.  $V_{DDL}$  dependence of (a) the power and the number of error cores in 64 cores of a chip and (b) the power reduction by the dual  $V_{DD}$  at 4 different clock frequencies.

for 48% and 99.9% of the total power shown in Fig. 8 at 30MHz and 10kHz respectively. By using the proposed TSS test method,  $V_{DDH}$  is assigned to the error cores and  $V_{DDL}$  is assigned to the non-error cores automatically. When all the 64 cores have the errors, all cores operate at  $V_{DDH}$ , which is equal to the all  $V_{DDH}$  approach. Therefore, the power is minimized by optimizing  $V_{DDL}$ . Fig. 8 (b) shows the  $V_{DDL}$  dependence of the power reduction by the dual  $V_{DD}$  at 4 different clock frequencies. At 30MHz and 15MHz (delay error mode), the power reduction is only 3%, because the  $V_{DDL}$  distributions are narrow. In contrast, at 10kHz (function error mode), the power reduction is 27%, because the  $V_{DDL}$  distribution is broad.

Finally, the die-to-die variations of the measured VDDL distribution are discussed. Fig. 9 shows the measured  $V_{DDL}$  dependence of the number of error cores in 64 cores of 5 chips



Fig. 9. Measured  $V_{\rm DDL}$  dependence of the number of error cores in 64 cores of 5 chips at 10kHz and 30MHz clock.

at 10kHz and 30MHz clock signals. Chip1 is the same as that in Fig. 6. At both 10kHz and 30MHz, the die-to-die variations of the  $V_{DDL}$  distribution are observed. An interesting result is that the delay error mode and the function error mode are not correlated, because their mechanisms are different. For example, Chip2 has the highest  $V_{DDL}$  at 30MHz, however, Chip3 or Chip5 has the highest  $V_{DDL}$  at 10kHz.

Fig. 10 (a) shows the  $V_{DDL}$  dependence of the simulated power of 5 chips at 10kHz clock. Fig. 10 (b) shows the  $V_{DDL}$ dependence of the power reduction by the dual  $V_{DD}$ . The power reduction is from 18% to 48% among 5 chips, and the high  $V_{DDL}$  chip (Chip3) achieves the largest power reduction. Fig. 10 (c) shows the  $V_{DDL}$  / $V_{DDH}$  dependence of the power reduction by the dual  $V_{DD}$ . In the conventional processors, the optimum  $V_{DDL}$  / $V_{DDH}$  to achieve the minimum power is 0.7[9]. In contrast, at 0.2 - 0.4V and 10kHz (function error mode), the optimum  $V_{DDL}$  / $V_{DDH}$  is from 66% to 82% among 5 chips. In order to manage the large die-to-die variations at low  $V_{DD}$ , the fine grain dual  $V_{DD}$  with TSS test method is a promising approach without increasing the test cost.

## V. CONCLUSION

The Teacher-Student Swap (TSS) test method with the dual  $V_{DD}$  for the ultra low  $V_{DD}$  homogeneous multi-core LSI's was proposed and the test chips were fabricated in 90 nm CMOS. Depending on  $V_{DD}$ , the LSI's have the delay error mode ( $V_{DD} > 0.3V$ ) and the function error mode ( $V_{DD} < 0.3V$ ). At the delay error mode, the power reduction by the dual  $V_{DD}$  was only 3%, while the power reduction was 27% at the function error mode. The die-to-die power reduction by the dual  $V_{DD}$  in the 5 chips was from 18% to 48%, when the optimum  $V_{DDL}$  / $V_{DDH}$  was from 66% to 82%. In order to manage the large die-to-die variations at low  $V_{DD}$ , the fine grain dual  $V_{DD}$  with TSS test method is a promising approach without increasing the test cost.

#### ACKNOWLEDGMENTS

This work is partially supported by STARC. The VLSI chips were fabricated through the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo, with the collaboration by STARC, Fujitsu Limited, Matsushita Electric Industrial Company Limited., NEC Electronics Corporation, Renesas Technology Corporation, and Toshiba Corporation.

#### REFERENCES

- B. Calhoun, and A. Chandrakasan, "Ultra-dynamic voltage scaling (UDVS) using sub-threshold operation and local voltage dithering," IEEE Journal of Solid-State Circuits, Vol. 41, No. 1, pp. 238-245, Jan. 2006.
- [2] S. Hanson, B. Zhai, M. Seok, B. Cline, K. Zhou, M. Singhal, M. Minuth, J. Olson, L. Nazhan-dali, T. Austin, D. Sylvester, and D. Blaauw, "Performance and variability optimization strategies in a sub-200mV, 3.5pJ/inst, 11nW subthreshold processor," IEEE Symposium on VLSI Circuits, pp. 152-153, June 2007.
- [3] M. Hwang, A. Raychowdhury, K. Kim, and K. Roy, "A 85mV 40nW process-tolerant subthreshold 8x8 FIR filter in 130nm technology," IEEE Symposium on VLSI Circuits, pp. 154-155, June 2007.
- [4] H. Kaul, M. Anders, S. Mathew, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar, "A 320mV 56µW 411GOPS/Watt ultra-low voltage



Fig. 10. (a)  $V_{\rm DDL}$  dependence of the power of 5 chips at 10kHz clock. (b)  $V_{\rm DDL}$  dependence of the power reduction by the dual  $V_{\rm DD}$ . (c)  $V_{\rm DDL}$  /V\_{\rm DDH} dependence of the power reduction by the dual  $V_{\rm DD}$ .

motion estimation accelerator in 65nm CMOS," IEEE International Solid-State Circuits Conference, pp. 316-317, Feb. 2008.

- [5] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N.Borkar, "An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS," IEEE International Solid-State Circuits Conference, pp. 98-99, Feb. 2007.
- [6] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook, "TILE64 Processor: a 64-Core SoC with mesh interconnect," IEEE International Solid-State Circuits Conference, pp. 88-89, Feb. 2008.
- [7] T. Niiyama, P. Zhe, K. Ishida, M. Murakata, M. Takamiya, and T. Sakurai, "Dependence of minimum operating voltage (V<sub>DDmin</sub>) on block size of 90-nm CMOS ring oscillators and its implications in low power DFM," IEEE International Symposium on Quality Electronic Design, pp. 133-136, March 2008.
- [8] A. Wang, and A. Chandrakasan, "A 180-mV subthreshold FFT processor using a minimum energy design methodology," IEEE Journal of Solid-State Circuits, Vol. 40, No. 1, pp. 310-319, Jan. 2005.
- [9] T. Kuroda, and M. Hamada, "Low-power CMOS digital design with dual embedded adaptive power supplies," IEEE Journal of Solid-State Circuits, Vol. 35, No. 4, pp. 652-655, April 2000.