Power Benefit Study for Ultra-High Density Transistor-Level Monolithic 3D ICs

Size: px

Start display at page:

Download "Power Benefit Study for Ultra-High Density Transistor-Level Monolithic 3D ICs"

Coral Goodman
6 years ago
Views:

1 Power Benefit Study for Ultra-High Density Transistor-Level Monolithic 3D ICs ABSTRACT The nano-scale 3D interconnects available in monolithic 3D IC technology enable ultra-high density device integration at the individual transistor-level. In this paper we demonstrate the power benefits of transistor-level monolithic 3D designs. We first build a cell library that consists of 3D gates and model their timing/power characteristics. Next, we build timing-closed, full-chip GDSII layouts and perform sign-off iso-performance power comparisons with 2D IC designs. We also study the characteristics of benchmark circuits that maximize the power benefits in monolithic 3D designs. Lastly, our study is extended to predict the power benefits of monolithic 3D designs built with future devices. 1. INTRODUCTION To better exploit the benefits from 3D die stacking, monolithic 3D technology is currently being investigated as a next generation technology. In a monolithic 3D IC, the device layers are fabricated sequentially. When the top layer is attached to the bottom layer, the top layer is a blank silicon. Alignment precision is determined by lithography stepper accuracy, which is around 10nm today. Also, the top layer can be made very thin, around 30nm [1]. Thus, monolithic vias (MVIAs) for vertical connections are very small about two orders of magnitude smaller than through-silicon-via (TSV) with almost negligible RC parasitics. With these small MVIAs, designers can truly exploit the benefit of vertical dimension. The early works for monolithic 3D ICs were technology-driven [6, 4, 9]. Recently, logic design methodologies for monolithic 3D ICs were demonstrated [2, 8, 7]. In these works, the authors presented various comparisons among monolithic 3D ICs and TSVbased 3D ICs and conventional 2D ICs in terms of footprint, timing, and power. However, timing was not closed in these works, which make the studies not practical. In addition, all these works assume that the timing and power characteristics of 3D monolithic gates are the same as 2D gates and did not demonstrate why that is a reasonable assumption. The authors also did not provide indepth analyses and discussions on why monolithic 3D technology reduces power consumption and what factors affect the power reduction margin. This knowledge is crucial to maximize the benefit and justify on-going and future research on fabrication and design technologies for monolithic 3D ICs. As discussed in [2, 8], monolithic 3D technology enables a very fine-grained 3D circuit partitioning. We can divide standard cells into PMOS and NMOS parts, place them in different layers, and connect them using MVIAs, which we call transistor-level monolithic 3D integration (T-MI) in this paper. Or, as in TSV-based 3D ICs, we may place planar cells in different layers and connect them using MVIAs, which is named gate-level monolithic 3D integration (G-MI). In this paper we focus on transistor-level integration that allows the highest integration density possible. The T-MI designs are different from G-MI: (1) Most of the 3D interconnects are embedded in the cells. (2) PMOS and NMOS transistors are on different layers, thus manufacturing processes can be optimized sepa- Extend layer definitions & metal layers Design T-MI cells 2D timing & power library Benchmark circuit RTL WLM Synthesis Create physical cell library & interconnect RC library Placement Pre-route optimization Routing Post-route optimization Timing/power analysis Figure 1: Overall design and analysis flow. Shaded boxes highlight differences in T-MI. The WLM means wire load model. rately. (3) Physical layout (placement, routing, optimization, etc.) can be performed using existing 2D electronic design automation (EDA) tools, with modifications. In this paper, we study the power benefit of T-MI based on timingclosed, detailed routing completed GDSII-level layouts and signoff analysis on timing and power. Our comprehensive work encompasses device and interconnect-level study, gate-level modeling and optimization, and full-chip layout constructions, optimization, and timing/power analysis for the current and future technology nodes. With our layout-based simulations and in-depth analyses, we demonstrate how to maximize the power benefit of T-MI technology. For fair comparisons between 3D and 2D designs, timing is closed on all designs (iso-performance), and power consumption is compared. We also investigate the circuit characteristics that affect the power benefit of monolithic 3D ICs. Our major contributions are as follows: (1) To the best of our knowledge, this is the first work to characterize the timing and power of the individual transistor-level monolithic 3D cells. We extract the internal RC parasitics of our T-MI cells and characterize their timing and power. We then compare T-MI cells with 2D counterparts. (2) We study the design aspects that significantly affect the power benefit of monolithic 3D ICs. We discuss what kind of logic circuits are suitable for power reduction in monolithic 3D ICs. In addition, we demonstrate that the power reduction rate also depends on the target clock period. (3) We build the libraries and full-chip layouts for monolithic 3D ICs implemented using 7nm devices. The goal is to predict the future trend of power saving with monolithic 3D technology and study how the smaller dimensions and varying RC parasitics affect the power benefit. 2. DESIGN AND ANALYSIS FLOW One of the major benefits of T-MI is that existing 2D EDA tools can be used, with simple modifications if needed. We extensively use commercial EDA tools in this study. Our design and analysis flow, summarized in Fig. 1, consists of four parts: (1) library preparations, (2) synthesis, (3) layout, and (4) analysis. In the li-

fold A VSS VDD Z (a) 2D cell VSS VDD A (b) our T-MI cell Z Z M1(130) CT P(85) MVIA(140) MB1(130) PB(85) CTB Figure 2: The layout of an inverter from (a) Nangate 45nm library, and (b) our T-MI library.

The numbers in parentheses mean thickness in nm. brary preparation part, we prepare T-MI-specific library files. We synthesize the RTL codes of benchmark circuits using Synopsys Design Compiler.

2 fold A VSS VDD Z (a) 2D cell VSS VDD A (b) our T-MI cell Z Z M1(130) CT P(85) MVIA(140) MB1(130) PB(85) CTB Figure 2: The layout of an inverter from (a) Nangate 45nm library, and (b) our T-MI library. P, M, and CT represent poly, metal, and contact. The suffix B means the bottom tier. MVIA mean monolithic via. Top/bottom tier silicon substrate and p/nwells are not shown for simplicity. The numbers in parentheses mean thickness in nm. brary preparation part, we prepare T-MI-specific library files. We synthesize the RTL codes of benchmark circuits using Synopsys Design Compiler. 1 In the layout part, we perform placement, routing, and optimizations using Cadence Encounter (v10.12). Finally, we perform static timing analysis and statistical power analysis. Our major efforts for T-MI design flow are spent on T-MI cell library construction and characterization, T-MI interconnect structure modeling, and T-MI wire load modeling. We modify the technology files and design rules to account for additional layers on the bottom tier as well as additional metal layers on the top tier (see Section 3.3). Using Cadence Virtuoso, we create our T-MI cells by modifying existing 2D cells. The cells are then abstracted to create the T-MI physical cell library. We also build interconnect RC libraries using Cadence captable generator and QRC Techgen. For synthesis, we create the T-MI wire load models (see Section 3.4) that guide synthesis optimizations. During layout construction, we first run Encounter placer. The tool recognizes T-MI cells as the cells with pins on multiple layers. For routing, we set up Encounter to utilize the additional metal layers on bottom and top tiers. Since our T-MI cells contain routing blockages on the MVIA layer, the router avoids 3D routing through the top tier part of the cells using MVIAs. Using our T-MI interconnect library that reflects the T-MI metal layer structures and materials, we perform RC extraction on all the nets in the layout. Our full-chip timing/power optimizations and analyses for T-MI and 2D are the same, because the entire T-MI design (top/bottom tiers) is captured in a single Encounter session. We perform statistical power analysis with the switching activity of the primary inputs and sequential cell outputs at 0.2 and 0.1, respectively NM TECHNOLOGY SETUP 3.1 Monolithic 3D Cell Design We design our T-MI 3D cells using the (2D) standard cells in Nangate 45nm library [10] as our baseline. As shown in Fig. 2, we fold the 2D standard cells into 3D and create T-MI 3D cells. The thicknesses of top/bottom tier silicon substrates and inter-layer dielectric (ILD) are 30nm and 110nm, respectively. The diameter 1 Our benchmark circuits and the synthesis results are shown in Section S2. 2 The impact of switching activity is shown in Section S8. of MVIA is 70nm. Note that by folding, each input/output pin is on both tiers. We prefer to place the PMOS transistors on the bottom tier and the NMOS on the top tier. In Nangate 45nm library, P/NMOS transistors show hole/electron mobility skew. To compensate the difference, in Nangate 45nm library, a PMOS is larger than the corresponding NMOS. Since extra silicon space on the top tier is required for MVIAs (not on the bottom tier see Fig. 2(b)), placing PMOS transistors on the bottom tier balances top/bottom silicon area usage. However, we should also consider manufacturing aspects in deciding the P/NMOS layer assignment. 3 After folding the cell, VDD and VSS strips are overlapping, as shown in Fig. 2. The power to VDD on the bottom tier can be delivered down through arrays of MVIAs, placed apart from the VSS strip. We may need extra space for these VDD MVIAs. Yet, power delivery network design and IR-drop analysis are outside our scope. Also, since VDD and VSS strips are overlapping, it may act as a small decoupling capacitor. However, in the extracted cell internal RC data for our inverter cell, the coupling capacitance (or cap) between VDD and VSS strips is around 0.01fF, which is small compared with other cell internal parasitic capacitances. The transistor model in Nangate 45nm library is PTM 45nm with bulk silicon technology [11]. In monolithic 3D technology, because of the structure, top tier transistors are similar to silicon-oninsulator (SOI) devices [1]. However, in this study we assume the same transistor model for T-MI and 2D cells, because (1) the original Nangate 45nm library is based on bulk silicon technology, and (2) if we assume both devices and interconnect structures in T-MI are different from 2D, it becomes harder to understand which factor contributes to power reduction, by how much. 3.2 Comparison with 2D Cells Our T-MI cells preserve the same transistor sizes as in the original 2D cells. 4 The T-MI cell height is 0.84µm, which is 40% smaller than the original 2D cell height (1.4µm). Thus, cell footprint reduces by 40%. The reasons why it is not 50% are (1) P/NMOS size mismatch incurs extra space on NMOS side, and (2) MVIAs require extra space on the top tier. When designing T-MI cells, care should be taken to reduce cell internal RC parasitics. As shown in Fig. 2(b), the connection from the PMOS on the bottom tier to the NMOS on the top tier needs to go through CTB, MB1, MVIA, CT, M1, then CT to diffusion. This 3D path may become larger than the original 2D path and may increase cell internal parasitic RC. Similarly, the path from the PB on the bottom tier to the P on the top tier goes through multiple layers. To reduce cell internal RC parasitics, it is important to minimize the lengths of 3D paths. To achieve shorter 3D paths, we should place MVIAs close to the connecting transistors. We also need to utilize direct source/drain (S/D) contacts (see Fig. 6(c) in the supplement). The direct S/D contacts reduce the detour in the 3D paths and unnecessary RC parasitics. We examine the cell internal RC parasitics of 3D and 2D cells and the impact on timing/power. In previous works [2, 8, 7], the authors assumed that the delay and power of 3D cells are the same as 2D cells and used 2D timing/power library. In [1], the authors fabricated a transistor-level monolithic 3D IC and measured the top/bottom transistor performances. They reported that the differences between 3D transistors and baseline 2D transistors were negligible. Yet, the delay and power of cells are also affected by cell internal RC parasitics. From Fig. 2(b), we can conjecture that there 3 In sub-32nm nodes, thanks to advanced channel engineering techniques, the hole/electron mobility is about the same. 4 GDSII layouts of some of our T-MI cells are shown in Fig. 6 in the supplement.

3 Table 1: Cell internal parasitic RC values. The 3D-c means 3D with top tier silicon modeled as a conductor. R (kω) C (ff ) cell 2D 3D 3D-c 2D 3D 3D-c INV NAND MUX DFF Table 2: Delay and internal power consumption of cells with various input slew and load capacitance conditions. The library uses different input slew settings for DFF. The values in the parentheses mean the percentage ratio of 3D to 2D. delay (ps) power (fj) cell 2D 3D 2D 3D fast case: input slew=7.5ps (5ps for DFF), load cap.=0.8ff INV (98.3%) (91.6%) NAND (98.6%) (94.6%) MUX (97.3%) (97.5%) DFF (104.2%) (106.2%) medium case: input slew=37.5ps (28.1ps for DFF), load cap.=3.2ff INV (99.4%) (94.8%) NAND (99.5%) (96.2%) MUX (98.2%) (96.8%) DFF (103.1%) (106.3%) slow case: input slew=150ps (112.5ps for DFF), load cap.=12.8ff INV (99.8%) (96.0%) NAND (99.8%) (96.7%) MUX (98.8%) (97.3%) DFF (102.5%) (104.9%) are coupling capacitances among PB, CTB, MB1, MVIA, CT, and M1. Using Mentor Graphics Calibre XRC with EM-simulationbased extraction rules, we extract these capacitance values as well as resistances and transistors from our T-MI cell layout. Then, we generate a SPICE netlist of the cell that consists of transistors and parasitic RC components. Since Calibre XRC is designed for 2D ICs, it can only model one diffusion layer. Due to this tool limitation, top tier diffusion layer can be modeled as either dielectric or conductor. Even though the top tier silicon is doped (low resistivity) and the bodies of top tier trasistors are tied to the ground, we expect that some amount of electric field may penetrate the top tier silicon and coupling among top and bottom tier objects (M1, MB1, P, PB, etc.) may exist. When we assume that the top tier silicon is dielectric, the coupling between top and bottom tier objects would be overestimated; when it is conductor, the coupling would be underestimated. The real case would be between these two extreme cases. The total cell internal RC values, extracted from the original 2D cells and our 3D (T-MI) cells, are shown in Table 1. For 3D case, the results with top tier silicon as both dielectric (3D) and conductor (3D-c) are shown. From the results, we observe the followings: (1) For INV, NAND2, and MUX2, the R values of 3D are noticeably smaller than 2D counterparts, because we reduce the length of poly and metal lines inside the cells, using 3D interconnects. (2) The C values of 3D are comparable with those of 2D the 2D value is between 3D and 3D-c. (3) For DFF, both R and C of 3D are larger than 2D counterparts. Due to the complex internal connections, we could not create a 3D cell layout that match RC parasitics of 2D. In summary, depending on the cell layout complexity, the internal RC ratio between 3D and 2D may vary. Yet, the delay and power of the cells are more important metrics. We perform cell timing/power characterizations using commercial softwares. The SPICE netlists obtained from the previous Table 3: Summary of metal layers. Unit is nm. level metal layers width spacing thickness global 2D:M7-8, 3D:M intermediate 2D:M4-6, 3D:M local 2D:M2-3, 3D:M M1 2D:M1, 3D:MB1,M RC extractions are fed into Cadence Encounter Library Characterizer, which runs SPICE simulations to characterize delay and power of cells under various input slew and load capacitance conditions. The delay/power of 3D and 2D cells are shown in Table 2. The values are obtained from the data tables in the characterized Liberty library. The delay is the cell internal delay including load effect, and the power is the dynamic power consumed within cell boundary (including short circuit power and power for gate/parasitic capacitances). We observe that for INV, NAND2, and MUX2, the delay and power of 3D are slightly better than 2D, whereas for DFF, they are a little worse. In addition, as the input slew and load capacitance condition changes from fast to slow case, the difference between T- MI and 2D becomes smaller. Note that depending on cell design quality and manufacturing technology, the results may change. We believe that with proper cell designs, the delay and power of 3D cells could be similar to 2D counterparts. 3.3 Monolithic Interconnect Setup Our T-MI interconnect structure is an extension of the Nangate (2D) 45nm library. As shown in Table 3, we use 8 out of 10 metal layers in the Nangate 45nm. For T-MI, we make two modifications: We add (1) a new metal layer on the bottom tier (MB1), and (2) three local metal layers on the top tier (M4-6). 5 With T-MI cell folding, the cells become 40% smaller than 2D (see Section 3.2). This results in about 40-50% smaller core footprint area. As a result, the cell pin density in T-MI becomes about 1.7-2X larger than in 2D, leading to a higher routing demand per unit area (or routing tile). To satisfy the high routing demand, we need to increase the routing capacity (#routing tracks per routing tile). The most area-efficient way is to add local metal layers, because of the small pitch. We found that adding 3 local metal layers increases routing capacity sufficiently. Due to manufacturing issues (low thermal budget), in [2] the authors suggest tungsten is suitable for bottom tier metal. However, in this work we assume copper, because a copper-based manufacturing process may be developed. Besides, MB1 is mostly used for short interconnects such as within cells or short nets 6. In our benchmark circuit M256 (see Table 11), the wirelength of MB1 (for net routing) is only 0.3% of the total wirelength. Thus, the impact of MB1 material on the timing and power of a whole circuit is minimal. When tungsten is used, IR-drop on the VDD strips could be an issue, which is outside our scope. 3.4 Monolithic 3D Wire Load Model In T-MI designs, the wires are about 20-30% shorter than in 2D designs (see Table 4). We feed this information to the synthesis part by modifying wire load models (WLM). A WLM defines the statistical average of unit length resistance, capacitance, and area of wires, as well as the fanout vs. wirelength table. For each net, according to the fanout, the synthesis engine finds the corresponding wirelength and the capacitance/resistance/area from the WLM. We reflect the reduced wirelength of T-MI designs in the fanout 5 Our 2D and T-MI metal layers are shown in Fig. 9 in the supplement. 6 The impact of using MB1 for routing on optimization quality is discussed in Section S3.

wirelength (um) 400 300 200 100 FPU LDPC M256 AES DES 0 0 5 10 15 20 fanout Figure 3: Fanout vs. wirelength in 2D wire load models. Table 4: Summary of layout results for 45nm node.

9% -7.6% -13.9% -9.5% LDPC -43.2% -33.6% -32.1% -12.8% -39.2% -21.7% DES -40.9% -21.5% -4.1% -1.6% -7.7% -1.4% M256-43.4% -28.4% -17.5% -10.7% -22.2% -12.

4 wirelength (um) FPU LDPC M256 AES DES fanout Figure 3: Fanout vs. wirelength in 2D wire load models. Table 4: Summary of layout results for 45nm node. The values represent the percentage difference of T-MI over 2D. circuit footprint total power name wirelen. total cell net leakage FPU -41.7% -26.3% -14.5% -9.4% -19.5% -11.1% AES -42.4% -23.6% -10.9% -7.6% -13.9% -9.5% LDPC -43.2% -33.6% -32.1% -12.8% -39.2% -21.7% DES -40.9% -21.5% -4.1% -1.6% -7.7% -1.4% M % -28.4% -17.5% -10.7% -22.2% -12.9% Table 5: Summary of design results in our work and previous works. The [2]-3D means their INTRACEL method with timing driven + IPO, which corresponds to transistor-level monolithic 3D design. The [7]-3D means their 3TM setup. circuit design total wire- longest path total power name type length (m) delay (ns) (mw ) ours-2d AES ours-3d (-23.5%) (-10.9%) [7]-2D [7]-3D (-21.0%) (-6.6%) ours-2d LDPC ours-3d (-33.6%) (-32.1%) [2]-2D ,554 [2]-3D 1.60 (-12.6%) ,461 (-6.0%) ours-2d ours-3d (-21.6%) (-4.1%) DES [2]-2D [2]-3D (-13.4%) (-1.9%) [7]-2D [7]-3D (-19.7%) (-3.1%) footprint = x456.4um wirelength = 3.806m footprint = x330.4um wirelength=0.611m vs. wirelength table. From preliminary layout simulations, per each circuit we extract a WLM for T-MI as well as 2D. With these WLMs, the synthesized netlists for 2D and T-MI are different. The fanout vs. wirelength trends for our benchmark circuits are shown in Fig. 3. Note that the curves of circuits are distinct, which is related to the circuit characteristics (discussed in Section 4.3) NM RESULTS 4.1 Design Analysis Results The layout simulation results for 45nm node are summarized in Table 4. 8 With T-MI, the footprint reduces by %, which is larger than the cell footprint reduction rate, 40%. With T-MI, timing is better because of shorter wirelengths, and the optimizer may downsize cells and use less number of buffers while still meeting the target clock period. Thus, the footprint of the whole T-MI design could be further reduced than the individual cell footprint reduction rate. With T-MI, total wirelength reduces by %. Depending on the circuit characteristics, the wirelength reduction rate varies. We observe that the circuit with a larger wirelength reduction rate tends to show a larger power reduction rate. All designs met the timing. The power reduction was the largest in LDPC, 32.1%, whereas in DES, only 4.1%. In LDPC, the net power is much larger than the cell power, thus a large net power reduction with T-MI leads to a large total power reduction. We also observe that with T-MI, not only net power but also cell power reduces; with a better timing, cells are downsized and less number of buffers are used, to reduce cell power. 4.2 Comparison with Existing Works Our results and the results from previous works ([2][7]) are summarized in Table 5. 9 All three works use Nangate 45nm library as baseline 2D. The footprint reduction rate of 3D over 2D in this 7 The impact of T-MI WLM on design quality is presented in Section S5. 8 Our detailed layout results for 45nm node are presented in Section S4. Our GDSII layouts of the timing-closed, routing completed AES design are shown in Fig. 8 in the supplement. 9 Note that the purpose of this study is not to directly compare the design (a) LDPC (b) DES Figure 4: Snapshots of routing results for LDPC and DES. work, [2], and [7] are about 42.3%, 30%, and 40%, respectively. This footprint reduction rate mostly affects overall design quality of 3D designs, because the timing and power reduction in the monolithic 3D designs is from reduced footprint and wirelength. Our results show larger wirelength reduction than these previous works. In [2, 7], they intentionally chose small target clock periods, thus timing was not closed. Note that power values in different works vary by much. For AES and LDPC, our results show larger power reduction rate than previous works. Interestingly, in all three works, the power reduction rates for DES circuit are low (only 2-4%). 4.3 Circuit Characteristics Study As shown in Table 4, LDPC and DES showed much different power reduction rate with T-MI. By contrasting these two designs, we explain for what kind of circuits T-MI provides large power benefit. With T-MI, the buffer count reduces by 48.6% (in LDPC) vs. 3.2% (in DES), total wirelength reduces by 33.6% vs. 21.5%, total power reduces by 32.1% vs. 4.1%, cell power reduces by 12.8% vs. 1.6%, and net power reduces by 39.2% vs. 7.7%. Compared with LDPC, the buffer count reduction for DES is very small, which leads to very small cell power reduction. Although the wirelength reduction in DES is not so small, the net power reduction rate is significantly smaller than LDPC. The net capacitance/power consists quality of ours to the previous works; due to various reasons (floorplan setup, design and analysis flow, optimization methods, target clock period, switching activity factors, etc.), it is not possible to provide fair comparisons.

5 reduction (%) total power cell power net power leakage reduction (%) slow medium fast slow medium fast (1.0ns) (0.8ns) (0.72ns) (2.6ns) (2.4ns) (2.0ns) (a) AES (b) M256 Figure 5: Power reduction rate (T-MI over 2D) under various target clock periods. of wire and (cell input) pin parts. 10 For most nets in DES, wires are very short 11. This difference is also observed in Fig. 4. In DES layout, there are many small regions where cells are tightly connected inside but not so much to outside. For these short nets, pin capacitances dominate wire capacitances, thus reducing wirelength does not reduce net power as much. Although these two circuits are similar in size (#cells, nets) and average fanout, because of the inherent difference in circuit characteristics, the power benefit of T-MI differs by much. 4.4 Impact of Target Clock Period The power benefit of T-MI also depends on the target clock period. For AES and M256, we vary the target clock period and perform full designs, from synthesis to layout optimizations. The power reduction rate is shown in Fig. 5. The trend is clear; when the target clock is faster, the power benefit of T-MI becomes larger. This is because at faster clock speeds, the timing of the 2D design becomes harder to meet than T-MI, because of longer wires. The optimization engine uses more buffers and larger cells, leading to steep increase in cell power. Thus, the cell power reduction rate increases noticeably as clock becomes faster. With faster clock speeds, core footprint and wirelengths also become larger, leading to larger net power reduction rate with T-MI. 5. 7NM TECHNOLOGY SETUP Another major aspect that affects the power benefit of T-MI is the technology node. As the technology advances, devices and wires shrink at different speed, affecting timing/power of the circuit and changing power benefit of T-MI. According to the latest ITRS 2011 roadmap [5], 7nm node is near the end of the roadmap. 12 In ITRS projection for 7nm node, devices become dramatically efficient, however wires do not. The copper effective resistivity in 7nm is 3.7X larger than in 45nm, due to various reasons (edge scattering, barrier thickness, etc.). We now predict how the power benefit of T-MI changes in the future 7nm node. The comparison between our 45nm and 7nm setup is shown in Table 6. Since there is no real 7nm node data available today, we scale down our 45nm library data as well as use data from ITRS projection. As a transistor model, we use ASU PTM- MG HP 7nm model [11]. The interconnect dimensions are scaled 10 We provide wire vs. pin power breakdown in Section S6. 11 The average wirelengths of DES-2D and LDPC-2D are 10.5µm and 72.0µm, respectively. 12 A summary of 45nm and 7nm node device and interconnect characteristics from ITRS projections are shown in Table 15 in the supplement. 8 4 Table 6: Comparison of our 45nm and 7nm node setup. 45nm 7nm transistor planar multi-gate VDD (V ) transistor length (drawn, nm) transistor width varies fixed back-end-of-line ILD k M2 width (nm) MVIA diameter (nm) ILD thickness (nm) standard cell height (um) Table 7: Summary of layout results for 7nm node. circuit footprint total power name wirelen. total cell net leakage FPU -47.0% -34.2% -37.3% -32.4% -44.4% -21.0% AES -62.0% -47.8% -19.8% -10.3% -28.4% -28.5% LDPC -42.9% -27.7% -19.1% -3.7% -26.6% -3.5% DES -40.8% -21.9% -3.4% -1.3% -7.3% -3.0% M % -23.0% -17.8% -14.1% -23.0% -2.4% down to (7/45)X = 0.156X, and the interconnect RC libraries are rebuilt, with a lower dielectric k (=2.2). We scale down the physical shapes of cells to 0.156X. Based on preliminary SPICE simulations 13, we also scale down cell input capacitance to 0.179X, cell delay to 0.471X, output slew to 0.420X, cell power to 0.084X, and cell leakage power to 0.678X. We apply these scaling factors to the 45nm Liberty library and create our 7nm Liberty library. Since the transistors in 7nm node are not planar but multi-gate (e.g. Fin- FET), the coupling between top/bottom tier transistors would be much smaller. Thus, we can reduce ILD thickness to keep the aspect ratio of MVIA reasonable. 14 The interconnect RC characteristics for 45nm and 7nm are obtained from the captable built with Cadence Encounter, which runs EM simulations. The unit length resistances (Ω/µm) of 45nm and 7nm nodes for a local metal layer (M2) are 3.57 and 638, respectively, whereas for a global metal layer (M8), and 2.650, respectively. The unit length capacitances (ff/µm) of 45nm and 7nm nodes for M2 are and 0.153, respectively, whereas for M8, and 0.095, respectively. We observe that in 7nm node, the local metal layers become very resistive, due to the larger copper effective resistivity and the smaller metal width/thickness. Yet, in 7nm node, the wirelengths of the nets on local metal layers become shorter, thus the resistances of the net wires do not increase as dramatically. The capacitance per unit length increases for local metal layers, even though the dielectric k becomes smaller. 6. 7NM RESULTS The layout simulation results for 7nm node are summarized in Table Compared with the results in Table 4, we see that the footprint reduction rate is larger, especially for AES where 62% footprint reduction was achieved. In the AES case, the target clock period is very small, 0.27ns. For the 2D design, Encounter performed high-effort optimization techniques to meet the timing, while for T-MI design it did not. As a result, the buffer count of the T- MI design is 84.5% smaller. We also observed similar optimization differences for FPU. Wirelength reduction is %. In the FPU case, total power reduction is the largest, 37.3%. For DES, the power reduction is the smallest, 3.4%. 13 Our 7nm cell characterizations are presented in Section S1. 14 Note that our 7nm library setup is just one of many possibilities; there is a limitation in the prediction accuracy. 15 Our detailed layout results for 7nm node are presented in Section S4.

6 Table 8: Impact of lower cell pin cap in 7nm node. The -p suffix means the cell pin cap reduction rate (p20 means 20% reduced pin cap). design total WL total power cell net leak (mm) (mw ) (mw ) (mw ) (mw ) DES-2D DES-3D 63.5 (-21.9%) (-3.4%) DES-2D-p DES-3D-p (-21.9%) (-1.8%) DES-2D-p DES-3D-p (-21.8%) (-2.7%) DES-2D-p DES-3D-p (-21.9%) (-2.3%) For LDPC, the power reduction rate in 7nm node is smaller than in 45nm. In LDPC, there are lots of long wires across the core area. Considering the unit length metal resistance, the router prefers intermediate/global layers than local metal layers for long nets. However, in T-MI we added 3 metal layers to only local layers; on intermediate/global layers, T-MI suffers more routing congestion than 2D. 16 Thus, in 7nm node, the extremely high resistance on local layers (see Section 5) reduces the power reduction rate, because of worse timing (the local metal resistance was not so high in 45nm node.). In summary, depending on circuit characteristics, in 7nm node, the power benefit may become larger or smaller. 6.1 Impact of Pin Cap Reduction Rate As mentioned in Section 5, when we compare 7nm node with 45nm node, the cell pin cap reduces by 82.1%, which is smaller than the wirelength reduction rate of designs, about 85% (compare total wirelength of designs in Table 12 and 13). Thus, in 7nm node, the (pin cap)/(wire cap) ratio may become larger than in 45nm node. Then, the wire cap reduction with T-MI reduces the total net cap by a smaller percentage in 7nm node. However, depending on the materials and manufacturing technology, the pin cap of cells may reduce at faster rate than our projection. Thus, we explore how the power benefit of T-MI changes when pin cap reduces more. For this study, we choose DES as the test circuit, because it showed the largest (pin cap)/(wire cap) ratio among our circuits. Thus, we expect to see larger impact with various pin cap settings. Our simulation results are summarized in Table 8. Surprisingly, the power benefit of T-MI does not increase with larger pin cap reduction rate. As pin cap reduces, the net power reduces. Then, the cell power becomes more dominating factor, because cell power does not decrease so much with smaller pin caps. Thus, the power reduction rate with T-MI becomes smaller. 6.2 Impact of Lower Metal Resistivity As discussed in Section 5, in 7nm node, the effective resistivity of copper becomes very high. However, in the future, thanks to better interconnect materials (e.g. carbon nanotube, graphene nano ribbon) and manufacturing process, the resistivity of interconnect may be lower than expected. In this scenario, we may expect that the timing benefit of 3D may become smaller, because the nets are longer in 2D designs and the lower resistivity would reduce delay of nets in 2D more than in 3D. As a case study, we reduce the resistivity of local and intermediate layers by 50%. The resistivity of global metal layers is not changed, because the wires on the global layers are large and the resistivity is not too high. We choose M256 as the test circuit, be- Table 9: Impact of the lower metal resistivity in 7nm node for M256. The -m suffix means reduced metal resistivity. design total WL total power cell net leak (mm) (mw ) (mw ) (mw ) (mw ) M256-2D M256-3D 612 (-23.0%) (-17.8%) M256-2D-m M256-3D-m 613 (-22.9%) (-17.8%) cause it is the largest circuit among our benchmark circuits and more affected by net delay change. The impact of the reduced metal resistivity is shown in Table 9. All designs met the timing. With lower resistivity, the power consumption reduces, because with better timing smaller cells are used. However, there is not much difference in wirelength and total power reduction percentage. The cell and net power reduction rate went down a little, however the leakage power reduction rate went up. From this result, we conclude that the lower metal resistivity does not necessarily lead to smaller power reductions in monolithic 3D ICs. 7. CONCLUSIONS In transistor-level monolithic 3D ICs, reduced footprints lead to shorter wirelengths, better performances, and lower power consumptions. With carefully designed T-MI 3D cells, we performed layout simulations for the benchmark circuits and demonstrated up to 32.1% and 37.3% total power reductions in 45nm and 7nm nodes. In addition, we discussed other factors that affect the power benefit of T-MI, such as circuit characteristics and target clock periods. We expect to see larger power benefits with T-MI in future technology nodes, where wires become serious problems. 8. REFERENCES [1] P. Batude et al. Advances in 3D CMOS Sequential Integration. In Proc. IEEE Int. Electron Devices Meeting, pages 1 4, [2] S. Bobba et al. CELONCEL: Effective Design Technique for 3-D Monolithic Integration targeting High Performance Integrated Circuits. In Proc. Asia and South Pacific Design Automation Conf., pages , [3] K. D. Boese, A. B. Kahng, and S. Mantik. On the Relevance of Wire Load Models. In Proc. Int. Workshop on System-Level Interconnect Prediction, pages 91 98, [4] N. Golshani et al. Monolithic 3D Integration of SRAM and Image Sensor Using Two Layers of Single Grain Silicon. In Proc. IEEE Int. Conf. on 3D System Integration, pages 1 4, [5] International Technology Roadmap for Semiconductors. ITRS 2011 Edition. [6] S.-M. Jung et al. The Revolutionary and Truly 3-Dimensional 25F 2 SRAM Technology with the smallest S 3 (Stacked Single-crystal Si) Cell, 0.16um 2, and SSTFT (Stacked Single-crystal Thin Film Transistor) for Ultra High Density SRAM. In Proc. Symposium on VLSI Technology, pages , [7] Y.-J. Lee, P. Morrow, and S. K. Lim. Ultra High Density Logic Designs Using Transistor-Level Monolithic 3D Integration. In Proc. IEEE Int. Conf. on Computer-Aided Design, pages , [8] C. Liu and S. K. Lim. A Design Tradeoff Study with Monolithic 3D Integration. In Proc. Int. Symp. on Quality Electronic Design, pages , [9] T. Naito et al. World s first monolithic 3D-FPGA with TFT SRAM over 90nm 9 layer Cu CMOS. In Proc. Symposium on VLSI Technology, pages , [10] Nangate. Nangate 45nm Open Cell Library. [11] NIMO Group at ASU. Predictive Technology Model. 16 The impact of a different metal layer setup is discussed in Section S7.

top tier bot tier (a) INV NMOS PMOS (b) NAND2 MVIA (c) MUX2 direct S/D contact Table 11: Benchmark circuits and synthesis results. FPU AES LDPC DES M256 45nm node target clock period (ns) 1.8 0.8 2.

23 7nm node target clock period (ns) 0.72 0.27 0.9 0.3 1.0 #cells 11,378 12,541 37,322 50,833 191,543 cell area (µm 2 ) 447.1 362.3 1456.4 2061.3 6788.

The p/nwell and implants are not shown for simplicity. Table 10: The 7nm cell characterization results.

INV NAND2 DFF 45nm 7nm 45nm 7nm 45nm 7nm input cap (ff ) 0.463 0.125 0.523 0.082 0.877 0.097 cell delay (ps) 44.27 25.56 49.24 30.50 124.70 27.07 output slew (ps) 31.35 15.13 35.89 19.29 34.55 8.

604 leakage (pw ) 2,844 2,583 4,962 2,906 42,965 23,241 SUPPLEMENT S1 Scaling Factors of 7nm Standard Cells To obtain the scaling trends of 7nm cell characteristics, we first create SPICE netlists of

The transistor fin height, width, and length of the ASU model are 18, 7, and 11nm, respectively.

We also scale the cell internal parasitic R and C components in the original SPICE netlists by 7.7X and 0.

7 top tier bot tier (a) INV NMOS PMOS (b) NAND2 MVIA (c) MUX2 direct S/D contact Table 11: Benchmark circuits and synthesis results. FPU AES LDPC DES M256 45nm node target clock period (ns) #cells 9,694 13,891 38,289 51, ,877 cell area (µm 2 ) 19,123 16,756 60,590 85, ,636 #nets 11,345 14,218 44,153 54, ,569 average fanout nm node target clock period (ns) #cells 11,378 12,541 37,322 50, ,543 cell area (µm 2 ) #nets 12,484 12,811 43,183 54, ,545 average fanout (d) DFF Figure 6: Layout snapshots of our T-MI cells. The S/D means source/drain. The p/nwell and implants are not shown for simplicity. Table 10: The 7nm cell characterization results. The cell delay, output slew, and cell power are obtained by averaging the rise/fall transition cases, when input slew is 19ps and load capacitance is 3.2fF. INV NAND2 DFF 45nm 7nm 45nm 7nm 45nm 7nm input cap (ff ) cell delay (ps) output slew (ps) cell power (fj) leakage (pw ) 2,844 2,583 4,962 2,906 42,965 23,241 SUPPLEMENT S1 Scaling Factors of 7nm Standard Cells To obtain the scaling trends of 7nm cell characteristics, we first create SPICE netlists of 7nm cells. From the SPICE netlists of Nangate 45nm cells, the transistor models are replaced by ASU PTM-MG HP 7nm model [11]. The transistor fin height, width, and length of the ASU model are 18, 7, and 11nm, respectively. We assume the number of fins per MOS transistor is 1, because the original cells are of X1 strength; the results may change if we use multiple fins. We also scale the cell internal parasitic R and C components in the original SPICE netlists by 7.7X and 0.156X, respectively, because: (1) The resistance of metal interconnect is R = ρ L/(W t) = ρ s L/W. The sheet resistance (ρ s = ρ/t) becomes 7.7X, because M1 thickness (t) is 0.156X and we increase effective resistivity (ρ) by 20% to account for size effects and barrier thickness. Both the length (L) and width (W) of cell internal interconnects become 0.156X. Thus, the R components become 7.7X of the original. (2) The unit length capacitance does not change much. And the length of cell internal interconnects becomes 0.156X. Thus, the C components become 0.156X of the original. With the SPICE netlists of our 7nm cells, we run Cadence Encounter Library Characterizer (ELC) to obtain Liberty timing and power library. The ELC runs SPICE simulations for various input slew and load capacitance conditions and builds a library with timing and power data. The characterization results are shown in Table cells cannot be placed VDD/VSS strips MB1 VDD/VSS strips MVIA MB1 M1 Figure 7: A zoom-in shot of T-MI design for AES. Skyblue rectangles are standard cells. For clarity, only MB1, M1, and MVIA layers are shown. 10. Per each cell, we calculate the scaling ratio, then average them for all cells to obtain the final scaling trend. S2 Benchmark Circuits and Synthesis Results Our benchmark circuits and synthesis results for 45nm and 7nm nodes are summarized in Table 11. The FPU is a double precision floating point unit. The AES and the DES are encryption engines. The LDPC is a low-density parity-check engine for the IEEE 802.3an standard. And the M256 is a simple partial-sum-add-based 256bit integer multiplier. The circuits are in different sizes. Note that target clock periods for 7nm node are smaller than those for 45nm node. We use Synopsys Design Compiler (ver. F ) for synthesis. The synthesis results are from 2D results. All synthesized designs (2D, T-MI, in 45nm, 7nm) met target clock periods. S3 Concerns in Layout Optimizations In the post-route optimization step, the Encounter optimization engine tries to preserve routed wires. In T-MI designs, the MB1 wires and the routing MVIAs block the cell placement, thus the optimizer cannot place cells at (nor move cells to) such places. For example, in Fig. 7, the white spaces (dotted boxes) cannot be used for optimization such as buffering or gate sizing. To see whether these MVIA/MB1 blockages cause design quality degradation, we perform a layout simulation. For this case study, we use AES as the target circuit, because it showed a high placement utilization with lots of densely packed placement re-

Table 14: Layout results with/without our T-MI WLMs. The -n suffix means without our T-MI WLM. design total WL WNS total power (mm) (ps) (mw ) FPU-3D 149.1 +4 7.22 FPU-3D-n 152.0 (+1.9%) +11 7.20 (-0.

9%) M256-3D 4760.2 0 160.5 M256-3D-n 5020.6 (+5.5%) +3 166.8 (+3.9%) 170.53x168.24um (a) 2D-placement 127.70x126.20um (b) T-MI-placement gions.

+21ps without MB1 and MVIA), and total power (-0.1%). Thus, we conclude that under our settings (placement, routing, optimization options, final utilization, etc.

Note that the utilization of the above AES design is around 80%; we may see problems caused by the MVIA/MB1 blockages when utilization is very high.

8 Table 14: Layout results with/without our T-MI WLMs. The -n suffix means without our T-MI WLM. design total WL WNS total power (mm) (ps) (mw ) FPU-3D FPU-3D-n (+1.9%) (-0.3%) AES-3D AES-3D-n (+0.1%) (-0.1%) LDPC-3D LDPC-3D-n (+10.1%) (+10.1%) DES-3D DES-3D-n (+0.5%) (+0.9%) M256-3D M256-3D-n (+5.5%) (+3.9%) x168.24um (a) 2D-placement x126.20um (b) T-MI-placement gions. From layout simulations, we observe that there are negligible differences in design quality, in terms of wirelength (+0.1%), timing (WNS = +25ps in original vs. +21ps without MB1 and MVIA), and total power (-0.1%). Thus, we conclude that under our settings (placement, routing, optimization options, final utilization, etc.), the routings on MB1 and MVIA do not degrade design quality noticeably. Note that the utilization of the above AES design is around 80%; we may see problems caused by the MVIA/MB1 blockages when utilization is very high. However, in general, it is customary not to exceed the 80% utilization, due to various reasons (placement and routing quality, optimization quality, decap area, etc). S4 Detailed Layout Results The detailed layout simulation results for 45nm node are shown in Table 12. We set the target utilization to around 80%, which is common in industry designs. Since we observed severe wire congestions in LDPC (see Fig. 4(a)), the target utilization was lowered to about 33%; the 2D design was barely routable with this setting. We also observed significant wire congestions in M256, thus the target utilization was lowered to 68%. All designs met the timing (WNS 0). The detailed layout simulation results for 7nm node are shown in Table 13. We set similar target utilizations as for 45nm node. All designs met timing. S5 Impact of T-MI Wire Load Model As mentioned in Section 3.4, we create custom WLMs for T-MI designs. There have been debates on whether WLM is helpful or not to the final layout results [3]. Since our target circuits are small to medium sized, we may expect that WLM is helpful to some extent. To see the impact of the custom WLMs on design quality, we perform the synthesis for T-MI designs with not our T-MI WLMs but the 2D WLMs. As a result, the synthesized netlists for T-MI and 2D become similar. The layout results with/without custom WLM for T-MI designs are shown in Table 14. For FPU, AES, and DES, the design quality difference is negligible. However, for LDPC and M256, we observe significant increase in wirelength and total power without T-MI WLM. Thus, we conclude that for some designs, T-MI WLM models are helpful for obtaining larger power benefits with T-MI. S6 Breakdown of Net Power We break net power into wire and pin power components (net = wire + pin). Wire means metal wires and vias used for routing outside cells, and pin means input pins of cells. As shown in Table 16, in LDPC, wire cap is much larger than pin cap, and so is wire power. Most of the net power reduction is from reduced wirelengths, as seen by the wire power reduction. In contrast, in DES, (c) 2D-routing (d) T-MI-routing Figure 8: The placement and routing snapshots of AES designs. The figures reflect the relative sizes of 2D vs. T-MI designs. Table 15: Summary of the ITRS projection on high performance logic devices and interconnects. The 45nm and the 7nm projection data are from ITRS 2008 and 2011, respectively. The copper effective resistivity and unit length capacitance are for local/intermediate metal layers. node 45nm 7nm year device type bulk Si multi-gate NMOS drive current (µa/µm) 1,210 2,228 Cu effective resistivity (µω cm) Cu unit length capacitance (ff/µm) pin cap is much larger than wire cap. Thus, reduced wirelengths and wire power only reduces a small portion of the net power. In fact, most of the nets in DES are short, whereas most are long in LDPC; the average wirelength of LDPC-2D and DES-2D are 72.0µm and 10.5µm, respectively. S7 Impact of the Metal Layer Setup To see the impact of the metal layer setup on power benefit of T-MI, we modify the metal layer stack of T-MI. Instead of adding 3 local metal layers on the top tier, we add 2 to local and 2 to intermediate metal layers. The original and modified metal stacks are shown in Fig. 9. We use LDPC and M256 for this case study. The results are summarized in Table 17. With the modified metal layer structure, compared with our T-MI results, total wirelength of the design with modified metal layers decreases by 1.6% for LDPC and increases by 1.0% for M256. The cell power, net power, and leakage power reduces, and the total power of LDPC and M256 reduces by 2.4% and 2.8%, respectively. Thus, we conclude that the metal layer structure of T-MI affects power benefit and should

9 Table 12: Layout results of 2D and monolithic 3D designs for 45nm node. The #cells mean total number of cells, and #buffers mean the number of inverting/non-inverting buffers. The #cells include #buffers. The utilization means final cell placement density, after all optimizations. The WL and WNS mean wirelength and worst negative slack, respectively. Positive WNS value means timing is met with a positive slack. The values in parentheses show the percentage ratio to the 2D designs. circuit design footprint #cells #buffers utili- total WL WNS total power cell power net power leakage name type (µm 2 ) zation (%) (m) (ps) (mw ) (mw ) (mw ) (mw ) FPU 2D 24,839 (100) 10,959 1,644 (100) (100) (100) 3.98 (100) 4.21 (100) 0.25 (100) 3D 14,476 (58.3) 9,922 1,240 (75.4) (73.7) (85.5) 3.61 (90.6) 3.39 (80.5) 0.23 (88.9) AES 2D 25,375 (100) 19,577 4,952 (100) (100) (100) 6.36 (100) 6.94 (100) 0.40 (100) 3D 14,613 (57.6) 18,996 5,157 (104.1) (76.4) (89.1) 5.87 (92.4) 5.97 (86.1) 0.36 (90.5) LDPC 2D 208,954 (100) 47,017 13,374 (100) (100) (100) (100) (100) 0.85 (100) 3D 118,758 (56.8) 42,831 6,868 (51.4) (66.4) (67.9) (87.2) (60.8) 0.66 (78.3) DES 2D 109,652 (100) 54,402 8,436 (100) (100) (100) (100) (100) 1.03 (100) 3D 64,830 (59.1) 53,534 8,170 (96.8) (78.5) (95.9) (98.4) (92.3) 1.02 (98.6) M256 2D 478,077 (100) 245,935 62,970 (100) (100) (100) (100) (100) 4.70 (100) 3D 270,748 (56.6) 216,956 48,125 (76.4) (71.6) (82.5) (89.3) (77.8) 4.10 (87.1) Table 13: Layout results of 2D and monolithic 3D designs for 7nm node. circuit design footprint #cells #buffers utili- total WL WNS total power cell power net power leakage name type (µm 2 ) zation (%) (mm) (ps) (mw ) (mw ) (mw ) (mw ) FPU 2D 639 (100) 17,306 3,931 (100) (100) (100) 1.37 (100) 1.34 (100) 0.17 (100) 3D 339 (53.0) 11,371 1,368 (34.8) (65.8) (62.7) 0.92 (67.6) 0.74 (55.6) 0.13 (79.0) AES 2D 724 (100) 29,153 11,496 (100) (100) (100) 1.35 (100) 1.27 (100) 0.23 (100) 3D 275 (38.0) 12,687 1,778 (15.5) (52.2) (80.2) 1.21 (89.7) 0.91 (71.6) 0.16 (71.5) LDPC 2D 5,208 (100) 47,503 11,689 (100) (100) (100) 2.43 (100) 5.83 (100) 0.41 (100) 3D 2,972 (57.1) 43,453 7,936 (67.9) (72.3) (80.9) 2.34 (96.3) 4.28 (73.4) 0.40 (96.5) DES 2D 2,612 (100) 50,878 6,851 (100) (100) (100) 9.49 (100) 5.03 (100) 0.60 (100) 3D 1,546 (59.2) 50,758 6,693 (97.7) (78.1) (96.6) 9.36 (98.7) 4.67 (92.7) 0.58 (97.0) M256 2D 11,411 (100) 255,364 59,153 (100) (100) (100) (100) (100) 2.07 (100) 3D 6,172 (55.4) 213,272 40,997 (69.3) (77.0) (82.2) (85.9) (77.0) 2.02 (97.6) Table 16: Wire vs. pin capacitance breakdown of LDPC and DES in 45nm node. The values are for the entire circuit. design total cap. (pf ) power (mw ) wire pin wire pin LDPC-2D LDPC-3D DES-2D DES-3D global M7-8 M10-11 M11-12 Table 17: Impact of the different metal layer setup for T-MI. The +M suffix means the modified metal layer stack. design total WL total power cell net leak (mm) (mw ) (mw ) (mw ) (mw ) LDPC-3D LDPC-3D+M 432 (-1.6%) 6.85 (-2.4%) M256-3D M256-3D+M 618 (+1.0%) (-2.8%) intermediate local M4-6 M1-3 M7-9 M1-6 M6-10 M1-5 be chosen carefully. The local, intermediate, and global metal layer usage for LDPC and M256 designs are shown in Fig. 10. We observe that both local and intermediate layers are heavily used. On global layers, we see a lot of long wires. LDPC used more global metal than M256. Note that a net uses combinations of these layers; the line segments in the snapshot do not represent the whole net. S8 Impact of Switching Activity Factor Another major factor that affects the power consumption is the switching activity factor. The switching activity factor is defined as the number of signal transitions (0-1 or 1-0) per a given clock period. The power values of cells and nets are linearly proportional to the related switching activities. Depending on various factors (architecture, usage scenario, etc.), the actual switching activity values MB1 MB1 (a) 2D (b) T-MI (c) T-MI+M Figure 9: Metal layer stack diagrams for (a) 2D, (b) T-MI, and (c) T-MI+M. The +M means modified metal layer stack. may vary. For statistical power analyses, we provide switching activity factors to the primary input ports and the outputs of sequential cells (e.g. flipflop). Our default settings for primary inputs and sequential cell outputs are 0.2 and 0.1, respectively. Then, the given switching activity values are propagated to the rest of the circuit, based on the netlist connectivity and the functionality of cells. Since the switching activities of primary inputs affects until the first sequential cells and these paths are usually short, changing the switching activity factor of primary inputs affects the power by a

global layers (M11-12) (a) LDPC intermediate layers (M6-10) local layers (MB1, M1-5) (b) M256 Figure

total power (mw) 500 400 300 200 100 0 M256-2D M256-3D 0.1 0.2 0.3 0.

4 switching activity (b) Figure 11: Power dependency on switching activity factor.

various switching activity factor. All results are from 45nm node. small amount.

The total power of 2D and 3D designs for M256 under various switching activity factors are shown in

Although the total power increases with a larger switching activity factor, the power reduction rate

10 global layers (M11-12) (a) LDPC intermediate layers (M6-10) local layers (MB1, M1-5) (b) M256 Figure 10: GDSII snapshots of local, intermediate, and global metal layers for (a) LDPC and (b) M256. total power (mw) M256-2D M256-3D switching activity (a) power reduction (%) FPU LDPC M256 AES DES switching activity (b) Figure 11: Power dependency on switching activity factor. (a) Total power of M256 with various switching activity factors, and (b) power reduction rate under various switching activity factor. All results are from 45nm node. small amount. In this case study, we vary the switching activity factors of the sequential cell outputs only. The total power of 2D and 3D designs for M256 under various switching activity factors are shown in Fig. 11(a). Although the total power increases with a larger switching activity factor, the power reduction rate does not change much, as shown in Fig. 11(b). The other circuits also show negligible differences in power reduction rate under various switching activity factors. Thus, we conclude that the power benefit of T-MI is not largely affected by the switching activity level.

IT IS BELIEVED that in today s logic designs, interconnects

IT IS BELIEVED that in today s logic designs, interconnects 1892 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 12, DECEMBER 2013 Ultrahigh Density Logic Designs Using Monolithic 3-D Integration Young-Joon Lee, Student