Activity-Aware Registers Placement for Low Power Gated Clock Tree Construction

Activity-Aware Registers Placement for Low Power Gated Clock Tree Construction Weixiang Shen, Yici Cai, Xianlong Hong Dept. of Computer Science & Technology Tsinghua University Beijing, 100084, P. R. China cwx04@mails.tsinghua.edu.cn Jiang Hu Dept. of Electrical and Computer Engineering Texas A&M University College Station, TX 77843, USA jianghu@ece.tamu.edu Abstract As power consumption of the clock tree dominates over 40% of the total power in modern high performance VLSI designs, measures must be taken to keep it under control. One of the most effective methods is based on clock gating to shut off the clock when the modules are idle. However, previous works on gated clock tree power minimization are most focused on clock routing and the improvements are often limited by the given registers placement. The purpose of this work is to navigate the registers during placement to further reduce the clock tree power based on clock gating. Our method simultaneously performs (1) activityaware register clustering that reduces clock tree power not only by clumping registers into a smaller area, but pulling the registers with similar activity pattern close to shut off more time for the resultant subtrees; (2) timing and activity based net weighting that reduce net switching power by assigning a combination of activity and timing weights to the nets with higher switching rates or more critical timing; (3) gate control logic optimization that still set the gate signal high if a register is active for a number of consecutive clock cycles. Experimental results show that our approach is able to reduce the power and total wirelength of clock tree greatly with minimal overheads. 1. Introduction and Motivation Due to rapidly increasing on-chip power density with technology evolution and the growing market for battery powered devices, power dissipation has become one of the most critical and primary concern for modern IC designs. In the composition of the total power budget of a CMOS circuit, clock distribution network normally accounts for more than 40% [1][7], as the clock nets operate at the highest switching frequency and they drive a larger fan-out. More recently the area of low power clock tree synthesis has been investigated [2]-[8]. The dynamic power consumption is estimated by 1 2 α ifvdd 2 C L, where V dd is the supply voltage, C L is the load capacitance and α i is the activity of target block or net (for clock net, α i = 2 since there is one rising and falling edge in every clock cycle). Since the frequency f is always specified for the circuit, therefore, the power dissipation of the clock tree can only be reduced by two choices. One is to reduce the supply voltage [8], which creates a quadratic reduction; the other is to minimize the switched capacitance α i C L [2]-[7]. Among the methods to reduce the switched capacitance, clock gating is a well-known and the most effective approach to power minimization [4][5][6]. Since clock signal is not always needed, thus power can be saved by masking off the clock when circuits are idle. In [3], Farrahi et al. defined a methodology based on behavioral synthesis to build an activity-driven clock tree. Given a pre-placement description of the design, the set of active and idle times, representing the activity pattern for each module, is extracted from the module s scheduling table. An activity pattern is a string of 0s and 1s, indicating idle and activity control steps, respectively. The tree construction algorithm is heuristic, bottom-up, based on recursive weighted matching, where the cost function is the activity of the resulting subtree. The objective is to cluster into the same subtree modules with similar activity patterns, so that the clock tree can be gated with high probability. In [4][5], Oh et al. presented a zero-skew gated clock routing technique for VLSI circuits that improves upon [3] in two ways. First, it starts from a placed netlist of modules. Second, it accurately accounts for the power consumption of control clock tree and the gated clock control signals. By delaying the merging of high activity nodes, the global activity in the tree is reduced. In [6][7], Donno proposed a power-aware clock tree planning that generating clock tree topology based on a merging cost including both function(i.e., clock activity conditions)

and physical (i.e., floorplanning) information at RTL, the results of the method are put into back-end traditional physical design tools as a set of design constraints. Although clock gating has already been a well-known technique to reduce dynamic power consumption in clock network, unfortunately, if applied in a uncontrolled fashion, gating can adversely impact clock power. In fact, in order to amortize its power and area overhead, clock gating logic should be shared among several flip-flops. If the flipflops that share a common gated clock are widely dispersed across the chip, a significant wiring overhead is induced in the clock distribution network. As a result, clock drivers in each domain are loaded with a much larger capacitance and power may increase even if switching activity is decreased. The register s activity pattern plays an important role in gated clock tree for low power. In [2], the author proposed an algorithm of activity-sensitive clock tree construction for low power. Since the parent activity is formed by OR-ing the activity pattern of its children, this means the parent must be active whenever its left or right child is active. By merging two nodes with the similar activity pattern, it will result in the reduction of the active periods of the parent and the number of the control signals transitions. Therefore, the resulting tree needs the clock signal for a percentage of time comparable to that of its children, leading to a reduction of the overall activity in the tree. However, the traditional gated clock network is constructed and optimized at RTL or after placement. In the final placement, the modules with the similar activity pattern may end up being spread far apart across the chip, or the activity pattern of adjacent modules may be different. Therefore, despite we construct the clock tree based on their physical distance or logical distance(the similarity of the activity), or combination both of them after placement [6][7], the resulting tree is always suboptimal. For example shown in Fig. 1, the activity patterns of reg1,, reg3 and are 1000011100, 0111000010,1100011100, and 0110000011, respectively. If the placer does not differentiate registers from other logic cells, the registers distribution may result as case (a). Reg1 and reg3 have the similar activity pattern but they are far away, it will increase the wirelength greatly if merge them first based on activity-driven merging strategy, at last the power dissipated on wire may exceed the power saved by masking off clock more time. In [6][7], nodes merging is based on the combination of distance and activity, and the resulting clock tree is as (a) shows. However, if we could cluster the registers to a small area, and pull the registers with the similar activity pattern close as (b), we will not only reduce a lot of wirelength, importantly we can shut off clock more time to save more power. We explain the reason as follows: for the activity pattern of and n 2 in (a), they are 1111011110 and 1110011111, respectively. We can only shut off the clock on and n 2 2 cycles; while in (b), they are 1100011100 and 0111000011, respectively, the cycles of shutting off the clock increase to 5. We can delete the gate before reg3 as its activity is the same as, the gate on could play the same role of it. So the power reduction is obvious compared (a) with (b). reg1 reg3 reg1 reg3 reg1 root reg3 n 2 reg1 reg3 (b) Registers distribution if we cluster registers and consider their activities in placement and its corresponding gated clock tree n 2 (a) Registers distribution of normal placer and its corresponding gated clock tree root Figure 1. Two different placements and its corresponding clock tree Clustering the registers and pulling the registers with the similar activity pattern close are helpful for next gated clock tree construction. Numerous attempts have been made to reduce clock network power consumption during the placement stage [9][10]. [9] used the Manhattan ring to navigate registers in placement to minimize the clock network wirelength based on a quadratic placement framework. [10] presented a power-aware placement method including activitybased register clustering and activity-based net weighting. The basic idea behind these methods is try to reduce the clock network power by reducing the clock network wirelength, by clumping the registers within the same leaf cluster of clock tree into a smaller area. However, these methods may not be effective for gated clock tree construction, since we not only need to clump the registers close to minimize the wirelength(or capacitance), but the most important characteristic in gated clock tree is to pull the registers with the similar activity pattern close, in order to reduce the active periods of the resultant subtrees. Combined the unique characteristic of gated clock tree, in this paper, we present an effective gated clock tree aware placement method simultaneously performs (1) Activity pattern based register clustering. We add pseudo edges to connect the registers and assign them appropriate weights to cluster the registers to a small area, especially for the registers with the similar activity pattern. (2) Activity and timing based net weighting to reduce the net switching power. (3) Gate control signal optimization. Suppose a register is active for a number of consecutive clock cycles, it is a waster of energy if control signals go on/off during these cycles. To prevent this, we optimize the gate signal and set it to remain high for one or two clock cycles even after the register is gone idle. The remainder of the paper is organized as follows: in

Section 2, we introduce the placer and power model we used. In Section 3 and 4, we present our activity-aware register clustering and net weighting in detail. Section 5 describes the gated clock tree construction and optimization after placement. Experimental results and analysis are given in Section 6, and we conclude the paper in Section 7. 2. Background 2.1. Cut-based Placement Paradigm The placer core we use is the cut-based placer Capo from University of Michigan EDA Lab [11]. A min-cut placement instance contains: 1) a rectangular region (referred to as bin) where cells are to be placed; 2) a hypergraph, with each node representing a cell and each hyperedge representing a signal net connecting two or more cells. The placer recursively partitions each bin and its associated hypergraph at current level, and assigns the subhypergraphs to subbins, minimizing total weighted net cuts for total weighted wirelength reduction. The cut direction usually alternates between horizontal and vertical cuts. 2.2. Power Model Since our ultimate objective is to reduce gated clock tree power consumption, we use a more accurate power model [6][7], which contains: 1) the registers and the clock gating port input capacitance; 2) the capacitance switched by the interconnection in the clock tree and by the interconnection that feeds the control signal to the gating logic. Let c 0 be the unit wire capacitance, l i, l g the interconnection length of the clock tree and of the control gating logic signal, respectively, C i and C g the input capacitance for the register and the gate logic. Power dissipation is then modeled as: [(c 0 l i + C i )p(i) + 0.5 (c 0 l g + C g )p tr ]fv 2 dd (1) where p(i) represents the probability for the register to be active and p tr is the probability to have a transition on the control signal net. 3. Activity-Aware Register Clustering [10] showed the distribution of clock-tree capacitance on an industrial design. The most of total capacitance is at the leaf level, which includes all the clock sinks and the wires connecting them and the driving buffers. So an effective way of reducing clock tree capacitance is to reduce the capacitance at the leaf level. A naive method is register clustering, which try to clump the registers within the same leaf cluster of the clock tree into a small area. However, considering the characteristics of gated clock tree, if we place the registers with similar activity pattern close, the activity of the resultant subtrees by merging these registers will reduce according to Fig. 1, and thus the clock signal can be shut off more periods. So at the register clustering, we also take the register s activity pattern into consideration and try to pull the registers with similar activity pattern close. As we introduced in Section 2.1, the Capo is a minimal cut-based placer. Initially, all the cells are assembled in the center, as the alternative horizontal and vertical cuts, the placement area is partitioned into two areas(bins) with the approximately equal area and minimal weighted cuts, then the cells are assigned to these bins. Indeed, the Capo works does not differentiate registers from logic cells, while the registers are strongly connected with logic cells and placing registers only for clock network may affect ordinary placement adversely. In order to pull the registers close, especially for the registers with similar activity pattern while with minimal overheads, such as signal net wirelength, timing, and signal net power, we do not cluster registers at the first partition levels of Capo and determine the appropriate level to add our pseudo edges according to the circuit scale. In the experiment, we observed that if we do not limit the registers with the similar activity pattern in the same bin, then as the partition continues, these registers may be partitioned into different bins and cannot be pulled close any more. So at the preliminary level, we should pull the registers with the similar activity pattern close strongly, avoid partitioning them to different bins later. So the pseudo edge s weight is determined by two factors: 1) the weight w l with respect to the partition level; 2) the similarity of activity patterns between the registers w s. For the begin levels, w l should be assigned a relatively large value, while for the last levels, the value can be small since the area of bins is relatively small, even the registers are partitioned into its subbins, they are still relatively near. { W : if level = T w l = 3.0 level/20.0 + W : if level > T where T is the partition level we start to add our pseudo edges, and W controls the scope of w l. For w s, we evaluate the similarity of the registers in the same bin, then according to the similarity function, we assign an appropriate weight to the added pseudo edge. However, when the number of registers in the bin is larger, and if we add a pseudo edge between every pair registers, the pseudo edges are so many and it will greatly affect the run time of Capo. So at first, we evaluate the average similarity value Average in the bin, and add a pseudo edge between registers only when the value between them is larger than C 0 Average, where C 0 can be determined according to the number of registers in current bin. w s (Reg1, Reg2) = α(c Reg1 + C Reg2 ) p(reg1,reg2)

p(reg1,reg2) = P(Reg1.act(i) = 0,Reg2.act(i) = 0) where C Regi is the capacitance of Regi, α is used for normalization, p(reg1, Reg2) is the probability for Reg1 and Reg2 to be idle at the same cycle and 0 i act.size. If w s (Reg1,Reg2) is larger, it means they can decrease more switched capacitance due to their similar activities, so these two registers should be placed close. By merging these registers with close activities, the resulting tree needs the clock signal for a percentage of comparable to that of its leaves, leading to a reduction of the overall activity in the tree. For a particular bin, suppose the registers number in the bin is n, then Average is determined as follow: Average(bin) = n (n 1) 2 n 1 n i=1 j=i+1 w s (Regi,Regj) After determining the weights of w l and w s, then for the pseudo edge added between the registers Regi and Regj at partition level L, the weight of it is determined as follows: W(edge(Regi,Regj)) = w l (L) w s (Regi,Regj) (2) 4. Activity and Timing Based Net Weighting Register clustering and pulling the registers with the similar activity pattern close can effectively reduce the capacitance and the overall activity of the clock tree, thus result in shutting off the clock signal more time to minimize the dynamic power. Unfortunately, it often increases the length of some signal nets and thus the net switching power of the signal nets. This may cancel out the power reduction attained by clock tree. There have been extensive works on net weighting strategies for timing optimization in placement [12], a few other works employ switching activity based net weighting to minimize total switching power [10]. Different from strategies that focus on either timing or power, in this paper, we used the method in [13], which seeks a weighting scheme that improves both. { (1 + c α) (1 + T 0 W = T ) : slack 0 0/N+slack (1 + c α) (1 + N) : else where α is the switching probability of signal net, slack is the the minimum slack at the input of downstream cells, T 0 is typically several times larger than the gate delay, N and c are constant parameters. In order to maintain consistency and avoid oscillation, net weights are incrementally updated in the feedback procedure at every partition level. The weight of an edge is updated considering both its previous value and new value. 5. Gated Clock Tree Construction 5.1. Hierarchical Merging Given the registers placement, we construct a gated clock tree. Similar to the approach introduced in [6][7], we determine the nodes merging order based on a cost function: dis(i,j) = αf(d(i,j)) + βg(l(i,j)) (3) where α and β allow the tuning of the weight of wire-length vs. switching activity, while f and g are used as normalization functions for the physical and logical distances. Given the same definitions as [6], the cost function becomes: 1 dis(i, j) = α[ ( x i x j + y i y j )] dim max + β[1 1 (C i + C j )p(i,j)] (4) C tot We find that the second term is similar to the evaluate function in our placement, since we are trying to pull the registers with the small g(l(i,j)) close, therefore, the dist(i, j) will be smaller. [6][7] builded a balanced clock tree based on level by level merging, but for low power consideration, a better clock tree is constructed in an un-balanced topology [5]. This means that we want to bring the high activity nodes into the tree as late as possible so that the overall activity in the tree will be reduced. However, the wirelength will increase if we merge nodes with small activity first, it will cancel out the power reduced by shutting off clock more cycles. In order to trade off between wirelength and activity, in this paper, we first calculate the dist(i, j) for all possible nodes pairs at current level, and choose the pair with the smallest dist(i,j) to be merged. Other than until all nodes at current level are bi-merged as [6][7], we push the parent of i and j to nodes set immediately for next iteration. The detailed algorithm is shown in algorithm 1. Algorithm 1 Clock Tree Construction Require: a set of sinks {n i } repeat for each pair of n i, n j do compute dis(n i,n j ) according to Eqn. 4 end for pick the pair n i, n j whose dis(n i,n j ) is minimum merge these two nodes and generate a parent n k remove node n i, n j push n k to nodes set until only one node is left in nodes set W = βw new + (1 β)w orig

5.2. Gate Moving Initially, we only add a gate before each sink(or register). This may introduce a large number of gates and the gate position isn t the most suitable to minimize the power. In this paper, we used the gate moving strategy in [6][7] to explore the opportunities for moving gating logic from the leaves towards the upper level inside the clock tree. Besides gate moving, there is another case when inserting gates hardly reduces power, when activity of the node is close to 1, it is obvious since there is no time frame during which the node can be shut off, so we can erase the gates before these nodes. 5.3. Gate Control Signal Optimization Suppose a register is active for a number of consecutive clock cycles, it is a waste of energy if signals go on/off during these cycles. To prevent this, the gate signal may be designed to remain high for one or two clock cycles even after the module is gone idle, This prevents unnecessary switching of signal between the consecutive active cycles of register. We propose a heuristic approach to deal with this effect. It performs a post-order visit on the clock tree and for each gated node it tries to find a local minimum for dissipated power including both the wire and the gate. The heuristic works as shown in Fig. 2. 111101111000000 111111111000000 S1 S2 S1 S2 111101111000000 111101111000000 Figure 2. A typical gated clock tree Suppose the activity pattern for S 1 is 111101111000000, if we shut off the clock when the register is idle normally, the signal should be set low at the fifth cycle, but this will result in 2 transitions of the signal since the register is active at the next cycle, and the P tr () = 3/14. The power saved by the node S 1 may be less than the power dissipated by the gate. So for this case, we first evaluate the power power before if we set low as long as the register is idle, and then we get the power power after in the case if we still set the signal high if the register is active for a number of consecutive cycles. For the example shown in Fig. 2, we could still set the high at the fifth cycle for S 1, then P tr () reduces to 1/14. At last, we choose the best case to further reduce the power. 6. Experimental Results The proposed algorithm has been implemented with C++ based on Capo 9.3, and experiments are performed on a set Table 1. The characteristics of benchmarks circuit #cell #net #register period(ns) s1488 661 672 6 3.9 s15850 10811 10891 534 8.6 s35932 19843 19881 1728 20.6 s38417 24846 24877 1636 5.6 s38584 21369 21410 1426 10.5 of ISCAS89 benchmarks with specified timing constraints and activity profile, the detailed information of the benchmarks are listed in Table 1. In order to evaluate the effectiveness of the proposed method, we examine the results of two algorithms: 1) original Capo placement; 2)the proposed algorithm with navigating registers for gated clock tree in placement. Both of these two placement results use the same clock tree construction and optimization algorithm referred in Section 5. We measure signal nets power, minimum slack, total half perimeter wirelength of signal nets (HPWL), clock tree power, clock tree wirelength and running time for both algorithms. As expected, the results of our algorithm show significantly reduction on clock tree wirelength and clock tree power, with the consideration of registers in placement. As we do not add our register clustering immediately at the first partition levels of Capo, after some partitions, the cells have been assigned to the bins and the registers have been relatively scattered on the chip, so the overheads induced by our algorithm is minimal. By clustering the registers in the bin close, it dramatically reduced the clock tree wirelength, as shown in column 3 of Table 2. Because in the placement, we pull the registers with the similar activity pattern close, which is consistent with the next clock tree construction. By merging these registers, the resulting tree needs the clock signal for a percentage of time comparable to that of its leaves, leading to a reduction of the overall activity in the tree, and we could erase more gates in gate moving step since the parent s activity pattern is more similar as its children. So the power saving and gates number reduction on clock tree are obvious in Table 2. In order to see the effectiveness of our gate control signal optimization separately, we show the clock tree power with and without this technique in Table 3. As we analyzed in Section 5.3, though gate moving is already very effective to find the best gates position to minimize the clock tree power, by preventing the detrimental transitions of the signal, we can further reduce the clock tree power. Fig. 3 illustrates the registers placement results of previous Capo and our algorithm. Since we add pseudo edges between registers, the distribution of registers in (b) is clustered more tightly than (a), which further verifies the reduction of clock tree wirelength and power in our experiment.

Table 2. Comparison of our algorithm against placement without considering clock tree construction Circuit Algorithms Clock WL Clock Power #Gate HPWL Signal Power Slack(ns) CPU Time(sec) s1488 ref 20566.9 0.000106 2 1.8340E6 0.00324 0.972 23.89 our 10032.0 0.000071 2 1.9008E6 0.00326 0.973 25.71 s15850 ref 2.3657E6 0.007106 43 25.7373E6 0.01938 0.022 719.24 our 1.6792E6 0.005351 39 28.8173E6 0.02037-0.036 924.93 s35932 ref 6.5835E6 0.008928 118 48.4727E6 0.01589 3.325 1097.27 our 5.4234E6 0.007357 101 54.6089E6 0.01688 3.332 1506.39 s38417 ref 6.7787E6 0.033413 118 55.5543E6 0.06603-1.665 1466.70 our 5.2371E6 0.026142 97 63.9635E6 0.07079-1.270 1657.53 s38584 ref 6.0006E6 0.015735 99 65.3293E6 0.03793 1.669 1561.78 our 4.8541E6 0.012703 96 71.3371E6 0.03987 1.944 1789.01 Ave -27.95% -23.27% +10.51% +4.85% +20.21% Table 3. The comparison of the clock tree power with/without transition optimization Circuit Previous After Opt Reduction s1488 7.11569E-5 7.08900E-5 0.38% s15850 0.005516 0.005351 2.99% s35932 0.007808 0.007357 5.78% s38417 0.027732 0.026142 5.73% s38584 0.013599 0.012703 6.59% Avg 4.29% (a) Previous placement (b) Our activity-driven placement Figure 3. Registers placement of s35932 7. Conclusion In this paper, we present an algorithm of low power gated clock tree driven placement. Our idea is not only try to decrease the clock tree wirelength by clumping the registers as previous work, but the most important is to pull the registers with similar activity pattern close, which is especially effective for gated clock tree construction after placement. 8. Acknowledgement The authors would like to thank Prof. Igor Markov for kind help. This work is supported by the National Natural Science Foundation of China (NSFC) 60476014. References [1] Nir Magen, Avinoam Kolodny, Uri Weiser, Nachum Shamir, Interconnect-dissipation in a Microprocessor, in Proc. SLIP, pp. 7-13, 2004. [2] Chunhong Chen, Changjun Kang, Majid Sarrafzadeh, Activity-sensitive clock tree construction for low power, in Proc. ISLPED, pp. 279-282, 2002. [3] A. Farrahi, C. Chen, A. Srivastava, G. Tellez, M. Sarrafzadeh, Activity-driven clock design, IEEE Transactions on CAD/ICAS, Vol. 20, No. 6, pp. 705-714, June 2001. [4] Jaewon Oh and Massoud Pedram, Power reduction in microprocessor chips by gated clock routing, in Proc. ASP- DAC, pp. 313-318, 1998. [5] Jaewon Oh and Massoud Pedram, Gated clock routing for low-power microprocessor design, IEEE Transactions on CAD/ICAS, Vol. 20, No. 6, pp. 715-722, June, 2001. [6] Monica Donno, Alessandro Ivadldi, Luca Benini, Enrico Macii, Clock-tree power optimization based on RTL clockgating, in Proc. DAC, pp. 622-627, 2003. [7] Monica Donno, Enrico Macii, Luca Mazzoni, Power-aware clock tree planning, in Proc. ISPD, pp. 138-147, 2004. [8] Jatuchai Pangjun and Sachin S. Sapatekar, Low-power clock distribution using multiple voltages and reduced swings, IEEE Transactions on CAD/ICAS, Vol. 10, No. 3, pp. 715-722, June, 2002. [9] Yongqian Lu, C. N. Sze, Xianlong Hong, Qiang Zhou, Yici Cai, et al, Navigating registers in placement for clock network minimization, in Proc. DAC, pp. 176-181, 2005. [10] Yongseok Cheon, Pei-Hsin Ho, et al, Power-Aware Placement, in Proc. DAC, pp. 795-800, 2005. [11] http://vlsicad.eecs.umich.edu/bk/pdtools/. [12] T. Kong, A novel net weighting algorithm for timing-driven placement, in Proc. ICCAD, pp. 172-176, 2002. [13] Bin Liu, Yici Cai, Qiang Zhou, Xianlong Hong, Power driven placement with layout aware supply voltage assignment for voltage island generation in dual-vdd design, in Proc. ASP-DAC, pp. 582-587, 2006.