Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System Ho Young Kim, Robert Maxwell, Ankil Patel, Byeong Kil Lee Abstract The purpose of this study is to analyze and compare the performance of multi-threaded system and chip-multi-processor system. Currently, various types of multi-core and multi-threaded processors are popularly used in all the domains. Even mobile devices have two or more cores to improve the performance. While those two technologies are widely used, it is not clear which one would be better in performance/power consumption and which hardware configuration is optimal for a specific target domain. This research originally arose from a question asking which system we should choose to execute 4 thread workloads: (i) 1 core, 4-threaded computer; or (ii) 4 cores, 1-threaded computer. Index Terms Multi-thread system, Chip-multi-processor system, Performance evaluation, Power consumption. I. INTRODUCTION In today s computer industry, it is common to use multi-threaded or chip-multi-processor computers. Also, in mobile device applications, it is becoming more common to see cell phones with multi-cores. Multi-threading/multi-core technology increases performance, but doing so requires more power than do single treading/core computers. It was not a big issue to use a lot of power at the beginning of computer era. However, consuming less power is becoming a critical issue in designing computer systems. Using multi-threaded and multi-core systems also requires more space (area) than a single threaded or single core system. Multi-threaded and multi-core technologies are not the same concept [5][10][11][13]. A thread is the smallest unit of processing that can be scheduled by an operating system, and multiple threads can exist within the same process and share the resources but be executed independently. However, cores in multi-core system have each hardware resources for themselves and use them for each processing. It is not simple to say that performance of multi-threading is better than that of multi-core or vice versa, and this research is started by this curiosity. In this paper, we measure some performance metrics of 32 multi-threading systems and 32 multi-core systems in order to compare and find the best configuration of performance computing. For each multi-threading and multi-core systems, the simulation case is made by multiple of two (2, 4, 8, 16 and 32), and total number of simulation was 20 for each benchmarks. Manuscript received Aug 2, 2012. Ho Young Kim, Electrical and Computer Engineering, University of Texas at San Antonio, (e-mail: Derek.kim25@gamil.com). San Antonio, Texas, USA, 210-458-5027. Robert Maxwell, (ramaxwell@gamil.com). Ankil Patel, (ankilpatel@hotmail.com). Byeong Kil Lee, (e-mail: byeong.lee@utsa.edu). Based on the investigation and analysis, the best condition for performance and power consumption depends on the application and a given number of execution threads. The best combination of number of cores/threads is recommended for each thread group. Also, we cannot say that one of them, multi-threading or multi-core, is clearly better than the other one. As fabrication technology evolves, designers can create tinier transistors, where leakage power is getting more important. In general, computer performance is not determined by frequency number or the number of threads/cores anymore. Of the many factors that affect computer performance, this research focuses on the impact from different combination of multi-threaded and multi-core systems. Also, power consumption is taking into account with the variations. II. METHODOLOGY A. Multi2sim simulator for performance measurement Multi2Sim [1] has been developed integrating some significant characteristics of popular simulators, such as separate functional and timing simulation, SMT and multiprocessor support and cache coherence. Multi2sim is an application-only tool intended to simulate x86 binary executable files. Table 1 shows the simulator s configuration. Table 1. Multi2sim configuration for this research Simulator Architecture Cache configuration Multi2sim x86 architecture L1: private, 2-way, 64KB(32/32) L2: private, 8-way, 512KB L3: shared, 16-way, 8MB The simulation is run by several thread-groups which are 2-Th, 4-Th, 8-Th, 16-Th, and 32-Th. In each group, the simulation case is determined by increasing number of cores or by decreasing number of threads. For example in 4-Th group, the first simulation case is 1 thread-4 cores, the second case is 2 threads-2 cores, and the last case is 4 threads-1 core. This way, the total simulation case in five groups is 20. The 4 physical machines are used for this research, having Xeon E5506 processor with private L1, L2 caches and shared L3 cache. Table 2 shows the detail information about the machine used in the simulation. All configurations are running in parallel and the result data is gathered and analyzed. 10

Table 2. Physical machine configuration Table 4. PARSEC 2.1 benchmark suite Xeon E5506 # of CPU 8/12 cores Cache configuration Technology L1: private, 2-way, 64KB(32/32) L2: private, 8-way, 256KB L3: shared, 16-way, 8MB 45nm B. McPAT Simulator for power consumption McPAT [4] is the first integrated power, area, and timing modeling framework for multithreaded and multicore / manycore processors. It is designed to work with a variety of processor performance simulators (and thermal simulators, etc.) over a large range of technology generations. McPAT allows a user to specify low-level configuration details. It also provides default values when the user decides to specify only high-level architectural parameters. For power simulation in this research, the Xeon Tulsa system configuration which is included in simulator, are used for CPU configuration, and the CPU simulation data from Multi2sim are dumped to collect activity information. The number of power simulations is same with that of CPU performance simulation, but the simulation time is relatively shorter than CPU simulation time. Cache configuration for McPAT is matched to CPU simulator, Multi2sim, to get more accurate result. Table 3 shows simulator condition for this research. Table 3. McPAT power simulator configuration Simulator McPAT 0.8 Instruction mix of the PARSEC is shown in Figure 1. Total 9 workloads are analyzed for instruction mix. In all workloads have over 20% of Memory instructions, and the portion of integer instruction is over 36% in all workloads. Also, it is a main part of instruction in PARSEC 2.1 benchmark suite. For floating point instruction, 5 workloads have over 10% of it, but in 2 workloads, there is no floating point instruction. These 3 instructions portion in all benchmarks instruction mix is over 84%. Base CPU Cache configuration Technology Xeon Tulsa L1: private, 2-way, 64KB(32/32) L2: private, 8-way, 512KB L3: shared, 16-way, 8MB 65nm Figure 1. Instruction Mix in PARSEC benchmark III. WORKLOADS The PARSEC benchmark suite [2][3][15] is used for the experiments. This is recompiled version for Multi2sim simulator, and it can be downloaded in Multi2sim workload site. One more important thing to use this benchmark suite is that it also needs different execution commands from original version, but the detail information to run workloads is located inside each workload folder. Table 4 shows the detail information about the PARSEC 2.1 benchmarks suite [12]. All 13 benchmarks would be used in first plan, but 4 benchmarks have some problem from recompilation and cannot be run - which are Facesim, Freqmine, Raytrace, and Streamcluster. So, 9 benchmarks out of 13 are used for this experiment. The input for these workloads are medium size input sets because it takes too long time for simulation with native input. For example with medium input set, the full simulation time is about 3 or 4 days depends on benchmark. If the native input is used, it maybe needs more than 7 days for 1 full simulation. Also, Multi2sim creator recommended the medium size input for simulation. IV. PERFORMANCE COMPARISON OF MULTI-THREADED AND MULTI-CORE SYSTEM WITH VARIOUS CONFIGURATIONS A. CPU performance: Instruction Per Cycle variation in working thread groups One of the most common and widely used performance metrics is IPC (Instruction Per Cycle) and it is measured and analyzed in all simulation cases because it is an indicator of speed for the processor [6][14]. Figure 2 is a comparison chart between threads variations in one core and core variations in fixed one thread. In left chart, threads are increased from 1 to 32 by power of 2, and IPC is decreased from 1.83 to 1.11 when the number of threads is increased. Performance is degraded by increasing number of thread because of threads contention, but in case of increasing number of cores, the performance is increased from 2.40 to 11.68 linearly. When number of cores is increased by power of 2, performance is also increased as much as 1.58 in all cases. 11

Figure 2. IPC comparison between threads and cores up These trends are common for all benchmarks and it is shown in Figure 3 and Figure 4. Figure 3 is an IPC chart of thread variation of fixed one core, and Figure 4 is an IPC chart of core variation of fixed one thread. Figure 5. IPC chart of threads variation in two core and four core fixed condition IPC performance is investigated in each working threads group, 2-thread, 4-thread, 8-thread, 16-thread and 32-thread. First one is performance data chart in 2-thread and 4-thread groups in Figure 6. Figure 3. IPC chart of thread variation in fixed core Figure 6. IPC chart of threads variation in 2 and 4 working threads groups Figure 4. IPC chart of core variation in fixed thread In other core fixed and thread variation charts, there is an interesting point that using 2 threads or 4 threads has better performance than others in all simulation cases. In multi-threading system, hardware resources are shared among threads, and it is shown that using 2 threads or 4 threads is best choice by this simulation as shown in Figure 5. However, in core variation simulation, when more number of cores is used, the CPU performance is always increased. 12 Figure 7. IPC chart of threads variation in 8 working threads group In 2 and 4 threads group, it is obvious that using more cores instead of threads is good to get a better performance. Also, it is confirmed in 8-thread group chart, Figure 7. In 16 and 32 working group threads analysis, results are the same as the other cases that when threads are changed to cores, performance is increased as shown in Figure 8. It can be summarized that increasing number of cores instead of increasing number of threads improves the CPU performance, and using lots of threads, over 4 threads, hurts CPU performance by causing too much threads contentions.

C. Power measurement There are some power metrics in power simulation result, and runtime dynamic power, substrate leakage, and total leakage power is considered in this section to analyze power behavior in multi-threading system and multi-core system. Figure 8. IPC performance in 16 and 32 working thread Dynamic power performance: Runtime dynamic power consumption represents a power consumed by a hardware activity and switching by the input. There is peak dynamic power, and it is the worst case of power consumption under maximum hardware activity and switching, so it is not considered in this paper. Figure 10 shows runtime dynamic power comparison chart between thread changing and core changing. B. Area Estimation Consumption area is also important factor of developing CPU, and it is measured with power consumption simulator, McPAT. In general, area depends on hardware resources such as the number of hardware, number of threads and number of cores. Usually increasing number of core needs more space than increasing number of threads in one central processing unit. This common information is also reflected in simulation result. Figure 9 shows that estimated area increases as both number of thread up and number of core up. Figure 9. Estimated area in threads up and cores up In the chart, when number of thread is increased from 2 to 32, the area is increased from 352 mm 2 to 617 mm 2. The difference is less than two times from 2 threads case. However, in core variation case, when number of core is increased by two, the area is increased by 122mm 2 in each case, so total increased area from 2 cores to 32 cores is over 4 times. To summarize area estimation in threads variation and cores variation is that increasing number of cores is not a good choice for area consumption and needs more 122 mm 2 to increase 2 cores because increasing number of cores needs additional hardware resources per core like processing unit, L1 and L2 cache, but increasing number of threads is better choice and needs 15 mm 2 to increase 2 threads. Figure 10. Runtime dynamic power consumption in thread and core variation The left one is a thread variation result from 2-thread to 32-thread in one core, and it shows runtime dynamic power is increased exponentially by increasing number of threads by power of 2. The right one is same type of chart of core changing in one thread, and the dynamic power is increased linearly by increasing number of cores by power of 2. Lower 16 working threads cases, the dynamic power consumption in core variation group, 16c1t (16 core-1thread), 8c1t, 4c1t, 2c1t, is higher than thread variation group, 1c16t, 1c8t, 1c4t, 1c2t, but in 32 working threads, it is changed suddenly and 1 core 32 threads used about 2 times more power than 1 thread 32 cores. The result in each working threads groups will be explained one by one. First chart is 2 and 4 working threads chart that is Figure 11. It is expected to use more power using more cores instead of thread in these 2 and 4 working threads groups, and the simulation data has same trend with expectation that 2c1t consumed more power than 1c2t. Also, in 4 working threads group, 2c2t and 4c1t consumed more power than 1c4t. There is no Fluidanimate power consumption data in 4c1t case because simulator error was happened in that only 4c1t configuration and it cannot be fixed though it is asked in Mulati2sim forum. In 8 working threads group, the trend of dynamic power consumption is similar to 4 threads group data that when thread is changed to cores in fixed 8 working threads, the power consumption increased by increasing number of cores. 13

Figure11. Runtime dynamic consumption in 2 and 4 working threads In case of bodytrack, x264, they showed well linear trend of power consumption by increasing cores for threads. This 8 working threads groups data trend is not changed at all when number of working thread were changed from 4 to 8, but from 16 working threads the power consumption data shows some meaningful changing from that data. Figure 12 shows runtime dynamic power consumption data in 16 working threads. In this 16 working threads chart, 1c16t case used more power than 2c8t and 4c4t, and after 8c2t consumed power is higher than 1c16t, and 16c1t is used more power than others. It shows that using over 8 threads can cause too much threads contention among them as much as degrading the system performance, and it becomes more serious in 32 working threads group. In 32 working thread, data trends are changed a lot than other cases in Figure 13. In this 32 working threads group, the most power consuming case is 1 core 32 threads, 33.6 W, and 4 cores 8 threads consumed less power than others, 9.6 W. Figure 13. Runtime dynamic power in 32 working threads group Leakage power consumption: When process technology keeps shrinking, the leakage power consumption becomes a dominant part of power consumption in these days [7]. Moreover, as long as process technology keeps decreasing, the leakage power will be more important part of power consumption to overcome. There are gate leakage, subthreshold leakage and total leakage in simulation data, and gate leakage is about less than a tenth of subthreshold leakage, so subthreshold and total leakage is mainly considered in this leakage power section. This leakage power consumption will be reviewed with same order with dynamic power consumption that first comparison between threads and core variation will be showed then each working thread group result will be showed. Figure 14 shows the subthreshold leakage power chart in thread variation and core variation. Figure 12. Runtime dynamic power in 16 working threads group It is the expected result to consume more power in 1 core 32 threads in core variation trends, but it is not expected that 4 core 8 threads used least power than others. The differences power consumption between highest and lowest is 23.98 W, over 3 times higher than 4c8t s, the lowest one. Generally in this simulation, when number of cores increased in fixed thread, the power is increased linearly, but when number of threads increased in fixed core, the power is increased exponentially, so 1 core 32 threads consumed more power than 1 thread 32 cores and others. Figure 14. Subthreshold leakage in threads variation and core variation In left chart with threads variation, 1 core 2 threads subthreshold leakage is 18.44 W and it is 23.03 W for 1 core 32 threads. Number of threads is increased 16 times, and subthreshold leakage is increased just about 5 W. However, in right chart, core variation in one thread, subthreshold leakage is increased from 25.57 W to 247.12 W exponentially, and it is about over 10 times than 2 cores 1 thread s although in area estimation, area is increased 4 times than 2 cores 1 thread. The leakage power is directly connected with increasing number of cores, and it is shown well in all working threads groups in Figure 15. 14

Figure 15. Total leakage power in working threads groups In each group, thread is changed to core in fixed working threads. The leakage power is increased exponentially when thread is changed to core, and the increasing rate is also increased in more number of working threads group. For best performance for power consumption, using more number of threads than the number of cores is the answer. Figure 17. E-D product in 2, 4 and 8 working threads Figure 16. Power consumption comparison between runtime dynamic and total leakage To see the impact of leakage power consumption, it is compared to dynamic power consumption in Figure 16. In left chart, the gap between runtime dynamic and leakage is getting smaller when using more number of threads because runtime dynamic power increased exponentially but leakage is linearly in small rate. However, in right chart, the gap between runtime dynamic and leakage is getting larger as the number of cores increases, and in 32 cores 1 thread the power consumption difference is over 10 times than runtime dynamic power consumption. In summary, runtime dynamic power is exponentially increased by increasing number of threads, and leakage power is also exponentially increased by increasing number of cores, but the portion of leakage power in this 64 nm technology is over 90% in total power consumption. D. Power and delay product (mixed performance of power and speed) E-D Product: Until now the speed performance by the IPC and the power performance by runtime dynamic and leakage power is simulated and analyzed, but it is still not easy to say that which one is better for overall performance to use more threads or more cores. To see the total performance including speed and power performance, the Energy * Delay product [8] is adopted. To calculate E-D product, it is needed that energy per instruction data and cycle per instruction data, and it can be changed (Energy/cycle)/IPC 2. 15 Figure 18. E-D product in 16 working threads Figure 17 shows an Energy-Delay product chart of 2, 4 and 8 working threads groups and shows the results of best performances in each group. In 2 working threads group, the best performance is 1 core 2 threads, and in 4 working threads, the best is 2 core 2 threads, and from 4 working threads group the best performance point is going to move to use more cores than threads. However, the worst choice in each group is 2c1t, 1c4t, and 1c8t, using more threads than cores except in 2 working threads group. Figure 18 shows 16 working threads group data. The best performance condition is in 8 cores 2 threads in this group, but the mixed performance data between 4-cores 4-threads and 8-core 2-threads is similar. Also, in this group the worst performance is 1c16t, using highest threads in group and second bad case is 2c8t like expectation by IPC and power consumption results. In 32 working threads group result, the best performance is in using 8 cores and 4 threads as shown in Figure 19, and worst performance is in using 1 core and 32 threads. When working threads number is increased, using more number of threads cannot give a better performance than using more number of cores. By this Energy-Delay product results in groups, 2 threads or 4 threads can improve the performance but in using more than 4 threads, the performance can be hurt by this over threads contention. Also, the number of cores should be increased for better performance under using 2 or 4 threads. Because of the high leakage power consumption in large number of cores, 16c1t and 32c1t cannot be a best case in each group.

Figure 19. E-D product in 32 working threads E-D 2 Product: The other popular mixed performance metric is E-D 2 [9] With E-D product. Also, this E-D 2 metric is calculated with IPC and Power performance data for this research to compare the performance of both. Figure 20 shows an E-D 2 calculation chart from the same simulation in 2, 4 and 8 working threads groups. In the 2 working threads group, the best performance case is 2 cores 1 thread, not 1 core 2 threads like E-D product because speed factor is more weighted by multiplying Delay to E-D. Also, in 4 working threads group data, worst case is 1c4t, and it is same with E-D product result, but best performance case is 4c1t, it is changed from 2c2t in E-D product. Figure 20. E-D^2 product in 2,4,8 working threads In 8 working threads group, the gaps among mixed performance data are increased, and the best performance combination in this group is using 8 cores 1 thread. In case of increasing number of core instead of thread, the weak point is that more number of cores consumed more power as hardware and area are increased. However, its merit is increasing the speed by more number of cores. In this E-D 2 product, more important metric for performance is speed, delay, so best performance choice is moved to use more number of cores in all groups. Figure 21 shows 16 working threads group data. The best performance in this group was 8c2t in E-D product, but in E-D 2 metric data, the best choice is moved to use one more core and one less thread, 16c1t, and worst choice is 1c16t, it is same with E-D product. In 32 working threads group, result of E-D 2 trend is almost same with other groups that 16c2t is the best case in this group, and worst is 1c32t in Figure 22. Figure 21. E-D 2 product in 16 working threads group Figure 22. E-D 2 product in 32 working threads group In these E-D and E-D 2 products, the mixed performance is reviewed that more number of cores should be used for better performance in each working threads group. Usually, using more number of cores is better for speed and runtime dynamic power consumption, but minor point of it is increasing the leakage power. The portion of leakage power is 42.5%, and runtime dynamic power is 57.5% of total power consumption in 1 core 32 threads by the McPAT power simulations. Moreover, in 32 cores 1 thread, the leakage power consumed 93.7% and runtime dynamic power consumed 6.3% of total power. If the solution for leakage power will is found, the answer for better performance is always to choose highest number of cores in each group. It is shown in Table 5 Power consumption comparison. Table 5. Power consumption comparison in 32 working threading group (Total = Runtime Dynamic + Total Leakage, Total leakage = Gate leakage + Subthreshold leakage) 16

V. CONCLUSION Lots of information is gathered via simulation, and it is not an easy work to find relationships between multi-threads and multi-core computing performance. Many charts are drawn with different configurations and analyzed. In CPU performance, the IPC behavior is mainly measured and analyzed by variations of threads and cores for each working thread group. The IPC performance is degraded by increasing number of threads under fixed number of cores, and it is shown in all other thread variation cases. However, the IPC is always increasing when number of cores is increased with fixed number of threads cases. In each working thread group, the IPC performance is increased when a thread is changed to a core. Although, 2 or 3 benchmarks show saturation behavior among groups, their performance data is not diminished. It is clear that increasing the number of cores provides better performance than by increasing the threads. In area estimation, increasing the number of threads is a better choice than increasing the number of cores to have a smaller CPU area with the same number of threads or cores. For example, the area for 32 threads CPU is almost 2 times larger than 2 threads CPU. However, 32 core CPU size is over 4 times larger than 2 cores CPU. This is attributed to the relationship between area increasing rate with number of threads and cores which is linear and exponential respectively. Dynamic power consumption shows exponential relationship when the number of threads is increased and linear relationship when the number of cores is increased. The leakage power consumption is higher than the dynamic power consumption for variations in both threads and cores. Leakage power consumption is more dominant when numbers of cores are increased consuming up to 93.7% of the power in worst case.to make a final decision between multi-thread computer and multi-core computer, the Energy-Delay product is adopted, and it draw a new conclusion that in all number of working threads groups it is better to use multi-core concept with 2 or 4 threads without 2 working threads group. For example, 2c2t, 4c2t, 8c2t, and 8c4t are best performance choices in all working threads groups in E-D product. This result will be continued by increasing number of working threads because the leakage power is main part of power consumption and increased by number of cores exponentially. Also, E-D 2 product is calculated from simulation data because it is also well used metric for mixed performance. By increased weight of Delay, the result data is moved to use more number of cores than E-D product result. From 2 working threads to 16 working threads groups, all best choices are using highest number of cores, and for 32 working threads group, the best case is using 16 cores and 2 threads. Based on simulation and analysis, for better speed and better power consumption it can be recommended to use multi-core system having 2 multi-threading architecture. The winner for better speed, area, and power consumption is not the only one, multi-core system or multi-threading system. To achieve the best performance multi-core and multi-threading system has to be used together. However, if leakage power problem can be solved, winner will be a multi-core system. 17 REFERENCES [1] R. Ubal, J. Sahuquillo, S. Petit, P. Lopez. Multi2Sim: A Simulation Frame work to Evaluate Multicore-Multithreaded Processors, IEEE 2007. [2] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and CompilationTechniques, October 2008. [3] Christian Bienia and Kai Li. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009. [4] Sheng Li; Jung Ho Ahn; Strong, R.D.; Brockman, J.B.; Tullsen, D.M.; Jouppi, N.P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. Microarchitecture, MICRO-42. 2009. [5] Intel Hyper-Threading Technology, Jan 2003.[Online]. Available: http://www.intel/com/ [6] A. Suleman What makes parallel programming hard? May 2011. [Online] http://www.futurechips.org/tips-for-power-coders/parallel-programmi ng.html [7] Zyuban, V.V.; Kogge, P.M. Inherently lower-power highperformance superscalar architectures. Computers, IEEE Transactions on Volume: 50, Issue: 3. Page(s): 268 28, 2001 [8] Yingmin Li; Brooks, D.; Zhigang Hu; Skadron, K.; Bose, P. Understanding the energy efficiency of simultaneous multithreading. Low Power Electronics and Design, 2004. ISLPED '04. Proceedings of the 2004 International Symposium on Publication Page(s): 44 49, 2004 [9] Cong, J.; Jagannathan, A.; Reinman, G.; Tamir, Y. Understanding the energy efficiency of SMT and CMP with multiclustering. Low Power Electronics and Design, 2005. ISLPED '05. Proceedings of the 2005 International Symposium on. Page(s): 48 53, 2005 [10] Multithreading (computer architecture). Wikipedia, n.p. n.d. [11] Multi-core processor. Wikipedia, n.p. n.d. [12] Contreras, G.; Martonosi, M. Characterizing and improving the performance of Intel Threading Building Blocks Workload Characterization, IISWC 2008. IEEE International Symposium on, Page(s): 57 66, 2008. [13] Alameldeen, A.R.; Wood,D.A. Variability in architectural simulations of multi-threaded workloads High-Performance Computer Architecture, HPCA-9 2003. Proceedings. The Ninth International Symposium on. Page(s): 7 18, 2003. [14] Wu, C.-J.; Martonosi, M. Characterization and dynamic mitigation of intra-application cache interference. Performance Analysis of Systems and Software (ISPASS), IEEE International Symposium on. Page(s): 2 11, 2011 [15] Bhadauria, M.; Weaver, V.M.; McKee, S.A. Understanding PARSEC performance on contemporary CMPs. Workload Characterization, IISWC 2009. IEEE International Symposium on. Page(s): 98 107, 2009.