22nd December Dear Sir/Madam:

Similar documents
High Performance Computing Systems and Scalable Networks for. Information Technology. Joint White Paper from the

Instruction Scheduling for Low Power Dissipation in High Performance Microprocessors

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Using Variable-MHz Microprocessors to Efficiently Handle Uncertainty in Real-Time Systems

Research Statement. Sorin Cotofana

Final Report: DBmbench

Low Power and High Performance Level-up Shifters for Mobile Devices with Multi-V DD

Project 5: Optimizer Jason Ansel

Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era

SCALCORE: DESIGNING A CORE

Kosuke Imamura, Assistant Professor, Department of Computer Science, Eastern Washington University

4202 E. Fowler Ave., ENB118, Tampa, Florida kose

shangupt 2260 Hayward St. #4861, Ann Arbor, MI 48105, Ph:

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Parallelism Across the Curriculum

SATSim: A Superscalar Architecture Trace Simulator Using Interactive Animation

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

PhD Student Mentoring Committee Department of Electrical and Computer Engineering Rutgers, The State University of New Jersey

Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing *

Instruction-Driven Clock Scheduling with Glitch Mitigation

CSE502: Computer Architecture CSE 502: Computer Architecture

The Intel Science and Technology Center for Pervasive Computing

Educational Experiment on Generative Tool Development in Architecture

EE 382C EMBEDDED SOFTWARE SYSTEMS. Literature Survey Report. Characterization of Embedded Workloads. Ajay Joshi. March 30, 2004

1 Educational Experiment on Generative Tool Development in Architecture PatGen: Islamic Star Pattern Generator

Energy Efficiency of Power-Gating in Low-Power Clocked Storage Elements

Computer Architecture A Quantitative Approach

Run-time Power Control Scheme Using Software Feedback Loop for Low-Power Real-time Applications

Performance Evaluation of Multi-Threaded System vs. Chip-Multi-Processor System

Conventional 4-Way Set-Associative Cache

Self-Aware Adaptation in FPGAbased

An Area Efficient Decomposed Approximate Multiplier for DCT Applications

PROBE: Prediction-based Optical Bandwidth Scaling for Energy-efficient NoCs

CS 6135 VLSI Physical Design Automation Fall 2003

DEVELOPMENT OF A ROBOID COMPONENT FOR PLAYER/STAGE ROBOT SIMULATOR

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Hardware/Software Codesign of Real-Time Systems

Outline Simulators and such. What defines a simulator? What about emulation?

Recent Advances in Simulation Techniques and Tools

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

Big versus Little: Who will trip?

Fast Placement Optimization of Power Supply Pads

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

Design Trade-offs for Memory Level Parallelism on an Asymmetric Multicore System

James P. Millan. Citizenship. Education

Second Workshop on Pioneering Processor Paradigms (WP 3 )

Performance Evaluation of Recently Proposed Cache Replacement Policies

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Adaptive Modulation with Customised Core Processor

CREATING A MINDSET FOR INNOVATION Paul Skaggs, Richard Fry, and Geoff Wright Brigham Young University /

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

CS4617 Computer Architecture


System Level Analysis of Fast, Per-Core DVFS using On-Chip Switching Regulators

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

MECHATRONICS Master study program. St. Kliment Ohridski University in Bitola Faculty of Technical Sciences Bitola.

ABOUT COMPUTER SCIENCE

Leading by design: Q&A with Dr. Raghuram Tupuri, AMD Chris Hall, DigiTimes.com, Taipei [Monday 12 December 2005]

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

MULTISCALAR PROCESSORS

Computer & Information Science & Engineering (CISE)

Memory-Level Parallelism Aware Fetch Policies for Simultaneous Multithreading Processors

COTSon: Infrastructure for system-level simulation

Multiple Clock and Voltage Domains for Chip Multi Processors

Distributed Vision System: A Perceptual Information Infrastructure for Robot Navigation

Evaluation of CPU Frequency Transition Latency

Durham Research Online

Dynamic MIPS Rate Stabilization in Out-of-Order Processors

SCIENTIFIC LITERACY FOR SUSTAINABILITY

Computer Logical Design Laboratory

Yutaka Hori Web:

An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Teaching digital control of switch mode power supplies

Information Technology Fluency for Undergraduates

On the Rules of Low-Power Design

SDR Applications using VLSI Design of Reconfigurable Devices

WEI HUANG Curriculum Vitae

2009 Brian L. Greskamp

II. ROBOT SYSTEMS ENGINEERING

Using Variability Modeling Principles to Capture Architectural Knowledge

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

An Integrated Modeling and Simulation Methodology for Intelligent Systems Design and Testing

DIGF 6B21 Ubiquitous Computing

A High Definition Motion JPEG Encoder Based on Epuma Platform

David Daly. IBM T. J. Watson Research Center P.O. Box 218 Yorktown Heights, NY

Invitation for involvement: NASA Frontier Development Lab (FDL) 2018

Architecture ISCA 16 Luis Ceze, Tom Wenisch

Cross Linking Research and Education and Entrepreneurship

Reduction of Peak Input Currents during Charge Pump Boosting in Monolithically Integrated High-Voltage Generators

Power Management in Multicore Processors through Clustered DVFS

Heterogeneous Concurrent Error Detection (hced) Based on Output Anticipation

STRATEGIC FRAMEWORK Updated August 2017

Faculty of Arts and Social Sciences. STRUCTUURRAPPORT Chair Digital Arts and Culture

Compiler Optimisation

Technical-oriented talk about the principles and benefits of the ASSUMEits approach and tooling

Grundlagen des Software Engineering Fundamentals of Software Engineering

DESIGN OF A MEASUREMENT PLATFORM FOR COMMUNICATIONS SYSTEMS

Transcription:

Jose Renau Email renau@cs.uiuc.edu Siebel Center for Computer Science Homepage http://www.uiuc.edu/~renau 201 N. Goodwin Phone (217) 721-5255 (mobile) Urbana, IL 61801 (217) 244-2445 (work) 22nd December 2003 Dear Sir/Madam: Please find enclosed my application for a position of tenure-track Assistant Professor at your department. I will finish my Ph.D. in computer science at the University of Illinois Urbana-Champaign this coming summer, and would be available to start working in Fall 2004. My research area is Computer Architecture. I have been working under the supervision of Professor Josep Torrellas. I have a broad expertise in computer architecture, which includes chip and processor architecture, multiprocessor systems architecture, low power design, Thread Level Speculation (TLS), and processor-inmemory systems. I have also worked on compilation support for new architectures and Linux kernel development. Finally, I have developed substantial software systems, including a simulator for computer architectures and a TLS compiler. Please find enclosed my resume, statement of research and teaching, and samples of my publications. All these documents are also available at my website (www.uiuc.edu/~renau). I am very excited at the opportunity of contributing to your department. I am sure that I can enhance its visibility and reputation with my work. I am looking forward to your positive request for an interview. Yours sincerely, Jose Renau

Personal Information Jose Renau Citizenship Spain Siebel Center for Computer Science Email renau@cs.uiuc.edu 201 N. Goodwin Homepage http://www.uiuc.edu/~renau Urbana, IL 61801 Phone (217) 721-5255 (mobile) (217) 244-2445 (work) Research Interests Computer architecture, chip multiprocessors, energy/performance trade-offs, thread level speculation, interaction between architecture and compilers, Linux kernel. Education University of Illinois at Urbana Champaign: (Advisor: Professor Josep Torrellas) 2004 (expected) Ph.D. Computer Science Thesis: Chip Multiprocessor with Thread Level Speculation: Performance and Energy The thesis challenges, for the first time, the commonly-held view that Thread Level Speculation (TLS) consumes excessive energy. It also proposes novel micro-architectural mechanisms to support out-oforder task spawning in Chip Multiprocessors (CMP) with TLS. The experimental work included the development of a full TLS compiler. 1999 M.S. Computer Science Thesis: Memory Hierarchies in Intelligent Memories: Energy/Performance Design The thesis describes the FlexRAM architecture, focusing on energy, performance, and complexity issues. FlexRAM is a processor-in-memory architecture. Ramon Llull University, Spain: 1997 M.S. Computer Science Thesis: Linux Kernel IEEE1284 Implementation The thesis consisted of building TCP/IP over IEEE1284 and SCSI in Linux. The implementation also included drivers. 1994 B.S. Computer Science Final project: ILZR, a New Data Compression Algorithm Awards IBM Graduate Research Fellowship (2003-2004) J. Poppelbaum Memorial Award, University of Illinois (2003). Given to one graduate student every year for academic merit and creativity in computer architecture Page 1 of 7

Publications Conferences and Journals [1] Speculative Multithreading Does Not (Necessarily) Waste Energy, Jose Renau, Smruti Sarangi, James Tuck, Karin Strauss, Luis Ceze, Wei Liu, and Josep Torrellas, Submitted to International Symposium on Computer Architecture (ISCA), November 2003. [2] TLS Chip Multiprocessors: Micro-Architectural Mechanisms for Tasking with Out-of-Order Spawn, Jose Renau, James Tuck, Wei Liu, Luis Ceze, Karin Strauss, and Josep Torrellas, Submitted to International Symposium on Computer Architecture (ISCA), November 2003. [3] Managing Multiple Low-Power Adaptation Techniques: The Positional Approach, Michael Huang, Jose Renau, Josep Torrellas, Sidebar, IEEE Computer Magazine, December 2003. [4] Programming the FlexRAM Parallel Intelligent Memory System, Basilio Fraguela, Jose Renau, Paul Feautrier, David Padua, and Josep Torrellas, International Symposium on Principles and Practice of Parallel Programming (PPoPP), June 2003. [5] Positional Adaptation of Processors: Application to Energy Reduction, Michael Huang, Jose Renau, and Josep Torrellas, International Symposium on Computer Architecture (ISCA), June 2003. [6] Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors, José F. Martínez, Jose Renau, Michael Huang, Milos Prvulovic, and Josep Torrellas, International Symposium on Microarchitecture (MICRO), November 2002. [7] Energy-Efficient Hybrid Wakeup Logic, Michael Huang, Jose Renau, and Josep Torrellas, International Symposium on Low Power Electronics and Design (ISLPED), August 2002. [8] A Framework for Dynamic Energy Efficiency and Temperature Management, Wei Huang, Jose Renau, and Josep Torrellas, Journal on Instruction Level Parallelism (JILP), October 2001. [9] Cache Decomposition for Energy-Efficient Processors, Michael Huang, Jose Renau, Seung-Moon Yoo, and Josep Torrellas, International Symposium on Low Power Electronics and Design (ISLPED), August 2001. [10] A Framework for Dynamic Energy Efficiency and Temperature Management, Wei Huang, Jose Renau, Seung-Moon Yoo, and Josep Torrellas, International Symposium on Microarchitecture (MICRO), December 2000. Workshops [11] Profile-Based Energy Reduction for High Performance, Wei Huang, Jose Renau, and Josep Torrellas, ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO), December 2001. [12] Energy/Performance Design of Memory Hierarchies for Processor-In-Memory Chips, Wei Huang, Jose Renau, Seung-Moon Yoo, and Josep Torrellas, Workshop on Intelligent Memory Systems, November 2000. It also appeared in Lecture Notes in Computer Science (Vol. 2107) by Springer-Verlag, 2001. [13] Memory Hierarchies in Intelligent Memories: Energy/Performance Design, Wei Huang, Jose Renau, Seung-Moon Yoo, and Josep Torrellas, Ninth Workshop on Scalable Shared Memory Multiprocessors, June 2000. Technical Reports and Theses [14] CFlex: A Programming Language for the FlexRAM Intelligent Memory Architecture, Basilio Fraguela, Jose Renau, Paul Feautrier, David Padua, and Josep Torrellas, Technical Report UIUCDCS-R-2002-2287, July 2002. [15] FlexRAM Architecture Design Parameters, Seung-Moon Yoo, Jose Renau, Wei Huang, and Josep Torrellas, Technical Report 1584, October 2000. [16] Memory Hierarchies in Intelligent Memories: Energy/Performance Design, Jose Renau, M.S. Thesis, University of Illinois, December 1999. [17] Linux Kernel IEEE1284 Implementation, Jose Renau, M.S. Thesis, Ramon Llull University, June 1997. Page 2 of 7

Talks As Presenter at Conferences/Workshops Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors, International Symposium on Microarchitecture (MICRO), November 2002. Cache Decomposition for Energy-Efficient Processors, International Symposium on Low Power Electronics and Design (ISLPED), August 2001. Memory Hierarchies in Intelligent Memories: Energy/Performance Design, The Ninth Workshop on Scalable Shared Memory Multiprocessors, June 2000. As Invited Speaker Architectural Support for Hierarchical Thread-Level Speculation, IBM T.J.Watson Research Center, New York, August 2003. As Presenter in DARPA PI Meeting Morphable Multithreaded Memory Tiles (M3T) Architecture, IBM T.J.Watson Research Center, New York, April 2002. Software Created Designed and implemented a new simulator of computer architectures (Sesc). It is used by several research groups at the University of Illinois, University of Rochester, North Carolina State University, Georgia Institute of Technology, and Cornell University. It models a variety of architectures, including dynamic superscalar processors, CMPs, processor-in-memory, and TLS architectures. Created a fully automatic TLS compiler pass using GCC. It generates tasks with software value prediction. This is the compiler used to evaluate the architecture proposed in my Ph.D. thesis. Made some extensions to CACTI, a widely used cache power model. The extensions have been used at the University of Illinois, University of Rochester, North Carolina State University, U.C. Davis, U.C. Irvine, U.C. Riverside, and University of Arizona. Contributed to official Shared Memory Multiprocessors (SMP) Linux patches to support SMP boards. These patches are included in all the Linux kernel distributions since 1995. Co-developed the IEEE 1284 (parallel port) in Linux. This implementation is included in all Linux kernels since 1996. Developed official GCC patches, which are included in the main distribution (2002). Developed TCP/IP over SCSI boards, which involved several modifications to the Linux kernel to support a high-performance interconnection system between Linux machines. Invented a new data compression algorithm (ILZR), a variant of Lempel Zib Ross William, distributed as public domain for Amiga Computers in Aminet-CD (1993). Developed the superscalar simulation infrastructure used by the Architecture Group at the Computer Science Department of Ramon Llull University (1992-1994). Teaching Experience Substitute teacher for some senior- and graduate-level computer architecture classes at the University of Illinois (2002, 2003). Tutoring graduate students at the University of Illinois (2002, 2003). Created and taught a course for system administrators at Ramon Llull University, Spain. The course was 4 hours a week for 10 weeks (1997). Page 3 of 7

Professional Experience Jan 1999-Aug 2003 Research Assistant. University of Illinois at Urbana-Champaign. Aug 1998-Dec 1998 System Administrator. University of Illinois at Urbana-Champaign. Worked for the Computing and Communications Services Office. Jan 1998-Jul 1998 Computer Network Specialist. FIHOCA, S.A. (Spain). Sep 1996-Sep 1997 System Administrator. Asertel, S.A. (Spain). In charge of the computer infrastructure. Specialized in network security. May 1995-Sep 1996 Systems Manager. Ramon Lull University (Spain). In charge of the administration of the UNIX machines, PCs, and the network of the University. Profesional Activities and Memberships Reviewer of papers for conferences and journals in computer architecture (ISCA, MICRO, HPCA, ICS, CAL, and IPDPS). ACM member since 1997. References Josep Torrellas (advisor) Professor & Willet Faculty Scholar Department of Computer Science University of Illinois at Urbana-Champaign Siebel Center for Computer Science 201 North Goodwin 201 North Goodwin Urbana, IL 61801 Urbana, IL 61801 (217) 244-4148 (217) 333-3373 torrellas@cs.uiuc.edu snir@cs.uiuc.edu David Padua Professor Department of Computer Science University of Illinois at Urbana-Champaign Siebel Center for Computer Science 201 North Goodwin 201 North Goodwin Urbana, IL 61801 Urbana, IL 61801 (217) 333-4223 (217) 333-8461 padua@cs.uiuc.edu sadve@cs.uiuc.edu Wen-Mei Hwu Franklin W. Woeltge Professor Department of Electrical and Computer Enginering University of Illinois at Urbana-Champaign 215 Coordinated Science Laboratory 1308 West Main Urbana, IL 61801 (217) 244-8270 hwu@crhc.uiuc.edu Mark Snir Faiman/Muroga Professor Head, Department of Computer Science University of Illinois at Urbana-Champaign Siebel Center for Computer Science Sarita Adve Associate Professor Department of Computer Science University of Illinois at Urbana-Champaign Siebel Center for Computer Science Page 4 of 7

Appendix: Abstracts of my Conference Papers [1] Speculative Multithreading Does Not (Necessarily) Waste Energy (Submitted to ISCA 2004) While Chip Multiprocessors (CMP) with Speculative Multithreading (SM) have been gaining momentum, experienced processor designers in industry have reservations about their practical implementation. In particular, it is felt that SM is too energy-inefficient to compete against conventional superscalars. This paper challenges the commonly-held view that SM consumes excessive energy. We show a CMP with SM support that is not only faster but also more energy efficient than a state-of-the-art wide-issue superscalar. We demonstrate it with a new energy-efficient CMP micro-architecture. In addition, we identify the additional sources of energy consumption in SM, and propose energy-centric optimizations that mitigate them. Experiments with the SpecInt 2000 codes show that a CMP with 2 4-issue cores and support for SM delivers a speedup of 1.08 over a 8-issue superscalar and consumes only 54% of its power. Alternatively, for the same average power in both chips, the SM CMP is 1.6 times faster than the superscalar on average. [2] TLS Chip Multiprocessors: Micro-Architectural Mechanisms for Tasking with Out-of-Order Spawn (Submitted to ISCA 2004) Chip Multiprocessors (CMP) are flexible, high-frequency platforms on which to support Thread-Level Speculation (TLS). However, for TLS to deliver on its promise, CMPs must exploit multiple sources of speculative task-level parallelism, including any nesting levels of both subroutines and loop iterations. Unfortunately, these environments are hard to support in decentralized CMP hardware: since tasks are spawned out-of-order and unpredictably, maintaining key TLS basics such as task ordering and efficient resource allocation is challenging. This paper is the first one to propose micro-architectural mechanisms that, taken together, fundamentally enable fast TLS with out-of-order spawn in a CMP. These simple mechanisms are: Splitting Timestamp Intervals, the Immediate Successor List, and Dynamic Task Merging. To evaluate them, we develop a TLS compiler with out-of-order spawn. With our mechanisms, a TLS CMP with 2 4-issue processors increases the average speedup of full SpecInt 2000 applications from 1.15 (no out-of-order spawn) to 1.25 (with out-of-order spawn). Moreover, the resulting CMP outperforms a very aggressive 8-issue superscalar. Specifically, with the same clock frequency, the CMP delivers an average speedup of 1.14 over the 8-issue processor. [4] Programming the FlexRAM Parallel Intelligent Memory System (PPoPP 2003) In an intelligent memory architecture, the main memory of a computer is enhanced with many simple processors. The result is a highly-parallel, heterogeneous machine that is able to exploit computation in the main memory. While several instantiations of this architecture have been proposed, the question of how to effectively program them with little effort has remained a major challenge. In this paper, we show how to effectively hand-program an intelligent memory architecture at a high level and with very modest effort. We use FlexRAM as a prototype architecture. To program it, we propose a family of high-level compiler directives inspired by OpenMP called CFlex. Such directives enable the processors in memory to execute the program in cooperation with the main processor. In addition, we propose libraries of highly-optimized functions called Intelligent Memory Operations (IMOs). These functions program the processors in memory through CFlex, but make them completely transparent to the programmer. Simulation results show that, with CFlex and IMOs, a server with 64 simple processors in memory runs on average 10 times faster than a conventional server. Moreover, a set of conventional programs with 240 lines on average are transformed into CFlex parallel form with only 7 CFlex directives and 2 additional statements on average [5] Positional Adaptation of Processors: Application to Energy Reduction (ISCA 2003) Although adaptive processors can exploit application variability to improve performance or save energy, effectively managing their adaptivity is challenging. To address this problem, we introduce a new approach to adaptivity: the Positional approach. In this approach, both the testing of configurations and the application of the chosen configurations are associated with particular code sections. This is in contrast to the currently-used Temporal approach to adaptation, where both the testing and application of configurations are tied to successive intervals in time. Page 5 of 7

We propose to use subroutines as the granularity of code sections in positional adaptation. Moreover, we design three implementations of subroutine-based positional adaptation that target energy reduction in three different workload environments: embedded or specialized server, general purpose, and highly dynamic. All three implementations of positional adaptation are much more effective than temporal schemes. On average, they boost the energy savings of applications by 50% and 84% over temporal schemes in two experiments. [6] Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors (MICRO 2002) This paper presents CHeckpointed Early Resource RecYcling (Cherry), a hybrid mode of execution based on ROB and checkpointing that decouples resource recycling and instruction retirement. Resources are recycled early, resulting in a more efficient utilization. Cherry relies on state checkpointing and rollback to service exceptions for instructions whose resources have been recycled. Cherry leverages the ROB to (1) not require in-order execution as a fallback mechanism, (2) allow memory replay traps and branch mispredictions without rolling back to the Cherry checkpoint, and (3) quickly fall back to conventional out-of-order execution without rolling back to the checkpoint or flushing the pipeline. We present a Cherry implementation with early recycling at three different points of the execution engine: the load queue, the store queue, and the register file. We report average speedups of 1.06 and 1.26 in SPECint and SPECfp applications, respectively, relative to an aggressive conventional architecture. We also describe how Cherry and speculative multithreading can be combined and complement each other. [7] Energy-Efficient Hybrid Wakeup Logic (ISLPED 2002) The instruction window is a critical component and a major energy consumer in out-of-order superscalar processors. An important source of energy consumption in the instruction window is the instruction wakeup: a completing instruction broadcasts its result register tag and an associative comparison is performed with all the entries in the window. This paper shows that a very large fraction of the completing instructions have to wake up no more than a single instruction currently in the window. Consequently, we propose to save energy by using indexing to only enable the comparator at the single instruction to wake up. Only in the rare case when more than one instruction needs to wake up, our scheme reverts to enabling all the comparators or a subset of them. For this reason, we call our scheme Hybrid. Overall, our scheme is very effective: for a processor with a 96-entry window, the number of comparisons performed by the average completing instruction is reduced to 1.1. The exact magnitude of the energy savings will depend on the specific instruction window implementation. Furthermore, in the Hybrid schemes, the application suffers no performance penalty. [9] Cache Decomposition for Energy-Efficient Processors (ISLPED 2001) The L1 data cache is a time-critical module and, at the same time, a major source of energy consumption. To reduce its energy-delay product, we apply two principles of low power design: specialize part of the cache structure and break down the cache into smaller caches. To this end, we propose a L1 cache that combines new designs of a stack cache and a PSA cache. Individually, our stack and PSA cache designs have a lower energy-delay product than previously proposed designs. In addition, their combined operation is very effective. Relative to a conventional 2-way 32KB data cache, our design containing a 4-way 32KB PSA cache and a 512B stack cache reduces the energy-delay product of several applications by an average of 44%. [10] A Framework for Dynamic Energy Efficiency and Temperature Management (MICRO 2000) While technology is delivering increasingly sophisticated and powerful chip designs, it is also imposing alarmingly high energy requirements on the chips. One way to address this problem is to manage the energy dynamically. Unfortunately, current dynamic schemes for energy management are relatively limited. In addition, they manage energy either for energy efficiency or for temperature control, but not for both simultaneously. In this paper, we design and evaluate for the first time an energy-management framework that tackles both energy efficiency and temperature control in a unified manner. We call this general approach Dynamic Energy Efficiency and Temperature Management (DEETM). Our framework combines many energy-management techniques Page 6 of 7

and can activate them individually or in groups in a fine-grained manner according to a given policy. The goal of the framework is two-fold: maximize energy savings without extending application execution time beyond a given tolerable limit, and guarantee that the temperature remains below a given limit while minimizing any resulting slowdown. The framework successfully meets these goals. For example, it delivers a 40% energy reduction with only a 10% application slowdown. Page 7 of 7

Jose Renau http://www.uiuc.edu/~renau Research Statement Research Interests I am a computer architect with broad interdisciplinary research interests and experience. I have made contributions to chip-level architectures for Thread Level Speculation (TLS) [1,2], superscalar processor microarchitecture [6], low-power architectures [2,7,9] and adaptive processors [3,5,8,10,11], programmability, energy, and performance of processor-in-memory architectures [4,12,13,14,15,16], and compilation support for emerging architectures [1,2,4]. I feel that interdisciplinary research is required to push the envelope in computer architecture. Past and Present Research Thread Level Speculation Most of my research has been on performance and energy trade-offs in chip-level architectures. My thesis focuses on improving the performance and minimizing the energy consumption of TLS architectures [1,2]. The thesis challenges for the first time the commonly-held view that TLS consumes excessive energy. This is an important issue because energy and power are arguably the main design constraints in current processors. My thesis describes the architecture of a Chip Multiprocessor (CMP) with TLS support that is both faster and more energy efficient than a state-of-the-art wide-issue superscalar processor. Additionally, it identifies the sources of energy waste in TLS and proposes novel energy-centric optimizations. My thesis is also the first one to propose detailed microarchitectural mechanisms to enable speculative tasking with out-of-order task spawn in a TLS CMP. Out-of-order task spawn unlocks higher performance. To evaluate my proposals, I built a detailed simulator and a novel TLS compiler on top of GCC. The compiler generates energy-efficient tasks with out-of-order spawning. Experiments with SPECint codes show that a TLS CMP with 2 narrow-issue cores delivers significant speedups over a wider-issue superscalar, while consuming a fraction of its average power. Therefore, I claim that TLS CMPs are highly-promising platforms for next-generation processors. Processor Checkpointing As part of my TLS work, I reused TLS s support for program state checkpointing and rollback recovery to improve superscalar pipeline design [6]. Current superscalar pipelines are sub-optimal in that instructions retain their resources (registers or load/store queue entries) well past their completion until they retire. In our proposal, called Cherry, we decouple resource recycling and instruction retirement. Registers and load/store queue entries are recycled before instruction retirement, boosting pipeline utilization and, as a result, processor performance. Cherry relies on TLS s support for register and cache state checkpointing and rollback to service exceptions for instructions whose resources have been recycled. The resulting higher resource utilization enabled by Cherry leads to substantial speedups for SPECint and SPECfp codes. Overall, this work is significant in that it proposes enhancing the performance of processors through aggressive resource recycling rather than through adding more resources, which is not scalable. Low-Power Adaptive Processors Before working on TLS and Cherry, I worked on low-power adaptive (or reconfigurable) processor architectures. Run-time adaptation of hardware structures such as caches or pipelines is a promising approach to partially solve the problem of high energy and power requirements in current processors. We proposed a hardware and software algorithm that controls energy consumption and temperature in a unified manner [8,10]. We also designed a novel approach to processor adaptation called Positional adaptation [3,5,11]. In positional adaptation, a processor remembers the best configuration when it executes a code section; it then uses the same configuration when the code section is invoked again. This is in contrast to the conventional adaptation schemes. In such schemes, which we call temporal, the configuration is chosen based on the behavior of the code section immediately preceding the current one. In our work, we show that positional adaptation is more effective than temporal at saving processor energy with little performance impact. In addition to this work, I have also proposed new energy-efficient cache [9] and instruction window [7] organizations. Processor-in-Memory Architectures I have also worked on the FlexRAM project, which proposes a new processor-in-memory architecture. A FlexRAM chip includes up to 64 simple processors and 64 Mbytes of DRAM. Several such chips can be placed in the memory system of a workstation, resulting in a very versatile computing platform. For example, Page 1 of 2

Jose Renau http://www.uiuc.edu/~renau Research Statement highly-parallel, or memory-intensive tasks can be off-loaded to the memory processors, which can execute in parallel with the main processor. The original FlexRAM architecture did not focus on energy or complexity issues. For my M.S. thesis, I redesigned the FlexRAM chip (on paper) to make it energy-efficient [12,13,15,16]. The resulting architecture has energy and performance advantages over conventional workstations. However, it is quite complex to program. To mitigate this problem, we proposed Open MP-based extensions to a high-level language to help program FlexRAM. These extensions, called C-Flex [4,14], substantially enhance the programmability of FlexRAM and similar processor-in-memory architectures. Tool Development During my Ph.D., I have developed a large number of software tools that my colleagues and I have used for research. Specifically, I have designed and implemented a simulator of computer architectures (Sesc). It is used by several research groups at the Univ. of Illinois, Univ. of Rochester, North Carolina State Univ., Cornell Univ., and Georgia Institute of Technology. It models a variety of architectures, including dynamic superscalar processors, CMPs, processor-in-memory, and TLS architectures. To evaluate the TLS architecture proposed in my thesis, I built together with three other graduate students a TLS compiler pass using GCC. The pass automatically generates tasks with software value prediction. In addition, to improve the task selection quality, we built a profiler pass. I made some extensions to CACTI, a widely-used tool that models power consumption in caches. The extensions have been used at the Univ. of Illinois, Univ. of Rochester, North Carolina State Univ, U.C. Davis, U.C. Irvine, U.C. Riverside, and Univ. of Arizona. Finally, I also made several open source contributions. I co-developed the IEEE 1284 and some multiprocessor patches for Linux. They are included in all Linux kernels since 1996. I also contributed with some official patches for GCC in 2003. Future Research In the short-to-medium term, I plan to keep investigating TLS architectures. I think that TLS has great potential for future processors. However, there are many issues that need to be solved before we can see commercial processors supporting TLS. I have observed that processor designers in companies such as IBM and Intel have reservations about TLS. They are especially concerned about power consumption and design complexity. I plan to focus on making TLS a viable alternative. I will systematically address all the open questions and problems in TLS, starting with the impact of TLS on the chip temperature. This research is challenging because it requires interdisciplinary expertise in energy, performance, and compilation support. I plan to contribute with novel ideas to improve the performance and complexity trade-offs in Chip Multiprocessors (CMPs) with out-of-order superscalars. I believe that Cherry-style checkpointing in modern pipelines is a promising approach to boost performance while limiting complexity. For these architectures, I also want to make fundamental advances in energy, power, and temperature issues, which I consider the true constraints in future CMP designs. Current microprocessors and multiprocessor systems are very complex. In computer architecture, more complexity implies harder-to-test designs and longer time to market. I believe that microarchitectural proposals to reduce design complexity will be the next big thing in computer architecture. Therefore, in the next 5 years, I would like to open new research areas in complexity management. I plan to make contributions to simplify the design of hardware and software. Finally, I have worked on many areas because I truly enjoy working in groups. Group work is a very gratifying experience, and I want to keep doing it as I build my research team of graduate students. I plan to build an interdisciplinary research team, with graduate students performing research on microarchitecture, multiprocessor systems architecture, energy and temperature issues, compilers, and performance evaluation. Page 2 of 2

Jose Renau http://www.uiuc.edu/~renau Teaching Statement Teaching Statement I consider teaching one of the most effective ways to make the world better. Teachers had a strong influence in my life, second only to my family. While research can affect a large group of people, I feel that teaching has a much larger effect on a small group of people. In my life, some professors have had a bigger impact on me than any paper that I have ever read. I would like my teaching to have this kind of impact. As a senior student in a large research group at the University of Illinois, I have had the pleasure of coordinating the research of several younger Ph.D. and M.S. students in the group. I have always liked helping new members by suggesting lists of papers to read and research problems to examine. In addition, I have coordinated the work of several group members on our research tool infrastructure. While working in my M.S. degree in Spain, I instructed a group of 20 system administrators. I prepared and taught a course on networking and security that lasted a couple of months. Moreover, at the University of Illinois, I have been a substitute instructor several times. When my advisor has been out of town, I have volunteered to teach his classes. This has given me the opportunity to interact with students on multiple occasions. When I teach, I like to balance an abstract global view against real industrial examples. I like to impart a solid understanding with some additional insights that would be difficult to find in a book. I extract most of these insights from conferences or recent news. Given my background in computer science, I am comfortable teaching any computer architecture class at the graduate and undergraduate level. I would like to teach the following subjects: single processor architectures, multiprocessor architectures, energy and performance issues, and emerging architectural approaches like processors-in-memory and thread level speculation. At the senior undergraduate level, I can also teach compiler and operating systems courses. At the undergraduate level, I can teach VLSI and networking courses. Aside from already established courses, I would like to create new interdisciplinary courses. I feel that the emerging research topics in computer architecture are at the intersection between multiple areas. I would like to teach courses analyzing the interaction between architectures and compilers, and the interaction between performance and energy optimizations. University professors are particularly fortunate in that they interact with groups of smart graduate students. I fondly remember becoming interested in computer architecture by participating in small reading groups. As a professor, I would like to create reading groups where junior students can discover their interests in computer architecture. Page 1 of 1