MULTISCALAR PROCESSORS

THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE

MULTISCALAR PROCESSORS by Manoj Franklin University of Maryland, US.A. SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress. Franklin, Manoj MULTISCALAR PROCESSORS ISBN 978-1-4613-5364-5 DOI 10.1007/978-1-4615-1039-0 ISBN 978-1-4615-1039-0 (ebook) Copyright 2003 by Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 2003 Softcover reprint of the hardcover 1 st edition 2003 All rights reserved. No part ofthis work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without the written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Permission for books published in Europe: permissions@wkap.nl Permissions for books published in the United States of America: permissions@wkap.com Printed an acid-free paper.

Foreword The revolution of semiconductor technology has continued to provide microprocessor architects with an ever increasing number of faster transistors with which to build microprocessors. Microprocessor architects have responded by using the available transistors to build faster microprocessors which exploit instruction-level parallelism (ILP) to attain their performance objectives. Starting with serial instruction processing in the 1970s microprocessors progressed to pipelined and superscalar instruction processing in the 1980s and eventually (mid 1990s) to the currently popular dynamically-scheduled instruction processing models. During this progression, microprocessor architects borrowed heavily from ideas that were initially developed for processors of mainframe computers and rapidly adopted them for their designs. In the late 1980s it was clear that most of the ideas developed for high-performance instruction processing were either already adopted, or were soon going to be adopted. New ideas would have to be developed to continue the march of microprocessor performance. The initial multi scalar ideas were developed with this background in the late 1980s at the University of Wisconsin. The objective was to develop an instruction processing paradigm for future microprocessors when transistors were abundant, but other constraints such as wire delays and design verification were important. The multiscalar research at Wisconsin started out small but quickly grew to a much larger effort as the ideas generated interest in the research community. Manoj Franklin's Ph.D thesis was the first to develop and study the initial ideas. This was followed by the Wisconsin Ph.D theses of Scott Breach, T.N. Vijaykumar, Andreas Moshovos, Quinn Jacobson and Eric Rotenberg which studied various aspects of the multi scalar execution models. A significant amount of research on processing models derived from multi scalar was also carried out at other universities and research labs in the 1990s. Today variants of the basic multiscalar paradigm and other follow-on models continue to be the focus of significant research activity as researchers continue to build the knowledge base that will be crucial to the design of future microprocessors.

vi This book provides an excellent synopsis of a large body of research carried out on multiscalar processors in the 1990s. It will be a valuable resource for designers of future microprocessors as well as for students interested in learning about the concepts of speculative multithreading. GURI SOH! UNIVERSITY OF WISCONSIN-MADISON

Soli Deo Gloria

Contents Foreword by Guri Sohi Preface Acknowledgments v xv xix 1 INTRODUCTION 1 1.1 Technology Trends 2 1.1.1 Sub-Micron Technology 2 1.1.2 Implications of Sub-Micron Technology 2 1.2 Instruction-Level Parallelism (ILP) 3 1.2.1 Extracting ILP by Software 5 1.2.2 Extracting ILP by Hardware 9 1.3 Thread-Level Parallelism (TLP) 12 1.3.1 Speculative TLP 13 1.3.2 Challenges for TLP Processing 14 1.4 The Multiscalar Paradigm 15 1.5 The Multiscalar Story 16 1.5.1 Developing the Idea 16 1.5.2 Multi-block based Threads and the ARB 17 1.5.3 Maturing of the Ideas 18 1.5.4 Other Speculative Multithreading Models 19 1.6 The Rest of the Story 20 2 THE MULTISCALAR PARADIGM 25 2.1 Ideal TLP Processing Paradigm-The Goal 26 2.2 Multiscalar Paradigm-The Basic Idea 27 2.3 Multiscalar Execution Example 29 2.3.1 Control Dependences 30

x 2.3.2 Register Data Dependences 31 2.3.3 Memory Data Dependences 32 2.4 Interesting Aspects of the Multiscalar Paradigm 32 2.5 Comparison with Other Processing Paradigms 35 2.5.1 Multiprocessing Paradigm 35 2.5.2 Superscalar Paradigm 36 2.5.3 VLIW Paradigm 38 2.6 The Multiscalar Processor 38 2.7 Summary 40 3 MULTISCALAR THREADS-STATIC ASPECTS 43 3.1 Structural Aspects of Multiscalar Threads 43 3.1.1 Definition 43 3.1.2 Thread Spawning Model 44 3.1.3 Thread Flow Graph 46 3.1.4 Thread Granularity 48 3.1.5 Thread Size Variance 49 3.1.6 Thread Shape 50 3.1.7 Thread Entry Points 52 3.1.8 Thread Exit Points 54 3.2 Data Flow Aspects of Multiscalar Threads 55 3.2.1 Shared Name Spaces 55 3.2.2 Inter-Thread Data Dependence 55 3.3 Program Partitioning 57 3.3.1 Compiler-based Partitioning 58 3.3.2 Hardware-based Partitioning 59 3.4 Static Thread Descriptor 59 3.4.1 Nature of Information 59 3.4.2 Compatibility Issues and Binary Representation 61 3.5 Concluding Remarks 62 4 MULTISCALAR THREADS-DYNAMIC ASPECTS 65 4.1 Multiscalar Microarchitecture 65 4.1.1 Circular Queue Organization of Processing Units 66 4.1.2 PU Interconnect 68 4.2 Thread Processing Phases 69 4.2.1 Spawn: Inter-Thread Control Prediction 69 4.2.2 Activate 69 4.2.3 Execute 70

Contents xi 4.2.4 Resolve 70 4.2.5 Commit 70 4.2.6 Squash 71 4.3 Thread Assignment Policies 71 4.3.1 Number of Threads in a PU 71 4.3.2 Thread-PU Mapping Policy 72 4.4 Thread Execution Policies 74 4.4.1 Intra-PU Thread Concurrency Policy: TLP 74 4.4.2 Intra-Thread Instruction Concurrency Policy: ILP 75 4.5 Recovery Policies 76 4.5.1 Thread Squashing 77 4.5.2 Basic Block Squashing 77 4.5.3 Instruction Re-execution 78 4.6 Exception Handling 78 4.6.1 Exceptions 78 4.6.2 Interrupt Handling 79 4.7 Concluding Remarks 80 5 MULTISCALAR PROCESSOR-CONTROL FLOW 81 5.1 Inter-Thread Control Flow Predictor 81 5.1.1 Dynamic Inter-Thread Control Prediction 82 5.1.2 Control Flow Outcome 83 5.1.3 Thread History 84 5.1.4 Prediction Automata 85 5.1.5 History Updates 86 5.1.6 Return Address Prediction 87 5.2 Intra-Thread Branch Prediction 92 5.2.1 Problems with Conventional Branch Predictors 93 5.2.2 Bimodal Predictor 96 5.2.3 Extrapolation with Shared Predictor 96 5.2.4 Correlation with Thread-Level Information to Obtain Accurate History 97 5.2.5 Hybrid of Extrapolation and Correlation 99 5.3 Intra-Thread Return Address Prediction 99 5.3.1 Private RASes with Support from Inter-Thread RAS 100 5.3.2 Detailed Example 100 5.4 Instruction Supply 101 5.4.1 Instruction Cache Options 101

xii 5.4.2 A Hybrid Instruction Cache Organization for Multiscalar Processor 104 5.4.3 Static Thread Descriptor Cache (STDC) 105 5.5 Concluding Remarks 106 6 MULTISCALAR PROCESSOR-REGISTER DATA FLOW 109 6.1 Nature of Register Data Flow in a Multiscalar Processor 110 6.1.1 Correctness Issues: Synchronization 111 6.1.2 Register Data Flow in Example Code 112 6.1.3 Performance Issues 113 6.1.4 Decentralized Register File 114 6.2 Multi-Version Register File-Basic Idea 115 6.2.1 Local Register File 116 6.2.2 Performing Intra-Thread Register Data Flow 116 6.2.3 Performing Inter-Thread Register Data Flow 117 6.3 Inter-Thread Synchronization: Busy Bits 119 6.3.1 How are Busy Bits Set? Forwarding of Create Mask 119 6.3.2 How are Busy Bits Reset? Forwarding of Register Values 121 6.3.3 Strategies for Inter-Thread Forwarding 123 6.4 Multi-Version Register File-Detailed Operation 126 6.4.1 Algorithms for Register Write and Register Read 127 6.4.2 Committing a Thread 128 6.4.3 Squashing a Thread 130 6.4.4 Example 131 6.5 Data Speculation: Relaxing Inter-Thread Synchronization 133 6.5.1 Producer Identity Speculation 134 6.5.2 Producer Result Speculation 138 6.5.3 Consumer Source Speculation 143 6.6 Compiler and ISA Support 144 6.6.1 Inter-Thread Data Flow Information 145 6.6.2 Utilizing Dead Register Information 146 6.6.3 Effect of Anti-Dependences 147 6.7 Concluding Remarks 148 7 MULTISCALAR PROCESSOR-MEMORY DATA FLOW 151 7.1 Nature of Memory Data Flow in a Multiscalar Processor 152 7.1.1 Example 152 7.1.2 Performance Issues 154

Contents xiii 7.2 Address Resolution Buffer (ARB) 156 7.2.1 Basic Idea 156 7.2.2 liardvvare Structure 157 7.2.3 liandling of Loads and Stores 158 7.2.4 Committing or Squashing a Thread 160 7.2.5 Reclaiming the ARB Entries 161 7.2.6 Example 162 7.2.7 Tvvo-LevellIierarchical ARB 164 7.2.8 Novel Features of ARB 164 7.2.9 ARB Extensions 166 7.2.10 Memory Dependence Table: Controlled Dependence Speculation 167 7.3 Multi-Version Cache 168 7.3.1 Local Data Cache 168 7.3.2 Performing Intra-Thread Memory Data Flovv 170 7.3.3 Performing Inter-Thread Memory Data Flovv 171 7.3.4 Detailed Working 172 7.3.5 Comparison vvith Multiprocessor Caches 175 7.4 Speculative Version Cache 175 7.5 Concluding Remarks 177 8 MULTISCALAR COMPILATION 179 8.1 Role of the Compiler 179 8.1.1 Correctness Issues 181 8.1.2 Performance Issues 181 8.1.3 Compiler Organization 181 8.2 Program Partitioning Criteria 183 8.2.1 Thread Size Criteria 183 8.2.2 Control Flovv Criteria 184 8.2.3 Data Dependence Criteria 185 8.2.4 Interaction Among the Criteria 188 8.3 Program Partitioning lieuristics 188 8.3.1 Basic Thread Formation Process 189 8.3.2 Control Flovv lieuristic 190 8.3.3 Data Dependence lieuristics 190 8.3.4 Loop Recurrence lieuristics 194 8.4 Implementation of Program Partitioning 194 8.4.1 Program Profiling 194 8.4.2 Optimizations 195

xiv 8.4.3 Code Replication 195 8.4.4 Code Layout 195 8.5 Intra-Thread Static Scheduling 196 8.5.1 Identifying the Instructions for Motion 197 8.5.2 Cost Model 198 8.5.3 Code Transformations 199 8.5.4 Scheduling Loop Induction Variables 199 8.5.5 Controlling Code Explosion 200 8.5.6 Crosscutting Issues 202 8.6 Concluding Remarks 204 9 RECENT DEVELOPMENTS 207 9.1 Incorporating Fault Tolerance 207 9.1.1 Where to Execute the Duplicate Thread? 208 9.1.2 When to Execute the Duplicate Thread? 209 9.1.3 Partitioning of PUs 210 9.2 Multiscalar Processor with Trace-based Threads 211 9.2.1 Implementation Hurdles of Complex Threads 212 9.2.2 Tree-Like Threads 213 9.2.3 Instruction Cache Organization 215 9.2.4 Advantages 216 9.2.5 Trace Processors 216 9.3 Hierarchical Multiscalar Processor 217 9.3.1 Microarchitecture 219 9.3.2 Inter-Superthread Register Data Flow 219 9.3.3 Inter-Superthread Memory Data Flow 221 9.3.4 Advantages of Hierarchical Multiscalar Processing 221 9.4 Compiler-Directed Thread Execution 221 9.4.1 Non-speculative Inter-Thread Memory Data Flow 221 9.4.2 Thread-Level Pipelining 222 9.4.3 Increased Role of Compiler 222 9.5 A Commercial Implementation: NEC Merlot 223 Index 235

Preface Semiconductor technology projections indicate that we are on the verge of having billion-transistor chips. This ongoing explosion in transistor count is complemented by similar projections for clock speeds, thanks again to advances in semiconductor process technology. These projections are tempered by two problems that are germane to single-chip microprocessors-on-chip wire delays and power consumption constraints. Wire delays, especially in the global wires, become more important, as only a small portion of the chip area will be reachable in a single clock cycle. Power density levels, which already exceed that of a kitchen hot plate, threaten to reach that of a nuclear reactor! Looking at software trends, sequential programs still constitute a major portion of the real-world software used by various professionals as well as the general public. State-of-the-art processors are therefore designed with sequential applications as the primary target. Continued demands for performance boost have been traditionally met by increasing the clock speed and incorporating an array of sophisticated microarchitectural techniques and compiler optimizations to extract instruction level parallelism (ILP) from sequential programs. From that perspective, ILP can be viewed as the main success story form of parallelism, as it was adopted in a big way in the commercial world for reducing the completion time of ordinary applications. Today's superscalar processors are able to issue up to six instructions per cycle from a sequential instruction stream. VLSI technology may soon allow future microprocessors to issue even more instructions per cycle. Despite this success story, the amount of parallelism that can be realistically exploited in the form of ILP appears to be reaching its limits, especially when the hardware is limited to pursuing a single flow of control. Limitations arise primarily from the inability to support large instruction windows--due to wire delay limitations and complex program control flow characteristics-and the ever-increasing latency to memory.

xvi Research on the multiscalar execution model started in the early 1990s, after recognizing this inadequacy of just relying on ILP. The goal was to expand the "parallelism bridgehead" established by ILP by augmenting it with the "ground forces" of thread-level parallelism (TLP), a coarser form of parallelism that is more amenable to exploiting control independence. Many studies on parallelism indeed confirm the significant performance potential of paralleuy executing multiple threads of a program. The difficulties that have been plaguing the parallelization of ordinary, non-numeric programs for decades have been complex control flows and ambiguous data dependences through memory. The breakthrough provided by the multiscalar execution model was the use of "sequential threads," i.e., threads that form a strict sequential ordering. Multiscalar threads are nothing but subgraphs of the control flow graph of the original sequential program. The sequential ordering of threads dictates that control passes from a thread to exactly one successor thread (among different alternatives). At run-time, the multiscalar hardware exploits TLP (in addition to ILP) by predicting and executing a dynamic sequence of threads on multiple processing units (PUs). This sequence is constructed by performing the required number of thread-level control predictions in succession. Threadlevel control speculation is the essence of multiscalar processing; sequentially ordered threads are executed in parallel in a speculative manner on independent PUs, without violating sequential program semantics. In case of misspeculation, the results of the incorrectly speculated thread and subsequent threads are discarded. The collection of PUs is built in such a way that (i) there are only a few global wires, and (ii) very little communication occurs through global wires. Localized communication can be done using short wires, and can be expected to be fast. Thus the use of multiple hardware sequencers (to fetch and execute multiple threads)-besides making judicious use of the available transistor budget increase-fits nicely with the goal of reducing on-chip wire delays through decentralization. Besides forming the backbone of several Ph.D. theses, the multiscalar model has sparked research on several other speculative multithreading modelssuperthreading, trace processing, clustered multithreading, and dynamic multithreading. It has become one of the landmark paradigms, with appearances in the Call for Papers of important conferences such as [SCA and Micro. It has been featured in an article entitled "What's Next for Microprocessor Design?" in the October 2, 1995 issue of Microprocessor Report. Recently multiscalar ideas have found their way into a commercial implementation from NEe called Merlot, furthering expectation for this execution model to become one of the "paradigms of choice" for future microprocessor design. A detailed understanding of the software and hardware issues related to the multi scalar paradigm is of utmost importance to researchers and graduate students working in advanced computer architecture. The past few years have

PREFACE xvii indeed seen many publications on the multiscalar paradigm, both from the academia and the industry. However, there has been no book that integrates all of the concepts in a cohesive manner. This book is intended to serve computer professionals and students by providing a comprehensive treatment of the basic principles of multi scalar execution as well as advanced techniques for implementing the multi scalar concepts. The presentation benefits from the many years of experience the author has had with the multi scalar execution model, both as Ph.D. dissertation work and as follow up research work. The discussion within most of the sections follows a top-down approach. This discussion is accompanied by a wealth of examples for clarity and ease of understanding. For each major building block, the book presents alternative designs and discusses design trade-offs. Special emphasis is placed on highlighting the major challenges. Of particular importance is deciding where a thread should start and end. Another challenge is enforcing proper synchronization and communication of register values as well as memory values from an active thread to its successors. The book provides a comprehensive coverage of all topics related to multiscalar processors. It starts with an introduction to this topic, including technology trends that provided an impetus to the development of multi scalar processors and are likely to shape the future development of processors. It ends with a review of the recent developments related to multiscalar processors. We have three audiences in mind: (1) designers and programmers of next-generation processors, (2) researchers in computer architecture, and (3) graduate students studying advanced computer architecture. The primary intended audience are computer engineers and researchers in the field of computer science and engineering. The book can also be used as a textbook for advanced graduate-level computer architecture courses where the students have a strong background in computer architecture. This book would certainly engage the students, and would better prepare them to be effective researchers in the broad areas of multithreading and parallel processing. MANO] FRANKLIN

Acknowledgments First of all, I praise and thank my Lord JESUS CHRIST-to whom this book is dedicated-for HIS love and divine guidance all through my life. Everything that I am and will ever be will be because of HIM. It was HE who bestowed me with the ability to do research and write this book. Over the years, I have come to realize that without such an acknowledgement, all achievements are meaningless, and a mere chasing after the wind. So, to HIM be praise, glory, and honor, for ever and ever. I thank my family and friends for their support and encouragement throughout the writing of this book. I like to acknowledge my parents Prof. G. Aruldhas and Mrs. Myrtle Grace Aruldhas who have been a constant inspiration to me in intellectual pursuits. My father has always encouraged me to strive for insight and excellence. Thanks to my wife, Bini, for her companionship, love, understanding, and undying support. And thanks to my children, Zaneta, Joshua, and Tesiya, who often succeeded in steeling my time away from this book and have provided the necessary distraction. Prof. Guri Sohi, my Ph.D. advisor, was instrumental in the development and publicizing of the multiscalar paradigm. He provided many insightful advice while I was working on the multiscalar architecture for my Ph.D. Besides myself, Scott Breach and T. N. Vijaykumar also completed Ph.D. theses on the multi scalar paradigm. Much of the information presented in this book has been assimilated from our theses and papers on the multiscalar paradigm. The National Science Foundation, DARPA, and IBM have been instrumental in funding the research on the multiscalar architecture at University of Wisconsin-Madison, University of Minnesota, and University of Maryland. Without their support, multi scalar research would not have progressed very far. Finally, I thank Susan Lagerstrom-Fife and Sharon Palleschi of Kluwer Academic Publishers for their hard work in bringing this manuscript to publication.