MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Similar documents
Project 5: Optimizer Jason Ansel

The Looming Software Crisis due to the Multicore Menace

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

CS 110 Computer Architecture Lecture 11: Pipelining

Asanovic/Devadas Spring Pipeline Hazards. Krste Asanovic Laboratory for Computer Science M.I.T.

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

CSE502: Computer Architecture CSE 502: Computer Architecture

Design Challenges in Multi-GHz Microprocessors

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

On-chip Networks in Multi-core era

Dynamic Scheduling II

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Dynamic Scheduling I

Multiband RF-Interconnect for Reconfigurable Network-on-Chip Communications UCLA

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Lecture 4: Introduction to Pipelining

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Challenges of in-circuit functional timing testing of System-on-a-Chip

Department Computer Science and Engineering IIT Kanpur

ECE 4750 Computer Architecture, Fall 2016 T09 Advanced Processors: Superscalar Execution

Customized Computing for Power Efficiency. There are Many Options to Improve Performance

Instruction Level Parallelism Part II - Scoreboard

Lecture Topics. Announcements. Today: Pipelined Processors (P&H ) Next: continued. Milestone #4 (due 2/23) Milestone #5 (due 3/2)

Tomasolu s s Algorithm

CSE502: Computer Architecture CSE 502: Computer Architecture

A High Definition Motion JPEG Encoder Based on Epuma Platform

SCALCORE: DESIGNING A CORE

EECS 470. Tomasulo s Algorithm. Lecture 4 Winter 2018

COSC4201. Scoreboard

Recent Advances in Simulation Techniques and Tools

Compiler Optimisation

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE473 Computer Architecture and Organization. Pipeline: Introduction

Learning Outcomes. Spiral 2 8. Digital Design Overview LAYOUT

CMP 301B Computer Architecture. Appendix C

Multi-Channel Charge Pulse Amplification, Digitization and Processing ASIC for Detector Applications

CS61c: Introduction to Synchronous Digital Systems

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

CS4617 Computer Architecture

A SCALABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

CSE502: Computer Architecture Welcome to CSE 502

Computer Science 246. Advanced Computer Architecture. Spring 2010 Harvard University. Instructor: Prof. David Brooks

Silicon-Photonic Clos Networks for Global On-Chip Communication

Low Power System-On-Chip-Design Chapter 12: Physical Libraries

Implementation of Pixel Array Bezel-Less Cmos Fingerprint Sensor

U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Instruction Level Parallelism III: Dynamic Scheduling

CS521 CSE IITG 11/23/2012

Measuring and Evaluating Computer System Performance

How a processor can permute n bits in O(1) cycles

CS Computer Architecture Spring Lecture 04: Understanding Performance

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Dr. D. M. Akbar Hussain

Research Statement. Sorin Cotofana

SOFTWARE IMPLEMENTATION OF THE

hankhoffmann/

Integrated Power Delivery for High Performance Server Based Microprocessors

Power Spring /7/05 L11 Power 1

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

Deadlock-free Routing Scheme for Irregular Mesh Topology NoCs with Oversized Regions

Low Power Design Methods: Design Flows and Kits

Lecture 8-1 Vector Processors 2 A. Sohn

L15: VLSI Integration and Performance Transformations

L15: VLSI Integration and Performance Transformations

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

StreamIt: High-Level Stream Programming on Raw

CS152 Computer Architecture and Engineering Lecture 3: ReviewTechnology & Delay Modeling. September 3, 1997

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Synthetic Aperture Beamformation using the GPU

Gaming Development. Resources

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

Artificial Neural Network Engine: Parallel and Parameterized Architecture Implemented in FPGA

Final Report: DBmbench

Architecting Systems of the Future, page 1

Heterogeneous Concurrent Error Detection (hced) Based on Output Anticipation

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Relative Timing Driven Multi-Synchronous Design: Enabling Order-of-Magnitude Energy Reduction

Power Issues with Embedded Systems. Rabi Mahapatra Computer Science

Getting to Work with OpenPiton. Princeton University. OpenPit

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

Highly Versatile DSP Blocks for Improved FPGA Arithmetic Performance

Out-of-Order Execution. Register Renaming. Nima Honarmand

On the Rules of Low-Power Design

Multi-Channel FIR Filters

Neuromorphic Computing based Processors

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84

High Performance Computing for Engineers

EECS 470. Lecture 9. MIPS R10000 Case Study. Fall 2018 Jon Beaumont

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

EECS 470 Lecture 5. Intro to Dynamic Scheduling (Scoreboarding) Fall 2018 Jon Beaumont

EE4800 CMOS Digital IC Design & Analysis. Lecture 1 Introduction Zhuo Feng

SIGNED PIPELINED MULTIPLIER USING HIGH SPEED COMPRESSORS

Transcription:

MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms

6.189 IAP 2007 Lecture 17 The Raw Experience 1 6.189 IAP 2007 MIT

Raw Chips October 02 2 6.189 IAP 2007 MIT

Raw Microprocessor Tiled microprocessor with point-to-point pipelined scalar operand network Each tiles is 4 mm x 4mm MIPS-style compute processor Single-issue 8-stage pipe 32b FPU 32K D Cache, I Cache 4 on-chip mesh networks Two for operands One for cache misses, I/O One for message passing 3 6.189 IAP 2007 MIT

Raw Microprocessor 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBMimplemented Powers in same process 18W average power 4 6.189 IAP 2007 MIT

One Cycle in the Life of a Tiled Processor Image by MIT OCW. Image by MIT OCW. Application uses as many tiles as needed to exploit its parallelism 5 6.189 IAP 2007 MIT

Raw Motherboard 6 6.189 IAP 2007 MIT

Raw in Action 7 6.189 IAP 2007 MIT

MPEG-2 Encoder Performance 350 x 240 Images 720 x 480 Images 16 16 60 60 Speedup 8 30 Frames/s Speedup 8 30 Frames/s 4 1 1 4 8 # of Tiles 4 10 1 16 4 8 # of Tiles Square Linear speedup Diamond Hand-optimized, slice parallel implementation Circle Slice parallel implementation Triangle Baseline macroblock parallel implementation 10 8 6.189 IAP 2007 MIT

MPEG-2 Encoder Performance Encoding Rate (frames/s) # Tiles 352 x 240 640 x 480 720 x 480 1 4.30 1.14 1.00 2 8.48 2.24 1.97 4 16.18 4.45 3.84 8 30.82 8.69 7.52 16 58.65 16.74 14.57 32 103* 30* 64 158* 51.90 * Estimated data rates 9 6.189 IAP 2007 MIT

Programmable Graphics Pipeline Input Vertex V Vertex Sync Triangle Setup Pixel P Pixel simplified graphics pipeline 10 6.189 IAP 2007 MIT

Phong Shading Per-pixel phong-shaded polyhedron 162 vertices, 1 light Output, rendered using Raw simulator 11 6.189 IAP 2007 MIT

Phong Shading (64-tiles) Fixed pipeline Reconfigurable pipeline 33% faster 150% better utilization 12 6.189 IAP 2007 MIT

Shadow Volumes 4 textured triangles 1 point light Rendered in 3 passes Output, rendered using Raw simulator 13 6.189 IAP 2007 MIT

Shadow Volumes (64-tiles) Fixed pipeline Pass 1 Pass 2 Pass 3 Reconfigurable pipeline Pass 1 Pass 2 Pass 3 40% faster cycles 14 6.189 IAP 2007 MIT

1020 Element Microphone Array 15 6.189 IAP 2007 MIT

Case Study: Beamformer 1,600 1,400 1,430 1,200 MFLOPS 1,000 800 600 640 400 200 0 240 C program 19 C program Unoptimized StreamIt Optimized StreamIt 1 GHz Pentium III 420 MHz single tile Raw 420 MHz 64 tile Raw 420 MHz 16 tile Raw 16 6.189 IAP 2007 MIT

The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler StreamIt language and Compiler 17 6.189 IAP 2007 MIT

Scalability Problems in Wide Issue Processors Wide Fetch (16 inst) Control Bypass Net Unified Load/Store Queue 18 6.189 IAP 2007 MIT

Area and Frequency Scalability Problems ~N 3 ~N 2 N s Bypass Net 19 6.189 IAP 2007 MIT

Operand Routing is Global + >> Bypass Net 20 6.189 IAP 2007 MIT

Idea: Make Operand Routing Local Bypass Net 21 6.189 IAP 2007 MIT

Idea: Exploit Locality Bypass Net 22 6.189 IAP 2007 MIT

Replace Crossbar with Point-To-Point Network 23 6.189 IAP 2007 MIT

Replace Crossbar with Point-To-Point Network + >> 24 6.189 IAP 2007 MIT

Operand Transport Latency Crossbar Point-to-Point Network Non-local Placement Locality-driven Placement ~ N ~ N ½ ~ N ~1 25 6.189 IAP 2007 MIT

Distribute the Register File 26 6.189 IAP 2007 MIT

More Scalability Problems Wide Fetch (16 inst) Control SCALABLE Unified Load/Store Queue 27 6.189 IAP 2007 MIT

More Scalability Problems Wide Fetch (16 inst) Control Unified Load/Store Queue 28 6.189 IAP 2007 MIT

Distribute Everything Wide Fetch (16 inst) Control Unified Load/Store Queue 29 6.189 IAP 2007 MIT

Tiled Processor 30 6.189 IAP 2007 MIT

Tiled Processor 31 6.189 IAP 2007 MIT

Tiled Processor (Taylor PhD 2007) Fast inter-tile communication through point-to-point pipelined scalar operand network (SON) Easy to scale for the same reasons as multicores 32 6.189 IAP 2007 MIT

Raw Compute Processor Internals r24 r24 r25 r25 r26 E r26 r27 M1 M2 r27 IF D A F TL P U 33 6.189 IAP 2007 MIT

Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P sub $20,$1,$25 34 6.189 IAP 2007 MIT

Why Communication Is Expensive on Multicores Multiprocessor Node 1 Multiprocessor Node 2 send occupancy send send overhead latency Transport Cost receive occupancy receive receive overhead latency 35 6.189 IAP 2007 MIT

Why Communication Is Expensive on Multicores Multiprocessor Node 1 send occupancy Destination node name Sequence number Value Launch sequence send latency Commit Latency Network injection 36 6.189 IAP 2007 MIT

Why Communication Is Expensive on Multicores receive sequence Multiprocessor Node 2 injection cost receive occupancy receive latency 37 6.189 IAP 2007 MIT

A Figure of Merit for SONs 5-tuple <SO, SL, NHL, RL, RO> Send occupancy Send latency Network hop latency Receive latency Receive occupancy Tip: Ordering follows timing of message from sender to receiver 38 6.189 IAP 2007 MIT

The Interesting Region Scalable Multiprocessor (on-chip) Where is Cell in this space? <2, 14, 3, 14,4> Raw SON (scalable) < 0, 0, 1, 2, 0> Superscalar (not scalable) < 0, 0, 0, 0, 0> 39 6.189 IAP 2007 MIT

The Raw Experience Insights into the design Raw architecture Raw parallelizing compiler 40 6.189 IAP 2007 MIT

Raw Parallelizing Compiler (Lee PhD 2005) Sequential Program Compiler Data distribution Instruction distribution Coordination Communication Control flow Fine-grained Orchestrated Parallel Program 41 6.189 IAP 2007 MIT

Data Distribution load r1 <- addr add r1 <- r1, 1???? M M M M 42 6.189 IAP 2007 MIT

Instruction Distribution seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v2.4=v2 pval5=seed.0*6.0 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v2.7=pval6*5.0 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 basic block 43 6.189 IAP 2007 MIT

Clustering: Parallelism vs. Communication seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v3.10 44 6.189 IAP 2007 MIT

Adjusting Granularity: Load Balancing seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 v0=v0.9 v2=v2.7 v3=v3.10 45 6.189 IAP 2007 MIT

Placement seed.0=seed pval5=seed.0*6.0 seed.0=seed pval1=seed.0*3.0 pval4=pval5+2.0 pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 tmp0=tmp0.1 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 tmp3=tmp3.6 v1.2=v1 v2.4=v2 v1.8=pval7*3.0 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 v0.9=tmp0.1-v1.8 v0=v0.9 v1=v1.8 v2=v2.7 v3.10=tmp3.6-v2.7 v3=v3.10 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 v2=v2.7 46 6.189 IAP 2007 MIT

Communication Coordination Proc Proc Proc Switch Switch Proc receive routes send 47 6.189 IAP 2007 MIT

Instruction Scheduling v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 route(t,e) v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 route(n,t) seed.0=recv() pval5=seed.0*6.0 route(w,t) seed.0=seed send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 route(t,e) tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 route(t,e) route(e,t) tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() tmp0.1=recv() pval7=tmp1.3+tmp2.5 route(t,w) route(w,t) route(w,t) pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 route(w,s) tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,e) Send(v2.7) v2=v2.7 route(t,e) v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 route(w,n) v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(s,t) Inter-tile cycle scheduling schedules communication and can guarantee deadlock freedom 48 6.189 IAP 2007 MIT

Final Code Representation v2.4=v2 route(t,e) v1.2=v1 route(n,t) seed.0=recv() route(w,t) seed.0=seed route(t,e) seed.0=recv(0) route(t,e) seed.0=recv() route(t,w) pval5=seed.0*6.0 route(w,s) send(seed.0) route(t,e) pval3=seed.o*v2.4 route(e,t) pval2=seed.0*v1.2 route(w,t) pval4=pval5+2.0 route(s,t) pval1=seed.0*3.0 tmp2.5=pval3+2.0 route(t,e) tmp1.3=pval2+2.0 route(w,t) tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp2=tmp2.5 send(tmp1.3) route(w,n) tmp3=tmp3.6 tmp0.1=pval0/2.0 send(tmp2.5) tmp1=tmp1.3 v2.7=recv() send(tmp0.1) tmp1.3=recv() tmp2.5=recv() v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval6=tmp1.3-tmp2.5 tmp0.1=recv() v3=v3.10 v2.7=pval6*5.0 pval7=tmp1.3+tmp2.5 Send(v2.7) v1.8=pval7*3.0 v2=v2.7 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 49 6.189 IAP 2007 MIT

Control Coordination 50 6.189 IAP 2007 MIT

Asynchronous Global Branching br x br x br x br x br x 51 6.189 IAP 2007 MIT

Summary Tiled microprocessors incorporate the best elements of superscalars and multiprocessors Superscalar Multicore Tiled Processor with SON PE-PE communication Free Expensive Cheap exploitation of parallelism Implicit Explicit Both Clean semantics Yes No Yes scalable No Yes Yes power efficient No Yes Yes 52 6.189 IAP 2007 MIT

Raw Project Contributors Anant Agarwal Saman Amarasinghe Jonathan Babb Rajeev Barua Ian Bratt Jonathan Eastep Matt Frank Ben Greenwald Henry Hoffmann Paul Johnson Jason Kim Theo Konstantakopoulos Walter Lee Jason Miller James Psota Arvind Saraf Vivek Sarkar Nathan Shnidman Volker Strumpen Michael Taylor Elliot Waingold David Wentzlaff 53 6.189 IAP 2007 MIT