Embedded Hardware (1) Kai Huang

Size: px

Start display at page:

Download "Embedded Hardware (1) Kai Huang"

Allan Ferguson
5 years ago
Views:

1 Ebedded Hardware () Kai Hang

2 News: PS4 and Xbo One are Coing /9/203 2

The Hardware Xbo One PS4 CPU PS4 /9/203 kai.

3 The Hardware Xbo One PS4 CPU PS4 /9/203 3 Xbo One sei-csto 86 AMD APU 28n 8-core Jagar CPU CPU freqency.6 GHz.75 GHz GPU 8 CUs:52 shaders (800MHz) 2 CUs:768 shader (853 MHz) Meory 8G 5500MHz DDR5 8G 233MHz DDR3 Me Bandwidth 76GB/sec 68.3GB/sec Ebedded SRAM N/A MB (204GB/sec)

4 /9/203 4

5 Dataflow MoC Recap Kahn Process Network Synchronos DataFlow h init = 3 h2 g f 4 g f 2 h init = 0 3 h 2 /9/203 kai.hang@t 5

6 Otline Processor Meory I/O /9/203 6

7 Otline Processor o Single-cycle datapath o Pipeline datapath o Processor types Meory I/O /9/203 kai.hang@t 7

8 Y-Chart Methodology Architectre odel Mapping Applications odel Perforance Evalation Perforance Nbers /9/203 8

processing Display D/A converter Sensors

9 Ebedded Syste Hardware Ebedded syste hardware is freqently sed in a loop ( hardware in a loop ): This corse A/D converter Saple-and-hold Inforation processing Display D/A converter Sensors Environent Actators Ebedded syste /9/203 kai.hang@t 9

10 The Big Pictre Since 946 all copters have had 5 coponents Control nit coordinates varios actions: Inpt, Otpt Processing Processor Control Meory Inpt nit accepts inforation: Han operators, Electroechanical devices Other copters Inpt Datapath: the part of the central processing nit (CPU) that does the actal coptations Datapath Stores inforation: Instrctions, Data Otpt Otpt nit sends reslts of processing: To a onitor display, To a printer /9/203 kai.hang@t 0

11 PC Datapath Coponents Cobinational Eleents o ALU, Adder o Iediate etender o Mltipleers Storage Eleents o Instrction eory o Data eory o PC register o Register file Clocking ethodology o Tiing of reads and writes 6 Etend EtOp Clock Address RA 0 Instrction Instrction Meory RB RW Registers File RegWrite BsA select BsB BsW A L U MeRead ALU control Data Meory zero ALU reslt overflow Address Data_ot Data_in MeWrite /9/203 kai.hang@t

12 ALU: Arithetic Logic Unit. ALU is a digital circit that perfors Arithetic (Add, Sb,...) and Logical (AND, OR, NOT) operations. 2. John Von Neann proposed the ALU in 945 when he was working on EDVAC. -Bit ALU /9/203 kai.hang@t 2

13 Logical Operation Arithetic Operation Shift Operation Mltifnction ALU None = 00 SLL = 0 SRL = 0 SRA = 2 Shift Aont lsb 5 Shifter SLT: ALU does a SUB and check the sign and overflow ADD = 0 SUB = A B c 0 A d d e r sign overflow ALU Reslt zero AND = 00 OR = 0 NOR = 0 XOR = Logic Unit ALU Selection Shift = 00 SLT = 0 Arith = 0 Logic = /9/203 kai.hang@t 3

14 Single-Cycle Datapath (with Control Signal) 30 Jp or Branch Target Address PCSrc 0 30 PC Instrction Meory Address Instrction Rd 5 Rs Rt I26 RA BsA Registers RB BsB RW BsW Et 30 I6 0 Net PC A L U zero J, Beq, Bne ALU reslt Data Meory Address Data_ot Data_in 0 RegDst RegWrite EtOp ALUSrc ALUCtrl Op fnc ALU Ctrl MeRead ALUOp MeWrite MetoReg Main Control /9/203 kai.hang@t 4

15 Register Transfer Level (RTL) RTL is a description of data flow between registers RTL gives a eaning to the instrctions All instrctions are fetched fro eory at address PC Instrction RTL Description ADD Reg(Rd) Reg(Rs) + Reg(Rt); PC PC + 4 SUB Reg(Rd) Reg(Rs) Reg(Rt); PC PC + 4 ORI Reg(Rt) Reg(Rs) zero_et(i6); PC PC + 4 LW Reg(Rt) MEM[Reg(Rs) + sign_et(i6)]; PC PC + 4 SW MEM[Reg(Rs) + sign_et(i6)] Reg(Rt); PC PC + 4 BEQ if (Reg(Rs) == Reg(Rt)) PC PC sign_etend(i6) else PC PC + 4 /9/203 kai.hang@t 5

16 Instrctions are Eected in Steps R-type Fetch instrction: Instrction MEM[PC] Fetch operands: data Reg(Rs), data2 Reg(Rt) Eecte operation: ALU_reslt fnc(data, data2) Write ALU reslt: Reg(Rd) ALU_reslt Net PC address: PC PC + 4 I-type Fetch instrction: Instrction MEM[PC] Fetch operands: data Reg(Rs), data2 Etend(i6) Eecte operation: ALU_reslt op(data, data2) Write ALU reslt: Reg(Rt) ALU_reslt Net PC address: PC PC + 4 BEQ Fetch instrction: Instrction MEM[PC] Fetch operands: data Reg(Rs), data2 Reg(Rt) Eqality: zero sbtract(data, data2) Branch: if (zero) PC PC sign_et(i6) else PC PC + 4 /9/203 kai.hang@t 6

17 Instrction Eection Eaples LW Fetch instrction: Instrction MEM[PC] lw Rt,C(Rs) Fetch base register: base Reg(Rs) Calclate address: address base + sign_etend(i6) Read eory: data MEM[address] Write register Rt: Reg(Rt) data Net PC address: PC PC + 4 SW Fetch instrction: Instrction MEM[PC] sw Rt,C(Rs) Fetch registers: base Reg(Rs), data Reg(Rt) Calclate address: address base + sign_etend(i6) Write eory: MEM[address] data Net PC address: PC PC + 4 Jp Fetch instrction: Instrction MEM[PC] j C Target PC address: target PC[3:28], I26, 00 Jp: PC target concatenation /9/203 kai.hang@t 7

18 Eection of Load Instrction: lw Rt,C(Rs) 30 EtOp = sign to sign-etend Iediate6 to bits PC Instrction Meory Address Instrction I6 Rd 5 Rs Rt RA RB RW EtOp = sign Etender Registers BsA BsB BsW ALUSrc = 0 ALUCtrl = ADD A L U MeRead = ALU reslt Data Meory Address MeWrite = 0 Data_ot Data_in MetoReg = 0 RegDst = 0 selects Rt as destination register RegDst = 0 RegWrite = MeRead = to read data eory ALUSrc = selects etended iediate as second ALU inpt ALUCtrl = ADD to calclate data eory address as Reg(Rs) + sign-etend(i6) MetoReg = places the data read fro eory on BsW RegWrite = to write the eory data on BsW to register Rt /9/203 kai.hang@t 8

19 Eection of Store Instrction: sw Rt,C(Rs) EtOp = sign to sign-etend Iediate6 to bits 30 PC Instrction Meory Address Instrction I6 Rd 5 Rs Rt RA RB RW EtOp = sign Etender Registers BsA BsB BsW ALUSrc = 0 ALUCtrl = ADD A L U MeRead = 0 ALU reslt Data Meory Address MeWrite = Data_ot Data_in MetoReg = 0 RegDst = becase no destination register RegDst = RegWrite = 0 MeWrite = to write data eory ALUSrc = to select the etended iediate as second ALU inpt ALUCtrl = ADD to calclate data eory address as Reg(Rs) + sign-etend(i6) MetoReg = becase we don t care what data is placed on BsW RegWrite = 0 becase no register is written by the store instrction /9/203 kai.hang@t 9

20 Eection of Jp Instrction: j C 30 Jp Target Address 30 PCSrc = + I6 zero Instrction Rs 5 RA BsA Meory 30 Instrction Registers Et A 0 Rt 5 L RB BsB 0 Address U 0 RW BsW Rd 5 PC 00 J = selects I26 as jp target address I26 30 Net PC RegDst RegWrite = = 0 EtOp ALUSrc ALUCtrl = = = J = MeRead = 0 ALU reslt Data Meory Address MeWrite = 0 Data_ot Data_in MetoReg = 0 Upper 4 bits are fro the increented PC PCSrc = to select jp target address MeRead, MeWrite & RegWrite are 0 We don t care abot RegDst, EtOp, ALUSrc, ALUCtrl, and MetoReg /9/203 kai.hang@t 20

21 Drawbacks of Single Cycle Processor Long cycle tie o All instrctions take as ch tie as the slowest Arithetic Instrction Fetch Reg Read ALU Load Instrction Fetch Reg Read longest delay ALU Reg Write Meory Read Reg Write Store Instrction Fetch Reg Read ALU Meory Write Branch Instrction Fetch Reg Read ALU Jp Instrction Fetch Decode Alternative Soltion: Mlticycle ipleentation o Break down instrction eection into ltiple cycles /9/203 kai.hang@t 2

22 Single-Cycle vs. Mlticycle Clock Tie needed Tie allotted Instr Instr 2 Instr 3 Instr 4 Clock Tie needed Tie allotted 3 cycles 5 cycles 3 cycles 4 cycles Instr Instr 2 Instr 3 Instr 4 Tie saved /9/203 kai.hang@t 22

23 Otline Processor o Single-cycle datapath o Pipeline datapath o Processor types Meory I/O /9/203 kai.hang@t 23

24 RW BsW PC 00 Single-Cycle Datapath Shown below is the single-cycle datapath How to pipeline this single-cycle datapath? Answer: Introdce registers at the end of each stage IF = Instrction Fetch ID = Decode and Register Fetch EX = Eecte and Calclate Address MEM = Meory Access WB = Write Back 0 Address Inc Instrction Instrction Meory I26 Rd Rs Rt 0 Register File Et I6 0 Net PC A L U zero ALU reslt Data Meory Address Data_in 0 /9/203 kai.hang@t 24

25 RW BsW PC 00 Pipelined Datapath Pipeline registers, in green, separate each pipeline stage Pipeline registers are labeled by the stages they separate Is there a proble with the register destination address? IF = Instrction Fetch ID = Decode EX = Eecte MEM = Meory WB IF/ID ID/EX EX/MEM Inc I26 Net PC MEM/WB 0 Address Instrction Instrction Meory Rs Rt Rd Register File Et I6 A L U zero ALU reslt Address Data Meory Data_in 0 /9/203 kai.hang@t 25

26 BsW RW PC 00 Corrected Pipelined Datapath Destination register nber shold coe fro MEM/WB o Along with the data dring the written back stage Destination register nber is passed fro ID to WB stage IF ID EX MEM WB IF/ID ID/EX EX/MEM Inc I26 Net PC MEM/WB 0 Address Instrction Instrction Meory Rs Rt Rd Register File Et I6 A L U zero ALU reslt Address Data Meory Data_in 0 /9/203 kai.hang@t 26

Progra Eection Order Graphically Representing Pipelines Mltiple instrction eection over ltiple clock cycles o Instrctions are listed in eection order fro top to botto o Clock cycles ove fro left to

27 Progra Eection Order Graphically Representing Pipelines Mltiple instrction eection over ltiple clock cycles o Instrctions are listed in eection order fro top to botto o Clock cycles ove fro left to right o Figre shows the se of resorces at each stage and each cycle Tie (in cycles) CC CC2 CC3 CC4 CC5 CC6 CC7 CC8 lw $6, 8($5) IM Reg ALU DM Reg add $, $2, $3 IM Reg ALU DM Reg ori $4, $3, 7 IM Reg ALU DM Reg sb $5, $2, $3 IM Reg ALU DM Reg sw $2, 0($3) IM Reg ALU DM /9/203 kai.hang@t 27

28 Instrction Order Instrction Tie Diagra Diagra shows: o Which instrction occpies what stage at each clock cycle Instrction eection is pipelined over the 5 stages Up to five instrctions can be in eection dring a single cycle lw $7, 8($3) IF ID EX MEM WB ALU instrctions skip the MEM stage. Store instrctions skip the WB stage lw $6, 8($5) IF ID EX MEM WB ori $4, $3, 7 IF ID EX WB sb $5, $2, $3 IF ID EX WB sw $2, 0($3) IF ID EX MEM CC CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Tie /9/203 kai.hang@t 28

29 Single-Cycle vs Pipelined Perforance Consider a 5-stage instrction eection in which o Instrction fetch = ALU operation = Data eory access = 200 ps o Register read = register write = 50 ps What is the single-cycle non-pipelined tie? What is the pipelined cycle tie? What is the speedp factor for pipelined eection? Soltion Non-pipelined cycle = = 900 ps IF Reg ALU MEM Reg 900 ps IF Reg ALU MEM Reg 900 ps /9/203 kai.hang@t 29

30 Single-Cycle verss Pipelined cont d Pipelined cycle tie = a(200, 50) = 200 ps IF Reg ALU MEM 200 CPI for pipelined eection = o One instrction copletes each cycle (ignoring pipeline fill) Speedp of pipelined eection = 900 ps / 200 ps = 4.5 o Instrction cont and CPI are eqal in both cases Speedp factor is less than 5 (nber of pipeline stage) o Becase the pipeline stages are not balanced Reg IF Reg ALU MEM Reg 200 IF Reg ALU MEM Reg /9/203 kai.hang@t 30

31 Sary between Datapaths Clock Cycle Tie Cycle Per Instrction # instrction eecting concrrently Dplicate Hardware Single Cycle Mltiple Cycle Pipeline Long (Long enogh for the slowest instrction) clock cycle per instrction (by definition) Short (long enogh for the slowest instrction step) Variable nber of clock cycles per instrction Short (long enogh for the slowest pipeline stage) Fied nber of clock cycles per instrction, one for each pipeline stage # pipeline stage Yes, since we can se a fnctional nit (FU) for at ost one sbtask per instrction No, since the instrction generally is broken into single-fu steps Etra Register No Yes, to hold reslts for the net step Yes, to avoid restriction on pipeline eection Yes, to provide reslts for the pipeline stage Perforance Baseline Faster, bt not too fast Fastest, if pipeline is balanced /9/203 kai.hang@t 3

32 Otline Processor o Single-cycle datapath o Pipeline datapath o Processor types Meory I/O /9/203 kai.hang@t

33 General Prpose Processors (GPP) High perforance o Highly optiized circits and technology o Use of parallelis sperscalar: dynaic schedling of instrctions sper-pipelining: instrction pipelining, branch prediction, speclation o cople eory hierarchy Not sited for real-tie applications o Eection ties are highly npredictable becase of intensive resorce sharing and dynaic decisions Properties o Good average perforance for large application i o High power consption /9/203 kai.hang@t 33

34 GPP + Meory (I): von Neann Architectre 2 3 PC 200 Meory Data + Progra ADD a,a2,an address data IR CPU. PC := Fetch => IR := Me[PC] 3. Decode IR 4. Eecte 5. PC := PC + 6. goto 2 N GPR /9/203 kai.hang@t 34

35 GPP + Meory (II): Harvard Architectre Progra Meory address data PC IR Data Meory address data CPU GPR /9/203 kai.hang@t 35

36 Intel Lynnfield (Core i5/i7) 4-8Mbytes L3 Cache 4 cores, 8 threads /9/203 kai.hang@t 36

37 Siple GPP: Xilin MicroBlaze IOPB: Instrction side On chip Peripheral Bs IXCL_M: Instrction-side Xilin Cache Link Master IXCL_S: Instrction-side Xilin Cache Link Slave ILMB: Instrction side Local Mory Bs DOPB: Data side On chip Peripheral Bs DXCL_M: Data-side Xilin Cache Link Master DXCL_S: Data-side Xilin Cache Link Slave DLMB: Data side Local Meory Bs MFSL: Master Fast Siple Link SFSL: Slave Fast Siple Link /9/203 37

38 Ebedded Processors RISC vs. CISC Cople instrction set CISC (e.g. 86) o Map copleity of coon instrctions directly in achine code o Cople instrctions can consist of several siple instrctions o Can lead to sbtle tiing isses o Used in general prpose copting Redced instrction set RISC (e.g. ARM Acorn Risc Machine) o Only siple achine instrctions; Copiler has to ap highlevel langage onto siple instrctions o All instrctions take the sae tie o Used in ebedded systes (Real-tie hardware, sart phones, ) /9/203 kai.hang@t 38

39 Application Specific Instrction Set Processors Micro Controllers (MicroCtrl) o Used in Control Doinated Systes o Reactive systes with event driven behavior o Application eaples: cars, conser electronics (washing achines, dishwashers etc.) Digital Signal Processors (DSPs) o Used in Data Doinated Systes o Streaing-oriented systes with ostly periodic behavior o Application eaples: signal processing Very Long Instrction Word Processors (VLIWs) o Used in Data Doinated Systes o Application eaples: iage processing /9/203 kai.hang@t 39

ASIP: Micro Controllers Control-doinant applications o Spports process schedling and synchronization o Preeption (interrpt), contet switch o Short latency

40 ASIP: Micro Controllers Control-doinant applications o Spports process schedling and synchronization o Preeption (interrpt), contet switch o Short latency ties Low power consption Peripheral nits often integrated Sited for real-tie applications Philips 83 C552: 8 bit-805 based icrocontroller /9/203 kai.hang@t 40

nits Specialized instrction set High data throghpt

41 ASIP: Digital Signal Processors Optiized for data-flow applications Sited for siple control flow Parallel hardware nits Specialized instrction set High data throghpt Zero-overhead loops Specialized eory Sited for real-tie applications /9/203 4

42 Very Long Instrction Word Processors Key idea: detection of possible parallelis to be done by copiler, not by hardware at rn-tie (inefficient). VLIW: parallel operations (instrctions) encoded in one long word (instrction packet), each instrction controlling one fnctional nit. VLIW processors are an eaple of the so called Eplicit Parallelis Instrction Copters (EPIC) /9/203 42

43 Philips TriMedia VLIW CPU 5 isse slots (fnctional nits FU), therefore p to 5 instrctions can be eected in parallel /9/203 kai.hang@t 43

Application Specific Integrated Circits (ASICs) Csto-designed circits necessary o if ltiate speed or o energy efficiency is the goal and o large nbers can

44 Application Specific Integrated Circits (ASICs) Csto-designed circits necessary o if ltiate speed or o energy efficiency is the goal and o large nbers can be sold. Approach sffers fro o long design ties, o lack of fleibility (changing standards) o high costs, i.e., Millions of $ ask costs /9/203 kai.hang@t 44

45 Reconfigrable Processing Units (RPUs) Fll csto chips (HW) ay be too epensive, software (SW) too slow. Cobine the speed of HW with the fleibility of SW o HW with prograable fnctions and interconnect. o HW (Re-)Configrable at design-tie or at rn-tie (dynaic reconfigration) Field Prograable Gate Arrays (FPGAs) o Crrently the ost sophisticated and sed RPUs o Applications Fast and very cheap prototyping of (MP-)SoCs Encryption, Fast object recognition (edical and ilitary) Adapting obile phones to different standards Very poplar devices fro o XILINX (Virte II(Pro), Virte 4, Virte 5, Virte 6, Virte 7) o Altera (Cyclone, Arria, Strati) o Actel and others /9/203 kai.hang@t 45

46 Floor-plan of VIRTEX II FPGAs Configrable Logic Block (CLB) Digital Clock Manager (DCM) Inpt/Otpt Blocks (IOB) /9/203 46

47 Syste-on-Chip (SoC) /9/203 47

48 Syste Specialization The ain difference between general prpose highest vole icroprocessors and ebedded systes is specialization. Specialization shold respect fleibility o application doain specific systes shall cover a class of applications o soe fleibility is reqired to accont for late changes, debgging Syste analysis reqired o identification of application properties which can be sed for specialization o qantification of individal specialization effects /9/203 kai.hang@t 48

49 Why Ipleentation Alternatives? Trade-off between Fleibility and Perforance/Power Efficiency /9/203 49

50 Energy Efficiency Hgo De Man, IMEC, Philips, 2007 /9/203 50

LECTURE 8. Pipelining: Datapath and Control

LECTURE 8. Pipelining: Datapath and Control LECTURE 8 Pipelining: Datapath and Control PIPELINED DATAPATH As with the single-cycle and multi-cycle implementations, we will start by looking at the datapath for pipelining. We already know that pipelining