Design of Pipeline DSP icroprocessor N DSP2000 Cheng Li, Lu io, Qiyo Yu, P.Gillr n R.Venktesn Fculty of Engineering n Applie Science emoril niversity of Newfounln St. John s, NF, Cn A1B 3 E-mil: {licheng, xio, qiyo, venky}@engr.mun.c, pul@cs.mun.c Abstrct Progrmmble igitl signl processing (DSP) microprocessors re the processors tht re esigne to perform in igitl signl processing-intensive pplictions. In this pper, We present the esign of simplifie DSP microprocessor with restricte instruction set N DSP2000, which consists of three mjor components: the control unit, the tpth n the system memory. A Hrvr rchitecture, pipeline tpth n t forwring techniques re use to improve the system performnce n voi hzrs (t hzr, control hzr, etc.). The whole system is coe using VHDL n simulte using Synopsys CAD tools. The system performnce is briefly nlyze bse on the synthesis results. Key wors: DSP microprocessor, Pipeline, ultiply n Accumulte (AC), VHDL 1. Introuction After commercil DSP microprocessors were first introuce in the erly 1980s, DSP technology n DSP microprocessors hs gine more n more significnce in the current informtion worl. Though DSP microprocessors n generl-purpose microprocessors shre number of common fetures, they hve importnt ifferences. DSP microprocessors re primrily esigne for rel-time high-spee clcultion pplictions [1]. Besies possessing mny of the fetures of generl-purpose microprocessor, DSP is lso chrcterize by fst multiply-ccumulte, multiple-ccess memory rchitecture, specilize ressing moes n specilize execution control [2]. The N DSP2000 microprocessor is -bit fixe-point prototype computtionl engine on which DSP pplictions cn be built. There re -bit generl-purpose registers within the CP, which re use for most of the opertions. Except tht R0 is re only register n lwys stores 0 s its vlue, ll the other registers re the sme. A lo n store rchitecture is use for the instructions except the AC opertion. Our trget is to buil DSP microprocessor tht cn support bsic DSP opertions like the DFT, FFT, igitl filters, n so on. For such DSP pplictions, the convolution opertion is wiely use. In the computer system, DSP ppliction, like igitl filter, cn be chieve by multiplying the current signl vlue (multiply opern ) with the coefficient of the igitl filter (multiply opern b) n then ing it bck to the previous sum ( opern c). All these work shoul be finishe in one step, i.e., in single system clock cycle. Thus n efficient AC opertion (multiply two operns n then to thir opern) is key requirement for DSP processor. In this project, we im t builing the DSP microprocessor to support the bsic function tht DSP processor shoul support the multiply n ccumulte (AC) opertion. Besies this DSP feture, we lso inclue some functions tht generl-purpose processor shoul support. The instruction set we choose is subset of the complete instruction set for generl-purpose processor. Restricte t forwring is relize to improve our esign. Some highlighte fetures of N DSP2000 re: Pipeline t-pth esign AC opertion Seprte instruction n t memory (Hrvr emory Structure) Fst t memory ccess (in one single clock cycle, two t memory elements cn be fetche) Support vrious opertions such s rithmetic, logic n control function 2. N DSP2000 control unit esign 2.1 Instruction set esign
A restricte instruction set is inclue in the N DSP2000 esign. Actully, it is subset of tht of the IPS processor [2]. The instructions re ll bits long. There re totl 16 instructions tht hve been implemente in the esign. Accoring to their purposes, the instructions cn be ivie into five groups: 1. Arithmetic opertion ADD, ADDI, SB, SBI, L, AC 2. Brnch opertion BNEZ, JP, JR 3. Logic opertion AND, OR, OR 4. emory opertion LW, SW, SWS. Other NOP A -bit instruction is encoe s follows. The most significnt 6 bits represents the opcoe (bit 31 bit 26). The next bits represent the first register number (bit 2 bit 21) n the following two bits represent the secon (bit 20-16) n the thir register (bit 1 bit 11) respectively. It is not necessry tht ll the three registers will be use in every instruction. The 16 LSB bits is use s the ress offset. When new instruction is fetche from the instruction memory, for exmple n rithmetic opertion ADD R1, R2, R3, this will be encoe in the instruction memory in the following formt: SB 001011 00001 00010 00011 00000 000000 LSB Opcoe R1 R2 R3 When this instruction comes to the controller n register file, the controller will interpret the opcoe bits n will fin tht this is n opertion. The controller will output the corresponing control signls tht set up the correct pth for this opertion. The register file will output the store vlue for these corresponing registers. These t will be use s the source for the specifie opertions. 2.2 System control esign The control unit is the key component for N DSP2000 to perform properly. The control unit performs the instruction vlition n instruction-ecoing tsks uring the instruction ecoing (ID) pipeline stge. System control is chieve through the controlle output using series of ifferent multiplexers. In pipeline structure, the control signls generte by the control unit re propgte own the pipeline synchronously with the system clock through pipeline registers. Besies these control signls, there re two other types of control signls generte by the control unit. One is use for the generl control purpose such s the memory re/write control signl n the other is use to control the AL opertion. 3. N DSP2000 t-pth esign In this pper, we will only look t some key components in the t-pth esign, inclue the AL, multiplier n memory esign. In the next section, we will iscuss how we cn put ll the components together n mke pipeline microprocessor. 3.1 AL esign The rithmetic logic unit (AL) is n essentil prt of computer processor. It performs the rithmetic opertions (ition n subtrction) n logic opertions (AND, OR, OR etc.). It hs two -bit t inputs A n B, n crry input C. The three control inputs ALOP0, ALOP1, ALOP2 ecie which opertion shoul be tken. The outputs inclue -bit rithmetic logic result n 3-bit flg (v: overflow; c: crry; z: zero). The most time consuming opertions in AL opertion re ition n subtrction. In ripple er esign, the er propgtes the crry from the lowest bit to the highest sequentilly. Thus the most significnt bit of the sum must wit for the sequentil evlution of the previous 31 1-bit ers, which is very slow. In theory we cn nticipte the crry input without witing for it to be generte by the previous 1-bit er component. This cn be one by pplying some clcultions on the two operns n the crry input to the lest significnt bit of the er. Dely for this kin of er will be in orer of log 2 N, where N is the bit number of the operns ( in our cse), inste of N s the usul ripple er hs. Therefore, to increse the spee, fst prllel er, the Look-Ahe-Crry is use in our esign. 3.2 ultiplier esign ultipliction with ccumultion is typicl opertion in DSP pplictions. Therefore the hrwre multiplier is n importnt chrcteristic of DSP processor. The spee of the multiplier is one of the most importnt
fctors etermining the overll system performnce. Ielly the multiplier shoul hve the bility to multiply two operns in single system clock cycle. We cn implement the multiplier in single unit using pure combintionl logic circuit. However this esign is not efficient becuse the corresponing pipeline stge will hve much longer ely thn other stges. This mkes the system clock cycle, which is etermine by the worst-cse ely, unnecessrily long. The synthesize result for the ely of 16-bit multiplier is 49 ns, much longer thn other components. Therefore it seems more ttrctive to split the multiplier into severl pipeline stges. We use 16-4 bit multipliction moulr n split the multiplier into stges. The synthesis result shows tht the ely for ech stge within the multiplier is 19 ns. Although the totl time is 19 = 9 ns, lmost twice s in single unit, it is vntgeous n more ttrctive since the clock cycle coul be reuce gretly when we integrte the multiplier pipeline in the system pipeline of the microprocessor. 3.3 N DSP2000 memory esign Typicl DSP opertions require mny itions n multiplictions. In our esign, the AC opertion requires us to fetch two operns from the t memory with resses inicte in the two source registers n perform the multipliction n ition. To fetch the two operns in single instruction cycle, we nee to mke two memory ccesses simultneously. The result of multipliction n ition is store bck in the register holing the ition opern. By oing so, we cn ccumulte the results of mny multipliction opertions in this register through consecutive AC instructions. One AC instruction nees four register ccess opertions (three res n one write) besies two t memory res in one instruction cycle. For the other instructions, it is necessry to ccess t memory once t most (e.g. LW, SW). There re two common methos to chieve multiple memory ccesses per instruction cycle: extene Hrvr rchitecture n moifie von Neumn rchitecture []. Extene Hrvr rchitecture The norml Hrvr rchitecture hs two seprte physicl memory buses. This llows two simultneous memory ccesses: one for instruction memory n one for t memory. This is inequte for AC opertions, which involves two operns in t memory. Some DSP Hrvr rchitectures permit the instruction bus to be use lso for ccess of operns. It is often necessry to fetch three wors (the instruction plus two operns) n the Hrvr rchitecture is inequte to support this. Thus DSP Hrvr rchitectures often inclue cche memory which cn be use to store instructions tht will be reuse, leving both Hrvr buses free for fetching operns. This extension is sometimes clle n extene Hrvr rchitecture. oifie von Neumn rchitecture The von Neumn rchitecture uses only single memory bus for both t memory n instruction memory. This is economicl n simple to use becuse the instructions or t cn be locte nywhere throughout the vilble memory. But it oes not permit multiple memory ccesses. The moifie von Neumn rchitecture llows multiple memory ccesses per instruction cycle by using two seprte clocks: one for instruction n the other for t memory. The clock for t memory is (n-1) times fster thn tht for instruction cycle. Ech instruction cycle is ivie into n mchine sttes, n memory ccess cn be me in ech mchine stte. Consequently, totl of n memory ccesses per instruction cycle re llowe. The t memory rchitecture of NDSP2000 uses the moifie Hrvr rchitecture. The t memory is seprte from instruction memory just like norml Hrvr rchitecture. The t memory consists of mster memory n slve memory. Both mster memory n slve memory hve their own ress n t buses. The mster memory is use s the common t memory. For ll the instructions except AC, the t memory ccesses re oriente to the mster memory. The AC instruction gets the first opern from the mster memory n the secon opern from the slve memory. Becuse they own the eicte ress n t buses, the two t memories cn be ccesse simultneously. One vntge is tht the progrmmer cn use resses freely since no ress conflicting cn occur between the two memories. 4. Pipelining n t forwring The pipeline structure cn be implemente by inserting register between ifferent stges. With these registers, both control signls n intermeite t in one stge re seprte from jcent stges. These registers re triggere by the system clock. Approprite logic circuits cn be e between inputs n output of ifferent pipeline registers to perform t forwring. In the esign of N DSP2000 microprocessor, the
t-pth is comprise of pipeline with five stges: Instruction fetch (IF), Instruction ecoing (ID), emory Access (E), Execution (E) n Write bck (WB) [2, 6]. For the AC opertion, we nee to get the memory t before we cn procee. Thus we move the E stge to the front of the E stge. Ech stge of the pipeline opertes inepenently n is synchronize with the system clock. Ech instruction execution tkes five clock cycles to complete n new instruction is fetche uring ech clock cycle. Figure 1 shows the pipeline esign igrm of the N DSP2000 microprocessor.. Performnce nlysis Bse on the timing specifiction from the synthesis result of ech system builing component, s shown in tble 1, we get the execution time for ll the instructions epening on the ifferent routes. To implement the pipeline structure of the processor, we ivie the totl execution time eqully into ll the pipeline stges. Accoring to the current timing figures, the spee of our DSP processor is roun 10 Hz. The instruction cycle T is etermine by the mxim of the worst cse ely mong ll pipeline stges. Component ximl ely (NS) Loction Progrm counter 1.3 Stge 1 Instruction memory (1K ) 13.23 Stge 1 1 Stge 1, 2, 4 2 1.39 Stge 1, 2, 3, 4 Register file 22 Stge 2 Controller.4 Stge 2 Zero Detector 3.11 Stge 2 4 2.7 Stge 2, 3, 4 OR Gte 0.8 Stge 2 Dt emory (1K) 13.23 Stge 3 ultiplier 49 Stge 4 AL 19 Stge 4 The tble inictes tht the multiplier is the criticl component for the mximum ely. The min reson for the low spee of our DSP microprocessor is tht we put both the multiplier n AL into the sme pipeline stge. Thus the ely of this pipeline stge limits the spee of the whole system. To improve this, we reesigne the multiplier n AL using n internl pipeline structure. It performe roun 4 times fster thn the previous one (roun 19 ns for the multiplier). Bse on the new esign, the spee of our DSP microprocessor will chieve roun 2 Hz. An ll these results re bse on the current 0.3um COS technology we re using. With more vnce COS technology such s 0.18um n 0.1um, N DSP2000 cn chieve much higher spee n performnce [6]. 6. Conclusion n future work The whole system is escribe n implemente in VHDL. Both the bsic components n the complete pipeline microprocessor re simulte n teste using Synopsys esign compiler. Ech system builing component is synthesize using the Synopsys esign nlyzer n is prove to work properly. Timing specifiction from the simultion n synthesis results re consiere n use to ecie system structure n pipeline esign. Bse on the esign n testing result of our implementtion, we cn conclue tht NDSP2000 is simple n prcticl prototype of DSP processor. It performs norml DSP pplictions efficiently n relibly. Becuse of esign time limittion, some ttrctive fetures of DSP processor were not inclue into our esign t the time. In the future, t forwring cn be further improve n the number of pipeline stges cn be further extene. The performnce of N DSP2000 microprocessor will then be further improve n enhnce. References 1. P.Lpsley et. l "DSP Processor Funmentls", IEEE Press, 199 2. J.H Hennessy n D.A. Ptterson, Computer Orgniztion n Design: The hrwre Softwre Interfce, orgn Kupmn Publishers, 1994 3. Synopsys: Design Compiler Reference nul, V3.2 1992( Online ocumenttion) 4. Z. Nvbi, VHDL: Anlysis n oelling of Digitl System, cgrw Hill 1998. Steve Heth, icroprocessor Architectures n Systems: RISC, CISC & DSP, Newnes, 1991 6. Cheng Li, Lu io, Qiyo Yu, N DSP2000 Processor Design, Course project report for Avnce Digitl System, emoril niversity, Aug. 2000
IR 20-16 4 PC Reset Clock Hlt 1 Instruction emory IR System Clock IR 2-0 26 Strt IR 31-26 Shifter PC + 4 IR 20-16 IR 1-11 IR 2-21 C O N T R O L L E R 28 Register 2 Register 3 Registers Register 1 Files Write Dt IR 1-0 16 4 PC + 4[31-28] Write Register Sign Exten Dt 2 Dt 3 Dt 1 Ctrl JP JP0 JP1 em1 em2 Rsem1 Rtem2_0 Rtem2_1 RtR3 ulsel ALOP0 ALOP1 ALOP3 ResSel_0 ResSel_1 ArSel Forwr_EN Shifter 2 JP 0 JP 1 0 ArSel JP 9 Ctrl 8 IR 1-11 IR 2-21 Zero FwDt F w D t F w D t Pipeline Registers Trnsfer e Fw3 Fw2 Fw2 c b Fw3 11 Dt emory Rsem1 3 4 em2 em1 Rtem2_0 Rtem2_1 FwDt 1 Fw1 Forwring nit Fw1 Fw3 Fw2 L 6 RtR3 7 ulsel AL ALOP0 ALOP1 ALOP2 Forwr_EN 7 ResSel_0 IR 2-21 FwDt IF/ID ID/E E/E E/WB Figure 1 N DSP2000 System Digrm