Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures

Suword Permuttion Instructions for Two-Dimensionl Multimedi Processing in MicroSIMD rchitectures Ruy. Lee Princeton University rlee@ee.princeton.edu strct MicroSIMD rchitectures incorporting suword prllelism re very efficient for ppliction-specific medi processors s well s for fst multimedi informtion processing in generl-purpose processors. This pper ddresses the unsolved prolem of the need to permute the suwords pcked in registers for mximum prllelism performnce, especilly for two-dimensionl (2-D) multimedi lgorithms. We propose new systemtic pproch for identifying the fundmentl dt rerrngement needs in current nd future 2-D pixel processing progrms sed on the hierrchicl decomposition of frmes nd ojects into tomic 2-D structures. We define new suword permuttion instructions, Check, Excheck, Exchnge, nd Permset, tht chieve these dt rerrngements cross multiple registers. We lso define n lphet of suword permuttion primitives, including these new instructions nd the Mix instruction defined for P-RISC MX-2 nd I-64, which supports the dt rerrngement needs of 2-D frmes nd ojects. We show the sufficiency nd efficiency of this lphet for chieving ll possile permuttions of hierrchicl 2-D locks. 1. Introduction Multimedi informtion processing cn e considered n incresing prt of the generlpurpose worklod or specil-purpose ppliction re. In this pper, we consider new instructions for ccelerting multimedi processing in ny progrmmle processor, whether generl-purpose or ppliction-specific. The focus is on simple, single-cycle instructions, which cn e used to construct ny type of permuttions needed in two-dimensionl (2-D) multimedi processing. Multimedi extensions hve een dded to generl-purpose processors to ccelerte the processing of different medi types [1-7,15,16]. The types of ppliction-specific processors we trget in this pper re designed to execute vrious multimedi progrms, rther thn just one. They include digitl signl processors [8], video signl processors [9, 10], nd mediprocessors [11, 12]. Suword prllelism [1,4] is now widely deployed y multimedi instructions in microprocessor rchitectures [1-7, 15,16] nd in medi processors [12] to ccelerte the processing of lower-precision dt, like 16-it udio smples or 8-it pixel components. We lso cll this microsimd rchitecture [13], since it pplies SIMD (Single Instruction Multiple Dt) prllel processor techniques [14] within single processor. suword-prllel (or microsimd) instruction performs the sme opertion in prllel on multiple pirs of suwords pcked into two registers, which re typiclly 32 to 128 its wide in tody s microprocessors nd mediprocessors (see figure 1). For exmple, 64-it word-oriented dtpth cn e

prtitioned into eight 8-it suwords, or four 16-it suwords, or two 32-it suwords. Sustntil performnce improvements hve een relized using suword prllel instructions, t very low cost compred to other forms of prllelism, like supersclr, VLIW or prllel processor orgniztions, for the sme degree of opertion prllelism [13]. Register File Suword rith.unit Shift/Permute Unit Multiple Suword rithmetic nd Shift/Permute functionl units cn e implemented Figure 1: microsimd suword prllelism leverging word dtpths With pcked suwords in registers, we now need to e le to re-rrnge suwords within register, nd etween registers. This is necessry to chieve the mximum prllelism for susequent processing. Unfortuntely, suword permuttion opertions re not understood s clerly s suword rithmetic opertions. They require moving severl fields (suwords) in prllel. Conventionl shift nd rotte instructions move ll the its in register y the sme mount. Extrct nd deposit instructions, found in instruction-set rchitectures like P-RISC [17], move one field using one or two instructions. Erly suword permuttion instructions like mix nd permute [4] in the P-RISC MX-2 multimedi instructions re first ttempt to find efficient nd generl-purpose suword permuttion primitives. However, the sufficiency or efficiency of these permuttion primitives in chieving ny ritrry permuttion hs not een demonstrted. The prolem is further complicted y the fct tht imge, video or grphics processing require mpping two-dimensionl ojects onto suwords in multiple registers, nd then permuting these suwords etween registers. In ddition, since permuttions hve not een esily chieved y progrmmle processors, lgorithm designers my not hve optimized lgorithms using permuttions. Hence, one cnnot just com through ll the common multimedi lgorithms to determine wht permuttions re used nd the performnce impct they hve: one would often need to re-think lgorithms to see if efficient permuttions would help improve the performnce. Furthermore, in designing suword permuttion primitives, we need to project the permuttion needs of future, yet-to-e defined multimedi lgorithms, nd this seems to e n intrctle prolem. In this pper, we propose systemtic solution to this unsolved prolem of finding generic suword permuttion primitives for oth current nd future lgorithms for processing two-dimensionl multimedi dt. We lso define smll set of suword permuttion primitives, nd show tht this is oth sufficient nd n efficient set. In section 2, we descrie how 2-dimensionl frmes cn e mpped into the pcked suwords of microsimd rchitectures. We lso show tht two-dimensionl ojects cn e decomposed into smller locks or polygons, nd ultimtely into tomic 2x2 mtrices nd tringles. In section 3, we review the suword permuttion instructions tht hve een defined in the multimedi instructions MX-2 for P-RISC processors [4] nd for I-64 EPIC processors [15], especilly the mix instruction. We show n exmple of how permuttion on 2-D oject cn e decomposed into hierrchicl permuttions on 2x2 mtrices. In section 4, we investigte the suword permuttion needs of tomic 2-D structures, nd postulte tht

these re generic primitives since ll 2-D frmes nd ojects cn e decomposed into these tomic 2-D structures. In section 5, we propose smll susets of suword permuttion primitives tht re sufficient nd lso efficient for different performnce nd cost levels. Section 6 summrizes nd concludes the pper. 2. Mpping nd decomposition of 2-D locks To use microsimd rchitectures for mximum performnce, we need to mp multimedi dt into pcked suwords in wy tht permits mximum prllel execution, SIMD style. Pixel-oriented multimedi dt in imges, grphics, video or nimtion, re two-dimensionl (2-D) in nture. How should 2-D locks of dt e mpped into the pcked suwords of micro- SIMD rchitectures? 2-D rry of pixels in memory is normlly stored in row-mjor formt: elements of row one re stored sequentilly in successive memory loctions, followed y elements of row two, nd so forth. When words re loded into registers from memory, this trnsltes into mpping the first row into set of registers, mpping the second row into nother set of registers, nd so forth. This is clled re-mpping [13] of 2-D lock different rows of the 2-D lock re held in different registers. 2-D imge or frme is esily decomposed into smller 2-D locks. The smllest 2-D lock is 2x2 lock (or mtrix). 2-D oject within frme cn lso e decomposed into smller locks, where gin the smllest 2-D rectngulr lock is 2x2 mtrix of pixels. For exmple, n 8x8 mtrix used in DCT or IDCT cn e decomposed into four 4x4 mtrices, ech stored in four 64-it registers, s shown in Figure 2, where ech element is 16-it suword. Ech such 4x4 mtrix cn e further decomposed into four 2x2 mtrices (Figure 2). Mtrices with dimensions tht re power of two cn e successively decomposed into smller mtrices, nd ultimtely into the smllest 2x2 mtrix. () re mpping of 4x4 mtrix: R1 = 00 01 02 03 R2 = 10 11 12 13 R3 = 20 21 22 23 R4 = 30 31 32 33 () Decomposition into four 2x2 mtrices: R1 = 00 01 00 01 R2 = 10 11 10 11 R3 = c00 c01 d00 d01 R4 = c10 c11 d10 d11 Figure 2: re mpping nd decomposition of 2-D locks Non-rectngulr ojects my more ccurtely e decomposed into non-rectngulr polygons, the smllest of which is tringle. Since ll 2-D frmes nd ojects cn e decomposed into tomic 2-D units like the 2x2 mtrix nd the tringle, we postulte tht if we cn determine the permuttion needs of these tomic units, they cn serve s permuttion primitives for the entire frme or oject. t the lowest level, we permute the four pixels of 2x2 mtrix. t the next higher level, we gin permute 2x2 mtrix, where ech element is now itself 2x2 mtrix.

3. Mix permuttion instruction For microprocessor multimedi instructions, only P-RISC MX-2[4], I-64[15] nd the PowerPC ltivec[16] hve few instructions designed for generl-purpose suword permuttion. ecuse our focus in this pper is on 2-D multimedi processing, nd rempped 2-D ojects spn t lest two registers, we seek permuttion primitives tht reorder suwords from two source registers. We descrie mix, defined y the uthor for MX-2 nd I-64, which is currently the only suword permuttion instruction with two source registers. 3.1. Definition of Mix instruction The mix opertion selects either ll even elements, or ll odd elements, from the two source registers [4,15]. The pir of mixl nd mixr opertions is defined s follows: mixl: interleve the corresponding even elements from the two source registers, strting from the leftmost elements in ech register mixr: interleve the corresponding odd elements from the two source registers, ending with the rightmost elements in ech register Tle 1 defines these mix instructions, for three different suword sizes: 8 its, 16 its nd 32 its. Ech letter in the register contents represents n 8-it suword, nd ech register holds totl of 64 its. Tle 1: Definition of Mix instruction Register Contents: R1 = c d e f g h R2 = C D E F G H Instruction: mixl,8 R1,R2,R3 mixr,8 R1,R2,R3 mixl,16 R1,R2,R3 mixr,16 R1,R2,R3 mixl,32 R1,R2,R3 mixr,32 R1,R2,R3 Definition: R3 = c C e E g G R3 = d D f F h H R3 = e f E F R3 = c d C D g h G H R3 = c d C D R3 = e f g h E F G H 3.2. Exmple of decomposle suword permuttions. common permuttion of 2-D oject is mtrix trnspose, where the mtrix is flipped long its digonl: rows ecome columns, nd columns ecome rows. This is decomposle permuttion. For exmple, n 8x8 mtrix of 16-it elements stored in 16 registers cn e decomposed into four 4x4 mtrices (Figure 2), ech of which cn e further decomposed into four 2x2 mtrices (Figure 2). y trnsposing ech of the 2x2 mtrices, then trnsposing the lrger 2x2 mtrix, where ech element is itself one of these 2x2 mtrices, we otin the mtrix trnspose of 4x4 mtrix (see Figure 3). The mix instructions cn perform these hierrchicl 2x2 mtrix trnspositions. The mixl nd mixr instructions re used in pirs t the level of

suword size equl to the mtrix element size. Then, they re used t the size of suwords tht re twice s lrge. Repeting this on ech of the four 4x4 mtrices completes the trnspose of the originl 8x8 mtrix. r1 00 01 00 01 r2 10 11 10 11 r3 c00 c01 d00 d01 r4 c10 c11 d10 d11 00 10 00 10 mixl,16 r1,r2, t1 01 11 01 11 mixr,16 r1,r2, t2 c00 c10 d00 d10 mixl,16 r3,r4, t3 mixr,16 r3,r4, r4 c01 c11 d01 d11 mixl,32 t1,t3, r1 mixl,32 t2,r4, r2 mixr,32 t1,t3, r3 mixr,32 t2,r4, r4 t1 t2 t3 r4 00 10 c00 c10 r1 01 11 c01 c11 r2 00 10 d00 d10 r3 01 11 d01 d11 r4 Figure 3: Hierrchicl Decomposition of Mtrix Trnspose Permuttion 4. Fundmentl dt rerrngements in 2-D locks We propose tht systemtic pproch to finding set of permuttion primitives for current nd future 2-D multimedi progrms cn e sed on decomposing imges nd ojects into tomic units, then finding the permuttions desired for these 2-D uilding locks. The suword permuttion instructions for these 2-D uilding locks re lso defined for lrger suword sizes t successively higher hierrchicl levels. We propose studying the permuttions of 2x2 mtrix, nd the permuttions of the four tringles contined within this 2x2 uilding lock. Wht re the useful dt rerrngements in 2x2 mtrix nd its four emedded tringles (section 4.1)? Wht re permuttion primitives tht cn perform these dt rerrngements (section 4.2)? re these permuttion primitives sufficient nd efficient (section 4.3)? Cn they e generlized (section 4.4)? 4.1. Chrcteriztion of 2-D dt rerrngements The first set of dt rerrngements likely to e needed in 2x2 mtrix is to e le to swp elements verticlly, horizontlly nd digonlly. This is sed on oserving tht nerest neighor interctions re perhps the most common 2-D pixel opertions. The eight nerestneighor movements for pixel in 2-D frme re shown in Figure 4. Figure 4 expresses the 9-element mtrix of Figure 4 s four 2x2 mtrices (outlined in old). Here, n element of 2x2 mtrix cn move to its right (or left) neighor, its downwrd (or upwrd) neighor, or its digonl right (or left) neighor. Figure 4c shows ll possile nerest neighor movements, for one or two pirs of elements for 2x2 mtrix. The four elements of 2x2 mtrix cn lso e rotted clockwise y 1, 2 or 3 positions (Figure 5). This is equivlent to rotting counter-clockwise y 3, 2 or 1 position. lso, rotting y 2 positions is equivlent to swpping oth the digonl nd nti-digonl elements,

s lredy covered in Figure 4c. Hence, we need only consider rotting clockwise or nticlockwise y 1 position. ) Nerest Neighor Movement for 2-D locks Up-down Right-Left Digonl- ntidigonl Identity Mtrix trnspose ) Nerest Neighor Movement for 2x2 Mtrices c) Nerest Neighor Moves for 2x2 Mtrix Figure 4: Nerest Neighor Permuttions Rotte y 2 elements = swp digonl nd ntidigonl elements clockwise ) Rottion of 2x2 Mtrix ) Rottions of the four emedded tringles nti-clockwise Figure 5: Rottions of 2x2 mtrix nd its emedded tringles 2x2 mtrix contins four tringles, ech of which cn e rotted clockwise or nticlockwise y 1 position. This results in 8 different permuttions of the 2x2 mtrix, s shown in Figure 5. Tringles re useful for representing non-rectngulr shpes.

We postulte tht these permuttions of 2x2 mtrices nd tringles should e efficiently supported, t ll suword sizes (powers of 2), for use in decomposle permuttions of 2-D ojects. Wht suword permuttion instructions cn chieve these common dt rerrngements? 4.2. Check, Exchnge nd Excheck opertions To chieve these common dt rerrngements, we only need to define three new suword permuttion primitives (see Tle 2). The check instruction llows the downwrd nd upwrd swpping of elements, the exchnge instruction llows the right nd left movement, while the excheck instruction llows the rottion of tringle of three elements within 2x2 mtrix. The mixl nd mixr opertions, defined erlier, chieve the swpping of digonl elements. The check instruction performs checkerord pttern: it selects lterntely from the corresponding suwords in the two source registers, for ech position in the result register (see Tle 2). Exchnge is n opertion on single source register: it swps djcent suwords in ech pir of consecutive suwords. Excheck cn e descried s composite opertion: it performs check on the two source registers, followed y n exchnge opertion on the result. Tle 2: Definition of Check, Exchnge nd Excheck 4.3. Sufficiency nd efficiency of permuttion primitives Register Contents: R1 = c d e f g h R2 = C D E F G H Instruction: Definition: check,8 R1,R2,R3 R3 = c D e F g H check,16 R1,R2,R3 R3 = C D e f G H check,32 R1,R2,R3 R3 = c d E F G H exchnge,8 R1,R3 R3 = d c f e h g exchnge,16 R1,R3 R3 = c d g h e f exchnge,32 R1,R3 R3 = e f g h c d excheck,8 R1,R2,R3 R3 = D c F e H g excheck,16 R1,R2,R3 R3 = C D G H e f excheck,32 R1,R2,R3 R3 = E F G H c d In Tle 3, we systemticlly enumerte the permuttions of re-mpped 2x2 mtrices, to verify tht the suword permuttion instructions defined ove cn indeed perform ll these permuttions efficiently. R1 nd R2 contin four 2x2 mtrices. It is esier to follow just the leftmost mtrix (in old), which is leled s in figures 4-6, initilly in R1 nd in R2. The permuttions re enumerted s follows: ech of the 4 elements in resulting 2x2 mtrix cn e in the top left corner in R3. Therefter, ech of the 3 remining elements cn e in the top right corner in R3. This gives 12 possiilities for the top row, which is used for the numeric numering of the cses. The two remining elements of ech 2x2 mtrix re in the ottom row in R4, nd their two possile orderings give the () nd () numering in Tle 3.

Opernd registers: 1() ttopleft 1() 2() 2() 3() 3() 4() ttopleft 4() 5() 5() 6() 6() 7() ttopleft 7() 8() 8() 9() 9() 10() ttopleft 10() 11() 11() 12() 12() Tle 3: ll Permuttions of Four re-mpped 2x2 Mtrices R1 = c d e f g h R2 = C D E F G H Result Registers: Instructions Used: Type of Dt Movement: R3 = c d e f g h ;R3=R1 identity permuttion R4 = C D E F G H ;R4=R2 R3 = c d e f g h ;R3=R1 swp ottom row elements rightleft R4 = D C F E H G ;R4=exchnge(R2) R3 = c D e F g H ;R3=check(R1,R2) swp right column elements updown R4 = C d E f G h ;R4=check(R2,R1) R3 = c D e F g H ;R3=check(R1,R2) rotte ottom-right tringle nticlockwise R4 = d C f E h G ;R4=excheck(R2,R1) R3 = c C e E g G ;R3=mixL(R1,R2) swp digonl elements R4 = d D f F h H ;R4=mixR(R1,R2) =trnspose R3 = c C e E g G ;R3=mixL(R1,R2) rotte ottom-right tringle R4 = D d F f H h ;R4=mixR(R2,R1) clockwise R3 = d c f e h g ;R3=exchnge(R1) swp top row elements right-left R4 = C D E F G H ;R4=R2 R3 = d c f e h g ;R3=exchnge(R1) swp oth rows elements rightleft R4 = D C F E H G ;R4=exchnge(R2) R3 = d D f F h H ;R3=mixR(R1,R2) rotte top-right tringle nticlockwise R4 = C c E e G g ;R4=mixL(R2,R1) R3 = d D f F h H ;R3=mixR(R1,R2) rotte nti-clockwise 1 element R4 = c C e E g G ;R4=mixL(R1,R2) R3 = d C f E h G ;R3=excheck(R2,R1) rotte top-left tringle nticlockwise R4 = c D e F g H ;R4=check(R1,R2) R3 = d C f E h G ;R3=excheck(R2,R1) R4 = D c F e H g ;R4=excheck(R1,R2) R3 = C c E e G g ;R3=mixL(R2,R1) rotte top-left tringle clockwise R4 = d D f F h H ;R4=mixR(R1,R2) R3 = C c E e G g ;R3=mixL(R2,R1) rotte clockwise 1 element R4 = D d F f H h ;R4=mixR(R2,R1) R3 = C d E f G h ;R3=check(R2,R1) swp left column elements updown R4 = c D e F g H ;R4=check(R1,R2) R3 = C d E f G h ;R3=check(R2,R1) rotte ottom-left tringle R4 = D c F e H g ;R4=excheck(R1,R2) clockwise R3 = C D E F G H ;R3=R2 swp left nd right column R4 = c d e f g h ;R4=R1 elements up-down R3 = C D E F G H ;R3=R2 R4 = d c f e h g ;R4=exchnge(R1) R3 = D c F e H g ;R3=excheck(R1,R2) rotte top-right tringle clockwise R4 = C d E f G h ;R4=check(R2,R1) R3 = D c F e H g ;R3=excheck(R1,R2) R4 = d C f E h G ;R4=excheck(R2,R1) R3 = D d F f H h ;R3=mixR(R2,R1) rotte ottom-left tringle nticlockwise R4 = c C e E g G ;R4=mixL(R1,R2) R3 = D d F f H h ;R3=mixR(R2,R1) swp nti-digonl elements R4 = C c E e G g ;R4=mixL(R2,R1) R3 = D C F E H G ;R3=exchnge(R2) R4 = c d e f g h R3 = D C F E H G R4 = d c f e h g ;R4=R1 ;R3=exchnge(R2) ;R4=exchnge(R1) swp digonl nd nti-digonl elements =rotte clockwise y 2 The suword permuttion instructions used to chieve ech of the 2x2 lock permuttions re shown. Only the 5 suword permuttion primitives defined erlier re needed: mixl, mixr, exchnge, check,ndexcheck. If the processor hs t lest two permuttion units, then

ech cse in Tle 3 cn e executed in one cycle, since there re no dependencies in generting R3 nd R4. This estlishes the efficiency of these permuttion primitives. Ech 2x2 mtrix permuttion is lso leled with one of the 20 dt movements (including identity) descried in Figures 4c, 5 nd 5. There re four permuttions in Tle 3 tht re not leled with dt movement descried erlier. They correspond to more esoteric dt rerrngements of 2x2 mtrix, descried est s chnging rows into digonls, nd chnging digonls into columns (Figure 6). Even though these four permuttions were not initilly identified s dt rerrngements to e supported, the permuttion primitives we defined efficiently support them. This supports the thesis tht if we cn define permuttion primitives tht somehow form sis set, they cn e used to implement other permuttions tht my e needed in lgorithms yet to e invented. Identity Chnging Rows to Digonls Chnging Digonls to Columns Figure 6: Four unleled permuttions of 2x2 mtrix 4.4. Repeting permuttions on smller susets of suwords The exchnge instruction cn e replced y more generl permset instruction, which repets permuttion on suset of elements over the rest of the elements in the register. Permset is lso generliztion of the permute instruction in MX-2 [4]. The suwords in the source register re numered, nd permute specifies the new ordering desired in terms of this numering. The mux instruction in I-64 [15] nd the vperm instruction in ltivec [16] re similr. Tle 4 gives exmples of this permute instruction on 8-it nd 16-it suwords. Tle 4: Exmples of Permute Instruction on 8-it nd 16-it Suwords Opernd register: R1 = c d e f g h permute Instruction Result register contents Type of Permuttion permute,8,01234567 R1, Rt Rt = c d e f g h identity permuttion permute,8,10325476 R1, Rt Rt = d c f e h g exchnge permute,8,66666666 R1, Rt Rt = g g g ggggg rodcst permute,8,76543210 R1, Rt Rt = h g f e d c reverse permute,8,05276341 R1, Rt Rt = f c h g d e ritrry permuttion permute,8,55000366 R1, Rt Rt = f f d g g permuttion with repetitions permute,16,0213 R1, Rt Rt = e f c d g h permuting four 16-it suwords There is limit to the efficiency of the permute instruction for permuting mny suwords, since the control its quickly exceed the numer of its permuted. Permuting four suwords requires only 8 control its, which cn e encoded in the permute instruction itself [4, 15]. eyond four elements nd up to sixteen elements, ny ritrry permuttion cn still e

performed with one instruction, y providing the control its for the permuttion in second source register [16], rther thn in the 32-it instruction. Permuting 32 elements requires 160 its, nd permuting 64 elements requires 384 its (n*log n its). Hence, permuting more thn 16 elements cnnot e chieved y single instruction with two source registers, using this method of specifying permuttions. To permute more suwords without incresing the numer of control its required, we define new permset instruction which permutes suset of m suwords, where m is less thn the numer of suwords in the register. The sme permuttion is repeted on consecutive susets of m suwords. If the totl numer of suwords in the register is not multiple of m, we cn pd this lst set of suwords with zeros. Tle 5: Replcing 8-element Permute with 4-element Permset instructions Permute exmple Equivlent Permset instructions Type of permuttion permute,8,01234567 R1, Rt permset, 8,4,0123 R1, Rt identity permute,8,10325476 R1, Rt permset, 8,4,1032 R1, Rt exchnge permute,8,66666666 R1, Rt permset, 8,4,2222 R1, Rt rodcst permset,16,4,2222 Rt, Rt permute,8,76543210 R1, Rt permset, 8,4,3210 R1, Rt reverse permset,16,4,2301 Rt, Rt permute instruction cn e turned into permset instruction, y inserting new prmeter which specifies the numer of elements to e permuted in ech set. In Tle 5, this cn e second prmeter, inserted etween the two existing prmeters of suword size nd permuttion control its. Using this new permset instruction, the first four permuttions in Tle 4 cn lso e specified s permuttions on sets of 4 elements, s shown in Tle 5. The identity nd exchnge opertions cn e replced y exctly one such permset instruction. The rodcst nd reverse opertions ech need two permset instructions, with 4-element permute sets. The next two permute instructions in Tle 4 cnnot e ccomplished in 1 or 2 instructions, ecuse of the lck of symmetry in the permuttion done on consecutive sets of 4 elements. So, while the permset instruction with 4-element sets is not s generl s the full permute instruction on 8 elements, it cn specify ll possile permuttions of 2x2 mtrices, with lower implementtion cost. 5. lphet of Suword Permuttion Primitives n lphet is smll set of sic primitives from which words, phrses, sentences, prgrphs nd stories cn e uilt. Mny of these stories nd words were not even conceived when the lphet ws designed. We propose n lphet of fundmentl permuttion primitives, which re simple yet powerful enough to express ll dt rerrngement needs of current nd future 2-D medi processing progrms. The mix opertions pper to e truly fundmentl, selecting firly etween elements cross the width of oth source registers, emodying the powerful even-odd prdigm. lthough the check instruction cn e derived from the mix opertion, it cn lso e considered fundmentl permuttion since it emodies the checkerord pttern. The exchnge opertion, while useful permuttion primitive in itself, cn e replced y the more generl permset instruction, s descried ove. n initil lphet of suword permuttion primitives is shown in Figure 7, including mixl, mixr, permset, check nd excheck, defined on 8, 16 nd 32 it suwords. For very low cost

implementtions, t slightly reduced performnce, miniml lphet could exclude check nd excheck. Check my e excluded from miniml set, ecuse Shift_Left of the second opernd, followed y mixl instruction cn ccomplish it. Excheck is the composition of check followed y exchnge, so it my lso e omitted from miniml set of fundmentl permuttions. They re included in the initil lphet for efficiency nd uniformity in performnce, so tht every permuttion of sic 2x2 mtrix, s enumerted in Tle 3, cn e done in single cycle (or single step). Miniml lphet: mixl, mixr on 8, 16 nd 32 it suwords permset on 8, 16 nd 32 it suwords, with 4-element sets dditionl Primitives: check on 8, 16 nd 32-it suwords excheck on 8, 16 nd 32-it suwords Figure 7: lphet of Suword Permuttion Primitives The miniml set of mixl, mixr nd permset my e further reduced depending on the size of the registers in the processor. For exmple, if registers re only 64 its wide, then permuttion instructions for the two 32-it suwords my not e needed, since they cn esily e specified s permuttions on the four 16-it suwords. These permuttion instructions my lso e extended down to suwords of 4 its, 2 its nd 1 it, especilly if it is lso desired to support permuttions for cryptogrphy efficiently. 6. Summry MicroSIMD rchitecture incorporting suword prllelism is very efficient for pplictionspecific medi processors, s well s for fst multimedi informtion processing in generlpurpose microprocessors. This is ecuse, in the lrge mjority of cses, microsimd rchitectures cn exploit the dt-prllelism present in multimedi progrms s efficiently s other more expensive prllel rchitectures. The reduced complexity in register ports nd register ypssing in microsimd rchitectures results in fster cycle times, less re nd less design complexity for the sme degree of prllelism s other prllel rchitectures like VLIW, supersclr, or conventionl SIMD or MIMD prllel processor rchitectures [13]. We pose the prolem of finding smll set of fundmentl suword permuttion opertions tht cn e used efficiently for current nd future two-dimensionl multimedi progrms. Such suword permuttion instruction rerrnges dt etween suword trcks in microsimd rchitectures, performing function like tht of interconnection networks which move dt etween prllel processors in conventionl SIMD or MIMD prllel processor rchitectures. While this initilly ppers to e n intrctle prolem, this pper descries novel pproch to solving this prolem systemticlly. We first descrie how 2-dimensionl ojects re loded into registers s pcked suwords in re-mpped formt, corresponding to how 2-dimensionl dt is usully stored in memory. We use the 2x2 mtrix s sic uilding lock to which 2-dimensionl frmes of pixels nd 2-D ojects cn e hierrchiclly decomposed. We then chrcterize the interesting permuttion opertions of this sic 2x2 mtrix, s well s the four tringles tht it contins. These re verticl, horizontl, digonl, nd rottionl rerrngements of vrious kinds. We define new suword permuttion primitives: check, exchnge, excheck, ndpermset. The check instruction llows the downwrd nd upwrd swpping of elements, the exchnge

instruction llows the right nd left movement, while the excheck instruction llows the rottion of tringles. The mixl nd mixr opertions defined erlier [4] chieve the swpping of digonl elements. Permset llows the permuttion of smller set of suwords to e repeted on other suwords in the source register, enling symmetric permuttions to e specified on mny more elements, without incresing the numer of permuttion control its. Exchnge is one exmple of the permset instruction. We then define n initil lphet (lphet ) of suword permuttions which contins mix, permset, check nd excheck. Processors designed for high performnce cn implement lphet, while very cost sensitive processors cn choose to implement n even smller set - miniml lphet of only mix nd permset instructions. The omitted instructions, check nd excheck in lphet, cn e composed from mix nd permset. Tht this miniml set is essentilly equivlent to the set consisting of mix nd permute in MX-2 is prtil vlidtion of the sufficiency of the suword permuttion instructions chosen for MX-2 [4]. We verify tht ll the 24 permuttions of 2x2 mtrix cn e otined using only instructions from lphet, in single cycle, in processor with t lest two permuttion units. Just s suword prllelism is useful eyond multimedi processing for ccelerting ll forms of dt-prllel computtions on lower precision dt, we expect tht suword permuttions will e eqully useful. The prolem is tht there re so mny possile rerrngements of rectngulr grid of pixels of ritrry size tht it is extremely difficult to select set of fundmentl permuttion primitives, from which ll other permuttions cn e uilt. This pper hs proposed systemtic pproch to solving this prolem, nd hs proposed very smll lphet of fundmentl suword permuttion primitives for existing nd future two-dimensionl processing in microsimd rchitectures. 7. References 1. Ruy Lee, "ccelerting Multimedi with Enhnced Microprocessors", IEEE Micro, Vol. 15 No. 2, pril 1995, pp. 22-32. 2. Mrc Tremly, J. O Connor, V. Nrynn, nd L. He, VIS Speeds New Medi Processing, IEEE Micro, Vol. 16 No. 4, ugust 1996, pp. 10-20. 3. lex Peleg nd Uri Weiser, MMX Technology Extension to the Intel rchitecture, IEEE Micro, Vol. 16 No. 4, ugust 1996, pp. 42-50. 4. Ruy Lee, "Suword Prllelism with MX-2", IEEE Micro, Vol. 16 No. 4, ugust 1996, pp. 51-59. 5. Ninth nnul Microprocessor Forum, Octoer 21-24, 1996, Sn Jose, Cliforni (Mips nd lph multimedi). 6. Ruy Lee, "Multimedi Extensions for Generl-Purpose Processors", IEEE Workshop on Signl Processing Systems SiPS97 Design nd Implementtion, Novemer 3-5, 1997, Leicester, United Kingdom, pp. 9-23. 7. S. Oermn, F. Weer, N. Juff, G. Fvor, MD 3Dnow! Technology nd the K6-2 Microprocessor, Hot Chips 10 Symposium on High-Performnce Chips, ugust 16-18, 1998, Plo lto, Cliforni, pp. 245-254. 8. J. Golston, Single-Chip H.324 Videoconferencing, IEEE Micro, Vol. 16 No. 4, ugust 1996, pp. 21-33. 9. S. Dutt, K. O Connor, W. Wolf, nd. Wolfe, Design Study of 0.25-um Video Signl Processor, IEEE Trnsctions on Circuits nd Systems for Video Technology, Vol. 8 No. 4, ugust 1998. 10. Krl Guttg et l, Single-Chip Multiprocessor for Multimedi: the MVP, IEEE Computer Grphics nd pplictions, Vol. 12 No. 6, Novemer 1992, pp.53-64. 11. P. Foley, The Mpct Medi Processor Redefines the Multimedi PC, Proc. Compcon, IEEE Computer Society Press, Los lmitos, Clif., 1996, pp. 311-318. 12. C. soglu, W. Lee, J. O Donnell, The MP1000 VLIW Mediprocessor, Equtor Technologies Inc. 13. Ruy Lee, Efficiency of microsimd rchitectures nd index-mpped dt for medi processing, Proceedings of IS&T/SPIE Symposium on Electric Imging: Medi Processors 99, Jnury 1999, Sn Jose, Cliforni. 14. Michel Flynn, Very High-Speed Computing Systems, Proceedings of IEEE, Vol. 54 No. 12, 1966, pp. 1901-1909. 15. I-64 ppliction Developer s rchitecture Guide, Intel Corportion, Order Numer: 245188-001, My 1999. http://developer.intel.com/design/i64. 16. ltivec Extension to PowerPC Instruction Set rchitecture Specifiction. Motorol, Inc., My 1998. http://www.motorol.com/ltivec. 17. Ruy Lee, Precision rchitecture, IEEE Computer, Vol. 22, No. 1, Jn 1989, pp.78-91.