Implementation of Different Architectures of Forward 4x4 Integer DCT For H.264/AVC Encoder

Implementtion of Different Architectures of Forwrd 4x4 Integer DCT For H.64/AVC Encoder Bunji Antoinette Ringnyu, Ali Tngel, Emre Krulut 3 Koceli University, Institute of Science nd Technology, Koceli, Turkey unjintoinetteringnyu@gmil.com Koceli University, Deprtment of Electronics nd Communiction Engineering, Koceli, Turkey tngel@koceli.edu.tr 3 YONGATEK, Teknoprk Istnul, Turkey emre.krulut.dde@gmil.com Astrct This pper presents n overview nd different implementtions of the 4x4 Integer Discrete Cosine Trnsform (DCT) used for the H.64 stndrd, lso presenting the Utiliztion Report nd the Mximum Operting Frequency for ech implementtion. The H.64 stndrd specifies the use of Integer DCT to decompose the results of inter prediction nd intr prediction from sptil to Frequency Domin. We implemented the -D direct multipliction, the -D (utterfly) method nd the -D multipliction with dders. With these implementtions, the results show tht -D with dders nd -D implementtions outperform the direct -D multipliction. However, the -D with dders uses fewer resources thn the -D utterfly nd still chieves reltively high frequency. Keywords H.64/AVC, Integer DCT, Imge Compression, VHDL. Introduction Since the 990s, imge compression hs experienced significntly high progress. From the H.6 nd JPEG used in the 980s, to the MPEG- nd MPEG- used in the 990s. With the populrity internet got in the 000s, stndrds like MPEG-4, H.64 or MPEG- Prt 0/AVC, MP4 nd HEVC sprung up. Of ll the stndrds mentioned ove, H.64 stndrd is the most widely used codec. Compred to previous stndrds, H.64 provides higher compression of out 50% over wide rnge of it rtes nd high video resolutions. Although its coding lgorithms re sed on the sme lock-sed motion compenstion nd trnsform sed sptil coding frmework of prior video coding stndrds, the H.64 hs mny innovtions compred to the older stndrds []; such s hyrid predictive/trnsforms coding of intr frmes nd integer trnsforms. Sme s the existing stndrds, AVC encoding is ccomplished through mny locks which include Motion Estimtion nd Motion Compenstion, Intr prediction, Trnsform nd Quntiztion, Inverse Trnsform nd Quntiztion, nd Entropy Encoder. Fig. shows lock digrm of H.64/AVC encoder scheme []. The Motion Estimtion module is used to identify nd eliminte temporry redundncies tht exist etween individul frmes. It involves use of motion vectors tht descries the trnsformtion of the video /imge from one dimension to the next. Motion vectors my e pplied to the whole imge in which cse we hve glol motion estimtion or on prts of the imge in which it ecomes locl motion estimtion or even per pixel Motion Compenstion (MC) will decode the imge tht is encoded y Motion Estimtion [], [3]. The input to the inter prediction nd intr prediction locks re mcrolocks. These locks re encoded in either inter or intr mode. In inter mode, prediction is formed y motioncompensted prediction or two reference picture(s) selected from the set of list 0 nd/or list reference pictures [4]. In instnces where motion estimtion cnnot e exploited, intr mode is used to eliminte sptil redundncies y ttempting to predict the current lock y extrpolting the neighoring from djcent locks in defined set of djcent directions. Fig.. Digrm of AVC encoding scheme. [3]

To decompose the results of inter prediction nd intr prediction from sptil to Frequency Domin, integer DCT is used. This is usully chieved through the use of 4X4 DCT. In the cse of mcrolock coded in 6 6 intr prediction mode, its lum pixels re first trnsformed using the 4 4 DCT nd s second step, gthered 4 4 DC coefficients lock is trnsformed gin using 4 4 Hdmrd trnsform [5].The coefficients from the Trnsform lock re quntized to remove unimportnt informtion, llowing only significnt coefficients for representing the residul frme. At the level of the trnsformtion lock, over ll precision of integers coefficients re reduced, leding to the elimintion of high frequency coefficients. There re severl ppers [6, 7, 9, 0] discussing hrdwre implementtion of Trnsform nd Quntiztion. In this pper we will focus on the Trnsform lock. The core trnsform mtrix is 4x4 mtrix which cn e implemented using -D mtrix multipliction or the populr utterfly lgorithm which is -D using dditions, sutrctions nd shift opertions long rows then long columns. In ddition to the two methods used ove, implement the 4x4 DCT using -D multiplictions with ddition opertions only nd the finl results from the three rchitectures compred to see which of them uses less resources. These rchitectures re implemented using VHDL. The rest of the pper is mde of section, which presents n overview of H.64 Trnsform lgorithm. Section 3 presents the implementtion of the three different rchitectures for the 4x4 Integer DCT Block. The synthesis nd results re presented in section 4. The pper is concluded with section 5.. Overview of H.64 Trnsform In the AVC stndrd, the residul frme of the prediction, which is the difference of the originl frme nd the predicted frme, is prtitioned into fixed-size of mcrolocks. A mcrolock is composed of 6 6 luminnce(y) smples, 8 8 chrom lue(c) smples, nd 8 8 chrom red(cr) smples in the cse of 4::0 chrom susmpling formt. A ock digrm of these three smple locks is shown in Fig. three different trnsforms used in H.64/AVC. According to the lock digrm of H.64 trnsform component s illustrted in Fig. 4, the residue is trnsformed using integer DCT. In the 6 6 Intrprediction mode, DC coefficients of ll trnsformed residul locks re grouped into n rry of 4 4 efore eing sent to Hdmrd trnsform. Detils of these processes re descried in mthemticl models in section... 4 4 Forwrd Trnsforms DCT hs een used in oth previous stndrds (like the 8x8 DCT) nd existing stndrds of imge compression (like the Integer DCT used in H.64 nd H.65). The H.64/AVC is sed on 4x4 Integer DCT Fig. 3. Block digrm of H.64 Trnsform Component tht cn e computed exctly with integer rithmetic in order to void inverse trnsform mismtch prolems. There re two types of 4x4 integer trnsforms for the residul coding. The first one is for luminnce residul locks nd is descried y () []. Y = CXC T () Where X is the 4x4 residul input of the Trnsform lock nd C is specified y C = c c c _ c = /, = / cos( π / 8), c = / cos(3π /8) This cn e fctorized s Y = (CXC T ) E () Fig.. Processing order of locks in mcrolock[3] At su-mcrolock level, mcrolocks re sudivided into su-locks of 4 4 smples for encoding. Due to the 3 different smples, there re C = d d d d E =

Where E is mtrix of scling fctors. The symol mens tht ech component of CXC T is multiplied y the corresponding coefficient in E. To reduce hrdwre implementtion of the trnsform, the constnt d is pproximted y 0.5 nd the constnt y /5. The finl forwrd trnsform expression ecomes [3], []: C f = Y = (CfXCf T ) Ef (3) E f / = / /4 / / /4 / / /4 / / /4 So, the scling mtrix Ef cn e incorported into the quntiztion process. Then CfXC T f ecomes the core of -D integer forwrd trnsform. This cn e computed y using two -D trnsforms. The first -D is pplied to the rows of the incoming residue. The second -D is then pplied on the columns. This is wht is populrly known s utterfly lgorithm. Since the 4x4 trnsform mtrix hs only, -,, - s coefficients, its utterfly is s shown elow: Fig.. Butterfly Digrm of 4x4 Integers DCT.. Implementtion of 4x4 DCT Trnsform There re severl ppers discussing on the VLSI implementtion of -D integer trnsform for H.64.Thus, implementtion of fst -D trnsform cn e clssified into two ctegories: row/column decomposition (-D) pproch nd direct -D pproch. Though direct -D requires more resources, it is implemented to e used for comprison with other rchitectures. Also, the -D is implemented with full dders only. Then, the Butterfly is lso implemented. 3. Direct -D Multipliction This is implemented y using norml Mtrix multipliction with Finite Stte Mchine (FSM). This FSM consists of the sttes INITIALIZATION, then MULT which performs the first mtrix multipliction of eqution () which is C*X nd finlly, stte MULT which performs the second trnsform. All the other rchitectures re implemented using similr sttes. 3. -D Multipliction with Adders Fig.. Butterfly Digrm of 4x4 Integers DCT.. 4 4 Hdmrd Trnsform The other kind of trnsform is Hdmrd Trnsform (HT). It is pplied to the luminnce DC terms in 6x6 intr prediction mode. The Hdmrd trnsform is defined y eqution (4) H f = Y = HXH T (4) The Hdmrd trnsform mtrix is very similr to the Forwrd trnsform mtrix with the only difference eing, the in forwrd trnsform is eing replced with in the Hdmrd trnsform. The utterfly of the Hdmrd is s shown elow : Unlike performing multipliction directly, this rchitecture is implemented y replcing multiplictions (*) with full dders. This is ccomplished with the help of the conctention opertor in VHDL. 3.3 Implementtion of -D-Trnsform using - D pproch. This is ccomplished y firstly performing the utterfly lgorithm on the rows. The trnspose of the resulting out is tken nd the send utterfly lgorithm is tken. This second utterfly is performed on the columns. A finl trnspose is tken to otin the required output. The lock digrm of this stge is s shown elow: Fig.. Block of the Integer DCT using the -D pproch.

4. Results nd Discussions The implementtion is done oth with VHDL nd Mtl s explined elow. Firstly, the Residue vlues re generted in Mtl nd written to dumper ( text file). The Residue vlues re then red from the dumper to Mtl nd VHDL, then the lgorithms executed. The results from the Trnsform lock re gin dumped to two seprte text files nd lstly, compred to e sure tht the results were the sme. The whole process of generting Residue vlues to compring is s depicted in Fig. 7. This process ws repeted for different sets of dt, just to confirm the lgorithms were working correctly nd no its were lost especilly in VHDL. Finlly, synthesis ws done to vıew the Utiliztion Anlysis nd Mximum Frequency supported y ech rchitecture. The results re presented in Tle. The -D direct multipliction does the opertion with multipliers using 0 Slice LUTs nd 45 Slice Registers. The rchitecture lso hs mximum operting frequency of 5 MHz. This confirms the resons why multipliers re -D integer DCT implemented with full dders cn still e used in H.64/AVC encoders. 5. Conclusion This pper hs given the Mximum Frequency nd Resource usge improvements for the -D Integer DCT in H.64 encoder. The -D with dders nd - D rchitectures chieved higher Mximum frequency (out 33.3 %), which is higher thn the - D with multipliers. Even though the two rchitectures hve reltively higher frequencies, the -D with dders uses less hrdwre resources. With this mximum frequency chieved nd low hrdwre utıliztion cpilities, the -D rchitectures with dders cn lso e used in systems where the - D rchitecture especilly in rel-time pplictions such s moile communiction nd video rodcsting. Even though the D implementtion is not lwys used, the one implemented here cn still e used in low frequency systems like Video Compression for storge devices like DVDs, nd Digitl signl processing. 6. Acknowledgement We will like to thnk Mr. Muhmmed Aslm for his constnt support during this project nd the entire stff of YONGA TEK, TEKNOPARK Istnul, for providing suitle environment for this reserch. Fig.. Block digrm of the synthesis nd simultion processes discourged in the implementtion of Integer DCT. They use more resources nd operte t low frequencies.the -D just s expected uses less resources (66 Slice LUTs nd 730 Slice Registers) nd mximum frequency of 00MHz). The -D with dders on the other hnd chieves the sme mximum frequency s the -D method (00 MHz), using even the lest resources (794 Slice LUTs nd 4 Slice Registers). Tle. Synthesis Report of the Three Different Architectures Architecture Slice LUTs Slice Registers Mximum operting Frequency Direct -D Multipliction 0 45 5MHz -D Multipliction With Adders 794 4 00MHz The -D Approch 66 730 00MHz The use of -D integer DCT is generlly not dvisle ecuse for the hrdwre resources they consume. But it cn e seen from the tle ove tht, 7. References [] Meihu Gu et l, Hrdwre Prototyping for Vrious Trnsforms in H.64 High Profile, Journl of Informtion & Computtionl Science, 0, pp. 9 8. [] Thoms Wiegnd nd Gry J. Sullivn, The H.64/MPEG4 Advnced Video Coding Stndrd nd its Applictions, IEEE SIGNAL PROCESSING MAGAZINE, Agust 006, pp. 34-43. [3] H.S.Mrver, A.Hllpuro, M.Krczewicz nd L.kerofsky, Low Complexity Trnsform nd Quntistion in H.64/AVC IEEE Trnsctions on circuits nd systems for video technology, vol. 3, No 7, July, 003, pp. 560-576. [4] I.E.G Richrdson, H.64 nd MPEG-4 Video Compression,pulished y John Wiley nd sons, West Sussex, UK, 003 [5] Meihu Gu et l, Hrdwre Prototyping for Vrious Trnsforms in H.64 High Profile, Journl of Informtion & Computtionl Science, 0, pp. 9 8. [6] H.S.Mrver, A.Hllpuro, M.Krczewicz nd L.kerofsky, Low Complexity Trnsform nd Quntistion in H.64/AVC IEEE Trnsctions on circuits nd systems for video

technology, vol. 3, No 7, July, 003, pp. 560-576. [7] Meihu Gu et l, Hrdwre Prototyping for Vrious Trnsforms in H.64 High Profile, Journl of Informtion & Computtionl Science, 0, pp. 9 8. [8] H. Klv nd J.B.Lee, The VC- nd H.64 Video Compression Stndrds for Brodnd Video Services, Springer, New York, USA, 008. [9] Chrles S. Luoy, Mqele M. Dlodlo, Gerhrd De. Jger, nd Keith L.Ferguson, Optimiztion of 4x4 Integer DCT in H.64/AVC Encoder, Council for Scientific nd Industril Reserch. [0] Drft ITU-T Recommendtion nd Finl Drft Interntionl Stndrd of Joint Video Specifiction, ITU-TRec.H.64 nd ISO/IEC4496-0 AVC,003. [] I.E.G.Richrdson, H.64 nd MPEG4 Video Compression-Video Coding for Next Genertion Multimedi,NewYork:Wiley,003. [] Xun-Tu Trn nd Vn-Hun Trn, An Efficient Architecture of Forwrd Trnsforms nd Quntiztionfor H.64/AVC Codecs, Journl on Electronics nd Communictions, Vol., No., April June, 0, pp. - 9.