OVER-SAMPLING FOR ACCURATE MASKING THRESHOLD CALCULATION IN WAVELET PACKET AUDIO CODERS

OVER-SAMPLING FOR ACCURATE MASKING THRESHOLD CALCULATION IN WAVELET PACKET AUDIO CODERS Ferdnan Snaga #, Elathamby Ambkarajah # and Andrew P. Bradley* # School of Electrcal Engneerng and Telecommuncatons The Unversty of New South Wales NSW 2052, Australa *Cooperatve Research Centre for Sensor Sgnal and Informaton Processng (CSSIP) School of Informaton Technology and Electrcal Engneerng The Unversty of Queensland, QLD 4072, Australa ABSTRACT Many exstng audo coders use a crtcally sampled dscrete wavelet transform (DWT) for the decomposton of audo sgnals. Whle the alasng present n the wavelet coeffcents s cancelled n the decoder, these coders normally perform calculaton of the smultaneous maskng threshold drectly on these alased coeffcents. Ths paper uses over-samplng n the wavelet packet decomposton n order to provde alas-free coeffcents for accurate smultaneous maskng threshold calculaton. The proposed technque s compared wth maskng threshold calculaton based upon the FFT and crtcallysampled wavelet coeffcents, and the results show that a bt rate savng of up to 16 kbt/s can be acheved usng over-samplng. Keywords: wavelet packet, over-samplng, smultaneous maskng. 1. INTRODUCTION The dscrete wavelet transform (DWT) s a powerful technque for audo codng because the DWT coeffcents provde a compact, non-redundant representaton of the sgnal [1, 2]. Moreover, usng the more general wavelet packet decomposton, the decomposed sub-bands can be arranged to approxmate the crtcal bands of the human audtory system, allowng the calculaton of smultaneous maskng thresholds drectly from the resultng coeffcents. In wavelet-based coders, smultaneous maskng threshold calculatons are normally performed on the crtcally sampled wavelet packet coeffcents, however these are known to be affected by the alasng nherent n the wavelet decomposton structure. In ths paper, we demonstrate that the use of oversamplng n the decomposton stage allows us to accurately calculate the smultaneous maskng threshold, thus enablng a lower sgnal-to-mask rato. The crtcally sampled wavelet coeffcents are avalable as a subset of the over-sampled wavelet coeffcents, and hence the decoder s unaffected by ths maskng calculaton technque. In sectons 2 and 3, the crtcally sampled and oversampled wavelet decompostons are explaned respectvely, ncludng ther relatonshp. Smultaneous maskng for the removal of perceptually redundant sgnal components s descrbed n secton 4. In secton 5, an audo coder s proposed that combnes crtcally sampled wavelet packet decomposton wth maskng threshold calculatons based on the over-sampled wavelet packet coeffcents. Computatonal complexty s dscussed n secton 6, and performance comparsons on the proposed coder are made n secton 7. 2. CRITICALLY SAMPLED DISCRETE WAVELET TRANSFORM The dscrete wavelet transform and dscrete wavelet packet decomposton have been nstrumental n localzng transent events n the tme-frequency doman, and has hence found strong applcatons n audo codng. The DWT s descrbed as follows [3] 1 k w( 2,2 n ) ( n )s( k ), (1) 2 k 2 where s the mother wavelet, s(k) s the dscrete sgnal, 2 s tme dlaton and 2 n s tme translaton of wavelet transform ndcatng decmaton. The flter bank mplementaton of the crtcally sampled DWT s performed by down samplng the wavelet coeffcents by a factor of two after sub-band flterng usng the quadrature mrror flters. For the crtcally sampled wavelet decomposton, the number of output coeffcents s dentcal to the number of nput samples. In ths paper, the Daubeches wavelet was selected for the decomposton, wth db8 for the frst to sxth levels and db2 for seventh and eghth levels. 245 0-7803-8549-7/04/$20.00 c 2004 IEEE

246 ICCS 2004 3. OVER-SAMPLED DISCRETE WAVELET TRANSFORM The over-sampled DWT s dfferent from the crtcally sampled DWT n a flter bank mplementaton, n that t s performed wthout down samplng. Over-sampled wavelet packet (WP) coeffcents can be obtaned by performng crtcally sampled wavelet packet decomposton twce, frstly wthout shftng and secondly by shftng one sample and then nterleavng the resulted wavelet coeffcents [4], however ths s computatonally ntensve. Over-sampled WP decomposton can be performed more effcently usng the A Trous algorthms [3]. The A Trous algorthm s performed by nsertng 2 1 zeroes between flter coeffcents, where s the decomposton level, and wth no sub-samplng. The crtcally sampled wavelet coeffcents and the over-sampled WP coeffcents have a close relatonshp snce the crtcally sampled wavelet coeffcents exst n the over-sampled WP coeffcents. The crtcally sampled wavelet coeffcents can be obtaned by down-samplng the over-sampled WP coeffcents by a factor of 2 [3]. The coeffcents of the over-sampled WP decomposton are closer to those of the contnuous wavelet transform than those of the crtcally sampled WP decomposton [5]. Therefore, the over-sampled WP coeffcents derved usng the A Trous algorthm provdes more accurate tme-frequency nformaton than the conventonal WP decomposton, n addton to shftnvarance, a property that can be mportant n some applcatons. The dsadvantages are the ncreased computatonal complexty and the ncreased memory requred to represent the sgnal. 4. SIMULTANEOUS MASKING Audtory maskng s a well-known phenomenon n the human audtory system whereby sgnal components are rendered naudble by the presence of maskng sgnals that occur wthn the same crtcal band. The smultaneous maskng model used n ths work was obtaned from [6]. After obtanng the smultaneous maskng threshold (db), the sgnal to mask rato (SMR) s calculated usng the maxmum power (db) n the processed frame. The SMR s used n the coder bt allocaton algorthm to determne the mnmum number of bts needed to represent the nput sgnal n a perceptually lossless fashon. Reducng the number of bts used to represent the audo sgnal ncreases the quantsaton nose, however the use of the maxmum power n each crtcal band n the SMR calculaton ensures that the maxmum quantsaton nose s stll under the maskng threshold. Temporal maskng, whch has been shown to provde bt rate reductons of up to 20 kbt/s n wavelet packetbased audo coders, can also be combned [7] wth the smultaneous maskng to further reduce the bt rate. The functonal model used to produce ths mprovement was based upon the followng equaton [8]: TM F b log t L c a, (3) 10 m where TM F s the amount of forward maskng threshold n db n the mth band. t s the tme dfference between the masker and the maskee n mllseconds. L m s the masker level n db obtaned by takng the average power of all samples n the mth crtcal band. a, b, and c, are parameters derved from psychoacoustc data [8]. Temporal maskng was not used n ths work, snce the objectve of ths paper was to evaluate the effect of crtcally sampled wavelet coeffcents on smultaneous maskng threshold calculaton. 5. AUDIO CODER The audo coder developed n ths work mproves upon the smultaneous maskng threshold calculaton used n exstng crtcally sampled wavelet packet coders. Followng the maskng threshold calculaton, the oversampled WP coeffcents are dscarded, whereas the crtcally sampled WP coeffcents obtaned from the over-sampled WP coeffcents are retaned for the actual codng, as seen n Fg. 1. Input 8 Levels of OS-DWT Pckng CS-DWT Coeffcents Maskng : Quantsaton & Block Compandng Smultaneous Maskng Temporal Maskng [7] SMR Calculaton & Bt Allocaton Coded Data Fgure 1 : Wavelet packet-based audo coder. OS: Over-sampled, CS: Crtcally sampled.

ICCS 2004 247 6. COMPUTATIONAL COMPLEXITY COMPARISON One major dsadvantage of the over-sampled DWT usng the A Trous algorthm compared to crtcally sampled DWT s that the computatonal complexty ncreases. Nevertheless the ncrease n computatonal complexty occurs n the codng process only, because the decoder smply receves the coded crtcally sampled data obtaned from over-sampled DWT coeffcents. In order to compare computatonal complexty between crtcally sampled WP decomposton and oversampled WP decomposton, the fully decomposed dscrete wavelet packet s consdered. In ths comparson, the number of multplcatons s taken as the measure of computatonal complexty. The results are seen n Table 1, where N s the number of samples n a frame, L s the number of decomposton levels and K s the number of non-zero flter coeffcents, applcable n the A Trous algorthm, to avod multplcaton by zero. Table 1. Computatonal complexty of fully decomposed crtcally sampled DWT and over-sampled DWT FFT Method Computatonal Complexty N log N 7. PERFORMANCE EVALUATION In evaluatng the effcacy of the over-sampled DWT for bt rate reducton, four audo materals wth 44.1 khz samplng frequency were used. Addtonally, the FFT algorthm, as used n MPEG 1 layer I, was also ncluded n the comparson. 7.1. SMR and Bt Rate Comparson In these experments, DWT coeffcents are scaled to have unty gan, whle FFT coeffcents have been normalzed as per MPEG 1 layer I. Ths normalzaton equalzes the sgnal power of DWT and FFT so that the SMR can be compared, n order to make a meanngful bt rate comparson. The results are shown n Fgure 2 and Table 2. The average SMR for each band was calculated, to observe the effect on the bt rate of mproved maskng threshold calculaton from the alas-free wavelet coeffcents. Fgure 2 shows that the average SMR value of the over-sampled DWT s lower than that of the crtcally sampled DWT, meanng that the over-sampled DWT can be exploted to reduce the bt rate. The average SMR value of over-sampled DWT s also lower than that calculated usng the FFT. Crtcally sampled DWT Over-sampled DWT L N K L NK 2 1 L 1 As seen n Table 1, the over-sampled wavelet decomposton s more computatonally expensve than the crtcally sampled wavelet decomposton, however t s stll perfectly feasble for applcatons n whch the crtcally sampled wavelet decomposton s normally used. The FFT also requres less computaton than the over-sampled wavelet decomposton, but the small mprovement n complexty s easly offset by the reductons n bt rate possble by the more accurate maskng threshold calculaton resultng from the use of the over-sampled wavelet decomposton. Moreover, n the MPEG standard, FFT s only used for maskng threshold calculatons, whle a separate computaton s needed for tme frequency mappng. In DWT, the coeffcents are drectly used for maskng threshold calculaton. In ths work, the wavelet packet was not fully decomposed. The decomposton was lmted to approxmate the crtcal bands of the human audtory system. Ths smplfcaton of the full decomposton reduces computatonal complexty by about 25% n our mplementaton. Fgure 2. Average SMR for FFT, Crtcally Sampled (CS) DWT and Over-Sampled (OS) DWT Table 2. Comparson of bt rate (kbps) for three schemes and four audo materals Audo Materal Length FFT CS DWT OS DWT (sec) (kbps) (kbps) (kbps) Afrca 12 184.8 160.7 144.4 Pop 25 168.0 162.2 149.1 Tracy Chapman 24 174.2 150.1 137.9 Rahan 22 162.4 164.5 150.9

248 ICCS 2004 It can be seen from Fgure 2 that the alasng nherent n the crtcally sampled DWT produces an nter-band smoothng effect on the SMR. Therefore, the dfference between crtcal bands wth a hgh and a low SMR becomes less pronounced and hence the dfference n bts allocated to these bands becomes less pronounced. Ths means that the crtcally sampled DWT s less able to take advantage of any narrow band smultaneous maskng. It should be noted that the dfference between the average level of the crtcally sampled and over-sampled SMR graphs s due to the fact that the maxmum sgnal power found n the crtcally sampled DWT wll always be less than or equal to that n the over-sampled DWT. 7.2. Subjectve Test Sem-formal subjectve tests, nvolvng 14 subjects, were performed on the decoded audo to produce the result n Table 2. The subjectve tests comply wth the ITU-R Recommendaton BS.1116 [9]. The test procedures were conducted double-blndly usng A-B-C trple stmul wth hdden reference. These crtera were acheved by usng ABC/HR software [10]. Audo A was the orgnal verson as the reference whle B and C were the orgnal and the decoded verson that were assgned randomly by the software. Ths s a double blnd crteron where nether subject nor test admnstrator knows whch one of B and C s the reference durng the test. Frstly, the orgnal verson (A) was presented as the reference to the subjects. Secondly, the randomly assgned orgnal and the decoded verson were presented to the subjects. The subjects could lsten to A, B or C as many tmes as they lke. Thrdly, the subjects were asked to dentfy B or C as the decoded verson after comparng to A. The gradng scale was as follows: 1.0 to 1.9 for Very Annoyng qualty, 2.0 to 2.9 for Annoyng qualty, 3.0 to 3.9 for Slghtly Annoyng qualty, 4.0 to 4.9 for Perceptble but Not Annoyng qualty and 5 as Imperceptble qualty wth 0.1 resoluton. The score was calculated n subjectve dfference grade (SDG) SDG Grade decoded Grade orgnal (4) where Gradedecoded s the score of the audo materal that s selected by the subject as the decoded verson and Grade orgnal s the score of the orgnal verson or the reference, whch s 5.0. Correct selecton of the decoded verson results n negatve SDG whle the ncorrect selecton results n postve SDG. The average SDG s then subtracted from 5.0 as the orgnal grade to be mean subjectve grade (MSG). After the subjects graded the audo qualty, the orgnal and the decoded audo sgnals were presented to the subject, and then a sgnal was randomly selected from these two audo sgnals and presented to the subjects. The subjects then dentfed the audo sgnal as the orgnal or the decoded sgnal. These steps were repeated fve tmes. The probablty that the subject s guessng f the subject dentfes correctly all the tmes s 0.031 and 1.0 f the subject dentfes ncorrectly all the tmes. From the data obtaned from the subjectve tests, the MSG for crtcally sampled DWT and the over-sampled DWT s 4.900 and 4.854 respectvely. The average probablty that the subjects are guessng s 0.45 for crtcally sampled DWT and 0.44 for over-sampled DWT. These numbers show that the MSG of over-sampled DWT s very close to the MSG of crtcally sampled DWT decoded output, whch shows almost equal qualty. By combnng ths wth the probablty that the subjects are guessng, (both probabltes are close to 0.5), the transparent qualty has been acheved. 8. CONCLUSIONS The use of over-samplng n wavelet packet audo codng has been presented n ths paper. It has been shown that bt rate reductons of up to 16 kbt/s can be acheved usng over-samplng n a varable bt rate scheme, as compared wth the conventonal crtcally-sampled DWT or the FFT as used n MPEG 1, layer I. Further, subjectve tests have shown that ths bt rate reducton can be acheved whle mantanng transparent qualty n the decoded audo sgnals. Future research wll concentrate on the ntegraton of functonal temporal maskng models n an over-sampled wavelet packet audo coder, and the performance of fxed bt rate coders based on oversampled wavelet packet coeffcents. 9. REFERENCES [1] A. P. Bradley, "Shft-nvarance n the dscrete wavelet transform," n Proc. of Dgtal Image Computng : Technques and Applcatons (DICTA'03), Sydney, Australa, pp. 29-38, 2003. [2] A. Cohen and J. Kovacevc, "Wavelet : the mathematcal background," Proceedngs of the IEEE, vol. 84, pp. 514-522, 1996. [3] M. J. Sensa, "The dscrete wavelet transform : weddng the a trous and mallat algorthm," IEEE Transactons on Sgnal Processng, vol. 40, pp. 2464-2482, 1992. [4] Y. Andreopoulos, M. V. Schaar, A. Munteanu, J. Barbaren, P. Schelkens, and J. Cornels, "Completeto-overcomplete dscrete wavelet transforms for scalable vdeo codng wth MCTF," Vsual Communcaton and Image Processng, vol. 5150, pp. 719-731, 2003. [5] H. Guo and C. S. Burrus, "Convoluton usng the undecmated dscrete wavelet transform," n Proc. of Acoustcs, Speech and Sgnal Processng, pp. 1291-1294, 1996.

[6] M. Black and M. Zeytnoglu, "Computatonally effcent wavelet packet codng of wde-band stereo audo sgnals," n Proc. of Internatonal Conference on Acoustcs, Speech, and Sgnal Processng, pp. 3075-3078, 1995. [7] F. Snaga, T. S. Gunawan, and E. Ambkarajah, "Wavelet packet based audo codng usng temporal maskng," n Proc. of Internatonal Conference on Informaton, Communcatons and Sgnal Processng and Pacfc-Rm Conference on Multmeda, Sngapore, 2003. [8] W. Jesteadt, S. P. Bacon, and J. R. Lehman, "Forward maskng as a functon of frequency, masker level, and sgnal delay," Journal of Acoustc Socety of Amerca, vol. 71, pp. 950-962, 1982. [9] ITU, "ITU-R BS.1116.1, Methods for the subjectve assessment of small mparments n audo systems ncludng multchannel sound systems," Internatonal Telecommuncaton Unon, Geneva, 1997. [10] ff123, "ABC/Hdden Reference : Tool for comparng multple audo samples," 2002. ICCS 2004 249