arxiv: v1 [eess.as] 30 Dec 2017

Size: px

Start display at page:

Download "arxiv: v1 [eess.as] 30 Dec 2017"

Blaze Stewart
5 years ago
Views:

1 LOGARITHMI FREQUEY SALIG AD OSISTET FREQUEY OVERAGE FOR THE SELETIO OF AUDITORY FILTERAK ETER FREQUEIES Shoufeng Lin arxiv:8.75v [eess.as] 3 Dec 27 Department of Electrical and omputer Engineering, urtin University Kent Street, entley, Perth, Western Australia, 62 shoufeng.lin@postgrad.curtin.edu.au; ee.linsf@gmail.com ASTRAT This paper provides new insights into the problem of selecting filter center frequencies for the auditory filterbanks. We propose to use a constant frequency distance and a consistent frequency coverage as the two metrics that motivate the logarithmic frequency scaling and a regularized selection of center frequencies. The frequency scaling and the consistent frequency coverage have been derived based on a common harmonic speaker signal model. Furthermore, we have found that the existing linear equivalent rectangular bandwidth (ER function as well as any possible linear ER approximation can also lead to a consistent frequency coverage. The results are verified and demonstrated using the gammatone filterbank. Index Terms auditory filterbank, speech signal processing, frequency scaling, frequency coverage, ER.. ITRODUTIO Auditory filterbanks have been widely accepted and applied in numerous speech signal processing algorithms especially in the computational auditory scene analysis (ASA area [], for various applications including the speech enhancement, recognition and transcription. A typical auditory filterbank consists of two parts, i.e. the filter type and the centre frequencies of filters. ommon filter types include the gammatone, gammachirp, and their variants [2], which simulate the auditory response of human hearers. hoice of center frequencies of the auditory filters has evolved from the earlier critical bandwidth and the critical-band-rate scale [3], to the polynomial approximation of equivalent rectangular bandwidth (ER [4], and the currently well-accepted linear ER [5], as well as their corresponding ER-rate scales (ERS. Although the linear ER approximation in [5] has been found useful in practical implementations, it has been based on experimental findings through psychoacoustic measurement and curve-fitting. Logarithmic frequency scales have also been applied [6, 7, 8]. However, the selection of the number of subbands for a given frequency range still remains empirical for both of the ER rate scale and the logarithmic scale. In this paper, we further investigate the frequency scaling and provide new insights including a new proposed frequency coverage metric, and also derivations of a new frequency scaling function that lead to consistent frequency coverage for auditory filterbanks. Moreover, based on the proposed definition of frequency coverage, we also derive an expression for the frequency coverage metric from the existing linear ER. 2. EQUIVALET RETAGULAR ADWIDTH SALE The ER of a particular filter is defined as the bandwidth of a rectangular filter to pass the same energy of the filter [4, 5]. The relationship between the ER of the human auditory filter and the center frequency has been studied extensively using analytical expressions to approximate measurement data from psychoacoustic experiment. An early approximation has the polynomial form [4] ÊR(f = a f 2 + b f + c, ( where f is the frequency in unit of Hz, and a, b, c R are parameters. However, one of the most widely accepted analytical approximation over the past decades has been the linear form [5] ẼR(f =24.7 (.437 f + = f. Each ER corresponds to a constant distance along the basilar membrane [9, 5] in cochlea. The ER-rate scale (ERS has been developed to scale frequency in terms of units of the ER, by solving the integral [4, 5]: ẼRS(f = df, (3 ẼR(f with the boundary condition (2 ẼRS( =. (4

2 Using (2 in (3 and (4 yields [5] ẼRS(f = 2.4 lg(.437 f +. (5 The ER and ERS given in (2 and (5 have been applied in numerous auditory studies, for selecting the center frequencies of the auditory filterbank [], yet the ER approximation is still found as a result of curve-fitting from experiments, and the number of subbands for a given frequency range is still an empirical parameter. 3. SUGGESTED FREQUEY SALIG AD OVERAGE 3.. Speaker Signal Model ased on the source excitation - vocal tract models for the process of speech production [], as well as the amplitudemodulation (AM and frequency modulation (FM structure [2], a harmonic model is used for the speaker signal: s ( q H q s q (t = s ( q (t, (6 = (t = A ( q (t cos ( ω q t + φ ( q (t, (7 where t R is continuous time, s q (t the speech signal from the q-th speaker, q =,..., Q, integer Q the number of concurrent speakers, s ( q (t the -th harmonic of speaker q, integer the order of harmonics for a speaker, integer H q the maximum order of harmonics for speaker q, A ( q (t the envelope of each harmonic, φ ( q (t R the phase (which is short-time constant for speech signals, and ω q > the (angular fundamental frequency. With appropriate selection of filter center frequencies, the auditory filterbank ideally separates into subbands the harmonic components of not only a single speaker, but also multiple concurrent speakers, based on the time-frequency sparsity assumption of speech signals [3] Logarithmic Frequency Scaling In practice, concurrent speakers usually have different fundamental frequencies. Thus we can denote fundamental frequencies of two speakers as f, f 2 (f = ω /2π, f 2 = ω 2 /2π, f f 2, and their difference is f = f f 2. (8 Thus from (7 the frequency difference of their -th harmonic is f. This means that their harmonics (of same order are more distant at higher frequencies on the linear frequency scale, which makes selection of the filterbank center frequencies difficult for a regular per-speaker estimate. We thus propose a frequency scaling function Υ( that satisfies (9 so that speech components of separate speakers appear equidistantly, with respect to (w.r.t. : Υ( f Υ( f 2 onstant, w.r.t.. (9 The logarithmic functions are functional solutions to (9: Υ( = A log ( +, ( where A >, >, R. They also have better resolutions for the lower frequencies, which aligns with the fact that most speech energy falls in low frequencies (e.g. fundamental frequencies and their lower-order harmonics. We can easily verify from ( that Υ( f Υ( f 2 A (log (f /f 2, which is constant with respect to. Denote the ratio of center frequency to the bandwidth as for filter band b (b =,..., b, integer b > is the number of filter bands, i.e. (b and f = η(b denote the bandwidth and center fre- where quency of filter band b, respectively., ( is also referred to as the quality factor (Q-factor of subband b. Denote the frequency range that we are interested in as [f min, f max ], where f max > f min >. Assuming that the center frequencies of filter bands are equidistantly spaced in the proposed frequency range, we have Υ( and = ( b b Υ(f min + (b Υ(f max, (2 b = Υ ( Υ(, (3 where Υ ( denotes the inverse function of Υ(. From (, (2 and (3, we can get for the new logarithmic frequency scaling = Υ (Υ( = Υ ( ( b b Υ(f min + (b Υ(f max b = ( b b Υ(f min +(b Υ(fmax A 3.3. Proposed Frequency overage (4 The auditory filterbank requires sufficient frequency coverage to capture all harmonic components of concurrent speakers. Here we propose to define the frequency coverage of the filterbank on the proposed frequency scale as (b Σf, (5

3 where (b and Σf denote the distance between consecutive filter bands and the half of the sum of their bandwidths, as shown in (6 and (7, respectively: and Σ 2 (b+ f, (6 (f (b+ +. (7 Apparently = gives a full coverage for ideal brickwall bandpass filters with no overlap. For a practical auditory filterbank however, the filters always have finite rolloff rate, thus reasonable overlap is required for full coverage, leading to. Also depending on applications, we may have < when full coverage is not required. Therefore from (, (4 and (5, we have when = η (b+, = 2 f (b+ = 2 = 2 = 2 = 2 f (b+ + f (b+ f (b+ + f (b+ / f (b+ + / Υ(fmax Υ(fmin A( + Υ(fmax Υ(f min A( f min ( fmax ( fmax f min +, (8 which clearly shows that the frequency coverage on the logarithmic frequency scaling is consistent over the frequency range, i.e. if the Q-factor is a constant w.r.t subband index b, the resulting is also a constant value Frequency overage of the Existing ERS The existing ER function (2 does not lead to a constant, here we investigate its corresponding frequency coverage by applying the definition in (5. Denote the general form of ER in (2 as ˆυ(f = D + E f, (9 where D, E >. When D = 24.7, and E =.8 we have (2. The resulting ERS following the process of (3 and (4 becomes: ˆΥ(f = E lg( + D f, (2 where D E D, (2 and E E lg e. (22 Assuming the filter bandwidth is a constant scale of the ER, which is true for some auditory filters, e.g. the gammatone filter [2], i.e. = K ˆυ(f, (23 where K > is a constant. ote here that the Q-factor is not constant as D. Therefore, selecting equidistantly on the scale ˆΥ(f, similar to (4, we have = ˆΥ (b ( ˆΥ(f = ˆΥ ( ( b b ˆΥ(f min + (b ˆΥ(f max b = [ D ( + D f min ( b b ( + D f max (b ] D. Thus from (5 and (9 we have,ˆυ = 2 f (b+ = K D + E 2 = E K 2 f (b+ + (f (b+ + (24 f (b+ [(( + D f min ( b b ( + D f max (b + (( + D f min ( b b ( + D f max (b ]/ [(( + D f min ( b b ( + D f max (b (( + D f min ( b b ( + D f max (b ] = E K [ ( + D f max b + ( + D f min [ 2 ] ( + D f max b ( + D f min b ]/ = E K ( D+E fmax D+E f min b + 2 ( D+E fmax, D+E f min b (25 which is also constant over filter subbands. Thus as long as the ER has the linear form as (9 and assuming that (23 holds, the resulting frequency coverage is constant over frequency at given f min, f max and b. Thus the number of subbands for a given frequency range b can be derived from the required frequency coverage using (25, and the subband center frequencies can then be calculated from (4 or (24.

4 4. UMERIAL STUDIES 4.. ew ER and ERS Functions From ( we have a new frequency scaling function that can lead to consistent frequency coverage for the auditory filterbank, as well as a constant Q-factor. ow we calculate the parameters. Denote the maximum inaudible frequency as f m, usually f m 2Hz, we use the boundary condition ER (Hz Fidell 983 Shailer 983 Houtgast 977 Patterson 976 Patterson 982 Weber 977 ER(f ER(f υ(f ER v.s. enter Frequency instead of (4. Thus from ( we have Υ(f m =, (26 = A log (f m. (27 From (3 and ( we have a new approximation of the ER: υ(f = / dυ(f df = ln (28 A f. hoosing natural logarithm, i.e. = e, where e = , we can get A from linear fitting of experimental readings from the literature [4, 5, 6, 7, 8, 9] as shown in Fig.. We can see that υ(f = f, (29 A where A = 7.7 fits the data well. Then we have { A ln(f +, f > f m Υ(f =, (3, f f m where = 23.. Equations (29 and (3 are the proposed new ER and ERS functions. ote here that the ER of human auditory system may vary with age and sound level and from one listener to another [4]. Thus the precise values of A and may vary. However, the derivation from ( to (8 shows that, as long as the ER function has the proposed form of (28 or (29, the resulting frequency scaling always satisfies the frequency coverage as (8 shows. The existing and proposed ERS functions are plotted in Fig. 2. We can see that the proposed scaling follows the proposed logarithmic scaling, and is steeper at frequencies lower than about Hz. In this section we use f min = 2Hz and f max = 36Hz. The center frequencies that correspond to equidistant points on respective ERS for b = 6 are plotted in. We can see that the proposed ERS has more points at low frequencies. This can provide better frequency resolution on the lower frequencies as most of speaker fundamental frequencies are below 5Hz, and usually most speech energies are in the fundamental frequency or its lower order harmonics [] enter Frequency (Hz Fig. : Measured equivalent rectangular bandwidth versus center frequency, and ER curves. ERS ERS ERS(f Υ(f ERS v.s. Frequency Frequency (Hz ERS v.s. Frequency (Logarithmic Scale ERS(f Υ(f Frequency Logarithmic Scale (Hz Fig. 2: The existing and proposed ERS and corresponding selected center frequencies Frequency overage of the Gammatone Filterbank The frequency coverage is the property that we propose for the selection of center frequencies of an auditory filterbank. Here we use the gammatone filter to demonstrate the feature. We can see from [2] that bandwidth of the gammatone filter is only dependent on the filter order n (n and the ER, i.e.,γ = k(n ẼR(, (3 where [ π(2n 2!2 (2n 2 ] k(n = 2 2 /n. (32 (n! 2 This satisfies the assumptions in (8 and (23. Thus using the new ER function (29 instead of (2 in (3, we have the Q-factor for the gammatone filter,γ = A k(n, (33

5 which is constant over frequency, e.g. when n = 4, we have k(4 =.8865, and,γ = Thus given b = 6, we can get η (b η (b from (8, and η(b,ˆυ η (b v.s. Frequency.66 from (25. ẼR(f Proposed υ(f Frequency (Hz Frequency overage v.s. umber of Sub bands.5.5 ERS(f Proposed Υ(f umber of Sub bands Fig. 3: Q-factor and frequency coverage for a 4-th order gammatone filterbank. Fig. 3 further provides the frequency coverage of the proposed and existing ERS over the number of subbands of the 4-th order gammatone auditory filterbank for the frequency range of [2, 36]Hz. We can see from the top panel that for frequencies above about 5Hz, both ERs align well with each other. However, the existing ER has decreasing Q-factors as frequencies decrease below about 5Hz, while the proposed ER is consistent across the entire frequency range. We can also see from the bottom panel that for both ER scaling functions, the frequency coverage is constant for a given number of subbands b, and increases almost linearly with the number of subbands. The frequency coverage reaches about at b = 24 for both scaling. However, it can also be noted that for the same frequency range, the ERS requires less number of subbands than the new logarithmic scale, for a desired frequency coverage. 5. OLUSIOS This paper investigates the frequency scaling of the auditory filterbanks, and proposes a novel frequency coverage metric for the selection of center frequencies of auditory filterbanks. We also propose a new ER that aligns with the logarithmic frequency scaling, and derive that equidistant frequencies on the logarithmic frequency scale provide a consistent frequency coverage for the filterbanks. Moreover, we show that the existing and any possible linear ER can also provide consistent frequency coverage. The suggested frequency coverage is demonstrated using the gammatone filterbank. Acknowledgment The author would like to acknowledge the contribution of the Australian Postgraduate Award and Australian Government Research Training Program Scholarship in supporting this research. Due thanks are given to Professor S. ordholm and anonymous reviewers for the review comments on early revisions of the manuscript. 6. REFEREES [] D. Wang and G. J. rown, omputational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE Press, 26. [2] J. Holdsworth, I. immo-smith, R. Patterson, and P. Rice, Implementing a gammatone filter bank, Annex of the SVOS Final Report: Part A: The Auditory Filterbank, vol., pp. 5, 988. [3] E. Zwicker and E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency, The Journal of the Acoustical Society of America, vol. 68, no. 5, pp , 98. [4].. Moore and. R. Glasberg, Suggested formulae for calculating auditory-filter bandwidths and excitation patterns, The Journal of the Acoustical Society of America, vol. 74, no. 3, pp , 983. [5]. R. Glasberg and.. Moore, Derivation of auditory filter shapes from notched-noise data, Hearing research, vol. 47, no., pp. 3 38, 99. [6] X. Sun, Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio, in Acoustics, Speech, and Signal Processing (IASSP, 22 IEEE International onference on, vol.. IEEE, 22, pp. I 333. [7] F. olan, Intonational equivalence: an experimental evaluation of pitch scales, in Proceedings of the 5th International ongress of Phonetic Sciences, arcelona, vol. 39, 23. [8] W. iesmans,. Das, T. Francart, and A. ertrand, Auditory-inspired speech envelope extraction methods for improved eeg-based auditory attention detection in a cocktail party scenario, IEEE Transactions on eural Systems and Rehabilitation Engineering, vol. 25, no. 5, pp , 27. [9]. Moore, Parallels between frequency selectivity measured psychophysically ant in (cochilear mechanics, 986.

6 [] R. Patterson, I. immo-smith, J. Holdsworth, and P. Rice, An efficient auditory filterbank based on the gammatone function, in a meeting of the IO Speech Group on Auditory Modelling at RSRE, vol. 2, no. 7, 987. [] J. R. Deller Jr, J. G. Proakis, and J. H. Hansen, Discrete time processing of speech signals. Prentice Hall PTR, 993. [2] P. Maragos, J. F. Kaiser, and T. F. Quatieri, Energy separation in signal modulations with application to speech analysis, IEEE transactions on signal processing, vol. 4, no., pp , 993. [3] O. Yilmaz and S. Rickard, lind separation of speech mixtures via time-frequency masking, IEEE Transactions on Signal Processing, vol. 52, no. 7, pp , 24. [4] R. D. Patterson, Auditory filter shapes derived with noise stimuli, The Journal of the Acoustical Society of America, vol. 59, no. 3, pp , 976. [5] D. L. Weber, Growth of masking and the auditory filter, The Journal of the Acoustical Society of America, vol. 62, no. 2, pp , 977. [6] T. Houtgast, Auditory-filter characteristics derived from direct-masking data and pulsation-threshold data with a rippled-noise masker, The Journal of the Acoustical Society of America, vol. 62, no. 2, pp , 977. [7] R. D. Patterson, I. immo-smith, D. L. Weber, and R. Milroy, The deterioration of hearing with age: Frequency selectivity, the critical ratio, the audiogram, and speech threshold, The Journal of the Acoustical Society of America, vol. 72, no. 6, pp , 982. [8] S. Fidell, R. Horonjeff, S. Teffeteller, and D. M. Green, Effective masking bandwidths at low frequencies, The Journal of the Acoustical Society of America, vol. 73, no. 2, pp , 983. [9] M. J. Shailer and.. Moore, Gap detection as a function of frequency, bandwidth, and level, The Journal of the Acoustical Society of America, vol. 74, no. 2, pp , 983.

Auditory modelling for speech processing in the perceptual domain

ANZIAM J. 45 (E) ppc964 C980, 2004 C964 Auditory modelling for speech processing in the perceptual domain L. Lin E. Ambikairajah W. H. Holmes (Received 8 August 2003; revised 28 January 2004) Abstract