APPLICATION OF THE FAN-CHIRP TRANSFORM TO HYBRID SINUSOIDAL+NOISE MODELING OF POLYPHONIC AUDIO

6th European Signal Processing Conference (EUSIPCO 8), Lausanne, Switzerland, August 5-9, 8, copyright by EURASIP APPLICATION OF THE FAN-CHIRP TRANSFORM TO HYBRID SINUSOIDAL+NOISE MODELING OF POLYPHONIC AUDIO Maciej Bartowia Chair of Multiedia Telecounications and Microelectronics, Poznan University of Technology Polana 3, 6-965, Poznan, Poland phone: + (48-6) 665385, fax: + (48-6) 6653899, eail: bartow@ultiedia.edu.pl web: www.ultiedia.edu.pl ABSTRACT Reliable classification of spectral peas as tonal and noiserelated is an iportant stage of hybrid sinusoidal+noise odeling. Spectral peas of higher haronics are often issed due to their wide frequency spread resulting fro pitch variation. Recently introduced fan-chirp transfor allows for copensating the changes of fundaental frequency in the process of spectral analysis of speech and haronic sounds. In case of polyphonic audio the fundaental is often not unique and/or is hard to estiate. We propose a siple technique for estiation of chirp rates fro already detected partials to iprove the detection of higher haronics through the application of frequency warping and fan-chirp analysis.. INTRODUCTION Sinusoidal odeling is a well established signal processing fraewor applicable to speech and audio analysis, enhanceent, restoration, source separation, autoatic recognition, wateraring, copression, and synthesis []. Sinusoidal+noise (SN) odeling is an iportant eber of the faily of hybrid techniques that use different odels to efficiently represent different classes of signal coponents. Within a SN odel, a short segent of audio data is odeled as a su of quasi-sinusoids with continuously varying agnitudes and frequencies (called the deterinistic coponen, and a stochastic coponent (noise), whose shorttie power spectra envelope changes over tie, xˆ ( K t A ( sin f ( ) d. () ϕ + π τ τ + hn ( ξ( 443 44444 444444 3 noise = = coponent deterinistic coponent In fact, this distinction is not as uch critical fro the perceptual point of view, as it is iportant due to the representation efficiency (in applications related to copression) and flexibility (in applications involving sound anipulations). In general, the separation of the tonal (sinusoidal) and stochastic (noise) coponent is a difficult proble. First of all, the bul of spectral coponents observed in natural audio exhibit only certain degree of coherence in tie evolution of phase and instantaneous frequency. Consequently, ost of the is neither purely sinusoidal nor purely rando. A coon approach to the separation proble is to odel the greater possible part of the signal energy by the deterinistic coponent, under certain constraints (e.g. f being a haronic series [], that strongly narrows the range of applications). A residual signal is obtained by plain (tiedoain) or spectral subtraction of the reconstructed sinusoids fro the original signal. It is subsequently odeled as the stochastic coponent. A ore flexible approach is to perfor a classification of spectral peas (lobes surrounding local axia of the agnitude short tie spectru) into tonal and non-tonal according to their shape. For exaple, Rodet [3] proposes a easure of sinusoidality based on coplex cross-correlation of the short tie spectra and the DFT of the analysis window. This approach is liited to stationary sinusoids, whereas tie-varying coponents often exist in natural audio (fig.). x 4.5.5.5.5.5 3 Tie [s] Figure Narrowband spectrogra of an exaple usic excerpt showing a significant frequency spread of energy related to higher haronics due to pitch variations. An analysis window of 496 saples is necessary here to resolve low frequency partials Lagrange et al [4] estiate the degree of local aplitude and frequency odulation using the tie-frequency reassignent ethod of Auger and Flandrin [5]. Subsequently, individual spectral peas are cross-correlated with a DFT of a distorted window function, and the degree of sinusoidality is deterined and used in pea classification. Zivanovic et al [6,7] developed a pea classification syste based on several local

6th European Signal Processing Conference (EUSIPCO 8), Lausanne, Switzerland, August 5-9, 8, copyright by EURASIP spectru descriptors: noralized bandwidth (NBD), noralized duration (NDD), frequency coherence (FCD). The distinction between sinusoidal peas (ain and side lobes) and noise is done upon the inspection of descriptor cobined values. The fundaental proble with all the approaches entioned above is that they wor under assuption that tonal energy anifests in the short tie spectru as a distinct pea, allowing a siple detection. In practice, such assuption hardly holds in case of instruents with free intonation (such as violin, trobone, etc), as shown in fig., because variations of pitch cause the energy of higher partials to be spread over a wide frequency range and utually overlap. The traditional DFT-based ML estiation ethod often fails at the tas of usical spectru analysis due to inappropriate underlying odel that assues local stationarity of partials. Musical scales of any bass instruents start at 7Hz, the coonly used range begins at about 45Hz. High spectral resolution necessary for proper analysis of low pitched sounds requires the use of long DFT windows (6-s, i.e. - saples if f s = 44.Hz) in order to reliably resolve individual partials (cf fig. ). In a typical situation, instantaneous frequencies of partials change significantly during such a long period, thus they are no ore observable as narrow spectral lines. Hence, it is reasonable to see for locally-adaptive TF analysis ethods [5,8,9] that coonly attept at odeling the non-stationary spectral content on a chirp basis. Aong any chirp transfors and chirp estiation techniques proposed hitherto which often exhibit high coputational coplexity, the Fan-chirp transfor (FChT) introduced by Kepesi and Weruaga [,] offers two fundaental advantages in the context of usic analysis. It allows for siultaneous adapting to the pitch variations of all haronics of given sound, and its coputational coplexity is very low, enabling online processing. Developed priarily for the analysis of speech, FChT coputes the spectru of a signal on the set of basis functions with fan-lie geoetry in the tie-frequency plane. The short-tie fan-chirp transfor (STFChT) is defined as N α X (, α) = x( n) φ = α '( n) exp, () n j π φ N where φ α (n) is a tie-frequency warping operator, ( n) φ α ( n ) = ( +.5 α ( n N) )n, (3) and α is the sew paraeter corresponding to the chirp rate. In fact, the STFChT of a given signal is equivalent to the DFT of the sae signal sapled on a non-unifor grid obtained by inverting the warping operator (3). Therefore a fast ipleentation is possible which requires just a resapling step followed by an FFT []. Since the apping (3) is bijective in [..N], the transfor is reversible, provided no aliasing ters are introduced in the process of resapling. These aliasing ters ay be avoided by appropriate upsapling of the original signal prior to warping.. MODELING OF POLYPHONIC MUSIC. The proble of fundaental frequency STFChT is able to resolve haronic partials whose frequency deviation within the analysis window is greater than spacing between corresponding ean frequencies. It is possible under the condition that an appropriate value of α is used, that corresponds to the rate of change of the fundaental frequency, and α < /N. In the context of speech analysis, it ay be approxiated as (4) f'( f( n + ) f( n ) α =, (4) f ( f ( n) where f (n) denotes a fundaental frequency estiated within a syetric tie window centered around n. Several techniques for the FChT-supported estiation of fundaental using either inter-frae or intra-frae approach are described in []. In the context of polyphonic usic, f is not unique due to the presence of ultiple sounds of different pitch, often generated by different instruents. The issue of ultiple pitch estiation fro polyphonic audio has been addressed by any researchers (e.g. [,3]) and is generally considered as a difficult tas. Furtherore, soe usical instruents (lie bells, glocenspiel or Rhodes piano) exhibit strongly inharonic spectra, therefore their fundaental is undefined. It is iportant to note however, that even without a strictly defined fundaental all the sinusoidal partials of pitched sounds follow a siilar pattern in the tie-frequency plane. Considering partials of a haronically rich sound, their individual chirp rate estiates taen relative to their ean frequency estiates are strictly related to the pitch change rate. Therefore, instead of (4), α ay be estiated fro the statistics of individual chirp rates α of soe lower frequency partials detected before calculating the FChT. It is a feasible solution, since low partials usually exhibit ore stable frequencies and are relatively easy to detect.. Estiation of individual partials Partials with a liited depth of frequency odulation ay be often (but not always) odeled as linear chirps. It is possible to estiate their ean frequency and individual chirp rate by using one of several techniques developed for sinusoidal odeling. For exaple, Abe and Sith [4] deonstrated that for a chirp expressed as ( γ t + j( ϕ + ω t + β )) x( = A exp t, (5) weighted by a Gaussian window (as well as other windows), a non-zero frequency odulation ter β results in a quadratic shape of log aplitude and phase spectra. They proposed a quadratic-interpolated FFT ethod for estiating the ω and β, π b ωˆ =, d β ˆ = p, (6) N a a fro the paraeters of a parabola fitted to the log agnitude and phase spectru surrounding peas,

6th European Signal Processing Conference (EUSIPCO 8), Lausanne, Switzerland, August 5-9, 8, copyright by EURASIP where a = b = d = ( log X log log )/ + X + X ( log X log )/ + X ( X X + X )/ +, (7) π d p = (8) N a + b and is the index of FFT bin corresponding to local axiu of agnitude..3 Chirp rate estiation for groups of partials We propose a two-stage analysis procedure for sinusoidal odeling of polyphonic usic. The ain idea is to perfor a standard analysis first, with the use of DFT for the detection of reliable low frequency partials and estiation of their paraeters ω and β. Subsequently, the non-stationary high frequency partials are detected and their paraeters are estiated by the use of FChFT analysis, taing into account several ost interesting values of chirp rate α, i.e. those values that ost probably correspond to the local tiefrequency sewness related to the underlying pitch odulation. Let assue sounds coing fro different instruents with different pitch variation are present siultaneously in the current analysis frae. Its spectru shows a ixture of haronic and inharonic series of partials. We observe that the estiates of individual chirp rates α = β /(ω ) of individual partials follow a ulti-odal distribution that ay be approxiated by a Gaussian ixture odel (GMM), w ( α φ ) p p α( α) =, where (9a) w ( α µ ) p, and (9b) ( α φ ) = σ π exp σ φ denotes a certain state of the odel representing a group of partials sharing a coon chirp rate. The weights w are not explicitly nown, but ay be regarded as representing the bul of partials exhibiting siilar teporal evolution, thus they ay be estiated fro the heights of the epirical distribution odes. Our ai is to find the values of µ which are the interesting chirp rates that ay reveal additional high frequency partials due to the tie-frequency warping inherent in the FChFT. We estiate the values of µ by eploying an iterative algorith based on the Expectation Maxiization ethod. The algorith starts with a classical sinusoidal analysis of a given audio frae with an optional pea verification in order to reject peas induced by noise [6,7]. Initially, for each frae we gather the observed values of α =β /(ω ) and for a pdf estiate (fig. ) by the use of a histogra soothing ethod [5]. Locations of peas of this pdf estiate are the initial estiates of µ which ay be iteratively iproved as follows: For each saple of α calculate its distance to every µ. Calculate new estiations of µ through weighted averaging the values of α with the weights inversely proportional to the distances. Iterate until there is no significant change in µ. Results of such iterations (fig. ) are the values of the chirp rates that ay be applied within the second stage eploying FChFT analysis for enhanced estiation of non-stationary high-frequency partials. We have observed experientally that for real world usic the values of α are usually constrained in the range of <- >, and ost often do not exceed.5. PDF α estiate..5..5 Gaussian Mixture Model -.5 -.4 -.3 -. -....3.4.5 α.5 -.5-4 6 8 4 Frae No. Figure Above: distribution of the estiated values of α for a single frae of the test signal (fig. ). Below: estiated values of α in consecutive fraes..4 FChFT-based usic analysis Music spectru analysis with the FChFT transfor offers the possibility to reveal otherwise hidden spectral peas related to non-stationary high frequency partials. It also offers an enhanced estiation of the paraeters of lower partials due to the frequency deviations being copensated by the tie-frequency sew inherent in the fan-chirp basis functions. Thans to the chirp rates α being estiated in the first stage of the proposed technique (sec..3), it is necessary to calculate the FChFT only for those few values of α, which is a coputationaly feasible operation. Our pea detection and estiation algorith depends on the observation that for each sinusoidal partial with varying frequency the highest value of corresponding pea in the agnitude spectru is offered by the output of the FChFT with such value of the α paraeter that is closest to the firstorder approxiation of the real frequency variation function. In other words, the closest is the chirp rate used in the fanchirp transfor to the real frequency change rate, the ore is the spectru siilar to a spectru of a sinusoid. The algorith for the analysis is very straightforward:. For a data segent x of N saples, initialize a vector of peas P[] with N/ zeros. Also, insert pea values estiated fro the DFT analysis in the first stage into the locations corresponding to the DFT bin nubers.. For the first candidate value of α estiated as described in.3, calculate the result of FChFT(x,α).

6th European Signal Processing Conference (EUSIPCO 8), Lausanne, Switzerland, August 5-9, 8, copyright by EURASIP 3. Find all sinusoidal peas in the FChFT output, according to the chosen pea detection criteria. 4. For each of those peas copare their agnitude to the agnitude of corresponding pea already gathered in the vector P. If the agnitude is higher, it eans that a better approxiation of corresponding partial is found. In such case, replace the existing pea with the new pea fro the FChFT result. Also, collect the neighboring spectral data and write it to the entries of P. Label the pea with the current chirp rate, α. 5. Iterate steps..4 with subsequent values of α fro the set. 6. For each of the peas gathered in the vector P, calculate the corrected ω and β, according to (6-8). Correct the value of β by taing into account the chirp rate α used for the particular pea. Note that the above procedure does not guarantee that all hidden partials are detected. Unfortunately, soe groups of highly non-stationary sinusoids ay be issed if none of the have been detected in the first stage so that it could contribute to the estiation of optial sew paraeter α. 3. EXPERIMENTAL RESULTS. Synthetic signal In order to verify the procedure proposed in section.3, a siple test has been set up. An artificial signal has been constructed by suing two inharonic spectra of two bell sounds with deeply odulated pitch, synthesized using the FM synthesis technique (fig. 3). As it can be easily observed, the deep frequency deviation causes ost of the high frequency partials to be blurred to a significant degree. Clearly, this signal spectru contains at least two groups of partials and the distribution of α should reveal in each frae at least two odes of the pdf, corresponding to the different frequency odulation patterns. Frequency.5.45.4.35.3.5..5..5 3 4 5 6 7 8 9 Tie x 4 Figure 3 Spectrogra of the synthetic signal. The blac frae shows a data segent of N=496 saples, further analyzed in fig. 4 Experients show that for this synthetic signal about lowest haronics are detected reliably in the first stage of sinusoidal analysis. In fact, due to overlapping partials, the estiation of ω and β is not free of errors, therefore the actual values of α are slightly biased. Resulting chirp spectra are shown in fig. 4. Magnitude [db] Magnitude [db] 6 4-4 6 8 4 6 8 6 4-4 6 8 4 6 8 Figure 4 Coparison of standard DFT (above, blac) and FChFT analysis (below) with two values of autoatically estiated α (shown in blue and red). In both cases, the analysis window is Haing, 496 saples. As it can be clearly seen, ost of the high-frequency partials that are entirely indiscernible in the DFT output becoe quite visible in the result of FChFT. It is iportant to note that fan-chirp analysis allowed to discriinate partials that are very close in frequency, but differ ostly in the chirp rate, α.. Analysis of real usic A series of experients with various excerpts of popular and classic usic fro the EBU SQAM reference CD have been perfored in order to verify the effectiveness of the new sinusoidal analysis technique in real-life applications. In each experient, a benchar was created fro the results of standard sinusoidal analysis with an additional pea selection procedure based on spectral descriptors (NBD+FCD). Results of FChT-based analysis copared favorably with the benchar, since any existing partials have been detected in the high frequency range. Moreover, ore robust pea detection due to the chirp analysis allowed for changing the detection thresholds to ore strict setting. Thans to this, there was uch less of false partials detected due to the spectral energy induced by noise. A saple coparison fro these experients is shown in fig. 5. In the upper plot we show the sinusoidal partials detected with the standard DFT ethod followed by pea classification based on spectral descriptors. This typical result reveals serious deficiencies of the analysis technique. Most of the high frequency partials have not been properly detected due to the vibrato odulation, while there are any false ultiple partials in the range of -5Hz induced by the irregular spectral peas which are the side lobes of deeply odulated haronics. It is worth entioning, that a sipler analysis without application of spectral descriptors (not shown here) gives even worse results. In the lower plot, the results of FChFT-based analysis show uch of the partials in

6th European Signal Processing Conference (EUSIPCO 8), Lausanne, Switzerland, August 5-9, 8, copyright by EURASIP the high frequency range being properly detected, and also the nuber of false partials is significantly reduced. 4 x 5. ACKNOWLEDGMENTS This wor was supported by the research grant 3 TD 7 3 of the Polish Ministry of Science and Higher Education..5 REFERENCES.5.5.5 Tie [s].5 3.5.5 Tie [s].5 3 4 x.5 in the detection of highly non-stationary partials is achieved, that enables a good quality odeling of wideband audio, without restrictions regarding haronicity..5 Figure 5 Coparison of sinusoidal partial detection based on standard DFT technique (above) and the proposed technique exploiting fan-chirp transfor analysis (below). These plots should be copared with figure. One significant disadvantage of the proposed new technique for sinusoidal analysis is the additional coputational burden related to the necessary calculations of several fan chirp transfors. However, detection and estiation are usually not very deanding in ters of coputational coplexity, copared to tracing, whose coplexity is often datadependant. Since our analysis results in uch cleaner the data input to the tracing algorith, the total operation speed of a odeling syste ay not increase significantly. 4. CONCLUSIONS A coputationally feasible application of the fan-chirp transfor to hybrid sinusoidal+noise odeling of polyphonic usic have been presented in the paper. A very siple technique has been proposed for estiation of the frequency warping paraeter α that does not require pitch estiation. Experiental results confir, that a substantial iproveent [] J.Beauchap (red), Analysis, Synthesis, and Perception of Musical Sounds: The Sound of Music, Springer, 6. [] X. Serra, J.O.Sith, "Spectral odelling synthesis: A sound analysis/synthesis syste based on deterinistic plus stochastic decoposition", Coputer Music Journal, 4(4), 99, pp. -4. [3] X.Rodet, "Musical sound signal analysis/synthesis: Sinusoidal + residual and eleentary wavefor odels", IEEE Tie-Frequency and Tie-Scale Worshop, TFTS'97, Coventry, UK, August 997. [4] M.Lagrange, S.Marchand, J-B.Rault, "Sinusoidal paraeter extraction and coponent selection in a non-stationary odel", Proc. DAFx', Haburg,, pp. 59-64. [5] F. Auger, P. Flandrin, "Iproving the readability of tiefrequency and tie-scale representations by the reassignent ethod", Proc. ICASSP'95, May 995, vol. 4, pp. 6889. [6] A. Röbel, M.Zivanovic, X.Rodet, "Signal decoposition by eans of classification of spectral peas", Proc. ICMC'4, Miai, 4. [7] M.Zivanovic, A. Röbel, X.Rodet, "Adaptive threshold deterination for spectral pea classification", Proc. DAFx'7, Bordeaux, 7. [8] S.Mann, S.Hayin, "Adaptive chirplet transfor: an adaptive generalization of the wavelet transfor", Optical Engineering, vol.3, no.6, pp. 43-56, June 99. [9] X-G. Xia, "Discrete chirp-fourier transfor and its applications to chirp rate estiation", IEEE Trans. Sig. Proc, vol.48, no., pp. 3-333, Noveber. [] M. Kepesi, L. Weruaga, "Adaptive chirp-based tiefrequency analysis of speech signals", Speech Co., vol.48, pp. 474-49, 6. [] L. Weruaga, M. Kepesi, "The fan-chirp transfor for non-stationary haronic sounds", Signal Proc., vol. 87, pp. 54-5, 7. [] P.J.Walsley, S.J.Godsill, P.J.W.Rayner, "Polyphonic pitch tracing using joint Bayesian estiation of ultiple frae paraeters", Proc. IEEE Worshop on Audio and Acoustics, Mohon, NY State, 999 [3] Y. Chunghsin; A. Röbel, X. Rodet, "Multiple fundaental frequency estiation of polyphonic usic signals", Proc. ICASSP '5, March 5, vol.3, pp. 5-8. [4] M. Abe, J.O. Sith, "Design criteria for the quadratically interpolated FFT ethod (III): Bias due to aplitude and frequency odulation", CCRMA Rep. STAN-M-6, October, 4. [5] W. Hardle, Soothing Techniques: With Ipleentation in S, Springer-Verlag, Berlin, 99