SINUSOIDAL PARAMETER EXTRACTION AND COMPONENT SELECTION IN A NON STATIONARY MODEL

Proc. of the 5 th Int. Conference on Digital Audio Effects (DAFx-), Hamburg, Germany, Setember 6-8, SINUSOIDAL PARAMETER EXTRACTION AND COMPONENT SELECTION IN A NON STATIONARY MODEL Mathieu Lagrange, Sylvain Marchand, and Jean-Bernard Rault France Telecom R&D SCRIME - LaBRI, Université Bordeaux 4, rue du Clos Courtel, BP 59 35, cours de la Libération, F-355 Cesson Sevigné cedex, France F-3345 Talence cedex, France firstname.name@rd.francetelecom.com sm@labri.u-bordeaux.fr ABSTRACT In this aer, we introduce a new analysis technique articularly suitable for the sinusoidal modeling of non-stationary signals. This method, based on amlitude and frequency modulation estimation, aims at imroving traditional Fourier arameters and enables us to introduce a new eak selection rocess, so that only eaks having coherent arameters are considered in subsequent stages (e.g. artial tracking, synthesis). This allows our sectral model to better handle natural sounds.. INTRODUCTION Sectral sound models rovide general reresentations for many alications such as comression, content extraction and transformation. Most of these models, such as additive synthesis, are based on the Fourier analysis which has roven to be accurate under the condition of local stationarity. Unfortunately, most natural signals are not stable enough to be considered as locally stationary with tyical analyzing frame lengths, i.e. ms (close to the ercetual sensibility of the human auditory system). To address this issue, the window length can be shortened but, according to the well known time / frequency resolution tradeoff, this will lead to oor frequency resolution. In the context of voiced seech analysis, McAulay and Quatieri [] roosed to adat the analysis window size to the itch of the voice, although this cannot be alied to olyhonic sources. This aer introduces an alternative that relies on amlitude and frequency modulation based modeling of the audio signal during the analysis time slot. Intra-frame arameters variations are extracted and the bias introduced when estimating the stationary arameters can be comensated. Thanks to these modulation measures and to a more accurate stationary arameter extraction, we are able to develo an efficient eak selection rocess that better searates noisy eaks and modulated ones. After a brief introduction in Section to the non-stationary model used in this aer, we extend in Section 3 the studies of Masri [] to the case of the Hann window in order to estimate intra-frame variations. Section 4 is dedicated to the correction of the stationary arameters with two different aroaches: time reassignment and the comutation of the sectrum of a modulated sinusoid. After an introduction on eak selection rocessing and sinusoidal characterization, a new eak selection rocess is resented in Section 5. Possible alications as well as comarative results follow... Stationary Case. SINUSOIDAL MODELING Additive synthesis is the original sectrum modeling technique. It is rooted in Fourier s theorem, which states that any eriodic function can be modeled as a sum of sinusoids at various amlitudes and harmonic frequencies. For stationary seudo-eriodic sounds, these amlitudes and frequencies continuously evolve slowly with time, controlling a set of seudo-sinusoidal oscillators commonly called artials. The audio signal a can be calculated from the additive arameters using Equations and, where n is the number of artials and the functions f, a, and φ are the instantaneous frequency, amlitude, and hase of the -th artial, resectively. The n airs f a are the arameters of the additive model and reresent oints in the frequency-amlitude lane at time t. This reresentation is used in many analysis / synthesis rograms such as SMS [3] or InSect [4]. n a t t φ π φ.. Non-Stationary Case a t cos φ t () t f u du () For non-stationary signals, amlitude and frequency arameters a and f aear as the mean of the amlitude / frequency evolutions in the analysis frame. In our model, we consider that the arameters can evolve in the analysis window. Since the human ear as every sensory organ erceives amlitude variation as the logarithm of the excitation, it is convenient to exress the amlitude modulation a in db (decibels). Although a similar logarithmic scale would be aroriate for frequency variations ( ) f as well, they will be considered as linear for the sake of simlicity. Thus, the audio signal a is given by the following equations: P a t t φ π φ a t a t t cos φ t (3) t f u f u u du (4) where a, f, a, and f are considered as constant during the analysis window. The following section is dedicated to the estimation of the intra-frame modulations. DAFX-59

Proc. of the 5 th Int. Conference on Digital Audio Effects (DAFx-), Hamburg, Germany, Setember 6-8, 3 3.8 Amlitude Modulation Measure 4 x 4 Frequency Modulation Measure Φ i+ Φ i Φ i+ +Φ i.7 Phase (radians) Phase (radians).6.5.4.3 8 6. 4 3 3. 3 3 Frequency offset from eak (bins) 3 3 Frequency offset from eak (bins) Figure : Influence of amlitude (left) and frequency (right) modulations on zero-added hase sectrum. Amlitude Modulation Measure Φ i+ Φ i Frequency Modulation Measure Φ i+ +Φ i. 5 5 f in bins er frame 5 5 a in db er frame Figure 3: Influence of frequency modulation on the amlitude modulation measure when there is no amlitude modulation and vice versa..5.5.5 5 5 a in db er frame. 5 5 f in bins er frame Figure : Evolution of emirical measures (Φ i Φ i ) as a function of actual modulations. 3.. Modulation Estimation 3. MODULATED SINUSOIDS Several techniques were roosed in order to estimate intra-frame modulations. Marques [5] and Peeters [6] studies were based on the use of truncated Gaussian windows, known for their good theoretical roerties. Indeed, the Fourier transform of a Gaussian window is a Gaussian function. However, Gaussian windows have oor frequency resolution [7]. The Hann window, with its rominent and narrow main lobe, has roven to be better for our uroses. We use the emirical studies of Masri [] based on hase distortion analysis. For a eak located in the i-th bin of the zeroadded hase sectrum Φ, icking the values of Φ at indices i and i allows us to estimate the frequency / amlitude modulations (see Figure ). Indeed, a constant relationshi between a and Φ i Φ i was found, as well as a more comlex one between f and Φ i Φ i (see Figure ). More formally: a c Φ a f G Φ f where Φ a and Φ f are exressed by: Φ a Φ i Φ i Φ f Φ i Φ i The value of c and those of the coefficients of G deend on the window, its size and the zero-adding level. These emirical measures come from the fact that the influence of frequency modulation on hase was found to be symmetrical whereas that of amlitude was found to be anti-symmetrical, as can be seen in Figure. It theoretically guarantees the indeendence of the two estimations. However, this indeendence is not erfect, as shown in Figure 3, but quite sufficient for our uroses. Zero-adding, because it interolates the sectrum by artificially adding zeros to the frame buffer before the Fast Fourier Transform (FFT), is a comutation-exensive rerequisite. Indeed, by interolating the sectrum, it reduces but not totally excludes the bad case where only one of the hase sectrum bins used is shifted by a π factor. 3.. Comutation of Modulated Sectrum The sectrum, at frequency f, of a modulated sinusoid whose set of arameters is s a f φ a f is given by the classic short-time Fourier (STFT) formula: N π j f n A f f w" n# a" n# e$ N (5) n! N where a" n# is the discrete version of a (see Equation 3), w" n# is the analysis window, and N is the size of the STFT. 4. STATIONARY PARAMETER CORRECTION The comlex sectrum is ut out of shae by the intra-frame modulations a and f, thus severely corruting the estimation of the stationary arameters a, Φ and f. This section deals with different techniques used in order to correct the bias in extracting the DAFX-6

Proc. of the 5 th Int. Conference on Digital Audio Effects (DAFx-), Hamburg, Germany, Setember 6-8, amlitude.9.8.7 t.6.5.4.3. N N time..5.5.5.5 Frequency offset from eak (bins) Figure 4: Time coordinate evolution in a non linear amlitude model. amlitude a and hase Φ, rovided that the modulations were correctly measured. In the remainder of this aer, the estimation of f% will not be addressed because the analysis method roosed in [8] already erforms a frequency correction by considering the signal derivative. 4.. Time Reassignment Using a zero-hase window, in a stationary context, it is convenient to consider that the amlitude arameter a is equal to a t where t is the time osition of the center of the analysis window. In a non linear amlitude context, the oint of maximum sectral energy (center of gravity) for a comonent has a changing time coordinate (see Figure 4). The difference t between this oint and the center of the window can be estimated using the time reassignment method. This method was resented by Auger and Flandrin [9] for a large variety of known time-frequency and timescale distributions. It was introduced in the sinusoidal modeling context in [, 6]. More recisely, we have: t R & X th; X'h; ( X h; ( ) (6) where X th; denotes the short-time Fourier transform comuted using the window multilied by a time ram (w th w h t) and X h; is the short-time Fourier transform comuted using the original window w. The corrected amlitude is then given by: a% a a t During the time interval t, there is a hase travel due to eriodic oscillation. The hase can be corrected in this way: (7) φ% φ π f% t (8) Unfortunately, this method only takes the amlitude modulation into account. The following section resents an alternative solution that takes advantage of both the amlitude and frequency modulations. 4.. Modulated Sectrum In [8], the amlitude arameter is corrected by considering the continuous ower sectrum of the analysis window: a% (9) a W ( f% f ( Figure 5: Window ower sectrum (line) versus modulated sinusoid ower sectrum (dashed). where W is the ower sectrum of the analysis window w and f is the frequency of the sectrum local maximum. Since the frequency has changed, we have then to estimate the new amlitude. Fortunately, in a stationary model, the ower sectrum of a sinusoid has the shae of the window ower sectrum. As can be seen on Figure 5, for a frequency correction of half a bin, the amlitude factor is aroximately 8. In the case of a strong modulation, the main lobe is flattened and so very different from the main lobe of the window ower sectrum, inducing a bad estimation of the amlitude correction to be done. It is ossible to imrove this correction by considering the ower sectrum A f* of +( a modulated sinusoid comuted with Equation 5 instead of W f% ( f. The corrected amlitude is then exressed by: a a% () A f* For a frequency correction of half a bin and a and f 4, the amlitude factor is of 7. 4.3. Comarative Results This section resents comarative results for the different methods exosed. The modulations arameters a and f used to comute the corrections are extracted, i.e. errors resented below are not only attributable to the correction rocess but also to the modulations extraction rocess. The first art of Figure 6 shows errors as function of amlitude modulation and the second art as function of frequency modulation. Since the reassignment method does not handle frequency modulation, its error is not lotted. Thanks to these modulation measures and more accurate stationary arameter extraction, we are now able to develo an efficient eak selection rocess. f 5. PEAK SELECTION Our eak selection rocess retains a bin if it is a local maximum and its frequency correction as defined in [8] is below one bin. Unfortunately, this simle eak selection rocess retains also so-called noisy eaks (sectral manifestation of a rocess that can hardly be modeled by sinusoidal additive algorithms). Studies [6] have been made in order to decide if a eak is tonal (if it has DAFX-6

Proc. of the 5 th Int. Conference on Digital Audio Effects (DAFx-), Hamburg, Germany, Setember 6-8,...8.6.4..8.7.6.5.4.3.. 3 4 5 6 7 8 9 a in db er frame 3 4 5 6 f in bins er frame Figure 6: Errors in amlitude estimation in function of amlitude (to) and frequency (bottom) modulations. Standard method (dot dashed), reassignment method (dashed) and roosed method (line). to be considered as a sinusoidal contribution) or not. In a way, these methods could seem redundant because sectral eaks are by nature sinusoids. Inversely, it is ossible to synthesize noisy sounds with additive algorithms []. Thus, instead of having a tonal criterion, one would like to know if the set of arameters s is coherent and reliable, i.e. is a sectral manifestation of a rocess that can correctly be modeled by sinusoidal additive algorithms. The first subsection resents frequency rediction based criteria for sinusoidal characterization. The second one is dedicated to sectrum shae criteria. In the second section, after an introduction on a simle criterion used in MPEG Layer II, we resent the cross-correlation and its use in the domain. Its limitation lead us to introduce a new criterion that, thanks to modulation estimation, better searates noisy eaks and modulated eaks. 5.. Frequency Prediction Criterion Sinusoids, in a stationary model, should have coherent hase and frequency evolutions in time. In the general Advanced Audio Coding (AAC) Standard [] and the Phase Derived Sinusoidality Measure (PDSM) in [6], frame to frame informations are used to searate sectral eaks. Since frequency is the derivative of the hase, a sinusoid should have coherent hase and frequency evolutions. This closeness of hase measurements evolution and frequency ones is evaluated and used as a eak selection criterion. It assumes a stationary behavior for the sinusoid so that its sectral contribution will stay in the same bin during the measurement rocess ( to 3 frames). 5.. Sectrum Shae Criterion In the MPEG Layer I and II sychoacoustic model [3], an amlitude criterion is used to establish the tonality of a eak using a fixed amlitude relation between surrounding bins. More formally, a eak is a tonal comonent if the following relation is satisfied: X k X k j -, 7dB () where j and k are bins index of an 4 length analysis frame. For MPEG Layer I, j is chosen according to for 3 k 3 63 j /. 3 3 for 63 4 k 3 7 6 3 3 6 for 7 4 k 4 5 () This criterion, desite its quite emirical values, has shown good results in frequency masking curve comuting where recision is not so imortant. The cross-correlation method has been used successfully in seech coding [4] as a voicing index and in sinusoidal modeling as a sinusoidal likeness measure [6]. These aroaches have stationary assumtion because the model used is a set of ure steady sinusoids (harmonic for the first case). Indeed, if a eak and its surrounding bins H ω k have the same values as those of the analysis window translated in frequency and shifted in hase, they come from a ure steady sinusoid. In real signals, this is rarely true so it is natural to use the crosscorrelation Γ s (s stands for stationary) to measure correlation between normalized H ω k (real lobe comuted via STFT, lotted with crosses on Figure 7) and W ω k (lotted with a solid line), the normalized STFT of the analysis window w using a narrow bandwidth [ B, B]: Γ s 65 H f k ; W f k <5 k7 f k8 9 f B7 f B: 55555 55555 (3) where f k and f are exressed in bins. For harmonic sounds, if the fundamental frequency is modulated by a frequency factor f, the i-th harmonic is modulated by a factor i f corruting so much the sectrum that it leads Marques in [5] to wonder if there were really harmonics in the high frequency of a seech sectrum. Indeed, the frequency modulation sreads the energy over a large number of bins, this henomenon can be observed on the sectrogram of the highest harmonics in Figure 8. A solution to this roblem is to flatten the sectrum by filtering the inut signal to give it a constant fundamental frequency equal to its mean value along time to suress the frequency modulation. This method relies on a good fundamental frequency estimation and a monohonic source context. With an estimation of the modulations, we are able to get rid of these constraints. To estimate the coherence of the arameter set, instead of considering W ω k, we comare H ω k to A f* f k (modulated lobe lotted with circles on Figure 7), the sectrum of a modulated sinusoid comuted with Equation 5 with extracted set of arameters s. Γ n ω (n stands for non stationary) is then defined as: Γ n f = 5 H f ω k > A%f k7 ω k8 9 B7 B: f ω k 5 5555 5555 (4) DAFX-6

Proc. of the 5 th Int. Conference on Digital Audio Effects (DAFx-), Hamburg, Germany, Setember 6-8,.9.8 real lobe modulated lobe window lobe x 4.7.6.5.5.4 Frequency.3..5. 4 3 3 4 Frequency offset from eak (bins).5..5..5.3.35 Time Figure 7: Real modulated sectrum main lobe (crosses), comared with window lobe (solid line) and modulated lobe (circles) comuted from Eq. 5 with measured arameters a f Φ a f. 5.3. Alications Since the Γ n criterion denotes the coherence and the reliability of the eak (see discussion at introduction of Section 5), we are able to robustly classify eaks. The alications can then be: noise reduction by setting a threshold below which the eak is considered as noisy, comonent selection in a sinusoidal coding framework, ordering eaks under a trust criterion to the next analysis stage. As far as sinusoidal coding is concerned, comonent selection is mainly done in the literature by amlitude or loudness criteria [5]. This means that the eaks having the highest amlitude, Signal to Mask Ratio (SMR) or loudness are selected. Unfortunately, sinusoidal reresentations are usually associated in a hybrid framework with other reresentations (transient and noise). The sinusoidal section, which is often rocessed at first, will try to model the inut signal even if the reresentation is not relevant. Γ n selection allows us to select relevant comonents, i.e. which adequately model the deterministic art of the inut signal. 5.4. Results In this section, we resent results for three comonent selection criteria. The analyzed sound is an increasing frequency and decreasing amlitude set of sinusoids mixed with a bandass filtered white noise signal; its sectrogram is lotted in Figure 8. The first eak selection rocess uses a simle amlitude criterion that orders eaks by decreasing amlitude (see Figure 9(a)). The second one uses the stationary correlation criterion (see Figure 9(b)); it sorts eaks in decreasing values of Γ s. The last one uses the non-stationary criterion (see Figure 9(c)); it sorts eaks in decreasing values of Γ n. Since the set of eaks is now ordered, we can choose the best eaks if we choose to retain only a few eaks at each frame ( best means here that a eak has a criterion Figure 8: Sectrogram of the analyzed signal value highest than the other eaks in the set). In the three figures, at each frame, only the highest ten ercents of the eaks detected by the analysis stage were lotted. Because the noise level is very high in the limited band, the amlitude criterion selects eaks generated by the white noise. As far as quasi-stationary sinusoids (fundamental and lowest harmonics) detection is concerned, the standard correlation criterion Γ s shows good results but fails quickly as the frequency modulation increases. The non-stationary criterion Γ n handles this roblem better, even though for very high harmonics, it does not seems ossible to decide if a eak comes from a highly modulated sinusoid or from white noise. 6. CONCLUSION In this aer, we have resented a way to handle non stationarity in a sinusoidal model. By using the algorithm of modulation extraction roosed by Masri in [], we first correct traditional Fourier arameters. Thanks to these modulation measures and more accurate Fourier arameters, we roosed a new eak selection rocess which better differentiate modulated comonents from stochastic comonents. These considerations greatly imrove the accuracy and robustness of sectral modeling. 7. REFERENCES [] Robert J. McAulay and Thomas F. Quatieri, Seech Analysis/Synthesis Based on a Sinusoidal Reresentation, IEEE International Conference on Acoustics, Seech and Signal Processing (ICASSP), vol. 34, no. 4,. 744 754, 986. [] Paul Masri, Comuter Modeling of Sound for Transformation and Synthesis of Musical Signals, Ph.D. thesis, University of Bristol, 996. [3] Xavier Serra, Musical Signal Processing, chater Musical Sound Modeling with Sinusoids lus Noise,. 9, Studies on New Music Research. Swets & Zeitlinger, Lisse, the Netherlands, 997. DAFX-63

Proc. of the 5 th Int. Conference on Digital Audio Effects (DAFx-), Hamburg, Germany, Setember 6-8, 9 8 7 6 5 4 3 Frequency (in Hz) Frequency (in Hz) 3 4 5 6 7 9 8 7 6 5 4 3 ( b ) 3 4 5 6 7 Frame number 8 7 6 5 4 3 ( c ) 3 4 5 6 7 Frame number [4] Sylvain Marchand and Robert Strandh, InSect and Re- Sect: Sectral Modeling, Analysis and Real-Time Synthesis Software Tools for Researchers and Comosers, in Proceedings of the International Comuter Music Conference (ICMC), Beijing, China, October 999, International Comuter Music Association (ICMA),. 34 344. [5] J. Marques and L. Almeida, A Background for Sinusoid Based Reresentation of the Voiced seech, in IEEE International Conference on Acoustics, Seech and Signal Processing (ICASSP), Tokyo, 986,. 33 36. [6] Geoffroy Peeters and Xavier Rodet, SINOLA: A New Analysis/Synthesis Method using Sectrum Peak Shae Distortion, Phase and Reassigned Sectrum, in Proceedings of the International Comuter Music Conference (ICMC), Beijing, China, October 999, International Comuter Music Association (ICMA). [7] Sylvain Marchand, Sound Models for Comuter Music (analysis, transformation, synthesis), Ph.D. thesis, University of Bordeaux, LaBRI, December. [8] Myriam Desainte-Catherine and Sylvain Marchand, High Precision Fourier Analysis of Sounds Using Signal Derivatives, Journal of the Audio Engineering Society, vol. 48, no. 7/8,. 654 667, July/August. [9] François Auger and Patrick Flandrin, Imroving the readability of time-frequency and time-scale reresentations by the reassgnment method, IEEE International Conference on Acoustics, Seech and Signal Processing (ICASSP), vol. 43,. 68 89, May 995. [] Kelly Raymond Fitz, The reassigned Bandwith-Enhanced Method of Additive Synthesis, Ph.D. thesis, University of Illinois, 999. [] Pierre Hanna and Myriam Desainte-Catherine, Influence Of Frequency Distribution On Intensity Fluctuation of Noise, in Proceedings of the Digital Audio Effects (DAFx) Conference. University of Limerick and COST (Euroean Cooeration in the Field of Scientific and Technical Research), December,. 4. [] ISO MPEG4, ISO/IEC JTC/SC9/WG FDIS 4496 Information technology - Generic Coding of Audio Visual Objects, Part 3 (MPEG-4),. [3] ISO MPEG, ISO/IEC JTC/SC9/WG Coding of Moving Pictures and Associated Audio for Digital Storage Media at u to About.5Mbit/s, standard n? 7, alias MPEG- ISO-MPEG, November 99. [4] Daniel W. Griffin and Jae S. Lim, A New Model-Based Seech Analysis/Synthesis System, in IEEE International Conference on Acoustics, Seech and Signal Processing (ICASSP), Tama, 985. [5] Heiko Purnagen, Nikolaus Meine, and Bernd Edler, Sinusoidal Coding Using Loudness-Based Comonent Selection, in IEEE International Conference on Acoustics, Seech and Signal Processing (ICASSP),. Figure 9: Sectral eaks retained by three eaks selection rocess: amlitude criterion (a), stationary correlation criterion (b) and non-stationary one (c). DAFX-64