Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Similar documents
SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Signal Characterization in terms of Sinusoidal and Non-Sinusoidal Components

L19: Prosodic modification of speech

Sound Synthesis Methods

HIGH ACCURACY FRAME-BY-FRAME NON-STATIONARY SINUSOIDAL MODELLING

Adaptive noise level estimation

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

ADAPTIVE NOISE LEVEL ESTIMATION

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Timbral Distortion in Inverse FFT Synthesis

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

A NEW APPROACH TO TRANSIENT PROCESSING IN THE PHASE VOCODER. Axel Röbel. IRCAM, Analysis-Synthesis Team, France

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

NOTES FOR THE SYLLABLE-SIGNAL SYNTHESIS METHOD: TIPW

Determination of instants of significant excitation in speech using Hilbert envelope and group delay function

FREQUENCY-DOMAIN TECHNIQUES FOR HIGH-QUALITY VOICE MODIFICATION. Jean Laroche

TIME DOMAIN ATTACK AND RELEASE MODELING Applied to Spectral Domain Sound Synthesis

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Glottal source model selection for stationary singing-voice by low-band envelope matching

ON THE RELATIONSHIP BETWEEN INSTANTANEOUS FREQUENCY AND PITCH IN. 1 Introduction. Zied Mnasri 1, Hamid Amiri 1

COMBINING ADVANCED SINUSOIDAL AND WAVEFORM MATCHING MODELS FOR PARAMETRIC AUDIO/SPEECH CODING

Complex Sounds. Reading: Yost Ch. 4

A GENERALIZED POLYNOMIAL AND SINUSOIDAL MODEL FOR PARTIAL TRACKING AND TIME STRETCHING. Martin Raspaud, Sylvain Marchand, and Laurent Girin

Empirical Mode Decomposition: Theory & Applications

Converting Speaking Voice into Singing Voice

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Determination of Variation Ranges of the Psola Transformation Parameters by Using Their Influence on the Acoustic Parameters of Speech

Lecture 5: Sinusoidal Modeling

Frequency slope estimation and its application for non-stationary sinusoidal parameter estimation

TRANSFORMS / WAVELETS

Feature extraction and temporal segmentation of acoustic signals

Sinusoidal Modelling in Speech Synthesis, A Survey.

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Identification of Nonstationary Audio Signals Using the FFT, with Application to Analysis-based Synthesis of Sound

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

METHODS FOR SEPARATION OF AMPLITUDE AND FREQUENCY MODULATION IN FOURIER TRANSFORMED SIGNALS

Singing Expression Transfer from One Voice to Another for a Given Song

ADDITIVE synthesis [1] is the original spectrum modeling

Sound analysis, processing and synthesis tools for music research and production

Synthesis Algorithms and Validation

ACCURATE SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC COMPONENTS BASED ON DISCRETE HARMONIC TRANSFORM

Speech Synthesis using Mel-Cepstral Coefficient Feature

Introduction. Chapter Time-Varying Signals

Measurement of RMS values of non-coherently sampled signals. Martin Novotny 1, Milos Sedlacek 2

Pitch Period of Speech Signals Preface, Determination and Transformation

THE BEATING EQUALIZER AND ITS APPLICATION TO THE SYNTHESIS AND MODIFICATION OF PIANO TONES

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION

Speech Synthesis; Pitch Detection and Vocoders

Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain {jordi.bonada,

Lecture 7 Frequency Modulation

DAFX - Digital Audio Effects

Advanced audio analysis. Martin Gasser

Formant Synthesis of Haegeum: A Sound Analysis/Synthesis System using Cpestral Envelope

On the Use of Time Frequency Reassignment in Additive Sound Modeling *

Synthesizing a choir in real-time using Pitch Synchronous Overlap Add (PSOLA)

INFLUENCE OF FREQUENCY DISTRIBUTION ON INTENSITY FLUCTUATIONS OF NOISE

VIBRATO DETECTING ALGORITHM IN REAL TIME. Minhao Zhang, Xinzhao Liu. University of Rochester Department of Electrical and Computer Engineering

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Lecture 9: Time & Pitch Scaling

A Linear Hybrid Sound Generation of Musical Instruments using Temporal and Spectral Shape Features

Sinusoidal Modeling. summer 2006 lecture on analysis, modeling and transformation of audio signals

Frequency slope estimation and its application for non-stationary sinusoidal parameter estimation

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

A system for automatic detection and correction of detuned singing

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL

Enhanced Waveform Interpolative Coding at 4 kbps

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Detection, localization, and classification of power quality disturbances using discrete wavelet transform technique

Sub-band Envelope Approach to Obtain Instants of Significant Excitation in Speech

Epoch Extraction From Emotional Speech

Prosody Modification using Allpass Residual of Speech Signals

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Seismic application of quality factor estimation using the peak frequency method and sparse time-frequency transforms

Digital Speech Processing and Coding

WARPED FILTER DESIGN FOR THE BODY MODELING AND SOUND SYNTHESIS OF STRING INSTRUMENTS

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

NOVEL APPROACH FOR FINDING PITCH MARKERS IN SPEECH SIGNAL USING ENSEMBLE EMPIRICAL MODE DECOMPOSITION

Friedrich-Alexander Universität Erlangen-Nürnberg. Lab Course. Pitch Estimation. International Audio Laboratories Erlangen. Prof. Dr.-Ing.

FIR/Convolution. Visulalizing the convolution sum. Convolution

Vocal effort modification for singing synthesis

Fundamentals of Music Technology

HIGH ACCURACY AND OCTAVE ERROR IMMUNE PITCH DETECTION ALGORITHMS

Signal Characteristics

Digital Signal Processing

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

YOUR WAVELET BASED PITCH DETECTION AND VOICED/UNVOICED DECISION

SINUSOIDAL MODELING. EE6641 Analysis and Synthesis of Audio Signals. Yi-Wen Liu Nov 3, 2015

DERIVATION OF TRAPS IN AUDITORY DOMAIN

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Transcription:

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase Reassignment Geoffroy Peeters, Xavier Rodet Ircam - Centre Georges-Pompidou, Analysis/Synthesis Team, 1, pl. Igor Stravinsky, 75004 Paris, France Geoffroy.Peeters@ircam.fr, Xavier.Rodet@ircam.fr Abstract A new Analysis/Synthesis method, named SINOLA, based on sinusoidal additive OLA/PSOLA synthesis, is proposed. It allows high quality transformation of both stationnary non-stationnary parts of a signal. Time-frequency characterization synthesis parameters estimation is done by a novel method based on spectrum peak shape distortions time-frequency phase evolutions. Introduction Speech musical sound transformation plays an essential role in many applications today such as movie production (post-synchronization), musical studio effects (pitch-shifting, time-warping), Text-to- Speech, prosody matching so on. Depending on the required quality on the allowed complexity, several methods can be used, starting from the simplest, elementary resampling by changing the speed of reading from a circular read/write-buffer, to the most complex, the creation of an elaborate model of the source signal. At first, transformation of the signal can be obtained through a blind process, this is the case, for example, with the phase-vocoder. But for better results, one can apply an Analysis/Synthesis (A/S) method. The analysis stage allows the extraction of the parameters necessary for an accurate transformation of the signal. These parameters will then be changed according to the modification desired then used to synthesize the transformed signal. In this paper, we propose SINOLA, a new sound transformation method which uses two different A/S methods : the sinusoidal additive the OLA/PSOLA methods. Each of these methods is apropriate for parts of the signal having different characteristics. 1 The SINOLA model Sinusoidal additive A/S consists of decomposing a signal into a sum of sinusoidal components with parameters varying slowly over time. This method is extremely accurate for signals which can be considered as a sum of sinusoids with stationary parameters in a window of 3 to 4 pitch periods. It allows high quality extended sound transformation thanks to a complete control of sinusoidal parameters. However, it is not appropriate for transitory, non periodic pulses rom components, which are difficult to represent by slowly varying sinusoids. On the other h, Time-Domain Overlap- Add (TD-OLA) TD-Pitch-Synchronous OLA (TD-PSOLA, which is important for periodic, i.e. harmonic sounds), are well adapted for non-stationary or non-sinusoidal components require shorter windows. In SINOLA, the sinusoidal additive A/S is used to model the stationary sinusoidal components while the OLA/PSOLA method is used to process attacks, transients, non periodic pulses rom components (see Figure 1 bottom). SIN: Sinusoidal additive A/S model [5] where, are the amplitude, frequency initial phase of the th frequency component of the signal. Usually are supposed to be low-pass signals are therefore considered constant during a short analysis frame. At the synthesis stage, these parameters are interpolated between adjacent frames in order to avoid signal discontinuities. In section 2.3 we show

how parameter variations can be included evaluated in the analysis stage. OLA: TD-OLA/TD-PSOLA method [3] As opposed to sinusoidal additive A/S, OLA PSOLA do not use a model. This can be viewed as a drawback since possibilities for sound modification are limited. But it can also be viewed as an advantage since the whole signal frame is taken into account, not only the stationary sinusoidal part. The OLA method consists of decomposing the signal into overlapping frames while PSOLA constrains these frames to be positioned in a pitch-synchronous way at the analysis at the synthesis stage. A general formulation is: where is the th frame obtained by windowing the signal with a function defined during a duration centered around time, is the modified th frame, is the synthesis signal constructed by overlap-adding the successive frames positioned at the. In the case of PSOLA the are positioned in a pitch-synchronous way, is equal to 2 local pitch period the positions of the determine the pitch periods of the synthesis signal. The OLA/PSOLA method is depicted in Table 1 for each type of signal. Frequency Shifted OLA / PSOLA We introduce the Frequency Shifted OLA (FS-OLA) Frequency Shifted PSOLA (FS-PSOLA) in order to allow low-cost spectral modifications of the sound this, independently of the pitch time modifications. As opposed to FD-PSOLA [3] which is based on spectrum resampling, FS-OLA FS-PSOLA are based on spectrum shifting: (1) Unfortunatelly when (1) is applied without cares to an harmonic signal, the signal becomes inharmonic. However if (1) is applied to each fundamental waveform (FW) separately, (1) results only in the shifting of the spectral envelope but does not change the Input signal Input signal - SIN OLA / PSOLA SIN transients S/NS Partial Tracking Output signal Figure 1: SINOLA : (top) Analysis stage (bottom) Synthesis stage harmonicity properties. This is because one sole FW does not have any notion of pitch. Therefore can be used as a rough approximation of the spectral envelope. The process is the following : (2) where denotes the analytic signal corresponding to, is the size of is the frequency shift factor. is then processed by the PSOLA method giving the required pitch. In FS-OLA, the same frequency shifting is applied but this time without any consideration about pitch harmonicity. 2 Parameter Estimation Three types of information are needed for SINOLA are retrieved simultaneously using the Short Time Fourier Transform (STFT) of the signal (see Figure 1 top): 1. a time-frequency characterization of the signal for its decomposition into transients, sinusoidal non-sinusoidal components (see 2.1, 2.2), 2. the time-varying frequency, amplitude phase of the sinusoidal components (see 2.3), 3. the pitch-synchronous markers in the case of PSOLA (see 2.4). 2.1 Transient detection Transients are detected using cross-entropy measurement derived from the Kullback- Leibler distance [2]: (3)

Table 1: OLA - PSOLA method for different types of signals Type transient rom rom ( periodic part) periodic Method OLA OLA OLA-PSOLA PSOLA = transient positions 1 = of periodic part original signal pitch pe- rom component rom component riod (see 2.4) = transient positions = of periodic part synthesis signal pitch period alternate time reversing ing morphing alternate time revers- morphing between morphing between between where, is the amplitude of the STFT at time frequency. 2.2 S/NS signal characterization The Sinusoidal versus Non-sinusoidal (S/NS) signal characterization consists of measuring how well a part of the time/frequency plane can be represented by a sinusoidal model. It is therefore strongly dependent on the assumptions defining the sinusoidal model: local stationarity or non-stationarity of the sinusoidal parameters. Numerous methods have been proposed for S/NS characterization (see [8] for a review) but most of them use this stationarity assumption. In [6] we have proposed a method, called the Phase Derived Sinusoidality Measure (PDSM), which allows measurement of sinusoidality without a stationary frequency assumption. For this, PDSM compares a temporal model of the evolution of measured frequencies a temporal model of the corresponding phase derivative. For a specific frequency, if the models are close (according to a distance measure) this b can be represented by a sinusoidal model. We give here a low-cost method to compute it using frequency reassignment [1] which can be written (using b-pass convention): BP STFT BP STFT BP (4) The first formulation of is the instantaneous frequency definition which is often used in order to obtain precise frequencies from a Discrete Fourier Transform. The second formulation gives a low cost method to compute it. It also expresses the correction to apply to the discrete frequency in order to obtain the exact frequency. The distance given by PDSM can be shown to be similar to this correction. 2.3 Complex Short-Time Spectrum Distortion measure In classical A/S methods, parameters are often estimated from short-time spectra. The signal is usually assumed to be stationary during the analysis window, thus, the spectrum is assumed to have peaks at the frequencies of the sinusoidal components. Unfortunately, the signal is rarely stationary during the analysis window: amplitude frequency modulation of signal components distort the shape of the assumed spectral peaks, therefore inducing incorrect parameter estimation. Previous studies have shown the importance of spectrum distortion induced by these variations have proposed partial solutions (neural network, signal normalization [6]), or analytical formulation [4]. We propose here a complete parameter estimation method taking into account amplitude frequency modulation. The signal model is a sum of sinusoids with linear variation of amplitude ( ) of frequency ( ). is the initial phase is the peak index. For in the th 1 means average pitch period of neighboring periodic regions

frame centered on, (we note ): 6000 Partial Tracking 5000 4000 The Short Time Complex Spectrum is estimated using a truncated gaussian window where are the mean stard deviation of the gaussian function is the size of the truncation ( must be greater than in order to reduce the truncation effect). The Distortion is measured by fitting a second order polynomial around each log-amplitude spectrum peak (P ) around each corresponding unwrapped phase spectrum region (P ). For a specific peak index, parameters are given by: P Frequency (Hz) 3000 2000 1000 0 0.86 0.88 0.9 0.92 0.94 0.96 Time (s) Figure 2: Partial Tracking method: frequency frequency slope estimations (thin dashed lines), partial births (thick dashed lines), partials (thick lines), signal: female singing voice, window size: 14 ms, analysis step: 7 ms model, it suffices to consider only two frames together. For each couple of peaks, a track-score is computed. In a frequency b, the couple that leads to the maximum score (if this score is above a certain threshold) is chosen. If the maximum score is below the threshold, there is a birth, a death or no track in this b. P atan (6) (5) where, It is easy to show that usual sinusoidal estimators of frequency, amplitude phase (, P P ) have a bias proportional to the frequency amplitude modulation to the size of the window (see [7] for details). Partial Tracking with time-varying parameters Once the sinusoidal parameters are estimated, the peaks of adjacent analysis frames are connected to form frequency tracks. This is called Partial Tracking. Usual partial tracking methods consider three successive frames in order to construct a track. Since the time derivatives of parameters are part of our where are the maximum curvature 2 of 3rd order polynomials with the following boundary conditions : for frequency, for amplitude. are model parameters. Results obtained with (6) are shown in Figure 2. 2.4 PSOLA markers positioning PSOLA markers (noted ) have to be placed in a pitch synchronous way, i.e. the distance between two markers must be equal to the local pitch period. Moreover, because of the windowing applied in the PSOLA method, the markers must be close to the local maxima of signal energy. In speech processing, Glottal Closure Instants (GCI) detection methods are used in order to place PSOLA markers [9]. But for musical signals, GCI methods 2 second order derivative

are not relevant. This is why other methods, which use phase spectrum information, have been proposed. But then, we cannot guarantee that markers will be close to local maxima of energy. In order to fulfill both periodicity energy conditions we propose here a new method based on group delay. The method uses a weighted sum of frequency component group delays. The weighting is made according to component amplitudes. Let us define: Gd (7) where Gd is the group delay of frequency for a window centered at time. Gd can be computed in an efficient way using time reassignment [1] which can be written (using b-pass convention): BP STFT BP STFT BP (8) Marker positions are then given by the local minima of the time derivative of : (special care has to be taken as is not injective). Because of the windowing applied before computing Gd, a confidence measure of must be computed for each. It is given by an amplitude weighted stard deviation (in ) of the Gd. Large std values mean small confidence while small std values mean large confidence. Results obtained with this new method are shown in Figure 3. Conclusion From spectrum analysis SINOLA derives all the information necessary for high quality sound processing such as time warping, pitch shifting, spectrum dilatation so on. Because of its dual processing (SIN OLA), it preserves the inherent local characteristics of the signal (sinusoidal, rom-noise, attackstransients) allows easy natural modifications of the signal. Examples of the sound quality obtained with this method will be given during the presentation of this paper. References [1] F. Auger P. Flrin. Improving the Readibility of Time-Frequency Time- Scale Representations by the Reassignment 4 3.5 3 2.5 2 1.5 1 0.5 Marker Positionning 0 8.6 8.7 8.8 8.9 9 9.1 9.2 4 Time (samples) x 10 Figure 3: PSOLA markers positioning: signal (top), confidence measure (middle), inverse of the derivative of (bottom), signal: male speech voice, window size: 20 ms, analysis step: 1 ms Method. IEEE Trans. Signal Processing, 43(5):1068 1089, 1995. [2] M. Basseville. Distance Measures for Signal Processing Pattern Recognition. Signal Processing, 18:349 369, 1989. [3] F. Charpentier M. Stella. Diphone Synthesis Using an Overlap-Add Technique for Speech Waveforms Concatenation. In ICASSP, Tokyo, 1986. [4] J. Marques L. Almeida. A Background for Sinusoid Based Representation of Voiced Speech. In ICASSP, Tokyo, 1986. [5] R. McAulay T. Quatieri. Speech Analysis/Synthesis based on a Sinusoidal Representation. IEEE Trans. Acoust. Speech Signal Process, 34(4):744 754, 1986. [6] G. Peeters X. Rodet. Sinusoidal versus Non-Sinusoidal Signal Characterisation. In COST-G6 DAFX, Barcelona, 1998. [7] G. Peeters X. Rodet. SINOLA : A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase Reassigned Spectrum. In ICMC, Peking, 1999. [8] G. Richard C. d Alessro. Analysis/Synthesis Modification of the Speech Aperiodic Component. Speech Communication, (19):221 244, 1996. [9] H. Strube. Determination of the Instant of Glottal Closures from the Speech Wave. J. Acoust. Soc. Am., 56(5):1625 1629, 1974.