Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

Similar documents
Subband Analysis of Time Delay Estimation in STFT Domain

Enhanced Waveform Interpolative Coding at 4 kbps

Proceedings of the 5th WSEAS Int. Conf. on SIGNAL, SPEECH and IMAGE PROCESSING, Corfu, Greece, August 17-19, 2005 (pp17-21)

Chapter 2: Digitization of Sound

Interpolation Error in Waveform Table Lookup

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

Different Approaches of Spectral Subtraction Method for Speech Enhancement

GSM Interference Cancellation For Forensic Audio

Speech Synthesis using Mel-Cepstral Coefficient Feature

ROBUST echo cancellation requires a method for adjusting

Perceptual wideband speech and audio quality measurement. Dr Antony Rix Psytechnics Limited

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

REAL-TIME BROADBAND NOISE REDUCTION

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Chapter 5 Window Functions. periodic with a period of N (number of samples). This is observed in table (3.1).

NOISE ESTIMATION IN A SINGLE CHANNEL

Chapter IV THEORY OF CELP CODING

FIR/Convolution. Visulalizing the convolution sum. Convolution

NOISE SHAPING IN AN ITU-T G.711-INTEROPERABLE EMBEDDED CODEC

Convention Paper Presented at the 112th Convention 2002 May Munich, Germany

FIR/Convolution. Visulalizing the convolution sum. Frequency-Domain (Fast) Convolution

RECENTLY, there has been an increasing interest in noisy

CMPT 468: Delay Effects

Digital Signal Processing. VO Embedded Systems Engineering Armin Wasicek WS 2009/10

Reducing comb filtering on different musical instruments using time delay estimation

TE 302 DISCRETE SIGNALS AND SYSTEMS. Chapter 1: INTRODUCTION

Improving Sound Quality by Bandwidth Extension

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Live multi-track audio recording

Factors Governing the Intelligibility of Speech Sounds

arxiv: v1 [cs.it] 9 Mar 2016

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

THE problem of acoustic echo cancellation (AEC) was

ME scope Application Note 01 The FFT, Leakage, and Windowing

Chapter 9. Chapter 9 275

Sound Synthesis Methods

PART II Practical problems in the spectral analysis of speech signals

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

An objective method for evaluating data hiding in pitch gain and pitch delay parameters of the AMR codec

Multirate Digital Signal Processing

10 Speech and Audio Signals

Audible Aliasing Distortion in Digital Audio Synthesis

Presentation Outline. Advisors: Dr. In Soo Ahn Dr. Thomas L. Stewart. Team Members: Luke Vercimak Karl Weyeneth. Karl. Luke

SGN Audio and Speech Processing

Introduction of Audio and Music

IMPULSE RESPONSE MEASUREMENT WITH SINE SWEEPS AND AMPLITUDE MODULATION SCHEMES. Q. Meng, D. Sen, S. Wang and L. Hayes

System Identification and CDMA Communication

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Non-intrusive intelligibility prediction for Mandarin speech in noise. Creative Commons: Attribution 3.0 Hong Kong License

Speech Enhancement using Wiener filtering

Complex Sounds. Reading: Yost Ch. 4

Narrow-Band Interference Rejection in DS/CDMA Systems Using Adaptive (QRD-LSL)-Based Nonlinear ACM Interpolators

Application of Frequency-Shift Filtering to the Removal of Adjacent Channel Interference in VLF Communications

Non-Uniform Speech/Audio Coding Exploiting Predictability of Temporal Evolution of Spectral Envelopes

Digital AudioAmplifiers: Methods for High-Fidelity Fully Digital Class D Systems

QUANTIZATION NOISE ESTIMATION FOR LOG-PCM. Mohamed Konaté and Peter Kabal

ALTERNATING CURRENT (AC)

FFT 1 /n octave analysis wavelet

Data Transmission. ITS323: Introduction to Data Communications. Sirindhorn International Institute of Technology Thammasat University ITS323

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

CG401 Advanced Signal Processing. Dr Stuart Lawson Room A330 Tel: January 2003

You know about adding up waves, e.g. from two loudspeakers. AUDL 4007 Auditory Perception. Week 2½. Mathematical prelude: Adding up levels

Laboratory Assignment 4. Fourier Sound Synthesis

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

ACOUSTIC feedback problems may occur in audio systems

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

COM 12 C 288 E October 2011 English only Original: English

Since the advent of the sine wave oscillator

Calibration of Microphone Arrays for Improved Speech Recognition

INTERNATIONAL TELECOMMUNICATION UNION

Abstract Dual-tone Multi-frequency (DTMF) Signals are used in touch-tone telephones as well as many other areas. Since analog devices are rapidly chan

Nonuniform multi level crossing for signal reconstruction

ARTIFICIAL BANDWIDTH EXTENSION OF NARROW-BAND SPEECH SIGNALS VIA HIGH-BAND ENERGY ESTIMATION

L19: Prosodic modification of speech

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

Agilent Technologies VQT Undercradle J4630A

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Transient noise reduction in speech signal with a modified long-term predictor

Auditory modelling for speech processing in the perceptual domain

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

SAMPLING THEORY. Representing continuous signals with discrete numbers

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

TWO ALGORITHMS IN DIGITAL AUDIO STEGANOGRAPHY USING QUANTIZED FREQUENCY DOMAIN EMBEDDING AND REVERSIBLE INTEGER TRANSFORMS

Linguistic Phonetics. Spectral Analysis

EECS 452 Midterm Exam (solns) Fall 2012

Pitch Detection Algorithms

One-Bit Delta Sigma D/A Conversion Part I: Theory

Window Functions And Time-Domain Plotting In HFSS And SIwave

Speech Coding in the Frequency Domain

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

Acoustics, signals & systems for audiology. Week 9. Basic Psychoacoustic Phenomena: Temporal resolution

Ninad Bhatt Yogeshwar Kosta

Noise estimation and power spectrum analysis using different window techniques

INSTANTANEOUS FREQUENCY ESTIMATION FOR A SINUSOIDAL SIGNAL COMBINING DESA-2 AND NOTCH FILTER. Yosuke SUGIURA, Keisuke USUKURA, Naoyuki AIKAWA

New Features of IEEE Std Digitizing Waveform Recorders

PROBLEM SET 6. Note: This version is preliminary in that it does not yet have instructions for uploading the MATLAB problems.

Transcription:

PAGE 433 Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution Wenliang Lu, D. Sen, and Shuai Wang School of Electrical Engineering & Telecommunications University of New South Wales, Australia Abstract The accurate estimation of the relative delay between two signals is of vital importance in systems that strive to objectively measure the quality of synthetic speech. The measured delay is used to align the two signals before subsequent analysis and thus inaccurate estimates lead to significant deviations from subjective measures of quality. In this paper, we explore four different methods in terms of their accuracy and complexity in calculating the delay between two signals. One method is the traditional cross-correlation method, along with three other techniques with sub-sample resolution. We test the methods which is investigated on a variety of different signals including sinusoids and speech. 1. Introduction In intrusive objective speech quality measurement, offline analysis is performed on segments of both original and synthesized speech. The first pre-processing step is the precise alignment of the original and degraded samples(voran, 1999). Simple cross-correlation in time domain provides a resolution of one sample. However, the synthesized signal produced by speech coding systems is most likely to have subsample delays(quackenbush, Barnwell III, and Clements, 1988) - making cross-correlation an inadequate technique. The goal of objective measurement of speech quality is to predict corresponding subjective measurements of quality. Unlike non-intrusive methods which do not require the reference (original) signal from which the synthesized signal was derived, intrusive systems work best when the original and synthesized signals are accurately aligned. This allows the comparison of the spectral content of the signal and makes it feasible to incorporate models of hearing to compute the amount of noise that is perceived by any listener. It is of no surprise, therefore, that better delay detection produces a better correlation prediction of subjective quality. The current ITU standard for objective measurement of speech quality (P.862), PESQ(Perceptual Evaluation of Speech Quality)(ITU-T, 2), measures the delay between the original and synthesized speech to a resolution of one sample. We found that when the actual delay between the original and synthesized signal differs by non-integer samples, PESQ can produce considerably different scores. This is especially pronounced in non-waveform coders for which the signals of the original and synthetic signals bear little resemblance. In one test on PESQ, 1 different codec systems, each containing 6 speech sentences of 4 seconds long respectively, are passed through PESQ. Subsequently all the 6 speech sentences are delayed by. samples before the same PESQ measurement is conducted again. The artificially introduced delay obviously did not change the perceptual quality of the signals. Results are shown in figure 1. PESQ scores 4. 4 3. 3 2. 2 1. 1. PESQ prediction Speech without delay Speech delayed by. sample 1 3 3 4 4 6 Figure 1: PESQ(ITU-T, 2) scores for ten different systems, 6 speakers each. For all 6 speech segments, only have identical scores. The maximum difference between two scores is 1.98 MOS points which is significant since MOS scores range from 1 to. It is also of significance that highest discrepancies indicate a lower quality than when the signals were not artificially delayed - obviously indicating that the subsequent comparison of the time-frequency content is disrupted by an incorrect computation of the delay. Even though recursive algorithms are adopted in PESQ to account for changing delays in the system, the results above indicate that PESQ fails to account for the artificial delay. It is hypothesized, therefore, that better results are possible if higher resolution alignment (1/4 or 1/16 of a sample for example) can be achieved. In this paper, we investigate three methods other than simple cross-correlation for finding subsample delays between signals.

2.1. Techniques 2. Methods and stimulations As stated above, intrusive objective measurement uses the original speech as reference. An intuitive use of the original signal is to calculate the masking threshold according to a psychoacoustic model. The masking threshold can be superimposed on the noise spectrum calculated by the difference between the original and synthesized spectrum to indicate the amount of noise energy above the masking threshold. In this paper, original signals are referred as s orig [n] while s synth [n] for degraded signals. It is assumed that s synth [n] is derived by passing s orig [n] through some system such as a speech coder. We assume s orig [n] and s synth [n] have the same length of L. These systems are usually non-linear and produce variable delay between the input and output. Four different algorithms are investigated in terms of their ability to calculate the delay between original and synthesized waveforms. Algorithms that were investigated are as follows: 1. The first method is the traditional normalized crosscorrelation method(knapp and Carter, 1976). The index of maximum cross-correlation between s orig [n] and s synth [n] is taken to be the amount of delay. This method obviously has resolution limited to one integer. In practical speech coders, the actual delay d real is rarely an integer sample and the best result achieved by this method is the nearest integer to d real. 2. The second method operates in the frequency domain where one signal can be delayed with any precision, integer or decimal, by adding a phase lag to this signal s digital Fourier transform. The delay is achieved, without any sampling rate change (usually requiring an upsampling followed by downsampling) in the time domain. To estimate the delay, we first delay the original signal S orig (θ) to produce a delayed version S orig d (θ). S orig d (θ, d test ) = S orig (θ) e jdtestθ 2π L (1) Here L is the length of s orig [n]. Subsequently a dot product P d (d test ) is computed between S orig d (θ) and conjugate of S synth (θ). P d (d test ) = 2π θ= S orig d (θ, d test )S synth (θ) (2) The dot product can be interpreted as zero lag (R orig d,synth []) of cross-correlation between s orig d [n] and s synth [n] in the time domain. This procedure is repeated as within a predefined range of the delay [d down, d up ], with a step size of 1/2, 1/4 or 1/16, depending on the desired resolution. The d test for which P d (d test ) is a maximum is used as the final estimate of delay. The predefined range [d down, d up ] should be chosen carefully to ensure that it covers the true delay. This algorithm does not require the use of IDF T, which will result in circular shift effect caused by the periodic extension implicit in the use of DF T/IDF T. Magnitude (db) Phase (degrees) 4 6 8 1.1.2.3.4..6.7.8.9 1 Normalized Frequency ( π rad/sample) 1 2 3 4 x 14 Linear Phase FIR filter, order: 31.1.2.3.4..6.7.8.9 1 Normalized Frequency ( π rad/sample) Figure 2: Frequency Response of Low Pass Filter 3. The third method of measuring a subsample resolution delay is implemented in time domain. The method is essentially the cross-correlation method of above, but with an upsampling factor M applied to both s orig [n] and s synth [n], to produce an estimate of the delay d detected, with a resolution of 1/M. In practice, the characteristics of the linear phase low-pass filter used for the interpolation also has an affect on the result. For comparison, we have used a 31th order FIR filter to ensure very narrow transient bands. The frequency response of the filter is shown in Figure 2. 4. The final method is similar to but pads zeros after signals, before applying on them. The length of signals increase from L to 2L 1. Without zero-padding in time domain, s uses less than half time of for same signals. However, technically should have identical results as, as padding-zeros in the time domain in leads to bandlimited interpolation in frequency domain, which does not provide extra information. Since the lengths of signals are doubled, the complexity of this algorithm is increased significantly. 2.2. Source signals and Delay generation Four types of signals are used to test the various algorithms described in the previous section. The sampling rate f s, of all signals was 8Hz. The signals are: 1. Single Sinusoid: This set of test stimuli were single sinusoids which ranged in frequency from.2f s to.4f s. The multiple frequencies strove to disclose the performance of the methods as a function of frequency. 2. Multiple Sinusoids: This set of stimuli was created by combining sinusoids of different frequencies. There were 3 random frequency components for each combination tested. 3. Speech signal: Speech segments containing one sentence were tested. PAGE 434

Signal Single Sinusoid Multi Sinusoids Speech Random Noise SNR needed db 3dB db 9dB Length: 16384 Length: 31716 CPU time used PAGE 43 Table 1: Signals VS SNR 4. Random noise: Random noise with a Gaussian distribution and zero mean were used as the test stimuli. To make the results more relevant, random delays were used to create the delayed signal for comparison. There are several ways of producing a delayed signal s synth [n] from a known signal s orig [n]. In this paper, we add to the DFT phase of original signal a linear phase lag,, where d real[n] is the delay between s orig [n] and s synth [n], L is the length of original signal, and k =, 1,.., L 1. To avoid the effect of circular shifts, the original signal is padded with zeros and the delayed signal is recovered beginning at a certain starting index. k 1 2πd real L 3.1. Accuracy of methods 3. Results Figure 3 to 6 reveal the performance of the four different techniques applied to the four different stimuli as a function of SNR, respectively. Each procedure is repeated 32 times before mean of error are achieved. As is the only method not expected to produce subsample resolution, its error is always larger than other methods for all signals, as can be anticipated. White noise of varying power were added to the stimuli to measure the performance as a function of SNR. As can be seen from Figure 3, for single sinusoid, when the SNR is less than db, the log error levels out at 4. As the SNR is increased, the error decreases before leveling out when the SNR is about db. The same trend can be observed for the other three sets of stimuli in that the error will reach a minimum when the SNR is slightly above db. However, one fact that affects the performance in noise is the amount of frequency components of the stimuli. The four types of signals tested here have different amount of frequency components, and the smallest SNR needed to achieve best performance is directly related to that. Table 1 shows the minimum SNR required to achieve best performance for each stimuli. It can be observed that the greater the frequency content of the stimuli, the less the SNR required to reach best performance. As single sinusoidal signals have only one component in frequency, it tends to be affected more easily by noise. In comparison, random signals have the most number at components of the four method, and thus achieves the best results even when SNR is as low as 8dB. As expected, and have the same results for all four types of signals. 3.2. Complexity In this section we investigate the complexity of Methods 2, 3 and 4, as they are all methods that produce subsample resolution delays. From the algorithm, the complexity seconds 1 Figure 7: Complexity Comparisonfor, 3, 4 of method 2 is (6 log 2 l + 2(e b)m)l real additions and (4 log 2 l + 4(e b)m)l real multiplications. Here b and e indicate the beginning and end of detecting range, m represent subsample resolution, for example, 4 or 16 (for 1/4 and 1/16 resolution) and l is the length of signals involved. has almost identical performance to method 2. However, this method uses zeros padding which results in sequences twice as long as those used for Methods 1 to 3. This leads to a complexity of (6 log 2 2l + 2(e b)m)2l real additions and (4 log 2 2l + 2(e b)m)2l real multiplications. The method is thus more than twice as complex as. Another important factor, the detecting range e b, plays an important role in complexity of and 4. e and b is determined in advance, and complexity is almost linear to e b. Also, when m is high, for example, m = 64, even a small e b will lead to enormous computation. More strategies will be discussed in the next section. requires (18 log 2 l + 2l f m + 22)l real additions and (12 log 2 l + 2l f m + )l real multiplications, where l f is the length of low pass filter used for interpolation. Obviously, when l is fixed, l f is the main factor that control complexity of. Figure 7 shows the CPU time used in a test when Methods 2, 3, 4 are applied to two different-length signals, respectively, on a PC with Pentium R 4 Processor (3.GHz) and 2 GBytes of memory as well. When signal length increases to around double the size, complexity of and 4 will almost double. takes about twice the CPU time as for the same signal, as expected. For, the complexity increase from 17 seconds to 23 seconds, a ratio of 1.3, which is quite small relative to the increase in signal size. In a nutshell, has an almost fixed computation complexity, while and 4 can reduce their complexity with the recursive like strategy stated next section.

PAGE 436 3 3 Single Sinusoid 3 3 Multiple Sinusoids 1 1 1 1 6 4 4 6 8 1 6 4 4 6 8 Figure 3: Mean error for single sinusoid Figure 4: Mean error for multiple sinusoid 3 3 Speech signal 3 3 Random Signal 1 1 1 1 1 6 4 4 6 8 1 Figure : Mean error for speech signal Figure 6: Mean error for random signal 4. Discussion A slight disadvantage to is the requirement of a preset range of testing delays. This range can be determined by an initial use of to detect multiple candidate integer values around which to search for the actual delay. The size of the preset range has a linearly proportional impact on the computational complexity. However, if the range can be narrowed to a small range in advance, the complexity can be reduced significantly. A combination of with in a recursive strategy will lead to excellent accuracy with the least amount of computational complexity (for the same performance). For low SNR, produces the same error as. However, the error performance of improves sharply at higher SNR. This characteristic can be attributed to the low-pass filter used in this algorithm. The filter requires a transition bandwidth as short as possible, at the cost of stop band ripple. To achieve this, a relatively high order, low pass filter is required. This, however, makes the method computationally expensive. However, when the SNR is high enough, the effect of the linear phase low-pass filter will disappear, except for the single sinusoid, which will be distorted by the filter and cannot achieve the expected results. All the methods discussed in this paper are based on waveform similarity. In practical speech coding systems, the synthesized signal will not only be delayed but significantly altered in amplitude. This means that the error in the delay estimation may be larger for these systems.. Conclusion In this paper, we have investigated three methods of aligning signals with sub-sample resolution. It was found that along with in a recursive strategy provides the best tradeoff between complexity and performance. may be practical for signals with high SNR but at a cost of significantly higher complexity. References ITU-T (2). Perceptual evaluation of speech quality(pesq), an objective method for end-to-end speech

quality assessment of narrow-band telephone networks and speech codecs. Knapp, C. H. and G. C. Carter (1976). The generalized correlation method for estimation of time delay. IEEE Tranactions on Acoustics, Speech, and Signal Processing. Quackenbush, S., T. Barnwell III, and M. Clements (1988). Objective Measurement of Speech Quality. Prentice Hall. Voran, S. (Vol.7, No.4, July 1999). Objective estimation of perceive speech quality part i: Development of the measuring normalizing block technique. IEEE Transaction on Speech and Audio Processing. PAGE 437