Enhancing 3D Audio Using Blind Bandwidth Extension

Similar documents
HRIR Customization in the Median Plane via Principal Components Analysis

Audio Engineering Society. Convention Paper. Presented at the 131st Convention 2011 October New York, NY, USA

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

BINAURAL RECORDING SYSTEM AND SOUND MAP OF MALAGA

HRTF adaptation and pattern learning

Convention Paper Presented at the 112th Convention 2002 May Munich, Germany

Sound Source Localization using HRTF database

Computational Perception. Sound localization 2

A Virtual Audio Environment for Testing Dummy- Head HRTFs modeling Real Life Situations

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

396 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

Acoustics Research Institute

A triangulation method for determining the perceptual center of the head for auditory stimuli

Listening with Headphones

Study on method of estimating direct arrival using monaural modulation sp. Author(s)Ando, Masaru; Morikawa, Daisuke; Uno

Convention Paper 9870 Presented at the 143 rd Convention 2017 October 18 21, New York, NY, USA

3D sound image control by individualized parametric head-related transfer functions

ORIENTATION IN SIMPLE VIRTUAL AUDITORY SPACE CREATED WITH MEASURED HRTF

Ivan Tashev Microsoft Research

Spatial Audio Reproduction: Towards Individualized Binaural Sound

PERSONALIZED HEAD RELATED TRANSFER FUNCTION MEASUREMENT AND VERIFICATION THROUGH SOUND LOCALIZATION RESOLUTION

Binaural Hearing. Reading: Yost Ch. 12

WAVELET-BASED SPECTRAL SMOOTHING FOR HEAD-RELATED TRANSFER FUNCTION FILTER DESIGN

University of Huddersfield Repository

Proceedings of Meetings on Acoustics

Auditory Localization

Convention Paper Presented at the 139th Convention 2015 October 29 November 1 New York, USA

Upper hemisphere sound localization using head-related transfer functions in the median plane and interaural differences

III. Publication III. c 2005 Toni Hirvonen.

Adaptive Filters Application of Linear Prediction

Psychoacoustic Cues in Room Size Perception

The psychoacoustics of reverberation

Perceptual Band Allocation (PBA) for the Rendering of Vertical Image Spread with a Vertical 2D Loudspeaker Array

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA. Why Ambisonics Does Work

ARTIFICIAL BANDWIDTH EXTENSION OF NARROW-BAND SPEECH SIGNALS VIA HIGH-BAND ENERGY ESTIMATION

Aalborg Universitet. Audibility of time switching in dynamic binaural synthesis Hoffmann, Pablo Francisco F.; Møller, Henrik

The relation between perceived apparent source width and interaural cross-correlation in sound reproduction spaces with low reverberation

Audio Engineering Society. Convention Paper. Presented at the 115th Convention 2003 October New York, New York

Multiple Sound Sources Localization Using Energetic Analysis Method

Sound source localization and its use in multimedia applications

Virtual Acoustic Space as Assistive Technology

INVESTIGATING BINAURAL LOCALISATION ABILITIES FOR PROPOSING A STANDARDISED TESTING ENVIRONMENT FOR BINAURAL SYSTEMS

Capturing 360 Audio Using an Equal Segment Microphone Array (ESMA)

Reducing comb filtering on different musical instruments using time delay estimation

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Predicting localization accuracy for stereophonic downmixes in Wave Field Synthesis

Complex Sounds. Reading: Yost Ch. 4

SOPA version 2. Revised July SOPA project. September 21, Introduction 2. 2 Basic concept 3. 3 Capturing spatial audio 4

Audio Engineering Society. Convention Paper. Presented at the 124th Convention 2008 May Amsterdam, The Netherlands

Different Approaches of Spectral Subtraction Method for Speech Enhancement

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL

PAPER Enhanced Vertical Perception through Head-Related Impulse Response Customization Based on Pinna Response Tuning in the Median Plane

Sound Processing Technologies for Realistic Sensations in Teleworking

A classification-based cocktail-party processor

Laboratory Assignment 5 Amplitude Modulation

3D Intermodulation Distortion Measurement AN 8

Proceedings of Meetings on Acoustics

On distance dependence of pinna spectral patterns in head-related transfer functions

IMPROVED COCKTAIL-PARTY PROCESSING

Spatial Audio & The Vestibular System!

University of Huddersfield Repository

Introduction. 1.1 Surround sound

Bandwidth Extension for Speech Enhancement

URBANA-CHAMPAIGN. CS 498PS Audio Computing Lab. 3D and Virtual Sound. Paris Smaragdis. paris.cs.illinois.

Proceedings of Meetings on Acoustics

MEASURING DIRECTIVITIES OF NATURAL SOUND SOURCES WITH A SPHERICAL MICROPHONE ARRAY

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

Monaural and Binaural Speech Separation

Analysis of Frontal Localization in Double Layered Loudspeaker Array System

The analysis of multi-channel sound reproduction algorithms using HRTF data

Externalization in binaural synthesis: effects of recording environment and measurement procedure

Psycho-acoustics (Sound characteristics, Masking, and Loudness)

From Binaural Technology to Virtual Reality

SAMPLING THEORY. Representing continuous signals with discrete numbers

Feasibility of Vocal Emotion Conversion on Modulation Spectrogram for Simulated Cochlear Implants

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 A MODEL OF THE HEAD-RELATED TRANSFER FUNCTION BASED ON SPECTRAL CUES

Terminology (1) Chapter 3. Terminology (3) Terminology (2) Transmitter Receiver Medium. Data Transmission. Simplex. Direct link.

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Towards an intelligent binaural spee enhancement system by integrating me signal extraction. Author(s)Chau, Duc Thanh; Li, Junfeng; Akagi,

Extracting the frequencies of the pinna spectral notches in measured head related impulse responses

IMPULSE RESPONSE MEASUREMENT WITH SINE SWEEPS AND AMPLITUDE MODULATION SCHEMES. Q. Meng, D. Sen, S. Wang and L. Hayes

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

A binaural auditory model and applications to spatial sound evaluation

Computational Perception /785

IMPROVED CODING OF TONAL COMPONENTS IN MPEG-4 AAC WITH SBR

Convention e-brief 433

Perception and evaluation of sound fields

Perceptual wideband speech and audio quality measurement. Dr Antony Rix Psytechnics Limited

Terminology (1) Chapter 3. Terminology (3) Terminology (2) Transmitter Receiver Medium. Data Transmission. Direct link. Point-to-point.

High performance 3D sound localization for surveillance applications Keyrouz, F.; Dipold, K.; Keyrouz, S.

Fundamentals of Digital Audio *

Final Exam Study Guide: Introduction to Computer Music Course Staff April 24, 2015

Proceedings of Meetings on Acoustics

Creating three dimensions in virtual auditory displays *

Speech Compression. Application Scenarios

Sound Source Localization in Median Plane using Artificial Ear

Binaural Speaker Recognition for Humanoid Robots

Laboratory Assignment 2 Signal Sampling, Manipulation, and Playback

Drum Transcription Based on Independent Subspace Analysis

Spatial audio is a field that

Transcription:

Enhancing 3D Audio Using Blind Bandwidth Extension (PREPRINT) Tim Habigt, Marko Ðurković, Martin Rothbucher, and Klaus Diepold Institute for Data Processing, Technische Universität München, 829 München, Germany Correspondence should be addressed to Tim Habigt (tim@tum.de) ABSTRACT Blind bandwidth extension techniques are used to recreate high frequency bands of a narrowband audio signal. These methods allow to increase the perceived quality of signals that are transmitted via a narrow frequency band as in telephone or radio communication systems. We evaluate the use of blind bandwidth extension methods in 3D audio applications, where high frequency components are necessary to create an impression of elevated sound sources. 1. INTRODUCTION An emerging feature in communication systems is 3D audio using Head-Related Transfer Functions (HRTFs). In this scenario, it is desired to place virtual sound sources at different azimuths and elevations to create an immersive three-dimensional impression. The human auditory system uses different cues to localize sounds. These cues include interaural time differences (ITD), interaural level differences (ILD) and spectral cues [2]. Whereas lateral localization can be achieved using ILD and ITD only, spectral cues are needed to spatialize virtual sound sources at different elevations. Several studies have shown that spectral cues in high frequency bands are needed to perceive the elevation of sound sources. Algazi et al. [1], for example, were able to show that localization accuracy drops significantly if the audio bandwidth is reduced from 22 khz to 3 khz. Although elevation cues exist in low frequency regions, high frequency bands play an important role in the perception of elevated sound sources. Figure 1 shows magnitude plots of HRTFs of a KE- MAR dummy-head in the elevation range from to 5. Elevation-dependent frequency cues can be seen especially in the frequency range above 5 khz. Telephone networks and radio communication sys- Elevation [degrees] 2 4 5 1 15 2 Frequency [khz] 6 8 1 12 Fig. 1: HRTF magnitude plots of a KEMAR over a range of different elevations. The colorbar indicates the magnitude in db. tems are generally band-limited. Such systems work, for example, with sampling rates of 8 khz and therefore only transmit a usable bandwidth of 4 khz. This degrades the speech quality but the loss of fidelity is usually tolerable as it does not severely impact intelligibility of the communication. On the other hand, this degrades 3D audio quality in band-limited communication systems as the narrowband speech signal does not excite the required high frequency compo-

nents. Recently, bandwidth extension or spectral band replication is used in audio codecs like AAC+ [5]. These techniques allow to replicate high frequency spectral features from low frequency information. Therefore, only a small part of the speech bandwidth (eg. -5 khz) needs to be coded and transmitted. The high frequency components (5-1 khz) can be replicated requiring only little guiding information. Blind bandwidth extension (BWE) recreates high frequency components without any additional information. In previous work, blind bandwidth extension techniques are primarily used to improve the perceived audio quality. In this work, we evaluate the possibility to create high frequency components using blind bandwidth extension techniques to excite the needed spectral cues to provide an impression of sound source elevation. Several proposed algorithms for high quality audio bandwidth extension exist. Our main goal is to excite high frequency components of the HRTF to provide elevation cues. We concentrate on accurate localization and do not evaluate the perceived speech quality. We conducted listening experiments to evaluate this approach. In the listening tests the participants had to judge the elevation of a virtual sound source. Three different sound signals were used that represent the cases of narrowband and wideband speech and broadband noise. 2. BLIND BANDWIDTH EXTENSION We want to spatialize signals with a bandwidth of 8 khz and employ a BWE algorithm to extend the bandwidth to 16 khz. There are several blind bandwidth extension algorithms that aim for high-quality speech reconstruction ([6], [4], [8], [3]). Blind bandwidth extension is generally composed of two tasks. First, the high frequency band of the signal has to be reconstructed. Afterwards, these components are shaped by a filter to match the spectral envelope of natural speech. We employ a bandwidth extension method with low computational complexity proposed by Yasukawa [11]. This processing scheme is composed of several steps that are illustrated in Figure 2. REC HPF SF g 1:2 LPF SUM Fig. 2: Blind bandwidth extension scheme. 1 2 2 4 6 8 Frequency (khz) Fig. 3: Transfer function of the shaping filter (SF). The first step doubles the sampling rate of the signal. The input signal is interpolated by a factor of two by inserting zeros in and applying a low-pass filter (LPF) to remove aliasing artifacts. In the next step the high-frequency components are generated by a non-linear processing element. Yasukawa proposes a full-wave rectifier (REC) as the non-linear filter. This rectifier creates all even harmonics of the input signal. If the input signal is, for example, a sine wave, one can see the harmonics in the Fourier series of sin(ωt) = 2 π 4 π n=2,4,6,... g cos(nωt) n 2 1 (1) in the even multiples of the fundamental frequency ω of the cosine. The following high-pass filter (HPF) was designed to remove the signal components in the frequencies from -4 khz. This filter removes the DC-offset that is created by the rectification and ensures that the original signal is not distorted. A shaping filter (SF) modifies the spectrum of the signal to better match human voice. The transfer function of the shaping filter is shown in Figure 3. We designed the shaping filter according to the proposed version of Yasukawa. At the end of the processing chain, the generated high frequency components are added to the origi- AES 129 th Convention, San Francisco, CA, USA, 21 November 4 7 Page 2 of 5

4 4 2 2 2, 4, 6, 8, Frequency (Hz) Fig. 4: Spectrum of a narrowband speech signal. 2, 4, 6, 8, Frequency (Hz) Fig. 5: Spectrum a speech signal after bandwidth extension. nal low bandwidth signal. Two amplifiers (g) allow to adjust the gain of both paths. The effect of the bandwidth extension is illustrated in Figure 5 where the spectrum of the narrowband speech signal from Figure 4 is shown after the extension. It can be seen that the original frequency components in the range from -4 khz are not affected by the algorithm. 3. SUBJECTIVE EVALUATION Algazi et al. [1] were able to show that elevation localization depends on the bandwidth of the signal. They observed that the localization accuracy drops significantly if the audio bandwidth is reduced from 22 khz to 3 khz. The reduction of the localization performance is highest in the median plane. For this reason, we chose the positions of the virtual sound sources as follows. The virtual sound sources were positioned on three different azimuths (135, 18, 215 ), where is directly in front and 9 is to the left of the listener. All the positions are in the back hemisphere to avoid front-back confusions. The elevations were chosen from to 5 with a spacing of 1. In our coordinate system, denotes the horizontal plane and 9 is directly above. No negative elevation angles were chosen to prevent up-down confusions. We spatialized sound sources by convolving them with Head-Related Impulse Responses (HRIRs) from the publicly available MIT database of a KEMAR [7]. We were not able to measure the individual HRTFs of the test subjects. For this reason, we used HRTFs that were measured with a dummy head that matches the average human torso and head dimensions. The choice of a general set of HRTFs can lead to difficulties as it is not guaranteed that the spatial cues match the test subject. We therefore included broadband noise signals in our sound samples to check if the subjects were able to correctly localize this noise. We used these noise signals as an indicator that the subject could extract correct elevation cues from the HRTFs. The speech sources were taken from VoxForge [1], a free speech database under the GPL license. 3.1. Stimuli The sound samples can be categorized into three classes: 1. Narrowband speech signals (4 khz bandwidth) 2. Bandwidth-extended speech signals (8 khz bandwidth) 3. Wideband noise signals (22 khz bandwidth) The speech samples have a duration of about 5 s. The wideband noise signal acts as a reference. 1 subjects aged 22-29 volunteered in the study. The datasets of two participants were discarded as they were not able to correctly localize the broadband noise signal for any spatial configuration and it was assumed that the generic HRTFs did not match their physical characteristics. AES 129 th Convention, San Francisco, CA, USA, 21 November 4 7 Page 3 of 5

3.2. Evaluation procedure The participants in this study had to decide which one of two sound samples was located at a higher position. Therefore, a randomly selected sound sample was chosen from the database and spatialized at two different elevations on the same azimuth. These virtual sound sources were then presented to the listener with a short pause in-between. Afterwards, the participants had to choose which sample was located higher and mark their choice in a graphical user interface. This procedure is similar to the one used by Kulkarni and Colburn [9]. The graphical user interface as well as the convolution of the sound signals with the HRTFs was implemented in Matlab. The created signals were presented with a pair of Sennheiser HD 38 Pro headphones. The experiments were conducted in a distraction-free environment. At the beginning of the evaluation, each participant listened to 2 different sound samples from different directions to become familiar with the task and the signals. These results were not included in the evaluation. Furthermore, the participants were informed about the possible locations of the sound sources. 4. RESULTS Our main goal was to evaluate the impact of blind bandwidth extension on the elevation perception. Figure 6 shows the results of this evaluation. Three graphs represent the different stimuli. These graphs show the percentage of correct localizations over the angular distance. The smallest elevation distance between two points is 1 due to the HRTF database s spatial sampling grid. The distances 3, 4 and 5 were combined to get roughly the same number of samples in each of the three elevation distances. All results are shown with their respective 95% confidence intervals. A localization success rate of 5% corresponds to guessing and indicates that the test subject could not extract the elevation cues necessary for correct localization. In the case of wideband noise, the participants were able to correctly identify the higher source in 83% of the cases for distances > 3. This shows that the participants were successfully able to extract the elevation cues although non-individualized HRTFs were used. The percentage of correct localizations for the narrowband speech signal is 5% in this case, which Percentage of correct localizations 1 8 6 4 2 > 3 2 Narrowband BW extended Wideband noise 1 Angular distance in elevation Fig. 6: Results of the subjective evaluation. The plots show the percentage of correct localizations over the elevation distance. Three different stimuli are shown separately. The errorbars represent the 95% confidence interval. shows that the participants were just able to guess and could not get any information about the elevation. In the case of bandwidth-extended speech, the elevation localization success rate is 68%. This is, considering the chosen confidence level, a significant improvement compared to the narrowband case. For small distances between the sound sources (1 in this test), the localization success rate lies within the bounds of chance performance. It was reported that localization in the vertical direction is impaired especially in the median plane. For this reason, we divided the test samples into two groups of azimuths. Half the samples are located in the median plane whereas the other half is at azimuth angles 135 and 215. In Table 1 the results of this comparison are shown. It can be seen that the elevation localization success rate is roughly 1 percentage points better in positions outside of the median plane for the case of bandwidth-extended stimuli. 5. CONCLUSION We were able to show that blind bandwidth extension is a valuable tool to improve the reproduction of 3D audio. The bandwidth of speech signals was widened from 4 khz to 8 khz using a blind bandwidth extension scheme proposed by Yasukawa. AES 129 th Convention, San Francisco, CA, USA, 21 November 4 7 Page 4 of 5

Azimuth Wideband noise Bandwidth extended speech Narrowband speech 18 83.9 ± 9.2% 62. ± 13.5% 39.2 ± 13.4% 135, 215 82.5 ± 11.8% 72.6 ± 11.1% 57.3 ± 11.2% Table 1: Localization success rate with respect to azimuth. The subjective evaluation showed a significant improvement of the localization accuracy using generic HRTFs. 6. REFERENCES [1] V. Algazi, C. Avendano, and R. Duda. Elevation localization and head-related transfer function analysis at low frequencies. The Journal of the Acoustical Society of America, 19:111, 21. [2] J. Blauert. Spatial Hearing: The Psychophysics of Human Sound Localization. The MIT Press, 1997. [3] J. Cabral and L. Oliveira. Pitch-synchronous time-scaling for high-frequency excitation regeneration. In Ninth European Conference on Speech Communication and Technology, 25. [8] U. Kornagel. Spectral widening of the excitation signal for telephone-band speech enhancement. In Proc. International Workshop on Acoustic Echo and Noise Control, pages 215 218. [9] A. Kulkarni and H. Colburn. Role of spectral detail in sound-source localization. Nature, 396(6713):747 749, 1998. [1] VoxForge. Free speech recognition. http://www.voxforge.org/, Last accessed on 21-8-3. [11] H. Yasukawa. Signal restoration of broad band speech using nonlinear processing. In Proc. European Signal Processing Conference (EUSIPCO-96), 1996. [4] S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter. Speech enhancement via frequency bandwidth extension using line spectral frequencies. In IEEE International Conference on Acoustics Speech and Signal Processing, volume 1. IEEE, 21. [5] M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz. Spectral Band Replication, a novel approach in audio coding. PREPRINTS-AUDIO ENGI- NEERING SOCIETY, 22. [6] J. Fuemmeler, R. Hardie, and W. Gardner. Techniques for the regeneration of wideband speech from narrowband speech. EURASIP Journal on Applied Signal Processing, 21(1):274, 21. [7] B. Gardner and K. Martin. HRTF measurements of a KEMAR dummy-head microphone. MIT Media Lab Perceptual Computing, 1994. AES 129 th Convention, San Francisco, CA, USA, 21 November 4 7 Page 5 of 5