Multizone Wideband Reproduction of Speech Soundfields

Size: px

Start display at page:

Download "Multizone Wideband Reproduction of Speech Soundfields"

Barry Foster
5 years ago
Views:

1 Multizone Wideband Reproduction of Speech Soundfields Associate Professor Christian Ritz School of Electrical, Computer and Telecommunications Engineering, University of Wollongong

2 Overview Brief introduction to the University of Wollongong, School of Electrical, Computer and Telecommunications Engineering (SECTE) Overview of spatial audio the University of Wollongong Mathematical description of soundfields Brief fundamentals Reproducing soundfields using loudspeakers Single zone approaches Multizone soundfield reproduction using loudspeakers Orthogonal basis expansion approach Conclusions and open research challenges

3 Where is Wollongong? About 1 hour south of Sydney on the East coast of NSW

4 The city of Wollongong 203,500 Population in Wollongong 292,500 Population of Illawarra area 22 C Average daily temperature (71.6 F) 27 C Average summer temperature (80.6 F) 17 Patrolled beaches View from the escarpment (small mountains)

University of Wollongong 2011: Celebrated their 60 th Anniversary Total number of students: 31,464 (as of 2014) 12,811 international students in Australia

ranking to 31 st in the world s top 100 younger universities in the 2015 Times Higher Education (THE) 100 Under 50 Also ranked by TE to be in the top 250

5 University of Wollongong 2011: Celebrated their 60 th Anniversary Total number of students: 31,464 (as of 2014) 12,811 international students in Australia and abroad Rankings: Top 2% of research universities in the world (QS and Times Higher Education World University Rankings 2013/14) Recently improved ranking to 31 st in the world s top 100 younger universities in the 2015 Times Higher Education (THE) 100 Under 50 Also ranked by TE to be in the top 250 institutions in the world in the subject fields of Electrical and Electronic Engineering and Computer Science and Information Systems UOW Library UOW Innovation Campus

6 School of Electrical, Computer and Telecommunications Engineering (SECTE): Students and Staff Around 25 Academic Staff and many research students Degrees offered: Undergraduate : Bachelor of Engineering majoring in Electrical Engineering, Computer Engineering or Telecommunications Engineering Postgraduate: Masters by Coursework, Masters by Research and PhD ~300 Bachelor Students, ~200 Graduate students (including ~50 PhD students) ~50% of students are international

7 School of Electrical Computer and Telecommunications Engineering: Research Major research groups and labs ADVANCED MANUFACTURING TECHNOLOGIES (AMT) Australian Power Quality & Reliability Centre Centre for Intelligent Mechatronics Research (CIMR) INFORMATION AND COMMUNICATIONS TECHNOLOGY RESEARCH (ICTR) Emerging Networks & Applications (ENA) Optoelectronic Signal Processing Research Lab (OSPR) Visual & Audio Signal Processing (VASP) SUSTAINABLE BUILDINGS RESEARCH CENTRE

8 SPATIAL AUDIO THE UNIVERSITY OF WOLLONGONG Brief overview

9 Research led by A/Prof Ritz Current Flagship Projects: Microphone arrays for sound processing Spatial Audio Signal processing, compression, enhancement and reproduction Acoustic design of microphone arrays, loudspeaker arrays and customised musical instruments Quality of Experience (QoE) for multimedia Team: Research students (mostly PhD) selected projects: Ad-Hoc Microphone Arrays for Speech and audio signal enhancement and localisation Spatial Audio Coding and Enhancement Multizone audio reproduction and control Single and Multichannel speech signal enhancement based on dictionary learning and sparse coding Quality of Experience for Image Matching Applications Expertise: Digital Signal Processing for speech, audio, multimedia applications Sound source localisation using microphones 3D audio recording, analysis, synthesis and coding Human perception of multimedia and social media Multimedia annotation and semantics International standards for multimedia communication and processing Collaboration Highlights: RMIT University (spatial audio, semantics of multimedia) Smart Services CRC (New Media Services) collaborating with Fairfax Media to deliver AirLink University of Klagenfurt, Austria Peking University (Shenzhen Graduate School), China (Microphone arrays) Beijing University of Technology (speech coding and enhancement, spatial audio) Plus numerous other collaborators in the Faculty and University

10 Research Facilities UOW Anechoic Chamber Configurable Hemispheric Environment for Spatialised Sound (CHESS): 16 loudspeaker hemisphere for 3D sound reproduction Unique collaboration with staff from the Creative Arts discipline Audio hardware equipment and software Microphone arrays, loudspeaker arrays, amplifiers, pre-amps, custom hardware and software Technical workshop resources UOW Anechoic Chamber: CHESS Plus access to University research infrastructure and expertise including High Performance Computing, electronics and mechanical workshops, 3D printing services etc.

11 Spatial (3D) Audio Communication Multiple people, multiple sites communicating through a network Systems key stages: Microphone Array Recording Sound scene analysis and enhancement Compression Spatial Audio Rendering Spatial Audio Communication System

Microphone Array Recording A microphone array is required for Source location (Direction of Arrival (DOA)) estimation and separation into individual speech objects

increased performance and much smaller than alternative technologies Acoustic Vector Sensor (AVS): Uses specialist gradient microphones to record pressure gradient in x, y

12 Microphone Array Recording A microphone array is required for Source location (Direction of Arrival (DOA)) estimation and separation into individual speech objects Multichannel speech enhancement in noisy environments We have researched miniaturised coincident microphone arrays Coincident: Microphones located very close together- increased performance and much smaller than alternative technologies Acoustic Vector Sensor (AVS): Uses specialist gradient microphones to record pressure gradient in x, y and z directions Miniature B-Format microphones: can also provide x, y and z directions but using standard pressure microphone 1 cm Example AVS designed by our lab 3D printed B- Format microphone

13 Enhancement based on Multichannel Linear Prediction (MC-LP) Apply MC-LP to the AVS channels o(n) x(n) y(n) z(n) MC-LP LP coefficients Residual signal Key idea: The speech signal can then be enhanced by further processing of both the spectrum represented by the LP coefficients and the residual signal Enhanced speech is then produced using the enhanced residual and LP coefficients Shujau, M., Ritz, C., Burnett, I., Speech Dereverberation Based On Linear Prediction: An Acoustic Vector Sensor Approach, Proc. ICASSP'2013, pp. 1-5, Vancouver, Canada, May

14 Multichannel 3D audio coding Spatially Squeezed Surround Audio Coding (S 3 AC) Converts 5 channel surround to 2 channel stereo Stereo signal can be compressed with existing technology (e.g. MP3) 5 channel recovered from analysis of the decoded stereo signal S 3 AC 3D: Generalisation to more than 5 channels E.g. 16 channel 3D audio coded at 128 kbps results in equivalent subjective quality to separate coding of each channel via AAC Bin Cheng; Ritz, C.; Burnett, I.; Xiguang Zheng, "A General Compression Approach to Multi-Channel Three-Dimensional Audio," Audio, Speech, and Language Processing, IEEE Transactions on, vol.21, no.8, pp.1676,1688, Aug. 2013

15 Compression of Multiple Soundfield Zones Example application: Spatial Audio teleconferencing Analysis of speakers using microphone array analysis Extract individual talkers as separate spatial audio objects Joint compression of spatial audio objects Spatial audio playback at each site, combining all streams from each other site Multiple, distributed sites One or more talkers at each site Xiguang Zheng; Ritz, C.; Jiangtao Xi, "Encoding Navigable Speech Sources: A Psychoacoustic-Based Analysis-by-Synthesis Approach," Audio, Speech, and Language Processing, IEEE Transactions on, vol.21, no.1, pp.29,38, Jan

16 MATHEMATICAL DESCRIPTION OF SOUNDFIELDS Brief fundamentals

17 What is a sound field? Sound that has both amplitude and direction E.g. 2D spatial audio, 3D spatial audio, binaural audio In real environments, includes reverberation due to echo off walls, ceiling and floor Described mathematically as the solution of the acoustic wave equation: 2 p z, t 1 c 2 ρ 2 p(z,t) t 2 = 0 p z, t is the pressure at a point in space z = x, y, z and at time t 2 is the laplacian operator i.e. 2 = 2 p x p y p z 2 c is the speed of sound 340 m/s

18 Planewave solution Consider the 2D case and define x = (x, y) as a position in space (in cartesian coordinates), which is equivalent to r, θ in polar coordinates i.e. (x, y)= rcosθ, rsinθ A simple solution to the wave equation for 2D soundfields is a function of planewaves A sound wave of constant frequency travelling in a specific direction θ Source:

19 Cylindrical Harmonic Expansion Description of the soundfield as weighted sum of cylindrical harmonic functions + S x, k = α m (k)j m (kr)e jmθ where J m is the m th order Bessel function of the first kind, j = 1, and α m (k) are known as the Fourier-Bessel coefficients Can be solved to find α m (k) but not ideal approach zeros of J m (kr) cause problems

20 Green s Functions Green s functions describe the acoustic transfer function between two points in space For an impulse source arriving from location z s, s z s, k, the 3D Green s function in the free field (anechoic) is: 1 p z z s, k = e jk(z zs), where k = ω 4π z z s c wavenumber and j = 1 For 2D: p x x s, k = j 4 H 0 1 k x s x, is the where H 1 0 is the zeroth order Hankel function of the first kind (a function of Bessel functions - see mathematics literature for definition) Frequency domain soundfield at listening point, S x, k, is multiplication of source with the Green s function (or convolution in the time domain): S x, k = s x s, k j 4 H 0 1 k x s x Source, s x s, k Listening point, S x, k

21 Green s functions and Superposition Using Green s functions, we can derive the soundfield at a point x due to an impulse as the sum of the contribution of all sources arriving from all locations i.e., in 2D for l = 1 to L sources: p x, k = L j 4 H 0 1 k x l x l=1 Hence, total soundfield at x is : Source 2, s x 2, k Source L, s x L, k Source 1, s x 1, k Listening point, S x, k L S x, k = s x l, k l=1 j 4 H 0 1 k x l x

22 Broadband Soundfields Previous slide was for single frequency Reproducing broadband (i.e. multiple frequency) soundfields (e.g. speech) requires inverse Fourier transform of S x, k or alternatively a Fourier series representation Total (time-domain) broadband soundfield s b x, t assuming each source contains frequencies k =1 to K: K s b x, t = S x, k k=1

23 REPRODUCING SOUNDFIELDS USING LOUDSPEAKERS Single zone approaches

24 Reproducing sound fields using Aim: Find loudspeaker signals, D x l, ω, such that a soundfield within a restricted region (zone) is accurately reproduced i.e. we want P(x, ω) to match a desired (virtual) soundfield produced by source S x, ω Loudspeakers Loudspeakers D(x l, ω) Virtual Sound source, S x, ω x l D P(x, ω) x Surface surrounding volume D D

25 Existing Loudspeaker Approaches Simpler (panning-based) loudspeaker approaches: Stereo (2 channels) only (limited) 2D sound fields reproduced Surround sound (5.1 channels) 2D sound fields (but with limited accuracy in reproducing source direction) Ambisonics (basic approach) Vector Base Amplitude Panning (VBAP) More sophisticated (sound field synthesis) approaches: Higher Order Ambisonics Wave Field Synthesis (WFS) Least Squares

26 Panning-based approaches Sound source direction perceived based on level differences between loudspeaker signals Channel pairs: stereo, 5.1 surround and VBAP Channel triplets: 3D VBAP All channels: Ambisonics Relatively simpler signal processing compared to sound field synthesis approaches Good results for practical numbers of loudspeakers Ambisonics reproduction using UOW 16 channel hemisphere

27 Sound Field Synthesis (SFS) Consider an enclosed volume D of surface D Synthesis soundfield using multiple loudspeakers (as secondary sources) located at positions x l, l = 1 to L SFS synthesis equation: Loudspeakers D P(x, ω) x S x, ω = D x l, ω G x x l, ω da(x l ) D(x l, ω) x l G(x x l, ω) D D x l, ω are the loudspeaker signal weights G x x l, ω is the Green s function between x and x l da(x l ): surface area of the enclosure Virtual Sound source, S x, ω Surface surrounding volume D D

28 SFS Solutions Higher Order Ambisonics (HOA) Loudspeaker signals derived by decomposing the sound field into an orthogonal set of basis functions Often relies on planewave representations of sound fields Wave Field Synthesis (WFS) Loudspeaker signals derived by directly solving the SFS synthesis integral on the previous slide Comparing HOA and WFS for the same number of loudspeakers WFS produces accurate first wavefront across a larger area but with strong artefacts outside this area HOA produces accurate sound field but within a smaller area HOA and WFS generally require analytic solutions to the integral expressions Assume continuous functions Alternative Least Squares solutions: Minimise the least squared error between desired and achievable pressure values for a given set up (i.e. number of loudspeakers) and spatial sampling resolution (i.e. number of chosen pressure samples) within a chosen reproduction region

29 MULTIZONE SOUNDFIELD REPRODUCTION USING LOUDSPEAKERS Orthogonal basis expansion approach

30 Room Key Question Active zone 1: Listening to speech or music Active zone 2: Listening to speech or music Quiet zone: No external sound How can we create multiple independent listening zones within a room? Using a single set of loudspeakers

31 Multizone Soundfields Move from one zone to multiple zones Example: Region D contains three sub regions Bright zone, D b quiet zone D q unattended zone is all other space in the region D Direction of the desired planewave in D b is θ and is reproduced by loudspeakers positioned at x l (or equivalently φ l ) with first loudspeaker at φ Each zone has radius r within a total region D of size R surrounded by a loudspeaker array or radius R l

32 Existing solutions 2D multizone reproduction first introduced in 2008 by Polletti [5] and based on least squares approach to SFS A multizone approach using cylindrical harmonic expansion proposed by Wu and Abhayapala in 2011 [6] Previous approaches attempt to completely suppress any interzone interference Can result in impractically large loudspeaker signal amplitudes More recent work by Jin and Kleijn [7] uses an orthogonal basis expansion approach with weightings to control reproduction of each zone according to their importance Leads to more practical numbers of loudspeakers Conceptually similar approach also proposed by Chen, Abhayapala, and Zhang [8] Limited work on practical solutions for multiple frequency soundfields Multizone narrowband and wideband speech soundfields presented in 2013 by Radmanesh and Burnett [9]

33 Orthogonal Basis Expansion Similar to a planewave decomposition, a soundfield can be described more generally as a weighted sum of orthogonal basis functions: S x, k = C n G n x, k n Where G n x, k are the basis functions and C n are the coefficients, which can be derived using a inner product as: C n = D S x, k G n x, k d x

34 Weighted Orthogonal Basis Expansion Add a weighting term to the inner product i.e.: C n = w(x)s x, k G n x, k dx D, n = 1 to N Weights can be chosen in various ways e.g. assume constant weight values within each zonei.e. w(x b )=w b, w(x q )=w q, w(x u )=w u for the bright, quiet and unattended zones, respectively In our work, we explore alternative weighting schemes and efficient implementations Basic idea: Derive the coefficients for chosen set of basis functions that minimises reconstruction error between modelled and desired soundfield

35 Deriving the basis functions We begin with a set of P planewaves We factorise these into an orthogonal set of N basis functions or wavefields Via QR factorisation of a set of P planewaves, F p (x, k), arriving from angles 0 2π The resulting soundfield can then be described as: S x, k = P p (k, w)f p (x, k) p Where coefficients P p (k, w) are related to the desired C n values via the QR decomposition (see [7] for more details)

36 Deriving loudspeaker signal weights Recall from before, free field soundfield at a point x produced by multiple sources is sum of signals arriving from each loudspeaker For loudspeaker signal weights at position l derived using above approach with weighting w: L S x, k = d l k, w j l=1 H k x l x Loudspeaker signal weights derived based on the orthogonal decomposition previously desribed Basic idea: Find loudspeaker signal weights, d l k, w, to produce the desired soundfield described by the orthogonal set of basis wavefields

37 Deriving loudspeaker signal weights Loudspeaker signal weights derived as d l k, w below H m 1 is the m th order Hankel function of the first kind M = kr is the truncation length [10] (based on spatial frequency and size of the region) φ pw = p 1 φ, where φ = 2π P are angles of arrival of the planewaves φ l are the angular position of the L loudspeakers separated by an angular spacing of φ s M d l k, w = 2(jπH m 1 (kr l )) 1 m= M p P p (k, w)j m e jmφ pw e jmφ l φ s

38 Reproducing broadband speech soundfields d l k, w are the weights that must be applied to a desired source signal e.g. a speech signal to reproduce a desired soundfield E.g. assume we wish to synthesise a speech soundfield at an arbitrary location, x, in the reproduction area, D, using a set of L loudspeakers The value of the loudspeaker signals, Y l x, k, required to reproduce this soundfield at x is: Y l x, k = d l k, w Y k, where d l k, w is the free field loudspeaker signal weights as given on the earlier slide Y(k) is the complex Fourier coefficient of a desired time domain virtual source, y(n), e.g. a desired speech signal We can also derive the estimated soundfield reproduced by the array of loudspeakers such that we can minimise the mean squared error between the desired and actual soundfield And understand the impact of the weights used for the basis functions

39 Reproduction of multiple frequencies (i.e. broadband speech soundfields) The pressure generated by the loudspeakers at any point in the reproduced soundfield is given by, K k p x, w = S x, k, w where there are K different sinusoidal components. Multiple nested summations are required to derive this value: Here, a summation occurs for every sample in the reproduction region, D, for P planewaves, for N basis wavefields, for 2M+1 modes, for L loudspeakers and for K sinusoidal components This is computationally demanding E.g. a three second audio file sampled at 16 khz requires approximately independent reproductions due to the QR decomposition Hence, more efficient implementations are required

Examples Soundfields best visualised using animations of single frequencies Example desired soundfields generated based on weighted orthogonal

40 Examples Soundfields best visualised using animations of single frequencies Example desired soundfields generated based on weighted orthogonal basis expansion Orthogonal Basis Expansion with one frequency (2 khz) Independent Multizone Soundfields with Orthogonal Basis Expansion (2 khz and 1.25 khz) Example soundfields reproduced using loudspeakers Loudspeaker Reproduced Orthogonal Basis Expansion Loudspeaker Reproduced Occlusion Problem Example showing affect of varying the weightings used for the basis functions Loudspeaker Reproduced Multizone Soundfield with Varied Weighting

41 Conclusions and open research challenges Multizone soundfield reproduction is theoretical possible and simulations demonstrate it s practical feasibility Current approaches are computationally demanding e.g. a 3 s audio file sampled at 16 khz requires approximately independent derivations due to the QR decomposition and nested summations to derive multiple frequency components We are investigating codebook-based approaches to reduce complexity Techniques require many loudspeakers governed by desired bandwidth of the sound sources as well as size of the reproduction area We are working on alternative weighting approaches to reduce loudspeaker counts whilst maintaining perceptual quality of the reproduced soundfield Some initial results to be published at ChinaSIP 2015

42 References 1. Ahrens, J., Rabenstein, R., Spors, S., Sound Field Synthesis for Audio Presentation, Acoustics Today, pp , Vol. 10, Iss. 2, Spring Spors, S.; Wierstorf, H.; Raake, A.; Melchior, F.; Frank, M.; Zotter, F., "Spatial Sound With Loudspeakers and Its Perception: A Review of the Current State," Proceedings of the IEEE, vol.101, no.9, pp.1920,1938, Sept R. Rabenstein and S. Spors, Sound Field Reproduction, in Springer Handbook of Speech Processing, P. J. B. Dr, P. M. M. Sondhi, and P. Y. (Arden) H. Dr, Eds. Springer Berlin Heidelberg, 2008, pp M. Kolundzija, C. Faller, and M. Vetterli, Reproducing Sound Fields Using MIMO Acoustic Channel Inversion, JAES, vol. 59, no. 10, pp , Nov M. Poletti, An Investigation of 2-D Multizone Surround Sound Systems, presented at the Audio Engineering Society Convention 125, Y. J. Wu and T. D. Abhayapala, Spatial multizone soundfield reproduction: Theory and design, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 6, pp , W. Jin, W. B. Kleijn, and D. Virette, Multizone soundfield reproduction using orthogonal basis expansion, presented at the Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp H. Chen, T. D. Abhayapala, and W. Zhang, Enhanced sound field reproduction within prioritized control region, in INTER-NOISE and NOISE-CON Congress and Conference Proceedings, 2014, vol. 249, pp N. Radmanesh and I. S. Burnett, Generation of isolated wideband sound fields using a combined twostage lasso-ls algorithm, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 21, no. 2, pp , Y. J. Wu and T. D. Abhayapala, Theory and design of soundfield reproduction using continuous loudspeaker concept, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 17, no. 1, pp , 2009.

A spatial squeezing approach to ambisonic audio compression

University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 A spatial squeezing approach to ambisonic audio compression Bin Cheng