Tu1.D II Current Approaches to 3-D Sound Reproduction. Elizabeth M. Wenzel

Current Approaches to 3-D Sound Reproduction Elizabeth M. Wenzel NASA Ames Research Center Moffett Field, CA 94035 Elizabeth.M.Wenzel@nasa.gov Abstract Current approaches to spatial sound synthesis are reviewed, particularly as they relate to the topics being addressed in the special session on 3-D Sound Reproduction. Most currently available virtual audio systems tend to fall into two categories. Those aimed at high-end simulations for purposes emphasize high-fidelity rendering while others are directed toward entertainment and game applications. The papers represented in this special session are primarily concerned with the goals of high-fidelity simulations of spatial sound presented over headphones. They seek to elucidate the nature of the acoustic parameters that must be rendered in order to provide a listener with an accurate or authentic perceptual experience. 1. Comparison of VAE Systems Different virtual acoustic environment (VAE) applications emphasize different aspects of the listening experience that require different approaches to rendering software/hardware. Auralization requires computationally intensive synthesis of the entire binaural room response that typically must be done offline and/or with specialized hardware. A simpler simulation that emphasizes accurate control of the direct path, and perhaps a limited number of early reflections, may be better suited to information display. The fact that such a simulation does not sound "real" may have little to do with the quality of directional information provided. Achieving both directional accuracy and presence in virtual reality applications requires that head tracking be enabled with special attention devoted to the dynamic response of the system. A relatively high update rate (~60 Hz) and low latency (less than ~100 ms) may be required to optimize localization cues from head motion and provide a smooth and responsive simulation of a moving listener or sound source [1-4]. Implementing a perceptually adequate dynamic response for a complex room is computationally intensive and may require multiple CPUs or DSPs. One solution for synthesizing interactive virtual audio has been the development of hybrid systems [e.g., 5, 6]. These systems attempt to reconcile the goals of directional accuracy and realism by implementing realtime processing of the direct path and early reflections using a model (e.g., the image model) combined with measured or modeled representations of late reflections and reverberation. During dynamic, real-time synthesis, only the direct path and early reflections can be readily updated in response to changes in listener or source position. A densely measured or interpolated headrelated transfer function (HRTF) database is needed to avoid artifacts during updates. Late portions of the room response typically remain static in response to head motion, or given enough computational power, could be updated using a database of impulse responses pre-computed for a limited set of listener-source positions. Model-based synthesis is computationally more expensive but requires less memory than databased rendering [6]. The Lake Huron/HeadScape system relies entirely on long, densely pre-computed binaural room impulse responses (BRIRs) rendered with a fast frequency-domain algorithm. The early portion of the BRIR (4000 samples) is updated in response to head motion and the late reverberation remains static. Another recent trend is that in some spatial sound systems, synthesis is now being performed entirely in software for use on generic hardware platforms such as a personal computer with a Windows or Linux operating system. NASA s SLAB software is an example of this approach. Tables 1 and 2 summarize system characteristics and specifications for some of the currently available virtual audio systems targeting different applications. (The Convolvotron is listed for historical comparison purposes.) These systems tend to fall into two categories. Those aimed at high-end simulations for purposes (e.g., auralization, psychoacoustics, information displays, virtual reality) tend to emphasize high-fidelity rendering of direct path and/or early reflections, accurate models of reverberation, and good system dynamics (high update rate, low latency). Other systems are directed toward entertainment and game applications. The rendering algorithms in such systems are proprietary and appear to emphasize efficient reverberation modeling; it is often not clear whether the direct path and/or early reflections are independently spatialized. The information in the tables is based on published papers in a few cases [e.g., 3, 5, 7] but more often on product literature and websites [8]. Tu1.D II - 883

VAE System / Primary Target Application SLAB / DIVA / AuSIM / Spat (IRCAM) / AM3D /, games Tucker-Davis / Lake /, entertainment Creative Audigy / games Sensaura / entertainment QSound / games Crystal River Convolvotron / Audio Display User Interface OS Implementation headphone C++ Windows 98/2k software / Intel C++ UNIX, Linux software / SGI headphone C client-server model software / (client: Win98/2k, Intel DOS, Mac, etc.) Graphical (Max, jmax) headphone Graphical / ActiveX Mac, Linux, IRIX software / Mac, Intel, SGI C++ Windows 98/2k software / Intel (MMX) Windows 98/2k special purpose DSP hardware (RP2.1) C++ Windows NT special purpose DSP hardware (CP4, Huron) C++ Windows 98/2k consumer sound card 3D sound N/A software / engine hardware 3D sound N/A software / engine hardware headphone C DOS special purpose DSP hardware Rendering Domain / Room Model image model image model direct path? / direct path direct path, reverb engine frequency (HRTF) / precomputed BRIR proprietary / proprietary / proprietary / direct path VAE System SLAB DIVA AuSIM Table 1. Summary table describing system characteristics for various VAE systems. # Sources Filter Order Room Effect Scenario Update Rate arbitrary, arbitrary image model arbitrary CPU-limited (max. direct: 128, 6 1 st order (120 Hz typical, (4 typical) reflections: 32) reflections 690 Hz max.) arbitrary, CPU-limited 32 per CPU GHz arbitrary, modeled HRIRs (typical direct: 30, reflections: 10) arbitrary (128 typical, 256 max.) image model 2 nd order reflections, late reverb N/A Internal Sampling Rate Latency 24 ms default 44.1 khz (adjustable output buffer size) 20 Hz ~110-160 ms arbitrary (32 khz typical) arbitrary (375 Hz default max.) 8 ms default (adjustable output buffer size) 44.1 khz 48 khz (default) 96 khz AM3D 32-140, CPU-limited? N/A ~22 Hz max. 45 ms min. 22 khz (current) 44.1 khz (future) Lake 1 2058 to 27988 precomputed? 0.02 ms min. 48 khz (HeadScape, 4 DSPs) response Convolvotron 4 256 N/A 33 Hz 32 ms 50 khz Table 2. Summary table describing system specifications for various VAE systems. II - 884

It is often difficult to determine details about a particular system s rendering algorithm and performance specifications. For example, critical dynamic parameters like scenario update rate and internal rendering latency are not readily available or not enough information about the measurement scenario is provided to evaluate the quoted values. Some systems listed in Table 2 are not present in Table 3 because not enough information was found regarding system performance specifications. 2. NASA s SLAB System SLAB is an example of a software-based real-time virtual acoustic environment rendering system designed for use in the personal computer environment. It is being developed by the Spatial Auditory Displays Lab at NASA Ames Research Center primarily as a tool for the study of spatial hearing. To enable a wide variety of psychoacoustic studies, SLAB provides extensive control over the VAE rendering process. It provides an API (Application Programming Interface) for specifying the acoustic scene and setting the low-level digital signal processing (DSP) parameters as well as an extensible architecture for exploring multiple rendering strategies. The project is also intended to provide a low-cost system for dynamic synthesis of virtual audio over headphones that does not require special purpose signal processing hardware. Because it is a software-only solution designed for the Windows/Intel platform, it can take advantage of improvements in hardware performance without extensive software revision. SOURCE Location (Implied Velocity) Orientation Sound Pressure Level Waveform Radiation Pattern Source Radius ENVIRONMENT Speed of Sound Spreading Loss Air Absorption Surface Locations Surface Boundaries Surface Reflection Surface Transmission Late Reverberation Table 3. Acoustic Scenario Parameters. LISTENER Location (Implied Velocity) Orientation HRIR ITD 2.1. SLAB Acoustic Scenario The acoustic scenario of a sound source radiating into an environment and heard by a listener can be specified by the parameters shown in Table 3. A source, characterized by its waveform, level, radiation pattern, size, and dynamic quantities including position and orientation, radiates into an environment. Propagation of acoustic energy in the environment is specified by the speed of sound, spherical spreading loss, and air absorption; the environment is further specified by the location and characteristics of reflecting and transmitting objects. The source signal propagates through the environment, arriving at a listener characterized by a head-related impulse response (HRIR) and interaural time delay (ITD), as well as a dynamically changing position and orientation. The HRIRs used here are derived from minimum-phase representations of the raw left and right-ear impulse responses measured for individual subjects. ITDs are estimated from the raw left and right-ear impulse responses and represented as a pure delay. HRTFs, on the other hand, refer to the equivalent frequency domain representations of the raw HRIRs. Currently, the SLAB Renderer supports all but the following parameters: radiation pattern, air absorption, surface transmission, and late reverberation. A signal path may be modeled according to the physical scenario using the signal flow architecture shown in Fig. 1(a). A set of P paths from the source to the listener (including the direct path) is separately rendered. The filter r(z) imposes the source radiation pattern on the source signal to take the signal from the source to a point in the vicinity of the source along a particular radiation direction. The filter z -τα a(z) applies the propagation delay, spherical spreading loss, and air absorption experienced as the source signal propagates from near the source to near the listener; the filter m(z) imposes transmission or reflection characteristics of any objects encountered. The filter z -τh h(z) represents the HRIR and ITD, and takes any arriving signal from the vicinity of the listener along a particular direction to the listener's ear canals. The SLAB signal flow shown in Fig. 1(b) was designed to implement the physical effects discussed above in an easily maintained, efficient architecture. It consists of a set of parallel signal paths, one for each rendered path from the source to a listener's ears. The propagation delay and interaural time delay for each source-to-ear path are combined, and implemented via an interpolated delay line. Static effects along each path, such as materials reflection filtering are combined and implemented as an infinite impulse response (IIR) filter. A finite impulse response (FIR) filter is used to implement dynamic effects such as the head-related impulse response and the source radiation pattern. 2.2. Dynamic Behavior Interactive virtual audio systems are necessarily time varying. As the scenario changes over time, different signal processing parameters are required to render the changing physical effects imposed on the source signal. The difficulty is that all signal processing structures available for implementing the changing scenario are inherently static, assuming fixed coefficients. As a result, care must be taken when updating signal processing parameters. Ideally, new parameters are II - 885

switched in sufficiently frequently that the change from one parameter set to the next is imperceptibly small. Certain parameters such as time delays need to be updated every sample to avoid artifacts; minimumphase head-related impulse responses are somewhat more forgiving. A primary problem with this approach is that it is expensive to compute signal processing parameters from scenario information. There is also the additional issue that peripherals such as head trackers typically provide update rates ranging from 30 to 120 Hz, so that intermediate scenario data must be developed. Two methods are typically used to accommodate a changing scenario: output crossfading, and parameter crossfading (described as commutation in [9]). In output crossfading (e.g., as in early versions of the Convolvotron that used non-minimum phase HRIRs), the output is a blend of the input processed once according to past parameters and then again according to present parameters. While the two processing paths use static coefficients, the blend is varied over time to achieve a transition between the parameter sets. Parameter crossfading, by contrast, processes the input only once according to a varying set of rendering parameters that have been crossfaded before processing of the input signal. Physical Signal Flow SOURCE RADIATION ENVIRONMENT PROPAGATION HEAD EFFECT source r(z) z a(z) m(z) z τh h(z) 1 P P P 2P + 2 radiation pattern τα propagation delay air absorption spherical spreading surface reflection object transmission interaural time delay left, right HRIR mix binaural signal P = Number of Paths (Direct Path & Reflections); 2P = Paths Rendered for Left & Right Ears (a) source SLAB Signal Flow mix headphone output 1 τ a ± τ z h interpolated delay line propagation delay ITD 2P m(z) IIR filter, m reflection transmission 2P h(z) a(z) r(z) 2P + FIR filter radiation pattern air absorption spherical spreading HRIR 2 e(z) 2 IIR filter output device equalization (b) P = Number of Paths (Direct Path & Reflections); 2P = Paths Rendered for Left & Right Ears Figure 1. (a) The physical signal flow partitions the properties of the acoustic scenario into the relevant signal processing components. (b) The SLAB signal flow partitions the physical scenario into signal processing components as they are implemented in the SLAB system architecture. Overlap-add methods that operate in the frequency domain are, in effect, a type of output crossfade where the crossfade interval corresponds to the overlap-add interval. Undesirable artifacts when updating the scenario are mitigated by the use of frequent updates and densely measured HRTF databases and/or densely pre-computed binaural room impulse responses [10, 11]. Disadvantages of this method include large memory requirements and the fact that changes in the source, room and receiver characteristics require new measurements or simulations. Other systems utilizing convolution in the time-domain also appear to have used densely-interpolated HRIR databases (e.g., spatial resolution on the order of 2 after interpolation), perhaps combined with a short period of output crossfade, to mitigate possible artifacts due to switching between filters [1, 12]. Output crossfading has the drawback of being computationally burdensome. In addition, the output is a mixture of two different systems and might not II - 886

resemble that of a single system intermediate between the two. Accordingly, the SLAB system uses a variation of parameter crossfading that we term "parameter tracking." Since new scenario information may be available relatively infrequently and contains measurement noise, signal processing parameters computed with each new scenario update become target parameters that are tracked or smoothed. Currently in the SLAB system, the scenario is updated at an average interval of about 8.3 ms given a 120 Hz scenario update rate. In parameter crossfading, there may be multiple update rates for various signal processing parameters. In SLAB, there are two parameter update rates. Every other input frame or 1.45 ms (64 samples), filter coefficients are replaced with ones slightly closer to the target coefficients, while path delays are updated every sample (22.7 µs) to preserve embedded Doppler shifts. A more detailed discussion of dynamic synthesis methods in SLAB and other systems can be found in [13]. Informal listening tests of the SLAB system indicate that its dynamic behavior is both smooth and responsive. The smoothness is enhanced by the 120-Hz scenario update rate, as well as the parameter tracking method, which smooths at rather high parameter update rates; i.e., time delays are updated at 44.1 khz and the FIR filter coefficients are updated at 690 Hz. The responsiveness of the system is enhanced by the relatively low latency of 24 ms. The scenario update rate, parameter update rates, and latency compare favorably to other virtual audio systems. 2.3. SLAB Software Features In addition to the scenario parameters, SLAB provides hooks into the DSP parameters, such as the FIR update smoothing time constant or the number of FIR filter taps used for rendering. Also, various features of the renderer can be modified, such as exaggerating spreading loss or disabling a surface reflection [14]. Recently implemented features include source trajectories, API scripting, user callback routines, reflection offsets, the Scene layer, and internal plug-ins. An external renderer plug-in interface has also been developed that allows users to implement and insert their own custom rendering algorithms. SLAB is being released via the web at http://humanfactors.arc.nasa.gov/slab. The SLAB User Release consists of a set of Windows applications and libraries for writing spatial audio applications. The primary components are the SLABScape demonstration application, the SLABServer server application, and the SLAB Host and SLAB Client libraries. SLABScape (Figure 2) allows the user to experiment with the SLAB Renderer API. This API provides access to the acoustic scenario parameters listed in Table 3. The user can also specify sound source trajectories, enable Fastrak head tracking, edit and play SLAB Scripts, A/B different rendering strategies, and visualize the environment via a Direct3D display. 3. Conclusions Interest in the simulation of acoustic environments has prompted a number of technology development efforts over the years for applications such as auralization of concert halls and listening rooms, virtual reality, spatial information displays in aviation, and better sound effects for video games. Each of these applications implies different task requirements that require different approaches in the development of rendering software and hardware. For example, the auralization of a concert hall or listening room requires accurate synthesis of the room response in order to create what may be perceived as an authentic experience. Information displays that rely on spatial hearing, on the other hand, are more often concerned with localization accuracy than the subjective authenticity of the experience. Virtual reality applications such as astronaut training environments, where both good directional information and a sense of presence in the environment are desired, may have requirements for both accuracy and realism. All applications could benefit from the represented by the papers in this special session on 3D Sound Reproduction [see also 15, 16] that help to specify the acoustic parameters required for perceptually accurate spatial sound synthesis. Such studies can give system designers guidance about where to devote computational resources without sacrificing perceptual validity. Figure 2. SLABScape Screenshot. II - 887

4. Acknowledgements Work supported by the Human Measurement and Performance Project within NASA s Airspace Systems Program. 5. References [1] Sandvad, J. Dynamic aspects of auditory virtual environments. 100 th Conv. Aud. Eng. Soc, Copenhagen, preprint 4226, 1996. [2] Wenzel, E. M. Analysis of the role of update rate and system latency in interactive virtual acoustic environments. 103 rd Conv. Aud. Eng. Soc, New York, preprint 4633, 1997. [3] Wenzel, E. M. The impact of system latency on dynamic performance in virtual acoustic environments. Proc. 15 th Int. Cong. Acoust. & 135th Acoust. Soc. Amer. Seattle, pp. 2405-2406, 1998. [4] Wenzel, E. M. Effect of increasing system latency on localization of virtual sounds. Proc. Aud. Eng. Soc. 16th Int. Conf. Spat. Sound Repro. Rovaniemi, Finland. April 10-12, New York: Audio Engineering Society, pp. 42-50, 1999 [5] Savioja, L., Huopaniemi, J., Lokki, T. and Väänänen, R. Creating interactive virtual acoustic environments. J. Aud. Eng. Soc., vol. 47, pp. 675-705, 1999. [6] Pelligrini, R. S. Comparison of data- and modelbased simulation algorithms for auditory virtual environments. 107 th Conv. Aud. Eng. Soc, Munich, 1999. [7] Wenzel, E. M., Miller, J. D. and Abel, J. S. A software-based system for interactive spatial sound synthesis, ICAD 2000, 6 th Intl. Conf. on Aud. Disp., Atlanta, Georgia, 2000. [8] Websites: www.3dsoundsurge.com www.ausim3d.com www.ircam.fr www.am3d.com www.tdt.com www.lake.com.au www.creative.com www.sensaura.com www.qsound.com [9] Jot, J. M., Larcher, V. and Warusfel, O. Digital signal processing issues in the context of binaural and transaural stereophony. 98 th Conv. Aud. Eng. Soc. Paris, France, 1995, Preprint 3980. [10] Bronkhorst, A. W. Localization of real and virtual sources. J. Acoust. Soc. Amer., (1995) 98, 2542-2553. [11] Gardner, W. G. Efficient convolution without inputoutput delay. J. Aud. Eng. Soc., (1995) 43, 127-136. [12] Sahrhage, J., Blauert, J. and Lehnert, H. Implementation of an auditory/tactile virtual environment. Proc. 2 nd FIVE Int. Conf., (Palazzo dei Congressi, Italy, 1996) 18-26. [13] Wenzel, E. M., Miller, J. D. and Abel, J. S. Sound Lab: A real-time, software-based system for the study of spatial hearing, 108 th Conv. Aud. Eng. Soc, Paris, preprint 5140, 2000. [14] Miller, J. D. and Wenzel, E. M. (2002) Recent Developments in SLAB: A Software-Based System for Interactive Spatial Sound Synthesis. Proc. Int. Conf. Aud. Displ., ICAD 2002, Kyoto, Japan, pp. 403-408. [15] Begault, D. R. Audible and inaudible early reflections: Thresholds for auralization system design. 100 th Conv. Aud. Eng. Soc, Copenhagen, preprint 4244, 1996. [16] Begault, D. R., Wenzel, E. M. and Anderson, M. R. Direct comparison of the impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speech source. J. Aud. Eng. Soc., vol. 49, pp. 904-916, 2001. II - 888