Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

Similar documents
RECOMMENDATION ITU-R BS User requirements for audio coding systems for digital broadcasting

Technical Aspects of LTE Part I: OFDM

ETSI TS V ( )

SOPA version 2. Revised July SOPA project. September 21, Introduction 2. 2 Basic concept 3. 3 Capturing spatial audio 4

Transcoding free voice transmission in GSM and UMTS networks

Information. LSP (Line Spectrum Pair): Essential Technology for High-compression Speech Coding. Takehiro Moriya. Abstract

Wideband Speech Coding & Its Application

ENHANCED TIME DOMAIN PACKET LOSS CONCEALMENT IN SWITCHED SPEECH/AUDIO CODEC.

BASIC CONCEPTS OF HSPA

Audio /Video Signal Processing. Lecture 1, Organisation, A/D conversion, Sampling Gerald Schuller, TU Ilmenau

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

III. Publication III. c 2005 Toni Hirvonen.

Binaural Hearing. Reading: Yost Ch. 12

Perceptual wideband speech and audio quality measurement. Dr Antony Rix Psytechnics Limited

Enhancing 3D Audio Using Blind Bandwidth Extension

Sound source localization and its use in multimedia applications

INVESTIGATING BINAURAL LOCALISATION ABILITIES FOR PROPOSING A STANDARDISED TESTING ENVIRONMENT FOR BINAURAL SYSTEMS

Auditory modelling for speech processing in the perceptual domain

Lecture LTE (4G) -Technologies used in 4G and 5G. Spread Spectrum Communications

Adaptive time scale modification of speech for graceful degrading voice quality in congested networks

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

A Study on Complexity Reduction of Binaural. Decoding in Multi-channel Audio Coding for. Realistic Audio Service

Overview of Code Excited Linear Predictive Coder

Speech Compression. Application Scenarios

IMPROVED CODING OF TONAL COMPONENTS IN MPEG-4 AAC WITH SBR

Assistant Lecturer Sama S. Samaan

APPLICATIONS OF DSP OBJECTIVES

LTE Base Station Equipments Usable with W-CDMA System

CS 6956 Wireless & Mobile Networks April 1 st 2015

Mobile Data Communication Terminals Compatible with Xi (Crossy) LTE Service

TELECOMMUNICATION SYSTEMS

Proceedings of Meetings on Acoustics

Fourier Analysis of Smartphone Call Quality. Zackery Dempsey Advisor: David McIntyre Oregon State University 5/19/2017

UNEQUAL POWER ALLOCATION FOR JPEG TRANSMISSION OVER MIMO SYSTEMS. Muhammad F. Sabir, Robert W. Heath Jr. and Alan C. Bovik

The psychoacoustics of reverberation

RADIO LINK ASPECT OF GSM

ON THE PERFORMANCE OF WTIMIT FOR WIDE BAND TELEPHONY

Multiplexing Concepts and Introduction to BISDN. Professor Richard Harris

WIDESTAR II Satellite Mobile Station

Interactive Simulation: UCF EIN5255. VR Software. Audio Output. Page 4-1

6 TH GENERATION PROFESSIONAL SOUND FOR CONSUMER ELECTRONICS

Audio Quality Terminology

T325 Summary T305 T325 B BLOCK 3 4 PART III T325. Session 11 Block III Part 3 Access & Modulation. Dr. Saatchi, Seyed Mohsen.

1. Organisation. Gerald Schuller

Speech Coding Technique And Analysis Of Speech Codec Using CS-ACELP

Multiplexing Module W.tra.2

Appeal decision. Appeal No France. Tokyo, Japan. Tokyo, Japan

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

A spatial squeezing approach to ambisonic audio compression

CROSS-LAYER DESIGN FOR QoS WIRELESS COMMUNICATIONS

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Understanding PMC Interactions and Supported Features

Mobile Communication and Mobile Computing

FOMA Location Information Functions Using SUPL International Roaming Location Positioning Function

ITM 1010 Computer and Communication Technologies

-/$5,!4%$./)3% 2%&%2%.#% 5.)4 -.25

Monaural and Binaural Speech Separation

Experiments in two-tone interference

Summary of the PhD Thesis

Digital Speech Processing and Coding

3GPP: Evolution of Air Interface and IP Network for IMT-Advanced. Francois COURAU TSG RAN Chairman Alcatel-Lucent

Combining Subjective and Objective Assessment of Loudspeaker Distortion Marian Liebig Wolfgang Klippel

Application-driven Cross-layer Optimization in Wireless Networks

EE482: Digital Signal Processing Applications

A Virtual Car: Prediction of Sound and Vibration in an Interactive Simulation Environment

Binaural Cue Coding Part I: Psychoacoustic Fundamentals and Design Principles

ETSI TS V1.1.1 ( )

Scalable Speech Coding for IP Networks

2. LITERATURE REVIEW

Researches in Broadband Single Carrier Multiple Access Techniques

Surround: The Current Technological Situation. David Griesinger Lexicon 3 Oak Park Bedford, MA

Waves Nx VIRTUAL REALITY AUDIO

3GPP TS V5.0.0 ( )

INTERNATIONAL TELECOMMUNICATION UNION

Adaptive Modulation and Coding for LTE Wireless Communication

An objective method for evaluating data hiding in pitch gain and pitch delay parameters of the AMR codec

Long Term Evolution (LTE) and 5th Generation Mobile Networks (5G) CS-539 Mobile Networks and Computing

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

2012 LitePoint Corp LitePoint, A Teradyne Company. All rights reserved.

INSTRUCTION MANUAL IP REMOTE CONTROL SOFTWARE RS-BA1

Communication Networks

MPEG-4 Structured Audio Systems

PERFORMANCE ANALYSIS OF DOWNLINK MIMO IN 2X2 MOBILE WIMAX SYSTEM

Audio Compression using the MLT and SPIHT

SERIES P: TERMINALS AND SUBJECTIVE AND OBJECTIVE ASSESSMENT METHODS Voice terminal characteristics

PERSONAL 3D AUDIO SYSTEM WITH LOUDSPEAKERS

Autonomous Vehicle Speaker Verification System

SGN Audio and Speech Processing

Wideband Speech Encryption Based Arnold Cat Map for AMR-WB G Codec

HRTF adaptation and pattern learning

LTE-Advanced and Release 10

multiple access (FDMA) solution with dynamic bandwidth. This approach TERMS AND ABBREVIATIONS

In this lecture. System Model Power Penalty Analog transmission Digital transmission

SOME PHYSICAL LAYER ISSUES. Lecture Notes 2A

MNTN USER MANUAL. January 2017

WINNER+ Miia Mustonen VTT Technical Research Centre of Finland. Slide 1. Event: CWC & VTT GIGA Seminar 2008 Date: 4th of December 2008

Field Experiments of 2.5 Gbit/s High-Speed Packet Transmission Using MIMO OFDM Broadband Packet Radio Access

RECOMMENDATION ITU-R F Characteristics of advanced digital high frequency (HF) radiocommunication systems

Transcription:

Audio Transmission Technology for Multi-point Mobile Voice Chat Voice Chat Multi-channel Coding Binaural Signal Processing Audio Transmission Technology for Multi-point Mobile Voice Chat We have developed a spatial transmission technology for comfortable and smooth telecommunication in the mobile environment, which allows users participating in multi-point voice chat to assign a unique spatial position to each of the distant talkers voice. This enables customization of the listening environment according to the individual user s preferences, and provides an intuitive interface for speaker identification as well as less tiring for voice chat. Kei Kikuiri Nobuhiko Naka packet access connections such as 1. Introduction *2 tion, mixed environments with several and speakers involve new difficulties, Recently, communications services Fourth-Generation (4G) mobile com- including speaker identification, and allowing multiple simultaneous partici- munications, NTT DOCOMO has also following multiple simultaneous topics and developed high-quality speech coding in the conversation. It is well known online games, have been receiving technology able to transmit super- that applying spatial information such much attention. For these types of ser- wideband speech (with frequency as direction and/or distance to each vices, a multi-point voice chat function bandwidth over 10 khz) at bit-rates speaker s voice signal using binaural will play an important role in realizing from 48 to 64 kbit/s [1]. signal processing technology can be pants, such as content shares Long Term Evolution (LTE) *1 *3 rich communication because it is able As one of our initiatives to improve to convey a sense of emotion and Quality of Experience (QoE) for this excitement in real-time. sort of mobile voice communication Conventionally, spatial play- At the same time, as bandwidth of service, we have also developed spatial back has been used mainly for repro- access networks increases, there is transmission as an extension to ducing a real or virtual acoustic space much research and development the above mentioned high-quality to create presence or share a toward more natural voice communi- speech coding technology. This provides space between participants[3]. On the cation, providing a sense of presence, a comfortable listening environment for other hand, the objective of the proposed while also transmitting wider band conversation among several people, spatial transmission technology is speech signals. Intended for use in such as with multi-point voice chat. to allow each user to allocate a unique VoIP services over high-speed mobile 26 Shinya Iizuka Research Laboratories effective in reducing these types of difficulty [2]. In contrast to one-to-one conversa- position to the voices of remote partici- *1 Content share: A service for sharing information such as video or images over a network. *2 LTE: An evolutional standard of the ThirdGeneration mobile communication system specified at 3GPP; LTE is synonymous with Super3G proposed by NTT DOCOMO. *3 Binaural signal processing: A type of signal processing which artificially adjusts the heard by each ear to create a spatial effect when playing back monaural.

pants for improving listening. There are three typical es for a voice chat system allowing listeners to determine the position of remote voices according to preference (Table 1). The first is client-side processing. Each client directly receives voice data from remote participants, and individually processes the voice data for spatial synthesis. Therefore all the functions required to generate spatial are implemented in the client, but the amount of data transmitted and the processing load on each client increases with the number of participants. The second is server-side processing. A server renders spatial signals from the received participants voices, and multiplexes them for transmission. This reduces the volume of transmitted data and the amount of processing required in each client, but an additional back channel is required to send control information for generating spatial from the clients to the server. The last one is a hybrid. The server performs compression and multiplexing, while the client processes spatial synthesis. Compression on the server may degrade quality compared to other two es, but it reduces the volume of transmitted data, and allows distribution of the processing load and spatial processing at the client side. Our spatial transmission technology is based on the hybrid, taking the limitations of wireless transmission and client processing capacity in a mobile environment into consideration. This consists of two major developments. One is using multi-channel coding *4 on the server, which compresses multiple high-quality speech coding signals and generates a single stream at 48 to 96 kbit/s, while reducing quality degradation by taking advantage of human auditory characteristics. The other is spatial decoding, which reduces the complexity of client processing through efficient integration of decoding and spatial synthesis. Practical implementation of this technology enables provision of smooth, multi-point voice chat communications with voices that are easy to distinguish Table 1 Comparison of chat systems with spatial playback Server Back channel Downlink transmission data (transmission volume) processing (processing load) Client-side processing Encoded data from all participants (increases with number of participants) Decoding and spatial synthesis of each data stream (increases with number of participants) Server-side processing synthesized data Decoding of spatial synthesis data Hybrid processing Multiplexed data Decoding and spatial synthesis of multiplexed data intuitively in mobile environments. This article describes the development of this spatial transmission technology, the results of a -quality evaluation and development of a mobile VoIP multi-point voice chat prototype using the technology. 2. Audio Transmission Technology 2.1 Architecture of Audio Transmission Technology The spatial transmission technology is composed of processes for speech encoding, multi-channel coding and the spatial decoding. The client performs the speech encoding and the spatial decoding, and the server performs the multi-channel coding (Figure 1). A high-quality speech coding algorithm developed by NTT DOCOMO is used for the speech encoding. This transforms the input time-domain speech signal to its frequency-domain coefficients using a Modified Discrete Cosine Transform (MDCT) *5 and quantizes each coefficient according to auditory significance. The method is able to encode a super-wideband speech signal with low latency of several tens of milliseconds and processing load comparable to conventional speech encoding methods. The multi-channel coding process decodes the high-quality speech-coded stream from each client, determines the most important components by comparing frequency-domain coefficients, and *4 Multi-channel coding: A form of signal processing which takes input signals from multiple systems, and performs multiplexing and data compression onto a single system. *5 MDCT: A method for converting a time-series signal to its frequency components. It is able to reduce distortion at block boundaries without losing information by applying an overlapping transform with the preceding and following blocks, so it is widely used for encoding. 27

Audio Transmission Technology for Multi-point Mobile Voice Chat And you know what...? And you know what...? Uh huh, yeah... Multi-channel coding process decoding process Why!??? Multiplexing server Why!??? Speech encoding process Uh huh, yeah... Portrayal of spatial listening Figure 1 Architecture of spatial transmission technology compresses and multiplexes them to create a single, compressed and encoded data stream (Figure 2). The spatial decoding process receives the compressed and multiplexed encoded data from the multichannel coding process, separates out and decodes the frequency-domain components of each participant s voice, and performs spatial synthesis. Figure 3 shows the mechanism by which humans recognize the location of a source. Sound generated by a source propagates to both ears through different paths. The direction from which it arrives is recognized based on the Inter-aural Intensity Difference (IID) and the Inter-aural Time Difference (ITD), both resulting from the difference in distances from each ear to the source. Thus, if signal processing is used to simulate IID and User A voice data User B voice data User C voice data ITD for a monaural signal and the resulting signals are presented separately to the left ear and the right ear using headphones, the listener perceives the signal with a spatial effect. Conventionally, spatial synthesis processing is applied to the decoded time-domain signal, but Compression/ multiplexing Sound components are discarded based on auditory significance Figure 2 Multiplex processing for multi-channel coding we developed a method of spatial synthesis operating directly on the frequency domain coefficients (i.e., decoded MDCT coefficients) while decoding the encoded data for this technology (Figure 4). By combining the process of decoding and spatial synthesis processing, we achieved to reduce the 28

processing required for spatial playback by approximately 30% to 50% relative to conventional methods. Sound source waveform 2.2 Verification of Sound Quality To verify the quality of transmitted by the spatial transmission technology, we conducted subjective evaluation tests. Conditions for the test are shown in Table 2. We used the Multi-Stimulus test with Hidden Reference and Anchor (MUSHRA) method [4], which evaluates test stimuli (including the original ) on a range from 0 to 100 points. Figure 5 shows the test results. The error bar in the figure shows a 95% confidence interval *6 for the averaged scores. Conversation A contains momentary instances of simultaneous utterances, while conversation B contains continuous periods of two or more participants speaking. Results of conversation A at 64 kbit/s and conversation B at 96 kbit/s show that our technology achieves equivalent quality to that using multiple high-quality encoded signals encoded at 64 kbit/s per channel. In other words, the spatial transmission technology offers a 20% to 25% reduction in downlink data transmission for each of the conversations through the multi-channel coding. 3. Prototype Waveform arriving at left ear (a) General, conventional processing Speech decoding Encoded data (b) Proposed processing Encoded data Methodology Dequantization Dequantization Number of subjects Test items Reference (sampling frequency) Encoded (bit-rate/ sampling frequency) Uncompressed, multiplex encoded Band-limited Listening method transform A decoding Coefficient operations This technology was implemented in a VoIP-based, multi-point voice chat system using the Session Initiation Protocol (SIP) *7. The server and client functions were implemented as Windows *8 and Windows Mobile *9 Waveform arriving at right ear Figure 3 recognition mechanism Transform B transform C synthesis Coefficient operations Correction Figure 4 decoding architecture Table 2 Subjective evaluation test conditions MUSHRA 10 Differences in intensity and arrival time caused by difference in distance to source transform B Speech signal Speech signal Conversation A (five participants, few concurrent utterances) Conversation B (six participants, many concurrent utterances) Binaural playback with sources reconstructed separately (22.05 khz) Binaural playback with spatial transmission (48, 64, 96 kbit/s / 22.05 khz) High-quality encoded (64 kbit/s / 22.05 khz), spatial synthesis with separately reconstructed sources. 7 khz bandwidth, 3.5 khz bandwidth Headphones (both ears) applications respectively. We confirmed execution of the client software on FOMA PRO Series HT-01A termi- *6 95% confidence interval: Assuming the sample has a particular distribution, an interval containing 95% of the sample. *7 SIP: A call control protocol defined by the Internet Engineering Task Force (IETF) and used for IP-phone with VoIP, etc. *8 Windows : A trademark or registered trademark of Microsoft Corp. in the United States and other countries. *9 Windows Mobile : A trademark or registered trademark of Microsoft Corp. in the United States and other Countries. 29

Audio Transmission Technology for Multi-point Mobile Voice Chat Score 100 80 Volume operations with up/down buttons 60 40 20 0 Reference Uncompressed, 7 khz 3.5 khz multiplexed band-limited band-limited high-quality transmission transmission transmission encoding 96 kbit/s 64 kbit/s 48 kbit/s 320 kbit/s 64k 5 Directional operations with left/right buttons (a) Conversation A Score 100 80 60 Photo 1 Prototype software display example 40 20 0 7 khz 3.5 khz Reference Uncompressed, band-limited band-limited multiplexed high-quality transmission transmission transmission 48 kbit/s encoding 96 kbit/s 64 kbit/s 384 kbit/s 64k 6 (b) Conversation B 95% confidence interval Figure 5 Subjective evaluation test results adjusted, is promising for applications attempting to improve a sense of shared space or presence. In the future we plan to continue study of improvements to the technology s binaural signal processing, such as personalizing spatial nals (Photo 1). Clients participate in a positions of the other participants voices voice chat session by placing calls to according to their preferences, and was meeting rooms configured on the serv- developed to provide comfortable, References er. The client screen shows a participant smooth, multi-user voice communica- [1] K. Kikuiri et. al: High-quality Speech list and whether participants are speak- tion. The subjective listening test results ing, and after selecting a participant, the indicated that the proposed multi-channel [2] R. Drullman and A. W. Bronkhorst: Mul- left and right buttons can be used to coding method reduced transmitted data tichannel speech intelligibility and talker adjust the speaker position while the up volume by 20% to 25%, while maintain- recognition using monaural, binaural, and down buttons adjust the volume. ing quality. We also described a prototype of this technology, imple- 4. Conclusion In this article, we have described a 30 mented in the form of a VoIP-based, multi-point voice-chat system. effects to user preferences. Coding, NTT DoCoMo Technical Journal, Vol.9, No.2, pp.38-41, Sep. 2007. and three-dimensional auditory presentation, J. Acoust. Soc. Am., 107, pp.22242235, 2000. [3] Y. Yasuda et. al: Reality Speech/Audio Communications Technologies, NTT DoCoMo Technical Journal, Vol.5, No.1, spatial transmission technology In addition to improving the experi- used in a multi-point voice chat applica- ence of voice-chat participants, the spa- tion for mobile environments. The tech- tial transmission technology, Method for the subjective assessment of nology provides spatial synthesis which allows the direction and volume intermediate quality level of coding sys- that allows participants to adjust the of individual participants voices to be tems, 2003. pp.61-69, Jun. 2003. [4] ITU-R Recommendation BS.1534-1: