STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds

Similar documents
Implementation of realtime STRAIGHT speech manipulation system: Report on its first implementation

2nd MAVEBA, September 13-15, 2001, Firenze, Italy

Getting started with STRAIGHT in command mode

Converting Speaking Voice into Singing Voice

Application of velvet noise and its variants for synthetic speech and singing (Revised and extended version with appendices)

L19: Prosodic modification of speech

Speech Synthesis using Mel-Cepstral Coefficient Feature

Enhanced Waveform Interpolative Coding at 4 kbps

Abstract. 1 Introduction

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Overview of Code Excited Linear Predictive Coder

Feasibility of Vocal Emotion Conversion on Modulation Spectrogram for Simulated Cochlear Implants

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Sound Synthesis Methods

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Digital Speech Processing and Coding

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Digital Signal Processing

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

EE482: Digital Signal Processing Applications

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Fundamental frequency estimation of speech signals using MUSIC algorithm

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

Complex Sounds. Reading: Yost Ch. 4

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Lecture 9: Time & Pitch Scaling

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Advanced audio analysis. Martin Gasser

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Speech Synthesis; Pitch Detection and Vocoders

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

HCS 7367 Speech Perception

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Principles of Musical Acoustics

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Different Approaches of Spectral Subtraction Method for Speech Enhancement

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

SYNTHETIC SPEECH DETECTION USING TEMPORAL MODULATION FEATURE

SPEECH AND SPECTRAL ANALYSIS

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Synthesis Algorithms and Validation

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Auditory modelling for speech processing in the perceptual domain

SGN Audio and Speech Processing

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

I-Hao Hsiao, Chun-Tang Chao*, and Chi-Jo Wang (2016). A HHT-Based Music Synthesizer. Intelligent Technologies and Engineering Systems, Lecture Notes

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

Psychology of Language

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Linguistics 401 LECTURE #2. BASIC ACOUSTIC CONCEPTS (A review)

Pitch Period of Speech Signals Preface, Determination and Transformation

Communications Theory and Engineering

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

Improved signal analysis and time-synchronous reconstruction in waveform interpolation coding

SGN Audio and Speech Processing

Monaural and Binaural Speech Separation

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.

Hungarian Speech Synthesis Using a Phase Exact HNM Approach

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

Experimental evaluation of inverse filtering using physical systems with known glottal flow and tract characteristics

Voice Excited Lpc for Speech Compression by V/Uv Classification

A Comparative Performance of Various Speech Analysis-Synthesis Techniques

Vocal effort modification for singing synthesis

Laboratory Assignment 4. Fourier Sound Synthesis

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Parameterization of the glottal source with the phase plane plot

Edinburgh Research Explorer

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

SOURCE-filter modeling of speech is based on exciting. Glottal Spectral Separation for Speech Synthesis

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

8A. ANALYSIS OF COMPLEX SOUNDS. Amplitude, loudness, and decibels

Implementation of SYMLET Wavelets to Removal of Gaussian Additive Noise from Speech Signal

Slovak University of Technology and Planned Research in Voice De-Identification. Anna Pribilova

Speech Compression Using Voice Excited Linear Predictive Coding

Chapter 7. Frequency-Domain Representations 语音信号的频域表征

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

Acoustic Phonetics. Chapter 8

HIGH ACCURACY FRAME-BY-FRAME NON-STATIONARY SINUSOIDAL MODELLING

General outline of HF digital radiotelephone systems

ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION. Frank Kurth, Alessia Cornaggia-Urrigshardt and Sebastian Urrigshardt

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

Enhancing 3D Audio Using Blind Bandwidth Extension

VIBRATO DETECTING ALGORITHM IN REAL TIME. Minhao Zhang, Xinzhao Liu. University of Rochester Department of Electrical and Computer Engineering

Speech Enhancement using Wiener filtering

High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

Acoustic Phonetics. How speech sounds are physically represented. Chapters 12 and 13

Transcription:

INVITED REVIEW STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds Hideki Kawahara Faculty of Systems Engineering, Wakayama University, 930 Sakaedani, Wakayama, 640 8510 Japan Abstract: STRAIGHT, a speech analysis, modification synthesis system, is an extension of the classical channel VOCODER that exploits the advantages of progress in information processing technologies and a new conceptualization of the role of repetitive structures in speech sounds. This review outlines historical backgrounds, architecture, underlying principles, and representative applications of STRAIGHT. Keywords: Periodicity, Excitation source, Spectral analysis, Speech perception, VOCODER PACS number: 43.72.Ar, 43.72.Ja, 43.70.Fq, 43.66.Ba, 43.71.An [doi:10.1250/ast.27.349] This article contains the supplementary media files (see Appendix). Underlined file names in the article correspond to the supplementary files. For more information, see http:// www.asj.gr.jp/2006/data/ast2706.html. 1. INTRODUCTION This article provides an overview of the underlying principles, the current implementation and applications of the STRAIGHT [1] speech analysis, modification, and resynthesis system. STRAIGHT is basically a channel VOCODER [2]. However, its design objective greatly differs from its predecessors. It is still amazing to listen to the voice of VODER that was generated by human operation using pre-computer age technologies. It effectively demonstrated that speech can be transmitted using a far narrower frequency bandwidth, which was an important motivation of telecommunication research in the 1930s. This aim was recapitulated in the original paper on VOCODER [2] and led to the development of speech coding technologies. The demonstration also provided a foundation for the conceptualization of a source filter model of speech sounds, the other aspect of VOCODER. It is not a trivial concept that our auditory system decomposes input sounds in terms of excitation (source) and resonant (filter) characteristics. Retrospectively, this decomposition can be considered an ecologically relevant strategy that evolved through selection pressure. However, this important aspect of VOCODER was not exploited independently from the primary aspect, narrow band e-mail: kawahara@sys.wakayama-u.ac.jp transmission, or in other words, parsimonious parametric representations. This coupling with parsimony resulted in poor resynthesized speech quality. Indeed, VOCODER voice used to be a synonym for poor voice quality. High quality synthetic speech by STRAIGHT presented a counter example to this belief. It was not designed for parsimonious representation. It was designed to provide representation consistent with our perception of sounds [1]. The next section introduces an interpretation of the role of repetitive structures in vowel sounds and shows how the interpretation leads to spectral extraction in STRAIGHT. 2. SURFACE RECONSTRUCTION FROM TIME-FREQUENCY SAMPLING Repeated excitation of a resonator is an effective strategy to improve signal to noise ratio for transmitting resonant information. However, this repetition introduces periodic interferences both in the time and frequency domains, as shown in the top panel of Figure 1. It is necessary to reconstruct the underlying smooth timefrequency surface from the representation deteriorated by this interference. The following two step procedure was introduced to solve this problem. The first step is a complementary set of time windows to extract power spectra that minimize temporal variation. The second step is inverse filtering in a spline space to remove frequency domain periodicity while preserving the original spectral levels at harmonic frequencies. 2.1. Complementary Set of Windows So-called pitch synchronous analysis is a common 349

frequency domain was introduced. The remaining temporal periodicity due to phase interference between adjacent harmonic components is then reduced by introducing a complementary time window. Complementary window w C ðtþ of window wðtþ is defined by the following equation: w C ðtþ ¼wðtÞ sin t ; ð1þ T 0 where T 0 is the fundamental period of the signal. Complementary spectrogram P C ð!; tþ, calculated using this complementary window, has peaks where spectrogram Pð!; tþ, calculated using the original one, yields dips. A spectrogram with reduced temporal variation P R ð!; tþ is then calculated by blending these spectrograms using a numerically optimized mixing coefficient : P R ð!; tþ ¼Pð!; tþþp C ð!; tþ: ð2þ Cost function ðþ used in this optimization is defined p using B R ð!; tþ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P R ð!; tþ: ZZ jb R ð!; tþ B R ð!þj 2 dtd! 2 ðþ ¼ ZZ ; ð3þ P R ð!; tþdtd! where B R ð!þ is the temporal average of B r ð!; tþ. Optimization was conducted using periodic signals with constant F 0. Cost is 0.004 for the current STRAIGHT implementation. The cost for a Gaussian window having an equivalent frequency resolution to STRAIGHT s window is 0.08. The center panel of Fig. 1 shows the spectrogram with reduced temporal variation P R ð!; tþ using an optimized mixing coefficient. Note that all negative spikes found in the top panel, that is Pð!; tþ, disappeared. Fig. 1 Estimated spectra of Japanese vowel /a/ spoken by a male. Left wall of each panel also shows waveform and window shape. Three-dimensional plots have frequency axis (left to right in Hz), time axis (front to back in ms), and relative level axis (vertical in db). Top panel shows spectrogram calculated using isometric Gaussian window. The center panel shows spectrogram with reduced temporal variation using a complementary set of windows. Bottom panel shows STRAIGHT spectrogram. practice to capture the stable representation of a periodic signal. However, due to intrinsic fluctuations in speech periodicity and wide spectral dynamic range, spectral distortions caused by fundamental frequency (F 0 ) estimation errors are not negligible. These distortions are reduced by introducing time windows having weaker discontinuities at the window boundaries, such as a pitch adaptive Bartlett window. To further reduce the levels of the side lobes of the time window, Gaussian weighting in the 2.2. Inverse Filtering in a Spline Space Piecewise linear interpolation of values at harmonic frequencies provides approximation of missing values when the precise F 0 is known. Instead of directly implementing this idea, a smoothing operation using the basis function of the 2nd order B-spline is introduced because this operation yields the same results for line spectra and is less sensitive to F 0 estimation errors. Smoothed spectrogram P S ð!; tþ is calculated from original spectrogram P R ð!; tþ using the following equation when the spectrogram only consists of line spectra: Z 1= P S ð!; tþ ¼ h! ð=! 0 ÞP Rð! ;tþd ; ð4þ where! 0 represents F 0. Parameter represents nonlinearity and was set to 0.3 based on subjective listening tests. Smoothing kernel h! is an isoscale triangle defined in ½ 1; 1Š. Because a spectrogram calculated using a complementary set of windows does not consist of line 350

H. KAWAHARA: STRAIGHT weight h Ω (λ/ω 0 ) 1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 6 4 2 0 2 4 6 normalized frequency (λ/ω 0 ) Fig. 2 Smoothing kernel h ð=! 0 Þ for ¼ 0:3. Horizontal frequency axis is normalized by F 0. spectra, smoothing kernel h shown in Fig. 2 is used to recover smeared values at harmonic frequencies. The shape of h is calculated by solving a set of linear equations derived from wðtþ, w C ðtþ, and. The following equation yields the reconstructed spectrogram P ST ð!; tþ (STRAIGHT spectrogram): Z 1= P ST ð!; tþ ¼ r h ð=! 0 ÞP Rð! ;tþd ð5þ Soft rectification function rðxþ is introduced to ensure that the results are positive everywhere. The following shows the function used in the current implementation: rðxþ ¼ logðe x þ 1Þ: The bottom panel of Fig. 1 shows the STRAIGHT spectrogram of Japanese vowel /a/ spoken by a male speaker. Note that interferences due to periodicity are systematically removed from the top to the bottom panel while preserving details at harmonic frequencies. It also should be noted that this pitch adaptive procedure does not require alignment of analysis position to pitch marks. 3. FUNDAMENTAL FREQUENCY EXTRACTION The surface reconstruction process described in the previous section is heavily dependent on F 0. In the development of STRAIGHT, it was also observed that minor errors in F 0 trajectories affect synthesized speech quality. These motivated the development of dedicated F 0 extractors for STRAIGHT [1,3,4] based on instantaneous frequency. The instantaneous frequency of the fundamental component is the fundamental frequency by definition. It is extracted as a fixed point of mapping from frequency to instantaneous frequency of a short-term Fourier transform ð6þ [5]. An autonomous procedure for selecting the fundamental component that does not require apriori knowledge of F 0 was introduced and revised [1,3]. In the current implementation, normalized autocorrelation based procedure was integrated with the previous instantaneous frequency based procedure to reduce F 0 extraction errors further [4]. 3.1. Aperiodicity Map In the current implementation, the aperiodic component is estimated from residuals between harmonic components and smoothed to generate a time-frequency map of aperiodicity Að!; tþ. Estimated F 0 information (f 0 ðtþ) is used to generate new time axis uðtþ for making the apparent fundamental frequency of the transformed waveform have a constant fundamental frequency f c. This manipulation removes artifacts due to the frequency modulation of harmonic components: Z t f 0 ðþ uðtþ ¼ d: ð7þ 0 f c When periodic excitation due to voicing is undetected, estimated f 0 is set to zero to indicate the unvoiced part. 4. REMAKING SPEECH FROM PARAMETERS A set of parameters (STRAIGHT spectrogram P ST ð!; tþ, aperiodicity map Að!; tþ, and F 0 with voicing information f 0 ðtþ) are used to synthesize speech. All of these parameters are real valued and enable independent manipulation of parameters without introducing inconsistencies between manipulated values. A pitch event based algorithm is currently employed by using a minimum phase impulse response calculation. A mixed mode signal (shaped pulse plus noise) is used as the excitation source for the impulse response. Group delay manipulation is primarily used to enable subsampling temporal resolution in F 0 control. Randomization of group delay in a higher frequency region (namely higher than 4 khz) is also used to reduce perceived buzzyness typically found in VOCODER speech. 5. APPLICATIONS STRAIGHT was designed as a tool for speech perception research to test speech perception characteristics using naturally sounding stimuli. Selective manipulation of formant locations and trajectories suggest that the results using STRAIGHT were essentially consistent with classical findings but seemed to shed new light on spectral dynamics [6,7]. It is interesting to note that the evidence of the perceptual decomposition of sounds into size and shape information (in other words resonant information) was provided by a series of experiments using STRAIGHT [8]. 351

Representing sounds in terms of excitation source and resonator characteristics was proven to be a fruitful idea suggested by the classical channel VOCODER and was extensively exploited in STRAIGHT. The extended pitch adaptive procedure for recovering smoothed time-frequency representation from voiced sounds enabled versatile speech manipulations in terms of perceptually relevant attributes. It also enabled exemplar-based speech manipulations such as auditory morphing, which is a powerful tool for investigating para- and non-linguistic aspects of speech communications and is useful in multimedia applications. STRAIGHT is still actively being revised by the introduction of new ideas and feedback from applications. Exploitation on excitation information is going to be a hot topic for coming year. Fig. 3 User interface for morphing demonstration (courtesy of the Mirainan, designed by Takashi Yamaguchi). 5.1. Morphing Speech Sounds Morphing speech samples [9] is an interesting strategy for investigating the physical correlates of perceptual attributes. It enables us to provide a stimulus continuum between two or more exemplar stimuli by evenly interpolating STRAIGHT parameters. Emotional morphing demonstrations (media file: straightmorph.swf. Refer to Appendix.) were displayed in the Miraikan (Japanese name of the National Museum of Emerging Science and Innovation) from April 22 to August 15, 2005. Figure 3 shows a screenshot of the display. Three phrases were portrayed by one female and two male actors with three emotional styles (pleasure, sadness, and anger). Simple resynthesis of these original samples was placed at the vertices. Morphed sounds were located on the edges and the inside links of the triangle and reproduced by mouse clicks. 5.2. Testing STRAIGHT A set of web pages is available that consists of the morphing demonstration mentioned above and links to executable Matlab implementations of STRAIGHT and morphing programs [10]. It also offers an extensive list of STRAIGHT related literatures and detailed technical information helpful for testing those executables. 6. CONCLUSION ACKNOWLEDGEMENTS The author appreciates support from ATR, where the original version of STRAIGHT was invented. He also appreciates JST for funding the exploitation of the underlying principles of STRAIGHT as the CREST Auditory Brain Project from 1997 to 2002. The implementation of realtime STRAIGHT and rewriting in C language are supported by the e-society leading project of MEXT. Applications of STRAIGHT in vocal music analysis and synthesis are currently supported by the CrestMuse project of JST. REFERENCES [1] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigné, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction, Speech Commun., 27, 187 207 (1999). [2] H. Dudley, Remaking speech, J. Acoust. Soc. Am., 11, 169 177 (1939). [3] H. Kawahara, H. Katayose, A. de Cheveigné and R. D. Patterson, Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity, EUROSPEECH 99, 6, pp. 2781 2784 (1999). [4] H. Kawahara, A. de Cheveigné, H. Banno, T. Takahashi and T. Irino, Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT, Interspeech 2005, pp. 537 540 (2005). [5] F. J. Charpentier, Pitch detection using the short-term phase spectrum, ICASSP 86, pp. 113 116 (1986). [6] Chang Liu and Diane Kewley-Port, Vowel formant discrimination for high-fidelity speech, J. Acoust. Soc. Am., 116, 1224 1233 (2004). [7] P. F. Assmann and W. F. Katz, Synthesis fidelity and timevarying spectral change in vowels, J. Acoust. Soc. Am., 117, 886 895 (2005). [8] D. R. R. Smith, R. D. Patterson, R. Turner, H. Kawahara and T. Irino, The processing and perception of size information in speech sounds, J. Acoust. Soc. Am., 117, 305 318 (2005). [9] H. Kawahara and H. Matsui, Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation, ICASSP 2003, I, pp. 256 259 (2003). [10] http://www.wakayama-u.ac.jp/~kawahara/index-e.html APPENDIX: SUPPLEMENTARY FILES The animation file (straightmorph.swf) was produced by the Macromedia Flash. Open source as well as commercial flash players and plug-ins are available to play flash movies for Windows and Mac OS. Please click the upper right corner (A button titled I love you. ( 352

H. KAWAHARA: STRAIGHT Table A.1 Morphing between two expressions. (a) file name anger sadness iloveyouangsad1a.wav 100 0 iloveyouangsad1b.wav 90 10 iloveyouangsad1c.wav 80 20 iloveyouangsad1d.wav 70 30 iloveyouangsad1e.wav 60 40 iloveyouangsad1f.wav 50 50 iloveyouangsad1g.wav 40 60 iloveyouangsad1h.wav 30 70 iloveyouangsad1i.wav 20 80 iloveyouangsad1j.wav 10 90 iloveyouangsad1k.wav 0 100 (b) file name pleasure anger iloveyouhpyang1a.wav 100 0 iloveyouhpyang1b.wav 90 10 iloveyouhpyang1c.wav 80 20 iloveyouhpyang1d.wav 70 30 iloveyouhpyang1e.wav 60 40 iloveyouhpyang1f.wav 50 50 iloveyouhpyang1g.wav 40 60 iloveyouhpyang1h.wav 30 70 iloveyouhpyang1i.wav 20 80 iloveyouhpyang1j.wav 10 90 iloveyouhpyang1k.wav 0 100 (c) file name sadness pleasure iloveyousadhpy1a.wav 100 0 iloveyousadhpy1b.wav 90 10 iloveyousadhpy1c.wav 80 20 iloveyousadhpy1d.wav 70 30 iloveyousadhpy1e.wav 60 40 iloveyousadhpy1f.wav 50 50 iloveyousadhpy1g.wav 40 60 iloveyousadhpy1h.wav 30 70 iloveyousadhpy1i.wav 20 80 iloveyousadhpy1j.wav 10 90 iloveyousadhpy1k.wav 0 100 ) ) of the interface first to start playing English examples. Manipulated sound files embedded in the flash animation (straightmorph.swf) for the English demonstration mentioned above are listed in Tables A.1, A.2, and A.3. Table A.2 Morphing between the centroid (iloveyoucentroid.wav) and each expression. file name centroid anger iloveyouctoaa.wav 75 25 iloveyouctoab.wav 50 50 iloveyouctoac.wav 25 75 centroid pleasure iloveyouctoha.wav 75 25 iloveyouctohb.wav 50 50 iloveyouctohc.wav 25 75 centroid sadness iloveyouctosa.wav 75 25 iloveyouctosb.wav 50 50 iloveyouctosc.wav 25 75 Table A.3 Morphing between the centroid and the average of two expressions. filename iloveyousideas.wav iloveyousideha.wav iloveyousidesh.wav two expressions anger and sadness pleasure and anger sadness and pleasure The sample sentence I love you. was portrayed by a male actor in three different emotional expressions. The centroid (iloveyoucentroid.wav) of three expressions was generated by morphing them. Then, the centroid was used to generate other three-way morphing examples. Finally, the centroid was morphed with the average (50% point) of two expressions. Hideki Kawahara received B.E., M.E., and Ph.D. degrees in Electrical Engineering from Hokkaido University, Sapporo, Japan in 1972, 1974, and 1977, respectively. In 1977, he joined the Electrical Communications Laboratories of Nippon Telephone and Telegraph Public Corporation. In 1992, he joined the ATR Human Information Processing research laboratories in Japan as a department head. In 1997, he became an invited researcher at ATR. From 1997 he has been a professor of the Faculty of Systems Engineering, Wakayama University. He received the Sato award from the ASJ in 1998 and the EURASIP best paper award in 2000. His research interests include auditory signal processing models, speech analysis and synthesis, and auditory perception. He is a member of ASA, ASJ, IEICE, IEEE, IPSJ, ISCA, and JNNS. 353