A GMM-STRAIGHT Approach to Voice Conversion

A GMM-STRAIGHT Approach to Voice Conversion EE 225D Project, May 2009 Stephen Shum SID: 18066044 sshum@berkeey.edu Abstract This paper expores the topic of voice conversion as expored in a joint project with Percy Liang (EECS, Berkeey). For our purposes, voice conversion is the process of modifying the speech signa of one speaker (source) such that it sounds as though it had been pronounced by a different speaker (target). Foowing the Source-Fiter mode of speech production, we begin by assuming that most of a speaker s characteristics can be summarized in the spectra enveope as represented by a set of Linear Predictive Coefficients. By using a Gaussian mixture mode (GMM) to mode the features of the source speaker, we can then earn a mapping of features from the source to the target, and then resynthesize via various methods. In this paper, we expore different approaches to mode the features and describe our resuts from experimenting with the resynthesis process, incuding the integration of a STRAIGHT vocoder system, which provides an advanced mode of the excitation signa. Further discussion incudes ways to immediatey improve the system and how we woud ike to proceed in the future.

Contents 1 The Motivation 1 2 System Overview 1 3 Data 2 4 Feature Extraction and Aignment 2 4.1 Linear Predictive Coefficients.................................... 2 4.2 Line Spectra Frequencies...................................... 3 4.3 Dynamic Time Aignment...................................... 4 5 The GMM-Linear Mapping of Features 4 5.1 The Gaussian Mixture Mode and EM Agorithm......................... 4 5.2 Learning a mapping function.................................... 5 5.3 Increasing Mode Compexity.................................... 6 6 Synthesizing Converted Speech 7 6.1 Copying Source Residuas...................................... 8 6.2 Residua Codebook and Seection.................................. 8 6.3 The Vocoder Approach........................................ 8 7 The STRAIGHT Vocoder 9 7.1 F0 Extraction............................................. 10 7.2 Aperiodicity Extraction....................................... 10 7.3 Spectrogram Extraction....................................... 11 7.4 Synthesis............................................... 12 8 Integrating STRAIGHT 13 8.1 Impementation............................................ 13 8.2 Resuts, Anaysis, and Further Work................................ 14 9 Discussion 14 10 Concusion 14 11 Acknowedgements 15

1 The Motivation Speech is used as a way of conveying a wide range of information. Though a primary interest in human speech is to communicate the meaning of a message (a topic of much interest in Automatic Speech Recognition and Natura Language Processing), aso present in the signa is secondary information that incudes a speaker s identity, emotion, age, and even possibe pathoogy. The specific correates of these characteristics to the actua signa are sti being expored in many areas of research. For now, what we do know is that the individuaity amongst voices is what makes ife interesting. Voice modification technoogy has many appications in a systems that make use of pre-recorded speech, such as voice maiboxes or eaborate text-to-speech synthesizers. In these cases, voice conversion woud prove to be a simpe and efficient way to create the desired variety of speakers without the need to record different speakers. On a more medicay reated note, persons who suffer from some form of voice pathoogy or who have had some form of surgery that renders them speech impaired woud find assistance in voice modification, which might restore their previous speaking capabiities. In the same sense, with inter-nationa communication becoming more and more commonpace, work is being done to recognize (ASR) and transate utterances from one anguage to another (aso known as Machine Transation) and re-synthesizing the transated utterance. Voice conversion woud be of vauabe assistance in preserving the naturaness of the re-synthesized speech. A in a, we can see that our probem of voice conversion is directy reated to the advancement of human anguage tecnoogies, incuding the aforementioned automatic speech recognition, machine transation, and natura anguage processing, as we as speaker identification, and teephony in genera. Perhaps the most significant distinction between our topic and the rest of those at hand is that, in voice conversion, we are utimatey interested in the quaity of the re-synthesis of speech, targeted towards a human istener [1]. This is an important distinction to keep in mind as we ook ahead into the evauation of our system. 2 System Overview A speaker s characteristics, incuding speaking rate, average pitch, average pause duration between words, phrases, and sentences, and voice timbre can be summarized at fixed intervas by extracting features, a representation of the origina signa using reduced dimensionaity. We woud ike to be abe to earn some sort of a mapping function from the features of the source speaker to the features of the target speaker. We begin our impementation by focusing on the text-dependent data scenario, in which our training data is a set of pairs of speech signas T = {(vi 1, v1 i )m i=1 } where each pair corresponds to the speakers saying the same words. In particuar, vi s is a sequence of rea numbers corresponding to the speech of speaker s on utterance i. During training, we buid a Gaussian Mixture Mode (GMM) that, using a variant of the Expectation-Maximization Agorithm, deveops a inear mapping from the features of the source to the features of the the target. During training, the fina goa for our system is: given a new speech signa v a for speaker 1 (source), convert it to some v b, which is supposed to correspond to speaker 2 (target) saying the same words [2]. The rest of this paper wi outine the theories around which we buit our voice conversion system. First, we wi expain the ideas surrounding the front-end signa processing and dynamic aignment that must be done to generate a workabe set of features before any training or testing. Then we wi briefy discuss the graphica modes that were appied to the data to buid our mapping function. Next, we wi go over the process of resynthesizing the converted speech and methods for overcoming the difficuties of this probem. Finay we take a ook at the resuts generated from a few experiments and concude with a discussion on the potentia for future work. 1

Figure 1: Pots (a) and (b) are signa and spectra pots of the typica gotta source excitation signa. Pot (c) shows the frequency response of some voca tract configuration with resonant (formant) frequencies F 1, F 2, F 3, F 4. Pot (d) shows the resuting spectrum of the speech; in particuar, (d) = (b) (c). [5] 3 Data The desire to work in the text-dependent scenario requires the appropriate corpus from which to run our experiments. We have been provided with the University of Wisconsin s Microbean X-ray Speech Production Database by Professor Keith Johnson of the UC Berkeey Linguistics Department. The Database is composed of more than 60 speakers each saying a specific set of utterances, ranging from word sequences to sentences and paragraphs [3]. For each speaker we have approximatey 10 minutes worth of utterances, from which we devoted 6 minutes to a training set and 4 minutes to a test set. 4 Feature Extraction and Aignment Speech can be expained by the Source-Fiter Mode [4]. The ungs and gottis (source) suppy the power and pitch, whie the shape of the voca tract, which consists of the pharynx and the mouth and nose cavities, works ike a musica instrument to produce sound. The speech we hear is a resut of the resonant frequencies caused by differing shapes in our voca tract (fiter). This fiter can, ike a fiters, be characterized by its frequency response as shown in part (c) of Figure 1. Note the resonant spectra peaks, or formant frequencies of the voca tract. 4.1 Linear Predictive Coefficients Based on the Source-Fiter Mode, one of the most fundamenta feature extraction methods in speech processing are inear predictive coefficients (LPCs). In this setting, we assume that the present sampe of speech x(n) is predicted by the past M sampes of the speech such that x(n) = a 1 x(n 1) + a 2 x(n 2) + + a M x(n M) = M a i x(n i), (1) where x(n) is the prediction of x(n), and x(n i) is the i-th step previous sampe. Then {a i } are the LPCs of the signa. As such, the error between the actua sampe and the predicted one can be expressed as i=1 2

M ɛ(n) = x(n) x(n) = x(n) a i x(n i). (2) The objective is then to minimize the sum of the squared error with respect to each a i. E = n ɛ 2 (n) = n ( x(n) i=1 2 M a i x(n i)), (3) This can be done by taking the corresponding derivatives of a i and soving the resuting system of inear equations [5]; moreover, the Levinson-Durbin Recursion Agorithm can aso efficienty provide a souion to our LPC cacuation [6]. With a itte more work, we can aso see that the Linear Predictive formuation does indeed define an a-poe fiter. By sighty modifying (2) and taking the z-transform, we have ɛ(z) = X(z) i=1 M a k z k X(z) (4) Supposing now that the error (or residua) is the input to our system, and x is our output, we have a transfer function of the foowing form, k=1 X(z) ɛ(z) = H(z) = 1 1 M k=1 a kz k (5) which indeed defines an a-poe fiter [7]. The key eement to take away from a this is that inear predictive coefficients are a form of feature extraction, a representation of the origina signa using fewer dimensions. Indeed, they were originay and are sti widey used in compression in speech coding over teephone channes. This aows for easy quantization and transmission of the residua, which woud then aow for near perfect reconstruction of the signa. Because the success of this project reies rather heaviy on the quaity of speech re-synthesis, LPCs were an idea starting point. For initia impementation, we extract a set of LPCs from a 25ms Hamming-windowed segment of the signa. To minimize cut-off errors, we do this extraction every 10ms. A few fina remarks are necessary before we concude this discussion on inear predictive coefficients. Because they define the fiter that represents a spectrum with peaks at the resonant frequencies of the voca tract, the exact number of LPCs that are best for a given signa actuay depend on the bandwidth of the speech signa. The arger the bandwidth, the more spectra peaks, and thus the more LPCs required to sufficienty mode the voca tract. In our corpus, we have speech samped at approximatey 22kHz, hence we wi use an order of 22+2 = 24 coefficients. Lasty, we shoud note that resonant frequencies ony exist during voiced sounds - where the gottis actuay vibrates - such as vowes. Formants do not necessariy exist for consonants, and thus we shoud note that the LPCs extracted for unvoiced sounds do not exacty correspond to anything with respect to the voca tract. Nevertheess, inear predictive coefficients are effective in serving as features for the anaysis and synthesis of a speech signa. 4.2 Line Spectra Frequencies Some preiminary experimentation with creating our mode to map source speaker features to target speaker features, however, showed that inear predictive coefficients, as defined above, has its imitations. Perhaps it was that the LPCs were noisy and rendered the system unabe to earn a sucessfu mapping function. Luckiy, a ook into the iterature introduced the notion of Line Spectra Frequencies (LSFs), which, in the speech coding domain, are often used to represent LPCs for transmission over a channe. 3

To arrive at LSFs, we first decompose the Linear Predictive poynomia A(z) = 1 M k=1 a kz k into P (z) = A(z) + z (M+1) A(z 1 ) (6) Q(z) = A(z) z (M+1) A(z 1 ) (7) A the roots of P and Q ie on the unit circe and correspond directy to poe frequencies. Interestingy enough, for voiced sounds, P (z) corresponds to the voca tract with the gottis cosed, whie Q(z) corresponds to an open gottis [8]. And though there are M + 1 roots for each poynomia, because of the paindromic symmetry of P and the antipaindromic symmetry of Q, the roots come in conjugate pairs ±w, so dim(lp C) = dim(lsf ) [9]. For this fina reason, LSFs are sometimes aso referred to as Line Spectra Pairs. In our context, LSFs perform better than LPCs for a variety of reasons. LPCs are not a very robust to noise - a sma quantization error may ead to arge spectra distortion - nor do they interpoate we. Because LSFs correspond directy to frequencies, they have more physica meaning and, as such, are better suited for our purposes. The impementation of this cacuation is done easiy in MATLAB via the command sf = poy2sf(pc). 4.3 Dynamic Time Aignment In order to map features from one speaker to another, we need to create a monotonic aignment between our set of features. That is, given a training pair (v 1, v 2 ) that has been divided into respective feature frame vectors such that v i = (x i 1,..., x i n i ), x R d, i = 1, 2 (8) where d is the number of LSF coefficients extracted per frame, we want to find a monotonic aignment a = {(u 1, u 2 )} that minimizes the cost (u 1,u 2 ) a (x u 1 x u 2)2. Because the rate at which a given speaker speaks may change given differing phonetic casses of hte utterance, a inear warping of the time axis woud not be idea. Instead, we impemented the Dynamic Time Warp agorithm that, as its name suggests, empoys dynamic programming to sove our probem [7]. Now, given the aignment, we merge adjacent frames from one signa that are aigned to one frame in the other signa. This merging is done by recacuating LSF features over a wider window, utimatey eaving us with two sequences of modified feature frames ṽ i = ( x i 1,..., x i n i ). Doing this for a training pairs in T, we now have a set of frame pairs F = {( x 1, x 2 )} [2]. 5 The GMM-Linear Mapping of Features Now that we have extracted the desired features, we can ook into the modeing of the data and the mapping of features from source speaker to target speaker. We assume that the features of our source speaker ie in some d-dimensiona space, where d is the number of LSFs extracted from a given frame, and that these features are custered in some way reated to the phonemes from which they originate. Previous work has empoyed vector-quantization techniques for these features, in which a codebook of mappings is generated from the training pairs and testing is done through a VQ ookup [10]. Our approach uses a softened view from the vector-quantization method; our speakers feature space is represented in a Gaussian Mixture Mode (GMM). And from there, we proceed to earn a mapping, or conversion function. 5.1 The Gaussian Mixture Mode and EM Agorithm Given enough mixtures, a Gaussian Mixture Mode shoud be abe to approximate any arbitrary distribution. In fact, its usefuness and appicabiity has aready been demonstrated in many appications reated to speech, 4

from text-independent speaker recognition to speaker-independent speech recognition [7]. The GMM assumes that the probabiity distribution of observed parameters takes the foowing parametric form [1]: p(x) = k πn (x; µ i, Σ i ) (9) where x R d, π is a mutinomia distribution over the k mixture components, and { 1 N (x; µ, Σ) = exp 1 } (2π) n/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) i=1 As such, we can say that given a source of feature vectors {x t }, each of our features ies in some mixture of k acoustic casses C, = 1,..., k that are to be determined. Then, by an appication of Bayes rue, we can see that (10) p (C x) = π N (x; µ, Σ ) k i=1 π jn (x; µ j, Σ j ) (11) Now, if we iterate through the entire training set {x t }, we wi find the expected counts c (n) for mixture : c (n) = p (C x) (12) x {x t} This gives us the E-step of the Expectation-Maximization (EM) Agorithm during iteration n; we can summarize the M-step of the agorithm as foows: Update mixture weights with reative frequency of expected counts: π (n+1) c (n) = k i=1 c(n) i Gaussian means are updated beow (covariances omitted): µ (n+1) = 1 c (n) x {x t} (13) τ (n) (x) x (14) where τ (n) (x) = p (n) (C x). The EM agorithm iterativey increases the ikeihood of mode parameters by successive maximizations of an intermediate quanitity which, in this case, are the conditiona probabiities p (C x) [11]. The initiaization of the agorithm was uniformy random over the span of our feature space. We ran the agorithm a fixed number of iterations and used a minima variance criterion. 5.2 Learning a mapping function With our acoustic casses C now defined by a Gaussian Mixture Mode, we can begin to earn a inear mapping for our source and target feature pairs. What comes to mind is something of the form y i = k p(c x i ) [A x i + b ] + w i (15) =1 in which we need to estimate the (d d) inear transform A and (d 1) bias vector b over a observed training pairs (x i, y i ) in order to minimize the noise w i. More specificay, we incorporate the parameters of the GMM that were just estimated [1], to give us 5

Figure 2: The frame-independent approach treats each data vector pair (x i, y i) as a separate mode. y i = k =1 p(c x i ) [ A Σ 1 (x i µ ) + b ] + wi (16) From now on, to simpify notation, et us denote x () i = Σ 1 (x i µ ). Thus, the probem of minimizing the error resembes that of a east squares inear regression. By soving this directy, we can obtain our voice conversion function. 5.3 Increasing Mode Compexity Evauation of this initia system produced decent resuts. The resuts from a subjective istening test using 3 evauators with imited knowedge of the project suggest that, moduo the quaity of synthesized speech, the converted speech resembes, at best a 50/50 bend between the source and target speakers. However, our initia GMM and inear mapping was based on a very imited mode. We had empoyed the naive frame-independent approach as shown in Figure 2. Such a bag of frames methodoogy resuted in the foowing inear mapping: y i = k =1 [ ] p(c x i ) A x () i + b + w i (17) It made sense, however to aso try incorporating the neighboring data pairs into the Mode, resuting in another inear mapping function and the graphica mode as depicted in Figure 3: y i = k =1 [ ] p(c x i 1, x i, x i+1 ) A x () i + b + C x () i+1 + D x () i 1 + w i (18) This approach worked better, improving sighty upon subjective evauation. What seemed to be missing, however, was the notion of time dependence in the output. The fina enhancement we made, then, was to add a sort of Markovian dependence to the output sequence such that our estimated target features y(n) are dependent on the previousy estimated target features y n 1 as we as the present and future source features x(n), x(n + 1). The corresponding mode is depicted in Figure 4, and the resuting inear mapping function is of the form: 6

Figure 3: This approach uses data ocay observed to map R 3d R d. This is the ocay dependent approach. Figure 4: This approach uses data ocay observed as we as previousy estimated data to map R 3d R d. We ca this the Markov dependence approach. y i = k =1 [ ] p(c y i 1, x i, x i+1 ) A x () i + b + C x () i+1 + D ȳ () i 1 + w i (19) Uness otherwise specified, we used this mode for most of the experiments discussed in this paper. Sometimes, due to unforeseen errors or computationa compexity issues, it was faster to get things going with the initia frame-independent approach at itte cost to our resuts. In any case, the fina impementation of this graphica mode invoved training a mixture of k = 16 Gaussians with 10 iterations of the Expectation- Maximization agorithm. 6 Synthesizing Converted Speech In our discussion of Linear Predictive Coefficients and the Source-Fiter Mode, it was seen that if the appropriate residua error vaues ɛ are appied to the fiter defined by LPCs {a i }, we can obtain a nearperfect reconstruction of of the origina signa. This works, of course, when the residua can indeed excite the resonant frequencies of the fiter. In voice conversion, given a set of predicted features, discovering the proper residua for high quaity, target-speaker-sounding speech is a difficut probem in itsef. Here, this process is known as residua prediction [6]. 7

Figure 5: Schematic diagram for copying source residuas to synthesize converted speech. 6.1 Copying Source Residuas A number of interesting methods have been proposed. The simpest one, in fact, does no actua prediction at a; it is to simpy copy source residuas. Technicay speaking, the idea source-fiter mode assumes that a arge majority of speaker-dependent information can be represented by the voca tract and, hence, the extracted features. We expect the source excitation to be ess crucia; hence it makes sense to just use the source residuas and appy the transformed features [6]. As depicted in Figure 5, this was the first method impemented into the initia voice conversion system. Its resut can be witnessed on Tracks 1-3 (Source, Target, Converted) in the accompanying demo CD. The quaity of the utterance, whie a bit crude and unpoished, was actuay somewhat better than initiay expected. The converted speech was, for the most part, inteigibe and had at east a sight resembance to both speakers. Nevertheess, it is cear that merey changing the parameters of our fiter is not enough to change a speaker s identity. At the end of it a, we might say that a third speaker was created in this process [12]. The shortcomings of this method were made more pronounced in a more ambitious voice conversion task: cross-gender. Tracks 1-3 demonstrated a conversion between two femae speakers. Tracks 4-6 (Source, Target, Converted) show the system s ack of robustness in converting from a mae voice to a femae voice. This prompted an investigation of other approaches to residua prediction. 6.2 Residua Codebook and Seection In an effort to avoid using the source speaker s parameters in the converted utterance, the next ogica approach woud be to buid a codebook that stores the target s feature vectors y m seen in training with the corresponding residuas r m that were seen in training. Then during synthesis, given a predicted feature vector ỹ, we can seect the residua from the codebook whose corresponding feature vector minimizes the squared error between ỹ and a feature vectors {y m } seen in training [13]. That is, choose r = r m where m = arg min m. ỹ y m (20) Whie it sounds promising, we ought to keep in mind that seecting from disjoint and discontinuous residuas often causes probems in phase mismatch, among others, which woud introduce unwanted disturbances into the synthesized utterance. Unfortunatey, due some shortcomings in computing capabiity, we were unabe to test out this approach in time for the competion of this paper. It woud, nevertheess, be interesting to see how the resuts of this method compare with resuts of the previous. 6.3 The Vocoder Approach The vocoder approach is based on the standard formuation of ineear predictive coding, where the source residua signa is either white noise or a puse train that resembes unvoiced or voiced excitations, respectivey [6]. This approach, as depicted in Figure 6, may seem ike a step backwards from the approaches mentioned above; indeed, vocoders have a reputation for creating ess natura, more synthetic sounding speech. However, there have been recent advances to the vocoder in the STRAIGHT project, which managed to overcome the monotony of a synthetic sounding voice via techniques invoving natura residua waveforms. 8

Figure 6: Schematic diagram for the synthesis of converted speech using the vocoder approach. Figure 7: Parameters that are extracted by the STRAIGHT system: (Top-Left) A vector of F0 vaues, (Bottom-Left) The Aperiodic component of the speech, (Right) A smoothed Spectrogram. 7 The STRAIGHT Vocoder The STRAIGHT (Speech Transformation and Representation by Adaptive Interpoation of weighted spectrom) system was originay designed and buit by Professor Hideki Kawahara to investigate human speech perception. It was aso motivated by the need for fexibe speech modification, the simpicity of the channe vocoder methods (source-fiter separation), and the ack of natura quaity in the resuting speech produced by such a method [14]. The resut of this project was a system whose reproduced speech sounds after modification are sometimes indistinguishabe from the origina speech sounds in terms of naturaness [15]. STRAIGHT uses procedures that can be grouped into three subsystems: a source information extractor, a smoothed time-frequency representation extractor, and a synthesis engine consisting of an excitation source and a time varying fiter [16]. The eements that are extracted from the first two subsystems are depicted in Figure 7. And to witness the abiity of the synthesis, Tracks 7-8 of the accompanying demo CD consist of an origina utterance (Track 7), whie Track 8 is the resut of using STRAIGHT to extract information and then using that as the input to the synthesis engine. Before we dive any deeper, we shoud reaize that the STRAIGHT system has no desire for information reduction. Because the quaity and fexibiity for manipuations was the main focus for its deveopment, we need to be wary that dimensionaity and computabiity become a concern when trying to scae STRAIGHT for use in our own voice conversion system. The foowing subsections wi provide a summary of the methodoogy used in the STRAIGHT system 9

to provide high quaity speech anaysis-modification-synthesis based on the channe vocoder formuation. 7.1 F0 Extraction For the abiity to mode, adjust, and reproduce speech, it is important to be abe to extract F 0 trajectories which do not have any trace of interferences caused by the ength of the anaysis window and the signa waveform. Pitch extraction agorithms based on the usua definition of periodicity do not behave we for this purpose, because a natura speech signa is neither purey periodic nor stabe [14]. To work around this issue, the STRAIGHT system extracts as the fundamenta frequency the instantaneous frequency of the fundamenta component of the signa. The F 0 estimation method of STRAIGHT assumes that the signa has a neary harmonic structure, as foows, x(t) = N k=1 ( t ) a k (t) cos (kw 0 (τ) + w k (τ))dτ + φ k (0), (21) 0 where a k (t) represents a sowy changing instantaneous ampitude, and w k (τ) aso represents sowy changing perturbations of the k-th harmonic component. As such F 0 is the instantaneous frequency of the fundamenta component where k = 1 [16]. To extract the fundamenta component, we appy a series of band-pass fiters arranged in og-inear fashion (6-24 per octave) and use them to extract fixed points that map from the fiter center frequency to the instantaneous frequencies of the fiter output. The fiter impuse response w F (t, λ) is composed of a Gabor function w(t, λ) convoved with a 2nd-order cardina B-spine basis function h(t, λ) than can be tuned to some frequency λ. Hence, we have w F (t, λ) = w(t, λ) h(t, λ), (22) w(t, λ) = e λ 2 t 2 4πη 2 e jλt, (23) { } h(t, λ) = max 0, 1 λt 2πη (24) which is essentiay a continuous waveet transform [17]. Now, if we can tune λ such that λ = 2πF 0, then this fiter wi effectivey suppress interference from neighboring harmonic components. To do so, we seek a set of fixed points {λ s} such that a fiter ws(t, λ ) with center frequency λ s wi have an output with instantaneous frequency w c (t, λ s) = λ s [17]. The F 0 that is seected is the one having the distincty higher signa to noise ratio of sinusoida component and background noise [16]. In the event of no distinct fundamenta frequency (e.g. pauses in between words, unvoiced sounds), the expected pitch vaue of 0 is returned. The above outined the procedure for an initia estimate of F 0, which obtains reasonabe accuracy. However, this process can be improved by a refinement procedure that uses the initia estimates to perform another iteration that finds fixed points corresponding to harmonic components that can, once again, use a signa to noise ratio to provide an updated estimate with minimum estimation error. A ook at the MATLAB impementation of STRAIGHT provides some additiona insight into this methodoogy. As part of its defaut parameter settings for extracting F 0 information, the STRAIGHT system searches over a defaut window ength of 40ms for possibe F 0 s of between 40 and 800Hz. Performance is said to improve if this search range can be decreased, that is, if we have a priori knowedge of the speaker s F 0 range [18]. 7.2 Aperiodicity Extraction Previousy, we saw that the F 0 estimation method of STRAIGHT assumes a neary harmonic structure of the signa. However, there wi aways exist deviations from periodicity, which introduce additiona components 10

Figure 8: Aperiodic component extraction [19]. on inharmonic frequencies. As such, we can find a measure of aperiodicity by taking the energy on inharmonic frequency normaized by the tota energy. During the extraction of the fundamenta frequency, STRAIGHT creates a window function by convoving a sighty time-stretched (η) Gaussian with a 2nd-order cardina B-spine function (24) that is tuned to the fixed F 0 [16]. This window is designed to have zeros between harmonic components; thus a power spectrum cacuated using this window provides the energy sum of periodic and aperiodic components at each harmonic frequency and provides the energy of the aperiodic component at each in-between harmonic frequency. To summarize the procedure, et S(w) 2 represent a power spectrum of an utterance, and then et S U (w) 2 and S L (w) 2 represent the upper and ower spectra enveopes respectivey. The upper enveope is cacuated by connecting spectra peaks, whie the ower enveope is cacuated by connecting spectra vaeys, as depicted in Figure 8. The aperiodicity measure of a certain center frequency w (as integrated over some window of neighboring frequencies W (w) is then defined as the ower enveope normaized by the upper enveope as such P AP (w) = W (w) S(λ) 2 ( SL (λ) 2 S U (λ) 2 ) dλ W (w) S(λ) 2 dλ (25) This aperiodic measure provides additiona information regarding the speaker s source that is not summarized in the F 0 extraction. Whie it is mosty true that most of a speaker s prosodic features can be we represented in the previousy discussed F 0 extraction and the to-be discussed time-frequency spectra enveope, the aperiodicity measure that was just summarized sti contains information that is perceptuay significant. 7.3 Spectrogram Extraction Speech is inherenty periodic and, ironicay, this does pose probems. It is, in some ways, contradictory and frustrating that periodicity induces such difficuty in speech anaysis and manipuation; for human isteners, voiced sounds are perceived to be smoother and richer than unvoiced sounds [14]. And yet, what we now 11

seek in this probem of spectrogram extraction is to obtain a time-frequency representation that does not have any trace of periodicity. When the ength of a time window for spectra anaysis is comparabe to the fundamenta period of the signa repetition, the resutant power spectrum shows periodic variation in the time domain. Conversey, when the ength of the time window spans severa repetitions of this fundamenta period, the resutant power spectrum shows periodic variation in the frequency domain [14]. What we seek, then is a perfecty sized rectanguar window that is equa to the fundamenta period such that variations in either domain are not apparent. Unfortunatey, this is not possibe in the context of natura speech. Fundamenta frequencies of these signas change a the time; moreover, the sharp discontinuities of a rectanguar window make these representations highy sensitive to minor errors. The most unique feature of STRAIGHT is its abiity to perform extended pitch synchronous anaysis without requiring, as other pitch synchronous procedures do, that its anaysis frame be aigned to the pitch in any specific manner. Work for this has aready been done in the process of F 0 extraction. A simiar fiter from (22) is used, resuting from a convoution of an isotropic Gaussian with the 2nd-order cardina B-spine basis h(t, λ) as copied from (24): w p (t, λ) = e λ 2 t 2 4πη 2 h(t, λ), (26) { } h(t, λ) = max 0, 1 λt 2πη (27) where again we have tuned λ = 2πF 0, and aowed for some time stretching η for improved frequency resoution [15]. This paces the second-order zeros on the other harmonic frequencies, which makes the resutant spectrum ess sensitive to any F 0 extraction errors [14]. The compensatory window w c that produces peaks at positions where zeros were ocated in (22) (i.e. fis in the hoes ) is achieved by the foowing moduation: ( ) λt w c (t) = w p (t) sin (28) 2 Then finay, the resuting composite spectrum P r (w, t, η) can be represented as a weighted squared sum of the power spectra P 2 o (w, t, η), using the origina time window, and P 2 c (w, t, η), using the compensatory window. P r (w, t, η) = P 2 o (w, t, η) + ξ(η)p 2 c (w, t, η) (29) where ξ is seected to minimize tempora variation of the resutant spectrogram. This resut, in addition to some additiona smoothing procedures, aows for pitch synchronous spectra anaysis and the creation of the STRAIGHT spectrum [15]. 7.4 Synthesis The fina eement eft to discuss is that of STRAIGHT synthesis. However, now that we know the source (F 0, aperiodic component) and fiter (time-frequency spectra enveopes) parameters, there is not much to do but to pace it a back together and get back the output speech. Figure 9 shows a schematic diagram of how STRAIGHT is used. One of the big pitfas in speech synthesis foowing the vocoder framework is the inherent buzzy sound that resuts from a puse-ike excitation [15]. To combat this effect, STRAIGHT introduced a group deay manipuation to enabe user contro of the F 0 that is finer than the resoution determined by the samping interva. Whie the mathematica detais are expained in detai in [20], we can summarize by saying that the resynthesis engine in STRAIGHT incudes an a-pass fiter generated via random numbers as we as the introduction of asymmetry into the group deay to produce more interesting timbre. The fina signa is generated by a pitch-synchronous overap add of the resuting source signa convoved with the minimum phase impuse response cacuated from the extracted (and possiby modified) spectra enveopes. 12

Figure 9: A schematic diagram of the STRAIGHT system. Note, howe,ver, that In our impementation, we are doing conversion instead of modification [14]. 8 Integrating STRAIGHT The intricacies and detais within the STRAIGHT system took some time to understand. One of the origina goas of the project this semester was to actuay get into the STRAIGHT source code and extract the reevant parts to improve our Voice Conversion System. Unfortunatey, for right now, we wi have to sette with the abiity to merey integrate reevant parts of STRAIGHT into the present system. 8.1 Impementation Our initia approach to resynthesis merey copied source residuas which, because they do not propery excite the fiter parameters at every given frame of speech, resuted in a very noisy sounding resynthesis. As such, our top priority was to use the STRAIGHT resynthesis engine, which meant we woud need to be abe to provide a three aforementioned components (F 0, Aperiodicty, Spectrogram) in proper STRAIGHT format to the synthesizer. Pitch was easy to hande. In fact, we had previousy negected to directy compensate for differences in pitch between our source and target speaker, which was at east a partia reason in our faied attempt to convert a mae voice into that of a femae (Tracks 4-6 in Demo CD). So we began by figuring out the bias between the source pitch p s and target pitch p t. That is, we earned the function p t (n) = p s (n) + b where b = p t p s is the difference between the means of the source and target pitches seen in training. We aso began ooking at a way to modify pitch trajectories in order to mode the differences in voca infection between the speakers. However, the impementation coud not be debugged in time to report on its resuts. At this time, we decided to not worry so much about generating or earning a conversion function between aperiodic components. As such, instead of copying the source residuas per se, we copied the source aperiodic component and hoped that the fundamenta frequency and spectrogram mapping woud suffice in a successfu conversion. The generation of a spectrogram was an interesting probem. As mentioned previousy the STRAIGHT approach makes no attempt for information reduction [14]. Unfortunatey, to earn a mapping function in some finite amount of time, we coud not afford to et dimensionaity get out of hand. The STRAIGHT system extracted a 512-point spectra enveope every 1ms over a 40ms window, whie our conversion system extracted a set of 24 LSFs every 10ms over a 25ms window. In an sight effort to compromise, we began extracting and mapping LSF vectors of 24 dimensions every 5ms over a 40ms window. To create the spectrogram, we woud then use noise to excite the spectra enveope parametrized by the converted LSFs and simpy smear it over 5ms (i.e. MATLAB: repmat(specenv,1,5)). 13

8.2 Resuts, Anaysis, and Further Work Though much of the theory suggested that everything woud work out, the resuts were ess satisfying than expected. Though the quaity of speech may have been perhaps sighty better than the origina method of copying source residuas, the signa sti sounded very noisy and uncear. Our best resut was the abiity to make the mae-to-femae conversion at the very east inteigibe. The resut of said conversion using the STRAIGHT impementation can be heard on Track 9 (Source: Track 4, Target: Track 5), whie the resut from a femae-to-femae conversion (Source: Track 1, Target: Track 2) can be found on Track 10. Perhaps there is an underying bug that has pagued our impementation from the beginning; it is uncear where the probem truy ies. If one istens hard enough a sufficient number of times, he or she coud probaby say the converted speech resembes parts of both speakers, but at the end of the day, we have mosty just created a third speaker speaking in a noisy, artifact-fied environment. There are many directions in which we coud progress; however, we might begin by ooking more introspectivey and begin with experiments that woud bring us coser to expaining the shortcomings of our soution. The STRAIGHT project is an incrediby usefu too and it woud be great to work with it more and understand it in greater detai. Perhaps we ought to ook a itte deeper into the mapping of aperiodic content and into our attempt to represent the smoothed, pitch-synchronous spectrogram using just LSFs. It may even be beneficia to the ongoing STRAIGHT progress to work on methods of information reduction for faster rea-time impementation of STRAIGHT [15]. Wherever we decide to move next with this project, there wi be potentia for improvement. 9 Discussion And ooking past the immediate probems, there are many other areas that can be expored to improve our voice conversion resuts. In addition to finishing up earning pitch trajectory mapping, we coud ook into durationa differences between speakers. Even though a dynamic time-aignment was performed to pair up the feature vectors, we have yet to do anything to warp the time axis during the actua conversion phase. That might, in fact, ead into a more compicated system of feature extraction and graphica modeing in genera. On a note reated to this project, there has aready been some work done to incorporate a GMM mode with the STRAIGHT excitation to generate a converted residua signa based on Maximum Likeihood Estimation [19]. Other work has empoyed a Hidden Markov Mode (HMM) to mode the dynamic characteristics of a speaker, where the HMM has a state-dependent codebook of feature mappings [21]. Unfortunatey, the codebook approach has proven to be outdated, but recenty, there was work done in combining an HMM with the GMM mode. Anaogous to the graphica mode framework in automatic speech recognition, this HMM does not use phonemes as states; rather, it uses utterance and speaker specific information trained from the GMM to determine its states and transition probabiities [22]. Foowing the work in this area is a definite possibiity for the future. 10 Concusion Our utimate goa is to be abe to do voice conversion in the text-independent setting, the form of unsupervised earning where training utterances from the target and source speakers need not be aigned. Unti we get to that point, however, we wi be abe to keep ourseves busy working on appying audio signa processing and machine earning techniques to the text-dependent probem. This project turned out to be an enriching opportunity to further investigate the compexities of speech outside of the ASR domain. I had a ot of fun paying with sound fies and see potentia for continuing work on this probem in the future. 14

11 Acknowedgements For the training and testing of our system, we were generousy provided with the University of Wisconsin s Microbeam X-ray Speech Production Database by Professor Keith Johnson of the Linguistics Department at UC Berkeey. Our data manages to provide a ot more than just same-text speech for over 60 speakers; the Microbeam X-ray technoogy used provides data for movement of various articuators in speech, which is incrediby usefu for work in a areas of inguistics and speech processing. We do apoogize for not using this corpus at its fuest potentia, though we are, nonetheess, gratefu to have been granted access to this data. And for guiding an undergraduate into this project, I woud ike to thank EECS Graduate Student Percy Liang for his patience in etting me badger him about the machine earning code (for the GMM-Linear Mapping) he provided. I appreciate his assistance in that regard, as I had my hands fu trying to figure out how to put everything ese in pace. Aso a thanks to EECS Graduate Student Suman Ravuri for introducing to us the STRAIGHT system; I cannot imagine what it woud have been ike trying to figure it a out from scratch. Finay, a thank you to Professor Neson Morgan for supporting this work-in-progress as my EE 225D course project. References [1] Yannis Styianou, Oivier Cappe, and Eric Mouines. Continuous probabiistic transform for voice conversion. IEEE Trans. on Speech and Audio Processing, 6(2):131 142, 1998. [2] Percy Liang and Stephen Shum. Voice conversion notes - version 1. Updated November 6, 2008. [3] John R. Westbury. X-ray Microbeam Speech Production Database User s Handbook. University of Wisconsin, 1994. [4] Keith Johnson. Acoustic and Auditory Phonetics. Backwe, Maden, Massachusetts, 2003. [5] S. Park. Dsp, chapter 7. www.engineer.tamuk.edu/spark/chap7.pdf. [6] David Suendermann. Text Independent Voice Conversion. PhD thesis, Bundeswehr University Munich, 2007. [7] Neson Morgan and Ben God. Speech and Audio Signa Processing. John Wiey and Sons, Inc., New York, 2000. [8] Tony Robinson. Line spectra pairs, 1998. http://svr-www.eng.cam.ac.uk/ ajr/speechanaysis/node51.htm. [9] Jonathan Y. Stein. Digita Signa Processing: A Computer Science Perspective. Wiey-Interscience, New York, 2000. [10] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara. Voice conversion through vector quantization. In Proc. IEEE Int Conf. Acoustics, Speech, Signa Processing, 1988. [11] Michae I. Jordan. Introduction to probabiistic graphica modes. Berkeey, CA, 2003. Unpubished. [12] Aexander Kain and Michae W. Macon. Spectra voice conversion for text-to-speech synthesis. In Proc. IEEE Int Conf. Acoustics, Speech, Signa Processing, 1998. [13] Hui Ye and Steve Young. High quaity voice morphing. In Proc. IEEE Int Conf. Acoustics, Speech, Signa Processing, 2004. 15

[14] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Aain de Cheveigne. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction. Speech Communication, 27:187 207, 1999. [15] Hideki Banno, Hiroaki Hata, Masanori Morise, Toru Takahashi, Toshio Irino, and Hideki Kawahara. Impementation of reatime straight speech manipuation system: Report on its first impementation. Acoustica Science and Technoogy, 28(3), 2007. [16] Hideki Kawahara, Jo Esti, and Osamu Fujimura. Aperiodicity extraction and contro using mixed mode excitation and group deay manipuation for a high quaity speech anaysis, modification and synthesis system straight. In Proc. of the MAVEBA, Firenza, Itay, 2001. [17] Hideki Kawahara, Haruhiro Katayose, Aain de Cheveigne, and Roy D. Patterson. Fixed point anaysis of frequency to instantaneous frequency mapping for accurate estimation of f0 and periodicity. In Proc. Eurospeech, 1999. [18] Hideki Kawahara. Tips to make re-synthesis quaity better. STRAIGHT Tria Webpage. http://www.wakayama-u.ac.jp/ kawahara/straighttria/betterresynth.htm. [19] Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. Maximum ikeihood voice conversion based on gmm with straight mixed excitation. In Proc. InterSpeech, 2006. [20] Hideki Kawahara. Speech representation and transformation using adaptive interpoation of weighted spectrum: Vocoder revisited. In Proc. IEEE Internationa Conference on Acoustics, Speech, and Signa Processing, 1997. [21] E. Kim, S. Lee, and Y. Oh. Hidden markov mode based voice conversion using dynamic characteristics of speaker. In Proc. European Conference On Speech Communication and Technoogy, 1997. [22] Yue Z., Zou X., Jia Y., and Wang H. Voice conversion using hmm combined with gmm. In Proc. Congress on Image and Signa Processing, 2008. 16