Formant Synthesis of Haegeum: A Sound Analysis/Synthesis System using Cpestral Envelope

Size: px

Start display at page:

Download "Formant Synthesis of Haegeum: A Sound Analysis/Synthesis System using Cpestral Envelope"

Harold Ross
5 years ago
Views:

1 Formant Synthesis of Haegeum: A Sound Analysis/Synthesis System using Cpestral Envelope Myeongsu Kang School of Computer Engineering and Information Technology Ulsan, South Korea ilmareboy@ulsan.ac.kr Yeonwoo Hong 1 School of Electrical Engineering Ulsan, South Korea ducj16@ulsan.ac.kr Abstract This paper presents a formant synthesis method of haegeum using cepstral envelope for spectral modeling. Spectral modeling synthesis (SMS) is a technique that models timevarying spectra as a combination of sinusoids (the "deterministic" part), and a time-varying filtered noise component (the "stochastic" part). SMS is appropriate for synthesizing sounds of string and wind instruments whose harmonics are evenly distributed over whole frequency band. Formants (or acoustic resonances) are extracted from cepstral envelope and we use them for synthesizing sinusoids. A secondorder digital resonator by the impulse-invariant transform (IIT) is applied to generate deterministic components and the results are band-pass filtered to adjust magnitude. The noise is calculated by first generating the sinusoids with formant synthesis, subtracting them from the original sound, and then removing some harmonics remained. A line-segment approximation is used to model noise components. The synthesized sounds are consequently by adding sinusoids, which are shown to be similar to the original haegeum sounds. Keywords-Sound synthesis of Haegeum, spectral modeling, formant synthesis, cepstral envelope, spectral analysis I. INTRODUCTION The study on synthesis of musical instrumental sounds has been studied through the method of sampling, modulation, filtering, modeling, etc [1]. Sampling and filtering are the most traditional methods that use recorded instrumental sounds or produce a desired waveform and spectrum using filters. We can synthesize the most natural sound through sampling that well describes the sound color of the original instruments. But sampling is appropriate for the solo play and we have to resample when the playing style is changed. Using the modulation we can create new sounds such as electrical sounds not natural sounds. The synthesized sounds using modulation feels like artificial, so it's very hard to produce natural sounds of various instruments. On the other hands, modeling is a method that synthesizes the musical sounds using digital filters designed based on acoustical characteristics. When using modeling we can adjust the tone color by changing the filter parameters and can describe the playing style by adding or deleting certain filters. But the calculation complexity is much higher [2]. To reproduce realistic instrumental sounds it is important to select a good model. The synthesis techniques of musical instruments using modeling are physical modeling and spectral modeling. Physical modeling is synthesis techniques that analyze the sound production mechanism and design appropriate model. Physical modeling using digital wave guide is much used in the area of sound synthesis. Spectral modeling synthesizes the sounds by analyzing the spectrum of the instrumental sounds as a sum of sinusoids and other components which affect the characteristic of the instrumental sounds. Spectral modeling is appropriate for the synthesis of string and wind instruments which have periodic character [3]. The main advantage of spectral modeling is the existence of analysis procedures that extract the synthesis parameters out of real sounds, thus being able to reproduce actual sounds [4]. Additive synthesis is the original spectral modeling technique. Generally, the techniques which models instrumental sounds as a sum of sinusoids and noise components are referred to as spectral modeling [5]. The modeling of sinusoids is more important because noise components are not much related to the pitch. Additive synthesis, subtractive synthesis, formant synthesis can be used for the modeling of sinusoids. Formant synthesis is a technique which synthesizes harmonics using spectral envelope in the frequency domain. The information for magnitudes and frequencies of every harmonics is not needed, and the inharmonic components can be synthesized using formant synthesis [6]. To extract spectral envelope, the followings are generally used for formant synthesis: interpolation, linear prediction coding (LPC), and cepstrum. In this paper, cepstrum is used. In 1963, Bogert et al. introduced cepstral processing which had been studied for seismic analysis. At about same time, Oppenheim proposed a new class of systems called homomorphic system whose fundamental concept is same as cepstrum [7]. Cepstral coefficients are extracted from Fourier transform of logarithm of input spectrum. Input signal combined by convolution in time domain is converted into a form of product in frequency domain, and converted into a summation by logarithm. We can interpret Fourier transform of the logarithm as a superposition of the components included in input signal in summed form. In 1990, Serra introduced spectral modeling synthesis (SMS) that synthesizes musical 1 Corresponding author. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology ( ) /11/$ IEEE

2 instrumental sounds using spectral modeling technique [4]. After that, sound synthesis of musical instrument using SMS has been studied [3, 8]. Sinusoids+noise model is developed as sinusoids+noise+transient model [9 11]. Sound synthesis of Korean traditional instruments using spectral modeling has not yet studied. A study on analysis of acoustical characteristics of Korean traditional bowed string instrument, haegeum, has done [12]. However, a study on synthesis of haegeum sounds has not yet done. Haegeum has a characteristic that harmonics are evenly distributed through very high frequency band, similar to wood wind instruments. Consequently, spectral modeling is suitable for haegeum. In this paper, we propose a formant synthesis techniques using cepstral envelope for the spectral modeling of haegeum. Formants are extracted from cepstral envelope and utilized as parameters for the synthesis of sinusoids. The filter by the impulse-invariant transform (IIT) is used as a digital resonator for the synthesis of sinusoids. For the interpolation of the magnitude of filtered signal, band-pass filter is applied to the output of the resonator. The noise is calculated by first generating the sinusoids with formant synthesis, subtracting them from the original sounds. As a result, the synthesized single-notes are produced by adding sinusoids by formant synthesis and noise components by a linesegment approximation. The rest of this paper is organized as follows. Section II introduces the structure of haegeum and Section III gives background about the cepstrum. Then, Section IV shows frequency characteristics of haegeum and explains how to extract spectral synthesis parameters. The synthetic results are illustrated in Section V, and discussions are mentioned in Section IV. Section IIV concludes this paper. II. A BOWED INSTRUMENT: HAEGEUM A. Structure of Haegeum Haegeum is a Korean traditional bowed string instrument that produces sound by rubbing two strings with a bow. Haegeum has characteristics of both wind and string instruments. Consequently, it can be defined a wind instrument by its structure and a string instrument due to its two strings. As shown in Fig. 1, the thick and thin strings are called Junghyun and Yuhyun, respectively. The other parts of the haegeum include: the cylindrical body, the Jua, the Bokpan, the Wonsan, the Ipjuk, and the fiddle bow. The Jua tightens the two strings. One side of the cylindrical body is hollow; the other is flat and is known as the Bokpan. Sound echoes though the Bokpan. The Wonsan elevates the strings above the body and various sounds are produced based on its position. The Ipjuk functions as a handle and neck. The fiddle bow is made from horsehair and it is always inserted between the two strings attached to a small barrel-shaped resonator (Bokpan). The two strings are generally tuned to A3 and E4 and the resonator is tuned to about 300Hz. The resonator has a paulownia wood plate at one end on which the bridge rests. B. Musical Range on Haegeum The pitch of the haegeum changes with the position of lefthand on the strings and it is usually played at the following three positions: Hwangjong (near the Jua), Jungryeo (middle of the strings), and Chonghwangjong (near the Bokpan). The haegeum has a wide range of two and a half octaves as shown in Table I. As you can see, there are some duplicated singlenotes at different positions. (e.g., note Joong (Ab4) at Hwangjong and Jungryeo position). TABLE I. Hwangjong position SINGLE-NOTES AT DIFFERENT POSITIONS Joong (Ab3) 210 Im (Bb3) 228 Nam (C4) 257 Hwang (Eb4) 317 Tae (F4) 354 Joong (Ab4) 433 Im (Bb4) 478 Nam (C5) 524 Jungryeo position Hwang (Eb4) 315 Tae (F4) 351 Joong (Ab4) 427 Im (Bb4) 474 Nam (C5) 527 Mu (Db5) 556 Hwang (Eb5) 611 Tae (F5) 668 Cheonghwangjong position Joong (Ab4) 408 Im (Bb4) 457 Nam (C5) 519 Hwang (Eb5) 608 Tae (F5) 687 Joong (Ab5) 827 Im (Bb5) 930 Nam (C6) 1013 Figure 1. Structure of a Korean traditional bowed string instrument, Haegeum The term yul corresponds to the term note in Western music. In Korean musical theory, an octave is divided into twelve tones and the tones are named as follows: Hwangjong

The haegeum generates considerably regular harmonics and they are distributed to 20 khz. III.

3 (Eb), Daeryo (E), Taeju (F), Hyeopjeong (Gb), Goseon (G), Jungryeo (Ab), Yubin (A), Imjong (Ab), Ichik (B), Namryeo (C), Muyeok (Db), and Eungjong (D). The bolded syllables are those in current usage in notation for haegeum. sustain. Constant vibrational energy is delivered to the haegeum sound and the sound only fades once the bowing stops. The haegeum generates considerably regular harmonics and they are distributed to 20 khz. III. BACKGROUND: CEPSTRUM It is convenient to assume that the input signal convolved with an excitation signal and a filter can be expressed as follow xn ( ) = en ( ) sn ( ), (1) where x(n) is the input signal, e(n) is the excitation signal and s(n) is a kind of filter that generates the input signal by filtering the given excitation signal. Taking the logarithm of the Fourier transform of both side of (1) yields jω jω jω log X( e ) = log E( e ) + log S( e ). (2) ω E yields a spectrum that should be characterized by a relatively rapidly varying function of ω j The term ( e ) j (e.g., a noise signal), the term ( e ) S ω varies more slowly with ω (e.g., harmonic components of musical sound). Consequently, the left-hand side of (2) can be separated into the two right-hand-side components by a kind of a filter that separates the log spectral components that vary rapidly with ω from those that vary slowly with ω. The cepstrum is calculated by taking the inverse discrete Fourier transform of the left-hand side of (2) yielding Figure 2. Comparison of waveforms: (a) gayageum and (b) haegeum Furthermore, Fig. 3 presents the maximum amplitude is in the range from 1600Hz to 1800Hz. This frequency characteristic is the same as the frequency characteristic of the body of the haegeum in [12]. This concludes that the body gives the haegeum its distinctive tone. As shown in Fig. 3, several large amplitudes are appeared out of 1600Hz 1800Hz and these formants correspond to resonant frequencies that affect the characteristic of the haegeum sound. 1 π jω jωn cn ( ) = π log Xe ( ) e dω, 2 (3) where c(n) is called the nth cepstral coefficient [14]. The spectral envelope, which varies slowly with respect to frequency, yields large-valued cepstral coefficients for low values of n, but it dies out for high n. The spectral fine structure is rapidly varying with ω, and it yields small-valued cepstral coefficients for small n. Consequently, the contribution of the excitation and the filter can be separated in the cepstral domain by taking the Fourier transform of the cepstral coefficients: sinusoidal component and stochastic component (e.g., excitation signal). IV. EXTRACTION OF SYNTHESIS PARAMETERS FOR HAEGEUM Figure 3. Formants (or acoustical resonances) of note Hwang (Eb4) at Hwangjong position A. Frequency Characteristics of Haegeum The haegeum produces a variety of sounds and it can generate very sharp sounds when the length of a string is reduced by pressing a finger down on it. Moreover, a player produces sound by pushing/drawing a bow across one or more strings, and thus the produced haegeum sounds can be characterized by the bow velocity. Fig. 2 shows the waveform comparison between haegeum and gayageum, which is a Korean traditional plucked-string instrument. The haegeum has a slightly slower attack than the gayageum. Since the haegeum is bowed to generate sound, there is no decay and a very long B. Extraction of Spectral Parameters As mentioned in Section II, there are three different positions to play the haegeum. For haegeum sounds generated in each position, several single-notes are duplicated. With this reason, single-notes generated at Hwangjong position are considered to extract spectral parameters in this paper. Spectral parameters are extracted from the sustain region because the haegeum has a slower attack and no decay. In this case, it can be assumed that the steady bow velocity is supplied by the player and thus it is not necessary to consider about the sinusoidal frequency and phase content of local sections of a

4 Figure 4. A flow diagram to extract spectral synthesis parameter signal as it changes over time. Therefore, we do not consider about the change of harmonics and their magnitudes over time through the short-time Fourier transform (STFT) in this paper. The fast Fourier transform (FFT) is performed for signal of 500 1,000 samples size extracted from the sustain region of each note as a frame of STFT analysis. Fig. 4 illustrates a flow diagram to extract spectral parameters to analyze the haegeum sound. In formant synthesis, the basic assumption is that the transfer function can be satisfactorily modeled by simulating formant (or resonant) frequencies, formant amplitudes and bandwidths. Thus, the synthesis consists of the artificial reconstruction of the formant characteristics to be produced. This is done by exciting a set of resonators to achieve the desired sound spectrum. To extract these synthesis parameters, it is very important to determine an appropriate window size to truncate cepstral coefficients. Fig. 5 shows the real cepstrum of note Hwang (Eb4) and the window size is set to a period in length. We can finally obtain the cepstral envelope from the truncated cepstrum by taking Fourier transform. the following two conditions as efficient resonances to well represent frequency characteristics of the haegeum sound: i) the interval of two adjacent resonances should be greater than the fundamental frequency, and ii) the amplitude of the current resonance should be bigger than the amplitude of adjacent resonances. The bandwidth of each resonance can be measured at half-power points (gain -3dB or relative to peak). Fig. 6 illustrates the cepstral envelope and extracted resonances from note Im at Hwangjong position. Figure 6. Cepstral envelope and acoustical resonances of note Im at Hwangjong position Figure 5. Real cepstrum of note Hwang (solid) and window to extract cepstral envelope (dotted) The cepstral envelope represents the formant structure and many resonances lie within 20 khz. Although many resonances lie within 20 khz, the amplitude of resonance dramatically goes down at above 15 khz as shown in Fig. 3. Consequently, we consider resonances within 15 khz that are satisfied with As a result, sinusoids are synthesized with these extracted resonances and then it is possible to obtain noise components (or residual signal) by subtracting the synthesized sinusoids from the original sound. Furthermore, it is possible to know how well the extracted resonances are matched with the harmonic components of the original sound. If sinusoids remain in the residual signal, we should reanalyze the sound until we obtain a good enough residual signal that is free of sinusoidal components. Ideally the resulting residual signal must be as close as possible to a stochastic signal. To model the residual signal, we should firstly obtain the local maximum values at every 100 samples in the original residual, and then a line-segment approximation is applied to its log-magnitude spectrum.

resonant frequencies are 250Hz and 550Hz, the center frequency, bandwidth, upper and lower cutoff frequencies of the BSF should be 250Hz, 300Hz, 100Hz and 400Hz, respectively. Fig.

5 V. SYNTHESIS OF HAEGEUM SOUNDS A digital resonator generating the sinusoids is illustrated in Fig. 7 and two parameters are used to specify the input-output characteristics of the resonator: the resonant (or formant) frequency, f r, and the resonance bandwidth ω BW. resonant frequencies are 250Hz and 550Hz, the center frequency, bandwidth, upper and lower cutoff frequencies of the BSF should be 250Hz, 300Hz, 100Hz and 400Hz, respectively. Fig. 8 depicts spectra of synthetic notes Im (Bb4) and Nam (C5) by band-pass filtered formant synthesis. Figure 8. Spectra of synthesized signle-notes by formant synthesis applied with a digital band-pass filter: (a) Im (Bb4) and (b) Nam (C5) Figure 7. Frequency response of resonator (f r = 1000Hz, ω BW = 50Hz) Then, samples of the output of a digital resonator, y(n), are computed from the input signal, x(n), by (4) yn ( ) = Axn ( ) + Byn ( 1) + Cyn ( 2). (4) The constants A, B and C are related to the resonant frequency f r and the bandwidth ω BW by the impulse-invariant transformation. ωbw C = e ωbw B = 2e 2 cos( ωr ) (5) A = 1 B C. The digital resonator is a second-order difference equation and the transfer function of the digital resonator is given by (6) j e ω A T( z) = 1 2, 1 Bz Cz (6) where z = [6]. To produce the sinusoids, a parallel formant synthesizer is used in this paper. The parallel formant synthesizer sums the outputs of the simultaneously excited formant resonators. In this case, a superposition of resonators at adjacent resonant frequencies can be occurred and this may cause the result that attenuation around resonances is not described well. To conquer this drawback, a digital band-pass filter (BSF) is connected to the output of each formant resonator. The center frequency and bandwidth are specified by each resonant frequency and the difference of two adjacent resonant frequencies, respectively. If, for example, the first and second The synthetic spectra are not satisfied as much as we expected. This results from the lack of stochastic components. As mentioned in previous section, the line-segment approximation is applied for modeling noise components, and parameters are specified by the local maxima at every 100 samples in the residual. Fig. 9 shows a comparison of original noise components and synthesized noise components by the line-segment approximation. Figure 9. A comparison of original noise components (dotted) and synthesized noise components (solid) of Nam (C5) by the line-segment approximation. Fig. 10 illustrates a block diagram to synthesize single-notes of haegeum. As a result, the system includes digital resonators, digital band-pass filters, and the line-segment approximation.

6 Figure 10. A block diagram to synthesize single-notes of haegeum with extracted spectral parameters Figure 11. A comparison of spectra between original single-notes (dotted) and synthesized single-notes (solid) at Hwangjong position: (a) Joong (Ab3), (b) Im (Bb3), (c) Nam (C4), (d) Hwang (Eb4), (e) Tae (F4), (f) Joong (Ab4), (g) Im (Bb4), and (h) Nam (C5) Fig. 11 demonstrates spectral peaks of the synthesized and original single-notes at Hwangjong position and there is no doubt that they are similar in the shape of their spectra. As shown in Fig. 11, we can obtain the synthesized single-notes with a high fundamental frequency whose spectra are similar to the originals (Tae (F4), Joong (Ab4), Im (Bb4) and Nam (C5)). However, there are some differences for notes with a relatively low fundamental frequency (Joong (Ab3), Im (Bb3), Nam (C4) and Hwang (Eb4)) at several frequencies. They have slightly larger spectral magnitudes than the corresponding original

7 single-notes at below 10 khz and their spectral amplitudes are larger at above 10 khz. These results can be occurred due to the following two reasons: Notes with low fundamental frequencies have many resonances, and thus narrow intervals between resonant frequencies can be measured. In this paper, we decide resonances are efficient when the interval of two adjacent resonances is greater than the fundamental frequency and the amplitude of the current resonance is bigger than the amplitude of adjacent resonances. Consequently, more resonances can be missed at high frequency compared to notes having high fundamental frequencies, and these missing resonances still remain in the residual signal. This results in low-quality synthesized sound. However, this can be solved by adjusting a control parameter to obtain resonances from the original notes. To generate noise components, the sinusoids are subtracted from the original sound in frequency domain. This results in a residual on which the stochastic approximation is performed. We assumed that the residual is a stochastic signal, and consequently it is not necessary to keep exact spectral shape information. This paper presents the linesegment approximation to produce the residual. The line-segment approximation is carried out by finding local maxima at every 100 samples, thus giving equally spaced points in the spectrum that are connected by straight lines to create the spectral envelope. If, however, there are many spectral peaks in the residual, the approximation allows the residual to have high amplitudes for valleys in waveform. To conquer this problem, another type of approximation should be considered. Sound demo samples are available at VI. DISCUSSIONS Spectrum expresses the energy information of input signal in frequency domain, and thus it is influenced by the number of samples in the input signal. As the number of input samples is increased, cepstrum gives the average value of total energy rather than the change over time. Thus, the energy nearby peak (rather than peak) becomes larger because cepstrum is originated in spectrum. Whit this reason, 10 20ms regions of each note are used as input samples to extract exact formants. In this paper, we used sustain region to extract synthesis parameters such as resonant frequencies, magnitudes and bandwidth. However, it is necessary to consider about the usage of attack region as well as the change over time to generate more realistic haegeum sounds. Consequently, we assumed that the input signal is a frame of STFT analysis. Then, we should improve the proposed model into STFT based model which describes the change over time. In spectral modeling, many synthesis parameters are required to synthesize the sounds that are similar to original sounds. However, formant synthesis using spectral envelope can describe the harmonic/inharmonic components, less parameters than additive synthesis in spectral modeling are used, though. Using FFT to extract envelope yields a dense and fine curve but more data are needed. In the case of LPC, the envelope is so smooth that it is very hard to express the characteristics of harmonics of haegeum. Consequently, the higher order of LPC is required to obtain a fine envelope. In cepstral analysis, if cepstral coefficients are separated using a window with an appropriate size, it is possible to have a denser envelope than FFT. As a result, cepstral analysis is more suitable for synthesizing haegeum sounds. Noise components might contain some peaks which are considered as harmonics. To eliminate them, we added an additional process to re-find them in the residual. The synthetic results tend to show magnitudes at higher frequencies are bigger than originals. This is caused by addition of noise components, and thus we need to modify the noise model. VII. CONCULSIONS In this paper, we studied a formant synthesis method of haegeum sounds using cepstral envelope for spectral modeling. The parameters required in synthesis process are reduced by using formant synthesis method instead of additive synthesis that is utilized for sinusoids in existing spectral modeling. Formants are extracted from cepstral envelope based on the characteristics of haegeum, and that is exploited in synthesis of sinusoids. To model noise components, synthesized sinusoids are subtracted from the original signal. To improve synthetic results, we added an additional process to find sinusoidal components remained in the residual. A digital resonator by IIT is used for the synthesis of sinusoids and the line-segment approximation is used for the synthesis of noise. The synthesized sounds are generated by adding sinusoids and noise components, and they are very similar to the originals. REFERENCES [1] C. Roads, The Computer Music Tutorial, The MIT press, London, [2] S. Cho and U. Chong, Sound Synthesis of Right-Hand Playing Styles using Improved Physical Modeling of Sanjo Gayageum, Acoust. Soc. Kor., vol. 25, no. 8, pp , [3] X. Serra and J. O. Smith, "Residual Minimization in a Musical Signal Model based on a Deterministic plus Stochastic Decomposition," J. Acoust. Soc. Am., vol. 95, no. 5 2, pp , [4] X. Serra and J. O. Smith, "Spectral Modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition," Comput. Music J., vol. 14, no. 4, pp , [5] Spectral Audio Signal Processing, available at Online Book, [6] H. K. Dennis, "Software for a cascade/parallel formant synthesizer," J. Acoust. Soc. Am., vol. 67, no. 3, pp , [7] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Ronald W. Schafer, Discrete-Time Signal Processing, 2nd ed., Prentice Hall, [8] X. Serra and J. Bonada, Sound Transformations Based on the SMS High Level Attributes, in Proc. Int l Conf. Digital Audio Effects (DAFX98), [9] T. S. Verma, T. H. Y. Meng, Time Scale Modification Using a Sines+Transients+Noise Signal Model, in Proc. Int l Conf. Digital Audio Effects (DAFX98), [10] T. S. Verma and T. H. Y. Meng, An analysis/synthesis tool for transient signals, in Proc. 16th Int l Cong. Acoustics, vol. 1, pp , [11] T. S. Verma, T. H. Y. Meng., Extending Spectral Modeling Synthesis with Transient Modeling Synthesis, Comput. Music J., vol. 24, no. 2, pp , 2000.

8 [12] J. Noh, S. Park and KM. Sung, Acoustic Characteristics of the Haegeum Body, Acoust. Soc. Kor., vol. 26, no. 7, pp , [13] H. Song, Korean Musical Instruments, 1st ed., Youl Hwa Dang, [14] B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley & Sons, 1999.

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

A Parametric Model for Spectral Sound Synthesis of Musical Sounds Cornelia Kreutzer University of Limerick ECE Department Limerick, Ireland cornelia.kreutzer@ul.ie Jacqueline Walker University of Limerick