The Correlogram: a visual display of periodicity

Size: px
Start display at page:

Download "The Correlogram: a visual display of periodicity"

Transcription

1 The Correlogram: a visual display of periodicity Svante Granqvist* and Britta Hammarberg** * Dept of Speech, Music and Hearing, KTH, Stockholm; Electronic mail: svante.granqvist@speech.kth.se ** Dept of Logopedics and Phoniatrics, Karolinska Institute, Huddinge University Hospital; Electronic mail: britta.hammarberg@klinvet.ki.se Abstract Fundamental frequency (F0) extraction is often used in voice quality analysis. In pathological voices with a high degree of instability in F0, it is common for F0 extraction algorithms to fail. In such cases, the faulty F0 values might spoil the possibilities for further data analysis. This paper presents the correlogram, a new method of displaying periodicity. The correlogram is based on the waveform matching techniques often used in F0 extraction programs, but with no mechanism to select an actual F0 value. Instead, several candidates for F0 are shown as dark bands. The result is presented as a 3D-plot with time on the x-axis, correlation delay inverted to frequency on the y-axis and correlation on the z-axis. The z-axis is represented in a gray scale as in a spectrogram. Delays corresponding to integer multiples of the period time will receive high correlation, thus resulting in candidates at F0, F0/2, F0/3 etc. While the corrlogram adds little to F0 analysis of normal voices, it is useful for analysis of pathological voices since it illustrates the full complexity of the periodicity in the voice signal. Also, in combination with manual tracing, the correlogram can be used for semi-manual F0 extraction. If so, F0 extraction can be performed on many voices that cause problems for conventional F0 extractors. To demonstrate the properties of the method it is applied to synthetic and natural voices, among them six pathological voices, which are characterized by roughness, vocal fry, gratings/scrape, hypofunctional breathiness and voice breaks, or combinations of these. Introduction Fundamental frequency (F 0 ) is a commonly used parameter being the main acoustic correlate to perceived pitch. In the field of voice quality research, F 0 extraction is particularly relevant, for example for evaluation of pre- and posttreatment of voice disorders and for measuring F 0 perturbation. Fundamental frequency extraction has received a great deal of attention in speech and voice research. Several different algorithms have been invented (e.g. Hess, 1983; Titze & Liang, 1993; Hess, 1995), and the algorithms have been applied both to acoustic waveforms and to electroglottographic (EGG) signals (Rothenberg 1973, Fourcin 1986). Examples of such methods are peak picking and methods based on spectral or cepstral properties of the signal or on waveform matching by means of autocorrelation or autodifference (e.g., Hess, 1983). The waveform matching technique has many important advantages. For example, it is independent on determination of the instant of excitation and has a low sensitivity to noise (Titze & Liang, 1993). Also, it can offer several estimates of fundamental frequency per period. The basic idea of the waveform matching technique is to compare the signal in two time windows separated by a variable time delay. Certain lengths of this time delay will achieve a high correlation. These delays correspond to multiples of the period time. For example the comparison can be realized in terms of a correlation function, which is a straightforward procedure since few variables are involved. If the waveform matching technique is to be used for F 0 extraction, the F 0 extraction algorithm must select which peak in the correlation function corresponds to the fundamental period time. Normal voices rarely cause selection problems. However, for dysphonic voices with 1

2 an unstable period time, different F 0 extraction algorithms will give different results (Rabinov et al., 1995; Karnell et al., 1991). Phonation containing bicyclic segments (equivalent to period doubling, Titze, 1995) is a typical example; most F 0 extractors will select a fundamental frequency of F 0 /2 for the bicyclic segments. This is in some sense correct since the period actually is doubled. However, different algorithms require different magnitudes of bicyclicity in order to arrive at this result. Hence different F 0 extraction programs yield different results. This is problematic when the extracted F 0 data are used for deriving perturbation measures, such as jitter. For example, in a study of vocal fry, Blomgren et al. (1998) reverted to a semi-manual extraction method instead of using the automatic methods, since their voice samples were characterized by a high amount of variability. The problems outlined above do not originate from the waveform-matching algorithm but rather from the selection mechanism. Because fundamental frequency is defined as the inverse of period time, no F 0 exists if the signal is not perfectly repetitive, strictly speaking. Therefore, especially for pathological voices, there will be cases when the task of extracting a single F 0 is ill-defined or unrealistic. In such cases, an improved description of the perturbation itself may be more relevant. In this paper, we present a display showing the raw correlation functions in a threedimensional graph. We propose the term correlogram for these displays. The result is a picture reflecting periodicity characteristics of a voice rather than an extracted F 0 curve. The correlogram is free from a selection mechanism, leaving to the user to select the F 0 value or to what extent F 0 extraction is at all appropriate. This type of display should be particularly useful for voices where F 0 selections are difficult, that is, all voices with a high amount of F 0 perturbation. The method is tested on synthetic signals and natural voices and compared with other methods. Descriptions of selected perceptual voice terms In the case of pathological voices, information about periodicity or lack of periodicity is particularly relevant. Many attempts have been made to correlate F 0 perturbation characteristics with perceptual features. Such correlates are interesting, since a complete understanding of the relationship between perception and acoustics would allow objective measurements of voice qualities. In the following, some perceptual terms frequently used for pathological voices are reviewed together with the typically associated F 0 extraction problems. Creaky voice, vocal fry and pulsed phonation. These terms appear to be associated either with low pitch and a prolonged glottal closed phase or by a complex pattern of glottal excitations, giving rise to subharmonics (Titze, 1995; Laver, 1980; Ladefoged, 1988; Hammarberg & Gauffin, 1995). Roughness. This term also appears to be associated with period time perturbation. However the term appears mostly, but not always, to be linked to a more random perurbation than what is commonly associated with the multi-cyclic type of vocal fry. The term is also sometimes associated with low-frequency noise (Hammarberg & Gauffin, 1995; De Krom, 1995; Titze, 1995; Ishiki et al., 1969; Hillenbrand, 1988; Omori et al., 1997; Imaizumi 1986). Gratings/scrape is a term mainly used in Sweden (Swedish: skrap). The term is often translated to high-frequency roughness (Hammarberg & Gauffin, 1995). Breathiness is caused by soft or incomplete closure of the glottis and is often associated with high-frequency noise. Breathy voice can be produced in both hypo- and hyperfunctional laryngeal settings, which give rise to two different types of breathiness. These modes of phonation correlate strongly to a high or low relative level of the fundamental, respectively (Titze, 1995; Hammarberg & Gauffin, 1995; Hammarberg, 1986). Voice breaks, vocal breaks or register breaks occurs when the vocal folds suddenly switches from one mode of vibration to another, for example between modal and falsetto register (Sundberg, 1987; Hammarberg, 1986; Švec & Pešák, 1994). Most of the above voice qualities present problems for F 0 extraction. The complexpatterns of glottal excitation that often are associated with vocal fry or gratings/scrape, typically cause octave leaps in the F 0 curve, while pulsed phonation in principle can produce a smooth, continuous curve. Roughness, when associated with a random distribution of period time, mostly generates an unstable F 0 curve, but also, different F 0 algorithms tend to yield 2

3 different F 0 values. Most programs generally handle breathy voices successfully since they contain little F 0 perturbation, but for certain algorithms, particularly if event-based, a high noise level can cause errors. F 0 extractors generally succeed in tracking the F 0 in voice breaks. All these problems with F 0 extraction in pathological voices are unfortunate since the F 0 perturbation appears to represent an important characteristic of such voices (Hillenbrand, 1988; Gauffin et al, 1995). Hence, alternative methods for displaying periodicity variation should be useful in the analysis of pathological voices. Method The correlogram is based on the correlation between two time windows of the signal (Figure 1). It displays the correlation in a novel manner in terms of a graph showing several such correlation functions, displayed in a gray scale similar to the Fourier transforms in a spectrogram. The method has been implemented by the first author (SG) as a program module of the Soundswell Signal Workstation software (Hitech Development AB, Sweden). Different waveform matching functions can be used. In this paper we have selected to use the Pearson correlation coefficient: r m, n where x i = = m+ w 1 i= m w x i m+ w 1 k= m ( x k x ) ( x k+ n m+ w 1 m+ w 1 2 ( xk xk ) k= m k= m [1] or, in computational form k ( x x k+ n k+ n ) x ) 2 k+ n where x j is the j:th sample, m is the starting sample number, n is the delay separating the starting points of the windows and w is the window width. This function is normalized, so the result will be restricted to the range 1 r m,n 1. When the delay n corresponds to one or several fundamental periods, a maximum will occur in the correlation coefficient. This is true, regardless of where in the fundamental period the starting point of the window is located, so there is no need to determine the point of excitation. Note that the use of the Pearson correlation coefficient rather than a simple cross-correlation is advantageous since only the Pearson alternative is insensitive to a DC component of the signal. A DC component adds an increase of the cross-correlation, since the voice signal becomes relatively smaller, as the DC component increases. Such an increase would be irrelevant in periodicity analysis. For each time value along the x-axis a correlation coefficient is calculated with a starting sample m corresponding to the time coordinate of the x-axis. This correlation coefficient is calculated for different delays n, along the y-axis. The correlation, r m,n in this point, is displayed along a gray scale, with black corresponding to r m,n =1 and white for r m,n 0. If the signal is perfectly periodic with a period time T 0, the correlogram will show a set of horizontal black bands, representing different candidates for the fundamental period time C 1, C 2, C 3, at delays n corresponding to T 0, 2T 0, 3T 0. (Figure 1, middle panel). Correlograms can also be presented with an inverted y-axis, thus showing frequency rather than time. In this case F 0 candidates, C 1, C 2, C 3 will appear at F 0, F 0 /2, F 0 /3 and so on (Figure 1, lower panel). Both these representations have certain advantages. In a time correlogram, the candidates C n appear as horizontal stripes that are equidistant. It also shows more salient high-order candidates. In a r m, n = w m+ w 1 x 2 k k = m w m+ w 1 m+ w 1 m+ w 1 k = m x m+ w 1 k = m k x x k k + n 2 k = m w x k k = m m+ w 1 x 2 k + n k = m x k + n m+ w 1 k = m x k + n 2 [2] 3

4 Figure 1. The upper panel shows how the two correlation windows are applied to a perfectly periodic signal (100 Hz sawtooth wave, formant filter at 1000 Hz, bandwidth 100 Hz). Middle and lower panels show the resulting time and frequency correlogram. 4

5 Figure 2. The correlogram used for semi-manual F 0 extraction. The first candidate was traced manually (upper panel) and the computer program then extracted the F 0 value within the traces that corresponded to the highest correlation (lower panel). Note the absence of octave leaps during the bicyclic segment at s. frequency correlogram, the strip density is greater at lower than at higher frequencies, but on the other hand, it is more clearly related to pitch. The selection of delay representation is a matter of choice, but since periodicity is mostly expressed as fundamental frequency, rather than fundamental period time, the frequency representation seems intuitively more appropriate. The length of the correlation window, w, will affect the appearance of the correlogram. A shorter window gives better resolution in time, but may also show the first formant in terms of side bands surrounding each candidate. Normally, a frame length of about one fundamental period is appropriate. An interesting possibility is to let the window length vary as the correlation delay varies, such that the window length is equal to the delay. With this procedure the window length is automatically adjusted to an appropriate value as fundamental frequency varies. The procedure is computationally less efficient, however, especially if 5

6 correlations are calculated for frequencies close to 0 Hz. For practical reasons, the correlograms show r raised to the power of γ for values of r>0. Higher values of γ make the middle levels of gray brighter, and vice versa, see below; γ = 4 has been found to be an appropriate value. The correlogram allows semi-manual extraction of F 0. In this case the user restricts the range of allowed F 0 values to the range around a candidate (Figure 2). The software then extracts the frequency corresponding to the highest correlation within this range. The manual control allows the user to select the appropriate candidate, thus eliminating the risk that an automatic algorithm selects a faulty candidate and placing the responsibility on the user. Applications Synthesized signals The properties of the correlogram analysis method can be efficiently demonstrated by applying it to synthesized sound, since the properties of such sounds are well-defined while natural voices mostly contain combinations of acoustic properties in unknown magnitudes. All synthesized sound files were generated using a sampling rate of 16 khz. Figure 3 illustrates side bands and the effect of the chosen value of γ. The signal was created using a 100 Hz saw-tooth waveform that was fed through a formant filter with a fixed bandwidth of 100Hz and a resonance frequency, F 1, increasing from 0 to 1000 Hz during the 10 s long sound file. In the left panels (γ=1) a Figure 3. Illustration of side bands. The signal was a 100 Hz saw-tooth wave passed through a formant filter in which the formant frequency was increased from 0 to 1000 Hz (bandwidth = 100 Hz). The side bands are prominent with γ=1 (left panels) but become considerably suppressed by γ=4 (right panels). 6

7 segment of a secondary periodicity of 5 ms / 200 Hz appears at 2 s, when F 1 equals 200 Hz. As F 1 is increased, this secondary periodicity shifts and new periodicities appear. The amplitude of the side bands increase when F 1 coincides with a partial. Eventually, the secondary periodicities form continuous side bands around the candidates. If γ is increased to 4 (right panels) the side bands are suppressed such that the candidates can more easily be differentiated from the side bands. The candidates also appear narrower. Figure 4 illustrates the effects of window length, w, on the frequency correlogram. The signal was a saw-tooth wave of 100 Hz, frequency modulated ±10% at 10 Hz and fed through a formant filter of 1000 Hz, bandwidth 100 Hz. For w=5 ms (upper left panel) candidates are present at all times, while the side bands are intermittent. Time resolution is good. For w=10 ms (upper right panel) the side bands are attenuated and time resolution is still good due to the similarity between T 0 and window length. For w=20 ms (lower left panel) the candidates fade at rapid frequency transitions due to poor time resolution, while the side bands are well suppressed. A window length equal to correlation delay n (e.g., 10 ms at n corresponding to 100 Hz and 20 ms at n corresponding to 50 Hz etc.) yields a high time resolution, visible, intermittent side bands at high frequencies, but low time resolution and no side bands at low frequencies (bottom right panel). This is illustrated in terms of the steps in C 2 overlap and extend over longer time than those pertaining to C 1. As seen in the figure, a window length approximating T 0 appears appropriate. In cases of completely unknown T 0, however, a window length equal to correlation delay might be preferable. In this case, C 1 will always be analyzed with a window of length T 0. Figure 5 illustrates the effect of spectral tilt on side bands. The signal, F 0 = 100 Hz, was created by adding sinusoids according to a spectral tilt, which was continuously varied at a constant rate from 0 to 18 db/octave over an 18 s long sound file. This signal was fed through a formant filter at 1000 Hz, bandwidth 100 Hz. It can be expected that strong harmonics in the Figure 4. The effects of window length: 5 ms, 10 ms, 20 ms, and variable (upper left, upper right, lower left, and lower right panel, respectively). The signal was a 100 Hz saw-tooth wave, frequency modulated at 10 Hz, ±10 % (formant filter at 1000 Hz, bandwidth 100 Hz). 7

8 Figure 5. The effect of spectral tilt on side bands. The source signal was created by additive synthesis. The spectral tilt was varied from 0dB/octave at 0 s, to 18dB/octave at 18 s. The source signal was fed through a formant filter at 1000 Hz, bandwidth 100 Hz. 8

9 formant region will add extra periodicities that are visible in the correlogram. Correspondingly, if the signal only contains the fundamental, this is the only periodicity that will be visible. These effects can be seen in the figure; a tilt of 0 db/octave (t=0 s) produces pronounced side bands. The absolute width of the candidates is largely determined by the distance between the candidate and the side band. This distance is a direct function of F 1, and thus the candidate width is largely determined by F 1. In other terms, the relative width is determined by F 1 /F 0. At about -10 db/octave (t=10 s), candidates grow wide and have no visible side bands. At -18dB/octave (t=18 s) the signal is dominated by the fundamental, the absolute width is mainly determined by F 0, and thus the relative width is nearly independent of both F 0 and F 1. The width s dependance of F 1 can also be seen in figure 3, where the candidates become narrower for higher values of F 1. It should be noted that it is not the increasing spectral tilt per se that makes the side bands fade away, but rather the fact that the level of the partial at the formant is reduced. As a rule of thumb, side bands appear if the spectral level at the first formant is near or above the level of the fundamental. However, this is also slightly dependent on the chosen value of γ. Figure 6 illustrates the effect of adding noise to a periodic signal. At 0 s, the signal, F 0 = 100 Hz, is a saw-tooth waveform, and at 10 s it consists of white noise only. This signal was fed through a 1000 Hz formant filter, bandwidth 100 Hz. The levels of the saw-tooth waveform and the noise were matched, so that the output level of the formant filter was equal at the start and end of the sound file. This means that the harmonics-to-noise ratio (HNR) was infinite at 0 s, 1 (0 db) at 5 s and 0 at 10 s. The effect of the noise on the correlogram is noticeable at about 2 s, corresponding to a HNR of 4 (12 db). At about 5 s, or HNR=1 (0 db), the candidates more or less disappear, the only visible periodicity appearing at 1 ms and created by the F 1 at 1000 Hz. Figure 7 compares time and frequency correlograms of synthesized saw-tooth waveforms that contain bicyclic F 0 or amplitude variation; this can be seen as a special case of jitter or shimmer. In the jitter case, the period time (mean 10ms) was varied every other period, starting at 0% and ending at ±10%. The jitter (left panels) can be seen as an F 0 fluctuation, the first candidate C 1 splitting into two stripes that reach 91 to 111 Hz at the end of the frequency correlogram and 11 to 9 ms at the end of the time correlogram. In the shimmer case (right panels), the amplitude of the periods was varied every other period. The magnitude of the shimmer varies from 0% at the start to ±100% at the end, that is, at the end, every second period has an amplitude of zero, while the intermediate periods have an amplitude twice the original. In this case, the first candidate also shows an oscillating pattern, distinct however, from that characterizing jitter. The odd-order candidates gradually fade as the shimmer quantity increases. The shimmer magnitude is seen less clearly than the jitter magnitude since the former is reflected in terms of the gray scale while the latter is represented by the position along the y- axis. Figure 8 compares time and frequency correlograms of synthesized saw-tooth waveforms that contain random F 0 or amplitude variation; these represent other types of jitter and shimmer. In the jitter case, the period time (mean 10ms) was randomly distributed within ±10%. The jitter (left panels) can be seen as a random F 0 fluctuation, the first candidate C 1 fluctuating in the range 91 to 111 Hz in the frequency correlogram and 11 to 9 ms in the time correlogram. In the shimmer case (right panels), the ampliude of each period was varied randomly. The magnitude of changes was 100%, in other words, the amplitude was randomly distributed between zero and full scale. In this case, the first candidate also shows a fluctuating pattern, again distinct from that characterizing jitter. Although the time correlogram is directly linked to the waveform matching function, we shall henceforth focus on frequency correlograms. Natural voices Some examples of correlograms and narrow band spectrograms of pathological voices are presented in Figures All these figures concern examples of voices that may cause difficulties for F 0 tracking programs. The difficulties are due to ambiguity about whether F 0 is represented by C 1 or by C 2 (Figures 9, 10 and 11), due to a high noise level and unstable C 1 (Figures 12, 13 and 14), or due to the wellexcited first formant, which makes the side bands hard to differentiate from C 1 (Figure 15). For describing the voices, the terminology proposed by Hammarberg and Gauffin (1995) 9

10 Figure 6. The effect of adding white noise. At 0 s, the source signal consists of a sawtooth wave only, and at 10 s of white noise only. The source signal was passed through a formant filter at 1000 Hz, bandwidth 100 Hz. 10

11 Figure 7. Examples of bicyclic F 0 (left panels) and amplitude variation (right panels) in sawtooth waveforms. F 0 and amplitude variations increased from 0 to 10% and from 0 to 100%, respectively. Figure 8. Examples of random F 0 (left panels) and random amplitude variation (right panels) in sawtooth waveforms. F 0 and amplitude variations was 10% and from 0 to 100%, respectively. 11

12 was used; all voice examples except one (Figure 15) were taken from Hammarberg s library of pathological voices. The voice samples had been rated by groups of 6 to 14 voice clinicians (speech language pathologists and phoniatricians) using the Stockholm Voice Evaluation Approach (SVEA) assessment protocols (Hammarberg 1986; 2000). These examples have been found useful in teaching as archetypes of various perceived voice properties, as in each example a particular voice quality is dominant over others. However, since all natural voices are perceptually multi-dimensional (Kreiman et al., 1994; 1996), each example still represents more than one single perceptual feature. This well-known fact makes direct mapping of acoustic to perceptual features difficult. Figure 9 presents the voice of a man, age 41, who was diagnosed with chronic laryngitis and whose voice quality was characterized as rough. This example shows short (about 100 ms) bursts of bicyclicity starting at 0, 0.2 and 0.45 s. Widening of the candidates can be seen at s, s, and s. The two former cases of widening are probably due to the low amplitude of the overtones in the signal, and the last is probably due to a low first formant. Figure 10 presents the voice of a man, age 32, who was diagnosed with incomplete voice mutation and whose voice quality was characterized as a mixture between vocal fry and gratings/scrape. This example contains bicyclicity throughout most of its duration. Figure 11 presents the voice of a man, age 29, who was diagnosed with a benign tumor, perceived as having gratings/scrape only. These examples all show short (100 ms) bursts of bicyclicity and the only obvious difference among them is the F 0 at which they occur. In Figure 10 there also is a longer (about 250 ms) bicyclic sequence. Careful inspection of the spectrograms (lower panel) reveals subharmonics coinciding with the bicyclic segments in the correlograms but with poorer time resolution. The subharmonics could be visualized more clearly in the spectrogram if a narrower bandwidth had been chosen. This would, however have further deteriorated the time resolution. Figure 12 presents the voice of a man, age 50, who was diagnosed with paralytic dysphonia, and whose voice quality was characterized by hypofunctional breathiness with roughness. The voice produces wide and unstable candidates. As sometimes also found in rough voices, an instant of bicyclicity occurs, near t=0.7 s. However, the corresponding subharmonics in the spectrogram (lower panel) are not easily spotted, probably due to the short duration of the bicyclicity. Figure 13 presents the voice of a man, age 40, who was diagnosed with paralytic dysphonia; the voice was characterized by hypofunctional breathiness. In the narrow-band spectrogram only few harmonics except the fundamental are visible. The correlogram shows wide candidates and no bicyclicity. Figure 14 presents the voice of a woman, age 75, who was diagnosed with paralytic dysphonia which shows repeated voice breaks between falsetto and modal register with a high degree of instability (Hammarberg, 1986). C 1 suddenly disappears at t=0.5 s and 1.35 s as the voice switches from falsetto to modal. Finally, Figure 15 presents the voice of an opera singer. Side bands are prominent, indicating a well-excited first formant. As also can be seen in the spectrogram, the singer apparently tuned F 1 to either two or three times F 0, such that either the second or third partial coincides with F 1. This strategy increases the sound pressure level, which is an important ability in operatic singing. Discussion Since F 0 fluctuation plays an important role in many different pathological voice qualities, F 0 extraction would be one way to study such voice qualities. Unfortunately, F 0 extraction applied to voices with a high degree of F 0 perturbation presents problems that are not easily solved. The most typical example is bicyclic voice, where F 0 extractors tend to yield F 0 /2. Since the transition from normal phonation to bicyclicity can be gradual, although without a pitch glide, an F 0 extraction algorithm must determine when to switch from displaying F 0 to displaying F 0 /2. Such switching results in an octave leap. In the correlogram, this problem is circumvented by eliminating the selection mechanism and displaying raw correlation functions in a threedimensional graph. Hence, the correlogram can describe highly perturbed voices, even when the value of F 0 is far from obvious. The appropriateness of extracting F 0 from pathological voices can sometimes be questioned. Pathological voices often show large period-to-period variation, and since the signal is not exactly repetitive, no strict period time 12

13 Figure 9. Waveform (top), correlogram (middle) and narrow band spectrogram (bottom) of speech. The voice was characterized by roughness. Bicyclic segments appear around 0.05, 0.25 and 0.5 s. At about 0.8 s, there is a lack of periodicity due to the voiceless consonant /t/. 13

14 Figure 10. Waveform (top), correlogram (middle) and narrow band spectrogram (bottom) of speech. The voice was characterized by vocal fry and gratings/scrape. Bicyclic segments appear around 0.25 s, at s and at s. 14

15 Figure 11. Waveform (top), correlogram (middle) and narrow band spectrogram (bottom) of speech. The voice was characterized by gratings/scrape. Bicyclic segments appear around 0.05s, 0.35 s and 0.5 s and also with less magnitude around 0.2 s and 0.8 s. Note the abnormally high F 0. 15

16 Figure 12. Waveform (top), correlogram (middle) and narrow band spectrogram (bottom) of speech. The voice was characterized by hypofunctional breathiness with roughness. The candidates all are unstable and wide, and a short instance of bicyclicity can be seen at around 0.7 s. 16

17 Figure 13. Waveform (top), correlogram (middle) and narrow band spectrogram (bottom) of speech. The voice was characterized by hypofunctional breathiness. All candidates are wide due to the dominant fundamental. The segment between 1.0 and 1.7 s represent silence. 17

18 Figure 14. Waveform (top), correlogram (middle) and narrow band spectrogram (bottom) of speech. The voice was characterized by repeated register breaks. At 0.4 s and 1.35 s the voice switched from falsetto to modal. The noise between 0.7 and 0.9 s is inhalatory stridor. 18

19 Figure 15. Waveform (top), correlogram (middle) and narrow band spectrogram (bottom) of singing, operatic style. Prominent side bands can be seen either at two or three times F 0 due to a well-excited first formant. 19

20 exists; or, an ambiguity exists with regard to F 0. For such voices, it appears appropriate not to enforce an F 0 selection, but rather to display the correlation functions, as in a correlogram. In some cases a perceptual evaluation of pitch may also be a worthwhile alternative. The interpretation of a correlogram requires some care, however. While the F 0 candidates appear as dark horizontal bands, there can also be side bands, which originate from formant ringings of high amplitudes and are thus not to be considered as candidates. Side bands appear when an overtone coincides with a formant and can mostly be identified from their relatively less dark appearance (i.e. low correlation). In some cases, such as when F 1 is twice F 0, the distinction between side bands and candidates can be less clear. The presence of side bands can also be used as an indication of a high positive level difference between the first formant (L 1 ) and the fundamental (L 0 ). Hence, the presence of side bands may indicate a sonorous or pressed voice with a well-excited first formant. On the other hand, a low or negative L 1 -L 0 difference has been shown to be related to hypofunctional breathiness (Hammarberg 1986). In the correlogram this would correspond to a wide candidate. It must be kept in mind, however, that the candidate width also depends on the formant frequencies. A correlogram is a time-domain analysis tool. This means that it does not directly display spectral properties, such as harmonics, which would require a Fourier transform. It should be noted that the candidates have no direct connection to the harmonics of the signal. It is true that C 1 corresponds to the first harmonic, H 1, but the presence of C 2 does not necessarily indicate the presence of a subharmonic. However, the combined occurrence of a constant C 2 and a varying C 1 would indicate bicyclicity, and a constant C 3 and a varying C 2 and C 1 tricyclicity, etc. These characteristics indicate the presence of subharmonics, although the subharmonics per se are not visualized in a correlogram. Compared to the narrow-band spectrogram, the correlogram shows a better time-resolution, due to the shorter time windows needed. For instance, to display a narrow-band spectrogram with visible subharmonics, the length of the time window must correspond to several fundamental periods, whereas in the correlogram, the time windows typically are as short as one fundamental period. The short time windows have the effect that the correlogram can visualize short bursts of bicyclicity that would not be easily seen in a narrow-band spectrogram. The correlogram has also been used for extraction of F 0 from violin playing (Gleiser et al 1999) by means of manual tracing. In these experiments, the violin player was accompanied by piano playing, which however was suppressed by placing the microphone on the violin bridge. Violin sound typically presents difficulties in F 0 extraction. However, the correlogram method was surprisingly successful and showed a remarkable insensitivity to the piano sound. In this study, the vibrato rate was also extracted in a second step, by performing correlogram analysis and tracing on the extracted oscillating F 0 curve. The computation of a correlogram is generally more computationally intensive than the computation of a spectrogram. However, with the increasing power of computers, the computation speed is less of a problem. For example, every correlogram presented above required less than 3 seconds computing time on a 1700 MHz Pentium 4 system running Windows These initial applications suggest that the correlogram is useful for future work for refining, revising and standardizing the relations between acoustical voice characteristics and perceived voice quality parameters. Correlograms should be useful also for the training ofan analytic listening to voice qualities. Presenting images of the perturbation of voices together with the sounds seems a valuable opportunity that may pave the way to a better agreement on the meaning of voice terms across the voice community. The correlogram illustrates the periodicity of the waveform in a robust way, since it lacks the selection mechanism of F 0 extractors. It illustrates differences between periodic and random period-to-period variations. The robustness of the correlogram should make it a particularly valuable tool for periodicity analysis in such cases of pathologic speech where standard F 0 extraction methods fail or where they present ambiguous results. Conclusions Correlation functions have previously been used to extract F 0 information from voice signals, 20

21 automatically selecting a single value to represent F 0, sometimes even in ambiguous cases. The correlogram method presented here shows the raw correlation functions. In cases of periodic or quasi-periodic phonation, such as in some pathological voices, it displays several F 0 candidates, and leaves the user to select one by tracing, if appropriate. In some cases of quasiperiodic phonation, the correlogram illustrates the type of aperiodicity, differentiating signal characteristics such as multi-cyclic or random perturbations, typically associated with vocal fry or roughness. It should be worthwhile to test the correlogram in cases of quasi-periodic signals where traditional F 0 tracking methods fail. Acknowledgements This work was supported by research grants from the Bank of Sweden Tercentenary Foundation and the Swedish Council for Work Life Research. We would also like to thank Jan Gauffin and Stellan Hertegård for valuable advice and Johan Sundberg for discussions and editorial assistance. References Blomgren M, Chen Y, Ng M, Gilbert H (1998). Acoustic, aerodynamic, physiologic and perceptual properties of modal and vocal fry registers. J Acoust Soc Am 103: DeKrom G (1995). Some spectral correlates of pathological breathy and rough voice qualitiy for different types of vowel fragments. J Speech Hear Res 38: Fourcin A. (1986). Electrolaryngographic assessment of vocal fold vibration. J Phonetics 14: Gauffin J, Granqvist S, Hammarberg B, Hertegård S, Håkansson A (1995). Irregularities in the voice: some perceptual experiments using synthetic voices Proc ICPhS-95 Vol 2: Gleiser J, Friberg A, Granqvist S (1998). A method for extracting vibrato parameters applied to violin performance. TMH-QPSR, KTH, 4/1998: Hammarberg B. (1986). Perceptual and acoustic analysis of dysphonia. Doctoral thesis. Dept of Logopedics and Phoniatrics, Karolinska Institute, Stockholm. Hammarberg B. (2000) Voice research and clinical needs. Folia Phoniatr. Logop. 52: Hammarberg B, Gauffin J (1995). Perceptual and acoustical characteristics of quality differences in pathological voices as related to physiological aspects. In: Fujimura O, Hirano M (eds). Vocal Fold Physiology, Voice Quality Control. San Diego: Singular Publishing Group; Hess W (1983). Pitch determination of speech signals. Springer-Verlag. ISBN Hess W (1995). Determination of glottal excitation cycles in running speech. Phonetica 52: Hillenbrand J (1988). Perception of aperiodicities in synthetically generated vowels. J Acoust Soc Am 83: Imaizumi S (1986). Acoustic measures of roughness in pathological voice. J Phonetics 14: Ishiki N, Okamura H, Tanabe M, Morimoto M (1969). Differential diagnosis of hoarseness. Folia Phoniatrica 21: Karnell M, Scherer R, Fischer L (1991). Comparison of acoustic voice perturbation measures among three independent voice laboratories. J Speech Hear Res 34: Kreimann J, Gerrat B, Berke G (1994). The multidimensional nature of pathologic voice quality. J Acoust Soc Am 96: Kreimann J, Gerratt, B R (1996). The perceptual structure of pathologic voice quality. J Acoust Soc Am 100: Laver J. (1980) The Phonetic Description of Voice Quality. Cambridge University Press, Cambridge. ISBN Ladefoged P (1988). Discussion of phonetics: a note on some terms for phonation types. In: Fujimura O (ed), Vocal physiology: Voice production, mechanisms and functions New York: Raven Press; McAllister A (1997). Acoustic, perceptual and physiological studies of ten-year-old chilren s voices. Doctoral thesis. Dept of Logopedics and Phoniatrics, Karolinska Institute and Dept of Speech, Music and Hearing, Royal Institute of Technology (KTH), Stockholm. Omori K, Kojima H, Kakani R, Slavit D, Blaugrund S (1997). Acoustic characteristics of rough voice: Subharmonics. J Voice 11: Pabon P (1991). Objective acoustic voice-quality parameters in the computer phonetogram. J Voice 5: Rabinov R, Kreiman J, Gerratt B, Bielamowicz S (1995). Comparing reliability of perceptual ratings of roughness and acoustic measures of jitter. J Speech Hear Res 38: Rothenberg (1973). A new inverse filtering technique for deriving the glottal airflow waveform during voicing. J Acousic Soc Am 53: Sederholm E (1996). Hoarseness in ten-year old children: Perceptual characteristics, prevalence and etiology. Doctoral thesis. Dept of Logopedics and Phoniatrics, Karolinska Institute and Dept of 21

22 Speech, Music and Hearing, Royal Institute of Technology (KTH),, Stockholm. Švec J, Pešák J (1994). vocal breaks from the modal to falsetto register. Folia Phoniatr Logop 46: Sundberg J (1987). The Science of the Singing Voice. Northern Illinois University Press. ISBN X. Titze I (1995). Definitions and nomenclature related to voice quality. In: Fujimura O, Hirano M (eds). Vocal Fold Physiology, Voice Quality Control. San Diego: Singular Publishing Group; Titze I, Liang H (1993). Comparison of F 0 extraction methods for high-precision voice perturbation measurements. J Speech Hear Res 36:

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006 1. Resonators and Filters INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006 Different vibrating objects are tuned to specific frequencies; these frequencies at which a particular

More information

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping Structure of Speech Physical acoustics Time-domain representation Frequency domain representation Sound shaping Speech acoustics Source-Filter Theory Speech Source characteristics Speech Filter characteristics

More information

Quarterly Progress and Status Report. Acoustic properties of the Rothenberg mask

Quarterly Progress and Status Report. Acoustic properties of the Rothenberg mask Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Acoustic properties of the Rothenberg mask Hertegård, S. and Gauffin, J. journal: STL-QPSR volume: 33 number: 2-3 year: 1992 pages:

More information

SPEECH AND SPECTRAL ANALYSIS

SPEECH AND SPECTRAL ANALYSIS SPEECH AND SPECTRAL ANALYSIS 1 Sound waves: production in general: acoustic interference vibration (carried by some propagation medium) variations in air pressure speech: actions of the articulatory organs

More information

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation Peter J. Murphy and Olatunji O. Akande, Department of Electronic and Computer Engineering University

More information

Vocal fold vibration and voice source aperiodicity in dist tones: a study of a timbral ornament in rock singing

Vocal fold vibration and voice source aperiodicity in dist tones: a study of a timbral ornament in rock singing æoriginal ARTICLE æ Vocal fold vibration and voice source aperiodicity in dist tones: a study of a timbral ornament in rock singing D. Zangger Borch 1, J. Sundberg 2, P.-Å. Lindestad 3 and M. Thalén 1

More information

Quarterly Progress and Status Report. Vocal fold vibration and voice source aperiodicity in phonatorily distorted singing

Quarterly Progress and Status Report. Vocal fold vibration and voice source aperiodicity in phonatorily distorted singing Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Vocal fold vibration and voice source aperiodicity in phonatorily distorted singing Zangger Borch, D. and Sundberg, J. and Lindestad,

More information

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification Daryush Mehta SHBT 03 Research Advisor: Thomas F. Quatieri Speech and Hearing Biosciences and Technology 1 Summary Studied

More information

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels A complex sound with particular frequency can be analyzed and quantified by its Fourier spectrum: the relative amplitudes

More information

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals 16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract

More information

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL Narsimh Kamath Vishweshwara Rao Preeti Rao NIT Karnataka EE Dept, IIT-Bombay EE Dept, IIT-Bombay narsimh@gmail.com vishu@ee.iitb.ac.in

More information

Linguistic Phonetics. Spectral Analysis

Linguistic Phonetics. Spectral Analysis 24.963 Linguistic Phonetics Spectral Analysis 4 4 Frequency (Hz) 1 Reading for next week: Liljencrants & Lindblom 1972. Assignment: Lip-rounding assignment, due 1/15. 2 Spectral analysis techniques There

More information

ScienceDirect. Accuracy of Jitter and Shimmer Measurements

ScienceDirect. Accuracy of Jitter and Shimmer Measurements Available online at www.sciencedirect.com ScienceDirect Procedia Technology 16 (2014 ) 1190 1199 CENTERIS 2014 - Conference on ENTERprise Information Systems / ProjMAN 2014 - International Conference on

More information

Glottal source model selection for stationary singing-voice by low-band envelope matching

Glottal source model selection for stationary singing-voice by low-band envelope matching Glottal source model selection for stationary singing-voice by low-band envelope matching Fernando Villavicencio Yamaha Corporation, Corporate Research & Development Center, 3 Matsunokijima, Iwata, Shizuoka,

More information

Synthesis Algorithms and Validation

Synthesis Algorithms and Validation Chapter 5 Synthesis Algorithms and Validation An essential step in the study of pathological voices is re-synthesis; clear and immediate evidence of the success and accuracy of modeling efforts is provided

More information

From Ladefoged EAP, p. 11

From Ladefoged EAP, p. 11 The smooth and regular curve that results from sounding a tuning fork (or from the motion of a pendulum) is a simple sine wave, or a waveform of a single constant frequency and amplitude. From Ladefoged

More information

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing Project : Part 2 A second hands-on lab on Speech Processing Frequency-domain processing February 24, 217 During this lab, you will have a first contact on frequency domain analysis of speech signals. You

More information

Complex Sounds. Reading: Yost Ch. 4

Complex Sounds. Reading: Yost Ch. 4 Complex Sounds Reading: Yost Ch. 4 Natural Sounds Most sounds in our everyday lives are not simple sinusoidal sounds, but are complex sounds, consisting of a sum of many sinusoids. The amplitude and frequency

More information

The source-filter model of speech production"

The source-filter model of speech production 24.915/24.963! Linguistic Phonetics! The source-filter model of speech production" Glottal airflow Output from lips 400 200 0.1 0.2 0.3 Time (in secs) 30 20 10 0 0 1000 2000 3000 Frequency (Hz) Source

More information

WaveSurfer. Basic acoustics part 2 Spectrograms, resonance, vowels. Spectrogram. See Rogers chapter 7 8

WaveSurfer. Basic acoustics part 2 Spectrograms, resonance, vowels. Spectrogram. See Rogers chapter 7 8 WaveSurfer. Basic acoustics part 2 Spectrograms, resonance, vowels See Rogers chapter 7 8 Allows us to see Waveform Spectrogram (color or gray) Spectral section short-time spectrum = spectrum of a brief

More information

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday. L105/205 Phonetics Scarborough Handout 7 10/18/05 Reading: Johnson Ch.2.3.3-2.3.6, Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday Spectral Analysis 1. There are

More information

DIVERSE RESONANCE TUNING STRATEGIES FOR WOMEN SINGERS

DIVERSE RESONANCE TUNING STRATEGIES FOR WOMEN SINGERS DIVERSE RESONANCE TUNING STRATEGIES FOR WOMEN SINGERS John Smith Joe Wolfe Nathalie Henrich Maëva Garnier Physics, University of New South Wales, Sydney j.wolfe@unsw.edu.au Physics, University of New South

More information

Steady state phonation is never perfectly steady. Phonation is characterized

Steady state phonation is never perfectly steady. Phonation is characterized Perception of Vocal Tremor Jody Kreiman Brian Gabelman Bruce R. Gerratt The David Geffen School of Medicine at UCLA Los Angeles, CA Vocal tremors characterize many pathological voices, but acoustic-perceptual

More information

Quarterly Progress and Status Report. Notes on the Rothenberg mask

Quarterly Progress and Status Report. Notes on the Rothenberg mask Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Notes on the Rothenberg mask Badin, P. and Hertegård, S. and Karlsson, I. journal: STL-QPSR volume: 31 number: 1 year: 1990 pages:

More information

Quarterly Progress and Status Report. Speech waveform perturbation analysis

Quarterly Progress and Status Report. Speech waveform perturbation analysis Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Speech waveform perturbation analysis Askenfelt, A. and Hammarberg, B. journal: STL-QPSR volume: 21 number: 4 year: 1980 pages:

More information

Converting Speaking Voice into Singing Voice

Converting Speaking Voice into Singing Voice Converting Speaking Voice into Singing Voice 1 st place of the Synthesis of Singing Challenge 2007: Vocal Conversion from Speaking to Singing Voice using STRAIGHT by Takeshi Saitou et al. 1 STRAIGHT Speech

More information

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Proceedings of Meetings on Acoustics Volume, http://acousticalsociety.org/ ICA Montreal Montreal, Canada - June Musical Acoustics Session amu: Aeroacoustics of Wind Instruments and Human Voice II amu.

More information

CHAPTER 3. ACOUSTIC MEASURES OF GLOTTAL CHARACTERISTICS 39 and from periodic glottal sources (Shadle, 1985; Stevens, 1993). The ratio of the amplitude of the harmonics at 3 khz to the noise amplitude in

More information

Acoustic Phonetics. How speech sounds are physically represented. Chapters 12 and 13

Acoustic Phonetics. How speech sounds are physically represented. Chapters 12 and 13 Acoustic Phonetics How speech sounds are physically represented Chapters 12 and 13 1 Sound Energy Travels through a medium to reach the ear Compression waves 2 Information from Phonetics for Dummies. William

More information

L19: Prosodic modification of speech

L19: Prosodic modification of speech L19: Prosodic modification of speech Time-domain pitch synchronous overlap add (TD-PSOLA) Linear-prediction PSOLA Frequency-domain PSOLA Sinusoidal models Harmonic + noise models STRAIGHT This lecture

More information

Experimental evaluation of inverse filtering using physical systems with known glottal flow and tract characteristics

Experimental evaluation of inverse filtering using physical systems with known glottal flow and tract characteristics Experimental evaluation of inverse filtering using physical systems with known glottal flow and tract characteristics Derek Tze Wei Chu and Kaiwen Li School of Physics, University of New South Wales, Sydney,

More information

Linguistics 401 LECTURE #2. BASIC ACOUSTIC CONCEPTS (A review)

Linguistics 401 LECTURE #2. BASIC ACOUSTIC CONCEPTS (A review) Linguistics 401 LECTURE #2 BASIC ACOUSTIC CONCEPTS (A review) Unit of wave: CYCLE one complete wave (=one complete crest and trough) The number of cycles per second: FREQUENCY cycles per second (cps) =

More information

University of Groningen. On vibration properties of human vocal folds Svec, Jan

University of Groningen. On vibration properties of human vocal folds Svec, Jan University of Groningen On vibration properties of human vocal folds Svec, Jan IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check

More information

Subtractive Synthesis & Formant Synthesis

Subtractive Synthesis & Formant Synthesis Subtractive Synthesis & Formant Synthesis Prof Eduardo R Miranda Varèse-Gastprofessor eduardo.miranda@btinternet.com Electronic Music Studio TU Berlin Institute of Communications Research http://www.kgw.tu-berlin.de/

More information

Aalto Aparat A Freely Available Tool for Glottal Inverse Filtering and Voice Source Parameterization

Aalto Aparat A Freely Available Tool for Glottal Inverse Filtering and Voice Source Parameterization [LOGO] Aalto Aparat A Freely Available Tool for Glottal Inverse Filtering and Voice Source Parameterization Paavo Alku, Hilla Pohjalainen, Manu Airaksinen Aalto University, Department of Signal Processing

More information

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE 1602 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 8, NOVEMBER 2008 Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE Abstract

More information

Perceived Pitch of Synthesized Voice with Alternate Cycles

Perceived Pitch of Synthesized Voice with Alternate Cycles Journal of Voice Vol. 16, No. 4, pp. 443 459 2002 The Voice Foundation Perceived Pitch of Synthesized Voice with Alternate Cycles Xuejing Sun and Yi Xu Department of Communication Sciences and Disorders,

More information

Quarterly Progress and Status Report. Synthesis of selected VCV-syllables in singing

Quarterly Progress and Status Report. Synthesis of selected VCV-syllables in singing Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Synthesis of selected VCV-syllables in singing Zera, J. and Gauffin, J. and Sundberg, J. journal: STL-QPSR volume: 25 number: 2-3

More information

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Mel Spectrum Analysis of Speech Recognition using Single Microphone International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree

More information

Perturbation analysis using a moving window for disordered voices JiYeoun Lee, Seong Hee Choi

Perturbation analysis using a moving window for disordered voices JiYeoun Lee, Seong Hee Choi Perturbation analysis using a moving window for disordered voices JiYeoun Lee, Seong Hee Choi Abstract Voices from patients with voice disordered tend to be less periodic and contain larger perturbations.

More information

Speech Synthesis; Pitch Detection and Vocoders

Speech Synthesis; Pitch Detection and Vocoders Speech Synthesis; Pitch Detection and Vocoders Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University May. 29, 2008 Speech Synthesis Basic components of the text-to-speech

More information

COMP 546, Winter 2017 lecture 20 - sound 2

COMP 546, Winter 2017 lecture 20 - sound 2 Today we will examine two types of sounds that are of great interest: music and speech. We will see how a frequency domain analysis is fundamental to both. Musical sounds Let s begin by briefly considering

More information

SOUND SOURCE RECOGNITION AND MODELING

SOUND SOURCE RECOGNITION AND MODELING SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental

More information

Musical Acoustics, C. Bertulani. Musical Acoustics. Lecture 14 Timbre / Tone quality II

Musical Acoustics, C. Bertulani. Musical Acoustics. Lecture 14 Timbre / Tone quality II 1 Musical Acoustics Lecture 14 Timbre / Tone quality II Odd vs Even Harmonics and Symmetry Sines are Anti-symmetric about mid-point If you mirror around the middle you get the same shape but upside down

More information

Introduction. Chapter Time-Varying Signals

Introduction. Chapter Time-Varying Signals Chapter 1 1.1 Time-Varying Signals Time-varying signals are commonly observed in the laboratory as well as many other applied settings. Consider, for example, the voltage level that is present at a specific

More information

ASPIRATION NOISE DURING PHONATION: SYNTHESIS, ANALYSIS, AND PITCH-SCALE MODIFICATION DARYUSH MEHTA

ASPIRATION NOISE DURING PHONATION: SYNTHESIS, ANALYSIS, AND PITCH-SCALE MODIFICATION DARYUSH MEHTA ASPIRATION NOISE DURING PHONATION: SYNTHESIS, ANALYSIS, AND PITCH-SCALE MODIFICATION by DARYUSH MEHTA B.S., Electrical Engineering (23) University of Florida SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING

More information

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

International Journal of Modern Trends in Engineering and Research   e-issn No.: , Date: 2-4 July, 2015 International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Analysis of Speech Signal Using Graphic User Interface Solly Joy 1, Savitha

More information

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Synthesis using Mel-Cepstral Coefficient Feature Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract

More information

CS 188: Artificial Intelligence Spring Speech in an Hour

CS 188: Artificial Intelligence Spring Speech in an Hour CS 188: Artificial Intelligence Spring 2006 Lecture 19: Speech Recognition 3/23/2006 Dan Klein UC Berkeley Many slides from Dan Jurafsky Speech in an Hour Speech input is an acoustic wave form s p ee ch

More information

COM325 Computer Speech and Hearing

COM325 Computer Speech and Hearing COM325 Computer Speech and Hearing Part III : Theories and Models of Pitch Perception Dr. Guy Brown Room 145 Regent Court Department of Computer Science University of Sheffield Email: g.brown@dcs.shef.ac.uk

More information

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals.

Block diagram of proposed general approach to automatic reduction of speech wave to lowinformation-rate signals. XIV. SPEECH COMMUNICATION Prof. M. Halle G. W. Hughes J. M. Heinz Prof. K. N. Stevens Jane B. Arnold C. I. Malme Dr. T. T. Sandel P. T. Brady F. Poza C. G. Bell O. Fujimura G. Rosen A. AUTOMATIC RESOLUTION

More information

An introduction to physics of Sound

An introduction to physics of Sound An introduction to physics of Sound Outlines Acoustics and psycho-acoustics Sound? Wave and waves types Cycle Basic parameters of sound wave period Amplitude Wavelength Frequency Outlines Phase Types of

More information

Digital Signal Processing

Digital Signal Processing COMP ENG 4TL4: Digital Signal Processing Notes for Lecture #27 Tuesday, November 11, 23 6. SPECTRAL ANALYSIS AND ESTIMATION 6.1 Introduction to Spectral Analysis and Estimation The discrete-time Fourier

More information

Linear Frequency Modulation (FM) Chirp Signal. Chirp Signal cont. CMPT 468: Lecture 7 Frequency Modulation (FM) Synthesis

Linear Frequency Modulation (FM) Chirp Signal. Chirp Signal cont. CMPT 468: Lecture 7 Frequency Modulation (FM) Synthesis Linear Frequency Modulation (FM) CMPT 468: Lecture 7 Frequency Modulation (FM) Synthesis Tamara Smyth, tamaras@cs.sfu.ca School of Computing Science, Simon Fraser University January 26, 29 Till now we

More information

EE482: Digital Signal Processing Applications

EE482: Digital Signal Processing Applications Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 12 Speech Signal Processing 14/03/25 http://www.ee.unlv.edu/~b1morris/ee482/

More information

Parameterization of the glottal source with the phase plane plot

Parameterization of the glottal source with the phase plane plot INTERSPEECH 2014 Parameterization of the glottal source with the phase plane plot Manu Airaksinen, Paavo Alku Department of Signal Processing and Acoustics, Aalto University, Finland manu.airaksinen@aalto.fi,

More information

2007 Elsevier Science. Reprinted with permission from Elsevier.

2007 Elsevier Science. Reprinted with permission from Elsevier. Lehto L, Airas M, Björkner E, Sundberg J, Alku P, Comparison of two inverse filtering methods in parameterization of the glottal closing phase characteristics in different phonation types, Journal of Voice,

More information

USING A WHITE NOISE SOURCE TO CHARACTERIZE A GLOTTAL SOURCE WAVEFORM FOR IMPLEMENTATION IN A SPEECH SYNTHESIS SYSTEM

USING A WHITE NOISE SOURCE TO CHARACTERIZE A GLOTTAL SOURCE WAVEFORM FOR IMPLEMENTATION IN A SPEECH SYNTHESIS SYSTEM USING A WHITE NOISE SOURCE TO CHARACTERIZE A GLOTTAL SOURCE WAVEFORM FOR IMPLEMENTATION IN A SPEECH SYNTHESIS SYSTEM by Brandon R. Graham A report submitted in partial fulfillment of the requirements for

More information

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER X. SPEECH ANALYSIS Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER Most vowel identifiers constructed in the past were designed on the principle of "pattern matching";

More information

Acoustic Phonetics. Chapter 8

Acoustic Phonetics. Chapter 8 Acoustic Phonetics Chapter 8 1 1. Sound waves Vocal folds/cords: Frequency: 300 Hz 0 0 0.01 0.02 0.03 2 1.1 Sound waves: The parts of waves We will be considering the parts of a wave with the wave represented

More information

Sound is the human ear s perceived effect of pressure changes in the ambient air. Sound can be modeled as a function of time.

Sound is the human ear s perceived effect of pressure changes in the ambient air. Sound can be modeled as a function of time. 2. Physical sound 2.1 What is sound? Sound is the human ear s perceived effect of pressure changes in the ambient air. Sound can be modeled as a function of time. Figure 2.1: A 0.56-second audio clip of

More information

Source-filter Analysis of Consonants: Nasals and Laterals

Source-filter Analysis of Consonants: Nasals and Laterals L105/205 Phonetics Scarborough Handout 11 Nov. 3, 2005 reading: Johnson Ch. 9 (today); Pickett Ch. 5 (Tues.) Source-filter Analysis of Consonants: Nasals and Laterals 1. Both nasals and laterals have voicing

More information

CMPT 468: Frequency Modulation (FM) Synthesis

CMPT 468: Frequency Modulation (FM) Synthesis CMPT 468: Frequency Modulation (FM) Synthesis Tamara Smyth, tamaras@cs.sfu.ca School of Computing Science, Simon Fraser University October 6, 23 Linear Frequency Modulation (FM) Till now we ve seen signals

More information

Perceptual evaluation of voice source models a)

Perceptual evaluation of voice source models a) Perceptual evaluation of voice source models a) Jody Kreiman, 1,b) Marc Garellek, 2 Gang Chen, 3,c) Abeer Alwan, 3 and Bruce R. Gerratt 1 1 Department of Head and Neck Surgery, University of California

More information

Respiration, Phonation, and Resonation: How dependent are they on each other? (Kay-Pentax Lecture in Upper Airway Science) Ingo R.

Respiration, Phonation, and Resonation: How dependent are they on each other? (Kay-Pentax Lecture in Upper Airway Science) Ingo R. Respiration, Phonation, and Resonation: How dependent are they on each other? (Kay-Pentax Lecture in Upper Airway Science) Ingo R. Titze Director, National Center for Voice and Speech, University of Utah

More information

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) Proceedings of the 2 nd International Conference on Current Trends in Engineering and Management ICCTEM -214 ISSN

More information

A Multichannel Electroglottograph

A Multichannel Electroglottograph Publications of Dr. Martin Rothenberg: A Multichannel Electroglottograph Published in the Journal of Voice, Vol. 6., No. 1, pp. 36-43, 1992 Raven Press, Ltd., New York Summary: It is shown that a practical

More information

Tempo and Beat Tracking

Tempo and Beat Tracking Lecture Music Processing Tempo and Beat Tracking Meinard Müller International Audio Laboratories Erlangen meinard.mueller@audiolabs-erlangen.de Book: Fundamentals of Music Processing Meinard Müller Fundamentals

More information

AN ANALYSIS OF ITERATIVE ALGORITHM FOR ESTIMATION OF HARMONICS-TO-NOISE RATIO IN SPEECH

AN ANALYSIS OF ITERATIVE ALGORITHM FOR ESTIMATION OF HARMONICS-TO-NOISE RATIO IN SPEECH AN ANALYSIS OF ITERATIVE ALGORITHM FOR ESTIMATION OF HARMONICS-TO-NOISE RATIO IN SPEECH A. Stráník, R. Čmejla Department of Circuit Theory, Faculty of Electrical Engineering, CTU in Prague Abstract Acoustic

More information

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz Between physics and perception signal models for high level audio processing Axel Röbel Analysis / synthesis team, IRCAM DAFx 2010 iem Graz Overview Introduction High level control of signal transformation

More information

THE BEATING EQUALIZER AND ITS APPLICATION TO THE SYNTHESIS AND MODIFICATION OF PIANO TONES

THE BEATING EQUALIZER AND ITS APPLICATION TO THE SYNTHESIS AND MODIFICATION OF PIANO TONES J. Rauhala, The beating equalizer and its application to the synthesis and modification of piano tones, in Proceedings of the 1th International Conference on Digital Audio Effects, Bordeaux, France, 27,

More information

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065 Speech Processing Undergraduate course code: LASC10061 Postgraduate course code: LASC11065 All course materials and handouts are the same for both versions. Differences: credits (20 for UG, 10 for PG);

More information

Voice Activity Detection

Voice Activity Detection Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class

More information

Friedrich-Alexander Universität Erlangen-Nürnberg. Lab Course. Pitch Estimation. International Audio Laboratories Erlangen. Prof. Dr.-Ing.

Friedrich-Alexander Universität Erlangen-Nürnberg. Lab Course. Pitch Estimation. International Audio Laboratories Erlangen. Prof. Dr.-Ing. Friedrich-Alexander-Universität Erlangen-Nürnberg Lab Course Pitch Estimation International Audio Laboratories Erlangen Prof. Dr.-Ing. Bernd Edler Friedrich-Alexander Universität Erlangen-Nürnberg International

More information

GSM Interference Cancellation For Forensic Audio

GSM Interference Cancellation For Forensic Audio Application Report BACK April 2001 GSM Interference Cancellation For Forensic Audio Philip Harrison and Dr Boaz Rafaely (supervisor) Institute of Sound and Vibration Research (ISVR) University of Southampton,

More information

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012 Signal segmentation and waveform characterization Biosignal processing, 5173S Autumn 01 Short-time analysis of signals Signal statistics may vary in time: nonstationary how to compute signal characterizations?

More information

SGN Audio and Speech Processing

SGN Audio and Speech Processing Introduction 1 Course goals Introduction 2 SGN 14006 Audio and Speech Processing Lectures, Fall 2014 Anssi Klapuri Tampere University of Technology! Learn basics of audio signal processing Basic operations

More information

THE HUMANISATION OF STOCHASTIC PROCESSES FOR THE MODELLING OF F0 DRIFT IN SINGING

THE HUMANISATION OF STOCHASTIC PROCESSES FOR THE MODELLING OF F0 DRIFT IN SINGING THE HUMANISATION OF STOCHASTIC PROCESSES FOR THE MODELLING OF F0 DRIFT IN SINGING Ryan Stables [1], Dr. Jamie Bullock [2], Dr. Cham Athwal [3] [1] Institute of Digital Experience, Birmingham City University,

More information

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2 Signal Processing for Speech Applications - Part 2-1 Signal Processing For Speech Applications - Part 2 May 14, 2013 Signal Processing for Speech Applications - Part 2-2 References Huang et al., Chapter

More information

NCCF ACF. cepstrum coef. error signal > samples

NCCF ACF. cepstrum coef. error signal > samples ESTIMATION OF FUNDAMENTAL FREQUENCY IN SPEECH Petr Motl»cek 1 Abstract This paper presents an application of one method for improving fundamental frequency detection from the speech. The method is based

More information

ANALYSIS AND EVALUATION OF IRREGULARITY IN PITCH VIBRATO FOR STRING-INSTRUMENT TONES

ANALYSIS AND EVALUATION OF IRREGULARITY IN PITCH VIBRATO FOR STRING-INSTRUMENT TONES Abstract ANALYSIS AND EVALUATION OF IRREGULARITY IN PITCH VIBRATO FOR STRING-INSTRUMENT TONES William L. Martens Faculty of Architecture, Design and Planning University of Sydney, Sydney NSW 2006, Australia

More information

Quarterly Progress and Status Report. On certain irregularities of voiced-speech waveforms

Quarterly Progress and Status Report. On certain irregularities of voiced-speech waveforms Dept. for Speech, Music and Hearing Quarterly Progress and Status Report On certain irregularities of voiced-speech waveforms Dolansky, L. and Tjernlund, P. journal: STL-QPSR volume: 8 number: 2-3 year:

More information

Envelope Modulation Spectrum (EMS)

Envelope Modulation Spectrum (EMS) Envelope Modulation Spectrum (EMS) The Envelope Modulation Spectrum (EMS) is a representation of the slow amplitude modulations in a signal and the distribution of energy in the amplitude fluctuations

More information

Transforming High-Effort Voices Into Breathy Voices Using Adaptive Pre-Emphasis Linear Prediction

Transforming High-Effort Voices Into Breathy Voices Using Adaptive Pre-Emphasis Linear Prediction Transforming High-Effort Voices Into Breathy Voices Using Adaptive Pre-Emphasis Linear Prediction by Karl Ingram Nordstrom B.Eng., University of Victoria, 1995 M.A.Sc., University of Victoria, 2000 A Dissertation

More information

Source-filter analysis of fricatives

Source-filter analysis of fricatives 24.915/24.963 Linguistic Phonetics Source-filter analysis of fricatives Figure removed due to copyright restrictions. Readings: Johnson chapter 5 (speech perception) 24.963: Fujimura et al (1978) Noise

More information

A METHODOLOGICAL STUDY OF PERTURBATION AND ADDITIVE NOISE IN SYNTHETICALLY GENERATED VOICE SIGNALS

A METHODOLOGICAL STUDY OF PERTURBATION AND ADDITIVE NOISE IN SYNTHETICALLY GENERATED VOICE SIGNALS Journal of Speech and Hearing Research, Volume 30, 448--461, December 1987 A METHODOLOGICAL STUDY OF PERTURBATION AND ADDITIVE NOISE IN SYNTHETICALLY GENERATED VOICE SIGNALS JAMES HILLENBRAND RIT Research

More information

INDIANA UNIVERSITY, DEPT. OF PHYSICS P105, Basic Physics of Sound, Spring 2010

INDIANA UNIVERSITY, DEPT. OF PHYSICS P105, Basic Physics of Sound, Spring 2010 Name: ID#: INDIANA UNIVERSITY, DEPT. OF PHYSICS P105, Basic Physics of Sound, Spring 2010 Midterm Exam #2 Thursday, 25 March 2010, 7:30 9:30 p.m. Closed book. You are allowed a calculator. There is a Formula

More information

Review: Frequency Response Graph. Introduction to Speech and Science. Review: Vowels. Response Graph. Review: Acoustic tube models

Review: Frequency Response Graph. Introduction to Speech and Science. Review: Vowels. Response Graph. Review: Acoustic tube models eview: requency esponse Graph Introduction to Speech and Science Lecture 5 ricatives and Spectrograms requency Domain Description Input Signal System Output Signal Output = Input esponse? eview: requency

More information

Communications Theory and Engineering

Communications Theory and Engineering Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019 Speech and telephone speech Based on a voice production model Parametric representation

More information

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012 Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012 o Music signal characteristics o Perceptual attributes and acoustic properties o Signal representations for pitch detection o STFT o Sinusoidal model o

More information

Processor Setting Fundamentals -or- What Is the Crossover Point?

Processor Setting Fundamentals -or- What Is the Crossover Point? The Law of Physics / The Art of Listening Processor Setting Fundamentals -or- What Is the Crossover Point? Nathan Butler Design Engineer, EAW There are many misconceptions about what a crossover is, and

More information

Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2

Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2 Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2 The Fourier transform of single pulse is the sinc function. EE 442 Signal Preliminaries 1 Communication Systems and

More information

A perceptually and physiologically motivated voice source model

A perceptually and physiologically motivated voice source model INTERSPEECH 23 A perceptually and physiologically motivated voice source model Gang Chen, Marc Garellek 2,3, Jody Kreiman 3, Bruce R. Gerratt 3, Abeer Alwan Department of Electrical Engineering, University

More information

Whole geometry Finite-Difference modeling of the violin

Whole geometry Finite-Difference modeling of the violin Whole geometry Finite-Difference modeling of the violin Institute of Musicology, Neue Rabenstr. 13, 20354 Hamburg, Germany e-mail: R_Bader@t-online.de, A Finite-Difference Modelling of the complete violin

More information

Fundamentals of Music Technology

Fundamentals of Music Technology Fundamentals of Music Technology Juan P. Bello Office: 409, 4th floor, 383 LaFayette Street (ext. 85736) Office Hours: Wednesdays 2-5pm Email: jpbello@nyu.edu URL: http://homepages.nyu.edu/~jb2843/ Course-info:

More information

FFT 1 /n octave analysis wavelet

FFT 1 /n octave analysis wavelet 06/16 For most acoustic examinations, a simple sound level analysis is insufficient, as not only the overall sound pressure level, but also the frequency-dependent distribution of the level has a significant

More information

The Discrete Fourier Transform. Claudia Feregrino-Uribe, Alicia Morales-Reyes Original material: Dr. René Cumplido

The Discrete Fourier Transform. Claudia Feregrino-Uribe, Alicia Morales-Reyes Original material: Dr. René Cumplido The Discrete Fourier Transform Claudia Feregrino-Uribe, Alicia Morales-Reyes Original material: Dr. René Cumplido CCC-INAOE Autumn 2015 The Discrete Fourier Transform Fourier analysis is a family of mathematical

More information

Audio Restoration Based on DSP Tools

Audio Restoration Based on DSP Tools Audio Restoration Based on DSP Tools EECS 451 Final Project Report Nan Wu School of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI, United States wunan@umich.edu Abstract

More information

The quality of the transmission signal The characteristics of the transmission medium. Some type of transmission medium is required for transmission:

The quality of the transmission signal The characteristics of the transmission medium. Some type of transmission medium is required for transmission: Data Transmission The successful transmission of data depends upon two factors: The quality of the transmission signal The characteristics of the transmission medium Some type of transmission medium is

More information

Signal Characteristics

Signal Characteristics Data Transmission The successful transmission of data depends upon two factors:» The quality of the transmission signal» The characteristics of the transmission medium Some type of transmission medium

More information