HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS

Size: px

Start display at page:

Download "HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS"

Shanon Ross
5 years ago
Views:

2 HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS Master s Thesis submitted to the faculty of University of Miami in partial fulfillment of the requirements of the degree of Master of Science by Arvind Venkatasubramanian Music Engineering Technology, Frost School of Music, University of Miami, P.O.Box Coral Gables, FL May 2005 Research Advisor: Mr. Colby N. Leider Assistant Professor, Music Engineering Technology Frost School of Music University of Miami, Coral Gables Thesis Panel: Mr. Kenneth C. Pohlmann Director of Music Engineering Frost School of Music University of Miami, Coral Gables Dr. Edward P. Asmus Associate Dean, Graduate Music Studies, Frost School of music, University of Miami, Coral Gables

3 University of Miami Master s Thesis submitted to the faculty of University of Miami in partial fulfillment of the requirements of the degree of Master of Science HIGH-FIDELITY, ANALYSIS-SYNTHESIS DATA RATE REDUCTION FOR AUDIO SIGNALS Approved: Ken C. Pohlmann Director of Music Engineering Colby N. Leider Assistant professor of Music Engineering Dr. Edward P. Asmus Associate Dean of Graduate Studies

4 VENKATASUBRAMANIAN, ARVIND (M.S., Music Engineering) High-Fidelity, Analysis-Synthesis Data Rate (May 2005) Reduction for Audio Signals Abstract of the master s research project at the University of Miami Research project Supervised by Assistant Professor Colby Leider Number of pages in text: 146 Powerful music-compression algorithms have facilitated greater audio data reduction today. This paper is about a basic communication system that synthesizes part of the audio information for which the human auditory system is not relatively sensitive and adding them to audio information to which they are sensitive, thus maintaining the originality of the signal in the sensitive region. One of the biggest advantages in writing data-reduction algorithms of today is to avoid coding for music data, which humans do not hear. The Fletcher-Munson curves depict the loudness perception scheme of the human auditory system. The auditory system is very sensitive to the mid-frequency region compared to low-frequency and high-frequency regions of the human hearing range. In the proposed coder, the mid frequency PCM information from the audio data is transmitted as such through the channel. This involves modulation at the transmitter end to reduce the sampling rate through the channel and demodulation at receiver end to recover back the message. A sinusoidal model is used to synthesize the audio data corresponding to the low-frequency and high-frequency regions. The sinusoidal model involves short time Fourier analysis that extracts meaningful parameters that are fed into an oscillator at the synthesis end to reconstruct the sound. Therefore, modifications are possible before resynthesis.

5 This method is two-filter method and the application is an alternate for the existing audio data-reduction algorithms that rely on psychoacoustic models and perceptual coding. Because no approximation is made to the signal that lies in the region of greatest sensitivity, the application tends to be a perceptually transparent. Low-frequencies have poor spectral resolution than high frequencies. Therefore the two-filter model was improved by downsampling the input, causing a pitch shift. This enables the SMS system [2] to track clear sinusoids and because the input is down sampled, the data and computation cost are halved. Modifications are set to recover back the original time length. One another model called the four-filter method, which outperforms the partial analysis/synthesis system (PSMS), was simulated based on the duplex theory of pitch perception by using a variable time-frame length for different spectral bands. The fourfilter method was improved to one another level. In high-frequency regions, the ear follows the amplitude envelope of the frequency spectrum and leaves out its phase content. This psychological evidence was used in our sinusoidal model by discarding the high-frequency phase parameters that are to be fed into the oscillator. The results showed that the four-filter method is better than the two-filter method both in quality sense and in data rate reduction. The high-frequencies synthesized without phase information sounded same as the high-frequencies synthesized with phase. This showed that phase of high-frequency signals do not have much perceptual importance.

6 DEDICATION To human empiricism that protects humanity, accepting the pluralism, as a mark of respecting the uncertainty that rules my humbleness & humility. v

7 ACKNOWLEDGEMENT I would like to thank my family, relatives, and friends. I would like to extend a word of thanks to Dr.Srinivasan Narasimhan. I acknowledge the theses committee members for their time. I would like to thank my academic mentor Ken Pohlmann, for being patient giving me enough time, encouraging me at right times and for his understanding. I would like to thank and appreciate my theses supervisor Colby Leider, for introducing me to intellectual music and for guiding me through this research by contributing his ideas. His knowledge in sound synthesis took me towards this final report, as without his guidance this work would not be possible. His guidance helped me sail on right track till the end. I extend my thankfulness to Dr.Asmus for his help to me during summer My sincere thanks to all my high school friends and teachers, undergraduate friends and teachers, Joe, Music Engineering friends and the Indian graduate friends at the UM. I appreciate those who participated in listening tests. I m thankful to Girish, Department of Psychiatry and Department of Gerontology at the UM, who gave me a job. I thank Ali Habashi for his help. I would like to thank Dr.Shariar Negadaripour (EE), Dr.Murat Dogruel (EE), Dr.Micheal Scodrilis (EE), Dr.Don Wilson (Composition), Dr.Modestino (EE), Dr.Mermelstein (EE) and Dr.Moeiz Tapia (EE) under whom I underwent course works and class projects at the UM. These courses and projects have indirectly helped me in this thesis. vi

8 Table of Contents: Chapter 1: Introduction...1 Chapter 2: Literature Review...7 Pitch Perception...7 Psychoacoustics of music...7 Ear mechanisms and human auditory system...9 Duplex theory of pitch perception...14 Missing fundamental effect...15 Virtual and Spectral pitch...17 Fletcher-Munson Curves...23 Basic Communication System...27 Analysis-Resynthesis...31 Fourier Philosophy: Fast Fourier Transform Analysis...32 Classical theory of timbre...36 Overview of sound synthesis techniques...37 Spectral modeling synthesis...43 MQ-Synthesis...49 Bandwidth Enhanced Sinusoidal Modeling...56 Phase Synthesis...59 Perceptual Coding V Partial Synthesis Based Data Reduction...63 Chapter 3: The Research: A Partial synthesis based audio data reduction...75 On the use of pitch perception in data reduction...75 Fletcher and Munson Curves: Data Sets...78 The General Procedure: Two-filter method...83 Modulation and Demodulation: Mid-frequency band...85 FFT Analysis of low-frequency and high-frequency bands...89 Peak detection...94 Peak Continuation...99 Additive Partial Sinusoidal Synthesis (PSMS) Cubic Spline Interpolation Fusion of the sensitive and less sensitive data vii

9 Downsampling method Duplex theory of pitch perception: Applications Improving Partial SMS: Four-filter method Discarding HF Phase Possibilities in modifications Advantages: Experimenting with a sine plus noise model Chapter 4: Results: Listening tests and results Data reduction and results Chapter 5: Conclusion Future extension of the project Pros and Cons: Perceptual coding V PSMS BIBLIOGRAPHY APPENDIX viii

10 List of Figures: Figure 1.1: Overview of the proposed coder Figure 2.1: The second dimension of pitch: chroma Figure 2.2: The Missing fundamental effects Figure 2.3: Fletcher-Munson Curves Figure 2.4: Stages of Communication Figure 2.5: Basic communication system Figure 2.6: Types of communication Figure 2.7: Overview of general analysis and synthesis technique Figure 2.8: (a-c) Nyquist sampling theorem Figure 2.9: Additive Synthesis Figure 2.10: The amplitude progression of the partials of a trumpet tone Figure 2.11: SMS: Block diagram of the analysis process [2] Figure 2.12: SMS: Block diagram of the synthesis process [2] Figure 2.13: McAulay-Quatieri Sinusoidal Analysis-Synthesis system: [5] Figure 2.14: Mcaulay-Quatieri Sinusoidal model for speech [5] Figure 2.15: Peak detection in MQ-approach,: [5] Figure 2.16: Mcaulay-Quatieri Sinusoidal Analysis-Synthesis (Peak picking) [5] Figure 2.17: Lemur Figure 2.18: Lemur Graphical tool Figure 2.19: Bandwidth enhanced sinusoidal modeling Figure 2.20: MPEG audio compression and decompression Figure 3.1: Perceptual Coding approach Vs Synthesis based approach Figure 3.2: Fletcher Munson original curves (Fig 3 [1]) Figure 3.3: Figure 2 mentioned in [1] Figure 3.4: Figure 3 in [1] Figure 3.5: MALAB plot of figure 3.4 Figure 3.6: The Schematic Block Diagram employed in our Synthesis based Data Reduction (Two filter method) Figure 3.7: The Two Filter method: Band Pass and Band elimination filters Violin spectrum, FS = 44100, mono, 16 bit Figure 3.8: Modulation and demodulation of sensitive data (Country music, 44.1 khz, 16 bit, mono at 705 Kbps) Figure 3.9: Original and Windowed short time signals, Fourier analysis (Hanning Window, 75% Overlap) Figure 3.10: Magnitude and Phase spectrum of LF and HF bands Figure 3.11: Peak detection in LF and HF bands Figure 3.12: Missed Peaks Figure 3.13: Peaks below threshold Figure 3.14: Peak Continuation process Figure 3.15: Crack Removal (Top) with cracks; (Bottom) without cracks Figure 3.16: Cubic spline interpolation: Crack Removal Figure 3.17: Spectral Fusion (A more general model) ix

11 Figure 3.18: The downsampling method Figure 3.19: High Spectral resolution Four-filter method Figure 3.20: A schematic picture explaining how the analysis frame length is changed over frequency scale. Figure 3.21: The four filter method Figure 3.22: (Top) High resolution (Bottom) Low spectral resolution Analysis of spectral resolution of low frequency sounds (0-700 Hz) Figure 3.23: Synthesizing HF band with phase parameters and without phase parameters Figure 3.24: Plots explaining the synthesis of different band that have various analysis Frame length and their final fusion (Four filter method) Figure 3.25: Possibilities of Modifications Figure 3.26: Cross Effect using PSMS modifications Figure 3.27: Demonstration: Advantages of sine plus noise model in a two filter method Figure 4.1: Qualitative Results: Music Genre: Two-filter method Figure 4.2: Qualitative Results: Tonal Instruments: Two-filter method Figure 4.3: Qualitative Results: Percussion Instruments: Two-filter method Figure 4.4: Qualitative Results: Music Genre: Four-filter method Figure 4.5: Qualitative Results: Tonal Instruments: Four-filter method Figure 4.6: Qualitative Results: Percussion Instruments: Four-filter method Appendix: Figures: Rock music, sampled at 44.1 khz, 16 bit, mono, 705 kbps Country music, sampled at 44.1 khz, 16 bit, mono, 705 kbps Speech, sampled at 44.1 khz, 16 bit, mono, 705 kbps Chinese Pipa, sampled at 44.1 khz, 16 bit, mono, 705 kbps Gottuvadhyam and Tabla, sampled at 44.1 khz, 16 bit, mono, 705 kbps Sitar and Tabla, sampled at 44.1 khz, 16 bit, mono, 705 kbps x

12 3 List of Tables: Table 2.1: Table 2.1: Critical Bandwidth as a function of center frequency and critical Band rate: [8] Table 3.1: Fletcher Munson curve: data sets Table 4.1: Bit rates and audio compression ratio (Two-filter and four-filter method) Table 5.1: Pros and cons: Perceptual coding V PSMS xi

13 1 CHAPTER 1 INTRODUCTION The aim of this research project is to present a communication system that encodes less information so as to decode all the necessary information, at the same time, improving the fidelity of existing synthesis-based data reduction algorithms. The primary motive of data reduction is achieved by following a simple tactic. The Human auditory system is very sensitive to the mid frequency range (1-5 khz) of the spectrum. Experiments show that critical bands are much narrower at low frequencies than at high frequencies; three-fourths of the critical bands are below 5 khz; the ear receives more information from low and mid frequencies and less from high frequencies. We transmit the audio PCM data which represent the mid-frequency spectrum above the human threshold of hearing through the communication channel. A simple modulation technique is followed by modulating the mid-frequency pass band to the base band and downsampling in time. At the receiver end the modulated data is upsampled and then demodulated so that the message is recovered back. This facilitates data reduction. The Sinusoidal model of Xavier Serra [2] is used to synthesize the low-frequency and high-frequency bands of the spectrum. The audio for the Low-frequency and highfrequency regions of the human auditory range are synthesized and received at the receiver end of the Communication System. A band elimination filter is used to eliminate the sensitive frequency band for which humans are sensitive. The outputs of the band-elimination filter comprising the less-sensitive low and high ends of the

14 2 spectrum become inputs to the sinusoidal model. A band pass filter is used to filter the sensitive frequencies of the mid spectrum. Since two filters are used, this method is called the two filter method. The Sinusoidal modeling involves a short-time Fourier analysis of overlapping time frames. Each time frame is windowed before analysis. The short-time Fourier transform gives the spectral details of the current frame. A simple peak detection algorithm detects vital peaks i.e., local maxima in the spectrum. The amplitudes of the peaks and their corresponding frequencies and phases will form the inputs of the oscillator at the synthesis end. Before synthesis, a peak continuation algorithm connects the spectral points (peaks) to form the sinusoids, commonly called the tracks. The connected tracks are smoothly interpolated from frame to frame as said in the sinusoidal modeling synthesis and Mcaulay-Quartieri algorithm. However, the peak interpolation of frequency-domain parameters are replaced in this thesis work by a smooth cubic spline interpolation of abruptly changing voltage levels between frames in the time domain. This helps in reducing the computational expenses. The stochastic signals in the sensitive mid spectral band would be transmitted as PCM data. Therefore, the sinusoidal plus noise model was replaced with a mere sinusoidal model in which even noise is modeled into tracks. However, a typical SMS is a sine plus noise model. Shorter tracks are usually deleted and are stochastically analyzed by a linear prediction method. A convincing nth-order polynomial fit of the stochastic frequency response is possible at the synthesis end if we could send linear prediction coefficients through the channel. White noise could be used as excitation into the linear prediction filter to get a stochastic approximation of the noise. Stochastic modeling is not

15 3 included in this project. It is always better to model a deterministic plus stochastic model because in real world cases, physical music signals are made up of sinusoids and musical noise (excitation). Moreover, human beings do not follow exact phase information for noisy transient sounds. They do not carry perceptually meaningful information in most cases. Even though, stochastic model approach is not followed in this thesis, a demonstration of advantages of including stochastic analysis in the two filter method is discussed briefly. Figure 1.1: Overview of the proposed coder A logical note about this system is that the low frequencies have poor frequency resolution compared to mid and high frequencies according to the duplex theory of pitch perception. We use the SMS technique for synthesizing the LF region. Henceforth, false sinusoidal trajectory connections and trajectory breaks are expected not to show up or not to be audible as artifacts in the low frequencies of the output sound. Though this is a logical conclusion, engineering is all about improving models. Hence, this research work

16 4 also introduces a new four filter method that models the sound better giving more hopes and chances for high fidelity and tactical data reduction. The four-filter method involves two filters in the low-frequency region, one in the mid- (sensitive region) and one in the high-frequency region. Based on the frequency location of the local spectral band, the frame length for analysis is changed in the time domain that would result in appropriate frequency resolution in the frequency domain. Low frequencies have poor spectral resolution. The auditory system follows the time impulses for low frequencies. Hence, the analysis time frame length is set to a small value to capture time resolution and this value increases for each filter, as we move to the highfrequency end. This not only results in smoother sinusoidal connections but also manages to save the amount of parameters that has to be sent through the channel for proper sound reconstruction. The four-filter method was improved to one another level. In HF regions, the ear follows the amplitude envelope of the frequency spectrum and leaves out its phase content. This psychological evidence was used in our sinusoidal model by discarding the HF phase parameters that are to be fed into the oscillator. The other way to improve the system is to shift the low frequency region way up on the frequency scale by a downsampling in time by a factor two. The sinusoids would have better resolution now and the trajectory-tracking will be smooth and comfortable. At the same time the duration of the music will be halved. This will reduce the number of

17 5 parameters by a factor two. After getting the parameters, the frequency parameters are divided by a factor two and sent through the channel. Modifications are applied to timestretch the signal back to original length. If the downsampling factor becomes greater, the musician could hear the warbling effect, which is undesirable artifact in this case. The chapter three contains more information on the sinusoidal trajectory tracking. In this complex world, perfection is never achieved. This applies to this project too. The success level of this algorithm is based on how close the systems audio output is when compared to original audio. The complexity, in getting near to physical world s real sounds, is very difficult to represent, even when we break the complexity into separate dimensions of time, frequency and amplitude. To represent the complexity, we need the magnitude response followed by phase response of the systems. Therefore, the success of this synthesis based music data reduction is all on how close the synthesized music is to the original music. The synthesis methods we used here are the Music sinusoidal plus noise modeling based on the past research work by Xavier Serra [2]. Sounds produced by musical instruments and other physical systems can be modeled as a sum of deterministic and stochastic parts, or as a sum of sinusoids plus noise residual. Sinusoidal are produced by a harmonic vibrating system. The residual contains the energy produced by the excitation mechanisms and other components which are not result of periodic vibration [3]. This synthesis method is applicable only for musical purposes. A more general scheme that fits any sound, sometimes even noise, is the Mcaulay-Quatieri algorithm [5]. Our system

18 6 in this research, works for any type of sound ranging from tonal to non tonal to noisy transient sounds. Our system allocates more bits to the human sensitive signals. A perceptual coder has to analyze a short time signal to adaptively allocate more bits to meaningful music signal, code fewer bits to less meaningful non musical signal and does not allocate bits for useless information. If low-bit-rate coding could be used to code the mid-frequency sensitive signals, this research scheme could facilitate avoiding all adaptive bit allocation computational cost because this becomes a direct choice for an engineer to switch over from sensitive to less sensitive signal by a one step external decision. Hence, if low-bit rate coding could be used for coding sensitive frequencies, the possibilities for avoiding computational costs increase. At the same time, high fidelity could be attained because the algorithm focuses on sensitive frequencies granting bits liberally from the bit pool. However, this work will be reserved for the future. The third chapter contains the research project which is more detailed.

19 7 CHAPTER 2 LITERATURE REVIEW PSYCHOACOUSTICS OF MUSIC Psychoacoustics is the study of human auditory perception, ranging from the biological design of the ear to the brain s interpretation of aural information. Sound is only an academic concept without our perception of it. Psychoacoustics is the branch of study that explains the subjective response to everything we hear. It is only our response to sound that fundamentally matters. Psychoacoustics seeks to reconcile acoustical stimuli and all the scientific, objective, and physical properties that surround them, with the physiological and psychological responses evoked by them. Psychoacoustics can be defined simply as the psychological study of hearing. The aim of psychoacoustic research is to find out how hearing works. In other words, the aim is to discover how sounds entering the ear are processed by the ear and the brain in order to give the listener useful information about the world outside. The ear and its associated nervous system is an enormously complex, interactive system. The physiology of the human hearing system has evolved incredible powers of perception. At the same time it has its limitations. The ear is astonishingly acute in its ability to detect nuance or defect in a signal. It is also with portions of the signal that do

20 8 not have perceptual importance. The accuracy of a coded signal can be very low, but this accuracy is very frequency-dependent and time-dependent. The ear is a highly developed physical organ (the eye, for example can only receive frequencies over one octave), but the ear is useful only when coupled to the interpretative powers of the brain. Those mental judgments form the basis for everything we experience from sound and music. The left and right ears do not differ physiologically in their capacity for detecting sound, but their respective right and left brain halves do. The two halves loosely divide the brain s functions. [8] PITCH AND PITCH PERCEPTION Pitch refers to the tonal height of a sound object, e.g. a musical tone or the human voice. The use of the term pitch is, however, often inconsistent in that the term is both used for a stimulus parameter (i.e.,, synonymous to frequency) and for an attribute of auditory sensation. The people concerned with processing of speech mostly use the term in the former sense, meaning the fundamental frequency (oscillation frequency) of the glottal oscillation (vibration of the vocal folds). In psychoacoustics (and so in the present discussion) the term is used throughout in the latter sense, i.e., meaning an auditory (subjective) attribute. The ANSI definition of psycho acoustical terminology says that pitch is that auditory attribute of sound according to which sounds can be ordered on a scale from low to high. To date this definition still is a useful basis, though it must be complemented by taking account of certain additional aspects. [12]

21 9 EAR MECHANISMS IN PITCH PERCEPTION The ear performs the transformation from acoustical energy to mechanical energy and ultimately to the electrical impulses sent to the brain, where information contained in sound is perceived. The outer ear collects sound and its intricate folds help us to assess directionality. The ear canal resonates at about 3 khz, providing extra sensitivity in the frequency range critical for speech intelligibility. The eardrum transduces acoustical energy into mechanical energy; it reaches maximum excursion at about 120 db SPL, above which it begins to distort the waveform. Three bones in the middle ear, colloquially known as hammer, anvil and stirrup provide impedance-matching to efficiently convey sounds in air to the fluid filled inner ear. The coiled basilar membrane detects the amplitude and frequency of sound; those vibrations are converted to electrical impulses and sent to the brain as neural information along a bundle of nerve fibers. The brain decodes the period of stimulus and point of maximum stimulation along the basilar membrane to determine frequency activity in local regions surrounding the stimulus. Examination of the basilar membrane shows that the ear contains roughly 30,000 hair cells arranged in multiple rows along the basilar membrane, roughly 32 mm long; this is the organ of corti. The cells detect local vibrations of the basilar membrane and convey audio information to the brain via electrical impulses. Frequency discrimination dilates that at low frequencies, tones a few Hertz apart can be distinguished; however at high

22 10 frequencies, tones must differ by hundreds of hertz. In any case, hair cells respond to the strongest stimulation in their local region; this is called a critical band, a concept introduced by Harvey Fletcher. Experiments show that critical bands are much narrower at low frequencies than at high frequencies; three-fourths of the critical bands are below 5 KHz; the ear receives more information from low frequencies and less from high frequencies. Critical bands are approximately 100 Hz wide for frequencies from 20 to 400 Hz and approximately 1/5 octave in width for frequencies from 1 to 7 KHz. Previous research shows that critical bands can be approximated with the equation: Critical Bandwidth in Hertz = 24.7(4.37F + 1) Where F = center of frequency in khz. [12] The bark is the unit of perceptual frequency; a critical band has a width of one bark; 1/100 of a bark equals 1 Mel. The bark scale relates absolute frequency (in Hertz) to perceptually measured frequencies such as pitch or critical bands. Using a bark scale, the physical spectrum can be converted to a psychological spectrum. In this way, a pure tone (a single spectrum line) can be represented as a psychological masking curve. The pitch place theory further explains the action of the basilar membrane. Carried by the surrounding fluid, a sound wave travels the length of the membrane and the wave stops at particular places along the length of the membrane, where the greatest vibration of the membrane occurs, corresponding to different frequencies. Specifically, high frequencies are sensed at the membrane near the middle ear while low frequencies are

23 11 sensed at the farther end. The wave excited by a high-frequency sound does not reach the far end of the basilar membrane. However, a low-frequency sound will pass through all high frequency places to reach the far end. Because hair cells tend to vibrate at the frequency of strongest stimulation, they will convey that frequency in a critical band, ignoring lesser stimulation. This excitation curve is described by the cochlear spreading function, an asymmetrical contour. Critical bands are important in perceptual coding because they show that the ear discriminates between energy in the band, and energy outside the band; in particular, this promotes masking. [8] Perception of pitch is a complicated issue. The definitions that pitch is a sensory (subjective) attribute tend toward a concept that includes not only the aspect of perceived height but in addition one or even more aspects of tones that are relevant in music. The most prominent of these additional aspects is octave-equivalence the notion that tones an octave apart are somehow similar and so, in certain musical respects, "equivalent". Pitch must be regarded as a two-dimensional attribute such that height is only one of two dimensions. The second "dimension" ordinarily is termed chroma. According to this concept a pitch is said to have both a certain height and a certain musical-categorical value (chroma), e.g. "c-ness", "d-ness", etc. This is often illustrated by Roger Shepard s helical model. In that model, pitches are represented by points on an ascending helix such that the vertical height of their position reflects pitch height, while the rotational angle of the position corresponds to chroma. On the helix pitches with one and the same chroma (in music denoted by the same letter, c, d, e, etc.) are situated vertically above or below one another. Practically any sound of real life including the tones of musical instruments

24 12 evokes several pitches at a time, though often (in particular for the harmonic complex tones produced by conventional musical instruments) one of them is most prominent and then is said to be the pitch. So the weak point of the ANSI definition is that there is no guarantee that any sound having pitch indeed can unambiguously be positioned on the low-high dimension. The auditory system also gets confused when a C and the very G are played simultaneously, one tone perceived by the right and one by the left ear. [12] When one listens to a pair of successive musical tones, one can ordinarily tell whether or not the tones are equal in pitch; or if the first is higher in pitch than the second; or vice versa. However, even for ordinary musical tones there is octave equivalence, which means that tones may be confused with one another although their oscillation frequencies differ by a factor of two. This implies that for harmonic complex tones there exists a certain ambiguity of pitch which naturally emerges from the multiplicity of pitch. The ambiguity of pitch can be much amplified by suppressing certain harmonics from the Fourier spectrum of a "natural" harmonic complex tone. Shepard has described observations on harmonic complex tones whose Fourier spectrum consisted only of harmonics that were in an octave relationship, i.e.,, the 1st, 2nd, 4th, 8th, 16th, etc. While the musical pitch class (the chroma) of such tones is well defined, the absolute height of pitch is quite ambiguous; that is, octave confusions are very likely to occur.

25 13 Figure 2.1: The second dimension of pitch: Chroma [12] This is particularly true when the frequency of the first harmonic is near the lower limit of the hearing range while the upper part tones extend up to the high end of the hearing range. In that case, there indeed is little that if any information available to the ear about what actually is the fundamental frequency (oscillation frequency). When, for instance, the oscillation frequency of the above type of tone is 10 Hz and the number of part tones chosen is 11, the listener is exposed to a spectrum of part tones with the frequencies 10, 20, 40, 80,...,10240 Hz. When that tone is followed by another having twice the oscillation frequency of the first, the listener gets exposed to 20, 40, 80, till Hz, and it is not surprising that one will not perceive much of a difference, if any. So, under these conditions there is "perfect" octave equivalence. From this notions it is easy to understand that when the ratio between the oscillation frequencies of the two tones is 1.414, the listener on first sight cannot be expected to be

26 14 able to tell whether the second tone is higher in pitch than the first or vice versa. The tritone paradox, originates from the observation that listeners in fact do make fairly consistent decisions on which of the two tones is higher in pitch, i.e.,, whether they heard an upward or downward step of pitch. However, while the responses of individual listeners are fairly consistent and reproduceable, different listeners may give opposite responses. Moreover, the responses of individual listeners turn out to be dependent on the absolute height of the oscillation frequencies. That is, when the listening experiment is made with a base frequency of, e.g., 12 Hz instead of 10 Hz, the individual responses may systematically change. This was regarded as a particularly "paradox" outcome. The basic aspects of the tritone paradox can be fairly well explained by the theory of virtual pitch. However, the theory cannot account for the observed individual differences, as the factors governing those differences as yet are unknown. [12] DUPLEX THEORY OF PITCH PERCEPTION Generally the ear mechanism works differently for low-frequency and high-frequency pitch perception. At very low frequencies, we may hear successive features of a waveform so that it is not heard as having just one pitch. The ear follows the energy envelope in the LF region on the time scale. It takes into account the number of time bursts per second and hence pitch sensation is based on periodicity. For high-frequency contents the ear takes the position of vibration along basilar membrane of the cochlea into account. In HF regions, the ear follows the amplitude envelope of the frequency spectrum and leaves out its phase content. The two mechanisms appear to be about equally effective at a frequency around 640 Hz. This is popularly called the Duplex

27 15 theory of pitch perception. [7] For frequencies well above 1000 Hz, the pitch frequency is heard only when the fundamental is actually present. THE MISSING FUNADAMENTAL EFFECT When two single-frequency tones are present in the air at the same time, they will interfere with each other and produce a beat frequency. The beat frequency is equal to the difference between the frequencies of the two tones and if it is in the mid-frequency region, the human ear will perceive it as a third tone, called a subjective tone" or "difference tone". When two sound waves of different frequency approach the ear, the alternating constructive and destructive interference causes the sound to be alternatively soft and loud a phenomenon, that is called beating. The beat frequency is equal to the absolute value of the difference in frequency of the two waves. The subjective tones, which are produced by the beating of the various harmonics of the sound of a musical instrument help to reinforce the pitch of the fundamental frequency. Most musical instruments produce a fundamental frequency plus several higher tones, that are whole-number multiples of the fundamental. The beat frequencies between the successive harmonics constitute subjective tones that are at the same frequency as the fundamental and therefore reinforce the sense of pitch of the fundamental note being played. If the fundamental is 50 Hz and its two successive harmonics 150 Hz and 200 Hz

28 16 beat with each other, a 50 Hz is the resultant which equals the fundamental and hence reinforces the pitch. If the lower harmonics are not produced because of the poor fidelity or filtering of the sound reproduction equipment, humans still hear the tone as having the pitch of the non-existent fundamental because of the presence of these beat frequencies. This is called the missing fundamental effect. It plays an important role in sound reproduction by preserving the sense of pitch (including the perception of melody) when reproduced sound loses some of its lower frequencies. The presence of the beat frequencies between the harmonics gives a strong sense of pitch. Figure 2.2: The missing fundamental effect Fletcher in his first paper proposed that the missing fundamental was re-created by nonlinearties in the mechanism of the ear. He soon abandoned this false conclusion and in his second paper described experiments saying that a tone must include three successive partials in order to be heard as a musical tone, a tone that has the pitch of the fundamental, whether or not the fundamental is present. [7]

29 17 For fundamental frequencies of up to about 1400 Hz, the pitch of a complex tone is determined by the second and higher harmonics and not by the fundamental, whereas beyond this frequency the opposite holds; this is the case both for tones with harmonics of equal amplitude and for tones with harmonics of which the amplitudes fall by 6 db/octave. For fundamental frequencies of up to about 700 Hz, the pitch is determined by the third and higher harmonics; for frequencies up to about 350 Hz, by the fourth and higher harmonics [18]. SPECTRAL PITCH AND VIRTUAL PITCH Pitch of sine tones with high probability is a "place pitch", i.e.,, is dependent on the place of maximal excitation of the cochlear partition and eventually is a result of peripheral auditory Fourier analysis. On the other hand, it was evident that the pitch of many types of complex tone cannot be explained by that principle in particular, the pitch of harmonic complex tones whose fundamental Fourier component is weak or entirely missing. Attempts have been made to resolve this conflict by searching for a parameter of sound and a mechanism that accounts both for the pitch of sine tones and of complex tones. The search is a failure to this day. Besides the pitch of sine tones there is another type of pitch, namely, virtual Pitch. The conceptual distinction between spectral pitch and virtual pitch is that both spectral pitch and virtual pitch ultimately are dependent on aural Fourier analysis; however, while any spectral pitch is conceived as immediately corresponding to a spectral singularity, virtual pitch is modeled as being deduced from a set of spectral pitches on another stage of auditory processing. The relationship between spectral pitch

30 18 and virtual pitch is analogous to that between primary and virtual visual contour in many respects. There is hardly any sound that does not elicit any spectral pitch at all. The harmonic complex tones of real life, i.e.,, voiced speech and musical tones are aurally represented by a number of spectral pitches that correspond to the lower harmonics. The formants of speech vowels elicit corresponding spectral pitches. Even random sound signals often are either "colorized" by spectral irregularities, which will give rise to steady spectral pitches, or there can occur instantaneous irregularities in the short-term Fourier spectrum that elicit spectral pitches of which both the height and the instant of occurrence are random. When any real-life sound (e.g., foot step, knock at the door, splashing water, sound of a car's engine, and fricative phoneme of speech) can be identified by ear, one can be sure that - besides temporal structure - spectral pitch is involved. Spectral pitch is the most important carrier of auditory information, as it is an element of higher-order, Gestalt-like types of auditory percepts such as, e.g., the pitch of musical tones, the strike note of bells, the root of musical chords, and the quality of a particular vowel. The telephone channel does not distort the pitch of speech, although transmission is confined to the frequency range from about 300 to about 3400 Hz. When we suppress bass reproduction, we will notice that the fundamental becomes inaudible, while the speaker's pitch continues to be well reproduced. The kind of pitch of the fundamental that we may hear if the fundamental is strong enough, is the pitch of a sine tone; it is of the spectral pitch type. The pitch that we ordinarily hear, however, is not dependent on the fundamental is being audible; it is by the auditory system extracted from a range of

31 19 the Fourier spectrum that extends above the fundamental. The latter type of pitch is termed virtual pitch. A procedure was described for the automatic extraction of the various pitch percepts which may be simultaneously evoked by complex tonal stimuli. The procedure is based on the theory of virtual pitch, and in particular on the principle, that the whole pitch percept is dependent both on analytic listening (yielding spectral pitch), and on holistic perception (yielding virtual pitch). The more or less ambiguous pitch percept governed by these two pitch modes is described by two pitch patterns: the spectral-pitch pattern, and the virtual-pitch pattern. Each of these patterns consists of a number of pitch (height) values and associated weights, which account for the relative prominence of every individual pitch. The spectral-pitch pattern is constructed by spectral analysis, extraction of tonal components, evaluation of masking effects (masking and pitch shifts), and weighting according to the principle of spectral dominance. The virtual-pitch pattern is obtained from the spectral-pitch pattern by an advanced algorithm of sub-harmonic coincidence assessment. It can be concluded that, as an attribute of auditory sensation, virtual pitch is fundamentally different in type from spectral pitch. This conclusion is strongly suggested by the fact that one can hear both types of pitch at a time, having the same height. Evidently, it is possible to communicate one and the same pitch (in terms of pitch height) through two drastically different perceptual "channels": Spectral pitch is communicated immediately, i.e., by a Fourier component's frequency, while virtual pitch is

32 20 communicated by providing to the auditory system information about the oscillation frequency of a complex signal that is implied in the Fourier spectrum as a whole. Formation of virtual pitch can essentially be said to be a process of subharmonic matching. The tonal aspects of any sound are primarily represented by a set of spectral pitches, and pertinent virtual pitches are "inferred" on the basis of the presumption that in any case they must be subharmonic to the spectral pitches. The virtual pitch mechanism deals with both "harmonic" and "inharmonic" sounds as well, though internally it strictly sticks to the presumption that each and every virtual-pitch candidate must be a subharmonic of a spectral pitch. Where the partials in a sound are harmonically related, but with the first member of the series missing (for example, a sound with partials at 500Hz, 750Hz, 1000Hz, 1250Hz etc.) a virtual pitch can be heard at 250Hz - the missing fundamental. Where the partials are not exactly harmonic, a virtual pitch is still heard at about the same place, but the exact frequency turns out to be determined in quite a complicated way by the frequencies of the individual partials. No comprehensive rule for determining virtual pitch is yet known. Extensive research has been done on virtual pitch and it has been proved that it is not due to a simple explanation such as difference tones, but rather a side effect of the human hearing mechanism. There is no doubt that the strike note of a bell is a virtual pitch, as will be explained below. Virtual pitch effects often dominate spectral pitches, for example, in bells the strike note is about an octave below the nominal even if the tierce, only a minor third away, is very strong.

33 21 Frequencies of partials present in sounds can be measured with scientific instruments, or spectrum analyzers. Pitches cannot be measured with instruments. They exist only in our perception of a sound. Only a human listener can tell us the pitch of a sound - and different listeners may disagree on the perceived pitch. [12] Spectral pitch is defined as an elementary auditory object that immediately represents a spectral singularity. The simplest and most prominent example is the pitch of a sine tone. A virtual pitch is characterized by the presence of harmonics or near-harmonics. A spectral pitch corresponds to individual audible pure-tone components. Most pitches heard in normal sounds are virtual pitches and this is true whether the fundamental is present in the spectrum or not. The crossover from virtual to spectral pitch to be about 800 Hz, but this method depends on selection of clear sinusoidal components in the spectrum. This follows the duplex theory of pitch perception. [7] A procedure for the schematic and automatic extraction of "fundamental pitch" from complex tonal signals, such as voiced speech and music was developed by Ernst Terhardt. While the aurally relevant "fundamental" of a complex signal cannot be defined in purely mathematical terms, an existent model of virtual-pitch perception provided a suitable basis [13]. The procedure comprised the formation of determinant spectral pitches (which correspond to the frequencies of certain signal components), and the deduction of virtual pitch (or "fundamental frequency") from those spectral pitches. The latter deduction was accomplished by a principle of sub-harmonic matching, for whose realization a simple, universal, and efficient algorithm was found. While the calculation may be confined to the determination of "nominal" virtual pitch, certain typical auditory

34 22 phenomena, such as the influence of SPL, partial masking and interval stretch, were accounted well, in which the 'true' virtual pitch was obtained [14]. An algorithm for extraction of pitch and pitch salience from complex tonal signals is mentioned in [13]. The core idea behind the project presented here is to make use of this natural phenomenon of Virtual pitch in audio data reduction. The human sound perception is not sensitive to the detailed spectral shape or phase of non-periodic sounds. The sinusoidal model used in this research work takes advantage of the human inability to perceive the exact spectral shape of signals. Even phase of transient noisy signals has limited perceptual importance. [3] In pure tones, because the frequency composition is so simple, no distinction can be made between three different properties of a tone; its fundamental frequency, its pitch and its spectral balance. The case is different with complex tones. Let us take the three properties in turn. The fundamental frequency of a tone is the frequency of the repetition of the waveform. If the tone is composed of harmonics of a fundamental, this repetition rate will be at the frequency of the fundamental regardless of whether the fundamental is actually present. In harmonic tones, the perceived pitch of the tone is determined by the fundamental of the series, despite the fact that it is not present. This phenomenon is sometimes called the perception of the missing fundamental. Finally, the spectral balance is the relative intensity of the higher and lower harmonics. This feature controls our perception of the brightness of the tone [6]. We will make great use of these pitch-related concepts, especially the missing fundamental effect and duplex theory of pitch perception in this research project in an

35 23 effective manner which follows in Chapter 3. While synthesizing the low-frequency and high-frequency spectrum, our system might not follow all the exact harmonics and their corresponding energy levels. The missing fundamental effect explained above therefore is usefully applied here to mask the absence of missing partials in the case of music and absence of formants in speech applications. Minor errors while tracking the sinusoids hence would be less perceived. Only major tracking errors in SMS will clearly show up. THE FLETCHER MUNSON CURVES, 1933 Figure 2.3: The Fletcher Munson Equal Loudness Curves (1933) In 1933, Fletcher and Munson decided to gather some information about how we perceive different frequencies at different amplitudes. They came up with the Equal Loudness Contours or the Fletcher and Munson Curves. These curves give information

36 24 on the threshold of hearing at different frequencies and the apparent levels of equal loudness at different frequencies. The ear is not equally sensitive to all frequencies, particularly in the low and high frequency ranges. The response to frequencies over the entire audio range has been charted, originally by Fletcher and Munson in 1933, with later revisions by other authors, like Robinson and Dadson, as a set of curves showing the Sound Pressure Levels (SPL) of pure tone s that are perceived as being equally loud. The curves are plotted for each 10 db rise in level with the reference tone being at 1 khz. The curves are lowest in the range from 1 to 5 khz, with a maximum dip around 3300 Hz, indicating that the ear is most sensitive to frequencies in this range. The intensity level of higher or lower tones must be raised substantially in order to create the same impression of loudness. The phon scale was devised to express this subjective impression of loudness, since the decibel scale alone refers to actual sound pressure or sound intensity levels. Historically, the A, B, and C weighting networks on a Sound level meter were derived as the inverse of the 40, 70 and 100 db Fletcher-Munson curves and used to determine Sound level. The lowest curve represents the threshold of hearing, the highest the threshold of pain. The actual data sets and their requirement in this research will be explained in detail in Chapter 3.

37 25 Dynamic Range The instruments used to measure the magnitudes of sounds respond to changes in air pressure. However, sound magnitudes are often specified in terms of intensity, which is the sound energy transmitted per second (i.e.,, the power) through a unit area in a sound field. For a medium such as air, there is a simple relationship between the pressure variations of a plane sound wave in a free field (i.e, in the absence of reflected sound) and the acoustic intensity; intensity is proportional to the square of the pressure variation. [17] The difference in sound-pressure level between the saturation or overload level and the background noise level of an acoustic or electro-acoustic system, measured in decibels. This range may be expressed as a signal-to-noise ratio for maximum output. For a sound or a signal, its dynamic range is the difference between the loudest and quietest portions. The human hearing system has a dynamic range of about 120 db between the threshold of hearing and the threshold of pain. The intensity level where a sound becomes just audible is the threshold of hearing. For a continuous tone of between 2000 and 4000 Hertz, heard by a person with good hearing acuity under laboratory conditions, this is dyne/cm 2 Sound Pressure and is given the reference level of 0 db. While 0 db is the reference employed, the threshold of hearing varies considerably with lower and higher frequencies. This curve is also called minimum audible field (MAF). Alternate units for this reference level are: 2 x 10-4 microbar (µbar); 2 x 10-5 Newton/m 2 (N/m 2 ); 2 x 10-5 Pascal (Pa); 20 micro Pascal (µpa).

38 26 Threshold of pain is the intensity level of a loud sound which gives pain to the ear, usually between 115 and 140 db. The square of the sound pressure is proportional to sound intensity. SPL can be calculated in the same manner and is measured in Decibels. SPL = 10 log (r/r ref ) 2 = 20 log (r/r ref ) Where r is the given sound pressure and r ref is the reference sound pressure. Decibel is the unit of a logarithmic scale of power or intensity called the power level or intensity level. The decibel is defined as one tenth of a bel where one bel represents a difference in level between two intensities I 1, I 0 where one is ten times greater than the other. Intensity level = 10 log 10 (I 1 /I 0 ) (db) Because of the very large range of Sound Intensity which the ear can accommodate, from the loudest (1 watt/m 2 ) to the quietest (10-12 watts/m 2 ), it is convenient to express these values as a function of powers of 10. The result of this logarithmic basis for the scale is that increasing a sound intensity by a factor of 10 raises its level by 10 db; increasing it by a factor of 100 raises its level by 20 db; by 1,000, 30 db and so on. When two sound sources of equal intensity or power are measured together, their combined intensity level is 3 db higher than the level of either separately. 0 db is defined as the threshold of hearing, and it is with reference to this internationally agreed upon quantity that decibel measurements are made.

39 27 Phon is a unit used to describe the loudness level of a given sound or noise. The system is based on equal loudness Contour, where 0 phons at 1,000 Hz are set at 0 decibels, and the threshold of hearing at that frequency. The hearing threshold of 0 phons then lies along the lowest equal loudness contour. If the intensity level at 1,000 Hz is raised to 20 db, the second curve is followed. For the purpose of measuring sounds of different loudness, the sone scale of subjective loudness was invented. One sone is arbitrarily taken to be 40 phons at any frequency, i.e., at any point along the 40 phon curve on the graph. Two sones are twice as loud, e.g phons = 50 phons. Four sones are twice as loud again, e.g phons = 60 phons. The relationship between phons and sones is shown in the chart, and is expressed by the equation: Phon = log 2 (Sone) BASIC COMMUNICATION SYSTEM BLOCK The research project described in chapter 3 is basically a transmitting and receiving communication system. Therefore an overview of very basic functional communication block is given here. Today, Communication has entered our lives in so many different forms, that it is very difficult to lead a life without various appliances and tools born out of communication. Communication is the process of conveying something from one point to other. It can be classified into communication within line of sight and distance between the transmitter and receiver.

40 28 If the two points are beyond the line of sight, then the branch of communication, engineering comes into picture, and it is known as telecommunication engineering. Figure 2.4: Stages of communication In communication engineering, the physical message such as sound, words, pictures etc., are converted into equivalent electrical values, called signals. This electrical signal is conveyed to a distant place, through a communication media, and at the receiving end, this electrical signal is reconverted back into the original message through some media. Figure 2.5: Basic communication system

41 29 Source The message produced by the source is not necessarily electrical in nature, but it may be a voice signal, a picture signal etc. So, an input transducer is required to convert the original physical message into a time varying electrical signal. These signals are called base band signals (or) message signals or modulating signals. At the destination another transducer is used to convert the electrical signal into appropriate message. Transmitter The transmitter comprising electrical and electronic components converts the message signal into a suitable form for propagating over the communication medium. This is often achieved by modulating the carrier signal (i.e.,, high frequency signal to carry the modulating or message signal) which may be an electromagnetic wave. This wave is often referred as modulated signal. Modulation and Demodulation Modulation is the process by which some characteristics of a high frequency carrier signal is varied in accordance with the instantaneous value of the another signal called modulating or message signal. Signal containing information or intelligence to be transmitted is known as modulating or message signal. It is also known as base band signal. The term base band designates the band of frequencies representing the signal supplied by the source of information. Usually the frequency of carrier is greater than the modulating signal. The signal resulting from the process of modulation is called

42 30 modulated signal. Demodulation is the process of getting back the original signal from the channel and it is done before receiving the signal at the receiving end. Channel The transmitter and the receiver are usually separated in space. The channel provides connection between the source and the destination. Regardless of its type, the channel degrades the transmitted signal in a number of ways which produces the signal distortion. This occurs in a channel, due to imperfect response of channel bandwidth and contamination of signals due to channel noise. Receiver The main function of the receiver is to extract the message signal from the degraded version of transmitted signal. The transmitter and receiver are carefully designed to avoid distortion and minimize the effect of the noise from the receiver. So that faithful reproduction of the message emitted by the source is possible. The receiver has the task of operating on the received signal so as to reconstruct the recognizable form of the original message signal and to deliver it to the user destination.

43 31 Communication types Figure 2.6: Types of communication This research will focus on Digital communication of mid-frequency band PCM samples as indicated in this figure above plus transmission of spectral parameters of the lowfrequency and high- frequency bands. ANALYSIS/RESYNTHESIS Analysis-Resynthesis is a technique in which the input signal is analyzed for a short time and the spectrum of the same is computed. The musician makes the needed modification and the desired sound is resynthesized in the final stage. One of the major application tools which use this technique is the phase vocoder [4]. The analysis of a sound, to identify the harmonics that occur in the sound signal, is performed through the estimation of its power spectrum. Samples of musical tones are analyzed using the Short-Time Fourier Analysis, to determine the time-varying frequency characteristics. The analysis is carried out on short segments of the input signal through a

44 32 technique called windowing, with window widths based on the amplitude envelope parameters, obtained through analysis of the amplitude of the waveform with respect to time. The Fast Fourier Transform (FFT) is then applied to these discrete sections to obtain a spectrum of the sound signal. This data is then used for resynthesis of the original sound. The spectral model synthesis employed in this research work works in the same way. Figure 2.7 Overview of general analysis and synthesis technique THE FOURIER PHILOSOPHY: DISCRETE FOURIER TRANSFORM Continuous For a continuous function of one variable f(t), the Fourier Transform F(f) will be defined as:

45 33 F(f) = f(t) e -j2π ft dt And the inverse transform as f(t) = F(f) e j2π ft df Where j is the square root of -1 and e denotes the natural exponent Discrete e jφ = cos(φ ) + j sin(φ ) Consider a complex series x(k) with N samples of the form x 0, x 1, x 2,.x k.x N-1 Where x is a complex number X i = X real + j X i mag Further, assume that that the series outside the range 0, N-1 is extended N-periodic, that is, x k = x k+n for all k. The FT of this series will be denoted X(k), it will also have N samples. The forward transform will be defined as x(n) = (1/N) N 1 k = 0 x( k) e -jk2π n/n for n = 0 N-1 The inverse transform will be defined as

46 34 x(n) = N 1 k = 0 x( k) e jk2π n/n for n = 0 N-1 Of course although the functions here are described as complex series, real-valued series can be represented by setting the imaginary part to 0. In general, the transform into the frequency domain will be a complex valued function, that is, with magnitude and phase. Magnitude = X (n) = (X real * X real + Xi mag * X i mag ) 0.5 Phase = tan -1 (X i mag / X real ) The Nyquist Criterion and Sampling Theorem The sampling theorem (often called "Shannon s Sampling Theorem") states that a continuous signal must be discretely sampled at least twice the frequency of the highest frequency in the signal. More precisely, a continuous function f(t) is completely defined by samples every 1/fs (fs is the sample frequency) if the frequency spectrum F(f) is zero for f > fs/2. Fs/2 is called the Nyquist frequency and places the limit on the minimum sampling frequency when digitizing a continuous signal. Normally the signal to be digitized would be appropriately filtered before sampling to remove higher frequency components. If the sampling frequency is not high enough the high frequency components will wrap around and appear in other locations in the discrete spectrum, thus corrupting it.

47 35 The key features and consequences of sampling a continuous signal can be shown graphically as follows. Consider a continuous signal in the time and frequency domain. Figure 2.8 (a) Fourier transform (continuos) Sample this signal with a sampling frequency f s, time between samples is 1/f s. This is equivalent to convolving in the frequency domain by delta function train with a spacing of f s. Figure 2.8 (b) Fourier transform (Discrete) If the sampling frequency is too low the frequency spectrum overlaps, and become corrupted. This is called aliasing.

48 36 Figure 2.8 (c) Aliasing Another way to look at this is to consider a sine function sampled twice per period (Nyquist rate). There are other sinusoid functions of higher frequencies that would give exactly the same samples and thus can't be distinguished from the frequency of the original sinusoid. CLASSICAL THEORY OF TIMBRE An overview of timbre definitions and theory behind is provided here because modifications are possible in this research project. The chapter three contains details of creating various sound effects by modifications. International Standards Organization & American National Standards Institute: "Timbre is that attribute of auditory sensation in terms of which a listener can judge that two sounds similarly presented and having the same loudness and pitch are dissimilar." DIMENSIONS OF TIMBRE A considerable amount of effort has been done in order to find the perceptual dimensions of timbre, the color of a sound. Often these studies have involved multidimensional

49 37 scaling experiments, where a set of sound stimuli is presented to subjects, who then give a rating to their similarity or dissimilarity. On the basis of these judgments a lowdimensional space, which best accommodates the similarity ratings, is constructed and a perceptual or acoustic interpretation is searched for these dimensions. Two of the main dimensions described in these experiments have usually been spectral centroid and rise time. The first measures the spectral energy distribution in the steady state portion of a tone, which corresponds to perceived brightness. The second is the time between the onset and the instant of maximal amplitude. The psychophysical meaning of the third dimension has varied, but it has often related to temporal variations or irregularity in the spectral envelope. These available results provide a good starting point for the search of features to be used in musical instrument recognition systems [21]. Since this research project does not need focus on timbre theory except for modification parts, timbre is not focused here more. In a pure sinusoidal model, one has access to a particular frequency in the name of track, and the track can be modified, and new timbres can be created. In this project, there are opportunities for modifying tracks to create partial timbre modifications. Examples of that kind will be provided in the third chapter. AN OVERVIEW OF SOUND-SYNTHESIS TECHNIQUES When generating musical sound on a digital computer, it is important to have a good model whose parameters provide a rich source of meaningful sound transformations.

50 38 Three basic model types are in prevalent use today for musical sound generation: instrument models, spectrum models, and abstract models. Instrument models attempt to parametrize a sound at its source, such as a violin, clarinet, or vocal tract. Spectrum models attempt to parametrize a sound at the basilar membrane of the ear, discarding whatever information the ear seems to discard in the spectrum. Abstract models, such as FM, attempt to provide musically useful parameters in an abstract formula. The following passages will be an overview of widely used sound synthesis techniques for music purposes. Additive Synthesis The Philosophy behind all sound synthesis methods is the Euler s philosophy which states that any physical world sound we hear can be broken down to a bunch of sinusoids viz sine waves and cosine waves. These building blocks can be subjected to mathematical operations and desired sounds can be synthesized. This is what exactly the Fourier transform uses in its time to frequency mapping. Additive synthesis is one of the oldest and most heavily researched synthesis techniques is also based on the summation of elementary waveforms to create more complex waveforms. This technique is accepted as the most powerful and flexible spectral modeling technique. It was among the first synthesis techniques in computer music. It was described extensively in the very first article of the very first issue of the Computer Music Journal. Additive synthesis technique assumes that any periodic waveform can be modeled as a sum of sinusoids at various amplitude envelopes and time-varying frequencies. Basically, it puts together a number of different wave components together, which can be partials or harmonics, to arrive at a

51 39 particular sound. Figure shows the additive synthesis function by summing up sinusoids in order to form specific waveforms. Additive synthesis allows more control than any other kind of synthesis, as it permits fine control over individual frequency components. Moreover, a synthesis may interpolate between the frequency spectra from two or more different sounds. Additive synthesis is effective in the modeling of steady state sounds rather than portions of sound like the transients in the attack part of the sound. Figure 2.9: Additive Synthesis The phase factor: Phase is a trickster. Depending on the context, it may or may not be a significant factor in additive synthesis. For example, if one changes the starting phases of the frequency components of a fixed waveform and resynthesizes the tone, this makes no difference to the listener and yet such a change may have a significant effect on the visual

52 40 appearance of the waveform. Phase relations become apparent in the perception of the brilliant but short life of attacks, grains, and transients. The ear is also sensitive to phase relationships in complex sounds where the phases of certain components are shifting over time. Addition of partials is limited in that it succeeds only in creating a more interesting fixed waveform sound. Since the spectrum in fixed waveform synthesis is constant over the course of a note, partial addition can never reproduce accurately the sound of an acoustic instrument. It approximates only the steady state portion of an instrumental tone. Research has shown that the attack portion of a tone, where the frequency mixture is changing on a millisecond-by-millisecond timescale, is by more useful for identifying traditional instrument tones than the steady-state portion. In any case, time-varying timbre is usually more tantalizing to the ear than a constant spectrum. (Grey 1975) [9] Time varying Additive synthesis By changing the mixture of sine waves over time, one obtains more interesting synthetic timbres and more realistic instrumental tones. In the trumpet note in figure below, it takes 12 sine waves to reproduce the initial attack portion of the event. After 300 ms, only three or four sine waves are needed. [9]

53 41 Figure 2.10: The amplitude progression of the partials of a trumpet tone Subtractive synthesis Subtractive synthesis implies the use of filters to shape the spectrum of a sound source by subtracting unwanted partials of its spectrum, while favoring the resonation of others. As the source signal passes through a filter, the filter boosts or attenuates selected regions of the frequency spectrum. If the original source is spectrally rich and the filter is flexible, subtractive synthesis can shape close approximations of many natural sounds, as well as a wide variety of new and unclassified timbres. This technique has been used successfully to model percussion-like instruments and the human voice. [9] Subtractive synthesis is often referred to as analogue synthesis because most analogue synthesizers (i.e., non-digital) use this method of generating sounds. In its most basic form, subtractive synthesis is a very simple process as follows:

54 42 OSCILLATOR > FILTER > AMPLIFIER An Oscillator is used to generate a suitably bright sound. This is routed through a Filter. A Filter is used to cut-off or cut-down the brightness to something more suitable. This resultant sound is routed to an Amplifier. An Amplifier is used to control the loudness of the sound over a period of time so as to emulate a natural instrument A filter can be literally any operation on a signal (Rabiner et al. 1972) but the most common use of the term describes devices that boost or attenuate regions of a sound spectrum, which is the usage. Filters can be one of these methods: Delaying a copy of an input signal slightly and combining the delayed input signal with the new input signal. Delaying a copy of the output signal and combining it with the input signal. [9] The chapter 3 containing the research project involves lot of band elimination and band pass filtering. The bandwidth is a measure of the selectivity of the filter and is equal to the difference between the upper and lower cutoff frequencies. The response of a bandpass filter is often described by terms such as sharp (narrow) or broad (wide), depending on the actual width. The pass band sharpness is often quantifies by means of quality factor (Q). When the cutoff frequencies are defined at the -3 db points, Q is given by Q = f 0 /BW

55 43 Where BW is the bandwidth [25]. Therefore high Q denotes narrow bandwidth. Bandwidth may also be described as a percentage of center frequency. SPECTRUM MODELING SYNTHESIS The proposed coder in this research work synthesizes the less sensitive signals. While many synthesis techniques are available, the sinusoidal modeling synthesis is preferred here because sound could be modeled as a set of sinusoids. Moreover, sinusoidal model sometimes could be used in denoising applications. The main advantage of this group of techniques is the existence of analysis procedures that extract the synthesis parameters out of real sounds, thus being able to reproduce and modify actual sounds. Our particular approach is based on modeling sounds as stable sinusoids (partials) plus noise (residual component), thereby analyzing sounds with this model and generating new sounds from the analyzed data. The analysis procedure detects partials by studying the time-varying spectral characteristics of a sound and represents them with time-varying sinusoids. These partials are then subtracted from the original sound, and the remaining "residual" is represented as a time-varying filtered white noise component. The synthesis procedure is a combination of additive synthesis for the sinusoidal part and subtractive synthesis for the noise part. This analysis/synthesis strategy can be used for either generating sounds (synthesis) or transforming pre-existing ones (sound processing). To synthesize sounds we generally

56 44 want to model an entire timbre family, (i.e.,, an instrument) and that can be done by analyzing single tones and isolated note transitions performed on an instrument and building a database that characterizes the whole instrument or any desired timbre family, from which new sounds are synthesized. In the case of the sound-processing application, the goal is to manipulate any given sound, that is, not being restricted to isolated tones and not requiring a previously built database of analyzed data. [2] Some of the intermediate results from this analysis/synthesis scheme, and some of the techniques developed for it, can also be applied to other music-related problems, e.g., sound compression, sound-source separation, musical acoustics, music perception, and performance analysis. The Deterministic Plus Stochastic Model A sound model assumes certain characteristics of the sound waveform or the soundgeneration mechanism. In general, every analysis/synthesis system has an underlying model. The sounds produced by musical instruments, or by any physical system, can be modeled as the sum of a set of sinusoids plus a noise residual. The sinusoidal, or deterministic, component normally corresponds to the main modes of vibration of the system. The residual comprises the energy produced by the excitation mechanism that is not transformed by the system into stationary vibrations plus any other energy component that is not sinusoidal in nature. For example, in the sound of wind-driven instruments, the deterministic signal is the result of the self-sustained oscillations produced inside the bore, and the residual is a noise signal that is generated by the turbulent streaming that takes place when the air from the player passes through the narrow slit. In the case of

57 45 bowed strings, the stable sinusoids are the result of the main modes of vibration of the strings, and the noise is generated by the sliding of the bow against the string, plus by other non-linear behavior of the combined bow-string-resonator system. This type of separation can also be applied to vocal sounds, percussion instruments and even to nonmusical sounds produced in nature. A deterministic signal is traditionally defined as anything that is not noise (i.e.,, an analytic signal, or perfectly predictable part, predictable from measurements over any continuous interval). However, in the present discussion the class of deterministic signals considered is restricted to sums of quasi-sinusoidal components (sinusoids with slowly varying amplitude and frequency). Each sinusoid models a narrowband component of the original sound and is described by amplitude and a frequency function. A stochastic signal is fully described by its power spectral density, which gives the expected signal power versus frequency. When a signal is assumed stochastic, it is not necessary to preserve either the instantaneous phase or the exact magnitude details of individual FFT frames. Therefore, the input sound is modeled by S(t) = R r = 1 Ar(t) cos[θ r (t)] +e (t) Where A r (t) and θ r (t) are the instantaneous amplitude and phase of the r th sinusoid, respectively, and e(t) is the noise component at time t (in seconds).

58 46 The model assumes that the sinusoids are stable partials of the sound and that each one has a slowly changing amplitude and frequency. The instantaneous phase is then taken to be the integral of the instantaneous frequency ω r (t), and therefore satisfies t θ r (t) = 0 ω r ( τ ) dτ Where ω(t) is the frequency in radians and r is the sinusoid number. By assuming that e (t) is a stochastic signal, it can be described as filtered white noise, e(t) = t 0 h(t, τ )u (τ )dτ where u(τ ) is white noise and h (t,τ ) is the response of a time-varying filter to an impulse at time t. That is, the residual is modeled by the convolution of white noise with a time-varying, frequency-shaping filter. [2] Analysis/Synthesis Process: Sinusoids + Noise The deterministic plus stochastic model has many possible implementations, and we will present a general one while giving indications on variations that have been proposed. Both the analysis and synthesis are frame-based processes with the computation done one frame at a time. Throughout this description, we will consider that we have already processed a few frames of the sound and we are ready to compute the next one.

59 47 Figure 2.11: Block diagram of the analysis process ([2]) The figure above shows the block diagram for the analysis. First, we prepare the next section of the sound to be analyzed by multiplying it with an appropriate analysis window. Its spectrum is obtained by the Fast Fourier Transform (FFT), and the prominent spectral peaks are detected and incorporated into the existing partial trajectories by means of a peak-continuation algorithm. The relevance of this algorithm is that it detects the magnitude, frequency, and phase of the partials present in the original sound (the deterministic component). When the sound is pseudo-harmonic, a pitch-detection step can improve the analysis by using the fundamental frequency information in the peak continuation algorithm and in choosing the size of the analysis window (pitchsynchronous analysis). [2] The stochastic component of the current frame is calculated by first generating the deterministic signal with additive synthesis and then subtracting it from the original

60 48 waveform in the time domain. This is possible because the phases of the original sound are matched, and therefore the shape of the time-domain waveform is preserved. The stochastic representation is then obtained by performing a spectral fitting of the residual signal. Figure 2.12: Block diagram of the synthesis process [2] The deterministic signal, i.e.,, the sinusoidal component, results from the magnitude and frequency trajectories, or their transformation, by generating a sine wave for each trajectory (i.e.,, additive synthesis). This can either be implemented in the time domain with the traditional oscillator bank method or in the frequency domain using the inverse- FFT approach. The synthesized stochastic signal is the result of generating a noise signal with the timevarying spectral shape obtained in the analysis (i.e.,, subtractive synthesis). As with the deterministic synthesis, it can be implemented in the time domain by a convolution or in

61 49 the frequency domain by creating a complex spectrum (i.e.,, magnitude and phase spectra) for every spectral envelope of the residual and performing an inverse-fft. [2] M-Q Synthesis The spectrum-modeling synthesis described above is mainly targeted toward musical signals. A similar application but a slightly different algorithm was proposed by Robert McAulay and Thomas Quatieri for voice and speech signals [5, 26]. In 1986, Robert McAulay and Thomas Quatieri proposed a new method of analysis/synthesis for discrete-time speech signals that attempted to develop a reconstruction process that would result in a best possible approximation of the original signal. They modeled speech signals as two components. The first was an excitation signal which consisted of a sum of sinusoids with time-varying amplitudes and frequencies, as well as an initial phase offset. The second component is the voice tract. This is modeled as a time-variant filter with time-varying magnitudes and phases. These two components are combined and expressed as S(t) = L( t) l = 1 Al(t) e j Ψ l(t) Where A l (t) combines the time-varying magnitude response of the vocal tract and the amplitude of the excitation signal, and the phase of the exponential includes the timevarying phase of the vocal tract as well as the initial phase offset of the excitation signal.

62 50 Figure 2.13: McAulay-Quatieri Sinusoidal Analysis-Synthesis system: [5] To find expressions for these sinusoids, they derived a new technique to analyze the signal. Using overlapping-windowing methods similar to standard short-time analysis, the MQ method computes Fourier transforms of the individual windows. The peak frequencies of each window (the partials) are found, and their amplitudes and phases are extracted. The partials for each window are linked to those in the following window to develop a trend in the progression of frequencies (their amplitude and phases). We call each progression a track. The birth of a track occurs when there does not exist a partial in the previous window with which to connect one in the current window. Conversely, a death track occurs when a partial does not exist in the following window with which to connect one in the current window.

63 51 Figure 2.14: Mcaulay-Quatieri Sinusoidal model for speech [5] The MQ Model has outstanding results and reproduces inaudibly different signals when applied to a wide variety of quasi-harmonic sounds. Perhaps its greatest advantage is the small amount of data required to perform this process. To reproduce a signal using standard Fourier techniques, information about a great many coefficients must be retained. To reconstruct perfectly, an infinite number must be used. With the MQ method, information about several time-varying sinusoids must be stored, and little else. One of the flaws in the MQ method is how it represents noise. Noise shows up as tracks that span only a small number of windows. It is difficult to represent these short tracks using sinusoids, so other methods must be developed (see section entitled Bandwidth- Enhancement).

64 52 In the analysis stage, the amplitudes, frequencies, and phases of the model are estimated on a frame-by-frame basis, while in the synthesis stage these parameter estimates are interpolated to allow for continuous evolution of parameters at all the sample points between the frame boundaries. The Sine Wave Speech Model In the speech production model, the speech waveform s(t) is assumed to be the output of passing a vocal cord (glottal) excitation waveform through a linear system representing the characteristics of the vocal tract. The excitation function is usually represented as a periodic pulse train during voiced speech, where the spacing between consecutive pulses corresponds to pitch of the speaker. Alternately, the binary voiced/unvoiced excitation model can be replaced by a sum of sine waves. Figure 2.15: Peak detection in MQ-approach,: [5]

65 53 The motivation for this sine-wave representation is that voiced excitation, when perfectly periodic, can be represented by a Fourier series decomposition in which each harmonic component corresponds to a single wave. Passing this sine wave representation of the excitation through the time-varying vocal tract results in the sinusoidal representation of the speech waveform, which, on a given analysis frame is described by s(n) = L l 1 Al(n) + φ l where A l and φ l represent the amplitude and phase of each sine wave component associated with the frequency track w l and L is the number of sine waves.[5] Figure 2.16: Mcaulay-Quartieri Sinusoidal Analysis-Synthesis (Peak picking) [5]

66 54 Spectral Models Related to the Sinusoidal Model: Additive synthesis is a traditional sound synthesis method that is very close to the sinusoidal model. It has been used in electronic music for several decades [Roads 1995]. Like the sinusoidal model, it represents the original signal as a sum of sinusoids with time-varying amplitudes, frequencies, and phases [Moorer 1985]. However, it does not differentiate harmonic and inharmonic components. To represent non-harmonic components it requires a very large amount of sinusoids, therefore giving best results for harmonic input signals. Vocoders are another group of spectral models. They represent the input signal at multiple parallel channels, each of which describes the signal at a particular frequency band. Vocoders simplify the spectral information and therefore reduce the amount of data. The Phase vocoder is a special type of vocoder that uses a complex short-time spectrum, thus preserving the phase information of the signal. The phase vocoder is implemented with a set of band pass filters or with a short-time Fourier transform. The phase vocoder allows time and pitch scale modifications, like the sinusoidal model does [4]. The sinusoidal model was originally proposed by McAulay-Quatieri for speech coding [5] purposes and by Smith and Serra [2, 11] [McAulay-Quatieri 1986; Smith & Serra 1987] for the representation of musical signals. Even though the systems were developed independently, they were quite similar. Some parts of the systems such as the peak detection were slightly different, but both systems had all the basic ideas needed for the sinusoidal analysis and synthesis: the original signal was windowed into frames, and the

67 55 short-time spectrum was examined to obtain the prominent spectral peaks. The frequencies, amplitudes and phases of the peaks were estimated and the peaks were tracked into sinusoidal tracks. The tracks were synthesized using linear interpolation for amplitudes and cubic polynomial interpolation for frequencies and phases. Serra [1989] was the first to decompose the signal into deterministic and stochastic parts to use a stochastic model with the sinusoidal model. Since then, this decomposition has been used in several systems. The majority of the noises modeling systems use two kinds of approaches: either the spectrum is characterized by a time-varying filter or the shorttime energies within certain frequency bands [3] Pitch-Synchronous Analysis The estimation of the sinusoidal modeling parameters is a difficult task in general. Most of the problems are related to the analysis window length. If the input signal is monophonic or consists of harmonic voices that do not overlap in time, it is advantageous to synchronize the analysis window length to the fundamental frequency of the sound. Usually, the frequencies of the harmonic components of voiced sounds are integral multiples of the fundamental frequency. The advantage of the pitch-synchronous analysis is most easily seen in the frequency domain: the frequencies of the harmonic components correspond exactly to the frequencies of the DFT coefficients. The estimation of the parameters is very easy, since no interpolation is needed, and the amplitudes and phases can be obtained directly from the complex spectrum. Also, pitch-synchronous analysis

68 56 allows the use of window lengths as small as one period of the sound, while nonsynchronized windows must be 2-4 times the period depending on the estimation method. This means that a much better time resolution is gained by using the pitch-synchronous analysis. Unfortunately, pitch-synchronous analysis can not be utilized in the case where several sounds with different fundamental frequencies occur simultaneously. In general, monophonic recordings represent only a small minority among musical signals, and therefore pitch-synchronous analysis typically can not be used. To keep the complexity of the system low, the pitch-synchronous analysis was not included in our system. Adaptive window length has been successfully used in modern audio coding systems, but in a quite different manner: a long window is used for stationary parts of the signal, and when rapid changes occur, the window is switched into a shorter one. This enables good frequency resolution for the stable parts and a good time resolution in rapid changes. Bandwidth-Enhanced Sinusoidal Modeling The Reassigned Bandwidth-Enhanced Method [15], developed by Kelly Fitz resolves the noise modeling problems associated with the MQ method. Using the MQ method, signals are represented by a collection of sinusoidal components. The peaks in the spectrum of each window are linked together (Short Time analysis). If the signal being represented has obvious high peaks or a trend in the frequencies from window to window, this analysis provides an accurate reconstruction. However, if a signal has

69 57 significant energy outside the peaks, or very high-frequency noise, the MQ method does not represent the signal adequately. These signals are said to be noisy. The energy that is not capable of being represented is called noisy energy because it has frequencies with fast-varying amplitude. Figure 2.17: Lemur These types of signals require many sinusoids to be represented sufficiently. The sinusoids that do represent them become a track of short duration partials with rapidly varying amplitudes and frequencies. It is difficult to distinguish the noisy tracks due to external unwanted noise from the short jittery tracks that are due to the wanted sound representation. The sinusoidal model does not provide a way of distinguishing noisy components from deterministic components. In addition, the representation of this noisy

58 Signal is very fragile. Time and frequency manipulation changes phase, which destroys the properties of the sound and introduces errors in the reconstructed signal. Figure 2.

70 58 Signal is very fragile. Time and frequency manipulation changes phase, which destroys the properties of the sound and introduces errors in the reconstructed signal. Figure 2.18: Lemur Graphical tool To provide a better way of representing noise, the Reassigned Bandwidth-Enhanced Method uses Bandwidth-Enhanced Oscillators, which spread spectral energy away from the partial s center frequency. The partial s energy is increased while the bandwidth also increases relative to its spectral amplitude. The center frequency stays the same so that frequency is spread evenly on both sides. By removing the noisy tracks and increasing the bandwidth of neighboring tracks, the energy in the signal is conserved and a closer representation to the original signal can be constructed.

71 59 Figure 2.19: Bandwidth-Enhanced sinusoidal modeling These Bandwidth-Enhanced Oscillators can now be used to synthesize a sound signal from components that have varying frequencies, amplitudes, and concentrations of noise and sinusoidal energy. A greater variety of sounds can now be represented with greater accuracy while still using the sinusoidal model for the representation or longer, betterdefined tracks. These Bandwidth-Enhanced partials allow us to manipulate noise representations without taking away the desired noise representations. The Enhanced- Bandwidth Sinusoidal Model allows for appropriate representation of noise that provides a way to distinguish the non-sinusoidal noise that must be removed. PHASE SYNTHESIS A summary of importance of phase in audibility is given here. In this research work we cross examine the importance of phase parameters in the high-frequency region of the spectrum. Models of pitch perception found in literature, often discard the phase of the frequency components. This model contradicts time- domain models where the pitch of a complex tone is given as function of the time interval between peaks in the waveform, in some dominant region of basilar membrane. To verify if the relative phase of

72 60 harmonics of a complex tone is of importance to the perception of pitch, Moore conducted a number of experiments. Here it was concluded that phase did infact have an effect on perceived pitch in some cases. Most often however, it only affected the strength of the perceived pitch. Cariani and Delgutte later verified this. Terhardt considered that the pitch of complex tones in which it is assumed that only the frequency spectrum of the stimulus is important in determining pitch, relative phase of the frequency components being irrelevant. In the words of Schouten, pitch of a complex tone is given by the reciprocal of the time interval between corresponding peaks in the fine structure of the waveform evoked at some dominant region of the basilar membrane. The fine structure of a waveform may be influenced by changes in the relative phase of the components, and thus under some circumstances, pitch ought to be affected by relative phase. Bennen said that the relative phase can affect pitch, but that the effect is not mediated by changes in temporal structure of the waveform. According to the temporal model two types of changes in pitch perception might occur with changes in relative phase of the components, a change in pitch value, and a change in the clarity of pitch. [17] "The frequency-domain representation of periodic sounds was studied by the scientists Ohm, Helmholtz, and Hermann in particular. Ohm stated that changes in the phase spectrum, although they altered the wave shape, did not affect its aural effect. Helmholtz developed a method of harmonic analysis with acoustic resonators. According to these studies, the ear is phase-deaf, and timbre is determined exclusively by the spectrum. Such

73 61 conclusions are still considered essentially valid for periodic sounds only because Fourier series analysis-synthesis works only for those. It was Ohm who first postulated in the early nineteenth century that the ear was, in general, phase deaf. This view was a gross simplification. There are many instances in which phase plays an extremely important role in the perception of sound stimuli. In fact, it was Helmholtz who noted that Ohm's law didn't hold for simple combinations of pure tones. However, for non-simple tones Ohm's law seems to be well supported by psychoacoustic literature. The importance of phase in perceiving musical sounds was demonstrated by Clark, who clearly showed that in the absence of phase information, acoustic waveforms sounded unrealistic. Effect of phase on the timbre One of the dimensions, which govern the quality, or Timbre of the music instrument is the directionality of the sound. The directionality, binaural or monaural is decided by the phase of the sound signal. In general literature of the past doesn t pay more attention to the phase of the signal but considering its directional property and spatialization, phase plays a major role but in general, when we are ready to not consider some dimensions while synthesizing, phase does not seem to be of much of importance. Changes in timbre are not distinct enough to be observed after a few seconds required to alter the phases; anyhow these changes are too small to transfer from one vowel to

74 62 another. Harmonics beyond the sixth to eighth give dissonances and beats, so it is not excluded that, for these higher harmonics, a phase effect exist. The maximum effect of phase on timbre is the difference between a complex tone in which the harmonics are in phase and one in which alternate harmonics differ in phase by 90 o. The effect of lowering each successive harmonic by 2 db is greater than the maximum phase effect described above. The effect of phase on timbre appears to be independent of sound level and the spectrum. Phase is perceptually important in many situations. However, it still remains a question to which extent phase is of importance to modeling of natural occurring sounds using spectral sound models. For non-periodic sound parts such as transients, the phase of the frequency components is of great importance to the perception of sounds than steady parts. [10] The sound synthesized, using STFT, S (t) = k n= 0 an (t) cos [θ n (t)] where k is the number of partials, a n (t), the time varying amplitude, and [θ n (t)] is the time varying phase. For synthesis without phase information θ n (t) is simply obtained by integration of measured frequency values over time: [24]

75 63 t θ n (t) = 0 ω n (τ ) dτ Phase is an important parameter when performing additive analysis/synthesis of binaural recordings. If the phase is left out, the ability to perceive spatial qualities of sounds is substantially degraded. The phase is important in all incident positions, except front/back, whereas spectral envelope is mainly influential in the lateral positions. Perceptually important cues for use in localization are expected to be less present in sounds synthesized without phase than with phase information. [24] Perceptual Coding Vs Synthesis based approach Perceptual coding is a digital audio coding technique that reduces the amount of data needed to produce high-quality sound. Perceptual digital audio coding takes advantage of the fact that the human ear screens out a certain amount of sound that is perceived as noise. Reducing, eliminating, or masking this noise significantly reduces the amount of data that needs to be provided. With perceptual coding of the record, physical identity is waived in favor of perceptual identity. Using a psychoacoustical model of the human auditory system, the codec identifies imperceptible signal content (to remove irrelevancy) as bits are allocated. The signal is then coded efficiently (to avoid redundancy) in the final bit stream. These steps reduce the quantity of data needed to represent an audio signal. The intent is to hide quantization noise below signal-dependent thresholds of hearing and then code as efficiently as possible. The method asks how much noise can be introduced to the signal without becoming audible.

76 64 In the view of many observers, compared to new perceptual coding methods, Pulse-code Modulation is a powerful but inefficient dinosaur. Due to the appetite for bits, PCM coding is limited in its usefulness. Achieving lower bit rates through perceptual coding is limited in its usefulness. The desire to achieve lower bit rates through perceptual coding is appealing because it opens new application for digital audio (and video) with acceptable signal degradation. Through psychoacoustics, we can understand how information is perceived by the ear [8]. Masking and Perceptual Coding The world presents us with a multitude of sound simultaneously. We automatically accomplish the task of distinguishing each of the sounds and attending to the ones of greatest importance. It is often difficult to hear one sound when a much louder sound is present. This process seems intuitive, but on the psychoacoustic and cognitive levels it becomes very complex. The term for this process is masking. In order to gain a broad and thorough understanding of masking phenomenon, we can survey the definition and its accompanying explanation from several views. Masking as defined by the American Standards Association (ASA) is the amount (or the process) by which the threshold of audibility for one sound is raised by the presence of another (masking) sound. For example, a loud car stereo could mask the car's engine noise. The term was originally borrowed from studies of vision, meaning the failure to recognize the presence of one stimulus in the presence of another at a level normally adequate to elicit the first perception.

77 65 The purpose of any data-reduction system is to decrease the data rate, the product of the sampling frequency and the word length. This can be accomplished by decreasing the sampling frequency; however, the Nyquist theorem dictates a corresponding decrease in high-frequency audio bandwidth. Another approach uniformly decreases the word length; however, this reduces the dynamic range of the audio signal by 6dB/bit, thus increasing the quantization noise. A more enlightened approach uses psychoacoustics. Perceptual coders maintain sampling frequency but selectively decrease word length; word-length reduction is done dynamically based on signal conditions [8]. Perceptual coders analyze the frequency and amplitude content of the input signal and compare it to a model of human auditory perception. Using the model, the encoder removes the irrelevancy and statistical redundancy of the audio signal. In theory, although the method is lossy, the human perceiver will not hear degradation in the decoded signal. Considerable data reduction is possible. For example, a perceptual coder might reduce a channel s bit rate from 768 kbps to 128 kbps; a word length of 16 bits/sample is reduced to an average of 2.67 bits/sample, and the data quantity is reduced by about 83%. A well-designed perceptually coded recording, with a conservative level of reduction, can rival the sound quality of a conventional recording because the data is coded in a much more intelligent fashion, and quite simply, because we do not hear all of what is recorded anyway. Perceptual coders are so efficient that they require only a fraction of the data needed by a conventional system. Part of this efficiency stems from the adaptive quantization used by most perceptual coders. With PCM, all signals are given equal word lengths. Perceptual coders assign

78 66 bits according to audibility. A prominent tone is given a large number of bits to ensure audible integrity. Conversely, fewer bits can be used to code soft tones. Inaudible tones are not coded at all. Together, bit rate reduction is achieved. A coder s reduction ratio is the ratio of input bit rate to output bit rate. Reduction ratios of 4:1, 6:1, or 12:1 are common. Perceptual coders have achieved remarkable transparency, so that in many applications reduced data is audibly indistinguishable from linearly represented data. Tests show that ratios of 4:1 or 6:1 can be transparent. [8] Critical Bands To determine this threshold of audibility, an experiment must be performed. A typical masking experiment might proceed as follows. A short, (about 400 msec) pulse of a 1,000 Hz sine wave acts as the target, or the sound the listener is trying to hear. Another sound, the masker, is a band of noise centered on the frequency of the target (the masker could also be another pure tone). The intensity of the masker is increased until the target cannot be heard. This point is then recorded as the masked threshold. Another way of proceeding is to slowly widen the bandwidth of the noise without adding energy to the original band. The increased bandwidth gradually causes more masking until a certain point is reached, at which no more masking occurs. This bandwidth is called the critical band [6]. We can keep extending the masker until it is full bandwidth white noise, and it will have no more effect than at the critical band.

79 67 Bark Band Center Frequency Critical Bandwidth Low Frequency Cutoff (Hz) (Hz) (Hz) (Hz) High Frequency Cutoff

80 Table 2.1: Critical Bandwidth as a function of center frequency and critical band rate [8] Critical bands grow larger as we ascend the frequency spectrum. Conversely, we have many more bands in the lower frequency range, because they are smaller. Critical bands seem to be formed at some level by a auditory filter bank. Critical bands and their center frequencies are continuous, as opposed to having strict boundaries at specific frequency locations. Therefore, the filters must be easily variable. Use of the auditory filter bank may be the unconscious equivalent of our willfully focusing on a specific frequency range. Non-Simultaneous Masking The ASA definition of masking does not address non-simultaneous masking. Sometimes a signal can be masked by a sound preceding it, called forward masking, or even by a sound following it, called backward masking. Forward masking results from the accumulation of neural excitation, which can occur for up to 200 msec. In other words, neurons store the initial energy and cannot receive another signal until after they have passed it, which may be up to 200 msec. Forward masking effects are slight because maskers need to be within the same critical band and even then do not have the broad

81 69 masked audiograms of simultaneous masking. Likewise, backward masking only occurs under tight tolerances. Central Masking and Other Effects Another way to approach masking is to question at what level it occurs. Studies in cognition have shown that masking can occur at or above the point where audio signals from the two ears combine. The threshold of a signal entering monaurally can be raised by a masker entering in the other ear monaurally. This phenomenon is referred to as central masking, because the effect occurs between the ears. Spatial location can have a profound effect on the effectiveness of a masker. Many studies have been performed in which unintelligible speech can be understood once the source is separated in space from the interference. The effect holds whether the sources are actually physically separated or perceptually separated through the use of interaural time delay. Asynchrony of the onset of two sounds has shown to help prevent masking, as long as the onset does not fall within the realm of non-simultaneous masking. Each 10 msec increase in the inter-onset interval was perceived as being equal to a 10 db increase in the target's intensity [6]. Fusion The concept of fusion must be included in any intelligent discussion of masking, because the two are similar and often confused. In both cases, the distinct qualities of a sound are lost, and both phenomena respond in the same manner to the same variables. In fusion, like in masking, the target sound cannot be identified, but in fusion the masker takes on a

82 70 different quality. The typical masking experiment does not necessarily provide a measure of perceptual fusion. In a fusion experiment, on the other hand, listeners are asked whether they can or cannot hear the target in the mixture or, even better, to rate how clearly they can hear the target there. What we want to know is whether the target has retained its individual identity in the mixture. [6] Fusion takes into consideration interactive global effects of two sound sources on each other, instead of trying to reduce the situation to two separate and distinct entities. Masking experiments are concerned with finding the threshold at which the target cannot be identified, ignoring the effect of the target on the masker. Use of psychoacoustic principles for the design of audio recording, reproduction, and data reduction devices makes perfect sense. Audio equipment is intended for interaction with humans, with all their abilities and limitations of perception. Traditional audio equipment attempts to produce or reproduce signals with the utmost fidelity to the original. A more appropriately directed, and often more efficient, goal is to achieve the fidelity perceivable by humans. This is the goal of perceptual coders. The core part of a perceptual coder is the psychoacoustic model, which is the heart of the system. Generally a psychoacoustic model performs, a time to frequency mapping, determines maximum SPL levels, determines threshold in quiet, identify tonal and no tonal components, decimates the maskers, calculates masking thresholds, determines global masking thresholds, determines minimum masking thresholds and calculates signal to mask ratios.

83 71 Figure 2.20: MPEG audio compression and decompression Although one main goal of digital audio perceptual coders is data reduction, this is not a necessary characteristic. Perceptual coding can be used to improve the representation of digital audio through advanced bit allocation. All data reduction schemes are not necessarily perceptual coders. Some systems, the DAT 16/12 scheme for example, achieve data reduction by simply reducing the word length, in this case cutting off four bits from the least-significant side of the data word, achieving a 25% reduction. The data reduction scheme in the present research work, however, uses a different scheme based on partial sound synthesis that relies on auditory phenomenon of human sensitivity to certain frequencies. Though the mid-frequencies are modulated to the baseband by sampling the signal at a lower sampling rate and downsampling it later in time, an upsampling process interpolates the in-between samples,and then the signal is demodulated back to it s original spectral location. This research is

84 72 not a perceptual-coding scheme, although it uses or takes advantage of perceptual phenomena Out of a desire for simplicity, the first digital audio systems were wide-band systems, tackling the entire audio spectrum at once. Presently, perceptual coders are multiband systems, dividing up the spectrum in a fashion that mimics the critical bands of psychoacoustics. By modeling human perception, perceptual coders can process signals much the way humans do, and take advantage of phenomena such as masking. While using adaptive delta pulse-code modulation (ADPCM), the frequency spectrum is divided into four bands to remove unperceivable material. Once a determination is made as to what can be discarded, the remainder is allocated the available number of bits. This process is called dynamic bit allocation. History of Synthesis-based audio Data reduction The synthesis-based music data reduction is a relatively young area of research. Xavier Serra s sinusoidal plus stochastic residual noise model [2, 11] was an effective synthesis - based data reduction system. A similar idea was MQ synthesis by McAulay and Quatieri [5, 20, and 26]. The application here was data reduction for digital speech processing applications. This research work is all about improving or replacing the two powerful music and speech data reduction approaches. The third chapter forms the research project. Eric Scheirer wrote a synthesis-based data reduction approach as a part of MPEG-4, 7 and 21, the structured audio orchestral language (SAOL). Unlike all these historical synthesis-based data reduction schemes, which employ complete synthesis

85 73 techniques, our data reduction scheme is a partial synthesis scheme. There is no connection to SAOL in this research work. However, we will do a review of the same because we are reviewing the history of synthesis based approaches in data reduction. AN OVERVIEW OF SAOL: (Structured Audio Orchestral language) SAOL is a powerful, flexible language for describing music synthesis, and integrating synthetic sound with "natural" (recorded) sound in an MPEG-4 bit stream. MPEG-4 integrates the two common methods of describing audio on the internet today: streaming low-bit rate coding and structured audio descriptions (like MIDI files). SAOL lives within the MPEG-4 paradigm of streaming data and decoding processes. Thus, the Structured Audio toolset is not only a method of synthesis, but a streaming format appropriate for internet-based (or any other channel) transmission of audio data. The saolc package contains a program for encoding score and orchestras into the streaming format, and facility for decoding this format. MPEG-4 Structured Audio has its roots in another Media Lab project called Netsound, developed by Michael Casey and other members of the Machine listening group at the MIT Media Lab in NetSound has similar concepts to MPEG-4 Structured Audio but uses Csound developed by Barry Vercoe for synthesis. There are five major elements to the Structured Audio toolset: The Structured Audio Orchestra Language (or SAOL) is a digital-signal processing language that allows for the description of arbitrary synthesis and control algorithms

86 74 as part of the content bit stream. The syntax and semantics of SAOL are standardized here in a normative fashion. The Structured Audio Score Language, (or SASL) is a simple score and control language which is used in certain profiles to describe the manner in which soundgeneration algorithms described in SAOL are used to produce sound. The Structured Audio Sample Bank Format (or SASBF) The Sample Bank format allows for the transmission of banks of audio samples to be used in Wavetable synthesis and the description of simple processing algorithms to use with them. A normative scheduler description. The scheduler is the supervisory run-time element of the Structured Audio decoding process. It maps structural sound control, specified in SASL or MIDI, to real-time events dispatched using the normative soundgeneration algorithms. Normative reference to the MIDI standards, standardized externally by the MIDI Manufacturers Association. MIDI is an alternate means of structural control which can be used in conjunction with or instead of SASL. Although less powerful and flexible than SASL, MIDI support in this standard provides important backwardcompatibility with existing content and authoring tools. [16] Our research does not focus on any synthesis methods which use an audio/music description format languages, nor does it has a package for writing music scores. Such a project is reserved as a future extension of this research work. Our method is transmitting PCM samples which are very sensitive perceptually and synthesize spectrum which are not much sensitive in perception. Details of this project implementation follow in the next chapter.

87 75 CHAPTER-3 THE RESEARCH PROJECT A Hi-Fidelity audio data reduction scheme using Partial Sinusoidal modeling Synthesis Scheme (PSMS) While perceptual coding is based on energy levels of our perception, and the masking phenomenon, this synthesis-based data reduction approach presented here is based on making use of our complex pitch perception. In other words, the former takes the vertical amplitude scale hearing into account to code the bits which we hear based on loudness criterion whereas the later takes the horizontal frequency scale, to make use of the complex pitch perception mechanisms which our auditory systems somehow manage to do. This project takes advantage of auditory systems complexity in order to engineer the music product. A real world example We hear music inside a room. When we come out of the room and close the door, we still hear music with an obvious attenuation in the energy level we perceive. The door acts as an attenuator. The door also filters out some frequency components. Since our auditory system is sensitive to mid-frequencies, it somehow perceives most of the mid frequency contents. We hear the pitch (fundamental) but not necessarily timbre. The music that we perceive after the door is closed is the mid-frequency spectrum. We

88 76 transmit the mid-frequency content. The low and high frequencies are synthesized. The physical spectrum of the synthesized parts will differ widely or slightly over each short time period. However, the variation in spectral shape does not mean that the sound will be different from that of the original. The synthesized content will sound almost close to original content because our auditory system does not follow spectral shapes exactly. Hence we make use of the phenomenon of our inabilities in following the complex pitch, the missing fundamental concept, and other concepts which we discussed in detail in the Chapter 2. One another example is our daily conversation with people over telephones and mobiles where we hear only between 1 to 4 khz due to the frequency range of the speech signal. This helps the transmission purposes. The aim in our project is not to do intuitive research on the complex topic of pitch perception procedures done by the auditory system but to make use of those valid ideas to engineer a music compression and synthesis system. Frequency is a literal measurement, pitch is not. Pitch is a subjective, complex characteristic based on frequency as well as other physical quantities such as waveform and intensity. For example, if a 200-Hz sine wave is sounded at a soft then louder level, most listeners will agree that the louder sound has a lower pitch. In fact, a 10% increase in frequency might be necessary to maintain a listener s subjective evaluation of constant pitch at low frequencies. On the other hand, in the ear s most sensitive region, 1 to 5 khz, there is almost no change in pitch with loudness. Also, with musical tones, the effect is much less [8].

89 Figure 3.1: Perceptual Coding approach Vs Synthesis based approach 77

78 Humans can identify pitch defects very easily in the sensitive region, the mid-frequency region, but in this work we do not approximate or do alternations/modifications to the mid-frequency

90 78 Humans can identify pitch defects very easily in the sensitive region, the mid-frequency region, but in this work we do not approximate or do alternations/modifications to the mid-frequency samples. We transmit it as such thereby avoiding the possibilities of creating a defective signal. This chapter will cover four major topics. 1. Two-filter method; 2. Frequency resolution downsampling method; 3. Four-filter method; 4. Advantages of using a sine plus noise model in the two filter method. FLETCHER AND MUNSON CONTOURS: DATA SETS In order to analyze the success level of the four topics mentioned above, visual Fletcher Munson plots will be useful. Therefore the full data set is included in figures. Figure 3.2: Fletcher Munson original curves (Fig 3 [1])

91 79 Figure 3.3: Figure 2 mentioned in [1] From these curves in Figure 3.3 loudness level contours can be drawn. The first sets of loudness level contours are plotted with levels above the reference threshold as ordinates.

92 80 For example, the zero loudness level contours correspond to points where the curves of figure 3.3 intersect the abscissa axis. The number of db above these points is plotted as the ordinate in the loudness level contour shown in Figure 3.4 [1]. Figure 3.4: Figure 3 in [1] In Figure 3.2 similar sets of loudness level contours are shown using intensity levels as ordinates.

81 Figure 3.5: Matlab plot of figure 3.4 The formulae for computing the Figure 3 plots of the 1933 Fletcher Munson paper are given in Table BI of the paper cited in [22].

93 81 Figure 3.5: Matlab plot of figure 3.4 The formulae for computing the Figure 3 plots of the 1933 Fletcher Munson paper are given in Table BI of the paper cited in [22]. This is a nonlinear regression formula. It is fit to the 1933 paper raw data rather than the published curves, which are quite smoothed. y = C 3* x 3 + C 2* x 2 + C 1* x + C 0 where C 0, C 1, C 2, C 3 are polynomial coefficients y(1) = *(10^-5)* x * x *x ; (62 Hz) y(2)= *(10^-5)* x * x *x ; (125 Hz) y(3) = * x *x ; y(4) = * x *x ; y(5) = x; (250 Hz) (500 Hz) (1000 Hz)

94 82 y(6) = * x *x ; y(7) = * x *x ; y(8)= * x *x ; (2000 Hz) (4000 Hz) (5650 Hz) y(9) = * x *x ; (8000 Hz) y(10) = * x *x ; (11300 Hz) y(11) = * x *x ; (16000 Hz) Phon Sound Pressure Levels (db) Table 3.1: Fletcher Munson curve: data sets

95 83 THE GENERAL PROCEDURE Two-Filter Method Figure 3.6: The Schematic Block Diagram employed in our Synthesis based Data Reduction (Two filter method). We are sensitive to the mid-frequencies around 3300 Hz. The input audio signal is filtered with a band pass filter with cutoff frequencies 0.9 KHz and 6.1 KHz that filters these mid-frequencies. A band-elimination filter with cutoff frequencies 1.1 KHz and 5.9 KHz eliminates the mid band from the input signal. A 200 Hz overlap is set between transition band to avoid phase distortion and spectral leakage. The output of the band pass filter is the most sensitive data that we hear. The spectral contents above the threshold of hearing are modulated to base band and are downsampled in time. The base

96 84 band signal is transmitted over the communication channel. At the receiver, the base band is reconstructed back to the pass band area. The band pass filter (0.9 KHz and 6.1 KHz) and band elimination filters (1.1 khz and 5.9 khz) can be designed based on ones need. The cutoff frequencies were chosen based on Fletcher and Munson curves. Moreover, the Fletcher-Munson curves provide information about the threshold of hearing that helps the peak detection algorithm. The output of the band-rejection filter forms the less sensitive data. This output forms input to a sinusoidal/spectral model [2]. A short time Fourier transform is applied to windowed time frames with a 75% frame overlap rate. The length of the window is fixed such that it captures the periodicity more than one time. The 75% overlap frame rate is set to avoid signal leakage. Normally a minimum of 75% overlap for hamming windows and 50% overlap for rectangular windows are needed to avoid signal leakage. This is because the main lobe width of hamming window is four bins and that of rectangular window is two bins. The hop size is the product of the analysis frame length and the inverse of the main lobe width in bins. In any analysis and synthesis methods, the choice of window is very critical. In this case Kaiser Windowing scheme and hanning windowing scheme turned to be comparatively successful than other windowing schemes. Different windowing schemes will be elaborately discussed later in this chapter. The short time Fourier spectrum was computed and the prominent peaks in the power spectrum were picked. A peak here is defined as a local maximum. There are chances

97 85 where two peaks might be very close to each other. In such cases, the maximum of the two peaks is chosen. The other peak is deleted. This will help us in maintaining the frequency characteristic of a sinusoid while connecting the peaks into frequency distinct tracks. Figure 3.7: The Two Filter method: Band Pass and Band elimination filters Violin spectrum, FS = 44100, mono, 16 Bit Modulation and Demodulation: MF Band The output of the band-pass filter has the most sensitive data that has to be transferred through the channel as PCM audio samples. The sensitive data is modulated to the base band with a sampling frequency twice the highest frequency in the base band. The base band signal is downsampled in time by a factor four. This reduces the data through the

98 86 channel. On the receiver end, the data is up sampled by a factor four to recover the original length and also moved to the original pass band with a sampling rate of the input mid band and fused with the less sensitive data that forms the output of the oscillator at the synthesis end. However, extreme care should be taken while dealing with the sensitive data because any artifacts in this data will be very clearly audible to human ears. Figure 3.8: Modulation and demodulation of sensitive data (Country music, 44.1, 16 bit, mono at 705 Kbps) In the Figure 3.8, a downsampling factor two is set. However, with a factor four better data reduction can be obtained.

99 87 Channel The modulated mid-frequency content is transmitted over the communication channel. During the course of transmission, the channel noise, which consists of the transmission noise and, reception noise, will affect the original data. Partial Sinusoidal Modeling Synthesis (PSMS): LF and HF band The main advantage of spectrum modeling techniques is the existence of analysis procedures that extract the synthesis parameters out of real sounds, thus being able to reproduce and modify actual sounds. SMS is based on modeling sounds as stable sinusoids (partials) plus noise (residual component), therefore analyzing sounds with this model and generating new sounds from the analyzed data. Before starting the analysis band elimination filter chops off the mid frequency spectral data to which we are sensitive. The analysis procedure detects partials by studying the time-varying spectral characteristics of a sound and represents them with time-varying sinusoids. The synthesis procedure is additive synthesis method where the instantaneous amplitudes, frequencies and phases are fed into separate oscillators, and all sinusoids are added frame to frame. In audio signal spectrum modeling, the aim is to transform a signal to a more easily applicable form, removing the information that is irrelevant in signal perception.. A sufficient time and frequency resolution is also difficult to achieve at the same time. The standard pulse code modulated (PCM) signal which basically describes the sound pressure levels reaching the ear is not a good presentation for the analysis of sounds. A

100 88 general approach is to use spectrum modeling, or a suitable middle-level representation to transform the signal into a form that can be generated easily from the PCM signal, but from which also the higher level information can be more easily obtained. The sinusoids+noise model is one of these representations. The sinusoidal part utilizes the physical properties of general resonating systems by representing the resonating components by sinusoids. The noise model utilizes the inability of humans to perceive the exact spectral shape or phase of stochastic signals. The sinusoids+noise model has the ability to remove irrelevant data and encode signals with lower bit rate. It has also been successfully used in audio and speech coding [3]. This project focuses on only a complete sinusoidal model. PSMS using a sine plus noise model for less sensitive data synthesis will be a future extension of the project. A short overview of the process theoretically and the results obtained during the step-bystep implementation follows. At first, the input signal is analyzed to obtain time-varying amplitudes, frequencies and phases of the sinusoids. Then, the sinusoids are synthesized. In the parametric domain, we can make modifications to produce effects like pitch shifting or time stretching. The analysis of sinusoids is the most complex part of the system. Firstly, the input signal is divided into partly overlapping and windowed frames. Secondly, the short-time spectrum of the frame is obtained by taking a discrete Fourier transform (DFT/FFT). The spectrum is analyzed, prominent spectral peaks are detected,and their parameters, amplitudes, frequencies, and phases are estimated. Once the amplitudes, frequencies and phases of the detected sinusoidal peaks are estimated, they are connected to form interframe trajectories. A peak continuation algorithm tries to find

101 89 the appropriate continuations for existing trajectories from among the peaks of the next frame. The obtained sinusoidal trajectories contain all the information required for the resynthesis of the sinusoids. The sinusoids can be synthesized by interpolating the parameters of trajectories and summing the resulting waveforms in time domain. The next phase of Partial SMS includes magnitude and phase spectra computation, Peak Detection, Peak Continuation, Cubic Spline interpolation between time frames, modifications of the Analysis Data, Synthesis, and fusion of sensitive and less sensitive data. The very first step to begin with is the magnitude and phase spectra computation. FFT Analysis of Low and High Frequency Bands The computation of the magnitude and phase spectra of the current frame is the first step in the analysis. The control parameters for the STFT (window-size, window-type, FFTsize, and frame-rate) have to be set in accordance with the sound to be processed. First of all, a good resolution of the spectrum is needed since the process that tracks the partials has to be able to identify the peaks. In case if a PSMS involves a sine plus noise model, the phase information is particularly important for subtracting the deterministic component to find the residual; we should use an odd-length analysis window, and the windowed data should be centered in the FFTbuffer at the origin to obtain the phase spectrum free of the linear phase trend induced by the window ("zero-phase" windowing).

102 90 Figure 3.9: Original and Windowed short time signals, Fourier analysis (Hanning window) The time-frequency compromise of the STFT must be well understood. For the deterministic analysis, it is important to have enough frequency resolution to resolve the partials of the sound. For the stochastic analysis the frequency resolution is not that important, since we are not interested in particular frequency components and we are more concerned with a good time resolution. This can be accomplished by using different parameters for the deterministic and the stochastic analysis. In this project we use a

103 91 sinusoidal model and not a sine plus noise model. The reason for not using a sine plus noise model will be discussed in the coming sections. Figure 3.10: Magnitude and Phase spectrum of LF and HF bands The computation of the spectra is carried out by the short-time Fourier transform technique. Window choice The window choice plays an important role in any analysis and synthesis system. The choice depends on the problem encountered, the amount of overlap rate, and the precision

104 92 expected and computational cost. Popular windows like the Hamming requires 75% overlap rate, i.e, the hop size is 25% of the analysis frame length. The rectangular window needs at least 50% overlap. In steady-state sounds we should use long windows (several periods) with a good side-lobe rejection (for example, Blackman-Harris 92dB) for the deterministic analysis. This gives a good frequency resolution, therefore a good measure of the frequencies of the partials. In the case of harmonic sounds the actual size of the window will change as pitch changes, in order to assure a constant time-frequency trade-off for the whole sound. The choice of analysis window determines the trade-off of time verses frequency resolution, which affects the smoothness of the spectrum and the detectability of different sinusoidal components. The most commonly used windows are rectangular, hamming, hanning, Kaiser, Blackman and Blackman-Harris. All the standard windows are real and symmetric and have a frequency spectrum with a sinclike shape. For the purposes of SMS and in general for any sound analysis/synthesis application, the choice of window is mainly determined by two of its spectral characteristics. They include the width of the main lobe, defined for present purposes as the number of bins between zero crossings on either side of the main lobe when the DFT length equals the window length, and the highest side-lobe level, which measures the gain from the highest side lobe to the main lobe. Ideally, we want a narrow main lobe [i.e., good frequency resolution) and a very low side lobe level). The choice of window determines this trade-off. The rectangular window has the narrowest main lobe, (two bins), but the first side lobe is very high-13 db relative to the main lobe peak. The

105 93 Hamming window has a wider main lobe, four bins, and highest side lobe is 43 db down. A very different window, the Kaiser, allows control of the trade-off between main lobe width and the highest side lobe level. If a narrower main lobe width is desired, then the side lobe level will be higher, and vice versa. Since the control of this tradeoff is valuable, Kaiser Window is good general purpose choice. The window length must be sufficient to resolve the most closely spaced sinusoidal frequencies. A nominal choice for periodic signals is about four periods [11]. DFT Computation Once a section of the waveform has been windowed, the next step is to compute the spectrum using the DFT. For practical purposes, the FFT should be used whenever possible, but this requires the length of the analyzed signal to be a power of two. This can be accomplished by taking any desired window length and zero padding, i.e.,, filling with zeros out to the length required by the FFT. This not only allows use of FFT algorithm, but computes a smoother spectrum as well. Zero padding in time domain corresponds to spectral interpolation. The size of the FFT, N, is normally chosen to be the first power of two that is at least twice the window length M, with the difference N-M filled with zeros. If B, the number of samples in the main lobe when the zero padding factor is 1 (N=M), then a zero padding factor of N/M gives B s N/M samples for the same main lobe [and same main lobe bandwidth). The zero-padding (interpolation) factor N/M should be large enough to enable an accurate estimation of the true maximum of the main lobe. That is, since the window length is not an exact number of periods for every sinusoidal frequency, the

106 94 spectral peaks do not, in general, occur at FFT bin frequencies (multiples of F s /N). Therefore, the bins must be interpolated to estimate peak frequencies. Zero padding is one type of spectral interpolation [11]. Choice of Hop Size Once the spectrum has been computed at a particular frame in the waveform, the STFT hops along the waveform and computes the spectrum of next section in the sound. This hop size H is an important parameter. Its choice depends very much on the purpose of the analysis. In general, more overlap will give more analysis points and therefore smoother results across time, but the computational expense is proportionally greater. A general and valid criterion is that the successive frames should overlap in time, in such a way that all the data are weighted equally. A good choice is the window length divided by the main lobe width in bins. For example, a practical value for the hamming window is to use a hop size equal to one fourth of the window size [11]. Peak Detection in PSMS The input sound is filtered by a band-pass filter in and around the region of sensitivity of human ear. Once the spectrum of the current frame is computed, the next step is to detect its prominent magnitude peaks. Theoretically, a sinusoid that is stable both in amplitude and in frequency (a partial) has a well defined frequency representation: the transform of the analysis window used to compute the Fourier transform.

107 95 It should be possible to take advantage of this characteristic to distinguish partials from other frequency components. However, in practice, this is rarely the case, since most natural sounds are not perfectly periodic and do not have nicely spaced and clearly defined peaks in the frequency domain. Figure 3.11: Peak detection in LF and HF bands There are interactions between the different components, and the shapes of the spectral peaks cannot be detected without tolerating some mismatch. Only some instrumental sounds (e.g., the steady-state part of an oboe sound) are periodic enough and sufficiently free from prominent noise components that the frequency representation of a stable

108 96 sinusoid can be recognized easily in a single spectrum. A practical solution is to detect as many peaks as possible and delay the decision of what is a deterministic, or "well behaved" partial, to the next step in the analysis: the peak continuation algorithm [2]. However, in this project work, we track all the available tracks just like a Mcaulay- Quartieri algorithm does. A "peak" is defined as a local maximum in the magnitude spectrum, and the only practical constraints to be made in the peak search are to have a frequency range and a magnitude threshold. In fact, we should detect more than what we hear and get as many sample bits as possible from the original sound, ideally more than 16. The measurement of very soft partials, sometimes more than 80dB below maximum amplitude, will be difficult and they will have little resolution. These peak measurements are very sensitive to transformations, because as soon as modifications are applied to the analysis data, parts of the sound that could not be heard in the original can become audible. The original sound should be as clean as possible and have the maximum dynamic range, and then the magnitude threshold can be set to the amplitude of the background noise floor. Due to the sampled nature of the spectra returned by the FFT, each peak is accurate only to within half a sample. A spectral sample represents a frequency interval of F s /N Hz, where F s is the sampling rate and N is the FFT size. Zero-padding in the time domain increases the number of spectral samples per Hz and thus increases the accuracy of the simple peak detection. However, to obtain frequency accuracy on the level of 0.1% of the distance from the top of an ideal peak to its first zero crossing (in the case of a Rectangular window), the zero-padding factor required is A more efficient spectral

109 97 interpolation scheme is to zero-pad only enough so that quadratic (or other simple) spectral interpolation, using only samples immediately surrounding the maximummagnitude sample, suffices to refine the estimate to 0.1% accuracy. Figure 3.12: Missed Peaks In real cases, some peaks might lie below the threshold of hearing. This is because, the partials that were not audible before modifications may be clearly perceivable after modifications. The SMS specifies to go 80 decibels below the threshold of hearing. However, the sinusoidal model involved in this project is not a sine plus noise model. The peak detection algorithm only finds the prominent peak around a local maximum separated by a specified distance between peaks. We went 10 or 20 decibels below the

110 98 threshold of hearing to pick the peaks that were missed. In this project we are not going to deal much about modifications because this is a data reduction scheme. Figure 3.13: Peaks below threshold Figure: The figure showing peaks picked and its appropriate phase matches. Some of the peaks below threshold were missed. This problem is corrected by choosing threshold 10 db below actual threshold

111 99 Peak Continuation Once the spectral peaks corresponding to the low frequency and high frequency bands of the current frame have been detected, the peak continuation algorithm adds them to the incoming peak trajectories. The basic idea of the algorithm is that a set of "guides" advances in time through the spectral peaks, looking for the appropriate ones (according to the specified constraints) and forming trajectories out of them. Thus, a guide is an abstract entity which is used by the algorithm to create the trajectories and the trajectories are the actual result of the peak continuation process. The instantaneous state of the guides, their frequency, phase and magnitude, are continuously updated as the guides are turned on, advanced, and finally turned off. The schemes used in the sinusoidal model (McAulay and Quatieri, 1984; 1986) [5] find peak trajectories both in the noise and deterministic parts of a waveform, thus obtaining a sinusoidal representation for the whole sound. These schemes are unsuitable when we want the trajectories to follow just the partials. For example, when the partials change in frequency substantially from one frame to the next, these algorithms easily switch from the partial that they were tracking to another one which at that point is closer. In this project some parameters for the input to the algorithm. Therefore the process of tracking the music and noise is not fully automatic. The specifications of the parameters are mentioned in the forthcoming passages.

112 100 Initial Guides With this parameter, the user specifies the approximate frequency of the partials that are known to be present in the sound, thus reserving guides for them. The algorithm adds new guides to this initial set as it finds them. When no initial guides are specified, the algorithm creates all of them. One another method is to create initial guides at some equal intervals and allow the algorithm to update the guides using the data set being tracked. Maximum Peak Deviation Guides advance through the sound, selecting peaks. This parameter allows a control of the maximum allowable frequency distance from a peak to the guide that is selected by. It is useful to make this parameter a function of frequency in such a way that the allowable distance is bigger for higher frequencies than for lower ones. Thus the deviation can follow a logarithmic scale, which is perceptually more meaningful than a linear frequency scale. However, since the model used in this project tracks both noisy components and meaningful music components, we do not use logrithmic scale. Peak Contribution to Guide The frequency of each guide does not have to correspond to the frequency of the actual trajectory. It is updated every time it incorporates a new peak. This parameter is a

113 101 number from 0 to 1 that controls how much the guide frequency changes when a new peak is incorporated. That is, given that the current guide has a frequency f`~, what will be its value when it incorporates a peak with frequency h. For example, if the value of the parameter is 1, it means that the value of the guide, f`~, is updated to h, thus the peak makes the maximum contribution. If the value of the parameter is smaller, the contribution of the peak is correspondingly smaller; the new value falls between current value of f`~, and h. This parameter is useful, for example to circumscribe a guide to narrow frequency band. Maximum Number of Guides This is the maximum number of guides used by the peak-continuation process at each particular moment in time. The total number of guides may be bigger when a guide is turned off a new one can use its place. Minimum Starting Guide Separation A new guide can be created at any frame from a peak that has not yet been incorporated into any existing guide. This parameter specifies the minimum required frequency separation from a peak to the existing guides in order to create a new guide at that peak. Consequently, through this parameter peaks which are very close to existing guides can be rejected as candidates for starting guides.

114 102 Maximum Sleeping Time When a guide has not found a continuation peak for a certain number of frames, the guide is killed. This parameter specifies the maximum non active time, that is, the maximum number of frames that the guide can be alive while not finding continuation peaks. Maximum Length of Filled Gaps Given that a certain sleeping time is allowed, we may wish to fill the resulting gaps. This parameter specifies the length of the biggest gap to be filled (a number smaller or equal than maximum sleeping time). The gaps are filled by interpolating between the end points in the trajectory. Minimum Trajectory Length Once all the trajectories are created, this parameter controls the minimum trajectory length. All trajectories shorter than this length are deleted. The last two specifications are optional. In this research work, the last two were not used. If the minimum trajectory lengths are deleted, that will constitute most of the part of noise. Only when a sine plus noise model is employed, these specifications will turn into useful ones.

115 103 To describe the peak continuation algorithm let us assume that the frequency guides were initialized with initial guides and that they are currently at frame n. Suppose that the guide frequencies at the current frame are f ~ 1, f ~ 2, f ~ 3 f ~ p, where p is the number of existing guides. We want to continue the p guides through the peaks of frame n with frequencies g 1, g2, g3.g m, thus continuing the corresponding trajectories. There are three steps in the algorithm: (1) Guide advancement, (2) Update of guide values, and (3) start of new guides. Next these steps are described. Guide Advancement Each guide is advanced through frame n by finding the peak closest to its current value. The r th guide claims frequency g i for which f ~ r~g i is a minimum. The change in frequency must be less than maximum peak deviation. The possible situations are as follows: 1. If a match is found within the maximum deviation, the guide is continued (unless there is a conflict to resolve. The selected peak is incorporated into the corresponding trajectory. 2. If no match is found, it is assumed that the corresponding trajectory must turn off entering frame n, and its current frequency is matched to itself with zero magnitude. Since the trajectory amplitudes are linearly ramped from one frame to the next, the terminating trajectory ramps to zero over the duration of one hop

116 104 size. Whether the actual guide is killed or not, depends on the maximum sleeping time. 3. If a guide finds a match which has already been claimed by another guide, we give the peak to the guide that is closest in frequency, and the loser looks for another match. If the current guide loses the conflict, it simply picks the best available non-conflicting peak which is within the maximum peak deviation. If the current guide wins the conflict, it calls the assignment procedure recursively on behalf of the dislodged guide. When the dislodged guide finds the same peak and wants to claim it, it sees there is a conflict which it loses and moves on. This process is repeated for each guide, solving conflicts recursively, until all possible matches are made. Update of guide values Once all the existing guides and their trajectories have been continued through frame n, the guide frequencies are updated. There are two possible situations: 1. If a guide finds a continuation peak, it s frequency is updated from f ~ r to h r according to h ~ r = α (g i f ~ r) + f ~ r α [0, 1] Where g i is the frequency of the peak that the guide has found at frame n, and alpha is the peak contribution to guide. When alpha is 1 the frequency of the peak trajectory is the same than the frequency of the guide, therefore the difference Between guide and trajectory is lost.

117 If a guide does not find a continuation peak for maximum sleeping time frames, the guide is killed at frame n. If it is still under the sleeping time it keeps the same Value (its value can be negated in order to remember that it has not found a peak). When maximum sleeping time is 0 any guide that does not find a continuation peak at frame n is killed. In order to distinguish between guides that find a continuation peak at frame n is killed. In order to distinguish between guides that find a continuation peak from the ones that do not but still are alive, we refer to the first ones as active guides and the second ones as sleeping guides. Start of New Guides New guides, and therefore new trajectories, are created from the peaks of frame n that are not incorporated into trajectories by the existing guides. If the number of current guides is smaller than maximum number of guides a new guide can be started. A guide is created at frame n by searching through the unclaimed peaks of the frame for the one with the highest magnitude which is separated from every existing guide by at least minimum staring guide separation. The frequency value of the selected peak is the frequency of the new guide. The actual trajectory is started in the previous frame, n-1, where its amplitude value is set to 0 and its frequency value to the same as the current frequency, thus ramping in amplitude to the current frame. This process is recursively done until there are no more unclaimed peaks in the current frame, or the number of guides has reached maximum number of guides.

118 106 In order to minimize the creation of guides with little chance of surviving, a temporary buffer is used for the starting guides. The peaks selected to start a trajectory are stored into this buffer and continued by only using peaks that have not been taken by the consolidated guides. Once these temporary guides have reached a certain length they become normal guides. Figure 3.14: Peak-continuation process. Here, g represent the guides and p the spectral peaks. The magnitude, frequency and phase information at p form input to the oscillator For the case of harmonic sounds these guides could be created at the beginning of the analysis, setting their frequencies according to the harmonic series of the first fundamental found, and for inharmonic sounds each guide is created when it finds the first available peak. When a fundamental has been found in the current frame, the guides can use this information to update their values. Also the guides can be modified depending on the last peak incorporated. Therefore by using the current fundamental and the previous peak we control the adaptation of the guides to the instantaneous changes in

119 107 the sound. For a very harmonic sound, since all the harmonics evolve together, the fundamental should be the main control, but when the sound is not very harmonic, or the harmonics are not locked to each other and we cannot rely on the fundamental as a strong reference for all the harmonics, the information of the previous peak should have a bigger weight. However, we do not focus on just harmonic partials. This project involves an algorithm that tracks any sound, harmonic or inharmonic. Each peak is assigned to the guide that is closest to it and that is within a given frequency deviation. If a guide does not find a match it is assumed that the corresponding trajectory must "turn off". In inharmonic sounds, if a guide has not found a continuation peak for a given amount of time the guide is killed. New guides, and therefore new trajectories, are created from the peaks of the current frame that are not incorporated into trajectories by the existing guides. If there are killed or unused guides, a new guide can be started. A guide is created by searching through the "unclaimed" peaks of the frame for the one with the highest magnitude The peak continuation algorithm presented is only one approach to the peak continuation problem. The creation of trajectories from the spectral peaks is compatible with very different strategies and algorithms; for example, hidden Markov models have been applied. An N Markov model provides a probability distribution for a parameter in the current frame as a function of its value across the past N frames.

120 108 With a hidden Markov model we are able to optimize groups of trajectories according to defined criteria, such as frequency continuity. This type of approach might be very valuable for tracking partials in polyphonic sounds and complex inharmonic tones. In particular, the notion of "momentum" is introduced, helping to properly resolve crossing fundamental frequencies [3]. Additive Synthesis: PSMS A short tutorial on additive synthesis was given in chapter 2. The peak continuation algorithm returns the values of the prominent peaks organized into frequency trajectories. Each peak is a triad (A r, ω r, φ r ) where l is the frame number and r is the track number to which it belongs. The synthesis process takes these trajectories, or their modification, and computes one frame of synthesized sound s (t) by S(t) = Rl r = 1 Ar cos [tω r + φ r ] Where R l is the number of trajectories present in frame l and S is the length of the synthesis frame. A synthesis frame is S samples long and does not correspond to an analysis frame. Without time scaling the synthesis frame l goes from the middle of analysis frame l-1 to the middle of analysis frame l, i.e., corresponds to the analysis hop size. The final sound s (t) results from the juxtaposition of all synthesis frames [2].

121 109 In SMS scheme usually to avoid click between frames, a peak interpolation strategy is followed. In this project we replace the peak interpolation with a smooth cubic spline interpolation stage. Cubic Spline Interpolation In any SMS or MQ- Synthesis scheme, a peak continuation scheme will be followed by a peak interpolation scheme. Peak interpolation is the process of interpolating the amplitude, frequency and phase parameters of each sinusoidal track smoothly between frames. Peak interpolation helps in avoiding the sharp clicks between frames. A click or a crack is an undesirable audio effect and is always subjected for cleaning. In digital audio restoration, an undesirable noise include, clicks/cracks or steady state hiss. This hence is an interpolation in frequency domain. However, in this research project we have replaced this peak interpolation with a smooth cubic spline interpolation method. This involves smoothly interpolating the abrupt voltage levels between the frame edges. An usual method of writing a vinyl restoration code is to first detect the click and then fix it. However, our implementation is simple because, it is already known that the crack exists only at frame edges.

122 110 Figure 3.15: (Top) with cracks; (Bottom) without cracks At the heart of most click removal methods is an interpolation scheme which replaces missing or corrupted samples with estimates of their true value. It is usually appropriate to assume that clicks have in no way interfered with the timing of the material, so the task is then to fill in the gap with appropriate material of identical duration to the click. As discussed with appropriate material of identical duration to the click. This amounts to an interpolation problem which makes use of the good data values surrounding the corruption and possibly takes account of signal information which is buried in the corrupted sections of data. An effective technique will have the ability to interpolate gap lengths from one sample up to at least 100 samples at a sampling rate of 44.1 KHz [28].

123 111 Figure 3.16: Cubic spline interpolation However, the entire process employed here in this work is not automatic. The user has to specify or change a parameter depending on the sound. A specific number of points are selected around the crack center spotted at the frame edge, i.e, few points in the previous frame and few points from current frame and are smoothly interpolated using the cubic spline interpolation technique. The amount of points that form input to the interpolation algorithm depends on the nature of sound in particular the wavelength and zero crossings. For a low frequency sound we might want to specify a big number and for high frequency sound a small number to arrive at a desirable output.

112 FUSION OF SENSITIVE AND LESS SENSITIVE SPECTRA Once the PCM samples of the MF region are demodulated to the original frequency band by the demodulator, we fuse the synthesized bands and the MF

124 112 FUSION OF SENSITIVE AND LESS SENSITIVE SPECTRA Once the PCM samples of the MF region are demodulated to the original frequency band by the demodulator, we fuse the synthesized bands and the MF band. The synthesized bands are synthesized from the channel parameters. Here is a model of the fusion. Figure 3.17: Spectral Fusion (A more general model) DISTORTION AT TRANSITION BAND EDGES Distortion at transition bands could be expected sometimes. These distortions may be phase distortions or other type of distortions at the transition edges between frequency bands. This may be suppressed by using the same filter that we have used before

125 113 analysis, this time at the synthesis end before fusing the spectral bands to form final output and this turns successful. The Downsampling method This is the second method to be discussed after the two filter method. The downsampling method is applicable only for very low frequency sounds. According to the duplex theory of pitch perception, low frequencies have very poor frequency resolution but better time resolution and high frequencies have better spectral resolution and poor time resolution. At 640 Hz both, temporal and pattern recognition effects are sensed [7]. Figure 3.18: The downsampling method.

126 114 The reason is because of the fact that the human auditory system follows time impulses in the LF region. The downsampling method targets two things. The first one is that since the LF sinusoids have poor spectral resolution, the Low-frequency sound is downsampled by a factor two in time. This shifts all the tracks by a factor two on frequency scale. Therefore the band rejection filter must be designed with cutoff frequencies that are two times the cutoff frequencies that rejected the sensitive mid band in the two filter method. The pitch increases by a factor two and time decreases by a factor two. Hence spectral resolution of tracks is increased. This facilitates the peak tracking process. The second target is data reduction. Data reduction will also be achieved because the time is compressed by a factor two when we downsample. Finally, before resynthesizing, a modification factor two is set and frequency parameters were divided by two to get back the original time. However, this method is applicable only for very low frequency sounds because the sampling rate of the signal should be taken into consideration before shifting the frequency components. Moreover, the sound is comparatively defective if modification algorithms are not robust. Though the downsampling method only fits the low frequency instruments like bass guitar or kick drum, downsampling method could be effectively used in the four filter method that follows.

127 115 Duplex theory of pitch perception: Applications: Four-Filter method Figure 3.19: High Spectral resolution Four-filter method The four-filter PSMS makes use of the duplex theory of pitch perception discussed in chapter 2. The dual purpose of the four filter method is more spectral resolution for better tracking and re-synthesis and more chances of data reduction. The four filter method improves the two filter method in terms of fidelity, and data reduction. Moreover, logically it sets a unique example for better ways to analyze signals, truthfully in future. This way the tradeoff in time and frequency resolution can be compensated to some extent. According to the duplex theory of pitch perception, the low

128 116 frequencies have poor spectral resolution and high time resolution whereas the high frequencies have poor time resolution and high spectral resolution. Figure 3.20: A schematic picture explaining how the analysis frame length is changed over the frequency scale. At a 640 Hz cross over frequency, both temporal and pattern recognition resolution is there. Regardless of the matter which domain, that we need to set high resolution, we need spectral frequency parameters that are very close to what would be original when we synthesize the sound. Therefore spectral resolution (number of bins) cannot be varied. Hence, a variable analysis time frame length is employed to achieve a better PSMS system. A narrow analysis frame length (35-45 msec) is set for low-frequency sounds and a wider analysis frame length (100 msec) is set for high frequency sounds. The downsampling method is optional because, downsampling method works well only for low-frequency harmonic signals.

117 Figure 3.21: The four filter method The input sound is filtered using four band pass filters. A DC to 0.7 khz filter, 0.5 khz to 1 khz, 0.9 khz to 6.1 KHz (mid) and a 5.

129 117 Figure 3.21: The four filter method The input sound is filtered using four band pass filters. A DC to 0.7 khz filter, 0.5 khz to 1 khz, 0.9 khz to 6.1 KHz (mid) and a 5.9 khz to 20 KHz were used to filter the different frequency bands. Each filtered frequency band except the mid band form input to the sinusoidal model. Variable time frame length overlapping frames were set before sending the pass band signal into the sinusoidal model. A low frequency band has a short analysis frame length and a high frequency has a little long frame length. This method is bit expensive but the resolution is comparatively good and it is an extremely good tactic for enhanced data reduction.

130 118 Figure 3.22: (Top) High resolution (Bottom) Low spectral resolution Analysis of spectral resolution of low frequency sounds (0-700 Hz) Note: There is no zero padding involved in the above plots

131 119 However, the parametric information through the channel for both two-filter method and four-filter method almost remains the same. A two filter method captures large number of parameters from two wide spectral bands whereas a four frequency method captures less number of parameters from three spectral bands. Discarding Phase in HF regions In HF regions, the ear follows the amplitude envelope of the frequency spectrum and leaves out its phase content. This psychological evidence was used in our sinusoidal model by discarding the HF phase parameters that are to be fed into the oscillator. In the four filter method, the fourth filter chops off the lower spectral details, leaving us with only high frequency spectral energy. The amplitude and frequency parameters corresponding to this HF band were transmitted. The phase parameters were discarded because they have no perceptual importance in the auditory scene. The experiment was conducted with a cymbal crash sound and the results are shown in Figure 3.23.

132 120 Figure 3.23 (a): Synthesizing HF band with phase parameters Figure 3.23 (b): Synthesizing HF band without phase parameters

133 121 Figure 3.24: Plots explaining the synthesis of different band that have various analysis frame length and their final fusion (Four filter method)

134 122 Hence, it was experimentally verified that the phase does not have very significant sonic meaning in the high frequencies. This way the four filter method was improved by compressing more audio data by chopping off useless audio data that traverses the channel to the synthesis port. Experiments were conducted both on non harmonic instruments like cymbal crash and harmonic instruments like harmonium. The results of the experiment clearly indicate that phase information is not perceived in higher frequencies. Therefore high-frequency phase parameters can be discarded because they do not have much perceptual importance. Hence, one-third of the high-frequency information through the channel is reduced. POSSIBILITIES OF MODFIFICATION Figure 3.25: Possibilities of Modifications

135 123 The most powerful option of the analysis-synthesis based schemes is that the music after analysis can be modified. The modifications can be different music applications for music production and composition like the traditional time/pitch scale modifications, reverberation, chorusing, harmonizing and other music production applications. Our synthesis based method proposes some modifications. However the modifications are little complex and involves lot of computation compared to other schemes. The sinusoidal parameters are modified using a modification instruction to get the desired modification effect for the LF and HF regions. The mid spectrum is modified using a phase vocoder to attain a desired sound effect. The two sounds are added together. The phase vocoder used here is a commercially available tool. Cross Effects Figure 3.26: Cross Effect using PSMS modifications

136 124 The unique modification possibility in this project would be to introduce meaningful cross effects. For example we may obtain a pitch shift effect by modifying the sinusoidal parameters and use a harmonizing effect for mid spectral samples using a phase vocoder. Here is an example above where the mid spectrum is introduced to chorusing effect using a commercial phase vocoder and the frequency parameters of the LF-HF sinusoidal parameters are multiplied by random numbers. The cross effect provides some meaningful digital effect as output. Demonstration: Advantages over sine plus noise model The fidelity and data reduction are increased if a traditional sinusoidal plus noise model is replaced with a pure sinusoidal model as followed in this thesis work (two filter method). Noise generally consists of musical excitation and other noisy components. Infact SMS is sometimes used in denoising applications. While a pure deterministic model consists of only perfect harmonic tracks, the noise may have the nearby harmonic frequencies of the deterministic tracks. These near by harmonics contribute a lot to the music. When these nearby harmonics are modeled using a LPC estimate, the user need to send LPC coefficients close to analysis frame length to get the best estimate in order to minimize the error. Therefore it becomes an additional burden on the channel. If less LPC coefficients are sent, the nearby harmonics are not properly modeled as stochastic noise when the LPC filter is excited with a white noise.

137 125 In most of the real world cases, most of the non musical noises are associated in LF or HF regions as rumblings (LF) or hiss sounds (Tape HF). In the two filter method, we send all the sensitive nearby harmonics as PCM samples. The noises in the LF and HF that are less sensitive are modeled stochastically. Therefore the resulting residual sound comprising of mid nearby harmonics and musical noise plus stochastic LF and HF noise sounds better than a full stochastic model. Moreover, the number of LPC coefficients will also decrease by a greater factor. This becomes a better model in all terms. A demonstration by pictures and plots is shown below. Figure 3.27 (a)

138 126 Figure 3.27 (b) Figure 3.27 (c)

139 127 Figure 3.27 (d) Figure 3.27 (e)

140 128 Figure 3.27 (f) Figure 3.27 (g)

141 129 Figure 3.27 (h) Figure 3.27 (i)

142 130 Figure 3.27 (j) Figure 3.27 (k) Figure 3.27(a-k): Demonstration: Disadvantages of sine plus noise model in a two filter method

143 131 CHAPTER 4 TESTS AND RESULTS Tests were conducted to analyze the success level of this research project. The results of the same were plotted. The two filter method and four filter method were analyzed qualitatively and the bit reduction and audio compression ratio were tabulated. Qualitative analysis Listening tests were conducted on 12 subjects. Five had musical ears and the rest weren t. The subjects were instructed to rate the quality of the output sound by comparing it to the input sound. They were specifically asked to judge and take into account any possible distortions in the output. They were asked to rate the output sound ranging from 1 to 5, 1 being worst and 5 being best. Decimal values in between these integers were allowed so that the subject could rate the sound to the best possible extent while making critical judgments. The qualitative results for the two filter method are plotted below:

144 132 Figure 4.1: Qualitative Results: Music Genre: Two filter method Figure 4.2: Qualitative Results: Tonal Instruments: Two filter method

145 133 Figure 4.3: Qualitative Results: Percussion Instruments: Two filter method The qualitative results for the four filter method are plotted below: Figure 4.4: Qualitative Results: Music Genre: Four filter method

146 134 Figure 4.5: Qualitative Results: Tonal Instruments: Four filter method Figure 4.6: Qualitative Results: Percussion Instruments: Four filter method

147 135 Bit rate calculation The bit rates for all the above sounds were calculated using both two filter method and four filter method. The mid frequency sensitive signal that forms input to the channel as PCM samples were reduced by an audio compression ratio 4:1. For example, a sound sampled at a CD rate of 44.1 KHz, 705 Kbps (mono) will be reduced to 176 kbps. The parameters (Amplitudes, frequencies, Phases) were converted to binary format to calculate the number of bits per second. The number of kbps from LF-HF sinusoidal parameters were added to the number of kbps from the sensitive PCM samples to find the bit rate and audio compression ratio. All the audio files were mono. Music Instrument/Genre Two filter Method Four filter method Rock 44.1 KHz, 705 kbps Classical 44.1 KHz, 705 kbps Country 44.1 KHz, 705 kbps Jazz 44.1 KHz, 705 kbps kbps (PSMS) (mid) kbps = kbps (3:2) kbps (PSMS) kbps (mid) = kbps (3:1) kbps (PSMS) kbps = kbps (2:1) kbps (PSMS) kbps (mid) = kbps (2:1) kbps (LF1) kbps (LF2) kbps (mid) (HF) = kbps (2:1) kbps (LF1) kbps (LF2) kbps(mid) kbps (HF) = kbps (3:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) kbps (LF1) kbps (LF2) kbps kbps (HF) = kbps (3:1)

148 136 Speech 38 KHz, 608 kbps Gottuvadhyam 44.1 KHz, 705 kbps Pipa 48 KHz, 768 kbps kbps (PSMS) kbps (mid) = kbps (2:1) kbps (PSMS)+ 176 kbps (mid) = kbps (3:1) kbps (PSMS) kbps = kbps (3:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) kbps (LF1) kbps kbps (mid) kbps (HF) = (3:1) Sitar 44.1 KHz, 705 kbps Flute 44.1 KHz, 705 kbps Violin 44.1 KHz, 705 kbps Piano 24 KHz, 384 kbps Acoustic Guitar 32 KHz, 512 kbps kbps (PSMS) kbps (mid) = kbps (2:1) kbps kbps (mid) = kbps (3:1) kbps (PSMS) kbps (mid) = kbps (2:1) kbps (PSMS) + 96 kbps(mid)= kbps (4:1) 14 kbps (PSMS) kbps(mid) = 142 kbps kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) 5 kbps (LF1) kbps (LF2) + 96 kbps (mid) kbps (HF) = kbps (4:1) kbps (LF1) kbps (LF2) kbps

149 137 ( 3:1 ) (mid) kbps (HF) = kbps ( 4:1) Electric Bass 30 KHz, 480 kbps Low Tom 44.1 kbps, 705 kbps High Tom 44.1 kbps, 705 kbps Closed snare 44.1 kbps, 705 kbps Open Hi-hat 44.1 kbps, 705 kbps kbps (PSMS) kbps (mid) = kbps (3:1) kbps (PSMS) kbps = kbps (3:1) kbps (PSMS) kbps (mid) = kbps (3:1) kbps (PSMS) kbps (mid) = kbps (3:2) kbps (PSMS) kbps (mid) = kbps (3:2) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF)= kbps (3:1) kbps (LF1) kbps (LF2) kbps kbps (HF) = kbps (3:1) kbps (LF1) kbps (LF2) kbps kbps (HF) = kbps (2:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (3:1) Mid-Tom 44.1 kbps, 705 kbps Cymbal Crash 44 KHZ, 705 kbps kbps kbps (mid) = kbps (3:1) kbps (PSMS) kbps (mid) = kbps (3:2) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = 200 kbps (3:1) kbps (LF1) kbps (LF2) kbps (mid) kbps (HF) = kbps (2:1) Table 4.1: Bit rates and compression ratios for two filter and four filter method

150 138 Earlier, in the third chapter, we mentioned that one need to go below the threshold of hearing to pick peaks because the peaks that were not audible may be perceived after modifications. However, the primary focus of this research is data reduction and modifications have lesser significance. Therefore, when one wants to have good bit rate reduction, the factor below threshold of hearing must be set to zero.

151 139 CHAPTER 5 CONCLUSION It was analyzed that the interpolation of voltages between time frames was not completely successful. Therefore the output quality was affected to a smaller extent. However, if a commercial SMS system is used, this partial synthesis idea can certainly better the existing schemes by means of data reduction and fidelity. However the flexibility in modification is affected due the partial synthesis. The bit reduction rates, compression ratio are mentioned in the last table of this chapter. Finally the various features of this synthesis scheme are compared to normal perceptual coding schemes and presented in that table. Future Extensions The future extensions of this project include transient modeling of the low and high frequency sounds using the sine plus transient plus noise scheme mentioned in [27]. The entire project could be implemented for MIDI files. There is a curve dip at the HF rage of the Fletcher-Munson contours. This dip could be made use of as we did for mid frequency dip in this research. However, the dip is not a common one to all age groups. Old people sometimes don t have access to the high frequencies. A run length coding could be done for parameters representing frequency tracks by sending the (frequency parameter, length of the string) as mentioned in [23]. One another attempt would be to use loudness variable band pass and band rejection filters in the two filter method.

152 140 However, this may not induce a big change in data reduction. A perceptual coder may be used for coding the mid frequency signals in the future. PROS AND CONS: A COMPARTITIVE STUDY Feature Perceptual Audio Coding PSMS Bit Allocation Dynamic & Adaptive There is no bit pool but bits are assigned based on sensitivity and is a one step external decision and no adaptive techniques involved. Bit Rate Fixed bit rates Depends on the sound Psychoacoustic base point Loudness perception & less of pitch perception Pitch perception Channel Noise No Chances do prevail being a communication system

153 141 Applications Storage space reduction, Television & Radio broadcast Satellite transmission, Military & Musical applications. Data reduction, Television & Radio broadcast Satellite transmission, Military & Musical applications. Commercial Usage Used commercially today Future commercial product Table 5.1: Pros and cons: Perceptual coding and PSMS Instead of the SMS/MQ synthesis technique that was used for synthesizing the less sensitive LF-HF regions, other synthesis techniques could be tried to verify and test if which synthesis technique could be the best fit for the system. A low-bit-rate coder could be used to code the sensitive signal instead of the entire signal. This will facilitate accurate coding of sensitive signal. The less sensitive signals can be synthesized using SMS. Both the synthesized signal and low bit rate coded signal may be added and quality may be ensured.

154 142 BIBLIOGRAPHY: 1. Fletcher, Munson.W.A, Loudness, its definition, measurement and calculation, Journal of Acoustical Society of America, 5 (1933) : Xavier Serra, A system for analysis/transformation/synthesis based on a deterministic plus stochastic decomposition (Ph.D. diss., Stanford University, 1989). 3. Tuomas Virtanen, Audio Signal Modeling with Sinusoids plus Noise (Master of Science Thesis, Tampere University of technology). 4. Jean Laroche, Mark Dolson, New phase-vocoder techniques for pitch-shifting, harmonizing and other exotic effects, Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, New York, 1999), McAulay, R.J, Quatieri,T.F, Speech analysis/synthesis based on a sinusoidal representation IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP- 34, no.4 (1986): Albert.Bregman, Auditory Scene Analysis, (Cambridge: MIT Press, 1994). 7. Perry R. Cook, Music, Cognition and Computerized sound, Introduction to Psychoacoustics, (Cambridge: MIT Press, 2001) 8. Ken C. Pohlmann, Principles of Digital audio, Fourth Edition, (New York: McGraw- Hill Publications, 2000). 9. Curtis Roads, The Computer Music Tutorial, (Cambridge: MIT Press, 1996). 10. T. H. Andersen and K. Jensen, Importance and representation of phase in the sinusoidal model, Journal of the Audio Engineering Society, 52 no.11 (2004) : Serra, X, Smith, J, Spectral Modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition, Computer Music Journal 14, no.4 (1990): Website: Terhardt, E and Seewann, S.G Algorithm for extraction of pitch and pitch salience from complex tonal signals, Journal of the Acoustical Society of America 71 (1982): Ernst Terhardt, Calculating virtual pitch Hear. Res 1 (1979): Kelly Fitz, The Reassigned Bandwidth-Enhanced Method of Additive Synthesis,"

155 143 (Ph. D. diss., University of Illinois at Urbana-Champaign, 1999). 16. Scheirer, Eric D., and Barry L. Vercoe, SAOL: The MPEG-4 Structured Audio Orchestra Language, Computer Music Journal 23 no.2 (1999): Brian CJ.Moore, An Introduction to the Psychology of Hearing (San Diego: Academic Press, 2003) 18. Plomp, Pitch of complex tones, Acoustic society of America, 41, no.6 (1967): Rodet, X. and P. Depalle, Spectral Envelopes and Inverse FFT Synthesis, 93 rd Convention of Audio Engineering Society (San Francisco, 1992) 20. McAulay, R.J. and T.F. Quatieri, Speech Transformations based on a sinusoidal representation, IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP 34 no. 6 (1986): Eronen A, and Klapuri A, "Musical Instrument Recognition Using Cepstral Coefficients and Temporal Features, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2000, (Istanbul, 2000), J. B. Allen and S. T. Neely, Modeling the relation between the intensity just noticeable difference and loudness for pure tones and wideband noise, Journal of Acoustical Society of America, 102 no. 6 (1997). 23. Sylvain Marchand, Compression of sinusoidal modeling parameters, Proceedings of the COST G-6 conference on digital audio effects (DAFX-00), (Verona, 2000), T. H. Andersen and K. Jensen. Phase models in analysis/synthesis of voiced sounds, In Proceedings of the DSAGM, (Copenhagen, 2001). 25. Charles Dodge and Thomas. A. Jerse, Computer music synthesis, composition and performance, second edition (New York: Prentice Hall Series) 26. Thomas Quatieri, Discrete time speech signal processing (New Jersey: Prentice Hall Series, 2001) 27. T. Verma, T. Meng, Extended spectral modeling synthesis with transient modeling synthesis, Computer music journal, 24 no.2 (2000): Simon J.Godsill and Peter J.W.Reyner, Digital audio restoration, (New York: Springer series, 1998

156 144 APPENDIX: Sample plots obtained from experiments Rock music, sampled at 44.1 KHz, 16 bit, mono, 705 kbps Country music, sampled at 44.1 KHz, 16 bit, mono, 705 kbps

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner. Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb 2009. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence