Citation for published version (APA): Andringa, T. (2002). Continuity preserving signal processing Groningen: s.n.

Size: px

Start display at page:

Download "Citation for published version (APA): Andringa, T. (2002). Continuity preserving signal processing Groningen: s.n."

Kristian Fields
6 years ago
Views:

1 University of Groningen Continuity preserving signal processing Andringa, Tjeerd C IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2002 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Andringa, T. (2002). Continuity preserving signal processing Groningen: s.n. Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date:

2 Rijksuniversiteit Groningen Continuity Preserving Signal Processing Proefschrift ter verkrijging van het doctoraat in de Wiskunde en Natuurwetenschappen aan de Rijksuniversiteit Groningen op gezag van de Rector Magnificus, dr. D.F.J. Bosscher, vrijdag 22 februari 2002 om uur door Tjeerd Catharinus Andringa geboren op 30 mei 1964 te Leeuwarden

3 Promotor: Prof. dr. ir. H. Duifhuis Beoordelingscommissie: Prof. dr. L.W.J. Boves Prof. dr. L.P. Kok Prof. dr. L. Schomaker

4 Continuity Preserving Signal Processing

6 TABLE OF CONTENTS Introduction CHAPTER 1 Recognizing Arbitrary Sounds Speech and Speech Recognition Research Approach Defining a Speech Recognition System The Signal-in-Noise-Paradox Solving the Paradox Design Basics Quasi-Stationarity and Continuity Task Conclusions and Definitions CHAPTER 2 Introduction to CPSP Overview of Representations and Techniques The Basilar Membrane Model The Cochleogram Tuned Autocorrelation Time Normalized Correlogram Estimation of Ridges Local Instantaneous Frequency Contours Continuity Preserving Signal Processing 5

7 TABLE OF CONTENTS 2.8 Design Choices and Overview of Implemented System Fundamental Period Contour Estimation Selection of Periodic Signal Contributions Resynthesis of Target Signal Parameterization of Selections Recognition Experiment CHAPTER 3 The Basilar Membrane Response The Basilar Membrane Model Sensitivity Versus Group Delay Place-Frequency Relation Estimating Individual Signal Components Estimation of Ridges Cochleogram Reconstruction of TAC Selections CHAPTER 4 Time Normalized Correlogram Three Continuous Correlogram Variants Dynamic Properties of the TNC The Characteristic Period Correlation On- and Offsets Transients Estimation of Local Instantaneous Frequency Contours Estimating the Tuned Autocorrelation The Tuned Autocorrelation in Noise CHAPTER 5 Fundamental Period Contour Estimation Fundamental Period Contour Estimation in Noise Fundamental Period Contour Estimation for Clean Speech A Comparison of Fundamental Period Estimation Algorithms CHAPTER 6 Auditory Element Estimation Auditory Element Estimation and Selection The Robustness of Mask Forming Robustness of Auditory Element Estimation Continuity Preserving Signal Processing

8 CHAPTER 7 Overview and Discussion Overview of CPSP Obtaining Acceptable Recognition Results Conclusions Samenvatting Nawoord References Continuity Preserving Signal Processing 7

9 TABLE OF CONTENTS 8 Continuity Preserving Signal Processing

10 Introduction This thesis is the product of a research project that is simple to formulate: apply the rich temporal information of a model of the human inner ear (or cochlea) to automatic speech recognition. This project seemed, and proved to be, an interesting combination of physics, engineering, and cognitive science. When I started to study speech and auditory processing after my undergraduate study physics, I was struck by the large number of specialisms and subspecialisms. Is this compartmentalization of research the optimal approach for a phenomenon like speech? A definite advantage of a high degree of specialization is that speech and auditory processing are studied in detail, but too much focus on details may prevent the detection of more global regularities. The fact that a speech signal can cause far reaching behavioral responses in less than a second, convinces me that the specialisms ought not only to communicate, but in fact be highly integrated to meet the challenges of the phenomenon of speech. Unfortunately, research efforts do not reflect this integration, since each specialism seems preoccupied with questions unique for their level of description. Without a solid justification of the compartmentalization of speech science one is forced to be suspicious about the interpretation of experiments by one or a few subspecialisms. In particular one might view ideas and theories that are limited to phenomena within a single (or a few) specialism(s) with suspicion because they might represent (subtle) artifacts Continuity Preserving Signal Processing 9

11 Introduction due to the unnecessary separation of one or more specialisms from the rest of speech processing. It cannot be excluded that these artifacts have prevented the development of a single comprehensive theory of speech processing able to account for the wealth of available experimental evidence. Another observation that struck me is that, in general, the scientific study of speech processing, as well as engineering approaches of Automatic Speech Recognition (ASR) systems seem to start from a very strong basic assumption: namely that speech is presented without any background noise. Yet, as everyone knows, speech is normally produced in uncontrollable situations with multiple sound sources. Consequently source separation ought to be an inevitable ingredient of a speech recognition system. The low noise robustness of modern ASR systems might be hard-coded in these systems by the exclusion of an explicit source separation mechanism and a preference for noiseless signals during design. A theory of speech, as well as a robust ASR system, might reflect the communicative function of speech as its central premise. Efficient communication requires that the speech signal changes slowly enough for the information carrying features to be detected in a wide range of acoustic environments and fast enough to transfer sufficient information. The balance between these two conflicting demands is likely to determine the main features of the speech signal and is consequently likely to be a suitable starting point for a theory of speech processing. This thesis is the result of a 10 year period of intermittent research in three groups that each had important influences. In professor Duifhuis group at the department of Sensory Biophysics I was able to work with the model of the human inner ear (or cochlea) that forms the starting point of this work. At the same time I worked as a lecturer for the undergraduate study applied cognitive science (Technische Cognitiewetenschap, now Artificial Intelligence). Here I learned to appreciate the sheer complexity and surprisingly optimal functioning of the human cognitive system. The last years I worked with the research company Human Quality (HuQ) Speech Technologies. Duifhuis cochlea model has, apart from its careful design and great numerical stability, two special features: it can implement various nonlinearities in a neurophysiologically plausible way and it is continuous in time and place. After a while we decided to focus on the model s continuity for a very pragmatic reason: nonlinearities tend to complicate the mathematical description of a problem, while continuity tends to simplify it. After learning 10 Continuity Preserving Signal Processing

12 more about speech signal processing I realized that the generally applied forms of signal processing where chosen for reasons of mathematical convenience and not to represent the physics of the problem. The symmetry between the continuous nature of the speech production system and the continuity of cochlea seemed too valuable to ignore. Continuity Preserving Signal Processing (CPSP), possibly in a form similar as developed in this work, is in my opinion the reason why our auditory system is able to keep track of information of multiple concurrent sound sources. Frame-based approaches (FFT s, wavelets, LPC) introduce discontinuities at every transition to a next frame and pose unnecessary restrictions on the trade-off between temporal and frequency resolution. This complicates the analysis of complex signals more than the benefits of a convenient mathematical description warrant. As will be demonstrated in this work, CPSP is not only powerful because it solves problems, but also because it provides alternative signal descriptions in which some of the usual problems of signal processing do not occur. All techniques and representations developed in the course of this work have been patented (Andringa 1999). Continuity Preserving Signal Processing, as developed in this work, is based on two conjectures. One states that the auditory system is optimally useful when it functions reliably in as many complex acoustic environments as possible. The other says the same for the speech process. The consequences of these conjectures are discussed in chapter 1. This nontechnical chapter formulates a restrictive framework to guide the development of ideas in later chapters and ends with a formulation of the objectives of this thesis. Chapter 2 starts with an overview of some of the core techniques of CPSP, it continues with the introduction of most of them and demonstrates their potential for ASR applications. Chapters 3 to 5 study special aspects of CPSP. Chapter 6 combines all derived representations to identify coherent areas of the timefrequency plane (termed auditory elements) that are likely to represent the most reliable sources of information in the signal. Chapter 7 provides a summary of CPSP and studies its generality. It then returns to the framework presented in chapter 1 by proposing a method to recognize speech sounds in arbitrary acoustic environments. The thesis is concluded by a short analysis of the consistency of CPSP with human performance and an overview of the advantages of CPSP. Although CPSP is conceptually simple, it deviates in its basis assumptions from modern (speech) signal processing techniques. To allow an optimal presentation, a tutorial form was considered to be more useful than a set of Continuity Preserving Signal Processing 11

13 Introduction articles. The mathematical prerequisites are minimal, but familiarity with signal analysis is helpful. To help the reader appreciate the differences and special properties of the large number of different representations presented this work, only a single target sentence (namely a Dutch version of zero one two three) is used for visual representations. Although the focus is primarily on speech analysis and speech recognition, this is predominantly because speech is an example of a very complex class of sound. Notwithstanding the single example signal, CPSP is intended as a form of signal processing that can deal with unknown signals of arbitrary complexity. HuQ Speech Technologies is founded to take care of the transition of CPSP from a new research tool for the study of complex sounds, to a useful engineering technique that can be readily applied in a wide range of applications. HuQ s products and demonstration systems will prove that CPSP can indeed deal with a large number of qualitatively different sounds. CPSP is in active development and given its huge potential for the analysis of complex sounds it is to be expected that CPSP will develop considerably in the next few years. Tjeerd Andringa, January Continuity Preserving Signal Processing

14 Speech and Speech Recognition CHAPTER 1 Recognizing Arbitrary Sounds This chapter provides a framework for speech signal processing that is intended to function in arbitrary, varying, uncontrollable and unknown acoustic environments. The framework is based on insights from both physics and cognitive science, which is in contrast with common speech signal processing techniques that rely on mathematics and typically use standard electrical engineering techniques based on digital signal processing and applied statistics. The new approach is based upon a number of straightforward basic design considerations that are presented in this chapter. Subsequent chapters explore the consequences of these design considerations. 1.1 Speech and Speech Recognition What is speech, and what distinguishes speech sounds from other sounds? These are hard questions because speech sounds are meaningful sounds. One might consider speech as an intermediate product of a very complex system that can communicate conscious thoughts and ideas of a speaker to a listener. In the case of speech the intermediate product is meaningful sound, in the similar case of sign language the physical manifestation is a set of meaningful gestures. In both cases, it is impossible to determine the meaningfulness of the Continuity Preserving Signal Processing 13

15 Recognizing Arbitrary Sounds physical manifestation by studying it isolated from the rest of the system. Just like most people are unable to determine whether or not a combination of gestures is part of sign language, so is even the most competent linguist unable to determine whether or not a sound is speech when he or she does not know the language. Like human listeners, an automatic speech recognition (ASR) system cannot determine that a signal is speech, unless the system knows about the regularities of speech sounds, the regularities of the language and the structures of the world that are associated with meaning. When the system does not know enough of the structure of speech sounds, it will treat nonspeech sounds as speech or, inversely, it will ignore valid speech. When the system does not know how speech sounds ought to be combined to form valid sounds of e.g., English or Dutch, it will treat foreign or nonsense words as meaningful words of the target language. Vice versa it might ignore or misinterpret valid words. And when the system does not know enough about syntax and semantics it is likely to choose a nonsensical combination of words as recognition result. Building a speech recognition system is therefore an extremely challenging task: it requires the integration of signal processing, linguistics and knowledge of the world. It requires in fact much more knowledge of these fields than is currently available. Designing a speech recognition system with the competence of a normal human listener is therefore beyond our current capabilities. At this moment ASR systems cannot reliably recognize spontaneous speech or speech with a moderate amount of background noise that does not impair human performance. Nevertheless, technology allows us to build systems that function reasonably well in one important special case: namely when the user ensures that the system s input is limited to sounds the system can process correctly. When the user cannot or does not satisfy this condition the system will fail and/or produce a nonsensical output. An important example of a useful speech recognition application is a dictation system. Such a system requires the user to train it: 1 for optimal results hours or even days of training are required. It is speaker-dependent in the sense that it functions well with the user(s) that trained the system, but functions suboptimally (or fails) with all other users. Yet, these systems, when functioning, allow the users to interact with their computer in a very natural 1. At the same time the user is trained to produce speech in a manner that the ASR system can recognize. 14 Continuity Preserving Signal Processing

16 Speech and Speech Recognition way. Unfortunately not many users are willing to spend the necessary hours on training. Hitherto these systems are more sold than used. Another useful application area of speech recognition technology is that of automatic telephone information services. These information systems function best when the user is cooperative and experienced and when the dialogue is very structured. A structured dialogue reduces the number of possible words, which facilitates recognition enormously. Because the dialogue is very structured and the system usually does not need to recognize each word to estimate the next course of action, these systems can function satisfactorily for a large number of cooperative users. Before addressing the basics of modern speech recognition systems, it is useful to describe some aspects of the speech signal (O Shaughnessy 2000, Gold 2000). Speech is produced in the vocal tract. Its energy stems either from the flow of air around constrictions in the vocal tract, which leads to a signal in which all frequencies contribute (a continuous frequency distribution). Or it stems from the periodic opening and closing of the vocal folds in the glottis; this restricts the signal to a discrete set of frequencies that are periodic with the main period of the vocal fold movement. The resulting sound is changed (filter ed) by the resonant properties of the vocal tract. The excitation of the vocal folds determines the pitch and the loudness, which carry intonation and information about speaker identity. Pitch and loudness are relatively unimportant for meaning in most European languages. The filtering effect of the vocal tract, on the other hand, is the main source of information for wordidentity estimation. The speech production process is often modeled by a source-filter model as depicted in figur e 1.1. This figur e provides two equivalent versions of this process. The upper panel shows a time-domain representation in which the driving force is either a periodic pulse produced by the vocal folds, or an aperiodic contribution from turbulent flow. This excitation is modulated by the resonance effect of the oral, nasal, and pharyngeal cavity. The resulting speech is radiated through lips and nose. The lower panel represents the same process in terms of the frequency domain. Now the vocal fold excitation is a set of harmonics: frequency components at integer multiples of the fundamental frequency (f 0 ) of the vocal fold movement. The first harmonic, h 1, represents the lowest frequency of the vocal fold movement and equals f 0, the nth-harmonic, h n, represents n times the fundamental frequency. This discrete set of frequency components is Continuity Preserving Signal Processing 15

17 Recognizing Arbitrary Sounds Source filter model: time domain Noise source: aperiodic excitation t nasal cavity T 0 =1/f 0 oral cavity Vocal fold: periodicglottal excitation t Switch or summation Speech: waveform t Noise source: continuous distribution of frequencies Source filter model: frequency domain f F 1 F 2 F 3 h 1 h 1 h 4 h 6 Vocal fold: harmonic spectrum f 0 h 4 h 6 f Switch or summation f Spectral envelope f 0 Speech: spectrum f Figure 1.1. An overview of speech production. The upper panel shows a timedomain representation where the periodic pulses of the glottis resonate in the oral and nasal cavity before being radiated. The lower panel shows a frequency domain representation of the same process. The periodic input is now a harmonic complex of frequency components, each of which is an integer multiple of the fundamental frequency f 0. This set of frequencies is multiplied with a spectral envelope representing three formants. This leads to the final speech spectrum that consists of weighted harmonics. ASR systems use an estimate of the temporal development of the spectral envelope to determine word identity. filter ed by the vocal tract. This leads to a relative enhancement of some of the harmonics, and to a relative attenuation of others. Frequency regions where harmonics are enhanced are termed formants. The first two formants, F 1 and F 2, are most important for word-identity estimation. F 3 is somewhat less important. Higher formants are of no special importance for word-identity estimation, but do contribute to speaker identification. The upper panel of figur e 1.2 depicts the waveform of the Dutch utterance /NUL EEN TWEE DRIE/ (English translation: /ZERO ONE TWO THREE/). Because the temporal resolution of the picture is low, it is impossible to identify individual oscillations. But even with a good temporal resolution the 16 Continuity Preserving Signal Processing

Speech and Speech Recognition Time domain representation 0.5 Displacement 0 0.

18 Speech and Speech Recognition Time domain representation 0.5 Displacement Frequency domain representation Frequency in khz Time in ms Figure 1.2. Two representations of speech. The upper panel shows the waveform. The lower panel shows the corresponding Fourier spectrogram. The utterance is the dutch string /NUL EEN TWEE DRIE/ (English translation: /ZERO ONE TWO THREE/). A 25 ms Hamming window was used and the time difference between the start of consecutive frames was 10 ms. (Source: Groningen corpus, Sulter1994) waveform is hard to analyze. By performing a so-called Short Term Fourier Transformation (STFT), the signal can be transformed to a frequency domain representation that is easier to analyze. The STFT divides the signal into a number of frequency contributions, or spectrum, that describe the signal during a certain time interval. These intervals are called frames or windows. A succession of spectra, describing the temporal development of a signal, is called a spectrogram or time-varying spectrum (Papoulis 1984). An example is depicted in the lower panel of figur e 1.2. The horizontal axis represents time, the vertical axis frequency and the color coding reflects the relative contribution of a certain frequency within a certain time window: blue for low values, yellow for intermediate values and (dark) red for the highest values. Spectrograms are generally believed to provide sufficient and suitably organized information for speech recognition purposes (Furui 1989, O Shaughnessy 2000, Young 1997). The individual harmonics show up as slowly varying (horizontal) structures. Natural sources and in particular speech are rarely perfectly periodic: usually Continuity Preserving Signal Processing 17

19 Recognizing Arbitrary Sounds Template database { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 } Combinations Probabilities 1 P( 1 ) 11 P( 11 ) 111 P( 111 ) P( 12 ) 121 P( 121 ) 1211 P( 1211 ) P( 0008 ) 0009 P( 0009 ) 0000 P( 0000 ) Production P(w y) P(w) w = = max arg w max arg w P( w y) P( y w)p( w) P( y) Input signal y Choose max Result: 12 Figure 1.3. The standard speech recognition paradigm is based on Bayes decision rule which is depicted in the upper right corner. The input is assumed to consist of words for which probabilistic templates (Hidden Markov Models, HMMs) exist. The combination of templates with the highest probability to reproduce the input signal is the recognition result. the fundamental frequency, or its perceptive analogue pitch, changes as a function of time. This is reflected in the frequencies of individual harmonics that must remain integer multiples of the fundamental frequency. Signals with changing pitch are called quasiperiodic and form the main study object of this work. Formants correlate to areas in the time-frequency plane where harmonics are particularly prominent. Between t=0.2 and t=0.3 s a formant glide can be observed. During this period the second formant drops from 2000 Hz to 700 Hz, which is caused by a change of the vocal tract during the /L/ of /NUL/ that results in a matching change in the relative contribution of individual harmonics. 2 Most speech recognition systems use the Short Term Fourier Transformation as the basis for further analysis. Modern automatic speech recognition (ASR) systems are based on Bayes decision rule as depicted in figur e 1.3 to determine which word, or a string of words corresponds to a given sound. Bayes decision rule leads to a probabilistic framework where each sound is modeled as a so-called Hidden Markov Model (HMM). Each HMM is specialized to produce a word or 2. Note that the harmonics go up, while the formant goes down: changes in formants and changes in pitch are uncorrelated. 18 Continuity Preserving Signal Processing

20 Speech and Speech Recognition phoneme in the sense that it has a relatively high probability of producing the corresponding word or phoneme and a much lower probability of producing other words or phonemes. The probability that a string of word templates w (or other unit of speech) produces the observation sequence y is denoted as P(y w). Some word strings are more probable than others. This is denoted by the stochastic language model 3 that is represented by a set of values for P(w). The word string w with the highest probability of producing y, forms the output of the recognition system. This process finds the best word sequence ŵ given input y, i.e., it maximizes P(w y), according to Bayes decision rule: ŵ = = = max arg w max arg w max arg w P( w y) P( y w)p( w) P( y) P( y w)p( w) P( y) (1.1) The second step on the right side results from the application of Bayes rule. P(y) is independent on the word sequence w. Automatic speech recognition systems require a suitable form of the input 4 y. This input is usually based on a spectrogram-like representation that is tailored to estimate the spectral envelope as efficiently as possible while minimizing pitch information in the parameterization. Typical examples are Mel-Frequency scaled Cepstral Coefficients (Young 1997, Gold 2000) and Perceptual Linear Prediction (Hermanski 1990). Both represent approximations of aspects of human perception, but do not make any difference between the type of input: consequently a speech signal is processed in the same way as a nonspeech signal or a mixture of speech and noise. This limits the application of speechspecific knowledge. 3. A stochastic language model of order N represents the probability of a word given its N-1 predecessors. Syntax, semantics, and other linguistic knowledge is implicitly coded in these N-grams. 4. To simplify implementation and to speed-up processing, most ASR systems assume statistical independence of consecutive (overlapping) input vectors and statistical independence of the parameter values that constitute each input vector. This assumption is equivalent to assuming that speech is produced by a random process where each parameter of the input is uncorrelated with all other parameter values. Given the speech production overview in figur e 1.1 and the discussion in section 1.7, this assumption cannot be justified. Continuity Preserving Signal Processing 19

21 Recognizing Arbitrary Sounds The knowledge about the development of speech sounds is stored in the acoustic models P(y w). These models are based on as much speech data as possible. After training, each model represents the corpus mean (and variance) of the sounds it represents. Consequently, HMM-based recognition systems represent a form of pattern matching where the input is compared with the average development of the training corpus. Often separate models are trained for male and female voices. Although this recognition approach was not originally intended for noisy speech, it is used as a natural starting point for the development of truly noise robust applications. 1.2 Research Approach This chapter started with the observation that speech is meaningful sound and that its meaningfulness cannot be studied in isolation from the rest of the (communication) system. Since the rest of the communication system involves large parts of our brain and the articulatory and auditory organs, the speech signal itself might only be the top of the iceberg in the sense that most of the regularities in the signal are due to processes and limitations of the rest of the system. When this is the case, the speech signal cannot be understood adequately without some understanding of the basic processes that produce and process it. It is even possible that the optimal (and natural) representation of speech depends on the possibilities and limitations of the whole cognitive system. A suboptimal choice of the basic representation of speech and the required functionality to process speech might be part of the reason why it is still impossible to develop reliable ASR systems for spontaneous and/or noisy speech. One might therefore expect that ASR research necessarily involves an integration of (cognitive) science and technology. It does not. The development of ASR techniques in the last decades was almost exclusively technology driven and not science driven. On the other hand, research in the diverse fields of cognitive science (the sciences that study the algorithms of the brain) has been extremely fragmented and often aimed at explaining experimental data obtained by studying idealized stimuli in controlled environments. The history of ASR technology (Gold 2000) shows that ASR technology has converged to the widely accepted HMM approach. There are some alternative schemes, usually involving artificial neural networks, which may lead to systems that function just as well. But these systems suffer from the same basic 20 Continuity Preserving Signal Processing

22 Research Approach limitations as HMM-based systems do: they function well within a certain range of operating conditions and fail when these conditions are not met. Engineers are educated to select, use, and adapt a subset of the large set of well known and extremely powerful engineering tools. When years of research fail to produce systems with the desired functionality one might consider the possibility that the current set of engineering tools is not suitable to deal with the problem at hand. In fact, it suggests a problem with the basic assumptions on which modern ASR systems are based. The set of basic assumptions comprises all that the system takes for infallible: consequently when these assumptions are not justified, the systems will produce incorrect (and even unpredictable) results. This behavior is in stark contrast to the reliability of human speech recognition performance. This leads to the question: which set of basic assumptions allows such a reliable human speech recognition performance? One might expect this to be a core question of the cognitive sciences that study the speech process. It is not. Cognitive science is still very fragmented and in search of a common general framework. Consequently it has great difficulty in answering questions about the performance of the whole (speech) process/system. Most cognitive scientists that study speech, study abstractions of the speech process that are still far from conversational speech in a normal acoustic environment. This results in scientific modelling that is either too abstract to implement (and therefore very difficult to falsify on real-life situations), or too detailed and complex to be of instant use in any feasible engineering system. The approach of an engineer is to build a system that works with a subset of standard engineering techniques. The approach of a cognitive scientist, on the other hand, is to understand aspects of cognition with the learned conceptual background and the experimental techniques. The focus of an engineer is aimed at building a working and integrated system, while the focus of the cognitive scientist is (often, but not necessarily) aimed at the understanding of the processes that make up the system. This suggests that the approach of the engineer and the cognitive scientist might complement each other. This work tries to find a hybrid approach by identifying and applying aspects of cognition that allow the implementation of more reliable speech signal analysis systems. Continuity Preserving Signal Processing 21

23 Recognizing Arbitrary Sounds 1.3 Defining a Speech Recognition System As a first step towards the formulation of a set of reliable basic assumptions it is useful to formulate the criterion of a speech recognition system. This is what the developer tries to achieve and consequently what is optimized in the system. In ASR-research the criterion is traditionally, the reduction of worderror rate (typically by improving the acoustic model P(w y)) or the increase in the fraction of successful dialogues (typically by improving the language model P(w) and the system integration). These measures are relevant for human speech recognition as well. But human perception, being the product of evolution, optimizes behavioral benefit in combination with the requirements of the environment. One can therefore postulate that the human auditory system balances requirements like: quality of recognition, speed of recognition, the ability to rapidly and correctly process unexpected and potentially behaviorally significant sounds, the number of (acoustic) environments in which the system works adequately, source characterization (e.g., speaker identification) the integration of perception with behavior (the effect of the rest of cognition). Human behavior shows that words in sentences are recognized almost flawlessly within 1 second (typically 600 to 800 ms), in a wide range of signalto-noise ratios and in a wide range of qualitatively different acoustic environments (Alefs 1999). More difficult acoustic environments require longer processing (up to a few seconds) and more attention. In most acoustic environments the transition, in terms of the signal-to-noise ratio, from unimpaired speech perception to the complete inability to recognize speech is usually sharp. Around the speech reception threshold (Plomp 1979) the fraction of sentences of which all words are recognized correctly, decreases typically with 15% to 20% per db signal-to-noise ratio. For a noise condition with a speech reception threshold of -6 db, all sentences are correctly recognized above -2 db, and none of the sentences is recognized correctly below -10 db. The wide range of unimpaired recognition performance in combination with the sharpness of the transition between unimpaired performance and the complete inability to recognize speech suggests that: human speech perception is 22 Continuity Preserving Signal Processing

24 Defining a Speech Recognition System optimized to function, both reliably and as fast as possible over a range of signal-tonoise ratios that is as wide as possible and in a number of realistic acoustic environments that is as large as possible. Two tentative conclusions can be drawn from this observation. The first conclusion is that the human auditory system optimizes the fraction of time (assuming real-life conditions) that it functions adequately. This forms the first basic conjectur e of this thesis: (1.1) The human auditory system is optimized to function as often as possible in varying and uncontrollable acoustic environments. In ASR research the term robustness is used to denote a similar property: a method or algorithm is robust if it can deal with a broad range of applications and adapt to unknown conditions (Junqua 1996). The second tentative conclusion is that the speech communication process is optimized to function as often as possible in varying and uncontrollable acoustic environments. This leads to the second basic conjecture of this work: (1.2) The most important (linguistic) information in a speech signal is represented by those signal properties that can be estimated by the human auditory system in the widest possible range of real-life conditions. These signal properties are, by definition, the most robust properties. Although this conjecture may seem obvious to some, it is actually difficult to prove because the information in a speech signal cannot be quantified since a quantitative theory of linguistic meaning does not exist. Yet speech communication, by definition, requires the transmission of information. This second conjecture states that the signal contributions that can be estimated in the widest range of (real-life) environments are also the most informative. This allows a change of the search effort from optimally informative linguistic features (which is difficult to quantify) to a search for features that can be estimated in a maximally wide range of acoustic environments (which is easier to quantify). Although conjecture 1.2 is difficult to prove, there is ample supporting evidence. As discussed in the context of the figur e 1.1, the peaks in the energy spectrum, which are likely to exert the largest influence in a measurement system, represent the formant information that is associated with word identity. ASR systems use formant information by modelling the spectral envelope. Speech synthesis systems (Keller 1994) use the formant development as the main determinant of intelligibility, and speech-like signals that consist of sinusoids at each formant position are intelligible for trained listeners (Remez 1981). Furthermore, work in the period between 1921 and Continuity Preserving Signal Processing 23

25 Recognizing Arbitrary Sounds 1951 by Fletcher, French, Steinberg, and Galt (French 1947, reviewed in Allen 1993) showed that intelligibility scores can be predicted by the articulation index: a measure related to the signal-to-noise ratio (SNR) per frequency band. This suggests again an important role for spectral peaks, because they correspond to positions in the time frequency plane where the signal-to-noise ratio is maximal. 5 Consequently, spectral peaks are likely to carry linguistic information. Note that conjecture 1.1 can be rephrased by stating that the human auditory system is able to correctly process arbitrary sounds as often as possible. The term arbitrary sound now has a special meaning: (1.3) An arbitrary sound is a sound of which no a priori knowledge is available: it can be any sound out of the set of all possible sound combinations. This means that of an arbitrary sound nothing is known, except for the most general properties that are valid for all sounds. An arbitrary signal might, or might not be recognizable by a recognition system. Modern ASR system cannot recognize arbitrary sounds, and not even arbitrary speech sounds. The correct functioning of an ASR system is usually restricted to a very specific set of words, speakers, types of background noises, transmission and coding properties etc. This entails that, in the context of ASR, an unknown sound has an extremely restricted meaning: virtually everything of the signal has to be known and conform to the system s expectations; the only aspect that is left to be estimated is the word order. When the signal does not comply to the expectations of the system, it fails. Such systems are speech recognition systems in the sense that they recognize any input as being speech. A much more useful system is able to deal with arbitrary input. This can be termed a general recognition system: (1.4) A general recognition system is a system that can recognize a signal that belongs to a certain target class, if and only if it is actually present in the input. Note that the term general does not refer to the recognition of all types of input, but is aimed at recognizing all signals of a certain class embedded in an arbitrary background. The rest of this chapter presents a framework for the development of general (speech) recognition systems. Chapter 7 returns to this framework and reflects on the way the techniques developed in this work help to design general recognition systems. 5. This led Allen (Allen 2000) to conclude that the role of the formant is to improve the local SNR. 24 Continuity Preserving Signal Processing

The Signal-in-Noise-Paradox Signal in Noise Paradox 7 6 Frequency in khz 5 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time in seconds Figure 1.4. A Fourier-based spectrogram of a recognizable signal in cocktail party noise.

26 The Signal-in-Noise-Paradox Signal in Noise Paradox 7 6 Frequency in khz Time in seconds Figure 1.4. A Fourier-based spectrogram of a recognizable signal in cocktail party noise. It is difficult to find all signal components that belong to a single source. This figur e was derived with a standard FFT-spectrum based on 25 ms Hamming windows and a time step between consecutive frames of 10 ms. A logarithmic energy compressing was applied. The signal-to-noise ratio is 0 db. 1.4 The Signal-in-Noise-Paradox This section is devoted to a more careful presentation of one of the main problem one encounters while recognizing arbitrary sounds: how to decide which part of the sound is the target or otherwise meaningful without having to recognize the signal. This problem becomes apparent while studying a single example of a signal that might be recorded at a lively cocktail party. A spectrogram of such a signal 6 is depicted in figur e 1.4. Suppose we aim to recognize speech: that means that we are not interested in the unintelligible background babble or in nonspeech contributions like music. We just want to recognize any intelligible speech that is part of the signal. But since we have not recognized the signal, we do not know where that speech is, what is said 6. Available at Continuity Preserving Signal Processing 25

27 Recognizing Arbitrary Sounds or who was speaking. In fact, usually we even cannot be sure that there is any speech at all. Since we do not know what to expect, we do not know what to look for. In this example the situation is even more difficult because most of the signal consists of speech sounds. Figure 1.4 shows a lot of structure: part of the visible structures reflects the target speech, part the background and part a mixture of both. If we could select the target speech, we could discard the background and present the selection to an automatic speech recognition system. If we made the correct selection, the probability that the ASR system will provide the correct answer is maximal. But each selection error, even a small one, will reduce this probability considerably. Now the question changes to: how to make a sufficiently correct selection? We might guess a word string and search for evidence of that word-string. When the words are correctly guessed it is likely that a sufficient fraction of the evidence can be selected. Unfortunately, there is an effectively infinite number of possible word strings, so the search space is huge; and it will be extremely difficult to determine what the best combination of word-string and selection is. We are stuck in the speech-in-noise-paradox, or more general: (1.5) The signal-in-noise-paradox: a correct recognition result is possible with a correct selection of evidence and a correct selection of evidence is possible with a correct recognition result. Unfortunately, neither a selection, nor a recognition result is available. This paradoxical situation is a direct consequence of working with arbitrary input. The paradox represents a central problem of (speech) recognition, whether natural 7 or artificial. Apparently, the human speech recognition system is very efficient in combining the correct bits-and-pieces of information. Building an artificial speech recognition system that solves the signal-in-noise-paradox, requires intimate knowledge of the functional units of this process and how they can be modeled computationally. 7. Although the signal-in-noise-paradox is relevant for natural systems, it has obviously been avoided. The paradoxical situation arises from the assumption that selection and recognition must be different processes (a direct consequence of the application of pattern-matching for speech recognition purposes). It is likely that our brain avoids the paradox by integrating selection and recognition. 26 Continuity Preserving Signal Processing

28 Solving the Paradox 1.5 Solving the Paradox An obvious question is what it actually means that a sound source can be recognized or characterized. To recognize something entails that it can be classified as an instance of a certain class. 8 Generally speaking: (1.6) A sound source is classifiable as a member of a certain class if its signal has a sufficient fraction of the characteristic properties of that class, without being inconsistent with the class. The set of characteristic properties of a signal like speech is defined by the laws of physics and by conventions between speakers. In the case of speech, meaning is a typical characteristic constraint. Furthermore: (1.7) A recognizable signal has a set of properties that allows its classification. And the source that produced the signal is subject to characteristic (physical) constraints that are reflected in the characteristics of the sounds it can produce. 9 It is the task of a general sound recognition system to use these characteristics optimally. Modern ASR systems map the input to the best matching template, but do not check if the input satisfies the characteristic constraints of the class of signals to be recognized. Such systems will produce nonsense if the user cannot ensure the correct type of input. To prevent this type of error an additional constraint must be introduced: (1.8) An acceptable recognition result is a recognition result that satisfies the characteristic constraints/properties of its class. The best acceptable recognition result is therefore the best recognition result that matches sufficiently well. Note that a recognition system can never be absolutely certain that a recognition result is correct. It can determine the best interpretation of the data and determine whether the recognition result is acceptable, but this is not a guarantee that the recognition result is correct. Yet 8. Note the small difference between recognition and classification. Both refer to the same process, but recognition is, per definition, impossible when a signal has not been presented before. In this context recognition and classification will be considered to refer to the final outcome of the system. A more general definition may allow recognition and classification to occur at all levels of pr ocessing. 9. Sound sources that are designed to be unclassifiable (on the basis of acoustic information alone) are hi-fi audio systems. These systems have hardly any noticeable physical constraints and can deceive listeners by producing any recognizable class of sound at will. Continuity Preserving Signal Processing 27

29 Recognizing Arbitrary Sounds the fact that we are rarely confronted with far-reaching and negative consequences of an incorrect interpretation of speech, shows that this guarantee is rarely essential. 10 Being able to recognize or to characterize a sound source requires detecting it. In animals this has to happen as quickly as possible, since the sound source might require a rapid adaptation of behavior. The onset, here defined as the moment the difference between the absence and presence of the source is detectable, is therefore of great potential importance. The onset is by definition aperiodic. After the onset, the signal might be perceivable for some time. Since the signal is produced by a physical system, it cannot change more rapidly than the physics of the source (and the transmission medium) allow. This means that its signal changes in a continuous manner. Finally, at some moment the sound source will stop or become otherwise unnoticeable by the perceptive system. This produces another discontinuity: the offset. The total response of the perceptive system to a single sound source consists of an onset, often followed by a relatively slowly changing signal body and concluded by an offset. This leads to the definition of a sound event: (1.9) A sound event is the audible physical signal that consists of the sequence of an onset, an optional continuous development, and an offset. A sound event describes the input originating from a single source to the auditory system. Our auditory system usually receives multiple sound events at any given time. Prior to further analysis the auditory system is unaware of the existence of the individual sound events, it simply responds to the combined influence of all concurrent sound events. Yet, since the auditory system is in fact able to recognize multiple concurrent sound events, it apparently assumes that the input is produced by different sources that all produce classifiable signals. According to conclusion 1.7 this means that the auditory system assumes that all sound events show a characteristic set of properties that allow recognition. This leads to a basic assumption of a general sound recognition system: (1.10) A general sound recognition system may assume that an arbitrary signal consists of a superposition of sound events of which each, in principle, can be recognized. 10. This forms further support for conjecture Continuity Preserving Signal Processing

30 Solving the Paradox If the system is able to assign enough information from the sound events of a single source to a coherent representation that does not contain information of other sources, 11 it can recognize the sound as if the other sounds were not interfering. This can solve the signal in noise paradox. (1.11) An auditory event is the internal representation of information that can be estimated from a single sound event. Auditory events may not represent information of uncorrelated sources. If auditory events can be estimated reliably, it is possible to reduce the task of a general sound processing system to a pattern recognition task through the set of combinations of auditory events. A number of combinations of auditory events might be acceptable in the sense of conclusion 1.8. Summarizing: (1.12) The signal in noise paradox can be solved by grouping continuously developing acoustic information of a single source into auditory events. A search through the set of auditory event combinations might produce a number of acceptable recognition results. Conclusion 1.12 suggests that speech recognition requires much more than pattern matching: (1.13) Speech recognition is an active process that aims to find the most meaningful and most consistent interpretation of the signal. Although this conclusion is used as a central design guideline for the innovations described in the work, it will not be addressed directly. These conclusions are supported by experiments within the field of auditory scene analysis (see Bregman 1990 for an excellent overview). Bregman describes a large number of experiments in which acoustic evidence is either integrated into a single percept or segregated into multiple percepts. Bregman divides auditory scene analysis into primitive processes and schema-based processes. Primitive processes form a first processing stage. This stage functions independently of the types of sound sources, since it is based on simple and reliable physical cues in the signal. A typical cue for primitive processes is common onset: signal contributions that start at the same time are likely to belong to the same source. Another cue is based on the fact that frequency development of harmonics of a quasiperiodic source are correlated through time. Consequently, correctly coevolving harmonics tend to be 11. As will be shown in chapter 2 and section 4.7 it is generally impossible to form bottom-up representation without any contamination from concurrent sound events. A noiseless reconstruction of the information of the sound event is only possible after correct recognition. Continuity Preserving Signal Processing 29

31 Recognizing Arbitrary Sounds integrated in a single percept. Schema-based processing forms a second stage. The schema s represent our knowledge about different types of potentially meaningful sound. The activation of a schema requires the activity of primitive processes that comply to the regularities represented by the schema s. This work is mainly aimed at modelling important functional aspects of primitive auditory scene analysis. In particular this work is aimed at auditory event forming. Schema-based processing, although essential for robust speech perception, will be addressed (implicitly) in a proposal for robust speech recognition in section 7.2. The output of a limited implementation of a primitive scene analysis stage, as presented in chapter 2, will be evaluated with a traditional speech recognition system using HMM-based pattern matching. Section 6.1 provides the most elaborate form of primitive ASA in this work. 1.6 Design Basics The numbered conclusions of the previous sections reduce the search space associated with the development of general sound recognition systems. Conjecture 1.1 is probably the most restrictive of all. If a recognition system has to function as often as possible, it is not allowed to make any unjustified prior assumptions that may prevent the system from reaching correct recognition results. Typical built-in assumptions of modern ASR systems that are often unjustified ar e (Junqua 1996): the input consists of a single source, the background noise is absent, stationary, or known, the training set matches the operating environment, the input can be recognized. When all of these conditions are met, the system may work adequately, but when one of these conditions is violated the system will fail. Unfortunately, in most application environments, these conditions cannot be guaranteed without considerable effort. A general speech recognition system, of which the human auditory system is an example, remains almost unimpaired in these environments. Apparently, a general speech recognition system uses weaker prior assumptions than modern ASR systems. 30 Continuity Preserving Signal Processing

32 Design Basics (1.14) A recognition system that functions as often as possible must be based on a set of the weakest possible prior assumptions. Applying the results of previous sections, one can propose a set of two basic assumptions: 1. all sound events show an onset, an optional continuous development and an offset (definition 1.9) 2. the (physical) constraints of the source that are reflected in the signal allow and determine recognizability (conclusions 1.7 and 1.10) These basic assumptions form reliable starting-points for a recognition process. However, the ability to reach a correct recognition result is determined by the state of the perceptive system and the interaction between all concurrent sound events. Generally speaking: (1.15) A recognition system is able to recognize a sequence of sound events of a single source correctly, if and only if it can assign a sufficient amount of informative evidence about the source to a single representation, i.e., if it can form a sufficiently informative set of auditory events corresponding to the sound event sequence of the source. More evidence, i.e., greater redundancy, might facilitate processing, but will not change the recognition result. Less evidence, i.e., reduced redundancy, will eventually produce ambiguous results, incorrect results or no results at all. The work of Fletcher, French, Steinberg and Galt (French 1947, Allen 1993) in the context of the articulation index showed that the local (in terms of time and frequency) signal-to-noise ratio is the sole determinant of whether or not reliable information can be derived from a speech signal. Reducing the global SNR will reduce the area of the time-frequency plane where reliable information can be derived from, up to a point where insufficient reliable information is left to guide a successful search. Limits of the measurement process Conform conclusion 1.15 the central problem of the recognition of arbitrary sounds is the assignment of the correct information to the correct representation. Physical measurement theory tells that any measurement includes an error. This error decreases with the inverse of the square-root of the number of measuring points. For stationary signals, the measurement error can be reduced to an arbitrary small value by increasing the duration or number of measurements. For nonstationary signals, like speech, this is Continuity Preserving Signal Processing 31

33 Recognizing Arbitrary Sounds impossible since the signal changes between and/or during measurements. This means that any system that processes arbitrary, nonstationary signals has to deal with an unknown amount 12 of noise on the values of all feature estimates. For signals of a single source this is usually not very problematic since most natural signals, including speech, are extremely redundant. But in the case of unknown signals, which is the case until the signals have been identified, it eventually becomes impossible to separate the measurement error from the effects of the interaction between sound events. This means that: (1.16) It is generally impossible to detect and estimate the values of individual features completely and/or sufficiently r eliably. This entails that individual features can never be trusted before an acceptable recognition result is reached. This is an extremely important operational problem for all measurement systems, including the mammalian sensory system. The best solution is to hypothesize all likely features and feature values and work with the feature hypotheses as possible interpretations of the input. Although the reliability of individual feature hypotheses might be low, this is certainly not the case for combinations of feature hypotheses. A higher-level interpretation of multiple noisy features can be estimated with a considerably smaller error because it is based on more evidence (a larger context). As will be shown in this work, the pitch of a noisy signal can be hypothesized relatively reliably, although it might be difficult to find the individual harmonics on which the estimate is based and to estimate their frequency. Note that higher level interpretations are hypotheses as well. In the case of pitch-estimation in noise, it is often impossible to determine the precise time of on- and offset and it is difficult to avoid octave errors. Multiple pitch-contour hypotheses, each differing in the on- and offset time and frequency range, might be necessary. As will be shown in the next chapter, each pitch-contour hypothesis leads to an auditory event hypothesis. Some combinations of auditory events stem from the same source, and may lead to acceptable recognition results (and possibly even the correct recognition result). Other sequences will be incomplete or might combine information of multiple sound sources. A good recognition system must, of course, be able to discard these sequences. 12. The amount of noise can only be estimated in the context of a (correct) recognition result. 32 Continuity Preserving Signal Processing

34 Quasi-Stationarity and Continuity To summarize the consequences of the physical limitations of the measurement process: (1.17) The inability to estimate features of arbitrary signals reliably, forces a recognition system to work with hypotheses about features and featurevalues. (1.18) Higher level feature hypotheses are more reliable than the lower level hypotheses they are based upon and the whole is easier to recognize than the constituting features. 13 and conform conclusion 1.15: (1.19) The first processing stages of the auditory system must result in a set of auditory events hypotheses that include the auditory events required to reach a correct recognition result. Later processing stage must select the best acceptable recognition result. This thesis is aimed at the development of a number of techniques that, eventually, allow the formation of a set of auditory events hypotheses conform conclusion A technique that approaches auditory event forming is presented in section Quasi-Stationarity and Continuity This section addresses the justified application of quasi-stationarity and the importance of continuity to keep track of signal components of a single source. (1.20) A signal component is a single physically coherent signal constituent that can be described by specifying the temporal development of frequency and energy (phase is optional). Signal components can be combined to form sound events. Signal components refer usually to either harmonics, complexes of harmonics, or aperiodic contributions such as noise bursts and on- and offset transients (see section 4.4). 13. Note that conclusion 1.18 entails that the final disambiguation of noisy features is a top-down process. Continuity Preserving Signal Processing 33

35 Recognizing Arbitrary Sounds Quasi-stationarity Most speech signal processing techniques are based upon the quasi-stationarity assumption. This means that certain aspects of the signal, like amplitude and frequency content, can be modeled as originating from a process that is assumed to be constant over short periods 14 (for speech a value of around 10 ms is usually chosen). The rationale for this assumption is that speech is produced by a physical system that cannot change infinitely fast (Young 1997). This is a perfectly reasonable assumption that is used extensively in this thesis. But the assumption holds exclusively for the signal of a single source (speaker). If a signal is produced by two speakers, it will change more rapidly and certainly differently than is allowed by the physics of a single vocal tract. Consequently, a form of quasi-stationarity that is only valid for a single vocal tract is not justified for mixtures of speakers and should be avoided. In an arbitrary, unknown environment, the situation is even worse, since signal contributions might exist for which quasi-stationarity is never a useful approximation. If quasi-stationarity is nevertheless applied, the induced approximation errors will degrade the combined signal irreparably and therefore reduce the probability to reach a correct recognition result. Quasi-stationarity is often implemented by blocking the signal into frames and assuming (i.e., hoping) that the sequence of consecutive frames provides a sufficiently adequate description of the frequency-content of the target signal through time. Since the width of the frame (or the effective width of a window) is inversely proportional to the frequency resolution, a trade-off between temporal and frequency resolution is introduced. Signals in which frequency detail and temporal detail are both important cannot be processed optimally in a frame-based approach. Another direct consequence of this approach is that the use of frames introduces discontinuities that make it difficult to determine the continuity (and consequently the existence) of underlying signal components. This in turn makes it more difficult to assign signal information of a single source to a single representation. The use of non-rectangular windows while discarding 14. The application of quasi-stationarity is very similar to the sample-and-hold process that is used to transform continuous signals into discrete signals. The correct application of quasi-stationarity is subject to the same conditions as the sampling process: the target features of the original signal must not contain frequencies that are represented ambiguously in the sampled version (Papoulis 1984, see The sampling theorem for stochastic processes). The dynamics of the source and the choice of features determine the sample frequency and thereby the maximal period of stationarity. 34 Continuity Preserving Signal Processing

36 Quasi-Stationarity and Continuity phase (the temporal information within the windowed signal) exacerbates this problem even more. 15 To conclude: (1.21) Quasi-stationarity, with a proper time-constant must only be applied to individual signal components or to complex signals, like the speech of a single speaker, for which it holds. As long as the signal, or some selection of it, is not positively identified as suitable for the quasi-stationarity assumption, the application of quasi-stationarity is not justified and may lead to suboptimal or incorrect results. This entails that virtually all speech signal processing techniques are ill-suited for use in a general acoustic environment. In particular, techniques like the Short Term Fourier Transform (STFT), Linear Prediction (LP) and frame-based filterbank methods should not be used on arbitrary signals. These techniques are nevertheless applied to these signals, often without much success, or with successes on a very narrow range of applications (see Junqua 1996, which provides an overview of robust speech signal processing techniques). In the latter case, it is, again, the responsibility of the user to ensure that the input complies to the demands of the class of signals that can be dealt with. Quasi-stationarity, with a proper time-constant, can only be applied safely to signal contributions of a single source. For an unknown mixture of sound events a more suitable form of signal processing is required. Continuity When aiming to develop a sound analysis system that assigns information of a single source to a single representation, one has to exploit the regularities of the source as well as possible. Unfortunately the regularities of the source are unknown, because the source is not yet classified. The system can, according to conclusion 1.14, only assume the weakest possible prior knowledge. But conform conclusion 1.9, it is safe to assume that any sound event has an onset, an optional continuous development and an offset. Consequently: all sound events that are not impulse-like show a continuously developing part. 16 In the case of speech, most kinds of music and a wealth of natural signals, a continuous development is prominent most of the time. Only for some plosives like the /t/, the /k/ or the /p/ it might be argued that a continuous 15. More general: each irreversible operation destroys information and all irreversible operations should be avoided as long as the signal is not split into individual signal components. 16. It will be demonstrated in section 4.3 that even the response to an impulse can be characterized by a distinctive continuous development. Continuity Preserving Signal Processing 35

37 Recognizing Arbitrary Sounds development is not visible in the signal. Utterances like Why I owe you an hour?, on the other hand, can be pronounced in such a way that the complete utterance forms a single continuous whole. Continuity between on- and offset is a well-defined signal property that is shared by all sound events. Continuity, provided it can be justified from the signal, can therefore be exploited without any further knowledge of the type of signal. Continuity is therefore ideal to guide the assignment of acoustic evidence from individual sound events to auditory events. 17 As long as a signal component shows a continuous development, it is likely to stem from a single source. This is a fairly safe conclusion, because the probability is small that uncorrelated sources give rise to signal components that extend each other smoothly. Furthermore, signal properties such as pitch-contours are continuous as well, and can help to group different signal components together: all harmonics of a single quasiperiodic sound event remain integer multiples of the fundamental frequency. Frequency contours consistent with a certain fundamental frequency contour are likely to belong to the same source, or, as is often the case with music, to multiple sources with a correlated temporal development. Consequently: (1.22) Continuity of signal components forms one of the most reliable cues for assigning information from a single source (or multiple correlated sources) to a single representation. While this process is not complete, continuity through time and frequency ought to be conserved. In this light, it is to be expected that the first processing stages of the auditory system conserve continuity as well as possible. In the auditory system the transduction from sound, i.e., pressure fluctuations, to neural information is performed around a structure called the basilar membrane (see figur e 2.2 or figur e 3.1). The basilar membrane is a coherent physical structure that can be described by the physics of transmission lines. A transmission line is a structure that is continuous in both time and place, where in the case of the basilar membrane each position corresponds to a certain frequency that it responds best to. Consequently: (1.23) The basilar membrane transducts acoustic vibrations to neural information and conserves continuity in time and frequency (via its correspondence to place) for further processing. 17. Although powerful, the use of continuity is in general neither required nor sufficient. 36 Continuity Preserving Signal Processing

38 Task This property forms the basis of all new techniques described in this work. Furthermore, each signal component tries to enforce its frequency and phase upon the basilar membrane. This succeeds best at the basilar membrane positions that are most sensitive to its frequency. Some signal components may entrain a region so efficiently that they are locally dominant. 18 This entails that regions of the time-place plane where the local signal-to-noise ratio is favorable will be dominated by the target. This is in accordance with conclusion 1.15 and the experimental and theoretical work of Fletcher, French, Steinberg and Galt that showed that the local SNR (and not the spectrum!) determines the intelligibility of (nonsense) words. Consequently: (1.24) The separation of a sound into the individual signal components starts at the basilar membrane where each signal component tries to dominate the corresponding BM-region. This conclusion is (implicitly) implemented in Computational Auditory Scene Analysis (CASA) systems like that of Brown (1994) and Cooke. Since these systems are not based on representations that preserve continuity as well as possible, they cannot exploit the continuous nature of signal development optimally. Most computational auditory scene analysis systems (Rosenthal 1998) are based on varying combinations of neurophysiological plausibility an functional considerations. Yet, although Brown acknowledges the importance of weak assumptions, current CASA is not based on the most rigorous application of the weakest possible basic assumptions: consequently these endeavors are unlikely to lead to general recognition systems. 1.8 Task The previous sections provided a starting point for sound processing that can help to develop recognition systems that are suitable for a maximally wide range of environments and tasks. As stated in section 1.4, the signal-in-noiseparadox represents a central problem of modern automatic speech sound recognition. It can be solved by assigning all estimable evidence of a single source to a single auditory event stream. When the combined evidence 18. Whenever the term local is used in the context of the time-frequency or time-place plane it will reflect locality in both time and frequency or both time and BMposition. Continuity Preserving Signal Processing 37

39 Recognizing Arbitrary Sounds represents sufficient information, the signal can be recognized. This work focuses on a few essential steps in the auditory event estimation process. In particular: the identification of BM-regions where the influence of a single signal component can be estimated, the characterization of the information represented by these basilar membrane regions (typically the temporal development of energy and frequency) and the identification of combinations of signal contributions of the same source. It is, according to conjecture 1.2, assumed that the most robust features of a speech signal represent the most important (linguistic) information. The work of Fletcher et al. showed that robust speech features correspond to regions of the time-frequency plane with a favorable SNR. Consequently: this work aims to identify and combine signal components of a single source that are likely to have a favorable (local) SNR. Therefore success is measured by: 1. identifying and describing, in terms of the temporal development of frequency and energy, the signal components of a clean target signal that are likely to have a high SNR, 2. selecting target signal components and discarding non-target components from a noisy version of the signal and 3. determining that the selected signal components represent the same temporal development of frequency and energy as the clean target. When all three demands are satisfied (and conjecture 1.2 is valid) it is possible to develop speech recognition systems that are maximally robust because they are based on the most reliable (i.e., highest local SNR) and most informative features in the signal. Section 6.3 (figur e 6.15 in particular) provides a quantification of a measur e of success based on these three demands. The main focus of this work is on the segregation of a signal in signal components. The integration of signal components can sometimes be justified from the signal, but often the combination of signal components depends on the application of linguistic or other knowledge which will be discussed in the form of a proposed recognition system in section 7.2. This work will only address the use of monaural information. Binaural processing (i.e., interaural time and level differences), although powerful, is not addressed. 38 Continuity Preserving Signal Processing

40 Conclusions and Definitions 1.9 Conclusions and Definitions This chapter can be summarized by repeating the numbered conjectures, definitions, and conclusions: (1.1) The human auditory system is optimized to function as often as possible in varying and uncontrollable acoustic environments. (1.2) The most important (linguistic) information in a speech signal is represented by those signal properties that can be estimated by the human auditory system in the widest possible range of real-life conditions. These signal properties are, by definition, the most r obust properties. (1.3) An arbitrary sound is a sound of which no a priori knowledge is available: it can be any sound out of the set of all possible sound combinations. (1.4) A general recognition system is a system that can recognize a signal that belongs to a certain target class, if and only if it is actually present in the input. (1.5) The signal-in-noise-paradox: a correct recognition result is possible with a correct selection of evidence and a correct selection of evidence is possible with a correct recognition result. (1.6) A sound source is classifiable as a member of a certain class if its signal has a sufficient fraction of the characteristic properties of that class, without being inconsistent with the class. (1.7) A recognizable signal has a set of properties that allows its classification. And the source that produced the signal is subject to characteristic (physical) constraints that are reflected in the characteristics of the sounds it can produce. (1.8) An acceptable recognition result is a recognition result that satisfies the characteristic constraints/properties of its class. (1.9) A sound event is the audible physical signal that consists of the sequence of an onset, an optional continuous development, and an offset. (1.10) A general sound recognition system may assume that an arbitrary signal consists of a superposition of sound events of which each, in principle, can be recognized. (1.11) An auditory event is the internal representation of information that can be estimated from a single sound event. Auditory events may not represent information of uncorrelated sources. (1.12) The signal in noise paradox can be solved by grouping continuously developing acoustic information of a single source into auditory events. A search through the set of auditory event combinations might produce a number of acceptable recognition results. Continuity Preserving Signal Processing 39

41 Recognizing Arbitrary Sounds (1.13) Speech recognition is an active process that aims to find the most meaningful and most consistent interpretation of the signal. (1.14) A recognition system that functions as often as possible must be based on a set of the weakest possible prior assumptions. (1.15) A recognition system is able to recognize a sequence of sound events of a single source correctly, if and only if it can assign a sufficient amount of informative evidence about the source to a single representation, i.e., if it can form a sufficiently informative set of auditory events corresponding to the sound event sequence of the source. (1.16) It is generally impossible to detect and estimate the values of individual features completely and/or sufficiently r eliably. (1.17) The inability to estimate features of arbitrary signals reliably, forces a recognition system to work with hypotheses about features and featurevalues. (1.18) Higher level feature hypotheses are more reliable than the lower level hypotheses they are based upon and the whole is easier to recognize than the constituting features. (1.19) The first processing stages of the auditory system must result in a set of auditory events hypotheses that include the auditory events required to reach a correct recognition result. Later processing stage must select the best acceptable recognition result. (1.20) A signal component is a single physically coherent signal constituent that can be described by specifying the temporal development of frequency and energy (phase is optional). Signal components can be combined to form sound events. (1.21) Quasi-stationarity, with a proper time-constant must only be applied to individual signal components or to complex signals, like the speech of a single speaker, for which it holds. As long as the signal, or some selection of it, is not positively identified as suitable for the quasi-stationarity assumption, the application of quasi-stationarity is not justified and may lead to suboptimal or incorrect results. (1.22) Continuity of signal components forms one of the most reliable cues for assigning information from a single source (or multiple correlated sources) to a single representation. While this process is not complete, continuity through time and frequency ought to be conserved. (1.23) The basilar membrane transducts acoustic vibrations to neural information and conserves continuity in time and frequency (via its correspondence to place) for further processing. (1.24) The separation of a sound into the individual signal components starts at the basilar membrane where each signal component tries to dominate the corresponding BM-region. 40 Continuity Preserving Signal Processing

42 CHAPTER 2 Introduction to CPSP The previous chapter formulated the theoretical basis for the recognition of arbitrary signals on which the rest of this work is built. The purpose of this chapter is twofold. In the first place it provides a general overview of the basic Continuity Preserving Signal Processing (CPSP) techniques. Secondly, it describes a robust speech signal selection system and a first recognition experiment with a standard HMM-based speech recognition system. This chapter focuses on the selection of quasiperiodic signals. In terms of time and energy, quasiperiodic signals form an important fraction of natural signals and are of central importance for speech recognition in noise. The last five numbered theorems of the previous sections form the basis on which the main line of thought is based. Operational and implementation details are presented in later chapters. The chapter starts with an overview of the most important techniques and representations that are addressed in this work. Continuity Preserving Signal Processing 41

43 Introduction to CPSP 2.6 and 3.5 Ridge estimation 2.6 and 4.5 Running autocorrelation 2.7 and 4.5 Inst. freq. estimation Input sound 2.3 Cochleogram 2.3 and 3.1 Basilar membrane Quasi-periodic signals 2.5, 4.1 and 4.2 TNC: Time Normalized Correlogram 2.9 and 5.1 P 0 -contour estimation 2.4, 4.6 and 4.7 TAC: Tuned Autocorrelation All signals 4.3 Char. period correlation 4.4 Onset and offset characterization 2.11 and 6.1 Masks and Auditory elements 2.11 and 6.2 Inverse BM filtering sound 2.12, 3.4 and 3.1 Reconstruction of cochleogram 2.12 Parameterization of reconstruction 2.13 ASR coding Figure 2.1. Overview of techniques and representations presented in this work. The numbers refer to the sections that discuss the topics. Chapter 2 introduces most techniques, but focuses on quasi-periodic signals. The topics in the dashed boxes are developed to function with aperiodic signal components as well; these topics will be introduced in chapter 4. The Time Normalized Correlogram (TNC) forms the central representation of this work. 2.1 Overview of Representations and Techniques Figure 2.1 provides an overview of the techniques and representations discussed in chapters 2 to 6. Together, these techniques describe a new methodology to select coherent information about individual signal components from (possible) mixtures of sound sources (see sections 2.11 and 6.1). The selected information can either be used to resynthesize sound (see section 6.2) or it can be used to form a parametric description of the sound for coding and/or automatic recognition (see sections 2.12 and 3.6). The input of the system is an arbitrary sound that is processed by a basilar membrane model that preserves continuity through time and place. The system identifies regions of the time-frequency plane (in this work actually a time-place 42 Continuity Preserving Signal Processing

44 Overview of Representations and Techniques plane) that have a very high probability of being dominated by a single source. A set of regions that is attributed to a target source is called a mask. The boxes in figur e 2.1 near the arrow marked quasi-periodic signals are necessary to determine the fundamental period contours that are used to select periodic signals. The dashed boxes below the lower arrow marked all signals can be used to identify and characterize aperiodic signal contributions such as noise bursts and on- and offsets. This chapter introduces the techniques in the solid boxes; later chapters provide a more detailed treatment of all representations and techniques. Section 2.2 addresses a continuity preserving split-up of the input sound into a large number of coupled frequency channels with a model of the basilar membrane (BM). A continuous representation of the temporal development of the energy as expressed by all basilar membrane segments is called a cochleogram. The cochleogram, defined in section 2.3, represents similar information as the short term Fourier energy spectrum, but the cochleogram is a representation of energy as function of time and place instead of time and frequency. Chapter 3 discusses the basilar membrane model and the cochleogram in depth. To separate sound sources, knowledge about the desired target sources must be applied. Section 2.4 introduces a fundamental period contour (or pitch-contour) into the cochleogram definition to arrive at the Tuned Autocorrelation (TAC). A more detailed study of TAC estimation and the influences of interfering sounds is provided in section 4.6 and section 4.7 respectively. The tuned autocorrelation as well as the cochleogram form a subset of the Time Normalized Correlogram (TNC): a three dimensional quadratic domain representation of basilar membrane movements. Because the TNC conserves continuity though time, place (corresponding to frequency) and periodicity it forms the central representation of this thesis. The TNC is introduced in section 2.5. Chapter 4 is devoted entirely to the TNC and its properties. For the tuned autocorrelation to be of practical use, fundamental period contours must be estimated from unknown input. This requires the identification of ridges in the cochleogram. Ridges correspond often to resolved harmonics and pinpoint regions with a high SNR and consequently indicate regions were reliable information can be derived from. Fortunately, these are also the regions where the quasi-stationarity assumption can be applied safely. Ridge estimation is treated in section 2.6, while section 3.5 explains how strong and resolved harmonics inevitably lead to ridges. When ridges represent resolved harmonics, it is possible to model the frequency Continuity Preserving Signal Processing 43

45 Introduction to CPSP development of the signal contribution (harmonics) that produced the ridge. This leads to the estimation of Local Instantaneous Frequency (LIF) contours that are introduced in section 2.6 and described in more detail in section 4.5. Local instantaneous frequency contours can be combined to arrive at the fundamental period contours required for tuned autocorrelation estimation. This is the subject of section 2.9 and chapter 5. The pitch estimation algorithms that are presented in this work are suboptimal in the sense that their basic assumptions are not suitable for a truly arbitrary input signal. The presented algorithms are therefore intended as a proof-of-concept and a basis for further development. The design decisions that lead to the presented system are outlined in section 2.8. When suitable fundamental period contours have been estimated, they can be used to select quasi-periodic information with the tuned autocorrelation (section 2.10). The resulting TAC-selections can be processed further. As a first step it is useful to form a mask of the most reliable information. A simple example of a mask is introduced in section 2.11, while a more formal and complete treatment of mask forming is given in section 6.1, which addresses the formation of auditory elements, i.e., place-time regions with diverse properties that are likely to be dominated by a single source. The information represented by the auditory elements can be used for the resynthesis of the target sounds (section 2.11 and section 6.2). Alternatively, it can be used to measure the individual signal contributions (e.g., section 3.4) and to reconstruct the clean (i.e., original) cochleogram (section 3.6). Or it can be used to parameterize the selected information (section 2.12) for automatic speech recognition experiments. Some preliminary ASR experiments are presented in section Chapter 2 does not address aperiodic signals. On- and offset effects, as well as the response of the TNC to an impulse and to white noise are treated in section 4.2. Since aperiodic signals represent a continuous distribution of frequencies (instead of the discrete set of periodic signals) one can actively search for continuous distributions of frequency components. This is discussed in section 4.3, which introduces the Characteristic Period Correlation (CPC) as a means to measure (local) dominance. The CPC and the cochleogram can be used as a basis for the identification and characterization of on- and offsets. On- and offset characterization is treated briefly in section Continuity Preserving Signal Processing

46 The Basilar Membrane Model s x BM High T Middle Low T Basilar membrane Time Figure 2.2. Left a schematic representation of the uncoiled basilar membrane and right some examples of the BM-segment velocity as functions of time. Neighboring segments show a similar, but not equal, velocity for each moment in time. The BM is continuous in time and place. Notice that the lower and middle BM responses show a common periodicity T. Chapter 3 addresses the basilar membrane model but focuses on the main properties of the cochleogram as a continuous representation of energy 1 r(s,t) as function of basilar membrane position s and time t. Chapter 4 addresses the Time Normalized Correlogram. The TNC r(s,t,t) is a generalization of the cochleogram that includes periodicity T. Chapter 5 addresses the details of the fundamental period estimation techniques that were developed. Chapter 6 focuses on the identification and use of coherent signal contributions. These coherent signal contributions resemble the concept of auditory events that was introduced in chapter 1. Finally, chapter 7 summarizes the main findings and reflects on the whole work. 2.2 The Basilar Membrane Model Figure 2.2 shows a very schematic representation of the essential features of the basilar membrane. The basilar membrane is a coiled-up structure with a length of 3.5 cm that is situated in the cochlea, a snail-house-like structure of about 1 cm 3. The side of the basilar membrane near the opening of the snail- 1. Note that energy is not referred to by the symbol E as is customary. The representation where the energy measure is derived from is a correlation function for which the symbol r is typically used. Continuity Preserving Signal Processing 45

47 Introduction to CPSP house is most sensitive to frequencies of about 20 khz. Further inside the cochlea the frequency to which each position is most sensitive decreases down to 20 Hz according to an (approximately) logarithmic place-frequency relation. The frequency-range of the basilar membrane is therefore 3 orders of magnitude or about 10 octaves. Approximately 3000 hair cells, evenly spread along the basilar membrane, transduct the local vibrations to gradedpotentials, which in turn are coded as action-potentials and transmitted by neurons to the brainstem. The axons of these neurons form the auditory nerve. Figure 3.1 on page 84 shows some of these details. This thesis uses a (simplified) linear, one-dimensional transmission line model of the basilar membrane (Duifhuis 1985, van Hengel 1996). The most relevant properties of the model are continuity in both time and place and a one-to-one place frequency-relation. This entails that the basilar membrane model can be interpreted as a filter bank with physically coupled filters: neighboring filters show similar displacements at all points in time. Although the original basilar membrane model can be nonlinear like the actual basilar membrane, a linear version of the model is used. This allows an efficient implementation as an overlap-and-add filter bank and it helps to solve the central problem: how to separate a mixture of signals. After all, linearity entails additivity, which can be interpreted such that a mixture of signals a and b can be split without introducing cross-terms that depend on both a and b. The absence of cross-terms, which cannot be guaranteed in most nonlinear system, simplifies the design and implementation of a sound separation system. The original basilar membrane model requires an internal update frequency of 400 KHz and has 400 segments that span the full human frequency range. To reduce processing time the model was reimplemented as a filter bank with 100 channels spanning a frequency range between 30 and 6100 Hz. The filter bank implementation requires an in- and output sample-frequency of 20 khz. This reduction of the number of channels trades spatial and temporal resolution, and indirectly noise robustness, for computational efficiency. 2.3 The Cochleogram A time-frequency representation, like an FFT-based energy spectrogram is thought to represent the relevant information for the perception of speech and 46 Continuity Preserving Signal Processing

48 The Cochleogram can be interpreted straightforwardly. Unfortunately (given the discussion in section 1.7) it is discontinuous in both time and frequency. A spectrogram-like time-frequency representation, continuous in place (and indirectly frequency) can be computed by averaging the energy of (overlapping) frames of each basilar membrane segment. However this procedure implies quasistationarity, which (conform theorem 1.21) ought to be avoided since the input is not yet identified as a signal for which quasi-stationarity holds. A continuous alternative in both time and place for the FFT spectrogram is the leaky integrated square of the displacement or velocity of the basilar membrane segments. 2 The use of velocity is preferred over the use of displacement because the use of velocity enhances high-frequency components, which reduces the masking effects of high-frequency components by lower frequency components (see section 3.3). Leaky integration describes a process were the system, at each point in time, loses information about its previous state, but learns about the present. In this case: t τ r s ( t) = r s ( t t)e + x s ( t)x s ( t) s = 1,, s max (2.1) r s (t) denotes the value of the leaky integrated energy of segment s at time t, t is the sample period, t- t denotes the time of the previous sample, x s (t) is the current output value of the channel. The time constant τ of this first order system determines a scope of memory. For large values of τ the value of exponential function is very close to unity, for small values the influence of the exponential function becomes more prominent since it reduces the contribution of the previous value of r s (t). The square term x s (t)x s (t) is nonnegative, hence r s (t) is non-negative as well. Equation 2.1 can be generalized to: r s ( t) = L{ x s ( t)x s ( t) } (2.2) where the operator L{} denotes any form of lowpass filtering. The value of τ is kept at 10 ms throughout this thesis. Real neurons perform a leaky integration process as well and 10 ms is a normal value for neurons (Weiss 1996, Segev 1995). While the input of equation 2.1 is the squared basilar 2. Note that the use of the basilar membrane displacement leads to a true energy representation. The use of the BM velocity leads to a representation that is related to energy via a frequency dependent correction factor. Although not completely justified, the term ener gy will be used throughout the work. Continuity Preserving Signal Processing 47

49 Introduction to CPSP membrane velocity (Lim 1985), the neurophysiological equivalent is the allpositive amplitude compressed half-wave rectified basilar membrane velocity. The half-wave rectification is performed by the hair cells in the organ of Corti. The natural system shows a dynamic range compression of the BM movements x that is often approximated as a cubed-root: x 1/3 (Stevens 1957, Hermanski 1990). Dynamic range compression is necessary to bring all relevant features within the same range. This is important because r s (t), computed according to equation 2.1, has a dynamic range that, due to the nature of speech, can be 50 db or more (Furui 1989). To compensate for the square in equation 2.1 the effect of the cubed-root is doubled and approximated by x 0.15 : R s ( t) = [ r s ( t) ] 0.15 (2.3) All signal processing in this work is performed in the linear domain. Equation 2.3 is applied either as a last step of processing or prior to visual presentation (as is the case in most figur es). Since leaky integration is a low-pass filtering process, the output r s (t) can be downsampled to sampling-rates in the order of the integration time-constant. To accommodate sharp onsets a sampling-rate of 200 Hz, corresponding to 1 sample per 5 ms, is chosen. 3 This leads to the cochleogram as the desired doubly continuous time-frequency representation. Figure 2.3. shows the cochleogram of the Dutch word /NUL/ (English: ZERO), spoken by a female speaker. This word is part of the target sentence /NUL EEN TWEE DRIE/ (English translation: /ZERO ONE TWO THREE/) that will be used throughout this thesis. Unlike the FFT-spectrum of figur e 1.2, the cochleogram conserves continuity in time and place (or frequency). It is instructive to consider the main structures of this figur e. The broad red band, starting at approximately t=50 ms and f=220 Hz, is the first harmonic h 1 corresponding to the fundamental frequency f 0. The fundamental frequency rises during the utterance to a value above 350 Hz. The band above and parallel to the first harmonic is the second harmonic h 2 The lowest few harmonics form the first formant F 1. A second formant F 2 becomes visible after the transition from the /N/ to the /U/ at t=120 ms and drops during the 3. This is an application of quasi-stationarity. It is assumed that the local change in energy and frequency of a noisy signal can be modeled sufficiently well with a sampling-rate of 200 Hz. Speech-scientists often use 100 Hz as sampling-rate. The optimal sampling-rate may be a function of the basilar membrane position. 48 Continuity Preserving Signal Processing

50 The Cochleogram Frequency in Hz Segment number Time in ms Figure 2.3. The cochleogram of the word /NUL/. Since it is produced by a single vocalization, its shows a single coherent development. The cochleogram is sampled each 5 ms. Notice the discontinuity at higher frequencies marking the end of /N/ and the onset of the /U/. The /L/ appears as a glide of the second formant between t=200 and t= 350 ms, that sequentially highlights the best matching underlying harmonic. The vertical line indicates the position where the information of figur e 2.4 is derived from. /L/ from 2000 Hz to about 900 Hertz. Notice that this change of formant position entails that different harmonics succeed each other as the most prominent local frequency contribution. A third formant F 3, is marginally visible during the /N/ but becomes prominent during the rest of the utterance. In the higher frequency regions a fourth and vaguely even a fifth formant are visible. The transition from the /U/ to the /L/ is smooth, the transition from /N/ to the /U/ is partially discontinuous due to the transition from the nasal /N/ to the vowel /U/; at the end of the /N/ the tip of the tongue leaves the hard palate, allowing the oral cavity to resonate in addition to the nasal cavity. Notice that the onset discontinuity of the word is sharp and the offset is smooth. This is due to the exponential decay of the leaky integration process and the ringing-out effect of the basilar membrane in combination with the nonlinearity of equation 2.3. Continuity Preserving Signal Processing 49

51 Introduction to CPSP (energy) Segment number Figure 2.4. The cochleogram cross-section at t=175 ms in the cochleogram of figur e 2.3. The numbers correspond to the harmonics of a fundamental frequency of 217 Hz. The lower harmonics are resolved, the higher harmonics form complexes that correspond to the second and higher formants. Note that each signal component tries to recruit a region of the BM. The strongest local contribution dominates. The upper horizontal axis indicates the corresponding frequency in Hz of the segments of the lower horizontal axis. A vertical cross-section of the cochleogram at t = 175 ms is depicted in figur e 2.4. This figur e shows a representation of the energy distribution as function of segment number (the lower horizontal axis) or the corresponding frequency (upper axis) corresponding to the information under the vertical line in figur e 2.3. Note the peaked structure. For low segment numbers the peaks correspond to resolved harmonics. For higher segment numbers the individual harmonics become less well resolved and merge eventually into formants. This behavior is a direct consequence of the nonlinear placefrequency relation. Several harmonics are depicted in the figur e. The first three, the 9 th, the 13 th, the 18 th and the 25 th harmonics dominate the response. The 4 th to 8 th harmonics are just resolved, for the 10 th to 12 th harmonics only minimal visible evidence exists. These harmonics are (partially) masked by the other components. Although the higher harmonics are not resolved, they do contribute to the shape of the formants and contribute to the timbre of the vowel /U/. 50 Continuity Preserving Signal Processing

52 Tuned Autocorrelation Entrainment of segments by a dominant signal component is a very important property of a transmission line model and is due to the fact that the basilar membrane forms a single continuous structure. When a prominent signal component drives a certain segment, the segment will drag its neighbors along and they drag their neighbors along, etc. This effect attenuates rapidly as a function of place. Only the signal components that can overcome the recruitment-effect of other signal components will dominate locally and produce peaks. Entrainment is, like masking, more prominent on the high-frequency side, than on the low-frequency side. Entrainment and dominance is studied in more detail in section 4.3. The first 12 segments show a low and irregular response with a value of around 0.7. This response is due to quantization noise. The highest value, just 7 segments away (in de original model 21 segments), is 3.25, a factor 4.7 higher. Considering the nonlinear compression in equation 2.3, this corresponds to a difference of approximately 90 db. This is consistent with the dynamic range of 90 db that is associated with the 16 bit input. 2.4 Tuned Autocorrelation Splitting a mixture of signals without certainty about the signals origin requires the use of the weakest possible basic assumptions, i.e., the use of universally valid signal properties. An important general property is periodicity. In both speech and music, periodic signals represent the largest fraction of time and energy. Perfectly periodic signals do not occur often; most natural signals show amplitude and/or frequency modulations due to source properties. A sound event y(t) is quasiperiodic 4 with fundamental period contour T ( t) = 1 f 0 ( t), if for each harmonic y i (t): y i ( t) y i ( t + T ( t) ) (2.4) If the harmonic y i (t) of the sound event entrains segment s of the basilar membrane, the response x s (t) of the segment will show quasiperiodicity as well. Consequently: 4. Although the amplitude of a quasiperiodic signal y(t) can be a function of time as well, the effects of amplitude modulation are usually small for consecutive cycles and will be ignored in this section. Continuity Preserving Signal Processing 51

53 Introduction to CPSP x s ( t) x s ( t + T ( t) ) (2.5) If T(t) is known, equation 2.5 can be combined with equation 2.2 to yield: r s, 0 ( t) = L{ x s ( t)x s ( t) } = L{ x s ( t)x s ( t + T ( t) )} = r s, T ( t) t ( ) (2.6) This entails that, under the condition that T(t) is the correct fundamental period contour, r s, T ( t) ( t) closely approximates the cochleogram contributions r s, 0 ( t) for all segments that are entrained by the sound event y(t). This is important because T(t) is a signal property with a very high probability of being unique for sound event y(t). The set of values r s, T ( t) ( t) is defined as the Tuned Autocorrelation 5 (TAC), because it is based on autocorrelation values x s ( t)x s ( t + T ( t) ) and tuned to a fundamental period contour T(t) (and hence also to a fundamental frequency contour f 0 (t)=1/t(t)). Equation 2.6 holds only for a correct fundamental period contour. For fundamental period contours that are not correlated with the contour of the target source, the values of x s (t) and x s (t+t) will not correlate and their average will be close to zero. This means that the TAC has values similar to the energy measure of the cochleogram for a correctly estimated period contour and values close to 0 for randomly chosen or uncorrelated period contours: r s, T ( t) t ( ) r s, 0 ( t) 0 for correct T ( t) for uncorrelated T ( t) (2.7) This property forms the basis for the assignment of information of quasiperiodic sound events into auditory events. When it is not known which segments are entrained by the quasiperiodic source, the TAC of all segments is computed with: 6 r s, T ( t) t t τ ( ) = r s, T ( t) ( t t)e + x s ( t)x s ( t + T ( t) ) s = 1 s max (2.8) 5. Papoulis 1984 defines the autocorrelation R xx of a real-valued random signal as R xx (t 1,t 2 )=E{ x(t 1 )x(t 2 ) }, where E denotes an expectation operator. The TAC r T(t) (t) is a subset of the generic autocorrelation: r T(t) (t)=r xx (t,t+t(t))=e{ x(t)x(t+t(t) }. 6. The actual formula (see equation 4.16) is slightly more complicated due to the effects of group delay. This phenomenon will be ignored in this overview, but it will play an important role in subsequent chapters. 52 Continuity Preserving Signal Processing

Tuned Autocorrelation Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Cochleogram: clean TAC: clean 100 90 80 70 60 50 40 30 20 10 1 Segment number Frequency in Hz 6100 4200 2900 2000

The upper panels show the signal as in figur e 2.3 and its associated TAC. All negative values are set to zero (dark blue).

54 Tuned Autocorrelation Frequency in Hz Cochleogram: clean TAC: clean Segment number Frequency in Hz Cochleogram: SNR=0 db Babble Time in ms TAC: SNR=0 db babble Time in ms Segment number Figure 2.5. Examples of the Tuned-Autocorrelation. The upper panels show the signal as in figur e 2.3 and its associated TAC. All negative values are set to zero (dark blue). The lower panels show the cochleogram of this signal when cocktail party noise is added at a signal-to-noise ratio of 0 db. The lower right hand panel shows the associated TAC. Notice that, compared with the panel above, most of the prominent structures are conserved, but some of the background has been selected as well. Two examples are shown in figur e 2.5. The upper panels show the cochleogram of the word /NUL/ (conform figur e 2.3) and the positive values of the associated TAC. The lower panels show the cochleogram of this signal when cocktail party noise is added resulting in a signal-to-noise ratio of 0 db (equality of signal and noise energy). The lower right hand panel shows the associated TAC. Compared with the panel above, most of the prominent structures are conserved, but some of the background has been selected as well. This cannot be avoided (see section 4.7). The TAC is not defined over the complete 500 ms, since the period contour of the sound event is only defined when the sound event is present. Note that negative values of the TAC representation are set to zero in visible representations only. This will be done throughout this thesis. Continuity Preserving Signal Processing 53

55 Introduction to CPSP A tuned autocorrelation that results from a properly estimated period contour represents quasiperiodic information consistent with this contour. There is no guarantee that all information belongs to the same source, it is however guaranteed that all periodic contributions of the target source that entrain BMregions will be represented. The tuned autocorrelation is very robust. This has several reasons. A first reason is that the tuned autocorrelation selects all segment ranges dominated by target harmonics. In the case of broadband signals, like speech, in which a few harmonics or formants dominate, a peaked cochleogram results. The probability that formants (similar structures) of other sounds produce even stronger peaks that dominate the same regions even more prominently is usually small (but not zero). This probability is of course strongly dependent on the signal-to-noise ratio (SNR) and the distribution of energy over the frequency range. With common broadband signals that mask the target speech at a signal-to-noise ratio of -6 db (ratio=1:4), the number of unmasked peaks of the target speech is reduced to a level where it becomes difficult to find a set of reliable starting points for the search of auditory events. Human speech perception deteriorates rapidly in these conditions (Plomp 1979, Alefs 1999). A second reason for the robustness of the TAC is that a source does not need to dominate to provide a consistent local contribution. As long as the average contributions x(t)x(t+t) of a less dominant source is larger than the average of x (t)x (t+t) of a source that is dominating locally, the less dominant source will provide a positive contribution, even if it is masked optically. Since there are no coherent peaks, this situation does not provide reliable starting points for auditory event estimation. Yet this might explain why some noisy sentences cannot be perceived on first presentation when the listener does not know what to expect, whereas the same sentence is recognizable when the listener could form a correct expectation. For example, a naive listener might have difficulties with a target sentence at an SNR of -6 db, while a experienced listener can perceive each word of the target sentence at -10 db or less. The most important problem with the application of the TAC is the necessity of a correct estimate of the fundamental period contour T(t). Since it is not directly available, it has to be estimated from the signal. There exists an abundance of pitch estimation techniques, but none of these performs adequately on arbitrary (noisy) signals. The tuned autocorrelation is only a useful technique if a robust pitch estimation technique is available. Such a technique will be proposed in section 2.9. The next section addresses a 54 Continuity Preserving Signal Processing

56 Time Normalized Correlogram generalization of the TAC that is used as the theoretical basis for most of the techniques described in this work. 2.5 Time Normalized Correlogram Equation 2.8 presents a subset of a more general continuous autocorrelation function: r s, T ( t) = L{ x s ( t)x s ( t + T )} s T = = 1,, s max [ 0, T max ] (2.9) r s, T ( t) is typically implemented as a time-evolving matrix of dimensions (# segments) x (# periods) that is called the Time Normalized Correlogram (TNC). Of central importance is that the TNC is continuous in time, periodicity and place (with place related to frequency). The positive values of the TNC can be depicted in a similar way as the TAC-spectrograms. This is shown in figur e 2.6. This figur e shows the TNC for t=175 ms in the middle of the /U/ of NUL. The vertical line at T=0 corresponds to the energy spectrum that was depicted in figur e 2.4. The vertical band at T=4.6 ms represents the TAC for the fundamental period T 0. This band is repeated around 9.2 ms for 2T 0. These bands form the peaks of a large vertical structure that narrows as the frequencies of the individual harmonics increase. Each broadband quasiperiodic source has a similar structure; which is mainly determined by the instantaneous fundamental period. The name TNC is derived from the fact that its definition in equation 2.9 ensures that whenever the first full period of a quasiperiodic signal ends at time t 0, its TNC starts to build-up at t 0 irrespective of the period T of the signal: for t < t 0 the temporal average of x(t)x(t+t(t)) is close to zero, while for t t 0 the average is large and positive and independent of the value of T(t). This form of time-of-onset normalization (that does not hold for L{ x(t)x(t-t(t)) } ) helps to study the temporal development of all types of sources. A full discussion, where different definitions of correlograms are compared, is given in section 4.1. Since it is unlikely that uncorrelated sources show a similar development of the instantaneous fundamental frequency, the probability is low that the vertical structures of different sources overlap. This is not the case for the energy term at T=0 where the energy of all sound events get expressed on top Continuity Preserving Signal Processing 55

Introduction to CPSP 6100 4200 2900 2000 100 90 80 70 Frequency in Hz 1400 910 600 60 50 40 Segment number 380 220 120 28 30 20 10 1 1 2 3 4 5 6 7 8 9 10 Period in ms Figure 2.6. The Time Normalized Correlogram derived from the /U/ of /NUL/ at t=175 ms.

57 Introduction to CPSP Frequency in Hz Segment number Period in ms Figure 2.6. The Time Normalized Correlogram derived from the /U/ of /NUL/ at t=175 ms. The TNC allows to follow arbitrary continuous paths through time (t), place (s) and periodicity (T). The vertical line at T=0 corresponds to the energy spectrum. The structure at T=4.6 ms is the TAC for the fundamental period T 0. This structure is mirrored around 9.2 ms for 2T 0. Note the way harmonics and formants are expressed. of each other. The introduction of periodicity as an extra signal dimension allows not only a mixture of a periodic and an aperiodic signal to be split, but also mixtures of quasiperiodic signals! Note that this is partly an idealization: the combination of two or more quasiperiodic signals leads to a superposition of the individual TNC s that is more difficult to interpr et than a single one. 7 The vertical cross-section of the TNC corresponds to an autocorrelation lag T for all segments s. The horizontal cross-section corresponds to the full running autocorrelation of a single segment. For aperiodic signals the correlation decreases as a function of T, but since this source is periodic, the autocorrelation has the appearance of a cosine. Notice that most segments are dominated by a single harmonic. This is most prominent for segments that correspond to the lower harmonics. The periodicity of the local running 7. Interference effects between the different signals complicate the interpretation further. 56 Continuity Preserving Signal Processing

58 Estimation of Ridges autocorrelation reflects the frequency of the segment s main driving force as a function of time. The first period that occurs in all segments is 4.60 ms which corresponds to 217 Hz. For the second harmonic the second period peaks at 4.60 ms. This corresponds to an instantaneous frequency of 1/(4.6/2)=434 Hz, as to be expected. Just above 2000 Hz a region of the BM is dominated by the ninth harmonic. This region corresponds to the second formant. Note that the position of the tenth harmonic cannot be estimated as it is masked by the ninth. The third formant gets expressed just below 3000 Hz and is dominated by the 13th harmonic at: (4.6 ms/13) -1 =2.9 khz. Note that the TNC allows the estimation of instantaneous local frequencies with high accuracy. This is a direct consequence of the avoidance of a frame-based approach and the conservation of continuity. Section 4.5 provides the details of the local frequency estimation algorithm and discusses its accuracy. The TNC is an extremely rich representation, but one of its most important features is that: The TNC can represent arbitrary continuous paths through time (t), place (s) and periodicity (T). This means that if we know or hypothesize a period contour T(t) as a source property, we can investigate the consequences of T(t) as a continuous function of time. On the other hand, if it is known or hypothesized that a segment sequence s(t) represents information of a single signal component or sound event, it is possible to use the TNC to study the development of information represented by the running autocorrelation under the segment sequence s(t). 2.6 Estimation of Ridges The instantaneous local frequency information as represented by the TNC forms the basis for the estimation of pitch-contours in unknown noisy circumstances. Computationally, the TNC is extremely inefficient 8 since it is of the order (# segments) times (# samples per second) times (# periods). For 100 segments, a sample frequency of 20 khz and a maximum period of 25 ms (500 different values) this corresponds to 10 9 x (2 multiplications + 1 addition) per second. Although it is possible to increase the efficiency of the computation considerably, a more efficient appr oach is required. 8. According to theorem 1.21, efficient FFT-based approximations of the TNC (Slaney 1993) are not allowed for signals that are not (yet) recognized. Continuity Preserving Signal Processing 57

59 Introduction to CPSP Fortunately, it is easy to determine regions in the cochleogram that are likely to provide prominent information about a single signal component (e.g., a harmonic). As discussed in the context of figur e 2.4, each signal contribution tries to entrain a region of the basilar membrane. A periodic signal component that succeeds to dominate a region of the BM will lead to a peak at the BM position corresponding to the frequency of the signal component. This means that peaks usually correspond to a single strong signal component. Relatively weak signal contributions like the 10th to 12th harmonic in figur e 2.4 are almost completely masked by stronger contributions and do not show up as separate peaks. When the search space is limited to peaks in the cochleogram, one efficiently selects positions where information of entraining signal components can be estimated! To reduce the number of spurious peaks, ridges can be formed by combining peaks through time. All peak-positions that cannot be classified as members of ridges equal to or longer than 20 ms (4 frames 9 of 5 ms) are discarded. This leads to figur e 2.7, which allows a comparison between ridge estimates in noise with estimates in the clean situation. The left hand panel gives the ridges as estimated in 0 db cocktail party noise superimposed on the cochleogram of the clean target signal /NUL/. The right hand panel shows the complementary information: the noisy cochleogram with the ridges as estimated in a clean signal. The ridges estimated in the noisy signal often coincide with the most prominent peaks of the clean target. Since these ridges are estimated from a noisy signal they include positions where signal components of the background dominate. As can be seen in the right hand panel, the cocktail party background consists mainly of the intensity peaks of the speech of other speakers. Since these intensity peaks last shorter than the whole sound event, the requirement of ridges of at least 20 ms removes an important fraction of the background peaks. For backgrounds consisting of a lot of uncorrelated sources, or backgrounds containing aperiodic noises this is often the case. This requirement helps to alleviate the problems associated with the signal-innoise paradox by applying knowledge of the target signal: in this case the fact that speech usually consists of contributions of at least 20 ms. This constraint 9. Note that although the word frame is used to denote a signal-block of 5 ms, quasistationarity by frame-blocking is not applied. The basic signal representation conserves continuity, but some features, like peak positions and the local instantaneous frequency are sampled each 5 ms. 58 Continuity Preserving Signal Processing

Local Instantaneous Frequency Contours 6090 4230 2930 2000 100 90 80 70 Frequency in Hz 1370 910 600 60 50 40 Segment number 380 220 120 30 20 10 30 100 200 300 400 500 Time in ms 1 100 200 300 400

60 Local Instantaneous Frequency Contours Frequency in Hz Segment number Time in ms Time in ms Figure 2.7. The left hand panel shows the clean cochleogram of /NUL/ with the ridges as estimated in 0 db cocktail party noise. The right hand panel shows the complementary information: the noisy cochleogram with the ridges as estimated in a clean signal. Note that the ridges point to reliable sources of information. The bar denotes the time where the autocorrelation of figur e 2.8 is based on. reduces the search space by discarding nonspeech contributions much more often than the target speech. 2.7 Local Instantaneous Frequency Contours The next step towards a robust pitch estimation algorithm is the estimation of Local Instantaneous Frequency (LIF) contours. We now have a set of continuous ridges {s i (t)} and since the TNC is continuous in time t and place s, it is possible to compute the running autocorrelation along the ridge s(t) as: r s( t), T t t τ ( ) = r s( t), T ( t t)e + x s( t) ( t)x s( t) ( t + T ) T = 0,, T max (2.10) As the peak position changes smoothly, so does its associated autocorrelation. Note the symmetry with the tuned autocorrelation of equation 2.8. That equation represented a set of functions over all segments s with period contour T(t) as a function of time, while equation 2.10 is a set of functions over Continuity Preserving Signal Processing 59

61 Introduction to CPSP Ridges [1 3 5] Autocorrelation Ridges [ ] Autocorrelation Period in ms Figure 2.8. Examples of autocorrelations estimated a time t=250 ms (see figur e 2.7) from the noisy /NUL/. The lower panel shows the running autocorrelation of the ridges 2, 4, 6, 7 and 8 (numbering starting from lowest ridge in the left hand panel at the position of the bar in figur e 2.7). Since they all agree on a periodicity of approximately 4.10 ms, these ridges might reflect harmonics of a single source. The upper panel shows the autocorrelation of ridges 1,3, and 5 that do not agree with this periodicity. all T with the segment sequence s(t) as a function of time. The TAC describes vertical cross-sections of the TNC and the running autocorrelation a horizontal cross-section. Figure 2.8 gives examples of a few autocorrelations estimated a time t=250 ms (see figur e 2.7) from the noisy /NUL/. The lower panel shows the running autocorrelation of the ridges 2, 4, 6, 7, and 8 (numbering starting from lowest ridge) at t=250 ms in the left hand panel of figur e 2.7. The autocorrelations suggest that these ridges arise from harmonics that belong to the same source: they all agree on a periodicity of 4.10 ms (244 Hz). The upper panel shows the autocorrelation of ridges 1, 3, and 5 that do not agree with this periodicity. Of these, ridge 3, and 5 might agree on a common periodicity of 2.9, 5.8, or 8.7 ms. The local instantaneous frequency is related to the inverse of the period corresponding to the position of the first peak in the autocorrelation. 10 These values are computed and depicted in figur e 2.9 for two conditions: the blue 60 Continuity Preserving Signal Processing

62 Local Instantaneous Frequency Contours Frequency in khz Time in ms Figure 2.9. Comparison of the instantaneous local frequency as estimated in clean (blue) and in noisy (red) conditions. Although the noisy representation contains much more frequency contributions, a large fraction of the original signal is unperturbed. stars are the values of the local instantaneous frequencies as estimated from the clean /NUL/. The red circles are estimated from the noisy /NUL/. Note that most frequency contributions in the clean signal remain clearly present in the noisy environment. A closer examination shows that perturbations are often less than 2 percent. This shows that the ridges form a very reliable source of information for the estimation of the frequency development of signal components. Since the basilar membrane model (as used in this work) has only 100 segments, which span more than 2 orders of magnitude in frequency, the estimate of the instantaneous local frequency based on the position of the ridge(s) cannot be very accurate. The estimation of the instantaneous local frequency based on the running autocorrelation is accurate to within less than a percent for frequencies far below the Nyquist-frequency of the input (here 10 khz). This is surprising because the running autocorrelation has an 10. Actually a somewhat more complicated and more accurate regression method is used that takes the rest of the running autocorrelation into account as well. This is discussed in section 4.5 Continuity Preserving Signal Processing 61

63 Introduction to CPSP associated exponential window with a time constant of 10 ms. One could argue that the resulting time-frequency trade-off ought to limit frequency resolution to approximately 100 Hz. However, one can easily make a distinction between quasistationary signal components of 100 and 101 Hz (see section 4.5). The time-frequency trade-off is determined by the BM model (and not by the exponential window) and the source that produced the signal. Because the BM model has an infinite window of integration it can produce frequency estimations that are only limited by the development of the target signal. This is another illustration of the benefits of the avoidance of framebased approaches. 2.8 Design Choices and Overview of Implemented System In the last part of this chapter the target signal is extended to /NUL EEN TWEE DRIE/(English: /ZERO ONE TWO THREE/). 11 Cocktail-party (or babble) noise is added so that a signal-to-noise ratio of 0 db results. The associated cochleogram and the estimated ridges are depicted in the upper panel of figur e The lower panel shows the frequency contributions along the ridges. (Note the difference in scaling of the frequency axes.) An audio presentation of this sentence can be recognized without much difficulty by speakers of the Dutch language. Yet at -3 db, naive listeners are unable to recognize the noisy /DRIE/ when it is presented without the preceding digits. When these are added to the signal, the /DRIE/ is usually perceived clearly. This suggests the importance of linguistic context at this signal-to-noise ratio. At -6 db, listeners are often unable to detect the sentence at first presentation. This thesis uses the target sentence of figur e 2.10 because these informal experiments suggested that an SNR of 0 db (in the case of babble-noise) is just above the threshold that prevents speech to be correctly processed without additional information from linguistic context, binaural hearing, visual cues etc. Figure 2.10 shows the information available for the estimation of fundamental period contour hypotheses. The use of ridges, ensuring an efficient reduction of the search space, and the availability of mostly accurate estimations of the local instantaneous frequency must facilitate this process. Yet the robust 11. Will be made available at 62 Continuity Preserving Signal Processing

Design Choices and Overview of Implemented System Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Cochleogram and ridges of /NUL EEN TWEE DRIE/ in 0 db Babble noise 100 90 80 70 60 50

64 Design Choices and Overview of Implemented System Frequency in Hz Cochleogram and ridges of /NUL EEN TWEE DRIE/ in 0 db Babble noise Segment number 2 Instantaneous frequency contours Frequency in khz Time in ms Figure The upper panel shows the cochleogram of /NUL EEN TWEE DRIE/ with added cocktail party noise at a signal-to-noise ratio of 0 db. The lower panel shows the frequency contours associated with the ridges in the upper panel. Note the difference in scaling of the frequency axis. estimation of the fundamental period contour, or its corresponding pitchcontour, is surprisingly difficult: often there is no unique solution. This is a direct consequence of the limits of the measurement process (see page 31). The optimal solution is to work with a set of hypotheses, estimated to ensure that it contains, as often a possible, enough information to allow a correct recognition result. Then a full search through this set is performed to find the hypothesis combination that explains the data best. Unfortunately, speech recognition systems based on this approach are not yet available (but section 7.2 proposes such a system). At this stage an important and suboptimal design decision is made. In order to use a standard HMM-based speech recognition system for testing the benefits of the selection process, a single sequence of the target signal must be produced. This entails that the system needs to find the best fundamental period contours while limiting itself to the information in the signal. Since no linguistic knowledge can be applied, the system will be unable to detect pitch estimation errors and nonspeech sounds. 12 Continuity Preserving Signal Processing 63

65 Introduction to CPSP Finding a unique and correct pitch-contour becomes progressively more difficult and eventually impossible when the signal-to-noise ratio decreases. Furthermore, problems arise in situations with multiple speakers and/or music. To facilitate implementation even more some other, suboptimal, design decisions are made. One choice is to develop a proof-of-concept with parameter values that have not been optimized in any formal way. Another important design choice is to use the whole signal instead of the most recent information. This introduces a delay that effectively entails that the system can only be used in a multistage mode. The first stage computes the basilar membrane response, the ridges and the instantaneous frequency contours of the whole input signal. The second stage computes the fundamental period contour, performs selection with the TAC and produces a parametrization suitable for the last stage that computes the recognition result. This leads to the following system overview: Noisy speech Basilar membrane, ridges, instantaneous frequency. Pitch-contours (max 1 at a time), selection, parameterization. HMM-based recognition system. Words This is, of course, far from the ultimate goal: a real-time system that can correctly process sounds of which a priori knowledge might or might not be available. Nevertheless, a system like this will be useful for applications that have to deal with variable and uncontrollable signals, e.g., telephone-based information systems. The rest of this chapter is not aimed at building such a system, but aims to develop a signal processing approach that is suitable for variable acoustic environments. A suitable test of this system differs slightly from the evaluation of the process described in the system overview. It requires an evaluation of the behavior of the system when a correct pitch-contour is available since this gives an indication of the quality of the preprocessing. It is, conform conclusions 1.8 and 1.13, the task of later recognition stages to identify the correct pitch hypotheses on the basis that these lead to an 12. Note that this is a consequence of the separation of selection and recognition in two separate processes. The more integrative approach of section 7.2 avoids this problem. 64 Continuity Preserving Signal Processing

66 Fundamental Period Contour Estimation acceptable recognition result with a meaningful interpretation. Although pitch estimation in noise might become fairly reliable, it will never be possible to guarantee a correct choice between two or more well-supported pitch hypotheses without knowing which choice leads to the most meaningful interpretation. 2.9 Fundamental Period Contour Estimation The development of a reliable and robust pitch estimation technique is difficult. The main reason for this is that it is generally impossible to determine which signal contributions or signal properties belong to a certain source prior to recognizing the sources. This is a direct consequence of the inability to determine whether a signal is speech or not without being able to recognize the signal that was discussed in section 1.1. Yet although this problem is generally insoluble, some features, like smoothly developing harmonics, can be used to search for evidence of speech. Although these features are shared by other sounds than speech, they can be used as long as the user ensures that other types of sounds do not occur (which is, unfortunately, not what one requires for a system that can deal with unknown situations). Two fundamental period contour estimation techniques were developed: one for clean signals, and one for noisy signals. The first uses the fact that the target is not contaminated with noise. It is reliable, but sensitive to added noise. It is based upon the property that all harmonics of a periodic source show a common periodicity, as is shown in the lower panel of figur e 2.8. This technique is used to determine the quality of the TAC selection technique in section A more complete description of this method will be postponed to section 5.2. The second period contour estimation technique is, conform the limitations on page 63, developed for a larger class of signals. This technique is not developed to be flawless (which is impossible), but intended as a proofof-concept: it has to come up with an approximately correct pitch-contour most of the time. An overview of this technique will be given in this section. A more detailed description can be found in section 5.1 on page 140. A first choice is whether to work in the period domain or the frequency domain. This is an arbitrary choice because both domains are equivalent, but since the final result of the technique is a period contour, two domain changes can be avoided. The available information is depicted in the upper panel of Continuity Preserving Signal Processing 65

67 Introduction to CPSP 5 Raw periodicity Period in ms Period in ms Smoothed contours Time in ms Figure The upper panel shows raw periodicity information. This is the inverse of the information as depicted in the lower panel of figur e The lower panel shows the result of an algorithm that selects long contours from the upper representation and smooths the final result. These form the seeds for the next stages in the estimation process of the best fundamental period contours. figur e For visual clarity only periods smaller than 5 ms (i.e., frequencies higher than 200 Hz) are depicted. Since this representation is based on a situation with a signal-to-noise ratio of 0 db it shows a lot of spurious contributions that must be eliminated somehow. To attack this problem, a set of heuristics are required. Inspection of noisy speech signals showed that the most energetic, the longest (longer than 50 ms) and the smoothest ridges often reflect the most reliable sources of information. These heuristics are used to select a number of points of the upper panel of figur e The selected points form ridges of segments that on average belong to contours longer than 75 ms, or to contours longer than 50 ms that, on average, contain the (two) most energetic segments per time step. The selected ridges are substituted by a smooth approximation that is just as long. These ridges are depicted in the lower panel. 66 Continuity Preserving Signal Processing

68 Fundamental Period Contour Estimation Period in ms Period in ms Contour combinations Selected contour combinations Time in ms Figure The upper panel shows all possible fundamental period contours hypotheses consistent with the smoothed contours of figur e Note that some hypotheses overlap. The lower panel shows a selection of the contours of the upper panels based on length and smoothness. The fine structure in some of the lines of the upper panel is an artefact of the plot program. The smooth ridges p(t) might, or might not, stem from harmonics of the target speech. If the harmonic number n would be known the fundamental period p 0 (t) would be known, since: p 0 ( t) = p( t)n or f 0 ( t) = f ( t) n (2.11) As a further limitation, valid fundamental period values are limited to values between 2.5 ms (400 Hz) and 13.3 ms (75 Hz), a range that spans most speakers (Furui 1989). For example an instantaneous period p=6 ms can be the result of the second harmonic of a fundamental period p 0 =12 ms, or the first harmonic of p 0 =6 ms. A period p =2 ms can represent any harmonic number in the range of 2 to 6. This corresponds to any p 0 in the set {4, 6, 8, 10, 12} ms. If p and p stem from the same source, they share the same fundamental period p 0. In this case either 6 or 12 ms. This property is used for the contours as depicted in the lower panel of figur e The upper panel of figur e 2.12 shows all fundamental period contour Continuity Preserving Signal Processing 67

69 Introduction to CPSP clean 0 db 3 db 6 db Fundamental frequency in Hz Figure Pitch-contours estimated from the target signal in cocktail party noise. In an SNR of 0 db the match is generally very good, at -3 db the estimated periods begin to deteriorate, especially during on- and offset, while at -6 db only part of the target contours are estimated correctly. The period contours (estimated from an SNR of -6 db) around 100 Hz stem from a combination of sources. hypotheses consistent with the smoothed contours of figur e Some of the fundamental period contour hypotheses overlap or extend each other smoothly. This is a strong indication that the period contours stem from the same source: the probability that uncorrelated period contours form a consistent whole is small, but not zero! The lower panel depicts a selection of the upper panel based on three main criteria: the contours must have a certain minimal length (50 ms), they must be sufficiently smooth and in case of multiple concurrent contours only the longest contours are selected. This results in a strong reduction and it often results in a set that includes a more or less correct pitch-contour candidate. The final step compares the remaining concurrent candidates with the original local periodicity information, depicted in figur e 2.11, to determine which candidate explains most of the period values and, to prevent octave errors, has a reasonable ratio of odd and even harmonics. The candidates that meet these demands best forms the final output of the algorithm. Figure 2.13 shows a comparison between pitch-contours estimated from signals with different signal-to-noise ratios of babble noise. Apart from some differences during on- 68 Continuity Preserving Signal Processing

70 Selection of Periodic Signal Contributions and offset, the algorithm is able to find the correct contours for SNR s of -3 db and better. When the algorithm produces a correct contour, the match is usually well within 1% of the actual value. This is not surprising since the most prominent harmonics of the target sounds are still quite able to dominate locally in these conditions. The algorithm identifies these regions and uses periodicity information to find the pitch-contour that combines as many of these regions as possible. Because the periodicity information is still almost unimpaired, the pitch-contour must be of similar quality as estimated in clean conditions. During onset and offset the energy of the target is generally lower which may lead to an unfavorable local signal-to-noise ratio. This makes it more difficult to determine the period contour unambiguously. Since the pitch-contour estimation technique looks for long, smooth and well supported fundamental frequency contours, it finds all combinations of evidence that can be supported. These may, or may not, correspond to the true pitch-contours. Below an SNR of approximately -5 db babble, speech, or car factory noise the algorithm is unlikely to function reliably because it becomes impossible to form a set of hypotheses that include the correct period contours. As said before: this version of the pitch estimator is intended as a proof-of-concept. It can be improved and eventually it ought to be replaced by a version that produces multiple hypotheses that are evaluated by higher levels of processing Selection of Periodic Signal Contributions The next step is the actual assignment of information to an auditory event-like representation. The lower panel of figur e 2.14 shows typical examples of TACbased auditory events. It is surprising to see what the application of a single constraint, the period contour, can do with the noisy signal in the upper-panel. On the low-frequency side, the TAC cochleogram selects the first harmonics reliably, on the high-frequency side it selects large areas of the time-place plane. On the low-frequency side the selected regions are dominated by a single harmonic. On the high-frequency side the regions are dominated by formants: complexes of harmonics that agree on a common fundamental period. Continuity Preserving Signal Processing 69

71 Introduction to CPSP Frequency in Hz Noisy speech: /NUL EEN TWEE DRIE/ in 0dB cocktail party noise Segment number Frequency in Hz TAC selection based on p 0 contour estimated in noise Time in ms Segment number Figure The upper panel shows the original noisy cochleogram (0 db babblenoise). The TAC cochleogram, computed from the noisy signal (the red contours of figur e 2.13) is depicted in the lower panel. The blue areas denote negative TAC values (usually corresponding to regions that are dominated by other sources), or epochs in which no period contours could be estimated. It is now evident that the TAC contributes to the solution of the signal-innoise-paradox (definition 1.5). Furthermore, the power of the TAC-approach illustrates the meaning and feasibility of the theoretical solution of the signalin-noise-paradox as formulated in conclusion 1.12: The signal in noise paradox can be solved by grouping continuously developing acoustic information of a single source into auditory events. A search through the set of auditory event combinations might produce a number of acceptable recognition results. Although the selection as depicted in figur e 2.14 was based on correct period contours, it cannot be guaranteed that the selection is correct: one of the background speakers might be the source of one of the period contours. Section 4.7 addresses this problem in more detail. Further processing, using knowledge of speaker characteristics and all aspects of language, must solve this problem. Fortunately, the information represented by an auditory event, based on a correct period contour estimated in rather noisy situations, 70 Continuity Preserving Signal Processing

72 Resynthesis of Target Signal comprises accurate information about the relative importance of individual harmonics and formants. This is enough to reduce the number of interpretations to a few hypotheses. Although the TAC-approach cannot assign non-periodic information to auditory events, it can help in determining the position of likely candidates of aperiodic auditory events that might be assigned to the same stream. In normal speech the position of aperiodic signal components is strongly correlated to the periodic components. In most cases, these contributions end just before or during the onset, and start at or after the offset of a periodic contribution. In the case of the /T/ of /TWEE/ (/TWO/), starting at t=1000 ms and most noticeable in the segment range from 90 to 100 in the upper panel of figur e 2.14, some form of template matching in combination with the Characteristic Period Correlation (section 4.3) may suffice to detect and characterize likely candidates of aperiodic contributions Resynthesis of Target Signal Because the TAC forms a reliable basis for the assignment of information to auditory events, one might ask whether it could be used to split a combination of sounds into the constituting sound events. This is, within the limits of the discussion in section 4.7, possible and is in fact a straightforward procedure. All quasiperiodic signal contributions that dominate a certain region in the time-place plane of the TAC cochleogram represent basilar membrane oscillations. Since the basilar membrane model is implemented as an impulse response-based finite impulse response (FIR) filter, it is possible to inverse the filtering by reversing the impulse response in time and compensating for the frequency-effects caused by the double use of the basilar membrane filter (Slaney 1994). A full inversion results in the original mixture of signals. But if inverse filtering is based on the regions of the time-place plane that are dominated by the target source, the output is, ideally, exclusively based on information from the target. If all positive values of the TAC cochleogram of figur e 2.14 are used as a mask the result sounds unpleasant. This is mainly due to on- and offset effects of filtering and to the contribution of incidental correlations. To reduce the latter effect, TAC-values smaller than a certain fraction, e.g., 0.25, of the local instantaneous energy are discarded. The remaining TAC-values are depicted in the upper panel of figur e To reduce the effects of on- and Continuity Preserving Signal Processing 71

Introduction to CPSP Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Regions where: TAC > 0.

73 Introduction to CPSP Frequency in Hz Regions where: TAC > 0.25*energy Segment number Frequency in Hz Mask for resynthesis Time in ms Figure The upper panel shows the regions where the TAC-value is larger than 25% of the local energy. The lower panel shows the mask that is used as a template for resynthesis. The mask is based on the positive values in the upper panel, but tailored to consist of segment contributions of a certain minimum length. The light points at the end of the lines signify the tapering to further reduce on- and offset effects. Segment number offsets the mask is tailored to consist of long (here minimal 50 ms) continuous contributions of single segments: small holes (here 20 ms) in the positive values of the TAC-traces are filled up and isolated positive points are discarded. Finally, the mask is provided with smooth 10 ms wide on- and offsets. This leads to the mask as depicted in the lower panel of figur e To improve the sound quality, the background is not completely discarded, but reduced with an adjustable factor: in this case a factor of 100 in amplitude (40 db in terms of energy). By not completely discarding the background, unnatural deep silences are reduced and some evidence of aperiodic contributions, like the /T/ of /TWEE/, remains in the signal; this facilitates perception. When the resulting resynthesized sound is again presented to the basilar membrane model, the cochleogram of the resynthesized sound can be computed. This is presented in the middle panel of figur e The upper panel shows the cochleogram of the original signal. This signal formed the only source of information: no a priori information was used, nor necessary. The lower panel shows the clean reference. Apart from the second formant 72 Continuity Preserving Signal Processing

74 Resynthesis of Target Signal Frequency in Hz Frequency in Hz Frequency in Hz Noisy signal Resynthesized signal Clean signal Time in ms Segment number Segment number Segment number Figure The cochleogram of the resynthesized signal, shown in the middle panel, reflects most features of the original signal in the lower panel. The representation of the middle panel was entirely based on the noisy signal shown in the upper panel. The fuzziness in the reconstructed signal is due to spurious signal contributions of the cocktail party background. The /T/ starting around t=1.0 s is absent in the reconstruction since it is unvoiced. structure of the last word, which is masked completely, all important periodic contributions are represented faithfully. Note that the resynthesized cochleogram is more fuzzy, this is due to spurious contributions of the background. One way to avoid this is to measure and smooth all individual signal components and add these together in a true speech synthesis process. Although the intelligibility is high, the perceptual quality of very noisy (SNR 0 db) resynthesized signals is not very good. This is because quite a lot of the background, visible as the fuzziness in the middle panel of figur e 2.16, is still present in the selection. The original signal provided our auditory system with enough cues to separate the signal components and assign individual Continuity Preserving Signal Processing 73

75 Introduction to CPSP signal components, on the basis of perceptual coherence, to the correct source. In the resynthesized signal the majority of these cues are removed, which makes it more difficult to assign spurious contributions to other percepts. Consequently, the perceptual quality of the signal is low. For true speech enhancement, further optimizations of the resynthesis process are required. An improvement is to estimate relevant signal properties, such as the pitchcontour, the development of the first harmonics and the position and development of the second and third formants, and then use this information as input for a speech synthesis system. Techniques like this may lead to high quality coding at very low bit-rates. In its current form, resynthesis is mainly suitable for visual purposes. Mask forming and resynthesis will be improved in sections 6.1 and Parameterization of Selections This section and the following show that the information in the TACselections can form a suitable basis for robust HMM-based ASR systems 13. The resynthesized sound, as computed in the previous section, can be used as input for a standard off-the-shelf speech recognition system. Unfortunately, this resynthesis does not contain any information about unvoiced (aperiodic) phonemes and it is difficult to predict how a standard, pretrained, ASR system will respond to these signals. A practical problem is that the resynthesis procedure requires an extra pass through the basilar membrane model. Since the basilar membrane model forms the computational bottle-neck of the system, its application ought to be minimized. Instead, the input of the recognition system can be based on the TAC-cochleogram. A suitable input for an ASR system is a description of the temporal development of the spectral envelope of the target speech (see figur e 1.1) while suppressing pitch effects. As the upper right-hand panel of figur e 2.5 shows, the TAC-cochleogram of the voiced parts of a clean signal resembles the standard cochleogram closely. The TAC-cochleograms in the lower panel of figur e 2.14 resemble the clean cochleogram better when the negative values are replaced by suitable positive values. The TAC-approach allows an (usually) accurate estimate of the energy and frequency development of the 13. The results of more elaborate and convincing recognition experiments on the standardized Aurora test for robust speech recognition are available at: 74 Continuity Preserving Signal Processing

76 Parameterization of Selections first few harmonics. Because voiced speech mainly consists of a superposition of harmonics and because the TAC-approach is linear, it is possible to recreate the lower part of the cochleogram by adding contributions of the ideal responses of individual harmonics. This superposition does not contain any negative parts. This procedure is introduced in section 3.4 and described in detail in section 3.6. The information in the upper part of the cochleogram, where individual harmonics cannot be resolved, is based on masks as computed in the previous section. The information under the mask is kept unchanged. Outside the borders of the mask, vertical tails are added to reflect masking upward and downward in frequency. As a last measure a (horizontal) tail, as an aproximation of the ringing-out effect of the basilar membrane and the leaky integration process, is included. This results in synthetic correlograms as depicted in figur e The upper panel shows the r econstruction based on the TAC of the clean signal. A comparison with the lower panel of figur e 2.16 shows that the main components of both figur es are very similar. This indicates the validity of the reconstruction method. The lower panel of figur e 2.17 shows the reconstruction based on the TAC as estimated from noisy data. Since part of this signal is masked and some spurious contributions of the background are added, the match is not perfect, but the main features of both figur es are similar (under a visual inspection). Prior to a correct recognition it is unknown which of the contributions are spurious, and consequently it is impossible to make a perfect selection. This problem can only be alleviated by the application of more and more characteristic constraints; which is not possible within traditional ASR-systems. An HMM-based ASR system requires an estimation of the spectral envelope of the target speech without pitch effects. The representation as depicted in figur e 2.17 is not very suitable since the first harmonics are the most energetic components. Although these carry formant information, the detailed realization of the first formant depends strongly on pitch. To reduce the effect of irrelevant pitch differences and to stress the second and third formant, a very simple and quite arbitrary trick is used: the values of the compressed cochleogram are multiplied by a segment dependent factor. This factor is 1 for the first segment and maximal, e.g., 5, for the last segment. The multiplication factor of intermediate segments is a linear interpolation between the two extremes. This is an operation with a similar effect as pre-emphasis, a form of high-pass filtering that is usually applied within the standard methodology of Continuity Preserving Signal Processing 75

77 Introduction to CPSP Frequency in Hz Frequency in Hz "Reconstruction" of clean selection Reconstruction of noisy TAC selection Time in ms Segment number Segment number Figure Reconstructions of the original cochleogram-based on a superposition of information derived from the TAC-spectrum. The contributions of the first few harmonics are based on an estimate of the development of their frequency and energy. The unresolved harmonics are based on information under the mask of figur e Small holes are filled up. Masking effects are added when appropriate. The upper panel shows the r econstructed clean signal, the lower shows the reconstruction based on the noisy TAC that can be compared to the resynthesis in figur e ASR, and results in a speech spectrum where all frequencies contribute on average with the same amount of energy. As a final step, the envelope of the cochleogram must be coded as efficiently as possible. To produce a set of parameters similar to MFCCs, 14 a cosine transform of the weighted cochleogram is performed. The result is a variant of a standard cepstrum. Typically the first 8 to 14 values of the cepstrum, representing low spatial frequencies are kept, the rest is discarded. Finally, the 14. MFCC: Mel Frequency scaled Cepstral Coefficient. MFCCs are a well-known FFTbased representation involving an approximation of the place-frequency relation of the basilar membrane and a logarithmic compression. The envelope structure is derived with a cosine transform. The resulting representation is termed a cepstrum. The values that represent the contributions of each cosine-base vector are called cepstral coefficients (Rabiner 1993, Gold 2000). 76 Continuity Preserving Signal Processing

Parameterization of Selections Frequency in Hz Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 6100 4200 2900 2000 1400 910 600 380 220 120 28 Cepstrally smoothed original spectrum

2000 Time in ms Segment number Segment number Figure 2.18. Cochleogram representations of the information represented by the cepstral coefficient that are used for recognition.

78 Parameterization of Selections Frequency in Hz Frequency in Hz Cepstrally smoothed original spectrum Cepstrally smoothed noisy TAC reconstruction Time in ms Segment number Segment number Figure Cochleogram representations of the information represented by the cepstral coefficient that are used for recognition. The upper panel shows the information available to the ASR system based on the clean cochleogram, the lower panel shows similar information but now based on the noisy TAC. Apart from some noise, spurious contributions and masked information (all prominent in the last word), the main difference is the absence of unvoiced phonemes. time-step between successive frames is increased from 5 ms to 10 ms by averaging successive values. This makes the frame step a standard value and speeds up processing. These values are stored and used as input for the speech recognition system. The stored parameters are not very informative, but they can be transformed back to a cochleogram-like representation by applying the inverse cosine transform. The result is shown in figur e 2.18 which reflects the information available to the speech recognition system. The upper panel is based on the original clean signal. The energy contributions per segment are enhanced by values between 1 and 5, the spectral envelope is coded with 12 cepstral coefficients. Compared to the lower panel of figur e 2.16, the high-frequency segments are much more prominent, the first harmonics are less prominent, and the formant features are broader. The lower panel is based on the reconstructed TAC-cochleogram of figur e 2.17 and has a good general agreement with the ideal cochleogram, but it is noisy due to masking and Continuity Preserving Signal Processing 77

79 Introduction to CPSP spurious background contributions. These two representations form the basis for the recognition experiments of the next section Recognition Experiment This chapter set out to prove that the basic framework presented in chapter 1 is useful for the improvement of a standard speech recognition system. So far, a selection technique for quasiperiodic sound events has been developed that appeared robust under visual inspection. The final proof is whether TACbased features are robust to added noise. Preliminary experiments with a standard ASR system give an indication of the robustness of the TACapproach. This section provides, conform the design decisions of section 2.8, an overview of recognition experiments with period contours estimated from noiseless signals. The way these period contours are estimated is outlined in section 5.2. For the ASR experiments the Hidden Markov ToolKit (HTK, version 2.2, Young 1999), a well known state-of-the-art ASR development system was used. The ASR test is limited to the 5414 connected digits strings of the female training data of the TI-DIGITS data-base (Linguistic Data Consortium 1993). Both HTK and the TI-DIGITS task are well known and are often used for benchmarking and comparison. The recognition system uses 12 whole word models: /ONE/ to /NINE/, /ZERO/, /OH/ and /SILENCE/ and recognizes digits strings like: /SILENCE ONE FOUR SEVEN TWO ONE SILENCE/. A potential problem is that the TAC cannot select unvoiced phones like /TH/, /T/, /S/, /K/, /F/ and it will select only the periodic parts of phonemes like /Z/ and /V/. The most difficult digit is the /SIX/ since it consists of two unvoiced parts while the voiced part of the /I/ is often short (even less than 50 ms) and of relatively low intensity. This makes it hard to detect the pitch reliably. Two speech recognition systems were built: one with standard MFCC coefficients and one based on TAC-selection with the parameterization of section Both were trained on clean data and tested on the trainings data contaminated with added noise. 15 The test used babble noise, car factory noise, speech noise and white noise at -5, 0, 5, 10, 15 db SNR of the NOISEX database (Varga and Steeneken, 1992). For comparison, recognition tests on the clean signals were performed as well. 78 Continuity Preserving Signal Processing

80 Recognition Experiment Both systems are based on continuous density HMMs with diagonal covariance matrices and 10 states with self loops but no skip transitions. The input consisted of 12 cepstral coefficients, their temporal derivative (or delta s), the energy and the first order derivative of the energy. Training was based on a flat-start (all states filled with the global average of the training data). Baum-Welch reestimation was applied to bootstrap the models. The models were refined by adding two mixtures at the time followed by two Baum-Welch reestimation passes. Eventually the models consisted of 15 mixtures. The benefit of the last 6 mixtures is minimal, but somewhat more important for the selection-based system. The TAC-based system is expected to be more robust than the MFCC-based systems. But when tested on clean input, the MFCC-based system is expected to outperform the TAC-based system, since the latter ignores the unvoiced phonemes and involves more processing steps that can all introduce an error. Yet at some point along the signal-to-noise ratio continuum these disadvantages ought to be overcome by the benefits of selection. The results in figur e 2.19 confirm these expectations. For the matched condition, i.e., testing on clean speech, the performance of the MFCC system is 99.3% word correct, but performance drops rapidly for decreasing SNR. The TAC-based system reaches only 94.3% for the matched test, but performance is much less sensitive to added noise. The rather poor performance 16 can be attributed to several causes. Lack of optimization. These results are a first attempt, and no optimization whatsoever is performed. Since almost every aspect of the system can be improved this is probably the main cause. Absence of unvoiced signal components. This might be an important factor since the a recognition system trained on a cepstral representation of the cochleogram (including all voiced peaks) reaches a performance of 99.1%. 15. Note that this test does not aim to establish the quality of a recognition system but measure the robustness to noise of the parameterization. This kind of task requires that the training and testing database are equal. Experiments show that an unmatched test leads to a minimal degradation, which shows that the system has not been overtrained (i.e., its parameters do not reflect the peculiarities of individual samples in the training set). 16. See Bourlard 1995 (Towards increasing speech recognition error rates) for an other form of justification of these scor es. Continuity Preserving Signal Processing 79

81 Introduction to CPSP 100 Recognition using MFCC 100 Recognition using Selections % correct 50 % correct SNR in db All Babble Carfac Speech White SNR in db All Babble Carfac Speech White Figure The results of the recognition tests confirm that the TAC-selections lead to a robust representation of speech. The recognition performance of the TACselections in clean situations is below the results of the matched MFCC-test (trained and tested on clean speech). But a 10% average performance loss compared to the matched condition occurs for the MFCCs around +20 db SNR, while for the TAC-selections the 10% performance loss occurs at an SNR of +4 db. Missing pitch-contours. The selections are based upon offline computed pitch-contours. This algorithm is not flawless, which means that some contours, especially for the /SIX/and the /EIGHT/, were not estimated, while other contours might be a fraction to short or interrupted. This might explain up to 10% of the errors (or 0.6% in terms of percentage correct). See section 5.2 and figur e 5.4 for the basis of this argument. Pitch effects in the parameterization. As figur e 2.18 shows, the first and second harmonics are still noticeable in the inverse cepstral representations, this means that a considerable fraction of input information represents pitch information instead of envelope information. Since pitch information does not contribute to word identity estimation (in English), the HMM models will not represent word identity information efficiently. Independence assumption of HMMs. A cochleogram-based MFCC representation of speech is smoother and therefore more predictable than standard MFCCs. This entails that the TAC-based parameterization is more dependent (in a statistical sense) than MFCCs. Since HMM-based 80 Continuity Preserving Signal Processing

82 Recognition Experiment systems assume statistical independence of consecutive frames (see the footnote on page 19), the TAC-based parameterization violates this basic assumption of HMMs more than standard MFCCs do. This effect is difficult to quantify and will be the subject of further r esearch. Although the TAC-based parameterization might be far from optimal, it is very robust to different types and levels of added noise. There is little performance gain for 5 db and better. This suggest that the parameterization changes little in this range. This supports the visual evidence of the previous sections. Furthermore the selection is only weakly dependent on the type of added noise, this proves that, as claimed in section 2.4, the TAC selects the target much more efficiently than the noise. Finally, the recognition performance for very adverse signal-to-noise ratios is still high. In an SNR of -5 db more than 50% of the digits are recognized correctly (chance level is below 10%). This result is positively influenced by the availability of the correct period contours. Nevertheless, an SNR of -5 db is a very challenging situation that requires full attention of a human listener. One might take the SNR corresponding to 10% performance loss compared to the matched condition as a measure of robustness. This level is depicted in figur e 2.19 as the dashed line. For the MFCCs the point of 10% performance loss is around a SNR of 20 db, the exact point cannot be estimated due to the limited number of tested noise levels. For the TAC parameterization the 10% performance-loss-point lies, on average, at +4 db. It can be concluded that the application of the tuned autocorrelation may lead to a considerable increase in the noise robustness of a standard speech recognition system. The performance in clean situations can be improved by an optimized recognition strategy and by the inclusion of aperiodic speech components. Continuity Preserving Signal Processing 81

83 Introduction to CPSP 82 Continuity Preserving Signal Processing

84 CHAPTER 3 The Basilar Membrane Response This chapter focuses on the cochleogram: a continuous function r(s,t) of place s and time t reflecting the energy of basilar membrane segments. A generalization of the cochleogram, which includes periodicity as a third dimension will be addressed in the next chapter. Auditory modeling for speech signal processing focuses often on the nonlinear behavior of a filterbank description of the basilar membrane (Brown 1994, Patterson 1995, Tchorz 1999). As discussed in section 2.2, for the time being a linear system is preferred because it facilitates the separation of sound sources. Section 3.1 addresses the implementation of the BM model. Section 3.2 studies spatial differentiation of the BM and the trade-off between frequency selectivity and group delay. The place-frequency relation of the BM and the response to sinusoidal stimuli is addressed in section 3.3. Section 3.4 addresses ways to estimate the contribution (in terms of energy) of individual harmonics of a known fundamental. The origin of stable ridges and the way ridges are estimated are discussed in section 3.5. The chapter is concluded with section 3.6 which addresses the cochleogram reconstruction technique that was used in section Continuity Preserving Signal Processing 83

85 The Basilar Membrane Response L fluid V sound Middle-ear model Base L BM R BM C BM Apex Oval window High Basilar Middle membrane Spiking neurons Low Hair-cell Round window Auditory nerve Figure 3.1. A schematic depiction of the basilar membrane (BM) and an electrical equivalent network of the BM model. The BM is modeled as a cascade of 400 second-order filters that model the cochlear fluid mass (L fluid ), the BM-segment mass (L BM ), the friction (damping) due to the basilar membrane movements (R BM ) and the stiffness of the BM (C BM ). The velocity of the natural basilar membrane is transducted to graded-potentials by approximately 3000 hair-cells and transmitted in the form of action-potentials to the brainstem by spiking neurons. 3.1 The Basilar Membrane Model The basilar membrane (BM) model is based on the work of Duifhuis (1985), van Netten, Hoogstraaten and van Hengel (1997) who developed and validated a one dimensional model of the human cochlea and implemented the model in FORTRAN. Although the model represents the middle ear as well as the cochlear partition (which includes the BM), it is referred to as the basilar membrane model. An overview of the basic structure of the model and its relation with the natural basilar membrane is given in figur e 3.1 The model is developed for auditory research purposes and is optimized to be as physiologically plausible as possible with a one-dimensional basilar membrane model. The BM model was originally implemented as a nonlinear transmission line model. Special attention was paid to the numerical stability of the implementation. The original numerical model consists of 400 segments and functions at an internal sampling frequency of 400 khz. 84 Continuity Preserving Signal Processing

86 The Basilar Membrane Model The main drawback of this model for speech recognition research is the high computational demand. In 1993, with a 33 MHz machine, the system was a factor 1000 slower than real-time. To improve processing speed above the rate provided by improving computer technology, a faster implementation of the algorithm was required. To benefit optimally from linear signal-processing theory, a linear version of the model is used. This choice is mathematically convenient and could be motivated by the argument given in section 2.2. It is an open question whether or not a biophysically more realistic nonlinear model might eventually be superior for ASR purposes. An overlap-and-add filter bank implementation is chosen because it requires a minimal number of floating-point operations. The impulse response that defines the filter bank is based on the velocity of a linear 400 segment version of the model with an internal sampling frequency of 400 khz. Its impulse response is down-sampled to 20 khz to reduce the number of operations per second and to allow the system to work with standard speech databases at 20 khz. To reduce computation time, 300 segments, corresponding to frequencies below 6100 Hz, are used and the impulse response is limited to 50 ms. For all experiments only each third segment is used so that only 100 segments have to be computed. Although his reduces the computational load by a factor of three, it reduces spatial resolution and sensitivity just as well. The impulse responses of all 300 low-frequency segments 1 are depicted in figur e 3.2. As in figur e 2.6 all negative values are set to zero. Figure 3.2 shows amplitude information, while the figur es in chapter 2 all depict information in the quadratic ener gy domain. Accordingly, the dynamic range compression is chosen as x 0.30 and not x 0.15 as in section 2.3. This leads to dynamic range compression effects comparable to the nonlinear behavior of the intact cochlea. The dynamic range compression is only applied for visualization purposes and to compute the final output; prior to visual presentation all processing is linear. The impulse response of figur e 3.2 demonstrates the most important properties of the BM model. First of all, the figur e is continuous in both time and place: it is possible to choose arbitrary paths through the impulse response plane of the BM without encountering discontinuities. The place- 1. Because it is often convenient to have a frequency-axis and a segment numbering that correspond closely, the segment numbering starts from the low-frequency side. This is contrary to the convention in auditory modeling Continuity Preserving Signal Processing 85

87 The Basilar Membrane Response Frequency in Hz Segment number Time in ms Figure 3.2. The positive values of the impulse response of the linearized basilar membrane model shows rapid oscillations and a narrow temporal envelope for segments sensitive to high frequencies. On the low-frequency side the oscillations are slow and the envelope is broad. The part of the impulse response that extents beyond 50 ms is discarded. Note that the strong band, constituting the second peak at each segment, can be described as the peak of a traveling wave. The very weak and broad vertical bands and the structure in the lower left corner originate in the original model from reflections fr om the apex. frequency relation is visible as well: segments sensitive to the highest frequencies show many oscillations in a given interval, while the lowfrequency segments show only a few. The impulse response profile shows why the excitation of the BM is often described as a traveling wave. Consider for example the prominent band that ends around t=30 ms at segment 1. This band can be interpreted as the top of a wave that starts at the high-frequency side and travels within 30 ms to the low-frequency side. This entails that the low-frequency side shows a delay compared to the high-frequency side. A few examples of impulse responses from different segments are presented in figur e 3.3. Each impulse response shows a delay, that decreases with increasing segment number. Each segment has a frequency to which it is most sensitive: its characteristic frequency f c. This frequency corresponds to the most prominent frequency in the impulse response. The moment the envelope of the oscillation reaches a maximum can be interpreted as a delay associated 86 Continuity Preserving Signal Processing

88 The Basilar Membrane Model 1 x 104 Segment 50 and x 104 Segment 100 Displacement Displacement x 104 Segment x 104 Segment 200 Displacement Displacement Time in ms Time in ms Figure 3.3. Examples of impulse responses of different segments (300-segment version). After a delay, that decreases with segment number, oscillations start. The average period of the oscillation corresponds closely to the characteristic frequency of the segment. The peak of the oscillation s envelope correlates with the delay due to propagation effects. The first oscillations are the strongest. This corresponds with figur e 3.2. The difference between the impulse response of segment 50 and 51 is very small at each point in time. This is a direct consequence of the physical coherence of the BM that cannot be guaranteed in generic filter banks. with the segment. Notice that there is no true delay: the whole BM responds immediately due to the incompressibility of the cochlear fluid. A formalization of this delay will be studied in the next section. The impulse response defines the filter function of the filter bank version of the basilar membrane model. The filtering is implemented as an overlap-andadd filter bank (Lynn 1998). Typically, the signal is filter ed in blocks of 2 12 = 4096 points. The impulse response is reduced from 1000 samples (= 50 ms) to 997 samples. This entails that =3100 samples of the completely filter ed BM-response become available after each block. The 3100 samples correspond to 155 ms (i.e., exactly 31 frames of 5 ms). For each segment the last 600 ms of the BM-response are stored in a buffer. The BM filtering is one of the two processing bottle-necks, so considerable effort was made to ensure that the Matlab implementation 2 was as efficient as possible. The 100 segment BM is 10 times slower than real-time on a 400 MHz computer. Continuity Preserving Signal Processing 87

89 The Basilar Membrane Response 3.2 Sensitivity Versus Group Delay Although implemented as a regular filter bank, the BM model behaves like a linear transmission line. Compared to more conventional cochlear filter banks, which are often based on gamma-tone filters (Brown 1994, Patterson 1995, Tchorz 1999), a transmission line shows a well-defined spatial continuity. And in contrast to frame-based methods, like FFT or LPC, it shows spatial as well as temporal continuity. To study the merits of spatial continuity the first and second order spatial derivatives of the output were computed with: 0 th -order: x 0 s( t) = x s ( t) 1 st -order: x 1 s( t) = x s ( t) x s 1 ( t) 2 nd -order: x 2 s( t) = x s 1 ( t) 2x s ( t) + x s + 1 ( t) (3.1) The superscripted index denotes the order of differentiation. These are used as input for the cochleogram estimation according to: r i s( t) = r i s( t t)e δt ---- τ + x i s( t)x i s( t) (3.2) Versions of the cochleogram of the word /NUL/, based on different orders of spatial differentiation, are depicted in the upper panels of figur e 3.4. The differences are obvious: spatial differentiation highlights the differences between neighboring segments, which in turn leads to enhanced frequency resolution. The lower panel shows the cochleogram cross-section at t=175 ms. These figur es show that the original basilar membrane model can hardly resolve individual harmonics: the first harmonic dominates the response of all segments. The first and especially the second order differentiated cochleogram represent individual harmonics much better because the segment s response curves are sharper. The increased discriminability comes at a prize: sharper response curves represent more specific frequency information, which in turn requires more temporal information. Consequently the response of the segments will be slower. The response time of a linear system can be formalized as group delay. 2. All presented work is implemented in Matlab. Matlab does not optimize speed, but allows rapid prototyping in combination with extensive visualization capabilities. 88 Continuity Preserving Signal Processing

Sensitivity Versus Group Delay Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 (energy) 0.

90 Sensitivity Versus Group Delay Frequency in Hz (energy) th order 1 st order Time in ms Effect of spatial differentiation 2 nd order Time in ms No differentiation First order derivative Second order derivative Segment number Segment number Figure 3.4. Spatial differentiation sharpens the response of the basilar membrane segments. The cochleogram based on the 0th-order (original) signal is featureless. Differentiation highlights the differences between neighboring segments which leads to an enhancement of spectral detail. A related effect is that differentiation functions as first-or der high-pass filter. This is noticeable in the lower panel as a relative enhancement of high-frequency segments that leads to a reduction of the average slope. The responses of the first and second order derivatives are scaled to match the height of the peak of the original response. The group delay of a linear system can be defined as the center of gravity of the squared impulse response h s (t): t t t [ h s ( t) ] 2 dt d s = [ h s ( t) ] 2 dt (3.3) Figure 3.5 presents the group delay of different segments for different orders of spatial differentiation. The lower x-axis shows the corresponding characteristic frequency of the segments, the upper x-axis shows a normalized position axis where the last segment corresponds to 100 (the number of segments that is used in most experiments). The first 10% of the basilar membrane is sensitive to the narrow frequency range between 28 and 116 Hz and the corresponding group delay is long. The Continuity Preserving Signal Processing 89

91 The Basilar Membrane Response No differentiation First order derivative Second order derivative 25 Group delay in ms Best frequency in Hz Figure 3.5. Group delay as a function of characteristic frequency and segment number (upper axis). Group delay depends on the specificity of the cochlear filters. High-frequency filters are broadly tuned and show a short group delay. Spatial differentiation sharpens the cochlear filters and increases group delay. The shoulders and other effects on the low-frequency side of the BM stem from limiting the impulse response to 50 ms. last 10% of the basilar membrane segments on the other hand represents a 2000 Hz broad frequency range, the corresponding group delay is short. Consequently spatial differentiation sharpens the response, but leads to an increased group delay. As a general rule, a segment s group delay d is at least a few times the characteristic period T c (the inverse of the characteristic frequency f c ) of the segment: c d = ct c = ---- f c (3.4) The higher the multiplication factor c, the narrower its response and the more discriminative the segment 3. The group delay curves of figur e 3.5 show some odd details in the first 20 segments. These are attributed to the limitation of the impulse response to 50 ms. The true impulse response of low-frequency segments extends well beyond 50 ms. This suboptimal choice is a compromise 3. The factor c is closely related to the quality factor Q, defined in the frequency domain as f c f, with f a measure of the width of the response curve. 90 Continuity Preserving Signal Processing

92 Place-Frequency Relation to optimize processing speed. Fortunately these low frequencies are not of central importance for speech and the cut-off at 50 ms is partially masked by the smoothing effects of leaky integration. The most important effects of group delay occur when the pitch of a signal changes rapidly. The high-frequency segments follow the instantaneous pitch rapidly, while the low-frequency segments still reflect the instantaneous pitch of t minus 20 ms and more (see figur e 3.5). This complicates virtually all aspects of further processing. The place-frequency relation, as well as the trade-off between group delay and the sharpness of the response, the quality factor Q, can be adjusted in the original BM model (Duifhuis 1985). This work is based on a version of the model which operates with a constant Q of 10. This gives the rather blunt response profile of figur e 3.4 (blue line). For our analysis we use the second spatial derivative, the increased group delay has to be dealt with appropriately. This will be discussed in several contexts in the next chapters (in particular in the sections 4.6 and 5.1). The effect of spatial differentiation can be incorporated in the impulse response by computing the second spatial derivative of the original 300-segment impulse response. Selecting the impulse responses of each third segment leads again to a 100 segment model. 3.3 Place-Frequency Relation Most figur es of this work use a place-frequency relation that has been estimated experimentally. In the original BM model the resonance frequency of the uncoupled segments is chosen according to the Greenwood placefrequency relation (Greeenwood 1990, 1991) were x is measured in mm from the apex: f res [ Hz] = x [ mm ] 145 [Hz] (3.5) The segment index s can be related to x by correcting for the length of the BM (35 mm), the number or segments in the original model (400) and the fact that only one out of three segments are actually used: 400 s = x[ mm] 3 35[ mm] (3.6) Continuity Preserving Signal Processing 91

The Basilar Membrane Response 28 116 224 379 595 915 1366 2004 2934 4233 6095 3981 2512 3.6 3.4 Frequency of sine stimulus in Hz 1585 1000 631 398 251 158 3.2 3 2.8 2.6 2.4 2.2 10 log( sine stimulus in Hz) 100 63 2 1.

93 The Basilar Membrane Response Frequency of sine stimulus in Hz log( sine stimulus in Hz) Segment number Figure 3.6. The place-frequency relations. Each segment responds strongest to a characteristic frequency f c. This place-frequency relation is denoted by the black line. The Greenwood place-frequency relation, which is hard-coded in the model, corresponds to the dashed line. The upper-axis gives the characteristic frequency of the segment numbers in the lower-axis. The color coding of the background codes the strength of the response of each segment to unit amplitude sine stimuli with frequencies corresponding to the values at the y-axes. The y-axes and the color-coding are logarithmic. The interaction between segments, shifts the actual resonance frequency to a slightly lower value than the hard-coded f res of equation 3.5. Figure 3.6 shows the Greenwood place-frequency relation as the dashed line. The local resonance frequency defines the characteristic frequency f c and is depicted as the thin black line. It is always below the Greenwood place-frequency relation. Analogous to the characteristic frequency of a segment, every frequency has a characteristic segment. The actual place-frequency relation is based on measuring the response strength of each segment to a range of logarithmically spaced sinusoids of unit amplitude. The resulting matrix, of which the values are color-coded logarithmically, forms the background of figur e 3.6. Highfrequency segments respond stronger to low-frequency stimulation than viceversa. 92 Continuity Preserving Signal Processing

94 Place-Frequency Relation (energy) Segment number Figure 3.7. The BM-response to sine-wave stimuli of selected frequencies (in Hz). The upper axis provides the characteristic frequency (in Hz) of the segments of the lower axis. The red line shows the response of all segments of the BM to a unitstrength sinusoidal stimulus of 300 Hz. This frequency excites segment 25 most. This excitation drops much more steeply to the low-frequency side than towards the high-frequency side. Consequently, low-frequency stimuli mask highfrequency stimuli more than vice versa. The information in this figur e corresponds to horizontal cross-sections of figur e 3.6. Since (quasi-)periodic signals consist of combinations of harmonics, it is useful to study the response of the BM to single frequency stimuli. The response of the BM to a certain fixed frequency is termed a sine response. Several examples are depicted in figur e 3.7. Irrespective the driving frequency, all BM-responses have a similar asymmetrical form with a more prominent tail towards the high-frequency side than towards the low-frequency side. The figur e depicts steady-state situations that are only reached after a sufficient number (e.g., ten) of oscillations and/or a few (e.g., five) times the integration time-constant τ. Natural signals rarely show signal components that change slowly enough to fully justify this steady-state assumption. On the low-frequency side of the BM, the pitch as well as the amplitude are seldom constant enough during the 50 ms or more that are required to reach a steady-state. This results in broader responses than the ideal sine-response. Continuity Preserving Signal Processing 93

95 The Basilar Membrane Response On the high-frequency side of the BM, steady-state is reached quicker, but random pitch fluctuations of natural signals broaden the responses as well. Nevertheless the sine-response is a very useful approximation that is used extensively in section Estimating Individual Signal Components Since the BM model is linear, its response is a summation of the responses to the individual components of the driving sound events. In the case of a quasiperiodic sound event y(t) the input can be described as: 2πn y( t) = a n ( t)h n ( t) h n ( t) = sin t φ T ( t) n ( t) n (3.7) where a n (t) denotes the amplitude of the harmonic contribution h n (t). The harmonic is a function of the period contour T(t) and a phase function 4 φ n (t). The cochleogram of this signal is defined by equation 2.1. The square and a sufficiently long integration time-constant τ ensure that the effect of the phaseterm φ n (t) vanishes (except for some exceptional phenomena that are not considered here 5 ). In most cases a(t) changes slowly compared to the value of the time-constant τ. This means that a n (t) can be treated (for short intervals) as a constant 6 that scales the cochleogram contribution of h n (t) with a factor <a n 2 (t)>. The < > denotes temporal average as estimated by the leaky integration process. The cochleogram contribution of h n (t) is denoted as R[h n (t)]. For arbitrarily developing harmonic contributions R[h n (t)] is difficult to calculate exactly, but for slowly developing h n (t), it can be approximated by the sine-responses as depicted in figur e 3.6 and figur e 3.7. This means that the cochleogram R(t), resulting from a signal y(t) according to equation 3.7, can be approximated as: 4. This phase term might be estimated from the local basilar membrane movement. 5. The phase term can, in principle, counteract any effect of the period contour T(t) and might dominate the signal completely. These cases are not considered here. 6. This application of stationarity is justified since it applies to individual signal components. 94 Continuity Preserving Signal Processing

96 Estimating Individual Signal Components R( t) = a 2 n ( t)r[ h n ( t) ] w n ( t)r n ( t) n n (3.8) R n (t) is the response of a unit-amplitude harmonic contribution h n (t), approximated by a succession of the sine-responses of the characteristic segments corresponding to the temporal development of the local instantaneous frequency h n (t). The weighting w n (t) determines the scaling of this sine-response. Any quasiperiodic signal can be approximated by superimposing correctly weighted response-curves for each harmonic contribution. The quality of the approximation is, as ever, a measure of the validity of the underlying assumptions. In general equation 3.8 can be applied safely on regions that are dominated by a single source: it can therefore be applied safely along ridges that result from periodic signal components (see the discussion about indications of periodicity at the end of section 4.3). The approximation of equation 3.8 will be used in section 3.6 (which addresses the reconstruction technique introduced in section 2.12) and is valid when the quasiperiodicity assumption is valid. Errors may occur during the onset when the BM is not yet quasi-periodic. Normally, the weighting w n (t) of the sine-responses is unknown and must be estimated from the signal. To estimate the contributions of the individual harmonics of the signal in figur e 2.4, two different approaches are proposed. The first approach exploits the asymmetry in the sine-responses by neglecting masking towards the low-frequency side. In this case the signal in figur e 2.4 is approximated by first weighting the sine-response corresponding to the frequency of the fundamental. This accounts for part of the excitation at the position of the second harmonic, the remainder is attributed to the second harmonic. At the position of the next harmonic, the contribution of all previous harmonics is subtracted and the remainder is attributed to the current harmonic. This process can continue until the frequency of the harmonics exceeds the characteristic frequency of the last segment, but in practice it is limited to BM regions where harmonics are resolved. 7 This method works therefore particularly well for the first harmonics and will be used for cochleogram reconstruction in section When harmonics are not resolved it is often possible to skip harmonics so that each remaining harmonic approximates the best frequency of a single segment. The estimated contribution of the selected harmonics must then be shared with the harmonics that were skipped. Continuity Preserving Signal Processing 95

97 The Basilar Membrane Response 3 (energy) Segment number 1 (weighting) harmonic number Figure 3.8. Estimating and adding individual harmonic contributions. The blue cochleogram cross-section in the upper panel can be approximated by adding weighted harmonic contributions. The weighting of each harmonic h n, depicted in the lower panel, can be computed by solving a matrix equation. The red line in the upper panel gives the weighted sum of 28 harmonics. The average mismatch for segment 15 and higher is less than 3%. The values of h n in the lower panel are scaled to provide a unit contribution of the strongest harmonic. The second method is a numerical solution of the matrix equation: Rw = E. E is the target cochleogram cross-section, R the matrix of sine-responses associated with the frequencies of the individual harmonics and w the vector with the desired weighting values. When applied to the signal in figur e 2.4, the fundamental f 0 is 1/4.60=217 Hz, as can be estimated from the TNC in figur e 2.6. The associated harmonic frequencies are nf 0. The characteristic frequency of the last segment of the BM is 6100 Hz; the highest harmonic number that can be expressed is therefore For each frequency a sine-response can be selected and added to the matrix R. Solving w=r -1 E (in a least square sense) and setting negative values of w n to zero leads to the results in figur e 3.8. The upper panel depicts the target E in blue, the lower panel presents the scaled contribution w n of each harmonic. The red curve gives the weighted sum of sine-responses. As can be seen the match is very good, which entails 96 Continuity Preserving Signal Processing

98 Estimation of Ridges that the harmonic content of the first three formants were estimated correctly. The weights of the highest harmonics can only be estimated reliably around formant peaks. At other positions the sine-responses associated with the harmonics overlap almost completely and numerical errors will influence the results. Lower fundamental frequencies exacerbates this problem, but the use of a BM model with more segments alleviates the problem. This is an efficient and rather elegant method for analyzing the harmonic content of a periodic signal when the fundamental frequency contour is known. The technique also works when the pitch of the signal changes within the validity range of the quasiperiodicity assumption (which seems to be the case for spontaneous speech). In this case, the approximation with the sineresponses is somewhat less valid and the effects of group delay effects have to be accounted for by choosing a set of frequencies that reflect the local instantaneous frequencies of the harmonics. Yet, this correction is straightforward if a correct pitch-contour is provided. Further work will extend this technique to deal with rapid onsets, its use on mixtures of periodic signals and its application to TAC-cochleograms. 3.5 Estimation of Ridges As discussed in section 2.6, it is important to reduce the search space to allow computationally feasible implementations. The search space is reduced very efficiently by limiting TNC computation to ridges s(t) in the cochleogram. The existence of more or less stable ridges is, again, a direct physical consequence of BM continuity. Two signal contributions can interact in different ways. An uninteresting case occurs when one signal contribution masks the other altogether. In this case, the weaker contribution cannot be estimated. More interesting cases arise when both exert a noticeable influence. An important situation occurs when both signal contributions have frequencies that correspond to a single segment or close neighbors. In this case, intervals with constructive and destructive interference alternate. This results in amplitude modulation, with a period corresponding to the inverse of the frequency difference between both signal components, and the formation of a ridge at the position corresponding to the weighted mean frequency of both components. The leaky integrated energy value associated with this ridge shows amplitude modulation. In noisy situations this may result in interrupted ridges. This type of interaction is important for harmonic Continuity Preserving Signal Processing 97

99 The Basilar Membrane Response complexes at formant positions, which show amplitude modulations with the fundamental period of the signal. These modulations can be detected when the lowpass filtering has a sufficiently fine temporal resolution (which is generally not the case in the current implementation). Another important interaction between signal components arises when the signal components correspond to segments that are further apart, so that both dominate their corresponding characteristic segment. Somewhere in between (due to the asymmetrical nature of masking usually close to the highfrequency segment) segments exist that feel a comparable influence from both components. These segments must follow two different frequencies without rupturing the BM. Consequently the local amplitude is reduced, especially when the signal components are anti-phasic. The corresponding local energy is small as well. This leads inevitably to a situation with two energy peaks separated by a valley. For signal contributions that persist for some time, the corresponding peaks string together to form temporal ridges. Ridge forming proves the existence of stable ridges corresponding to sufficiently separated continuously developing signal components. This is a transmission line property that cannot be guaranteed in general filter bank based BM models. Ridge estimation is implemented straightforwardly. Candidate ridges are formed by stringing together peaks that differ less than 2 segments. In the odd case that ridges split or merge, the most energetic peak on the ambiguous side forms part of the continuation, the other peak forms the start of a new ridge or the end of an old ridge. When candidate ridges have a duration of at least 20 ms (4 frames) they are accepted as valid ridges. Ridge estimation can be improved by an approximation of the frequency and energy development of the ridges. 3.6 Cochleogram Reconstruction of TAC Selections Section 2.12 provided an overview of the techniques that are used to transform the TAC-selections into a form that is suitable for HMM-based ASR purposes. This section studies the details of the reconstruction process that was used to remove the negative correlations in the TAC-selection. This reconstruction technique can also be applied to the more refined mask forming technique that will be developed in chapter Continuity Preserving Signal Processing

100 Cochleogram Reconstruction of TAC Selections Cochleogram reconstruction is a two-step process that is illustrated in figur e 3.9. The first stage searches for evidence of resolved harmonics and uses this evidence to compute the harmonic contributions in the lower half of the reconstruction. The second stage adds information about the rest of the cochleogram using the mask and an approximation of actual activation patterns. The first stage of the algorithm involves the estimation of coherent ridges in the (60) low-frequency segments. Ridges are formed, as in the previous section: with ridges longer than 15 ms accepted as candidates for harmonics. Since the fundamental period contour is known, it is possible to predict the segment numbers of the lowest harmonics. The ridges that are, on average, within 1 segment of the expected value of the first 5 harmonics are accepted as harmonics of the target signal. This criterion discards spurious ridges on the bases of a mismatch in temporal development. The number of harmonics that can be modeled this way depends on the spatial resolution of the basilar membrane. With a more sharply tuned BM model and a higher number of segments, a higher number of harmonics can be treated individually. In this case 5 harmonics were treated individually because the acceptance regions of the first 5 harmonics do not overlap in the present BM model. 8 The algorithm is only weakly sensitive to the value of this parameter and works usually fine with up to 6 or 7 harmonics. The upper left hand panel of figur e 3.9 shows all candidate ridges. The energy development along these ridges is smoothed by replacing each value with a three point local average. The smoothed harmonic ridges are used to reconstruct an estimate of the original cochleogram by adding contributions of successive harmonics conform equation 3.8. This process is shown in the top panel of figur e The reconstruction starts with weighting the ideal sine-response (as shown in figur e 3.7) of the fundamental. It is assumed that harmonics influence each other only upward in frequency (see section 3.4 for the justification of this assumption). At the position of the second harmonic, part of the energy can be attributed to the first harmonic, and the rest of the energy is used to weight the ideal sine-response of the second harmonic. In the top panel of figur e 3.10, a large fraction of the energy of the position of the third harmonic must be attributed to the second harmonic, the fourth and fifth 8. This limits sufficiently r esolved harmonics to the first 60 segments. Continuity Preserving Signal Processing 99

The Basilar Membrane Response Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Candidate ridges Added: sine responses 100 90 80 70 60 50 40 30 20 10 1 Segment number Frequency in Hz

Segment number Figure 3.9. Overview of the reconstruction process. In the first step candidate ridges are selected.

101 The Basilar Membrane Response Frequency in Hz Candidate ridges Added: sine responses Segment number Frequency in Hz Added: high frequency information Time in ms Added: masking effects Time in ms Segment number Figure 3.9. Overview of the reconstruction process. In the first step candidate ridges are selected. Ridges that are close to the expected position of the first 4 harmonics are selected and replaced by a weighted sine-response. This is depicted in the upper right hand panel. Next the remaining information under de mask is added (lower left hand panel). In the final step, an exponential decaying tail, and an approximation of the effects of upward and downward masking are added. The reconstruction can be compared to the original selection in figur e are relatively more important. The resulting partial reconstruction, based on 5 harmonics, is depicted in black. The second stage of the algorithm is the reconstruction of the high-frequency range. Again the mask is used to pinpoint the regions that are most likely to represent information of the target. The selected values under the mask that exceed the partial reconstruction replace the values of the partial reconstruction. The result of this step is depicted in the lower left hand panel of figur e 3.9. This stage leads to high-frequency contributions with unrealistically steep upward and downward slopes. The black peaks in the upper panel of figur e 3.10 show this clearly. To make the reconstruction more realistic without adding extra information, the ridges of the mask can be augmented with flanks that represent masking effects consistent with a harmonic that excites the position of the peak associated with the flank. These can, again, be estimated from the sine-responses and added to the reconstruction. Finally the effect of leaky integration can be modeled as 100 Continuity Preserving Signal Processing

102 Cochleogram Reconstruction of TAC Selections (energy) Reconstruction based on 5 harmonics harmonic #1 harmonic #2 harmonic #3 harmonic #4 harmonic #5 Combination Mask (energy) Comparison of target, noisy signal and reconstruction Noisy envelope Clean target Reconstruction SNR > 3 db SNR > 0 db SNR > 3 db Segment number Figure Example of reconstructed cochleogram cross-section. The upper panel shows the reconstruction of the first 5 harmonics by addition. It is assumed that higher harmonics do not influence lower harmonics. The peaked black line above the position of the fourth harmonic shows the selected energy under the mask. The lower panel compares the original noisy cross-section (blue) with the target (red) and the estimated reconstruction (black) that includes upward and downward masking. The background is often much more energetic than the target (the average local SNR is -5 db). The red, green and black stars mark segments with a local SNR better than 3, 0 and -3 db, respectively. Tuned autocorrelation selections are reliable when the local SNR is better than 3 db. This cross-section corresponds to t=275 ms in figur e 3.9 or figur e exponential decay. The final reconstructions is shown in the lower right hand panel of figur e 3.9 and drawn in black in the lower panel of figur e Visual inspection shows that the reconstruction is often of high quality. The presented situation with a global SNR of 0 db babble noise is close to the limits of the selection technique. In figur e 3.10, the average instantaneous SNR per segment is -5 db. Part of the signal, in particular the high-frequency range of figur e 3.10, has a very unfavorable local signal-to-noise ratio. As can be seen in the lower panel of figur e 3.10, a correct reconstruction is likely when the red target is close to the blue line that corresponds to the total energy. This corresponds to situations where the local SNR is favorable (SNR > 3 db, red Continuity Preserving Signal Processing 101

103 The Basilar Membrane Response stars) and the BM is dominated by the target. The difference between the red target energy and the black reconstruction shows that the energy of the fifth harmonic is underestimated, this is a result of a TAC estimation error due to the local optimization algorithm that will be discussed in section 4.6. The underestimation of the fifth harmonic was balanced by the selection of the two contributions at segments 81 and 88. When the distance between the black and the blue line increases the probability increases that the reconstruction is incorrect. When the distance between the red and the blue lines corresponds to an SNR between 0 and 3 db, the target can still dominate, but the influence of the noise is considerable. For negative SNR (no green stars), the BM segments are not dominated by the target and the reconstruction is likely to represent spurious contributions. Generally the quality of the reconstruction deteriorates gracefully. Selections will be reliable for local signal-to-noise ratio s of 3 db and better, and deteriorate rapidly with negative local signal-to-noise ratios. Below a local SNR of -3 db, reconstruction leads to unreliable contributions. This provides a basis for the choice of a threshold of 0 db of the background model in equation 6.2. The link between the local SNR and the probability for a correct reconstruction (and consequently the assignment of information to the correct representation) relates the tuned autocorrelation to the experimental and theoretical work of Fletcher, French, Steinberg and Galt (French 1947, Allen 1993) that showed that the local SNR and not the spectrum determines the intelligibility of (nonsense) words. 102 Continuity Preserving Signal Processing

104 CHAPTER 4 Time Normalized Correlogram The previous chapter studied the energy of the basilar membrane response r(s,t) as function of s and t. This chapter studies the Time Normalized Correlogram (TNC), a generalization r(s,t,t) that includes periodicity. The TNC plays a central role in this thesis. The tuned autocorrelation (TAC) based selection of quasi-periodic speech signals depends on its two main properties. The first and most important property is continuity through time, place and periodicity. The second property is the time-of-onset normalization, which ensures that the onset time does not dependent on pitch. This chapter starts with a comparison of several continuous correlogram variants. The chapter proceeds in section 4.2 with a description of the dynamic (time dependent) properties of the TNC. This leads to two broad classes of TNC responses: (quasi-)periodic responses due to sinusoids or (complexes of) harmonics and aperiodic responses (due to on- and offset transients, pulses, steps and aperiodic noises). Section 4.3 introduces the Characteristic Period Correlation (CPC) that is used to determine which time-place regions are dominated by signal components with a corresponding frequency. Section 4.4 addresses on- and offset transients in some detail. Section 4.5 describes the estimation of the local instantaneous frequency contours. The two next sections discuss the tuned autocorrelation selection process. Section 4.6 Continuity Preserving Signal Processing 103

105 Time Normalized Correlogram describes the estimation process of TAC-selection and section 4.7 discusses effects of aperiodic and periodic noise on TAC selections. The Time Normalized Correlogram as defined in section 2.5 is a member of a larger family. Licklider (1951) proposed a first version of the correlogram as a model for pitch perception. Slaney (1993) discusses a typical implementation that is optimized for speed. It computes the autocorrelations with an FFTimplementation: this implies the quasi-stationarity assumption that, according to conclusion 1.21, should be avoided for unknown signals. Correlograms (usually frame based) are used to model human pitch perception (Meddis 1996) or, like Brown (1994) for auditory scene analysis Three Continuous Correlogram Variants For frame-based discrete time autocorrelations based on N r[ n] = x[ i]x[ i ± n]w[ i] x[ i] = 0 if i < 0, i > N i = 0 (4.1) (with w[i] an arbitrary window) it makes no difference whether or not the correlation is based on leads or lags: i.e., on i+n or on i-n. As long as all contributions are summed, the result is the same. The convention is to choose a minus sign, since it resembles a causal system without delay. For a continuously updated (or running) autocorrelation, the choice for i-n leads to a different presentation of information than an implementation based on i+n. Three slightly different continuous implementations of a leaky integration based correlogram will be discussed in this section. Although these correlograms will be defined for continuous time, they will be computed and visualized in discrete time. The most conventional choice is (Meddis 1992): 1. The correlogram helps to organize and visualize acoustic information. It might, or might not exist as an explicit representation in the brain. But since it consists of spatio-temporal correlations, which can be computed in extremely complex forms in single dendrites, it might be completely implicit in the sense that correlogram values are computed, but in a way that is not measurable as a field of neurons with responses resembling figur e 2.6. At this moment there is no neurophysiological evidence for a correlogram based representation. Yet, the mere fact that dendrites can process highly complex spatio-temporal correlations (Koch 1999, Bower 1998) warrants attention to the kind of information presented by these correlations. 104 Continuity Preserving Signal Processing

106 Three Continuous Correlogram Variants - r s, T ( t) = L{ x s ( t)x s ( t T )} s T = = 1,, s max [ 0, T max ] (4.2) This equation is the defining function of the matrix elements of the timeevolving matrix r s, T ( t). Conform equation 2.6, r s, T denotes the value of the - - autocorrelation of segment s and autocorrelation lag T. The superscripted - refers to the minus-sign in the last term. The matrix-indexes of the T- dimension span T = 0, t, 2 t,, T max, with T max = N t and N the number of matrix elements in the period dimension. The s-dimension span s = 1,, s max. Again, x s (t) denotes the output of BM segment s. This implementation is causal. The original article of Licklider (1951) resembles this variant. An alternative implementation, defining the TNC, looks forwar d in time: + r s, T ( t) = L{ x s ( t)x s ( t + T )} s T = = 1,, s max [ 0, T max ] (4.3) Dropping the indices, this implementation will be referred to as r + (t). Equation 4.3 is not causal, but it can be implemented by the introduction of a delay T max in the computation. This entails that the correlogram of the current time t becomes available with a delay of T max (e.g., 12.5 ms). Often (Patterson 1995, Brown 1994) a form of group delay normalization is applied. In this normalization, the envelope of the impulse response is timeshifted with the value of the local group delay d s (see figur e 3.5) so that all segments reflect the moment of maximal excitation due to an impulse at approximately (or exactly as in Brown 1994) the same time. This representation entails that information of high-frequency segments at time t is lined-up with information of low-frequency segments at t+30 ms or later. Group delay normalization can be performed either with a + or a - sign. But for reasons apparent later, only the + -version is considered here: gd r s, T ( t) = L{ x s ( t + d s ) x s ( t + d s + T )} s = 1,, s max T = [ 0, T max ] (4.4) Compared to equation 4.3 an even longer delay is necessary. While the minimal delay in equation 4.3 was T max, now the delay is T max +d s. The inverse of T max is the lowest frequency to be expressed in the correlogram. For most speech, a useful lower limit 2 is 80 Hz and T max =12.5 ms, and the corresponding group delay is 28 ms (see figur e 3.5). The combined delay is consequently more than 40 ms. Note that r gd (t) presents, for any given t, Continuity Preserving Signal Processing 105

107 Time Normalized Correlogram information that, originally, was separated 30 ms or more in time. With a frame sampling period of 5 ms, this corresponds to 6 frames or more. The redistribution of temporal information may lead to a reordering of information that makes it more difficult to study the temporal development of the signal. From the viewpoint of the conservation of continuity, equation 4.4 conserves the basic assumption of continuity of time in this discrete time implementation less well than the other implementations. 3 In these implementations a change between neighboring segments a and b associated with a time step from t to t+ t: s a ( t) s b ( t + t) (4.5) involves a minimal time step T, while in the case of group delay normalization the effective change is: s a ( t) s b ( t + t + d( s b ) d( s a )) (4.6) Because the group delay differs slightly between neighboring segments, an additional temporal shift d(s b )-d(s a ) is introduced that increases discontinuities due to discretization. These may reduce the validity of the continuity assumption, especially during rapid changes of signal. This effect is exacerbated when low sampling rates are used. Since the continuity of the proposed correlograms is a basic assumption, i.e., an assumption that cannot be checked during processing, violations of continuity may lead to unpredictable results. To compare these definitions qualitatively, three sequences of correlogram frames are presented in figur e 4.2. Before interpreting these different representations, it is important to stress that these representations represent exactly the same information (apart from sampling effects due to the presentation of one frame each 5 ms). The only difference is the temporal 2. This lower limit to pitch is valid for most speakers, but some speakers reach pitch values as low as 40 Hz (Furui 1989). The limit to 80 Hz is for presentational purposes. 3. Note that group delay correction in a BM (that is continuous in both time and place) is simply a correction with a continuous function that does not influence continuity in any way. Since this BM implementation is discrete in both time and place, care has to be take to ensure the functional equivalence between a continuous BM and a discrete approximation. This argument is therefore exclusively a consequence of the chosen implementation. 106 Continuity Preserving Signal Processing

108 Three Continuous Correlogram Variants time in ms Figure 4.1. The stimulus used for computing figur e 4.2, has a pitch of 100 Hz and consists of the weighted harmonics 1 to 5, 9, 15 and 40 in cosine-phase. The onset is tapered with a raised cosine to ensure a gradual build-up in 10 ms, this limits onset effects to the lower frequency range of the BM where the onset is still fast compared to the period. The dashed red line shows the untapered onset as used in section 4.2. ordering of the information. Figure 4.2 is based on a signal with f 0 =100 Hz consisting of a number of weighted harmonics that is presented in figur e 4.1. This signal is tapered with a raised cosine to ensure a gradual build-up within a single fundamental period. This tapering is comparable to the onset of voiced plosives like a /B/. However, it is too rapid to describe the onset of natural vowels. The left column of figur e 4.2 represents r - (t). This correlogram definition does not respond before t=0, and builds up gradually after t>0. At t= 5 ms the scope of non-zero values of x(t)x(t-t) is maximally 5 ms, because contributions x(t-t) are still 0 for T>5 ms. This limits the correlogram build-up to periods shorter than 5 ms and to segments with characteristic frequencies above 200 Hz. This explains the quarter-circle form in the 4th-frame. At t=10 ms the cut-off is at 10 ms and segments sensitive to 100 Hz. This leaves the right side of frame 5 yet unperturbed. The lower right side will always reflect a correlation that is delayed compared to the upper-left corner. The middle correlograms represent r + (t): the Time Normalized Correlogram. The TNC derived its name from its onset behavior that is visible in the transition from the third to the fourth frame. In this case, the complete correlogram starts to respond at the onset of the basilar membrane response, and it follows the build-up of the basilar membrane response as well as is possible with an integration time-constant of 10 ms. The choice of a + in the definition of the correlogram is perfectly natural, since periods can only be estimated after at least a full period has become available. The definition of r + (t) Continuity Preserving Signal Processing 107

Time Normalized Correlogram r (t) r + (t) t= 10 ms r gd (t) Frame 5 t = 5 ms Frame 5 Frame 6 t = 0 ms

Frame 10 Frame 18 t = 60 ms Frame 18 2 4 6 8 10 12 Period in ms 2 4 6 8 10 12 Period in ms 2 4 6 8 10 12

The temporal development of the three leaky integrated correlogram variants with the tapered test signal

Note the upper-left to bottom-right development between t=0 and t=15 ms.

it has a well defined onset at t=0, where all correlogram values start to respond simultaneously.

109 Time Normalized Correlogram r (t) r + (t) t= 10 ms r gd (t) Frame 5 t = 5 ms Frame 5 Frame 6 t = 0 ms Frame 6 Frame 7 t = 5 ms Frame 7 Frame 8 t = 10 ms Frame 8 Frame 9 t = 15 ms Frame 9 Frame 10 t = 20 ms Frame 10 Frame 18 t = 60 ms Frame Period in ms Period in ms Period in ms Figure 4.2. The temporal development of the three leaky integrated correlogram variants with the tapered test signal of figur e 4.1. The left variant, r - (t), is based on a backward looking autocorrelation. Note the upper-left to bottom-right development between t=0 and t=15 ms. The second is the forward looking r + (t), this is called the Time Normalized Correlogram (TNC) because it has a well defined onset at t=0, where all correlogram values start to respond simultaneously. The third variant, r gd (t), looks forward as well, but has additional group delay normalization. In this variant the response starts to build-up well before t=0 and from the low-frequency side upwards. 108 Continuity Preserving Signal Processing

110 Three Continuous Correlogram Variants ensures that visualization of evidence of periodicity is shifted to a point in time where it can be estimated. The right column shows the correlogram with additional group delay normalization r gd (t). Here the first response is visible in the lowest segments at a time corresponding to the maximal group delay at time t minus 30 ms. From then on, the correlogram starts to build-up from the bottom. At t=5 ms it is further developed than the TNC at the same instant, and provides a fairly good indication of the final harmonic distribution. At the high-frequency side the difference with the TNC is minimal because the local group delay is small. At the low-frequency side the difference becomes more and more prominent. At t=30 ms, r gd (t) is completely developed. This corresponds to 60 ms after the initial onset. The same build-up time is required for r + (t). The r - (t) requires an extra 12.5 ms for the lower right corner. In saturation, at t >70 ms, it makes no difference which method is chosen, all reach an identical steady state. This is of course only the case for stationary signals! Information carrying natural signals are nonstationary, so we must choose the correlogram version that is most suited to reflect nonstationary signals. The way r - (t) builds up leads to all kinds of practical problems in the context of the tasks and solutions of chapter 2; particularly during onsets and rapid changes. The r - (t) is therefore not a suitable choice. The combination of a minus sign in combination with group delay normalization complicates this further (e.g., Brown 1994). This is the reason why it was not considered in the first place. Unlike r - (t), r gd (t) can be used without additional problems, it can even provide some benefit in modeling the onset of pitch. The problem with group delay correction is that it, even more than r - (t), disguises timed signal changes by presenting BM-information at different times in a single frame. Furthermore r gd (t) cannot guarantee the continuity assumption during periods of rapid change as well as the TNC can. These objections amount to a minor deviation from optimality that ought to be avoided in any system that has to function with unknown signals. Since group delay normalization has no real benefits it is avoided in this study. One can conclude that the TNC is the best representations for reflecting the development of nonstationary signals, because the TNC, unlike r - (t) and r gd (t), reflects onsets and other signal changes for all segments at the same time. Continuity Preserving Signal Processing 109

111 Time Normalized Correlogram 4.2 Dynamic Properties of the TNC The TNC has a large number of convenient properties. Some of the steady state properties of the TNC were discussed in chapter 2 and in the previous section. This section focuses on its dynamic properties, in particular its onset behavior. To improve the understanding of the onset behavior of the TNC, it is instructive to compare its impulse response with the onset of a periodic signal. Again the signal of figur e 4.1 is used, but now also a version without the tapered onset is used. Since all harmonics are in cosine phase this leads to a maximally sharp onset. Although the signal is unnatural, it shows an important limiting case. The impulse response TNC and the TNC s of the periodic signal with the hard and the tapered onset are displayed in figur e 4.3, Probably the most striking observation from the left hand column is that: The general structure of the impulse response TNC is almost independent of time. Only the relative weighting of TNC areas do develop through time, the sign of the autocorrelation remains generally constant. This has important consequences that will be investigated later in this and the following section The middle trace of figur e 4.3 shows that the onset of the untapered periodic signal resembles the impulse response. The transition from to the stable periodic form develops from the high-frequency side down to the lowfrequency side and from high values of the autocorrelation lag (T) on the right, to low lags on the left. This is again a consequence of the fact that a period at the basilar membrane can only be estimated after at least a full period: and to achieve any measure of certainty, multiple periods are necessary. Group delay is a measure of the scope of time a segment requires before it can decide if its neighborhood ought to respond. This property of group delay helps to interpret the temporal development of the onset of the periodic signal. The difference between the tapered onset and the hard onset is very prominent for high frequencies. The build-up of the TNC of the tapered signal does not show any onset-effects above 1500 Hz, the second highest frequency in the signal. Compared to the width of the tapering window (10 ms) 1500 Hz is a high-frequency. Furthermore the red line of figur e 3.5 shows that the local group delay for the segment that is most sensitive to 1500 Hz is 5 ms. 110 Continuity Preserving Signal Processing

Dynamic Properties of the TNC Impulse response t = 5 ms Hard onset t = 0 5 ms Tapered onset t = 5 ms t = 10

25 ms t = 30 ms t = 30 ms t = 30 ms 2 4 6 8 10 12 Period in ms 2 4 6 8 10 12 Period in ms 2 4 6 8 10 12

A comparison between the impulse response of the TNC and the onset of a untapered and a tapered version of

The build-up of the untapered periodic signal starts in a similar way as the impulse response.

and show a longer impulse-like response.

112 Dynamic Properties of the TNC Impulse response t = 5 ms Hard onset t = 0 5 ms Tapered onset t = 5 ms t = 10 ms t = 10 ms t = 10 ms t = 15 ms t = 15 ms t = 15 ms t = 20 ms t = 20 ms t = 20 ms t = 25 ms t = 25 ms t = 25 ms t = 30 ms t = 30 ms t = 30 ms Period in ms Period in ms Period in ms Figure 4.3. A comparison between the impulse response of the TNC and the onset of a untapered and a tapered version of the periodic signal of figur e 4.1. The general structure of the impulse response is time-invariant. The build-up of the untapered periodic signal starts in a similar way as the impulse response. Low-frequency segments require more time to determine the difference between aperiodic and periodic signals and show a longer impulse-like response. The high frequencies of the tapered TNC are stable from the onset, but the upper left corner of the untapered periodic TNC requires more than 30 ms to stabilize. This is associated with the energetic upper left corner of the impulse response TNC. Continuity Preserving Signal Processing 111

113 Time Normalized Correlogram Interpreting group delay as the required interval before a segment can decide whether or not to respond, we can conclude that for BM-positions corresponding to 1500 Hz and above: 1. the onset is slow compared to the period of the basilar membrane segment, so onset effects are small, and 2. after 5 ms enough information is available to warrant stable periodicity estimation. Consequently no onset effect occur for 1500 Hz and above for the tapered signal. In the untapered situation of figur e 4.3, the first conclusion is not valid for any of the segments. Here the stimulus starts with a hard onset that produces an impulse response-like pattern that first has to diminish sufficiently before the stable response appears. This lasts particularly long in the upper left hand corner of the periodic TNC since this region reflect the highest energy values in the impulse response. Similar arguments hold for the lower frequency regions of the TNC. Only after a period equal to the local group delay (approximately 20 ms) the difference between the second and the third harmonic becomes prominent in the strong band at T=10 ms that signifies the first full period. The details of the onset are therefore noticeable in a characteristic manner in the TNC-development. Unfortunately, the optimal exploitation of these effects require the computation of the full TNC which is computationally demanding: these examples require 200 times real-time on a 400 MHz machine. This is (still) far too much for practical applications. 4 The time-independence of the general structure of the impulse response TNC leads to a distinction between two broad classes of signal contributions: aperiodic versus periodic. Periodic signals contributions try to dominate regions of the BM around the characteristic segment for the frequency contributions. The harmonics of a single quasi-periodic source induce the characteristic vertical structures in the TNC with the form of the peaks of resolved harmonics similar to the peaks of the corresponding sine responses (see section 3.3). 4. The number of TNC-values to compute is strongly dependent on the number of ridges that have to be followed to determine LIF-contours. The number of ridges depends on the signal and the level and the type of the noise. The selection mechanism is therefore optimized in the sense that it only requires the computation of 5% to 10% (depending on the SNR and system parameters) of the total number of TNC-values 112 Continuity Preserving Signal Processing

114 Dynamic Properties of the TNC Aperiodic signals contributions, on the other hand, consist of a continuum of frequency contributions that do not agree on a common periodicity. These signals contributions appear in the TNC as superpositions of the impulse response, and since the impulse response TNC has a time invariant general structure, aperiodic signal contributions induce, on average, a characteristic impulse response structure. 5 Narrow band noise induces the impulse response TNC in only a narrow range of BM segments. 6 The temporal invariance of the aperiodic TNC response can be demonstrated by comparing the average impulse response TNC with the average TNC of white noise (which has the same frequency content as an impulse). White noise can be simulated as a series of random values, each independent from its predecessor. Each sample can be treated as an impulse, and each sample leads to the impulse response TNC. The time-invariant structure of the impulse response ensures that the average of these random contributions converges to the impulse response TNC. The left side of figur e 4.4 shows the average impulse response of the TNC over 75 ms, the right side shows the average of 10 seconds of white random noise (both are sampled each 5 ms). Both representations are, except for the upper right part, in good general agreement. A longer averaging period will lead to a further reduction of the differences. Given the large dynamic range of the TNC-representation, very long integration periods are required for full convergence. This is especially valid for the upper right corner where random correlogram contributions are large compared to the long time average. The stability of the impulse response TNC will be used to estimate a measure of local dominance in section 4.3. To summarize: two classes of signal components exist for the TNC. Aperiodic signal components (like onsets, impulses and noises) induce an impulse response TNC or some perturbed version of it. While periodic signal components, which reflect (complexes of) sinusoids, lead to a superposition of a finite number of periodic responses and the characteristic vertical structure. Every input results in a combination of both types of responses. When the assumption of quasi-periodicity becomes more valid, the aperiodic response becomes less prominent and vice versa. 5. The samples of periodic signals do also induce a superposition of impulse responses, but periodic signals lead to stable interference patterns of a discrete set of contributions. 6. Sinusoids can be interpreted as narrowband noises with the narrowest bandwidth possible. In this case only a single segment responds with its own frequency. Continuity Preserving Signal Processing 113

Time Normalized Correlogram 6100 Mean of impulse response Mean of white noise 100 4200 90 2900 80 2000 70 Frequency in Hz 1400 910 600 60 50 40 Segment number 380 220 120 30 20 10 28 2 4 6 8 10 12

In the limit t the average TNC of white noise converges to the average impulse response.

115 Time Normalized Correlogram 6100 Mean of impulse response Mean of white noise Frequency in Hz Segment number Period in ms Period in ms Figure 4.4. Comparison of the average of the impulse response of the TNC with the average TNC of 10 seconds of white noise. In the limit t the average TNC of white noise converges to the average impulse response. This shows that aperiodic signals, like impulses, onsets and broadband noises lead to variants of a similar pattern. 4.3 The Characteristic Period Correlation Section 4.2 demonstrated the time-invariant structure of the impulse response of the TNC. This section uses this invariance to define a measure of local dominance. While periodic contributions are characterized by a discrete set of frequencies, aperiodic contributions are characterized by a continuous frequency distribution. A prototypical aperiodic signal like a unit pulse represents an equally weighted distribution of frequencies (a flat spectrum). Each frequency tries to dominate the basilar membrane segments with the corresponding characteristic frequency. Since all frequencies succeed equally well, a situation results in which each segment oscillates with its characteristic period T c. Figure 4.5 shows the positive values of the impulse response TNC for the first 50 segments. Superimposed on the figur e are lines that reflect the positions of 114 Continuity Preserving Signal Processing

116 The Characteristic Period Correlation Frequency in Hz Segment number Period in ms 5 1 Figure 4.5. Positions of harmonics superimposed on the mean impulse response TNC. The lower solid line connects the first autocorrelation peaks for each segment. This line corresponds to the expected position of the first harmonic and provides the characteristic segment for each period, or alternatively the characteristic period for each segment. For example the first harmonic (h 1 =f 0 ) of an 8 ms periodic signal is approximately 120 Hz. This yields a T c = 8ms for segment 10. The other black solid lines denote the expected segment numbers for the second to fifth harmonic. The lower dashed line provides the position of the first negative autocorrelation peak. Although corresponding to a negative correlation, this peak does provide periodicity information. The other dashed lines reflect the positions of the second to fifth negative TNC peaks. the maxima and minima of the autocorrelations. The lower solid line links segments with their characteristic period. This line reflects the positions where the first harmonic (h 1 =f 0 ) of a given fundamental period P 0 ought to be expressed. The other solid lines reflect the positions of the second to fifth harmonic. The dashed lines show the positions of the negative autocorrelation peaks. Although these are set to zero in the visible representation, they do reflect TNC positions wher e periodicity information can be derived from. When a segment responds with its characteristic period, its TNC shows a peak for T=T c. This leads to a first preliminary definition of the Characteristic Period Correlation (CPC) as: r c s ( t) = L{ x s ( t)x s ( t + T c( s) )} (4.7) Continuity Preserving Signal Processing 115

Time Normalized Correlogram Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Steady state TNC of a 250 Hz sine wave 1 2 3 4 5 6 7 8 9 10 11 12 100 90 80 70 60 50 40 30 20 10 1 Segment

117 Time Normalized Correlogram Frequency in Hz Steady state TNC of a 250 Hz sine wave Segment number 1 Two CPC definitions as function of characteristic period Autocorrelation Period in ms Figure 4.6. The computation of the Characteristic Period Correlation (CPC). The upper panel shows a steady-state TNC of a 250 Hz tone. The solid line shows the relation between the characteristic period T c and the segment number (right axis) and the characteristic frequency (left axis). The dashed line shows T c /2. The lower panel shows two different CPC responses to the 250 Hz tone as function of T c, which are normalized by the energy of the corresponding segment. The solid line corresponds to equation 4.7, the dashed line corresponds to equation 4.8. Note that the dashed line does not show high values near T=0 and T=2T c. where T c(s) denotes the characteristic period of segment s. The CPC is, like the cochleogram and the TAC, a subset of the TNC. But unlike the TAC, the TNC does not require a signal property. The upper panel of figur e 4.6 shows the steady-state TNC of a 250 Hz tone. The solid line shows the TNC-points that correspond to the CPC as defined in equation 4.7. The solid line in the lower panel shows the corresponding TNC values, which are normalized with the local energy value (at T=0). The CPC as defined in equation 4.7 provides values close to the energy values of the cochleogram for basilar membrane regions that oscillate with the characteristic period. Unfortunately, it also produces high values for highfrequency regions (i.e., low T c ) that are entrained by very low-frequency components. This is a situation that can occur due to the asymmetric response curves of the segments (i.e., upward spread of masking). To prevent this contamination, the CPC can be redefined to: 116 Continuity Preserving Signal Processing

118 The Characteristic Period Correlation r c s ( t) = L{ x s ( t)x s ( t + T c( s) )} L{ x s ( t)x s ( t + T c( s) 2) } (4.8) which corresponds to half of the difference between the correlation between x s (t) and x s (t+t c(s) ) and the correlation between x s (t) and x s (t+t c(s) /2). When the segments oscillate with a period close to the characteristic period the latter correlation will be maximally negative. Half of the difference between the two correlations results in CPC-values close to cochleogram values. The dashed line in the upper panel of figur e 4.6 shows the TNC-positions that correspond to half of the local characteristic period. The dashed line in the lower panel shows the normalized CPC, based on equation 4.8, for the 250 Hz tone as function of the characteristic period. The CPC is large for the entraining frequency 1/T c =250 Hz and negative for segments with a characteristic period smaller than the entraining frequency. It is possible to formulate alternative definitions of the CPC that show similar behavior by combining contributions that depend on more and/or different fractions of T c. On average, the correlation x(t)x(t+t) is smaller for T=T c than for T=0 (the local energy). 7 Furthermore temporal discretization effects prevent that the true values of T c(s) and T c(s) /2 are sampled. This leads to a segment dependent bias of the CPC. Both effects can be corrected for by measuring the average response of the CPC (as defined in equation 4.8) on white noise. The ratio of the average of the CPC as function of segment and the temporal average of the energy can be used as a normalization factor c s that can be applied to each segment of each frame. This leads to a third and final definition of the CPC: r c L{ x s ( t)x s ( t + T c( s) )} L{ x s ( t)x s ( t + T c( s) 2) } s ( t) = c s (4.9) The CPC responses in this work will be based on equation 4.9. Note that this CPC definition ensures values close to the local energy value for dominating aperiodic contributions. Dominating periodic contributions show a higher steady-state value than the local energy value. The normalized CPC can be used to detect time-place regions that are dominated by signal components (aperiodic as well as periodic) that entrain 7. When the amplitude envelope of x(t) increases as a function of time L{ x(t)x(t+t c ) } can be larger than L{ x(t)x(t) }. The inequality of autocorrelation lags R 0 R T holds only for frame-based autocorrelations. Continuity Preserving Signal Processing 117

119 Time Normalized Correlogram regions of the BM. Regions for which this holds can be identified criterion like: r sc ( t) > 1 C r s ( t) s with a (4.10) where C s is a constant close to zero that may depend on the segment number. This criterion accepts all regions of the place-time plane where the CPC explains more than a fraction 1-C s of the local energy r s (t) (i.e the cochleogram). Note that the CPC correlates information that is separated by T c(s) seconds. Amplitude envelope changes of the time-domain signal will be more noticeable in the low-frequency range, where T c(s) is large, than in the high-frequency range. To accommodate the effect of (random) amplitude fluctuation, C s must be more permissive for low-frequency segments than for high-frequency segments. This leads to two thresholds: 1-C high s for the highest frequency segment and 1-C low s for the lowest frequency segment. The intermediate values are based on a linear interpolation. Figure 4.7 gives examples of the CPC and the criterion in equation 4.10 for clean and noisy signals. The target signal consists of the Dutch words /NUL ÉÉN/ with a strong impulse between both words. The left column represents the clean signal, the right column represents the target signal in 0 db SNR babble noise. The upper row shows the normalized CPC. Visually it is, apart from some low energy regions, almost indistinguishable from the cochleogram. The middle row applies criterion 4.10 with C s high =0.10 and C s low =0.15. This leads to the acceptance of almost all high-frequency regions. The lower row applies criterion 4.10 with C s high =0.02 and C s low =0.08, which leads to a more pronounced rejection of some cochleogram regions. In both cases most of the aperiodic response of the impulse is accepted in the clean condition. In the noisy condition, the impulse response is partially masked by the babble noise. A comparison between the left and right columns shows that, although spurious contributions are introduced, the most salient target features are still dominated by the target in the noisy representations. Figure 4.8 provides an example. The upper panel compares the dominance regions for the clean and the noisy signal. The dashed red line denotes the noisy and the solid blue line the clean energy spectrum. The red stars and the blue diamonds show the segments that meet criterion 4.10 with the thresholds as in the lower panels of C high s =0.02 and C low s =0.08. Even in the clean condition not all peaks meet this dominance criterion. The peaks on the high-frequency flanks of strong 118 Continuity Preserving Signal Processing

The Characteristic Period Correlation Frequency in Hz Frequency in Hz Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 6100 4200 2900 2000 1400 910 600 380 220 120 28 6100 4200 2900

08 200 400 600 800 1000 Time in ms Characteristic Period Correlation (CPC) Noisy CPC: C S high =0.1, C S low =0.15 Noisy CPC: C S high =0.02, C S low =0.

The left column shows the dutch words /NUL ÉÉN / with a strong impulse added at t=500 ms. The right column shows the same signal in 0 db SNR babble noise.

02 and C s low =0.08, for the third row. Note that these criteria help to identify periodic as well as aperiodic signal contributions. contributions are partially entrained by these contributions.

120 The Characteristic Period Correlation Frequency in Hz Frequency in Hz Frequency in Hz Characteristic Period Correlation (CPC) Clean CPC: C high S =0.1, C low S =0.15 Clean CPC: C high S =0.02, C low S = Time in ms Characteristic Period Correlation (CPC) Noisy CPC: C S high =0.1, C S low =0.15 Noisy CPC: C S high =0.02, C S low = Time in ms Segment number Segment number Segment number Figure 4.7. The characteristic period correlation (CPC) can be used to identify signal components that are locally dominant. The left column shows the dutch words /NUL ÉÉN / with a strong impulse added at t=500 ms. The right column shows the same signal in 0 db SNR babble noise. The upper row shows the CPC as defined in equation 4.8. The other rows show the accepted cochleogram regions with the criteria C s high =0.10 and C s low =0.15 for the second row and C s high =0.02 and C s low =0.08, for the third row. Note that these criteria help to identify periodic as well as aperiodic signal contributions. contributions are partially entrained by these contributions. Note that almost all blue diamonds are paired with a red stars. Only around segment 63 four segments that were dominated by the target in the clean condition are no longer dominated by the target in the noisy condition. The change is the consequence of a strong noise contribution around segment 58. The lower left panel of figur e 4.8 shows the dominance regions for the clean signal. The right panel shows the time-place regions that meet the combined criterion that the regions must be dominated by the target in both the clean condition and in the noisy condition. The very good general agreement between the left and right panel demonstrates that dominance as estimated with criterion 4.10 can provide a robust source of information. In this case 72% of the area is entrained in both the clean and the noisy condition. Unfortunately this does not weight the perceptual importance of each area. Section 6.2 develops a better measure to quantify the similarity between the noisy and the clean condition. Continuity Preserving Signal Processing 119

Time Normalized Correlogram Entrainment of segments: C S high =0.02, C S low =0.08 (energy) 0.15 3.5 3 2.5 2 1.5 1 0.

121 Time Normalized Correlogram Entrainment of segments: C S high =0.02, C S low =0.08 (energy) Noisy cochleogram Clean cochleogram Clean entrainment Noisy entrainment Frequency in Hz Segment number Entrainment by clean target Entrainment by target in noise Time in ms Time in ms Figure 4.8. Local dominance is a robust measure. The upper panel shows the cochleogram cross-section corresponding to t=185 ms in the lower panels. The dashed red line and the solid blue line show the noisy and clean cochleogram crosssections of the /U/ in /NUL/. The blue diamonds and the red stars show the regions where the CPC meets the criterion in equation 4.10, with C s high =0.02 and C s low =0.08. Wherever the target (blue) dominates in the noisy condition (red), the BM is locally dominated as if no other sources are present. The lower panel shows this as well, by comparing dominance by the target in the clean and the noisy condition. The right panel shows the regions that are dominated in both the clean (left) and the noisy condition. This results in a right panel that is similar to the left panel: local dominance is a very robust feature. Segment number Compared to the tuned-autocorrelation, the CPC does represents a BM property and not a signal property. The application of a property of the target signal is necessary before the CPC can be used to separate signal components. To separate periodic contributions from aperiodic contributions the absence of characteristic effects of periodicity is, per definition, sufficient. The effects of periodicity depend on the situation but are typically: criterion 4.10 with a 1-C s >1 at the position of the ridge (in practice 1-C s >1.03 for steady-state vowels) local cochleogram responses of single harmonics that resemble the peaks of the sine-responses as depicted in figur e 3.7 a common periodicity or common amplitude modulation of a range of BM segments due to complexes of unresolved or partially resolved harmonics. 120 Continuity Preserving Signal Processing

122 On- and Offsets Transients When these effects are absent, while criterion equation 4.10 ensures local dominance, a time-place region can be identified as aperiodic. Section 6.1 provides examples of the combination of knowledge sources to identify regions with various physical properties. 4.4 On- and Offsets Transients Section 4.2 discussed the onset effects in the TNC, this section studies on- and offset transients as they appear in the CPC. These transients are per definition aperiodic and result therefore in a continuum of frequency components. Onsets may differ in rise time. A rapid onset of a sinusoid leads to a transient with a broad frequency content that evolves within a few times the local group delay d s to an ideal sine response. A more gradual buildup of the sinusoid leads to reduced transients, in combination with the gradual buildup of the strength of the ideal sine-response. When the rise time of the sinusoid is short compared to the local group delay, the transients represent a broad frequency range. When the rise time is large compared to the local group delay, the transients are small or absent, so that the BM response can be approximated by the ideal sine-response at each point in time. The transient is related to the aperiodic response class of the TNC, while the steady state is related to the periodic response class of the TNC. This is illustrated in figur e 4.9. The left hand panels compare the CPCresponse to an untapered 100 Hz cosine y(t), with a tapered version y t (t). The latter is tapered with a raised cosine with a rise-time of 26 ms, which equals the group delay of a segment with 100 Hz as characteristic frequency. The untapered signal shows a prominent BM-wide aperiodic response, while aperiodic responses are minimal in the tapered version. The upper panel shows that the aperiodic transient is gradually replaced by a typical periodic CPC response. The CPC-response to the signal y u (t) (in red) that must be added to the tapered cosine to produce the untapered version shows a typical BM-wide aperiodic response as depicted in the lower right panel. The upper right hand panel shows (the positive values of) the sum of the two CPCresponses in the lower panels. It shows a very good general agreement with the response of the untapered cosine in the upper left hand panel. Both panels are not identical because of interference effects at the BM, which spoil the equivalence between E{ y t ( t) + y u ( t) } and E{ y t ( t) } + E{ y u ( t) } (where E{} is an operator that produces either the CPC or a cochleogram). 8 Continuity Preserving Signal Processing 121

123 Time Normalized Correlogram Frequency in Hz CPC response to untaperded cosine Periodic + transient CPC response Segment number Frequency in Hz Periodic CPC response Time in ms Transient CPC response Time in ms Segment number Figure 4.9. Transients appear whenever the change of a periodic signal is fast compared to the group delay of the characteristic segment of the signal s frequency. The upper left hand panel shows the positive values of the CPC-response to an untapered 100 Hz cosine starting at t=20 ms that is depicted in the upper part of the cochleogram. Initially, the whole BM responds aperiodically. The lower left hand panel shows the response to a tapered cosine with a build-up time equal to the local group delay. Transients are minimal and the CPC is only positive around the entraining frequency of 100 Hz. The lower right hand panel shows the CPC of the signal (depicted in red) that must be added to the tapered signal to produce the untapered cosine. The upper right hand panel gives the sum of the two lower CPCresponses. This leads to a close approximation of the CPC of the untapered cosine. From t=90 ms (i.e., 70 ms after the onset) the periodic CPC-response starts to dominate more and more of the BM. Offset transients appear as the onset transients of a new signal that destructively interferes with the old signal. This is shown for the CPC in figur e The transients during onset and offset of the untapered 100 Hz (upper panel) and 1500 Hz cosines are similar for each frequency. Compared to the characteristic periods of most segments, the onset of the untapered 100 Hz cosine resembles a unit step. The right-hand side of the upper panel shows the 8. An extreme case arises when two input signals are antiphasic. Their individual CPC s (or cochleograms) are equal, but the CPC of the sum of the input signals is zero for all s and t. 122 Continuity Preserving Signal Processing

124 On- and Offsets Transients Frequency in Hz Frequency in Hz CPC of untapered cosine of 100 Hz followed by the response to a 75 ms block CPC of untapered cosine 0f 1500 Hz followed by positive and negative impulse Time in ms Segment number Segment number Figure Onsets and offsets are, per definition, aperiodic and show up as broad responses that gradually develop from the high-frequency side downwards to the steady state form. An offset is equivalent of the start of a new signal that interferes destructively. Consequently, the offset of the untapered 100 Hz cosine in the upper panel resembles the onset. For most BM segments, the untapered onset of a 100 Hz cosine resembles step responses as shown in the right of the upper panel. The untapered onset of a 1500 Hz cosine in the lower panel resembles, for most segments, impulse responses as shown on the right. A sine-phase onset has less energetic transients, but requires a similar amount of time to reach steady state. on- and offset step responses to a unit strength 150 ms block that starts at t=300 ms. On the other hand, the on- and offset of the untapered 1500 Hz cosine in the lower panel is fast compared to the characteristic period of most segments. Consequently, the on- and offset transients resemble impulse responses as depicted on the right of the lower panel. This suggests the use of pattern matching techniques for the detection of on- and offsets. 9 A way to detect onsets is by determining whether or not the scaled energy gradient stemming from either the cochleogram or the CPC exceeds a threshold C Onset : 9. Offsets can also be detected by searching for cochleogram areas where the energy decreases with e -t/τ. See section 6.1. Continuity Preserving Signal Processing 123

125 Time Normalized Correlogram E( s, t) t > E( s, t) C ( s, t ) Onset 0 (4.11) An example of a threshold C Onset (s,t 0 ) that can be used is n times the energy variance during [t-t 0, t] ms (for example n=2 and t 0 =20 for noisy speech). Alternative thresholds, that depend on the local group delay and/or depend on the gradient in the segment direction s, can be formulated. As will be shown in section 6.1, it is seldom necessary to search for onsets with a criterion like equation The start of regions that meet dominance criteria like equation 4.10 and the start of TAC-selections provide usually enough information to determine onsets reliably. The detection of onsets as such is often less important than the characterization of the start-up behavior of the signal component. This requires a more careful analysis of the details of the onset and consecutive development of the signal component. The broadness (in terms of the frequency range of the responding segments) and duration of the initial transient in combination with the rise-time of the steady state signal are reliable indicators of the rise-time of the signal. In the case of speech, the plosives (/B/, /P/, /K/and especially the /T/) give rise to transients that involve a large number of segments. These show up as vertical structures in the cochleogram. Broad transients are missing or minimal in noise bursts like the /s/ and the /f/. Usually, the onsets of voiced speech are slow compared to the local group delay and transient effects are minimal. Artificial sounds, like the beeps of a telephone, can be identified on the basis of a (for speech) uncharacteristically rapid on- and offset. 4.5 Estimation of Local Instantaneous Frequency Contours Local instantaneous frequency (LIF) contours reflect the frequency development of the signal components that give rise to ridges (see section 3.5). Ridges arise when a basilar membrane region is dominated by a single quasi-periodic signal component. Since local dominance occurs only when the signal persists for some time, the local application of quasi-stationarity is justified. Since the TNC ensures continuity in time, place and periodicity, it allows the computation of the running autocorrelation along ridges s(t) according to: 124 Continuity Preserving Signal Processing

126 Estimation of Local Instantaneous Frequency Contours 100 Ridge 4 LIF estimation of ridge 4 Autocorrelation Autocorrelation Ridge Period in ms Frequency in Hz Peak number Figure The upper left hand panel shows the autocorrelation of the fourth harmonic of the target signal of figur e 2.7. It shows 13 oscillations in 12.0 ms. This suggests an average frequency of approximately 1080 Hz. The true local instantaneous frequency (LIF) increases as a function of time, which results in a decrease of the time difference between the autocorrelation peaks. The corresponding frequencies (the inverse of the inter-peak distances) are shown in the right hand panel. A first order approximation through these values yields an LIF of 1064 ± 5 Hz. The lower left hand panel shows partial dominance of the autocorrelation. Situations like this arise sometimes and complicate LIF estimation. r s( t), T t t τ ( ) = r s( t), T ( t t)e + x s( t) ( t)x s( t) ( t + T ) T = [ 0, T max ] (4.12) Examples of these autocorrelations are given in the left hand panels of figur e The upper autocorrelation is well formed and is representative for a large majority of ridges. In some situations the autocorrelation shows a mixture of frequency contributions. The lower autocorrelation gives an example where the position of the third harmonic is partially dominated by the second. This leads to a confused autocorrelation and a more complicated, or even impossible LIF estimation. LIF estimation requires a well-formed autocorrelation, which can be checked with a measure based on the CPC as defined in section 4.3. Ill-formed autocorrelations occur occasionally and lead to (incorrect) LIF estimations that do not correspond to the segments characteristic frequency 10 and can be discarded. Continuity Preserving Signal Processing 125

127 Time Normalized Correlogram The autocorrelation in the upper panel corresponds to the fourth harmonic of the target signal of figur e 2.7 at time t=285 ms. This situation is chosen because it does not correspond to a very prominent ridge and the local frequency is changing quite rapidly. The local instantaneous frequencies can be approximated by computing the average peak distance, in this case 13 oscillations that fit in 12.0 ms, which is equivalent to approximately 1080 Hz. But this is an unnecessary application of quasi-stationarity that results in the average local frequency between t=285 ms and t= ms (10 ms for the frame width and 12.5 ms for the autocorrelation scope). The application of a first order approximation of the development of the interpeak distance improves the local instantaneous frequency estimation. This can be implemented by taking the distance in samples between peaks and fitting a first order model through these values. The value of this model for the autocorrelation peak at T=0 yields an estimate for the LIF. Although this is an efficient method that is used in section 5.1, it may suffer from temporal discretization effects because the sample period of 0.05 ms is not vanishingly small compared to the local instantaneous period of 0.94 ms. Temporal discretization effects can be reduced by improving the estimation of the peak positions with a three-point quadratic fit. This leads, approximately, to a tenfold improvement (van Hengel, personal communication) of the peak position estimation. The right hand panel shows the frequencies that correspond to the reestimated inter-peak distances and the linear fit through these values. The resulting LIF value at the position of peak number 0 is 1064 ± 5 Hz. The error is less than 0.5%! This is within the human range as has been measured psychophysically for stationary stimuli (Owens 1993). For speech sounds, fluctuations in pitch prevent a higher accuracy. Note that the LIF changes 25 Hz or 2.35% per 10 ms. A rate of change of 2.35% per 10 ms corresponds to a factor 10, or 3.3 octaves, per second, which are natural values for spontaneous speech. The local instantaneous frequency estimation is accurate in both time and frequency because the application of quasi-stationarity within time windows is avoided. The estimation of the local instantaneous frequency with any windowed signal is limited by the application of quasi-stationarity and the trade-off between temporal and frequency resolution: 10. This can be used to define the terms well- and ill-formed. 126 Continuity Preserving Signal Processing

128 Estimation of Local Instantaneous Frequency Contours 1 t = f (4.13) Increasing the size of the window reduces temporal resolution t, but increases frequency resolution f, and vice versa. Additional assumptions about the signal, such as assuming that the signal consists of a single harmonic complex, can increase the accuracy of instantaneous frequency estimation, but with arbitrary signals the validity of these assumptions are not guaranteed. The TNC-based LIF-estimation is subject to the time-frequency trade-off as well, but in a different way. It is based on local dominance and limited to signal contributions that lead to ridges. This means that two periodic signal components with frequencies corresponding to neighboring segments cannot be resolved since they lead to a single ridge. This inability can be alleviated somewhat by increasing the number of BM-segments and sharpening the response curves (see section 3.2). But sharpening the response curves leads to an increase in group delay according to equation In a transmission line based system, the t of equation 4.13 can be interpreted as group delay, while the f denotes a measure of sharpness for tuning curves. The accuracy of TNC-based LIF estimation is therefore limited to signal components that lead to ridges. It is further limited by the number of autocorrelation peaks and the accuracy of the peak position estimation. And finally, it is limited by the validity of the first-or der approximation. In the case of slowly changing signal contributions, stable ridges are formed and the maximal lag of the autocorrelation can be chosen to represent a number of periods that allows a very accurate LIF-estimation without invalidating the first-or der approximation 11. Note that the error associated with the first-or der fit provides a measure of the reliability of the estimate. This can, for example, be used during pitch estimation. 11. A higher order approximation might be more appropriate when very large maximal lag (e.g., 30 ms) is chosen to model the development of low-frequency components. Continuity Preserving Signal Processing 127

129 Time Normalized Correlogram 4.6 Estimating the Tuned Autocorrelation The next application of the TNC is the selection of basilar membrane regions with a common periodicity. While the local instantaneous frequency estimation is based on horizontal cross-sections, the Tuned Autocorrelation (TAC) is based on more or less vertical cross-sections of the TNC. Section 2.4 defined the T AC as: r s, T ( t) t t τ ( ) = r s, T ( t) ( t t)e + x s ( t)x s ( t + T ( t) ) s = 1 s max (4.14) This is suboptimal because the locally measurable fundamental period is influenced by group delay: the low-frequency segments reflect an instantaneous fundamental frequency up to 30 ms later than the highfrequency segments (see figur e 3.5). If the instantaneous fundamental period T(t) is defined as the instantaneous period of the signal, 12 the local fundamental period T s (t) is not T(t) but: T s ( t) = T ( t + d s ) (4.15) where d s represents the segment dependent group delay. This can be incorporated in equation 4.14 to yield: r s, T ( t) t ( ) = r s, T ( t) ( t t)e t τ + x s ( t)x s ( t + T ( t + d s )) (4.16) This is the best way to treat group delay. An example of this form of group delay correction is visualized in figur e This figur e shows part of the TNC from which the information of figur e 4.11 was derived. It corresponds to the /L/ in /NUL/ at t 1 =285 (as depicted in figur e 2.3) and it shows the effects of a pitch change of 2.35% per 10 ms. The straight line denotes the fundamental period T(t 1 ), the dashed line gives the local instantaneous fundamental period T s (t 1 ). The dashed-line follows the peaks of the autocorrelation responses 12. The instantaneous fundamental period is computed as the (idealized) instantaneous frequency of the first harmonic. This is a convenient choice for the period estimation algorithm, but complicates mathematical description (see section 5.1 for details). The instantaneous period of the source and the instantaneous frequency of the first harmonic are related via a group delay dependent time-shift. 128 Continuity Preserving Signal Processing

Estimating the Tuned Autocorrelation 6100 4200 2900 2000 100 90 80 70 Frequency in Hz 1400 910 600 60 50 40 Segment number 380 220 120 28 30 20 10 1 1 2 3 4 5 Period in ms Figure 4.12. Part of the TNC where the autocorrelations of figur e 4.

130 Estimating the Tuned Autocorrelation Frequency in Hz Segment number Period in ms Figure Part of the TNC where the autocorrelations of figur e 4.11 were derived from. The straight white line denotes the instantaneous period T(t), the dashed line denotes the local instantaneous period T s (t) conform equation Note that group delay correction is not required for an all-positive TAC response. The upper left corner shows similarities to the impulse response TNC. This is a consequence of the rapid change of fundamental frequency that violates the periodicity assumption and leads to the appearance of the aperiodic response class of the TNC. better than the straight line. However, it also shows that good results can (often) be reached without group delay correction. The tuned autocorrelation spectrum is very sensitive to a correct choice of the period T. If the TAC is perfectly tuned, relation 4.16 leads to an all-positive response. But small deviations of the correct period will result quickly in negative contributions x(t) x(t+t) for high-frequency segments. When T is perturbed with a fraction 1 + η, the corresponding TNC value associated with the nth harmonic ω n is determined by: cos( ω n ( 1 + η)t ) = cos( nη 2π) (4.17) When the mismatch is 1%, η=0.01 and for n=25 and n=50 yields: cos( nη2π) = cos( π) = 0 and cos( nη2π) = cos( 0.5 2π) = 1, respectively. In other words: when the pitch is 1% out of tune, the first zero correlation occurs, theoretically, at the 25 th harmonic and the 50 th harmonic Continuity Preserving Signal Processing 129

131 Time Normalized Correlogram (energy) % error 2 2% error 1% error correct pitch Segment number Figure The effects of pitch mismatch for the same signal as depicted in figur e 2.4. The high-frequency end of the TAC is very sensitive to estimation errors of the fundamental period. In this case an estimation error of approximately 1% leads to negative correlations above 3000 Hz. The negative correlations are compressed with the same type of compression as the positive values. corresponds to a maximally negative correlation. Note that these are ideal values, pitch fluctuations and discretization errors will influence the actual realization. The effects of pitch mismatch in the TNC of figur e 2.6 are shown in figur e Even a mismatch of 1% leads to a very noticeable error in high-frequency end of the TAC selection. Small and rapid deviations arise from natural pitch fluctuations in speech signals and must be dealt with. Larger mismatches show an ever increasing deviation from the ideal TAC. A 10% mismatch TAC shows positive responses for the first, second and eighth to twelfth harmonics. The first two lead conform equation 4.17 to cos( 0.1 2π) 0.81 and cos( 0.2 2π) These are positive while the values of the third to seventh harmonics, -0.31, -0.81, -1, -0.81, and -0.30, respectively are all negative. The next five values are positive again. 13 The details of figur e 4.13 depend on the 13. The harmonic numbers of the corresponding cochleogram cross-sections are depicted in figur e Continuity Preserving Signal Processing

132 Estimating the Tuned Autocorrelation details of the recruitment of basilar membrane regions by individual signal components. Strong harmonics entrain the position of others and enforce their cosine value. Because small pitch estimation errors can lead to large effects, the final TACselections must be based on a local optimization process. As will be described in chapter 5, the fundamental period contour, as estimated by the pitch estimation algorithms, describes the average development of the period contour well, but does not represent rapid pitch fluctuations in individual frames. The estimation process results in a period value and a local temporal derivative, which reflects the local temporal development of the period contour, but the actual instantaneous period may fluctuate around the estimated values. The local period estimation in combination with its derivative leads to local instantaneous periodicity values T s. The dashed line in figur e 4.12 shows these as a local instantaneous periodicity curve. To estimate the optimal value for the instantaneous period this curve is shifted upwards and downwards in periodicity (i.e., right and left in figur e 4.12) and the corresponding TNC-values for each choice of the local instantaneous period are computed. The choice of the instantaneous period that maximizes the sum of the positive values of the nonlinearly compressed TNC is chosen as the final instantaneous period on which the selection is based. It is important to use the nonlinearly compressed energy (with equation 2.3) because in the linear domain the strongest signal component is likely to dominate all other signal components. The best instantaneous period is the value that maximizes the area under the positive values and the x-axis of figur e 4.13 (the negative values have no physical meaning). The 1% mismatch condition corresponds to a one sample shift and the 2% mismatch to a two sample shift. This simple optimization procedure is an efficient way to reduce the effects of natural pitch fluctuations and small period estimation errors. It is not infallible: the energy of the fifth harmonic in the lower panel of figur e 3.10 was underestimated because the local optimization favored the inclusion of contributions at segments 81 and 88 above the correct estimation of the energy of the fifth harmonic. A representative example of the benefits of group delay correction and local periodicity optimization is depicted in figur e The upper left hand panel shows the basic TAC-cochleogram without additional optimization. It accounts for 87.5% of the nonlinearly scaled energy. Both group delay correction (upper right hand panel) as well as local periodicity optimization Continuity Preserving Signal Processing 131

133 Time Normalized Correlogram Frequency in Hz Basic TAC selection 87.5% Group delay correction 95.8% Segment number Frequency in Hz Local periodicity optimization 96.5% Time in ms Both optimizations 97.2% Time in ms Segment number Figure The importance of TAC optimization. The upper left hand panel shows the basic TAC cochleogram. The upper right and lower left hand panel show the effects of group delay correction and local periodicity optimization. The lower right hand panel shows the effect of both optimizations. The numbers denote the fraction of the nonlinearly compressed energy that is accounted for by the positive values of the selections. The actual realization of each of the TACcochleograms depends strongly on the initial period contour estimate, TACoptimization leads often to a considerable improvement, while group delay correction is beneficial during rapid changes of pitch. (lower left hand panel) lead to an improved estimation. Because the local periodicity optimization corrects small period estimation errors it is more effective than group delay correction (96.5% versus 95.8%). The latter is only beneficial when the pitch changes rapidly. The combination of both leads to the final TAC-cochleogram in the lower right hand panel that accounts for 97.2% of the nonlinearly scaled energy. The combination is applied to all TAC selections in this work. The original fundamental period estimation and the final r e-estimation are compared in section Continuity Preserving Signal Processing

134 The Tuned Autocorrelation in Noise 4.7 The Tuned Autocorrelation in Noise The previous section discussed the tuned autocorrelation without considering the influences of noise (here defined as one or more concurrent sounds other than the target sound). This section discusses the effects of noise and, in particular, the way it contaminates the TAC-selection. An optimal signal representation for standard HMM-based ASR systems represents the spectral envelope of the target source perfectly, while discarding the background completely. TAC-selections, when based on a correct pitch-contour, approach this ideal to a certain degree. With a correct period contour, all BM regions dominated by the target elicit a maximally strong TAC-response. The other regions lead to random autocorrelation values based on the leaky integrated product of the BM energy and a cosine with a random phase development. Since only the positive parts of the time normalized correlogram are used for further processing, half of these values are of no concern, but the positive half will contaminate the ideal TAC-cochleogram. A rough estimate of the contamination in a generic 0 db SNR broadband noise situation is useful. For a generic broadband signal one can assume that roughly half of the BM is dominated by the target and half by the noise. Half of the regions contaminated by the noise show a negative autocorrelation that is not represented in the (all positive) TAC. The remaining quarter reflects the local energy multiplied by a positive cosine value of random phase. These values are compressed nonlinearly according to equation 2.3. The average 14 of these random dynamic range compressed cosine values is Since positive cosine contributions only occur in 25% of the segments, about 25%x0.91=23% of the positive contributions of the TAC result from the background. Fortunately, these contributions occur mainly in the valleys between the ridges due to harmonics or formants and might not spoil the formant structure. In the unprocessed situation (i.e., the noisy cochleogram), noise and target cannot be separated and all the noise energy contributes. This makes it much more difficult, if not impossible, to estimate the spectral envelope with any measure of reliability. 14. Based on a numerical evaluation of: 1 π 2 π -- [ cos( φ )] 0.15 dφ π 2 Continuity Preserving Signal Processing 133

Time Normalized Correlogram Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Cochleogram of noise TAC of clean signal 100 90 80 70 60 50 40 30 20 10 1 Segment number Frequency in Hz

Segment number Figure 4.15. TAC estimation in 0 db babble noise (see figur e 2.5).

135 Time Normalized Correlogram Frequency in Hz Cochleogram of noise TAC of clean signal Segment number Frequency in Hz TAC contribution of background Time in ms TAC of target in noise Time in ms Segment number Figure TAC estimation in 0 db babble noise (see figur e 2.5). The upper left hand panel shows the cochleogram of the noise, the lower left hand panel shows the corresponding TAC-selection with the period contour of the signal in the upper right hand panel. The lower right hand panel shows the TAC-selection of the combination of target and noise. The TAC-cochleogram mainly reflects information of the target, but includes part of the background as well. The colorcoding is scaled identically in each panel. Note the reduction of the average energy in the TAC contribution of the background. Aperiodic noise The limited contamination of the TAC forms the basis for its usefulness for ASR. Examples of TAC-estimation in noise are depicted in figur e The noisy cochleogram of the target sound in 0 db babble noise has been depicted in the lower left hand panel of figur e 2.5 on page 53. The upper left hand panel of figur e 4.15 shows the cochleogram of the babble-noise without the target. The upper right hand panel shows the ideal TAC-cochleogram of the target. The lower left hand panel shows the TAC-selection that results when the fundamental period contour of the target is used on the noise. Note the more or less regular harmonic structure in the low-frequency region of the selection. Although the background may consist of a combination of mainly periodic contributions, the combination itself is not periodic. This entails that its TNC resembles the impulse response TNC as in figur e 4.4. The 134 Continuity Preserving Signal Processing

136 The Tuned Autocorrelation in Noise corresponding TAC-cochleogram is a sequence of (perturbed) vertical crosssections from the impulse response TNC as depicted in figur e 4.5, which results in the quasi-harmonic TAC that (as in section 6.1) can be discarded with criteria like equation The selection of the target in noise is a combination of the ideal TACcochleogram on the upper right corner and the contribution of the noise in the lower left corner of figur e The representation is clearly biased towards the target, but some information of the noise is inevitably present. Since signal-contributions that do not correlate well with the fundamental period contour tend to show up with considerably lower values in the selections than contributions that do correlate with the period contour, it is possible to apply an additional threshold that helps to separate both types of contributions. This was illustrated in the upper panel of figur e 2.15, which showed selected values that represented 25% or more of the local signal energy. Its lower panel showed a mask of coherent cochleogram areas. There is no guarantee that the mask contains information of the target source exclusively, but the threshold is more often than not beneficial in pinpointing reliable sources of information. Mask forming will be studied in more detail in section 6.1. Periodic noise The previous example used a type of background noise that consisted of so many and so variable harmonic contributions that it could not be described as periodic. The usefulness of the TAC with a periodic background is investigated with a superposition of the word /nul/ with a time-reversed, phase-inverted copy of itself 15 as input for the BM model. This represents a worst case scenario where signal contributions interact maximally. Consequently this signal is very suitable to study the limits of the selection process. The cochleogram of the combination is depicted in the upper left of figur e The necessary period contours were estimated from the clean signals. The TAC-selection of the time-reversed, phase-inverted signal is depicted in the upper right hand panel. The lower panels show the effects of phase-inverting the distractor. The lower left hand panel shows the TAC selection with phase- 15. The inverted signal sounds like /ONE/. The combination sounds ambiguous with a most likely interpretation as the nonwords /NUN/ and /WUL/. The percept of the word /NUN/ is particularly clear. The visible evidence in the upper left hand panel of figur e 4.16 seems to support this interpretation. Continuity Preserving Signal Processing 135

Time Normalized Correlogram Frequency in Hz 6095 4233 2934 2004 1366 915 595 379 224 116 28 Combined cochleogram TAC: time reversed signal 100 90 80 70 60 50 40 30 20 10 1 Segment number Frequency in

number Figure 4.16. Separation of periodic sounds. A superposition of the word /nul/ and a time-reversed, phase-inverted copy of it were presented to the selection system.

137 Time Normalized Correlogram Frequency in Hz Combined cochleogram TAC: time reversed signal Segment number Frequency in Hz TAC: phase inversion Time in ms TAC: no phase inversion Time in ms Segment number Figure Separation of periodic sounds. A superposition of the word /nul/ and a time-reversed, phase-inverted copy of it were presented to the selection system. The upper left hand panel shows the combined cochleogram with interference effects around the line (corrected for group delay). Note the destructive interference of the second harmonic. The upper right hand panel shows the TAC-selection of the time-reversed signal. The lower panels show two TAC-selections. The left is based on the signal as depicted on the upper left, the right is based on a similar mixture, but without the phase inversion. Note the differences in interference effects around the black line. Spurious contributions show up in the original, as well as the time-reversed selection. reversal (as in both upper panels) the lower right hand panel shows the selection based on the superposition without phase-inverting the distractor. At some points strong interaction effects exists. This is particularly noticeable in the middle of the combined utterance. The upper left hand panel shows a black line indicating the, group delay corrected, position of the mirr or in the cochleogram. At these positions, the original and the time-reversed signal have the similar amplitude and frequency content. The lower panels of figur e 4.16 show several examples of interference effects at these positions. The most visible effect is the destructive interference of the second harmonic in the left hand panel and the constructive interference in the right hand panel. This leads to its disappearance in left and enhancement in the right hand panel. 136 Continuity Preserving Signal Processing

138 The Tuned Autocorrelation in Noise Interference effects occur only at positions that reflect similar energy and similar intensity: these occur in this worst case scenario. For most signal combinations interference effects are of minor importance because signal components of uncorrelated sources rarely show the similarity in frequency and intensity required for strong interference effects: usually one of the sources dominates a BM region. This region may belong to the target, in which case it will be selected when a correct period contour is available. When a BM region is dominated by the distractor, there is still 50% change that the region correlates positively, but the average correlation will be lower than for the target because the average correlation between x(t) and x(t+t) is rarely maximal. 16 These effects are visible in figur e Compared to the aperiodic background, the TAC-selection shows larger areas of well-defined negative contributions, but there exists a similar number of spurious positive contributions that contaminate the ideal TAC-selection. These contaminations are symmetric, but the selected energy might not be equal due to the memory of leaky integration. In general the pitch contour that explains the most energy is the correct one. As stated in chapter 1, it is the responsibility of the recognition stage to make the final selection, Section 7.2 proposes such a recognition system. 16. This provides a way to separate signal contributions stemming from two sources with known period contours that may be important for the separation of concurrent vowels. Another method is to require that selected regions contain a ridge over the full duration of the region. Continuity Preserving Signal Processing 137

139 Time Normalized Correlogram 138 Continuity Preserving Signal Processing

140 CHAPTER 5 Fundamental Period Contour Estimation The estimation of pitch-contours in noise is difficult. The main reason for this is the general inability to detect and estimate features reliably (conclusion 1.16). Even for clean situations it not trivial to implement a reliable pitch estimation algorithm, this situation exacerbates in noise when the signal-innoise-paradox (conclusion 1.12) makes it difficult to decide on the focus of the search effort. In fact, the difficulties encountered during the formulation of a reasonably functioning pitch-extraction algorithm for noisy situations formed the basis for the formulation of the framework developed in chapter 1. This chapter focuses on two methods to estimate pitch. The first method has been introduced in section 2.9 and aims to estimate the best period contour at any time in unknown noisy situations. The second method is a ridge-based variant of well-known correlogram-based approaches (Meddis 1997) that are based on correlogram summation. Both implementations are a proof-ofconcept. They are not optimized and do not use the full potential of the techniques presented so far. In particular the CPC (as discussed in section 4.3) is not utilized. Yet both form a useful starting point for future developments. Continuity Preserving Signal Processing 139

141 Fundamental Period Contour Estimation 5.1 Fundamental Period Contour Estimation in Noise Section 2.9 outlined an algorithm for pitch estimation in noise. This section provides additional information and focuses on the design decisions that formed the basis for its formulation. Most pitch estimation techniques are developed for clean conditions or conditions with considerable constraints on the noise and are typically based on cues (O Shaughnessy 2000) like: periodicity of the short term FFT-based energy spectrum, a cepstrum (Gold 2000), a harmonic sieve (Duifhuis 1982) or subharmonic summation (Parson 1976, Hermes 1988). the first lar ge peak in the autocorrelation correlogram summation (Meddis 1997), similar to the technique in section 5.2. zero crossings These techniques assume a single source, assume known noise or apply quasistationarity before the signal is split into individual components for which quasi-stationarity can be guaranteed (see conclusion 1.21). Consequently, these techniques will produce incorrect results whenever the quasistationarity assumption is violated. The set of situations in which the pitch estimation technique is guaranteed to function (or fail) is extremely difficult to quantify since it can depend on the pitch value that still has to be estimated. Consequently, these pitch estimation algorithms are not as robust as is required for a system that can deal with arbitrary signals (definition 1.3). The basic design choices that led to the formulation of the robust period contour estimation algorithm are based on an informal study of a large number of very noisy (SNR < +3 db) speech signals. The speaker characteristics, as well as the characteristics of the noise were varied and the resulting cochleograms, ridges and instantaneous frequency contours were inspected to determine which features where most robust and could be used to base a pitch estimation algorithm on. Good indicators for reliable sources of information were: 1 1. the strongest ridges at each moment 2. long ridges 3. smooth ridges 1. Note that these features result from (strong) local dominance. 140 Continuity Preserving Signal Processing

142 Fundamental Period Contour Estimation in Noise 4. ridges with frequencies that correspond to the local characteristic frequency Ridges in which several of these features are combined are particularly reliable. Among the less reliable indicators are direct extensions of typical techniques that work well in noiseless environments. For example, a standard harmonic sieve applied to the local instantaneous frequency estimates (LIF, see section 4.5) along ridges results in successions of very good local estimates alternated with very poor or wrong estimates. Since the quality of the estimates could not be determined automatically, this method did not lead to reliable results. To facilitate implementation, a multi-stage algorithm that operates on the complete utterance was chosen. Figure 5.1 provides an overview of the fundamental period contour estimation algorithm as introduced in section 2.9 on page 65. The input for the algorithm is the information as represented in figur e 2.10 and encompasses the cochleogram, the ridges and the local instantaneous period. These values are sampled every 5 ms. The first stage of the algorithm is the selection and smoothing of the strongest ridges. 2 This algorithm starts with the detection of instantaneous periods whose corresponding characteristic segment (see section 3.3) are more than one segment from the ridge s characteristic segment. These period values are replaced by the segments characteristic period. Each ridge is followed and, as long as successive periods are within 5% of each other, the local periods are assigned to the same period contour. When successive periods are not within 5%, an additional check is performed to check if the next value is within 5%. If a valid next value can be found, the gap is filled with the average of its neighbors, otherwise a new contour is started. All contours are augmented with a smoothed version of the contour. Smoothing is performed with a 5-point (25 ms) moving average. In the middle of the contour the smoothed local period is based on a local neighborhood of 2 frames on each side. In the two first or last points of a contour the smoothed period values are based on the corresponding values of a first order approximation. Finally, the average ordinality of each contour is computed. The ordinality is a measure of the relative importance in terms of energy. A segment of the strongest ridge at time t has ordinality 1, the second most energetic segment has ordinality 2, etc. A period contour is accepted 2. This is a task that can be improved upon with the CPC as a measure of reliability. Continuity Preserving Signal Processing 141

143 Fundamental Period Contour Estimation Estimation of ridges and instantaneous period contours Selection of the most energetic smooth instantaneous period contours Notes Ridges formed by combining successive peaks that differ less than 2 in terms of segment number. Minimal length of ridges: 40 ms Gaps of 5 ms filled. Instantaneous period computed under all ridges. Acceptance: all successive periods within 5% (gaps of 5 ms allowed), minimal contour length 75 ms, or 50 ms with an average ordinality 2. Contours are smoothed with a 5-point linear approximation in each point. Cloning of contours to all possible fundamental periods Assumption: pitch between 75 and 400 Hz. Frequency change requires correction for group delay due to change in characteristic segments s 1 to s 2 : t t+d(s 2 )- d(s 1 ). Combination of cloned contours to smooth period contour hypotheses Acceptance of a combination of 2 or more contours: - for overlapping contours a 2 nd order model must fit all contours within 3%. - for contours that extend each other, the first and last 25 ms on each side of the transition ought to be fit by a 2nd order model to within 3% Selection of period contour hypothesis that explains most period estimates Smoothed fundamental period contours Instantaneous period + temporal derivative Group delay correction each for choice of period contour hypothesis. Acceptance of period p if cos(2πp 0 /p)>0.95 Best choice maximizes number of included periods and shows a flat distribution of odd and even harmonics Final reestimation and smoothing of fundamental period contours with a first order model over 25 ms or 25 points. Figure 5.1. Overview of the fundamental period contour estimation process. The input of the system is visualized in figur e The other steps are visualized in the figur es of section 2.9. whenever its length exceeds 50 ms and its average ordinality is smaller or equal than 2, or alternatively whenever its length exceeds 75 ms. So far, these practical values have not been optimized formally. The second step assumes that all possible fundamental periods lie between 13.3 ms (75 Hz) and 2.5 ms (400 Hz), a range that spans most speakers (Furui 1989). It is assumed that each contour represents a single harmonic number from start to end. The range of possible harmonic numbers can be estimated using the limits of the pitch range and equation As visualized in the upper panel of figur e 2.12, the smoothed period contours are multiplied by each possible harmonic number and copied to all possible fundamental periods. This involves a change in the corresponding characteristic segments 142 Continuity Preserving Signal Processing

144 Fundamental Period Contour Estimation in Noise of the contours, and since each segment has its own group delay this implies a temporal shift according to: t t + d( s np ) d( s p ) n { 1, 2, } (5.1) d(s p ) and d(s np ) are the group delays associated with the segments that are most sensitive to period p and period np, respectively. Note that this time-shift implicitly defines the instantaneous fundamental period as the instantaneous period of the first harmonic as expr essed on the BM. The third step combines the copied (cloned) contours into smooth fundamental period contour hypotheses. This is a complicated process since contours can sometimes be combined in different ways. When the local periods of two cloned contours match, on average, within 3% the clones are combined into a single hypothesis. Clones that extend each other are combined when a second order fit can be estimated that matches both contours within 3% during 25 ms. Note that this criterion is more strict than the 5% that was used for contour estimation in the first step. When this criterion is loosened the algorithm may produce unreliable results due to the combination of contours of different sources. The time-shift of equation 5.1 is very important because it allows a reliable comparison between multiple contours. When this form of group delay compensation is absent, contours of the same source will not be combined during rapid changes of pitch. Next, concurrent fundamental period contour hypotheses are compared. Concurrent hypotheses that assimilate more contours and lead to the longest hypotheses are favored over short hypotheses that consist of fewer contours. Based on this criterion some of the least supported hypotheses are discarded. This decision is applied with care so that only the least supported hypotheses are discarded. Hypotheses based on less than three concurrent ridges are always discarded. The remaining set of period contour hypotheses is likely to contain the correct period contours. The fourth and last step is the forced choice between concurrent contour hypotheses. For ASR systems this is a very important stage since this choice decides which part of the signal will be interpreted according to the expectations and limitations of the recognition system. Any error at this stage leads to recognition errors. This warrants a very careful decision process that is based on all available information: i.e., all ridges and their corresponding instantaneous periods. The decision process chooses at most one period contour for each moment. The selected hypothesis maximizes the number of instantaneous period values that it can claim as a possible harmonic, in Continuity Preserving Signal Processing 143

145 Fundamental Period Contour Estimation combination with a comparable number of odd and even harmonics. The number of claimed harmonics per fundamental period contour hypothesis p(t) is determined by counting the number of instantaneous period values that satisfy: p( t + d s ) cos π > 0.95 p s, t (5.2) p s,t is the instantaneous period value derived from a ridge at time t in segment s. And p(t+d s ) is the fundamental period hypothesis that is group delay corrected with a value d s to denote the expected instantaneous fundamental period of segment s. This is again a situation in which group delay correction is necessary, because instantaneous frequency information of different regions of the basilar membrane is compared. The cosine based criterion of equation 5.2 is equivalent to accepting a deviation of 5.1% around the expected value. A variant of equation 5.2 can be used to count the number of odd and even harmonics that are within 5.1% of the expected value: N p( t) = N o p( t) + N e p( t) = i p( t + d i ) cos π < 0.95 p i + i p( t + d i ) cos π > 0.95 (5.3) The square brackets denote a Boolean value: 1 if the statement is true, 0 if the statement is false. The index i refers to the period values p s,t in the selected set of ridges, while p(t+d i ) is the required group delay corrected value for the local instantaneous fundamental period reflected at time t in segment s. N p ( t ) is the total number of accepted harmonics, N o p ( t ) and Ne p ( t ) are the number of odd and even harmonics. The odd harmonics fall around the minimal values, while the even harmonics coincide with the maximal values of the cosine function. The best hypothesis of two or more concurrent hypotheses is the one that maximizes: Average # harmonics per frame Fraction odd harmonics (5.4) Both criteria are important. The average number of claimed harmonics is a measure of the quality of the hypothesis: efficient hypotheses that claim a large number of harmonics per frame are usually to be preferred over hypotheses that claim a lower number of harmonics per frame. If the fraction of odd harmonics is low, it is likely that the fundamental period contour is an octave too low. This happens quite often, because spurious contributions tend p i 144 Continuity Preserving Signal Processing

146 Fundamental Period Contour Estimation in Noise to increase the average number of claimed harmonics. The combined criterion reduces to the average number of odd harmonics per frame. This simple criterion has a high probability to select the correct hypothesis. Similar criteria can be found in literature (e.g., Duifhuis 1982). The selected information allows a reestimate and smoothing of the fundamental period contour based on all claimed harmonics. The local moving average for each frame is based on all data points within 12.5 ms of the center of the frame. Because equation 5.1 computes the fundamental period contour as the instantaneous period contour of the first harmonic as expressed on the BM it must be time-shifted to reflect the instantaneous period of the source. The final output of the algorithm is a sequence of parameters, defining the local instantaneous fundamental period and its temporal derivative. These are used to approximate the local fundamental period value T s (t) (equation 4.15) according to: d T s ( t) = T ( t + d s ) T ( t) + d s T ( t) dt (5.5) The actual period may fluctuate around the estimated values. The local optimization technique described in section 4.6 decides on the final and optimal value. The current implementation is, as mentioned before, a first proof-of-concept, it is not fully optimized for quality of estimation, stability or speed. Informal experiments suggest that the algorithm can correctly estimate 95% or more for most noisy situations with an SNR higher than 0 db. In these cases it allows a very good TAC-estimation. Between 0 and -3 db the probability of a correct estimation reduces to 70%; below -3 db the algorithm breaks down rapidly. The algorithm fails when its basic assumptions are invalid, i.e., when the pitch of the target is out of range or when it is presented with a concurrent mixture of period contours. The algorithm seems to works slightly better with voices within the female pitch range, than in the typical male pitch range. This difference results from the higher harmonic density of male speech and can probably be reduced by minor adjustments. Although the current implementation requires the whole signal, it is possible to reimplement the system in a way that estimates period contours with a delay of 100 ms or less. The lower limit of this delay is determined by a combination of group delay effects, the temporal scope required for the computation of local frequencies and, most importantly, the number of period hypotheses the system is allowed to produce. With a delay of 50 ms the system Continuity Preserving Signal Processing 145

147 Fundamental Period Contour Estimation has less information available to reduce the number of likely fundamental period candidates than when it is allowed to combine evidence over 100 ms. Optimally, the delay ought to depend on the signal itself: very redundant signals require a small delay, while less redundant signals require more and longer processing. 5.2 Fundamental Period Contour Estimation for Clean Speech A fundamental period estimation algorithm that can be applied to clean signals has been developed for the speech recognition test. This algorithm ought to be stable, efficient and reliable, because every major error in the period estimation algorithm leads to recognition errors. The general algorithm of the previous section leads to very good estimates in clean circumstances, but it is a first version of a complex algorithm; its stability cannot be guaranteed for use on large database and it is not optimized for speed. Moreover, a simpler algorithm suffices for clean situations. The algorithm presented in this section is intended to become a reliable and relatively fast alternative for the more general period estimation technique for clean signals. The demands for a fundamental period estimation algorithm to test the robustness of a speech recognition system are slightly different from a system that aims to select and track as much of the source as possible. The latter was optimized in the general fundamental period estimator. For an ASR-test it is necessary to produce a signal representation that resembles the stored templates as well as possible. This entails that spurious contributions should contaminate the selection as little as possible. During onset, but more often during offset, the signal energy might be relatively low and the probability of spurious contributions relatively high, while little linguistic information is conveyed. For example, the information after t=360 ms in the word /NUL/ in figur e 2.3 on page 49 is of little consequence, while a rising pitch can be estimated for at least another 100 ms. During these last 100 ms, the signal-tonoise ratio decreases rapidly which results in a more contaminated TACselection. To reduce this contamination it is beneficial to be conservative when determining whether or not the start or end of a signal is voiced. This is implemented by restricting both absolute energy and the decay behavior of the ridges in the low-frequency half of the basilar membrane model. This part of the basilar membrane is hardly affected by unvoiced signal components. 146 Continuity Preserving Signal Processing

148 Fundamental Period Contour Estimation for Clean Speech When the energy loss corresponds to 50% or more in 10 ms, or when the energy does not exceed 1% of the average maximal energy of the utterance, the frames are considered unvoiced. This combined criterion works only for well calibrated databases and rather short utterances. It must be replaced by more sophisticated criteria when the data behaves more like spontaneous speech (Furui 1989). Speech recognition results suggest that less than 0.5% of the voiced sound events were incorrectly assigned as unvoiced. The decay-criterion is a bit more restrictive than the decay of the leaky integration process in the absence of input. The decay in 10 ms associated with a leaky-integration time constant of 10 ms is e 10 τ = e while the applied threshold is 0.5. For speech signals this threshold is very efficient. Coherent voiced speech is rarely incorrectly broken up in subparts due to the decay criterion. 3 The combination of the thresholds for absolute energy and decay leads to fundamental period contours that tend to have earlier offsets and later onsets. The fundamental period algorithm is based on a summation of the autocorrelation along ridges. 4 As mentioned in the context of figur e 2.8, the autocorrelations along ridges that stem from the same source agree on the fundamental period as the first common periodicity that all ridges share. Figure 5.2 shows a typical example of a set of autocorrelations and the corresponding summation. Note that all autocorrelations are simply added and no group delay correction has been performed. Consequently, the result is an approximation. The optimization in the selection algorithm (see section 4.6) determines the final instantaneous fundamental period. In each frame the three highest peaks in the summed autocorrelation with values higher than 0.3 times the local energy along the ridge are selected and sorted according to the autocorrelation value (the highest peak first). When no peak satisfies the criterion, the frame is considered as unvoiced. It is assumed that one of these autocorrelation lags corresponds to the desired fundamental period value for this frame. The candidate autocorrelation lags are depicted in the upper panel of figur e 5.3. A blue dot denotes the highest local 3. In the odd case that a period contour is broken up in subparts, the leaky-integration will smoothen small gaps in the selection. 4. This algorithm is similar to correlogram based algorithms that claim to model aspects of human pitch perception (e.g., Meddis 1997). The main difference is the use of the running autocorrelations under ridges, instead of computing and summing the complete correlogram. Continuity Preserving Signal Processing 147

149 Fundamental Period Contour Estimation Autocorrelations of all ridges Autocorrelation Summed autocorrelation Autocorrelation Period in ms Figure 5.2. Fundamental period estimation by summing the running autocorrelation along all ridges in the cochleogram. The upper panel shows the individual autocorrelations, the lower panel the resulting sum. The first, second, third and eleventh harmonic form the most important contributions. The highest peaks corresponding to a periodicity longer than 1 ms are the candidates for the average fundamental period.the ridges are derived at t=700 ms from the clean cochleogram of figur e 5.3. autocorrelation value, a green dot the second value and the red dot the lowest selected value. The selected peaks are combined into temporal contours. Contours shorter than 25 ms are discarded. In each frame the remaining contours are compared with the corresponding characteristic frequency of the segment of the lowest ridge. This is depicted in the lower panel. The contour that falls 60% or more of the time within 10% of the characteristic frequency of the lowest ridge 5 is chosen, the other contours are discarded. Finally, the selected period contours are smoothed with the same procedure as described in section 5.1. The final output of the algorithm is, conform the demands of the TAC-selection algorithm in section 4.6, a parameter set defining a first order approximation of the local instantaneous fundamental period per frame. This technique combines two strategies that complement each other: periodicity information in the running autocorrelation provides an accurate 5. This implementation does not work properly with a missing fundamental. 148 Continuity Preserving Signal Processing

150 Fundamental Period Contour Estimation for Clean Speech Frequency in Hz Cochleogram and lowest ridge Local fundamental period hypotheses Segment number 200 Period in ms Time in ms Figure 5.3. Fundamental period disambiguation. The upper panel shows the position of the lowest peak superimposed on the cochleogram. The vertical line denotes the time where the information of figur e 5.1 is derived from.the lower panel shows the candidate fundamental periods. The colors blue, green and red denote whether the candidate has the highest, second or lowest autocorrelation value. The black dots (which overlap most of the blue dots) signify the corresponding characteristic period of the lowest ridge. The output consists of contours that match more than 60% of the frames. periodicity estimate, while position information facilitates the choice of the correct fundamental period candidate. 6 To determine a lower bound of the quality of the period contour estimation a standard MFCC-based HMM system was trained on the voiced frames, as estimated by the algorithm, of the female voices of the TIDIGIT database. A recognition report based on a test with the training data is shown in figur e 5.4. Less than 0.5% of the voiced sound events are missed or estimated with a major error that prevents correct recognition. Two thirds of the deletions occur in words with short vowels like /SIX/ and /EIGHT/. The overall performance of the speech recognition system shows that it is possible to reach 1% error- 6. Note that the two sources of knowledge correspond to time and place coding in the auditory nerve. Continuity Preserving Signal Processing 149

151 Fundamental Period Contour Estimation SENT: %Correct=92.66 [H=2599, S=206, N=2805] WORD: %Corr=99.24, Acc=98.72 [H=17268, D=60, S=73, I=89, N=17401] Confusion Matrix o t t f f s s e n o z s n w h o i i e i i h e i e o r u v x v g n r l e r e e h e o e e n t n Del [ %c / %e] one [99.7/0.0] two [99.1/0.1] thre [99.3/0.0] four [99.5/0.0] five [99.8/0.0] six [97.8/0.1] seve [99.6/0.0] eigh [99.6/0.0] nine [100/ 0.0] oh [98.7/0.1] zero [99.9/0.0] sile [100/ 0.0] Ins ================================================================= Figure 5.4. Recognition report of the speech recognition test using the voiced parts, i.e., frames in which a fundamental period contour was estimated, of the female trainings data of the TIDIGIT database. The six represents almost half of the deletions (Del). The test data was the same as the training data. Only 60 (0.5%) of the non-silence digits are deleted. The lower part of the report gives a confusion matrix. The /SIX/ was recognized correctly in 1012 cases and was recognized as an /EIGHT/in 16 cases. Its percentage correct was 97.8% and it was 8 times incorrectly inserted and 27 times deleted. rate on the TIDIGIT task with voiced frames exclusively. This rate forms an upperbound for the speech recognition test on selected speech presented in section 2.13 and chapter A Comparison of Fundamental Period Estimation Algorithms This thesis presents several ways to estimate and reestimate fundamental period contours. Section 5.1 presented a general algorithm that is optimized to function in very noisy situations ( SNR 0 ). Section 5.2 presented an algorithm tuned to the demands of an ASR system that assumes a clean speech signal. Both algorithms provide a smoothed version of the actual period 150 Continuity Preserving Signal Processing

152 A Comparison of Fundamental Period Estimation Algorithms contour that is based on a selection of the available data. The TAC-selection can, in principle, provide a reestimation of the instantaneous local fundamental period estimate because it combines all available periodic information. The quality of the reestimation is limited by the sample frequency and the integration time constant. When the time constant is relatively long (which is the case for most fundamental periods), temporal averaging reduces the temporal resolution. Frequency resolution is limited by the sample frequency. With a sample frequency of 20 khz the difference of one sample corresponds to 0.5% (0.5 Hz) around periods corresponding to 100 Hz, while around 400 Hz each sample leads to a difference of 2% (8 Hz). The natural fluctuations of the period-to-period variation in duration are called jitter and are typically between 0.5% and 1% (O Shaughnessy). The combination of an integration time constant of 10 ms, and the sample frequency leads to a sensitivity that is insufficient to study the jitter in detail, but the sensitivity is not low enough to average out all effects of jitter. The current system settings are therefore unsuitable for a detailed comparison of the pitch estimation techniques. Yet it is useful to compare the results and consequences of the different algorithms. Figure 5.5 shows some examples, in terms of pitch, of the two fundamental period contour algorithms of the previous sections. The solid lines give the output of the period estimation algorithms. The green line denotes the output of the general algorithm, the black line that of the algorithm for clean speech that was optimized for ASR-purposes. In general, the agreement between both is very good. The main difference between the two estimates occurs at the end of the third contour, where the general algorithm interpreted some noise as a valid extension of the contour. This type of error is, due to the complexity of the general algorithm, extremely difficult to reduce, because the reduction of the number of one type of error leads usually to an increase in the number of other errors. In part, this is a consequence of conclusion 1.16 which formulated a limitation of the measurement process. It is, however, possible to evaluate multiple hypotheses to find the best interpretation of the local periodicity information. The plus-signs and open circles in figur e 5.5 denote the frequencies that result from the reestimation process during TAC-selection. The plus-signs refer to a clean signal and the open circles refer to the signal with added babble noise at 0 db SNR. These values fluctuate around the original estimates that were estimated from clean signals: the green contour for the red values and the Continuity Preserving Signal Processing 151

153 Fundamental Period Contour Estimation 300 Frequency in khz Time in ms Time in ms ASR: no noise General: no noise ASR 0 db Babble General: 0 db Babble ASR: output coutour General: output contour Figure 5.5. Comparison of different fundamental period estimation algorithms. Note that the panels depict frequency and not the fundamental period. The legend applies to both panels. General refers to the fundamental period estimator of section 5.1 and ASR refers to the algorithm of section 5.2. The inset shows that all methods show a very good general agreement. Some differences occur in the third contour. The main figur e shows a close-up of this contour. The fundamental period contour computed by the general algorithm is longer and ends incorrectly. black contour for the blue values. In clean situations the fluctuations are minimal and usually due to natural pitch fluctuations and the discrete periodicity values. In noise, the fluctuations are only marginally larger. This increase is primarily the result of the opportunistic local optimization of the positive area under the compressed selection (section 4.6). At each time-step it is possible that spurious contributions are selected. The TAC selects the information that correlates with the fundamental period contour more efficiently than the uncorrelated contributions. This entails that spurious contributions usually have a small and local effect that shows itself by a small increase in the noise of the reestimated fundamental period value. The close-up in the main figur e shows that both fundamental period estimation techniques failed to estimate a small increase in frequency between 1380 and 1420 ms that shows up very clearly in both the clean reestimates and the noisy reestimates. This implies that the information under the ridges does 152 Continuity Preserving Signal Processing

154 A Comparison of Fundamental Period Estimation Algorithms Frequency in Hz General: no noise ASR: no noise Segment number Frequency in Hz General: 0 db babble Time in ms ASR: 0 db Babble Time in ms Segment number Figure 5.6. The selections corresponding to the fundamental period contour of figur e 5.5 (the word /TWEE/). The upper panels show the selections without background noise, the lower panels in 0 db Babble noise. A comparison of the lower left hand panel with the lower right hand panel shows that the effect of the incorrect final part of the fundamental period contour is minimal. The local instantaneous fundamental period optimization works efficiently on the target and less efficient on uncorr elated backgrounds. not always provide sufficient and completely correct information. In this case the local frequency estimation is completely dominated by the first two harmonics: the higher harmonics do not give rise to consistent ridges. The reestimation based on the selection uses all available information and consequently allows an improved pitch estimation. The other interesting effect in the third contour is the estimation error at the end of the red, general contour. Here the signal had ended, but the general period contour estimation was still able to find some supportive evidence for continuation. The corresponding reestimated period values in clean condition (marked with + ) vary considerably. In noise the reestimated values vary less because some spurious contribution provided a consistent target. This period estimation error has only a minor effect on the final TAC-selection as depicted in figur e 5.6. In clean conditions (the upper panels) hardly any difference is visible: the estimation error occurred when signal energy was Continuity Preserving Signal Processing 153

155 Fundamental Period Contour Estimation low and decreasing. In noisy conditions the effect of the estimation error is still small. The lower left hand panel shows the additional signal contribution after t=1450 ms, while the lower right hand panel only shows the effect of the leaky integration. This example shows that pitch estimation errors do not necessarily lead to large estimation errors. 154 Continuity Preserving Signal Processing

156 CHAPTER 6 Auditory Element Estimation A general sound recognition system (definition 1.4) must avoid the signal-innoise-paradox. This chapter refers back to conclusion 1.12 that avoided the paradox by grouping and selecting evidence represented by coherent cochleogram areas into a representation on which auditory event (definition 1.11) formation can be based. The selected evidence can be used for sound resynthesis, cochleogram reconstruction, as input for automatic speech recognition systems, or as input for a more elaborate auditory scene analysis stage. As described in section 1.8, this work tries to identify signal components that are likely to have a favorable (local) SNR and to combine these into a single representation whenever they might stem from the same source. Section 6.1 identifies the regions of the time-place plane of the cochleogram that are likely to have a high SNR; these areas are called auditory elements (conform Brown 1994). The section provides a selection method with a high preference for regions dominated by speech-like signals and a low preference for regions dominated by common noise-types. This addresses the first two steps of the measure of success in section 1.8. Section 6.2 quantifies the similarity between the selected information and the target signal by comparing the cochleogram of the target signal with a cochleogram based on resynthesized sound derived from information represented by the selected auditory elements. While section 6.2 addresses a measure of similarity based Continuity Preserving Signal Processing 155

157 Auditory Element Estimation on the shape of the cochleogram, the main task of this work is to select the most informative signal components of target signals in a wide range of signal-to-noise ratio s. This task is quantified in section 6.3 which addresses the third and last step of the measure of success in section Auditory Element Estimation and Selection This section a number of simple physical thresholds select cochleogram regions. These place-time regions are registered as either accepted areas A or as masks M. The accepted areas reflect the raw suprathreshold information. The application of a hard threshold may lead to adverse effects: when short intervals are discarded while their neighborhood is accepted, they are likely to be close to the threshold and consequently might be included to form coherent cochleogram regions. The masks provide such regions by combining segment contributions of a minimal duration L in which holes of maximal duration H are filled. Mask formation involves the application of a property of signals like speech: accepted signal contributions are limited to coherent contributions of a minimal length of 30 ms. This value corresponds to the duration of the shortest phonemes (e.g., /P/, /T/, /K/). This work uses a minimal duration L = 30 ms and a maximal hole size H = 10 ms. It is assumed that a combination of constraints is more effective and more robust than the optimal use of a single constraint. Consequently the general approach of this section is to use a combination of low thresholds, rather than to use a single strict threshold (as applied in section 2.11 and section 4.3). Coherent area forming is based on: a model of the background energy the driving energy per segment a CPC-based measure of local dominance a TAC-based measure of compliance to a period contour combined criteria for periodic signal contributions combined criteria for aperiodic signal contributions combined criteria for speech-like signal contributions The combined criteria are able to identify regions of the s-t-plane that are dominated by a single sound source. All thresholds are applied to a clean and to a noisy version (at 0 db babble noise) of the standard target signal. As in 156 Continuity Preserving Signal Processing

Auditory Element Estimation and Selection Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Clean signal 400 800 1200 1600 2000 0 db SNR babble noise 100 90 80 70 60 50 40 30 20 10 1

158 Auditory Element Estimation and Selection Frequency in Hz Clean signal db SNR babble noise Time in ms Segment number Figure 6.1. The cochleograms of the input signals used in this section. The standard target sentence /NUL ÉÉN TWEE DRIE/ is used, but, as in section 4.3, an additional impulse in inserted between the first and second word. The left hand panel depicts the clean target plus impulse. The right hand panel depicts the target embedded in babble noise at 0 db SNR. section 4.3 a click is inserted between the first and second word. Both cochleograms are depicted in figur e 6.1. The remaining part of this section depicts raw selected areas A as well as the corresponding mask M for each threshold. Background model The first threshold is based on a model C B (s,t) of the background energy. Its main purpose is to discard cochleogram areas with a high probability to be dominated by the noise. A background model must be applied with utmost care, because it is crude and error-prone. Background models based on unchecked assumptions about the noise (like assuming a slow rate-of-change) are particularly dangerous. Ideally the background model must represent all signal contributions that do not vary synchronously with the speech (constant contributions and contributions that do not comply to the demands of the target class). Consequently, a perfect background model can only be estimated after the recognizable components of the signal have been classified (a direct consequence of the signal-in-noise-paradox). The crude background model used here is based on the nonlinearly compressed energy (equation 2.3) because it represents perceptive effects better than the uncompressed energy (section 2.2). Typically, the background model is a function of time and place that is based on a moving average of the total energy with a large time-constant (e.g., τ B >100 ms). In this example, the onset of the target signal is a too early to provide sufficient history. As an alternative, the background model is based on the average value C B av (s) and the associated standard deviation C B std (s) of the nonlinearly compressed energy Continuity Preserving Signal Processing 157

159 Auditory Element Estimation r s (t) It is assumed that the target dominates during period of voiced speech, consequently the average and the standard deviation are based on intervals where no pitch could be estimated: C av B ( s, t) = r s ( t) 0.15 C std B ( s, t) = ( r s ( t) 0.15 C av B ( s, t) ) 2 t P (6.1) The operator < > denotes the average. The set P denotes the intervals in which a pitch could be estimated with the fundamental period estimation algorithm of section 5.1. Another function of the background model is to discard quantization noise. 1 The quantization noise level is determined by the dynamic range of the input signal, which in turn is determined by the type of signal, the dynamic range of the transmission channel, scaling of the data and/or the number of bits used per sample. In this example, it is set to c min =1, which corresponds to 38 db below the most energetic peak of the target cochleogram. (See figuur 2.4 for an indication of this threshold in relation to the dynamic range of the speech signal.) This leads to a background model C B (s,t) with a lower limit of c min =1 that is further defined as the average value of the nonlinearly scaled background C av B ( s, t) plus a fraction c std of the standard deviation C std B ( s, t) : C B ( s, t) = max[ C av B ( s, t) + c std ( t) C std B ( s, t), c min ] v c std (6.2) The criterion c std (t) is set to = 0 (v for voiced) for intervals with a TAC selection. This value is based on the observation at the end of section 3.6 that the TAC is likely to produce spurious contributions whenever the local SNR drops below 0 db. Criterion c std (t) is set to a less permissive u c std = 1 for intervals without TAC selections. The application of the threshold C B (s,t) to the cochleogram r s (t), leads to an accepted area A B : A B = { r s ( t) 0.15 > C B ( s, t), ( s, t) } (6.3) and a corresponding mask M B : M B = ( ) L = 30 ms, H = 10 ms m L, H A B (6.4) 1. The numerical noise level lies well below the quantization noise of the input signal. 158 Continuity Preserving Signal Processing

Auditory Element Estimation and Selection Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Accepted area clean Accepted area Noisy signal 100 90 80 70 60 50 40 30 20 10 1 Segment

Segment number Figure 6.2. Application of the selection criterion based on the background model.

160 Auditory Element Estimation and Selection Frequency in Hz Accepted area clean Accepted area Noisy signal Segment number Frequency in Hz Mask clean Time in ms Mask noisy signal Time in ms Segment number Figure 6.2. Application of the selection criterion based on the background model. The lefthand panels give the accepted cochleogram regions for the clean condition, the righthand panels for the noisy condition. The upper panels show the raw accepted regions. The lower panels show the masks that consists of segment contribution of at least 30 ms (with holes of 10 ms filled ). Mask formation leads, in the noisy condition, to an area reduction of 9%. The use of a background model leads to an important reduction of the search space. The effect of different threshold settings for voiced and unvoiced regions is most noticeable in the low-frequency region. The function m L,H is the mask forming function with parameters L=30 ms and H=10 ms. The effect of the background model is depicted in figur e 6.2. The lefthand panels show the clean condition and the righthand panels show the noisy condition. The upper panels show the raw accepted areas, the lower panels the associated masks. The effect of different threshold settings for periodic and aperiodic intervals is most noticeable for the lowest segment numbers. Driving Energy and Decay A second criterion is whether a cochleogram area is being excited or whether it reflects the decay of a response of a source that has effectively ended. Without driving energy, the cochleogram responses decrease with a rate determined by the lowpass filtering L{ }, which, in this case, is implemented Continuity Preserving Signal Processing 159

161 Auditory Element Estimation Frequency in Hz Accepted area clean Accepted area Noisy signal Segment number Frequency in Hz Mask clean Time in ms Mask noisy signal Time in ms Segment number Figure 6.3. Cochleogram areas where decay dominates. The strong impulse at t=500 ms is the main cause of decay. Strong decay is not prominent in the noisy condition, but is of some significance in the clean condition. as leaky integration with time-constant τ=10 ms (equation 2.1). When a decay criterion 2 C D (τ) is based on τ D =11 ms, it can identify cochleogram areas A Decay with very little or no driving energy: A Decay = ( s, t) rs ( t) t r s ( t) = logr t s ( t) < C D ( τ) (6.5) Such areas are depicted in figur e 6.3. In the noisy condition, decaying signal contributions are rapidly masked by noise, but in clean conditions areas dominated by decay may be of some importance and may help to identify pulses and offsets. The acceptance criterion τ D =11 is set close to the limit of τ=10 ms. This is a rather strict demand, but its complement, the areas A D with sufficient driving energy that we are interested in, is permissive. Much of the effect of the impulse, which was prominent in A Decay, will be discarded in A D. 2. In general, the decay criterion C D (s, t) may depend on segment dependent timeconstants. The current implementation is based on a single global time-constant. 160 Continuity Preserving Signal Processing

Segment number Figure 6.4. Acceptance region and masks for local dominance. In the clean condition 45% of the time-place area is sufficiently entrained.

162 Auditory Element Estimation and Selection Frequency in Hz Accepted area clean Accepted area Noisy signal Segment number Frequency in Hz Mask clean Time in ms Mask noisy signal Time in ms Segment number Figure 6.4. Acceptance region and masks for local dominance. In the clean condition 45% of the time-place area is sufficiently entrained. In the noisy condition the entrained area increases to 56%. Local Dominance The next criterion is based on the characteristic period correlation and selects areas that are dominated by signal contributions that lead to the expression of their frequency at the corresponding characteristic segments (see section 3.3). This threshold has been introduced in section 4.3 as equation In this section the local dominance criterion is renamed to C C (s) and relaxed with C C (s high )=c high and C C (s low )=c low and c high =0.02 and c low =0.15. The values for the intermediate segments are based on linear interpolation. The dominated area A C is defined by: A C = ( s, t) r s c ( t) r s ( t) 1 C > C ( s) (6.6) This leads to figur e 6.4. The figur e uses a similar threshold for the highfrequency side as the lower panel of figur e 4.7 (where C C (s high )=0.02) and a threshold on the low-frequency side comparable to the middle panel of figur e Continuity Preserving Signal Processing 161

163 Auditory Element Estimation Frequency in Hz Accepted area clean Accepted area Noisy signal Segment number Frequency in Hz Mask clean Time in ms Mask noisy signal Time in ms Segment number Figure 6.5. Accepted areas and associated masks for a TAC-selection based criterion C T (s)=c T =0.15. Although the criterion is somewhat relaxed compared to upper panel of figur e 2.15 the application of a constraining period contour is still very powerful. It is the only signal property used for the grouping of signal components. 4.7 where C C (s low )=0.15. Mask forming leads again to more smooth and coherent regions. The considerations at the end of section 3.5 entail that this dominance based criterion ensures the selection of regions that include spectral peaks. Compliance to a period contour Compliance to a periodic contour is computed with the Tuned Autocorrelation. The upper panel of figur e 2.15 showed selected values that represented 25% or more of the local signal energy. In general this criterion can be formulated as: A T ( s, t) r s, T ( t) ( t) r s ( t) C ( s = > ), s, t P (6.7) 162 Continuity Preserving Signal Processing

164 Auditory Element Estimation and Selection In this section the constraint is relaxed to C P (s)=c P =0.15. The resulting accepted areas and their associated masks are depicted in figur e 6.5. This TACbased criterion is very powerful and it is the only (implemented) signal property that can combine different cochleogram regions in a single representation. The rest of this section uses combinations of acceptance areas to form auditory elements. Periodic Signal Contributions Periodic signal contributions can be defined as signal contributions that dominate BM ranges by supplying sufficient driving energy and comply to a period contour. This leads to a combined criterion for periodic signal contributions and to a new accepted area A p defined as: A P = A D A C A T A B (6.8) that reflects signal contributions defined by: A D : sufficient driving ener gy A C : sufficient local dominance A T : compliance to a period contour A B : exceeding the background model To prevent adverse effects due to the multiple application of the gap-filling mask forming function m L,H, the corresponding mask is defined as: M P = m L, H ( A P ) and not as M P = m L, H ( M D M C M T M B ). (6.9) The resulting regions are depicted in figur e 6.6. The combination of constraints leads to a considerable reduction in the area of both A C and A T and a strong focus on the main signal contributions (i.e., strong individual harmonics and formants). This is not only the case for the clean condition, but also for the noisy condition. Inspection of the form and energy development of individual coherent areas in the noisy condition shows that they are separated into regions that tend to be dominated by either the background or the target signal. In the latter case the selected areas do often resemble the clean condition. This is an important step towards auditory event estimation. Continuity Preserving Signal Processing 163

Auditory Element Estimation Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Accepted area clean Accepted area Noisy signal 100 90 80 70 60 50 40 30 20 10 1 Segment number Frequency in

165 Auditory Element Estimation Frequency in Hz Accepted area clean Accepted area Noisy signal Segment number Frequency in Hz Mask clean Time in ms Mask noisy signal Time in ms Segment number Figure 6.6. Periodic signal components according to the combined criterion for active domination of the BM, compliance to a period contour and energy exceeding the background model. Note that individual coherent areas of the mask in the noisy condition are likely to belong to either the background, or to the target signal, in which case they often resemble the clean condition. This justifies the term auditory elements. The acceptance criterion C B (s,t) for the background model A B is relaxed during intervals with a TAC-selection. The TAC-selection uses periodicity as a very specific signal property that correlates with the target signal, but (on average) does not correlate with all other signal contributions. The requirements of mask formation in combination with the relaxed background criterion and the thresholds for TAC-acceptance and dominance result in an effective reduction of spurious contributions. Aperiodic Signal Contributions The complementary information based on an estimation of aperiodic signal contributions is defined somewhat mor e complex: A A = A D A C A B A P A N It reflects signal contributions defined by: A D : sufficient driving ener gy (6.10) 164 Continuity Preserving Signal Processing

166 Auditory Element Estimation and Selection Frequency in Hz Accepted area clean Accepted area Noisy signal Segment number Frequency in Hz Mask clean Time in ms Mask noisy signal Time in ms Segment number Figure 6.7. The aperiodic signal contributions are defined as contributions that dominate regions of the BM actively during periods where no pitch could be estimated, and of which the CPC does not exceed the energy too much (here a fraction c N =0.03). Note that the /T/ of the third word is selected in all conditions. A C : sufficient local dominance A B : exceeding the background model A P : frames without an estimated period contour. 3 A N : a CPC to energy ratio sufficiently close to (but above) unity: A N c t ( s, t) r s( ) r s ( t) 1 + C = < A ( s) (6.11) The last threshold C A (s) is included to allow a distinction between periodic contributions, of which the CPC value can exceed unity, and dominating aperiodic contributions for which the CPC-value is defined to be close to unity (equation 4.9). It is set to C A (s)=c A =0.03 for all segments. This combination of thresholds limits the accepted CPC-energy ratio s to values below 1.03 and above 0.98 down to 0.78 with decreasing segment number. 3. This choice is suboptimal since it assumes that the target signal is either voiced or unvoiced. Speech can be a combination of both. Continuity Preserving Signal Processing 165

167 Auditory Element Estimation Frequency in Hz Accepted area clean Accepted area Noisy signal Segment number Frequency in Hz Mask clean Time in ms Mask noisy signal Time in ms Segment number Figure 6.8. Speech like signal contributions are the union of the aperiodic and the periodic signal contributions. Figure 6.7 provides the raw areas and the associated masks. Only a few regions meet the demands. The impulse is visible in the raw areas of both conditions. The /T/ of /TWEE/ meets the demands as well. Speech-like Signal Contributions A combination of the accepted periodic and the aperiodic contributions can be used to select areas that might be target speech: A S = A P A A and M S = m L, H ( A S ) (6.12) These are depicted in figur e 6.8. These masks form the final output of the area selection algorithms. Although this mask will be referred to as the speech mask, it does not necessarily reflect speech. In fact very little knowledge of speech has been applied and consequently the bias towards the speech class is minimal. The main demands are pitch-contours with a duration of at least 50 ms that lie between 75 and 400 Hz (due to the limitations of the period contour algorithm of section 5.1) and segment contributions of a minimal duration of 30 ms (due to mask forming). These demands are implementation dependent and can be 166 Continuity Preserving Signal Processing

168 Auditory Element Estimation and Selection Criterion Function Remark L = 30 ms Mask: minimal duration of segment contribution Related to minimal duration of speech-like signal contributions H = 10 ms Mask: maximal hole size Reduces effect of hard thresholds c D = τ D =11 ms Decay: accepted timeconstant for energy loss Criterion somewhat larger than leaky-integration time-constant τ. c min = 1 Energy: minimal value Prevents contributions due to quantization noise c u std = 1 c v std = 0 c high = 0.02 c low = 0.15 c P = 0.15 c A = 0.03 Table 6.1. Background: fraction of std. deviation above average Background: fraction of std. deviation above average Dominance: acceptance crit. for high-frequency segments Dominance: acceptance crit. for low-frequency segments Periodicity: fraction of energy explained Aperiodic: deviation from unity of CPC to energy ratio For intervals where no period contour could be estimated (unvoiced intervals) Weaker criterion than c u std, used during intervals with TAC-selection 1-c high must be close to but less than unity for high-frequency segments Criterion for low freq. segm. to accommodate amplitude variation in target signal Weaker criterion than in section 2.11 Can make a distinction between dominating aperiodic noise and harmonic contribution Overview of thresholds for area selection and mask forming. relaxed if necessary. Other types of signals will be selected as long as they comply to these general demands. Additional signal properties (e.g., pitch variability) may be included to reduce the range of accepted signal types. It is the responsibility of the pattern recognition stage to determine how the selected evidence ought to be combined and interpreted (see e.g., section 7.2). There has not been any optimization of the thresholds. Each threshold is set to be permissive. The selection of speech like contributions depends on the combination of knowledge sources and not on the optimization of each threshold individually. It will be demonstrated in the next section that this approach is applicable to different types and levels of noise. Table 6.1 summarizes all threshold values. These values will be used in the next sections on a range of noise-types and signal-to-noise ratios. Continuity Preserving Signal Processing 167

169 Auditory Element Estimation 6.2 The Robustness of Mask Forming The resynthesis process, as described in section 2.11, is developed as a means to form a visual representation of the selected information by converting the information represented by the speech mask via a conversion to sound (Slaney 1994) back to a cochleogram representation. The advantage of this procedure is that masking effects are determined by the model itself and not approximated imperfectly (as for example in section 3.6). This section provides a quantitative comparison between the acoustic information represented by the masks and the target sound. Since the basilar membrane model is implemented as a finite impulse response (FIR) filter, it is possible to invert the filtering by reversing the impulse response in time and by compensating for segment dependent frequency effects caused by the double use of the basilar membrane filter. This compensation is based on the sensitivity of the BM as function of place. When the output of all filters are added, all BM information represented by the basilar membrane oscillations is converted back to a close approximation of the input sound. The cochleogram of the resynthesized standard target signal (without the extra pulse) is depicted in the upper panel of figur e 6.9. The resynthesis is based on an all-pass mask (i.e., a mask that spans the whole place-time plane). When the resynthesis is based on the clean signal in combination with the mask as estimated in 0 db babble noise, a very similar cochleogram results (middle panel). The main differences are the low values during intervals that were not included in the mask. If the resynthesis is based on the noisy signal and the mask estimated from the noisy signal (lower panel), the cochleogram remains similar, but spurious contributions are introduced. To measure the efficiency of the auditory element forming and selection process, cochleograms based on resynthesized signals were computed for a range of different levels and types of noise. The masks were estimated with the criteria of table 6.1 at SNR s that range from 20 down to -20 db in steps of 5 db. Four different noise types were added: babble noise, white noise, car factory noise and speech noise (all derived from the NOISEX-92 database, Varga 1992). The noise selections were randomly chosen from the noise files and scaled so that the root-mean-squares of the speech signal and the noise 168 Continuity Preserving Signal Processing

The Robustness of Mask Forming Frequency in Hz Frequency in Hz Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 6100 4200 2900 2000 1400 910 600 380 220 120 28 6100 4200 2900 2000 1400

170 The Robustness of Mask Forming Frequency in Hz Frequency in Hz Frequency in Hz Resynthesis of clean signal without mask Resynthesis of clean target, mask estimated in 0 db babble noise Resynthesis of noisy target, mask estimated in 0 db babble noise Time in ms Segment number Segment number Segment number Figure 6.9. Three examples of resynthesis. The upper panel shows the result of a full inversion of the signal. Visually, it cannot be distinguished from the original. The second panel shows the cochleogram of the resynthesized clean signal after applying the noisy mask as estimated in figur e 6.8. The corresponding sound sounds muffled. The last panel shows the resynthesis of the noisy target. This signal is intelligible but naturalness is reduced had the desired SNR. Eight different noisy signals were created for each of the 36 noise conditions. For most broadband noises it is unlikely that (even with an improved estimation technique) the required period contours can be estimated reliably below an SNR of approximately -6 db. This is the result of the breaking-up of ridges due to masking by noise peaks and an associated reduction of the reliability of the local frequency estimates. The combined effect leads to a rapid increase in the search space which eventually prevents the estimation of the correct period contour. The inability to estimate period contours below approximately -6 db SNR restricts the mask forming technique to SNR s above this level. For normalization purposes, it is necessary to compute masks of noise signals that contain no evidence for the correct period contour. These mask must be based on similar period contours as the other noise conditions to exclude effects of period contour estimation differences and/or errors. This led to the Continuity Preserving Signal Processing 169

171 Auditory Element Estimation decision to base all masks on the period contours as estimated from the 0 db SNR babble noise condition (see figur e 2.13). The similarity s te between a target vector (the clean signal) and an estimate of the target v e from a noise condition v n is based on a normalized dotproduct: cos( α t, e ) v = e v t v t v e (6.13) This measure is normalized so that s te =1 when and point in the same direction and s te =0 when v t equals the estimate v n in noise without the target, i.e., when cos( α t, e ) = cos( α n, e ). The baseline cos( α n, e ) is computed as the average for each noise type (N=72). This leads to the definition of s te as: s te cos( α t, e ) cos( α n, e ) = cos( α n, e ) (6.14) The similarity measure computed using the nonlinearly compressed cochleograms (interpreted as vectors) and based on the average similarity of the 8 samples per combination of noise type and SNR. The four noise types are normalized individually. Five different similarities were computed. 1. Target cochleogram vs. noisy cochleogram. This comparison provides a baseline similarity, reflecting the similarity between the target and the unprocessed noisy cochleogram. 2. Cochleogram of the noise vs. noisy cochleogram. This comparison provides the similarity between the unprocessed signal and the cochleogram of the noise without the target signal. It provides complementary information to the similarity 1 and is mainly included as a check for consistency. 3. Target cochleogram vs. resynthesized selection. This condition provides the similarity between the target cochleogram and the cochleogram of the resynthesized auditory elements. 4. Target cochleogram vs. resynthesized selection, voiced only. The resynthesized cochleogram and the target cochleogram differ during silences : the target cochleogram has a minimum energy level due to (quantization) noise, while the resynthesized cochleogram shows an exponential decay during silences. The effects of this qualitative difference are reduced when the similarity is based only on the frames where pitch is defined. 5. Resynthesized target vs. resynthesized selection. This provides the similarity between the cochleogram of the resynthesized target (based on speech masks estimated from the clean condition) and the cochleogram of the v t v t v e 170 Continuity Preserving Signal Processing

172 The Robustness of Mask Forming 1 Similarity measures Similarity Target cochleogram cochleogram 2 Noise cochleogram cochleogram 3 Target cochleogram resynthesis Target cochl. voiced resynth. voiced 5 Clean resynthesis resynthesis SNR in db Figure Similarity between target and selection as function of SNR. The solid black line reflects the baseline similarity between the noisy cochleograms and the target (1). The dashed black line reflects its complement(2): the similarity between the noisy cochleogram and the cochleogram of the noise (without target). The dashed red line reflects the similarity between the resynthesized cochleograms and the target (3). The solid red line provides the same measure, but limited to voiced intervals (4). The blue line reflects the similarity between the resynthesized target and the resynthesized cochleograms (5). The vertical bar at -6 db gives the SNR below which it is unlikely that the correct period contours can be estimated. resynthesized selection. Because these two signals have been subject to the same processing steps it is the best indication of the robustness of the auditory element selection technique. The results (as averages over the four noise conditions) are summarized in figur e 6.10 and in table 6.2. Figure 6.10 depicts the similarity 1 as the solid black line. Its converse (2) is depicted as the dashed black line. Both lines 1 intersect at approximately 10 db, with a similarity -- 2 (or α=45 ). This point 2 denotes the SNR where the noisy cochleogram resembles the target as much as it resembles the noise. The dashed red line shows the third similarity. The degradation between 20 db and 5 db is less than 2%, but in the clean condition the similarity is When the comparison is exclusively based on the voiced intervals, the similarity increases to The solid red line gives the average similarity for Continuity Preserving Signal Processing 171

173 Auditory Element Estimation SNR (db) 1 target coch. cochleogram 2 noise coch. cochleogram Average similarity for all noises (N=32) 3 target coch. resynth 4 target coch. resynth voiced 5 resynth target resynth clean noise Table 6.2. Different similarity measures for different SNR. the voiced intervals (4). This fourth similarity lies, for SNR better than 5 db, 3.5% above the dashed red line. The solid red line degrades only 0.6% for the SNR range between 20 and 10 db and 3% between 20 and 5 db, while the reference (1) degrades 20% and 33%, respectively. The comparison of the resynthesis based on the clean condition with the resynthesis of the noisy condition (similarity 5) is depicted as the solid blue line. It lies above the other measures for -6 db and better. It degrades 0.7% between the clean condition and 20 db (unprocessed reference 9%), 6% for the range between 20 and 5 db (reference 33%) and 12% between 20 and 0 db (reference 50%). The horizontal distance between the unprocessed reference and similarity 5 may serve as an indication of the improvement of the SNR due to auditory element selection. For 5 db SNR this improvement equals 18 db (assuming that the noiseless condition equals 30 db). Alternatively: the degradation of the processed data at 5 db equals the degradation of the unprocessed data at 23 db SNR. The improvement of the SNR is more or less constant over a large range of SNR s: between 5 and -10 db the improvement lies between 15 and 18 db. Above 5 db the improvement is limited by the choice of an SNR of 30 db associated with the clean condition. 172 Continuity Preserving Signal Processing

174 The Robustness of Mask Forming 1 Similarity for different types of noise Similarity babble white carfac 0.1 speech average SNR in db Figure Development of similarity between masked target cochleogram and the noisy resynthesized versions for different types of noise. Above 10 db there is little difference between the types of noise. The degrading effect of white noise is relatively weak because it is an inefficient masker of low-frequency information. Speech noise is an inefficient masker of the peaks in the spectrum. The minimal degradation difference in speech and withe noise between -15 and -20 db is attributable to the few remaining harmonics that are particularly resilient to speech noise. Figure 6.11 shows that the degradation differences between the different noise types is small for SNR above 10 db. Because of their nonstationary character babble and car factory noise are degrade the signal relatively efficiently. The degrading effect of white noise is relatively weak at high noise levels because it is an inefficient masker of the low-frequency information that dominates the general form of the cochleogram. Speech noise is also a relatively inefficient masker because it is defined as aperiodic noise with a spectral envelope equal to the average spectrum of speech. This entails that strong formants are still able to dominate locally while the rest of the harmonics are masked. The measure of similarity between the clean and the noisy cochleogram, as depicted in figur e 6.10, correlates with the performance of standard HMMbased ASR systems trained on similar data. Figure 6.12 shows a prominent correlation between the similarity measures of this section with the results of the speech recognition experiment of section This shows that an HMM- Continuity Preserving Signal Processing 173

175 Auditory Element Estimation 1 Comparison Similarity Measures and Recognition results Similarity and word error rate Cochleogram Target cochleogram Result MFCC based recognizer Resynth. voiced Target cochl. voiced 0.1 Resynthesis clean resynthesis Result Selection based recognizer SNR in db Figure A comparison between the similarity measurements (solid lines) and the recognition results of section 2.13 (dashed lines) shows a prominent qualitative correlation between both. The error scores of the HMM-system are presented as fractions instead of percentages. based system reacts similarly to an increase in the distance between a reference pattern and a noisy input as an inner product based similarity measure. This validates the choice of the normalized inner product as an indication of the quality of the auditory element estimation and selection technique of section 6.1. A final note about the perceptive quality of the resynthesized sound. The perceptive quality depends in the first place on the amount of background noise that is mixed with the signal. 4 When a signal is reconstructed with an allpass mask (i.e., a mask that spans the whole place-time plane) only a direct comparison reveals a minimal perceptual difference due to the absence of frequencies above 6200 Hz. It is difficult to tell which of the two is the original. When the resynthesis is based on the clean signal in combination with the mask as estimated in 0 db babble noise the resynthesis sounds muffled, but still quite natural. This demonstrates that the noisy selection, on which the mask is based, includes the features that are perceptively most relevant, while major distortions are avoided. 4. These signals are available at: Continuity Preserving Signal Processing

176 Robustness of Auditory Element Estimation If the resynthesis is based on the noisy signal and the mask estimated from the a 0dB SNR signal, it is intelligible, but the naturalness of the resynthesis is reduced due to distortions and spurious contributions. This reduction is attributed to the reduced presence of the background noise which forces the auditory system to include distortions of the target signal to the target stream. Some distortions that are not included in the target stream are perceived as socalled musical notes (see section 2.1 1). 6.3 Robustness of Auditory Element Estimation The previous section measured the similarity between the clean target cochleogram and a resynthesized cochleogram with the BM excitation represented by a mask. However, section 1.8 described a measure of success as: 1. identifying and describing, in terms of the temporal development of frequency and energy, the signal components of a clean target signal that are likely to have a high SNR, 2. selecting target signal components and discarding non-target components from a noisy version of the signal and 3. determining that the selected signal components represent the same temporal development of frequency and energy as the clean target. The first and second tasks are performed by the auditory element estimation technique that was described in section 6.1. Each auditory element represents at least a single signal component with an associated energy and local frequency development (in the case of a quasi-periodic component a local instantaneous frequency development as well). Because section 6.1 combined auditory element estimation and selection, task 1 and 2 are combined in a single technique. What remains is the quantification of task 3. An energy-domain development of a signal component comprises a description of the temporal development of frequency and energy (definition 1.20). The position and height of the associated ridge indicate its frequency and energy, respectively. Consequently, when a ridge has been preserved, a signal component has been preserved. The use of ridges prevents the dominance of a single strong harmonic and ensures a reasonable weighting of individual harmonics. The use of ridges entails that harmonics are counted Continuity Preserving Signal Processing 175

Auditory Element Estimation Frequency in Hz Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 6100 4200 2900 2000 1400 910 600 380 220 120 28 Robust speech mask 100 90 80 70 60 50 40 30

177 Auditory Element Estimation Frequency in Hz Frequency in Hz Robust speech mask Ridge mask Segment number Segment number Figure Robust target mask and the intersection of the target mask with the set of ridges estimated from the clean signal. once. However, a focus on ridges may lead to an underestimation of the linguistic importance of harmonic complexes at formant positions that may give rise to only a single ridge. It is possible to use the mask estimated from the clean condition (see the upper panel of figur e 6.8) as the reference mask. But because of the absence of masking effects by other sources, the clean condition is rather special and not a reliable indicator of the robust signal components of a target signal. The clean mask it is therefore replaced by a mask based on cochleogram regions that were selected in 25% or more of arbitrarily chosen unmatched noise conditions above -10 db SNR. This robust speech mask is depicted in the upper panel of figur e It is very similar to the mask derived from the clean condition, but approximately 10% of the low energy regions and approximately 15% of the low energy ridges are excluded. The energy development of the robust speech mask is based on the target cochleogram (clean condition). Resynthesized sounds based on the robust speech mask sounds a bit muffled, but perfectly intelligible and natural. A measure of the overlap between a lean ridge mask with a fat (in terms of the broadness in the spatial direction of the coherent acceptance areas) 176 Continuity Preserving Signal Processing

178 Robustness of Auditory Element Estimation noisy speech mask, allows a robust comparison. Small estimation errors in either the ridge mask or the noisy speech mask are unlikely to lead to differences in the fraction of overlap. Because the ridge mask corresponds to a set of individual signal components (representing the most important linguistic information) it can be demonstrated that it is possible to select signal components that represent the same temporal development of frequency and energy as the clean target in a range of different types of background noises. This comparison will complete task 3 of this work. The ridges in the robust speech mask are determined with the algorithm of section 2.6 and form a ridge template M R as depicted in the lower panel of figur e The ridges are only determined for periodic signal contributions, the aperiodic contribution (i.e., the signal component corresponding to the /T/) of the ridge mask is copied from the robust speech mask. The ridge template M R represents not only the most robust acoustic evidence of the target, it also represents pitch, formants, intonation and the main aperiodic contributions. Consequently the most relevant linguistic information is represented. 5 When additional information about the width of formants (which can also be estimated from the noisy speech masks) is included, it is possible to reconstruct a synthetic cochleogram that resembles the original clean cochleogram with techniques as described in section 3.6. The robust and informative nature of the ridge mask will be used in a proposal for a robust recognition system in section 7.2. Figure 6.14 shows an illustrative set of (very) noisy speech masks in red and the intersection with the ridge template, n n M R = M S M R, in black. A comparison between the lower panel of figur e 6.13 and the black lines of the upper left hand panel shows that the most important ridges are selected in speech noise at an SNR of 0 db (and above). This is true for the other noises, except for white noise that masks the high-frequency region efficiently. At -5 db babble noise, most of the ridges are still conserved. At -10 db white noise, the low-frequency ridges are still relatively unimpaired, but the highfrequency region is masked completely. The regions of the noisy speech masks of the other noise conditions reduce in area and begin to split up in smaller regions. Yet the ridge template still shows a considerable overlap. Below -10 db the masks break down in a large number of small subregions. M S n 5. This is an implicit validation of the use of conjecture 1.2. Continuity Preserving Signal Processing 177

Auditory Element Estimation Frequency in Hz 6100 4200 2900 2000 1400 910 600 380 220 120 28 Speech noise at 0 db Babble noise at 5 db 100 90 80 70 60 50 40 30 20 10 1 Segment number Frequency in Hz

Segment number Figure 6.14. Examples of area selection that illustrate the robustness of the match between the ridge template M R and the area selection technique of section 6.1. The noisy mask M n S is depicted in red, the ridges common to the noisy mask and the ridge mask are depicted in black.

M R and reflect a pattern of ridges that, as in section 6.2 can be interpreted as a matrix of ones and zero s.

179 Auditory Element Estimation Frequency in Hz Speech noise at 0 db Babble noise at 5 db Segment number Frequency in Hz White noise noise at 10 db Time in ms Car factory noise at 15 db Time in ms Segment number Figure Examples of area selection that illustrate the robustness of the match between the ridge template M R and the area selection technique of section 6.1. The noisy mask M n S is depicted in red, the ridges common to the noisy mask and the ridge mask are depicted in black. With decreasing SNR the masks reduce in area and the overlap with the ridge mask reduces as well. Yet even at -15 db car factory noise, evidence of some important ridges remains in the noisy masks. M R and reflect a pattern of ridges that, as in section 6.2 can be interpreted as a matrix of ones and zero s. These matrices can be weighted with the nonlinearly compressed energy and interpreted as vectors. Their normalized dot-product is computed as: M R n cos( α) M R M R ( M R ) = = n n M R M R M R M S M R M R n M S n (6.15) These values are normalized according to equation Table 6.3 summarizes the resulting similarities for the same set of noise conditions as were used in section 6.2. Figure 6.15 provides a graphical representation of the information in the table. The solid black line provides the unprocessed baseline similarity as in figur e The dashed black line denotes the average similarity over all noise conditions. The average degradation of the similarity between an SNR of 20 db and 0 db is 0.06 and can be attributed to the disappearance of the least energetic ridges. The most informative ridges, those that define the formant 178 Continuity Preserving Signal Processing

180 Robustness of Auditory Element Estimation SNR (db) reference N=32 babble N=8 white N=8 Similarity car factory positions, remain almost unimpaired in this SNR range. Even at -10 db, the average similarity is still This is possible because the system has a priori knowledge of the period contour which allows the selection of a significant fraction of the strongest harmonics that dominate the similarity measure. The improvement of the signal-to-noise ratio compared to the unprocessed condition (based on comparing cochleograms) as in section 6.2, is more than 20 db (for 5 db SNR or worse). There are some differences between the different noise types. Babble noise, car factory noise and speech noise behave similarly, white noise behaves somewhat aberrant because it has a flat spectrum while the other noise spectra behave more like 1/f, which characterizes speech as well. This entails that white noise is an efficient masker of the high frequencies but an inefficient marker of low-frequency information. The measure of distance in this section is related to a fundamentally different recognition approach than in section 6.2. That section was related to a traditional automatic speech recognition approach based on the comparison of an estimate of complete spectra (or spectral envelopes) with stored (stochastic) templates. Such an approach requires the transformation of a N=8 speech clean N=8 average N= noise Table 6.3. Similarities between the speech mask and the ridge mask for different noise types. The second column gives the unprocessed similarity degradation. The last column shows the average degradation of all noise conditions. The degradation at 10 db SNR is still minimal. Babble and speech noise have almost identical degradation curves for low noise levels. Continuity Preserving Signal Processing 179

181 Auditory Element Estimation 1 Similarity between mask and target ridge mask Measure of similarity babble 0.2 white carfac speech 0.1 Average Unprocessed SNR in db Figure Similarity between the noisy mask and the target ridge mask. The black line denotes (as in figur e 6.10) the unprocessed baseline similarity between the target cochleogram and the noisy cochleogram. The dashed black line gives the n n average similarity between E R = E S M R and E R. The colored lines provide the similarity for the different noise types. The difference between the unprocessed condition is approximately 25 db (for 5 db SNR or lower). The bar at -6 db denotes the SNR below which it is unlikely (for most broadband noises) that a correct period contour can be estimated. The aberrant behavior of the white noise condition results because white noise is an efficient masker of high-frequency ridges, but an inefficient masker for low-frequency ridges. The curves of babble and speech noise overlap for positive SNR s. noisy input into a form with a reduced distance to the stored templates. The approach of this section is related to recognition systems in which word or syllable models use the estimated pitch to produce a detailed expectation (here represented by the ridge masks) that allows them to search (actively) for supporting and conflicting evidence. Such models are more versatile and more constraining than is usually implied by the term model in ASR research. In fact they behave much like the concept of a schema as used in the field of auditory scene analysis (Bregman 1990). Section 7.2 proposes a recognition system that is based on this type of actively searching models. 180 Continuity Preserving Signal Processing

182 Overview of CPSP CHAPTER 7 Overview and Discussion This chapter discusses CPSP in the context of the theoretical framework of chapter 1. Section 7.1 starts with an overview of the main representations, their properties and some possible extensions. Section 7.2 proposes a method to apply these techniques to speech recognition in much more variable environments than is possible with current HMM-based technology. Finally, section 7.3 argues that the use of conjectures 1.1 and 1.2, on which this work is founded, leads to a system with properties consistent with human performance, and it draws some conclusions about the advantages and application domains of CPSP. 7.1 Overview of CPSP The main goal of this work is, conform conjecture 1.1, the formulation of a signal processing framework that allows recognition systems to function as often as possible in varying and uncontrollable acoustic environments. Following conclusion 1.14, the approach has been to start from the weakest possible prior assumptions. For sounds, the weakest possible prior assumption is that sounds consists of signal components that each show an onset, an optional continuous development and an offset (definitions 1.9 and Continuity Preserving Signal Processing 181

Overview and Discussion 6100 4200 2900 2000 100 90 80 70 Frequency in Hz 1400 910 600 60 50 40 Segment number 380 220 120 28 30 20 10 1 1 2 3 4 5 6 7 8 9 10 Period in ms Figure 7.1. Overview of the relations between the TNC and four special subsets.

183 Overview and Discussion Frequency in Hz Segment number Period in ms Figure 7.1. Overview of the relations between the TNC and four special subsets. The red line at T=0 corresponds to the cochleogram, the yellow vertical line at T=46 ms marks the tuned autocorrelation, the green line corresponds to the characteristic period correlation (according to definition 4.7) and the horizontal blue line at segment 32 reflects the running autocorrelation along a ridge. The TNC structure in the background is an example of the periodic TNC. 1.20). Consequently, the decision was made to preserve continuity as long as possible and to postpone the application of quasi-stationarity to the moment it can be justified (conclusion 1.21). This led to the use of a transmission line model of the basilar membrane, the formulation of the cochleogram and its generalization that includes periodicity: the Time Normalized Correlogram. The TNC always reflects a superposition of two qualitatively different stable patterns: one associated with the aperiodic excitation of the corresponding BM region, the other associated with a periodic excitation. This allows the analysis of signals in terms of periodic and aperiodic components, for example to separate on- and offset transients from the steady state behavior (section 4.4). Some subsets of the TNC provide special information about the state of the BM. Figure 7.1 depicts examples of these subsets as lines superimposed on a typical periodic TNC. The red line at T=0 corresponds to the cochleogram and reflects the energy of the basilar membrane as a function of time and place. The yellow vertical line at T=46 ms marks the tuned autocorrelation and is related to the periodic excitation of the TNC. When a correct period contour is 182 Continuity Preserving Signal Processing

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping Structure of Speech Physical acoustics Time-domain representation Frequency domain representation Sound shaping Speech acoustics Source-Filter Theory Speech Source characteristics Speech Filter characteristics