SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental sounds and events Musical instrument recognition 1 of 27
Human sound source recognition abilities Different acoustic properties of sound producing objects enable us to recognize sound sources by listening These properties are the result of the production process The produced sound waves are different produced at each event Acoustic properties change over time The acoustic world is linear: sound waves from different sources combine together and result larger sound sources Combination and interaction of properties single objects in the mix generate new, emergent properties belonging to the larger sound producing system 2 of 27
Timbre (= äänen väri) The perceptual qualities of objects and events; that is, what it sounds like ANSI 1973: The quality of a sound by which a listener can tell that two sounds of the same loudness and pitch are dissimilar There are many stable and time-varying acoustic properties affecting timbre It is unlikely that any one property or combination of properties uniquely determines timbre The sense of timbre comes from the emergent, interactive properties of the vibration pattern The identification is the result of the apprehension of acoustical invariants (the bowing of a violin sounds like this) inferences made according to learned experience (we learn how the violin sounds like in different acoustic environments) 3 of 27
Source-filter model of sound production The source is excited by energy to generate a vibration pattern The filter acts as a resonator, having different vibration modes Each mode can be characterized by its resonant frequency and by its damping or quality factor Q When the excitation is imposed on the filter, it modifies the elative amplitudes of the components of the source input This results peaks in the frequency spectrum of the signal at resonant frequencies Damping of the vibration modes is a measure of its sharpness of tuning and temporal response Lightly damped mode (high Q) results a sharp peak in spectrum and a longer time delay into the signal (and vice versa) 4 of 27
We can hear both the change in the sound spectrum and the time differences (if they are more than a few milliseconds) The final sound is the result of effects resulted by the excitation, resonators and radiation characteristics In sound producing mechanisms, that can be modeled as linear systems, the transfer function of the resulting signal is the product of the transfer functions of the partial systems (if they are in cascade), mathematically Y( z) X( z) Y( z) = X( z) H i ( z), (1) where and are the z-transforms of the output a sn excitation signal, respectively. i = 1 are the z-transforms of the N subsystems (for instance, the vocal tract and the reflections at lips) N H i ( z) 5 of 27
Machine sound source recognition A good sound source recognition system should: Exhibit generalization. Different instances of the same kind of sound should be recognized as similar. (for instance, musical instruments played at different environments or by different players) Hande real world complexity. Should be able to work with realistic recording conditions, with noise, reverberation and even competing sound sources. Be scalable. Ability to learn to recognize additional sounds and affects on performance. Exhibit graceful degradation. The systems performance should gradually worsen while noise, the degree of reverberation and number of competing sound sources increases. 6 of 27
Employ a flexible learning strategy. It should be able to introduce new categories as necessary and its refine classification criteria. Simplicity, computational efficiency. The simpler out of two systems performing equally well is better. (memory or processing requirements, how easy is it to understand how the system works) 7 of 27
A typical sound source recognition system Preprocessing (filtering, noise removal) Feature extraction Training and learning (supervised or unsupervised) Classification (pattern recognition, neural networks, stochastic models) Is able to work with limited number of sound classes and test data 8 of 27
Features for sound source recognition Frequency spectrum spectral centroid measures the spectral energy distribution, corresponds to the perceived brightness f c k = 1 N Ak ( )f( k), (2) k is the spectral component, and f(k) and A(k) it s amplitude and frequency, respectively. Also a normalized version can be used, f cnorm f c = ----- f 0 = N ---------------------------------- k = 1 Ak ( ), where f0 is the fundamental frequency of a harmonic sound. is the same as first moment, also higher order moments have been used as features 9 of 27
The power spectrum accross a set of critical bands or successive frequency regions The power spectrum of signal x(n) is the Fourier transform of the autocorrelation sequence r(n): P( ω) = rn ( )e jωn. (3) n = This can be calculated as the magnitude squared Fourier transform of the signal x(n): where P( ω) = X( ω) 2 X( ω) = xn ( )e jωn n =, (4) (5) 10 of 27
Spectral irregularity Corresponds to the standard deviation of time-averaged harmonic amplitudes from a spectral envelope n 1 A k + 1 + A k + A k 1 IRR = 20log A k --------------------------------------------- 3 k = 2 (6) Even and odd harmonic content in the signal spectrum Even harmonic content 2 2 2 A 2 + A 4 + A 6 + h ev -------------------------------------------- = 2 2 2 A 1 + A 2 + A 3 + M 2 A 2k k = 1 -------------------, = ---. (7) N 2 2 A n n = 1 = M N 11 of 27
Odd harmonic content 2 A 2k 1 2 2 2 A 1 + A 3 + A 5 + k = 0 h odd = -------------------------------------------- = -------------------------- L 2 2 2 N A 1 + A 2 + A 3 + L 2 A n n = 1 = N --- 1 2,. (8) Formants Spectral prominences created by one or more resonances in the filter of the sound source A robust feature for measuring formants are cepstral coefficients The cepstrum of a signal x(n) is defined as cn ( ) = F 1 { log F{ x( n) } } (9) 12 of 27
In practise the coefficients may be obtained for instance with linear prediction (LP) In LP, the filter of the sound source is approximated with an all-pole filter u(n) X 1 A(z) s(n) G Figure. The normalized input u(n) (pulse train or white noise) is scaled by the gain G and filtered with an all-pole filter The coefficients of the all-pole filter can be solved These coefficients describe the magnitude spectrum of the sound source filter These coefficients are converted into cepstral coefficients, which behave nicely for recognition purposes 13 of 27
60 40 20 0 Magnitude [db] 20 40 60 80 0 0.5 1 1.5 2 2.5 Frequency [Hz] x 10 4 Figure. Magnitude spectrum of a 40 ms frame of a guitar tone, and an approximating LPC spectrum of order 15. 14 of 27
60 Violin 50 40 30 Magnitude [db] 20 10 0 10 20 30 0 0.5 1 1.5 2 2.5 x 10 4 50 Trumpet 40 30 Magnitude [db] 20 10 0 10 20 30 0 0.5 1 1.5 2 2.5 Frequency [Hz] x 10 4 Figure. Average LPC spectrum of a violin and a trumpet tone, respectively. 15 of 27
Onset and offset transients Rise time (the duration of attack) the time interval between the onset and the instant of maximal amplitude Usually some kind of energy threshols are used to locate these points from an overall amplitude envelope Onset asynchrony Calculate the individual rise times of different harmonics or different frequency ranges Onset harmonic skew: a linear fit to the onset times of the harmonic partials as a function of frequency 16 of 27
Modulations Frequency modulation Vibrato (periodic), jitter (random) Difficult to measure reliably Presence/absence/degree of periodic and random modulations Amplitude modulation Tremolo Presence/absence/degree of periodic and random modulations These features can be extracted from an amplitude envelope 17 of 27
35 Intensity [db] 30 25 20 15 10 5 0 0 5 10 15 20 Bark frequency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [sec] Flute, violin, trumpet and clarinet 18 of 27
Classification Pattern recognition The data is presented as N dimensional feature vectors, which are assigned to different classes or clusters Supervised classification: the input pattern is identified as a member of a predefined class Unsupervised classification: the pattern is assigned to a hitherto unknown class (e.g. clustering) 3.5 class1 class2 3 2.5 2 1.5 Feature 1 1 0.5 0 0.5 1 1.5 2 1 0 1 2 3 4 5 Feature 2 19 of 27
Speaker recognition The most studied sound source recognition problem Recognition and verification Three major approaches Long-term averages of acoustic features Average out phonetic variations affecting features, leave only the speaker dependent component Earliest, has been used successfully in demanding applications Discards much speaker-dependent information Can require long speech utterances to derive stable long-term statistics Model the speaker-dependent features within phonetic sounds Compare within similar phonetic sounds in the train and test utterance 20 of 27
Explicit segmentation: A Hidden Markov Model (HMM)-based continuous speech recognizer as a front end -> little or no inmprovement in performance Implicit segmentation: unsupervised clustering of acoustic features during training and recognition (Gaussian mixture models, GMM) Discriminative neural networks (NN) NN s are trained to model the decision function which best discriminates spealers within a known set Problems Fundamental frequency information not used Speech rhythm not used Lack of generality: do not work well when acoustic conditions vary from those used in training Cannot deal with mixtures of sounds Performance suffers as population size grows 21 of 27
Case Reynolds 1995 20 mel-frequency cepstral coefficients at 20 ms frames Given a recorded utterance, a probalistic model is formed based on Gaussian distributions Motivations for using Gaussian Mixture Models (GMM) in speaker recognition: The individual component Gaussians in speaker-dependent GMM are interpreted to represent some broad acoustic classes A Gaussian mixture density is able to model smoothly the long-term sample distribution 22 of 27
The performance of the system depends on The noise characteristics of the signal The population size Nearly perfect performance with pristine recordings (630 talkers) Under varying acoustic conditions (e.g. using different telephone handsets during testing and training) 94% with population of 10 talkers 83% with population of 113 talkers 23 of 27
Automatic noise recognition Case Gaunard 1998 Car, truck, moped, aircraft, train 12 cepstral coefficients from 50-100 ms frames 1-5 state HMM Recognition performance 90-95% with cepestral coefficients as features 80% with 1/3 octave filter bank as front end 24 of 27
Case El-Maleh 1999 Frame level noise classification for mobile environments Car, voice babble, street, bus and factory Line spectral frequences (LSF:s) based on order 10 LPC analysis as features 89% average performance Shows some ability to generalize and robustness New noises were classified as similar training noises (restaurant (babble, music) -> babble or bus noise) Human speech-like noise (superimposed independent speech signals) was classified as speech with low number of superimpositions As number of superimposed signals increased, more as babble than as speech 25 of 27
Musical instrument recognition Difficulties Wide pitch ranges Variety of playing tecniques Properties of sounds may change completely with different techniques and different notes Interfering sounds in polyphony Different recording conditions Differences between instrument pieces (stradivarius vs. cheap violin) Psychological research as starting point for finding features Lots of work have been done in order to resolve what s the thing that makes musical instrument sounds distinguishable (timbre) 26 of 27
This knowledge has been used in musical instrument recognition systems Also lots of work with human voices Much less knowledge with environmental sounds The state-of-the-art still quite low Good results with isolated tones but with only one example of a particular instrument Good results with monophonic phrases but with only four instruments Not so good results with monophonic phrases with several instruments Some first attempts towards polyphonic recognition 27 of 27