SOUND SOURCE RECOGNITION AND MODELING

Similar documents
Applications of Music Processing

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Voice Activity Detection

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Speech Synthesis; Pitch Detection and Vocoders

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

SPEECH AND SPECTRAL ANALYSIS

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Advanced audio analysis. Martin Gasser

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Complex Sounds. Reading: Yost Ch. 4

Linguistic Phonetics. Spectral Analysis

Speech Synthesis using Mel-Cepstral Coefficient Feature

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

EE482: Digital Signal Processing Applications

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

Cepstrum alanysis of speech signals

Adaptive Filters Application of Linear Prediction

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Speech Signal Analysis

SGN Audio and Speech Processing

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Digital Speech Processing and Coding

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Basic Characteristics of Speech Signal Analysis

Subtractive Synthesis & Formant Synthesis

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Tools for Advanced Sound & Vibration Analysis

Sound, acoustics Slides based on: Rossing, The science of sound, 1990.

CS 188: Artificial Intelligence Spring Speech in an Hour

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Different Approaches of Spectral Subtraction Method for Speech Enhancement

L19: Prosodic modification of speech

Principles of Musical Acoustics

Enhanced Waveform Interpolative Coding at 4 kbps

Audio processing methods on marine mammal vocalizations

An introduction to physics of Sound

NCCF ACF. cepstrum coef. error signal > samples

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Dept. of Computer Science, University of Copenhagen Universitetsparken 1, DK-2100 Copenhagen Ø, Denmark

Physics 101. Lecture 21 Doppler Effect Loudness Human Hearing Interference of Sound Waves Reflection & Refraction of Sound

Lecture 6: Nonspeech and Music

Lecture 6: Nonspeech and Music. Music & nonspeech

Communications Theory and Engineering

Chapter 2 Channel Equalization

Speech Enhancement using Wiener filtering

Overview of Code Excited Linear Predictive Coder

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Variation in Noise Parameter Estimates for Background Noise Classification

Single-channel Mixture Decomposition using Bayesian Harmonic Models

Copyright 2009 Pearson Education, Inc.

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

Speech and Music Discrimination based on Signal Modulation Spectrum.

Digital Signal Processing

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Acoustics and Fourier Transform Physics Advanced Physics Lab - Summer 2018 Don Heiman, Northeastern University, 1/12/2018

On a Classification of Voiced/Unvoiced by using SNR for Speech Recognition

Chapter IV THEORY OF CELP CODING

Epoch Extraction From Emotional Speech

HCS 7367 Speech Perception

Voiced/nonvoiced detection based on robustness of voiced epochs

Musical Acoustics, C. Bertulani. Musical Acoustics. Lecture 14 Timbre / Tone quality II

Pitch Period of Speech Signals Preface, Determination and Transformation

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Robust Algorithms For Speech Reconstruction On Mobile Devices

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

SPEech Feature Toolbox (SPEFT) Design and Emotional Speech Feature Extraction

A Look at Un-Electronic Musical Instruments

Advanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses

Musical Acoustics, C. Bertulani. Musical Acoustics. Lecture 13 Timbre / Tone quality I

ESE531 Spring University of Pennsylvania Department of Electrical and System Engineering Digital Signal Processing

Slovak University of Technology and Planned Research in Voice De-Identification. Anna Pribilova

An Improved Voice Activity Detection Based on Deep Belief Networks

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Introduction of Audio and Music

Formant Synthesis of Haegeum: A Sound Analysis/Synthesis System using Cpestral Envelope

EE482: Digital Signal Processing Applications

Automatic classification of traffic noise

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

What is Sound? Part II

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

EVALUATION OF MFCC ESTIMATION TECHNIQUES FOR MUSIC SIMILARITY

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Whole geometry Finite-Difference modeling of the violin

Transcription:

SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental sounds and events Musical instrument recognition 1 of 27

Human sound source recognition abilities Different acoustic properties of sound producing objects enable us to recognize sound sources by listening These properties are the result of the production process The produced sound waves are different produced at each event Acoustic properties change over time The acoustic world is linear: sound waves from different sources combine together and result larger sound sources Combination and interaction of properties single objects in the mix generate new, emergent properties belonging to the larger sound producing system 2 of 27

Timbre (= äänen väri) The perceptual qualities of objects and events; that is, what it sounds like ANSI 1973: The quality of a sound by which a listener can tell that two sounds of the same loudness and pitch are dissimilar There are many stable and time-varying acoustic properties affecting timbre It is unlikely that any one property or combination of properties uniquely determines timbre The sense of timbre comes from the emergent, interactive properties of the vibration pattern The identification is the result of the apprehension of acoustical invariants (the bowing of a violin sounds like this) inferences made according to learned experience (we learn how the violin sounds like in different acoustic environments) 3 of 27

Source-filter model of sound production The source is excited by energy to generate a vibration pattern The filter acts as a resonator, having different vibration modes Each mode can be characterized by its resonant frequency and by its damping or quality factor Q When the excitation is imposed on the filter, it modifies the elative amplitudes of the components of the source input This results peaks in the frequency spectrum of the signal at resonant frequencies Damping of the vibration modes is a measure of its sharpness of tuning and temporal response Lightly damped mode (high Q) results a sharp peak in spectrum and a longer time delay into the signal (and vice versa) 4 of 27

We can hear both the change in the sound spectrum and the time differences (if they are more than a few milliseconds) The final sound is the result of effects resulted by the excitation, resonators and radiation characteristics In sound producing mechanisms, that can be modeled as linear systems, the transfer function of the resulting signal is the product of the transfer functions of the partial systems (if they are in cascade), mathematically Y( z) X( z) Y( z) = X( z) H i ( z), (1) where and are the z-transforms of the output a sn excitation signal, respectively. i = 1 are the z-transforms of the N subsystems (for instance, the vocal tract and the reflections at lips) N H i ( z) 5 of 27

Machine sound source recognition A good sound source recognition system should: Exhibit generalization. Different instances of the same kind of sound should be recognized as similar. (for instance, musical instruments played at different environments or by different players) Hande real world complexity. Should be able to work with realistic recording conditions, with noise, reverberation and even competing sound sources. Be scalable. Ability to learn to recognize additional sounds and affects on performance. Exhibit graceful degradation. The systems performance should gradually worsen while noise, the degree of reverberation and number of competing sound sources increases. 6 of 27

Employ a flexible learning strategy. It should be able to introduce new categories as necessary and its refine classification criteria. Simplicity, computational efficiency. The simpler out of two systems performing equally well is better. (memory or processing requirements, how easy is it to understand how the system works) 7 of 27

A typical sound source recognition system Preprocessing (filtering, noise removal) Feature extraction Training and learning (supervised or unsupervised) Classification (pattern recognition, neural networks, stochastic models) Is able to work with limited number of sound classes and test data 8 of 27

Features for sound source recognition Frequency spectrum spectral centroid measures the spectral energy distribution, corresponds to the perceived brightness f c k = 1 N Ak ( )f( k), (2) k is the spectral component, and f(k) and A(k) it s amplitude and frequency, respectively. Also a normalized version can be used, f cnorm f c = ----- f 0 = N ---------------------------------- k = 1 Ak ( ), where f0 is the fundamental frequency of a harmonic sound. is the same as first moment, also higher order moments have been used as features 9 of 27

The power spectrum accross a set of critical bands or successive frequency regions The power spectrum of signal x(n) is the Fourier transform of the autocorrelation sequence r(n): P( ω) = rn ( )e jωn. (3) n = This can be calculated as the magnitude squared Fourier transform of the signal x(n): where P( ω) = X( ω) 2 X( ω) = xn ( )e jωn n =, (4) (5) 10 of 27

Spectral irregularity Corresponds to the standard deviation of time-averaged harmonic amplitudes from a spectral envelope n 1 A k + 1 + A k + A k 1 IRR = 20log A k --------------------------------------------- 3 k = 2 (6) Even and odd harmonic content in the signal spectrum Even harmonic content 2 2 2 A 2 + A 4 + A 6 + h ev -------------------------------------------- = 2 2 2 A 1 + A 2 + A 3 + M 2 A 2k k = 1 -------------------, = ---. (7) N 2 2 A n n = 1 = M N 11 of 27

Odd harmonic content 2 A 2k 1 2 2 2 A 1 + A 3 + A 5 + k = 0 h odd = -------------------------------------------- = -------------------------- L 2 2 2 N A 1 + A 2 + A 3 + L 2 A n n = 1 = N --- 1 2,. (8) Formants Spectral prominences created by one or more resonances in the filter of the sound source A robust feature for measuring formants are cepstral coefficients The cepstrum of a signal x(n) is defined as cn ( ) = F 1 { log F{ x( n) } } (9) 12 of 27

In practise the coefficients may be obtained for instance with linear prediction (LP) In LP, the filter of the sound source is approximated with an all-pole filter u(n) X 1 A(z) s(n) G Figure. The normalized input u(n) (pulse train or white noise) is scaled by the gain G and filtered with an all-pole filter The coefficients of the all-pole filter can be solved These coefficients describe the magnitude spectrum of the sound source filter These coefficients are converted into cepstral coefficients, which behave nicely for recognition purposes 13 of 27

60 40 20 0 Magnitude [db] 20 40 60 80 0 0.5 1 1.5 2 2.5 Frequency [Hz] x 10 4 Figure. Magnitude spectrum of a 40 ms frame of a guitar tone, and an approximating LPC spectrum of order 15. 14 of 27

60 Violin 50 40 30 Magnitude [db] 20 10 0 10 20 30 0 0.5 1 1.5 2 2.5 x 10 4 50 Trumpet 40 30 Magnitude [db] 20 10 0 10 20 30 0 0.5 1 1.5 2 2.5 Frequency [Hz] x 10 4 Figure. Average LPC spectrum of a violin and a trumpet tone, respectively. 15 of 27

Onset and offset transients Rise time (the duration of attack) the time interval between the onset and the instant of maximal amplitude Usually some kind of energy threshols are used to locate these points from an overall amplitude envelope Onset asynchrony Calculate the individual rise times of different harmonics or different frequency ranges Onset harmonic skew: a linear fit to the onset times of the harmonic partials as a function of frequency 16 of 27

Modulations Frequency modulation Vibrato (periodic), jitter (random) Difficult to measure reliably Presence/absence/degree of periodic and random modulations Amplitude modulation Tremolo Presence/absence/degree of periodic and random modulations These features can be extracted from an amplitude envelope 17 of 27

35 Intensity [db] 30 25 20 15 10 5 0 0 5 10 15 20 Bark frequency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [sec] Flute, violin, trumpet and clarinet 18 of 27

Classification Pattern recognition The data is presented as N dimensional feature vectors, which are assigned to different classes or clusters Supervised classification: the input pattern is identified as a member of a predefined class Unsupervised classification: the pattern is assigned to a hitherto unknown class (e.g. clustering) 3.5 class1 class2 3 2.5 2 1.5 Feature 1 1 0.5 0 0.5 1 1.5 2 1 0 1 2 3 4 5 Feature 2 19 of 27

Speaker recognition The most studied sound source recognition problem Recognition and verification Three major approaches Long-term averages of acoustic features Average out phonetic variations affecting features, leave only the speaker dependent component Earliest, has been used successfully in demanding applications Discards much speaker-dependent information Can require long speech utterances to derive stable long-term statistics Model the speaker-dependent features within phonetic sounds Compare within similar phonetic sounds in the train and test utterance 20 of 27

Explicit segmentation: A Hidden Markov Model (HMM)-based continuous speech recognizer as a front end -> little or no inmprovement in performance Implicit segmentation: unsupervised clustering of acoustic features during training and recognition (Gaussian mixture models, GMM) Discriminative neural networks (NN) NN s are trained to model the decision function which best discriminates spealers within a known set Problems Fundamental frequency information not used Speech rhythm not used Lack of generality: do not work well when acoustic conditions vary from those used in training Cannot deal with mixtures of sounds Performance suffers as population size grows 21 of 27

Case Reynolds 1995 20 mel-frequency cepstral coefficients at 20 ms frames Given a recorded utterance, a probalistic model is formed based on Gaussian distributions Motivations for using Gaussian Mixture Models (GMM) in speaker recognition: The individual component Gaussians in speaker-dependent GMM are interpreted to represent some broad acoustic classes A Gaussian mixture density is able to model smoothly the long-term sample distribution 22 of 27

The performance of the system depends on The noise characteristics of the signal The population size Nearly perfect performance with pristine recordings (630 talkers) Under varying acoustic conditions (e.g. using different telephone handsets during testing and training) 94% with population of 10 talkers 83% with population of 113 talkers 23 of 27

Automatic noise recognition Case Gaunard 1998 Car, truck, moped, aircraft, train 12 cepstral coefficients from 50-100 ms frames 1-5 state HMM Recognition performance 90-95% with cepestral coefficients as features 80% with 1/3 octave filter bank as front end 24 of 27

Case El-Maleh 1999 Frame level noise classification for mobile environments Car, voice babble, street, bus and factory Line spectral frequences (LSF:s) based on order 10 LPC analysis as features 89% average performance Shows some ability to generalize and robustness New noises were classified as similar training noises (restaurant (babble, music) -> babble or bus noise) Human speech-like noise (superimposed independent speech signals) was classified as speech with low number of superimpositions As number of superimposed signals increased, more as babble than as speech 25 of 27

Musical instrument recognition Difficulties Wide pitch ranges Variety of playing tecniques Properties of sounds may change completely with different techniques and different notes Interfering sounds in polyphony Different recording conditions Differences between instrument pieces (stradivarius vs. cheap violin) Psychological research as starting point for finding features Lots of work have been done in order to resolve what s the thing that makes musical instrument sounds distinguishable (timbre) 26 of 27

This knowledge has been used in musical instrument recognition systems Also lots of work with human voices Much less knowledge with environmental sounds The state-of-the-art still quite low Good results with isolated tones but with only one example of a particular instrument Good results with monophonic phrases but with only four instruments Not so good results with monophonic phrases with several instruments Some first attempts towards polyphonic recognition 27 of 27