Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Similar documents
Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Speech Signal Analysis

Isolated Digit Recognition Using MFCC AND DTW

Auditory Based Feature Vectors for Speech Recognition Systems

Hearing and Deafness 2. Ear as a frequency analyzer. Chris Darwin

Signals & Systems for Speech & Hearing. Week 6. Practical spectral analysis. Bandpass filters & filterbanks. Try this out on an old friend

Acoustics, signals & systems for audiology. Week 4. Signals through Systems

Mel Spectrum Analysis of Speech Recognition using Single Microphone

T Automatic Speech Recognition: From Theory to Practice

Discrete Fourier Transform (DFT)

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Gammatone Cepstral Coefficient for Speaker Identification

Advanced audio analysis. Martin Gasser

Introduction of Audio and Music

Cepstrum alanysis of speech signals

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

EE482: Digital Signal Processing Applications

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

SGN Audio and Speech Processing

MOST MODERN automatic speech recognition (ASR)

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Using the Gammachirp Filter for Auditory Analysis of Speech

COM325 Computer Speech and Hearing

Topic 2. Signal Processing Review. (Some slides are adapted from Bryan Pardo s course slides on Machine Perception of Music)

Perceptive Speech Filters for Speech Signal Noise Reduction

SGN Audio and Speech Processing

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

8.3 Basic Parameters for Audio

What is Sound? Part II

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

AUDL Final exam page 1/7 Please answer all of the following questions.

Filter Banks I. Prof. Dr. Gerald Schuller. Fraunhofer IDMT & Ilmenau University of Technology Ilmenau, Germany. Fraunhofer IDMT

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

Speech Synthesis using Mel-Cepstral Coefficient Feature

Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress!

HCS 7367 Speech Perception

Adaptive Filters Application of Linear Prediction

Laboratory Assignment 2 Signal Sampling, Manipulation, and Playback

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

3.2 Measuring Frequency Response Of Low-Pass Filter :

Phase and Feedback in the Nonlinear Brain. Malcolm Slaney (IBM and Stanford) Hiroko Shiraiwa-Terasawa (Stanford) Regaip Sen (Stanford)

MUSC 316 Sound & Digital Audio Basics Worksheet

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Change Point Determination in Audio Data Using Auditory Features

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Applications of Music Processing

INDIANA UNIVERSITY, DEPT. OF PHYSICS P105, Basic Physics of Sound, Spring 2010

EE482: Digital Signal Processing Applications

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

CS 188: Artificial Intelligence Spring Speech in an Hour

Speech Synthesis; Pitch Detection and Vocoders

Complex Sounds. Reading: Yost Ch. 4

MUS 302 ENGINEERING SECTION

Psycho-acoustics (Sound characteristics, Masking, and Loudness)

Signal processing preliminaries

Communications Theory and Engineering

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Loudness & Temporal resolution

Digital Signal Processing

Chapter 5 Window Functions. periodic with a period of N (number of samples). This is observed in table (3.1).

A102 Signals and Systems for Hearing and Speech: Final exam answers

DERIVATION OF TRAPS IN AUDITORY DOMAIN

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Spectrum Analysis: The FFT Display

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Design and Implementation of Speech Recognition Systems

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Chapter 12. Preview. Objectives The Production of Sound Waves Frequency of Sound Waves The Doppler Effect. Section 1 Sound Waves

NAME STUDENT # ELEC 484 Audio Signal Processing. Midterm Exam July Listening test

Lecture 6: Speech modeling and synthesis

COMP 546, Winter 2017 lecture 20 - sound 2

Speech Enhancement and Noise-Robust Automatic Speech Recognition

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

8A. ANALYSIS OF COMPLEX SOUNDS. Amplitude, loudness, and decibels

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Sound is the human ear s perceived effect of pressure changes in the ambient air. Sound can be modeled as a function of time.

SOUND SOURCE RECOGNITION AND MODELING

A hybrid virtual bass system for optimized steadystate and transient performance

An introduction to physics of Sound

Multirate Signal Processing Lecture 7, Sampling Gerald Schuller, TU Ilmenau

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

EVALUATION OF MFCC ESTIMATION TECHNIQUES FOR MUSIC SIMILARITY

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

Terminology (1) Chapter 3. Terminology (3) Terminology (2) Transmitter Receiver Medium. Data Transmission. Direct link. Point-to-point.

Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

IMPLEMENTATION OF SPEECH RECOGNITION SYSTEM USING DSP PROCESSOR ADSP2181

URBANA-CHAMPAIGN. CS 498PS Audio Computing Lab. Audio DSP basics. Paris Smaragdis. paris.cs.illinois.

DSP First. Laboratory Exercise #11. Extracting Frequencies of Musical Tones

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Speech Coding in the Frequency Domain

Automatic Speech Recognition handout (1)

FIR/Convolution. Visulalizing the convolution sum. Convolution

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

THE CITADEL THE MILITARY COLLEGE OF SOUTH CAROLINA. Department of Electrical and Computer Engineering. ELEC 423 Digital Signal Processing

Auditory modelling for speech processing in the perceptual domain

Transcription:

SGN- 14006 Audio and Speech Processing Pasi PerQlä SGN- 14006 2015 Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks Slides for this lecture are based on those created by Katariina Mahkonen for TUT course PuheenkäsiMelyn menetelmät in Spring 2013.

IntroducQon MFCC coefficients model the spectral energy distribuqon in a perceptually meaningful way MFCCs are the most widely- used acousqc feature for speech recogniqon, speaker recogniqon, and audio classificaqon MFCCs take into account certain properqes of the human auditory system CriQcal- band frequency resoluqon (approximately) Log- power (db magnitudes)

Spectrogram of piano notes C1 C8 f0 f0 f0 Note that the fundamental frequency 16,32,65,131,261,523,1045,2093,4186 Hz doubles in each octave and the spacing between harmonic parqals doubles too. - Such octave change is perceived as doubling the height of the note

Mel scale Mel- frequency scale represents subjecqve (perceived) pitch. It is one of the perceptually moqvated frequency scales (see figure below). Mel- scale is constructed using pairwise comparisons of sinusoidal tones: a reference frequency is fixed and then a test subject (human listener) is asked to adjust the frequency of the other tone to be two Qmes higher or half Qmes lower. Models the non- linear percepqon of frequencies in the human auditory system For comparison, the Bark criqcal- band scale has been constructed based on the masking properqes of nearby frequency components. Constructed by filling the audible bandwidth with adjacent criqcal bands 1 26 Note that all the scales are related and: f Mel 100f Bark (very roughly) mm. on basilar membrane frequency / khz frequency / mel frequency / Bark

Mel scale f Mel = 2595log10(1 + f Hz ) 700 The anchor point for Mel scale is chosen so that 1000 Hz = 1000 Mel

Piano tones C1 C5 Mel- frequency spectrogram and Bark- scale spectrogram

ProperQes of human hearing percepqon of loudness differences Weber rule says that the perceived change in a physical quanqty is proporqonal to the relaqve change: Therefore it makes sense to measure sound levels in decibels: L I = 10log10(I)

Now let s get back to the calculaqon of MFCC coefficients The most widely- used acousqc feature used to represent a speech frame (in speech recogniqon for example)

CalculaQon of MFCC coefficients Define triangular bandpass filters uniformly distributed on the Mel scale (usually about 40 filters in range 0 8kHz). Linear spacing in Mel scale Note that the Mel filter bank has overlap between adjacent frequency bands. The center (Mel scale) frequency of band n is f Mel,c (n) Mel filter of band n starts at 0 amplitude at f Mel,c (n- 1) has maximum amplitude at f Mel,c (n) and decays to zero at f Mel,c (n+1)

CalculaQon of MFCC coefficients Pre- emphasize the signal, i.e., filter with H(z)=1-az -1, 0.95<a<0.99 The signal is processed in short windows of x(n). Window the short signal x(n) with a window funcqon w(n) take DFT of x(n) - > X(f) Obtain MFCC proceed to next window

CalculaQon of MFCC coefficients Define triangular bandpass filters W k, k=1,,k uniformly distributed on the Mel scale (usually K=40 filters in range 0 8kHz). DFT bin energies X(f) 2 of each filter are weighted with k th band s filter shape W k (f) and accumulated. f Take logarithm of each E(k), k=1,2, K Calculate discrete cosine transform (DCT II) of log energies K ( " c n = log(e(k))cos n$ k 1 % ' π + * -, for n =1,..., K k=1 ) # 2 & M, à c n are called MFCCs E(k) = W k ( f ) X( f ) 2

Log- Mel energies example Qme domain signal spectrogram of signal Time domain: take one window of data x(n) Use (pre- emphasis and) windowing Mel scale coefficients in matrix W 30,512 MulQply W with X(f) 2 and take logarithm W x = 30x512 512x1 30x1 W x = 30x512 512x121 30x121

MFCCs from Log- Mel energies example Apply DCT to log Mel energy spectrum of each frame DCT- II

Why are MFCC coefficients successful in audio classificaqon? Perceptually- moqvated (near log- f) frequency resoluqon Perceptually- moqvated decibel- magnitude scale Discrete cosine transform decorrelates the features (improves staqsqcal properqes by removing correlaqons between the features) Convenient control of the model order: picking only the lowest N coefficients gives lower- resoluqon approximaqon of the spectral energy distribuqon (vocal tract etc.)

Gammatone filter bank Gammatone filter bank emulates human hearing by simulaqng the impulse response of the auditory nerve fiber. Shape resembles a tone modulated with a gamma- funcqon. g(t) = at n 1 e 2πb( f c )t cos(2π f c t +φ) a is peak value, t n-1 Qme onset, exp() term defines bandwidth and decay, f c is characterisqc frequency, and φ is iniqal phase. Typically 42 bands, from 30Hz to 18kHz Drawback: Does not emulate level- dependent characterisqcs of auditory filters.

Example: Gammatone filter bank hmp://lqat.sourceforge.net/ a) response shape (Qme domain) b) magnitude responses of 40 filters on ERB scale c) log output of 30 filters d) convenqonal spectrogram b) a) d) c)