Advanced audio analysis. Martin Gasser

Similar documents
E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Sound Synthesis Methods

Lecture 6: Speech modeling and synthesis

Speech Signal Analysis

Speech Synthesis; Pitch Detection and Vocoders

Lecture 5: Speech modeling. The speech signal

EE482: Digital Signal Processing Applications

Signal Analysis. Peak Detection. Envelope Follower (Amplitude detection) Music 270a: Signal Analysis

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Speech Synthesis using Mel-Cepstral Coefficient Feature

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Applications of Music Processing

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

L19: Prosodic modification of speech

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Lecture 5: Sinusoidal Modeling

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Cepstrum alanysis of speech signals

Adaptive Filters Application of Linear Prediction

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN

Digital Signal Processing

Converting Speaking Voice into Singing Voice

Synthesis Techniques. Juan P Bello

Audio Signal Compression using DCT and LPC Techniques

Overview of Code Excited Linear Predictive Coder

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Lecture 5: Speech modeling

Enhanced Waveform Interpolative Coding at 4 kbps

EE482: Digital Signal Processing Applications

Synthesis Algorithms and Validation

Digital Speech Processing and Coding

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

EVALUATION OF MFCC ESTIMATION TECHNIQUES FOR MUSIC SIMILARITY

Advanced Music Content Analysis

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

ADAPTIVE NOISE LEVEL ESTIMATION

Formant Synthesis of Haegeum: A Sound Analysis/Synthesis System using Cpestral Envelope

Linguistic Phonetics. Spectral Analysis

SOUND SOURCE RECOGNITION AND MODELING

High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch

Communications Theory and Engineering

B.Tech III Year II Semester (R13) Regular & Supplementary Examinations May/June 2017 DIGITAL SIGNAL PROCESSING (Common to ECE and EIE)

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

FFT analysis in practice

Drum Transcription Based on Independent Subspace Analysis

COMPUTATIONAL RHYTHM AND BEAT ANALYSIS Nicholas Berkner. University of Rochester

Complex Sounds. Reading: Yost Ch. 4

TIME DOMAIN ATTACK AND RELEASE MODELING Applied to Spectral Domain Sound Synthesis

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Isolated Digit Recognition Using MFCC AND DTW

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Evaluation of MFCC Estimation Techniques for Music Similarity Jensen, Jesper Højvang; Christensen, Mads Græsbøll; Murthi, Manohar; Jensen, Søren Holdt

HIGH ACCURACY FRAME-BY-FRAME NON-STATIONARY SINUSOIDAL MODELLING

Comparison of CELP speech coder with a wavelet method

Auditory modelling for speech processing in the perceptual domain

Aberehe Niguse Gebru ABSTRACT. Keywords Autocorrelation, MATLAB, Music education, Pitch Detection, Wavelet

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

-voiced. +voiced. /z/ /s/ Last Lecture. Digital Speech Processing. Overview of Speech Processing. Example on Sound Source Feature

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

Speech Production. Automatic Speech Recognition handout (1) Jan - Mar 2009 Revision : 1.1. Speech Communication. Spectrogram. Waveform.

Linear Predictive Coding *

An Improved Voice Activity Detection Based on Deep Belief Networks

VU Signal and Image Processing. Torsten Möller + Hrvoje Bogunović + Raphael Sahann

Speech synthesizer. W. Tidelund S. Andersson R. Andersson. March 11, 2015

Speech Compression Using Voice Excited Linear Predictive Coding

Lecture 6: Nonspeech and Music

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

Signal Processing Toolbox

Tempo and Beat Tracking

The Channel Vocoder (analyzer):

Audio processing methods on marine mammal vocalizations

MFCC AND GMM BASED TAMIL LANGUAGE SPEAKER IDENTIFICATION SYSTEM

Music Signal Processing

SINUSOIDAL MODELING. EE6641 Analysis and Synthesis of Audio Signals. Yi-Wen Liu Nov 3, 2015

Speech and Music Discrimination based on Signal Modulation Spectrum.

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

SPEECH ENHANCEMENT USING PITCH DETECTION APPROACH FOR NOISY ENVIRONMENT

Robust Algorithms For Speech Reconstruction On Mobile Devices

SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

DISCRETE FOURIER TRANSFORM AND FILTER DESIGN

Lecture 9: Time & Pitch Scaling

Tempo and Beat Tracking

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

A Comparative Study of Formant Frequencies Estimation Techniques

Discrete Fourier Transform (DFT)

Using Noise Substitution for Backwards-Compatible Audio Codec Improvement

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

Speech/Non-speech detection Rule-based method using log energy and zero crossing rate

Transcription:

Advanced audio analysis Martin Gasser

Motivation Which methods are common in MIR research? How can we parameterize audio signals? Interesting dimensions of audio: Spectral/ time/melody structure, high level descriptions Which properties of the signals are captured by the features?

Topics STFT, Phase Vocoder ConstantQ transform Source-filter analysis (LPC, Cepstrum, MFCC) Spectral modeling synthesis Beat tracking Pitch estimation Chord/key recognition

STFT Short time fourier transform Take DFT s of (overlapping) frames of audio data Before DFT, multiply data with window function Efficiently implemented via FFT (e.g., FFTW) Resolution of STFT limited by samplerate/number of bins by window type (spectrum is convolved with DFT of window function)

Phase vocoder Analysis/resynthesis method based on STFT Independent modification of magnitude and phase values in STFT bins High-quality pitch shifting/ time stretching/other effects

Problems of STFT Window size/type has to be manually adjusted to the data Equal time/frequency resolution for all freq. bands Human auditory perception has good frequency resolution in lower bands, good time resolution in upper bands Ratio of center frequency to bandwidth of auditory filters (``filter Q ) is approximately constant

Constant Q transform Window length of basis sinusoids is inversely related to center frequencies Center frequencies are logarithmically spaced ( no 0 frequency!) Basis matrix is not invertible there is no unique inversion (yet?) Efficient implementation: Leverages sparsity of basis functions in frequency domain

Fast CQT Time kernel: K (dense) Spectral kernel: K (sparse) DFT X cq [k cq ]= N 1 n=0 x[n]k [n, k cq ] = 1 N N 1 k=0 X[k]K [k, k cq ]

STFT vs. CQT

SMS Spectral modeling synthesis Enhancement of tracking phase vocoder Tries to separate signal into sinusoidal and residual (filtered white noise) parts Store sinusoidal tracks and filter coefficients Mixed bottom-up/top-down approach Usage: transcription, high quality time stretching/pitch shifting

Algorithm

Deterministic part (a) Peak picking (b) Peak interpolation to increase accuracy

(c) Peak tracking Deterministic part

Stochastic part Spectral subtraction can be done in frequency or time domain Frequency domain: Synthesize spectral shape of sinusoid (main lobe of window function) and resynthesize Time domain: Use phase matched additive synthesis Ideal residual is stochastic

Stochastic part Perform amplitude rescaling in order to reduce smearing artifacts Compare residual to original signal Whenever residual > original, reduce amplitude of residual Model spectral envelope of resulting signal (smoothed DFT, LPC, Cepstrum)

Critical steps Spectral analysis: Currently, STFT - can we improve? Additive resynthesis Smearing at transients!

Source-filter analysis Idea: signal excitation resonance Models human speech production and many musical instruments Excitation broadband pitched source signal (e.g., glottal pulse train) Resonance slowly varying filter (e.g., vocal tract) formants

Source-filter analysis

Source-filter analysis Source signal is convolved with time-varying filter How to deconvolve the resulting signal? How to calculate coefficients of the filter? Applications: Pitch tracking, speech recognition/synthesis, music similarity,...

Linear Predictive Coding Analysis: Optimize coefficients in a predictive model (FIR filter), such that prediction error is minimized Difference between input signal and prediction: Residual Inverse filter: All pole (IIR) filter Resynthesis: Use (compressed) residual as input to inverse filter

LPC maths e(n) =x(n) p k=1 a k x(n k) E{e 2 (n)} a i = 2E{e(n) e(n) a i } = 2E{e(n)x(n i)} p = 2E{[x(n) a k x(n k)]x(n i)} =0 p k=1 Normal equations: Toeplitz matrix k=1 a k E{x(n k)x(n i)} = E{x(n)x(n i)} p a k r xx (i k) =r xx (i),i=1,...,p k=1 Efficient solution: Levinson-Durbin recursion

Cepstral techniques ``Cepstrum : Spectrum of a log(abs (spectrum)) Spectrum of signal: Spectrum of source spectrum of filter ``quefrency : Abscissa of cepstrum plot, unit of quefrency: Time (!) ``Cepstrogram : Plot of time intervals vs. spectral periodicities ``Liftering : Filtering in the cepstral domain

Cepstrum Inverse transform (DFT) of (liftered) Cepstrum spectral envelope

MFCC MFCC(x) =DCT(Mel(log DFT(x) )) Logarithm: Transforms product spectrum to sum Mel: Perceptual scale of pitches judged by listeners to be equal in distance to one another DCT: Decorrelates signal (DCT-II) spectral envelope (timbre) low coeffs.

Music similarity Model timbre as Gaussian distribution Σ = E(XX T ) µµ T µ = 1 n Σ(x i) E(XX T ) = 1 n Σx ix T i Compute similarity between distributions (KL divergence, earth movers distance,...) Simple genre classification Training : Labeled reference samples Nearest neighbor classification

High-level music analysis Beat tracking: Track locations of downbeats Tempo estimation: Find the (perceptual) tempo of a musical piece Pitch estimation Chord/key estimation

Beat tracking First step: Onset detection Can be done in spectral or time domain Causal/ real time methods: Model beat as dynamically excited oscillator Offline methods: Cluster inter-onset-intervals and find most plausible beat hypothesis

Scheirer s algorithm Subband decomposition (6 bands) Input half-wave rectified envelopes to resonator filterbank (150 bands ~ 60-240 bpm) Choose resonator with max. output over all bands ( Tempo)

Scheirer s algo cont d Beat phase determination can be done by inspecting output or internal state of winning oscillator Pros: Predicts what is happening NOW (in contrast to simple autocorrelation, which performs calculation after the fact ) Cons: Discretizes tempo

Non-causal IOI clustering Multiple agents Dixon s algorithm

Dixon s algo cont d Onset detection: Surfboard method Calculate amplitude envelope of signal Linear regression of envelope Use IOI clusters as input to agents which predict beat times

Pitch estimation Task: Find the fundamental frequency in a signal Problems: Lowest peak is not always the fundamental frequency Perceived fundamental may not even be physically present

Pitch estimation Time-domain Zero-crossing rate Maxima in autocorrelation φ(τ) = 1 N Minima in magnitude difference Frequency-domain Cepstrum Maximum likelihood, HPS N 1 n=0 ψ(τ) = 1 N x(n)x(n + τ) N 1 n=0 x(n) x(n + τ)

Cepstrum pitch detection Real Cepstrum: C(x) =IFFT(log( DFT(x) )) log scales values into usable range Regular partials appear as peaks in cepstrum Unit of quefrency is ms (period)

HPS, ML Harmonic Product spectrum Y (ω) = R r=1 X(ωr) Ŷ = max ω i Y (ω i ) Maximum likelihood Correlate ideal spectra with input Ideal spectrum: Pulse train starting at ω, convolved with analysis window function Select spectral template with max. corr.

Key/Chord recognition Chroma: Fold down spectral representation to 12 bins, one bin covers one pitch class Correlate Chroma vectors with pitch-class distribution templates

Thank you!