Advanced audio analysis. Martin Gasser

Advanced audio analysis Martin Gasser

Motivation Which methods are common in MIR research? How can we parameterize audio signals? Interesting dimensions of audio: Spectral/ time/melody structure, high level descriptions Which properties of the signals are captured by the features?

Topics STFT, Phase Vocoder ConstantQ transform Source-filter analysis (LPC, Cepstrum, MFCC) Spectral modeling synthesis Beat tracking Pitch estimation Chord/key recognition

STFT Short time fourier transform Take DFT s of (overlapping) frames of audio data Before DFT, multiply data with window function Efficiently implemented via FFT (e.g., FFTW) Resolution of STFT limited by samplerate/number of bins by window type (spectrum is convolved with DFT of window function)

Phase vocoder Analysis/resynthesis method based on STFT Independent modification of magnitude and phase values in STFT bins High-quality pitch shifting/ time stretching/other effects

Problems of STFT Window size/type has to be manually adjusted to the data Equal time/frequency resolution for all freq. bands Human auditory perception has good frequency resolution in lower bands, good time resolution in upper bands Ratio of center frequency to bandwidth of auditory filters (``filter Q ) is approximately constant

Constant Q transform Window length of basis sinusoids is inversely related to center frequencies Center frequencies are logarithmically spaced ( no 0 frequency!) Basis matrix is not invertible there is no unique inversion (yet?) Efficient implementation: Leverages sparsity of basis functions in frequency domain

Fast CQT Time kernel: K (dense) Spectral kernel: K (sparse) DFT X cq [k cq ]= N 1 n=0 x[n]k [n, k cq ] = 1 N N 1 k=0 X[k]K [k, k cq ]

STFT vs. CQT

SMS Spectral modeling synthesis Enhancement of tracking phase vocoder Tries to separate signal into sinusoidal and residual (filtered white noise) parts Store sinusoidal tracks and filter coefficients Mixed bottom-up/top-down approach Usage: transcription, high quality time stretching/pitch shifting

Algorithm

Deterministic part (a) Peak picking (b) Peak interpolation to increase accuracy

Stochastic part Spectral subtraction can be done in frequency or time domain Frequency domain: Synthesize spectral shape of sinusoid (main lobe of window function) and resynthesize Time domain: Use phase matched additive synthesis Ideal residual is stochastic

Stochastic part Perform amplitude rescaling in order to reduce smearing artifacts Compare residual to original signal Whenever residual > original, reduce amplitude of residual Model spectral envelope of resulting signal (smoothed DFT, LPC, Cepstrum)

Critical steps Spectral analysis: Currently, STFT - can we improve? Additive resynthesis Smearing at transients!

Source-filter analysis Idea: signal excitation resonance Models human speech production and many musical instruments Excitation broadband pitched source signal (e.g., glottal pulse train) Resonance slowly varying filter (e.g., vocal tract) formants

Source-filter analysis

Source-filter analysis Source signal is convolved with time-varying filter How to deconvolve the resulting signal? How to calculate coefficients of the filter? Applications: Pitch tracking, speech recognition/synthesis, music similarity,...

Linear Predictive Coding Analysis: Optimize coefficients in a predictive model (FIR filter), such that prediction error is minimized Difference between input signal and prediction: Residual Inverse filter: All pole (IIR) filter Resynthesis: Use (compressed) residual as input to inverse filter

LPC maths e(n) =x(n) p k=1 a k x(n k) E{e 2 (n)} a i = 2E{e(n) e(n) a i } = 2E{e(n)x(n i)} p = 2E{[x(n) a k x(n k)]x(n i)} =0 p k=1 Normal equations: Toeplitz matrix k=1 a k E{x(n k)x(n i)} = E{x(n)x(n i)} p a k r xx (i k) =r xx (i),i=1,...,p k=1 Efficient solution: Levinson-Durbin recursion

Cepstral techniques ``Cepstrum : Spectrum of a log(abs (spectrum)) Spectrum of signal: Spectrum of source spectrum of filter ``quefrency : Abscissa of cepstrum plot, unit of quefrency: Time (!) ``Cepstrogram : Plot of time intervals vs. spectral periodicities ``Liftering : Filtering in the cepstral domain

Cepstrum Inverse transform (DFT) of (liftered) Cepstrum spectral envelope

MFCC MFCC(x) =DCT(Mel(log DFT(x) )) Logarithm: Transforms product spectrum to sum Mel: Perceptual scale of pitches judged by listeners to be equal in distance to one another DCT: Decorrelates signal (DCT-II) spectral envelope (timbre) low coeffs.

Music similarity Model timbre as Gaussian distribution Σ = E(XX T ) µµ T µ = 1 n Σ(x i) E(XX T ) = 1 n Σx ix T i Compute similarity between distributions (KL divergence, earth movers distance,...) Simple genre classification Training : Labeled reference samples Nearest neighbor classification

High-level music analysis Beat tracking: Track locations of downbeats Tempo estimation: Find the (perceptual) tempo of a musical piece Pitch estimation Chord/key estimation

Beat tracking First step: Onset detection Can be done in spectral or time domain Causal/ real time methods: Model beat as dynamically excited oscillator Offline methods: Cluster inter-onset-intervals and find most plausible beat hypothesis

Scheirer s algorithm Subband decomposition (6 bands) Input half-wave rectified envelopes to resonator filterbank (150 bands ~ 60-240 bpm) Choose resonator with max. output over all bands ( Tempo)

Scheirer s algo cont d Beat phase determination can be done by inspecting output or internal state of winning oscillator Pros: Predicts what is happening NOW (in contrast to simple autocorrelation, which performs calculation after the fact ) Cons: Discretizes tempo

Non-causal IOI clustering Multiple agents Dixon s algorithm

Dixon s algo cont d Onset detection: Surfboard method Calculate amplitude envelope of signal Linear regression of envelope Use IOI clusters as input to agents which predict beat times

Pitch estimation Task: Find the fundamental frequency in a signal Problems: Lowest peak is not always the fundamental frequency Perceived fundamental may not even be physically present

Pitch estimation Time-domain Zero-crossing rate Maxima in autocorrelation φ(τ) = 1 N Minima in magnitude difference Frequency-domain Cepstrum Maximum likelihood, HPS N 1 n=0 ψ(τ) = 1 N x(n)x(n + τ) N 1 n=0 x(n) x(n + τ)

Cepstrum pitch detection Real Cepstrum: C(x) =IFFT(log( DFT(x) )) log scales values into usable range Regular partials appear as peaks in cepstrum Unit of quefrency is ms (period)

HPS, ML Harmonic Product spectrum Y (ω) = R r=1 X(ωr) Ŷ = max ω i Y (ω i ) Maximum likelihood Correlate ideal spectra with input Ideal spectrum: Pulse train starting at ω, convolved with analysis window function Select spectral template with max. corr.

Key/Chord recognition Chroma: Fold down spectral representation to 12 bins, one bin covers one pitch class Correlate Chroma vectors with pitch-class distribution templates

Thank you!