EE E682: Speech & Audio Processing & Recognition Lecture 6: Nonspeech and Music 1 2 3 4 5 Music and nonspeech Environmental sounds Music synthesis techniques Sinewave synthesis Music analysis Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e682/ Columbia University Dept. of Electrical Engineering Spring 26 E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-1
1 Music & nonspeech What is nonspeech? - according to research effort: a little music - in the world: most everything high speech music Information content low wind & water natural animal sounds attributes? contact/ collision Origin machines & engines man-made E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-2
Sound attributes Attributes suggest model parameters What do we notice about general sound? - psychophysics: pitch, loudness, timbre - bright/dull; sharp/soft; grating/soothing - sound is not abstract : tendency is to describe by source-events Ecological perspective - what matters about sound is what happened our percepts express this more-or-less directly E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-3
Motivations for modeling Describe/classify - cast sound into model because want to use the resulting parameters Store/transmit - model implicitly exploits limited structure of signal Resynthesize/modify - model separates out interesting parameters Sound Model parameter space E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-4
Analysis and synthesis Analysis is the converse of synthesis: Model / representation Synthesis Analysis Sound Can exist apart: - analysis for classification - synthesis of artificial sounds Often used together: - encoding/decoding of compressed formats - resynthesis based on analyses - analysis-by-synthesis E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-5
Outline 1 2 3 4 5 Music and nonspeech Environmental sounds - Collision sounds - Sound textures Music synthesis techniques Sinewave synthesis Music analysis E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-6
2 Environmental Sounds Where sound comes from: mechanical interactions - contact / collisions - rubbing / scraping - ringing / vibrating Interest in environmental sounds - carry information about events around us.. including indirect hints - need to create them in virtual environments.. including soundtracks Approaches to synthesis - recording / sampling - synthesis algorithms E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-7
(from Gaver 1993) Collision sounds Factors influencing: - colliding bodies: size, material, damping - local properties at contact point (hardness) - energy of collision Source-filter model - source = excitation of collision event (energy, local properties at contact) - filter = resonance and radiation of energy (body properties) Variety of strike/scraping sounds - resonant freqs ~ size/shape - damping ~ material - HF content in excitation/strike ~ mallet, force t f E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-8
Sound textures What do we hear in: - a city street - a symphony orchestra How do we distinguish: - waterfall - rainfall - applause - static Applause4 Rain1 5 5 4 4 freq / Hz 3 2 freq / Hz 3 2 1 1 1 2 3 4 time / s 1 2 3 4 time / s Levels of ecological description... E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-9
Sound texture modeling (Athineos) Model broad spectral structure with LPC - could just resynthesize with noise Model fine temporal structure in residual with linear prediction in time domain y[n] Sound TD-LP y[n] = ia i y[n-i] Σ Per-frame spectral parameters e[n] Whitened residual DCT E[k] Residual spectrum - precise dual of LPC in frequency - poles model temporal events Temporal envelopes (4 poles, 256ms) FD-LP E[k] = Σib i E[k-i] Per-frame temporal envelope parameters.2 -.2 amplitude.3.2.1.5.1.15.2.25 time / sec Allows modification / synthesis? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-1
Outline 1 2 3 4 5 Music and nonspeech Environmental sounds Music synthesis techniques - Framework - Historical development Sinewave synthesis Music analysis elements? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-11
3 Music synthesis techniques What is music? - could be anything flexible synthesis needed! Key elements of conventional music - instruments note-events (time, pitch, accent level) melody, harmony, rhythm - patterns of repetition & variation Synthesis framework: instruments: common framework for many notes score: sequence of (time, pitch, level) note events E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-12
The nature of musical instrument notes Characterized by instrument (register), note, loudness/emphasis, articulation... Frequency 4 3 2 1 Piano 4 3 2 1 Violin Frequency 1 2 3 4 Time Clarinet 4 3 2 1 1 2 3 4 Time Trumpet 4 3 2 1 1 2 3 4 Time distinguish how? 1 2 3 4 Time E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-13
Development of music synthesis Goals of music synthesis: - generate realistic / pleasant new notes - control / explore timbre (quality) Earliest computer systems in 196s (voice synthesis, algorithmic) Pure synthesis approaches: - 197s: Analog synths - 198s: FM (Stanford/Yamaha) - 199s: Physical modeling, hybrids Analysis-synthesis methods: - sampling / wavetables - sinusoid modeling - harmonics + noise (+ transients) others? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-14
Analog synthesis The minimum to make an interesting sound Trigger Pitch + Vibrato + Cutoff freq Envelope t Oscillator t Filter f + Gain Sound Elements: - harmonics-rich oscillators - time-varying filters - time-varying envelope - modulation: low frequency + envelope-based Result: - time-varying spectrum, independent pitch E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-15
FM synthesis Fast frequency modulation sidebands: cos( ω c t + βsin( ω m t) ) phase modulation - a harmonic series if ω c = r ω m J n (β) is a Bessel function: = n = J n ( β) cos( ( ω c + nω m )t) 1 J J 1 J2 J 3 J4.5 J n (β) for β < n - 2 -.5 1 2 3 4 5 6 7 8 9 modulation index β Complex harmonic spectra by varying β 4 ω c ω m = 2Hz = 2Hz freq / Hz 3 2 1 what use?.1.2.3.4.5.6.7.8 time / s E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-16
Sampling synthesis Resynthesis from real notes vary pitch, duration, level Pitch: stretch (resample) waveform.2.1.2.2.1 -.1 596 Hz 894 Hz.1 -.2.1.2 time -.1 -.2.2.4.6.8 time / s -.1 -.2.2.4.6.8 time / s Duration: loop a sustain section.2.1.174.176 -.1.24.26.2.1 -.1.2.1 -.1 -.2 -.2.1.2.3 time / s.1.2.3 time / s Level: cross-fade different examples Soft mix veloc -.2 -.2.5.1.15 time / s.5.1.15 time / s.2.1 -.1 - need to line up source samples Loud good & bad? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-17
Outline 1 2 3 4 5 Music and nonspeech Environmental sounds Music synthesis techniques Sinewave synthesis (detail) - Sinewave modeling - Sines + residual... Music analysis E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-18
4 Sinewave synthesis If patterns of harmonics are what matter, why not generate them all explicitly: sn [ ] = A k [ n] cos( k ω [ n] n) k - particularly powerful model for pitched signals Analysis (as with speech): - find peaks in STFT S[ω,n] & track - or track fundamental ω (harmonics / autoco) freq / Hz & sample STFT at k ω set of A k [n] to duplicate tone: 8 6 2 mag 4 1 2.2 5.1.5.1.15.2 time / s freq / Hz time / s Synthesis via bank of oscillators E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-19
Steps to sinewave modeling - 1 The underlying STFT: N 1 X[ k, n ] = xn+ n n = [ ] wn [ ] exp j 2πkn ------------ N What value for N (FFT length & window size)? What value for H (hop size: n = r H, r =, 1, 2...)? STFT window length determines freq. resol n: X w ( e jω ) = X( e jω ) W( e jω ) Choose N long enough to resolve harmonics 2-3x longest (lowest) fundamental period - e.g. 3-6 ms = 48-96 samples @ 16 khz - choose H N/2 N too long lost time resolution - limits sinusoid amplitude rate of change E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-2 *
Steps to sinewave modeling - 2 level / db 2 1-1 -2 Choose candidate sinusoids at each time by picking peaks in each STFT frame: freq / Hz level / db 8 6 4 2.2.4.6.8.1.12.14.16.18 2-2 -4 Quadratic fit for peak: time / s -6 1 2 3 4 5 6 7 freq / Hz y ab 2 /4 b/2 y = ax(x-b) x phase / rad -5-1 4 6 8 freq / Hz 4 6 8 freq / Hz + linear interpolation of unwrapped phase E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-21
Steps to sinewave modeling - 3 Which peaks to pick? Want true sinusoids, not noise fluctuations - prominence threshold above smoothed spec. level / db 2-2 -4-6 1 2 3 4 5 6 7 freq / Hz Sinusoids exhibit stability... - of amplitude in time - of phase derivative in time compare with adjacent time frames to test? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-22
Steps to sinewave modeling - 4 Grow tracks by appending newly-found peaks to existing tracks: freq existing tracks birth death new peaks time - ambiguous assignments possible Unclaimed new peak - birth of new track - backtrack to find earliest trace? No continuation peak for existing track - death of track - or: reduce peak threshold for hysteresis E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-23
Resynthesis of sinewave models freq / Hz level 3 2 1 7 6 5 After analysis, each track defines contours in frequency, amplitude f k [n], A k [n] (+ phase?) - use to drive a sinewave oscillators & sum up A k [n].5.1.15.2 n f k [n] A k [n] cos(2πf k [n] t) time / s.5.1.15.2 time / s Regularize to exactly harmonic f k [n] = k f [n] 3 2 1-1 -2-3 freq / Hz 6 4 2.5.1.15.2 time / s freq / Hz 7 65 6 55.5.1.15.2 time / s what to do? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-24
Modification in sinewave resynthesis Change duration by warping timebase - may want to keep onset unwarped 5 freq / Hz 4 3 2 1.5.1.15.2.25.3.35.4.45.5 time / s Change pitch by scaling frequencies - either stretching or resampling envelope level / db 4 3 2 1 1 2 3 4 freq / Hz level / db 4 3 2 1 1 2 3 4 freq / Hz Change timbre by interpolating params E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-25
Sinusoids + residual Only prominent peaks became tracks - remainder of spectral energy was noisy? model residual energy with noise How to obtain non-harmonic spectrum? - zero-out spectrum near extracted peaks? - or: resynthesize (exactly) & subtract waveforms e s [ n] = sn [ ] A k [ n] cos( 2πn f k [ n] ) k mag / db 2-2 -4.. must preserve phase! sinusoids original -6-8 LPC 1 2 3 4 5 6 7 freq / Hz residual Can model residual signal with LPC flexible representation of noisy residual E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-26
Sinusoids + noise + transients Sound represented as sinusoids and noise: sn [ ] = A k [ n] cos( 2πn f k [ n] ) + k Sinusoids Parameters are {A k [n], f k [n]}, h n [n] h n [ n] bn [ ] * Residual e s [ n] freq / Hz 8 6 4 2.2.4.6 time / s 8 6 4 2 8 6 4 2.2.4.6 {A k [n], f k [n]} h n [n] Separate out abrupt transients in residual? e s [ n] = t k [ n] + h n [ n] b' [ n] k * - more specific more flexible E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-27
Outline 1 2 3 4 5 Music and nonspeech Environmental sounds Music synthesis techniques Sinewave synthesis Music analysis - Instrument identification - Pitch tracking E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-28
5 Music analysis What might we want to get out of music? Instrument identification - different levels of specificity - registers within instruments Score recovery - transcribe the note sequence - extract the performance Ensemble performance - gestalts : chords, tone colors Broader timescales - phrasing & musical structure - artist / genre clustering and classification E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-29
Instrument identification Research looks for perceptual timbre space dull procedure? low attack bright low flux hi flux hi attack Cues to instrument identification - onset (rise time), sustain (brightness) Hierarchy of instrument families - strings / reeds / brass - optimize features at each level E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-3
Pitch tracking Fundamental frequency ( pitch) is a key attribute of musical sounds pitch tracking as a key technology Pitch tracking for speech - voice pitch & spectrum highly dynamic - speech is voiced and unvoiced ground truth? Applications - voice coders (excitation description) - harmonic modeling E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-31
Pitch tracking for music 4 Pitch in music - pitch is more stable (although vibrato) - but: multiple pitches Frequency 3 2 1??.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time Applications - harmonic modeling - music transcription ( storage, resynthesis) - source separation Approaches: place & time E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-32
Meddis & Hewitt pitch model Autocorrelation (time) based pitch extraction - fundamental period peak(s) in autocorrelation xt () xt ( + T) r xx ( T ) = xt ()xt ( + T) max Waveform x[n].2.1 -.1 -.2 2 4 6 8 1 time / samples Autocorrelation r xx [l] 3 2 1-1 2 4 6 8 1 lag / samples Compute separately in each frequency band & summarize across (perceptual) channels Periodicity detection CF / Hz 4 Autocorrelogram sound Bandpass filters Rectification & low-pass filter 1924 866 328 Cross-channel sum Summary ACG 8 1 2.5 5. 7.5 1. 12.5 lag / ms E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-33
sound Tolonen & Karjalainen simplification Multiple frequency channels can have different dominant pitches... But equalizing (flattening) the spectrum works: Prewhitening Highpass @ 1kHz Lowpass @ 1kHz Rectify & low-pass Summary AC as a function of time: Periodogram for M/F voice mix f/hz 1 4 Periodicity detection Periodicity detection + SACF enhance ESACF 2 1 5..1.2.3.4.5.6.7.8 Summary autocorrelation at t=.775 s time/s lag vs. freq? 2 Hz (.5s) 125 Hz (.8s).1.2.3.4.6.1.2 lag/s - Enhancement = cancel subharmonics E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-34
Post-processing of pitch tracks Remove outliers with median filtering 5-pt median Octave errors are common: - if x(t) x(t + T ) then x(t) x(t + 2T ) etc. dynamic programming/hmm time Validity - is there a pitch at this time? - voiced/unvoiced decision for speech Event detection - when does a pitch slide indicate a new note? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-35
Summary Nonspeech audio - i.e. sound in general - characteristics: ecological Music synthesis - control of pitch, duration, loudness, articulation - evolution of techniques - sinusoids + noise + transients Music analysis - different aspects: instruments, pitches, performance and beyond? E682 SAPR - Dan Ellis L6 - Nonspeech & Music 26-2-23-36