Onset detection and Attack Phase Descriptors IMV Signal Processing Meetup, 16 March 217
I Onset detection VS Attack phase description I MIREX competition: I Detect the approximate temporal location of new onsets in an audio file. I Algoritims are compared against manual expert annotation (which is inherently imprecise). I False positives and false negatives are penalized I Attack phase description I What are the slient time points in the beginning of this sound event? I What are the relations between these time points? I Paper: I K. Nymoen, A. Danielsen and J. London: Attack Phase Descriptor Estimation in Matlab toolboxes. Submitted for SMC217, Helsinki. I Comparing the MIRtoolbox (Lartillot) and the Timbre Toolbox (Peeters)
onset detection Audio waveform.8.6.4 amplitude.2 -.2 -.4 -.6 -.8 1 2 3 4 5 6 7 time (s) 1 Onset curve (Envelope) amplitude.9.8.7.6.5.4.3.2.1 1 2 3 4 5 6 7 time (s)
are these really onsets? 1 Onset curve (Envelope) amplitude.9.8.7.6.5.4.3.2.1 1 2 3 4 5 6 7 time (s)
are these really onsets? 1 Onset curve (Envelope) amplitude.9.8.7.6.5.4.3.2.1 1 2 3 4 5 6 7 time (s) I What would our research question typically be when using this function in the MIRtoolbox? I Segmentation I Melody I Rhythm(?) I Microrhythm
are these really onsets? 1 Onset curve (Envelope) amplitude.9.8.7.6.5.4.3.2.1 1 2 3 4 5 6 7 time (s) I What would our research question typically be when using this function in the MIRtoolbox? I Segmentation I Melody I Rhythm(?) I Microrhythm I Are we interested in onsets, or rather perceived moments of metrical alignment?
Salient time points in the initial phase of a sonic event Perceptual Attack Energy peak Perceptual Onset Physical Onset
Salient time points in the initial phase of a sonic event Perceptual Attack Energy peak Perceptual Onset Physical Onset
Salient time points in the initial phase of a sonic event Perceptual Attack Energy peak Perceptual Onset Physical Onset Schae er (196x) Gordon (1987) Collins (26) Wright (28) Villing (21)
Attack phase descriptors Perceptual Attack Energy peak Temporal centroid Perceptual Onset Attack Slope Attack Leap Physical Onset Attack time Rise time Log-Attack Time = log(attack time) I Time points I Time spans I Energy spans I (Energy points)
Attack phase descriptors (our definitions) Name Type Description Physical onset phtp Time point where the sound energy first rises from. Perceptual onset petp Time point when the sound event becomes audible. Perceptual attack petp Time point perceived as the rhythmic emphasis of the sound. Energy peak phtp Time point when the energy envelope reaches its maximum value. Rise time phts Time span between physical onset and energy peak. Attack time pets Time span between perceptual onset and perceptual attack. Log-Attack Time phts The base 1 logarithm of attack time. Attack slope pees Weighted average of the energy envelope slope in the attack phase. Attack leap pees The di erence between energy level at perceptual attack and perceptual onset. Temporal centroid phtp The temporal barycentre of the sound event s energy envelope.
Attack phase descriptors (our definitions) Name Type Description Physical onset phtp Time point where the sound energy first rises from. Perceptual onset petp Time point when the sound event becomes audible. Perceptual attack petp Time point perceived as the rhythmic emphasis of the sound. Energy peak phtp Time point when the energy envelope reaches its maximum value. Rise time phts Time span between physical onset and energy peak. Attack time pets Time span between perceptual onset and perceptual attack. Log-Attack Time phts The base 1 logarithm of attack time. Attack slope pees Weighted average of the energy envelope slope in the attack phase. Attack leap pees The di erence between energy level at perceptual attack and perceptual onset. Temporal centroid phtp The temporal barycentre of the sound event s energy envelope. Log-Attack Time is a commonly used descriptor in the MIR community. No consensus: some use physical descriptors, some use perceptual, and some use a combination to estimate it.
Perceptual Attack Energy peak Temporal centroid Perceptual Onset Attack Slope Attack Leap Physical Onset Attack time Rise time Log-Attack Time = log(attack time) I Time points I Time spans I Energy spans I (Energy points)
Attack phase descriptors step 1: Envelope extraction Timbre Toolbox I Apply Hilbert transform to the audio signal, I followed by a 3rd-order Butterworth lowpass filter with cuto frequency at 5 Hz. I No compensation for filter group delay MIRtoolbox I Spectrogram, hanning window, 1ms frame, 1% hop I Envelope = sum of columns in spectrogram.5 -.5 Audio waveform 1.5 Energy Envelope MIRtoolbox (D) MIRtoolbox (A) Timbre Toolbox (D) Timbre toolbox (A).5.1.15.2.25 Time (seconds)
Attack phase descriptors step 2: Salient time steps I Both the MIRtoolbox and the Timbre toolbox provide equvalents to beginning of attack and end of attack. Timbre Toolbox attack phase estimation Effort Function Mean effort.2.1 MIRToolbox attack phase estimation Time derivative Peak position Threshold 1 θ 1 θ 2 Energy envelope Attack start end.5 Energy envelope Attack start end θ 1.2.4.6.8 1 time (seconds).2.4.6.8 1 time (seconds)
Attack phase descriptors step 2: Salient time steps I Both the MIRtoolbox and the Timbre toolbox provide equvalents to beginning of attack and end of attack. Timbre Toolbox attack phase estimation Effort Function Mean effort.2.1 MIRToolbox attack phase estimation Time derivative Peak position Threshold 1 θ 1 θ 2 Energy envelope Attack start end.5 Energy envelope Attack start end θ 1.2.4.6.8 1 time (seconds).2.4.6.8 1 time (seconds) I But are these supposed to reflect physical or perceptual features?
Attack phase descriptors step 3:...
Attack phase descriptors step 3:... I Decide on the definitions of attack phase descriptors, and the methods for extracting salient time points
Into the nitty-gritty Timbre toolbox I NB! Make sure that you download the latest version from github... (Don t trust the CIRMMT link 1st hit on Google which gives you version 1.2 from 23) I My impression: Best used if you need a large range of audio descriptors for a large audio set, and don t want to fiddle with choosing parameters for your functions I Need to dig deep into the code to change the parameters (hard-coded): I I I Lowpass filter cuto frequency value Fix group delay problem
Into the nitty-gritty MIRtoolbox I Quite user-friendly: well documented, easy to access most parameters I mironsets() function - attack option I threshold value is hard-coded I mirgetdata-problem: I uncell(get(a, AttackPosUnit )) I uncell(get(a, PeakPosUnit ))
Perceptual experiment I Task: align a repeated musical sound to a click track. I 17 participants I 9 sound stimuli (8 musical instruments + click) I inter-stimuli-interval of 6 ms I click track and stimuli started with a random o set. I controlling sync using a keyboard and/or a slider on the screen.
Parameter optimisation and perceptual results Time relative to physical onset (in milliseconds) 8 6 4 2-2 -4-6 -8 Bright Snare Piano Drum Dark Piano Kick Drum Fiddle Shaker Synth Bass MIRtoolbox (D) MIRtoolbox (O) Timbre Toolbox (D) Timbre toolbox (O) Perceptual results Arco Click Bass Frame size (seconds).1.2.3.4.5.6.7.8.9.1.11.12.13.14.15 Jaccard index for MIRtoolbox (mean for all sounds).5.1.15.2.25.3.35.4.45.5.55.6.65.7.75 Treshold (fraction of e peak) Toolbox Envelope parameter Threshold parameter Timbre toolbox MIRtoolbox LPfilter cuto frequency Default: 5 Hz Optimised: 37 Hz Frame size Default:.1 s Optimised:.3 s Default: 3 Optimised: 3.75 fraction of e peak Default: 2% Optimised: 7.5%.55.5.45.4.35.3.25.2.15.1 Jaccard Index