Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Size: px

Start display at page:

Download "Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012"

Samson Ellis
5 years ago
Views:

1 Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

2 o Music signal characteristics o Perceptual attributes and acoustic properties o Signal representations for pitch detection o STFT o Sinusoidal model o Pitch detection algorithms o Polyphonic context and predominant pitch tracking o Applications in MIR 2

3 Digital audio format: PCM Sampling rate: 44.1 khz, khz Amplitude resolution: 16 bits/sample *The Physics Classroom: phys/class/sound/u11l2a.html WiSSAP 2007

4 Interesting sounds are typically coded in the form of a temporal sequence of atomic sound events. E.g. speech -> a sequence of phones music -> an evolving pattern of notes An atomic sound event, or a single gestalt, can be a complex acoustical signal described by a set of temporal and spectral properties => an evoked sensation. Department of Electrical Engineering, IIT Bombay

5 A sound of given frequency components and sound pressure levels leads to perceived sensations that can be distinguished in terms of: o loudness <-- intensity o pitch <-- fundamental frequency o timbre ( quality or colour ) <--ther spectro-temporal properties Department of Electrical Engineering, IIT Bombay

6 low pitch tone Air pressure variation T0 = 10 msec high pitch tone Frequency = 100 Hz 1 Hertz = 1 vibration/sec Frequency = 300 Hz T0 = 3.3 msec Department of Electrical Engineering, IIT Bombay

7 Musical pitch scale low pitch high pitch semitone = 2 1/12 Department of Electrical Engineering, IIT Bombay

8 o The construction of a musical scale is based on two assumptions about the human hearing process: o The ear is sensitive to ratios of fundamental frequencies (pitches), not so much to absolute pitch. o The preferred musical intervals, i.e. those perceived to be most consonant, are the ratios of small whole numbers. o A musical sound is typically comprised of severalfrequencies. The frequencies are evident if we observe the spectrum of the sound Department of Electrical Engineering, IIT Bombay

9 300 Hz 600 Hz 900 Hz 300 Hz + 600Hz 300 Hz + 600Hz + 900Hz Department of Electrical Engineering, IIT Bombay

10 Sound atoms : Single tone signal x 1 ( t) X 1 ( f ) t(ms) 50 f (Hz)

11 Non-tonal Signal x 2 ( t) X 2 ( f ) t(ms) 50 f (Hz)

12 Complex tone signal x 3 ( t) 0.5 X 3 ( f ) t(ms) 50 f (Hz)

13 Bandpass noise signal x 4 ( t) X 4 ( f ) t(ms) f (Hz)

14 A flute note x 1 ( t) X 1 ( f )db t(ms) f (khz)

15 o We see that the distinctive signal characteristics are more evident in the frequency domain. o The ear is a frequency analyzer. It represents a unique combination of analysis and synthesis => we do not perceive spectral components but rather the composite sounds. o We observe that a single note is perceived as one entity of well-defined subjective sensations. This is due to the spatial pattern recognition process achieved by the central auditory system. 15

16 Major dimensions of music for retrieval are melody, rhythm, harmony and timbre. o Melody, harmony -> based on pitch content o Rhythm -> based on timinginformation o Timbre -> relates to instrumentation, texture A representationof these high-level attributes can be obtained from pitch, timing and spectro-temporal information extracted by audio signal analysis. Representations are then compared via a similarity measure to achieve retrieval. 16

o The temporal pattern of frame-level features can offer important cues to signal identity Texture windows Audio signal <= duration: 0.5 1.

17 o The temporal pattern of frame-level features can offer important cues to signal identity Texture windows Audio signal <= duration: s Analysis windows <= duration: ms Frame-level features Feature vector Feature Extraction Feature summary M. F. Martin and J. Breebaart, "Features for Audio and Music Classification," in Proc.ISMIR,

18 Melody: pitch related feature Melody is the temporal sequence of notes usually played by a single instrument (fixed timbre). The discrete notes (pitches) are typically selected from a musical scale. frequency/note time

19 o Typical implementation: o Pitch detection is carried out on the audio signal at uniformly spaced intervals o The pitch sequence is segmented into notes (regions of relatively steady pitch) o Notes are labeled o Note patterns are matched to determine melodic similarity o Challenges: o Note segmentation can be a difficult task o Pitch detection in polyphonic music is tough 19

20 Monophonic Signal: cues to perceived pitch Spectrum Waveform A. de Cheveigne. Multiple F0 estimation. In D.-L. Wang and G.J. Brown, editors, Computational Auditory Scene Analysis : Principles, Algorithms and Applications, IEEE Press / Wiley, Schroeder histogram PDA Department of Electrical Engineering, IIT Bombay

21 o Time (Lag) domain: maximise autocorrelation value o Frequency domain: minimise error between estimated and predicted harmonic structures o Other 21

22 22

23 Music and speech signals are typically time-varying in nature => a time-frequency representation is required to visualize signal characteristics. The short-time Fourier transform (STFT) affords such a representation based on an assumption of signal quasistationarity. The window shape dictates the time and frequency resolution trade-off. X S m= ( ω, n) = x( m) w( n m) e jωm Department of Electrical Engineering, IIT Bombay

24 w(n-m) x(m) x(m)w(n-m) X( n, ω) DFT 0 ω π

26 ai[ t] t Φ[ ] i I[ t] I [ t] ˆ[ x t]= a[ t]cos Φ [ t] + et [ ] i = 1 i -amplitude variation of i th sinusoidal component ( partial ) - total phase (represents both frequency and phase variation) -Number of partials, can vary with time i Φ [ t] = ω[ t] t + ϕ[ t] i i i Model parameters to be estimated: { a i, ω i, ϕ i } l

27 Audio signal x DFT Peak detection Peak tracking Sinusoid parameters { a i, ω i, ϕ i } l Window Additive synthesis _ Tonal component + Σ Residual For the smooth evolution of the signal, sine components are detected in each frameand linked to tracks from the previous frame based on frequency proximity.

28 Spectral magnitude Fixed threshold (MaxPeak - 40 db) Final peaks picked 20 Magnitude (db) Frequency (Hz) Spectral magnitude Envelope - 20 db Envelope - 25 db Envelope - 30 db Magnitude (db) Frequency (Hz)

29 Match spectrum around peak with that of ideal sinusoid. Apply threshold to the error. Department of Electrical Engineering, IIT Bombay

30 Peak tracking sine peak D Frequency track born track dies C B A Time

31 Singer (main melody) Tanpura(drone) Frequency (Hz) Tabla(percussion) Ghe Na Tun Harmonium (secondary melody) Time (sec)

32 o Input : magnitudes+ locations of sinusoids o For a range of trial fundamentals, generate predicted harmonics o MinimiseTWM errorw.r.t. trial fundamentals Predicted Components a Measured Components b Err total Err Err = + ρ N K p m m p Nearest Neighbour Matching Department of Electrical Engineering, IIT Bombay

33 Department of Electrical Engineering, IIT Bombay

34 p E(p,j) W(p,p') E(p',j+1) p Pitch candidates, j Frame (time instant) E Measurement cost (local), W Smoothness cost j Minimize the Global transition cost over the singing spurt Department of Electrical Engineering, IIT Bombay

35 Department of Electrical Engineering, IIT Bombay

36 Polyphonic audio signal Signal representation Multi-F0 analysis Voice F0 contour Singing voice detection Predominant-F0 trajectory extraction

37 37

38 Pitch class profile opitch histogram osimilarity measure involves match between histograms 38

40 Positive phrases Negative phrase

42 Detects phrases melodically similar to Guru Bina pitch contour Swaras: S SN R Positive phrases Emphatic beat sam Negative phrase

43 43

44 Polyphonic audio signal Signal representation Multi-F0 analysis Voice F0 contour Singing voice detection Predominant-F0 trajectory extraction

45 o Input : magnitudes+ locations of sinusoids o For a range of trial fundamentals, generate predicted harmonics o MinimiseTWM errorw.r.t. trial fundamentals Predicted Components a Measured Components b Err total Err Err = + ρ N K p m m p Nearest Neighbour Matching Department of Electrical Engineering, IIT Bombay

46 Predicted to measured error N p n p Err p m = f n (f n) + ( ) [q f n (f n) r] n= 1 Amax Significant term : Δf / (f) p o Δf = frequency mismatch error of = partial frequency a Measured to predicted error K p k p Err m p = f k (f k) + ( ) [q f k (f k) r] n= 1 Amax a Department of Electrical Engineering, IIT Bombay

47 Melody detection system [1]

48 o F0 search range (male/female) o p, q, r o ρ (male/female) o Window length (pitch range and rate of variation) o Smoothness cost parameter (rate of pitch variation) o Voicing threshold Department of Electrical Engineering, IIT Bombay

49 o Window lengthis an analysis parameter that influences the accuracy of sinusoidal modeling of the signal o Closely-spaced components in the polyphony => need for higher frequency resolution = longer windows o Pitch variation with time can be rapid in ornamented regions => need for better time resolution = shorter windows Department of Electrical Engineering, IIT Bombay

50 o Easily computable measures for adapting window length o Signal sparsity: a sparse spectrum is more concentrated => better represented sinusoidal components o Window length selection (20, 30, 40 ms) based on maximizing signal sparsity

51 1. V. Raoand P. Rao, Vocal melody extraction in the presence of pitched accompaniment in polyphonic music, IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp , Nov V. Rao, P. Gaddipatiand P. Rao, Signal-driven window adaptation for sinusoid identification in polyphonic music, IEEE Transactions on Audio, Speech, and Language Processing, Jan

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner. Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb 2009. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence