Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Similar documents
Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

VIBRATO DETECTING ALGORITHM IN REAL TIME. Minhao Zhang, Xinzhao Liu. University of Rochester Department of Electrical and Computer Engineering

Enhanced Waveform Interpolative Coding at 4 kbps

Applications of Music Processing

Tempo and Beat Tracking

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Music Signal Processing

Automatic Evaluation of Hindustani Learner s SARGAM Practice

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

SOUND SOURCE RECOGNITION AND MODELING

Tempo and Beat Tracking

8.3 Basic Parameters for Audio

Sound Synthesis Methods

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Sound is the human ear s perceived effect of pressure changes in the ambient air. Sound can be modeled as a function of time.

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Toward Automatic Transcription -- Pitch Tracking In Polyphonic Environment

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

Transcription of Piano Music

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Speech Synthesis using Mel-Cepstral Coefficient Feature

Complex Sounds. Reading: Yost Ch. 4

Advanced audio analysis. Martin Gasser

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Speech Signal Analysis

SGN Audio and Speech Processing

Lecture 6: Nonspeech and Music

Speech/Music Change Point Detection using Sonogram and AANN

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Sound waves. septembre 2014 Audio signals and systems 1

Lecture 6: Nonspeech and Music. Music & nonspeech

REpeating Pattern Extraction Technique (REPET)

Automatic Transcription of Monophonic Audio to MIDI

L19: Prosodic modification of speech

JOURNAL OF OBJECT TECHNOLOGY

POLYPHONIC PITCH DETECTION BY MATCHING SPECTRAL AND AUTOCORRELATION PEAKS. Sebastian Kraft, Udo Zölzer

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma

Basic Characteristics of Speech Signal Analysis

Drum Transcription Based on Independent Subspace Analysis

Between physics and perception signal models for high level audio processing. Axel Röbel. Analysis / synthesis team, IRCAM. DAFx 2010 iem Graz

MUSC 316 Sound & Digital Audio Basics Worksheet

The psychoacoustics of reverberation

Psycho-acoustics (Sound characteristics, Masking, and Loudness)

Converting Speaking Voice into Singing Voice

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

COM325 Computer Speech and Hearing

Introduction of Audio and Music

Musical Acoustics, C. Bertulani. Musical Acoustics. Lecture 14 Timbre / Tone quality II

Synthesis Techniques. Juan P Bello

Temporal resolution AUDL Domain of temporal resolution. Fine structure and envelope. Modulating a sinusoid. Fine structure and envelope

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

Communications Theory and Engineering

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Linguistic Phonetics. Spectral Analysis

EE482: Digital Signal Processing Applications

INFLUENCE OF FREQUENCY DISTRIBUTION ON INTENSITY FLUCTUATIONS OF NOISE

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Music 171: Sinusoids. Tamara Smyth, Department of Music, University of California, San Diego (UCSD) January 10, 2019

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Computer Audio. An Overview. (Material freely adapted from sources far too numerous to mention )

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

TRANSFORMS / WAVELETS

Research on Extracting BPM Feature Values in Music Beat Tracking Algorithm

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

Speech Processing. Undergraduate course code: LASC10061 Postgraduate course code: LASC11065

VOICE QUALITY SYNTHESIS WITH THE BANDWIDTH ENHANCED SINUSOIDAL MODEL

SPEECH AND SPECTRAL ANALYSIS

ROBUST MULTIPITCH ESTIMATION FOR THE ANALYSIS AND MANIPULATION OF POLYPHONIC MUSICAL SIGNALS

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN

Musical Acoustics, C. Bertulani. Musical Acoustics. Lecture 13 Timbre / Tone quality I

SAMPLING THEORY. Representing continuous signals with discrete numbers

SGN Audio and Speech Processing

Acoustics, signals & systems for audiology. Week 9. Basic Psychoacoustic Phenomena: Temporal resolution

Speech and Music Discrimination based on Signal Modulation Spectrum.

FFT analysis in practice

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Loudness & Temporal resolution

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Pitch Estimation of Singing Voice From Monaural Popular Music Recordings

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

ALTERNATING CURRENT (AC)

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

What is Sound? Part II

Envelope Modulation Spectrum (EMS)

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

Distortion products and the perceived pitch of harmonic complex tones

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Single-channel Mixture Decomposition using Bayesian Harmonic Models

Transcription:

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

o Music signal characteristics o Perceptual attributes and acoustic properties o Signal representations for pitch detection o STFT o Sinusoidal model o Pitch detection algorithms o Polyphonic context and predominant pitch tracking o Applications in MIR 2

Digital audio format: PCM Sampling rate: 44.1 khz, 22.05 khz Amplitude resolution: 16 bits/sample *The Physics Classroom:http://www.glenbrook.k12.il.us/gbssci/ phys/class/sound/u11l2a.html WiSSAP 2007

Interesting sounds are typically coded in the form of a temporal sequence of atomic sound events. E.g. speech -> a sequence of phones music -> an evolving pattern of notes An atomic sound event, or a single gestalt, can be a complex acoustical signal described by a set of temporal and spectral properties => an evoked sensation. Department of Electrical Engineering, IIT Bombay

A sound of given frequency components and sound pressure levels leads to perceived sensations that can be distinguished in terms of: o loudness <-- intensity o pitch <-- fundamental frequency o timbre ( quality or colour ) <--ther spectro-temporal properties Department of Electrical Engineering, IIT Bombay

low pitch tone Air pressure variation T0 = 10 msec high pitch tone Frequency = 100 Hz 1 Hertz = 1 vibration/sec Frequency = 300 Hz T0 = 3.3 msec Department of Electrical Engineering, IIT Bombay

Musical pitch scale low pitch high pitch semitone = 2 1/12 Department of Electrical Engineering, IIT Bombay

o The construction of a musical scale is based on two assumptions about the human hearing process: o The ear is sensitive to ratios of fundamental frequencies (pitches), not so much to absolute pitch. o The preferred musical intervals, i.e. those perceived to be most consonant, are the ratios of small whole numbers. o A musical sound is typically comprised of severalfrequencies. The frequencies are evident if we observe the spectrum of the sound Department of Electrical Engineering, IIT Bombay

300 Hz 600 Hz 900 Hz 300 Hz + 600Hz 300 Hz + 600Hz + 900Hz Department of Electrical Engineering, IIT Bombay

Sound atoms : Single tone signal x 1 ( t) X 1 ( f ) 0.7 0.8 0 t(ms) 50 f (Hz) -0.6 500

Non-tonal Signal x 2 ( t) X 2 ( f ) 0.7 0.2 0 t(ms) 50 f (Hz) -0.5 500

Complex tone signal x 3 ( t) 0.5 X 3 ( f ) 0.2 0 t(ms) 50 f (Hz) -0.4 500 1000

Bandpass noise signal x 4 ( t) X 4 ( f ) 0.3 1 0-0.3 t(ms) 50 250 800 f (Hz)

A flute note x 1 ( t) X 1 ( f )db 0.5-20 0-0.5 t(ms) 50-70 5 f (khz)

o We see that the distinctive signal characteristics are more evident in the frequency domain. o The ear is a frequency analyzer. It represents a unique combination of analysis and synthesis => we do not perceive spectral components but rather the composite sounds. o We observe that a single note is perceived as one entity of well-defined subjective sensations. This is due to the spatial pattern recognition process achieved by the central auditory system. 15

Major dimensions of music for retrieval are melody, rhythm, harmony and timbre. o Melody, harmony -> based on pitch content o Rhythm -> based on timinginformation o Timbre -> relates to instrumentation, texture A representationof these high-level attributes can be obtained from pitch, timing and spectro-temporal information extracted by audio signal analysis. Representations are then compared via a similarity measure to achieve retrieval. 16

o The temporal pattern of frame-level features can offer important cues to signal identity Texture windows Audio signal <= duration: 0.5 1.0 s Analysis windows <= duration: 50 100 ms Frame-level features Feature vector Feature Extraction Feature summary M. F. Martin and J. Breebaart, "Features for Audio and Music Classification," in Proc.ISMIR, 2003. 17

Melody: pitch related feature Melody is the temporal sequence of notes usually played by a single instrument (fixed timbre). The discrete notes (pitches) are typically selected from a musical scale. frequency/note time

o Typical implementation: o Pitch detection is carried out on the audio signal at uniformly spaced intervals o The pitch sequence is segmented into notes (regions of relatively steady pitch) o Notes are labeled o Note patterns are matched to determine melodic similarity o Challenges: o Note segmentation can be a difficult task o Pitch detection in polyphonic music is tough 19

Monophonic Signal: cues to perceived pitch Spectrum Waveform A. de Cheveigne. Multiple F0 estimation. In D.-L. Wang and G.J. Brown, editors, Computational Auditory Scene Analysis : Principles, Algorithms and Applications, IEEE Press / Wiley, 2006. Schroeder histogram PDA Department of Electrical Engineering, IIT Bombay

o Time (Lag) domain: maximise autocorrelation value o Frequency domain: minimise error between estimated and predicted harmonic structures o Other 21

22

Music and speech signals are typically time-varying in nature => a time-frequency representation is required to visualize signal characteristics. The short-time Fourier transform (STFT) affords such a representation based on an assumption of signal quasistationarity. The window shape dictates the time and frequency resolution trade-off. X S m= ( ω, n) = x( m) w( n m) e jωm Department of Electrical Engineering, IIT Bombay

w(n-m) x(m) x(m)w(n-m) X( n, ω) DFT 0 ω π

ai[ t] t Φ[ ] i I[ t] I [ t] ˆ[ x t]= a[ t]cos Φ [ t] + et [ ] i = 1 i -amplitude variation of i th sinusoidal component ( partial ) - total phase (represents both frequency and phase variation) -Number of partials, can vary with time i Φ [ t] = ω[ t] t + ϕ[ t] i i i Model parameters to be estimated: { a i, ω i, ϕ i } l

Audio signal x DFT Peak detection Peak tracking Sinusoid parameters { a i, ω i, ϕ i } l Window Additive synthesis _ Tonal component + Σ Residual For the smooth evolution of the signal, sine components are detected in each frameand linked to tracks from the previous frame based on frequency proximity.

50 40 30 Spectral magnitude Fixed threshold (MaxPeak - 40 db) Final peaks picked 20 Magnitude (db) 10 0-10 -20-30 -40-50 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 50 40 30 20 Spectral magnitude Envelope - 20 db Envelope - 25 db Envelope - 30 db Magnitude (db) 10 0-10 -20-30 -40-50 0 500 1000 1500 2000 2500 3000 Frequency (Hz)

Match spectrum around peak with that of ideal sinusoid. Apply threshold to the error. Department of Electrical Engineering, IIT Bombay

Peak tracking sine peak D Frequency track born track dies C B A 0 1 2 3 4 Time

Singer (main melody) Tanpura(drone) Frequency (Hz) 2000 1500 1000 500 Tabla(percussion) Ghe Na Tun Harmonium (secondary melody) 0 0 5 10 15 20 Time (sec)

o Input : magnitudes+ locations of sinusoids o For a range of trial fundamentals, generate predicted harmonics o MinimiseTWM errorw.r.t. trial fundamentals Predicted Components a 800 700 600 500 400 300 Measured Components b 800 700 420 375 Err total Err Err = + ρ N K p m m p 200 200 100 100 Nearest Neighbour Matching Department of Electrical Engineering, IIT Bombay

Department of Electrical Engineering, IIT Bombay

p E(p,j) W(p,p') E(p',j+1) p Pitch candidates, j Frame (time instant) E Measurement cost (local), W Smoothness cost j Minimize the Global transition cost over the singing spurt Department of Electrical Engineering, IIT Bombay

Department of Electrical Engineering, IIT Bombay

Polyphonic audio signal Signal representation Multi-F0 analysis Voice F0 contour Singing voice detection Predominant-F0 trajectory extraction

37

Pitch class profile opitch histogram osimilarity measure involves match between histograms 38

Positive phrases Negative phrase

Detects phrases melodically similar to Guru Bina pitch contour Swaras: S SN R Positive phrases Emphatic beat sam Negative phrase

43

Polyphonic audio signal Signal representation Multi-F0 analysis Voice F0 contour Singing voice detection Predominant-F0 trajectory extraction

o Input : magnitudes+ locations of sinusoids o For a range of trial fundamentals, generate predicted harmonics o MinimiseTWM errorw.r.t. trial fundamentals Predicted Components a 800 700 600 500 400 300 Measured Components b 800 700 420 375 Err total Err Err = + ρ N K p m m p 200 200 100 100 Nearest Neighbour Matching Department of Electrical Engineering, IIT Bombay

Predicted to measured error N p n p Err p m = f n (f n) + ( ) [q f n (f n) r] n= 1 Amax Significant term : Δf / (f) p o Δf = frequency mismatch error of = partial frequency a Measured to predicted error K p k p Err m p = f k (f k) + ( ) [q f k (f k) r] n= 1 Amax a Department of Electrical Engineering, IIT Bombay

Melody detection system [1]

o F0 search range (male/female) o p, q, r o ρ (male/female) o Window length (pitch range and rate of variation) o Smoothness cost parameter (rate of pitch variation) o Voicing threshold Department of Electrical Engineering, IIT Bombay

o Window lengthis an analysis parameter that influences the accuracy of sinusoidal modeling of the signal o Closely-spaced components in the polyphony => need for higher frequency resolution = longer windows o Pitch variation with time can be rapid in ornamented regions => need for better time resolution = shorter windows Department of Electrical Engineering, IIT Bombay

o Easily computable measures for adapting window length o Signal sparsity: a sparse spectrum is more concentrated => better represented sinusoidal components o Window length selection (20, 30, 40 ms) based on maximizing signal sparsity

1. V. Raoand P. Rao, Vocal melody extraction in the presence of pitched accompaniment in polyphonic music, IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp. 2145 2154, Nov. 2010. 2. V. Rao, P. Gaddipatiand P. Rao, Signal-driven window adaptation for sinusoid identification in polyphonic music, IEEE Transactions on Audio, Speech, and Language Processing, Jan. 2012. 51