University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

Similar documents
Mel Spectrum Analysis of Speech Recognition using Single Microphone

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

BEAT DETECTION BY DYNAMIC PROGRAMMING. Racquel Ivy Awuor

SOUND SOURCE RECOGNITION AND MODELING

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

Advanced audio analysis. Martin Gasser

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Topic 2. Signal Processing Review. (Some slides are adapted from Bryan Pardo s course slides on Machine Perception of Music)

Introduction of Audio and Music

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Speech Signal Analysis

Fundamentals of Digital Audio *

Applications of Music Processing

Sound is the human ear s perceived effect of pressure changes in the ambient air. Sound can be modeled as a function of time.

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

JOURNAL OF OBJECT TECHNOLOGY

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

THE CITADEL THE MILITARY COLLEGE OF SOUTH CAROLINA. Department of Electrical and Computer Engineering. ELEC 423 Digital Signal Processing

Laboratory Assignment 4. Fourier Sound Synthesis

Audio processing methods on marine mammal vocalizations

Speech and Music Discrimination based on Signal Modulation Spectrum.

Chapter 2. Meeting 2, Measures and Visualizations of Sounds and Signals

Drum Transcription Based on Independent Subspace Analysis

Advanced Music Content Analysis

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Advanced Audiovisual Processing Expected Background

SGN Audio and Speech Processing

Speech Synthesis using Mel-Cepstral Coefficient Feature

Cepstrum alanysis of speech signals

Laboratory Assignment 2 Signal Sampling, Manipulation, and Playback

Real-time beat estimation using feature extraction

Advanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses

COMPUTATIONAL RHYTHM AND BEAT ANALYSIS Nicholas Berkner. University of Rochester

Topic 6. The Digital Fourier Transform. (Based, in part, on The Scientist and Engineer's Guide to Digital Signal Processing by Steven Smith)

8.3 Basic Parameters for Audio

EE482: Digital Signal Processing Applications

Research on Extracting BPM Feature Values in Music Beat Tracking Algorithm

Rhythm Analysis in Music

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

SGN Audio and Speech Processing

Discrete Fourier Transform (DFT)

Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels

Signal Analysis. Peak Detection. Envelope Follower (Amplitude detection) Music 270a: Signal Analysis

Signal Processing. Introduction

DSP First. Laboratory Exercise #11. Extracting Frequencies of Musical Tones

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Sound Synthesis Methods

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

Rhythm Analysis in Music

8A. ANALYSIS OF COMPLEX SOUNDS. Amplitude, loudness, and decibels

Electric Guitar Pickups Recognition

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Sampling and Reconstruction of Analog Signals

Complex Sounds. Reading: Yost Ch. 4

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Isolated Digit Recognition Using MFCC AND DTW

Audio Fingerprinting using Fractional Fourier Transform

Chapter 4. Digital Audio Representation CS 3570

Tempo and Beat Tracking

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

ALTERNATING CURRENT (AC)

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

An Automatic Audio Segmentation System for Radio Newscast. Final Project

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov Dec 2015

ECEn 487 Digital Signal Processing Laboratory. Lab 3 FFT-based Spectrum Analyzer

Performing the Spectrogram on the DSP Shield

Perceptive Speech Filters for Speech Signal Noise Reduction

Implementing Speaker Recognition

Signal Processing First Lab 20: Extracting Frequencies of Musical Tones

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

EXPERIMENTAL AND NUMERICAL ANALYSIS OF THE MUSICAL BEHAVIOR OF TRIANGLE INSTRUMENTS

Basic Characteristics of Speech Signal Analysis

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

WK-7500 WK-6500 CTK-7000 CTK-6000 BS A

Sound waves. septembre 2014 Audio signals and systems 1

Principles of Musical Acoustics

Signals, Sound, and Sensation

Electrical & Computer Engineering Technology

CSC475 Music Information Retrieval

FFT analysis in practice

CS3291: Digital Signal Processing

ECE438 - Laboratory 7a: Digital Filter Design (Week 1) By Prof. Charles Bouman and Prof. Mireille Boutin Fall 2015

Separating Voiced Segments from Music File using MFCC, ZCR and GMM

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Analysis/Synthesis of Stringed Instrument Using Formant Structure

AutoScore: The Automated Music Transcriber Project Proposal , Spring 2011 Group 1

Aberehe Niguse Gebru ABSTRACT. Keywords Autocorrelation, MATLAB, Music education, Pitch Detection, Wavelet

Week 1 Introduction of Digital Signal Processing with the review of SMJE 2053 Circuits & Signals for Filter Design

Speech Coding in the Frequency Domain

T Automatic Speech Recognition: From Theory to Practice

Digital Speech Processing and Coding

COMP 546, Winter 2017 lecture 20 - sound 2

Lab 3 FFT based Spectrum Analyzer

ADSP ADSP ADSP ADSP. Advanced Digital Signal Processing (18-792) Spring Fall Semester, Department of Electrical and Computer Engineering

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

CS 188: Artificial Intelligence Spring Speech in an Hour

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Transcription:

University of Colorado at Boulder ECEN 4/5532 Lab 1 Lab report due on February 2, 2015 This is a MATLAB only lab, and therefore each student needs to turn in her/his own lab report and own programs. 1 Introduction The goal of this lab is to explore some of the recent techniques developed in the audio industry to organize, and search large music collection by content. The explosion of digital music has created a need to invent new tools to search and organize music. Several websites provide large database of musical tracks: http://magnatune.com/ http://www.allmusic.com/ http://www.last.fm/ and also allow users to find musical tracks and artists that are similar. Companies such as gracenote and shazam have application to recognize a song based solely on the actual music. Other examples of automated music analysis include 1. score following: Rock Prodigy, SmartMusic, Rockband 2. automatic music transcription: Zenph 3. music recommendation, playlisting: Google Music, Last.fm, Pandora, Spotify 4. machine listening: Echonest At the moment these tools are still rudimentary. Fast computational methods are needed to expand these tools beyond their current primitive scope and integrate them on portable music players (e.g. ipod). The goal of this lab is to develop digital signal processing tools to automatically extract features that characterize music. 1

2 Audio signal: from the sound to the wav file Sounds are produced by the vibration of air; sound waves produce variations of air pressure. The sound waves can be measured using a microphone that converts the mechanical energy into electrical energy. Precisely, the microphone converts air pressure into voltage levels. The electrical signal is then sampled in time at a sampling frequency f s, and quantized. The current standard for high quality audio and DVD is a sampling frequency of f s = 96kHz, and a 24-bit depth for the quantization. The CD quality is 16 bit sampled at f s = 44.1 khz. 1. Dolphins can only hear sounds over the frequency range [7 120] khz. At what sampling frequency f s should we sample digital audio signals for dolphins? 2.1 Segments, Frames, and Samples In order to automatically analyze the music, the audio files are segmented into non overlapping segments of a few seconds. This provides the coarsest resolution for the analysis. The audio features are computed at a much finer resolution. The audio file is divided into overlapping intervals of a few milliseconds over which the analysis is conducted. In this lab the audio files are sampled at f s = 11, 050 Hz, and we consider intervals of size N = 512 samples, or 46 ms. This interval of 46 millisecond is called a frame. We will implement several algorithms that will return a set of features for each frame. While there is 512 samples per frame, we will usually return a feature vector of much smaller size. 3 The music In the file http://ecee.colorado.edu/~fmeyer/class/ecen4532/audio-files.zip you will find 12 tracks of various length. There are two examples of six different musical genres: 1. Classical 2. Electronic 3. Jazz 4. Metal and punk 5. Rock and pop 6. World The name of the file indicates its genre. The musical tracks are chosen because they are diverse but also have interesting characteristics, which should be revealed by your analysis. track201-classical is a part of a violin concerto by J.S. Bach (13-BWV 1001 : IV. Presto). track204-classical is a classical piano piece composed by Robert Schumann (Davidsbundlertanze, Op.6. XVI). 2

track370-electronic is an example of synthetic music generated a software that makes it possible to create notes that do not have the regular structure of Western music. The pitches may be unfamiliar but the rhythm is quite predictable. track396-electronic is an example of electronic dance floor piece. The monotony and the simplicity of the synthesizer and the bass drum are broken by vocals. track437-jazz is another track by the same trio as track439-jazz track439-jazz is an example of (funk) jazz with a Hammond B3 organ, a bass, and a percussion. The song is characterized by a well defined rhythm and a simple melody. track463-metal is an example of rock and metal with vocals, bass, electric guitars and percussion. track492-metal is an example of heavy metal with heavily distorted guitars, drums with double bass drum. track547-rock is an example of new wave rock of the 80 s. The music varies from edgy to hard. It has guitars, keyboards, and percussion. track550-rock is an example of pop with keyboard, guitar, bass, and vocals. There is a rich melody and the sound of steel-string guitar. track707-world: this is a Japanese flute, monophonic, with background sounds between the notes. The notes are held for a long time. track729-world: this is a piece with a mix of Eastern and Western influence using electric and acoustic sarod and on classical steel-string guitars. 2. Write a MATLAB function that extract T seconds of music from a given track. You will use the MATLAB function waveread to read a track and the function play to listen to the track. In the lab you will use T =24 seconds from the middle of each track to compare the different algorithms. Download the files, and test your function. 4 Low level features: time domain analysis The most interesting features will involve some sophisticated spectral analysis. There are however a few simple features that can be directly computed in the time domain, We describe some of the most popular features in the following. 3

4.1 Loudness The standard deviation of the original audio signal x[n] computed over a frame of size N provides a sense of the loudness, σ(n) = 1 N 1 N /2 1 m= N /2 [x(n + m) E[x n ]] 2 with E[x n ] = 1 N N /2 1 m= N /2 x(n + m) (1) 4.2 Zero-crossing The Zero Crossing Rate is the average number of times the audio signal crosses the zero amplitude line per time unit. The ZCR is related to the pitch height, and is also correlated to the noisiness of the signal. We use the following definition of ZCR, Temporal features ZCR(n) = 1 2N N /2 1 m= N /2 sgn(x(n + m)) sgn(x(n + m 1)). (2) ZCR is high for noisy (unvoiced) sounds and low for tonal (voiced) sounds For simple periodic signals, it is roughly related to the fundamental frequency 3. Implement the loudness and ZCR and evaluate these features on the different music tracks. Your MATLAB function should display each feature as a time series in a separate figure (see e.g. Fig. 1). 4. Comment on the specificity of the feature, and its ability to separate different musical genre. Figure 1: Zero crossing rate as a function of the frame number. 5 Low level features: spectral analysis 6 5.1 Windowing and spectral leaking Music is made up of notes of different pitch. It is only natural that most of the automated analysis of music should be performed in the spectral (frequency) domain. 4

Improving Chroma Our goal is the reconstruction of a musical score. Our analysis requires N = 512 samples to compute notes over a frame. The spectral analysis proceeds as follows. Each frame is smoothly extracted by multiplying the original audio signal by a tapper window w. The Fourier transform of the windowed signal Beat is then synchronous computed. If(Bartsch x n denotesand a frame Wakefield, of size N 2001) = 512 extracted at frame n, and w is a window of size N, then the Fourier transform, X n (of size N ) for the frame n is given by Y = FFT(w. x n ); K = N /2 + 1; X n = Y (1 : K); (3) There are several simple descriptors that can be used to characterize the spectral information provided by the Fourier transform over a frame. Figure 2: Spectrogram: square of the magnitude (color-coded) of the Fourier transform, X n (k) 2, as a function of the frame index n (x-axis), and the frequency index k (y-axis). 18 5. Let x[n] = cos(ω 0 n), n Z (4) Derive the theoretical expression of the discrete time Fourier transform of x, given by X(e jω ) = n= x[n]e jωn. (5) 6. In practice, we work with a finite signal, and we multiply the signal x[n] by a window w[n]. We assume that the window w is non zero at times n = N /2,..., N /2, and we define y[n] = x[n]w[n + N /2]. (6) Derive the expression of the Fourier transform of y[n] in terms of the Fourier transform of x and the Fourier transform of the window w. 5

7. Implement the computation of the windowed Fourier transform of y, given by (3). Evaluate its performance with pure sinusoidal signals and different windows: Bartlett Hann Kaiser, 8. Compute the spectrogram of an audio track as follows: (a) Decompose a track into a sequence of N f overlapping frames of size N. The overlap between two frames should be N /2. (b) Compute the magnitude squared of the Fourier transform, X(k) 2, k = 1,..., K over each frame n. (c) Display the Fourier transform of all the frames in a matrix of size K N f. The spectrogram should look like Fig. 2. You will experiment with different audio tracks, as well as pure sinusoidal tones. Do the spectrograms look like what you hear? In the rest of the lab, we will be using a Kaiser window to compute the Fourier transform, as explained in (3). 5.2 Spectral centroid and spread The first and second order moments, given by the mean E[ X n ] and standard deviation σ( X n ) of the magnitude of the Fourier transform. In fact, rather than working directly with the Fourier transform, we prefer to define a concept of frequency distribution for frame n according to, X n (k) = X n (k) K l=1 X n(l). (7) Then the first two moments of X n (k) are given by σ( X n ) = 1 K 1 K [ X n (k) E[ X n ] ] 2 k=1 with E[ X n ] = 1 K K k X n (k) (8) k=1 The spectral centroid E[ X n ] can be used to quantify sound sharpness or brightness. The spread, σ( X n ), quantifies the spread of the spectrum around the centroid, and thus helps differentiate between tone-like and noise-like sounds. 6

Spectral flatness 5.3 Spectral flatness Spectral flatness is the ratio between the geometric and arithmetic means of the magnitude of the Fourier transform, K k=1 X n(k) 1/K SF(n) = 1 K K k=1 X (9) n(k) The flatness is always smaller than one since the geometric mean is always smaller than the arithmetic mean. The flatness is one, if all X(k) are equal. This happens for a very noisy signal. A very small flatness corresponds to the presence of tonal components. In summary, this is a measure of the noisyness of the spectrum. Figure 3: Spectral flatness as a function of the frame number. 5.4 Spectral flux The spectral flux is a global measure of the spectral changes between two adjacent frames, n 1 and n, F n = K 1 k=1 ( X n (k) X n 1 (k) ) 2 where X n (k) is the normalized frequency distribution for frame n, given by (7). 10 (10) 9. Implement all the low level spectral features and evaluate them on the different tracks. Your MATLAB function should display each feature as a time series in separate figure (e.g. see Fig. 3). 10. Comment on the specificity of the feature, and its ability to separate different musical genres. 5.5 Application: Mpeg7 Low Level Audio Descriptors MPEG 7, also known as Multimedia Content Description Interface, provides a standardized set of technologies for describing multimedia content. Part 4 of the standard specifies description tools that pertain to multimedia in the audio domain. The standard defines Low-level Audio Descriptors (LLDs) that consist of a collection of simple, low complexity descriptors to characterize the audio content. Some of the features are similar to the spectral features defined above. 7

6 Basic Psychoacoustic Quantities In order to develop more sophisticated algorithms to analyze music based on its content, we need to define several subjective features such as timbre, melody, harmony, rhythm, tempo, mood, lyrics, etc. Some of these concepts can be defined formally, while others are more subjective and can be formalized using a wide variety of different algorithms. We will focus on the features that can be defined mathematically. 6.1 Psychoacoustic Psychoacoustics involves the study of the human auditory system, and the formal quantification of the relationships between the physics of sounds, and our perception of audio. We will describe some key aspects of the human auditory system: 1. the perception of frequencies and pitch for pure and complex tones; 2. the frequency selectivity of the auditory system: our ability to perceive two similar frequencies as distinct; 3. the modeling of the auditory system as a bank of auditory filters; 4. the perception of loudness; 5. the perception of rhythm. 6.2 Perception of frequencies The auditory system, like the visual system, is able to detect frequencies over a wide range of scales. In order to measure frequencies over a very large range, it operates using a logarithmic scale. Let us a consider a pure tone, modeled by a sinusoidal signal oscillating at a frequency ω. If ω < 500 Hz, then the perceived tone or pitch varies as a linear function of ω. When ω > 1, 000 Hz, then the perceived pitch increases logarithmically with ω. Several frequency scales have been proposed to capture the logarithmic scaling of frequency perception. 6.3 The mel/bark scale The Bark (named after the German physicist Barkhausen) is defined as ( ) z = 7arcsinh (ω/650) = 7 log x/650 + 1 + (x/650) 2, where ω is measured in Hz. The mel-scale is defined by the fact that 1 bark = 100 mel. In this lab we will use a slightly modified version of the mel scale defined by m = 1127.01048 log (1 + ω/700). (11) 8

6.4 The cochlear filterbank Finally, we need to account for the fact that the auditory system behaves as a set of filterbanks, with overlapping frequency responses. For each filter, the range of frequencies over which the filter response is significant is called the critical band. Our perception of pitch can be quantified using the total energy at the output of each filter bank. All spectral energy that falls into one critical band is summed up, leading to a single number for that frequency band. We describe in the following a simple model of the cochlear filterbank. The filter bank is constructed using N B = 40 logarithmically spaced triangle filters centered at the frequencies Ω p, defined by mel n = 1127.01048 log 1 + Ω p /700, (12) where the sequence of mel frequencies is equally spaced in the mel scale, with mel n = n mel max mel min N B, (13) mel max = 1127.01048 log (1 + 0.5 ω s /700), (14) mel min = 1127.01048 log (1 + 20/700), (15) and N B = 40. Each filter H p is centered around the frequency Ω p, and defined by H p (ω) = 2 Ω p+1 Ω p 1 ω Ω p 1 Ω p Ω p 1 if ω [Ω p 1, Ω p ), 2 Ω p+1 ω if ω [Ω p, Ω p+1 ). Ω p+1 Ω p 1 Ω p+1 Ω p (16) Each triangular filter is normalized such that the integral of each filter is 1. In addition, the filters overlap stems and Cepstrum Analysis of Speech 9.8 Applications to Speech Pattern Recognition 179 so that the frequency at which the filter H p is maximum is starting frequency for the next filter h n+1, and the edge frequency of h n 1. ough performance luesofτ.thiswas s, the group delay ed and thus more nt locations. Howe white noise and, recognition rates increasing values m Coefficients m distance measpretation in terms This is significant Amplitude 0.01 0.005 0 0 1000 2000 3000 4000 Frequency (Hz) Fig. 9.18 Weighting functions for mel-scale filtering Figure 4: The filterbanks used to compute the mfcc. Fig. 9.18. It is needed so that a perfectly flat input Fourier 9 spectrum will produce a flat mel-spectrum. For each frame, a discrete cosine (DCT) transform of the log of

Finally, the mel-spectrum (MFCC) coefficient of the n-th frame is defined for p = 1,..., N B as mfcc[p] = K H p (k)x n (k) 2 (17) k=1 where the filter H p is a discrete implementation of the continuous filter defined by (16). The discrete filter is normalized such that H p (j) = 1,p = 1,..., N B. (18) The MATLAB code in Fig. 5 computes the sequence of frequencies Ω p. nbanks = 40; %% Number of Mel frequency bands % linear frequencies linfrq = 20:fs/2; % mel frequencies melfrq = log ( 1 + linfrq/700) *1127.01048; % equispaced mel indices melidx = linspace(1,max(melfrq),nbanks+2); % From mel index to linear frequency melidx2frq = zeros (1,nbanks+2); % melidx2frq (p) = \Omega_p for i=1:nbanks+2 [val indx] = min(abs(melfrq - melidx(i))); melidx2frq(i) = linfrq(indx); end j Figure 5: Computation of the sequence of frequencies Ω p associated with the normalized hat filters defined in (16). 11. Implement the computation of the triangular filterbanks H p,p = 1,..., N B. Your function will return an array fbank of size N B K such that fbank(p,:) contains the filter bank H p. 12. Implement the computation of the mfcc coefficients, as defined in (17). 10