Introduction of Audio and Music

Similar documents
Digital Audio. Lecture-6

Chapter 2: Digitization of Sound

Signal segmentation and waveform characterization. Biosignal processing, S Autumn 2012

Speech/Music Discrimination via Energy Density Analysis

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Topic. Spectrogram Chromagram Cesptrogram. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

EE 464 Short-Time Fourier Transform Fall and Spectrogram. Many signals of importance have spectral content that

Heuristic Approach for Generic Audio Data Segmentation and Annotation

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Music 270a: Fundamentals of Digital Audio and Discrete-Time Signals

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

8.3 Basic Parameters for Audio

Measuring the complexity of sound

A multi-class method for detecting audio events in news broadcasts

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Topic 2. Signal Processing Review. (Some slides are adapted from Bryan Pardo s course slides on Machine Perception of Music)

Short Time Energy Amplitude. Audio Waveform Amplitude. 2 x x Time Index

!"!#"#$% Lecture 2: Media Creation. Some materials taken from Prof. Yao Wang s slides RECAP

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Advanced Digital Signal Processing Part 2: Digital Processing of Continuous-Time Signals

Fundamentals of Digital Audio *

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Lecture 7 Frequency Modulation

Time division multiplexing The block diagram for TDM is illustrated as shown in the figure

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

CHAPTER 4. PULSE MODULATION Part 2

Chapter 4. Digital Audio Representation CS 3570

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

SAMPLING THEORY. Representing continuous signals with discrete numbers

Continuous vs. Discrete signals. Sampling. Analog to Digital Conversion. CMPT 368: Lecture 4 Fundamentals of Digital Audio, Discrete-Time Signals

Sound is the human ear s perceived effect of pressure changes in the ambient air. Sound can be modeled as a function of time.

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Audio Fingerprinting using Fractional Fourier Transform

ON THE VALIDITY OF THE NOISE MODEL OF QUANTIZATION FOR THE FREQUENCY-DOMAIN AMPLITUDE ESTIMATION OF LOW-LEVEL SINE WAVES

Speech Signal Analysis

A CONSTRUCTION OF COMPACT MFCC-TYPE FEATURES USING SHORT-TIME STATISTICS FOR APPLICATIONS IN AUDIO SEGMENTATION

Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2

Introduction to Digital Signal Processing (Discrete-time Signal Processing)

CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM

CMPT 318: Lecture 4 Fundamentals of Digital Audio, Discrete-Time Signals

Speech/Music Change Point Detection using Sonogram and AANN

Feature Analysis for Audio Classification

New Features of IEEE Std Digitizing Waveform Recorders

Signal processing preliminaries

Chapter 4 SPEECH ENHANCEMENT

Outline. Communications Engineering 1

Communications I (ELCN 306)

Signal Processing for Digitizers

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

Lecture Schedule: Week Date Lecture Title

EXPERIMENTAL INVESTIGATION INTO THE OPTIMAL USE OF DITHER

Isolated Digit Recognition Using MFCC AND DTW

Speech Synthesis; Pitch Detection and Vocoders

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Speech and Music Discrimination based on Signal Modulation Spectrum.

Keywords: spectral centroid, MPEG-7, sum of sine waves, band limited impulse train, STFT, peak detection.

Chapter 5 Window Functions. periodic with a period of N (number of samples). This is observed in table (3.1).

Chapter Two. Fundamentals of Data and Signals. Data Communications and Computer Networks: A Business User's Approach Seventh Edition

Pulse Code Modulation

REAL-TIME BROADBAND NOISE REDUCTION

Psycho-acoustics (Sound characteristics, Masking, and Loudness)

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

ADAPTIVE NOISE LEVEL ESTIMATION

University of Colorado at Boulder ECEN 4/5532. Lab 1 Lab report due on February 2, 2015

An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet

DIGITAL COMMUNICATION

Chapter 3 Data Transmission COSC 3213 Summer 2003

FFT analysis in practice

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Sampling and Reconstruction of Analog Signals

Introduction. Chapter Time-Varying Signals

Digital Processing of Continuous-Time Signals

Design and Implementation of an Audio Classification System Based on SVM

Monaural and Binaural Speech Separation

6.2 MIDI: Musical Instrument Digital Interface. 6.4 Further Exploration

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

Digital Processing of

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Communications IB Paper 6 Handout 3: Digitisation and Digital Signals

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Data Communication. Chapter 3 Data Transmission

THE CITADEL THE MILITARY COLLEGE OF SOUTH CAROLINA. Department of Electrical and Computer Engineering. ELEC 423 Digital Signal Processing

2.1 BASIC CONCEPTS Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal.

Digital Speech Processing and Coding

EEE482F: Problem Set 1

SGN Audio and Speech Processing

MULTIMEDIA SYSTEMS

Digital Signal Processing

(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters

Sound waves. septembre 2014 Audio signals and systems 1

Adaptive noise level estimation

Complex Sounds. Reading: Yost Ch. 4

CG401 Advanced Signal Processing. Dr Stuart Lawson Room A330 Tel: January 2003

Speech Enhancement using Wiener filtering

EEE508 GÜÇ SİSTEMLERİNDE SİNYAL İŞLEME

Transcription:

1 Introduction of Audio and Music Wei-Ta Chu 2009/12/3

Outline 2 Introduction of Audio Signals Introduction of Music

3 Introduction of Audio Signals Wei-Ta Chu 2009/12/3 Li and Drew, Fundamentals of Multimedia, Prentice Hall, 2004

Sound 4 Sound is a wave phenomenon like light, but is macroscopic and involves molecules of air being compressed and expanded under the action of some physical device. Since sound is a pressure wave, it takes on continuous values, as opposed to digitized ones.

Digitization 5 Digitization means conversion to a stream of numbers, and preferably these numbers should be integers for efficiency. Sampling and Quantization

Digitization 6 The first kind of sampling, using measurements only at evenly spaced time intervals, is simply called, sampling. The rate at which it is performed is called the sampling frequency. For audio, typical sampling rates are from 8 khz (8,000 samples per second) to 48 khz. This range is determined by Nyquist theorem discussed later. Typical uniform quantization rates are 8-bit and 16-bit. 8-bit quantization divides the vertical axis into 256 levels, and 16- bit divides it into 65536 levels.

Sinusoids 7

Nyquist Theorem 8 The Nyquist theorem states how frequently we must sample in time to be able to recover the original sound. If sampling rate just equals the actual frequency, this figure shows that a false signal is detected: it is simply a constant, with zero frequency.

Nyquist Theorem 9 If sample at 1.5 times the actual frequency, this figure shows that we obtain an incorrect (alias) frequency that is lower than the correct one. For correct sampling we must use a sampling rate equal to at least twice the maximum frequency content in the signal. This rate is called the Nyquist rate.

Nyquist Theorem 10 Nyquist Theorem: The sampling rate has to be at least twice the maximum frequency content in the signal. Nyquist frequency: half of the Nyquist rate. Since it would be impossible to recover frequencies higher than Nyquist frequency, most systems have an antialiasing filter that restricts the frequency content to a range at or below Nyquist frequency.

Signal-to-Noise Ratio (SNR) 11 The ratio of the power of the correct signal and the noise is called the signal to noise ratio (SNR) - a measure of the quality of the signal. The SNR is usually measured in decibels (db), where 1 db is a tenth of a bel. The SNR value, in units of db, is defined in terms of base-10 logarithms of squared voltages, as follows: If the signal voltage V signal is 10 times the noise, the SNR is

12 Signal-to-Quantization-Noise Ratio (SQNR) Aside from any noise that may have been present in the original analog signal, there is also an additional error that results from quantization. This introduces a roundoff error. It is not really noise. Nevertheles it is caled quantization noise (or quantization error). The quality of the quantization is characterized by the Signal to Quantization Noise Ratio (SQNR).

13 Signal-to-Quantization-Noise Ratio (SQNR) For a quantization accuracy of N bits per sample, the range of the digital signal is -2 N-1 to 2 N-1-1. If the actual analog signal is in the range from V max to +V max, each quantization level represents a voltage of 2V max /2 N, or V N-1 max /2 Peak SQNR

14 Signal-to-Quantization-Noise Ratio (SQNR)

15 Signal-to-Quantization-Noise Ratio (SQNR) 6.02N is the worst case. If the input signal is sinusoidal, the quantization error is statistically independent, and its magnitude is uniformly distributed between 0 and half of the interval, then it can be shown that the expression for the SQNR becomes Typical digital audio sample precision is either 8 bits per sample, equivalent to about telephone quality, or 16 bits, for CD quality.

Linear and Nonlinear Quantization 16 Non-uniform quantization: set up more finely-spaced levels where humans hear with the most acuity. We are quantizing magnitude, or amplitude how loud the signal is. Weber s Law stated formaly says that equaly perceived differences have values proportional to absolute levels: If we can feel an increase in weight from 10 to 11 pounds, then if instead we start at 20 pounds, it would take 22 pounds for us to feel an increase in weight.

Linear and Nonlinear Quantization 17

Linear and Nonlinear Quantization 18 For steps near the low end of the signal, quantization steps are effectively more concentrated on the s axis, whereas for large values of s, one quantization step in r encompasses a wide range of s values.

Linear and Nonlinear Quantization 19 s p is the peak signal value and s is the current signal value

Linear and Nonlinear Quantization 20

Audio Filtering 21 Prior to sampling and AD conversion, the audio signal is also usually filtered to remove unwanted frequencies. For speech, typically from 50Hz to 10kHz is retained, and other frequencies are blocked by the use of a band-pass filter that screens out lower and higher frequencies. An audio music signal will typically contain from about 20Hz up to 20kHz.

Audio Quality vs. Data Rate 22 The uncompressed data rate increases as more bits are used for quantization. Stereo: double the bandwidth

Processing in Frequency Domain 23 It s hard to infer much from the time-domain waveform. Human hearing is based on frequency analysis. Use of frequency analysis often facilitates understanding. Part of slides are from Prof. Hsu, NTU http://www.csie.ntu.edu.tw/~winston/courses/mm.ana.idx/index.html

Problems in Using Fourier Transform 24 Fourier transformation contains only frequency information No Time information is retained Works fine for stationary signals Non-stationary or changing signals cause problems Fourier transformation shows frequencies occurring at all times instead of specific times

Short-Time Fourier Transform (STFT) 25 How can we still use FT, but handle nonstationary signals? How can we include time? Idea: Break up the signal into discrete windows Each signal within a window is a stationary signal Take FT over each part

STFT Example 26 Window function

Short-Time Fourier Analysis 27 Problem: Conventional Fourier analysis does not capture time-varying nature of audio signals Solution: Multiply signals by finite-duration window function, then compute DTFT:

Window Functions 28 Rectangular window: Hamming window:

Window Functions 29

30 Break Up Original Audio Signals into Frames Using Hamming Window

31 Introduction of Audio Features Wei-Ta Chu 2009/12/3

Introduction of Audio Features 32 Short-term frame level vs. long-term clip level A frame is defined as a group of neighboring samples which last about 10 to 40 ms For audio clips with sampling frequency 16kHz, how many samples are in a 20ms audio frame? Within an audio frame we can assume that the audio signal is stationary. A clip consists of a sequence of frames, and clip-level features usually characterize how frame-level features change over a clip. Y. Wang, Z. Liu, and J.-C. Huang, Multimedia content analysis using both audio and visual clues, IEEE Signal Processing Magazine, Nov., 2000, pp. 12-36.

Frames and Clips 33 Fixed length clips (1 to 2 seconds) or vary-length clips Both frames and clips may overlap with their previous ones

Frame-Level Features 34 Most of the frame-level features are inherited from speech signal processing. Time-domain features Frequency-domain features We use N to denote the fame length, and s n (i) to denote the ith sample in the nth audio frame.

Volume (Loudness, Energy) 35 Volume is a reliable indicator for silence detection, which may help to segment an audio sequence and to determine clip boundaries. It is approximated by the root mean square of the signal magnitude within each frame Volume of an audio signal depends on the gain value of the recording and digitizing devices. We may normalize the volume for a frame by the maximum volume of some previous frames.

Zero Crossing Rate 36 Count the number of times that the audio waveform crosses the zero axis. ZCR = the number of zero crossings per second ZCR is one of the most indicative and robust measures to discern unvoiced speech. Typically, unvoiced speech has a low volume but a high ZCR. Using ZCR and volume together, one can prevent low energy unvoiced speech frames from being classified as silent.

Pitch 37 Pitch is the fundamental frequency ( 基頻 ) of an audio waveform. Normally only voiced speech and harmonic ( 泛音 ) music have well-defined pitch. Temporal estimation methods rely on computation of the short time autocorrelation function or AMDF AMDF: average magnitude difference function

Pitch 38 Valleys exist in voiced and music frames and vanish in noise and unvoiced frames.

Spectral Features 39 Spectrum: the Fourier transform of the samples in this frames The difference among these three clips is more noticeable in the frequency domain than in the waveform domain Spectrogram

Spectral Features 40 Let denote the power spectrum (i.e. magnitude square of the spectrum) of frame n. If we think of as a random variable and normalized by the total power as the probability density function of, we can define mean and standard deviation of. Frequency centroid, brightness Bandwidth

Subband Energy Ratio 41 The ratio of the energy in a frequency subband to the total energy When the sampling rate is 22050 Hz, the frequency ranges for the four subbands are 0-630 Hz, 630-1720 Hz, 1720-4400 Hz, and 4400-11025 Hz. Z. Liu, Y. Wang, and T. Chen, Audio feature extraction and analysis for scene segmentation and clasification, Journal of VLSI Signal Processing, vol. 20, 1998, pp. 61-79.

Spectral Flux 42 Spectrum flux (SF) is defined as the average variation value of spectrum between the adjacent two frames. The SF values of speech are higher than those of music. The environment sound is among the highest and changes more dramatically than the other two types of signal. L. Lu, H.-J. Zhang, H. Jiang, Content analysis for audio clasification and segmentation, IEEE Trans. on Speech and Audio Processing, vol. 10, no. 7, 2002, pp. 504-516.

Spectral Rolloff 43 The 95th percentile of the power spectral distribution. This measure distinguishes voiced from unvoiced speech. The value is higher for right-skewed distributions. Unvoiced speech has a high proportion of energy contained in the high-frequency range of the spectrum This is a measure of the skewness of the spectral shape E. Scheirerand M. Slaney, Construction and evaluation of a robust multifeaturesspeech/music discriminator, Proc. of ICASSP, vol. 2, 1997, pp. 3741-3744.

44 MFCC (Mel-Frequency Cepstral Coefficients) The most popular features in speech/audio/music processing. Segment incoming waveform into frames Compute frequency response for each frame using DFTs Group magnitude of frequency response into 25-40 channels using triangular weighting functions Compute log of weighted magnitudes for each channel Take inverse DCT/DFT of weighted magnitudes for each channel, producing ~14 cepstral coefficients for each frame Par of slides are from Prof. Hsu, NTU http://www.csie.ntu.edu.tw/~winston/courses/mm.ana.idx/index.html

The Mel Weighting Functions 45 Human pitch perception is most accurate between 100Hz and 1000Hz. Linear in this range Logarithmic above 1000Hz A mel is a unit of pitch defined so that pairs of sounds which are perceptually equidistant in pitch are separated by an equal number of mels

Clip-Level Features 46 To extract the semantic content, we need to observe the temporal variation of frame features on a longer time scale. Volume-based features: VSTD (volume standard deviation) VDR (volume dynamic range) Percentage of low-energy frames: proportion of frames with rms volume less than 50% of the mean volume within one clip NSR (nonsilence ratio): the ratio of the number of nonsilent frames

Clip-Level Features 47 ZCR-based features: With a speech signal, low and high ZCR periods interlaced. ZSTD (standard deviation of ZCR) Standard deviation of first order difference Third central moment about the mean Total number of zero crossing exceeding a threshold Difference between the number of zero crossings above and below the mean values J. Saunders, Real-time discrimination of broadcast speech/music, Proc. of ICASSP, vol. 2, 1996, pp. 993-996.

Clip-Level Features 48 Pitch-based features: PSTD (standard deviation of pitch) SPR (smooth pitch ratio): the percentage of frames in a clip that have similar pitch as the previous frames Measure the percentage of voiced or music frames within a clip NPR (nonpitch ratio): percentage of frames without pitch. Measure how many frames are unvoiced speech or noise within a clip

49 Audio Segmentation and Classification Wei-Ta Chu 2009/12/3 T. Zhang and C.-C. J. Kuo, Audio content analysis for online audiovisual data segmentation and clasification, IEEE Trans. on Speech and Audio Procesing, vol. 9, no. 4, 2001, pp. 441-457.

Overview 50

Characteristics 51 1. Taking into account hybrid types of sound which contain more than one kind of audio component. 2. Put more emphasis on the distinction of environmental sounds Essential in many applications such as the post-processing of films 3. Integrated features are exploited for audio classification 4. Low complexity 5. The proposed method is generic and model-free. Used as the tool for segmentation and indexing of radio and TV programs

Features ZCR 52

Features Fundamental Frequency 53 Distinguish between harmonic from nonharmonic sounds. Sounds from most musical instruments are harmonic. Speech signal is a harmonic and nonharmonic mixed sound. Most environmental sounds are nonharmonic.

Features Fundamental Frequency 54 All maxima in the spectrum are detected as potential harmonic peaks. Check among locations of peaks whether they have sharpness, amplitude, and width values satisfying certain criteria. If all conditions are met, the SFuF (short-time fundamental frequency) value is estimated as the frequency corresponding to the greatest common divider of locations of harmonic peaks; otherwise, SFuF is set to zero.

Features Fundamental Frequency 55

Audio Segmentation 56 Two adjoining sliding windows installed with the average feature value computed within each window

Audio Segmentation 57

Audio Classification 58 1. Detecting silence: If the short-time energy is continuously lower than a threshold, or if most short-time average ZCR is lower than a threshold 2. Separating sounds with/without music components By detecting continuous and stable frequency peaks from the power spectrum Based on a threshold for the zero ratio at about 0.7 > 0.7: pure speech or nonharmonic environmental sound <0.7: otherwise

Audio Classification 59 3. Detecting harmonic environmental sounds: If the fundamental frequency of a sound clip changes over time but only with several values 4. Distinguishing pure music Based on statistical analysis of the ZCR and SFuF curves Degree of being harmonic, degree of fundamental frequency s concentration on certain values during a period of time, the variance of the average zerocrossing rate, and range of amplitude of the average zero-crossing rate

Audio Classification 60 5. Distinguishing songs 6. Separating speech/environmental sound with music background 7. Distinguish pure speech 8. Classifying Nonharmonic environmental sounds

Evaluation 61

References 62 Y. Wang, Z. Liu, and J.-C. Huang, Multimedia content analysis using both audio and visual clues, IEEE Signal Processing Magazine, Nov., 2000, pp. 12-36. (must read) Z. Liu, Y. Wang, and T. Chen, Audio feature extraction and analysis for scene segmentation and clasification, Journal of VLSI Signal Processing, vol. 20, 1998, pp. 61-79. T. Zhang and C.-C. J. Kuo, Audio content analysis for online audiovisual data segmentation and clasification, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 4, 2001, pp. 441-457.

Next Week 63 Project proposal