Introduction of Audio and Music

1 Introduction of Audio and Music Wei-Ta Chu 2009/12/3

Outline 2 Introduction of Audio Signals Introduction of Music

3 Introduction of Audio Signals Wei-Ta Chu 2009/12/3 Li and Drew, Fundamentals of Multimedia, Prentice Hall, 2004

Sound 4 Sound is a wave phenomenon like light, but is macroscopic and involves molecules of air being compressed and expanded under the action of some physical device. Since sound is a pressure wave, it takes on continuous values, as opposed to digitized ones.

Digitization 5 Digitization means conversion to a stream of numbers, and preferably these numbers should be integers for efficiency. Sampling and Quantization

Digitization 6 The first kind of sampling, using measurements only at evenly spaced time intervals, is simply called, sampling. The rate at which it is performed is called the sampling frequency. For audio, typical sampling rates are from 8 khz (8,000 samples per second) to 48 khz. This range is determined by Nyquist theorem discussed later. Typical uniform quantization rates are 8-bit and 16-bit. 8-bit quantization divides the vertical axis into 256 levels, and 16- bit divides it into 65536 levels.

Sinusoids 7

Nyquist Theorem 8 The Nyquist theorem states how frequently we must sample in time to be able to recover the original sound. If sampling rate just equals the actual frequency, this figure shows that a false signal is detected: it is simply a constant, with zero frequency.

Nyquist Theorem 9 If sample at 1.5 times the actual frequency, this figure shows that we obtain an incorrect (alias) frequency that is lower than the correct one. For correct sampling we must use a sampling rate equal to at least twice the maximum frequency content in the signal. This rate is called the Nyquist rate.

Nyquist Theorem 10 Nyquist Theorem: The sampling rate has to be at least twice the maximum frequency content in the signal. Nyquist frequency: half of the Nyquist rate. Since it would be impossible to recover frequencies higher than Nyquist frequency, most systems have an antialiasing filter that restricts the frequency content to a range at or below Nyquist frequency.

Signal-to-Noise Ratio (SNR) 11 The ratio of the power of the correct signal and the noise is called the signal to noise ratio (SNR) - a measure of the quality of the signal. The SNR is usually measured in decibels (db), where 1 db is a tenth of a bel. The SNR value, in units of db, is defined in terms of base-10 logarithms of squared voltages, as follows: If the signal voltage V signal is 10 times the noise, the SNR is

12 Signal-to-Quantization-Noise Ratio (SQNR) Aside from any noise that may have been present in the original analog signal, there is also an additional error that results from quantization. This introduces a roundoff error. It is not really noise. Nevertheles it is caled quantization noise (or quantization error). The quality of the quantization is characterized by the Signal to Quantization Noise Ratio (SQNR).

13 Signal-to-Quantization-Noise Ratio (SQNR) For a quantization accuracy of N bits per sample, the range of the digital signal is -2 N-1 to 2 N-1-1. If the actual analog signal is in the range from V max to +V max, each quantization level represents a voltage of 2V max /2 N, or V N-1 max /2 Peak SQNR

14 Signal-to-Quantization-Noise Ratio (SQNR)

15 Signal-to-Quantization-Noise Ratio (SQNR) 6.02N is the worst case. If the input signal is sinusoidal, the quantization error is statistically independent, and its magnitude is uniformly distributed between 0 and half of the interval, then it can be shown that the expression for the SQNR becomes Typical digital audio sample precision is either 8 bits per sample, equivalent to about telephone quality, or 16 bits, for CD quality.

Linear and Nonlinear Quantization 16 Non-uniform quantization: set up more finely-spaced levels where humans hear with the most acuity. We are quantizing magnitude, or amplitude how loud the signal is. Weber s Law stated formaly says that equaly perceived differences have values proportional to absolute levels: If we can feel an increase in weight from 10 to 11 pounds, then if instead we start at 20 pounds, it would take 22 pounds for us to feel an increase in weight.

Linear and Nonlinear Quantization 17

Linear and Nonlinear Quantization 18 For steps near the low end of the signal, quantization steps are effectively more concentrated on the s axis, whereas for large values of s, one quantization step in r encompasses a wide range of s values.

Linear and Nonlinear Quantization 19 s p is the peak signal value and s is the current signal value

Linear and Nonlinear Quantization 20

Audio Filtering 21 Prior to sampling and AD conversion, the audio signal is also usually filtered to remove unwanted frequencies. For speech, typically from 50Hz to 10kHz is retained, and other frequencies are blocked by the use of a band-pass filter that screens out lower and higher frequencies. An audio music signal will typically contain from about 20Hz up to 20kHz.

Audio Quality vs. Data Rate 22 The uncompressed data rate increases as more bits are used for quantization. Stereo: double the bandwidth

Processing in Frequency Domain 23 It s hard to infer much from the time-domain waveform. Human hearing is based on frequency analysis. Use of frequency analysis often facilitates understanding. Part of slides are from Prof. Hsu, NTU http://www.csie.ntu.edu.tw/~winston/courses/mm.ana.idx/index.html

Problems in Using Fourier Transform 24 Fourier transformation contains only frequency information No Time information is retained Works fine for stationary signals Non-stationary or changing signals cause problems Fourier transformation shows frequencies occurring at all times instead of specific times

Short-Time Fourier Transform (STFT) 25 How can we still use FT, but handle nonstationary signals? How can we include time? Idea: Break up the signal into discrete windows Each signal within a window is a stationary signal Take FT over each part

STFT Example 26 Window function

Short-Time Fourier Analysis 27 Problem: Conventional Fourier analysis does not capture time-varying nature of audio signals Solution: Multiply signals by finite-duration window function, then compute DTFT:

Window Functions 28 Rectangular window: Hamming window:

Window Functions 29

30 Break Up Original Audio Signals into Frames Using Hamming Window

31 Introduction of Audio Features Wei-Ta Chu 2009/12/3

Introduction of Audio Features 32 Short-term frame level vs. long-term clip level A frame is defined as a group of neighboring samples which last about 10 to 40 ms For audio clips with sampling frequency 16kHz, how many samples are in a 20ms audio frame? Within an audio frame we can assume that the audio signal is stationary. A clip consists of a sequence of frames, and clip-level features usually characterize how frame-level features change over a clip. Y. Wang, Z. Liu, and J.-C. Huang, Multimedia content analysis using both audio and visual clues, IEEE Signal Processing Magazine, Nov., 2000, pp. 12-36.

Frames and Clips 33 Fixed length clips (1 to 2 seconds) or vary-length clips Both frames and clips may overlap with their previous ones

Frame-Level Features 34 Most of the frame-level features are inherited from speech signal processing. Time-domain features Frequency-domain features We use N to denote the fame length, and s n (i) to denote the ith sample in the nth audio frame.

Volume (Loudness, Energy) 35 Volume is a reliable indicator for silence detection, which may help to segment an audio sequence and to determine clip boundaries. It is approximated by the root mean square of the signal magnitude within each frame Volume of an audio signal depends on the gain value of the recording and digitizing devices. We may normalize the volume for a frame by the maximum volume of some previous frames.

Zero Crossing Rate 36 Count the number of times that the audio waveform crosses the zero axis. ZCR = the number of zero crossings per second ZCR is one of the most indicative and robust measures to discern unvoiced speech. Typically, unvoiced speech has a low volume but a high ZCR. Using ZCR and volume together, one can prevent low energy unvoiced speech frames from being classified as silent.

Pitch 37 Pitch is the fundamental frequency ( 基頻 ) of an audio waveform. Normally only voiced speech and harmonic ( 泛音 ) music have well-defined pitch. Temporal estimation methods rely on computation of the short time autocorrelation function or AMDF AMDF: average magnitude difference function

Pitch 38 Valleys exist in voiced and music frames and vanish in noise and unvoiced frames.

Spectral Features 39 Spectrum: the Fourier transform of the samples in this frames The difference among these three clips is more noticeable in the frequency domain than in the waveform domain Spectrogram

Spectral Features 40 Let denote the power spectrum (i.e. magnitude square of the spectrum) of frame n. If we think of as a random variable and normalized by the total power as the probability density function of, we can define mean and standard deviation of. Frequency centroid, brightness Bandwidth

Subband Energy Ratio 41 The ratio of the energy in a frequency subband to the total energy When the sampling rate is 22050 Hz, the frequency ranges for the four subbands are 0-630 Hz, 630-1720 Hz, 1720-4400 Hz, and 4400-11025 Hz. Z. Liu, Y. Wang, and T. Chen, Audio feature extraction and analysis for scene segmentation and clasification, Journal of VLSI Signal Processing, vol. 20, 1998, pp. 61-79.

Spectral Flux 42 Spectrum flux (SF) is defined as the average variation value of spectrum between the adjacent two frames. The SF values of speech are higher than those of music. The environment sound is among the highest and changes more dramatically than the other two types of signal. L. Lu, H.-J. Zhang, H. Jiang, Content analysis for audio clasification and segmentation, IEEE Trans. on Speech and Audio Processing, vol. 10, no. 7, 2002, pp. 504-516.

Spectral Rolloff 43 The 95th percentile of the power spectral distribution. This measure distinguishes voiced from unvoiced speech. The value is higher for right-skewed distributions. Unvoiced speech has a high proportion of energy contained in the high-frequency range of the spectrum This is a measure of the skewness of the spectral shape E. Scheirerand M. Slaney, Construction and evaluation of a robust multifeaturesspeech/music discriminator, Proc. of ICASSP, vol. 2, 1997, pp. 3741-3744.

44 MFCC (Mel-Frequency Cepstral Coefficients) The most popular features in speech/audio/music processing. Segment incoming waveform into frames Compute frequency response for each frame using DFTs Group magnitude of frequency response into 25-40 channels using triangular weighting functions Compute log of weighted magnitudes for each channel Take inverse DCT/DFT of weighted magnitudes for each channel, producing ~14 cepstral coefficients for each frame Par of slides are from Prof. Hsu, NTU http://www.csie.ntu.edu.tw/~winston/courses/mm.ana.idx/index.html

The Mel Weighting Functions 45 Human pitch perception is most accurate between 100Hz and 1000Hz. Linear in this range Logarithmic above 1000Hz A mel is a unit of pitch defined so that pairs of sounds which are perceptually equidistant in pitch are separated by an equal number of mels

Clip-Level Features 46 To extract the semantic content, we need to observe the temporal variation of frame features on a longer time scale. Volume-based features: VSTD (volume standard deviation) VDR (volume dynamic range) Percentage of low-energy frames: proportion of frames with rms volume less than 50% of the mean volume within one clip NSR (nonsilence ratio): the ratio of the number of nonsilent frames

Clip-Level Features 47 ZCR-based features: With a speech signal, low and high ZCR periods interlaced. ZSTD (standard deviation of ZCR) Standard deviation of first order difference Third central moment about the mean Total number of zero crossing exceeding a threshold Difference between the number of zero crossings above and below the mean values J. Saunders, Real-time discrimination of broadcast speech/music, Proc. of ICASSP, vol. 2, 1996, pp. 993-996.

Clip-Level Features 48 Pitch-based features: PSTD (standard deviation of pitch) SPR (smooth pitch ratio): the percentage of frames in a clip that have similar pitch as the previous frames Measure the percentage of voiced or music frames within a clip NPR (nonpitch ratio): percentage of frames without pitch. Measure how many frames are unvoiced speech or noise within a clip

49 Audio Segmentation and Classification Wei-Ta Chu 2009/12/3 T. Zhang and C.-C. J. Kuo, Audio content analysis for online audiovisual data segmentation and clasification, IEEE Trans. on Speech and Audio Procesing, vol. 9, no. 4, 2001, pp. 441-457.

Overview 50

Characteristics 51 1. Taking into account hybrid types of sound which contain more than one kind of audio component. 2. Put more emphasis on the distinction of environmental sounds Essential in many applications such as the post-processing of films 3. Integrated features are exploited for audio classification 4. Low complexity 5. The proposed method is generic and model-free. Used as the tool for segmentation and indexing of radio and TV programs

Features ZCR 52

Features Fundamental Frequency 53 Distinguish between harmonic from nonharmonic sounds. Sounds from most musical instruments are harmonic. Speech signal is a harmonic and nonharmonic mixed sound. Most environmental sounds are nonharmonic.

Features Fundamental Frequency 54 All maxima in the spectrum are detected as potential harmonic peaks. Check among locations of peaks whether they have sharpness, amplitude, and width values satisfying certain criteria. If all conditions are met, the SFuF (short-time fundamental frequency) value is estimated as the frequency corresponding to the greatest common divider of locations of harmonic peaks; otherwise, SFuF is set to zero.

Features Fundamental Frequency 55

Audio Segmentation 56 Two adjoining sliding windows installed with the average feature value computed within each window

Audio Segmentation 57

Audio Classification 58 1. Detecting silence: If the short-time energy is continuously lower than a threshold, or if most short-time average ZCR is lower than a threshold 2. Separating sounds with/without music components By detecting continuous and stable frequency peaks from the power spectrum Based on a threshold for the zero ratio at about 0.7 > 0.7: pure speech or nonharmonic environmental sound <0.7: otherwise

Audio Classification 59 3. Detecting harmonic environmental sounds: If the fundamental frequency of a sound clip changes over time but only with several values 4. Distinguishing pure music Based on statistical analysis of the ZCR and SFuF curves Degree of being harmonic, degree of fundamental frequency s concentration on certain values during a period of time, the variance of the average zerocrossing rate, and range of amplitude of the average zero-crossing rate

Audio Classification 60 5. Distinguishing songs 6. Separating speech/environmental sound with music background 7. Distinguish pure speech 8. Classifying Nonharmonic environmental sounds

Evaluation 61

References 62 Y. Wang, Z. Liu, and J.-C. Huang, Multimedia content analysis using both audio and visual clues, IEEE Signal Processing Magazine, Nov., 2000, pp. 12-36. (must read) Z. Liu, Y. Wang, and T. Chen, Audio feature extraction and analysis for scene segmentation and clasification, Journal of VLSI Signal Processing, vol. 20, 1998, pp. 61-79. T. Zhang and C.-C. J. Kuo, Audio content analysis for online audiovisual data segmentation and clasification, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 4, 2001, pp. 441-457.

Next Week 63 Project proposal