Chapter 1: Introduction to audio signal processing

Chapter 1: Introduction to audio signal processing KH WONG, Rm 907, SHB, CSE Dept. CUHK, Email: khwong@cse.cuhk.edu.hk http://www.cse.cuhk.edu.hk/~khwong/cmsc5707 Audio signal proce ssing Ch1, v.3c 1

Reference books Theory and Applications of Digital Speech Processing, Lawrence Rabiner, Ronald Schafer, Pearson 2011 DAFX: Digital Audio Effects by Udo Zölzer (2nd Edition 2011), JohnWiley & Sons, Ltd. First edition can be found at http://books.google.com.hk The Audio Programming Book by Richard Boulanger, Victor Lazzarini 2010, The MIT press, can be found at CUHK e- library Digital Audio Signal Processing by Udo Zölzer, Wiley 2008. Real sound synthesis for interactive applications : by Perry Cook, AK Peters Audio signal proce ssing Ch1, v.3c 2

Overview (lecture 1) Chapter 1.A : Introduction Chapter 1.B : Signals in time & frequency domain Chapter 2.A : Audio feature extraction techniques Chapter 2.B : Recognition Procedures Audio signal proce ssing Ch1, v.3c 3

Chapter 1: Chapter 1.A : Introduction Chapter 1.B : Signals in time & frequency domain Audio signal proce ssing Ch1, v.3c 4

Chapter 1: introduction Content Components of a speech recognition system Types of speech recognition systems speech recognition Hardware A speech production model Phonetics: English and Cantonese Audio signal proce ssing Ch1, v.3c 5

Components of A speech recognition system Pre-processor Feature extraction Training of the system Recognition Audio signal proce ssing Ch1, v.3c 6

Types of speech recognition technology Isolated speech recognition - the speaker has to speak word-by-word into the system. ( Connected speech recognition - the speaker can speak a number of words without stopping. Continuous speech recognition - like human. Current product: Voice Actions for Android http://googlemobile.blogspot.com/2010/08/ju st-speak-it-introducing-voice-actions.html Audio signal proce ssing Ch1, v.3c 7

Types depending on speakers Speaker dependent recognition - designed for one speaker who has trained the system. Speaker independent recognition - designed for all users without prior training. Audio signal proce ssing Ch1, v.3c 8

Class exercise 1.1 Discuss the features of the speech recognition module in the following systems Mobile phone, speech command dialing system Android Speech input system Audio signal proce ssing Ch1, v.3c 9

Conversion time and sampling time Human freq. range 20Hz to 20KHz, Sampling is double of the highest freq. (sampling theory). So sampling for Hi-Fi music > 40KHz. 74 minutes CD music, 44.1KHz sampling 16-bit sound=44.1khz*2bytes*2channels*60seconds*70 min.=783,216,000 bytes (747~ MB). (see http://en.wikipedia.org/wiki/cd-rom) Compromise: telephone quality sound is 8KHz 8-bit sampling. Audio signal proce ssing Ch1, v.3c 10

Sampling 16-bit range 0->(2 16 )-1=65535) Time in ms (1KHz sampling) 65535 0 www.webkinesia.com/games/images/quant.gif Audio signal proce ssing Ch1, v.3c 11

Sampling and reconstruction https://edocs.uis.edu/jduva1/www/courses/455/sampling.jpg (2 16 -)-1= 65535 0 time Audio signal proce ssing Ch1, v.3c 12

Hardware for speech recognition setup Speech is captured by a microphone, e.g. sampled periodically ( 16KHz) by an analogue-to-digital converter (ADC) Each sample converted is 16-bit data. Tutorial: For a 16KHz/16-bit sampling signal, how many bytes are used in 1 second. (=32Kbytes) Audio signal proce ssing Ch1, v.3c 13 http://www.ras.ucalgary.ca/grad_project_2005/asph_sampling.jpg

A speech wave Time samples Audio signal proce ssing Ch1, v.3c 14

Music wave: violin3.wav (repeated 6 times for demo purposes) (http://www.youtube.com/watch?v=xdmx5d99xgu&feature=youtu.be) Sampling Frequency=FS=44100 Hz ( 42070 samples) How long is the play time? Answer:(1/44100 )*42070 =0.954 seconds All 42070 samples Zoom in to see 1000 samples Zoom in to see 300 samples Audio signal proce ssing Ch1, v.3c 15

Class exercise 1.2 For a 20KHz, 16-bit sampling signal, how many bytes are used in 5 seconds? Answer:? Audio signal proce ssing Ch1, v.3c 16

Speech recognition hardware ADC (Analog to Digital Converter) Speech Recording System DAC (Digital to Analog Converter) Or Audio signal proce ssing Ch1, v.3c 17

Discussion: Conversion resolution Music 44.1KHz, 16 bit is very good. Higher specifications may be used : e.g. 96KH sampling 24 bit Compression: MP3,etc can compress data Speech 20KHz sampling 16-bit is good enough. Audio signal proce ssing Ch1, v.3c 18

Class exercise 1.3 A sound is sampled at 22-KHz and resolution is 16 bit. How many bytes are needed to store the sound wave for 10 seconds? Answer:? Audio signal proce ssing Ch1, v.3c 19

Signal analysis spectrum Audio signal proce ssing Ch1, v.3c 20

Can we see speech? Pressure /output of mic Time domain signal Yes, using spectrogram. The time domain signal shows the amplitude of air-pressure against time. Freq. time Spectrogram The spectrogram shows the energies of the frequencies contents Vs time. Spectrogram (matlab function Specgram.m) Time Audio signal proce ssing Ch1, v.3c 21

Basic Phonetics Phonemes are symbols to show how a word is pronounced. Phonemes Vowel /AA/,/I/,/UH/ Diphthongs /AY/,/AW/ Consonants -Nasals /M/ -stops /B/,/P/ -fricative /V/,/S/ -whisper /H/ -affricates /JH/,/CH/ Audio signal proce ssing Ch1, v.3c 22

Phonetic table http://www.telefonica.net/web2/eseducativa/phonetics/tablea.gif Audio signal proce ssing Ch1, v.3c 23

Special features for Cantonese phonetics 廣東話 Each word is combined by an Initial (consonant) and a final (vowel); entering tone are ended by /p/, /t/ or /k/ Nine tones: lower-flat, lower-rising, lower-go higher-flat, higher-rising, higher-go Entering: ended by /p/, /t/ or /k/ Audio signal proce ssing Ch1, v.3c 24

Chapter 1.B : Signals in time and frequency domain Time framing Frequency model Fourier transform Spectrogram Audio signal proce ssing Ch1, v.3c 25

Revision: Raw data and PCM Human range 20Hz 20K Hz CD Hi-Fi quality music: 40KHz (sampling) 16bit People can understand human speech sampled at 5KHz or less, e.g. Telephone quality speech can be sampled at 8KHz using 8-bit data. For speech recognition systems normally use: 10~16KHz,12~16 bit. Audio signal proce ssing Ch1, v.3c 26

Human perceives data in blocks We see 24 still pictures in one second, then we can build up the motion perception in our brain. Source: http://antoniopo.files.wordpress.com/2011/03/eadweard_muybridge_horse.jpg?w=733&h=538 Audio signal proce ssing Ch1, v.3c 27

Time framing Since our ear cannot response to very fast change of speech data content, we normally cut the speech data into frames before analysis. (similar to watch fast changing still pictures to perceive motion ) Frame size is 10~30ms Frames can be overlapped, normally the overlapping region ranges from 0 to 75% of the frame size. Audio signal proce ssing Ch1, v.3c 28

Frame blocking and Windowing To choose the frame size (N samples )and adjacent frames separated by m samples. I.e.. a 16KHz sampling signal, a 10ms window has N=160 samples, (non-overlap samples) m=40 samples s n l=2 (second window), length = N N m N l=1 (first window), length = N Audio signal proce ssing Ch1, v.3c 29 n time

Tutorial for frame blocking A signal is sampled at 12KHz, the frame size is chosen to be 20ms and adjacent frames are separated by 5ms. Calculate N and m and draw the frame blocking diagram.(ans: N=240, m=60.) Repeat above when adjacent frames do not overlap.(ans: N=240, m=240.) Audio signal proce ssing Ch1, v.3c 30

Class exercise 1.4 For a 22-KHz/16 bit sampling speech wave, frame size is 15 ms and frame overlapping period is 40 % of the frame size. Draw the frame block diagram. Audio signal proce ssing Ch1, v.3c 31

The frequency model For a frame we can calculate its frequency content by Fourier Transform (FT) Computationally, you may use Discrete-FT (DFT) or Fast-FT (FFT) algorithms. FFT is popular because it is more efficient. FFT algorithms can be found in most numerical method textbooks/web pages. E.g. http://en.wikipedia.org/wiki/fast_fourier_transform Audio signal proce ssing Ch1, v.3c 32

The Fourier Transform FT method (see appendix of why m N/2) Forward Transform X m(complex number) = FT {s k(real number) } N 1 j N N jθ X m Ske, m= 0,1,2,3,...,,and e = cos( θ ) + 2 = k= 0 2πkm Input (Time domain) = s k = s 0, s 1, s N-1 (N samples) Output (Frequency domain) after FT= X 0, X 1, X N/2, which are (N/2+1)complex numbers. X = X e jθ m Since X m is complex so m m j sin( θ ) Audio signal proce ssing Ch1, v.3c 33

Fourier Transform X Note: e X N 1 m = k= 0 m S jθ k e 2 km j π N = cos( θ ) + jsin( θ ),and j= 1 = real+ j( imaginary), N, where m= 0,1,2,3,..., 2,and 2πkm N = θ, Signal voltage/ pressure level Fourier Transform Time S 0,S 1,S 2,S 3. S N-1 X m = (real 2 +imginary 2 ) single freq.. Spectral envelop freq. (m) Audio signal proce ssing Ch1, v.3c 34

Audio signal proce ssing Ch1, v.3c 35

s k Examples of FT (Pure wave vs. speech wave) X m pure cosine has one frequency band FT single freq.. s k time(k) complex speech wave has many different frequency bands X m freq.. (m) single freq.. time(k) Spectral envelop freq. (m) Audio signal proce ssing Ch1, v.3c 36

Use of short term Fourier Transform (Fourier Transform of a frame) Power spectrum envelope is a plot of the energy Vs frequency. Time domain signal of a frame amplitude time domain signal of a frame DFT or FFT Frequency domain output Energy Spectral envelop First formant Second formant time freq.. Audio signal proce ssing Ch1, v.3c 37 1KHz 2KHz

Class exercise 1.5: Fourier Transform Write pseudo code (or a C/matlab/octave program segment but not using a library function) to transform a signal in an array. Int s[256] into the frequency domain in float X[128+1] (real part result) and float IX[128+1] (imaginary result). How to generate a spectrogram? X e m jθ N 1 = k= 0 S k e 2πkm j N, m= = cos( θ ) + jsin( θ ) 0,1,2,3,..., N 2 Audio signal proce ssing Ch1, v.3c 38

The spectrogram: to see the spectral envelope as time goes by It is a visualization method (tool) to look at the frequency content of a signal Parameter setting: (1)Window size = N=(e.g. 512)= number of time samples for each Fourier Transform processing. (2)Window overlapping size D (e.g. 128). X-axis = time; FT samples S t to S t+512 Y-axis = freq.; plot the freq. energy envelope vertically using different gray scale. Repeat above procedures for samples from S D+t to S D+t+512 until D+t+512 >length of the input signal. Audio signal proce ssing Ch1, v.3c 39

A specgram Specgram: The white bands are the formants which represent high energy frequency contents of the speech signal Audio signal proce ssing Ch1, v.3c 40

Freq. Better frequency resolution Freq. Better time. resolution Audio signal proce ssing Ch1, v.3c 41

How to generate a spectrogram? Audio signal proce ssing Ch1, v.3c 42

Procedures to generate a spectrogram (Specgram1) Window=256-> each frame has 256 samples Sampling is fs=22050, so maximum frequency is 22050/2=11025 Hz Nonverlap =window*0.95=256*.95=243, overlap is small (overlapping =256-243=13 samples) X(128) For each frame (256 samples) Find the magnitude of Fourier X_magnitude(m), m=0,1,2, 128 Plot X_magnitude(m)= Vertically, -m is the vertical axis - X(m) =X_magnitude(m) is represented by intensity X(i) Repeat above for all frames q=1,2,..q Frame q=1 frame q=2 X(0) Frame q=q Audio signal proce ssing Ch1, v.3c 43

Class exercise 1.6: In specgram1 Calculate the first sample location and last sample location of the frames q=3 and 7. Note: N=256, m=243 Answer: q=1, frame starts at sample index =? q=1, frame ends at sample index =? q=2, frame starts at sample index =? q=2, frame ends at sample index =? q=3, frame starts at sample index =? q=3, frame ends at sample index =? q=7, frame starts at sample index =? q=7, frame ends at sample index =? Audio signal proce ssing Ch1, v.3c 44

Spectrogram plots of some music sounds sound file is tz1.wav High energy Bands: Formants seconds Audio signal proce ssing Ch1, v.3c 45

http://www.cse.cuhk.edu.hk/%7ekhwong/www2/cmsc5707/tz1.wav http://www.cse.cuhk.edu.hk/%7ekhwong/www2/cmsc5707/trumpet.wav http://www.cse.cuhk.edu.hk/%7ekhwong/www2/cmsc5707/violin3.wav spectrogram plots of some music sounds Spectrogram of Trumpet.wav High energy Bands: Formants Spectrogram of Violin3.wav Violin has complex spectrum seconds Audio signal proce ssing Ch1, v.3c 46

Exercise 1.7 Write the procedures for generating a spectrogram from a source signal X. Audio signal proce ssing Ch1, v.3c 47

Summary Studied Basic digital audio recording systems Speech recognition system applications and classifications Fourier analysis and spectrogram Audio signal proce ssing Ch1, v.3c 48

Appendix Audio signal proce ssing Ch1, v.3c 49

Answer: Class exercise 1.1 Discuss the features of the speech recognition module in the following systems speech command dialing system Probably it is an isolated speech recognition system, speaker dependent (if training is needed) Android Speech input system Continuous speech recognition, speaker independent. Audio signal proce ssing Ch1, v.3c 50

Answer: Class exercise 1.2 For a 20KHz, 16-bit sampling signal, how many bytes are used in 5 seconds? Answer: 20KHz*2bytes*5 seconds=200kbytes. Audio signal proce ssing Ch1, v.3c 51

Answer: Class exercise 1.3 A sound is sampled at 22-KHz and resolution is 16 bit. How many bytes are needed to store the sound wave for 10 seconds? Answer: One second has 22K samples, so for 10 seconds: 22K x 2bytes x 10 seconds =440K bytes *note: 2 bytes are used because 16-bit = 2 bytes Audio signal proce ssing Ch1, v.3c 52

Answer: Class exercise 1.4 For a 22-KHz/16 bit sampling speech wave, frame size is 15 ms and frame overlapping period is 40 % of the frame size. Draw the frame block diagram. Answer: Number of samples in one frame (N)= 15 ms * (1/22k)=330 Overlapping samples = 132, m=n-132=198. Overlapping time = 132 * (1/22k)=6ms; Time in one frame= 330* (1/22k)=15ms. s n m l=2 (second window), length = N N N n time l=1 (first window), length = N Audio signal proce ssing Ch1, v.3c 53

Answer(revised) Class exercise 1.5: Fourier Transform, m= j sin( θ ) http://en.wikipedia.org/wiki/list_of_trigonometric_identitie For (m=0;m<=n/2;m++) { tmp_real=0; tmp_img=0; For(k=0;k<N-1;k++) { tmp_real=tmp_real+s k *cos(2*pi*k*m/n); tmp_img=tmp_img-s k *sin(2*pi*k*m/n); } X_real(m)=tmp_real; X_img(m)=tmp_img; } From N input data S k=0,1,2,3..n-1, there will be 2*(N+1) data generated, i.e. X_real(m), X_img(m), m=0,1,2,3..n/2 are generated. N 1 E.g. S k =S 0,S 1,..,S 511 X_real 0,X_real 1,..,X_real 256, X_imgl 0,X_img 1,..,X_img 256, Note that X_magnitude(m)= sqrt[x_real(m) 2 + X_img(m) 2 ] X e m ± jθ = k= 0 S k e = cos( θ ) ± 2πkm j N 0,1,2,3,..., N 2 Audio signal proce ssing Ch1, v.3c 54

Answer: Class exercise 1.6: In specgram1 (updated) Calculate the first sample location and last sample location of the frames q=3 and 7. Note: N=256, m=243 Answer: q=1, frame starts at sample index =0 q=1, frame ends at sample index =255 q=2, frame starts at sample index =0+243=243 q=2, frame ends at sample index =243+(N-1)=243+255=498 q=3, frame starts at sample index =0+243+243=486 q=3, frame ends at sample index =486+(N-1)=486+255=741 q=7, frame starts at sample index =243*6=1458 q=7, frame ends at sample index =1458+(N-1)=1458_255=1713 Audio signal proce ssing Ch1, v.3c 55

Why in Discrete Fourier transform m is limited to N/2 N 1 j N N jθ X m Ske, m= 0,1,2,3,...,,and e = cos( θ ) + 2 = k= 0 2πkm The reason is this: In theory m can be any number from -infinity to + infinity (the original Fourier transform definition). In practice it is from 0 to N-1. Because if it is outside 0 to N-1, there will be no numbers to work on. But if it is used in signal processing, there is a problem of aliasing noise (see http://en.wikipedia.org/wiki/aliasing) that is when the input frequency (Fx) is more than 1/2 of the sampling frequency (Fs) aliasing noise will happen. j sin( θ ) If you use m=n-1, that means your want to measure the energy level of the input signal very close to the sampling frequency level. At that level aliasing noise will happen. For example Signal X is sampling at 10KHZ, for m=n-1, you are calculating the frequency energy level of a frequency very close to 10KHz, and that would not be useful because the results are corrupted by noise. Our measurement should concentrate inside half of the sampling frequency range, hence at maximum it should not be more than 5KHz. And that corresponds to m=n/2. 56 Audio signal proce ssing Ch1, v.3c