Robust Speech Recognition and its ROBOT implementation

Size: px

Start display at page:

Download "Robust Speech Recognition and its ROBOT implementation"

Herbert Chase
6 years ago
Views:

1 Robust Speech Recognition and its ROBOT implementation Yoshikazu Miyanaga Hokkaido University

2 Conditions for Speech Recognition Short Isolated Speech: words, phrase (<2sec) Attached Mic (several cm 10cm) Continuous Speech: sentences (>2sec) Remote Mic: (10cm 5m) Silent Room (>20dB) Living Room(20 ~10dB) Long Distance Mic: (>5m) Noisy Room: exhibition(<10db)

3 Conventional ASR Continuous Speech: (>2sec) Attached Mic (<10cm) Silent Room (>20dB) Short Isolated Speech: (<2sec) Attached Mic (<10cm) Living Room(20 ~10dB) Array Microphone Short Isolated Speech: (<2sec) Attached Remote Mic: (<5m) Living Room(20 ~10dB)

4 Hokkaido University Speech Communication System (HU-SCS) Short Isolated Speech: words, phrase (<2sec) Long Distance Mic: (>5m) Attached Mic (several cm 10cm) Remote Mic: (10cm 5m) Noisy Room: exhibition(<10db) Silent Room (>20dB) Living Room(20 ~10dB)

5 HU-SCS Automatic Speech Detection

6 97% by Current Technology (SNR 10dB) HU-SCS WAVELET Non-Linear Processing Robust voice activity detection using perceptual wavelet-packet transform and teager energy operator S-H Chen, H-T Wu, Y. Chang and T.K. Truong, Trans. Pattern Recognition Letters (2007) Automatic Speech Detection

7 HU-SCS HU-SCS v4 99% over SNR 10dB BP+Threshold Ope F 0 Detection Automatic Speech Detection

8 HU-SCS Automatic Speech Recognition Candidates of Recognition Results (1) Good Morning (2) See you (3) How are you?

9 71% by Current Tech (SNR 10dB). 97.4% (SNR 20dB). Spectral Subtraction RASTA, CMS A Prior Information HU-SCS Automatic Speech Recognition Candidates of Recognition Results (1) Good Morning (2) See you (3) How are you?

10 HU-SCS HU-SCS v4 95.3% (SNR 10dB). 98.3% (20dB) No A Prior Info. RSF/DRA Automatic Speech Recognition Candidates of Recognition Results (1) Good Morning (2) See you (3) How are you?

11 HU-SCS Candidates of Recognition Results (1) Good Morning (2) See you (3) How are you? Automatic Speech Rejection Recognition Result Good Morning

12 90% by Current Tech Confidential Scoring HU-SCS Technique Recognition confidential scoring and its use in speech understanding systems, T.J. Hazen, S.Seneff and J.Polifroni, Trans on Computer Speech and language (2002). Candidates of Recognition Results (1) Good Morning (2) See you (3) How are you? Automatic Speech Rejection Recognition Result Good Morning

13 HU-SCS Candidates of Recognition Results (1) Good Morning (2) See you (3) How are you? HU-SCS v4 Dependent GMM by Weighted HMM (90% Accuracy) AI (Artificial Intelligence) Automatic Speech Rejection Recognition Result Good Morning

14 HU-SCS First SCS HW LSI IP Mobile Intelligent Consumer Electronics etc Fine Advantage Automatic Automatic HW with Speech Speech Low Detection Power Rejection Recognition (1) Mobile Appli Small Low Power (2) PC free Super Low-Power Consumption Design Real-Time SCS 180nsec/word (10MHz クロック ) Recognition Time Small Scale Design with Special Designed LSI Noise Reduction by Array Microphone

15 HU-SCS Automatic Speech Detection HW with Low Power Automatic Speech Recognition Automatic Speech Rejection

16 Running Spectrum Domain Waveform Mel-Spectra t t-6

17 BP and Threshold OP End Point Start Point

18 Detection Switch-Less Recognition System by Automatic Detection Speech Recognition Operation/Control Recognition Hands Free Operation/Control 無音区間 Start Recognition Start End 無音区間 Recognition End Recognition Operation/Control

19 HU-SCS Automatic Speech Detection HW with Low Power Automatic Speech Recognition Automatic Speech Rejection

20 Speech Analysis and Robust Processing Speech Analysis LPC Cepstrum Mel-Frequency Cepstrum Robust Processing Various types of techniques have been proposed. Spectral Subtraction Wiener Filtering Microphone Arrays RSF/DRA (Running Spectrum Filtering/Dynamic Range Adjustment) uses filtering and normalizing for cepstral vectors.

21 Procedure of Mel-Frequency Cepstrum Speech Signals x(t) Cut into Short-Time Frames x f (n,t s ) Discrete Fourier Transform (DFT) Filterbanks with Mel-Frequency Scale Logarithm X(n,f) X s (n,f m ) log(x s (n,f m )) Discrete Cosine Transform (DCT) C(n,k) Cepstral Coefficients n : frame index k : cepstral order

22 Noise modeling Spectrum including noise can be modeled as, X ( n, ) S( n, ) H( ) A( ) Clean spectrum Multiplicative noise Additive noise log E( n, ) S( n, ) H( ) log X ( n, ) log( E( n, ) A( ))

23 Noise Corruption in Power Spectrum Noise corruptions make differences on gains and DC components. Power Spectrum Clean Speech E(n,ω)+A E(n,ω) Noisy Speech (White Noise at 10dB SNR)

24 Noise Corruption in Log Power Spectrum Noise corruptions make differences on gains and DC components. Log-power Spectrum E(n,ω)+A Clean Speech DC Components E(n,ω) Gain Noisy Speech (White Noise at 10dB SNR)

25 Running Spectrum Running spectrum is obtained by accumulating short-time spectrum DFT Running spectrum: time trajectory of frequency Frequency Frame Number

26 Spectral Subtraction Running spectrum of a noisy speech (white noise at 5 db SNR) After Subtraction Estimate the spectrum of noise from short-time spectra in the first several flames Subtract the estimated spectrum from each short-time spectrum

27 Noise Reduction Techniques Conventional method Spectral subtraction Parameters are not optimized for speeches from various environments. Excessive subtraction may cause musical noise. Robust speech feature extraction. Advanced speech analysis using RSF (running spectral filtering) and DRA (dynamic range adjustment).

28 Modulation Spectrum RSF focuses on modulation spectrum Running Spectrum Modulation spectrum: spectrum versus time trajectory of frequency. Modulation Spectrum Frequency Frame Number DFT on each frequency Frequency Modulation frequency

Clean Noisy (white noise at 5 db SNR) Lower modulation

29 Mod-F of Clean and Noisy Speech Speech components are dominant around 4 Hz in modulation spectrum. Clean Noisy (white noise at 5 db SNR) Lower modulation frequency components can be assumed as noise because of little changes in noise components.

30 Frequency (Hz) RSF (Running Spectrum Filtering) Speech components are dominant around 4 Hz in modulation spectrum. Modulation Spectrum Noise Components Speech Components Modulation Frequency [Hz] Unnecessary Part

31 RSF RSF (Running Spectrum Filtering) enhances perceptual auditory components. decreases noise components relatively by bandpass filtering in cepstral sequences. ~ C( n, k) Modulation Frequency of RSF Q h( i) C( n i 0 i, k) Coefficients in FIR Filter RASTA(IIR) RSF

32 DRA DRA (Dynamic Range Adjustment) normalizes amplitude of cepstral vectors in time domain (use of maximum value during utterance). suppresses dynamic range distortions caused by additive noise. C ( n, k) k ~ C( n, k k) ~ max C( n, k) 1 k T

33 RSF / DRA Comparison in cepstral time-trajectories at 4th order Clean Noisy Baseline RSF/DRA processing

34 HU-SCS Automatic Speech Detection HW with Low Power Automatic Speech Recognition Automatic Speech Rejection

35 Likelihoods of HMM Average HMM Variance GMM GMM GMM GMM GMM Approximation of many multidimensional Gaussian Distribution

36 Evaluation on Likelihoods MFCC p 1 p 2 p p 3 4 p 5 p 6 p 7 p 8 p 9 p p Likelihood of MFCC into this HMM The maximum likelihood is selected and its label is recognized as the result. The result is correct, isn t it?

37 Likelihood Likelihood Evaluation of Reliability The result of the top score is trusted. The result of the top score is NOT trusted.

38 Rejection Method using Multi-Criterions Tendency Ratio Maximum Score Square of Ratio MFCC Cluster Group Evaluation of Cluster New Type Speech Rejection Noisy Conditions

39 HU-SCS Automatic Speech Detection HW with Low Power Automatic Speech Recognition Automatic Speech Rejection

40 Overview of ASR System Current ASR systems adopt robust processing that removes influences of noise distortions. Speech Feature Vectors Calculate Probability (likelihood Covert to Spectrum or Cepstrum scores) Speech Data Speech Analysis Robust Processing Decrease Noise Distortions Speech Recognition Results Reference Models Prepare Reference Patterns by Speech Training

41 Circuit Structure of Complete Recognition System Speech Signal Robust Processing SRAM Speech Recognition Data Control System Control External Memory (SRAM) from/to Processor Speech Analysis SRAM SRAM

42 Circuit Implementations Required Operating Performance Speech Analysis 10 MIPS Robust Processing 500 MIPS (mainly in FIR) FFT IDCT FIR Log Divider 8 Buffer Buffer Cos/Sin ROM Speech Data 256*16 bits ROM RAM 512*24 bits Speech Analysis (MFCC) 4096*16 bits RAM 256*24 bits Robust Processing (RSF/DRA) Feature Vectors

43 Block Diagram Interfaces Microprocessor, External RAM, and Master/Slave MPU Interface Master Bus Interrupt Signal Filter Coefficients for RSF CLK SW Bus Control RSF/DRA SRAM HMM System Control RESET SRAM interface Address MFCC SRAM SRAM Data Control Chip Select Slave Bus Working for MFCC and RSF Data Control All right reserved. Copyright Feature Yoshikazu parameters Miyanaga before speech detection

44 New Scalable Architectures 2 types of scalable techniques are applied to the system. (1) Multiple Process Elements (PEs) in HMM Circuit The PEs enable high-speed processing and improving recognition performance. (2) Master/Slave Operation in the Complete System The operation enables high-speed processing and increase the number of word vocabularies.

45 HMM (Hidden Markov Models) qn Hidden Markov Models (HMM) Statistical modeling approach using Markov chain. Powerful for expressing time-varying data sequences and robust with speaker differences. ( 1 n N ) Set of states a 33 a 44 a 12 a23 a 34 a45 q1 q q 2 3 q 4 a11 a22 aij State transition probability b ( ) b ( ) b N (k) 1 k 2 k Output probability

46 Full-Parallel Computations in HMM The output probabilities and temporal scores can be computed concurrently for the number of HMM states. Output Prob. Calc. Score Calc. o t Output Prob. Calc. Output Prob. Calc. Score Calc. Score Calc. Select Max Max(δ) Path for upper state Output Prob. Calc. Score Calc.

47 Master/Slave Operation (1) Set Reference Data Microprocessor (2) Speech Analysis and Robust Processing (3) Broadcast (4) Speech Recognition (5) Gather Results Master Slave1 Slave2 Slave3 RAM

48 Master/Slave Operation (1) Set Reference Data Microprocessor (2) Speech Analysis and Robust Processing Master [4] RAM (3) Broadcast Slave1 [3] (4) Speech Recognition Slave2 [2] (5) Gather Results Slave3 [1]

49 Master/Slave Operation (1) Set Reference Data Microprocessor (2) Speech Analysis and Robust Processing (3) Broadcast (4) Speech Recognition (5) Gather Results Master Slave1 Slave2 Slave3 RAM

50 Master/Slave Operation (1) Set Reference Data Microprocessor (2) Speech Analysis and Robust Processing Master [1] RAM (3) Broadcast Slave1 (4) Speech Recognition (5) Gather Results Slave2 Slave3 [2]

51 Master/Slave Operation(2) (1) Set Reference Data Microprocessor (2) Speech Analysis and Robust Processing (3) Broadcast (4) Speech Recognition (5) Gather Results Master Slave1 Slave2 Slave3 RAM

52 Master/Slave Operation(2) (1) Set Reference Data Microprocessor (2) Speech Analysis and Robust Processing Master [4] RAM (3) Broadcast Slave1 [3] (4) Speech Recognition Slave2 [2] (5) Gather Results Slave3 [1]

53 Circuit Design (Analysis & HMM TEG) Technology Rohm CMOS 0.35 μm Univ. of Tokyo EXD Standard Cell Library Voltage Supply 3.3V RTL Level Design.Verilog-HDL Evaluation V2 Layout View Clock Freq. (MHz) Proc Time (ms/word) Power Coms (mw)

54 Comparison on Power Consumption Proposed HW (10MHz) and DSP Design (80MIPS) Processor Structure DSP based System TMS320C549 80MIPS Proposed System Dedicated Processor 10MHz Memory Access Time (ns) Processor (mw) (Core : 3.3V) Memory (mw) (SRAM, Core : 3.3V) Total

55 Processing Time of HU-SCS Comparison with Software Design 54 times faster No high speed clock Useful for Low-Power Design Proposed System (Hardware) Pentium 4 (Software) No. arithmetic units No. cycles 455,200 - Frequency(MHz) Recognition Processing time(ms)

56 Design by Standard Cells TSMC0.25µm CMOS Standard Cell Voltage 2.5V Highest Clock Rate 80.6MHz (12.4ns, Temperature Cond. Typical) No. Parallel Processing 32 8 HMM 491, ,980 RSF/DRA 11,910 MFCC 39,670 System Control 18,310 Bus Control 1,310 SRAM 63,400 Total 626, ,580

57 Current HU-SCS PC Interface with HU-SCS Board HU-SCS Board 55mm 44 mm

58 Overview of Current HU-SCS Improvement of Noise Robust Accurate ASR under SNR 0-10dB Robustness against Echo Improvement of Speech Recognition Higher Accuracy on MFCC Calculation Low Power Design and Higher Speed Processing Improvement of Total HW System Higher Speed Response Time

59 Comparison on Performance Environment Noise Level Correctness Current Previous Meeting Room 50dB 96.4% 90.0% Elevator 50dB 95.0% 84.4% Stairs 45dB 85.1% 50.5% Car A(Idling, No-Moving) 50dB 99.4% 95.6% Car B(High Speed, Open Window) 75dB 93.3% 85.0% Car C(High Speed, Audio ON(FM)) 75dB 88.9% 65.6% Total 93.0% 78.5% Cruiser Board(Outside, high speed) 80dB 82.7% - Comparisons between HU-SCS v4 and v % 50.00% 0.00% Previous Current

60 Results on Some Distances Car A Car B Car C 100.0% 100.0% 100.0% 90.0% 90.0% 90.0% 80.0% 80.0% 70.0% 80.0% 70.0% 70.0% 60.0% 50.0% 40.0% 60.0% 30cm 60cm 90cm 60.0% 30cm 60cm 90cm 30.0% 30cm 60cm 90cm Meeting Room Elevator Stair 100.0% 100.0% 100.0% 90.0% 90.0% 90.0% 80.0% 80.0% 70.0% 80.0% 70.0% 70.0% 60.0% 50.0% 40.0% 60.0% 30cm 60cm 90cm 60.0% 30cm 60cm 90cm 30.0% 30cm 60cm 90cm

61 Robot Implementation Speech Recognition & Synthesis Quick Response Control to Consumer Electronics and Machines

62 Communications and Controls

63 Summary Hokkaido University Speech Communication System Integrated Architecture of Speech Detection, Robust Speech Analysis, Speech Recognition, Speech Rejection Higher Speed Processing than DSP and Software Superior in Energy Saving than DSP Solutions Improving Noise Robustness by RSF/DRA Technique Small, Fast and Low Power

His research interests are in the areas of signal processing for wireless communications, nonlinear signal processing and low-power LSI systems.

64 Who? Yoshikazu Miyanaga He received the B.S., M.S., and Dr. Eng. degrees from Hokkaido University, Sapporo, Japan, in 1979, 1981, and 1986, respectively. He is currently a Professor at Graduate School of Information Science and Technology, Hokkaido University. His research interests are in the areas of signal processing for wireless communications, nonlinear signal processing and low-power LSI systems. He was a chair of Technical Group on Smart Info-Media System, IEICE. He is an advisory member of this technical group. Currently, he is IEICE fellow. He served as a member in the board of directors, IEEE Japan Council as a chair of student activity committee from 2002 to He is a chair of student activity committee in IEEE Sapporo Section from He is a chair of IEEE Circuits and Systems Society, Digital Signal Processing Technical Committee from He has been serving as international steering committee chairs/members of IEEE ISPACS, IEEE ISCIT, IEEE/EURASIP NSIP and honorary/general chairs/co-chairs of their international symposiums/workshops, i.e., ISPACS 2003, ISCIT 2004, ISCIT 2005, NSIP 2005, ISPACS 2008, ISMAC 2009 and APSIPA ASC He also served as international organizing committee chairs of IEICE ITC-CSCC , IEEE MSCAS 2004, IEEE ISCAS

65 Current References of this Topic 1. Kazunaga Ohnuki, Wataru Takahashi, Shingo Yoshizawa, Yoshikazu Miyanaga, Noise Robust Speech Features for Automatic Continuous Speech Recognition using Running Spectrum Analysis, Proceedings of 2008 International Symposium on Communications and Information Technologies (ISCIT), pp , October Jirabhorn Chaiwongsai, Werapon Chiracharit, Kosin Chamnongthai, Yoshikazu Miyanaga, An Architecture of HMM-Based Isolated-Word Speech Recognition with Tone Detection Function, Proceedings of 2008 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), December Nongnuch Suktangman, Kham Khanthavivone, Kraisin Songwatana, Yoshikazu Miyanaga, Robust Speech Recognition Based on Speech Spectrum on Bark Scale, EURASIP Proceedings of 2007 International Workshop on Nonlinear Signal and Image Processing (NSIP), pp , September Shingo Yoshizawa, Naoya Wada, Noboru Hayasaka, Yoshikazu Miyanaga, "Scalable Architecture for Word HMM-Based Speech Recognition and VLSI Implementation in Complete System", IEEE Transactions on Circuits and Systems I, Vol.53, No.1, pp.70-77, January Noboru Hayasaka and Yoshikazu Miyanaga, Spectrum Filtering with FRM for Robust Speech Recognition, IEEE Proceedings of International Symposium on Circuits and Systems (ISCAS), No.2, pp , May Naoya Wada, Noboru Hayasaka, Shingo Yoshizawa, Yoshikazu Miyanaga, Direct Control on Modulation Spectrum for Noise-Robust Speech Recognition and Spectral Subtraction, IEEE International Symposium on Circuits and Systems (ISCAS), pp , May Shingo Yoshizawa, Noboru Hayasaka, Naoya Wada, Yoshikazu Miyanaga, VLSI Architecture for Robust Speech Recognition Systems and its Implementation in Verification Platform, Journal of Robotics and Mechatronics, Vol.17, No.4, pp , Aug Yasuyuki Hatakawa, Shingo Yoshizawa, Yoshikazu Miyanaga, Robust VLSI Architecture for System-On-Chip Design and its implementation in Viterbi Decoder, IEEE International Symposium on Circuits and Systems (ISCAS), Vol.3, pp.25-28, May K.Songwatana, K. Dejhan, Y. Miyanaga and K. Khanthavivone, A Vowels Recognition Model for Laotion language using Transfer Function on Bark scale and Hidden Markov Modeling, IEEE Proceedings of International Workshop on Nonlinear Signal and Image Processing (NSIP), Vol.1, pp , May Kazuma Fujioka,Noboru Hayasaka,Yoshikazu Miyanaga and Norinobu Yoshida, A Noise Reduction Method of Speech Signals Using Running Spectrum Filtering, IEICE Transactions on Information and Systems Part.2,Vol.J88-D-Ⅱ, No.4,pp ,April Qi Zhu, Noriyuki Ohtsuki, Yoshikazu Miyanaga and Norinobu Yoshida, Noise-Robust Speech Analysis Using Running Spectrum Filtering, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Science, Vol.E-88-A, No.2, pp , February

A Real Time Noise-Robust Speech Recognition System

A Real Time Noise-Robust Speech Recognition System 7 A Real Time Noise-Robust Speech Recognition System Naoya Wada, Shingo Yoshizawa, and Yoshikazu Miyanaga, Non-members ABSTRACT This paper introduces