Distributed Speech Recognition Standardization Activity

Size: px

Start display at page:

Download "Distributed Speech Recognition Standardization Activity"

Natalie Sutton
6 years ago
Views:

1 Distributed Speech Recognition Standardization Activity Alex Sorin, Ron Hoory, Dan Chazan Telecom and Media Systems Group June 30, 2003 IBM Research Lab in Haifa

2 Advanced Speech Enabled Services ASR App DB TTS Flight Booking Center! How can I help you? I d like to fly from NY to Chicago Low resource mobile device Advanced ASR task where accuracy is crucial Server based ASR 2

3 Speech Recognition Over Mobile Networks - NSR vs DSR Network Speech Recognition (NSR) Low bitrate coding and channel errors degrade speech recognition accuracy Encoder Encoded Voice Decoder Playback ASR Front-end ASR Back-end Text Speech recognition server Distributed Speech Recognition (DSR) Compress & transmit recognition features (MFCC) To do list Proof of concept Standardization Speech reconstruction ASR Front-End DSR Encoder Recognition features low bit-rate stream DSR Decoder Playback ASR Back-end Text 3

4 DSR in ETSI/Aurora 2000 standard ETSI ES DSR Front-end (FE) 2002 standard ETSI ES DSR Advanced Front-end (AFE) Sept 2003 extended standards developed by IBM HRL & Motorola ETSI ES (XFE) and ETSI ES (XAFE) Speech To playback Speech Robust Features Extraction Compression Wireless channel Decompression Reconstruction ASR Back-end Text Pitch Voicing Class Pitch & VC compression Tonal Lang ASR Back-end Text 4

5 ASR Accuracy Comparison DSR vs GSM AMR In-car records, 5 languages, connected digits task AMR 4.75 kbps is 55% worse than DSR AFE WER % AMR 4.75 kbps AMR 12.2 kbps DSR AFE 4.8 kbps 5

6 Channel Errors DSR error mitigation interpolation in feature space Simulation by BT 3% accuracy degradation at 50% packet lost vs. 63% degradation for coded speech transmission Word Accuracy (%) Packet Loss (%) G723.1 Front-End Interpolation 6

7 State of the art TLR ASR improvement Accuracy IBM Haifa Research Lab DSR XFE/XAFE Requirements GSM AMR 4.75 LPC10 MELP complexity & bitrate Speech intelligibility 7

8 Extended DSR Client Diagram Speech Feature Extraction Spectrum Down sampling High Pass Cepstra V A D Car noise detection Compression Pitch Estimation Voicing classification xcompression Packing 8

9 Robust Low Complexity Pitch Estimator Spectrum Car noise flag Spectral Peaks Estimate location and amplitudes of spectral peaks Preliminary Candidates Use a few major peaks to find preliminary pitch candidates Candidates Use all peaks to determine a few best candidates and their spectral scores Downsampled speech Correlation Scores Compute correlation scores of the candidates Decision Logic Select final pitch candidate using spectral scores & correlation score & history Pitch 9

10 Pitch Contours Example Clean vs Babble Noise 10dB

11 xdsr Encoder Parameters Bitrate = 5.6 kbps ROM 15 CPU kwords XFE AMR XAFE 15 Basis Extension wmops 10 5 RAM 0 XFE AMR XAFE 8 6 Basis Extension kwords XFE AMR XAFE Basis Extension 11

12 Server Side Speech Reconstruction Raw pitch Voicing class Cepstra & energy Pitch Tracking Control / Harmonic Structure Init Harmonic Magnitudes The heart of the reconstruction process All-pole Modelling Postfilter Unvoiced Phase Voiced Phase Voice & Unvoice Combination Line spectrum Time domain OLA Synthesized speech 12

13 Magnitudes Reconstruction Problem 23 bins 13 MFCC Speech Abs/Power 128 STFT Mel Scale Triangular Filters LOG 23 DCT Quantization Freq??? 13 MFCC (LOC) S ( f i i jϕ ) = C iw ( f f ) i C = Ae S(f) C3 C1 C2 i i Harmonic magnitudes {A i } f0 f1 f2 f 13

14 Magnitudes Reconstruction by IBM and by Motorola Convert cepstra to spectral bins (IDCT exp) Describe front-end processing by linear equation linking bins with harmonic magnitudes Represent magnitudes by linear combination of 23 basis functions Rewrite and solve the equation in basis function weights Compute magnitudes Find (non-integer) index α k of harmonic frequency location at Melchannels grid: 0.5 < α k < 23.5 Extended IDCT 2 12 π LA k = Cepn cos n ( α k 0.5), k = 1,..., N 23 n= 0 23 Take exponent Normalize to compensate variable width of Mel-triangles harm 23 dimensional cepstrum - IBM outperforms Motorola Quantized 13 cepstra IBM and Motorola performs equally Cepstra truncation significantly degrade reconstruction accuracy 14

15 Combined Magnitudes Reconstruction Pitch HOC LOC Low Order Cepstra (C0 C12) HOC High Order Cepstra(C13 C23) Pitch LOC HOC Pitch LOC HOC HOC 1 HOC 2 HOC 3 Motorola Algorithm IBM Algorithm HOC Synthesis HOC k HOC N A Mot A k IBM Mot ( k, pitch) A + ( 1 µ ( k pitch) ) A = µ, k A IBM A k 15

16 Magnitudes Reconstruction Accuracy Evaluation 0.4 Magnitudes reconstruction accuracy Relative reconstruction error Pitch, ms IBM Motorola Combined 16

17 Intelligibility of Reconstructed Speech Average over background noise conditions: clean, car, street, babble Intelligibility testing results DRT WER%, 10 * TT WER% PCM XAFE XFE MELP LPC-10 Diagnostic Rhyme Test (DRT) Transcription Test (TT) 17

18 Decoded Speech Examples Coder Female voice Male voice Original LPC10 MELP XFE XAFE 18

19 Tonal Language Recognition Evaluation Standard pitch keeps state of the art TLR performance intact TLR Evaluation by Motorola TLR Evaluation by IBM WER % Mandarin digits Mandarin commands Cantonese digits WER % Mandarin digits Cantonese digits Proprietary pitch Standard pitch Proprietary pitch Standard pitch 19

Enhanced Waveform Interpolative Coding at 4 kbps

Enhanced Waveform Interpolative Coding at 4 kbps Oded Gottesman, and Allen Gersho Signal Compression Lab. University of California, Santa Barbara E-mail: [oded, gersho]@scl.ece.ucsb.edu Signal Compression