ON THE PERFORMANCE OF WTIMIT FOR WIDE BAND TELEPHONY

Similar documents
International Journal of Computer Engineering and Applications, Volume XI, Issue XII, Dec. 17, ISSN

Automatic Speech Recognition (ASR) Over VoIP and Wireless Networks

Artificial Bandwidth Extension Using Deep Neural Networks for Spectral Envelope Estimation

Wideband Speech Coding & Its Application

Transcoding free voice transmission in GSM and UMTS networks

Technical Report Speech and multimedia Transmission Quality (STQ); Speech samples and their usage for QoS testing

3GPP TS V5.0.0 ( )

Acoustics of wideband terminals: a 3GPP perspective

Wideband Speech Encryption Based Arnold Cat Map for AMR-WB G Codec

Using RASTA in task independent TANDEM feature extraction

Speech Coding Technique And Analysis Of Speech Codec Using CS-ACELP

Flexible and Scalable Transform-Domain Codebook for High Bit Rate CELP Coders

Proceedings of Meetings on Acoustics

Impact of the GSM AMR Speech Codec on Formant Information Important to Forensic Speaker Identification

COM 12 C 288 E October 2011 English only Original: English

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Technical Specification Group Services and System Aspects Meeting #7, Madrid, Spain, March 15-17, 2000 Agenda Item: 5.4.3

TECHNICAL REPORT Speech and multimedia Transmission Quality (STQ); Speech samples and their use for QoS testing

Speech Quality Evaluation of Artificial Bandwidth Extension: Comparing Subjective Judgments and Instrumental Predictions

Ninad Bhatt Yogeshwar Kosta

Bandwidth Extension for Speech Enhancement

Practical Limitations of Wideband Terminals

International Journal of Advanced Engineering Technology E-ISSN

techniques are means of reducing the bandwidth needed to represent the human voice. In mobile

Cellular systems & GSM Wireless Systems, a.a. 2014/2015

Mel Spectrum Analysis of Speech Recognition using Single Microphone

22. Konferenz Elektronische Sprachsignalverarbeitung (ESSV), September 2011, Aachen, Germany (TuDPress, ISBN )

CHAPTER 7 ROLE OF ADAPTIVE MULTIRATE ON WCDMA CAPACITY ENHANCEMENT

The Emergence, Introduction and Challenges of Wideband Choice Codecs in the VoIP Market

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

An audio watermark-based speech bandwidth extension method

Bandwidth Extension of Speech Signals: A Catalyst for the Introduction of Wideband Speech Coding?

An objective method for evaluating data hiding in pitch gain and pitch delay parameters of the AMR codec

Improving Sound Quality by Bandwidth Extension

AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES

Voice Coding, PCM Voice, Voice Quality, E-model

ETSI TS V ( )

Enabling New Speech Driven Services for Mobile Devices: An overview of the ETSI standards activities for Distributed Speech Recognition Front-ends

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

EFFICIENT SUPER-WIDE BANDWIDTH EXTENSION USING LINEAR PREDICTION BASED ANALYSIS-SYNTHESIS. Pramod Bachhav, Massimiliano Todisco and Nicholas Evans

NOISE ESTIMATION IN A SINGLE CHANNEL

Perceptual wideband speech and audio quality measurement. Dr Antony Rix Psytechnics Limited

Subjective Voice Quality Evaluation of Artificial Bandwidth Extension: Comparing Different Audio Bandwidths and Speech Codecs

Simulation of Conjugate Structure Algebraic Code Excited Linear Prediction Speech Coder

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

ENHANCED TIME DOMAIN PACKET LOSS CONCEALMENT IN SWITCHED SPEECH/AUDIO CODEC.

Gerhard Schmidt / Tim Haulick Recent Tends for Improving Automotive Speech Enhancement Systems. Geneva, 5-7 March 2008

SIMULATION VOICE RECOGNITION SYSTEM FOR CONTROLING ROBOTIC APPLICATIONS

Quality comparison of wideband coders including tandeming and transcoding

SERIES P: TELEPHONE TRANSMISSION QUALITY, TELEPHONE INSTALLATIONS, LOCAL LINE NETWORKS Methods for objective and subjective assessment of quality

Deriving Equipment Impairment Factors for Wideband Speech Codecs

Lesson 8 Speech coding

TELECOMMUNICATION SYSTEMS

The Channel Vocoder (analyzer):

Bandwidth Efficient Mixed Pseudo Analogue-Digital Speech Transmission

Scalable Speech Coding for IP Networks

Keywords Decomposition; Reconstruction; SNR; Speech signal; Super soft Thresholding.

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Different Approaches of Spectral Subtraction Method for Speech Enhancement

ETSI TS V8.0.0 ( ) Technical Specification

DETECTION OF CLIPPING IN CODED SPEECH SIGNALS. James Eaton and Patrick A. Naylor

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May ISSN

Information. LSP (Line Spectrum Pair): Essential Technology for High-compression Speech Coding. Takehiro Moriya. Abstract

Call Quality Measurement for Telecommunication Network and Proposition of Tariff Rates

Preface, Motivation and The Speech Coding Scene

DERIVATION OF TRAPS IN AUDITORY DOMAIN

EUROPEAN pr ETS TELECOMMUNICATION November 1996 STANDARD

NOVEL PITCH DETECTION ALGORITHM WITH APPLICATION TO SPEECH CODING

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Voiced/nonvoiced detection based on robustness of voiced epochs

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

Voice Excited Lpc for Speech Compression by V/Uv Classification

Adaptive time scale modification of speech for graceful degrading voice quality in congested networks

A NEW FEATURE VECTOR FOR HMM-BASED PACKET LOSS CONCEALMENT

Voice Coding, PCM Voice, Voice Quality, E-model

INTERNATIONAL TELECOMMUNICATION UNION

Overview of Code Excited Linear Predictive Coder

Transcoding of Narrowband to Wideband Speech

Voice Activity Detection for Speech Enhancement Applications

Enhancing 3D Audio Using Blind Bandwidth Extension

EE482: Digital Signal Processing Applications

LOSS CONCEALMENTS FOR LOW-BIT-RATE PACKET VOICE IN VOIP. Outline

IMPROVED SPEECH QUALITY FOR VMR - WB SPEECH CODING USING EFFICIENT NOISE ESTIMATION ALGORITHM

ARTIFICIAL BANDWIDTH EXTENSION OF NARROW-BAND SPEECH SIGNALS VIA HIGH-BAND ENERGY ESTIMATION

RIR Estimation for Synthetic Data Acquisition

Speech Quality Assessment for Wideband Communication Scenarios

Radio Relay - Vocality to Vocality

Speech Synthesis using Mel-Cepstral Coefficient Feature

IMPROVING QUALITY OF SPEECH SYNTHESIS IN INDIAN LANGUAGES. P. K. Lehana and P. C. Pandey

Combining Voice Activity Detection Algorithms by Decision Fusion

Speech Enhancement Using a Mixture-Maximum Model

10 Speech and Audio Signals

NOISE SHAPING IN AN ITU-T G.711-INTEROPERABLE EMBEDDED CODEC

Ap A ril F RRL RRL P ro r gra r m By Dick AH6EZ/W9

Speech Enhancement Based On Noise Reduction

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Waveform Coding Algorithms: An Overview

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Transcription:

ON THE PERFORMANCE OF WTIMIT FOR WIDE BAND TELEPHONY D. Nagajyothi 1 and P. Siddaiah 2 1 Department of Electronics and Communication Engineering, Vardhaman College of Engineering, Shamshabad, Telangana, India 2 Department of Electrical and Computer Engineering, University College of Engineering and Technology, Acharya Nagarjuna University, Guntur, India Email: nagajyothi1998@gmail.com ABSTRACT WTIMIT, which is a derivative of TIMIT emerged as a latest technique for speech quality. The technique has good wideband characteristics over a range of 50-7 KHz. in this paper, a study on the performance of phoneme recognition system has been performed. The study includes the effect of decimating the signal to 8 KHz in the conventional case. Further it is possible to evaluate the AMR-wideband codec for several acoustic models. It is possible to propose the WTIMIT type of wideband channel data from training interactive voice receiving system. Keywords: speech codecs, IVR, AMR-Wb, TIMIT. 1. INTRODUCTION The Typical Bandwidth of Speech is less than 4 KHz for applications in telephonic operations. This is often termed as narrow band. The IVR operates at a sampling rate of 8 KHz for Operation [1-3]. Citing this, the advanced speech service system expanded their BW to wideband (WB) frequency range of 0.05-7 KHz. The influence of a conventional telephony network on the N- TIMIT is used to evaluate the performance features of recognition system in traditional telephony. The Phoneme error rate (PER) suppressed by a huge extent due to direct WB speech. Similar in NB case, 23% relative PER degradations is identified. It is also reported that there is an evidence of the impact of a WB mobile network using the WTIMIT corpus. An enhancement of 19% PER with respect to direct WB speech is observed while there is a suppressing 3% PER relative to narrow band. In spite of these efforts it is to note that investigation pertaining to effects of telephony network are to be validated. This is more desirous in IVR based Telephony system. The development of DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus paved a way for evaluating automatic speech recognition (ASR) systems [4]. It constitutes wideband speech recordings which are sampled at 16 KHz. They typically containing in the rate of 50 Hz to 7 khz with respect to 630 native speakers. This is with reference to 8 major regions in the US. For training ten phonetically rich sentences are collected from every speech. In every utterance several features are extracted along with speech waveform, time aligned orthographic, phonetic and word transcriptions are taken. With reference to these efforts as of now there are five TIMIT derivatives namely FFMTIMIT, NTIMIT, CTIMIT, HTIMIT and STC-TIMIT. The FFMTIMIT can be abbreviated as Free Field Microphone TIMIT typical composed of natural TIMIT database. It typically uses a free field device for recording. NTIMIT (Network TIMIT) is adjunct to TIMIT with database constituting the speech wave form [4]. Over a telephone handset, Similarly CTIMIT constitutes of the original TIMIT recordings were passed through cellular telephone circuits. However, in the case of HTIMIT (Handset TIMIT) the data base consists of two subset with 192 male and 192 female speakers [5]. The corresponding speech signals are those which are transmitted through different telephone handsets. This typically helps in the investigation of telephone transducer effects on speech. For STCTIMIT which is single channel, the speech signals were sent through a real and, in contrast to NTIMIT, all these can be turned as the derivation of wideband speech [6-11]. While some are telephony are containing narrowband speech. The sampling is at the rate of 8 khz with a range of 200 Hz to 3.4 khz. Inspite of all these it is to be noted that there is no availability of real world wideband telephony speech corpus. Several versions of wideband speech codes like G.722 (1988), G.722 (1999) G.722.2 (2001) and G.711.1 (2008) have been into operation with several techniques like ADPCM, 3GPP [6] and wide band PCM. It is interesting to note that the wide band telephony speech transmission system is wide available and adaptable. In contrast to ever increasing mobile networks citing this, it its essential to have wideband system in the TIMIT for a wide range of scientific investigations. There are several advantages and applications associated with WBSTS. The integrated speech recognition system provides remote dictation or spelling. This was not a possible case with the earlier telephony system. In this paper, an investigation on the performance of the speech CODECS in terms of Bit rates is performed. The analysis is based experimentation carried out in MATLAB on windows platform in an i3 with 4 GB RAM. Further, the paper is organized as follows. A brief discussion on the standards and the corresponding bit rates of several speech codecs are presented in section 2. Results pertaining to synthesis followed by analysis of speech codecs is given section 3. Overall conclusion is given in section 4. 2. SPEECH CODECS In this Section, a brief introduction to the speech codecs is given. The aim of the speech codec is to compress the speech signal in order to reduce the bandwidth and requires minimum storage space. When we will reconstruct, it must be very close to original one. 1386

Based on intelligibility and naturalness we will measure perceived quality of the signal. Here we have considered two types of the networks GSM and VoIP. Table-1 and Table-2 shows the specification summary of the all the supported narrowband wideband codecs. Table-1. ITU-T approved VoIP supported narrowband and wideband speech codecs. Coding standard Algorithm Sampling frequency (khz) Bit rates (kbps) G.711 (A /U) Companded PCM 8 64 G.726 ADPCM 8 16/24/ 32/40 G.729 CS-ACELP 8 8 G.723.1A ACELP / MP-MLQ 8 5.3 / 6.3 G.722 (WB) SB-ADPCM 16 48, 56, 64 G.711.1 (WB) Companded PCM, MDCT 16 64, 80, 96 G.729.1 (WB) CELP, TD-BWE, TDAC 16 8-32 G.722.2 (AMR-WB) (Multi Rate) MRWB- ACELP 16 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, 23.85 Table-2. ETSI/3GPP approved GSM supported narrowband and wideband speech codecs. Coding standard Algorithm Sampling frequency (khz) Bit rates (kbps) GSM FR RPE-LTP 8 13 GSMEFR ACELP 8 12.2 GSM HR VSELP 8 5.6 (Multi Rate) MR- 4.75, 5.15, 5.90, 6.70, 7.40, GSM AMR 8 ACELP 7.95, 10.2, 12.2 6.60, 8.85, 12.65, 14.25, 15.85, GSM AMR-WB MRWB-ACELP 16 18.25, 19.85, 23.05, 23.85 3. RESULTS AND DISCUSSIONS Results pertaining to the technique and proposed method are presented in this Section. Testing of the Coded Data with G.711-Coded Models (8-kHz HMMs). The 8- khz un-coded and coded speech data that is coded with all other wireline codecs, such as G.711, G.726 and G.729, is tested with the G.711-coded models (8-kHz trained HMMs) for the CI, and CD-tied tri-phone models with 1, 2, 4 and 8 Gaussians per state, and are reported in the following tables. Case-1: Testing the wireline coded data (NB Codecs) with G.711-coded trained models The corresponding comparative analysis based on ASR accuracy with respect to G.711, G.726, G.729 and Un-coded are as shown in Figure-1. It is evident that the G.726 and G.711 have reported high accuracy coefficient. Table-3. Results of testing the wireline coded data (NB Codecs) with G.711-coded trained models (8kHzHMMs. Coded data used in testing Un-coded 86.11 91.41 94.29 95.23 95.89 G.711 89.54 93.86 95.31 96.22 96.51 G.726 90.43 94.33 95.77 96.48 96.83 G.729 85.19 90.86 94.03 94.98 95.43 1387

Figure-1. Graphic results of testing the wireline coded data (NB Codecs) with G.711-coded trained models (8kHzHMMs). Case-2: Testing of the coded data with G.729-coded models (8-kHz HMMs) is coded with all other wireline codecs, such as G.711, G.726 and G.729, is tested with the G.729-coded models (8-kHz HMMs) for the CI, and CD-tied tri-phone models with 1, 2, 4 and 8 Gaussians per state, and are reported. Results of testing the wireline coded data (NB codecs) with G.729-coded trained models (8kHzHMMs). Table-4. Testing of the coded data with G.729-coded models (8-kHz HMMs). Coded data used in testing Un-coded 89.54 93.95 95.63 96.5 96.56 G.711 89.208 93.409 95.22 96.281 96.494 G.726 89.063 93.767 95.241 96.716 96.57 G.729 89.952 94.656 96.026 96.853 96.942 It can be inferred from the comparative results shown in Figure.2 for G.711, G.726, G.729 and Un-coded for testing the wireline Codecs with 8 KHz G.729 coded HMMs, that the impact of the respective coding is minimal. Figure-2. Graphic results of testing the wireline coded data (NB codecs) with G.729-coded trained models (8kHzHMMs). Case-3: Testing of the coded data with HR-coded models (8-kHz HMMs) is coded with all other NB wireless codecs, such as FR, EFR, HR, and AMR, is tested with the HR-coded models (8-kHz HMMs) for the CI, and CD-tied tri-phone models with 1, 2, 4 and 8 Gaussians per state, and are reported. Results of testing the wireless coded data (NB Codecs) with HR-coded trained models (8kHzHMMs. Table-5. Testing of the coded data with HR-coded models (8-kHz HMMs). Coded data used in testing Un-coded 90.35 94.13 95.86 96.74 96.75 FR 93.2 96.26 97.3 97.7 97.55 EFR 91.31 94.73 96.28 96.9 96.92 HR 90.81 95.39 96.59 97.32 97.36 1388

when compared as shown in Figure-3. HR reported minimal, however almost similar to EFR for CD-8gau. Figure-3. Graphic results of testing the wireless coded data (NB Codecs) with HR-coded trained models (8kHzHMMs). FR reported to be having high accuracy coefficient when compared with EFR, HR and Un-coded Case-4: Testing of the coded data with AMR4.75-coded models (8-kHz HMMs) is coded with all other NB wireless codecs, such as FR, EFR, HR, and AMR, and wireline codecs, such as G.711, G.726 and G.729 are tested with the AMR@4.75kbps coded models (8-kHz HMMs) for the CI, and CD-tied triphone models with 1, 2, 4 and 8 Gaussians per state, and are reported. Results of testing the wireless coded data (NB Codecs) with AMR@4.75-coded trained models (8kHzHMMs) Table-6. Testing of the coded data with AMR4.75-coded models (8-kHz HMMs). Coded data used in testing Un-coded 89.02 93.54 95.33 96.315 96.57 FR 91.31 95.43 96.77 97.3 97.49 EFR 89.77 94.48 95.8 96.48 96.85 HR 88.71 93.9 95.59 96.51 96.6 AMR@ 4.75 89.29 94.71 96.04 96.77 97.05 AMR@12.2 90.49 94.48 95.98 96.5 96.98 G.711 89.62 94.58 95.88 96.3 96.83 G.726 90.03 94.97 95.97 96.69 97.06 G.729 89.06 93.64 95.45 96.53 96.76 When compared with respect to accuracy as shown in Figure-4, it is clearly evident that the FR expressed high accuracy while Un-coaded produced poor results while testing the wireless coded data (NB Codecs) with AMR@4.75-coded trained models (8kHzHMMs). Figure-4. Graphic results of testing the wireless coded data (NB Codecs) with AMR@4.75-coded trained models (8kHzHMMs). Case-5: Testing of the AMR coded data with AMR12.2-coded models (8-kHz HMMs) is coded with all other NB wireless codecs, such as FR, EFR, HR, and AMR, is tested with the AMR@12.2kbps coded models for the CI, and CD-tied tri-phone models with 1, 2, 4 and 8 Gaussians per state, and are reported in the following table. 1389

Table-7. Results of testing the AMR coded data (NB codecs) with AMR@12.2-coded trained models (8kHzHMMs). Coded data used in testing Un-coded 88.96 93.16 95.33 96.2 96.58 FR 91.41 95.02 96.7 97.05 97.39 EFR 89.93 94.48 95.88 96.43 96.97 HR 88.62 93.77 95.45 96.11 96.48 AMR@4.75 88.89 93.94 95.57 96.31 96.84 AMR@12.2 90.02 94.13 95.68 96.44 96.94 models. The ASR performance is almost same when tested with either 16-kHz coded models or 8-kHz un-coded models. The ASR performance is poor when tested with 16-kHz un-coded models. REFERENCES [1] X. Huang, A. Acero and H. W. Hon. 2001. Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall. Figure-5. Graphic results of testing the wireless coded data (NB Codecs) with AMR@12.2-coded trained models (8kHzHMMs). While comparing during the testing the wireless coded data (NB Codecs) with AMR@12.2-coded trained models (8kHzHMMs) in connection with the previous analysis in Case-4, the respective FR produced accuracy at high degree with respect to EFR, HR, AMR@4.75 and AMR@12.2 also including the un-coded. 4. CONCLUSIONS ASR accuracies accuracy for un-coded and coded data when tested with the different coded models that include 8-kHz and 16-kHz HMMs. The major observations made are as follows. The ASR accuracy always increases with 8-kHz coded trained models when compared to 8-kHz un-coded models for all the narrowband codecs. The ASR accuracy of coded data of any particular codec increases by at least 2% when the same type of coded models is used. The ASR results for coded data for specific codecs, such as G.711, G.729, HR and AMR12.2 for un-coded and respective coded models, are re-organized to see the ASR improvements. Coded data (G.711/G.729/HR/AMR12.2) tested with 8-kHz uncoded HMMs while the Coded data (G.711/G.729/HR/AMR12.2) tested with 16-kHz uncoded HMMs. Similarly, the Coded data (G.711/G.729/HR/AMR12.2) tested with 8-kHz Coded (G.711/G.729/HR/AMR12.2) HMMs whereas the Coded data (G.711/G.729/HR/AMR12.2) tested with 16-kHz Coded (G.711/G.729/HR/AMR12.2) HMMs. All these codecs perform well for the respective 8-kHz coded [2] K. W. Church and R. L. Mercer. 1993. Introduction to the Special Issue on Computational Linguistics Using Large Corpora. Computational Linguistics. 19(1): 1-24. [3] 2007. Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-End Feature Extraction Algorithm; Compression Algorithms. ETSI. [4] J. S. Garofolo et al. 1993. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia, USA. [5] J. S. Garofolo et al. 1996. FFMTIMIT. Linguistic Data Consortium, Philadelphia, USA. [6] 3GPP. 1999. Mandatory Speech Codec Speech Processing Functions: AMR Speech Codec; Transcoding Functions (3G TS 26.090). [7] C. Jankowski et al. 1990. NTIMIT: A Phonetically Balanced, Continuous Speech, Telephone Bandwidth Speech Database. In Proc. of ICASSP, pp. 109-112. [8] K.-F. Lee and H.-W. Hon. 1989. Speaker-Independent Phone Recognition Using Hidden Markov Models. IEEE Transactions on Acoustics, Speech and Signal Processing. 37(11):1641-1648. 1390

[9] N. Morales et al. 2008. STC-TIMIT: Generation of a Single-channel Telephone Corpus. In Proc. of LREC. pp. 391-395. [10] D. A. Reynolds. 1997. HTIMIT and LLHDB: Speech Corpora for the Study of Handset Transducer Effects. In Proc. of ICASSP. 2: 1535-1538. [11] P. Bauer and T. Fingscheidt. 2010. WTIMIT 1.0. Linguistic Data Consortium, Philadelphia. 1391