Michael Dorman Department of Speech and Hearing Science, Arizona State University, Tempe, Arizona 85287

Similar documents
NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

Uncertainty in measurements of power and energy on power networks

Calculation of the received voltage due to the radiation from multiple co-frequency sources

High Speed ADC Sampling Transients

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

Control Chart. Control Chart - history. Process in control. Developed in 1920 s. By Dr. Walter A. Shewhart

antenna antenna (4.139)

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

Evaluation of short-time speech-based intelligibility metrics

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

High Speed, Low Power And Area Efficient Carry-Select Adder

Evaluate the Effective of Annular Aperture on the OTF for Fractal Optical Modulator

Time-frequency Analysis Based State Diagnosis of Transformers Windings under the Short-Circuit Shock

MTBF PREDICTION REPORT

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

1 GSW Multipath Channel Models

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

Comparison of Two Measurement Devices I. Fundamental Ideas.

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

Inverse Halftoning Method Using Pattern Substitution Based Data Hiding Scheme

Section 5. Signal Conditioning and Data Analysis

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

NOVEL ITERATIVE TECHNIQUES FOR RADAR TARGET DISCRIMINATION

ANNUAL OF NAVIGATION 11/2006

Application of Intelligent Voltage Control System to Korean Power Systems

A High-Sensitivity Oversampling Digital Signal Detection Technique for CMOS Image Sensors Using Non-destructive Intermediate High-Speed Readout Mode

Low Switching Frequency Active Harmonic Elimination in Multilevel Converters with Unequal DC Voltages

Weighted Penalty Model for Content Balancing in CATS

Digital Transmission

Webinar Series TMIP VISION

Opportunistic Beamforming for Finite Horizon Multicast

Rejection of PSK Interference in DS-SS/PSK System Using Adaptive Transversal Filter with Conditional Response Recalculation

Learning Ensembles of Convolutional Neural Networks

Tunable Wideband Receiver (TWB) Data Processing Description December, 2013

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

POLYTECHNIC UNIVERSITY Electrical Engineering Department. EE SOPHOMORE LABORATORY Experiment 1 Laboratory Energy Sources

A Preliminary Study on Targets Association Algorithm of Radar and AIS Using BP Neural Network

Understanding the Spike Algorithm

Design of Shunt Active Filter for Harmonic Compensation in a 3 Phase 3 Wire Distribution Network

FFT Spectrum Analyzer

Prevention of Sequential Message Loss in CAN Systems

DETERMINATION OF WIND SPEED PROFILE PARAMETERS IN THE SURFACE LAYER USING A MINI-SODAR

Multicarrier Modulation

Chaotic Filter Bank for Computer Cryptography

Topology Control for C-RAN Architecture Based on Complex Network

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

Generalized Incomplete Trojan-Type Designs with Unequal Cell Sizes

Design of an FPGA based TV-tuner test bench using MFIR structures

MASTER TIMING AND TOF MODULE-

Effect of reducing slow temporal modulations on speech reception

Design and Implementation of DDFS Based on Quasi-linear Interpolation Algorithm

Malicious User Detection in Spectrum Sensing for WRAN Using Different Outliers Detection Techniques

* wivecrest Corporation 1715 Technology Dr., Suite 400 Saq Jose, CA w avecrestcorp. corn

Simulation of Distributed Power-Flow Controller (Dpfc)

Research on Controller of Micro-hydro Power System Nan XIE 1,a, Dezhi QI 2,b,Weimin CHEN 2,c, Wei WANG 2,d

Phasor Representation of Sinusoidal Signals

Beam quality measurements with Shack-Hartmann wavefront sensor and M2-sensor: comparison of two methods

Subarray adaptive beamforming for reducing the impact of flow noise on sonar performance

Low-Delay 16 kb/s Wideband Speech Coder with Fast Search Methods

The Application of Interpolation Algorithms in OFDM Channel Estimation

An Analytical Method for Centroid Computing and Its Application in Wireless Localization

Control of Chaos in Positive Output Luo Converter by means of Time Delay Feedback

Markov Chain Monte Carlo Detection for Underwater Acoustic Channels

Traffic balancing over licensed and unlicensed bands in heterogeneous networks

ECE315 / ECE515 Lecture 5 Date:

Keywords LTE, Uplink, Power Control, Fractional Power Control.

Guidelines for CCPR and RMO Bilateral Key Comparisons CCPR Working Group on Key Comparison CCPR-G5 October 10 th, 2014

Review: Our Approach 2. CSC310 Information Theory

Shunt Active Filters (SAF)

Harmonic Balance of Nonlinear RF Circuits

Biases in Earth radiation budget observations 2. Consistent scene identification and anisotropic factors

A Differentiable Approximation to Speech Intelligibility Index with Applications to Listening Enhancement

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

RC Filters TEP Related Topics Principle Equipment

Approximating User Distributions in WCDMA Networks Using 2-D Gaussian

Figure.1. Basic model of an impedance source converter JCHPS Special Issue 12: August Page 13

A Novel Optimization of the Distance Source Routing (DSR) Protocol for the Mobile Ad Hoc Networks (MANET)

Fast Code Detection Using High Speed Time Delay Neural Networks

A Perceptual Model for Sinusoidal Audio Coding Based on Spectral Integration

Clustering Based Fractional Frequency Reuse and Fair Resource Allocation in Multi-cell Networks

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

Electrical Capacitance Tomography with a Square Sensor

A NSGA-II algorithm to solve a bi-objective optimization of the redundancy allocation problem for series-parallel systems

@IJMTER-2015, All rights Reserved 383

An Efficient Blind Watermarking Method based on Significant Difference of Wavelet Tree Quantization using Adaptive Threshold

AFV-P 2U/4U. AC + DC Power Solutions. series. Transient Generation for Disturbance Tests. only. High Performance Programmable AC Power Source

Fiber length of pulp and paper by automated optical analyzer using polarized light (Five-year review of T 271 om-12) (no changes since Draft 1)

A Robust Feature Extraction Algorithm for Audio Fingerprinting

Sensors for Motion and Position Measurement

Model mismatch and systematic errors in an optical FMCW distance measurement system

Design of Teaching Platform Based on Information Detection System

Image analysis using modulated light sources Feng Xiao a*, Jeffrey M. DiCarlo b, Peter B. Catrysse b, Brian A. Wandell a

Transformer winding modal parameter identification based on poly-reference least-square complex frequency domain method

Transcription:

On the number of channels needed to understand speech Phlpos C. Lozou a) Department of Electrcal Engneerng, Unversty of Texas at Dallas, Rchardson, Texas 75083-0688 Mchael Dorman Department of Speech and Hearng Scence, Arzona State Unversty, Tempe, Arzona 85287 Zhemn Tu Department of Appled Scence, Unversty of Arkansas at Lttle Rock, Lttle Rock, Arkansas 72204-1099 Receved 5 December 1998; revsed 7 Aprl 1999; accepted 21 May 1999 Recent studes have shown that hgh levels of speech understandng could be acheved when the speech spectrum was dvded nto four channels and then reconstructed as a sum of four nose bands or sne waves wth frequences equal to the center frequences of the channels. In these studes speech understandng was assessed usng sentences produced by a sngle male talker. The am of experment 1 was to assess the number of channels necessary for a hgh level of speech understandng when sentences were produced by multple talkers. In experment 1, sentences produced by 135 dfferent talkers were processed through n (2 n 16) number of channels, syntheszed as a sum of n sne waves wth frequences equal to the center frequences of the flters, and presented to normal-hearng lsteners for dentfcaton. A mnmum of fve channels was needed to acheve a hgh level 90% of speech understandng. Asymptotc performance was acheved wth eght channels, at least for the speech materal used n ths study. The outcome of experment 1 demonstrated that the number of channels needed to reach asymptotc performance vares as a functon of the recognton task and/or need for lsteners to attend to fne phonetc detal. In experment 2, sentences were processed through 6 and 16 channels and quantzed nto a small number of steps. The purpose of ths experment was to nvestgate whether lsteners use across-channel dfferences n ampltude to code frequency nformaton, partcularly when speech s processed through a small number of channels. For sentences processed through sx channels there was a sgnfcant reducton n speech understandng when the spectral ampltudes were quantzed nto a small number 8 of steps. Hgh levels 92% of speech understandng were mantaned for sentences processed through 16 channels and quantzed nto only 2 steps. The fndngs of experment 2 suggest an nverse relatonshp between the mportance of spectral ampltude resoluton number of steps and spectral resoluton number of channels. 1999 Acoustcal Socety of Amerca. S0001-4966 99 01810-X PACS numbers: 43.72.Ar, 43.71.Es JMH INTRODUCTION mal cues necessary for the recognton of speech. Investgators showed that speech could be recognzed wth a hgh degree of accuracy when sne waves specfyng only the frst two or three formants of the sgnal were presented e.g., Delattre et al., 1952. In these experments as few as four or sx sne wave components out of 50 were suffcent to create ntellgble speech, f the sne wave components specfed harmoncs at or near the formant frequences of the sgnal. Remez et al. 1981, elaboratng on earler work on syllable recognton by Cuttng 1974 and Baley et al. 1976, carred the mnmal cues approach to one extreme by replacng the rch harmonc structure of speech wth only three sne waves at the formant frequences of the consonants and vowels n the words of sentences. Most lsteners were able to dentfy the words wth hgh accuracy. The aforementoned studes, and many others e.g., Hll et al., 1968, provde overwhelmng evdence that speech recognton does not requre the fne spectral detal present n naturally produced utterances. Ths fortunate crcumstance has proved essental n restorng speech understandng to deaf ndvduals ftted wth cochlear mplants, because t s not currently possble to provde fne spectral detal to ma Electronc mal: lozou@utdallas.edu Dudley 1939 provded one of the earlest demonstratons that speech understandng does not requre a hghly detaled spectral representaton of the speech sgnal. After bandpass flterng the speech sgnal nto ten spectral bands, Dudley 1939 estmated the envelopes of the bandpassed waveforms usng rectfcaton and low-pass flterng 20-Hz cutoff. Speech was syntheszed by flterng an exctaton sgnal ether buzz or hss through the same bandpass flters, and ampltude modulatng the outputs of the flters by the envelopes of the bandpassed waveforms. The resultng speech was hghly ntellgble. Dudley 1939 concluded that much of the nformaton n the speech spectrum s redundant. The channel vocoder approach, poneered by Dudley, was later exploted for effcent transmsson of speech over telephone channels see revew by Schroeder, 1966; Flanagan, 1972. In the 1950s, researchers at Haskns Laboratores used a 50-component sne wave syntheszer to nvestgate the mn- 2097 J. Acoust. Soc. Am. 106 (4), Pt. 1, October 1999 0001-4966/99/106(4)/2097/7/$15.00 1999 Acoustcal Socety of Amerca 2097

plant patents. However, n the context of sgnal processng for cochlear mplants, t s stll unclear as to how lttle or how much spectral detal s necessary to allow speech understandng at a hgh level. In the work cted above, hgh levels of speech understandng were obtaned f sgnals were fltered nto a reasonably large number of frequency bands and/or a small number of sne waves were output at or near the formant frequences. Such a strategy s mplemented n one of the two current sgnal processng strateges used for cochlear mplants Mc- Dermott et al., 1992; Lozou, 1998. The other sgnal processng strategy used for cochlear mplants dvdes the speech spectrum nto a small number of bands, 4 to 12 dependng on the devce, and, nstead of pckng hgh ampltude channels, transmts the energy n all of the bands. Ths strategy s the focus of ths artcle. At ssue s how many channels of stmulaton are necessary to acheve a hgh level of speech understandng n quet. Shannon et al. 1995 showed that hgh levels of speech understandng e.g., 90% correct for sentences could be acheved usng as few as four spectral bands. In Shannon et al. 1995 envelopes of the speech sgnal were extracted from a small number 1 4 of frequency bands, and used to modulate nose of the same bandwdth. The nose-modulated bands preserved the temporal cues wthn each band but elmnated the spectral detals wthn each band. Dorman et al. 1997 syntheszed speech as a sum of a small number of sne waves rather than nose bands. As n Shannon et al. 1995, sentence recognton usng four channels was found to be 90% correct. In Shannon et al. 1995 and Dorman et al. 1997 speech understandng was assessed usng sentences produced by a sngle male speaker. It s very lkely that the use of a sngle speaker overestmates the speech percepton abltes of lsteners n real-world stuatons because the use of a sngle speaker elmnates the need for lsteners to accommodate to varablty n the acoustc sgnal e.g., Mullenx et al., 1989; Sommers et al., 1997. Varablty n the acoustc sgnal arses from dfferences n the sze and shape of vocal tracts, dfferences n phonetc realzaton e.g., pronuncaton, and dfferences n speakng rate. The am of experment 1 was to determne the number of channels of stmulaton necessary to allow a hgh level of sentence understandng when speech was produced by 135 talkers, half of whom were female. The am of experment 2 was to assess the ntellgblty of speech processed through 6 and 16 channels and quantzed nto a small number of steps. The purpose of ths experment was to assess the mportance of ampltude resoluton for speech understandng when sgnals are processed nto a relatvely small, and a relatvely large, number of channels. Our hypothess was that a relatvely hgh degree of ampltude resoluton s a necessary condton for speech understandng when sgnals are processed nto a small number of channels because, wth a small number of channels, lsteners must use dfferences n sgnal levels across channels to nfer the locaton of formant frequences Dorman et al., 1997; Lozou et al., 1998. In contrast, when speech s processed nto a large number of channels, a hgh level of spectral ampltude resoluton s not necessary because the locaton of frequences n the nput spectrum are well specfed by the channels whch contan energy. The outcome of experment 2 s of nterest because a recent experment by Nelson et al. 1996 wth cochlear mplant subjects showed that the total number of dscrmnable ntensty steps vared from a low of 6 to a hgh of 45. If a hgh degree of ampltude resoluton s necessary for frequency analyss when speech s processed nto a small number of channels, then t s possble that speech percepton n some cochlear mplant subjects s constraned by a lmted ablty to resolve dfferences n sgnal level across channels. I. EXPERIMENT 1 A. Method 1. Subjects Nne graduate students from the Appled Scence Department, UALR, served as subjects. All of the subjects were natve speakers of Amercan Englsh and had normal hearng. The subjects were pad for ther partcpaton. 2. Sentence materal The mult-talker TIMIT database Garofolo et al., 1993 was used for testng. The TIMIT database contans speech from 630 speakers, representng 8 major dalect dvsons of Amercan Englsh, each speakng 10 phonetcally rch sentences. Some of the sentences were desgned to provde a good coverage of pars of phones wth extra occurrences of dffcult phonetc contexts and some of the sentences were desgned to maxmze the varety of allophonc contexts Lamel et al., 1986. A total of 135 sentences were randomly selected from the TIMIT database from the DR3 north mdland dalect regon. The sentences were produced by an equal number of female and male speakers one sentence per speaker. The 135 sentences were dvded nto 9 lsts 1 lst per channel condton, wth 15 sentences n each lst. Ffteen sentences were used for the frst channel condton, 15 dfferent sentences were used for the second channel condton, etc. There were eght sentences spoken by eght dfferent male speakers and seven sentences spoken by seven dfferent female speakers wthn each lst. Each sentence contaned, on the average, 7 words, and the 15 sentences n each lst contaned, on the average, a total of 100 words. Each subject lstened to a total of 135 sentences 15 sentences/condton 9 channel condtons. 3. Sgnal processng Sgnals were frst processed through a pre-emphass flter 2000-Hz cutoff, wth a 3-dB/octave rolloff, and then bandpassed nto n frequency bands (n 2,3,4,5,6,8,10,12, 16) usng sxth-order Butterworth flters. Logarthmc flter spacng was used for n 8 and mel spacng 1 was used for n 8. Logarthmc and sem-logarthmc mel flter spacng was used because: 1 the flter bandwdths can be computed systematcally; and 2 t s the type of flter spacng used n current cochlear mplant devces e.g., Zerhofer et al., 1994; Lozou, 1998. 2098 J. Acoust. Soc. Am., Vol. 106, No. 4, Pt. 1, October 1999 Lozou et al.: Channels to understand speech 2098

TABLE I. The center frequences Hz of the flters. Channel No. of Channels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 792 3392 3 545 1438 3793 4 460 953 1971 4078 5 418 748 1339 2396 4287 6 393 639 1037 1685 2736 4444 8 394 692 1064 1528 2109 2834 3740 4871 10 322 546 814 1137 1524 1988 2545 3213 4014 4976 12 274 453 662 905 1190 1521 1908 2359 2885 3499 4215 5050 16 216 343 486 647 828 1031 1260 1518 1808 2134 2501 2914 3378 3901 4489 5150 The center frequences and the 3-dB bandwdths of the flters are gven n Tables I and II, respectvely. The envelope of the sgnal was extracted by full-wave rectfcaton, and low-pass flterng second-order Butterworth wth a 400-Hz cutoff frequency. Snusods were generated wth ampltudes equal to the root-mean-square rms energy of the envelopes computed every 4 ms and frequences equal to the center frequences of the bandpass flters. The phases of the snusods were estmated from the FFT of the speech segment 2 McAulay and Quater, 1986. The snusods of each band were fnally summed and the level of the syntheszed speech segment was adjusted to have the same rms value as the orgnal speech segment. 4. Procedure The experment was performed on a PC equpped wth a Creatve Labs SoundBlaster 16 soundcard. The subjects lstened to the sentences va closed ear-cushon headphones at a comfortable level set by the subject. A graphcal nterface was used that allowed the subjects to type the words they heard. After lstenng to each sentence, subjects were asked to type n as many words as they could understand. Before each channel condton, subjects were gven a practce sesson wth examples of ten sentences processed through the same number of channels n that condton. None of the sentences used n the practce was used n the test. A sequental test order, startng wth sentences processed through a large number of channels (n 16) and contnung to sentences processed through a small number of channels (n 2), was employed. We chose ths sequental test desgn to gve the subjects tme to adapt to lstenng to the altered speech sgnals. There s no doubt a warm-up effect when lstenng to sne wave speech of any knd. B. Results and dscusson The subject s responses were scored as percentage of words correct. The results are shown n Fg. 1. A repeated measures analyss of varance ndcated a man effect F(8,64) 261.94,p 0.0001 for number of channels. Post hoc tests accordng to Scheffe showed no statstcally sgnfcant dfferences n scores when the number of channels was ncreased beyond eght. There was a sgnfcant dfference (p 0.001) between the scores obtaned wth sx and eght channels. There was no sgnfcant dfference between the scores obtaned wth fve and sx channels. Speech recognton performance wth four channels was 63% correct. Ths score was sgnfcantly lower than the score 90% reported by Shannon et al. 1995 and the score 90% reported by Dorman et al. 1997 usng sentences from the H.I.N.T. database produced by a sngle male talker. Ths outcome, as well as others, demonstrates that the number of channels necessary to reach asymptotc performance vares as a functon of the task and/or need for a lstener to attend to acoustc/ phonetc detal. In our study, the task was recognton of speech produced by multple speakers. Four channels dd not seem to be suffcent for achevng hgh level of sentence understandng. To see why consder, n Fg. 2 a, the channel spectrum of the vowel } head, spoken by a male talker, and TABLE II. The 3-dB bandwdths Hz of the flters. Channel No. of Channels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 984 4215 3 491 1295 3414 4 321 664 1373 2842 5 237 423 758 1356 2426 6 187 304 493 801 1301 2113 8 265 331 431 516 645 805 1006 1257 10 204 244 293 352 422 506 607 729 874 1049 12 165 193 225 262 306 357 416 486 567 661 771 900 16 120 135 151 170 192 216 242 273 307 345 389 437 492 553 622 700 2099 J. Acoust. Soc. Am., Vol. 106, No. 4, Pt. 1, October 1999 Lozou et al.: Channels to understand speech 2099

FIG. 1. Sentence understandng percent correct as a functon of number of channels. Error bars ndcate 1 standard devaton. processed through four channels. Four channels are suffcent to code the frequency of F1 and F2. The F1 of } s coded by a hgh-ampltude n channel one, and a low-ampltude n channel two. The F2 of } s coded by a hgh ampltude n channel three, and a low ampltude n channels two and four. Now, consder the four-channel spectrum Fg. 2 b of the vowel } produced by a female talker. In ths case, four channels are not suffcent for codng F2 nformaton, snce channel three s no longer a peak n the spectrum. Fgure 2 c and 2 d shows the channel spectra of the same vowels processed through fve channels. The F2 nformaton s coded adequately for both male and female vowels. The F2 s coded by a hgh ampltude n channel four, and a low ampltude n channels three and fve see Fg. 2 c. Most generally, four-channel processors use two channels channels three and four for codng F2 and the other hgh-frequency nformaton needed for consonant recognton, whle fvechannel processors use three channels channels three, four, and fve. Overall, our results suggest that a mnmum of three channels s needed to code F2 and/or hgh-frequency nformaton for mult-talker speech recognton. It s possble that four channels mght yeld hgher levels of speech understandng f the flter spacng were optmzed. Shannon et al. 1998 showed that there was a sgnfcant dfference n sentence recognton scores as a functon of three flter spacngs lnear, logarthmc, and ntermedate of FIG. 2. The channel spectra of the vowel } head produced by a male and a female talker. The spectra n a and b were generated usng a four-channel processor, and the spectra n c and d were generated usng a fve-channel processor. The flled trangles ndcate the formant frequences of the vowels. a four-channel processor. Ths flter optmzaton, however, can only be talored for a partcular speaker, e.g., a partcular female or male, and s therefore not practcal for real-world stuatons where multple talkers must be accommodated. Although fve channels acheved hgh levels 90% of ntellgblty, asymptotc performance was not acheved untl eght channels were used. Increasng the number of channels beyond eght dd not mprove speech ntellgblty, but dd mprove the subjectve qualty of speech. The fndng that eght channels are needed to reach asymptotc performance s consstent wth the study by Dorman et al. 1997 who showed that eght channels were needed to reach asymptote for mult-talker vowel recognton. Tranng.e., practce s a factor that needs to be taken nto account when nterpretng the above results, snce the normal-hearng lsteners were not accustomed to lstenng to speech contanng lmted spectral/temporal nformaton. The order of the test condtons was purposely confounded wth the amount of experence n lstenng to the altered speech sgnals because t was felt that gvng lsteners addtonal practce before encounterng sgnals wth the least spectral nformaton would maxmze performance n the most dffcult lstenng stuatons. II. EXPERIMENT 2 Experment 1 showed that a hgh level 90% of ntellgblty can be acheved usng processors wth fve or more channels of stmulaton. Ths fndng s surprsng gven that the processors dd not track or follow formant frequences, lke the pattern playback or the Remez et al. sne wave syntheszer. In the Remez et al. syntheszer, for nstance, three sne waves trace out three formant frequences n each update cycle. In contrast, the processors used n experment 1 generated sne waves n each cycle 4 ms at fxed frequences Table I. The only parameter that vared from cycle to cycle was the ampltudes of the sne waves. The frequences of the sne waves concded wth the formant frequences of speech only by chance and only rarely. Ths crcumstance rases the queston, How s nformaton coded n the frequency doman wth processors that do not track formant frequences? As ponted out by Dorman et al. 1997, the relatve dfferences n across-channel ampltudes must be used to code frequency nformaton. On ths vew, f ampltude resoluton were to be dstorted, then speech recognton ought to declne. Ths hypothess was tested n experment 2 where the channel ampltudes of a sx-channel processor were quantzed to a fnte number 2, 4, 8, 16 of steps. At ssue was how many dscrmnable steps are needed to mantan hgh levels of speech ntellgblty when speech s processed through a small number of channels. The answer to that queston s of nterest because t could provde some nsght nto whether the speech percepton abltes of some cochlear mplant patents are lmted by electrode dynamc range or the number of dscrmnable ntensty steps wthn the dynamc range Nelson et al., 1998. It s reasonable to expect that the number of steps used to code ampltude nformaton wthn a channel wll be less mportant when speech s processed through a large number of channels than when processed through a small number of 2100 J. Acoust. Soc. Am., Vol. 106, No. 4, Pt. 1, October 1999 Lozou et al.: Channels to understand speech 2100

channels. Ths s because n the case of a large number of channels, sgnal frequency wll be ndcated by the channel or channels wth sgnfcant energy. To test ths hypothess we processed speech through 16 channels, and quantzed the channel ampltudes nto 2 16 steps. At ssue was whether the same number of steps are needed to mantan hgh levels of speech ntellgblty for speech processed through a large number 16 of channels and through a small number 6 of channels. A. Method 1. Subjects The same subjects as n experment 1 were used. 2. Sentence materal One hundred and ffty new sentences from the TIMIT database, produced by an equal number of female and male speakers, were randomly selected. Seventy-fve sentences were used for the 6-channel processor and 75 sentences for the 16-channel processor. The 75 sentences used n each experment were dvded nto fve lsts wth 15 sentences n each lst one lst was used for each of the four quantzed condtons (Q 2,4,8,16 levels), and one lst was used for the unquantzed condton. The subjects lstened to a total of 150 sentences, 75 sentences processed through 6 channels, and 75 sentences processed through 16 channels. 3. Quantzaton and sgnal processng The envelope dynamc range of speech processed through a fnte number of channels dffers from channel to channel. For that reason, dfferent quantzaton step szes are needed for each channel. We frst determned the ampltude dynamc range of each channel by computng envelope hstograms of 100 TIMIT sentences. The TIMIT sentences were scaled so that all sentences had the same peak ampltude. The maxmum envelope ampltude n each channel, denoted as X max where s the channel number, was chosen to nclude 99% of all ampltude counts n that channel. The mnmum envelope ampltude (X mn ) was set 0.5 db above the rms value of the nose floor. The X max and X mn values were then used to estmate the quantzaton step sze,, of each channel as follows: X max X mn 1,2,...,N, Q 1 where Q s the number of quantzaton levels or steps, and N s the number of channels 6 or 16 n our case. Note that each channel had a dfferent value for X max and X mn snce the envelope dynamc range of each channel was dfferent. Consequently, the step szes were dfferent n each channel. The quantzed verson of the sx-channel sne wave processor was mplemented as follows. Sx envelope ampltudes were computed as before by pre-emphaszng the sgnal, bandpass flterng the sgnal nto sx logarthmc frequency bands Table I, full-wave rectfyng the bandpassed waveforms, and low-pass flterng 400 Hz the rectfed waveforms. The envelope ampltudes were then unformly quantzed to Q dscrete levels (Q 2,4,8,16). Sne waves were generated wth ampltudes equal to the quantzed envelope ampltudes, and frequences equal to the center frequences of the bandpass flters. The phases of the snusods were estmated from the FFT of the speech segment McAulay and Quater, 1986. The snusods of each band were fnally summed and the level of the syntheszed speech segment was adjusted to have the same rms value as the orgnal speech segment. The quantzed verson of the 16-channel sne wave processor was mplemented as follows. Sxteen envelope ampltudes were computed as before by pre-emphaszng the sgnal, bandpass flterng the sgnal nto 16 frequency bands Table I, full-wave rectfyng the bandpassed waveforms, and low-pass flterng 400 Hz the rectfed waveforms. Of the 16 envelopes computed, the sx envelopes wth the largest ampltude were selected n each 4-ms cycle. 3 The sx selected envelope ampltudes were then unformly quantzed to Q dscrete levels Q 2,4,8,16. Sne waves were generated wth ampltudes equal to the quantzed envelope ampltudes, and frequences equal to the center frequences of the selected bandpass flters. The phases of the snusods were estmated from the FFT of the speech segment. The snusods of the sx selected bands were fnally summed and the level of the syntheszed speech segment was adjusted to have the same rms value as the orgnal speech segment. 4. Procedure The experment was run n two ndependent 1 1 2-h sessons. In the frst sesson, the lsteners were presented wth a lst of 75 sentences processed through the 6-channel processor, 60 quantzed sentences 15 for each of the 4 condtons and 15 unquantzed sentences. In the second sesson, the lsteners were presented wth a lst of 75 sentences processed through the 16-channel processor, 60 quantzed sentences 15 for each of the 4 condtons and 15 unquantzed sentences. The quantzed and the unquantzed sentences, n both experments, were completely randomzed. A practce sesson preceded each test sesson, n whch the lsteners were presented wth ten examples of sentences from each quantzed condton. None of the sentences used n the practce sesson were used n the test sesson. B. Results and dscusson The results for the 6- and 16-channel processors are shown n Fg. 3. A repeated measures analyss of varance on the data for the sx-channel processor ndcated a man effect F(4,32) 112.54, p 0.0001 for the number of quantzaton steps. The mean scores were 41% correct for the 2-step condton, 52% correct for the 4-step condton, 80% correct for the 8-step condton, 83% correct for the 16-step condton, and 92% correct for the unquantzed condton. Post hoc tests ndcated that 4 steps allowed better performance than 2, 8 allowed better performance than 4 steps, 8 and 12 steps produced scores whch dd not dffer, and the unquantzed sgnal allowed better scores than the sgnal processed nto 16 steps. Relatvely hgh levels of ntellgblty were acheved usng 8 levels mean score 80% correct and 16 2101 J. Acoust. Soc. Am., Vol. 106, No. 4, Pt. 1, October 1999 Lozou et al.: Channels to understand speech 2101

FIG. 3. Speech recognton wth 6-channel flled crcles and 16-channel empty crcles processors as a functon of the number of the steps used to quantze the spectral ampltudes. Inf refers to the condton n whch the spectral ampltudes were not quantzed. Error bars ndcate 1 standard devatons. levels mean score 83% correct. These results are smlar to the results found wth early-model ten-channel vocoders Davd, 1956, e.g., 82% correct wth sx levels of ntensty quantzaton. A repeated measures analyss of varance on the data for the 16-channel processor ndcated a man effect F(4,32) 7.67,p 0.0001 for the number of quantzaton steps. Post hoc tests accordng to Scheffe showed that there was a sgnfcant dfference (p 0.002) between the scores obtaned wth two and four steps, 92% correct and 96% correct, respectvely. There was no statstcally sgnfcant dfference between the scores obtaned wth four steps and greater number of steps. Thus two steps were suffcent for achevng a hgh level 92% of performance. Ths outcome s consstent wth the fndngs of Drullman et al. 1995 that reported nearly perfect ntellgblty when speech was processed through 24 1 4-octave bands, and the ampltude envelopes were quantzed nto two levels. Our results and those of Drullman et al. 1995 suggest that poor ampltude resoluton defned n terms of the number of steps does not have a large effect on ntellgblty when speech s processed through a large number of channels. In contrast, when speech was processed nto a small number sx of channels, performance was poor 55% correct when the number of levels was smaller than eght. Ths outcome can be accounted for by the vew that that lsteners must rely on relatve ampltude dfferences across channels to nfer frequency nformaton when speech s processed nto a small number of channels. If ampltude dfferences are dstorted, then recognton accuracy wll suffer. On ths vew, cochlear mplant patents who are able to use only a few channels of stmulaton, and who are able to dscrmnate only a small number of ntensty dfferences on each channel Nelson et al., 1996, should fnd speech recognton relatvely dffcult. III. GENERAL DISCUSSION A. Number of channels The results n experment 1 showed that fve channels are needed to acheve hgh levels of sentence understandng and eght channels are needed to reach asymptotc performance. The task at hand was recognton of TIMIT sentences produced by multple speakers. It s very lkely that the number of channels needed to reach asymptotc performance as well as the shape of the performance-channels functon wll depend on the speech materal and whether lsteners wll be requred to rely on phonetc detal. A dfferent asymptote would be expected, for nstance, f the task were nonsensesyllable recognton snce the lsteners wll need to attend to fne acoustc/phonetc detal n order to understand what was beng sad. The results of experment 1 do not support a general concluson that eght channels are needed for all types of speech materal, but rather for recognton of syntactcally well-formed and meanngful sentences produced by multple speakers. Other factors that could affect the number of channels needed to acheve a hgh level of sentence ntellgblty nclude speakng rate, speakng style conversatonal versus clear and background nose. Hgher speakng rates are often assocated wth reduced sentence understandng, and speakng clearly s assocated wth mproved sentence understandng n nose for normal-hearng lsteners Tolhurst, 1955 and mproved sentence understandng n quet for hearng mpared lsteners Pcheny et al., 1985. Both speakng style and speakng rate deserve further study n the context of the number of channels necessary for speech understandng. Speech understandng n nose has been studed by Dorman et al. 1998 and by Fu et al. 1998. More channels are needed n nose than n quet to acheve hgh levels of speech understandng. B. Number of steps and number of channels The fndngs obtaned n experment 2 wth the 6- and 16-channel-processors suggest an nverse relatonshp between the mportance of spectral ampltude resoluton and spectral resoluton defned n terms of the number of spectral channels avalable. Two levels of ampltude resoluton were suffcent for nearly perfect ntellgblty 92% when speech was processed through 16 channels. However, eght or more levels were needed for hgh ntellgblty when speech was processed through sx channels. We have only nvestgated the effect of quantzaton on two extreme cases,.e., a small number of channels and a large number of channels. Further studes are needed to complete our understandng of the effects of spectral ampltude resoluton and spectral resoluton on speech understandng. IV. CONCLUSIONS These studes have provded yet another demonstraton that speech understandng does not requre a detaled spectral representaton of the speech sgnal. In experment 1 we found that fve channels of fxed frequency stmulaton allowed 90% dentfcaton accuracy for sentences produced by multple speakers. Asymptotc performance was acheved wth eght channels. In experment 2 we found that the number of levels used to code spectral ampltude nformaton has a sgnfcant effect on speech understandng. If speech s processed nto a large number of channels, two levels of ampl- 2102 J. Acoust. Soc. Am., Vol. 106, No. 4, Pt. 1, October 1999 Lozou et al.: Channels to understand speech 2102

tude resoluton are suffcent to acheve a hgh level of speech understandng. However, when speech s processed nto a small number of channels, eght or more levels are necessary. Thus the number of channels of stmulaton and the resoluton of ampltude nformaton wthn those channels trade off n determnng the level of speech understandng allowed by sgnal processors whch reduce the speech sgnal to a relatvely small number of fxed-frequency channels. ACKNOWLEDGMENTS The authors would lke to thank James Hllenbrand, Robert Shannon, and Steve Greenberg for provdng valuable suggestons on earler drafts of ths paper. Ths research was supported by a Shannon award R55 DC03421 from the Natonal Insttute of Deafness and other Communcaton Dsorders, NIH. 1 For n 8, the flter bandwdths were computed accordng to the equaton: 1100 log(f/800 1), where f ndcates the frequency n Hz. Ths s smlar to the techncal mel scale of Fant 1973, whch s a varant of the crtcal band scale. As shown n Table II, the channel flter bandwdths, for n 8, are approxmately 1/4 of an octave wde, whch s roughly the bandwdth of the crtcal band. Logarthmc spacng was used for n 8 to conform wth the spacng used n current cochlear mplant devces e.g., Zerhofer et al., 1994. 2 The phases of the snusods were computed from the FFT of the speech segment as follows. Let (k) be the phases of the FFT of a 4-ms speech segment. The phases ( j) of the N snusods, N beng the number of channels, were set equal to the phases of the FFT spectrum evaluated at frequences closest to the center frequences of the bandpass flters,.e., j f j r, j 1,2,...,N, where f j s the center frequency Hz of the jth bandpass flter Table I, r s the FFT resoluton r samplng frequency/fft length n Hz, and denotes the nearest nteger. Due to the lmted FFT resoluton, the above equaton only provdes a rough estmate of the underlyng snewave phases. Ths estmate seems to be suffcent n our case, however, snce we are only concerned wth speech ntellgblty rather than speech qualty see McAulay and Quater, 1995, for a dscusson on alternatve snewave phase representatons. 3 Ths spectral-maxmum mplementaton was chosen to mmc the sgnal processng used n the Nucleus 22 cochlear mplant processor McDermott et al., 1992. In ths processor, speech s processed through 16 channels, and the 6-channel ampltudes wth the largest energy are selected n each cycle for electrcal stmulaton. Baley, P., Summerfeld, Q., and Dorman, M. 1977. On the dentfcaton of sne-wave analogues of certan speech sounds, Haskns Laboratores Status Report on Speech Percepton, SR 51 52, 1 26. Cuttng, J. 1974. Two left-hemsphere mechansms n speech percepton, Percept. Psychophys. 16, 601 612. Davd, E. 1956. Naturalness and dstorton n speech-processng devces, J. Acoust. Soc. Am. 28, 586 589. Delattre, F., Lberman, A., Cooper, F., and Gerstman, L. 1952. An expermental study of the acoustc determnants of vowel color: Observatons on one- and two-formant vowels syntheszed from spectrographc dsplays, Word 8, 195 210. Dorman, M., Lozou, P., and Raney, D. 1997. Speech ntellgblty as a functon of the number of channels of stmulaton for sgnal processors usng sne-wave and nose-band outputs, J. Acoust. Soc. Am. 102, 2403 2411. Drullman, R. 1995. Temporal envelope and fne structure cues for speech ntellgblty, J. Acoust. Soc. Am. 97, 585 592. Dudley, H. 1939. Remakng speech, J. Acoust. Soc. Am. 11, 169 177. Fant, G. 1973. Speech Sounds And Features MIT Press, Boston. Flanagan, J. 1972. Speech Analyss, Synthess And Percepton Sprnger Verlag, New York. Fu, Q-J., Shannon, R., and Wang, X. 1998. Effects of nose and spectral resoluton on vowel and consonant recognton: Acoustc and electrc hearng, J. Acoust. Soc. Am. 104, 3586 3596. Garofolo, J., Lamel, L., Fsher, W., Fscus, J., Pallett, D., and Dahlgren, N. 1993. DARPA TIMIT: Acoustc-phonetc contnuous speech corpus, NIST Techncal Report dstrbuted wth the TIMIT CD-ROM. Hll, F., McRae, L., and McClellan, R. 1968. Speech recognton as a functon of channel capacty n a dscrete set of channels, J. Acoust. Soc. Am. 44, 13 18. Lamel, L., Kassel, R., and Seneff, S. 1986. Speech database development: Desgn and analyss of the acoustc-phonetc corpus, Proc. of the DARPA Speech Recognton Workshop, Report No. SAIC-86/1546. Lozou, P. 1998. Mmckng the human ear: An overvew of sgnal processng technques for convertng sound to electrcal sgnals n cochlear mplants, IEEE Sgnal Process. Mag. 15, 101 130. Lozou, P., Dorman, M., and Powell, V. 1998. The recognton of vowels produced by men, women, boys and grls by cochlear mplant patents usng a sx-channel CIS processor, J. Acoust. Soc. Am. 103, 1141 1149. McAulay, R., and Quater, T. 1986. Speech analyss/synthess based on a snusodal representaton, IEEE Trans. Acoust., Speech, Sgnal Process. ASSP-34, 744 754. McAulay, R., and Quater, T. 1995. Snusodal codng, n Speech Codng and Synthess, edted by W. Klejn and K. Palwal Elsever Scence, New York. McDermott, H., McKay, C., and Vandal, A. 1992. A new portable sound processor for the Unversty of Melbourne/Nucleus Lmted multelectrode cochlear mplant, J. Acoust. Soc. Am. 91, 3367 3371. Mullenx, J., Pson, D., and Martn, C. 1989. Some effects of talker varablty on spoken word recognton, J. Acoust. Soc. Am. 85, 365 378. Nelson, D., Schmtz, J., Donaldson, G., Vemester, N., and Javel, E. 1996. Intensty dscrmnaton as a functon of stmulus level wth electrc stmulaton, J. Acoust. Soc. Am. 100, 2393 2414. Pcheny, M., Durlach, N., and Brada, L. 1985. Speakng clearly for the hard of hearng I: Intellgblty dfferences between clear and conversatonal speech, J. Speech Hear. Res. 28, 96 103. Remez, R., Rubn, P., Pson, D., and Carrell, T. 1981. Speech percepton wthout tradtonal cues, Scence 212, 947 950. Schroeder, M. 1966. Vocoders: Analyss and synthess of speech, Proc. IEEE 54, 720 734. Shannon, R., Zeng, F.-G., Kamath, V., Wygonsk, J., and Ekeld, M. 1995. Speech recognton wth prmarly temporal cues, Scence 270, 303 304. Shannon, R., Zeng, F.-G., and Wygonsk, J. 1998. Speech recognton wth altered spectral dstrbuton of envelope cues, J. Acoust. Soc. Am. 104, 2467 2476. Sommers, M., Krk, K., and Pson, D. 1997. Some consderatons n evaluatng spoken word recognton by normal-hearng, nose-masked normal-hearng, and cochlear mplant lsteners. I: The effects of response format, Ear Hear. 18, 89 99. Tolhurst, G. 1955. The effect of ntellgblty scores of specfc nstructons regardng talkng, USAM Report No. NM 001 064 01 35 Naval Ar Staton, Pensacola, FL. Zerhofer, C., Peter, O., Brl, S., Pohl, P., Hochmar-Desoyer, I., and Hochmar, E. 1994. A multchannel cochlear mplant system for hgh-rate pulsatle stmulaton strateges, n Advances n Cochlear Implants, edted by I. Hochmar Desoyer and E. Hochmar Internatonal Interscence Semnars, Venna, pp. 204 207. 2103 J. Acoust. Soc. Am., Vol. 106, No. 4, Pt. 1, October 1999 Lozou et al.: Channels to understand speech 2103