Investigation of Several Types of Nonlinearities for Use in Stereo Acoustic Echo Cancellation

Size: px

Start display at page:

Download "Investigation of Several Types of Nonlinearities for Use in Stereo Acoustic Echo Cancellation"

Clement Underwood
6 years ago
Views:

1 686 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Investigation of Several Types of Nonlinearities for Use in Stereo Acoustic Echo Cancellation Dennis R. Morgan, Senior Member, IEEE, Joseph L. Hall, and Jacob Benesty, Member, IEEE Abstract In this paper, we investigate several types of nonlinearities used for the unique identification of receiving room impulse responses in stereo acoustic echo cancellation. The effectiveness is quantified by the mutual coherence of the transformed signals. The perceptual degradation is studied by psychoacoustic experiments in terms of subjective quality and localization accuracy in the medial plane. The results indicate that, of the several nonlinearities considered, ideal half-wave rectification and smoothed half-wave rectification appear to be the best choices for speech. For music, the nonlinearity parameter of the ideal rectifier must be readjusted. The smoothed rectifier does not require this readjustment, but is a little more difficult to implement. Index Terms Acoustic echo cancellation, adaptive filters, nonlinearity, psychoacoustics, stereo. I. INTRODUCTION IN stereo acoustic echo cancellation, there is a fundamental problem in uniquely identifying the receiving room impulse responses because of the coherence between the two loudspeaker signals [1]. This is of particular concern because, lacking proper identification, echo cancellation will depend on the impulse responses in the (actual or synthesized) transmission room. This means that one must track not only changes in the receiving room but also changes in the transmission room, which can be very rapid (e.g., when one person stops talking and another person starts). One successful solution to the fundamental problem is to deliberately add a small amount of distortion to each channel, e.g., through half-wave rectification [2]. This distortion is effective in reducing the coherence and thereby enabling correct identification of the room responses for both fullband stereo room-to-room conferencing [2] [4] and for low frequencies only in a hybrid arrangement [5]. The technique has also been proposed for synthesized stereo in desktop conferencing [6], where the same identification problem arises even though one knows the synthesizing transfer functions. In all of these applications, the nonlinearity technique has been shown to be effective for enabling unique identification of the receiving room impulse responses, yet it is hardly audible for speech signals due to self-masking effects. Until now, however, the perceptual degradation has not been quantified. Manuscript received June 2, 2000; revised April 30, The associate editor coordinating the review of this paper and approving it for publication was Dr. Michael S. Brandstein. D. R. Morgan and J. Benesty are with Bell Laboratories, Lucent Technologies, Murray Hill, NJ USA ( drrm@bell-labs.com; jbenesty@bell-labs.com). J. L. Hall, retired, was with Bell Laboratories, Lucent Technologies, Murray Hill, NJ USA. Publisher Item Identifier S (01) In this paper, we compare the effectiveness of several nonlinearities for achieving the above objectives. It is known that the conditioning of the multichannel covariance matrix, i.e., the ratio of largest to smallest eigenvalues, determines the misalignment of the solution and the speed of convergence of any adaptive algorithm. In [2, App. A], a link is established between the coherence and the covariance matrix, whereby the eigenvalues are lower bounded by the factor. Accordingly, the closer the coherence is to 1, the higher the misalignment and the slower the convergence. Therefore, we choose coherence reduction as a proxy for performance. In Section II, the coherences are derived theoretically for the half-wave rectifier as well as several other memoryless transformations. Parameter values of the nonlinear functions are selected such that they result in similar coherence reduction. These results are supported by simulations described in Section III. Then, in Section IV, we compare, for equivalent coherence reduction, the perceptual quality for both speech and music using formal psychoacoustic methods. We also investigate psychoacoustic effects on localization in the medial plane to see if the nonlinearity results in any impairment. Conclusions are summarized in Section V. II. THEORETICAL COHERENCE CALCULATIONS In this section, we derive mathematical expressions for the coherence of two signals modified by several types of nonlinear transformations and calculate some explicit examples using an actual measured room response. A. Formal Description of Nonlinear Distortion The starting point will be the general memoryless nonlinear transformation where is an arbitrary function. This distorted signal is added to the original signal to form the modified signal where the parameter controls the amount of distortion added. For two signals, such as in stereo teleconferencing, we similarly define (1) (2) (3a) (3b) /01$ IEEE

2 MORGAN et al.: NONLINEARITIES FOR USE IN STEREO ACOUSTIC ECHO CANCELLATION 687 B. General Formulation of Coherence We assume here that the speech signal can be represented as a stationary random Gaussian process over short intervals. This assumption is appropriate for the memoryless nonlinear transformations considered in this paper and will be further justified by the simulation results. If and are joint Gaussian stationary processes with variances and, respectively, we have, using Price s theorem [7], the cross correlation functions (4a) where the constants (4b) (5a) TABLE I OUTPUT CORRELATION FUNCTION OF SEVERAL NONLINEARITIES AS A FUNCTION OF THE NORMALIZED INPUT CORRELATION FUNCTION r () R ()=( ) (5b) and. Expressions (4) and (5a) are a slight generalization of Bussgang s theorem [7] to the case of cross-channel relations. The second form of (5), which is more convenient for our purposes, is derived from the first using integration by parts. Expressions for can be obtained as a function of [8] and are used with the above to compute the auto correlation and cross correlation functions of the modified signals (3) (6) By taking the Fourier transform, we can express these quantities in the frequency domain (7) which are then used to calculate the coherence For, we assume that for balanced operation, and define (8) (9) (10) (In general, the quantity is to be interpreted here as a single constant; we choose this notation because it is suggestive of the completely matched conditions,.) With these definitions, the coherence of the modified signals (8) is expressed as For only now with (11), the same expressions as above are obtained, (12) TABLE II CALCULATED VALUES OF (13) FOR n =1; 2; 3; 4 We note that in (7), the quantities are dimensionless; therefore, the units (and consequently power scaling) of, and hence the values of, will depend on the explicit form of the nonlinearity. C. Coherence with Example Nonlinear Functions Table I adapted from [8], lists some representative expressions of the output correlation function for several simple nonlinearities in terms of the normalized input correlation function. To compute the associated coefficients for these nonlinearities, it will be useful to evaluate integrals of the form (13) which are listed in Table II for. Applying these results in the defining equations (5), (9), and (10), we obtain the values listed in Table III, where for convenience,,, and are defined in terms of a common factor. Finally, substituting these coefficients together with the Fourier transforms of into (11) determines the coherence. D. Parameter Values to Obtain Similar Coherence Properties Measurements were taken of a room impulse response (Hu- MaNet room B [9]) using SYSid [10] over an 8-kHz bandwidth (4096 points with 16-kHz sampling rate). The front and back walls of the listening room are 16.6 ft (5.06 m) long, the side walls are 11.2 ft (3.41 m) long, and the room is 8.0 ft (2.44 m) high. The walls of the room are covered with acoustic-absorption panels. A white noise source was assumed so that products of the room response Fourier transforms determine the spectra

3 688 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Fig. 1. Theoretical coherence of white noise through measured room responses for various nonlinearities. (a) Half-wave, = 0:3. (b) Full-wave, =0:3=2:3. (c) Hard limiter, = =0:15. (d) Square-law, =0:05. (e) Square-sign, =0:15. (f) Cubic, =0:03. TABLE III CROSS CORRELATION CONSTANTS AND COHERENCE PARAMETERS CALCULATED FOR SEVERAL NONLINEARITIES. These spectra were inverse transformed (4096-point IFFT), modified using Table I, and retransformed (4096-point FFT) to obtain, which are then substituted into (11) along with the results of Table III to compute the coherence. Matlab was used to perform all of these calculations and plot the results, which appear in Fig. 1 for the various nonlinearities. In each case, the value of was selected to produce approximately the same amount of coherence reduction averaged over frequency, as shown in the second column of Table IV (the entry in the bottom row will be discussed in the next section). As mentioned in the introduction, the misalignment and convergence properties are more properly determined by the eigenvalues of the covariance matrix, which are lower bounded in terms of the coherence by the factor. Thus, averaged over frequency, is a measure of misalignment and convergence time. We also show these values in the third column of Table IV. These measures are also noted to be roughly comparable across the set of nonlinearities. We note that adding a half-wave rectified signal with factor results in a gain of to positive signals and a gain of 1 to negative signals. For a full-wave rectifier, the corresponding gains are and. Therefore, adding a half-wave rectified signal with factor is exactly equivalent to adding a full-wave rectified signal with factor and rescaling the composite signal by. Thus, Fig. 1(a) and (b) are identical. III. COHERENCE CALCULATIONS FROM SIMULATIONS Computer simulations were performed using Gaussian white noise, speech, and music signals. The white noise was generated using a standard routine (Matlab function). This case is used as a cross-check on the theoretical results of the last section. The speech signal was compiled from a digital speech data-

4 MORGAN et al.: NONLINEARITIES FOR USE IN STEREO ACOUSTIC ECHO CANCELLATION 689 TABLE IV WHITE NOISE COHERENCE MAGNITUDE AND INVERSE EIGENVALUE BOUND, AVERAGED OVER FREQUENCY, FOR NONLINEARITIES AND PARAMETER VALUES OF FIGS. 1 AND 3 base sampled at 16 khz. It consists of the following three sentences spoken by a male talker: Bobby did a good deed. Do you abide by your bid? A teacher patched it up. The signal extends over about 5.3 s at the 16-kHz sampling rate. The music signal was of a piano playing the first few bars of Beethoven s Moonlight Sonata. These signals were convolved with the aforementioned measured room impulse responses and processed (using the Matlab spectrum function) to obtain the magnitude-squared coherence. After taking the square root, we smoothed the coherence magnitude over 100 of 8193 frequency points (approximately 100 Hz). For white noise, the coherence of Fig. 1(a) was likewise smoothed over 50 of 4097 frequency points (approximately 100 Hz) and is plotted in Fig. 2(a). The simulation produced the coherence plotted in Fig. 2(b) and is seen to be in reasonable agreement, thereby verifying the methodology. Having established close agreement between theoretical and simulated coherence, we next use the simulation to compare the coherence for all three signal sources as shown on the left side of Fig. 3 for ideal half-wave rectification. The simulation was also used to compute the coherence of smoothed half-wave rectification [11] (14) where is a parameter used to round the edge of the discontinuous derivative at. This function would be difficult to treat on a theoretical basis. Here, the simulated coherence plots for this function are plotted on the right side of Fig. 3 for and. For purposes of comparison, we list the white noise coherence measures that were computed from the smoothed half-wave rectifier simulation in the bottom row of Table IV. As can be seen, the white noise coherence measures for the smoothed half-wave are both lower than those obtained for the other nonlinearities. However, these differences tend to diminish for speech and Fig. 2. Coherence of white noise through measured room responses for half-wave nonlinearity with = 0:3, smoothed by averaging over 100 Hz blocks. (a) Theoretical. (b) Simulation. music, as is evident in Fig. 3. We have not listed the speech and music coherence measures because they are very dependent on the particular sample used and meaningful results would require much more extensive evaluation over a comprehensive data base relating to the ultimate application. We prefer to separate the variabilities so that on one hand we generally characterize the intrinsic capability of the nonlinearity by the white noise coherence, while on the other hand determine the psychoacoustic degradations with a representative speech and music sample. On the basis of the coherence calculations, we can say at this point that for the nonlinear functions considered here, roughly comparable coherence reduction is achieved for the parameter values in Figs. 1 and 3 [the exception being for music, Fig. 3(e) and (f) to be discussed later]. As previously discussed, this reduction would lead to comparable misalignment and convergence performance for the stereo acoustic echo cancellation problem. Now with this as a basis of comparison, we go on to evaluate the perceptual degradation introduced by these various

5 690 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Fig. 3. Simulated coherence of white noise [top panels (a) (b)], speech [middle panels (c) (d)], and music [bottom panels (e)(f)] through measured room responses for ideal rectification [left panels (a) (c) (e)] with = 0:3 and smoothed rectification [right panels (b) (d) (f)] with =1:0, c= =0:65. nonlinearities with respect to subjective quality and auditory localization. IV. PERCEPTUAL DEGRADATION A. Subjective Quality The psychoacoustic listening experiment described in this section determines the effect of the above nonlinear transformations on the subjective quality of speech and of music. Three audio tokens were used: a male talker and a female talker uttering the sentence A teacher patched it up and the first few bars of Moonlight Sonata as used in Section III. All three tokens were stored as 16-bit PCM with a sampling rate of 16 khz and were normalized to an rms level of 1528 units. The male-talker token had a duration of 1.8 s, and a range of 9963 to units. The female-talker token had a duration of 1.5 s and a range of 9740 to units. The music token had a duration of 5.5 s and a range of 6541 to 6023 units. Signals were presented and controlled by a Concurrent MC5400 computer fitted with a DA04H 16-bit D/A converter. The D/A output was lowpass filtered at 5 khz and presented to a subject inside a double-walled Industrial Acoustics Company soundproof booth. Signals were presented diotically through Sennheiser HD-250 headphones at a comfortable listening level of approximately 80 db SPL. The transformation used for these experiments was to replace the original signals by the modified signals in (3), with one of the following four nonlinear functions in Table I: (A) half-wave, (C) hard limiter, (D) square-law, and (E) square-sign. 1 (For ease in comparison, this designation was chosen to match that used in Fig. 1.) The same parameter values as in Fig. 1 were used for these four nonlinearities. In addition, a fifth nonlinearity, designated as (G), for smoothed half-wave rectification (14) with and, as used in Fig. 3, was also included. For control purposes, an additional processing condition (O) was defined as no processing [ in (3)]. Thirteen subjects took part in this experiment. The ages of the subjects ranged from 33 to 65 years. Some of the subjects had a moderate amount of presbycusis (normal age-related hearing loss), but all subjects had audiologically normal hearing according to the 1964 ISO reference of average hearing loss at 1 The full-wave nonlinearity was not considered since, with appropriate scaling, it is identical to the half-wave nonlinearity, as previously noted. Also, the cubic nonlinearity was not used here because it produced overflow of the 16-bit D/A converter.

6 MORGAN et al.: NONLINEARITIES FOR USE IN STEREO ACOUSTIC ECHO CANCELLATION 691 Fig. 4. Average response ratings for speech and for music. The horizontal bars indicate 95% confidence intervals. Processing conditions correspond to: (A) half-wave, = 0:3; (C) hard limiter, = =0:15; (D) square-law, =0:05; (E) square-sign, =0:15; (G) smoothed half-wave rectification, =1:0 and c= =0:65; (O) no processing ( =0). 500, 1000, and 2000 Hz less than 26 db [12]. There was no apparent relationship between age or amount of presbycusis and performance in this experiment, and results from all 13 subjects are pooled. Each subject took part in a single experimental session. A session consisted of two replications of the 18 stimuli consisting of the three audio tokens either undistorted or processed as in (3) with one of the five nonlinearities described above. Subjects were instructed to indicate the quality of each stimulus by pushing one of five pushbuttons labeled excellent, good, fair, poor, and bad. 2 Printed instructions which the subjects read before the experiment are reproduced here as Appendix A. The five possible response ratings bad to excellent were assigned numerical values one to five, and statistical analysis was done on the resulting set of 468 numbers (three audio tokens six processing conditions thirteen subjects two replications). As was stated above, results from all 13 subjects were pooled. In addition, since analysis revealed no significant differences between results for male and for female talkers, results from the two talkers were pooled. An analysis of variance assuming 13 subject effects plus 39 token processing condition effects ( for the subset of experiments reported here plus another for seven additional conditions included in a larger experiment) gave 95% confidence intervals of 0.18 response units for speech and 0.26 response units for music. Experimental results for speech and for music are summarized in Fig. 4. The average response rating is shown on the horizontal axis, and the two types of source material are shown on the vertical axis. The horizontal bars indicate 95% confidence 2 Note that while these categories are identical to the listening-quality scale recommended for MOS testing in ITU-T Recommendation P.800, the present test should not be regarded as an MOS test; we did not include reference conditions, and many of the listeners had previous experience in speech coding. intervals. The key relating symbols in Fig. 4 to processing condition appears in the caption: A, C, D, and E relate directly to the labels in the subplots of Fig. 1, with the same parameter value for in each case; G is smoothed half-wave rectification (14) with and, relating directly to the right side of Fig. 3; and O designates unprocessed signals. As previously stated, these values were selected to produce approximately the same amount of coherence reduction. Of the four conditions reported in Fig. 4, hard limiting (C) and square-sign distortion (E) can be rejected out of hand: they both received average ratings of fair or worse for both speech and music. For speech, the other three conditions all appear to be quite satisfactory: they all received average ratings between good and excellent. Square-law distortion (D) and smoothed rectification (G) fared almost as well for music as for speech, but half-wave distortion (A) did substantially worse. However, it must be recalled from Fig. 3(e) and (f) that the half-wave rectifier with is overly aggressive for music. Therefore, the value of could be reduced for music, which would improve the perceptual quality. It is interesting to note that differences between average ratings for speech and music were greatest for the ideal half-wave rectifier (A) and hard limiter (C), whereas conditions O, D, E, and G produced differences less than 0.5. Note that the latter are exactly the four conditions that do not have a sharp discontinuity at the origin. Our speculation is that the sharp discontinuity that occurs with half-wave rectification and with hard limiting produces distortion that was more detrimental for the sustained tonal musical sample than it was for speech. B. Auditory Localization The second experiment reported here demonstrates that moderate nonlinear processing has essentially no effect on

7 692 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Fig. 5. Room layouts used for auditory localization experiments (coordinate units in feet). (a) Simulated room used to generate signals. (b) Actual room used to present nonlinearly transformed signals to listener. auditory localization. It differs from the first experiment in that stimuli were presented over a pair of loudspeakers rather than through headphones. The original speech token was the same male-talker token used in the first experiment. The left- and right-channel signals were produced in a simulated transmission room with a sound source and two microphones using the image model [13]. The simulated room was specified so as to model the actual room used in Sections II and III. A top view of the simulated room is shown in Fig. 5(a). (We use units of feet, as designated in the original room specifications.) The reflection coefficients of walls, ceiling, and floor were 0.85, 0.65, and 0.80, respectively. Two conditions were investigated. In one, the left- and right-channel signals were produced by a widely-spaced pair of omnidirectional microphones [ Left Mike and Right Mike in Fig. 5(a)]. In the other, the left- and right-channel signals were produced by a closely-spaced pair of cardioid microphones directed at right angles to each other [ Crossed Cardioid Mikes in Fig. 5(a)]. The position of the centered talker was the same for both conditions. Note that the centered talker ( ft in the next sub-

8 MORGAN et al.: NONLINEARITIES FOR USE IN STEREO ACOUSTIC ECHO CANCELLATION 693 section) was not equidistant from the left and right walls. This asymmetry was introduced deliberately to eliminate atypical artifacts. The height of the room was 8.0 ft. The speech source was 3.25 ft above the floor and all microphones were 2.25 ft above the floor. Each cardioid microphone was implemented by means of a closely-spaced pair of omnidirectional microphones (2.0 cm between microphones) with appropriately delayed and integrated outputs [14]. This spacing is small enough to provide an acceptable frequency response and large enough to provide an acceptable spatial resolution. The left-channel microphone was directed 45 toward the left and the right-channel microphone was directed 45 toward the right. The transformation used for these experiments was to replace the left- and right-channel signals and by the modified signals of (3) using the half-wave rectifier of Table I with nonlinearity parameter. This value of was chosen to be even somewhat larger than the value used in the coherence calculations and subjective quality experiments in order to better evoke any possible impairment of localization performance. The transformed signals were amplified and presented at a comfortable listening level of approximately 75 db SPL over a pair of Quad ESL-63 electrostatic loudspeakers. The listening room, shown in Fig. 5(b), was the same as described in Section III. In each experimental trial, the subject listened to a pair of stimulus presentations that differed in the location of the talker in the simulated transmission room. The subject was required to judge whether the perceived location of the talker in the second presentation was to the left of or to the right of the perceived location of the talker in the first presentation. Talker positions for the two stimulus presentations comprising a trial were symmetrical about the centered position. One was ft to the left of the centered position, and the other was ft to the right, so that the total change of talker position between the two stimulus presentations was ft. The order of presentation was randomized from trial to trial. We carried out a series of preliminary listening experiments to determine what values of would be included in the experiment. The criterion was to select values of that would cover the range from difficult (probability of correct response slightly better than chance) to easy (probability of correct response close to unity). These preliminary listening experiments revealed that a much smaller change of is detectable with the omnidirectional microphone configuration than with the cardioid microphone configuration. We selected four values of for the omnidirectional microphone configuration, 0.02 ft, 0.04 ft, 0.1 ft, and 0.2 ft, and four values of for the cardioid microphone configuration, 0.2 ft, 0.4 ft, 1.0 ft, and 2.0 ft. There were thus 16 different stimulus conditions: two microphones (omnidirectional or cardioid) four values of (as described above) two distortion conditions [unmodified ( ) and modified ( )]. An experimental session consisted of five replications of the 16 conditions, for a total of 80 trials, with the order of conditions randomized within each replication. An experimental session took about seven minutes to complete. Four subjects participated in the experiment. The ages of the subjects ranged from 33 to 63 years. As in the previous experiment, some of the subjects exhibited a moderate amount of presbycusis, all subjects had audiologically normal hearing, and there was no apparent relationship between age or amount of presbycusis and performance in the auditory localization task. Each subject read a sheet of printed instructions reproduced here as Appendix B and then participated in one or two practice sessions followed, on separate days, by two data-collection sessions, so the experimental results are based on ten observations per condition for each subject. The results for the four subjects were averaged and are plotted in Fig. 6. Each panel shows the proportion of correct responses,, versus the amount by which the talker position changed,. The left panel shows results with omnidirectional microphones and the right panel shows results with crossed cardioid microphones. In each panel, the points labeled O are for unmodified speech [ in (3)] and the points labeled A are for speech modified using half-wave rectification [ in (3)]. Each point shows the average of 40 trials. The error bars between the left and right panels show 95% confidence intervals of probability of correct response for and. The results indicate that introduction of this nonlinear distortion produces little or no degradation of stereophonic localization. Of 32 cases (four subjects two types of microphone four values of ), the proportion of correct responses was equal for distorted and undistorted speech in 14 cases, was higher for undistorted than for distorted in seven cases, and was higher for distorted than for undistorted in 11 cases. A simple nonparametric sign test based on these numbers shows that the difference between distorted and undistorted speech over the ensemble of experimental subjects is not significant. Fig. 6 supports this conclusion. In only one of the eight cases ( = 0.02 ft, omnidirectional microphones) is the difference between proportion of correct responses for distorted and undistorted speech significant at the 95% level. There were also some intersubject differences. Not all subjects did equally well overall, and in addition two of the subjects showed more of a difference between crossed cardioid and omnidirectional microphones than did two othersubjects. This was true for both undistorted and distorted speech. The primary directional cue produced by the omnidirectional microphone configuration is interchannel time difference, while the primary directional cue produced by the cardioid microphone configuration is interchannel intensity difference. We know [15] that some subjects are more dependent on interaural time difference in a lateralization task while other subjects are more dependent on interaural intensity difference, so it comes as no surprise that the effect of microphone configuration is different for different subjects. Fig. 6 shows that the threshold value of [ ] averaged over the four subjects was more than an order of magnitude smaller for the omnidirectional configuration (0.05 ft) than it was for the cardioid configuration (1.5 ft). However, the omnidirectional configuration gives a distorted representation of talker position: the perceived position of the talker moves abruptly from extreme right to extreme left as the actual talker position crosses the midline between the two microphones. This distortion occurs because the primary directional cue produced by the omnidirectional microphone configuration

9 694 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Fig. 6. (a) (b) Proportion of correct responses versus change of talker position for average of four subjects. Each point shows the average of 40 trials. Left panel: omnidirectional microphones. Right panel: crossed cardioid microphones. Processing conditions correspond to: (O) unmodified [ = 0 in (3)]; (A) modified with half-wave rectification ([ = 0:45 in (3)]. Each point shows the average of 40 trials. The error bars between the left and right panels show 95% confidence intervals for P(C) =0:5 and P(C) =0:8. is interchannel time difference; because of the precedence effect [16], [17], the perceived source in this case is localized at the loudspeaker receiving the earlier signal. With the cardioid microphone configuration, the directional cue is interchannel intensity difference, and the perceived position of the talker changes gradually from left to right as the actual position of the talker changes. In our experiment, the simulated talker and listener are effectively 15.1 ft apart (7.2 ft plus 7.9 ft from the front walls), so the 1.5-ft threshold we measured corresponds to an angle shift of about 5.7. This compares favorably with results presented by Mills [18], who reports minimum audible angles for tone bursts in the range of 1-4, depending on the frequency of the stimulus. V. CONCLUSIONS We investigated several types of nonlinearities for reducing the mutual coherence of stereo signals for the purpose of uniquely identifying the impulse responses in acoustic echo cancellation applications. The intention is that this reduction in coherence, while being effective for its intended purpose, does not seriously degrade the subjective quality of the audio source or the ability to localize the direction of sound. First, the parameters of the nonlinearities were selected so as to produce approximately equal reduction of coherence, and then psychoacoustic experiments were conducted to quantify the subjective loss of quality and to determine whether localization is compromised. Of the types of nonlinearities evaluated, half-wave rectification is the simplest to implement and only minimally affects the speech quality. However, for music the nonlinearity parameter must be reduced to maintain the same level of decoherence and, we presume, the same perceptual quality. The smoothed rectifier also provides good speech quality but is a little more difficult to implement because a running estimate of the standard deviation must be computed and used to normalize the break point. However, its performance seems to more uniformly affect speech and music, therefore not requiring readjustment of the nonlinearity parameter. We found no statistically meaningful effect of half-wave rectification on the localization performance. Informal listening also did not reveal any localization impairment for any of the other nonlinearities; mild nonlinearities of any kind seem to have no effect whatsoever. APPENDIX A PRINTED INSTRUCTIONS FOR SUBJECTIVE QUALITY EXPERIMENT

10 MORGAN et al.: NONLINEARITIES FOR USE IN STEREO ACOUSTIC ECHO CANCELLATION 695 APPENDIX B PRINTED INSTRUCTIONS FOR AUDITORY LOCALIZATION EXPERIMENT [10] User s Manual for the SYSid Audio-Band Measurement and Analysis System Version 4.0. Highland Park, NJ: Ariel Corporation, [11] M. R. Schroeder and J. L. Hall, Model for mechanical to neural transduction in the auditory receptor, J. Acoust. Soc. Am., vol. 55, pp , May [12] D. S. Green, Pure tone air conduction thresholds, in Handbook of Clinical Audiology, J. Katz, Ed. Baltimore: Williams & Wilkins, 1972, ch. 5, pp [13] J. B. Allen and D. A. Berkley, Image method for efficiently simulating small-room acoustics, J. Acoust. Soc. Am., vol. 65, pp , Apr [14] H. F. Olsen, Modern Sound Reproduction. New York: van Nostrand Reinhold, 1972, pp [15] L. A. Jeffress and D. McFadden, Differences of interaural phase and level in detection and lateralization, J. Acoust. Soc. Am., vol. 49, pp , [16] W. M. Hall, A method for maintaining in a public address system the illusion that the sound comes from the speaker s mouth, J. Acoust. Soc. Am., vol. 7, p. 239, [17] M. B. Gardner, Historical background of the Haas and/or precedence effect, J. Acoust. Soc. Am., vol. 43, pp , [18] A. W. Mills, On the minimum audible angle, J. Acoust. Soc. Am., vol. 30, pp , ACKNOWLEDGMENT The authors would like to thank M. M. Sondhi and G. W. Elko for helpful discussions, and T. Gaensler and the reviewers for providing useful comments to improve the text. REFERENCES [1] M. M. Sondhi, D. R. Morgan, and J. L. Hall, Stereophonic acoustic echo cancellation An overview of the fundamental problem, IEEE Signal Processing Lett., vol. 2, pp , Aug [2] J. Benesty, D. R. Morgan, and M. M. Sondhi, A better understanding and an improved solution to the specific problems of stereophonic acoustic echo cancellation, IEEE Trans. Speech Audio Processing, vol. 6, pp , Mar [3] A. Gilloire and V. Turbin, Using auditory properties to improve the behavior of stereophonic acoustic cancellers, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1998, pp [4] S. Shimauchi, Y. Haneda, S. Makino, and Y. Kaneda, New configuration for a stereo echo canceller with nonlinear pre-processing, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1998, pp [5] J. Benesty, D. R. Morgan, and M. M. Sondhi, A hybrid mono/stereo acoustic echo canceler, IEEE Trans. Speech Audio Processing, vol. 6, pp , Sept [6] J. Benesty, D. R. Morgan, J. L. Hall, and M. M. Sondhi, Synthesized stereo combined with acoustic echo cancellation for desktop conferencing, Bell Labs Tech. J., vol. 3, pp , July Sept [7] A. Papoulis, Probability, Random Variables and Stochastic Processes. New York: McGraw-Hill, 1984, p [8] R. F. Baum, The correlation function of Gaussian noise passed through nonlinear devices, IEEE Trans. Inform. Theory, vol. IT-15, pp , July [9] D. A. Berkley and J. L. Flanagan, HuMaNet: An experimental humanmachine communications network based on ISDN wideband audio, AT&T Tech. J., vol. 69, pp , Sept./Oct Dennis R. Morgan (S 63 S 68 M 69 SM 92) was born in Cincinnati, OH, on February 19, He received the B.S. degree in 1965 from the University of Cincinnati, and the M.S. and Ph.D. degrees from Syracuse University, Syracuse, NY, in 1968 and 1970, respectively, all in electrical engineering. From 1965 to 1984, he was with the Electronics Laboratory, General Electric Company, Syracuse, specializing in the analysis and design of signal processing systems used in radar, sonar, and communications. He is now a Distinguished Member of Technical Staff at Bell Laboratories, Lucent Technologies (formerly AT&T), Murray Hill, where he has been employed since From 1984 to 1990, he was with the Special Systems Analysis Department, Whippany NJ, where he was involved in the analysis and development of advanced signal processing techniques associated with communications, array processing, detection and estimation, and adaptive systems. Since 1990, he has been with the Acoustics Research Department, Murray Hill, NJ, where he is engaged in research on adaptive signal processing techniques applied to communication systems. He has authored numerous journal publications and is coauthor of Active Noise Control Systems: Algorithms and DSP Implementations (New York: Wiley, 1996) and Advances in Network and Acoustic Echo Cancellation (New York: Springer-Verlag, 2001). Dr. Morgan served as Associate Editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING from 1995 to He is currently serving as Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING. Joseph L. Hall was born in Boston, MA, on January 22, He received the B.A. degree in physics in 1959 from Williams College, Williamstown, MA, and the S.B. and S.M. degrees in electrical engineering in 1959 and the Ph.D. degree in electrical engineering in 1963, all from Massachusetts Institute of Technology (MIT), Cambridge. From 1964 through 1966, he was with the Department of Electrical Engineering and the Department of Biomedical Engineering at Johns Hopkins University, Baltimore, MD. In 1966, he joined the Acoustics and Speech Research Department at Bell Labs, Lucent Technologies (formerly AT&T), Murray Hill, NJ, where he is now a Distinguished Member of Technical Staff. His research interests are in the area of auditory psychophysics. He was Associate Editor of the Journal of the Acoustical Society of America. Dr. Hall is a Fellow of the Acoustical Society of America, where he has served on the society s executive council.

11 696 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 6, SEPTEMBER 2001 Jacob Benesty (M 98) was born in Marrakesh, Morocco, on April 8, He received the M.S. degree in microwaves from Pierre & Marie Curie University, France, in 1987, and the Ph.D. degree in control and signal processing from Orsay University, France, in April While pursuing the Ph.D. degree (from November 1989 to April 1991), he worked on adaptive filters and fast algorithms at the Centre National d Etudes des Telecomunications (CNET), Paris, France. From January 1994 to July 1995, he worked at Telecom Paris on multichannel adaptive filters and acoustic echo cancellation. He joined Bell Labs, Lucent Technologies (formerly AT&T), Murray Hill, NJ, in October 1995, first as a Consultant and then as a Member of Technical Staff. Since this date, he has been working on stereophonic acoustic echo cancellation, adaptive filters, source localization, robust network echo cancellation, and blind deconvolution. He was the Co-chair of the 1999 International Workshop on Acoustic Echo and Noise Control. He coauthored the book Advances in Network and Acoustic Echo Cancellation (New York: Springer-Verlag, 2001). He is also co-editor/coauthor of the book Acoustic Signal Processing for Telecommunication (Norwell, MA: Kluwer, 2000).

ROBUST echo cancellation requires a method for adjusting

ROBUST echo cancellation requires a method for adjusting 1030 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007 On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk Jean-Marc Valin, Member,