PAPER Enhanced Vertical Perception through Head-Related Impulse Response Customization Based on Pinna Response Tuning in the Median Plane

Size: px

Start display at page:

Download "PAPER Enhanced Vertical Perception through Head-Related Impulse Response Customization Based on Pinna Response Tuning in the Median Plane"

Tyler Short
5 years ago
Views:

1 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY PAPER Enhanced Vertical Perception through Head-Related Impulse Response Customization Based on Pinna Response Tuning in the Median Plane Ki Hoon SHIN a), Nonmember and Youngjin PARK, Member SUMMARY Human s ability to perceive elevation of a sound and distinguish whether a sound is coming from the front or rear strongly depends on the monaural spectral features of the pinnae. In order to realize an effective virtual auditory display by HRTF (head-related transfer function) customization, the pinna responses were isolated from the median HRIRs (head-related impulse responses) of 45 individual HRIRs in the CIPIC HRTF database and modeled as linear combinations of 4 or 5 basic temporal shapes (basis functions) per each elevation on the median plane by PCA (principal components analysis) in the time domain. By tuning the weight of each basis function computed for a specific height to replace the pinna response in the KEMAR HRIR at the same height with the resulting customized pinna response and listening to the filtered stimuli over headphones, 4 individuals with normal hearing sensitivity were able to create a set of HRIRs that outperformed the KEMAR HRIRs in producing vertical effects with reduced front/back ambiguity in the median plane. Since the monaural spectral features of the pinnae are almost independent of azimuthal variation of the source direction, similar vertical effects could also be generated at different azimuthal directions simply by varying the ITD (interaural time difference) according to the direction as well as the size of each individual s own head. key words: HRTF customization, HRIR, pinna response tuning, principal components analysis 1. Introduction The ability of humans to use sonic cues to localize a sound in the surrounding 3 dimensional space is referred to as auditory localization. At its very core, lies the head-related transfer function (HRTF) which comprises major cues for spatial hearing such as the ITD (interaural time difference), ILD (interaural level difference), and spectral modification induced by the pinna folds. Synthesis of spatial hearing based on HRTFs is of great practical and research importance and non-individualized HRTFs measured with a dummy head microphone system (the KEMAR for instance) are used for most virtual audio syntheses. However, subjective evaluations on these non-individualized HRTFs involving a group of individuals often report front/back reversal and poor vertical effects. Both front/back distinction and vertical perception for humans are mainly triggered by the spectral features (peaks Manuscript received April 9, Manuscript revised June 29, The author is with Samsung Electronics, Suwon-City, , Republic of Korea. The author is with KAIST, Science Town, Daejeon, , Republic of Korea. a) kihoon221.shin@samsung.com DOI: /ietfec/e91 a and notches) produced by the direction-dependent filtering of the pinna as described by Shaw and Teranishi [1]. In particular, the importance of spectral notches (or nulls) as localization cues in the median plane (0 azimuth) is supported by Blauert [2] and also by Hebrank and Wright [3]. They concluded that elevation in the median plane where both ITD and ILD are zero is cued by a spectral notch whose frequency has similar dependence on elevation as that previously observed by Shaw and Teranishi in the lateral plane. Further results confirmed this conclusion both in the median plane [4] and in the lateral plane [5]. In an attempt to explain such a prominent feature in HRTFs, Lopez-Poveda and Meddis [6] suggested a diffraction/reflection model based on the posterior wall of the human concha and was able to predict the notch frequencies with reasonable accuracy. More recently, Langendijk and Bronkhorst [7] were able to isolate the frequency bands responsible for front/back and up/down cues in human HRTFs via a series of subjective listening tests. They concluded that front/back cues and up/down cues were located mainly in the 8 16-kHz band and in the 6 12-kHz band, respectively. Both bands lie in the spectral region of the pinna response which generally spans from 2 khz to above 14 khz [8]. Individual pinnae take a large variety of size and shape and the artificial set of pinnae mounted on the KEMAR are manufactured based on the average dimensions of human pinna cavities. Therefore, the pinna response of the non-individualized HRTF generally cannot match that of each individual HRTF resulting in front/back confusion and compromised vertical effects for most listeners. Based on the hypothesis that the structure of an HRTF is closely related to the dimensions and orientation of each individual body part, i.e. head, torso, shoulders, and pinnae, a variety of HRTF customization techniques by modifying other people s HRTFs has been introduced to accomplish perceptual fidelity in virtual audio synthesis. Some studies such as HRTF clustering and selection of a few most representative ones by Shimada et al. [9], a structural model for composition and decomposition of HRTFs by Algazi et al. [10], HRTF scaling in frequency by Middlebrooks [11], and database matching by Zotkin et al. [12] already suggested that the hypothesis is somewhat valid although a perfect localization (equivalent to the localization based on the listener s own HRTFs) was never closely achieved. For example, the work of Middlebrooks is based on the idea that Copyright c 2008 The Institute of Electronics, Information and Communication Engineers

2 346 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY 2008 the HRTF will be shifted toward the lower frequencies while maintaining its shape when the pinna is scaled up in size. If the listener deduces the source elevation from the positions of peaks and notches in the oncoming sound spectrum, localization with the scaled-up pinna larger than the listener s own pinna will result in systematic bias in elevation perception and personalization may be achieved simply by scaling down the HRTF of the scaled-up pinna. However, the pinnae of different individuals are different in many more aspects than just a simple scaling, and an insignificantly small change in the shape of the pinna can cause dramatic changes in the HRTF. The database matching technique suggested by Zotkin et al. [12] relies on the HRTF database released by the CIPIC Interface Laboratory at UC Davis containing 43 sets of individual HRTFs and 2 sets of KEMAR HRTFs along with some anthropometric information. By taking a picture of the listener s own ear and comparing the anthropometric parameters measured from the image to the ones provided in the database, they selected the best matching set of individual HRTFs for virtual auditory synthesis. Although the localization performance on source elevation was improved by 20 30% for 4 out of 6 subjects, this method requires a sophisticated imaging system that can capture the subject s ear to its real life size and automatically compute the anthropometric dimensions from the image. In 1984, Morimoto and Aokata [13] introduced the interaural-polar coordinate system and showed that the similar spectral cues observed in the median plane occur in any sagittal plane. Moreover, Wightman and Kistler [14] conducted a series of experiments in which the produced stimuli contained the ITD signaling one direction and ILD and pinna cues signaling another direction through manipulation of the ITD in the measured HRTFs of several individuals. The apparent lateral directions of such stimuli with conflicting cues almost always followed the ITD cue as long as the stimuli included low frequencies. Morimoto et al. [15] proposed a new sound localization method based on [13] that successfully rendered 3-d sound images in a sagittal plane by simulating interaural differences (ITD and ILD) and individual HRTFs measured in the median plane. They further showed that the ITD was dominant on lateral perception by performing localization tests in which either one of the ITD or ILD was manipulated separately while the other one was kept at zero. In this paper, a measurement-free and yet effective HRTF customization method that can be based on any individual HRTF database of substantial size is proposed. The goal of our study is not in the retrieval of exact individual HRTFs. Rather, our goal lies in the development of hybrid HRTFs that can deliver the necessary vertical perception better than the non-individual HRTFs while reducing front/back reversal for any particular listener. The basic idea is similar to that suggested in [15]. Vertical perception is controlled by modifying the pinna responses extracted from the median HRIRs in any individual HRTF database that does not contain the HRTF of the target subject, and lateral perception is controlled by introducing the head shadow effect to compensate for ILDs and proper ITDs that are represented as simple linear delays. Justification for approximating the HRTF phase as linear functions independent of frequency can be found in the work of Kulkarni et al. [16]. Our method is developed primarily in the time domain because structural decomposition of an HRTF is generally not easy in the frequency domain. An HRIR is a sequence of temporal events of sound waves reaching the ears over multiple paths. Therefore, the pinna response can be easily extracted from an HRIR simply by clipping away the shoulder/torso response and keeping only the early response since the pinna is located closest to the ear canal. Brown and Duda [17] argued that most pinna activity occurred in the first 0.7 ms since the arrival of the direct pulse by comparing the KE- MAR HRIRs measured with pinna to the HRIRs measured without pinna. However, a more detailed comparison of the data presented in their work reveals that the difference is not so prominent after the first 0.2 ms. Examination of the HRIRs from our HRTF database [18] and those from the CIPIC HRTF database [19] also indicates that most pinna activity with largest intersubject variation is concentrated in the first 0.2 ms, which corresponds to 10 samples at a khz sampling rate. The proposed HRTF customization procedure consists of the following steps (See Fig. 1). First, the temporal pinna responses, each containing exactly 10 samples from the beginning of the direct pulse, are extracted from a group of individual HRIRs measured in the median plane after all initial time delays are removed. Then, principal components analysis (PCA) is performed on the isolated pinna Fig. 1 Outline of procedures for the proposed HRTF customization method.

3 SHIN and PARK: ENHANCED VERTICAL PERCEPTION THROUGH HEAD-RELATED IMPULSE RESPONSE CUSTOMIZATION 347 responses at each selected elevation angle to model them as linear combinations of 4 or 5 basis functions (or principal components) by using the covariance method [20]. A graphical user interface (GUI) designed using MATLAB TM allows the subject to tune the pinna response by changing the weight on each basis function and listening to a broadband stimulus (100 Hz 20 khz) filtered with the resulting pinna response aligned with a shoulder/torso response extracted from the KEMAR HRIR at the same elevation angle over a set of headphones (Sennheiser HD 250 linear II). KEMAR s shoulder/torso response at each elevation angle can be obtained simply by clipping away the pinna response and linear delay from the corresponding KEMAR HRIR and this step is indicated by the dashed crosses shown in Fig. 1. Adjustment of the weight on each basis function can continue until a satisfactory elevation perception is achieved. The proposed HRIR customization procedure also includes the steps for introducing the head shadow effect and individualized ITDs to the customized pinna responses as shown in Fig. 1 for an accurate virtual auditory synthesis in the entire 3-d space around a target listener s head. However, it should be noted that these interaural differences were ignored in this study because we wanted to verify first the effectiveness of the proposed HRIR customization method in rendering enhanced elevation perception and reduced front/back confusion in the median plane only where all the interaural differences are zero. A total of 4 subjects with normal hearing sensitivity participated in this study. For performance comparison, the individual HRTFs of these 4 participants were measured in the median plane. Subjective listening tests were performed on the customized HRIRs, individual HRIRs, and the KEMAR HRIRs in order to verify feasibility of the proposed method. 2. Method 2.1 PCA of Pinna Responses in the Time Domain A typical HRIR can be decomposed into a series of tempo- ral sound events as shown in Fig. 2. There is first an initial time delay due to the distance of the source with respect to the ears. Then, a direct pulse whose amplitude depends on the source distance and shadow effect arrives, followed by a ridge-trough combination caused by reflection and diffraction due to pinna cavities. The rest of the signal contains reflections from shoulder, torso, and measurement devices such as the turntable and vertical hoop stand for holding the point source at desired angle. Technically, the direct pulse cannot be part of the pinna response, but the early response that lasts for about 0.2 ms since the arrival of the direct pulse is referred to as the pinna response throughout the rest of this paper for convenience. It should be noted that the individual HRIRs used in our analysis are the ones from the CIPIC HRTF database [19] containing HRTFs obtained from 43 individual subjects plus the KEMAR with 2 sets of pinnae of different size. The procedure of the covariance method [20] used for PCA is as follows. Let X be an M by N data matrix containing the extracted pinna responses at selected elevation angle where M is the number of total dimensions (10 in this case) and N is the number of available data sets (45 in this case). The empirical mean of X along each dimension m = 1,...,M can be computed from u[m] = 1 N N X[m, n]. (1) n=1 The empirical mean of the 45 individual pinna responses measured at 45 elevation is shown for both ears in Fig. 3 as an example. This mean vector u is then subtracted from each column of X to get a mean-subtracted data matrix B: B = X u h (2) where h is a 1 by N row vector of all 1 s. The M by M covariance matrix C is obtained from the outer product of B with itself: 1 C = E[B B] = N 1 B B (3) where * is the conjugate transpose operator. Next, the eigenvalue matrix D and the orthonormal eigenvector matrix V of Fig. 2 Structural decomposition of an HRIR measured with a B&K HATS (Head And Torso Simulator) with an acoustic point source located at 0 azimuth and 0 elevation [18]. Fig. 3 Empirical mean of 45 pinna responses per each ear collected from the CIPIC HRIRs measured at 45 elevation.

4 348 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY 2008 the covariance matrix C are computed satisfying the following relationship: C V = V D (4) where D is an M by M diagonal matrix with eigenvalues of C in the diagonal. Matrices V and D must be rearranged in order of decreasing eigenvalue. Now the eigenvalues represent the energy distribution of the data X among each of the eigenvectors that forms a basis for the data. The cumulative energy content g is the sum of the energy content across all of the eigenvectors from 1 through m: g[m] = m λ q (5) q=1 where λ q is the qth eigenvalue and m = 1,...,M. By choosing a suitable accuracy bound, which was set to be more than 90% of the total energy stored in the original data in our analysis, a subset of the eigenvectors are selected as basis vectors (principal components). The first L columns of V that satisfies the following accuracy bound on the cumulative energy ratio (CER) are chosen as the principal components (PCs): CER (%) = g[l] 100 > 90%. (6) g[m] The CER computed for the pinna responses at 45 elevation using the above equation with L = 1,...,10 is shown in Fig. 4. It can be seen that at least 5 PCs are required for the modeled data to represent more than 90% of the energy in the original data for both ears. So L = 5 in this case. These 5 PCs obtained for each ear are shown in Fig. 5. Note that the PCs obtained for the left ear pinna responses are almost identical to those obtained for the right ear pinna responses. This was generally the case for other sets of data at different elevation angles. Sometimes the required number of PCs was 4 depending on the elevation angle. Now let W be an M by L matrix with L PCs as its column vectors: for p = 1,...,M and q = 1,...,L. A new data matrix Y, which is a transformation of X onto the L principal components, can be obtained simply by Y = W B. (8) This new data matrix Y (an L by N matrix) can then be used to retrieve a truncated version of the original data X by X = W Y + u h. (9) In essence, a linear superposition of the L PCs in W with the nth column of Y as a set of L principal component weights (PCWs) approximately recovers the nth column of the orig- Fig. 5 Five basis functions (principal components: PC1 PC5) of the pinna responses at 45 elevation. The solid lines denote the left ear principal components and the dashed lines denote the right ear principal components. W[p, q] = V[p, q] (7) Fig. 4 Cumulative energy ratio (CER in Eq. (6)) plotted with increasing number of PCs for 45 elevation. The number of PCs on the horizontal axis represents L in Eq. (6). Fig. 6 Pinna responses at 45 elevation of subject 50 (solid) in the CIPIC HRTF database and their approximations (dashed) computed as a linear combination of the 5 PCs per ear shown in Fig. 5. Left ear responses are plotted in the upper panel and right ear responses in the lower panel.

5 SHIN and PARK: ENHANCED VERTICAL PERCEPTION THROUGH HEAD-RELATED IMPULSE RESPONSE CUSTOMIZATION 349 Fig. 7 Five sets of PCWs required in order to recover the original pinna responses in the CIPIC HRTF database as linear combinations of the five PCs (left ear) depicted in Fig. 5. Note that the distribution of the PCWs becomes smaller as the eigenvalue decreases. Fig. 9 Left ear pinna responses of subject 8 (solid), subject 60 (dashed), and subject 153 (dotted) from the CIPIC HRTF database at various elevation angles. The numbers in the right indicate the corresponding angles. Fig. 8 Left ear pinna responses at 45 elevation of 4 randomly selected subjects from the CIPIC HRTF database. inal data X. The left and right pinna responses at 45 elevation for subject 50 from the CIPIC HRTF database along with the approximations computed using Eq. (9) are plotted for comparison in Fig. 6. It can be seen that 5 PCs are enough to recover the original data with close resemblance. 45 sets of 5 PCWs for the left ear PCs shown in Fig. 5 that are required to model the entire 45 left ear pinna responses in the CIPIC HRTF database are captured in Fig. 7. Note that the spread of PCWs is the largest for PC 1 and smallest for PC 5. This is a direct consequence of rearranging V and D (Eq. (4)) in order of decreasing eigenvalue since larger eigenvalue implies bigger energy distribution of the original data along the corresponding eigenvector. In other words, the first 2 PCs are more important basis functions than the latter 3 PCs in representing the variation of the original data. The left ear pinna responses of 4 randomly selected individuals at 45 elevation depicted in Fig. 8 shows large intersubject variations around 0.08, 0.11, and 0.16 ms. One can easily observe from the left ear PCs in Fig. 5 that the first 3 PCs have ridges at the above temporal positions indicating that a linear combination of these first 3 PCs with appropriate PCWs can cover most intersubject variation in the shape and amplitude of the ridge-trough pair following the direct pulse. Amplitude variation of the direct pulse can be covered with PC 5 because it has a ridge in the region where the direct pulse is likely to reside. Therefore, by allowing a subject to tune the weight on each PC for customization, one is merely adding a timed ridge-trough pair with adjusted amplitude and an overall level shift to the mean pinna response in Fig. 3. The left ear pinna responses of 3 randomly selected individuals at elevations from 30 through 210 are plotted in Fig. 9 in order to observe the intersubject variation pattern per elevation angle in the median plane. The most common and salient change in the individual pinna responses as the source climbs in elevation lies in the arrival time and level of the first reflection (second ridge) immediately after the direct pulse (first ridge) and also in the shape and duration of the trough that follows. The temporal interval between the arrivals of the direct pulse and first reflection contracts as the source rises in the frontal hemisphere up to 60 where the two pulses merge into a single ridge. The two pulses stay merged for all rear source positions. Meanwhile, the width of the following trough decreases as the source rises to 90 which is directly over the head and increases back as the source descends in the rear hemisphere. The above

6 350 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY 2008 phenomenon is similar to that observed by Hiranaka and Yamasaki [21]. After examining many individual pinna responses in the CIPIC HRTF database, we could conclude that most intersubject variation in pinna responses lies in the amplitude and arrival times of either the direct pulse or ridge-trough pair depending on the elevation angle of the source. Note that these intersubject variations become quite small as the source moves into the rear hemisphere especially when the source lies directly behind the listener at 180.However,it can be shown that even a very small difference in the time domain yields a large difference in the frequency domain. 2.2 PCW Tuning for Customization As mentioned above, letting a subject tune the weight on each PC brings an actual change in the shape of the pinna response. Four male subjects with normal hearing sensitivity participated in making customized HRTFs by using a GUI (graphical user interface) depicted in Fig. 10. Sectors in the GUI are bound by boxes and labeled per function in the figure. A subject may choose any elevation angle from 45 to 230 in the median plane since the HRTFs from the CIPIC HRTF database are available in that angular range at intervals of However, customization was only carried out at 9 specific elevation angles from 30 to 210 at 30 intervals in the median plane in order to compare the localization performance of the customized HRTFs to that of individual HRTFs of the participants measured at those angles. Bal- ance control in the GUI adjusts gains to be applied to the left and right channels since it is necessary to render sound images in the center before the tuning commences and an interaural difference in perceived levels between the left and right ears is quite common even for individuals with normal hearing sensitivity. As mentioned in the previous section, the PCs obtained for left and right ears turned out to be similar to each other at most elevation angles despite the interaural shape difference in pinna responses for some individuals in the CIPIC HRTF database. As a result, ear symmetry was assumed and customization was performed by tuning the PCWs on only one ear. The slider on each slide-bar on the GUI represents the PCW values for each PC. After punching in an elevation angle at which customization is to be performed, principal components analysis is executed on the isolated pinna responses measured at the specified angle and corresponding PCs are computed by simply pushing the PCA button. Then, each participant fiddles with the slide-bars to adjust the PCW on each PC and listens to an input stimulus (100 Hz 20 khz) filtered by the newly created HRIR (marked as Custom HRIR in Fig. 10) by pushing the PLAY button. This Custom HRIR is formed by aligning the pinna response obtained as a linear combination of the tuned PCs to the shoulder/torso response of the KEMAR HRIR measured at the same angle. The PLAY KEMAR button is for listening to the same input stimulus filtered by the KEMAR HRIR. Some listeners may find the vertical perceptions produced by the KEMAR HRIRs good enough in which case they can tune the PCWs so Fig. 10 A MATLAB TM GUI for pinna response customization based on tuning of PCWs (See text for details).

SHIN and PARK: ENHANCED VERTICAL PERCEPTION THROUGH HEAD-RELATED IMPULSE RESPONSE CUSTOMIZATION 351 that the resulting pinna response shown as a solid line in the top-right panel on the GUI takes a

7 SHIN and PARK: ENHANCED VERTICAL PERCEPTION THROUGH HEAD-RELATED IMPULSE RESPONSE CUSTOMIZATION 351 that the resulting pinna response shown as a solid line in the top-right panel on the GUI takes a similar shape with that of the KEMAR s shown as a dashed line in the same plot or simply keep the KEMAR HRIR as their customized HRIR at each angle of concern. On the other hand, if the KEMAR HRIR performs poorly in producing the necessary vertical effects, then the tuning can continue until each participant is satisfied with the resulting vertical effect he or she perceives. In our study, all participants reported unsatisfactory vertical perceptions with the KEMAR HRIRs so the tuning was performed on all target angles. Note that the headphone- pinna coupling effectforeachsubjectwascancelledusing thesubject s own headphone-to-meatus-entrance transfer function for all the output stimuli produced in the above tuning experiment. 2.3 Individual HRTF Measurement The individual HRTFs of the four subjects who participated in the above tuning experiment were measured at the elevation angles where the pinna customization took place. Subjects were seated in a chair coupled to a vertical hoop designed to hold an acoustic point source. Details on the measurement apparatus and method can be looked up in our previous work on modeling the HRTFs for nearby sources [18]. For correct headphone-presented simulation of freefield listening when evaluating these individual HRTFs on their localization capabilities, headphone-pinna coupling effect was cancelled using the headphone-to-meatus-entrance transfer function measured on each subject according to the method suggested by Wightman and Kistler [22]. While a typical HRTF measurement for an individual is carried out by placing a probe tube in the ear canal at a position very close to the eardrum, this is obviously a very difficult task. Møller, Sorensen, Hammershøi, and Jensen [23] demonstrated that HRTF measurements could also be made by measuring free-field and headphone responses at the entrance of a blocked ear canal. With their technique, however, a miniature microphone embedded in an earplug that can be fitted in each subject s ear canal is required. Instead of dealing with all the laborious procedures involved in the conventional measurement techniques, we adopted the Fig. 11 B&K Binaural Microphone Type 4101 (right) for measuring individual HRTFs mounted inside a subject s pinna at the entrance to the ear canal (left). blocked-meatus measurement technique using a B&K Binaural Microphone Type 4101 mounted inside each subject s pinna as shown in Fig. 11 for the sakes of convenience and efficiency. Although this stethoscope-like microphone set simplifies the overall measurement process by far, it was difficult to bend the microphone arms so that the microphone tips could be fitted with precision at the ear canal entrance without touching the tragus. Anchoring them in the exact same positions during measurement was another difficulty we faced. The microphone arms were taped on each subject s lower cheeks in an effort to anchor the microphone tips and the subjects were instructed to restrain from making any noticeable movement during the experiment. However, as the evaluation results shown in the subsequent chapter suggest, we believe that our individual HRTFs contain some errors induced by imprecise positioning of the microphone tips. 3. Subjective Evaluation Results Subjective listening tests were carried out on all four subjects (ID: SK, HS, KB, and CH) to assess the performance of the three HRIR sets: Customized HRIRs, individual HRIRs, and KEMAR HRIRs. In an attempt to prevent any possible learning acquired by the subject during the tuning process from affecting the overall evaluation result, the evaluation experiment was conducted several days after completion of tuning by all subjects. The subjects listened to broadband stimuli filtered by HRIRs from each of the above three HRIR sets over the headphones and gave their perceived responses by typing into a GUI designed for the evaluation test. Each of the 9 elevation angles is simulated 10 times in a random order yielding in total 90 stimuli to evaluate per HRIR set. The subjective evaluation results are shown in Figs for all 4 subjects. Evaluations on the KEMAR, individual, and customized HRIRs are displayed in the left, center, and right panel, respectively, in each figure. The horizontal axis denotes the actual source positions and the vertical axis denotes the perceived source positions in each panel. Note the response frequency scale drawn in a small box in the right panel of Fig. 15. The response frequency is represented by the size of the square with the largest square indicating 10 redundant responses and the smallest square indicating 1 response per each source location. The positive-sloped diagonal line in each panel indicates the perfect hearing condition in which the perceived source position corresponds exactly with the actual source position. The following observations are based on the evaluation responses presented in Figs All subjects reported difficulties of varying degree in making correct judgments on the source elevation on most trials with the KEMAR HRIRs. Either front/back reversal was frequent (especially for subjects SK and CH), which is evident from the many off-diagonal responses in symmetric positions with respect to the diagonal, or localization performance was low (for all 4 subjects) judging by the large response spread about

8 352 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY 2008 Fig. 12 Subjective evaluation result for subject SK on 3 HRIR sets: KEMAR, individual, and customized HRIRs (Refer to text for detail). Fig. 13 Subjective evaluation result for subject HS (Refer to text for detail). Fig. 14 Subjective evaluation result for subject KB (Refer to text for detail). Fig. 15 Subjective evaluation result for subject CH (Refer to text for detail). the diagonal. With individual HRIRs, front/back reversals were reduced for all subjects except for subject HS who often perceived the frontal sources at 30 and 0 to be in the rear instead. Subject KB made quite a few errors in

9 SHIN and PARK: ENHANCED VERTICAL PERCEPTION THROUGH HEAD-RELATED IMPULSE RESPONSE CUSTOMIZATION 353 Table 1 Localization errors s in Eq. (10) and front/back confusion counts computed by resolution of the responses shown in Figs The letters denote confusion clusters, i.e. C indicate the total confusions, B the backward confusions, and F the forward confusions. Fig. 16 Illustration for resolving front/back confusions. The confusions are reflected about the vertical plane (horizontal dashed line) onto the correct hemisphere. localizing the rear sources even with his own HRIRs and the scattered responses produced by subject CH for sources at 30,0, and 30 suggest that he too had difficulty in localizing the frontal sources near the horizontal plane. In general, however, it can be said that all subjects performed better with their own individual HRIRs than with the KE- MAR HRIRs judging by the tighter distribution of the responses around the diagonal. Comparison of the response data made with customized HRIRs to those made with the KEMAR HRIRs reveals the following. Front/back reversals were reduced for all subjects with customized HRIRs except for subject HS who made similar confusion errors for the sources on and below the horizontal plane as he did with his own HRIR set. The localization performance was enhanced for all subjects for most source positions judging by the smaller spread about the diagonal. Although subject HS localization performance with customized HRIRs was poor for sources near the horizontal plane, it was slightly improved for sources positioned at other elevation angles, i.e. from 0 to 150. Subject KB made poor elevation judgments with customized HRIRs as the source shifted from 90 to 210 into the rear hemisphere, but it should be noted that his localization performance on rear sources was poor with all 3 HRIR sets. When computing error indices to account for the localization performance associated with a particular set of HRIRs, it has been common practice to treat front/back confusions and localization accuracy separately by resolving the confusions in order to avoid error inflation [24]. On the other hand, resolution of the confusions can be misleading if we assume the responses correctly reflect the subject s perception. However, since our primary goal was to compare the three HRIR sets in terms of localization performance, we too elected to resolve all apparent confusions and report the incidence of confusions associated with each set of HRIRs. If the angle between the actual source position and the perceived response is made smaller by reflecting the response about the vertical plane passing through the subject s ears as shown in Fig. 16, the response is entered in reflected form and the confusion count is increased by one. Then, the localization error was computed in the root mean square sense including both the responses lying in the same hemisphere as the sources and the confusions in reflected form by the following definition 1 90 s = (x i φ source (i)) 2 90 i=1 1 2 (10) where x i is the perceived response for the ith stimulus corresponding to the actual source position φ source (i) andthe number 90 is the total number of presented stimuli per each HRIR set. Table 1 depicts these RMS errors and the confusion counts organized per subject per HRIR set evaluated. The RMS errors are indicated by the numbers in the top row of each cell and the confusion counts follow in the bottom row in the form: s/no. of total confusions (no. of backward confusions + no. of forward confusions). From these error indices shown in Table 1 we can deduce the following conclusions regarding the localization performance associated with each set of HRIRs. Comparison of the localization errors produced with the KEMAR HRIRs to those with the customized HRIRs reveals that the localization accuracy was improved by far with the customized HRIRs for subjects KB and CH whereas subjects SK and HS showed slightly better accuracy with the KE- MAR HRIRs. Obviously this is a direct result of resolution of the confusions because it appears to be otherwise for subjects SK and HS in Figs. 12 and 13. Of course, with the customized HRIRs front/back confusions were reduced for all subjects, and in particular, subjects SK and CH have shown dramatic improvements, i.e., the confusion counts wentfrom29to9forskandfrom43to6forch.on the contrary, the localization performance with individual HRIRs was not quite satisfactory for all subjects. Individual HRIRs are generally known to produce good localization results, but studies in the past like the one by Wightman and Kistler [24] show that headphone simulation of free-field lis-

10 354 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY 2008 Fig. 17 KEMAR (solid), individual (dashed), and customized (dotted) HRIRs (left) and the corresponding HRTFs (right) for subject SK. Fig. 18 KEMAR (solid), individual (dashed), and customized (dotted) HRIRs (left) and the corresponding HRTFs (right) for subject CH. tening tend to produce more frequent front/back confusions and less well defined source elevation as opposed to the freefield condition. With individual HRIRs, subjects HS and KB produced the best overall localization accuracy and subject KB s front/back confusions weretheleastof all three HRIR cases. On the other hand, the localization performance indices by the customized and individual HRIRs indicate that subjects SK and CH showed better localization accuracy and subjects HS and CH produced less confusions with the customized HRIRs than with individual HRIRs. In short, it can be said that with the customized HRIRs most subjects produced less confusions and 2 out of 4 subjects (SK and CH) performed best in the aspects of both the localization accuracy and front/back confusion.

11 SHIN and PARK: ENHANCED VERTICAL PERCEPTION THROUGH HEAD-RELATED IMPULSE RESPONSE CUSTOMIZATION 355 The customized and individual HRIRs for subjects SK and CH along with the KEMAR HRIRs and the corresponding HRTFs which are direct Fourier transforms obtained from the temporal responses are depicted in Figs. 17 and 18 for example. These plots immediately reveal that most spectral deviations among the HRTFs take place in the high frequency region and that the differences between the KEMAR and customized HRTFs mostly occur in the region above 6 khz, which is a direct consequence of the pinna response modification by tuning. It is also clear that even a small variation in the time response renders a substantial difference in the frequency response. In our study, we had hoped to find some similarity between the customized and individual responses both in the temporal and spectral shapes because in theory the two sets of responses are supposed to capture and reflect the individual pinna features better than the KEMAR HRIRs if the tuning had worked well as it did for these two subjects in particular. Unfortunately however, as was expected during the measurement phase of our study and also from the analysis of the evaluation results, there was very little similarity or none at all between the customized and individual HRTFs. The spectral notches and roll-offs that are known to be responsible for elevation perception do not seem to coincide even barely except for a few spectral regions, i.e. notches at 7 khz at 210, roll-offs at10khzat 150, notches at 16 khz at 120 and notches at 11.3 khz at 90 for subject SK in Fig. 17. Although the localization performances by subjects SK and CH using their own HRIRs were passable considering the results of headphone simulation of free-field condition achieved by others in the past, we believe that the individual HRIRs measured in this study contain errors probably induced by imprecise positioning of the microphone tips at the ear canal entrance as mentioned earlier. As a result, we cannot confirm if the spectral features in the HRTFs obtained by the proposed customization method indeed represent each individual s pinna characteristics at this point even though they have shown to bring improvements in the localization performance. 4. Discussion and Future Work The proposed HRIR customization method based on tuning of the basis functions obtained from decomposition of the pinna responses in the time domain by PCA was shown to be effectivein producing the necessaryvertical effects while reducing front/back reversals. We confirmed this by a series of subjective listening tests. With the customized HRIRs in comparison to the KEMAR HRIRs, 2 out of 4 subjects managed to show explicit improvements with noticeable decrease in front/back reversals while the other 2 subjects demonstrated enhanced elevation perception to some degree. All subjects reported that the sources at 60,90,and 120 in elevation angle were among the toughest to discriminate from one another for both individual and customized HRIRs and that they had to guess the source elevation on most trials with the KEMAR HRIRs. We also verified that similar vertical effects could also be generated at other azimuthal directions simply by adding proper ITDs to the customized HRIRs developed using the proposed method. The localization performance in other sagittal planes along with detailed analysis will follow in a subsequent paper. Acknowledgments This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the National Research Laboratory Program (M J ) and the BK 21 Project (2006) of Republic of Korea. References [1] E.A.G. Shaw and R. Teranishi, Sound pressure generated in an external-ear replica and real human ears by a nearby point source, J. Acoust. Soc. Am., vol.44, pp , [2] J. Blauert, Sound localization in the median plane, Acoustica, vol.22, pp , 1969/1970. [3] J. Hebrank and D. Wright, Spectral cues used in the localization of sound sources on the median plane, J. Acoust. Soc. Am., vol.56, pp , [4] R.A. Butler and K. Belendiuk, Spectral cues utilized in the localization of sound in the median sagittal plane, J. Acoust. Soc. Am., vol.61, pp , [5] P.J. Bloom, Determination of monaural sensitivity changes due to the pinna by use of minimum-audible-field measurements in the lateral vertical plane, J. Acoust. Soc. Am., vol.61, pp , [6] E.A. Lopez-Poveda and R. Meddis, A physical model of sound diffraction and reflections in the human concha, J. Acoust. Soc. Am., vol.100, pp , [7] E.H.A. Langendijk and A.W. Bronkhorst, Contribution of spectral cues to human sound localization, J. Acoust. Soc. Am., vol.112, pp , [8] H.W. Gierlich, The application of binaural technology, Applied Acoustics, vol.36, pp , [9] S. Shimada, M. Hayashi, and S. Hayashi, A clustering method for sound localization transfer functions, J. Audio Eng. Soc., vol.42, pp , [10] V.R. Algazi, R.O. Duda, R.P. Morrison, and D.M. Thompson, Structural composition and decomposition of HRTFs, Proc. WAS- PAA01, pp , New Paltz, NY, [11] J.C. Middlebrooks, Virtual localization improved by scaling nonindividualized external-ear transfer functions in frequency, J. Acoust. Soc. Am., vol.106, pp , [12] D.N. Zotkin, R. Duraiswami, and L.S. Davis, Customizable auditory displays, Proc. Int. Conf. on Auditory Display (ICAD), pp , Kyoto, Japan, [13] M. Morimoto and H. Aokata, Localization cues of sound sources in the upper hemisphere, J. Acoust. Soc. Jpn. (E), vol.5, pp , [14] F.L. Wightman and D.J. Kistler, The dominant role of lowfrequency interaural time differences in sound localization, J. Acoust. Soc. Am., vol.91, pp , [15] M. Morimoto, M. Itoh, and K. Iida, 3-D sound image localization by interaural differences and the median plane HRTF, Proc Int. Conf. on Auditory Display (ICAD), Kyoto, Japan, July [16] A. Kulkarni, S.K. Isabelle, and H.S. Colburn, Sensitivity of human subjects to head-related transfer-function phase spectra, J. Acoust. Soc. Am., vol.105, pp , [17] C.P. Brown and R.O. Duda, A structural model for binaural sound synthesis, IEEE Trans. Speech Audio Process., vol.6, no.5, pp , [18] K. Shin and Y. Park, Modeling of non-individualized head-related transfer functions for nearby sources, Proc. 9th Western Pacific

356 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY 2008 Acoustics Conf. (WESPAC), pp.164 172, Seoul, Korea, June 2006. [19] CIPIC HRTF Database Files, Release 1.

[21] Y. Hiranaka and H. Yamasaki, Envelope representation of pinna impulse responses relating to three-dimensional localization of sound sources, J. Acoust. Soc. Am., vol.73, pp.291 296, 1983. [22] F.

12 356 IEICE TRANS. FUNDAMENTALS, VOL.E91 A, NO.1 JANUARY 2008 Acoustics Conf. (WESPAC), pp , Seoul, Korea, June [19] CIPIC HRTF Database Files, Release 1.1, August 2001, CIPIC Interface Laboratory, U.C. Davis, available from ucdavis.edu/ [20] J.E. Jackson, A User s Guide to Principal Components, pp.1 25, John Wiley & Sons, [21] Y. Hiranaka and H. Yamasaki, Envelope representation of pinna impulse responses relating to three-dimensional localization of sound sources, J. Acoust. Soc. Am., vol.73, pp , [22] F.L. Wightman and D.J. Kistler, Headphone simulation of freefield listening. I: Stimulus synthesis, J. Acoust. Soc. Am., vol.85, pp , [23] H. Møller, M.F. Sorensen, D. Hammershøi, and C.B. Jensen, Headrelated transfer functions of human subjects, J. Audio Eng. Soc., vol.43, pp , [24] F.L. Wightman and D.J. Kistler, Headphone simulation of freefield listening. II: Psychophysical validation, J. Acoust. Soc. Am., vol.85, pp , Ki Hoon Shin was born in Seoul, Korea in He received his B.S. and M.S. degrees in mechanical engineering from University of Rochester, NY, in 1996 and 1998, respectively. From 1998 to 2000, he was enrolled in a Ph.D. program in aerospace engineering at Georgia Tech, GA. Since 2001, he engaged in researches on virtual audio synthesis for a Ph.D. in mechanical engineering at Korea Advanced Institute of Science and Technology (KAIST). He is now at the Digital Media R&D Center of Samsung Electronics developing audio algorithms for DTVs and home theatres. Youngjin Park was born in Seoul, Korea in He received his B.S. and M.S. degrees in mechanical engineering from Seoul National University in 1980 and 1982, respectively, and the Ph.D. in mechanical engineering from University of Michigan, MI, in From 1987 to 1988, he worked as a research fellow at University of Michigan. He also worked as an assistant professor at NJIT, NJ, from 1988 to He joined the faculty of Korea Advanced Institute of Science and Technology (KAIST) in 1990, where he is a Professor of Mechanical Engineering. His research interests include general control theories, virtual audio synthesis, active control of noise and vibration, system identification, etc.

HRIR Customization in the Median Plane via Principal Components Analysis

한국소음진동공학회 27 년춘계학술대회논문집 KSNVE7S-6- HRIR Customization in the Median Plane via Principal Components Analysis 주성분분석을이용한 HRIR 맞춤기법 Sungmok Hwang and Youngjin Park* 황성목 박영진 Key Words : Head-Related Transfer