Using sound levels for location tracking

Using sound levels for location tracking Sasha Ames sasha@cs.ucsc.edu CMPE250 Multimedia Systems University of California, Santa Cruz Abstract We present an experiemnt to attempt to track the location of sound sources. This is a very controlled experiment using four microphones at the corner of a square field in which we make the sounds for tracking. We have software to perform the sound tracking. This software works either by interpolation, or finds locations via a formula derived from a model. We show disappointing results. 1. Introduction The experiment described in this paper involves the use of multiple audio input channels in attempt to show locations of sound sources. Georgiou and Kyrakiakis mention this as applicable to tracking the locations of speakers in a video conference [4]. Brandstein has done work in location tracking using microphone arrays [3, 5]. His main focus was to use time delay as the metric for locations. This experiment, however, uses differences in amplitude between microphones to determine the sound location. My experience with recording audio on microphones has taught me that the recording level on a microphone changes as a sound moves closer or further away from it. For example, I have recorded drum kit using a pair of microphones to get better coverage of the kit. The drums on the left hand side of the kit (snare, hi hat) are recorded at a stronger level than those on the right (ride cymbal, floor tom) by a microphone placed on the left side of a kit, and vice-versa. This is apparent when examining the audio data for each channel. Therefore, I figured that the same principle might be used to localize the positions of sounds within a field bound by some microphones. We accomplish this by first capturing audio data with the locations already known, by which we train our software to determine the locations of any other sound recorded within the field. Since sound events with their locations unknown may never exactly match the amplitudes recorded with known locations, interpolation may be an option to best approximate the location. Additionally, we know that formulae exist to describe relationships of measured sound levels to distance from the source. Thus, we can try to use a formula for the distance using multiple recorded levels and compare that to results by interpolation. 1.1. A simple example using two microphones A sound registers at 12 on microphone A and 28 on microphone B. I ve already recorded sounds at positions registering 10 on A and 30 on B, 20 A and 20 on B, 30 on A and 10 on B. I can interpolate that the sound is probably close to the 10 on A, 20 on B spot, but slightly towards the 20/20 spot. 2. Proposed Experimental methodology and design 2.1. Experiment layout I place my four microphones at the corners of an imaginary rectangle of s predetermined size (probably 10 x10 ). Since the microphones may have directionality, they should be oriented as to best pick up sound within the rectangle. Each microphone 1

Figure 1. Sound gathering setup diagram. The spot enclosed by concentric circles represents a sound event, with each circle being of diminishing amplitude as they get further from the center. The microphone in the upper-right corner shall record the greatest amplitude for the event, followed, by the lower-right, then upper-left, etc. must be properly calibrated to the loudest sound possible within the experiment at a very close proximity. We want to try to have the best possible SNR and take advantage of all quantization steps for 16 bit audio. 2.2. Recording of audio data We record on four audio channels simultaneously by use of the Echo Layla audio hardware [1] that supports simultaneous recording and playback on up to eight channels. We record the audio using conventional multi-track recording software for the PC, such as Cakewalk [6]. Although the Wave file format supports any number of audio channels (more than 2 for stereo), we have recorded to four separate audio files. We have also used the multi-tracking software to clean up the audio data, that is, removed unneeded segments of silence and unexpected noise, thus, resulting in the audio that interested to work with. I subdivide the rectangle enclosed by the microphones into a grid,and at each point on the grid I have recorded a loud percussive sound on all four channels, keeping track of each location so they may be later be given as input to the software along with the audio data. This will form the audio data from which I can derive the known sound location data to be used later for interpolation. Since I am unfortunately unable to track sound locations in real time, I have prerecorded the sounds with locations unknown that need to be tracked. Again, I have recorded on four audio channels via the four microphones. All sounds shall be in the rectangle, but they do not need to be at points on the grid and I won t keep track of exact locations. I should generally try to record where I made these sounds so I may compare the results when I attempt to later track the data. 2.3. Methodology for location tracking We can location track by two methods: 1) using a formula given three measured amplitude values or 2) interpolate from pairs of measured amplitudes and distances. 2

2.3.1. Formula for plotting locations Given the two methods from which we may find distances of the sound source to the microphone, each uses the same formula ultimately to plot the location of the source. By placing our microphones on the corners of a square, we have significantly simplified the calculations needed to determine the locations. It would be possible to pinpoint the locations of the source given the distances to only three microphones placed in a room, assuming all lie on the same plane, but this would require much more complex calculations. We need the following values defined for the formula: Let w be the width of square, which also means w is the distance between a pair of microphones not in opposite corners of the square. Let d 1 be the distance from the sound source to the microphone in the upper left corner of the square. Let d 2 be the distance from the sound source to the microphone in the lower left corner of the square. Let d 3 be the distance from the sound source to the microphone in the upper right corner of the square. Finally, let d 4 be the distance from the sound source to the microphone in the lower right corner of the square. Though we have four distance values to use, we need only a pair from adjacent microphones. In order to find the pair of coordinates (x,y), given we decide to use d 1 andd 2 our formula for x and y are: x = d2 1 d2 2 + w2 2w y = d1 2 x2 To give a better approximation, we can repeat this formula for each pair of distances d 1 and d 3, d 2 and d 4, d 3 and d 4. The four sets of pairs can then be averaged together to give a better approximation of the location. 2.3.2. Formula for determining distance Given the amplitudes of the sound registered at the four microphones, we can calculate the distances we need. Let I be the true intensity of the sound, d n be any one of the distances mentioned above, and I n be the amplitude measured by the corresponding microphone. The relationship between these is: I n = I d 2 n I may remain unknown throughout, as it will prove to not be relevant. Using this formula for each microphone and the prior one for the location, we may derive a formula for a distance, given three of the amplitudes: I 1 I2 2 d 1 = I 3w 2 I 1 I 2 I3 2 I1 2I2 2 2I 1I3 2I 3 2I 1 I 2 I3 2 + + w2 8I2 2 2I2 2 I2 3 I1 2I2 2 2I 1I3 2I 3 2I 1 I 2 I3 2 + I2 3 (I2 1 I2 2 2I 1I3 2I 3 2I 1 I 2 I3 2 + 2I2 2 I2 3 )w4 + ( 2I 1 I2 2I 3w 2 2I2 2 I2 3 2(I1 2I2 2 2I 1I3 2I 3 2I 1 I 2 I3 2 + 2I2 2 I2 3 ) This can be repeated for each triplet of amplitudes to find a corresponding distance, i.e. for d 2 use I 2, I 4, and I1, etc. 2.3.3. Procedure for interpolation As an alternative to using the formula above to calculate distances, we may interpolate the distance given some measured amplitude and a table of amplitudes corresponding to distances for a given microphone. We may wish to consider performing interpolation because it is possible our microphones may not perform exactly as they should with the above equation, or they are not properly calibrated, i.e. I n = k n I/d 2 n, where k 1 k 2 k 3, etc. Our method of interpolation is Lagrange s classical formula of polynomial interpolation [2]. This algorithm involves two loops and runs in O(n 2 ), where n is the number of data points collected for interpolation. Our x 1..n are the measured amplitudes, y 1..n are the corresponding measured distances, and x is our measured amplitude with distance unknown. Once we find the four distances from each of the microphones, we may apply to the formula in section 2.3.1. 2.4. Some basic assumptions We need to make some assumptions about the experiment for the sake of simplification. It is certainly not impossible to account for these, but it would require a much more complex implementation. First, we assume that the microphones are omnidirectional with respect to a 90 degree field in front of them, as they are placed on the corners of our grid, facing the center. Second, there are no objects occluding the sound from any of the microphones, as to affect the measured amplitude. Third, acoustical properties of the room will not affect the measured amplitudes. 3

3. Software Implementation I have implemented two pieces of software. One is an xml table constructor for use in interpolation. The other is the actual location tracker. Both share a module to scan audio files for audio events of which we wish to record or determine a location. I have completed this implementation in the Java programming language, given its ease for rapid development and acceptable performance. 3.1. Audio File Reader The audio file reader processes files that are stored in the WAV file format. This was the default output format for the Cakewalk multi-tracking software for audio exports, and common in Windows environments. In WAV audio files, the samples are stored in little endian encoding, and so they must be decoded as it is not how Java stores its integers. Thus, this file reader reads individual samples from the audio data in that fashion. For each, file, the audio reader at first read a small number of samples to determine the baseline noise level. Next it shall scan through the file one sample at a time until it encounters data at some threshold above the noise level. This indicates the start of an audio event. The reader can report events either through finding the peak or power level. Peaks are found by finding the greatest absolute sample within a window after the start of an event. Powers are found using a formula over some defined constant window. Once determined, the reader shall read and disregard samples until it reaches the baseline level. Then, once again we repeat the process of scanning until the next event is found. For more continuous audio where there are not isolated events for which we wish to find the location, we can just repeat finding either peaks or power levels within a window. 3.2. location data compiling software The location data compiling software has been very specifically set up for my experimental setup, which has audio events recorded with locations known on a grid with 25 points (5x5). The software is set up to read four audio files and expect to find exactly 25 audio events in each. The locations for each event is predetermined and methodical (like raster scan ordering) so corresponding locations on the grid may be recorded with the peak or power value. The event data is written in XML format, with an element for each audio event. Each element contains the peak or power value as an integer for each microphone (4 in total), with peak values ranging from 0 to 32767 (absolute values for 16 bit PCM audio data), and the x and y coordinates on the grid, ranging from 0 to 4. 3.3. Location tracking software The location tracking software uses the same audio file reader module to find audio events of which we wish to find the location within the grid. We run four audio reader modules for data from each microphone simultaneously in separate threads. Each reader fills a buffer corresponding to its microphone when peak or power data from events are determined. If there is data available in all four, the main thread consumes from the front of each buffer and proceeds to determine the distances for the event. The software determines distances of the event to each of the microphones by using implementations of the methods described in 2.3. It has a mode for each. For the formula mode using three measured amplitude values, the measured amplitudes are simply determined by applying the formula. Once all distances are determined, the location (coordinates) of the event may be determined. Interpolation mode has additional steps. Before we may interpolate, tables must be constructed for each microphone. Before processing any audio data, the software reads in the XML data generated by the other software previously mentioned in 3.2, and for each point it calculates the distance from it to each of the microphones. For instance, (0,0) is (srqt2 from microphone 1, while (4, 4) is the same to microphone 4. Each distance is placed in the table with the measured amplitude. It is from these tables that we apply the interpolation algorithm described in 2.3.3 and subsequently find the location. Once we have determined the coordinates of the audio event on the 10 x10 grid, we can plot it graphically. The software includes a module that plots events on a 2-D window with the representation of a 10x10 grid. The software performs the plotting in real time as the events are processed from the audio files. 4

4. Results The results for this experiment should somehow answer the question of: how well can the software approximate the locations of recorded sound? We shall show in this section that the results were well below my expectations. Experimentation with the software ran with the following mode switches peak vs. power level and interpolation vs. formula application distance approximation thus resulting in 4 possible modes. I ran the trials using four sets of audio files, one file for each microphone in the set. The first set of files was also used to generate the interpolation table. We do have the most successful results from running the software over the same audio tracks used to create the interpolation data in peak mode with interpolation. This is to be expected. With the exception of the final row of points, the locations were plotted exactly where I should expect them. This would serve as a baseline for the interpolation mode. However, when the software was run on that data in formula application, mode, very few points are plotted at all and they are not in the correct location. Interpolation for the first set of audio files with locations unknown failed to produce any points plotted of the locations. Using the formula application mode with the first unknown audio set produced some points plotted, but their path does not resemble the that of the sound from when I gathered the data. Some points may be correct, but it is unclear how many. Furthermore, for the second two sets of audio files, the results show no to very sparse points plotted at all. 5. Conclusion The question remains of what went wrong. With no other choice I performed the experiment with four different microphones. These microphones varied in quality, directionality, and perhaps a nonlinear response to the audio signal. I made all attempts to prevent sound from being occluded by any objects in the field including myself, but that may have proved to be too difficult for one person to do on a first attempt. Finally, it is highly possible that the nature of the sounds itself that I used may not have worked for this experiment. They may have not been consistent enough to work using the simple model and interpolation, or they did not produce the consistent spherical waves that would be necessary as well. In all, it was of course disappointing that I could not track locations for my groups of unknown audio. However, I consider this work a positive experience in a number of ways. It required serious planning to devise the experiment, do the audio capture, write and test the software. I was forced to refine my ideas and consider what was really required to implement the software to try to accomplish this experiment. I had intended to return to the set up the audio capture environment a second time to try to produce better results, but unforeseen circumstances prevented that. In the end, I may had results not to my liking, but I am pleased with having the opportunity to attempt this work. Additionally, I am proud of the software written for this project, and I feel working on that has given me a better feel for what goes into audio processing. References [1] Echo layla product description page. http://www.echoaudio.com/products/discontinued/layla24/index.php. [2] Numerical interpolation: Polynomial interpolation. http://www.efunda.com/math/num interpolation/num interpolation.cfm#polynomial. [3] M. Brandstein, J. Adcock, and H. Silverman. A closed-form location estimator for use with room environment microphone arrays. Speech and Audio Processing IEEE Transactions on, 5:45 50, 1997. [4] P. Georgiou, C. Kyriakakis, and P. Tsakalides. Robust time delay estimation for sound source localization in noisy environments. In Applications of Signal Processing to Audio and Acoustics 1997 IEEE ASSP Workshop on, pages 19 22, 1997. [5] D. Sturim, M. Brandstein, and H. Silverman. Tracking multiple talkers using microphone-array measurements. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages 371 374, 1997. [6] I. Twelve Tone Systems. Cakewalk website, 2004. http://www.cakewalk.com. 5