Detection, Interpolation and Cancellation Algorithms for GSM burst Removal for Forensic Audio

>Bitzer and Rademacher (Paper Nr. 21)< 1 Detection, Interpolation and Cancellation Algorithms for GSM burst Removal for Forensic Audio Joerg Bitzer and Jan Rademacher Abstract One increasing problem for forensic audio analysis is the so-called 'bumblebee noise', caused by GSM radio transmission. The disturbance can be described as short pulses with a fundamental frequency of approximately 217Hz. The pulse nature of the signal causes a lot of harmonics and therefore, most of the desired speech signal is masked. In this contribution we introduce a new algorithm to reduce the burst. After analyzing the burst structure we found that filtering or adaptive noise cancelling is not suitable for this problem, because of two reasons. The burst itself is time-varying, or at least the recording device and medium are not constant and the burst can distort the signal more or less completely (overload, clipping). Therefore, we propose a two-step method. The first step is to detect the disturbance and the second is to remove it. Due to the large variety of occurrences over different recordings (inter-variation) and the small variations for one recording (intra-variation) we decided to use a fingerprint approach for detection. This means that the user selects one typical burst and the algorithm will find all other occurrences. This part of the signal is finally removed by using cancellation or interpolation algorithms. The interpolation is based on the auto-regressive (AR)-model of the speech signal. The resulting speech signal sounds natural and the remaining noise is not disturbing or reduces intelligibility. I. INTRODUCTION In the last decade the usage of mobile phones increased dramatically, and with it the number of audio recordings for forensic analysis, which are contaminated with a new kind of audio disturbances. This noise is in its nature periodically impulsive, which leads to a harsh sound, and in many cases the Signal-to-Noise Ratio (SNR) is negative. Therefore, the intelligibility and the audibility is reduced. In order to reduce this harsh sound, some algorithms have been developed. The first publicly available work was done by Rosengren and Nilson [1] for "bumblebee noise" reduction in mobile phones. Due to the periodic nature, the noise only has peaks at multiples of the main frequency, which is approximately 217Hz. Therefore the authors suggested the usage of fixed notch filter for the removal. Furthermore, an algorithm based on matched-filtering or correlation shows very good noise removal capabilities. Unfortunately this algorithm depends on a-priori knowledge, which is available in a mobile phone but not in a forensic work case. This problem was addressed by Harrison [2]. He provided a new solution based on an adaptive noise canceller to remove the impulse. This approach seems to work well, for stationary environments, but fails for real recordings, due to the fluctuation of speed in small magnetic recording devices. In this contribution, we will first analyse the problem in more detail and show why all approaches published so far will fail for real-world signals. In a second part of this paper a new algorithm based on detection and interpolation will be introduced, and a best practice method will be given. Furthermore, we will show some simulations and results on real recordings in the third part. II. PROBLEM ANALYSIS In order to give a deeper insight in the nature of the GSMburst, we will analyze it from a coarse to a fine scale. Figure 1 shows a typical recording of a microphone signal with a GSM mobile phone close to the amplifier. In this first example the signal was digitized directly, so no intermediate analogue recording equipment was used. This excerpt of about 3s length has some active regions (e.g. to.2s) and some idle parts (.6s to 1.2s). The desired speech signal has very low power. Therefore, the SNR for this example is very low. The idle parts are caused by the voice activity detection (VAD) mechanism inside the GSM system. If the VAD detects periods of silence, the transmission of the radio signal is reduced to give more bandwidth to other mobile phone users. Furthermore, (which cannot be seen in this example), there is a power control for the transmitter and in some recording devices there will be an automatic gain control (AGC) system. Both systems will change the amplitude of the GSM signal over time. As a first result of this analysis we cannot assume long-term stationarity for the burst signal. J. Bitzer (j.bitzer@hda.de) and J. Rademacher (j.rademacher@hda.de) are with Houpert Digital Audio, Anne-Conway- Str. 1, 28359 Bremen, Germany.

>Bitzer and Rademacher (Paper Nr. 21)< 2.6.4.2 -.2 -.4 -.6.5 1 1.5 2 2.5 Figure 1: Large Structure of a speech signal contaminated with GSM bursts Figure 2 shows a zoomed view of the active region at the beginning of the file. The signal is full of spikes, which occur every 4.615ms. This rate is given by the Time-Division Multiple Access (TDMA) method used in GSM. Furthermore, every 12ms, one frame (the so-called idle frame) is not used which can clearly been seen in the figure. The higher rate of 4.615ms determines the main frequency f m = 1/rate=216.67Hz 217Hz of the disturbance, all other frequencies are given by an integer multiplication of f m. The spikes are not constant in their height. This is caused by the desired signal which is additive to the GSM signal. This leads to two assumptions. Even shorttime stationarity cannot be assumed due to the idle-frame and the GSM burst is for this example an additive disturbance..6.4 Idle Frame seen that two impulses of the same recording are quite the same. Therefore, the inter-variance between recordings is much higher than the intra-variance for one recording at least for the shape of the impulse. 1.5 -.5-1 2 4 6 8 time in ms Figure 3: Different shapes of a GSM burst recorded with different equipment The signal model shown in Figure 4 is in our opinion the cause for the different shapes. The desired signal s( is disturbed by an additive burst signal b(, which amplitude is slowly changing or modulated by a gain signal g( over time. This mixture is finally filtered by an approximately linear time-invariant (LTI) system. For recordings of microphonesignals this LTI-system is in most cases a high-pass filter to reduce low frequency bumping sounds and a low-pass filter for high-frequency noise reduction. Furthermore, all recording devices are band-limited and therefore all signals will have low-pass characteristics. The shape of the original pulse b( is rectangular. This was suggested by Harrison, and if we look at the dotted line in Figure 3, which was recorded with the best equipment, this assumption seems to be correct..2 -.2 -.4 bk () gk () -.6.5.1.15.2 Figure 2:Structure of the GSM Signal with idle frame For further analysis Figure 3 shows three different burst signals recorded with different microphones and amplifiers. All bursts are synchronized and the signals are normalized in order to show the differences more clearly. The signals differ significantly, although we used the same mobile phone for all recordings. All signals have the burst structure in common, but the single impulses have different shapes. Also it can be sk () LTI-System () Hz Figure 4: Additive Signal Model However, this first model is not always correct. Especially, if transmission devices with limited dynamics are used, the GSM-burst can cause overloading, which is clipping for digital signals and magnetic tapes saturation for analog media. In both cases the desired signal will be changed drastically,

>Bitzer and Rademacher (Paper Nr. 21)< 3 e.g. for digital clipping we have a many-to-one transformation, with information loss. In this case, the first model can be changed to the replacement model shown in Figure 5, where the rectangular pulse controls the replacement switch. sk () bk () gk () LTI-System Hz () Figure 5: Replacement signal model This model can be more complex, if more than one LTIsystem is involved or if the system for the pulse and for the signals are different. However, we will focus on these two models. III. OPTIMAL SOLUTIONS Analyzing the additive signal model the output signal is given by ( b( g( )* ) y ( = s( * k (1 ) where * denotes convolution. In order to estimate the desired signal s( it would be necessary to estimate the inverse system of, which may not exist. Therefore, we will be content with the convolution result as our target signal. In this case the optimal estimator is ( b( g( 442 )* 4 s( * = y( (2 1 4 3 ) = i( which represents a compensation process of the estimated impulse. The remaining problem is that we do not know how to estimate the impulse i(. For the replacement model given in Figure 5, the optimal solution is more difficult and cannot be expressed easily. The observable signal y( contains two different parts. If the LTI-system = 1, a fifth of the desired signal is completely removed. However, for interpolation algorithms this is enough material to estimate the missing samples. A very reliable method is based on the assumption that speech can be modeled as an Auto-Regressive process (AR-model). The missing samples can be estimated by using a Least-Squares approach (LSAR-Algorithm) [3]. For real world data the rectangular impulse is smeared-out due to the impulse response of the filter. If this additional region is small enough, for example if we use very good equipment, the remaining undistorted signal can be used for interpolation. The quality of the remaining signal will degrade, but not significantly. The optimal solution would include the estimation of the inverse system h inv (. A perfect inversion is possible if the system has minimum phase characteristics [4]. However, this cannot be assumed for most real-world system. Assuming a minimum phase system the optimal algorithm would be: 1. Estimate h inv (. 2. Filter y( with h inv ( 3. Interpolate the impulse region (which is now rectangular). IV. USED ALGORITHMS Implementation of the optimal solutions given in the last section is hardly possible. In most cases the solution would be very difficult or not reliable. Therefore, we will focus on suitable solution that uses the a-priori knowledge of the expert working with the disturbance reduction software. Determination of one impulse by the User i user( detection of impulses using i user ( i( normalization of impulses (removal of ) g( i norm( time- alignment and averaging UserInterface i ( t Figure 6: Determination of the template impulse Figure 6 shows our approach in detail. In a first step the user determines one instance of the filtered impulse i user (. Of course this single representation is in most cases "disturbed" by the desired signal. However, this first instance is used as a detector sequence for other occurrences by using a matched filter approach. All detected impulses are normalized to reduce or remove the amplitude modulation signal g( and the mean of all this normalized and time-aligned impulses is used as a template i t ( for the final reduction step. This approach is related to the template approach used by Vaseghi for the removal of low-frequency clicks in audio recordings [5]. The final step is different for the two signal models. In the additive model we used a cancellation approach for the removal of the impulse (see Figure 7). The template impulse is used as a matched filter detector. Finally, an aligned impulse, aligned in terms of time and amplitude, is subtracted to get the final enhanced output signal.

>Bitzer and Rademacher (Paper Nr. 21)< 4 y( i ( t detection of impulses using the template impulse time- and amplitude- adjustment of template impulse impulse found? s(* Figure 7: Burst reduction algorithm based on cancellation approach The more complex approach for the replacement model is not finally solved. At the moment we are able to interpolate the detected impulse. However, very good results are only possible, if the system is "well-behaved". V. RESULTS In order to show some results we enhanced some examples. It is not very easy to find an objective measurement for the proposed methods. Therefore, we will show some excerpts from the waveforms and give some audio demonstration. Informal listening tests have shown, that our time-domain method gives far better results in terms of impulse reduction. The residual speech signal is not reverberant and clear. The remaining artifacts can be reduced by standard DeCrackling and filtering algorithms. For the first analysis of the enhancement performance, we computed the Power Spectral Density (PSD) of the input signal given in Figure 1. In this frequency plot in Figure 8, the spikes caused by the GSM-burst with its 217Hz repetition can be seen clearly. The PSD of the desired signal is masked more or less completely. On the other hand, the enhanced file is shown in Figure 9. The spikes are not visible anymore, and the true PSD of the signal can be analyzed. Amplitude in db 1-1 -2-3 -4-5 -6-7 Amplitude in db 1-1 -2-3 -4-5 -6-7 -8 5 1 15 Frequency in khz Figure 9: Power Spectral Density of the enhanced signal The reduction performance in the time-domain is given in Figure 1. In comparison to Figure 2, the bursts have been removed and only the residual speech signal is present..6.4.2 -.2 -.4 -.6.5.1.15.2 Figure 1: Result of the enhancement of the signal used in figure 2 In Figure 11 we took a closer look to the fine structure of the enhanced signal. The direct comparison shows that the burst and the DC-offset can be removed by our template technique, if a suitable template is used. However, all these analysis plots have no meaning in terms of audio quality of the residual signal. This can be judged by listening tests only. -8 5 1 15 Frequency in khz Figure 8: Power Spectral Density of the disturbed signal

>Bitzer and Rademacher (Paper Nr. 21)< 5 1.5 -.5-1 5 1 15 2 25 3 time in ms Figure 11: Comparison of original signal (dashed line) and enhanced signal (solid line) VI. CONCLUSION In this paper we have shown that the removal of the GSM burst is possible for forensic applications. After providing a detailed analysis of the problem, we presented a new timedomain based solution. It is very important to understand that the GSM burst structure is a time-domain problem and therefore a time-domain approach should be used. Frequencydomain approaches like notch filtering will fail due to the mismatch of problem and solution. Our results, especially the informal listening tests, clearly indicate that our two-step template algorithm outperforms former presented algorithms. However, further analyses based on clear objective performance measurements are still necessary. In the future we will concentrate on solutions for the replacement model to overcome the problem of the inverse system. ACKNOWLEDGMENT We thank Mr. Axel Behrens for his help to provide the VPI (Virtual Precision Instrument) and the audio demonstration files and Mr. Jerome Luepkes for the first review of the paper and language advice. Any remaining mistakes are courtesy of the authors. REFERENCES [1] P. Rosengren and A. Nilsson, "Bumblebee Killer", Master Thesis at the University of Karlskronna, Ronneby, 1999 [2] P. Harrison, "GSM Interference Cancellation for forensic audio: a report on work in progress", Forensic Linguistic 8 (2), 21, p. 9-23 [3] S. J. Godsill and P. J. W. Rayner, " Digital Audio Restoration: A Statistical Model-Based Approach ", Springer Verlag, 1998 [4] S. T. Neely and J. B. Allen, " Invertibility of a room impulse response ", Journal of Acoustical Society of America (JASA), vol. 66 p 165-169 [5] S. V. Vaseghi, "Advanced Signal Processing and Digital Noise Reduction", Wiley & Teubner, 1996