An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet

Journal of Information & Computational Science 8: 14 (2011) 3027 3034 Available at http://www.joics.com An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet Jianguo JIANG a,, Kaige MA a, Mingxing WEN a, Yongqing LIU a, Shuangji WANG a,b a School of Computer Science and Technology, Xidian Univ., Xi an 710071, China b No.91388 Troops of PLA, Zhanjiang 524022, China Abstract Aiming at providing a solution to the problems that the feeblish robustness of general algorithms in dealing with the linear speed change attacks and their overlarge fingerprint memory space, an audio fingerprint algorithm based on db4 wavelet transformation combined with statistical characteristics of wavelet domain is proposed. At first, decompose the audio signal in 5-layer wavelet. Then calculate the plus-minus change of low-frequency sub-band s wavelet coefficient, the energy distribution center, the energy of sub-band in wavelet domain, and the variance of wavelet coefficient. Finally, by using the results calculated as parameters of audio fingerprints, the 8-bit fingerprint block per frame was generated. Simulation results suggest that this algorithm shows excellent robustness in dealing with the attacks toward ordinary stick signal content and additive white Gaussian noise, and linear speed change attacks. Moreover, the memory space taken up by fingerprints is less. Keywords: Audio Fingerprints; Wavelet Transform; Linear Attack; Additive White Gaussian Noise; Robustness 1 Introduction In order to solve the difficulties in searching the needed songs among mass audio information, a digital audio fingerprinting technology with automatic music recognition came into being. Audio fingerprint is a compact digital signature based on content, which can represent the important acoustic characteristics of a piece of music. Its main purpose is to establish an effective mechanism to compare the two audio data in human auditory perception [1,2]. Wavelet transform is a local transformation on a signal in Time and Frequency domains, which can effectively extract information from the signal, and do multi-scale detailed analysis on a function or signal by functions such as scaling and translation, thereby can solve many difficult issues which can not be solved by the Fourier transform. CSLu and others proposed a method, which is, supported by the Fundamental Research Funds for the Defense of China (NO. D1120060967). Corresponding author. Email address: jjg3306@126.com (Jianguo JIANG). 1548 7741 / Copyright 2011 Binary Information Press December 2011

3028 J. JIANG et al. / Journal of Information & Computational Science 8: 14 (2011) 3027 3034 by adopting one-dimensional continuous wavelet transformation to extract audio characteristics, based on this method the audio fingerprint generation method for identification and authentication respectively was constructed. AL.Ghouti and others used balanced multiwavelets (Balanced Multiwavelets, BMW) extraction coefficient feature to propose an audio hashing algorithm [3]. In some documents, the author proposed audio fingerprint algorithm by combining computer vision. Y. Ke and others made audio signal spectrum as a two-dimensional images to handle [4], S. Bahja and others applied computer vision technology into data stream processing, and generated audio fingerprints by the Haar wavelet transform and Min Hash technology, and used Locality Sensitive Hashing(LSH) technique [5,6] in audio fingerprint retrieval. In terms of audio fingerprint algorithm in time and frequency domains, this paper considered how to improve the robustness in dealing with the linear speed change attacks and reduce its memory space for fingerprints, and proposed an audio fingerprint algorithm based on db4 wavelet transform, which combined wavelet transform with audio fingerprint algorithm. During the process of the algorithm, the audio signal was decomposed into 5-layer wavelet, and then calculated the plus-minus change of low-frequency sub-band s wavelet coefficient, the energy distribution center, the energy of sub-band in wavelet domain, and the variance of wavelet coefficient. Finally, by using the results calculated as the parameters of the audio fingerprints, the 8-bit fingerprint block per frame was generated. Comparison of simulation results suggest that this algorithm shows excellent robustness and identification, and the fingerprint is smaller in size. Using the index relationship established between the audio fingerprint algorithm and audio information, it can realize audio information real-time searching, which greatly improve the efficiency of audio searching [7,8]. 2 Algorithm Process The main steps of the algorithm are as follows: (1)Pretreatment, converts the input audio signal to mono signal whose down-sampling frequency is 5KHz. (2)Framing, windowing and overlapping, the length of the frame is 0.37s, using Hanning window, the overlap factor is P=28/32. The formula of Hanning window is as follows: w(n) = { 0.5 [1 cos(2πn/(n 1))], 0 n N 1 0, else (3)Using the wavelet based on db4 to decompose each frame of audio signal in 5-layer wavelet. A total of six components are achieved which include one approximation component ca5 and five details component cd1,, cd5. (4)Calculate the variance of the wavelet coefficients, the zero-crossing rate of wavelet coefficients, the centroid of wavelet domain and the energy of sub-band in wavelet domain of each component. (5)Extract hash bite value from each set of parameters in order to get a set of audio fingerprints of 8bits for per frame. The principle framework of the algorithm is shown in Fig. 1.

J. JIANG et al. / Journal of Information & Computational Science 8: 14 (2011) 3027 3034 3029 3 Generation of Fingerprint Fig. 1: Principle framework of the algorithm 3.1 The variance of the wavelet coefficients The formula of the variance of the wavelet coefficients [9] is σ(i, j) = 1 N (cd j cd) 2 N j=1 Where, cd = N cdj, σ (i, j) represents the variance of the j-th wavelet coefficient in the i-th j=1 frame, and N represents the total number of wavelet coefficients (The following definitions are the same with the definitions above). 3.2 The zero-crossing rate of the wavelet domain The zero-crossing rate of the wavelet domain reflects the plus-minus change of low-frequency sub-band s wavelet coefficients [9] when audio signal has been dealt with wavelet transform. The formula of it is as follows: zcr m = 1 sign[x(n)] sign[x(n 1)] w(n m) 2 m Where, x(n) is the n-th value of the wavelet coefficients in the m-th frame, which separately correspond to ca 5 and cd 5 ; W (n) is the window function, the length of which is N. if x(n) 0, then sign [x(n)] = 1; otherwise sign [x(n)] = 0. 3.3 The centroid of the wavelet domain The centroid of the wavelet domain is expressed as the center of energy distribution. In wavelet domain, the centroid of the audio signal changes with time, so it can be the characteristics of reflecting the non-stationarity of audio signal.

3030 J. JIANG et al. / Journal of Information & Computational Science 8: 14 (2011) 3027 3034 The computational formula of the centroid [9] is: N i x(i) 2 centroid = N x(i) 2 Where, x(i) is the i-th wavelet coefficient. 3.4 The energy of sub-band in wavelet domain The change in amplitude of the audio signal is an important dynamic characteristic of the audio signal, and the change in amplitude can reflect the change of energy. We can use the wavelet coefficients to measure the energy characteristics of audio because of the fact that the average rate of the wavelet coefficients corresponds to the average rate in time domain. The formula of calculating the energy of sub-band [9] is as follows: energy = 1 N x(i) 2 N 3.5 Generation of fingerprint The formula of the Hash-bit value sequence of the variance of wavelet coefficient is as follows: { 1, σ(n, m) σ(n, m + 1) (σ(n + 1, m) σ(n + 1, m + 1)) > 0 F 1 (n, m) = (1) 0, σ(n, m) σ(n, m + 1) (σ(n + 1, m) σ(n + 1, m + 1)) 0 Where, F 1 (n, m) represents the m-th bit value in the n-th frame. Besides, the formulas of the Hash-bit value of the zero-crossing rate of the wavelet coefficients, the centroid of the wavelet domain and the energy of the wavelet domain are as follows: 1, S c (n) N S c (i) > 0 F 2 (n, c) = 0, S c (n) N (2) S c (i) 0 Where, F 2 (n, c) corresponds to the Hash bit value of the zero-crossing rate of the wavelet coefficients, the centroid of the wavelet domain and the energy in wavelet domain, S c (n) represents the zero-crossing rate of the wavelet coefficients, the centroid of the wavelet domain or the energy in wavelet domain for the n-th frame. Set c = 1, which represents the zero-crossing rate of the wavelet coefficients; Set c = 2, which represents the centroid of the wavelet domain; Set c = 3, which represents the energy in wavelet domain. We can get the final formula of the audio fingerprint bit value for per frame: F 1 (n, m), 0 < m 5 F 2 (n, 1), m = 6 F (n, m) = F 3 (n, 2), m = 7 F 2 (n, 3), m = 8 (3)

J. JIANG et al. / Journal of Information & Computational Science 8: 14 (2011) 3027 3034 3031 4 Simulation Results and Comparison The simulation uses 100 randomly selected popular songs as test audios. Randomly select 4 initial points and intercept audio clips as long as 3.3s for each test audio, so there are 400 audio clips in total as experimental samples. After attack treatment, use the algorithm proposed in this paper and the traditional Mel frequency cepstrum coefficients (MFCC) algorithm respectively to make a simulation comparison. The results show that the algorithm proposed in this paper has better robustness for general content attacks, especially for linear speed change attack. The simulation uses Bit Error Rate(BER), Correct Identification Rate(CIR) and Best Recognition Rate(BRR) to measure the robustness of the algorithm [10,11]. The experimental environment is Windows XP, CPU 1.61GHz, 512MB memory; The tools used in the experiment include MATLAB 6.5, Adobe Audition 3.0. 4.1 The robustness analysis on the attack treatment of the common stick signal content For the attack treatment of the common stick signal content, the average BER between the fingerprint of the attacked audio clips and the fingerprint of the source audio is shown in Fig. 2. (Of all the figures in this paper, the dotted line represents the algorithm based on db4 wavelet characteristics, and the solid line represents the MFCC algorithm). Fig. 2: Comparison of the average bit error rate(ber) under different attack for the algorithm In Fig. 2, attack type 1-20 are respectively 32Kbps MP3 Compression Attack,128Kbps MP3 Compression Attack, Band-pass filter(bpk) attack, Amplitude Compression Attack,Equalization Attack, Echo Attack, Time Scale Modification Attack(TSM are separately ±2%, ±4% and ±5%, and the principle that the negative after the positive is taken in the figure) and Liner Speed Change Attack(LSC are separately ±1%, ±2%, ±3%, ±4% and ±5%). Fig. 2 shows that the average BER of the algorithm under attack is more stable than the MFC-

3032 J. JIANG et al. / Journal of Information & Computational Science 8: 14 (2011) 3027 3034 C algorithm, especially in terms of Liner Speed Change Attack [12] and Time Scale Modification Attack; while MFCC has advantages relatively in terms of 32Kbps MP3 Compression Attack, 128Kbps MP3 Compression Attack, Band-pass filter(bpk) attack, Amplitude Compression Attack and Equalization Attack. The analysis on the correct identification rate and the best recognition rate of the algorithm under different attacks are shown in Fig. 3. Fig. 3 shows that the algorithm based on db4 wavelet (a) Comparison of correct identification rate (CIR) (b) Comparison of best recognition rate (BRR) Fig. 3: Performance of the algorithm when under different attacks statistical characteristics has stronger robustness in terms of Liner Speed Change Attack, so as to overcome the shortcoming of weak robustness for most algorithms. 4.2 The robustness analysis of the additive white gaussian noise The robustness of the additive white Gaussian noise when the algorithm is used under different degrees is shown in Fig. 4. In this experiment, the signal to noise ratio(snr)of the additive white Gaussian noise is separately set to 20dB, 15dB, 10dB, 5dB, 3dB and 2dB. Fig. 4 shows that the average BER of the algorithm based on db4 wavelet statistical characteristics is more stable than that of the MFCC algorithm when under the attack of the additive white Gaussian noise; while in terms of the correct identification rate, the algorithm based on db4 wavelet statistical characteristics shows a better performance in low signal to noise ratio(snr), but it doesn t improve much as the SNR grows; in terms of the best recognition rate, the algorithm based on db4 wavelet statistical characteristics is better than the MFCC algorithm in the case of low signal to noise ratio (SNR). 5 Conclusion This paper proposes an audio fingerprinting algorithm based on db4 wavelet statistical characteristics. Use the plus-minus change of low-frequency sub-band s wavelet coefficients after wavelet transform, the energy distribution center in wavelet domain, the energy of sub-band in wavelet domain, and the variance of wavelet coefficients as parameters of extracting audio fingerprinting. The results of simulation and comparison with the MFCC algorithm show that the algorithm has

J. JIANG et al. / Journal of Information & Computational Science 8: 14 (2011) 3027 3034 3033 (a) Comparison of average bit error rate (BER) (b) Comparison of correct identification rate (CIR) (c) Comparison of the best recognition rate (BRR) Fig. 4: Performance of the algorithm when under additive white gaussian noise attack better robustness and higher recognition rate. But it has relatively weak ability while dealing with band-pass filter attack, amplitude compression attack and equilibrium attack. In future research, it is necessary to have a further study in the improvement of fingerprint size and the robustness while coping with the above three attacks, so as to increase the efficiency and the correct recognition rate of the algorithm. Acknowledgement This work is supported by the Fundamental Research Funds for the Defense of China (NO. D1120060967). References [1] Yaduo Liu, Wei Li, Xiaoqiang Li, A Robust Compressed-Domain Music Fingerprinting Technique Based on MDCT Spectral Entropy, ACTA ELECTRONICA SINICA, 38: 1172 1176, 2010.

3034 J. JIANG et al. / Journal of Information & Computational Science 8: 14 (2011) 3027 3034 [2] C. S. Lu, Audio Fingerprinting Based on Analyzing Ttme-Frequency Localization of Signals, Multimedia Signal Processing, pp. 174 177, 2002. [3] L. Ghouti and A. Bouridane, A Robust Perceptual Audio Hashing Using Balanced Multiwavelets, In Intemational Conference 011 Acoustics, Speech and Signal Processing, 5 : 209 212, 2006. [4] Y. Ke, D. Hoiem, R. Sukthankar, Computer Vision for Music Identification, Proceedings of Computer Vision and Pattem Recognition, pp. 597 604, 2005. [5] S. Bahja and M. Covell, Content Fingerprinting Using Wavelets, In Conference on Visual Media Production, pp. 198 207, 2006. [6] S. Bahja and M. Covell, Audio Fingerprinting Combining Computer Vision and Data Stream Processing, In International Conference On Acoustics, Speech and Signal Processing, 2: 213 216, 2007. [7] G. H. Li, D. F. Wu, J. Zhang, Concept framework for audio information retrieva: ARF, Journal of Computer Science and Technology, 18: 667 673, 2003. [8] J. Haitsma and T. Kalker, A Highly Robust Audio Fingerprinting System, In Proceedings of International Conference on Music Information Retrieval, 2002. [9] Jiming Zheng, Guohua Wei, Yu Wu, New effective method on content based audio feature extraction, COMPUTER ENGINEERING AND APPLICATIONS, 45: 131 137, 2009. [10] S. Baluja and M. Covell, Waveprint: Efficient Wavelet-based Audio Fingerprinting, Pattern Recognition, 41: 3467 3480, November, 2008. [11] Y. Jiao, B. Yang, M. Li and X. Niu, MDCT-Based Perceptual Hashing for Compressed Audio Content Identification, In IEEE Workshop on Multimedia Signal Processing, PP. 381 384, 2007. [12] Jingbing LI, Yingbin WEI, A Novel Watermarking Algorithm Robust to Local Nonlinear Geometrical Attacks, Journal of Computational Information Systems, Vol. 3 (5): 2181 2186, 2007.