Speech Compression based on Psychoacoustic Model and A General Approach for Filter Bank Design using Optimization

Size: px

Start display at page:

Download "Speech Compression based on Psychoacoustic Model and A General Approach for Filter Bank Design using Optimization"

Gilbert Lindsey
6 years ago
Views:

1 The International Arab Conference on Information Technology (ACIT 3) Speech Compression based on Psychoacoustic Model and A General Approach for Filter Bank Design using Optimization Mourad Talbi, Chafik Barnoussi, Cherif Adnane Laboratory of signal processing, Electronic department, Faculty of sciences of Tunis, 6, Tunisia mouradtalbi96@yahoo.fr, Chafik.Barnoussi@gmail.com adnane.cher@fst.rnu.tn Abstract: In this paper we propose a new speech compression technique based on the application of a psychoacoustic model combined with a general approach for Filter Bank Design using optimization. This technique is a modified version of the compression technique using a MDCT (Modified Discrete Cosine Transform) filter banks of 3 filters each and a psychoacoustic model. The two techniques are evaluated and compared with each other by computing bits before and bits after compression. They are tested on different speech signals and the obtained simulation results show that the proposed technique outperforms the second technique and this in term of compressed file size. In term of speech quality, the outputs speech signals of the proposed compression system are with good quality. This is justified by SNR, PSNR, NRMSE and PESQ computation. Keywords: speech compression, psychoacoustic model, Filterbank Design, optimization, bits before/bits after compression.. Introduction Accompanying rapidly increasing number of mobile users and the explosive growth of Internet has made speech compression as a research important issue in digital speech processing file. The essential purpose in speech compression consists in representing with minimum number of bits, the digital speech waveform while preserving its perceptual quality [, ]. The speech compression is essential either for reducing memory storage requirements or for reducing transmission bandwidth requirements, without harming the speech quality. For example, digital cellular phones use some compression techniques for compressing, in real-time, the speech signal over general switched telephone networks. Speech compression is also needed for reducing the storage requirements for storing the voice messages or for mail forwarding of voice messages. All these applications depend on the efficiency of the speech compression technique. Consequently, in the past, different techniques [3] were developed to meet the rising demand for better algorithms of speech compression. Speech signal is a simple manner for human to convey information with emotion from one person to others or from one place to another. It is characteristically classified as either unvoiced, voiced or a mixture of the two. Unvoiced sounds are produced when the vocal cords are too slack or tense to vibrate periodically and Voiced sounds are generated when the vocal cords are held together [4]. A detailed study on speech production is given in [5] and the references therein. Alike to other digital data compression techniques, speech compression techniques can also be classified into two categories which are lossy compression and lossless compression. Lossless compression is frequently performed by waveform coding techniques. In these techniques [5, 4, 6] actual shape of the signal produced by the microphone and its associated circuits is conserved. A most popular waveform coding technique is pulse code modulation (PCM). Other lossless techniques such as differential quantization and adaptive PCM, make speech signals compression by localizing redundancy and optimizing or suppressing it by the quantization process. All such techniques require simple signal processing and lead to minimum distortion with small compression [6, 7, 8]. A detailed study on these techniques is presented in [6, 7], [], [4], [9] and the references therein. Concerning lossy compression, the compressed data is a close approximation of the original data and not the same as it. Although, it leads to a much higher compression ratio than that of lossless compression. The literature review reveals that a considerable progress has been made on lossy compression techniques such as sub-band coding [4], linear predictive coding (LPC) [3] and turning point [5]. In these techniques, more sophisticated signal processing techniques are used. LPC is a robust tool extensively employed for the analysis of the ECG and speech signals in various aspects such as adaptive filtering, spectral estimation and data compression [, 3, 4]. Different efficient techniques [, 4, 5] have been reported in literature based on LPC. While in subband decomposition, spectral information is divided into a set of signals that can then be encoded by using a diversity of techniques. Based on subband decomposition, different techniques have been devised for speech compression [6, 7, 8, 9]. During last decade, Wavelet Transform, more precisely Discrete Wavelet Transform has emerged as a robust and powerful tool for extracting and analyzing information from non-stationary signal because of time varying nature of these signals. Nonstationary signals are characterized by transitory drifts, trends and numerous abrupt changes. Wavelet has localization feature along with its time-frequency resolution properties which makes it appropriate for analyzing non-stationary signals such as speech signals []. Actually, many wavelet or wavelet packet techniques have been developed for compressing speech signal [,, 3, 4, 5]. In the above context, the optimized wavelet filters are developed for speech

2 compression, and the filter coefficients are derived from linear optimization employing different windows [6]. In this paper, we have replaced a MDCT (Modified Discrete Cosine Transform) filter banks of 3 filters each used in the compression system proposed by Alex et al [7], by a Non- Uniform Filter Bank which is designed using optimization [8]. The rest of the paper is organized as follows. A background on the psychoacoustic model is provided in the following section. Section 3 deals with Filterbank and section 4 outlines the proposed speech compression scheme. In section 6, we give results and discussion and finally we give our conclusion.. Background on the Psychoacoustic Model The psychoacoustic model is based on many researches made on human perception. These researches have demonstrated that the average human hearing of all frequencies is not the same. Effects due to the human sensory system limitations and different sounds in the environment lead to facts that can be employed in order to remove unnecessary data contained in an audio signal [7]. The two principal human auditory system properties that make up the psychoacoustic model are the auditory masking and the hearing absolute threshold. Each of them provides a manner of determining which signal portions are indiscernible and inaudible to the average human, and can therefore be eliminated from a signal [7]... Absolute Threshold of Hearing The human hearing frequencies are in the range from Hz to Hz. Although, this does not mean that all frequencies are heard in the same manner. We can make the supposition that humans hear frequencies that make up speech better than others; this is a good guess [7]. Moreover, we can also hypothesize that hearing a tone becomes more difficult as its frequency nears either of the extremes. One other observation forms the basis for modeling. Due to the fact that humans hear lower frequencies, like those making up speech, more than others, like high frequencies around khz, the ear probably has better ability in detecting differences in pitch at lower frequencies than at high ones. For example, a human has an easier time telling the difference between 5 Hz and 6 Hz than he does determining whether something is 7, Hz or 8, Hz [7]. Many studies made by scientists leaded to the fact that the frequency range from Hz to Hz can be broken up into critical bandwidths, which are non-linear, nonuniform and dependent on the heard sound. Signals within one critical bandwidth are hard to separate for a human observer [7]. A more frequency uniform measure based on critical bandwidths is the Bark. From the earlier discussed remarks, we would expect a Bark bandwidth to be larger at high frequencies and smaller at low ones. Indeed, this is the case. The Bark frequency scale can be approximately expressed as follow: barks 3 arctg.76 f Hz 3.5 arctg 75 To determine the frequency effect on hearing capability, researchers played a sinusoidal tone at a very low power. The power was slowly raised until the subject could hear the tone. This level is the threshold at which we can hear the tone. The process is repeated for many frequencies with many subjects () and in the human auditory range. This experimental data can be modeled by the equation (): ATH f f exp f f 3 where f designates the frequency in Hertz. ( dbspl) Therefore, the following jump for the purposes of compression, can be made. If a signal has any frequency components with power levels that fall below the absolute threshold of hearing, then these components can be removed, as the average listener will be not capable to hear those frequencies of the signal anyway [7]... Auditory Masking Humans do not have the capability of hearing of minute differences in frequency. For example, it is very difficult to distinguish a,hz signal from one that is,hz. This becomes even more difficult if the two signals are playing at the same time. Moreover, the, Hz signal would also affect a human's capability of hearing of a signal that is, Hz, or, Hz, or 99 Hz. This concept is known as masking. If the, Hz signal is strong, it will mask signals at nearby frequencies, making them inaudible to the listener. For a masked signal to be heard, its power will need to be increased to a level greater than that of a threshold that is determined by the frequency of the masker tone and its strength.... Tone Maskers To determine if a frequency component is a tone, this necessitates knowing if it has been held constant for a period of time, as well as if it is a sharp peak in the frequency spectrum, which indicates that it is above the ambient noise of the signal. To determine if a certain frequency is a tone (masker) can be done with the definition given in [7].... Noise Maskers When a signal is not a tone then it is necessary a noise. Therefore, all frequency components that are not part of a tone's neighborhood, are taken and are treated as noise. Combining such components into maskers, though, takes a little more thought. Because humans have difficulty discriminating signals within a critical band, the noise found within each of the bands can be combined in order to obtain one mask. Therefore, the idea consists in taking all frequency components within a critical band that do not fit within tone neighborhoods, add them together, and place them at the geometric mean location within the critical band. This is repeated for all critical bands [7]...3. Masking Effect The determined maskers affect not only the frequencies within a critical band, but also in surrounding bands. Studies prove that the spreading of this masking has an approximate slope of +5 db/bark before and db/bark after the masker. The spreading can be described as a function depending on the ()

3 masker location j, masker location i, the power spectrum Ptm at j, and the difference between the masker and masker locations in Barks [7]. There is a slight difference in the resulting mask that depends on the nature of the mask whether it is a noise or a tone. Consequently, one can model the masks by the following equations, with the same variables as described above: For noise: Tnm i, j Pnm j 75z( SF( i, 5 (3) For tones: Tnm i, j Pnm j 75z( SF( i, 6 5 (4) Obviously, if there are multiple tone and noise, the overall effect is a little harder to determine. In their work, alex et all [7] suppose that the effects are power additive. This is a reasonable supposition to make, but note that there is a definitely an interplay that can occur between maskers that would lower or increase thresholds [7]. 3. Quantization Simulation Alex et al [7] developed two different quantization techniques for performing the audio compression. The first technique, named full range quantization, requires a predefined range that includes all possible input values. Since this technique gives a noticeable degradation of sound quality, they decided to develop a different technique of quantization. The second is a dynamic technique, named narrow range quantization, which determining the quantization range and the delta based on the current set of input data. The inputs to be quantized can range from [, ], and it is quantized with 6 bits (input has 65,536 distinct values between [, ]) [7]. 4. Filter Bank Filter bank is an array of band-pass filters that spans the entire audible frequency spectrum. Figure illustrates a Filter banks with M banks. Figure. Filter banks. The bank serves for isolating different frequency components in a signal. This is useful since some frequencies are deemed more important than others through the use of the psychoacoustic model. Magnitudes at these important frequencies require to be coded with a fine resolution. Small differences at these frequencies are significant and a coding scheme that conserves these differences must be used. On the other hand, frequencies that are less important do not have to be exact. A coarser coding scheme can be employed, although some of the finer details will be lost in the coding. We can obtain different coding resolutions by employing less bits to encode less significant frequencies and many bits to encode important frequencies. The filter bank, then, permits different signal parts to be encoded with a different number of bits resulting in a compressed data stream representation of the signal. Really, there are two sets of filter banks that are employed. The first set of filters is named the analysis filter bank. The input signal passes through each of these filters and is after that quantized with the proper number of bits, as determined by the psychoacoustic model. After that the signal requires to be reconstructed again from the quantized individual components. This is performed through a bank of synthesis filters. Finally, all the outputs of the synthesis filters are added together in order to reconstruct the final compressed output signal. There is one final point to make. Once the signal is passed through each filter in the analysis bank, it is then down-sampled by the number of filters in the bank. This is because there is redundant information present in each of the signals output from the filters. The decimation does not result in any loss of information, but does shift the frequency of the signal. After quantization, the signal is up-sampled to restore the frequency content to its original scale. Figure 3 illustrates a Analysis and Synthesis Filter Bank Setup [7]. Figure. Analysis and Synthesis Filter Bank Setup. 4.. Filter Bank Design Considerations A tradeoff exists between coarse and fine frequency resolution. For all signals, no single tradeoff is optimal. Take, for instance, a castanets and piccolo, which are two musical instruments having very different qualities. A harmonic piccolo calls for fine frequency resolution and coarse time resolution. This is because a piccolo plays within a small range, consequently necessitating more filters per bank for the purpose to sufficiently capture all tones. The contradictory is true for castanets which are localized in time but widely dispersed in frequency. In this case, you would want to use less filters per bank. Furthermore, many signals are nonstationary and need the coder to make adaptive decisions regarding the optimal time-frequency tradeoff. For their purposes, Alex and al [7] have employed non-adaptive filter banks that are commonly used in audio applications. 5. The Proposed Method Alex et al [7] implemented a compression scheme that uses psychoacoustic modeling to determine which portions of the audio signal; they remove without loss of sound quality to the human ear. In their compression system, the original signal is run through cosine modulated perfect reconstruction filter banks having 3 filters in each bank. This MDCT filter banks of 3 filters each used by Alex et al [7] are defined as follow: hk, n w n cos( n M ( k ) /(4M )) (5) M g k, L n h k, n (6) With: k M, n L, L M, M 3 and w ( n ) sin n. 5 M

4 The signal is divided by the filter banks into distinct frequency components and then it is quantized with a variable number of bits, which is based on the psychoacoustic model results. They have made analysis on this compressed version of the signal and by using different quantization schemes they can get 3 to 75 percent compression of the original signal. This difference is due to the overhead needed for decoding the quantized signal in each scheme. Here is a simplified block diagram of their scheme (Figure 3) [7]: Table. Coefficients of Analysis-Synthesis responses impulses of the designed filter bank using optimization. h h g g Figure 3. Encoding/Decoding systems. In this paper, we have modified the compression system of Alex et al [7] (figure 6) by replacing the MDCT (Modified Discrete Cosine Transform) filter banks of 3 filters each by a Uniform/Non-Uniform Filter Bank which is designed using optimization [8]. The goal is to design M analysis and synthesis FIR filters so that the analysis filters satisfy some frequency specifications and the filter bank (almost) meets the perfect reconstruction (PR) conditions. Both goals are achieved by minimizing the following performance index [8]: J w PR error (7) w Frequency Specificat ion error where w and w are optional weights. The algorithm can design both uniform (critically/over sampled) and non-uniform filter banks [8]. Figure 4 illustrates the used non-uniform filter bank: Figure 4. Analysis-Synthesis optimized filterbank. Where H (z), H (z), G (z) and G (z) are respectively the z- transforms of the impulse responses of the analysis and synthesis filters, h (n), h (n), g (n) and g (n). Those impulses are obtained from optimization by minimizing the performance index given by (8). Therefore we have replaced the impulses responses h and g associated to MDCT filters banks of 3 filters in each bank and given by (6) and (7) by h, h, g and g. Table reports those impulses coefficients obtained from optimization. h and h are designed for analysis and g and g are designed for synthesis Performance Evaluation In this paper, we present the objective criterions used for the evaluation and the comparison between the proposed speech compression technique and that of Alex et al [7]. These criteria are bits before and bits after compression, SNR, PSNR, NRMSE and PESQ. The output speech quality is objectively evaluated for the proposed and the conventional speech compression techniques in case of narrow range quantization. This evaluation is performed by SNR, PSNR, PESQ and NRMSE computing. These objectives criteria are defined as follows: o Signal-to-noise ratio SNR: o Perceptual evaluation of speech quality (PESQ): The perceptual evaluation of speech quality (PESQ) algorithm is an objective quality measure that is approved as the ITU-T recommendation P.86. It is a tool of objective measurement conceived to predict the results of a subjective Mean Opinion Score (MOS) test. It was proved [9] that the PESQ is more reliable and correlated better with MOS than the traditional objective speech measures. o o Peak Signal to Noise Ratio (PSNR): Normalized Root Mean Square Error (NRMSE): (8) (9) () Where s(n) and ŝ(n) represent respectively the original and the reconstructed signal, and N is the samples number per signal, µ s (n) is the mean of the speech signal s(n). 6.. File Format and Comparison To determine compression ratios for our compression schemes we first have to determine the number of bytes that each file

5 takes. We have use the same computation rules of files size (Original files size, 6-bit Compression, 8-bit Compression, Full Range Compression, and Narrow Range Compression) as used in [7]. Table.3. Results (bits before/ bits after compression) obtained from research work of Alex et al [7]. 7. Results and discussion Figure 6 illustrates an example of a reconstructed speech signal obtained by applying the proposed speech compression technique and the technique of Alex et al [7] (a) (b) (c) Figure 6. (a) original speech signal, (b) reconstructed speech signal using Alex compression scheme [7], (c) reconstructed speech signal using the proposed compression scheme. This figure 6 shows clearly that the proposed technique permits to obtain a reconstructed speech signal with a good quality and this by referring to the original speech signal. However the compression technique proposed by Alex et al [7] introduced some degradation on the reconstructed speech signal. Table. reports the results concerning Bits before and after compression using the proposed technique and the technique of Alex et al [7]. The two techniques are applied to three different speech signals. Table. Bits before compression and Bits after compression. These obtained results show clearly that the proposed technique outperforms the technique of Alex et al [7] and this in term of size of output files: sizes of output files from the proposed compression system are smaller than sizes of the output files from the system of Alex et al [7]. Table.3. reports the results (bytes before and after compression) obtained from the application of the technique of Alex et al [7] to a number of audio signals and sinwave signals. According to Alex et al [7], we can obtain the two following constitutions: For Full Range, we have Smallest File and Worst Sound Quality. For Narrow Range, we have better sound quality and larger File. In this work and for the tested speech signals, we can obtain the two following constitutions: For full Range we have smallest file and better sound quality. For narrow Range we have a completely degraded Sound Quality and larger File. To solve the problem of speech degradation when using narrow range in the proposed technique and the technique of Alex [7], we have multiplied the psychoacoustic model threshold by an adjustment factor. We have selected to be equals to 3 and this choice is based on simulation results. Figures 7 and 8 show clearly that by multiplying the psychoacoustic model threshold by the factor α, we obtain an output speech signal with a good quality (a) (b) (c) Figure 7: (a) Original Speech signal, (b) degraded output speech signal obtained from the compression system of Alex et al [7] without multiplying the threshold by, (c) output speech signal from the compression system of Alex et al [7] with multiplying the threshold by.

6 .5 Table 5 PESQ values of the reconstructed speech signal in case of Narrow range (a) (b) (c) Figure 8: (a) Original Speech signal, (b) degraded output speech signal obtained from the proposed compression system without multiplying the threshold by, (c) output speech signal from the proposed compression system with multiplying the threshold by. Table 4 reports the results obtained from the application of the two techniques on the three speech signals and this for narrow range and with/without multiplying the psychoacoustic model threshold by the factor. These results are bits before and after speech compressing. Tables 4 and 5 show that the performances of the proposed speech compressed system and that of Alex et al [8] are improved and the outputs of these systems having good qualities when multiplying the psychoacoustic model threshold by the factor α. The output speech signals of the proposed speech compression system, are with a little constant delay compared to the original speech signals and this in case of narrow range. To solve this problem we have suppressed this delay and we have obtained the following results reported in table 6. Table 6. SNR,PSNR, PESQ and NRMSE of Alex et al method and the proposed speech compression for Narrow range quantization. Table.4. Bits After for Narrow Range in the two cases, multiplying and without multiplying by factor. References [] Xie, N., Dong, G., & Zhang, T. (). Using lossless data compression in data storage systems: not for saving space. IEEE Transactions on Computers, 6(3),

7 [] Gibson, J. D. (5). Speech coding methods, standards, and applications. IEEE Circuits and Systems Magazine, 5(4), [3] Junejo, N., Ahmed, N., Unar, M. A., & Rajput, A. Q. K. (5). Speech and image compression using discrete wavelet transform. In IEEE symposium on advances in wired and wireless communication (pp ). [4] Agbinya, J. I. (996). Discrete wavelet transform techniques in speech processing. In IEEE Tencon digital signal processing applications proceedings (pp ). New York: IEEE. [5] Arif, M., & Anand, R. S. (). Turning point algorithm for speech signal compression. International Journal of Speech Technology. doi:.7/s [6] Gersho, A. (99). Speech coding. In A. N. Ince (Ed.), Digital speech processing (pp. 73 ). Boston: Kluwer Academic. [7] Gersho, A. (994). Advance in speech and audio compression. Proceedings of the IEEE, 8(6), [8] Shlomot, E., Cuperman, V., & Gersho, A. (998). Combined harmonic and waveform coding of speech at low bit rates. In IEEE conference on acoustics, speech and signal processing (ICASSP98) (Vol., pp ). [9] Shlomot, E., Cuperman, V., & Gersho, A. (). Hybrid coding: combined harmonic and waveform coding of speech at 4 kb/s. IEEE Transactions on Speech and Audio Processing, 9(6), [] Junejo, N., Ahmed, N., Unar, M. A., & Rajput, A. Q. K. (5). Speech and image compression using discrete wavelet transform. In IEEE symposium on advances in wired and wireless communication (pp ). [] Zois, E. N., & Anastassopoulos, V. (). Morphological waveform coding for writer identification. Pattern Recognition, 33(3), [] Laskar, R. H., Banerjee, K., Talukdar, F. A., & Sreenivasa Rao, K. (). A pitch synchronous approach to design voice conversion system using source-filter correlation. International Journal of Speech Technology, 5, [3] Shahin, I. M. A. (). Speaker identification investigation and analysis in unbiased and biased emotional talking environments. International Journal of Speech Technology, 5, [4] Vankateswaran, P., Sanyal, A., Das, S., Nandi, R., & Sanyal, S. K. (9). An efficient time domain speech compression algorithm based on LPC and sub-band coding techniques. Journal of Communication, 4(6), [5] Magboun, H. M., Ali, N., Osman, M. A., & Alfandi, S. A. (). Multimedia speech compression techniques. In IEEE international conference on computing science and information technology (ICCSIT) (Vol. 9, pp ). [6] Osman, M. A., Al, N., Magboud, H. M., & Alfandi, S. A. (). Speech compression using LPC and wavelet. In IEEE international conference on computer engineering and technology (ICCET) (Vol. 7, pp. 9 99). [7] McCauley, J., Ming, J., Stewart, D., & Hanna, P. (5). Subband correlation and robust speech recognition. IEEE Transactions on Speech and Audio Processing, 3(5), [8] Ramchandran, K., Vetterli, M., & Herley, C. (996). Wavelet, subband coding, and best bases. Proceedings of the IEEE, 84(4), [9] [9] Gershikov, E., & Porat, M. (7). On color transforms and bit allocation for optimal subband image compression. Signal Processing. Image Communication,, 8. [] Shao, Y., & Chang, C. H. (). Bayesian separation with sparsity promotion in perceptual wavelet domain for speech enhancement and hybrid speech recognition. IEEE Transactions on Systems, Man and Cybernetics. Part A: System and Humans, 4(), [] Satt, A., & Malah, D. (989). Design of uniform DFT filter banks optimized for subband coding of speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(), [] Joseph, S. M. (). Spoken digit compression using wavelet packet. In IEEE international conference on signal and image processing (ICSIP- ) (pp ). [3] Mallat, S. G. (989). A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,, [4] Fgee, E. B., Philips, W. J., & Robertson, W. (999). Comparing audio compression using wavelet with other audio compression schemes. Proceedings IEEE Electrical and Computer Engineering,, [5] Dusan, S., Flanagan, J. L., Karve, A., & Balaraman,M. (7). Speech compression using polynomial approximation. IEEE Transactions on Audio, Speech, and Language Processing, 5(),

Communications Theory and Engineering

Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019 Speech and telephone speech Based on a voice production model Parametric representation