United Codec. 1. Motivation/Background. 2. Overview. Mofei Zhu, Hugo Guo, Deepak Music 422 Winter 09 Stanford University.

Similar documents
Speech Coding in the Frequency Domain

Audio Compression using the MLT and SPIHT

Pre-Echo Detection & Reduction

Auditory modelling for speech processing in the perceptual domain

Evaluation of Audio Compression Artifacts M. Herrera Martinez

Images with (a) coding redundancy; (b) spatial redundancy; (c) irrelevant information

Drum Transcription Based on Independent Subspace Analysis

Audio Coding based on Integer Transforms

A Parametric Model for Spectral Sound Synthesis of Musical Sounds

Super-Wideband Fine Spectrum Quantization for Low-rate High-Quality MDCT Coding Mode of The 3GPP EVS Codec

Module 6 STILL IMAGE COMPRESSION STANDARDS

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

14 fasttest. Multitone Audio Analyzer. Multitone and Synchronous FFT Concepts

Audio and Speech Compression Using DCT and DWT Techniques

Onset Detection Revisited

Ninad Bhatt Yogeshwar Kosta

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

HARMONIC INSTABILITY OF DIGITAL SOFT CLIPPING ALGORITHMS

PATTERN EXTRACTION IN SPARSE REPRESENTATIONS WITH APPLICATION TO AUDIO CODING

Pulse Code Modulation

TWO ALGORITHMS IN DIGITAL AUDIO STEGANOGRAPHY USING QUANTIZED FREQUENCY DOMAIN EMBEDDING AND REVERSIBLE INTEGER TRANSFORMS

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Introduction of Audio and Music

15110 Principles of Computing, Carnegie Mellon University

Module 8: Video Coding Basics Lecture 40: Need for video coding, Elements of information theory, Lossless coding. The Lecture Contains:

CHAPTER 6: REGION OF INTEREST (ROI) BASED IMAGE COMPRESSION FOR RADIOGRAPHIC WELD IMAGES. Every image has a background and foreground detail.

Lecture5: Lossless Compression Techniques

Enhanced Waveform Interpolative Coding at 4 kbps

JPEG Image Transmission over Rayleigh Fading Channel with Unequal Error Protection

LECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR

Using Noise Substitution for Backwards-Compatible Audio Codec Improvement

Communications Theory and Engineering

Audio Watermarking Scheme in MDCT Domain

Audio Imputation Using the Non-negative Hidden Markov Model

DSP-BASED FM STEREO GENERATOR FOR DIGITAL STUDIO -TO - TRANSMITTER LINK

Advanced Audiovisual Processing Expected Background

EE482: Digital Signal Processing Applications

MAS160: Signals, Systems & Information for Media Technology. Problem Set 4. DUE: October 20, 2003

Overview of Code Excited Linear Predictive Coder

Using sound levels for location tracking

Copyright S. K. Mitra

A SURVEY ON DICOM IMAGE COMPRESSION AND DECOMPRESSION TECHNIQUES

Audio Signal Compression using DCT and LPC Techniques

15110 Principles of Computing, Carnegie Mellon University

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. Subject Name: Information Coding Techniques UNIT I INFORMATION ENTROPY FUNDAMENTALS

WARPED FILTER DESIGN FOR THE BODY MODELING AND SOUND SYNTHESIS OF STRING INSTRUMENTS

MAS.160 / MAS.510 / MAS.511 Signals, Systems and Information for Media Technology Fall 2007

HIGH QUALITY AUDIO CODING AT LOW BIT RATE USING WAVELET AND WAVELET PACKET TRANSFORM

AUDITORY ILLUSIONS & LAB REPORT FORM

Digital Speech Processing and Coding

2.1. General Purpose Run Length Encoding Relative Encoding Tokanization or Pattern Substitution

Flexible and Scalable Transform-Domain Codebook for High Bit Rate CELP Coders

2. REVIEW OF LITERATURE

Localized Robust Audio Watermarking in Regions of Interest

EE482: Digital Signal Processing Applications

Digital Audio. Lecture-6

COMBINING ADVANCED SINUSOIDAL AND WAVEFORM MATCHING MODELS FOR PARAMETRIC AUDIO/SPEECH CODING

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

models all of the high frequency input signal not modeled by the transients. Each of these three signals can be individually quantized using psychoaco

Compression and Image Formats

Communication Theory II

Reducing comb filtering on different musical instruments using time delay estimation

Attack restoration in low bit-rate audio coding, using an algebraic detector for attack localization

Audio Signal Performance Analysis using Integer MDCT Algorithm

FFT analysis in practice

Encoding a Hidden Digital Signature onto an Audio Signal Using Psychoacoustic Masking

A Modified Image Coder using HVS Characteristics

Single-channel and Multi-channel Sinusoidal Audio Coding Using Compressed Sensing

Convention Paper Presented at the 112th Convention 2002 May Munich, Germany

NOISE SHAPING IN AN ITU-T G.711-INTEROPERABLE EMBEDDED CODEC

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

Audible Aliasing Distortion in Digital Audio Synthesis

Digital Watermarking and its Influence on Audio Quality

Assistant Lecturer Sama S. Samaan

A TWO-PART PREDICTIVE CODER FOR MULTITASK SIGNAL COMPRESSION. Scott Deeann Chen and Pierre Moulin

Binaural Hearing. Reading: Yost Ch. 12

Huffman Code Based Error Screening and Channel Code Optimization for Error Concealment in Perceptual Audio Coding (PAC) Algorithms

Monophony/Polyphony Classification System using Fourier of Fourier Transform

Time division multiplexing The block diagram for TDM is illustrated as shown in the figure

Lecture 3: Wireless Physical Layer: Modulation Techniques. Mythili Vutukuru CS 653 Spring 2014 Jan 13, Monday

Class 4 ((Communication and Computer Networks))

Speech/Music Change Point Detection using Sonogram and AANN

Chapter IV THEORY OF CELP CODING

RECOMMENDATION ITU-R BS User requirements for audio coding systems for digital broadcasting

Lossless Image Compression Techniques Comparative Study

Golomb-Rice Coding Optimized via LPC for Frequency Domain Audio Coder

SNR Scalability, Multiple Descriptions, and Perceptual Distortion Measures

ARTIFICIAL BANDWIDTH EXTENSION OF NARROW-BAND SPEECH SIGNALS VIA HIGH-BAND ENERGY ESTIMATION

Multimedia Communications. Lossless Image Compression

Transcoding of Narrowband to Wideband Speech

Wideband Speech Coding & Its Application

An Efficient Zero-Loss Technique for Data Compression of Long Fault Records

10 Speech and Audio Signals

Drum Leveler. User Manual. Drum Leveler v Sound Radix Ltd. All Rights Reserved

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

Chapter 9 Image Compression Standards

Lossy Image Compression Using Hybrid SVD-WDR

A High-Throughput Memory-Based VLC Decoder with Codeword Boundary Prediction

On Minimizing the Look-up Table Size in Quasi Bandlimited Classical Waveform Oscillators

Transcription:

United Codec Mofei Zhu, Hugo Guo, Deepak Music 422 Winter 09 Stanford University March 13, 2009 1. Motivation/Background The goal of this project is to build a perceptual audio coder for reducing the data size of audio files that produces with good sound quality at a bit rate less than 128 kbps/ch. On the basis of perceptual audio coding which we quantize the file signals in frequency domain instead of time domain so that we can use psychoacoustic masking curve to cut off some inaudible signals in order to reduce the file size. In order to make it better, we choose to implement block switching, Huffman coding and M/S stereo coding. 2. Overview Figure 1 is the overview of our codec. We go through two main paths, which have several interactions between and go back to one path to output. One path is the left part, for each block of data, apply a sine window to it, and then do a FFT, then use the FFT result to decide the peaks of the masking curve, thus calculate the SMR to decide how many bits allocated for each MDCT lines. For the other path, also apply a sine window first, and then do a MDCT, whose size is decided by the FFT result using Block Switching. Then we implement the M/S stereo coding to the MDCT data to get better gain, and then use the bit allocation results to quantize these M/S data. At last, we introduce the Huffman Coding to get general probability compression.

Input Data Sine Window Sine Window FFT Block Switching MDCT Masking Curve M/S Stereo Coding Bit Allocation Quantization Figure 1 Huffman Coding 3. Implementation 3.1 Block Switching Block switching is an effective method to achieve time vs frequency resolution tradeoffs thereby reducing artifacts like pre-echo preceding fastattack transient sounds. We use the block switching developed by the Dolby AC-2A team [Bosi and Davidson 92]. Long blocks are of length 2048 while short blocks are of length 256. The block switching consists of 2 parts: 1) Transient detection: We use the transient detection algorithm used in the Dolby AC3 encoder. In a nutshell, a high pass filtered version of the full bandwidth channels are examined to detect a rapid surge in energy, which denotes a transient.

Subsequently, if the onset of a transient is detected in the second half of a long block in a certain channel, then that channel switches to a short block. The transient detector input is a block of 2048 samples; it processes the time-samples blocks in two steps, each operating on 1024 samples. Its output is a one-bit flag for each full bandwidth channel, which when set to one indicated the presence of a transient n the second half of the 1024-point block for the corresponding channel. It works in 4 stages. The high pass filter, segmentation of the time samples, a peak amplitude detection for each segment and the comparison of the peak values with a threshold set to trigger only significant changes in the amplitude values. The high pass filter is implemented as an IIR filter with cut off frequency of 8 khz, The block of the high passed 1024 samples is then decomposed into a hierarchical tree whose shorter segment is 256 samples. The sample with the largest magnitude is then identified for each segment and then compared to the threshold if there is any significant change in the level for the current block. First, the overall peak is compared to a silence threshold; if the overall peak is below this, then it is a steady state condition block and a long block is used. If the ratio of peak values for adjacent segments exceeds a pre-defined threshold, then the flag is set to indicate the presence of a transient in the current 1024 point input segment. The second step follows exactly the previously mentioned stages for the second 1024 point input segment and determines the presence of a transient in the second half of the input block. 2) Frequency mapping: One a transient is detected, a transition window comes into the picture, the transition windows from long to short block are left sides of long windows on the left and right sides of short windows on the right. The transition windows from a short block to a long block are time reverses of the windows from long to short. Time domain alias cancellation comes about by changing the kernel of the MDCT transform. For transition blocks, the MDCT is of size 0.5(Nlong + Nshort) with a phase term n= - b/2 + 1/2 where b is the length of the right side of the window. 3.2 Huffman coding Huffman coding is an entropy-coding algorithm used for lossless data compression. It refers to the use of a variable-length code table for encoding the input where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the input signals.

So in this project, we are doing Huffman coding to quantized MDCT lines. We first try to acquire probabilities of sequences of various data (speech, harpsichord, piano, castanet, drum, flute, etc. ), then create the code table which we would look up with when coding the new sound files. Because if computing the probabilities of new file real time, It will lead to huge computation and cost a lot of time. So we do it beforehand. We trained data when mantissa is 2,3,4,5,6,7,8, so when we encode the mantissa, we get the mantissa bits allocated for each band and then use the map table in the codebook to get the data mapped to the new code, then store to the file. Also at decoding period, we first look up the code in the codebook, and then decode to their original data. Here are the details of the Huffman Coding: First we train data from the mantissas of the Floating Point quantization to get the probabilities of each data values. Since Huffman Coding works best for distributions where the probabilities of values are powers of 0.5 (0.5, 0.25, 0.125), so we decide to take the probabilities of two or four consecutive MDCT data (if mantissa bits are two, there is only one bit remaining excluding the sign bit, so we use four consecutive data, for other mantissa bits, we use two). After getting the probabilities of the consecutive, we build Huffman tree from it, and get a dictionary map of codes for each possibilities. We store results from different kinds of music with different mantissa bits to a codebook file, so every time we encode, we just look up in the codebook file then pick the right and best table for input data.

3.3 M/S stereo coding: 3.3.1 Stereo coding In general, jointly coding the left and right channel provides higher coding gains. But stereo coding data rates may exceed twice the rate needed to transparently code one mono signal. In other words, the artifacts masked in single channel coding may become audible when presented as a stereo signal encoded as a dual mono. To avoid this "cocktail party effect", we obtain our new binaural masking curve by adjusting Masking Level Difference(BMLD), which is a difference effect between the masked threshold recorded when the signal is presented as a single channel signal and the masked threshold under binaural condition. 3.3.2 M/S stereo coding In this project, we mainly use M/S stereo coding to remove redundancies. Instead if transmitting separately the left and the right signal, the normalized sum and difference signals are transmitted. The definition is shown below, where L and R correspond to the hybrid filter bank spectral line amplitudes. M = (L+R)/ 2, S= (L-R)/ 2 The coding gain achieved by utilizing M/S stereo coding is signal dependent. The maximum gain is reached when the left and the right signals are equal or phase shifted by pi. In the phase of decoding, if M/S stereo coding is used we reconstruct left and right signals by using L = M+S, R=M-S 3.3.3 M/S stereo coding decision M/S stereo coding is applied in our codec only when:, Where and correspond to the FFT spectral line amplitudes computed in the psychoacoustic model. If this condition is met then M/S is transmitted, if not, then L/R is transmitted. This condition allows M/S transmission in cases where the mid and the side difference in energy by a certain threshold (in this case, 80%).

3.3.4 Masking in Stereo The masking thresholds for M and S need to be calculated. This is a step-wise process. First the equation is applied to each M and S frequency line in the exact manner as in the aforementioned section to calculate the basic masking thresholds, denoted BTHRm and BTHRs. To calculate the stereo masking contributions of the M and S channels, an additional factor, the masking level difference factor (MLD), is calculated at each frequency line and multiplied by each of the M and S masking level thresholds to obtain the masking level difference, denoted MLDm and MLDs. The MLD provides a second level of detectability of noise in the M and S channels based on the masking level differences between the channels. Essentially, the MLD is a measure of how detectable a masked signal in the M channel is in the S channel and vice versa. The equation used to calculate the MLD factor is as follows, where z is the frequency in barks. Now, the MLD factors can be calculated as: The actual thresholds for M and S are calculated as follows: The MLD signal essentially substitutes for the BTHR signal in cases where there is a chance of stereo unmasking. 3.3.5 Bit Allocation The bit allocation structure is almost the same as the basic coder. The only difference is that both the M channel and the S channel now share a common bit pool, whose size is twice of the original one. The water filling algorithm now is

applied to all the frequency lines of both M/S channel at the same time based on two SMRs, if M/S stereo coding is used. 4 Results Sound Type Bitrate (kb/s/ch) SDG(-4 to 0) Flute 82-0.1 Castanet 116-0.5 Piano 94 0 French Female Speech 102 0 Harpsichord 83.2-0.2 Classical 94 0 Pop 114.2857 0 Rock 110.53-0.2 Sound Type Bitrate (kb/s/ch) SDG(-4 to 0) Flute 57-0.5 Castanet 92-1.5 Piano 80 0 French Female Speech 89-0.2 Harpsichord 65.6-1 Classical 79-0.2 Pop 99.685-0.5 Rock 97.34-0.8

5 Conclusions and Future Work The results show a very good compression rate and performance at the same time, which is quite satisfactory. But in the future, we still have some ideas about how to improve: For Block Switching, The AC3 algorithm for block switching is a suboptimal solution aimed at ease of implementation. It has discontinuities in the transform and zero-overlap at certain points. We would like to implement a more thorough model like the AC-2A solution [Bosi and Davidson 92] which solves these issues. Also, the block switching algorithm implemented bypasses the Huffman coding as the tables are not optimal for shorter blocks. Eventually, we would like to have a solution wherein block switching, Huffman coding and MS-stereo coding can work simultaneously for all types of blocks. For Huffman coding, first, we can build more tables which are larger than 8, because with M/S stereo coding, M channel and S Channel are sharing the same bit pool, thus making a good save on S Channel for bits allocation and turn to M Channel so that M Channel can enjoy more bits. With some frequency-concentration sound files (instruments) like piano, flute, they ll have higher mantissa bits. For stereo coding, we will introduce intensity stereo coding at low bit rates to save bits from the higher frequency bands. 6 References [Bosi and Goldberg(2003)] M. Bosi and R. E. Goldberg. Introduction to Digital Audio Coding and Standards. Kluwer, 2003. [Liu et al.(2003)liu, Lee, and Hsiao] C. M. Liu, W. C. Lee, and Y. H. Hsiao. M/S coding based on allocation entropy. In Proc. of DAFx-03, pages 1 4, London, UK, Sept 8 11 2003. [Yuchao Song, Juhan Nam, and David Yeh (2008)] Perceptual Audio Coder with Entropy Encoding and Joint Stereo [Wang et al.(2005)wang, Nyikal, and Yu] R. Wang, H. Nyikal, and J. Yu. Stereo coding for audio compression, March 7 2005. URL

http://www.scribd.com/doc/266577/stereo-coding-for-audio- Compression. [Wikipedia(2009)] Wikipedia. Huffman_coding, 2008. URL http://en.wikipedia.org/wiki/huffman_coding. [Online; accessed 13- March-2009]. [Johnston and Ferrerira(1992)] J.D Johnston and A. J. Ferreria. Sum- Difference stereo transform coding. In ICASSP-92., 1992 IEEE International Conference on, pages 569-572, San Francisco, CA, March 1992 Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System (Audio Engineering Society Convention Paper 6196) Digital Audio Compression Standard (AC-3, E-AC-3) Revision B