Convention Paper Presented at the 122nd Convention 2007 May 5 8 Vienna, Austria

Similar documents
PATTERN EXTRACTION IN SPARSE REPRESENTATIONS WITH APPLICATION TO AUDIO CODING

Auditory modelling for speech processing in the perceptual domain

Speech/Music Change Point Detection using Sonogram and AANN

Efficient Coding of Time-Relative Structure Using Spikes

REAL-TIME BROADBAND NOISE REDUCTION

COMBINING ADVANCED SINUSOIDAL AND WAVEFORM MATCHING MODELS FOR PARAMETRIC AUDIO/SPEECH CODING

Introduction to Wavelet Transform. Chapter 7 Instructor: Hossein Pourghassem

Evaluation of Audio Compression Artifacts M. Herrera Martinez

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

Psycho-acoustics (Sound characteristics, Masking, and Loudness)

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

ELEC9344:Speech & Audio Processing. Chapter 13 (Week 13) Professor E. Ambikairajah. UNSW, Australia. Auditory Masking

Chapter 4 SPEECH ENHANCEMENT

Super-Wideband Fine Spectrum Quantization for Low-rate High-Quality MDCT Coding Mode of The 3GPP EVS Codec

Audio Engineering Society Convention Paper Presented at the 110th Convention 2001 May Amsterdam, The Netherlands

Signals, Sound, and Sensation

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

AN AUDITORILY MOTIVATED ANALYSIS METHOD FOR ROOM IMPULSE RESPONSES

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

A Pole Zero Filter Cascade Provides Good Fits to Human Masking Data and to Basilar Membrane and Neural Data

Sound Synthesis Methods

Speech Coding in the Frequency Domain

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Orthonormal bases and tilings of the time-frequency plane for music processing Juan M. Vuletich *

The psychoacoustics of reverberation

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Audio and Speech Compression Using DCT and DWT Techniques

Complex Sounds. Reading: Yost Ch. 4

Assistant Lecturer Sama S. Samaan

HCS 7367 Speech Perception

Spectro-Temporal Methods in Primary Auditory Cortex David Klein Didier Depireux Jonathan Simon Shihab Shamma

Tempo and Beat Tracking

SPEECH ENHANCEMENT WITH SIGNAL SUBSPACE FILTER BASED ON PERCEPTUAL POST FILTERING

United Codec. 1. Motivation/Background. 2. Overview. Mofei Zhu, Hugo Guo, Deepak Music 422 Winter 09 Stanford University.

Drum Transcription Based on Independent Subspace Analysis

Music Signal Processing

Convention Paper 7024 Presented at the 122th Convention 2007 May 5 8 Vienna, Austria

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Auditory Based Feature Vectors for Speech Recognition Systems

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Hierarchical spike coding of sound

Tempo and Beat Tracking

Using the Gammachirp Filter for Auditory Analysis of Speech

WARPED FILTER DESIGN FOR THE BODY MODELING AND SOUND SYNTHESIS OF STRING INSTRUMENTS

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL

Pre- and Post Ringing Of Impulse Response

MULTIPLE F0 ESTIMATION IN THE TRANSFORM DOMAIN

An Audio Watermarking Method Based On Molecular Matching Pursuit

HIGH QUALITY AUDIO CODING AT LOW BIT RATE USING WAVELET AND WAVELET PACKET TRANSFORM

RECENTLY, there has been an increasing interest in noisy

Testing of Objective Audio Quality Assessment Models on Archive Recordings Artifacts

Convention Paper Presented at the 112th Convention 2002 May Munich, Germany

A Study on Complexity Reduction of Binaural. Decoding in Multi-channel Audio Coding for. Realistic Audio Service

Time division multiplexing The block diagram for TDM is illustrated as shown in the figure

SOUND QUALITY EVALUATION OF FAN NOISE BASED ON HEARING-RELATED PARAMETERS SUMMARY INTRODUCTION

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Identification of Nonstationary Audio Signals Using the FFT, with Application to Analysis-based Synthesis of Sound

ADDITIVE SYNTHESIS BASED ON THE CONTINUOUS WAVELET TRANSFORM: A SINUSOIDAL PLUS TRANSIENT MODEL

Temporal resolution AUDL Domain of temporal resolution. Fine structure and envelope. Modulating a sinusoid. Fine structure and envelope

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Nonuniform multi level crossing for signal reconstruction

Attack restoration in low bit-rate audio coding, using an algebraic detector for attack localization

Evoked Potentials (EPs)

Overview of Code Excited Linear Predictive Coder

MOST MODERN automatic speech recognition (ASR)

IN a natural environment, speech often occurs simultaneously. Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation

Audio Watermarking Scheme in MDCT Domain

Speech Signal Analysis

THE MATLAB IMPLEMENTATION OF BINAURAL PROCESSING MODEL SIMULATING LATERAL POSITION OF TONES WITH INTERAURAL TIME DIFFERENCES

EE482: Digital Signal Processing Applications

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Acoustics, signals & systems for audiology. Week 9. Basic Psychoacoustic Phenomena: Temporal resolution

THE STATISTICAL ANALYSIS OF AUDIO WATERMARKING USING THE DISCRETE WAVELETS TRANSFORM AND SINGULAR VALUE DECOMPOSITION

Fundamental frequency estimation of speech signals using MUSIC algorithm

Audio Fingerprinting using Fractional Fourier Transform

Laboratory Assignment 2 Signal Sampling, Manipulation, and Playback

Frugal Sensing Spectral Analysis from Power Inequalities

Enhanced Waveform Interpolative Coding at 4 kbps

An Adaptive Algorithm for Speech Source Separation in Overcomplete Cases Using Wavelet Packets

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

Non-stationary Analysis/Synthesis using Spectrum Peak Shape Distortion, Phase and Reassignment

Introduction to Audio Watermarking Schemes

A Novel Hybrid Approach to the Permutation Problem of Frequency Domain Blind Source Separation

Signal Resampling Technique Combining Level Crossing and Auditory Features

Speech Compression based on Psychoacoustic Model and A General Approach for Filter Bank Design using Optimization

Chapter IV THEORY OF CELP CODING

AUDL GS08/GAV1 Signals, systems, acoustics and the ear. Loudness & Temporal resolution

Audio Compression using the MLT and SPIHT

EEE 309 Communication Theory

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

Change Point Determination in Audio Data Using Auditory Features

ROBUST echo cancellation requires a method for adjusting

Audio Imputation Using the Non-negative Hidden Markov Model

CLASSIFICATION OF CLOSED AND OPEN-SHELL (TURKISH) PISTACHIO NUTS USING DOUBLE TREE UN-DECIMATED WAVELET TRANSFORM

Timbral Distortion in Inverse FFT Synthesis

Computationally Efficient Optimal Power Allocation Algorithms for Multicarrier Communication Systems

Adaptive Filters Application of Linear Prediction

46 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 1, JANUARY 2015

TIME encoding of a band-limited function,,

Transcription:

Audio Engineering Society Convention Paper Presented at the 122nd Convention 27 May 5 8 Vienna, Austria The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 6 East 42 nd Street, New York, New York 1165-252, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. A Biologically-Inspired Low-Bit-Rate Universal Audio Coder Ramin Pichevar 1, Hossein Najaf-Zadeh 1, Louis Thibault 1 1 Advanced Audio Systems, Communications Research Centre, Ottawa, Canada Correspondence should be addressed to Ramin Pichevar (ramin.pishehvar@crc.ca) ABSTRACT We propose a new biologically-inspired paradigm for universal audio coding based on neural spikes. Our proposed approach is based on the generation of sparse 2-D representations of audio signals, dubbed as spikegrams. The spikegrams are generated by projecting the signal onto a set of overcomplete adaptive gammachirp (gammatones with additional tuning parameters) kernels. A masking model is applied to the spikegrams to remove inaudible spikes and to increase the coding efficiency. The paradigm proposed in this paper is a first step towards the implementation of a high-quality audio encoder by further processing acoustical events generated in the spikegrams. Upon necessary optimization and fine-tuning our coding system, operating at 1 bit/sample for sound sampled at 44.1 khz, is expected to deliver high quality audio for broadcast applications and other applications such as archiving and audio recording. 1. INTRODUCTION Non-stationary and time-relative structures such as transients, timing relations among acoustic events, and harmonic periodicities provide important cues for different types of audio processing (e.g., audio coding). Obtaining these cues is a difficult task. The most important reason why it is so difficult is that most approaches to signal representation/analysis are block-based, i.e. the signal is processed piecewise in a series of discrete blocks. Transients and non-stationary periodicities in the signal can be temporally smeared across blocks. Large changes in the representation of an acoustic event can occur depending on the arbitrary alignment of the processing blocks with events in the signal. Signal analysis techniques such as windowing or the choice of the transform can reduce these effects, but it would be preferable if the representation was insensitive to signal shifts. Shift-invariance alone, however, is not a sufficient constraint on designing a general sound

processing algorithm. Another important feature is coding efficiency, that is the ability of the representation to reduce the information rendundancy from the raw time domain signal. A desirable representation should capture the underlying 2D-timefrequency structures, so that they are more directly observable and well represented at low bit rates [11]. The aim of this article is to propose a shift-invariant representation that extracts acoustic events without smearing them, while providing coding efficiency. We will then show how this improved representation can be efficiently applied to audio coding by using adequate information coding and masking strategies. Comparison with similar techniques will be given afterwards. In the remainder of this section we will give a brief survey of different coding schemes to justify our choices for our proposed approach. 1.1. Block-Based Coding Most of the signal representations used in speech and audio coding are block based (i.e., DCT, MDCT, FFT). In the block-based coding scheme, the signal is processed piecewise in a series of discrete blocks, causing temporally smeared transients and non-stationary periodicities. On the other hand, large changes in the representation of an acoustic event can occur depending on the arbitrary alignment of the processing blocks with events in the signal. Signal analysis techniques such as windowing or the choice of the transform can reduce these effects, but it would be preferable if the representation was insensitive to signal shifts. 1.2. Filterbank-Based Shift-Invariant Coding In the filterbank design paradigm, the signal is continuously applied to the filters of the filterbank and its convolution with the impulse responses are computed. Therefore, the outputs of these filters are shift invariant. This representation does not have the drawbacks of block-based coding mentioned above, such as time variance. However, filterbank analysis is not sufficient for designing a general sound processing algorithm. Another important aspect not taken into account in this paradigm is coding efficiency or, equivalently the ability of the representation to capture underlying structures in the signal. A desirable code/representation should reduce the information redundancy from the raw signal so that the underlying structures are more directly observable. However, convolutational representations (i.e., filterbank design) increases the dimensionality of the input signal. 1.3. Overcomplete Shift-Invariant Representations In an overcomplete basis, the number of basis vectors (kernels) is greater than the real dimensionality (number of non-zero eigenvalues in the covariance matrix of the signal) of the input. The approach consists of matching the best kernels to different acoustic cues using different convergence criteria such as the residual energy. However, the minimization of the energy of the residual (error) signal is not sufficient to get an overcomplete representation of an input signal. Other constraints such as sparseness must be considered in order to have a unique solution. Overcomplete representations have been advocated because they have greater robustness in the presence of noise. They are also a way to maximize information transfer, when different regions/objects of the underlying signal have strong correlations [4]. In other terms, the peakiness of values can be exploited efficiently in entropy coding. In order to find the best matching kernels, matching pursuit is used. 1.3.1. Generating Overcomplete Representations with Matching Pursuit (MP) In mathematical notations, the signal x(t) can be decomposed into the overcomplete kernels as follow x(t) = M n m m=1 i=1 a m i g m (t τ m i ) + ɛ(t) (1) where τ m i and a m i are the temporal position and amplitude of the ith instance of the kernel g m, respectively. The notation n m indicates the number of instances of g m, which need not be the same across kernels. In addition, the kernels are not restricted in form or length. In order to find adequate τi m, a m i, and g m matching pursuit can be used. In this technique the signal x(t) is decomposed over a set of kernels so as to capture the structure of the signal. The approach consists of iteratively approximating the input signal with successive orthogonal projections onto some basis. The signal can be decomposed into Page 2 of 1

x(t) =< x(t), g m > g m + R x (t) (2) where < x(t), g m > is the inner product between the signal and the kernel and is equivalent to a m in Eq. 1. R x (t) is the residual signal. It can be shown [3] that the computational load of the matching pursuit can be reduced, if one can save values of all correlations in memory or can find an analytical formulation for the correlation given specific kernels. 2. A NEW PARADIGM FOR AUDIO CODING 2.1. Generation of the spike-based representation We propose an auditory sparse and overcomplete representation suitable for audio compression. In this paradigm the signal is decomposed into its constituent parts (kernels) by a matching pursuit algorithm. We use gammatone/gammachirp filterbanks as projection basis as proposed by [11] [1]. The advantage of using asymmetric kernels such as gammatone/gammachirp atoms is that they do not create pre-echos at onsets [3]. However, very asymmetric kernels such as damped sinusoids [3] are not able to model suitably harmonic signals. On the other hand, gammatone/gammachirp kernels have additional parameters that control their attack and decay parts (degree of symmetry), which are modified suitably according to the nature of the signal in our proposed technique. As described above, the approach is an iterative one. We will compare two variants of the technique. The first variant, which is non-adaptive, is roughly similar to the general approach used in [1], which we applied to the specific task of audio coding. However, the second adaptive variant is a novel one, which take advantage of the additional parameters of the gammachirp kernels and the inherent nonlinearity of the auditory pathway [6][7]. Some detail on each variant are given below. 2.1.1. Non-Adaptive Paradigm In the non-adaptive paradigm, only gammatone filters are used. The impulse response of a gammatone filter is given below g(f c, t) = (t) 3 e 2πbt cos(2πf c t) t >, (3) where f c is the center frequency of the filter, distributed on ERB (Equal Rectangular Bandwith) scales. At each step (iteration), the signal is projected onto the gammatone kernels (with different center frequencies and different time delays). The center frequency and time delay that give the maximum projection are chosen and a spike with the value of the projection is added to the auditory representation at the corresponding center frequency and time delay (see Fig. 2). The residual signal R x (t) decreases at each step. 2.1.2. Adaptive Paradigm In the adaptive paradigm, gammachirp filters are used. The impulse response of a gammachirp filter with the corresponding tuning parameters (b,l,c) are given below g(f c, t, b, l, c) = t l 1 e 2πbt cos(2πf c t + clnt) t >. (4) It has been shown that the gammachirp filters minimize the scale/time uncertainty [6]. In this apparoach the chirp factor c, l, and b are found adaptively at each step. The chirp factor c allows us to slightly modify the instantaneous frequency of the kernels. l and b controls the attack and the decay of kernels. However searching the three parameters in the parameter space is a very computationally complex task. Therefore, we use a suboptimal search [5]. In our suboptimal technique, we use the same gammatone filters as the ones used in the non-adaptive paradigm with values of l and b given in [6]. This step gives us the center frequency and start time (t ) of the best gammatone matching filter. We also keep the second best frequency (gammatone kernel) and start time. G max1 = argmax f,t { < r, g(f, t, b, l, c) > } g G (5) G max2 = argmax f,t { < r, g(f, t, b, l, c) > } g G G max1, (6) where G is the set of all kernels, and G G max1 excludes G max1 from the search space. For the sake of simplicity, we use f instead of f c in Eqs. (5) to (9). We then use the information found in the first step to find c. In other words, we keep only the set of the best two kernels in step one, and try to find Page 3 of 1

the best chirp factor given G max1 and G max2. G maxc = argmax c { < r, g(f, t, b, l, c) > } g G max1 G max2. (7) We then use the information found in the second step to find the best b. G maxb = argmax b { < r, g(f, t, b, l, c) > } g G maxc (8) We finally find the best l among G maxb found in the previous step. G maxl = argmax l { < r, g(f, t, b, l, c) > } g G maxb (9) Therefore, six parameters are extracted in the adaptive technique for the auditory representation : center frequencies, chirp factors (c), time delays, spike amplitudes, b, and l. The last two parameters control the attack and the decay slopes of the kernels. Although, there are additional parameters in this second variant, as shown later, the adaptive technique helps us obtain better coding gains. The reason for that is that we need a much smaller number of filters (in the filterbank) and a smaller number of iterations to achieve the same SNR, which roughly reflects the audio quality. 2.2. Masking We use gammachirp functions to decompose audio signals. In order to reduce the number of spikes that in return will increase the coding efficiency, we have developed a tentative masking model to remove inaudible spikes. Since there is not much difference in the spectrum of the gammachirp and gammatone functions, we have used gammatone functions to develop the masking model. For on-frequency temporal masking, that is the temporal masking effects in each critical band (channel), we calculate the temporal forward and backward masking as follows. First we calculate the absolute threshold of hearing in each critical band. Since the basis functions are short, the absolute threshold of hearing has been elevated by 1 db/decade when the duration of basis function is less than 2 msec [14]. QT k = AT k + 1(log 1 (2) log 1 (d k )) (1) where AT k is the absolute threshold of hearing for critical band k, QT k is the elevated threshold in quiet for the same critical band, and d k is the effective duration of the kth basis function defined as the time interval between the points on the temporal envelope of the the gammatone function where the amplitude drops by 9%. The masker sensation level is given by SL k (i) = 1log( a2 k A2 k QT k ) (11) where SL k (i) is the sensation level of the ith spike in critical band k, a k (i) is the amplitude of the ith spike in critical band k, and A k is the peak value of the Fourier transform of the normalized gammatone function in critical band k. We set the initial level for the masking pattern in critical band k to QT k and consider three situations for the masking pattern caused by a spike. When a maskee starts within the effective duration of the masker, the masking threshold is given by M k (n i : n i +L k ) = max(m k (n i : n i +L k ), SL k (i) 2) (12) where M k is the masking pattern (in db) in critical band k, n i is the start time index of the ith spike, and L k is the effective length of the gammatone function in critical band k (defined as the effective duration d k of the gammatone function in critical band k multiplied by the sampling frequency). Since gammatone functions are tonal-like signals, we assume that the masking level caused by a spike is 2 db less than its sensation level. In order to avoid overmasking the spikes, we take the maximum of the masking threshold due to a spike and the threshold caused by other spikes in the same critical band at any time instance. We have also investigated adding up the masking thresholds caused by all spikes in the same critical band (in the linear domain) at any time instance. That approach would overmask the spikes and results in audible distortion. Other situations are when a maskee starts after the effective duration of the masker (i.e., forward masking), and when a maskee starts before a masker (i.e., backward masking). For forward and backward masking, we assume a linear relation between the masking threshold (in db) and the logarithm of the time delay between the masker and the maskee in msec [8]. Since the effective duration of forward Page 4 of 1

masking depends on the masker duration [13], we define an effective duration for forward masking in critical band k as follows F d k = 1arctan(d k ) (13) The forward masking threshold is given by F M i (n) = (SL(i) 2) log 1( n n i+l k +F L k ) log 1 ( ni+l k+1 n i+l k +F L k ) where n i + L k + 1 n n i + L k + F L k and (14) F L k = round(f d k f s ), (15) f s denotes the sampling frequency. The index i denotes the index of the spike and k is the channel number. This forward masking contributes to the global masking pattern in critical band k as follows M k (n i + L k + 1 : n i + L k + F L k ) (16) = max(n i + L k + 1 : n i + L k + F L k, F M i ) For the backward masking, we assume 5 msec as the effective duration of masking for all critical bands regardless of the effective duration of gammatone functions. Hence, the backward masking threshold is given by BM i (n) = (SL(i) 2) log 1( log 1 ( n n i.5f s ) n i 1 n i.5f s ) (17) Similar to the forward masking effect, the backward masking affects the global masking pattern in critical band k as follows M k (n i.5f s : n i 1) (18) = max(m k (n i.5f s : n i 1), BM i ) For off-frequency masking 1 effects, we have considered the masking effects caused by any spike in two adjacent critical bands. According to [12] a single masker produces an asymmetric linear masking pattern in the Bark domain, with a slope of -27 db/bark for the lower frequency side and a level-dependent slope for the upper frequency side. The slope for the upper frequency side is given by 1 Masking effect of a masker on a maskee that is in a different channel. s u = 24 23 f +.2L (19) where f = f c is the masker frequency (gammatone center frequency in our work) in Hertz and L is the masker level in db. We have used this approach to calculate the masking effects caused by each spike in the two immediate neighboring critical bands. The result of this approach was insignificant, which indicates a need for an effective masking model for off-frequency masking in spike coding. Note that masking models used in most audio coding system do not perform well on spike coding systems. The reason may be that in this coding paradigm, spikes are well localized in both time and frequency. Removing of any audible spike would produce musical noise that cannot be tolerated in high quality audio coding. 2.3. Coding We pointed out earlier that sparse codes generate peaky histograms suitable for entropy coding. Therefore, we used arithmetic coding to allocate bits to these quantities. Time-differential coding is used to further reduce the bit rate. More robust and efficient differential coding schemes such as the Minimum Spanning Tree (MST) are under investigation. Preliminary results give a gain of 5% on bit rate over the simple timedifferential coding used in this article. 3. GENERATION OF SPIKEGRAMS FOR DIF- FERENT SOUND CLASSES We tested the algorithm for four different sounds: percussion, speech, castanet, white noise. In the next few sections, we will give results obtained on these different kinds of sound. 3.1. Coding of Percussion In this experiment, we code the percussion signal shown in Fig. 1 using the two variants described in the previous section: adaptive and non-adaptive approaches. 3.1.1. Non-Adaptive Scheme The matching pursuit is done for 3 iterations to generate 3 spikes. Fig. 2 shows the spikegram generated by the nonadaptive method. As we can see, the onsets and Page 5 of 1

Amplitude.6.4.2 -.2 -.4 -.6 -.8 1 2 3 4 5 6 7 Discrete Time x 1 4 Fig. 1: Samples of a percussion sound. offsets of the percussion are detected clearly by the algorithm. There are 3 spikes in the code (for 8 samples of the original sound file) before temporal masking is applied. Channel Number 2 18 16 14 12 1 8 6 4 2 1 2 3 4 5 6 Discrete Time x 1 4 now, we are using a lossless compression to encode these two parameters. We first extracted the histogram of the values (which seems to be very peaked). Therefore arithmetic coding is used for the compression of these values. For spike timing a differential paradigm is used. Times are first sorted in increasing order: Only the time elapsed since the last sorted spike is stored. This trick reduces the dynamic range of spike timings and makes it possible to perform arithmetic coding on timing information as well. We also used arithmetic coding for the compression of center frequencies. Using arithmetic coding, we used 13533 bits to code the spiking amplitudes and 5193 bits to code the timing information. For center frequencies, we used 4544 bits. These considerations give us a total bit rate of 2.9 bits/sample. 3.1.2. Adaptive Scheme In this scheme the gammachirp filters are used as described in the previous section. Fig. 3 shows the decrease of residual error through the number of iterations for the adaptive and non-adaptive approaches. Table 1 gives results and a comparison of the two different schemes. The number of spikes for the nonadaptive scheme before masking for the same residual energy is 44% percent more than the number of spikes for the adaptive scheme. Note that the spike gain is.12n. Fig. 2: Spikegram of the percussion signal using the gammatone matching pursuit algorithm (spike amplitudes are not represented). Each dot represents the time and the channel where the spike fired (extracted by MP). No spike is extracted between channels 21 and 24. We then used the masking technique detailed in section 2.2. The number of spikes after temporal masking is 2937. Note that spike coding gain in this case is.37n (N is the number of samples in the original signal). Two parameters are important for each spike: its position or spiking time and its amplitude 2. For 2 The amplitude can be seen as the synaptic strength with 3.2. Coding of Speech The same two techniques are applied to speech coding. The speech signal used is the utterance I ll willingly marry Marylin. 3.2.1. Non-Adaptive Scheme The spikegram contains 56 spikes before temporal masking. The number of spikes was reduced to 3528 after masking. Therefore, the spike coding gain is.44n (N is signal length). We used arithmetic coding to compress spike amplitudes and differential timing (time elapsed between consecutive spikes). Results are given in Table 2. The overall coding rate is 3.7 bits/sample. which each neuron is connected to another neuron. Page 6 of 1

Adaptive (24 Channels) Non-Adaptive (24 Channels) Spikes before masking 1 3 Spikes after masking 943 2937 Spike gain.12n.37n Bits for channel coding 362 4544 Bits for amplitude coding 3743 13535 Bits for time coding 325 5193 Bits for chirp factor coding 994 Bits for coding b 2135 Bits for coding l 255 Total bits 1559 23272 Bit rate (bit/sample) 1.93 2.9 Table 1: Comparative results for the coding of percussion (8 samples) at high quality (scores above 4 on the ITU-R 5-grade impairment scale in informal listening tests) for the adaptive and non-adaptive schemes. Adaptive (24 Channels) Non-Adaptive (24 Channels) Spikes before masking 12 56 Spikes after masking 1492 3528 Spike gain.13n.44n Bits for channel coding 496 118536 Bits for amplitude coding 35432 6748 Bits for time coding 419 6376 Bits for chirp factor coding 9836 Bits for coding b 1526 Bits for coding l 16 Total bits 157676 24596 Bit rate (bits/sample) 1.98 3.7 Table 2: Comparative results for the coding of speech (8 samples) at high quality (scores above 4 on the ITU-R 5-grade impairment scale in informal listening tests) for the adaptive and non-adaptive schemes. Adaptive (24 Channels) Non-Adaptive (24 Channels) Spikes before masking 7 3 Spikes after masking 651 2458 Spike gain.8n.3n Bits for channel coding 2293 85 Bits for amplitude coding 3945 8381 Bits for time 33 7354 Bits for chirp factor 778 Bits for coding b 139 Bits for coding l 651 Total bits 12357 24235 Bit rate(bit/sample) 1.54 3.3 Table 3: Comparative results for the coding of castanet (8 samples) at high quality (scores above 4 on the ITU-R 5-grade impairment scale in informal listening tests) for the adaptive and non-adaptive schemes. Page 7 of 1

Pichevar et al. 5 No Adaptive Parameter 3 Adaptive Parameters 4.5 3 Residual Norm 4 32-Channel 32Non-Adaptive Channel Non- Adaptive 32- Channel Adaptive 16-Channel Adaptive 64- Channel Non- Adaptive Channel Adaptive 64-Channel 16Non-Adaptive 25- Channel Adaptive 32-Channel 16Adaptive Channel Non- Adaptive 25-Channel Adaptive 16-Channel Non-Adaptive 25 3.5 2 3 2.5 15 2 1.5 1 1.5 1 2 3 4 5 5 Iterations (Spikes) Fig. 3: Comparison of the adaptive and nonadaptive spike coding schemes of percussion. In this figure, three parameters (the chirp factor c, b, and l) are adapted. 3.2.2. Adaptive Scheme Figs. 4 and 5 show that, in the case of speech, using the adaptive scheme can reduce both the number of spikes and the number of cochlear channels (filterbank channels) drastically. To achieve the same quality, we need 12 spikes (compared to 56 spikes for the non-adaptive case). The number of spikes after masking is 1492 spikes. The spike coding gain is.13n (the gain is.44n in the non-adaptive case). Results are given in Table 2. The overall required bit rate is 1.98 bits/sample in this case (roughly 35 percent lower than in the non-adaptive case). 3.3. Coding of Castanet We used the adaptive coding algorithm, and obtained an ITU-R impairment scale score of 4 in informal listening tests. The number of spikes before temporal masking is 7. We applied the temporal masking and reduced the number of spikes to 651. Spike coding gain is.8n in the adaptive case and.3n in the non-adaptive case. The bit rate is 1.54 bits/sample in the adaptive case and 3.3 bits/sample. Results for both the adaptive and nonadaptive schemes are given in Table 3. 3.4. Coding of White Noise In another set of experiments, we modelled white noise by the adaptive and non-adaptive approaches and compared the results. As we can see in Fig 6, 5 1 15 2 25 3 35 4 45 5 Fig. 4: Comparison of the adaptive and nonadaptive spike coding schemes of speech for different number of channels. In this figure, only the chirp factor is adapted. as for other signal types, the adaptive paradigm outperforms the non-adaptive one. Note that our deterministic model has been able to model the stochastic white noise. 3.5. Discussion Our technique generated spike gains ranging from.8n to.12n. This is much lower than 1.26N obtained in [1], 3.2N in [9], and.66n obtained in [2] for signals sampled at 4 to 8 khz. The lattermentioned techniques are based on thresholding the outputs of a filterbank and generating spikes each time the threshold is crossed. This peak-picking approach generates redundant spikes and higher number of spikes for the same audio materials compared to the proposed technique. 4. FUTURE WORK The matching pursuit algorithm is a relatively slow algorithm. We have derived a closed-form formula for the correlation between gammatone and gammachirp filters that can be used to speed up the process. The dynamics (evolution through time) of spike amplitudes, channel frequencies, etc. through time can give us some good hints on how these values must be coded. Preliminary results on this issue have given very encouraging performance. AES 122nd Convention, Vienna, Austria, 27 May 5 8 Page 8 of 1

Pichevar et al. Comparison of the adaptive and non- adaptive approaches for speech 55 16 channels with 3 adaptive parameters 16 channels with no adaptive parameter.9 No adaptive parameters 3 adaptive parameters 5 Norm NoorResidual m of Residual Err r Residual Norm/Signal Norm 45.8.7.6.5.4 4 35 3 25 2 15.3 1.2.1 5 5 1 15 5 1 15 2 25 3 35 4 45 5 Iterations(Spikes) Iterations(Spikes) Fig. 5: Comparison of the adaptive and nonadaptive spike coding schemes of speech for 16 channels. In this figure, three parameters (the chirp factor, b, and l are adapted. The introduction of perceptual criteria and weak/weighted matching pursuit is another potential performance booster to be investigated. In this article, we used time differential coding to code spikes. A more efficient way would be to consider spikes as graph nodes and optimize coding cost through different paths. This approach is under investigation. Some preliminary quantization sensitivity analysis is under investigation. The representation we have proposed in this article generates independent acoustical events on a coarse time scale. However, on a finer time scale, each acoustical event consists of dependent and correlated elements (spikes). This dependency can be used to further reduce redundancy. The masking paradigm used in this article works better for low-frequency contents than highfrequency information. A modified version of the approach should put more emphasis on higher frequencies. 5. CONCLUSION We have proposed a new biologically-inspired paradigm for universal audio coding based on neural spikes. Our proposed approach is based on the generation of sparse 2-D representations of audio signals, dubbed as spikegrams by matching pursuit. A Fig. 6: The convergence rate for the adaptive and non-adaptive paradigm for white noise. masking model is applied to the spikegrams to remove inaudible spikes and to increase the coding efficiency. We have replaced the peak-picking approach used in [2] and [1] by the matching pursuit, which is much more efficient in terms of dimensionality reduction. We have also proposed the adaptive fitting of the chirp, decay, and attack factors in the gammachirp filterbank. This change has reduced both the computational load and the bit rate of the coding system. We further introduced additional masking to the output of the representation obtained by matching pursuit. Arithmetic coding is used for lossless compression of the spike parameters to further reduce the bit rate. 6. REFERENCES [1] E. Ambikairajah, J. Eps, and L. Lin. Wideband speech and audio coding using gammatone filterbanks. In ICASSP, pages 773 776, 21. [2] C. Feldbauer, G. Kubin, and B. Kleijn. Anthropomorphic coding of speech and audio: A model inversion approach. EURASIP-JASP, (9):1334 1349, 25. [3] M. Goodwin and M. Vetterli. Matching pursuit and atomic signal models based on recursive filter banks. IEEE Transaction on signal processing, 47(7):189 192, 1999. AES 122nd Convention, Vienna, Austria, 27 May 5 8 Page 9 of 1

[4] D.J. Graham and D.J. Field. Sparse coding in the neocortex. Evolution of Nervous Systems ed. J. H. Kaas and L. A. Krubitzer, 26. [5] R. Gribonval. Fast matching pursuit with a multiscale dictionary of gaussian chirps. IEEE Transaction on signal processing, 49(5):994 11, 21. [6] T. Irino and R. Patterson. A compressive gammachirp auditory filter for both physiological and psychophysical data. JASA, 19(5):28 222, 21. [7] T. Irino and R.D. Patterson. A dynamic compressive gammachirp auditory filterbank. IEEE Trans. on audio and speech processing, pages 28 222, 26. [8] W. Jesteadt, S. Bacon, and J. Lehman. Forward masking as a function of frequency, masker level, and signal delay. JASA, pages 95 962, 1982. [9] G. Kubin and B.W. Kleijn. On speech coding in a perceptual domain. In ICASSP, pages 25 28, 1999. [1] E. Smith and M. Lewicki. Efficient auditory coding. Nature, (779), 26. [11] E. Smith and M.S. Lewicki. Efficient coding of time-relative structure using spikes. Neural Computation, 17:19 45, 25. [12] E. Terhardt, G. Stoll,, and M. Seewann. Algorithm for extraction of pitch and pitch salience from complex tonal signals. JASA, pages 679 688, 1982. [13] E. Zwicker. Dependence of post-masking on masker duration and its relation to temporal effects in loudness. JASA, pages 219 223, 1984. [14] E. Zwicker and H. Fastl. Psychoacoustics: Facts and Models. Springer-Verlag, Berlin, 199. Page 1 of 1