Environmental Sound Recognition using MP-based Features

Similar documents
Applications of Music Processing

Campus Location Recognition using Audio Signals

Introduction of Audio and Music

A multi-class method for detecting audio events in news broadcasts

An Audio Fingerprint Algorithm Based on Statistical Characteristics of db4 Wavelet

Different Approaches of Spectral Subtraction Method for Speech Enhancement

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Audio Signal Compression using DCT and LPC Techniques

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

Robust Low-Resource Sound Localization in Correlated Noise

Sound Recognition. ~ CSE 352 Team 3 ~ Jason Park Evan Glover. Kevin Lui Aman Rawat. Prof. Anita Wasilewska

DETECTION AND CLASSIFICATION OF POWER QUALITY DISTURBANCES

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Advanced Data Analysis Pattern Recognition & Neural Networks Software for Acoustic Emission Applications. Topic: Waveforms in Noesis

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation

HeadScan: A Wearable System for Radio-based Sensing of Head and Mouth-related Activities

TE 302 DISCRETE SIGNALS AND SYSTEMS. Chapter 1: INTRODUCTION

Electric Guitar Pickups Recognition

SOUND SOURCE RECOGNITION AND MODELING

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Keywords Decomposition; Reconstruction; SNR; Speech signal; Super soft Thresholding.

Machine recognition of speech trained on data from New Jersey Labs

AN INVESTIGATION INTO SALIENCY-BASED MARS ROI DETECTION

Signals A Preliminary Discussion EE442 Analog & Digital Communication Systems Lecture 2

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Gammatone Cepstral Coefficient for Speaker Identification

MUSICAL GENRE CLASSIFICATION OF AUDIO DATA USING SOURCE SEPARATION TECHNIQUES. P.S. Lampropoulou, A.S. Lampropoulos and G.A.

ALTERNATING CURRENT (AC)

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

MATLAB DIGITAL IMAGE/SIGNAL PROCESSING TITLES

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

Long Range Acoustic Classification

IDENTIFICATION OF SIGNATURES TRANSMITTED OVER RAYLEIGH FADING CHANNEL BY USING HMM AND RLE

Voiced/nonvoiced detection based on robustness of voiced epochs

Basic Characteristics of Speech Signal Analysis

Advanced Functions of Java-DSP for use in Electrical and Computer Engineering Senior Level Courses

Speech Signal Analysis

Multimedia Signal Processing: Theory and Applications in Speech, Music and Communications

Speech and Music Discrimination based on Signal Modulation Spectrum.

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Measuring the complexity of sound

Speech/Music Discrimination via Energy Density Analysis

COMP 546, Winter 2017 lecture 20 - sound 2

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

An analysis of blind signal separation for real time application

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Chapter 5. Signal Analysis. 5.1 Denoising fiber optic sensor signal

Complex Sounds. Reading: Yost Ch. 4

An Improved Voice Activity Detection Based on Deep Belief Networks

Performance Analysis of MFCC and LPCC Techniques in Automatic Speech Recognition

Linguistic Phonetics. Spectral Analysis

Roberto Togneri (Signal Processing and Recognition Lab)

HIGH QUALITY AUDIO CODING AT LOW BIT RATE USING WAVELET AND WAVELET PACKET TRANSFORM

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Speech Synthesis; Pitch Detection and Vocoders

8.3 Basic Parameters for Audio

Original Research Articles

Image Denoising Using Complex Framelets

Noise Attenuation in Seismic Data Iterative Wavelet Packets vs Traditional Methods Lionel J. Woog, Igor Popovic, Anthony Vassiliou, GeoEnergy, Inc.

OBJECTIVE OF THE BOOK ORGANIZATION OF THE BOOK

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

Classification of Road Images for Lane Detection

Detection and Classification of Nonstationary Transient Signals Using Sparse Approximations and Bayesian Networks

Project 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing

Biomedical Signals. Signals and Images in Medicine Dr Nabeel Anwar

Audio Imputation Using the Non-negative Hidden Markov Model

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

ANALYSIS OF ACOUSTIC FEATURES FOR AUTOMATED MULTI-TRACK MIXING

Enhanced Waveform Interpolative Coding at 4 kbps

COMPARITIVE STUDY OF IMAGE DENOISING ALGORITHMS IN MEDICAL AND SATELLITE IMAGES

EE482: Digital Signal Processing Applications

CLASSIFICATION OF MULTIPLE SIGNALS USING 2D MATCHING OF MAGNITUDE-FREQUENCY DENSITY FEATURES

A Study on Single Camera Based ANPR System for Improvement of Vehicle Number Plate Recognition on Multi-lane Roads

Composite Fractional Power Wavelets Jason M. Kinser

Noise estimation and power spectrum analysis using different window techniques

Ultra-Wideband Compressed Sensing: Channel Estimation Jose L. Paredes, Member, IEEE, Gonzalo R. Arce, Fellow, IEEE, and Zhongmin Wang

The Jigsaw Continuous Sensing Engine for Mobile Phone Applications!

AD-A 'L-SPv1-17

Voice Activity Detection

Audio processing methods on marine mammal vocalizations

VQ Source Models: Perceptual & Phase Issues

Image analysis. CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror

Orthonormal bases and tilings of the time-frequency plane for music processing Juan M. Vuletich *

Deep Learning for Human Activity Recognition: A Resource Efficient Implementation on Low-Power Devices

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

Chapter 4 SPEECH ENHANCEMENT

Mikko Myllymäki and Tuomas Virtanen

Communications Theory and Engineering

Target detection in side-scan sonar images: expert fusion reduces false alarms

Design and Implementation of an Audio Classification System Based on SVM

Transcription:

Environmental Sound Recognition using MP-based Features Selina Chu, Shri Narayanan *, and C.-C. Jay Kuo * Speech Analysis and Interpretation Lab Signal & Image Processing Institute Department of Computer Science * Department of Electrical Engineering University of Southern California Los Angeles, CA email: selinach@sipi.usc.edu Apr 01, 2008

Outline What are environmental sound and challenges in recognizing them Feature extraction using Matching Pursuit (MP-Features) Obtaining MP-Features Experimental setup and results Conclusion and future work

Environmental sounds What we hear everyday, anywhere Restaurants, streets, parks, airport and train stations, hallway, etc Unlike speech and music, which are structured sounds Speech formantic structure, i.e. vowels and consonants Music harmonic structure, i.e. notes Environmental sounds are unstructured and similar to noise Variably composed Thus, difficult to build models Can we incorporate such recognition and understanding in an automatic classification system?

Why recognize environment sound? Use audio information to assist activities, such as robotic navigation and human-computer interactions Vision-based robot has limitations Requires much world knowledge Lighting problems (or lack of), and angle of the camera Mitigate the limitations by incorporating audio information into the recognition process Fusion of audio and visual information Capture additional, semantically richer information Disambiguation of environment and object types Audio data can be easily acquired and is computationally cheaper to process than visual data Other applications: surveillance, search and rescue, obstacle detection, wearable devices, and other contextaware applications

Difficulties in acoustic environment classification Environment sounds are dynamic and unpredictable Same location, different time different sounds Streets: Characterize by cars and people, but could be absent (other additional sounds) Sounding similar Different location same sounds Many people in restaurant and bars sounds the same as in a crowded mall Different sounds similar sounding Noise of running water in a bath tub is similar to a running stream in the mountains

Audio classification results from current literature Structured Unstructured Environmental Music Speech, music, nonspeech, silent Number of classes Overall accuracy % 18 61 13 56 16 90 10 87 10 75 8 77 6 85 2 99 5 88 4 82 3 90 2 98 Recognizing general acoustic environment: Restaurants, street, schoolyard Recognizing discrete acoustic events: Door knocks, gun shots, laughter, applause Recognition rates general acoustic environment using conventional features (i.e. MFCCs): 92% for 5 classes (Chu et al.), 77% for 11 classes (Malkin et al.), and 60% for 13 or more classes (Eronen et al.) Due to randomness, high variance and other difficulties in working with various environments

Intuitions about environmental sounds Nature-nighttime Nature-daytime Different environments have different characteristics On Boat Harbor Decompositions are noticeably different from one another There are underlying structures within each type of signal Near Highway Subway platform Need features that can discover them and to capture these time-domain signatures

Audio features Conventional features for audio recognition/classification does not perform well with environmental sounds MFCCs, MFCC derivatives, sub-band energy, fundamental frequency, LPCCs, energy, zerocrossing, and spectral- centroid, bandwidth, roll-off, flux, flatness One of the most used features for any audio classification are MFCCs MFCCs describe the shape of the overall spectrum Favorable for modeling single sound sources Works well for structured sounds such as speech and music Performance degrades in the presence of noise Limitations of MFCCs and other commonly-used features, difficulties in describing Finer resolution of temporal characteristics Dynamic aspects of the signal In general, few temporal-domain features have been used to characterize audio signals in the past

MP-Features MP-Features are capable of capturing the time-domain signature of a signal in a concise way Use the matching pursuit (MP) algorithm to analyze environment sounds Matching pursuit algorithm (Mallet and Zhang, 1993) Builds up a sequence of sparse approximation stepwise Process includes Finding the decomposition of a signal from a dictionary of atoms Yield the best set of time-domain functions to form an approximate representation Using the information found from that set to find the MP-Features

Dictionaries Quality of MP depends on the chosen dictionary (or dictionaries) A dictionary contains: Set of bases, or simply parameterized waveforms Examples of dictionaries includes: Gabor, Fourier, wavelets, Cosine packet, Chirplets, Warplets, and many others. Desirable characteristics: Should be complete or overcomplete MP is guaranteed to converge (zero energy residual) Important for the atoms in the dictionary to be discriminative among themselves - Otherwise similar atoms will compete with each other in the MP process

Examples of various dictionaries Frequency dictionary: Fourier Time-scale dictionary: Haar Time-Frequency dictionary: Gabor Fourier, MSE=23.87 Haar, MSE=12.40 Gabor, MSE=9.91 Original signal Example of first five atoms found on the same signal, but different dictionaries. Approximation Example of reconstruction using first 10 atoms, with different dictionaries

Process of obtaining MP features For each sampling window, Decompose using the MP algorithm MP algorithm is stopped after obtaining n atoms n is determined experimentally Decode each atom with its original parameters obtaining frequency, scale, and shift Accumulate all the atoms parameters Find the mean and standard deviation for each parameter separately

Finding n atoms 90 80 Classification performance levels off after 4 or 5 atoms Recognition accuracy 70 60 50 1 2 3 4 5 6 7 8 9 10 First n atoms used as features Larger number increases the complexity and makes it more specific to each data item Smaller number represent data in more general way

A small example Plotting of MP-features Captures time and frequency simultaneously 105 Forming clear natural clusters automatically Shifts indices 100 95 Raining Stream / River 90 Nature-nighttime 85 8000 Nature-daytime For visual purpose, just plotted the centroids 6000 4000 Frequency indices 2000 0 600 700 School playground 1100 1000 900 800 Scales indices 1200 1300

Experimental setup Dataset: audio clips of 14 environmental sounds Audio sources: Recording of natural (unsynthesized) ambient sounds BBC sound effects - original series [9], Freesound project [10] Each sound files were of varying lengths (1-3 minutes each) Manually labeled and separated into 4-second segments Feature Extraction: features are analyzed and extracted for every 30ms rectangular window frame, with 15 ms overlap MFCC (12), MP-Features (6) Dictionary: Gabor function Classifier: Gaussian Mixture Model (GMM) 4-fold cross-validation - Separate sound sources, trained on data from 3 different sources and tested data from 1 source - Minimum number of files for a class was 4

14 types of environment sounds 1. Inside moving vehicles 2. Restaurant 3. Casino 4. Nature daytime 5. Nature nighttime 6. Street with police cars 7. Street with ambulance 8. Street with traffic and pedestrians 9. Playground 10. Raining 11. River/running water 12. Thundering 13. Train passing 14. Waves Choosing the types Diverse in characteristics or how they sound distinctively different from one another Homogeneous enough so they provide typical representation of the class Similar types of sounds as other works in literature

Classification Results 100 90 80 Classification Rate % 70 60 50 40 30 20 10 0 MFCCs (12) MP-Features (6) MP+MFCC Inside moving vehicles Casino Naturedaytime Restaurant Naturenighttime Street with Police car School Playground Street with traffic MFCC tend to operate on the extremes, doing better than MP-features alone in 7 classes, but very poorly (0-5%) in 5 classes MP-features range between 35%-100% MFCC and MP-features provide a complimentary effect for one another MFCC only: 68%, MP-feature only: 71%, MP-features+MFCC: 83% Raining Running water Thundering Train Passing Waves Street with ambulance

A closer look Nature-nighttime class Contains many insect sounds of higher frequencies (noise-like, flat spectrum) Characterized by narrow spectral peaks MFCC have difficulties encoding the narrow-band structure (i.e. engine sound of train passing) MFCCs: 0%, MP: 100%, MFCC+MP: 100% School playground Contains children playing and screaming and other outdoor ambient sounds, i.e. birds, passing traffic, etc MFCCs: 75%, MP: 60%, MFCC+MP: 95% MFCC acts as better discriminator for more structured sounds, like vocal sounds (i.e. talking inside restaurant) Nature-nighttime x 10-4 2 1 School playground 0.01 0.005 0 0 100 200 300 400 500 600 0 0 100 200 300 400 500 600

Conclusion Introduced a feature extraction framework that provides a way to capture temporal-spectral characteristics of an audio signal Uncover underlying structures within each type of signal Less sensitive to noise and are able to represent sound originating from different sources and different ranges Use matching pursuit for feature extraction and its application to unstructured audio processing MP-features can supplement MFCCs to yield a higher classification performance for environmental sounds than using conventional features

Thank you Email: selinach@sipi.usc.edu Speech Analysis and Interpretation Laboratory: http://sail.usc.edu Acknowledgement: This work was supported by the National Science Foundation, DHS and the Army. Questions?