Environmental Sound Recognition using MP-based Features

Environmental Sound Recognition using MP-based Features Selina Chu, Shri Narayanan *, and C.-C. Jay Kuo * Speech Analysis and Interpretation Lab Signal & Image Processing Institute Department of Computer Science * Department of Electrical Engineering University of Southern California Los Angeles, CA email: selinach@sipi.usc.edu Apr 01, 2008

Outline What are environmental sound and challenges in recognizing them Feature extraction using Matching Pursuit (MP-Features) Obtaining MP-Features Experimental setup and results Conclusion and future work

Environmental sounds What we hear everyday, anywhere Restaurants, streets, parks, airport and train stations, hallway, etc Unlike speech and music, which are structured sounds Speech formantic structure, i.e. vowels and consonants Music harmonic structure, i.e. notes Environmental sounds are unstructured and similar to noise Variably composed Thus, difficult to build models Can we incorporate such recognition and understanding in an automatic classification system?

Why recognize environment sound? Use audio information to assist activities, such as robotic navigation and human-computer interactions Vision-based robot has limitations Requires much world knowledge Lighting problems (or lack of), and angle of the camera Mitigate the limitations by incorporating audio information into the recognition process Fusion of audio and visual information Capture additional, semantically richer information Disambiguation of environment and object types Audio data can be easily acquired and is computationally cheaper to process than visual data Other applications: surveillance, search and rescue, obstacle detection, wearable devices, and other contextaware applications

Difficulties in acoustic environment classification Environment sounds are dynamic and unpredictable Same location, different time different sounds Streets: Characterize by cars and people, but could be absent (other additional sounds) Sounding similar Different location same sounds Many people in restaurant and bars sounds the same as in a crowded mall Different sounds similar sounding Noise of running water in a bath tub is similar to a running stream in the mountains

Audio classification results from current literature Structured Unstructured Environmental Music Speech, music, nonspeech, silent Number of classes Overall accuracy % 18 61 13 56 16 90 10 87 10 75 8 77 6 85 2 99 5 88 4 82 3 90 2 98 Recognizing general acoustic environment: Restaurants, street, schoolyard Recognizing discrete acoustic events: Door knocks, gun shots, laughter, applause Recognition rates general acoustic environment using conventional features (i.e. MFCCs): 92% for 5 classes (Chu et al.), 77% for 11 classes (Malkin et al.), and 60% for 13 or more classes (Eronen et al.) Due to randomness, high variance and other difficulties in working with various environments

Intuitions about environmental sounds Nature-nighttime Nature-daytime Different environments have different characteristics On Boat Harbor Decompositions are noticeably different from one another There are underlying structures within each type of signal Near Highway Subway platform Need features that can discover them and to capture these time-domain signatures

Audio features Conventional features for audio recognition/classification does not perform well with environmental sounds MFCCs, MFCC derivatives, sub-band energy, fundamental frequency, LPCCs, energy, zerocrossing, and spectral- centroid, bandwidth, roll-off, flux, flatness One of the most used features for any audio classification are MFCCs MFCCs describe the shape of the overall spectrum Favorable for modeling single sound sources Works well for structured sounds such as speech and music Performance degrades in the presence of noise Limitations of MFCCs and other commonly-used features, difficulties in describing Finer resolution of temporal characteristics Dynamic aspects of the signal In general, few temporal-domain features have been used to characterize audio signals in the past

MP-Features MP-Features are capable of capturing the time-domain signature of a signal in a concise way Use the matching pursuit (MP) algorithm to analyze environment sounds Matching pursuit algorithm (Mallet and Zhang, 1993) Builds up a sequence of sparse approximation stepwise Process includes Finding the decomposition of a signal from a dictionary of atoms Yield the best set of time-domain functions to form an approximate representation Using the information found from that set to find the MP-Features

Dictionaries Quality of MP depends on the chosen dictionary (or dictionaries) A dictionary contains: Set of bases, or simply parameterized waveforms Examples of dictionaries includes: Gabor, Fourier, wavelets, Cosine packet, Chirplets, Warplets, and many others. Desirable characteristics: Should be complete or overcomplete MP is guaranteed to converge (zero energy residual) Important for the atoms in the dictionary to be discriminative among themselves - Otherwise similar atoms will compete with each other in the MP process

Examples of various dictionaries Frequency dictionary: Fourier Time-scale dictionary: Haar Time-Frequency dictionary: Gabor Fourier, MSE=23.87 Haar, MSE=12.40 Gabor, MSE=9.91 Original signal Example of first five atoms found on the same signal, but different dictionaries. Approximation Example of reconstruction using first 10 atoms, with different dictionaries

Process of obtaining MP features For each sampling window, Decompose using the MP algorithm MP algorithm is stopped after obtaining n atoms n is determined experimentally Decode each atom with its original parameters obtaining frequency, scale, and shift Accumulate all the atoms parameters Find the mean and standard deviation for each parameter separately

Finding n atoms 90 80 Classification performance levels off after 4 or 5 atoms Recognition accuracy 70 60 50 1 2 3 4 5 6 7 8 9 10 First n atoms used as features Larger number increases the complexity and makes it more specific to each data item Smaller number represent data in more general way

A small example Plotting of MP-features Captures time and frequency simultaneously 105 Forming clear natural clusters automatically Shifts indices 100 95 Raining Stream / River 90 Nature-nighttime 85 8000 Nature-daytime For visual purpose, just plotted the centroids 6000 4000 Frequency indices 2000 0 600 700 School playground 1100 1000 900 800 Scales indices 1200 1300

Experimental setup Dataset: audio clips of 14 environmental sounds Audio sources: Recording of natural (unsynthesized) ambient sounds BBC sound effects - original series [9], Freesound project [10] Each sound files were of varying lengths (1-3 minutes each) Manually labeled and separated into 4-second segments Feature Extraction: features are analyzed and extracted for every 30ms rectangular window frame, with 15 ms overlap MFCC (12), MP-Features (6) Dictionary: Gabor function Classifier: Gaussian Mixture Model (GMM) 4-fold cross-validation - Separate sound sources, trained on data from 3 different sources and tested data from 1 source - Minimum number of files for a class was 4

14 types of environment sounds 1. Inside moving vehicles 2. Restaurant 3. Casino 4. Nature daytime 5. Nature nighttime 6. Street with police cars 7. Street with ambulance 8. Street with traffic and pedestrians 9. Playground 10. Raining 11. River/running water 12. Thundering 13. Train passing 14. Waves Choosing the types Diverse in characteristics or how they sound distinctively different from one another Homogeneous enough so they provide typical representation of the class Similar types of sounds as other works in literature

Classification Results 100 90 80 Classification Rate % 70 60 50 40 30 20 10 0 MFCCs (12) MP-Features (6) MP+MFCC Inside moving vehicles Casino Naturedaytime Restaurant Naturenighttime Street with Police car School Playground Street with traffic MFCC tend to operate on the extremes, doing better than MP-features alone in 7 classes, but very poorly (0-5%) in 5 classes MP-features range between 35%-100% MFCC and MP-features provide a complimentary effect for one another MFCC only: 68%, MP-feature only: 71%, MP-features+MFCC: 83% Raining Running water Thundering Train Passing Waves Street with ambulance

A closer look Nature-nighttime class Contains many insect sounds of higher frequencies (noise-like, flat spectrum) Characterized by narrow spectral peaks MFCC have difficulties encoding the narrow-band structure (i.e. engine sound of train passing) MFCCs: 0%, MP: 100%, MFCC+MP: 100% School playground Contains children playing and screaming and other outdoor ambient sounds, i.e. birds, passing traffic, etc MFCCs: 75%, MP: 60%, MFCC+MP: 95% MFCC acts as better discriminator for more structured sounds, like vocal sounds (i.e. talking inside restaurant) Nature-nighttime x 10-4 2 1 School playground 0.01 0.005 0 0 100 200 300 400 500 600 0 0 100 200 300 400 500 600

Conclusion Introduced a feature extraction framework that provides a way to capture temporal-spectral characteristics of an audio signal Uncover underlying structures within each type of signal Less sensitive to noise and are able to represent sound originating from different sources and different ranges Use matching pursuit for feature extraction and its application to unstructured audio processing MP-features can supplement MFCCs to yield a higher classification performance for environmental sounds than using conventional features

Thank you Email: selinach@sipi.usc.edu Speech Analysis and Interpretation Laboratory: http://sail.usc.edu Acknowledgement: This work was supported by the National Science Foundation, DHS and the Army. Questions?