Minimal-Impact Audio-Based Personal Archives

Similar documents
Preservation and recollection of facts

VQ Source Models: Perceptual & Phase Issues

Speech/Music Change Point Detection using Sonogram and AANN

Advanced audio analysis. Martin Gasser

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Mel- frequency cepstral coefficients (MFCCs) and gammatone filter banks

Estimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking

Signals & Systems for Speech & Hearing. Week 6. Practical spectral analysis. Bandpass filters & filterbanks. Try this out on an old friend

Change Point Determination in Audio Data Using Auditory Features

REpeating Pattern Extraction Technique (REPET)

Detecting proximity from personal audio recordings

A multi-class method for detecting audio events in news broadcasts

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Applications of Music Processing

Princeton ELE 201, Spring 2014 Laboratory No. 2 Shazam

Audio Similarity. Mark Zadel MUMT 611 March 8, Audio Similarity p.1/23

PART I: The questions in Part I refer to the aliasing portion of the procedure as outlined in the lab manual.

SOUND SOURCE RECOGNITION AND MODELING

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Digital Speech Processing and Coding

Automotive three-microphone voice activity detector and noise-canceller

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Acoustics, signals & systems for audiology. Week 4. Signals through Systems

MUS421/EE367B Applications Lecture 9C: Time Scale Modification (TSM) and Frequency Scaling/Shifting

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Speech Coding in the Frequency Domain

24 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 1, JANUARY /$ IEEE

Browsing Audio Life-log Data Using Acoustic and Location Information

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

E : Lecture 8 Source-Filter Processing. E : Lecture 8 Source-Filter Processing / 21

Case study for voice amplification in a highly absorptive conference room using negative absorption tuning by the YAMAHA Active Field Control system

Can binary masks improve intelligibility?

High-speed Noise Cancellation with Microphone Array

Data and Computer Communications Chapter 3 Data Transmission

ECE 556 BASICS OF DIGITAL SPEECH PROCESSING. Assıst.Prof.Dr. Selma ÖZAYDIN Spring Term-2017 Lecture 2

EE 464 Short-Time Fourier Transform Fall and Spectrogram. Many signals of importance have spectral content that

Distributed Speech Recognition Standardization Activity

Build Your Own Bose WaveRadio Bass Preamp Active Filter Design

Keywords: spectral centroid, MPEG-7, sum of sine waves, band limited impulse train, STFT, peak detection.

Experiments in two-tone interference

EE 438 Final Exam Spring 2000

Terminology (1) Chapter 3. Terminology (3) Terminology (2) Transmitter Receiver Medium. Data Transmission. Simplex. Direct link.

SGN Audio and Speech Processing

Adaptive Noise Reduction Algorithm for Speech Enhancement

Isolated Digit Recognition Using MFCC AND DTW

Discriminative Training for Automatic Speech Recognition

ECMA TR/105. A Shaped Noise File Representative of Speech. 1 st Edition / December Reference number ECMA TR/12:2009

I D I A P R E S E A R C H R E P O R T. June published in Interspeech 2008

EC 554 Data Communications

Lecture 6. Rhythm Analysis. (some slides are adapted from Zafar Rafii and some figures are from Meinard Mueller)

SGN Audio and Speech Processing

Data Communication. Chapter 3 Data Transmission

Bag-of-Features Acoustic Event Detection for Sensor Networks

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

Fundamental frequency estimation of speech signals using MUSIC algorithm

Electrical & Computer Engineering Technology

Speech Signal Analysis

Lecture 9: Time & Pitch Scaling

DEEP LEARNING BASED AUTOMATIC VOLUME CONTROL AND LIMITER SYSTEM. Jun Yang (IEEE Senior Member), Philip Hilmes, Brian Adair, David W.

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Improving room acoustics at low frequencies with multiple loudspeakers and time based room correction

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Single-channel Mixture Decomposition using Bayesian Harmonic Models

Electronic disguised voice identification based on Mel- Frequency Cepstral Coefficient analysis

FFT analysis in practice

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Speech and Music Discrimination based on Signal Modulation Spectrum.

Pattern Recognition. Part 6: Bandwidth Extension. Gerhard Schmidt

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Noise Exposure History Interview Questions

Chapter 3. Data Transmission

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum

Experiment One: Generating Frequency Modulation (FM) Using Voltage Controlled Oscillator (VCO)

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

Lecture 14: Source Separation

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

Discrete Fourier Transform, DFT Input: N time samples

Verus. Khalid Alqinyah, Muhsin Gurel, Michael Mullen, Richard Tran, Phil Weber

Chapter IV THEORY OF CELP CODING

An Analysis of Image Denoising and Restoration of Handwritten Degraded Document Images

Spectral Noise Tracking for Improved Nonstationary Noise Robust ASR

Automatic classification of traffic noise

Application Note. GE Grid Solutions. Multilin 8 Series 869 Broken Rotor Bar Detection. Introduction

You know about adding up waves, e.g. from two loudspeakers. AUDL 4007 Auditory Perception. Week 2½. Mathematical prelude: Adding up levels

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

CHAPTER 2 FIR ARCHITECTURE FOR THE FILTER BANK OF SPEECH PROCESSOR

PROBLEM SET 6. Note: This version is preliminary in that it does not yet have instructions for uploading the MATLAB problems.

Temporal resolution AUDL Domain of temporal resolution. Fine structure and envelope. Modulating a sinusoid. Fine structure and envelope

Cepstrum alanysis of speech signals

Recommender systems and the Netflix prize. Charles Elkan. January 14, 2011

Monaural and Binaural Speech Separation

Using a Game Development Platform to Improve Advanced Programming Skills

Equalizers. Contents: IIR or FIR for audio filtering? Shelving equalizers Peak equalizers

Adaptive Selection of Embedding. Spread Spectrum Watermarking of Compressed Audio

Activities on Beam Orbit Stabilization at BESSY II

Auditory modelling for speech processing in the perceptual domain

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Transcription:

Minimal-Impact Audio-Based Personal Archives Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,kslee}@ee.columbia.edu 1. Personal Audio Archives 2. Features 3. Segmentation 4. Clustering 5. Privacy 6. Future Work

1. Personal Audio Easy to record everything you hear <2GB / week @ 64 kbps Very hard to find anything how to scan? how to visualize? how to index? Need automatic analysis Need minimal impact

Applications Automatic appointment-book history fills in when & where of movements Life statistics how long did I spend in meetings this week vs. last most frequent conversations favorite phrases?? Retrieving details what exactly did I promise? privacy issues... Nostalgia?

Data Set Starting point: Collect data 62 hours recorded (8 days, ~7.5 hr/day) hand-mark 139 segments, 16 classes Label total mins total segs Library 981 27 Campus 750 56 Restaurant 560 5 Bowling 244 2 Lecture 1 234 4 Car/Taxi 165 7 Street 162 16 minimal impact?

2. Features Long duration recordings may benefit from longer basic time-frames 60s rather than 10ms? Perceptually-motivated features broad spectrum + some detail? For diary application... background more important than foreground? smooth out uncharacteristic transients

Feature sets Average Linear Energy 1 Normalized Energy Deviation 60 freq / bark 15 10 5 100 80 freq / bark 15 10 5 40 Average Log Energy 60 db 1 Log Energy Deviation db 15 freq / bark freq / bark 15 10 5 15 10 5 Average Spectral Entropy 100 80 60 db 0.9 0.8 0.7 0.6 0.5 freq / bark freq / bark 15 10 5 15 10 5 Spectral Entropy Deviation 10 5 db 0.5 0.4 0.3 0.2 0.1 bits 50 100 150 0 250 300 350 400 450 time / min Capture both average and variation Capture a little more detail in subbands... bits

Spectral Entropy Auditory spectrum: Spectral entropy peakiness of each band: H[n, j] = N F! k=0 w jk X[n,k] A[n, j] A[n, j] = N! F w jk X[n,k] k=0 ( ) w jk X[n,k] log A[n, j] energy / db 0 - -40-60 FFT spectral magnitude Auditory Spectrum rel. entropy / bits 0.5 0-0.5 0 1000 00 3000 4000 5000 6000 7000 8000 per-band Spectral Entropies -1 30 340 750 1130 1630 2280 32 3780 4470 5280 6250 7380 freq / Hz

3. BIC segmentation BIC (Bayesian Information Criterion): Compare more and less complex models log L(X 1;M 1 )L(X 2 ;M 2 ) L(X;M 0 ) λ 2 log(n) #(M) For segmentation: Grow context window from current boundary For each window, test every possible segmentation When BIC is positive, mark new segment last segmentation point candidate boundary current context limit 0 N time L(X 1 ;M 1 ) L(X 2 ;M 2 ) L(X;M 0 )

BIC Segmentation Example 04-09-10-1023_AvgLEnergy AvgLogAudSpec 15 10 5 BIC score 0-100 -0 last seg point no boundary found with shorter window 13:30 14:00 14:30 15:00 15:30 16:00 No training or stored models boundary passes BIC current window limit time / hr

Segmentation Results Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: Feature Correct Accept µdb 80.8% µh 81.1% σh/µh 81.6% µdb + σh/µh 84.0% µdb + σh/µh + µh 83.6% avg. mfcc 73.6% Sensitivity 0.8 0.7 0.6 0.5 0.4 0.3 µ db µ H! H /µ H µ db +! H /µ H µ db + µ H +! H /µ H 0.2 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 1 - Specificity

4. Segment clustering Daily activity has lots of repetition: Automatically cluster similar segments affinity of segments as KL2 distances supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus cmp lib rst str... 1 0.5 0

Spectral Clustering Eigenanalysis of affinity matrix: A = U S V Affinity Matrix SVD components: u k s kk v k ' 900 800 k=1 k=2 800 600 700 400 600 0 500 400 800 k=3 k=4 300 600 0 400 100 0 0 400 600 800 0 400 600 800 eigenvectors v k give cluster memberships Number of clusters? 0 400 600 800

Clustering Results Clustering of automatic segments gives anonymous classes BIC criterion to choose number of clusters make best correspondence to 16 GT clusters Frame-level scoring gives ~70% correct errors when same place has multiple ambiences

5. Privacy Recording conversations conflicts with expectations of privacy critical barrier to progress Technical solutions to improve acceptance? Speaker/speech search and destroy scramble 100ms segs of speech (preserving longer-term statistics) high-confidence speaker ID to bypass

Speech Scrambling Permute 0 ms segments within 1 s blocks removes intelligibility preserves local structure segment features almost unchanged freq / khz freq / khz Original (dan+kean-ex.wav) 4 2 0 Scrambled (0ms wins over 1s) 4 2 0 - -40-60 level / db 0 0 2 4 6 8 10 12 14 time / s

Visualization / browsing / diary inference link in other information sources - diary - email What is it good for? NoteTaker interface 6. Future Work

Conclusions Personal Audio is easy & cheap to collect but is it any use? Boundaries quite easy to spot e.g. moving to a new location Repeated activities can cluster together.. so user s labels can propagate Still gaining experience with the data speech, speaker ID, privacy,...