Minimal-Impact Audio-Based Personal Archives

Minimal-Impact Audio-Based Personal Archives Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,kslee}@ee.columbia.edu 1. Personal Audio Archives 2. Features 3. Segmentation 4. Clustering 5. Privacy 6. Future Work

1. Personal Audio Easy to record everything you hear <2GB / week @ 64 kbps Very hard to find anything how to scan? how to visualize? how to index? Need automatic analysis Need minimal impact

Applications Automatic appointment-book history fills in when & where of movements Life statistics how long did I spend in meetings this week vs. last most frequent conversations favorite phrases?? Retrieving details what exactly did I promise? privacy issues... Nostalgia?

Data Set Starting point: Collect data 62 hours recorded (8 days, ~7.5 hr/day) hand-mark 139 segments, 16 classes Label total mins total segs Library 981 27 Campus 750 56 Restaurant 560 5 Bowling 244 2 Lecture 1 234 4 Car/Taxi 165 7 Street 162 16 minimal impact?

2. Features Long duration recordings may benefit from longer basic time-frames 60s rather than 10ms? Perceptually-motivated features broad spectrum + some detail? For diary application... background more important than foreground? smooth out uncharacteristic transients

Feature sets Average Linear Energy 1 Normalized Energy Deviation 60 freq / bark 15 10 5 100 80 freq / bark 15 10 5 40 Average Log Energy 60 db 1 Log Energy Deviation db 15 freq / bark freq / bark 15 10 5 15 10 5 Average Spectral Entropy 100 80 60 db 0.9 0.8 0.7 0.6 0.5 freq / bark freq / bark 15 10 5 15 10 5 Spectral Entropy Deviation 10 5 db 0.5 0.4 0.3 0.2 0.1 bits 50 100 150 0 250 300 350 400 450 time / min Capture both average and variation Capture a little more detail in subbands... bits

Spectral Entropy Auditory spectrum: Spectral entropy peakiness of each band: H[n, j] = N F! k=0 w jk X[n,k] A[n, j] A[n, j] = N! F w jk X[n,k] k=0 ( ) w jk X[n,k] log A[n, j] energy / db 0 - -40-60 FFT spectral magnitude Auditory Spectrum rel. entropy / bits 0.5 0-0.5 0 1000 00 3000 4000 5000 6000 7000 8000 per-band Spectral Entropies -1 30 340 750 1130 1630 2280 32 3780 4470 5280 6250 7380 freq / Hz

3. BIC segmentation BIC (Bayesian Information Criterion): Compare more and less complex models log L(X 1;M 1 )L(X 2 ;M 2 ) L(X;M 0 ) λ 2 log(n) #(M) For segmentation: Grow context window from current boundary For each window, test every possible segmentation When BIC is positive, mark new segment last segmentation point candidate boundary current context limit 0 N time L(X 1 ;M 1 ) L(X 2 ;M 2 ) L(X;M 0 )

BIC Segmentation Example 04-09-10-1023_AvgLEnergy AvgLogAudSpec 15 10 5 BIC score 0-100 -0 last seg point no boundary found with shorter window 13:30 14:00 14:30 15:00 15:30 16:00 No training or stored models boundary passes BIC current window limit time / hr

Segmentation Results Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: Feature Correct Accept µdb 80.8% µh 81.1% σh/µh 81.6% µdb + σh/µh 84.0% µdb + σh/µh + µh 83.6% avg. mfcc 73.6% Sensitivity 0.8 0.7 0.6 0.5 0.4 0.3 µ db µ H! H /µ H µ db +! H /µ H µ db + µ H +! H /µ H 0.2 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 1 - Specificity

4. Segment clustering Daily activity has lots of repetition: Automatically cluster similar segments affinity of segments as KL2 distances supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus cmp lib rst str... 1 0.5 0

Spectral Clustering Eigenanalysis of affinity matrix: A = U S V Affinity Matrix SVD components: u k s kk v k ' 900 800 k=1 k=2 800 600 700 400 600 0 500 400 800 k=3 k=4 300 600 0 400 100 0 0 400 600 800 0 400 600 800 eigenvectors v k give cluster memberships Number of clusters? 0 400 600 800

Clustering Results Clustering of automatic segments gives anonymous classes BIC criterion to choose number of clusters make best correspondence to 16 GT clusters Frame-level scoring gives ~70% correct errors when same place has multiple ambiences

5. Privacy Recording conversations conflicts with expectations of privacy critical barrier to progress Technical solutions to improve acceptance? Speaker/speech search and destroy scramble 100ms segs of speech (preserving longer-term statistics) high-confidence speaker ID to bypass

Speech Scrambling Permute 0 ms segments within 1 s blocks removes intelligibility preserves local structure segment features almost unchanged freq / khz freq / khz Original (dan+kean-ex.wav) 4 2 0 Scrambled (0ms wins over 1s) 4 2 0 - -40-60 level / db 0 0 2 4 6 8 10 12 14 time / s

Visualization / browsing / diary inference link in other information sources - diary - email What is it good for? NoteTaker interface 6. Future Work

Conclusions Personal Audio is easy & cheap to collect but is it any use? Boundaries quite easy to spot e.g. moving to a new location Repeated activities can cluster together.. so user s labels can propagate Still gaining experience with the data speech, speaker ID, privacy,...