Minimal-Impact Audio-Based Personal Archives

Size: px

Start display at page:

Download "Minimal-Impact Audio-Based Personal Archives"

Justina Evans
5 years ago
Views:

1 Minimal-Impact Audio-Based Personal Archives Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Personal Audio Archives 2. Features 3. Segmentation 4. Clustering 5. Privacy 6. Future Work

1. Personal Audio Easy to record everything you hear <2GB / week @ 64 kbps Very hard to find

2 1. Personal Audio Easy to record everything you hear <2GB / 64 kbps Very hard to find anything how to scan? how to visualize? how to index? Need automatic analysis Need minimal impact

3 Applications Automatic appointment-book history fills in when & where of movements Life statistics how long did I spend in meetings this week vs. last most frequent conversations favorite phrases?? Retrieving details what exactly did I promise? privacy issues... Nostalgia?

4 Data Set Starting point: Collect data 62 hours recorded (8 days, ~7.5 hr/day) hand-mark 139 segments, 16 classes Label total mins total segs Library Campus Restaurant Bowling Lecture Car/Taxi Street minimal impact?

5 2. Features Long duration recordings may benefit from longer basic time-frames 60s rather than 10ms? Perceptually-motivated features broad spectrum + some detail? For diary application... background more important than foreground? smooth out uncharacteristic transients

6 Feature sets Average Linear Energy 1 Normalized Energy Deviation 60 freq / bark freq / bark Average Log Energy 60 db 1 Log Energy Deviation db 15 freq / bark freq / bark Average Spectral Entropy db freq / bark freq / bark Spectral Entropy Deviation 10 5 db bits time / min Capture both average and variation Capture a little more detail in subbands... bits

7 Spectral Entropy Auditory spectrum: Spectral entropy peakiness of each band: H[n, j] = N F! k=0 w jk X[n,k] A[n, j] A[n, j] = N! F w jk X[n,k] k=0 ( ) w jk X[n,k] log A[n, j] energy / db FFT spectral magnitude Auditory Spectrum rel. entropy / bits per-band Spectral Entropies freq / Hz

8 3. BIC segmentation BIC (Bayesian Information Criterion): Compare more and less complex models log L(X 1;M 1 )L(X 2 ;M 2 ) L(X;M 0 ) λ 2 log(n) #(M) For segmentation: Grow context window from current boundary For each window, test every possible segmentation When BIC is positive, mark new segment last segmentation point candidate boundary current context limit 0 N time L(X 1 ;M 1 ) L(X 2 ;M 2 ) L(X;M 0 )

9 BIC Segmentation Example _AvgLEnergy AvgLogAudSpec BIC score last seg point no boundary found with shorter window 13:30 14:00 14:30 15:00 15:30 16:00 No training or stored models boundary passes BIC current window limit time / hr

10 Segmentation Results Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept False Accept = 2%: Feature Correct Accept µdb 80.8% µh 81.1% σh/µh 81.6% µdb + σh/µh 84.0% µdb + σh/µh + µh 83.6% avg. mfcc 73.6% Sensitivity µ db µ H! H /µ H µ db +! H /µ H µ db + µ H +! H /µ H Specificity

11 4. Segment clustering Daily activity has lots of repetition: Automatically cluster similar segments affinity of segments as KL2 distances supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus cmp lib rst str

Spectral Clustering Eigenanalysis of affinity

components: u k s kk v k ' 900 800 k=1 k=2 800

12 Spectral Clustering Eigenanalysis of affinity matrix: A = U S V Affinity Matrix SVD components: u k s kk v k ' k=1 k= k=3 k= eigenvectors v k give cluster memberships Number of clusters?

13 Clustering Results Clustering of automatic segments gives anonymous classes BIC criterion to choose number of clusters make best correspondence to 16 GT clusters Frame-level scoring gives ~70% correct errors when same place has multiple ambiences

14 5. Privacy Recording conversations conflicts with expectations of privacy critical barrier to progress Technical solutions to improve acceptance? Speaker/speech search and destroy scramble 100ms segs of speech (preserving longer-term statistics) high-confidence speaker ID to bypass

15 Speech Scrambling Permute 0 ms segments within 1 s blocks removes intelligibility preserves local structure segment features almost unchanged freq / khz freq / khz Original (dan+kean-ex.wav) Scrambled (0ms wins over 1s) level / db time / s

16 Visualization / browsing / diary inference link in other information sources - diary - What is it good for? NoteTaker interface 6. Future Work

17 Conclusions Personal Audio is easy & cheap to collect but is it any use? Boundaries quite easy to spot e.g. moving to a new location Repeated activities can cluster together.. so user s labels can propagate Still gaining experience with the data speech, speaker ID, privacy,...

Preservation and recollection of facts

Preservation and recollection of facts Capture, Archival, and Retrieval of Personal Experience Accessing Minimal-Impact Personal Audio Archives We ve collected personal audio essentially everything we hear for two years and have experimented