Collection of re-transmitted data and impulse responses and remote ASR and speaker verification. Igor Szoke, Lada Mosner (et al.

Similar documents
RIR Estimation for Synthetic Data Acquisition

DEREVERBERATION AND BEAMFORMING IN FAR-FIELD SPEAKER RECOGNITION. Brno University of Technology, and IT4I Center of Excellence, Czechia

arxiv: v2 [cs.sd] 15 May 2018

arxiv: v1 [eess.as] 19 Nov 2018

A Method of Measuring Low-Noise Acoustical Impulse Responses at High Sampling Rates

Realtime auralization employing time-invariant invariant convolver

Live multi-track audio recording

Meeting Corpora Hardware Overview & ASR Accuracies

Voices Obscured in Complex Environmental Settings (VOiCES) corpus

Time-of-arrival estimation for blind beamforming

Silence Sweep: a novel method for measuring electro-acoustical devices

Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation

The effects of the excitation source directivity on some room acoustic descriptors obtained from impulse response measurements

Audio Augmentation for Speech Recognition

29th TONMEISTERTAGUNG VDT INTERNATIONAL CONVENTION, November 2016

Topic. Filters, Reverberation & Convolution THEY ARE ALL ONE

Training neural network acoustic models on (multichannel) waveforms

Acoustic Modeling from Frequency-Domain Representations of Speech

POSSIBLY the most noticeable difference when performing

Convention Paper Presented at the 130th Convention 2011 May London, UK

Channel Selection in the Short-time Modulation Domain for Distant Speech Recognition

Convention Paper Presented at the 138th Convention 2015 May 7 10 Warsaw, Poland

ROOM IMPULSE RESPONSES AS TEMPORAL AND SPATIAL FILTERS ABSTRACT INTRODUCTION

Measuring impulse responses containing complete spatial information ABSTRACT

Sound level meter directional response measurement in a simulated free-field

Acoustic Beamforming for Speaker Diarization of Meetings

Technique for the Derivation of Wide Band Room Impulse Response

Progress in the BBN Keyword Search System for the DARPA RATS Program

EFFECT OF STIMULUS SPEED ERROR ON MEASURED ROOM ACOUSTIC PARAMETERS

Using RASTA in task independent TANDEM feature extraction

Recurrent neural networks Modelling sequential data. MLP Lecture 9 Recurrent Neural Networks 1: Modelling sequential data 1

Google Speech Processing from Mobile to Farfield

Interfacing with the Machine

Case study for voice amplification in a highly absorptive conference room using negative absorption tuning by the YAMAHA Active Field Control system

Microphone Array Design and Beamforming

Rub & Buzz Detection with Golden Unit AN 23

Real-time Adaptive Concepts in Acoustics

Deployment Guide for Your Video Conference Room

Current and future developments in loudspeaker management systems

ROOM IMPULSE RESPONSE SHORTENING BY CHANNEL SHORTENING CONCEPTS. Markus Kallinger and Alfred Mertins

Real Time Distant Speech Emotion Recognition in Indoor Environments

Introduction to Audio Watermarking Schemes

Audio Engineering Society. Convention Paper. Presented at the 131st Convention 2011 October New York, NY, USA

COM 12 C 288 E October 2011 English only Original: English

RASTA-PLP SPEECH ANALYSIS. Aruna Bayya. Phil Kohn y TR December 1991

EE228 Applications of Course Concepts. DePiero

Recurrent neural networks Modelling sequential data. MLP Lecture 9 / 13 November 2018 Recurrent Neural Networks 1: Modelling sequential data 1

Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System

Recent Advances in Acoustic Signal Extraction and Dereverberation

Robustness (cont.); End-to-end systems

23RD NORDIC SOUND SYMPOSIUM

I D I A P. Hierarchical and Parallel Processing of Modulation Spectrum for ASR applications Fabio Valente a and Hynek Hermansky a

ROOM IMPULSE RESPONSES AS TEMPORAL AND SPATIAL FILTERS

IMPULSE RESPONSE MEASUREMENT WITH SINE SWEEPS AND AMPLITUDE MODULATION SCHEMES. Q. Meng, D. Sen, S. Wang and L. Hayes

Ambisonics Directional Room Impulse Response as a New SOFA Convention

Audio System Evaluation with Music Signals

Application Note 3PASS and its Application in Handset and Hands-Free Testing

Learning to Unlearn and Relearn Speech Signal Processing using Neural Networks: current and future perspectives

Analytical Analysis of Disturbed Radio Broadcast

3D Intermodulation Distortion Measurement AN 8

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

COMB-FILTER FREE AUDIO MIXING USING STFT MAGNITUDE SPECTRA AND PHASE ESTIMATION

Laboratory Assignment 2 Signal Sampling, Manipulation, and Playback

Sound Source Localizer

Performance Analysis of Parallel Acoustic Communication in OFDM-based System

Title. Author(s)Sugiyama, Akihiko; Kato, Masanori; Serizawa, Masahir. Issue Date Doc URL. Type. Note. File Information

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

University of Huddersfield Repository

Minirators MR-PRO / MR2

Phase Correction System Using Delay, Phase Invert and an All-pass Filter

ETSI TS V1.3.1 ( )

DISTANCE CODING AND PERFORMANCE OF THE MARK 5 AND ST350 SOUNDFIELD MICROPHONES AND THEIR SUITABILITY FOR AMBISONIC REPRODUCTION

Test Report. 4 th ITU Test Event on Compatibility of Mobile Phones and Vehicle Hands-free Terminals th September 2017

Room Impulse Response Measurement and Analysis. Music 318, Winter 2010, Impulse Response Measurement

Sound Source Localization using HRTF database

CG401 Advanced Signal Processing. Dr Stuart Lawson Room A330 Tel: January 2003

Speech quality for mobile phones: What is achievable with today s technology?

A Toolkit for Customizing the ambix Ambisonics-to- Binaural Renderer

TIME TRANSFER USING TIME REVERSAL (T 3 R)

APPENDIX B Setting up a home recording studio

How To... Commission an Installed Sound Environment

Wireless Intro : Computer Networking. Wireless Challenges. Overview

IMPROVING WIDEBAND SPEECH RECOGNITION USING MIXED-BANDWIDTH TRAINING DATA IN CD-DNN-HMM

System analysis and signal processing

A Computational Efficient Method for Assuring Full Duplex Feeling in Hands-free Communication

ESTIMATION OF TIME-VARYING ROOM IMPULSE RESPONSES OF MULTIPLE SOUND SOURCES FROM OBSERVED MIXTURE AND ISOLATED SOURCE SIGNALS

What you Need: Exel Acoustic Set with XL2 Analyzer M4260 Measurement Microphone Minirator MR-PRO

TIME-FREQUENCY CONVOLUTIONAL NETWORKS FOR ROBUST SPEECH RECOGNITION. Vikramjit Mitra, Horacio Franco

Influence of artificial mouth s directivity in determining Speech Transmission Index

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

Design and Production. Analog & Digital Audio. Fast & Accurate Measurements. Scalable Architecture. Superior Specifications.

Lecture (06) Digital Coding techniques (II) Coverting Digital data to Digital Signals

FREQUENCY RESPONSE AND LATENCY OF MEMS MICROPHONES: THEORY AND PRACTICE

SOPA version 2. Revised July SOPA project. September 21, Introduction 2. 2 Basic concept 3. 3 Capturing spatial audio 4

DEEP LEARNING BASED AUTOMATIC VOLUME CONTROL AND LIMITER SYSTEM. Jun Yang (IEEE Senior Member), Philip Hilmes, Brian Adair, David W.

Holographic Measurement of the 3D Sound Field using Near-Field Scanning by Dave Logan, Wolfgang Klippel, Christian Bellmann, Daniel Knobloch

(Towards) next generation acoustic models for speech recognition. Erik McDermott Google Inc.

Session III: New ETSI Model on Wideband Speech and Noise Transmission Quality Phase I. Goals and Background

Impulse Response Measurements Using All-Pass Deconvolution David Griesinger

By Ryan Winfield Woodings and Mark Gerrior, Cypress Semiconductor

Transcription:

Collection of re-transmitted data and impulse responses and remote ASR and speaker verification. Igor Szoke, Lada Mosner (et al.) BUT Speech@FIT LISTEN Workshop, Bonn, 19.7.2018

Why DRAPAK project To ship an ASR coping with distant and hidden mics (bugs). Gap between WER on ASR s trained on retransmitted data using real RIR or artificial RIR. It is few percents but still it is a gap (Ravanelli 2012). There is not such large dataset (regarding our goal of 50 environments). According to AcouSP, and openairlib.net Or is there? To support other R&D at BUT and also in the world. Status: 8 rooms processed so far Running verification experiments now If OK then scale-up.

Microphones placement Mics positions 1-8 - spherical mic array ~5 - table top close to the 1st speaker position (SPKID01) ~5 - hidden (in a shelf, AC, waste bin, under a table, in a drawer,...) 2 - IoT ~5 - ceiling, light, etc. ~5 - table top on other places Speaker positions Sitting person Standing person Noise source (near wall) Non-standard position (rotated to ceiling, etc.)

How To take it seriously we made a recording protocol Measure the room size, material, etc Position of the speaker Position of microphones (delay compensation) Set mic gains Take photos Visualisation tool Absolute & relative coord. It takes ~5 hrs to setup a room And ~3 hrs to dismount.

L207 - Speaker positions

L207 - Microphones positions to SpkID01

RIR estimation environment Maximum length sequence (MLS) - real RIR White noise like h(t) is product of circular cross-correlation of y(t) and x(t) Expects the same clocks (synchronized input and output) - bad for our case Exponential sine sweep (ESS) - real RIR deconvolution Sine with increasing frequency (exponentially to overcome some distortions) h(t) is product of convolution of y(t) and inverse filter It works fine for our case Image source model (ISM) - artificial RIR Numerical way how to calculate a RIR given room dimensions, spk+mic coord., reflec. coef. Cannot simulate obstacles Can get infinite number of them

What to record Everything (we do not want to re-setup the room again) Real-RIR Silence :) = environmental noise Speech data MLS - Maximum length sequence - (bad) - few of them ESS - Exponential sine sweep - good - 1s to 30s A Czech test set (not public :( ) - few hours English Librispeech Test Clean (public :) ) - few hours English NIST SRE 2010 subset (not public :( ) - 2 days :( The Czech train set - to fill space if possible Any other ideas???

Tools Synced 32 chs 48kHz 24 bit Soft gain Run for 3 days ^^ BUT Workshop on Room ^^ Acoustics Measurement Stojan Jakotytch

Results

RIRs collected (so far) 8 rooms 14 test sets 50 RIRs times 31 microphones = 1550 RIRs Room Size #Tests #RIRs #Mics VUT_FIT_L212 middle 2 5 31 VUT_FIT_Q301 middle 4 6 31 VUT_FIT_D105 large 2 7 31 VUT_FIT_E112 large 1 3 31 Hotel_SD_R112 small 0 5 31 Hotel_SD_ConferenceRoom large 2 0 4 31 VUT_FIT_L207 middle 3 9 31 VUT_FIT_L227 large 2 11 31

ASR Experiments Czech - retransmission experiment Decent DNN based ASR, trained on 400hrs, incl. reverb and noises Scoring uses fixed segmentation Baseline 75.5% WAC English - retransmission experiment Librispeech - Standard Kaldi recipe Baseline 95.86% WAC English - simulation experiment AMI - Standard Kaldi TDNN recipe SDM Baseline 39.6% WER

The Experiment on English Speaker data -> Retransmitted data RIR -> Exponential Sine Sweep ( real RIR)

ESS to ISM comparison (Q301, L207) The experiment on Czech ISM -> Image Source Model ( artificial RIR ) ESS -> Exponential Sine Sweep ( real RIR ) Baseline on playback data (headset)

The AMI Experiment - still running Standard Kaldi TDNN recipe ISM -> Image Source Model ( artificial RIR ) 450x (RND 2-5 x 2-5 x 2-6)m ESS -> Exponential Sine Sweep ( real RIR ) 190x 4 rooms (3.1x4.6x6.9, 2.6x6.9x10.8, 2.6x2.8x4.4, 3.1x4.6x7.5 )m Noise -> Environmental noise recorded in the 4 rooms Test Train SDM Eval (WER) IHM 70.9 IHM+ISM 57.7 IHM+ISM+Noise 49.0 IHM+ESS 55.1 IHM+ESS+Noise 49.2 SDM 39.6

Conclusion It works (the laboratory setup) It is comparable to artificial RIRs (ISM method) so far We can get close to real retransmitted data using RIR + noise It is stable But not fully comparable setups We are using it in SID (Odyssey and ICASSP 2018 papers) The big question Are you interested? Any suggestions? 2 rooms freely available here: http://speech.fit.vutbr.cz/software/but-speech-fit-reverb-database