Leverage always-on voice trigger IP to reach ultra-low power consumption in voicecontrolled

Similar documents
STANDARD CELL LIBRARIES FOR ALWAYS-ON POWER DOMAIN

Low Power Microphone Acquisition and Processing for Always-on Applications Based on Microcontrollers

Using the VM1010 Wake-on-Sound Microphone and ZeroPower Listening TM Technology

Figure 1: System synoptics of Energy Metering application circuit

Pipeline vs. Sigma Delta ADC for Communications Applications

AI Application Processing Requirements

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Choosing the Best ADC Architecture for Your Application Part 4:

Hello, and welcome to this presentation of the STM32 Digital Filter for Sigma-Delta modulators interface. The features of this interface, which

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

AN547 - Why you need high performance, ultra-high SNR MEMS microphones

How material engineering contributes to delivering innovation in the hyper connected world

Guidelines to Promote National Integrated Circuit Industry Development : Unofficial Translation

Introduction to CMC 3D Test Chip Project

The rise of always-listening sensors integrated in energy-scarce devices such as watches and remotecontrols

DEEP LEARNING BASED AUTOMATIC VOLUME CONTROL AND LIMITER SYSTEM. Jun Yang (IEEE Senior Member), Philip Hilmes, Brian Adair, David W.

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

DERIVATION OF TRAPS IN AUDITORY DOMAIN

LIMITATIONS IN MAKING AUDIO BANDWIDTH MEASUREMENTS IN THE PRESENCE OF SIGNIFICANT OUT-OF-BAND NOISE

CPE/CSC 580: Intelligent Agents

THE USE OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING IN SPEECH RECOGNITION. A CS Approach By Uniphore Software Systems

Keywords Decomposition; Reconstruction; SNR; Speech signal; Super soft Thresholding.

Preliminary. Wake on Sound Piezoelectric MEMS Microphone Evaluation Module

DRX Plus Detectors: Going from Good to Great

Test Report. 4 th ITU Test Event on Compatibility of Mobile Phones and Vehicle Hands-free Terminals th September 2017

Analog-to-Digital Converter Performance Signoff with Analog FastSPICE Transient Noise at Qualcomm

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Neural Networks The New Moore s Law

Raspberry Pi-based Scanning Translation Device

Low Transistor Variability The Key to Energy Efficient ICs

Enhancing Analog Signal Generation by Digital Channel Using Pulse-Width Modulation

Using the Peak Detector Voltage to Compensate Output Voltage Change over Temperature

Imaging serial interface ROM

NetApp Sizing Guidelines for MEDITECH Environments

Optimal Adaptive Filtering Technique for Tamil Speech Enhancement

Acoustic echo cancellers for mobile devices

AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES

ZLS38500 Firmware for Handsfree Car Kits

AVL X-ion. Adapts. Acquires. Inspires.

MaxxBass Development Recommendations

Research in Support of the Die / Package Interface

Audio Hub Evolution. May

Speech Enhancement Based On Noise Reduction

SonicNet Tones t0 t1 t2 t3 t4 ~7600 Hz ~7800 Hz ~8000 Hz ~8200 Hz ~8400 Hz

A Survey of Sensor Technologies for Prognostics and Health Management of Electronic Systems

Technical challenges for high-frequency wireless communication

Management for. Intelligent Energy. Improved Efficiency. Technical Paper 007. First presented at Digital Power Forum 2007

Intelligent Radio Search

Artificial Intelligence and Deep Learning

LM4935 Automatic Gain Control (AGC) Guide

Addressing the Challenges of Radar and EW System Design and Test using a Model-Based Platform

Chapter 6: DSP And Its Impact On Technology. Book: Processor Design Systems On Chip. By Jari Nurmi

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

Agilent Optimizing Your GSM Network Today and Tomorrow

A START-UP S PROSPECTIVE TO TECHNOLOGY CHOICE AND IC DEVELOPMENT IN DEEP SUBMICRON CMOS

Connecting Smartphones and Radios using RoIP and the JPS VIA app

Nonuniform multi level crossing for signal reconstruction

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Automotive three-microphone voice activity detector and noise-canceller

Agilent U8903A Audio Analyzer

Evaluation of CPU Frequency Transition Latency

8 cm 5,5 cm 145g 2,1 cm

ROBOT VISION. Dr.M.Madhavi, MED, MVSREC

Fundamentals of Digital Audio *

TSEK38 Radio Frequency Transceiver Design: Project work B

A Digital Signal Processor for Musicians and Audiophiles Published on Monday, 09 February :54

VM1010. Low-Noise Bottom Port Piezoelectric MEMS Microphone Data Sheet Vesper Technologies Inc. With Wake on Sound Feature

Mixed Signal Virtual Components COLINE, a case study

The Future of Packaging ~ Advanced System Integration

Application Areas of AI Artificial intelligence is divided into different branches which are mentioned below:

ENHANCED HUMAN-AGENT INTERACTION: AUGMENTING INTERACTION MODELS WITH EMBODIED AGENTS BY SERAFIN BENTO. MASTER OF SCIENCE in INFORMATION SYSTEMS

Hello, and welcome to this presentation of the STM32G0 digital-to-analog converter. This block is used to convert digital signals to analog voltages

LOW-POWER SOFTWARE-DEFINED RADIO DESIGN USING FPGAS

SV2C 28 Gbps, 8 Lane SerDes Tester

Key Critical Specs You Should Know Before Selecting a Function Generator

Balancing Bandwidth and Bytes: Managing storage and transmission across a datacast network

DWX Series Technology. Sony s DWX Boosts Sound Quality and Operational Convenience

SAF ANALYSES OF ANALOG AND MIXED SIGNAL VLSI CIRCUIT: DIGITAL TO ANALOG CONVERTER

Course Outcome of M.Tech (VLSI Design)

The total manufacturing cost is estimated to be around INR. 12

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

Ruixing Yang

KIPO s plan for AI - Are you ready for AI? - Gyudong HAN, KIPO Republic of Korea

Autonomous Vehicle Speaker Verification System

Classification of ships using autocorrelation technique for feature extraction of the underwater acoustic noise

Re-configurable Switched Capacitor Sigma-Delta Modulator for MEMS Microphones in Mobiles

INTERNATIONAL TELECOMMUNICATION UNION

Real-time Real-life Oriented DSP Lab Modules

SigmaDSP processors for audio signal processing

HISTOGRAM BASED APPROACH FOR NON- INTRUSIVE SPEECH QUALITY MEASUREMENT IN NETWORKS

The Importance of Data Converter Static Specifications Don't Lose Sight of the Basics! by Walt Kester

Capacitive MEMS accelerometer for condition monitoring

Space War Mission Commando

Power Management in modern-day SoC

Spatial Audio Transmission Technology for Multi-point Mobile Voice Chat

VLSI System Testing. Outline

2. LITERATURE REVIEW

Revision 1.1 May Front End DSP Audio Technologies for In-Car Applications ROADMAP 2016

1 Publishable summary

Keysight Technologies Pulsed Antenna Measurements Using PNA Network Analyzers

Transcription:

Leverage always-on voice trigger IP to reach ultra-low power consumption in voicecontrolled devices All rights reserved - This article is the property of Dolphin Integration company 1/9

Voice-controlled devices will continue to gain new market segments in the electronics industry in the coming years because voice technologies open new horizons for the user experience. In low-power smart devices Smartphones, Smartwatches, TV remote controls, two functionalities are generally combined to perform voice control: Voice Activity Detection (VAD): The VAD function runs in the always-on domain. Its role, in real handsfree applications, is to wake-up a subset of the system as soon as voice is detected. As the other blocks stay in deep sleep operation during the always listening mode, it results in a drastic reduction of the power consumption for the audio subsystem. Voice recognition: The voice recognition function, often part of an audio DSP, analyzes the audio signal and turns the recognized instructions (keywords or full speech meaning) into commands. Due to the lack of thorough specifications to assess the performance of voice detection solutions, it is a real conundrum for users to evaluate the best solution among a jungle of true and false detection claims and without any benchmark. The aim of this article is to help IP procurement managers, SoC architects and system makers using voice detection IP in their systems to assess and compare the detection performances. This article also gives some clarifications on triggering solutions for voice-controlled systems and provides means to specify and compare existing solutions thanks to the MIWOK benchmarks. Which characteristics to rely on to detect a voice in voice-controlled applications? You may think, and you would be right, that the first quality of a voice activity detector is to catch any voice event while skipping all noise events. But how is it translated into reliable figures and specification? Whereas the speech recognition community specifies recognition performance as the Word Error Rate (WER), especially for Speech To Text (STT), the voice detection community has not yet created its own metrics for voice-controlled applications. The challenge is to create representative figures of realistic usage and situations. Developers need to consider the following parameters that influence voice detection algorithm performances: Input words characteristics (voice power level, language, pronunciation, accent, tone or stress characteristics), Environment characteristics (background power level, type of noise: wind, cars, conversation). The voice detection functionality is different from the voice recognition functionality, as voice triggering consists only in sending an interrupt signal. This signal is a flag used to wake up parts of a system during voice activity. The most relevant performances to characterize the quality of voice detection are: All rights reserved - This article is the property of Dolphin Integration company 2/9

The capability to detect the beginning of a voice event The capability to avoid the detection of a noise event The latency of detection In low-power applications with voice capabilities, such as voice-activated remote controls, the recognition circuitry can take advantage of a real-time voice trigger (no need for data storage) to enable the best power optimization with the lowest silicon area. Figure 1: Latency of the voice detection real-time algorithm Detection latency vs. language characteristics To ensure the best recognition performances, the voice trigger must be able to detect voice activity as fast as possible. Only a minimum percentage of the first phoneme (i.e. perceptual distinct units of sound that distinguish one word from another) can be missed without degrading recognition performances. Phonemes have diverse lengths and forms. That is the reason why keywords are often chosen for their spelling characteristics. The following figure shows how the OK google key-word is recognized when its first phoneme has been shortened. All rights reserved - This article is the property of Dolphin Integration company 3/9

Figure 2: Percentage of key-word recognition depending on first phoneme truncation As soon as the voice triggering system is coupled with a recognition algorithm, the first phoneme should be shortened as little as possible in order to maintain the full performance of the recognition algorithm. Therefore, a Pass or Fail criterion must be defined as the minimum detection latency that represents an acceptable amount of time before degrading the recognition performances. Regarding the impact on voice processing, the detection latency has to be expressed as a percentage of the first phoneme. According to the tests made by Dolphin Integration s sound experts on several recognition algorithms, the Pass criterion has been defined at less than 40% of the first phoneme s duration. In this context, the first phoneme keeps its characteristics, remains intelligible by a voice-processing algorithm and the word is not degraded. And what about unwanted detections? Another challenge is to properly define the false detection rate. A false detection occurs due to the background sounds, without any speech. Thus, it would not be relevant to relate the number of false detections to a total number of words as this figure would be distorted proportionally to the density of speech, false detection being hidden by voice activity. It is more relevant to state the false detection rate as the percentage of time when the application is uselessly awoken in presence of noise. On the one hand this figure gives a more relevant image of the false detections rate, on the other it also informs on the impact on power consumption at the system level. We can summarize the main voice triggering performances as: Voice Detected as Voice All rights reserved - This article is the property of Dolphin Integration company 4/9

VDV = Number of detected words* Total number of words Noise Detected as Voice NDV = Duration of false detections** Total duration ofa given benchmark Voice Trigger Efficiency VTE = 0.5 VDV + 0.5 (1 NDV) * Word detected within 40% of its first phoneme ** False detection: no voice event occurs but a voice event is detected How to evaluate, specify and compare voice triggers Today, no standard, voice benchmarks or reference testbenches have been adopted. Whenever a fabless company needs to acquire a voice trigger solution, the buyer cannot rely on standard figures to evaluate the performances, neither to fairly compare different solutions available on the market. The user has no other choice than to test its system in real conditions. To deal with this issue, some methodologies already exist, which include noises and sentences, but they mainly focus on different usage or application. Output figures are given as speech hit-rate and non-speech hit-rate for all the given words inside the sentences, which is not relevant for a voice trigger application. Indeed, the performances must rely on the detection of the first word at the beginning of a speech or on specific keywords, and more precisely on the first phoneme of these words. The first voice trigger corpus of benchmarks: MIWOK With more than 25 years of experience in advanced audio, Dolphin Integration drives voice triggering forward with the release of its own corpus of benchmarks MIWOK specifically developed to evaluate and compare voice triggering solutions running in always-on mode. All rights reserved - This article is the property of Dolphin Integration company 5/9

Figure 3:Test of voice triggering applications with MIWOK The MIWOK Corpus contains a set of words, background noises and tools, freely available through the Dolphin Integration s website, which aims at providing reference benchmarks and procedures in order to evaluate the performances of any voice triggering system. Its first release was in Chinese language. It includes more than one hundred words, recorded by three different people, and more than a dozen background noises (i.e. street, airport, station, home, car ). The selection of the application context (Near-field detection or Far-field detection) and the mix of words vs. background noises is automatically computed from a GUI. Figure 4: Benchmark input is the mix of two.wav frames: background noise (up) and word sample (down) The MIWOK Corpus also provides a comprehensive methodology and detailed instructions to run simulations or measurements in a reproducible and fair manner. It also enables to easily extract the performance figures. Statistic data regarding true detections and false detections are sorted for each.wav pattern and final figures are given for the whole benchmark. Dolphin Integration offers high performance voice trigger IPs Dolphin Integration is pushing the boundaries of voice triggering by offering two types of voice trigger IP as hard IP. This type of implementation represents a breakthrough on the market in opposition to standard DSP-based voice detection features. The combination of a positioning early in the acquisition chain with real time detection results in an extra-low power consumption. All rights reserved - This article is the property of Dolphin Integration company 6/9

Figure 5: Architecture comparison of voice triggering solutions. 1) Digital voice trigger: WhisperTrigger -D This voice trigger is suitable to interface with a digital PDM microphone or to any PDM output of an audio converter. It is a unique algorithm that takes benefit of the best "state of the art" voice activity detection techniques. Figure 6: Distribution of the detection latency as percentage of the first phoneme (Far-field detection benchmark of MIWOK-C). 2) Analog voice trigger: WhisperTrigger -A This voice trigger is suitable to interface with an analog microphone (ECM or MEMS). During All rights reserved - This article is the property of Dolphin Integration company 7/9

always-on operation, the WhisperTrigger -A enables to save the power consumption of the whole audio recording chain. In other words, the mixed-signal audio chain (amplifier and ADC) and the DSP can stay in sleep mode while the voice trigger is listening. Final results are given here for the Far-field detection (e.g. STB, DTV) benchmark of MIWOK-C: WhisperTrigger-A WhisperTrigger-D VDV 85.64% 85.77% NDV 2.02% 7.03% VTE 91.81% 89.37% Conclusion Meeting today s low power challenges for applications embedding a voice-awakening feature requires specific performances and well-defined test parameters. The final specification is a weighted figure between detection latency, false detection rate and true detection rate. To help IC designers and IP procurement managers assess and compare voice trigger solutions, Dolphin Integration commits to offering the first corpus of benchmarks MIWOK dedicated to this functionality. Dolphin Integration s voice triggers have been tested with the proposed benchmarks and have been successfully embedded in several ICs of the market for a broad range of applications. Explore Dolphin Integration IP here: scoda96-h1-lb-io-vd.01 (28 nm) scods100-lb-io-n.12 (40 nm) scodp-mt1-vd.02 (130 nm) Please download for free the MIWOK-C Corpus at the following address: http://www.dolphin.fr/index.php/silicon_ip/ip_products/triggers/overview All rights reserved - This article is the property of Dolphin Integration company 8/9

About the authors: Paul Giletti - Audio product line manager, Dolphin Integration After a brief experience in I/O and physical design (2003-2004), Paul Giletti became an analog design engineer in Dolphin Integration. Specialized in delta-sigma converters and audio power amplifiers, its R&D works mainly focus on high density and low power design for high performance audio applications. He graduated in electronic and signal processing from the Polytechnical National Institute of Toulouse (INPT-ENSEEIHT, 2003) Vincent Richard Audio product marketing manager, Dolphin Integration Vincent Richard is Product Marketing Manager for Audio IP. He joined Dolphin Integration in 2012 after a master degree in Management of Technologies and Innovation at Grenoble Business School. He is in charge of the product definition and promotion of audio and measurement silicon IPs. He holds a Microelectronic Engineering Master degree in Integrated Circuit design. Main contributors of the MIWOK Corpus: Laurent Sauzéat Digital design engineer, Dolphin Integration Julien Gilleron Analog design engineer, Dolphin Integration All rights reserved - This article is the property of Dolphin Integration company 9/9