Auditory Mood Detection for Social and Educational Robots

Similar documents
Automatic Cry Detection in Early Childhood Education Settings

Automatic Cry Detection in Early Childhood Education Settings

CHAPTER 2 LITERATURE STUDY

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

METHOD OF LOCATION USING SIGNALS OF UNKNOWN ORIGIN. Inventor: Brian L. Baskin

Example. Check that the Jacobian of the transformation to spherical coordinates is

Application Note. Differential Amplifier

Make Your Math Super Powered

B inary classification refers to the categorization of data

Outcome Matrix based Phrase Selection

Experiment 3: Non-Ideal Operational Amplifiers

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

Improving Iris Identification using User Quality and Cohort Information

ECE 274 Digital Logic

Experiment 3: Non-Ideal Operational Amplifiers

ABB STOTZ-KONTAKT. ABB i-bus EIB Current Module SM/S Intelligent Installation Systems. User Manual SM/S In = 16 A AC Un = 230 V AC

DNN-based Causal Voice Activity Detector

This is a repository copy of Effect of power state on absorption cross section of personal computer components.

Interference Cancellation Method without Feedback Amount for Three Users Interference Channel

Convolutional Networks. Lecture slides for Chapter 9 of Deep Learning Ian Goodfellow

A Development of Earthing-Resistance-Estimation Instrument

Study on SLT calibration method of 2-port waveguide DUT

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

CHAPTER 3 AMPLIFIER DESIGN TECHNIQUES

ECE 274 Digital Logic Fall 2009 Digital Design

Simultaneous Adversarial Multi-Robot Learning

Joanna Towler, Roading Engineer, Professional Services, NZTA National Office Dave Bates, Operations Manager, NZTA National Office

10.4 AREAS AND LENGTHS IN POLAR COORDINATES

Y9.ET1.3 Implementation of Secure Energy Management against Cyber/physical Attacks for FREEDM System

A Novel Back EMF Zero Crossing Detection of Brushless DC Motor Based on PWM

Exercise 1-1. The Sine Wave EXERCISE OBJECTIVE DISCUSSION OUTLINE. Relationship between a rotating phasor and a sine wave DISCUSSION

Lecture 20. Intro to line integrals. Dan Nichols MATH 233, Spring 2018 University of Massachusetts.

Redundancy Data Elimination Scheme Based on Stitching Technique in Image Senor Networks

Application of Wavelet De-noising in Vibration Torque Measurement

Domination and Independence on Square Chessboard

A Comparative Analysis of Algorithms for Determining the Peak Position of a Stripe to Sub-pixel Accuracy

Digital Design. Chapter 1: Introduction

Fuzzy Logic Controller for Three Phase PWM AC-DC Converter

Foot-Pedal: Haptic Feedback Human Interface Bridging Sensational Gap between Remote Places

Dataflow Language Model. DataFlow Models. Applications of Dataflow. Dataflow Languages. Kahn process networks. A Kahn Process (1)

Engineer-to-Engineer Note

Geometric quantities for polar curves

To provide data transmission in indoor

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

LATEST CALIBRATION OF GLONASS P-CODE TIME RECEIVERS

DYE SOLUBILITY IN SUPERCRITICAL CARBON DIOXIDE FLUID

Engineer-to-Engineer Note

Math 116 Calculus II

9.4. ; 65. A family of curves has polar equations. ; 66. The astronomer Giovanni Cassini ( ) studied the family of curves with polar equations

Nevery electronic device, since all the semiconductor

Eliminating Non-Determinism During Test of High-Speed Source Synchronous Differential Buses

Design and Modeling of Substrate Integrated Waveguide based Antenna to Study the Effect of Different Dielectric Materials

Sequential Logic (2) Synchronous vs Asynchronous Sequential Circuit. Clock Signal. Synchronous Sequential Circuits. FSM Overview 9/10/12

The Discussion of this exercise covers the following points:

Section 16.3 Double Integrals over General Regions

Robustness Analysis of Pulse Width Modulation Control of Motor Speed

Algorithms for Memory Hierarchies Lecture 14

Middleware Design for Swarm-Driving Robots Accompanying Humans

April 9, 2000 DIS chapter 10 CHAPTER 3 : INTEGRATED PROCESSOR-LEVEL ARCHITECTURES FOR REAL-TIME DIGITAL SIGNAL PROCESSING

Mutual Adaptation to Mind Mapping in Human-Agent Interaction

First Round Solutions Grades 4, 5, and 6

& Y Connected resistors, Light emitting diode.

Simulation of Transformer Based Z-Source Inverter to Obtain High Voltage Boost Ability

Outline. A.I. Applications. Searching for the solution. Chess game. Deep Blue vs. Kasparov 20/03/2017

Design of FPGA-Based Rapid Prototype Spectral Subtraction for Hands-free Speech Applications

Synchronous Machine Parameter Measurement

Application of Feed Forward Neural Network to Differential Protection of Turbogenerator

NP10 DIGITAL MULTIMETER Functions and features of the multimeter:

Speech Enhancement Using the Minimum-Probability-of-Error Criterion

Available online at ScienceDirect. Procedia Engineering 89 (2014 )

A Key Set Cipher for Wireless Sensor Networks

Design-weighted Regression Adjusted Plus-Minus

Information-Coupled Turbo Codes for LTE Systems

University of North Carolina-Charlotte Department of Electrical and Computer Engineering ECGR 4143/5195 Electrical Machinery Fall 2009

EE Controls Lab #2: Implementing State-Transition Logic on a PLC

CSI-SF: Estimating Wireless Channel State Using CSI Sampling & Fusion

Module 9. DC Machines. Version 2 EE IIT, Kharagpur

Energy Harvesting Two-Way Channels With Decoding and Processing Costs

Postprint. This is the accepted version of a paper presented at IEEE PES General Meeting.

A New Stochastic Inner Product Core Design for Digital FIR Filters

Spiral Tilings with C-curves

Implementation of Different Architectures of Forward 4x4 Integer DCT For H.264/AVC Encoder

Application of AHP in the Analysis of Flexible Manufacturing System

DIGITAL multipliers [1], [2] are the core components of

Chapter 2 Literature Review

Synchronous Machine Parameter Measurement

Area-Time Efficient Digit-Serial-Serial Two s Complement Multiplier

Development and application of a patent-based design around. process

Multipath Mitigation for Bridge Deformation Monitoring

ISSCC 2006 / SESSION 21 / ADVANCED CLOCKING, LOGIC AND SIGNALING TECHNIQUES / 21.5

A Simple Approach to Control the Time-constant of Microwave Integrators

Design of a Wireless Active Sensing Unit for Structural Health Monitoring

GNSS MULTIPATH MITIGATION USING LOW COMPLEXITY ADAPTIVE EQUALIZATION ALGORITHMS

Innovative plate solutions for flexographic printing. nyloflex printing plates

University of Dayton Research Institute Dayton, Ohio, Materials Laboratory Wright Patterson AFB, Ohio,

Abacaba-Dabacaba! by Michael Naylor Western Washington University

Localization of Latent Image in Heterophase AgBr(I) Tabular Microcrystals

Direct AC Generation from Solar Cell Arrays

EQ: What are the similarities and differences between matrices and real numbers?

AS INDIVIDUAL AS YOUR LIFESTYLE IN THE BATHROOM.

Transcription:

Audiry Mood Detection for Socil nd Eductionl Robots Pul Ruvolo, In Fsel, nd Jvier Movelln University Cliforni Sn Diego {pul, infsel, movelln}@mplb.ucsd.edu Abstrct Socil robots fce fundmentl chllenge detectg nd dptg ir behvior current socil mood. For exmple, robots tht sst techers erly eduction must choose different behviors dependg on wher children re cryg, lughg, sleepg, or sgg songs. Interctive robotic pplictions require perceptul lgorithms tht both run rel nd re dptble chllengg conditions dily life. Th pper explores novel pproch udiry mood detection which ws born out our experience immersg socil robots clssroom environments. We propose new set low-level spectrl contrst fetures tht extends clss fetures which hve proven very successful for object recognition modern r vion literture. Fetures re selected nd combed usg mche lerng pproches so s mke decions bout ongog udiry mood. We demonstrte excellent performnce on two stndrd emotionl speech dtbses ( Berl Emotionl Speech [1], nd ORATOR dtset []). In ddition we estblh strong bsele performnce for mood detection on dtbse collected from socil robot immersed clssroom 18- months old children [3]. Th pproch opertes rel t little computtionl cost. It hs potentil gretly enhnce effectiveness socil robots dily life environments. I. MOTIVATION development socil robots brgs welth scientific questions nd technologicl chllenges robotics community [], [], [6], [7], [8], [9], []. Socil environments re complex, highly uncert, nd rpidly evolvg, requirg subtle dpttions t multiple -scles. A cse pot use robots erly childhood eduction, n re reserch tht we hve been pursug for pst 3 yers s prt RUBI project [3]. At ny moment students clssroom my be cryg, lughg, dncg, sleepg, overly excited, or bored. Dependg on mood robot must choose different behviors so s sst techers chievg ir eductionl gols. In ddition much work techers erly eduction occurs t mood trnsition s, e.g., trnsition from ply sleep. Socil robots cpble recognizg current mood could potentilly sst techers durg se trnsition periods. gol RUBI project explore potentil use socil robots sst techers erly childhood eduction environments [3], [11]. As prt project for lst three yers we hve conducted more thn 00 hours field studies immersg socil robots t Erly Childhood Eduction Center t UCSD. A criticl spect se field studies identify perceptul problems tht socil robots my fce such environments nd develop perceptul primitives for those such problems. One phenomen we identified from erly on tht over course dy mood clssroom goes through drmtic chnges nd tht much work techer occurs when y need mt desired mood, or mke mood trnsitions, e.g. trnsitiong from ply np. Humn techers re deed msters t detectg, fluencg, nd opertg clssroom moods. As such we identified detection such moods s criticl perceptul primitive for socil robots. Here we vestigte novel pproch detectg socil mood bsed on udiry formtion. proposed pproch emerged out our previous experience developg vul perception primitives for socil robots, nd reliztion criticl role tht udiry mood plys erly childhood eduction settgs. In followg sections we describe proposed pproch, evlute it usg two stndrd dtsets from emotion recognition literture nd flly test it on mood detection tsk for socil robot immersed n erly childhood eduction center. Fig. 1. Two robots developed s prt RUBI project. Top: RUBI-1, first pro ws for most prt remote controlled. Botm: RUBI-3 (Asobo) third pro teches children unomously for weeks t II. AUTOMATIC RECOGNITION OF AUDIO CATEGORIES Recognition udio ctegories hs recently become n ctive re reserch both mche perception nd robotics communities. Problems terest clude recognition emotion user s voice, music genre clssifiction, lnguge identifiction, person identifiction, nd our cse, udiry mood recognition. robotics community hs lso recognized importnce th re reserch. For exmple, [1] udiry formtion used determe environment robot opertg (e.g. street, elevr, hllwy). Formlly ll se problems reduce predictg ctegory lbel for udio smples nd thus re prime trget for modern mche lerng methods.

In th pper we explore n pproch recognition udiry ctegories spired mche lerng methods tht hve recently revolutionized r vion literture [13], [1]. It terestg note tht hriclly mche perception udiry nd vul dom hve evolved similr wys. Erly pproches object detection r vion were typiclly bsed on compositions smll set high-level, hnd-coded feture detecrs. For exmple humn fces were found combg output hnd-coded detecrs eyes nd or fcil fetures [1]. Insted, modern pproches rely on lrge collection simple low-level fetures tht re selected nd composed usg mche lerng methods. Similrly, much pioneerg work on recognition udiry ctegories ws itilly bsed on composition smll collection hnd-coded high-level fetures (e.g., pitch detection, glottl velocity detection, formnt detection, syllble segmenttion) [16], [17], [18]. An lterntive pproch, which one we explore th document, bsed on use mche lerng methods on lrge collection simple, light-weight fetures. While such methods hve been recently explored some success [19], here we troduce significnt chnges. For exmple, while [19] uses lerng methods select from pool 76 low-level udio fetures, here we utilize new collection, 000, 000 light-weight sptio-temporl s, orders mgnitude lrger thn wht hs ppered previous literture. potentil dvntge pproch proposed here three-fold: (1) It removes need engeer dom specific fetures such s glottl velocity tht pply only humn speech. Th chrctertic vitl for udiry mood detection which slient udiry phenomen re not constred humn speech. () pproch relies on generl purpose mche lerng methods nd thus could be pplied wide vriety tsks nd ctegory dtctions. (3) pool udiry fetures ws be computtionlly lightweight nd fford rel- detection current hrdwre, criticl sue for socil robot pplictions. describes steps volved proposed pproch. First udiry signl preprocessed nd converted Sonogrm, which n imge-like representtion coustic signl. A bnk sptio-temporl s n pplied Sonogrm imge nd combed mke set bry clssifiers. output se clssifiers re n combed n n-ctegory clssifier. III. FRONT END: AUDITORY SIGNAL PROCESSING We use populr udiry processg front end, motivted humn psychocoustic phenomen. It converts rw udio-signl -dimensionl Sonogrm, where one dimension nd or bnd, nd vlue for ech combtion perceived loudness sound. To obt Sonogrm, Short Term Fst Fourier Trnsforms (STFT) re first d over 0 millecond wdows overlppg ms nd modulted Hnng function. energy different frequencies Tr- Algorithm 1) Compute -d Sonogrm imge from rw udio signls. 3) ) Use choose set Sptio-Temporl solve multiple bry clssifiction problems. 3) Combe output bry clssifiers usg mulmil logtic regression produce n n- ctegory clssifier. Run- Algorithm 1) Compute -d Sonogrm imge from rw udio signls. signl 3) ) Apply bnk Sptio-Temporl selected durg trg process. 3) Combe output s bry clssifiers. ) Combe output bry clssifiers n-ctegory clssifier. Fig. : Generl Description Approch t Tr- nd Run- re n tegrted bnds ccordg Brk model [], which uses nrrow bnds low regions nd broder bnds high regions. energy vlues from Brk bnds re n trnsformed psychocousticl Sone units perceived loudness. Th done trnsformg energy ech bnd decibels, trnsformg decibel vlues Phon units usg Fletcher-Munson equl-loudness s [], nd flly pplyg stndrd phon--sone non-lerity convert Sone units []. m dvntge workg Sone units tht y re directly proportionl subjective impression loudness humns []. result se trnsformtions -d, imge-like representtion origl signl. An exmple trnsformed udio signl shown figure 3. IV. SPATIO-TEMPORAL BOX FILTERS s [1], [], [3] re chrcterized rectngulr, -like property tht mkes ir implementtion digitl rs very efficient. ir m dvntge over or g pproches, such s those volvg Fourier Trnsforms, pprent when non shift-vrint g opertions re required [3]. becme populr r grphics community [1], [], [3] nd hve recently become one most populr fetures used mche lerng pproches r vion [13]. In th pper we propose sptio-temporl generliztion () for rel- mche perception problems udiry dom. s re cpture criticl properties signls udiry dom. first periodic smplg cpture properties such s bet, rhythm, nd cdence. second temporl tegrtion outputs vi five mry stttics: men, m, mx, stndrd devition, nd qudrture pir. All but lst re self-explnry. Qudrture pirs re populr pproch signl processg literture detect

Sonogrm Sones 1 Criticl Bnd (Brk) 1 8 '" '# '$ '% '" '# 6 '# '$ " # '% & & 1 1 1 1 1 1 1 8 8 6 6 '$ '% 16 Criticl Bnd (Brk) 1*3)6+ '" Criticl Bnd (Brk) '# '" '" 18 16 '# 1*3)6+ 1*3)6+ Amplitude 16 '$ '" & 18 789,:;<,=)>?83 & '% '$ '# & '% '$ Sones 18 789,:;<,=)>?83 '% Filter Applied Sone Filter Applied Sone Filter Applied Filter Applied Sonogrm Sone Sones 789,:;<,=)>?83 $ " % & ()*+,-.+/0 # $ " # &" &# % & $()*+,-.+/0 % & ()*+,-.+/0 0 &$ &# &" &$ &" PCM Signl & &# &$ 0 60 0 0 80 0 (sec) 0 60 0 1 10 80 0 60 (sec) 80 0 (sec) 10 1 160 10 160 1 160 Fig. 3: Depicted bove origl 1-d temporl udio signl (left), Sonogrm (middle) nd pplied Fig. (right) 3: Depicted bove origl 1-d (left), Sonogrm (middle) nd IV-A. pplied Sonogrm Sonogrm serves s temporl put udio signl lerng frmework described section Fig. 3: Depicted bove feture origl 1-d temporl udio signl (left), Sonogrm (middle) nd pplied Sonogrm (right) Sonogrm feture serves s put lerng frmework described section IV-A. Sonogrm (right) Sonogrm feture serves s put lerng frmework described section IV-A. derivtives hve been shown be useful fetures sound IV. S PATIO -T EMPORAL B OX F ILTERS Fig. 3: Depicted bove -T EMPORAL origl B 1-d signlderivtives (left), hve Sonogrm (middle) nd superimposed been shown be useful fetures sound IV. S PATIO OXtemporl F ILTERS udio clssifiction. derivtives hve been shown be useful fetures sound IV. S PATIO -T EMPORAL B OX F ILTERS clssifiction. s [6], [7], [8] re chrcterized rectngulr, on Sonogrm (right). output serves s put lerng frmework described section IV-A. clssifiction. 3 shows one extensions studied th s [6], [7], [8] re chrcterized rectngulr, derivtives hve been shown be useful fetures sound clssifiction. 3 shows one extensions studied th document, th cse simple periodiclly pplied Sonogrm. tl number fetures used th work pproximtely, 000, 000. All combtions mry stttics, smplg tervls, nd, 000 bsic s re considered. A. Trg We use [] construct strong clssifier tht combes subset ll possible s. populr method for sequentil mximum likelihood estimtion nd feture selection. At ech round boostg, trnsfer function, or tung, constructed for ech Bnd shows one extensions studied -like property tht[8] mkes implementtion s [6], [7], re ir chrcterized rectngulr, document, th3 3 cse simple periodiclly ppliedth shows one extensions studied th -like property ir tht mkes ir implementtion document, th cse simple periodiclly pplied digitl rs very efficient. dvntge over -like property tht m mkes ir implementtion Sonogrm. document, th cse simple periodiclly pplied digitl rs very efficient. ir m dvntge over Sonogrm. or g pproches, s those ir volvg digitl rs such very efficient. m Fourier dvntge over Sonogrm. or g pproches, s those g volvg Trnsforms, pprent when nonsuch shift-vrint opor g pproches, such s those volvg Fourier 1 smplg tervl bsic bsic modultion ptterns phse dependent mnner. InFourier ourop- bsicbsic 1 smplg tervl Trnsforms, pprent when non shift-vrint g tervl bsic smplg bsic 1 smplg tervl ertionstrnsforms, re required [8]. when becme populr pprent non shift-vrint g opsmplg tervl bsic tervl bsic 1 smplg tervl cse ech hs qudrture pir which identicl smplg ertions re required [8]. becme populr smplg tervl r grphics community [8] nd hve ertions re required [8]. [6], [7], becme populr Phse r grphics community [6], [7], [8] nd hve Phse Phse Stttic men mx PhseSummry Summry Stttic origl butgrphics shifted hlf period. Ech recently become one phse most populr fetures r community [6], [7],used [8] nd hve Phse Phse men mx Summry Stttic Summry Stttic recently become one most populr fetures used Phse Summry Phse Summry men men mx mx Stttic Summry SttticStttic Summry Stttic mche lerng pproches r vion [18]. In th recently become one be most populr fetures used se mry stttics cn seen s wy convertg mche lerng pproches r vion [18]. In th bsic 3 pper we propose sptio-temporl generliztion mche lerng pproches r vion [18]. In th bsic 3 smplg tervl pper we propose for sptio-temporl bsic 3 tervl locl evidence udiry ctegory generliztion generliztion globl estimte. smplg tervl smplg () rel- mche perception pper we propose sptio-temporl smplg tervl smplg tervl bsic () for s rel- perception bsic 3 smplg tervl problems udiry dom. remche ). () for rel- mche perception Phse bsic m We problems use six s configurtions Summry Stttic Summry Stttic bsic udiry dom. s re tervl Phsesmplg m Summry Stttic Summry Stttic cpture problems criticl properties signls udiry dom. udiry dom. s re Phse Summry Stttic m Summry tervl Stttic smplg cpture criticl properties signls udiry dom. specific configurtion s explored Phse firstcpture periodic smplg cpture criticl properties signls properties udiry dom. smplg tervl Qudrture Filter Phse Qudrture Filter first periodic smplg cpture properties bsic Phse smplg tervl Qudrture Filter such s bet,first rhythm, nd directly cdence. temperiodic smplg second cpture properties Phse th document tken from r vion smplg tervl tervl Phse Phsesmplg m bsic 6 such s bet, rhythm, nd vi cdence. second temsummry Stttic Summry smplgstttic tervl bsic 6 porl tegrtion outputs five mry stttics: such s bet, rhythm, nd cdence. second tem- bsic Phse smplg tervl qudpir Summry Stttic m Summry Stttic bsic 6 literture [13], becuse y pper quntities tegrtion outputs nd vi five mrypir. stttics: qudpir Summry Stttic m bsic Summry Stttic men, porl m, stndrd devition, porl mx, tegrtion outputs vi qudrture five mry stttics: qudpir Summry Stttic m bsic Summry Stttic m, stndrd devition, qudrture importnt describg Sonogrm. In nd vion literture, Phse All butmen, for lst remx, self-explnry. Qudrture pirs re pir. men, m, mx, stndrd devition, nd qudrture pir. Filter Qudrture All but lst re self-explnry. Qudrture pirs re Fig. : Th figure depicts vrious prmeters tht smplg tervl populr pproch lst signl processg literture All but self-explnry. Qudrture pirs re Fig. : Th figure depicts vrious prmeters tht re nprocessg imge ptch detect populr pproch signl literture detect chrcterize sptiigure temporl. Echvrious five bsic Fig. ech : Th depicts prmeters tht modultion ptterns phse dependent mnner. In our populr pproch signl processg literture detect chrcterize sptio temporl. Echkernel, tervl five bsic Phse smplg modultion ptterns phse dependent mnner. In Viol fetureech plus center-surround re chrcterize ech sptio temporl. five bsic re mus bsic Ech 6 cse ech hsptterns qudrture pir which identicl Inour modultion phse dependent mnner. our Jones Viol Jones feture plus center-surround kernel, re cse ech hs shifted qudrture pirwhich identicl shown. qudpir echplus repetition simple Viol Jones feture center-surround kernel, re but phse period. cse ech hs qudrture pir which identicl shown. origl hlf blck re (s Summry Stttic m bsic Summry Stttic ech simple origl but phse shifted hlf period. series fed repetition mry shown. tht ech repetitionstttic simple six origl but phse shifted hlf period. We use s configurtions ). series tht fed mry stttic not encompssed re ignored). Similrly, specific prticulr sptio-temporl feture. series tht fed mry stttic We use six s configurtions ). specific s explored We configurtion use six s configurtions ). specific prticulr sptio-temporl feture. prticulr sptio-temporl feture. specific configurtion s vion explored tken portion Sonogrm Fig.specific th document directly from specific configurtion r s explored : Shown bove re severl exmples sptio temporl th [18], document tken directlytrom r vion literture becuse pper th document y tken directly from quntities r vion spectrl energies / cells A. Trg literture [18], becuse y pper quntities s. Ech six bsic fetures re shown. For ech importnt for describg Sonogrm. vion literture [18], becuse y pper quntities A. Trg A. Trg tht fll region mus As spectrl importnt for describg Sonogrm. As vionwe use literture, ptch [9] construct strong clssifier importnt for describg Sonogrm. As vion simple, s blck rectngle re literture, ptch We use [9] construct strong clssifier energies cells fll region. In udiry Sonogrm tht subset ll possible literture, blck ptch combes We use [9] s. construct strong clssifier subtrcted from s rectngle. Sonogrm tht combes subset ll possible s. re mus populr method for sequentil mximum likelihood Sonogrm tht combes subset ll possible s. dom se s prtil derivtives re mus output populr for At sequentil mximum likelihood method ech repetition simple blck re. When pplied feture selection. ech round boostg, re mus Sonogrm th corresponds estimtion nd populr method for sequentil mximum likelihood blck re. When pplied Sonogrm th corresponds estimtion nd selection. At ech round boostg, or bnd spectrl energy. For stnce trnsfer function, or feture tung, constructed forech computg prtil derivtives or blck re. When pplied Sonogrm th corresponds estimtion nd feture selection. At ech stttic round specific boostg, series tht fed mry which trnsfer function, or tung, constructed for ech computg prtil derivtives stnce s or bnd spectrl energy. For mps feture rel number trnsfer function, or tung, constructed for ech computg prtil derivtives or s prtilenergy. derivtive loudness bnd spectrl For stnce s which mps feture rel number prticulr sptio-temporl feture. bnd prtil derivtive loudness s [ 1, 1]. Ech tung d usg non-prmetric spectrl energy. For stnce which mps feture rel number prtil derivtive loudness [ 1, 1]. Ech tung d usg non-prmetric prticulr bnd. prticulr bnd. regression methods be optiml tung for prtil derivtive loudness [ 1, 1]. Ech tung d usg non-prmetric second prticulr bnd. correspondg regression methods round be optiml tung 3 derivtive t th boostg [30] forfor prtil prticulr bnd. regression methods be optiml tung for 3 second prtil derivtive 3 nd second prtil derivtive correspondg t th roundtht boostg [30] for. detils). feture + tung best 3 second prtil derivtive correspondg t th round boostg [30] for nd. detils). feture + tung tht best. prtil derivtive nd loudness t loss function n dded best nd. improvement detils). feture + tung tht loudness trequency improvement loss function n dded which nd mps performnce rel number prtilprtil derivtive loudness t t specific derivtive loction. se low-level nd ensemble, process repets until prtil derivtive loudness t improvement feture loss function n dded specific loction. se low-level nd ensemble, nd process repets until performnce loction. low-level [ 1, ensemble, nd repets until performnce 1]. Ech tung process d usg non-prmetric specificspecific loction. sese low-level nd nd regression methods be optiml tung for correspondg t th round boostg [] for detils). feture + tung tht best improvement loss function n dded ensemble, nd process repets until performnce no longer improves on holdout set. In th wy, GentleBoost simultneously builds clssifier nd selects subset good s. At ech round boostg, n optiml tung constructed nd trg loss d for ech feture under considertion for beg dded ensemble. To speed up serch for best feture dd (sce brute-force serch through ll 6 possible fetures would be very expensive) we employ serch procedure known s Tbu Serch [6]. First, rndom set n s re selected nd

evluted on trg set, nd re used itilize tbu lt s lredy evluted th round. p k n se s re n used s strtg pots for series locl serches. From ech strtg, set new cndidte s re generted replictg nd slightly modifyg its prmeters (smplg tervl, phse, etc.). If best feture from th set improves loss, tht feture reted nd locl serch repeted until locl optimum reched. mount needed tr clssifier scles lerly number exmples. On stndrd deskp r it tkes pproximtely 1 hour tr clssifier on dtset udio tht roughly 0 mutes length. V. EVALUATION In order benchmrk proposed pproch we performed experiments on two stndrd dtsets emotionl speech nd one on dt we collected ourselves from preschool. Once we confirmed tht pproch produced competitive performnce we evluted it on mood detection tsk socil robot immersed t n erly childhood eduction center. A. Recognition Emotion from Speech: Berl Dtset First system ws tested on Berl Emotionl Dtbse [1]. dtset consts cted emotion from five femle nd five mle Germn crs. Ech utternce dtbse ws clssified humn lbelers seven emotionl ctegories: nger, boredom, dgust, fer, joy, neutrl, nd sdness. Five long utternces nd five short utternces re ech speker for ech seven emotionl ctegories. Speech smples tht re correctly clssified t lest 80% humn lbelers were selected for trg nd testg. To ensure speker dependence, we performed -fold leve one out cross vlidtion. Tht we tred our system s ech levg one speker out trg set nd testg performnce on speker left out. Ech clssifier consted 1 s selected Gentle- Boost lgorithm. In order mke multi-clss decion, we tred ll possible non-empty subsets emotions versus rest. For seven-wy clssifiction experiment th mkes tl 63 bry clssifiers. To mke fl clssifiction decion, mulmil ridge logtic regression [7] ws pplied contuous outputs ech 63 bry detecrs. confusion mtrix fl system on hold out set presented tble I. overll recognition rte on th seven-wy clssifiction tsk ws 7.3%. se results re superior severl or publhed pproches [8]. Although it flls short best literture performnce 8.7% [9], we believe th becuse work [9] used mny optimiztions tilor ir system clssifiction humn speech, route tht we wh void for ske generlity. Also note tht our pproch quite novel, nd performed well despite th beg our first ttempt employ se fetures. Thus we believe th pproch shows gret potentil for improvement s we beg explorg prmeters technique greter detil. lightweight nture s llows us esily s ech 63 clssifiers rel. Even usg n efficient run- implementtion th system cn provide n estimte current emotion every 0 ms on stndrd deskp. B. Determg Emotionl Intensity : Orr dtset ORATOR dtset [] conts udio from 13 crs nd 1 non-crs recitg monologue Germn. crs were structed deliver monologue s if y were vriety settgs, such s tlkg close friend or deliverg speech. non crs spoke spontneously. Contrry Berl dtset, ORATOR specific emotion ctegories were not explicitly prompted, but rr were situtionlly bsed. Sgle sentence segments origl monologue were lbeled non-germn spekg ntive Englh spekers. Ech lbeler ws sked rte speech smple on seven different emotionl dimensions: gittion, nger, confidence, hppess, ledership, plesntness, nd strength. Th highlights nor key difference between Berl nd ORATOR. In Berl dtset ech udio clip belongs one mutully exclusive set emotions, however, ORATOR emotions re not mutully ech monologue beg rted on contuum for ech emotionl dimension. resultg dtset consts 10 udio smples pproximtely 6 seconds ech, lbeled tl lbelers on 7 different emotionl dimensions. We tred series bry detecrs dtguh p n versus botm n smples ech emotionl dimension. By cresg vlue n tsk becomes hrder sce system forced correctly dcrimte more subtle differences. We used two different vlues n: nd 0 which correspond usg one third nd two thirds origl smples ively. consensus lbel for ech smple ws d tkg men judgment cross ll lbelers. Tble II shows results 1 bry clssifiction experiments. Our pproch shows performnce comprble tht verge humn lbeler on ech tsk, which considerbly better tht previously reported performnce on th dtset []. In ddition beg ble successfully plce bry lbel on ech emotionl x, pproch lso chieved humn-like performnce t estimtg contuous emotionl tensity, i.e., correltion coefficients between detecr outputs nd contuous emotion lbels on n hold-out set were comprble those dividul coders (See Tble III). Th bility crucil for socil robotics pplictions where degree specific socil mood desired. We d severl descriptive stttics lerned fetures for solvg th tsk. most populr temporl tegrtion function men, followed qudrture pir. Th suggests tht some form phse vrince my be criticl for lerng emotionl chrctertics speech. most populr bnds were rnge 0 0 Hertz, which cont pitch verge converstionl mle nd femle voice.

Anger Boredom Dgust Fer Hppy Neutrl Sdness Anger.901 0 0.038.071 0 0 Boredom.081.7817.01.009 0.0677.07 Dgust.19.089.6061.03.019.078.099 Fer.0909 0.0166.678.030.10.0939 Hppy.378 0.018.0370.80.089 0 Neutrl.019.133 0.01.0.781 0 Sdness 0.093.01 0 0.031.866 TABLE I A CONFUSION MATRIX FOR THE BERLIN EMO DATABASE. THE CELL IN THE ITH ROW AND JTH COLUMN REPRESENTS THE FRACTION OF SAMPLES WITH OF EMOTION I CLASSIFIED AS EMOTION J. THE RECOGNITION RATE USING -FOLD LEAVE ONE SPEAKER OUT CROSS VALIDATION IS 7.3%. gitted ngry confident hppy ledership plesnt strong s ( vs. ).133.033.133..1..133 verge lbeler ( vs. ).08.1.13.16..11.1 s (0 vs. 0).166.33.166.66.33.33.1833 verge lbeler ( vs. ).188..183..181.90.11 TABLE II COMPARISON ON THE ORATOR DATASET OF THE PERFORMANCE OF VARIOUS APPROACHES ON THE BINARY CLASSIFICATION TASK OF RECOGNIZING THE TOP N EXAMPLES OF A SPECIFIC EMOTIONAL CATEGORY FROM THE BOTTOM N EXAMPLES OF THAT CATEGORY. THE QUANTITY REPORTED IS THE BALANCED ERROR RATE (THE PERCENT CORRECT WHEN THE TRUE POSITIVE RATE EQUALS THE TRUE NEGATIVE RATE). NOTE THAT LOWER NUMBERS ARE BETTER SINCE THAT IMPLIES A PARTICULAR APPROACH WAS BETTER ABLE TO MODEL THE CONSENSUS OF THE HUMAN LABELERS. gitted ngry confident hppy ledership plesnt strong s.8.6.3...3. verge lbeler.3.3.9.3..3.1 TABLE III THE FIRST ROW SHOWS THE CORRELATION COEFFICIENT BETWEEN THE OUTPUT OF THE TRAINED CLASSIFIER AND THE AVERAGE INTENSITY ASSIGNED BY THE LABELERS (RECALL THAT EACH LABELER PROVIDED AN ESTIMATE OF INTENSITY FOR EACH EMOTIONAL DIMENSION). THE SECOND ROW SHOWS THE AVERAGE CORRELATION COEFFICIENT BETWEEN THE INTENSITY RATING OF A PARTICULAR LABELER AND THE AVERAGE INTENSITY ASSIGNED BY ALL OF THE LABELERS. C. Detectg Mood Preschool Environment origl motivtion for our work ws develop perceptul primitives for socil robots. Here we present pilot study evlute performnce our pproch n ctul robot settg. study ws conducted t Room 1 Erly Childhood Eduction Center (ECEC) t UCSD nd it ws prt RUBI project, whose gol explore use socil robots erly childhood eduction. experiment ws conducted on robot, nmed Asobo, tht hs been unomously opertg Room 1 ECEC for weeks t, techg children mterils trgeted Cliforni Deprtment Eduction. Through dcussions techers t Room 1 we identified three bsic moods: cryg, sgg / dncg, nd bckground (everythg else). Detection se moods could result new robot bilities tngible benefits: (1) robot could help reduce cryg. () robot could help improve tmosphere clssroom dncg nd sgg when or children re dncg nd sgg. (3) robot could void pproprite behviors, like dncg nd sgg when techers re redg children. We collected dtbse udio from one full dy t ECEC nd coded three moods described bove. We extrcted non-overlppg udio segments eight seconds ech. re were 79 exmples cryg, 7 exmples plyg nd sgg, nd 11 exmples from bckground ctegory. We used 80% ech se ctegories nd % for testg. In order test ccurcy trdef, we rn our detecr vrious length tervls smpled from test set. For stnce, test performnce usg seconds udio, slidg wdow durtion seconds ws slid over ll udio smples test set. shows /ccurcy trdef function system. When 8 second udio segments, system chieves n ccurcy 90%. As expected performnce decles if shorter udio segments re used, nd it bsiclly t chnce less thn 600 millecond segments. obted levels performnce re very encourgg considerg th ws non-trivil tsk very chllengg environment. We re currently process developg new behviors for Asobo respond perceived mood. Prelimry necdotl evidence encourgg. For exmple, we observed child immeditely spped cryg when Asobo sked Are you OK?. Th behvior could be mde more effective if system loclize source udio signls ws tegrted system [30]. In th cse ASOBO could direct h gze cryg child before ferg h concern. In ddition, mood detecr could lso be potentilly used for robots lern on ir own how behve so

Clssifiction Accurcy 1 0.9 0.8 0.7 0.6 0. 0. 0.3 0. 0 1 3 6 7 8 Response (seconds) Fig. : Accurcy Trdef Function: performnce on three-wy clssifiction tsk s function tervl sound used for clssifiction. Usg pproximtely hlf second for clssifiction results decion slightly bove chnce. mximum performnce tted when usg 7.8-second smple. s ccomplh clssroom gols. For exmple, reduction cryg nd crese plyg could be used s reforcement signl for robot lern improve tmosphere clssroom. VI. CONCLUSION We identified umtic recognition mood s criticl perceptul primitive for socil robots, nd proposed novel pproch for udiry mood detection. pproch ws spired mche lerng techniques for object recognition tht hve recently proven so successful vul dom. We proposed fmily sptio-temporl s tht differ terms kernel, temporl tegrtion method, nd tung. dvntge proposed pproch tht it removes need engeer high-level dom specific feture detecrs, such s glottl velocity detecrs, tht pply only humn speech. Insted we let mche lerng methods select nd combe light-weight, low-level fetures from lrge collection. In ddition s re be computtionlly efficient thus llowg rel mood detection t little computtionl cost, n spect criticl for robot pplictions. pproch provided excellent performnce on problem recognizg emotionl ctegories humn speech, comprg fvorbly previous pproches terms ccurcy while beg much more generl. A pilot study clssroom environment lso confirmed very promg performnce pproch. In ner future, nd s prt RUBI project, we re plnng corporte mood detecr for robot Asobo operte s sort socil Moodstt, i.e. device tht helps chieve desired mood, n nlogous wy s rmostts help mt desired temperture level. REFERENCES [1] W. Burkhrdt, F., Peschke, A., Rolfes, M., Sendlmeir nd B. Wes, A dtbse germn emotionl speech, Interspeech Proceedgs, 0. [] H. Qust, Aumtic recognition nonverbl speech: An pproch model perception pr- nd extrlgutic vocl communiction neurl networks, Mster s s, University Gottgen, 01. [3] J. Movelln, I. Fsel, F. Tnk, C. Tylor, P. Ruvolo, nd M. Eckhrdt, RUBI project: progress report, Humn Robot Interction (HRI), Wshgn, D.C., 07. [] R. W. Picrd, Affective Computg. MIT Press, 1997. [] R. A. Brooks, C. Brezel, M. Mrjnovic, B. Scssellti, nd M. M. Willimson, Cog Project: Buildg Humnoid Robot, Lecture Notes Artificil Intelligence, vol. 16, pp. 87, 1999. [6] C. Brezel, Designg Socible Robots. Cmbridge, MA: MIT Press, 0. [7] T. Fong, I. Nourbkhsh, nd K. Dutenhhn, A Survey Socilly Interctive Robots, Robotics nd Aunomous Systems, vol., no. 3-, pp. 13 166, 03. [8] T. Knd, T. Hirno, D. En, nd H. Ishiguro, Interctive Robots s Socil Prtners nd Peer Turs for Children: A Field Tril, Humn- Computer Interction, vol. 19, no. 1-, pp. 61 8, 0. [9] H. Kozim, C. Nkgw, nd Y. Ysud, Interctive Robots for Communiction-cre: A Cse-study Autm rpy, Proceedgs 0 IEEE Interntionl Workshop on Robot nd Humn Interctive Communiction, 0, pp. 31 36. [] J. Peter H. Khn, B. Friedmn, D. R. Perez-Grndos, nd N. G. Freier, Robotic Pets Lives Preschool Children, Interction Studies, vol. 7, no. 3, pp. 0 36, 06. [11] F. Tnk, A. Cicourel, nd J. R. Movelln, Sociliztion between dlers nd robots t n erly childhood eduction center, Proceedgs Ntionl Acdemy Sciences, In Press. [1] S. Chu, S. Nrynn, C.-C. J. Kuo, nd M. J. Mtrić, Where m I? Scene recognition for mobile robots usg udio fetures, IEEE Interntionl Conference on Muldi & Expo (ICME), 06. [13] P. Viol nd M. Jones, Robust rel- object detection, Interntionl Journl Computer Vion, 0. [1] I. Fsel nd J. Movelln, Segmentl boltzmnn fields, (unpublhed mnuscript), 07. [1] K. C. Yow nd R. Cipoll, Probbiltic frmework for perceptul groupg fetures for humn fce detection, Proc. Second Intl Conf. Aumtic Fce nd Gesture Recognition,, vol., pp. 16 1, 1996. [16] R. Fernndez nd R. W. Picrd, Clssicl nd novel dcrimnt fetures for ffect recognition from speech, Interspeech Proceedgs, 0. [17] P. Mertens, prosogrm : Semi-umtic trnscription prosody bsed on nl perception model, Proceedgs Speech Prosody, 0. [18] M. Lng, B. Schuller, nd G. Rigoll, Hidden mrkov model-bsed speech emotion recognition, Acoustics, Speech, nd Signl Processg Proceedgs, 03. [19] B. Schuller, S. Reiter, R. Müller, M. Al-Hmes, M. Lng, nd G. Rigoll, Speker dependent speech emotion recognition ensemble clssifiction, Muldi nd Expo ICME, 0. [] H. Fstl nd E. Zwicker, Psychocoustics, Fcts nd Models. Sprger-Verlg, Berl Heidelberg, Germny, 1990. [1] M. J. McDonnell, -g techniques, Comput. Grph. Imge Process., vol. 17, no. 1, 1981. [] J. Shen nd S. Cstn, Fst pproximte reliztion ler s trnsltg cscdg - technique, Proceedgs CVPR, pp. 678 680, 198. [3] P. S. Heckbert, Filterg repeted tegrtion, Intterntionl Conference on Computer Grphics nd Interctive Techniques, pp. 31 31, 1986. [] J. Friedmn, T. Hstie, nd R. Tibshirni, Additive logtic regression: sttticl view boostg, Deprtment Stttics, Stnford University Technicl Report, 1998. [] J. R. Movelln nd I. R. Fsel, A genertive frmework for rel object detection nd clssifiction, Computer Vion nd Imge Understndg, 0. [6] F. W. Glover nd M. Lgun, Tbu Serch. Kluwer Acdemic Publhers, 1997. [7] J. R. Movelln, Turil on mulmil logtic regression, MPLb Turils. http://mplb.ucsd.edu, 06. [8] W. D. Zhongzhe Xio, E. Dellndre nd L. Chen, Two-stge clssifiction emotionl speech, Interntionl Conference on Digitl Telecommunictions, 06. [9] E. A. Thurid Vogt, Improvg umtic emotion recognition from speech vi gender differentition, Lnguge Resources nd Evlution Conference, 06. [30] J. Hershey nd J. Movelln, Audio vion: Usg udiovul synchrony locte sounds, 00.