& CITEC Central Lab Facilities Performance Assessment and System Design in Human Robot Interaction Sven Wachsmuth Bielefeld University May, 2011
& CITEC Central Lab Facilities What are the Flops of cognitive systems?... supercomputing power is already on the order of estimated human brain capacity, but intelligent or human-simulating machines do not yet exist.... [futurememes.blogspot.com/2010/04]
& CITEC Central Lab Facilities Beyond Flops... [Richard Murphy] says, They've designed the benchmark [Graph 500] to spur both researchers and industry toward mastering architectural problems of next-generation supercomputers.
& CITEC Central Lab Facilities Limits of benchmarking Evaluation and benchmarking is an inherently multidimensional problem (how to define progress?) Benchmarks significantly influence the design of system architectures Evaluation metrics do not necessarily make us aware of architectural bottlenecks Benchmarks do not capture the richness of applications
& CITEC Central Lab Facilities Benchmarks needs to be scalable [Perona, ICCV Workshop, 2007]
& CITEC Central Lab Facilities Limits of offline datasets Ground truth is not always easy to capture Image datasets ignore the acquisition step (sensing) Image datasets ignore the relevance of results Offline processing ignores system aspects Fokus on experimental studies Need for live systems Need for live users/interaction partners
& CITEC Central Lab Facilities Human-Robot Interaction Human-Robot Interaction scenarios Home-tour (navigation tasks / Human initiative teaching) Curious robot (manipulation tasks / Mixed initiative learning) Museum guide (assistive tasks / Robot initiative explanation)
& CITEC Central Lab Facilities Challenges in defining benchmarks How to measure progress? Multi-dimensionality System complexity Small datasets How to define ground truth? User behavior is highly variable How to prevent architectural bottlenecks? Tests are task and platform specific
& CITEC Central Lab Facilities Evaluation criteria / interacting levels Human: System: User experience / User performance Task performance Architecture: Reliability / robustness Simplicity Components: Accuracy / efficiency
& CITEC Central Lab Facilities Overview of methodologies (each level)
& CITEC Central Lab Facilities Interaction between levels Systemic Interaction Analysis (SinA) expectation-driven (based on video data) feedback changes component changes define prototypical script of task architectural changes annotation & system logging identify deviation pattern statistical analysis system analysis judging results Estimate impact of deviation patterns identify causes for deviation patterns (system and interaction level) Lohse, M., M. Hanheide, K. Pitsch, J. Rohlfing, Katharina, and G. Sagerer (2009). Improving HRI design by applying Systemic Interaction Analysis (SinA), Interaction Studies (Special Issue: Robots in the Wild: Exploring HRI in naturalistic environments), 10(3). John Benjamins Publishing Company, pp. 299-324.
& CITEC Central Lab Facilities Statistical Analysis of ELAN files in Matlab (SALEM) Hanheide, M., M. Lohse, and A. Dierker (2010). SALEM Statistical AnaLysis of Elan files in Matlab, Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, 7th Intl.l Conf. on Language Resources and Evaluation (LREC), Malta, pp. 121-123.
& CITEC Central Lab Facilities How to measure progress? Benchmarking questions: Did the overall number of different than problem-related tasks decrease? Did the perceptage of time the users spent on problemrelated tasks (compared to social and functional tasks) decrease? Did the mean duration of problem-related tasks decrease? Did the handling of problem-related tasks improve? When did the problem-related tasks occur in the task structure? Siepmann, F., Lohse, M., & Wachsmuth, S., Towards robot architectures for user-driven system design (in preparation).
& CITEC Central Lab Facilities Challenges in defining benchmarks How to measure progress? Multi-dimensionality System complexity Small datasets How to define ground truth? User behavior is highly variable How to prevent architectural bottlenecks? Tests are task and platform specific
& CITEC Central Lab Facilities Social cues in teaching scenarios Valence judging by non-verbal cues (facial expressions) Reduction to the evaluation of a single skill How to provoke natural user behavior? [Lang et al., ROMAN, 2009]
& CITEC Central Lab Facilities Uncertain ground truth in HRI Human judgements (without sound) 44 judges, 88 video sequences of 11 subjects success videos failure videos
& CITEC Central Lab Facilities Assistance in real applications Supporting cognitively disabled persons in ADLs (epilepsy, autism, learning disorders, hemiparesis) Cooperation with Bodelschwingsche Anstalten Bethel WOZ-Study (23 trials including 7 users): teeth cleaning Feedback by audio/video prompts
& CITEC Central Lab Facilities Individual reaction behavior WOZ study: User reactions on prompts wizard (WIZ) vs. caregiver (CG) audio (A) vs. audio/video (A/V)
& CITEC Central Lab Facilities Challenges in defining benchmarks How to measure progress? Multi-dimensionality System complexity Small datasets How to define ground truth? User behavior is highly variable How to prevent architectural bottlenecks? Tests are task and platform specific
& CITEC Central Lab Facilities Scalability and transfer of system frameworks and skills interactive manipulation DACS, ASR, Dialog,... service robotics Multi-modal anchoring, tutoring, Person attention, receptionist, motivation... XCF, Active Memory,... task assistence 1993 Task state pattern, Dialog framework,... BonSAI,... Social feedback,... Working memory,... Application design,... 2011
& CITEC Central Lab Facilities Scalability in competitions Robocup@Home Graz, 2009 Singapure, 2010
& CITEC Central Lab Facilities RoboCup@Home Desired abilities Tests Navigation Robot inspection Fast and easy setup Follow me Object recognition Go get it Object manipulation Who is who Recognition of humans Open challenge Human robot interaction Enhanced who is who Speech recognition General purpose service robot Gesture recognition Shopping Mall Robot applications Demo challenge Ambient intelligence Final
& CITEC Central Lab Facilities RoboCup@Home Tests are not completely pre-specified... Open Challenge allows free performance General Purpose Test includes task specification Shoppingmall includes real unknown environment Demo Challenge focusses on application domains Points are given for (partial) task completion (time limit) Judging is (partially) subjective!
& CITEC Central Lab Facilities System development is implicit part of the competition Team effort of 10-12 people Major team change from 2009 to 2010 Large number of modules Limited computing power Prototyping of tasks Short evaluation cycles Robot needs to perform instantly
& CITEC Central Lab Facilities Conclusions Benchmarking cognitive systems is inherently multidimensional (there is no FLOPs measure) Evaluation needs to be based on live-systems (performance is not characterized by offline error rates) System frameworks and skills significantly profit from transfer to other scenarios and platforms System integration and evaluation is costly (there is no free lunch) Internal system analysis and external interaction analysis needs to be coupled
& CITEC Central Lab Facilities Conclusions Benchmarking tasks should not be overspecified Human behavior is shaped by the system response (human input cannot be normed) Ground truth needs to be defined by the setup (otherwise it might be ill defined) Human behavior is highly individual (there is no average user ) Competitions in HRI are inherently not completely fair, but they are good for research
& CITEC Central Lab Facilities Thanks to a lot of people...