Continuous Gesture Recognition Fact Sheet

Similar documents
CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

Convolutional Neural Network-Based Infrared Image Super Resolution Under Low Light Environment

GESTURE RECOGNITION WITH 3D CNNS

Wadehra Kartik, Kathpalia Mukul, Bahl Vasudha, International Journal of Advance Research, Ideas and Innovations in Technology

A Fuller Understanding of Fully Convolutional Networks. Evan Shelhamer* Jonathan Long* Trevor Darrell UC Berkeley in CVPR'15, PAMI'16

Vehicle Color Recognition using Convolutional Neural Network

Colorful Image Colorizations Supplementary Material

Research on Hand Gesture Recognition Using Convolutional Neural Network

Semantic Segmentation on Resource Constrained Devices

Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks

tsushi Sasaki Fig. Flow diagram of panel structure recognition by specifying peripheral regions of each component in rectangles, and 3 types of detect

Lecture 23 Deep Learning: Segmentation

Detection and Segmentation. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 11 -

Hand Gesture Recognition by Means of Region- Based Convolutional Neural Networks

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Park Smart. D. Di Mauro 1, M. Moltisanti 2, G. Patanè 2, S. Battiato 1, G. M. Farinella 1. Abstract. 1. Introduction

Semantic Segmentation in Red Relief Image Map by UX-Net

arxiv: v1 [cs.ce] 9 Jan 2018

Improving a real-time object detector with compact temporal information

GESTURE RECOGNITION FOR ROBOTIC CONTROL USING DEEP LEARNING

NU-Net: Deep Residual Wide Field of View Convolutional Neural Network for Semantic Segmentation

Recognizing Gestures on Projected Button Widgets with an RGB-D Camera Using a CNN

Going Deeper into First-Person Activity Recognition

SIMULATION-BASED MODEL CONTROL USING STATIC HAND GESTURES IN MATLAB

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. ECE 289G: Paper Presentation #3 Philipp Gysel

Learning Pixel-Distribution Prior with Wider Convolution for Image Denoising

A Deep Learning Approach To Universal Image Manipulation Detection Using A New Convolutional Layer

Suggested projects for EL-GY 6123 Image and Video Processing (Spring 2018) 360 Degree Video View Prediction (contact: Chenge Li,

A Fast Method for Estimating Transient Scene Attributes

TRANSFORMING PHOTOS TO COMICS USING CONVOLUTIONAL NEURAL NETWORKS. Tsinghua University, China Cardiff University, UK

Impact of Automatic Feature Extraction in Deep Learning Architecture

Fully Convolutional Networks for Semantic Segmentation

Pelee: A Real-Time Object Detection System on Mobile Devices

Deep Multispectral Semantic Scene Understanding of Forested Environments using Multimodal Fusion

A Geometry-Sensitive Approach for Photographic Style Classification

Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks

Mobile Cognitive Indoor Assistive Navigation for the Visually Impaired

Deep Neural Network Architectures for Modulation Classification

arxiv: v1 [cs.lg] 2 Jan 2018

Deep Learning. Dr. Johan Hagelbäck.

Frame-Based Classification of Operation Phases in Cataract Surgery Videos

Generating an appropriate sound for a video using WaveNet.

A Neural Algorithm of Artistic Style (2015)

Understanding Neural Networks : Part II

Author(s) Corr, Philip J.; Silvestre, Guenole C.; Bleakley, Christopher J. The Irish Pattern Recognition & Classification Society

Music Recommendation using Recurrent Neural Networks

Multi-task Learning of Dish Detection and Calorie Estimation

Robust Chinese Traffic Sign Detection and Recognition with Deep Convolutional Neural Network

Synthetic View Generation for Absolute Pose Regression and Image Synthesis: Supplementary material

arxiv: v1 [cs.cv] 15 Apr 2016

DYNAMIC CONVOLUTIONAL NEURAL NETWORK FOR IMAGE SUPER- RESOLUTION

MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos

arxiv: v1 [cs.cv] 9 Nov 2015 Abstract

AUGMENTED CONVOLUTIONAL FEATURE MAPS FOR ROBUST CNN-BASED CAMERA MODEL IDENTIFICATION. Belhassen Bayar and Matthew C. Stamm

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Radio Deep Learning Efforts Showcase Presentation

Artificial Intelligence and Deep Learning

An Introduction to Convolutional Neural Networks. Alessandro Giusti Dalle Molle Institute for Artificial Intelligence Lugano, Switzerland

Modeling the Contribution of Central Versus Peripheral Vision in Scene, Object, and Face Recognition

Deep learning architectures for music audio classification: a personal (re)view

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

arxiv: v2 [cs.cv] 28 Mar 2017

Lecture 7: Scene Text Detection and Recognition. Dr. Cong Yao Megvii (Face++) Researcher

ROAD RECOGNITION USING FULLY CONVOLUTIONAL NEURAL NETWORKS

Neural Architectures for Named Entity Recognition

Project Title: Sparse Image Reconstruction with Trainable Image priors

A Deep-Learning-Based Fashion Attributes Detection Model

Object Detection in Wide Area Aerial Surveillance Imagery with Deep Convolutional Networks

On Emerging Technologies

Camera Model Identification With The Use of Deep Convolutional Neural Networks

Biologically Inspired Computation

Using RGB-Depth Cameras and AI Object Recognition for Enhancing Images with Haptic Features

Hand & Upper Body Based Hybrid Gesture Recognition

Cascaded Feature Network for Semantic Segmentation of RGB-D Images

AI Application Processing Requirements

A Review over Different Blur Detection Techniques in Image Processing

On the Robustness of Deep Neural Networks

DeepML: Deep LSTM for Indoor Localization with Smartphone Magnetic and Light Sensors

Multimedia Forensics

arxiv: v1 [eess.as] 19 Dec 2017

Consistent Comic Colorization with Pixel-wise Background Classification

arxiv: v1 [cs.cv] 13 Sep 2016

A2-RL: Aesthetics Aware Reinforcement Learning for Automatic Image Cropping

Number Plate Detection with a Multi-Convolutional Neural Network Approach with Optical Character Recognition for Mobile Devices

Audio to Body Dynamics

Automatic understanding of the visual world

Introduction to Machine Learning

The Hand Gesture Recognition System Using Depth Camera

arxiv: v1 [cs.cv] 27 Nov 2016

Effects of the Unscented Kalman Filter Process for High Performance Face Detector

Compositing-aware Image Search

Analyzing features learned for Offline Signature Verification using Deep CNNs

How Convolutional Neural Networks Remember Art

Convolutional Neural Network for Pixel-Wise Skyline Detection

SECURITY EVENT RECOGNITION FOR VISUAL SURVEILLANCE

Multiband NFC for High-Throughput Wireless Computer Vision Sensor Network

Video Object Segmentation with Re-identification

PROJECT REPORT. Using Deep Learning to Classify Malignancy Associated Changes

Gated Recurrent Convolution Neural Network for OCR

Transcription:

Continuous Gesture Recognition Fact Sheet August 17, 2016 1 Team details Team name: ICT NHCI Team leader name: Xiujuan Chai Team leader address, phone number and email Address: No.6 Kexueyuan South Road Zhongguancun,Haidian District Beijing,China Phone number: +86 10 62600553 E-mail: chaixiujuan@ict.ac.cn Rest of the team members Zhipeng Liu, Fang Yin, Zhuang Liu and Xilin Chen Affiliation Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS 2 Contribution details Title of the contribution Two streams RNN for Continuous Gesture Recognition with Efficient Segmentation. Final score General method description Continuous gesture sequence is firstly segmented into several isolated gestures based on the hand positions. Then the two streams RNN method is used to get the recognition results for the segmented isolated gestures. References Faster Recurrent Convolutional Neural Network(Faster R-CNN)[4]. Recurrent Neural Network (RNN)[2]. Keras: Deep Learning library[1]. 1

Caffe[3]. Face Detection[6]. Representative image / diagram of the method Figure 1 is the diagram of our method. Figure 1: Diagram of the method. Describe data preprocessing techniques applied (if any) For each image frame, the hand is detected with our pre-trained hand detector. 3 Visual Analysis 3.1 Gesture Recognition and Spotting Stage 3.1.1 Features / Data representation In gesture segmentation stage, the feature is hand position. In gesture recognition stage, in each frame, the features are represented by the hand shape and positions from two separated channels, i.e. RGB and depth. For the hand shape representation, HOG is extracted from the detected hand regions. For the hand position representation, skeleton pairwise feature[5] is used. The face and two hands are selected as the key points and the skeleton pairwise feature is constructed by the distances between each pair of three points. 3.1.2 Dimensionality reduction PCA is used for HOG feature dimensionality reduction. The feature dimension for the final hand shape representation is reduced to 81 from 324 for RGB and depth hand images, with nearly 90% energy reserved. 2

3.1.3 Compositional model There are two main modules in our continuous gesture recognition method: temporal segmentation and two streams RNN. We use hand positions to realize the temporal segmentation based on the fact that usually the subject will put the hands down after performing each gesture. Figure 2 illustrates the structure of two streams RNN. Figure 2: Diagram of the two streams RNN. 3.1.4 Learning strategy Use the features described above in training and validation data to train the two streams RNN. 3.1.5 Other techniques Face detection[6] technique is used for skeleton pair feature extraction as described in Section 3.1.1. Faster R-CNN[4] is used for hand detection. 3.1.6 Method complexity Actually, the continuous gesture recognition is transformed into the isolated gesture recognition problem with the accurate gesture segmentation. Thus the core of our method is also the two streams RNN. The architecture has four layers. First layer has two independent rnn channels with 330 neurons, second layer is fusion layer, thrid LSTM layer has 165 neurons and last layer is softmax layer. 3.2 Data Fusion Strategies The hand shape and position features are extracted for both RGB and depth videos. In each separated channel, the hand shape feature and position feature are fused by concatenating directly. While the features from different channels 3

are fused by the RNN model. Concretely speaking, they are fed into two RNN layers respectively and fused by the fusion layer. 3.3 Global Method Description Which pre-trained or external methods have been used (for any stage, if any) The face detection model[6] is pre-trained in preprocessing step. Qualitative advantages of the proposed solution 1) The continuous gesture recognition is transformed into the isolated gesture recognition problem with the help of the hand detection. 2) Firstly, in the isolated gesture recognition, RNN can model the contextual information of gesture. 3) Two input channel can make full use of rgb and depth information. Novelty degree of the solution and if is has been previously published 1) The gesture segmentation is realized with the accurate hand detection. 2)Two streams RNN fuses the RGB and Depth information effectively and it can model the contextual information of the temporal gesture sequences. 3)The hand detection module gives the precise hand positions, which is very important for the correct recognition. 4)Hand HOG and skeleton pair feature is integrated to describe the gesture well by avoiding the background noise. The work has not been published. 4 Other details Language and implementation details (including platform, memory, parallelization requirements) Hand detection is implemented in Caffe[3]. Face detection SDK, HOG and skeleton pair feature extraction are programmed in Visual Studio 2012 with C++. RNN classifier training and testing are implemented in keras with cudnn on a Titan X GPU. Human effort required for implementation, training and validation? The hand regions of 50000 images from training and validation data are annotated by human and used for hand detection model training. Training/testing expended time? In the training stage, it takes about 16 hours to train the RGB and depth hand detection model (8 hours per model) using Faster R-CNN. It takes about 80 hours and 4 hours for hands and face detection respectively on train and validation data(one Titan X GPU). After getting the detection results, it takes about 9 hours to extract features from train and validation 4

data. At last, it just takes about 20 minutes to train the final two streams RNN model. In the test stage, it takes about 12 hours to detect hands (one Titan X GPU) and 30 minutes to detect faces on test data. Then it takes about 1 hours to extract features from test data. At last, it just takes 5 minutes to get the recognition result on test data. General comments and impressions of the challenge? what do you expect from a new challenge in face and looking at people analysis? Given the complicated environments and the large variations between different subjects, the dataset is quite challenging. References [1] F. Chollet. Keras. https://github.com/fchollet/keras, 2015. [2] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. Computer Science, 2015. [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arxiv preprint arxiv:1408.5093, 2014. [4] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1 1, 2016. [5] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1290 1297. IEEE, 2012. [6] S. Wu, M. Kan, Z. He, S. Shan, and X. Chen. Funnel-structured cascae for multiview face detection with alignment awareness. Neurocomputing(Under review). 5