Research on Hand Gesture Recognition Using Convolutional Neural Network

Similar documents
Introduction to Machine Learning

Deep Learning. Dr. Johan Hagelbäck.

Hand Gesture Recognition by Means of Region- Based Convolutional Neural Networks

SIMULATION-BASED MODEL CONTROL USING STATIC HAND GESTURES IN MATLAB

Author(s) Corr, Philip J.; Silvestre, Guenole C.; Bleakley, Christopher J. The Irish Pattern Recognition & Classification Society

Counterfeit Bill Detection Algorithm using Deep Learning

Number Plate Detection with a Multi-Convolutional Neural Network Approach with Optical Character Recognition for Mobile Devices

Image Manipulation Detection using Convolutional Neural Network

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Biologically Inspired Computation

arxiv: v1 [cs.ce] 9 Jan 2018

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

Vehicle Color Recognition using Convolutional Neural Network

Learning Pixel-Distribution Prior with Wider Convolution for Image Denoising

GESTURE RECOGNITION FOR ROBOTIC CONTROL USING DEEP LEARNING

LANDMARK recognition is an important feature for

Wadehra Kartik, Kathpalia Mukul, Bahl Vasudha, International Journal of Advance Research, Ideas and Innovations in Technology

Free-hand Sketch Recognition Classification

Compact Deep Convolutional Neural Networks for Image Classification

An Introduction to Convolutional Neural Networks. Alessandro Giusti Dalle Molle Institute for Artificial Intelligence Lugano, Switzerland

Impact of Automatic Feature Extraction in Deep Learning Architecture

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

Colorful Image Colorizations Supplementary Material

Convolutional Neural Network-based Steganalysis on Spatial Domain

Classification Accuracies of Malaria Infected Cells Using Deep Convolutional Neural Networks Based on Decompressed Images

Understanding Neural Networks : Part II

arxiv: v3 [cs.cv] 18 Dec 2018

Coursework 2. MLP Lecture 7 Convolutional Networks 1

Convolutional Neural Networks: Real Time Emotion Recognition

COMPARATIVE STUDY AND ANALYSIS FOR GESTURE RECOGNITION METHODOLOGIES

INFORMATION about image authenticity can be used in

Introduction to Machine Learning

Improving reverberant speech separation with binaural cues using temporal context and convolutional neural networks

A Deep Learning Approach To Universal Image Manipulation Detection Using A New Convolutional Layer

Deep Learning for Human Activity Recognition: A Resource Efficient Implementation on Low-Power Devices

6. Convolutional Neural Networks

CSC 578 Neural Networks and Deep Learning

Continuous Gesture Recognition Fact Sheet

Convolutional Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 5-1

Content Based Image Retrieval Using Color Histogram

Camera Model Identification With The Use of Deep Convolutional Neural Networks

Robust Chinese Traffic Sign Detection and Recognition with Deep Convolutional Neural Network

arxiv: v1 [cs.lg] 2 Jan 2018

Lecture 17 Convolutional Neural Networks

Lecture 11-1 CNN introduction. Sung Kim

Scalable systems for early fault detection in wind turbines: A data driven approach

A Vision Based Hand Gesture Recognition System using Convolutional Neural Networks

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. ECE 289G: Paper Presentation #3 Philipp Gysel

Design a Model and Algorithm for multi Way Gesture Recognition using Motion and Image Comparison

Image Classification using Convolutional Neural Networks

Convolutional Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 5-1

Comparing Time and Frequency Domain for Audio Event Recognition Using Deep Learning

Convolutional Networks Overview

Generating an appropriate sound for a video using WaveNet.

Deep Neural Network Architectures for Modulation Classification

SLIC based Hand Gesture Recognition with Artificial Neural Network

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Analyzing features learned for Offline Signature Verification using Deep CNNs

Convolutional neural networks

Comparison of Google Image Search and ResNet Image Classification Using Image Similarity Metrics

CS 7643: Deep Learning

CONVOLUTIONAL NEURAL NETWORKS: MOTIVATION, CONVOLUTION OPERATION, ALEXNET

CSC321 Lecture 11: Convolutional Networks

REAL TIME EMULATION OF PARAMETRIC GUITAR TUBE AMPLIFIER WITH LONG SHORT TERM MEMORY NEURAL NETWORK

ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions

Thermal Image Enhancement Using Convolutional Neural Network

Image Recognition of Tea Leaf Diseases Based on Convolutional Neural Network

Deep Obstacle Avoidance

Pre-Trained Convolutional Neural Network for Classification of Tanning Leather Image

Learned Hand Gesture Classification through Synthetically Generated Training Samples

arxiv: v1 [cs.sd] 12 Dec 2016

Sketch-a-Net that Beats Humans

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Gesture Recognition with Real World Environment using Kinect: A Review

Detecting Damaged Buildings on Post-Hurricane Satellite Imagery Based on Customized Convolutional Neural Networks

Doppler-Radar Based Hand Gesture Recognition System Using Convolutional Neural Networks

Hand & Upper Body Based Hybrid Gesture Recognition

COMPARATIVE PERFORMANCE ANALYSIS OF HAND GESTURE RECOGNITION TECHNIQUES

DETECTION AND RECOGNITION OF HAND GESTURES TO CONTROL THE SYSTEM APPLICATIONS BY NEURAL NETWORKS. P.Suganya, R.Sathya, K.

ON CLASSIFICATION OF DISTORTED IMAGES WITH DEEP CONVOLUTIONAL NEURAL NETWORKS. Yiren Zhou, Sibo Song, Ngai-Man Cheung

Comparison of Head Movement Recognition Algorithms in Immersive Virtual Reality Using Educative Mobile Application

LIGHT FIELD (LF) imaging [2] has recently come into

Deep Learning Convolutional Neural Networks for Radio Identification

Convolutional Neural Network-Based Infrared Image Super Resolution Under Low Light Environment

Convolutional Neural Networks

Radio Deep Learning Efforts Showcase Presentation

Landmark Recognition with Deep Learning

IBM SPSS Neural Networks

EE-559 Deep learning 7.2. Networks for image classification

Different Hand Gesture Recognition Techniques Using Perceptron Network

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Face Detection System on Ada boost Algorithm Using Haar Classifiers

GESTURE RECOGNITION WITH 3D CNNS

ECE 599/692 Deep Learning Lecture 19 Beyond BP and CNN

Convolutional Neural Networks for Small-footprint Keyword Spotting

ECS 289G UC Davis Paper Presenta6on #1

Semantic Segmentation on Resource Constrained Devices

ROAD RECOGNITION USING FULLY CONVOLUTIONAL NEURAL NETWORKS

DYNAMIC CONVOLUTIONAL NEURAL NETWORK FOR IMAGE SUPER- RESOLUTION

NU-Net: Deep Residual Wide Field of View Convolutional Neural Network for Semantic Segmentation

Transcription:

Research on Hand Gesture Recognition Using Convolutional Neural Network Tian Zhaoyang a, Cheng Lee Lung b a Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China E-mail address: zytian3-c@my.cityu.edu.hk b Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China E-mail address: itacheng@cityu.edu.hk Abstract The combination of deep learning and human-computer interaction can be applied to lots of real-world solutions. We proposed a five-layer convolutional neural network named MixNET to achieve a hand gesture classification. The MixNET consists of two convolutional and max-pooling layers, two full-connected and dropout layers and a softmax loss layer. Our MixNET achieved an optimum loss rate of 1.8% which is better than other similar networks working on the same 6 classes of hand gestures. This model is trained and tested by using a 1GB GRAM GPU. Keywords: Convolutional Neural Network, Hand Gesture Recognition, Deep Learning 1. Introduction The human-computer interaction (HCI) by using hand gesture recognition is one of the important approaches to interactive with the machine. The hand gesture recognition using computer vision could be applied to many real-world problems. For instance, sign language translation, virtual reality, intelligent homes, assistive environments and video games control (Mitra and Acharya, 2007). As the visual-based hand gesture recognition field becomes more mature, there is an emerge need to find out a systematic and high-performance way which can offer more accurate gesture recognition process, so that the HCI could keep its value to be a workable approach for real-life application. The convolutional neural network being used in the visual-based hand gesture recognition would help to solve this problem. In particular, to recognize hand gesture by using computer vision could simplify the end user's access method and meet the ease and naturalness at the same time. Lots of earlier works, such as Murthy and Jadon (2009), Li (2012), Kulshreshth and LaViola (2013), tried to use vision-based hand gesture recognition to carry out the real-life task, however, still not very satisfactory for real-life application. Due to the nature of optical sensing, the lighting 1

and color and texture of the background could easily affect the quality of the captured images, hence to detect and track the hands robustly is quite hard, which leads to the low performance of hand gesture recognition (Ren et al., 2013). Many studies, such as Garg et al. (2009), Quek (1994), Ji et al. (2013), showed the advantage of applying convolution neural network (CNN) on image data. It is mainly used to identify two-dimensional graphic which has the attributes, such as invariance of displacement, scaling, and other forms distortion. The CNN differs from the previous neural network classifier by using data structure reconstruction and weights reduction. It could extract features in the multilayer perceptron, directly processing image with multi-dimensional vector and performing as a classifier. Due to the listed features, the CNN could be directly applied to the hand gesture data which need to be classified in this research. We propose MixNET, a five-layer convolutional neural network could be used to show highly accurate and efficient recognition and classification of hand gestures. 2. Methods The dataset used in this project is Sebastien Marcel Static Hand Posture base (Marcel and Bernier, 1999), which has 6 different hand postures: A, B, C, Five, Point and V. Totally, it has 4872 pictures of hand postures. This test divides them into three different sets, among which the train set has 2899 samples, the validation set has 1243 samples and the test set has 730 samples. Figure 1. Samples of six different hand postures. In figure 1, from left to right, the six pictures are A, B, C, Five, Point and V. The raw images are in pnm format. It is collected from about 10 different persons with uniform and complex background. For the raw images from the database have various of sizes, the dimension of the pictures is resized to 45px 45px by using bilinear interpolation. It only needs 4 pixels color intensity to get the interpolation value, hence this method could avoid the discontinuity of the pixel value and get a high-resolution result. The values of pixels from the dataset are in the range from 0 to 255. To normalize their values to [0,1], the data is divided by 255. To make the data 2

value of all dimensions centralized on the origin, each feature will deduct the mean of all image feature values. This method is applied on the red, green and blue channels for each image. Figure 2. Raw image (right), rescaled image (center) and preprocessed image (left). To make the program running on GPU, the data type of the dataset should be converted to float32, according to the development library which provides many interfaces to deal multi-dimensional arrays by using GPU. The whole preprocessing procedure is demonstrated in figure 2. 3.1 Network Architecture of MixNET 3. Results Figure 3. The architecture of the proposed MixNET. MixNET has five layers excluding the source input data layer. Each layer has a plurality of feature maps. Each feature map could extract one selected feature through a convolution filter and contains multiple neurons. The data layer contains the image after preprocessing. The activation function using in this project is Rectifier:. Compared with hyperbolic tangent function:, The rectified linear unit (ReLU) saves a relatively large amount of calculation and let the 3

output from parts of the neurons be zero. This effect would improve the sparse features of the whole network and avoid the dependency of when passing parameters among the neurons (Krizhevsky et al., 2012). In the first convolutional layer, MixNET uses ten kernels of 8 8 3 with no zero padding, unit strides to scan across all the inputs. The relationship can be described as:, where convs means the size of kernel after the convolution, m stands for the square inputs, k stands for the square kernel size. The 45 45 preprocessed images with 3 different channel RGB as the input data, then they would become 38 38 images. Each neuron in this layer would be added and multiplied by training weights then plus a bias. After the processing by ReLU activation function, it would be delivered to the max-pooling layer 1. Input 45 45 3 Convolution Number of output: 10 Kernel size: 11 38 38 10 ReLU 38 38 10 Max-pooling Kernel size: 2 Stride: 2 19 19 10 7 7 60 Max-pooling Kernel size: 2 Stride: 2 14 14 60 ReLU 14 14 60 Convolution Number of output: 60 Kernel size: 6 Figure 4. The data flow diagram of two conv-pool layers. The down sampling layer uses the max-pooling method which could keep the useful information and cut the amount of data which need to be processed on the upper level. In the first max-pooling layer, MixNET uses kernels of 2 2 with stride 2 to scan across all the inputs. The relationship can be described quite similar to what has been brought in the convolutional layer:, where pools means the size of kernel after the max-pooling, m stands for the square inputs, k stands for the square kernel size, s stands for the stride size. In the view of the receptive field for every unit do not overlap with each other, the size of feature map in this layer would be a quarter of former layer (row and column would be halved). In the practical using, we combine the convolution layer and max-pooling layer and call it conv-pool layer, as demonstrating in figure 3. Considering this act, the second layer is also a conv-pool layer. The data flow of these two layers is demonstrated in figure 4. The full-connected layer would calculate the dot product of the input vectors and weights, then plus a bias. All the neurons would be fully connected with the ReLU, before being 4

passed to next layer. Dropout layer refers to that the algorithm would randomly make some weights from nodes of hidden layer disabled. Thus, those nodes could be temporarily considered as they do not exist in the network architecture. On the other hand, the weights of these nodes would be stored until the next arriving input to update them. In this research, the dropout ratio is defined as 0.5. Srivastava et al. (2014) tell that the dropout is more efficient compared with standard regularizers. 7 7 60 Fullconnected 200 ReLU 200 Dropout 200 6 Softmax 200 Dropout 200 ReLU Figure 5. The data flow diagram of the two full-connected and dropout layers and the loss layer. The combination of these two layers would make the structure easily to understand as it has been shown in figure 3. The softmax function using in the last layer is defined as,. Intuitively speaking, the softmax could be used to compute the probabilities of each class. The output of the last dropout layer would be passed to a 6-way softmax of the loss layer. The final output is a classified distribution with respect to the 6 different hand postures. The data flow diagram is demonstrated in figure 5. 3.2 Performance Evaluation The MixNET is trained by using stochastic gradient descent. The initial weights are random values from a uniform distribution. The training set has been divided into two different set, one has 70% samples for training, another one has 30% samples for validation. The validation set can help to tune the parameters and do the model selection so that the overfitting can be minimized. The trained convolutional neural network is applied to the hand posture test set which contains 730 samples from 6 different categories. Figure 6 shows the classification accuracies of the training, validation and test set. 5

Figure 6. Train loss, validation loss and test score of final model of MixNET. The model achieved the lowest validation loss at epoch 120 with the best performance on the test set, which is 98.2%. This MixNET test is performed on a hardware system with 2.6GHz Intel Core i7-6600u and Nvidia GeForce 1GB GDDR5 RAM graphic. The whole test lasts 5.82 minutes in this environment. Classification Approach Performance K-Neighbors Classifier (uniform) 75% K-Neighbors Classifier (distance) 78% SVC 86% LeNet-5 93% MPCNN 95% MixNET 98% Table 1. Comparison with other classification approaches. Table 1 shows the performance of each classification method being used on the training set and then applying on the test set. The performance result indicates the reason of the convolutional neural network could be a popular hand gesture approach. The application of the LeNet-5 (LeCun et al., 1998) for learning and classifying the modified preprocessing hand gesture data gives the best test performance 93% which is 5% less than the proposed MixNET. The max-pooling convolutional neural networks (MPCNN) (Nagi et al., 2011) performs better than LeNet-5, however, the test performance 94.8% which is still 3.4% less than the proposed MixNET. 6

3.3 Conclusion This paper gives a brief introduction to the proposed convolutional neural network architecture named MixNET. The MixNET is trained for hand gesture recognition and achieves the test performance with a 98.2% classification accuracy. The MixNET has been trained and tested on the Sebastien Marcel Static Hand Posture base, however, it can also be applied to other hand gesture database with more than 6 hand gestures for recognition. In view of the whole process only takes 5.82 minutes on a 1GB GDDR5 GPU, it can be a state-of-art low-cost deep learning tool for human-computer interaction (HCI) in the future. 4. References Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 37(3), 311-324. Murthy, G. R. S., & Jadon, R. S. (2009). A review of vision based hand gestures recognition. International Journal of Information Technology and Knowledge Management, 2(2), 405-410. Li, Y. (2012, June). Hand gesture recognition using Kinect. In 2012 IEEE International Conference on Computer Science and Automation Engineering (pp. 196-199). IEEE. Kulshreshth, A., Zorn, C., & LaViola, J. J. (2013, March). Poster: Real-time markerless kinect based finger tracking and hand gesture recognition for HCI. In 3D User Interfaces (3DUI), 2013 IEEE Symposium on (pp. 187-188). IEEE. Ren, Z., Yuan, J., Meng, J., & Zhang, Z. (2013). Robust part-based hand gesture recognition using kinect sensor. IEEE transactions on multimedia, 15(5), 1110-1120. Garg, P., Aggarwal, N., & Sofat, S. (2009). Vision based hand gesture recognition. World Academy of Science, Engineering and Technology, 49(1), 972-977. Quek, F. K. (1994, August). Toward a vision-based hand gesture interface. In Virtual Reality Software and Technology Conference (Vol. 94, pp. 17-29). Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231. Marcel, S., & Bernier, O. (1999, March). Hand posture recognition in a body-face centered space. In International Gesture Workshop (pp. 97-100). Springer Berlin Heidelberg. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. 7

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. Nagi, J., Ducatelle, F., Di Caro, G. A., Cireşan, D., Meier, U., Giusti, A.,... & Gambardella, L. M. (2011, November). Max-pooling convolutional neural networks for vision-based hand gesture recognition. In Signal and Image Processing Applications (ICSIPA), 2011 IEEE International Conference on(pp. 342-347). IEEE. 8