Research on Hand Gesture Recognition Using Convolutional Neural Network Tian Zhaoyang a, Cheng Lee Lung b a Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China E-mail address: zytian3-c@my.cityu.edu.hk b Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China E-mail address: itacheng@cityu.edu.hk Abstract The combination of deep learning and human-computer interaction can be applied to lots of real-world solutions. We proposed a five-layer convolutional neural network named MixNET to achieve a hand gesture classification. The MixNET consists of two convolutional and max-pooling layers, two full-connected and dropout layers and a softmax loss layer. Our MixNET achieved an optimum loss rate of 1.8% which is better than other similar networks working on the same 6 classes of hand gestures. This model is trained and tested by using a 1GB GRAM GPU. Keywords: Convolutional Neural Network, Hand Gesture Recognition, Deep Learning 1. Introduction The human-computer interaction (HCI) by using hand gesture recognition is one of the important approaches to interactive with the machine. The hand gesture recognition using computer vision could be applied to many real-world problems. For instance, sign language translation, virtual reality, intelligent homes, assistive environments and video games control (Mitra and Acharya, 2007). As the visual-based hand gesture recognition field becomes more mature, there is an emerge need to find out a systematic and high-performance way which can offer more accurate gesture recognition process, so that the HCI could keep its value to be a workable approach for real-life application. The convolutional neural network being used in the visual-based hand gesture recognition would help to solve this problem. In particular, to recognize hand gesture by using computer vision could simplify the end user's access method and meet the ease and naturalness at the same time. Lots of earlier works, such as Murthy and Jadon (2009), Li (2012), Kulshreshth and LaViola (2013), tried to use vision-based hand gesture recognition to carry out the real-life task, however, still not very satisfactory for real-life application. Due to the nature of optical sensing, the lighting 1
and color and texture of the background could easily affect the quality of the captured images, hence to detect and track the hands robustly is quite hard, which leads to the low performance of hand gesture recognition (Ren et al., 2013). Many studies, such as Garg et al. (2009), Quek (1994), Ji et al. (2013), showed the advantage of applying convolution neural network (CNN) on image data. It is mainly used to identify two-dimensional graphic which has the attributes, such as invariance of displacement, scaling, and other forms distortion. The CNN differs from the previous neural network classifier by using data structure reconstruction and weights reduction. It could extract features in the multilayer perceptron, directly processing image with multi-dimensional vector and performing as a classifier. Due to the listed features, the CNN could be directly applied to the hand gesture data which need to be classified in this research. We propose MixNET, a five-layer convolutional neural network could be used to show highly accurate and efficient recognition and classification of hand gestures. 2. Methods The dataset used in this project is Sebastien Marcel Static Hand Posture base (Marcel and Bernier, 1999), which has 6 different hand postures: A, B, C, Five, Point and V. Totally, it has 4872 pictures of hand postures. This test divides them into three different sets, among which the train set has 2899 samples, the validation set has 1243 samples and the test set has 730 samples. Figure 1. Samples of six different hand postures. In figure 1, from left to right, the six pictures are A, B, C, Five, Point and V. The raw images are in pnm format. It is collected from about 10 different persons with uniform and complex background. For the raw images from the database have various of sizes, the dimension of the pictures is resized to 45px 45px by using bilinear interpolation. It only needs 4 pixels color intensity to get the interpolation value, hence this method could avoid the discontinuity of the pixel value and get a high-resolution result. The values of pixels from the dataset are in the range from 0 to 255. To normalize their values to [0,1], the data is divided by 255. To make the data 2
value of all dimensions centralized on the origin, each feature will deduct the mean of all image feature values. This method is applied on the red, green and blue channels for each image. Figure 2. Raw image (right), rescaled image (center) and preprocessed image (left). To make the program running on GPU, the data type of the dataset should be converted to float32, according to the development library which provides many interfaces to deal multi-dimensional arrays by using GPU. The whole preprocessing procedure is demonstrated in figure 2. 3.1 Network Architecture of MixNET 3. Results Figure 3. The architecture of the proposed MixNET. MixNET has five layers excluding the source input data layer. Each layer has a plurality of feature maps. Each feature map could extract one selected feature through a convolution filter and contains multiple neurons. The data layer contains the image after preprocessing. The activation function using in this project is Rectifier:. Compared with hyperbolic tangent function:, The rectified linear unit (ReLU) saves a relatively large amount of calculation and let the 3
output from parts of the neurons be zero. This effect would improve the sparse features of the whole network and avoid the dependency of when passing parameters among the neurons (Krizhevsky et al., 2012). In the first convolutional layer, MixNET uses ten kernels of 8 8 3 with no zero padding, unit strides to scan across all the inputs. The relationship can be described as:, where convs means the size of kernel after the convolution, m stands for the square inputs, k stands for the square kernel size. The 45 45 preprocessed images with 3 different channel RGB as the input data, then they would become 38 38 images. Each neuron in this layer would be added and multiplied by training weights then plus a bias. After the processing by ReLU activation function, it would be delivered to the max-pooling layer 1. Input 45 45 3 Convolution Number of output: 10 Kernel size: 11 38 38 10 ReLU 38 38 10 Max-pooling Kernel size: 2 Stride: 2 19 19 10 7 7 60 Max-pooling Kernel size: 2 Stride: 2 14 14 60 ReLU 14 14 60 Convolution Number of output: 60 Kernel size: 6 Figure 4. The data flow diagram of two conv-pool layers. The down sampling layer uses the max-pooling method which could keep the useful information and cut the amount of data which need to be processed on the upper level. In the first max-pooling layer, MixNET uses kernels of 2 2 with stride 2 to scan across all the inputs. The relationship can be described quite similar to what has been brought in the convolutional layer:, where pools means the size of kernel after the max-pooling, m stands for the square inputs, k stands for the square kernel size, s stands for the stride size. In the view of the receptive field for every unit do not overlap with each other, the size of feature map in this layer would be a quarter of former layer (row and column would be halved). In the practical using, we combine the convolution layer and max-pooling layer and call it conv-pool layer, as demonstrating in figure 3. Considering this act, the second layer is also a conv-pool layer. The data flow of these two layers is demonstrated in figure 4. The full-connected layer would calculate the dot product of the input vectors and weights, then plus a bias. All the neurons would be fully connected with the ReLU, before being 4
passed to next layer. Dropout layer refers to that the algorithm would randomly make some weights from nodes of hidden layer disabled. Thus, those nodes could be temporarily considered as they do not exist in the network architecture. On the other hand, the weights of these nodes would be stored until the next arriving input to update them. In this research, the dropout ratio is defined as 0.5. Srivastava et al. (2014) tell that the dropout is more efficient compared with standard regularizers. 7 7 60 Fullconnected 200 ReLU 200 Dropout 200 6 Softmax 200 Dropout 200 ReLU Figure 5. The data flow diagram of the two full-connected and dropout layers and the loss layer. The combination of these two layers would make the structure easily to understand as it has been shown in figure 3. The softmax function using in the last layer is defined as,. Intuitively speaking, the softmax could be used to compute the probabilities of each class. The output of the last dropout layer would be passed to a 6-way softmax of the loss layer. The final output is a classified distribution with respect to the 6 different hand postures. The data flow diagram is demonstrated in figure 5. 3.2 Performance Evaluation The MixNET is trained by using stochastic gradient descent. The initial weights are random values from a uniform distribution. The training set has been divided into two different set, one has 70% samples for training, another one has 30% samples for validation. The validation set can help to tune the parameters and do the model selection so that the overfitting can be minimized. The trained convolutional neural network is applied to the hand posture test set which contains 730 samples from 6 different categories. Figure 6 shows the classification accuracies of the training, validation and test set. 5
Figure 6. Train loss, validation loss and test score of final model of MixNET. The model achieved the lowest validation loss at epoch 120 with the best performance on the test set, which is 98.2%. This MixNET test is performed on a hardware system with 2.6GHz Intel Core i7-6600u and Nvidia GeForce 1GB GDDR5 RAM graphic. The whole test lasts 5.82 minutes in this environment. Classification Approach Performance K-Neighbors Classifier (uniform) 75% K-Neighbors Classifier (distance) 78% SVC 86% LeNet-5 93% MPCNN 95% MixNET 98% Table 1. Comparison with other classification approaches. Table 1 shows the performance of each classification method being used on the training set and then applying on the test set. The performance result indicates the reason of the convolutional neural network could be a popular hand gesture approach. The application of the LeNet-5 (LeCun et al., 1998) for learning and classifying the modified preprocessing hand gesture data gives the best test performance 93% which is 5% less than the proposed MixNET. The max-pooling convolutional neural networks (MPCNN) (Nagi et al., 2011) performs better than LeNet-5, however, the test performance 94.8% which is still 3.4% less than the proposed MixNET. 6
3.3 Conclusion This paper gives a brief introduction to the proposed convolutional neural network architecture named MixNET. The MixNET is trained for hand gesture recognition and achieves the test performance with a 98.2% classification accuracy. The MixNET has been trained and tested on the Sebastien Marcel Static Hand Posture base, however, it can also be applied to other hand gesture database with more than 6 hand gestures for recognition. In view of the whole process only takes 5.82 minutes on a 1GB GDDR5 GPU, it can be a state-of-art low-cost deep learning tool for human-computer interaction (HCI) in the future. 4. References Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 37(3), 311-324. Murthy, G. R. S., & Jadon, R. S. (2009). A review of vision based hand gestures recognition. International Journal of Information Technology and Knowledge Management, 2(2), 405-410. Li, Y. (2012, June). Hand gesture recognition using Kinect. In 2012 IEEE International Conference on Computer Science and Automation Engineering (pp. 196-199). IEEE. Kulshreshth, A., Zorn, C., & LaViola, J. J. (2013, March). Poster: Real-time markerless kinect based finger tracking and hand gesture recognition for HCI. In 3D User Interfaces (3DUI), 2013 IEEE Symposium on (pp. 187-188). IEEE. Ren, Z., Yuan, J., Meng, J., & Zhang, Z. (2013). Robust part-based hand gesture recognition using kinect sensor. IEEE transactions on multimedia, 15(5), 1110-1120. Garg, P., Aggarwal, N., & Sofat, S. (2009). Vision based hand gesture recognition. World Academy of Science, Engineering and Technology, 49(1), 972-977. Quek, F. K. (1994, August). Toward a vision-based hand gesture interface. In Virtual Reality Software and Technology Conference (Vol. 94, pp. 17-29). Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231. Marcel, S., & Bernier, O. (1999, March). Hand posture recognition in a body-face centered space. In International Gesture Workshop (pp. 97-100). Springer Berlin Heidelberg. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. 7
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. Nagi, J., Ducatelle, F., Di Caro, G. A., Cireşan, D., Meier, U., Giusti, A.,... & Gambardella, L. M. (2011, November). Max-pooling convolutional neural networks for vision-based hand gesture recognition. In Signal and Image Processing Applications (ICSIPA), 2011 IEEE International Conference on(pp. 342-347). IEEE. 8