Camera Based Hand Gesture Recognition Daniel Snowden B.Sc. Computing with Artificial Intelligence 2005/2006

Size: px

Start display at page:

Download "Camera Based Hand Gesture Recognition Daniel Snowden B.Sc. Computing with Artificial Intelligence 2005/2006"

Dwain Blankenship
5 years ago
Views:

1 Camera Based Hand Gesture Recognition Daniel Snowden B.Sc. Computing with Artificial Intelligence 2005/2006 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

2 Summary The minimum requirement was to develop a system capable of recognising at least 2 gestures from at least 1 user. It has achieved in this regard. The system uses histograms of motion history images as templates for gestures. A similar MHI is built from the live input and a histogram of the image is compared against the templates. The system itself does produce recognitions and does not detect many gestures, although these are likely to be optimisation and template issues. In addition there were hardware and software issues which slowed the development of this project. The following report describes the development of this system. i

3 Acknowledgements I d like to thank my supervisor, Andy Bulpit, for his help throughout this project. I d also like to thank Chris Needham and Andrew Bennett (Leeds Vision Group) for their assistance in resolving technical issues surrounding the webcam and image processing library. And of course I d like to thank my family and friends who put up with me constantly talking about Computer Vision. ii

4 Contents 1 Introduction Background Literature Review Computer Vision Tracking Motion History Images Template Matching Markov Markov Models Recursive Induction Bayesian Networks Software Development Languages Development Methodologies System Design Vision Techniques Background Subtraction Image Segmentation Motion History Images Histograms Comparison iii

5 4 System Implementation Platforms Video Input Image Processing Library Vision Techniques Background Subtraction Image Segmentation Motion History Images Histograms Training Comparison Results and Testing Evaluation Performance of the System Potential Future Improvements Generation of Templates Optimisation of Recognition Threshold Bibliography 17 iv

6 Chapter 1 Introduction 1.1 Background to the Problem Computer Vision is an exciting field of research, providing the potential for creating new ways of interacting with computers. The aim of this project is to develop an application capable of recognising a range of gestures from a single user, through computer vision techniques. A requirement of this project is to produce a low cost, real-time solution. A potential use for the technology is to allow users to use gestures to interact with a virtual document, for example by pointing at an area to select it. The School of Computing has a number of Phillips web cams, one of which is to be used during this project. It is intended to develop a system which can use a video source (live video input or pre-recorded) to recognise at least 2 gestures from a single user. 1

7 Chapter 2 Literature Review 2.1 Computer Vision Tracking One of the tracking methods reviewed is the adaptive background mixture model[6] which can be used for real time tracking (Also known as the Grimson and Stauffer tracker). This is a method based on the common background subtraction methods, sometimes used in vision systems,citetowardsmodel. What is different about the adaptive background mixture model is that each pixel is modelled as a mixture of Gaussians, built up over a period of time and evaluated to determine what is and is not in the background. The computational power required for the algorithm could prove to be a limiting factor and as a result, only a small number of Gaussians may be able to be used. This may not be as problematic as it would first sound, due to the fact that the program will not be run on particularly noisy scenes (The program will be used in controlled situations, changes in lighting and scenary are not expected). There could be great benefit to using a tracking system which does not require human intervention when being initialised. Such a method is discussed by Yuan, Sclarlof and Athisos. They use a probabilistic approach, which optimises itself[9]. While this approach could certainly prove useful, it may take too long to implement. 2

8 2.1.2 Motion History Images The motion history image is a technique which allows the successive layering of image silhouettes are layered in to a template image. Each silhouette decreases in value with every successive frame[3] Template Matching A form of template matching is used by Bobick and Davis [2]. Their technique uses the previously mentioned motion history images, their approach involves comparing motion history and motion energy images against similar images which represent templates of known actions Markov Markov Models A number of different approaches to implementing hidden Markov models have been studied. These include pseudo 3-D models and human anatomy based coupled models,[7]. Both methods involve adapting the models to the specifics of the movement of human anatomy, i.e. the degrees of freedom of the human hand Recursive Induction One method studied is recursive induction learning, a method the program develops its own learning model from examples[10], these examples can be supplied in the initialisation. A feature of this approach is the development of decision trees, based on the degrees of freedom of the human hand. An approach which could prove useful considering that the degrees of freedom makes gesture recognition in the hand a difficult task Bayesian Networks A Bayesian framework based approach could be considered[11] as the approach described by Zhou and Huang has a relatively low hardware requirement (program was run on a 1GHz CPU), and it is readily able to track a hand in a relatively cluttered scene. 3

9 2.2 Software Development Languages The only programming language research has been in to understanding the usage of C++, rather than to decide on a language to use. The reason for this is that the project will be using the Leeds Real Time Image Library 1, which is written in the C++ language, which means that the program will need to be developed in the same language. A guide to using program libraries in C++ has been found, and will be used to integrate the image processing library in to the program to be built[8]. Static libraries are to be used rather then dynamic libraries, as this will enable to program to run on systems without the library (as the program can carry it s own statically linked copy), as portability may be required in the testing, rather than development stages Development Methodologies There are few development methodologies which are suitable for a project such as this, due to the experimental nature of computer vision. Traditional methodologies such as the waterfall model are not appropriate to a project such as this. A form of rapid prototyping is to be used (with an iterative approach), as this will allow the flexibility to make alterations to the program and the CV algorithms used. Prototyping can be used with the waterfall model or with other development approaches, this approach allows the development of functionality early on in the project and allows the project to be perfected through a number of iterations with good risk control[1]

10 Chapter 3 System Design 3.1 Vision Techniques Background Subtraction The background subtraction is provided by the Background Modelling Methods library from the Leeds Vision Group. (For SOC students, at the time of writing, this is available at ṽislib/libs/) Image Segmentation The resulting binary image from the background subtraction is to be run through an edge detector, this provides an outline of the hand. As the background subtraction should produce a white image of a hand on a black background, this technique should prove effective Motion History Images The resulting binary image from the image segmentation is to be added to a second image. As well as adding new images, a decay rate is factored in. This way, new motion appears the brightest while older motion gradually decays to black. This is relatively simple to implement in a greyscale image as each pixel is assigned a value from The binary image from the edge detector produces values of 0 5

11 Figure 3.1: The Overall Design Of The System 6

12 for(i=0; i<width; i++) { for(j=0; j<height; j++) { } } Figure 3.2: Iteration through all pixels in an image histogram a histogram c histogram e histogram g histogram b histogram d histogram f histogram h Figure 3.3: Division of MHI for no edges, and 255 for edges, therefore it is simple to build in a decay rate (based on subtracting a specific pixel value each iteration) Histograms An approach is used, which is based on work by Aaron Bobick and James Davis [2]. 1 dimensional histograms of pixel intensity are to be used to recognise the gestures. 1D histograms are used as greyscale images use a single value of pixel intensity (0-255). A histogram class is provided in the Leeds Real Time Image Processing Library. (If RGB or similar colourspaces were used, a 2 or 3d histogram would be used) 8 histograms are to be used, with the motion history images being split into 8 sections (4 rows, 2 columns), the contents of each one added to the appropriate histogram. A similar process is to be carried out through a separate training program. This program will build histograms for each gesture and the histograms saved for future comparison. This approach should not have problems associated with colour variations, because the histograms are built from the MHI, with this being a greyscale image (with intensity linked to time). In addition, histograms are invariant to rotation (due to the fact that the histograms only store the frequency of pixel 7

13 values, not the position). The pixel values are grouped in to a number of bins. It is hoped that clustering the pixels in this way will help limit the effects of minor variations in the pixel values and clean up the values Comparison The comparison will take place by calculating the mean difference between the histograms of templates and the current MHI. The templates can then be ranked according how close they are to the current image. The closest template will be chosen as the current gesture. A failsafe will be required to ensure that the system will know that the user is not making a gesture. 2 possibilities are to have a default template for no gestures or set a maximum deviation from the template whereby any template which exceeds it will be ignored. It is intended to test both techniques. 8

14 Chapter 4 System Implementation During the implementation of this project, it was decided to use the Eclipse IDE. The reason for this was my familiarity with Eclipse and the existence of a C/C++ development plugin which included build control (though GNU make). 4.1 Platforms Video Input The Phillips web cam caused a number of problems with building the system. For an as yet unknown reason, there was no ability to disable the automatic gain control. This had the result of the camera adjusting focus as the light levels differed. In turn, this meant that objects moving in a scene caused changes in focus. Because the images were changing constantly, this made it difficult to seperate foreground and background. The reason for this is that changes in lighting produce pixel values which are different from the background image and are therefore assumed to be foreground. This problem does not occur on SOC computers. As a result, in the later stages of the project, I was allowed access to the Vision Lab. This allowed data to be gathered without the automatic gain control issue. Another ongoing problem with the Phillips web cam is compatibility with operating systems. Although a driver exists, it was not compatible with Fedora Core 5. This was only a minor compatibility 9

15 Figure 4.1: Variations in lighting caused there to be many differences between the current image and the mean image problem as I was able to restore previous installations of version 4 on my home computers Image Processing Library A substantial amount of time was spent building the library to the detriment of other components of the project. 4.2 Vision Techniques Background Subtraction As mentioned in the previous section, the automatic gain control was an issue with the background subtraction, producing peculiar results. The binary image resulting from the edge detection should produce black pixels for the background, with white pixels showing the foreground. The library provided 2 trackers, a tracker based on single Gaussian hypothesis and a Grimson and Staufer tracker based on multiple Gaussian hypothesis[6]. It was decided to use the single Gaussian 10

A Canny edge detector would have been preferable due to the fact that it first applies a gradient edge detector, then finds edge pixels through non maximal suppression and hysteresis tracking.

16 (a) Background Subtraction (b) After Edge Detection Figure 4.2: Comparison of background subtraction with resulting sobel edge detection tracker as the Grimson and Stauffer tracker proved to be very computationally expensive Image Segmentation Although the Leeds Real Time Image Processing Library provides an edge detector, unfortunately it is a Sobel edge detector. A Canny edge detector would have been preferable due to the fact that it first applies a gradient edge detector, then finds edge pixels through non maximal suppression and hysteresis tracking. This is a more sophisticated technique, producing a more optimal image [5, 4]. Nonetheless, the sobel edge detector proved to be remarkably resilient, this could be due to the fact that the background subtraction produced a relatively clean image. A diagnostic mode was included in the application, which directed the output of the various stages of image processing to display windows. This enabled the performance of individual components of the system to be viewed. These were the background subtraction (image difference and mean image), edge detection and the motion history image Motion History Images The motion history images were not as easily implemented as the design in the previous chapter. The first problem was that the initial approach used a set of nested loops designed to iterate through the width and height. This approach proved to be computationally expensive. A faster approach was found to involve a single loop, iterating over the image using pointers rather than assignment. The use of pointers provided a substantial benefit in increasing the speed of the system. 11

Figure 4.3: Video output of the various stages of image processing The reason why this is important is that the library processes video in real time.

17 Figure 4.3: Video output of the various stages of image processing The reason why this is important is that the library processes video in real time. As a result, delays in processing can result in obsolete data being written to the motion history image. Not all of the effects of the delay were obvious. One example being a ghosting effect, where an object in a scene (e.g. a hand) decays but reappears moments later. Another change from the initial design was the way decay functioned. As well as being slow, the subtraction method also tended to produce a ghosting effect, where pixels count down to a particular value but remain at that point as it is lower than the threshold value. It was decided to multiply the pixel values by a decay rate (with the decay rate being a value lower than 1, 0.8 was chosen) Histograms There were serious problems encountered when implementing the histograms. Due to delays at various stages, I was left with far less time than I had hoped for at this stage. The problem that emerged was that only 3 histograms were being populated. The histograms in this case were histograms a, b and c. Because the program iterates through each pixel, through width then height, it can be concluded that these are the first 3 histograms to be populated by the program. (The program treats the image as a 1d array of length w*d) At this stage, it was impossible to determine what exactly was causing this loss of functionality but 12

18 it is possible that it was the result of a computationally expensive loop and nested if statements. In the final version, a single histogram was used for each image because of the problems mentioned. I decided to make this modification as there was little time remaining to repair the program Training It was decided to use a separate training program to build the histograms for the templates. The histograms were cleared at each iteration and saved to file after being built. This allows the final template to be chosen by simply closing the program. Although this is not a particularly advanced technique, it has proven adequate for building the templates Comparison Development of the comparison mechanism was slower than expected. Rather than use a template to recognise no gesture, I decided to use the absolute difference calculated between the video histogram and templates. The way this worked was by creating a threshold, where any absolute difference which exceeds it is discounted. Choosing a reasonable value is a difficult process, however the testing mechanism helped considerably. In a file, the intended gesture, detected gesture and absolute difference were recorded. It was a simple matter to examine the differences present in false positives and to reduce the threshold accordingly. Of course on some occasions this prevented gestures from being picked up Results and Testing As mentioned, a file was used to record the data. The reason was to create a blind test, where seeing results does not affect the way I constructed gestures. A mechanism was put in place where I could simply click on the window with the LMB to state that point is intended and RMB for clench. 13

19 Chapter 5 Evaluation 5.1 Performance of the System The system was tested by using a selector which indicated the intended gesture. This is outputted to a file along with the recognised gesture and the absolute difference between the template histogram and the live video histogram. This data could then be used to create a cut-off value, where all differences that exceed it are ignored. In the final version of the program tested, 18 gestures were recognised. Out of those, 8 were false. 2 gestures were used in this test, a user pointing their hand and clenching their fist. Figure 5.1: An Example MHI Used For Construction of Histogram 14

20 no. Intended Gesture Detected Absolute Diff 1 clench point clench clench clench clench clench point clench point clench clench clench point clench point clench point point point point point point point point clench point clench clench clench clench clench clench clench clench clench Figure 5.2: Results of Testing 15

21 It should be noted that this test did not include the intended gestures that were not picked up (false negatives). It is highly likely that the poor quality results were partly the result of poor templates. Unfortunately there was not much time to develop more templates and a better approach to template generation. 5.2 Potential Future Improvements The system developed is very crude and there are many possible improvements that can be made Generation of Templates The generation of templates involved building a histogram of a single gesture, a more effective system that could be developed may involve mean histograms. By that I mean building histograms of several attempts at the same gesture and calculating the mean. This would provide a more robust template as the quality of the input gesture effects the performance of the entire system Optimisation of Recognition Threshold This is a reference to the cut-off value to prevent incorrect recognition. A future system could optimise this itself. The way this could be developed could involve using the intended gesture flags to recognise false positives. It could also be used to recognise false negatives. Using this approach, the system could be extended to decrease the threshold when it detects false positives (reduce the value of the minimum absolute difference) and raise it when it detects false negatives. It would have to use small increments of course. 16

22 Bibliography [1] Comparison of software development methodologies. crosstalk/1995/01/comparis.asp. Last Visited 07/12/2005. [2] Aaron Bobick and James Davis. Real-Time Recognition of Activity Using Temporal Templates [3] Gary R. Bradski and James W. Davis. Motion segmentation and pose recognition with history gradients. Machine Vision and Applications, pages , [4] Bob Fisher, Simon Perkins, Ashley Walker, and Eric Wolfart. Canny edge detector. http: // [5] Bob Fisher, Simon Perkins, Ashley Walker, and Eric Wolfart. Edge detectors. cee.hw.ac.uk/hipr/html/edgdetct.html. [6] Chris Stauffer and W.E.L Grimson. Adaptive Background Mixture Models for Real-Time Tracking. In IEEE Computer Vision and Pattern Recognition, pages , [7] Gerhard Rigoll Stefan Muller, Stefan Eickeler. Crane gesture recognition using pseudo 3-d hidden markov models. In Fourth IEEE Conference on Automatic Face and Gesture Recogntition, pages 1 4, [8] David A. Wheeler. Program library howto. program-library/. Last visited 03/12/2005. [9] Quan Yuan, Stan Sclaroff, and Vassilis Athisos. Automatic 2d hand tracking in video sequences. Technical report, Computer Science Department, Boston University, Available from http: // 17

23 [10] Meide Zhao, Francis K.H. Quek, and Xindong Wu. Recursive induction learning in hand gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20: , Available from csdl.computer.org/comp/trans/tp/1998/ 11/i1174abs.htm. [11] Hanning Zhou and Thomas S. Huang. A bayesian framework for real-time 3d hand tracking in high clutter background. In The International Conference on Human Computer Interaction, volume 10, pages 1 7, Available from hzhou/pub/hcii03.pdf. 18

24 Appendix A - Personal Reflection The final year project is a piece work which I had dreaded for almost 2 years. After working on this project for almost a year, I can safely say that it has been an amazing learning experience as it is an individual piece of work with none of the rigid limitations seen in assignments and module criteria. This is the first piece of work, of this type, where I have had such a level of academic freedom. This has also been the single hardest piece of work I ve ever done. Even with that being the case, I have no regrets about choosing this particular topic. This project has helped me develop several important skills, including programming and analytical/problem solving skills as well as developing a better understanding of computer vision techniques. My greatest weakness has probably been time management and balancing this project with other assignments, which unfortunately, left little time for implementing the later stages of the project. I have found that it is easy to get caught up on a few specific components and to loose sight of the bigger picture. In this sense, many of the problems which occurred were my own fault. Of course there were problems which were outside my scope of control. For example, the problems I mentioned in the implementation chapter with regards to the camera. In this sense, it has been a valuable learning experience and in future, I will certainly make provisions for hardware/software problems with tasks. This certainly was a task I was working on until the 11th hour. In closing, I feel I have learned a lot from this project and having the opportunity to develop a Computer Vision project has added to what has already been a fascinating and enjoyable year. 19

Real-Time Face Detection and Tracking for High Resolution Smart Camera System

Digital Image Computing Techniques and Applications Real-Time Face Detection and Tracking for High Resolution Smart Camera System Y. M. Mustafah a,b, T. Shan a, A. W. Azman a,b, A. Bigdeli a, B. C. Lovell