Visual Interpretation of Hand Gestures as a Practical Interface Modality

Visual Interpretation of Hand Gestures as a Practical Interface Modality Frederik C. M. Kjeldsen Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 1997

Abstract This dissertation describes a user interface in which many tasks traditionally performed by a mouse are instead performed using visual recognition of hand gestures. The goals are to explore both how a vision system should be designed to recognize hand gestures, and how they are best used in a general purpose interface. Observed by a camera below the screen, the user manipulates objects directly with gestures incorporating both motion and pose. Task and domain knowledge provide context, allowing real-time recognition on standard PC hardware. A color-based algorithm is trained to segment user's hands from complex backgrounds without visual aids. Training uses a novel combination of both positive and negative data to improve segmentation quality. The apparent path of the hand is smoothed with an algorithm which reduces the types of noise inherent in the domain but leaves a cursor motion on the screen that feels natural for the user. Salient features of the motion are extracted, including a newly discovered natural gesture (a Comma ), which helps provide punctuation for each gestural sentence. Neural networks are trained to classify the pose of the user's hand from cropped and preprocessed images. The nets correctly classify 90-95% of the hand images in real time. A transition network encodes the interaction language. It controls the application of feature extraction operators and interprets their results to determine when to perform actions on the user's behalf. The style of interaction is based on studies of natural gesticulation and incorporates various features designed to make it natural and easy for the user to remember. The system demonstrates a 80-90% success rate on most tasks. Object selection time for large objects is demonstrated to be equal or superior to that of a mouse. Object selection performance is modeled accurately by augmenting Fitts' Law with terms for lag and random cursor noise. Finally, the suitability of gesture for this type of task is considered. Various interaction styles are examined, and problems specific to hand gesture are discussed.

Acknowledgments I would like to express my thanks to IBM for the generous support of this work, via the Resident Study Program. Several individuals deserve special mention. My advisor, John Kender, has given his support in many ways. Ross Bevridge's suggestions helped shape the direction of this work, and Steve Feiner's comments helped to make the thesis much more complete. Several members of the T.J. Watson community have been very helpful, both as colleagues and laboratory rats. In particular, thanks to Jon Connell for both inspiration and his many excellent comments, as well as to Sharatchandra Pankanti, Michael Yao, Chitra Dorai, and Lisa Brown. Finally, apologies to my son, Joseph, for having to do without his father so much during the precious first year of life, and sincere thanks to his mother, Lorraine, both for picking up the slack when I was not there, and suffering my moods when I was. i

Contents Chapter 1: Introduction 1 1.1 Why Gesture... 2 1.1.1 How should it be used?...5 1.2 Why Vision... 6 1.3 Scope of problem... 7 1.4 Difficulties... 9 1.5 Overview of Thesis... 9 Chapter 2: Background 11 2.1 Hand Gesture Theory... 11 2.2 Hand Gesture Recognition... 13 2.2.1 Hand Segmentation...14 2.2.2 Pose Recognition...18 2.2.3 Motion Interpretation...23 2.3 Applications of gesture recognition... 26 2.3.1 Virtual Environments...26 2.3.2 Gesture in Traditional Interfaces...28 Chapter 3: System Description 30 3.1 Overview... 30 3.1.1 Design Discussion...31 3.2 Hand Segmentation... 33 3.2.1 Overview...33 3.2.2 Color Predicate and Training...34 3.2.3 Segmentation Process...36 3.2.4 Design Discussion...37 3.3 Hand Tracking... 43 3.3.1 Design Discussion...48 3.4 Motion... 49 3.4.1 Smoothing the Hand Path...50 ii

3.4.2 Extraction Motion Features...56 3.4.3 Design Discussion...58 3.5 Pose Recognition... 63 3.6 Gesture Interpretation... 76 3.6.1 Design Discussion...81 3.7 Implementation Details and Parameters... 87 3.7.1 Hardware...87 3.7.2 Segmentation...89 3.7.3 Tracking...90 3.7.4 Motion...91 3.7.5 Pose Recognition...93 3.7.6 Interaction Language Details...93 3.7.7 Window System Interface...94 Chapter 4: Performance Evaluation 96 4.1 Segmentation... 96 4.1.1 Overall Performance...96 4.1.2 Calibration...98 4.1.3 Performance in Different Environmental Conditions...99 4.1.4 Performance on Different Skin Tones...101 4.1.5 Non-Hand Skin Regions...101 4.1.6 Other Issues Affecting Segmentation Quality...102 4.2 Hand Motion Tracking... 106 4.2.1 Smoothing Algorithm Performance...106 4.2.2 Object Selection Performance...107 4.2.3 Subjective Evaluation of Tracking Performance...122 4.3 Pose Recognition... 124 4.3.1 Evaluating Network Performance...124 4.3.2 Network Training...126 4.3.3 Sources of Error...127 4.3.4 Network Weight Analysis...130 4.3.5 Variations...133 iii

4.4 The System as a Whole... 135 4.4.1 Speed...135 4.4.2 Task Performance...137 4.4.3 User Comments...141 Chapter 5: Discussion 143 5.1 Vision Systems for Hand Gesture Recognition... 144 5.1.1 Segmentation...144 5.1.2 Tracking...145 5.1.3 Motion Feature Extraction...147 5.1.4 Pose Recognition...151 5.1.5 Language Representation...153 5.1.6 General Considerations...153 5.2 Hand Gestures as an Interface Modality... 160 5.2.1 Characteristics of Free-Hand Gesture...161 5.2.2 Designing an Interface for Hand Gestures...167 5.2.3 Climbing the Learning Curve...180 5.2.4 Design of a Practical Gesture System...182 Chapter 6 Summary and Conclusions 186 6.1 Summary... 186 6.2 In Conclusion... 189 References 190 Appendix 196 iv

List of Figures and Tables Chapter 3. 30 Physical layout....30 System's view of user...30 Image labeled by CP and the largest connected component....34 Weighting function around CP training data...35 User training system...35 Segmentations of Figure 2 for tracking and pose recognition...36 Optimal CP and CP produced by histogramming the positive training examples....38 Color Predicates trained using simple update and with Gaussian smoothing and subtraction....41 Images segmented using the CPs in Figure 8....41 Segmented images of the user pointing to the four corners of the screen...44 The centroid of the user's hand pointing to the corners of the screen forms a quadrilateral in image space...45 The centroid of the hand as it follows a grid in screen space, forms a warped grid in image space....48 Sigmoid force scaling functions...52 The hand backing up behind the cursor between cycles and causing overshoot in a simple smoothing algorithm....54 Force applied to the cursor versus hand displacement...55 Table: Motion Features....57 Pose recognition network architecture....64 Various appearances the Point pose takes on...66 Hand pointing to the top and bottom of the screen....67 Color to gray conversion....68 Two poses very similar in joint angle space, but easy to differentiate in image space....70 Two pointing poses with corrupted outlines....71 A pointing pose with the finger removed and a fist pose...71 Two extremes of roll in a pointing pose...74 Transition network for window control task...76 Table: The actions which can be performed at each node...77 Interaction language using only motion features...80 Three CP training templates...89 v

Chapter 4. 96 Segmentation performance, the good, the average, and the ugly...97 Point missing a finger...97 Fist with hole...97 Example of the face and arm extracted with the hand....102 Example hand images from the PCN training set....105 Hand location before and after smoothing....107 Table: Selection times for 1 inch target with free-hand pointing...109 Time to select a screen object versus its size in inches...110 Table: Selection time in seconds versus target size in inches...110 Predicted and actual mouse selection time for objects of various sizes...113 Table: Probability of cursor landing in target in any one cycle for various cursor error distributions and target sizes....115 Table: Expected number of cycles it will take for the cursor to land inside the target for 3.5 consecutive cycles at various levels of noise....116 Predicted and actual selection time for targets of various sizes using free-hand pointing...117 Predicted selection time from simply increasing tracking rate....118 Predicted free-hand selection time with a reduced level of random noise and with no noise...119 Selection time performance for realistic targets of tracking rate and noise...120 Predicted free-hand selection times under ideal conditions...121 Examples of the three pose classes differentiated by one of the PCNs....124 Total classification performance versus training cycle for the training and test sets....127 Weights in a typical pose classification network....131 Example images for network weights discussion...132 Total classification performance during training for binary pose images...133 Classification performance for palm poses during training for binary pose images...134 Table: Results of system task testing....137 Table: Percentage of total errors by category...140 Chapter 5. 143 Alternate interaction language for the window control task, using the pose of the hand and the motion that occurs after it to signal an action...173 Interaction language allowing multiple actions, separated by a comma...176 Menu layout better suited for hand gesture...179 vi