Augmented Keyboard: a Virtual Keyboard Interface for Smart glasses

Augmented Keyboard: a Virtual Keyboard Interface for Smart glasses Jinki Jung Jinwoo Jeon Hyeopwoo Lee jk@paradise.kaist.ac.kr zkrkwlek@paradise.kaist.ac.kr leehyeopwoo@paradise.kaist.ac.kr Kichan Kwon Jamal Zemerly Hyun Seung Yang fabulous@paradise.kaist.ac.kr KUSTAR United Arab Emirates jamal@kustar.ac.ae yang@paradise.kaist.ac.kr buttons in keyboard to visualize while that of interactions should be produce accurately within a possibly small button area [Higuchi and Komuro 2013]. A laser-projected keyboard [Mistry and Maes 2009] that augments buttons onto hand physically using a projector is possibly one of solutions for the keyboard interface, however it seems that there is a huge gap from QWERTY keyboard mostly used. We propose a novel virtual keyboard interface Augmented Keyboard for Smart glasses which inherits the use of QWERTY keyboard to the interface by employing two features of a hand: the orientation of hand. To reduce constraints on background and computation for segmentation of arm, proposed method assumes the use of a depth camera that attached to Smart glasses by sharing the user s view. In this paper we define technical issues on estimation of hand orientation and pose accurately. To solve the problem, wrist estimation method which is essential to separate hand and forearm for the hand orientation is presented and evaluated quantitatively. We also present a hand orientation estimation that yields accurate performance. The analysis on two-dimensional hand shape from depth image is described by employing both model-based and appearance-based approaches. Extensive experimental results demonstrate that accurate and robust performance of the interface is achievable at 62 fps. Contributions of this paper can be summarized as follows: 1) Intuitive and easy-to-use interaction design for keyboard of Smart glasses 2) Wrist estimation method, which is robust against variations in hand orientation and pose, and its quantitative evaluation 3) Hand orientation estimation method which is robust to the changes in hand orientation and pose 4) Realtime performance of the proposed interface. Abstract We present a novel virtual keyboard interface Augmented Keyboard for a glass-type interface, i.e., Smart glasses. An intuitive and easy-to-use interaction design is proposed in order to inherit the uses in QWERTY keyboard by employing the two features of hand: hand orientation and hand pose. We also address technical issues on the interface and present wrist estimation method and hand orientation method as the solutions. Extensive experiments show that the robust performance is achieved at 62 fps. Throughout this paper we demonstrated that the proposed interface is fully capable of the chat applications or the games for Smart glasses. CR Categories: H.5.2 [Information Interfaces and Representation]: User Interfaces User-centered design Keywords: Virtual Keyboard, Hand Recognition, Augmented Reality, Natural User Interface 1. Introduction As an interface of next generation in mobile device market, a glasstype interface (i.e. Smart glasses) which maximizes usability and mobility of the device evolves. Diverse built-in sensors of Smart glasses such as touchpad, microphone, and camera help the interface become not only as a display but also a wearable computer that acquires user s command and perform tasks. To catch user s command in a natural way, many Natural User Interface (NUI) methods are employed to use user s voice and gestures as an input. However, the most commonly used and most versatile typewriterstyle interface, i.e. keyboard, has been a challenging issue in NUI because a keyboard is functionally designed to map characters/symbols onto buttons so that it naturally has many buttons. In case of a keyboard for mobile devices, there are too many 1.1 Related Work Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. VRCAI 2014, November 30 December 02, 2014, Shenzhen, China. Copyright ACM 978-1-4503-3254-5/14/11 $15.00 http://dx.doi.org/10.1145/2670473.2670482 159 The studies of vision-based hand pose recognition and tracking have been explored. RGB color image based method is presented in [Shen at al. 2011] that utilizes the fingertips and the convexity defect points to establish interactions with virtual objects like augmentation and pointing. A camera-projector based hand gesture recognition method in [Licsar and Sziranyi 2005] introduces an online training method that adaptively learns and recognizes hand gestures of different users. Since the segmentation of the hand is hard to achieve the robust performance against background variation, depth sensors are employed to ease the constraints on the segmentation. [Raheja at al. 2011] and [Ren at al. 2011] exploit the segmentation to extract features from the contour of hand. Both works were based on distance transform that is not only robust in the hand center detection but also discriminative against hand pose.

Figure 1. Proposed interface. Three characters are assigned to each finger and able to select by changing the orientation of the hand. User can type the aimed character with click gesture of the corresponding finger. Figure 2. Arm model consists of a palm model (yellow circle) and forearm model (blue rectangle) There are three major technical issues: First, the wrist line should be robustly estimated to separate hand and forearm. Second, estimation of both wrist estimation and hand orientation should not be influenced against hand pose (especially finger s pose). Third, proposed interface should satisfy the real-time constraint. All those issues, and also the hand pose recognition, are deeply coupled with the wrist estimation. The wrist can be used to: 1) eliminate the uncertainty in shapes of forearm area, 2) extract a hand-only image, 3) estimate hand orientation. The estimation of wrist, however, is hard to accomplish due to the absence of stable feature [Ren at al. 2011]. The depth image is an excellent feature even for the 3D hand model as described in [Oikonomidis at al. 2007]. The author of [Oikonomidis at al. 2007] argues that model-based approach provides continuous solutions to the problem of tracking hand articulation. 2. Interaction Design On the analysis of the use of conventional keyboard, not only hand pose or finger pose is related to an input but also orientation of hand contributes the input. For example when a user pushes a button that is far from the center of hand he may change the orientation and stretch his finger to reach the button because user wants to make less hand and arm movement. We also notify that there exists an implicit one-to-multiple mapping between a finger and buttons of keyboard. Based on two observations we present a novel keyboard interface as shown in Figure 1. We assume that only hands and forearms are observable in the view and the back of a hand should be facing sky. Two features, quantized hand orientation and hand poses, are employed to form different dimensions so that the number of interactions that can be produced in proposed interface is N X M where N and M are the quantization level of the hand orientation and the hand poses respectively. We restrict on recognizable hand pose to open palm that all fingers are separated from each other. To provide a visual guidance, specific characters that are augmented upon the fingertips for denoting possible inputs for the corresponding finger. The characters assigned to each finger are selected by its position in QWERTY keyboard nearby the corresponding finger. User can aim one of a character by controlling the orientation of the hand. Finally the aimed character can be inputted by clicking the finger. This procedure is straightforward and the typing speed can be improved with some practices since the interface does not always need user s attention. Figure 3. Flow diagram of proposed interface 160

Figure 4. Proposed feature of wrist estimation is presented: The intersected point Pi with the hand contour along the scan line (dotted) (a) and its geometric visualization (b). Figure 5. Process in hand orientation estimation(see Section 3.2) which is mapped with corresponding features. The validation of the potential arm blobs is performed through all processes of the hand feature extractor as shown in Figure3. 2.1 Problem Statement For the robust estimation of the wrist and the hand orientation, a model-based approach is presented in order to explore the geometric properties in an arm. The employed arm model consists of a forearm and a palm. The palm is modeled as a circle that has radius of and the forearm is modeled as a rectangle. The wrist line is defined as a line that has a slope orthogonal to the principal axis of arm and has an orthogonal distance with the center of the palm model, i.e., the hand center. From the definition, accurate estimation of is the most important issue in the wrist estimation. The hand orientation is defined as a relative angular difference between two axes, the principal axis of palm blobs and the principal axis of forearm blobs, which are connected with the center of the wrist (Figure 2). Since the definition is defined as a relative angle, it is invariant to the principal axis of the arm. 3.1 Wrist Estimation As we described in the previous section, the estimation of the radius of the palm model is the key function in the wrist estimation. In order to accomplish the robust estimation, we employ two stable feature of the hand: the hand center [Raheja at al. 2011] and convex hull of the arm blob. We define a terminology concave region that the region consists of the convex hull points and convexity defects as depicted in Figure4. We present a novel wrist estimation method by using a hand center and a concave region. One concave region is chosen that has the center of mass which has lower y-coordinate than the y-coordinate H 3. Proposed Method of the hand center c and has the minimum Euclidean distance to the hand center as shown as yellow region in Figure4 (a). The Figure 3 presents the diagram of proposed method. The input to the method is a 320 X 240 depth image acquired by Intel Creative camera. In preprocessing Connected Component Labeling of the depth image is performed to extract the one or two largest blobs which have the size of more than 100 pixels. We consider these two blobs as the potential arm blobs that will be processed independently. The principal axis of each blob is found using PCA for orientation normalization of the blob. We applied the method that extract the hand center by using Distance transform in [Raheja at al. 2011] and [Ren at al. 2011] which shows the most robust localization performance against changes in hand pose and orientation. The wrist estimation is then processed with the orientation-normalized arm blob and the hand center. The two hand feature is extracted with the estimated line position of the wrist in hand orientation estimation and hand pose recognition. All features that are extracted together, feature parsing returns the character projected line from the hand center c to the center of mass m of the concave region is used as a scan line so that the distance from the hand center to the point that meets the hand contour is the radius of the circular palm model. Geometric analysis is given in Figure 4 H P H P m L2 is minimum. If we approximate the (b) for the case c concave region to the triangle as denoted to red lines in Figure 4 (b) and set as : Pi as an origin, then the center of mass Pm can be defined rh (sin a sin p ) a rh (cos a p ) Pm, (1) 3 3 161

Figure 6. Comparison on histograms of the wrist estimation methods Hand orientation variation set Hand pose variation set Mixed variation set Overall Figure 7. Accuracy of the hand pose classification Method in [Licsar and Sziranyi 2005] 209.48 (51.49) 186.87 (39.17) 233.05 (46.93) 209.80 (45.86) Method in [Seo at al. 2008] 68.68 (101.87) 24.63 (18.76) 86.75 (125.58) 60.02 (82.07) Proposed 40.98 (8.20) 38.51 (10.48) 47.51 (10.48) 42.33 (9.72) Average pixel distance (Standard dev.) Table 1. Comparison of accuracy with the other wrist estimation methods Hand orientation variation set Hand pose variation set Mixed variation set Overall Mean (degree error) 0.68 1.22 0.95 0.95 Standard dev. 2.53 2.29 3.20 2.67 Table 2. Accuracy of hand orientation estimation Although a p a and a (SVM), are employed and quantitatively compared in the experiment. varies due to the contour of forearm, is always satisfied so that the only lower part of the hand 4. Experiment contour can be utilized to estimate rh. By following the definition of the hand axis estimation problem in Section 2, the estimation of two axes, palm and forearm, is performed with two images divided by the wrist line from the wrist estimation. Figure 5 (a) shows the distance transformed depth image and the local maxima points in the palm and forearm region as denoted with different colors (b). The local maxima points are used to extract the two axis lines based on M-estimator technique (c). The relative difference is then calculated as a result of the hand axis estimation. The experiment was conducted on a computer equipped with a dual-core Intel i5-3570 3.40 GHz, 16 GBs of RAM. Intel Creative Senz3D camera was used to capture the depth and RGB color image simultaneously. For the parameter of the quantization N=3 and M=6 were used. The experimental evaluation of the proposed method was based on ground truth data captured by the calibrated RGB camera image. We manually labeled the physical position of wrist, principal axis of hand and forearm with different color label. Three test datasets were built with variations on the hand orientation, hand shape, and both features. On this system, the average frame rates of entire process for two hands and one hand are 62.03 fps and 105.78 fps, respectively. 3.3 Hand Pose Recognition 4.1 Performance of Proposed Interface Since the movement of a finger is not completely independent from other fingers, we consider the hand pose recognition as an image classification problem. The wrist line is used to crop a hand-only image. The scale of the image is then normalized to 50 X 50 and then the validation of the image is performed whether the hand pose is recognizable or not. In validation, the observability of five fingers are checked by the number of the local maxima lines of the distance transformed hand image. Two classification methods, Locality Sensitive Hashing (LSH) and Support Vector Machine We first evaluate the accuracy on the test dataset with other wrist estimation methods in [Licsar and Sziranyi 2005] and [Seo at al. 2008]. The average pixel error is measured as the projected distance from the ground truth. Table 1 indicates that the method in [Seo at al. 2008] shows the better accuracy in the hand pose variation than ours. We suspect the main reason for this result to originate from the unstable localization of the hand center with some poses that bias the distance transform. Except one case, the proposed yields the most accurate performance. The standard deviation of the error 3.2 Hand Axis Estimation 162

Figure 8. Screenshots of prototype of the proposed interface. Each hand orientation is visualized as yellow lines. Recognized hand pose is visualized as the number that indicates finger s order. Figure 9. Demonstration of Augmented keyboard. Characters assigned to each finger are visualized upon corresponding finger. The characters inputted using key-press gesture hello are showed in the parent window. estimation more accurate than for other proposed work, it also tends to yield the most stable detection rates, as long as robust to the hand pose and orientation. The keyboard interface that we implemented in the experiment presents the possibilities of adaptation to the entertainment application. In the future work, we will introduce a part model based hand pose estimation that can explore both overall hand pose and each finger pose so that the degree of precision of the clicking gesture can be improved. Also HMM based approach will be adapted to enhance the parsing ability of the interface. proves the robustness of our method. As Figure 6 illustrates the distribution of the distance of ours converges around 20 pixel while that of others represents that severe misdetection happens. We also conducted accuracy experiment of the hand pose classification with 12 hand poses (6 for each hand). Figure 7 illustrates that SVM outperforms LSH for the most cases. The poses of some fingers like the ring and middle finger which are constrained with neighboring fingers yield poor recognition. The total average recognition rates were 95.8% and 87.7% for SVM and LSH. Table 2 shows the accuracy of the hand orientation that is measured with degree. Like the wrist estimation, the hand orientation produces the worst performance against hand pose variation. With the SVM and the hand orientation that has the mean degree errors is lower than 1 degree, the proposed system is capable of implementation of the proposed interaction design. Acknowledgements This research was supported by the IT R&D program of MKE/KEIT, (10039165, Development of learner-participatory and interactive 3D virtual learning contents technology). This work was supported by the IT R&D program of MKE & KEIT [10041610, The development of the recognition technology for user identity, behavior and location that has a performance approaching recognition rates of 99% on 30 people by using perception sensor network in the real environment]. This research was supported by the KUSTAR- Institute,, Korea. 5. Conclusions For a keyboard interface of Smart glasses, a novel interface is proposed which includes an interaction design and two estimation methods, the wrist estimation and the hand orientation estimation, that enable the designed interaction. Not only is the proposed wrist 163

References SHEN, Y., ONG, S. K., NEE, A. Y., 2011. Vision-based hand interaction in augmented reality environment. Intl. Journal of Human Computer Interaction, Taylor & Francis, 523 544. LICSÁR, A., SZIRÁNYI, T. 2005. User-adaptive hand gesture recognition system with interactive training. Image and Vision Computing, vol. 23, 1102-1114. RAHEJA, J. L., CHAUDHARY, A., SINGAL, K., 2011. Tracking of fingertips and centers of palm using Kinect, Computational Intelligence, Modelling and Simulation, 248-252. REN, Z., YUAN, J., ZHANG, Z., 2011, Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera, In Proceedings of the 19th ACM international conference on Multimedia, ACM, 1093-1096. OIKONOMIDIS, I., KYRIAZIS, N., ARGYROS, A. A., 2007, Efficient Model-based 3D Tracking of Hand Articulations using Kinect. Computer Vision and Image Understanding, 108(1-2), 52-73. SEO, B.-K., CHOI, J., HAN, J.-H., PARK, H., PARK, J., 2008, Onehanded interaction with augmented virtual objects on mobile devices. In Proceedings of The 7th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and Its Applications in Industry (p. 8). ACM. HIGUCHI, M., KOMURO, T., 2013. AR typing interface for mobile devices Vision-based hand interaction in augmented reality environment, In Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, ACM. MISTRY, P., MAES, P., 2009. SixthSense - A Wearable Gestural Interface. In the Proceedings of SIGGRAPH Asia, ACM. 164