Real Time Word to Picture Translation for Chinese Restaurant Menus

Similar documents
Book Cover Recognition Project

Improved SIFT Matching for Image Pairs with a Scale Difference

A Fast Segmentation Algorithm for Bi-Level Image Compression using JBIG2

Contrast adaptive binarization of low quality document images

VEHICLE LICENSE PLATE DETECTION ALGORITHM BASED ON STATISTICAL CHARACTERISTICS IN HSI COLOR MODEL

MAV-ID card processing using camera images

An Efficient Method for Vehicle License Plate Detection in Complex Scenes

Preprocessing and Segregating Offline Gujarati Handwritten Datasheet for Character Recognition

Module Contact: Dr Barry-John Theobald, CMP Copyright of the University of East Anglia Version 1

Scrabble Board Automatic Detector for Third Party Applications

Impeding Forgers at Photo Inception

Extraction and Recognition of Text From Digital English Comic Image Using Median Filter

Automatic Electricity Meter Reading Based on Image Processing

Midterm Examination CS 534: Computational Photography

Contents 1 Introduction Optical Character Recognition Systems Soft Computing Techniques for Optical Character Recognition Systems

8.2 IMAGE PROCESSING VERSUS IMAGE ANALYSIS Image processing: The collection of routines and

INDIAN VEHICLE LICENSE PLATE EXTRACTION AND SEGMENTATION

Restoration of Motion Blurred Document Images

Automatic Counterfeit Protection System Code Classification

Interframe Coding of Global Image Signatures for Mobile Augmented Reality

AUTOMATIC LICENSE PLATE RECOGNITION USING IMAGE PROCESSING AND NEURAL NETWORK

Checkerboard Tracker for Camera Calibration. Andrew DeKelaita EE368

Study Impact of Architectural Style and Partial View on Landmark Recognition

Libyan Licenses Plate Recognition Using Template Matching Method

CHARACTERS RECONGNIZATION OF AUTOMOBILE LICENSE PLATES ON THE DIGITAL IMAGE Rajasekhar Junjunuri* 1, Sandeep Kotta 1

Biometrics Final Project Report

Implementation of License Plate Recognition System in ARM Cortex A8 Board

AUTOMATIC LICENSE PLATE RECOGNITION USING PYTHON

Number Plate Recognition Using Segmentation

Why Should We Care? Everyone uses plotting But most people ignore or are unaware of simple principles Default plotting tools are not always the best

Image Recognition for PCB Soldering Platform Controlled by Embedded Microchip Based on Hopfield Neural Network

Deep Green. System for real-time tracking and playing the board game Reversi. Final Project Submitted by: Nadav Erell

Vehicle Number Plate Recognition with Bilinear Interpolation and Plotting Horizontal and Vertical Edge Processing Histogram with Sound Signals

An Evaluation of Automatic License Plate Recognition Vikas Kotagyale, Prof.S.D.Joshi

Proposed Method for Off-line Signature Recognition and Verification using Neural Network

Line Segmentation and Orientation Algorithm for Automatic Bengali License Plate Localization and Recognition

A Study of Slanted-Edge MTF Stability and Repeatability

A new seal verification for Chinese color seal

A Novel Morphological Method for Detection and Recognition of Vehicle License Plates

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May ISSN

Automatic Licenses Plate Recognition System

Smart License Plate Recognition Using Optical Character Recognition Based on the Multicopter

Automatics Vehicle License Plate Recognition using MATLAB

An Improved Bernsen Algorithm Approaches For License Plate Recognition

A Review of Optical Character Recognition System for Recognition of Printed Text

Matlab Based Vehicle Number Plate Recognition

Combination of Web and Android Application to Implement Automated Meter Reader Based on OCR

Vehicle License Plate Recognition System Using LoG Operator for Edge Detection and Radon Transform for Slant Correction

Recognizing Panoramas

EE 5359 MULTIMEDIA PROCESSING. Vehicle License Plate Detection Algorithm Based on Statistical Characteristics in HSI Color Model

Research on Pupil Segmentation and Localization in Micro Operation Hu BinLiang1, a, Chen GuoLiang2, b, Ma Hui2, c

Evaluation of Voting with Form Dropout Techniques for Ballot Vote Counting

Malaysian Car Number Plate Detection System Based on Template Matching and Colour Information

Webcam Image Alignment

Computer Vision. Howie Choset Introduction to Robotics

3) Start ImageJ, install CM Engine as a macro (instructions here:

Chapter 6. [6]Preprocessing

Study guide for Graduate Computer Vision

Method for Real Time Text Extraction of Digital Manga Comic

Blur Detection for Historical Document Images

FPGA based Real-time Automatic Number Plate Recognition System for Modern License Plates in Sri Lanka

An Efficient Color Image Segmentation using Edge Detection and Thresholding Methods

Experiments with An Improved Iris Segmentation Algorithm

Automatic Morphological Segmentation and Region Growing Method of Diagnosing Medical Images

Object Recognition System using Template Matching Based on Signature and Principal Component Analysis

Multi-Script Line identification from Indian Documents

Efficient Car License Plate Detection and Recognition by Using Vertical Edge Based Method

Why Should We Care? More importantly, it is easy to lie or deceive people with bad plots

Robust Hand Gesture Recognition for Robotic Hand Control

Linear Gaussian Method to Detect Blurry Digital Images using SIFT

Automatic License Plate Recognition System using Histogram Graph Algorithm

Study and Analysis of various preprocessing approaches to enhance Offline Handwritten Gujarati Numerals for feature extraction

An Improved Binarization Method for Degraded Document Seema Pardhi 1, Dr. G. U. Kharat 2

Traffic Sign Recognition Senior Project Final Report

Recursive Text Segmentation for Color Images for Indonesian Automated Document Reader

A Novel Method for Enhancing Satellite & Land Survey Images Using Color Filter Array Interpolation Technique (CFA)

Thresholding Technique for Document Images using a Digital Camera

Feature Extraction Technique Based On Circular Strip for Palmprint Recognition

Efficient Construction of SIFT Multi-Scale Image Pyramids for Embedded Robot Vision

Automatic Ground Truth Generation of Camera Captured Documents Using Document Image Retrieval

FILTERING THE RESULTS OF ZIGBEE DISTANCE MEASUREMENTS WITH RANSAC ALGORITHM

Autocomplete Sketch Tool

Extraction of Newspaper Headlines from Microfilm for Automatic Indexing

Locating the Query Block in a Source Document Image

Manuscript Investigation in the Sinai II Project

Automatic Locating the Centromere on Human Chromosome Pictures

International Journal of Advance Engineering and Research Development

Iris Segmentation & Recognition in Unconstrained Environment

Image stitching. Image stitching. Video summarization. Applications of image stitching. Stitching = alignment + blending. geometrical registration

A Simple Skew Correction Method of Sudanese License Plate

Image Enhancement in spatial domain. Digital Image Processing GW Chapter 3 from Section (pag 110) Part 2: Filtering in spatial domain

Image Segmentation of Historical Handwriting from Palm Leaf Manuscripts

Face Detection System on Ada boost Algorithm Using Haar Classifiers

An Effective Method for Removing Scratches and Restoring Low -Quality QR Code Images

NOVEL APPROACH OF ACCURATE IRIS LOCALISATION FORM HIGH RESOLUTION EYE IMAGES SUITABLE FOR FAKE IRIS DETECTION

Image Processing for feature extraction

Lane Detection in Automotive

Importing and processing gel images

A NOVEL APPROACH FOR CHARACTER RECOGNITION OF VEHICLE NUMBER PLATES USING CLASSIFICATION

Fast identification of individuals based on iris characteristics for biometric systems

Transcription:

Real Time Word to Picture Translation for Chinese Restaurant Menus Michelle Jin, Ling Xiao Wang, Boyang Zhang Email: mzjin12, lx2wang, boyangz @stanford.edu EE268 Project Report, Spring 2014 Abstract--We created a new mobile application that translates pictures of restaurant menu items in Chinese into photos of the entrées. The system takes an image captured with an Android mobile device camera as the input, processes it and looks up the corresponding entrée image before sending the image back over a network server. The matching photo then displays on the mobile device screen. This server based image processing consists of two components. The first component involves preprocessing the original scanned menu to form a SIFT feature database of menu items. The second component is a run-time process which performs segmentation and feature matching on the camera input image with the preprocessed menu database. The feature matching algorithm contains a histogram based vocabulary tree matching step, as well as a pairwise matching step with SIFT features and RANSAC method. The experimental results show that our method is robust and fast; it is able to find a matching entrée picture within 5 to 6 seconds and has a success rate of around 91% for clear and focused camera inputs. I. INTRODUCTION People often have the desire to see images of the food items available at a restaurant before they make the decision to order. This is especially true for people who visit ethnic restaurants or travelers in a foreign country. In these cases, restaurant menus are written in a foreign language, and the menu item names have no standard translation to English. When there is a lack of entrée pictures on the menu itself, it becomes very difficult to determine a desirable dish to eat. Sometimes, even when there are translations, the translated names are obscure and do little to help identify the ingredients of the dish. With this project, we designed and created an Android mobile application that allows users to see an image of a Chinese dish they want to eat by taking a picture of the name of that dish in Chinese using their mobile device. This report first describes the implementation of our application with its client and server based approach, followed by the photo matching algorithm of our MATLAB image processing code on the menu name, and lastly the report discusses some experimental results we acquired by testing the application. Many existing Chinese text recognition algorithms use an Optical Character Recognition (OCR) approach where the input image is analyzed using a connected components, outline approximation [1]. These methods work fairly well and can achieve a high success rate for English and other alphabet based languages. However, with East Asian languages such as Chinese, where the words are character based, the results from OCR methods are much worse. From experimental results done for this project, the Tesseract Chinese character OCR algorithm was only able to, at best, recognize 80% of the characters. Clear scanned menu images with black text on white backgrounds yielded the best results. With a much less ideal input from a phone s camera where the words are sometimes rotated and skewed with lighting gradients as well as additional noise, the yield for correct results decreases drastically, and recognition dropped below 60%. This approach would give poor results for menu item look up, as some item names only differ from others by one character but are actually completely different. In this report, we propose a SIFT based Chinese entrée word detection scheme that is robust and fast. The menu we used was a 50 item, grayscale image. Our project is implemented as part of the mobile client-server application which we created to translate Chinese restaurant menu items to their corresponding pictures. In our application, the client runs on an Android mobile device to capture an input image. The captured picture is sent to a server to extract the menu item s name and go through the two-step SIFT based matching process. After a matching menu item picture is found, it is sent back to the mobile device, and the image is displayed by the client application. II. IMPLEMENTATION

portion of the original image. However, since the menu we used contains horizontal lines separating each menu item, the aligned image is eroded with a horizontal line, the image center is selected, and the horizontal lines are used to indicate the edges of our desired section. Finally, a vocabulary tree and SIFT with RANSAC is performed on the aligned, sectioned off image, and the best matching photo, along with the matching item words from the original database, is sent back to the server. The photo is then sent back to the Android phone. Fig 1. Image Processing Pipeline The implementation is two-fold as shown in Fig. 1. First, the original menu is preprocessed to create a database containing each separate menu item to be used for later comparison. Second, there is real time processing with the photo taken by an Android phone. In the preprocessing phase, each item of the original menu is cropped out and separated into its own image to be stored into a database. This cropping completely excludes all other menu item words from each photo. The sift keypoints and descriptors as well as the vocabulary tree histogram for each item is stored in a database for real time processing matching usage. Additionally, the database has contains the relevant photos for each dish on the menu. In real time, the Android phone takes a photo of a section of the menu that includes the desired menu item, sends that photo to a server which then invokes MATLAB processing before sending the image of the best matched food item photo back to the phone. The processing generally takes 5 to 6 seconds to complete per photo. During the MATLAB processing phase, the original photo is first resized down to 0.2 times its original size for faster processing purposes without trading off for accuracy. The resized photo is then binarized using Otsu s method. Next, the binarized photo is median filtered with a 3x3 window to reduce image noise without changing the words in the menu item. After median filtering, the phone image is then rotated for proper alignment through the Hough transform. Although SIFT is rotationally invariant, proper orientation on the photo is necessary to crop out extraneous menu items. With the phone and menu that we are using, a typical photo taken can contain 4 or 5 additional items in addition to the actual item desired. We assume two things in this portion: 1) The photo is relatively properly oriented and 2) The user desired menu item is in the center of the photo. With these two assumptions, the Hough transform rotational angle to is limited to a set number of degrees either way for processing speed and accuracy purposes. Then, we crop the photo so that only the menu item we want remains. The cropping can be done simply by taking a central horizontal sliver of the aligned image that includes a fixed III. ALGORITHM The two primary goals of the image processing algorithm are accuracy of detection, and speed of detection. The accuracy of detection performance can be measured by the percentage of entrées items on the menu correctly matched to the corresponding database entry. The speed of detection can be measured directly with the amount of time used by the image processing algorithm. Additionally, the time used to upload the input image from the client to the server, as well as the time used to download the resulting image from the server to the client, also affect user experience, but it is not in the scope of this report, as this is unaffected by algorithm optimizations. A. Scale Invariant Feature Transform on Chinese Characters The Scale Invariance Feature Transform (SIFT) [10] is well suited to the application of matching an image of Chinese characters with a limited subset of known images of Chinese characters. In particular, SIFT s scale and rotational invariant properties relaxes the constraints placed on the input image from the mobile client. In addition, Chinese characters generate a comparatively large number of SIFT descriptors due to the prevalence of sharp edges and geometric spacing. On the test menu, there were a total of 50 entrée items, each consisting of between 2 and 5 Chinese characters. After running SIFT on a closely cropped image of each entrée item, the following statistics were obtained: TABLE I. Statistic Type SIFT DESCRIPTOR STATISTICS Count Maximum 227 Minimum 82 Average 148.26 Standard Deviation 30.68 Total 7413 B. RANSAC for SIFT Match Once SIFT has been applied to both the input image and the menu images, RANSAC is used to calculate which dish image from the database most closely matches the input menu item name, similar to what is used for poster matching as done in class [8]. While RANSAC can be used to obtain

highly accurate results, it is also well known that the algorithm can also result in long processing times due to the large number of iterations that it must cycle through. The general equation to calculate the number of iterations is the following: S log(1 P) / log(1 q k ) Here, P is the probability of producing a useful result, q represents the number of inliers versus number of points in the data, and k is the number of points from which the model parameters are estimated. Setting P =.99, q =.5, and k = 3, we obtain S = 35 iterations, which is quite expensive in computation time. C. RANSAC Cut off One way to speed up RANSAC is to terminate the iterative algorithm once the number of matches from RANSAC exceeds a certain threshold. Used naively with RANSAC, we can expect a 50% decrease in the average SIFT match computational time if the threshold is set properly. D. Vocabulary Tree for SIFT Match Another way to speed up RANSAC is the use of a vocabulary tree algorithm as a first pass to sort the 50 items in the menu in order of most closely matched to input image to least closely matched to input image. The key features while designing a vocabulary tree are the branching factor, number of leaf nodes, and the training data. For our case, the branching factor is set to 10, and the training data is from hw7_training_descriptors.mat. To set the number of leaf nodes, we need to take into account the total number of SIFT Descriptors. Table 1 shows that the total number of SIFT descriptors for all 50 items in the sample menu we used is 7413, so the total number of leaf nodes should be less than that number. One use of the sorted menu items is to apply pairwise RANSAC SIFT matching between the input image and a subset of top ranked menu items. This has the advantage of reducing the computation time by a fixed amount. Another use of the sorted menu items is to apply RANSAC SIFT matching between the input image and all menu items with the cutoff as described in Part C of the algorithm section. This has the advantage of reducing the computation time by an order of magnitude if both the vocabulary tree and the cutoff are accurate. However, the vocabulary tree is not advised to be used independent of the RANSAC SIFT Match because the vocabulary tree does not take into account the position of the SIFT descriptors, which yields valuable information and increases the probability of a match. E. Binarization Since the sample menu was in grayscale with dark text and white background, the binarization coefficient was set to a constant value in our algorithm. With other types and color-scales of menus, binarization methods would have to adapt to the specific type of image, and additional binarization methods include locally adaptive thresholding. F. Hough Transform The sample menu has repeating vertical and horizontal lines, so the Hough Transform was used to find the angle of rotation. The primary purpose of the Hough Transform is to aid in text detection and segmenting portions of the image taken that do not need to be used for comparison purposes. G. Text Detection It is assumed that the input image will contain multiple entrée items, and the dish of interest is in the middle of the image. Detection of vertical and horizontal lines is done by eroding with horizontal and vertical lines. The result is used to compute the central text region, which is surrounded by grid lines. By cutting down the input image, the time used for SIFT and SIFT Match will decrease. A. SIFT Speed IV. RESULT SIFT needs to be extracted from all input images. For input images cropped to roughly half a megabyte, the following shows the speed of the SIFT, and the speed of the SIFT Match per pair of images: Statistic Type TABLE II. VL_FEAT SIFT SPEED SIFT Time in Seconds SIFT Match Time in Seconds Maximum 0.9138 0.2098 Minimum 0.7706 0.0888 Average 0.8689 0.1423 Standard Deviation 0.0276 0.0278 B. Vocabulary Tree Speed For each input image, the amount of computation required is equivalent to generating the histogram from the SIFT descriptors, and calculating the overlapping histogram region with every other menu item histogram, which is precomputed). In Table III, second column is the time to generate the histogram, and the third column is the time to compare the input histogram to all histograms in the menu. Statistic Type TABLE III. VOCABULARY TREE SPEED Histogram Generation Leaves = 100000 Time in Seconds Histogram Match Time in Seconds Maximum 0.8483 0.2098

Statistic Type Histogram Generation Leaves = 100000 Time in Seconds Histogram Match Time in Seconds Minimum 0.1065 0.0888 Average 0.7327 0.1423 because the centroids are spaced too far apart. On the other hand, there are more descriptors than leaf nodes, which is preferable. Fig. 4: Vocabulary Tree Accuracy with Leaves = 1000 Standard Deviation 0.0276 0.0278 C. Vocabulary Tree Accuracy The accuracy of the vocabulary tree is determined by the ranking of the correct menu image to the input image. When the number of leaves is set to 100,000, we see that only half of the correct images are ranked in the top half. This is just as bad a guessing (Fig. 2). Fig. 2: Vocabulary Tree Accuracy with Leaves = 100000 D. Hough Transfrom The Hough Transform worked fine for perspectives where the horizontal and vertical grid lines are way longer than the size of any single character. For macro images where the grid lines are marginally longer than a character, the Hough Transform may rotate the input image in unexpected angles. Consequently, the maximum rotation angle was restricted to +/- 5 degrees. By reducing the number of leaves, we see an improvement, as shown with 10,000 leaves in Fig. 3. Fig. 3: Vocabulary Tree Accuracy with Leaves = 10000 E. Imperfect Text Detection Due to the graininess of the both the scanned menu and input image, it was difficult to precisely demarcate the rectangular box. If part of the text was cut off, but at least two of the characters remained, there would enough SIFT descriptors for a correct SIFT Match. On the other hand, if there are too much information on the image such as adjacent menu items, SIFT would be slower and the detection could be an item that is not in the center of the image. Furthermore, we found that items with more characters and characters with more complexity had better accuracy results since they contained more features for matching. F. Overall Speed In the final configuration with the vocab tree set to top 25 out of 50 menu items total, the amount of time for the entire image processing was between 4.5 seconds and 6.5 seconds. Reducing the number of leaves further from 10000 to 1000 (Fig. 4) does not result in any measurable improvement G. Overall Accuracy The overall accuracy depended several factors such as the the number of characters in the image, and the quality of the camera. Using an IPhone, the accuracy was as high as 91.67%. Using the Android phone given for the class project, we found the autofocus to be a source of significant error. If the image was in focus and had at least 3 characters,

empirically we found the accuracy to be at least 75%. On the other hand, out of focus images typically resulted in incorrect detection. REFERENCES [1] R. Smith, Tesseract OCR Engine, Tesseract OSCON, Google Inc. [2] David G. Lowe, "Distinctive image features from scaleinvariant keypoints," International Journal of Computer Vision [3] Martin A. Fischler and Robert C. Bolles (June 1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography" [4] P. V. C. Hough. Method and Means for Recognizing Complex Patterns. Dec. 18, 1962, US3069654 A [5] Nobuyuki Otsu (1979). "A threshold selection method from gray-level histograms". IEEE Trans. Sys., Man., Cyber. [6] M Swain, D. Ballard, (1991) Color Indexing. International Journal of Computer Vision. [7] EE368: hw7_training_descriptor.mat [8] EE368: sift_match.m [9] EE368: Tutorial 3: Client-Server Communication for Android Project [10] www.vlfeat.org

APPENDIX CONTRIBUTION DISTRIBUTION Item Michelle Jin Ling Xiao Wang Boyang Zhang Hough Transform 1 Menu Preprocessing 1/2 1/2 Text Detection 1/3 1/3 1/3 RANSAC 1 Vocabulary Tree 1 Android Client 1/3 1/3 1/3 Server 1/3 2/3 Tesseract Experiment 1 Menu/Entree Photo Collection 1/4 1/2 1/4 Poster 1/3 1/3 1/3 Report 1/3 1/3 1/3