Aks: A Database for Detection and Extraction of Devanagari Text in Camera Based Images

Size: px

Start display at page:

Download "Aks: A Database for Detection and Extraction of Devanagari Text in Camera Based Images"

Dwayne Marshall
5 years ago
Views:

1 Aks: A Database for Detection and Extraction of Devanagari Text in Camera Based Images Ganesh K Sethi #1, Rajesh K Bawa *2 # Assistant Professor, Department of Computer Science, Multani Mal Modi College, Patiala, India * Professor, Department of Computer Science, Punjabi University, Patiala, India Abstract With the advent of digital cameras and other hand held imaging devices a new type of text containing images have emerged that are unable to handle with traditional optical character recognition (OCR) technology. These camera based images imposes a number of challenges that are absent in scanned or born-digital images. To detect and extract text from camera based images, Text Information Extraction (TIE) process is carried out that detects presence of text in an image and separates it from the background. In this paper a detailed comparison between camera based, born digital and scanned images is presented. A database of camera based images containing text particularly in Devanagari script is created. A survey of various available bench mark databases is done and keeping in view the challenges of camera based images an exhaustive dataset of images is prepared. The paper also discusses the evaluation metrics used to compute the accuracy of text detection and extraction from camera based images. Keywords Born Digital Images, Camera Images, Scanned Images, Text Information Extraction (TIE), optical character recognition (OCR). I. INTRODUCTION OCR is a field related to recognition of characters in scanned documents, PDF files or images captured by digital camera. The commercial OCRs were originated about five decades back that resulted in automatic interpretation of hard copy documents. Since then a lot of improvement has been done and the accuracy of commercial OCRs in certain languages with clear text and clean background is more than 99%. Traditional OCRs use a flatbed scanner to input image with a clean background, even illumination and minimal or no skew. With the advent of digital cameras, PDAs and mobile-cameras a new type of documents has originated. Documents are now not confined to the scanned images, but also contain images that contain text on real world objects taken from real scenes. Images captured with a digital camera like sign boards on roads, buildings, vehicle license plates, clothes contain text that need to be understood. This has emerged a sub-field of OCR for camera based digital images. A wide variety of images is taken with hand held devices like digital camera, PDAs, mobile cameras and hand held scanners. These documents differ from traditional scanned images due to complex background, uneven illumination, variable font size and style, and arbitrary layouts of text in image [1]. The traditional OCR technology that works for scanned images is not applicable here. This failure of traditional OCR system gave birth to new area of research- Camera Based Document Analysis and Recognition. Since 2003, a Robust Reading Competition for extracting and recognizing text in digital images is held with International Conference of Document analysis and Recognition (ICDAR) for finding the best method. Due to lot of challenges, the accuracy of localizing, extracting and recognizing text in camera based images is not up to the mark till date. For recognizing text in camera based images, text in these images has to be completely extracted before it is passed to an OCR. The process carried out before passing image to OCR is Text Information Extraction (TIE). Text information extraction from digital images is a task performed on digital images to extract and recognize text content in them. Natural scenes images contain large amounts of information, which are often required to be automatically recognized and processed. TIE is an active and important area of research in digital image processing because of potential applications in mobile robot navigation, vehicle number plates detection and recognition, object identification, reading foreign language text, content based indexing and searching and helping visually impaired persons to know their environment in a better way. The input to TIE system is an image that is captured with a digital camera or any other hand held device. To check accuracy of any TIE system, an exhaustive set of images is required that must include all the challenging images. In present study, we have captured the image under different conditions to create a dataset of Devanagari text images. Multi-script images, frequently occurring in real world environment, are also taken into account. Section II of paper discusses the difference between camera, scanned and born digital images. In section III various bench mark databases have been discussed. Evaluation metrics are given in section IV and proposed work is presented in section V. ISSN: Page 32

II. WHAT IS AN IMAGE? The data input to any OCR or TIE system is an image.

using software-born digital. A. Scanned Image Scanned image is formed by scanning a document using a scanner against a plain background.

Usually, scanners are used for document images and there is no problem of uneven illumination, background noise or texture.

The commercial available OCRs are able to recognize text in the scanned images having text in a particular orientation by applying a series of steps like preprocessing, segmentation, feature

So extracting and recognizing text from scanned images is relatively solved problem for some scripts, while the work is going on for other languages. B.

avoid text-based filtering, to hide information, for generating CAPTCHA tests.

2 II. WHAT IS AN IMAGE? The data input to any OCR or TIE system is an image. The evolution of digital image may be by conversion of a paper into digital form using scanner-scanned image, capturing image using a digital camera-camera based or it may be synthetically created using software-born digital. A. Scanned Image Scanned image is formed by scanning a document using a scanner against a plain background. Scanners are used to digitize the old manuscripts, books and other documents. The quality of scanned image depends on the quality of scanner and the quality of document to be digitized. Usually, scanners are used for document images and there is no problem of uneven illumination, background noise or texture. Further scanned images mostly have black text on a white background or vice-versa. The commercial available OCRs are able to recognize text in the scanned images having text in a particular orientation by applying a series of steps like preprocessing, segmentation, feature extraction and classification. The accuracy of recognition in certain languages is more than 99%. So extracting and recognizing text from scanned images is relatively solved problem for some scripts, while the work is going on for other languages. B. Born Digital Image Born digital images are synthetically formed images using software tools. These images are mostly used in web pages, messages and on social-media (whattsapp, facebook etc.) to insert textual information. The concept of using images to entrench text originates from a variety of needs, for example in advertisements in order to attract attention, in titles and headings to beautify, in spam s to avoid text-based filtering, to hide information, for generating CAPTCHA tests. Borndigital text images resemble to other complex-text containing images but they present certain distinct characteristics. In such images, the resolution is kept inherently low (to be transmitted online and displayed on mobile-screen/monitor), and the text is normally super-imposed on the image. The automatic detection and extraction of text from borndigital images is an interesting area as it would provide a technology for a number of applications such as improved indexing, retrieval of web images and content filtering (e.g. advertisements or spam s) etc. The presence of text in born-digital images has been quantified in the past [2]-[3]. Research has shown that a considerable amount of text on web pages is presented in images (17%), while an important fraction of this text (76%) is not to be found anywhere else on the Web [2]. The work has been published in specifically focused on borndigital images [4]-[6]. Figure 2: Born-digital Images taken from internet Figure 2: Scanned Images C. Camera Based Image With the evolution of low cost consumer-end digital cameras, a new challenge has opened up for the document analysis community. Camera based images are the real scene images captured using a digital camera or any hand-held device like mobile, PDA or a tablet having digital camera. The images are captured from real scenic environments. As the resolution of devices varies, so the resolution of camera images varies greatly. Further, the camera images have lot of distortions like uneven illumination, complex backgrounds, motion blur, ISSN: Page 33

Criteria Background Sharpness Lens position Colours Contrast Words/ Characters arrange ment Surface Rotation Fontsize Fonttype Number of lines Camera Based Image Heterogeneous, any color possibly

curved, wavy text freestanding (detached) or attached to objects with arbitrary non- planar surfaces arbitrary rotations Large variation in font sizes in single image Text superimpo sed on a plain or

on scanner's glass black text on white background good (black/dark text on white/light background) clear horizontal lines text attached to plain paper horizontally aligned text lines or rotated by 90

Variable colours Good (dark/light text superimposed on light/dark background) Depends on target application Arbitrary rotations Variable fonts size mingled together Superimposed typed text Depends on

3 Criteria Background Sharpness Lens position Colours Contrast Words/ Characters arrange ment Surface Rotation Fontsize Fonttype Number of lines Camera Based Image Heterogeneous, any color possibly motion blur Variable, geometric and perspective distortions present high variation of colors depends on colors, shadows, lighting, illumination, texture May have horizontal and vertical lines, curved, wavy text freestanding (detached) or attached to objects with arbitrary non- planar surfaces arbitrary rotations Large variation in font sizes in single image Text superimpo sed on a plain or textured background machineprinted, handwritten, special effects (textured or 3D) often only single line or few words in Scanned Image Homogene ous, usually white Usually sharp Fixed, document lies on scanner's glass black text on white background good (black/dark text on white/light background) clear horizontal lines text attached to plain paper horizontally aligned text lines or rotated by 90 degrees limited number of font sizes, usually same throughout image machineprinted, handwritten usually several lines of text Born Digital Usually single color Usually sharp Artificially created Variable colours Good (dark/light text superimposed on light/dark background) Depends on target application Arbitrary rotations Variable fonts size mingled together Superimposed typed text Depends on target application image Table 1: A comparison between Camera based, Scanned and Digital image and perspective deformation. Camera based images, besides document images, contain natural scene images containing sign-boards, vehicle number plates, street names, electricity meters, and so on. The extraction of text from these camera based images have enormous applications viz. license plate recognition, helping visually impaired, understanding foreign languages, industrial automation, web search etc. For roman scripts, research on camera based scene image text detection and recognition started in mid-90s [7]-[8] and since then has seen a substantial amount of growth with a large number of approaches published in the last two decades [8]-[12]. Figure 3: Camera-Based Images from ICDAR and our dataset III. BENCH MARKS DATABASES AND ROBUST READING COMPETITIONS A number of methods for text localization, segmentation and recognition of text from camera images are available. To check the accuracy and to compare different methods, a standard or benchmark dataset is required. A number of datasets are available for the Roman, Korean, Chinese and mutiscripts that are being used for ranking different algorithms in various competitions. A. ICDAR 2003 Data Set The ICDAR 2003 data set was created for the Robust Reading competition [14]. The data set includes images of text in the real environment captured with hand held devices. This data set is used for the tasks of character recognition, word recognition, text localization and end-to-end recognition. The dataset is organized into Sample, Trial and Competition datasets that contain 20, 258 and 251 color images respectively. The number of words in corresponding datasets is 171, 1157 and 1111 with 854, 6185 and 5430 characters of Roman script. Ground truth information is specified by word bounding box locations and labels as well as character locations and labels for both the training and test sets. Figure 4 shows sample images from this data set. ISSN: Page 34

B. ICDAR 2011 Data Set A benchmark data set was created for the ICDAR 2011 Robust Reading competition [15] which was an extension of the ICDAR 2003 data set.

Ground truth information is provided for word bounding box locations and labels, which makes this suitable for the tasks of end-to-end recognition, text localization and word

Sample images from the ICDAR 2011 data set are shown in figure 4. ground truth labels are provided.

Maps. The lexicons consist of around 50 unique words each. The task is to locate all the words in an image that appear in its lexicon.

This contrasts from the more general OCR problem Figure 5: Sample images from the SVT data set Figure 4: Sample Images taken from ICDAR 2003 and 2011 Data Set C.

This is also based on the previous data sets, but ground truth errors are fixed and image duplicates are removed. It contains 233 color images with 1095 words.

In addition, pixellevel ground truth labels are provided for the first time for text segments, enabling the task of text segmentation. D.

For harvesting and labeling the images from Google Street View Amazon's Mechanical Turk tool was used.

The Street View Text (SVT) data set was specifically created for the word spotting problems in street view images. The dataset consists of 250 images containing 647 words.

Tu [18] collected and released publicly MSRA Text Detection 500 Database (MSRA-TD500) as a benchmark to evaluate different text detection algorithms in natural images.

4 B. ICDAR 2011 Data Set A benchmark data set was created for the ICDAR 2011 Robust Reading competition [15] which was an extension of the ICDAR 2003 data set. The repeated words were removed and with few additions. The dataset contains 255 color images with 1189 words. There is also a training set of a similar size. Ground truth information is provided for word bounding box locations and labels, which makes this suitable for the tasks of end-to-end recognition, text localization and word recognition. Unlike the ICDAR 2003 data set, this does not contain ground truth character location and label information. Sample images from the ICDAR 2011 data set are shown in figure 4. ground truth labels are provided. In addition, there is a lexicon given for every word, which contains the ground truth label plus business names in the locality extracted using the `Search Nearby' feature in Google Maps. The lexicons consist of around 50 unique words each. The task is to locate all the words in an image that appear in its lexicon. While there may be other text in the image, the target is to detect only the lexicon words. This contrasts from the more general OCR problem Figure 5: Sample images from the SVT data set Figure 4: Sample Images taken from ICDAR 2003 and 2011 Data Set C. ICDAR 2013 Data Set The latest version of the ICDAR data set is ICDAR 2013 [16]. This is also based on the previous data sets, but ground truth errors are fixed and image duplicates are removed. It contains 233 color images with 1095 words. The training set is of a similar size. Ground truth information is provided for word bounding box locations and labels, which makes this suitable for the tasks of end-to-end recognition, text localization and word recognition. In addition, pixellevel ground truth labels are provided for the first time for text segments, enabling the task of text segmentation. D. The Street View Text Dataset As a part of Google Street View project, Wang and Belongie [17] introduced street view text (SVT) data set. For harvesting and labeling the images from Google Street View Amazon's Mechanical Turk tool was used. Several Human Intelligence Tasks (HITs) were created that were completed on Mechanical Turk for building the data set. The Street View Text (SVT) data set was specifically created for the word spotting problems in street view images. The dataset consists of 250 images containing 647 words. In each image, word bounding box locations and E. MSRA Text Detection 500 Database (MSRA- TD500) C. Yao, X. Bai, W. Liu, Y. Ma and Z. Tu [18] collected and released publicly MSRA Text Detection 500 Database (MSRA-TD500) as a benchmark to evaluate different text detection algorithms in natural images. The database consists of 500 natural images, taken from outdoor (streets) and indoor (office and mall) scenes using a pocket camera. The resolutions of the images are in range of 1296x864 to 1920x1280. The text in the image may be in different languages (English, Chinese, or mixture of both), size, fonts, colors and orientations. The dataset is further divided into two parts: testing set and training set. The training set is formed by taking 300 images randomly selected from the original dataset and the remaining 200 images are used for testing. The fundamental unit for detection in the dataset is text line instead of word. The reason is due to the fact that it is hard to partition Chinese script text lines into individual words based on their spacing. Even for English language text lines, it is difficult partition words in natural scene images. Figure 6: Sample images from MSRA-TD500data set ISSN: Page 35

F. NEOCR: Natural Environment OCR Dataset R. Nagy, A. Dicker and K. Meyer [19]-[20] developed Natural Environment OCR Dataset.

The NEOCR consists of 659 natural scene images having a total of 5238 bounding boxes containing text.

textures or occlusion. Images containing text were then manually selected and for image annotation, the web-based tool of [21] for the LabelMe dataset [22] was used.

KAIST Scene Text Database The KAIST scene text dataset consists of 3000 images captured from variable environments, including indoor and outdoor scenes under different illumination conditions (strong

Figure 8: Sample Images taken from KAIST Scene Text Database Images were captured either by using a highresolution digital camera or a low-resolution mobile phone camera.

5 F. NEOCR: Natural Environment OCR Dataset R. Nagy, A. Dicker and K. Meyer [19]-[20] developed Natural Environment OCR Dataset. The dataset comprise of a broad range of natural scene images with rich annotation for OCR that covers a variety of features. These features distinguish natural scene text images from scanned images. The NEOCR consists of 659 natural scene images having a total of 5238 bounding boxes containing text. Images were captured with different digital cameras set at various camera settings to include text in real world with arbitrary colors, font types and font sizes, lighting effects, variety of textures or occlusion. Images containing text were then manually selected and for image annotation, the web-based tool of [21] for the LabelMe dataset [22] was used. Annotations are provided in XML for each image separately describing global image features, bounding boxes of text and its special characteristics. Figure 7: Example images from the NEOCR dataset. G. KAIST Scene Text Database The KAIST scene text dataset consists of 3000 images captured from variable environments, including indoor and outdoor scenes under different illumination conditions (strong artificial lights, day light, night, etc). Figure 8: Sample Images taken from KAIST Scene Text Database Images were captured either by using a highresolution digital camera or a low-resolution mobile phone camera. The images were captured at different resolutions but later resized to 640x480. The KAIST scene text database is arranged into different categories according to the script of the scene text captured: Korean, English (Number), and Mixed (English + Korean+ Number). For each category images are placed into two classes-digital camera and mobile camera. Further images are classifies as outdoor, indoor, shadow effect, night etc. The scene text in the images is mainly representative of frequent text used in Korean streets or shops. IV. EVALUATION METRICS A. Text Localization task The localized text in an image is specified by a bounding box around each word of the text. In text localization task of ICDAR 2005 Lucas et. al. [14], the organizers provided bounding boxes of words for each of the images. The ground truth text localized images are provided as separate text files (one per image) in which each line specifies the coordinates of one word's bounding rectangle and its transcription in a comma separated format. For a given image, with all the word-region rectangles, let T be ground-truth set of targets and the E be set returned by the system under test. The number of boxes which are correct is denoted by c. Then precision, p is defined as the number of correct estimated rectangles divided by the total number of rectangles: c p (1) E Recall, r is defined as the number of correct estimates divided by the total number of targets. c p (2) T The above measures of accuracy are quote unrealistic as a little deviation from the exact area results in a lot of deviation in the final quantitative aspect. So Lucas et. al. [14] adopted a flexible measure using notion of area-match. An area match m a between two rectangles r 1 and r 2 is defined as twice the common area divided by the total of the areas of two rectangles. m a 2. a( r a( r ) 1 1 r 2 ) a( r ) 2 (3) where a(r) is the area of rectangle r. The value of m is one for same rectangles (100% matching) and a zero for rectangles that have no intersection (entirely different). For each rectangle in the set of targets we find the closest match in the set of estimates, and vice versa. So, the best matching for a rectangle r, m(r, R) in set of Rectangles R is defined as: m( r, R) maxm a( r, r') r' R (4) So finally precision and recall are re-defined as ISSN: Page 36

6 p r r E e r T t m( r, T) T E m( r e, T) or ( r, E) m t e m( r, E) t (5) (6) is the best match for a rectangle re or rt in a set of rectangles T or E, respectively. The two measures are clubbed into single measure, called f score, computed as: f 1 / p (1 )/ r (7) Wolf and Jolion [23] pointed several drawbacks in above methods of evaluation as they don t give exact information about the proportion of the correctly detected words and the number of false alarms. Further, without creating ambiguity in their interpretation, these cannot be accumulated across multiple images. So another evaluation procedure based on the object count was proposed. Apart from one to one match, one-to-many and many-to-one matches are also counted with little penalty. B. Text Segmentation task In robust reading competition of ICDAR 2011, the text segmentation was introduced as a separate task. For comparing the results of different algorithms, the evaluation procedure has to calculate the precision, recall and f-score values. A ground truth data is provided for the task. In ground truth data, text is represented as black against a white background. Segmentation task is to segment the text from the input image and the text pixels are represented in black against a white background. The precision and recall values are calculated between the result of the method and the ground-truth for each image at the pixel-level. F-score, calculated as the harmonic mean of precision and recall values for the image set are used to rank the different methods. The metrics, however, existed earlier for scanned documents images as Document Image Binarization Contest (DIBCO) [17]. These metrics are calculated from the pixel level comparison of segmented image and the ground-truth image by using True Positive pixels (TP), False Positive pixels and False Negative pixels: True Positives: True positives are those pixels in the image which are text pixels and have been detected by the algorithm as also as text. False Positives: False positives are those pixels in the image which are actually not text pixels, but have been detected by the algorithm as text pixels. False Negatives: False negatives are those pixels in the image which are actually text pixels, but have not been detected by the algorithm as text pixels. Precision Rate: is defined as the ratio of correctly detected pixels to the sum of correctly detected pixels and false positives as represented in equation below: TP Precision Rate *100 (8) TP FP Recall Rate: is the ratio of correctly detected pixels to the sum of correctly detected pixels and false negatives as represented in equation below: TP RecallRate *100 (9) TP FN F-Measure: F-score is the harmonic mean of the recall and precision rates. 2* RecallRate *PrecisionRate F Score (10) RecallRate PrecisionRate Figure 9: Ground Truth data for pixel level segmentation task (KAIST Dataset) V. PROPOSED WORK A. Aks: Database for proposed work For Devanagari script there is no benchmark database of camera captured images. Aks, the proposed database of images is created by taking images from different environments. Camera captured images has a useful application in understanding the natural scene texts. So most of the database is dedicated towards natural scene text images. The proposed dataset is divided into two categories (figure 10): Devanagari Images and Multi-script images. In each category we further collected images in Natural light (Outdoor and indoor), Artificial light (Outdoor and indoor) and no light at all. Images of book titles are also taken. The images are captured from road-sides (sign boards), name plates (indoor and outdoor), hospitals, schools, historic monuments, book titles etc to cover different type of backgrounds. The captured images have different type of fonts, variable font size. The image may contain single word, multiple words arranged in Different patterns of lines. The text appearing in the image may be Machine-printed or handwritten. The ISSN: Page 37

B. Ground Truth Data TIE system extract text from an

and also on the text image that is given as input to

The accuracy of TIE is measured as discussed in

To calculate the required metrics, the output image of

are captured either with a high resolution digital

Further we used capturing device with different

Sample images from our dataset are shown in figure 11:

$\ (c) Artificial Light ` Figure 11: Ground Truth$ CONCLUSION (d) No Light Figure 10: Sample Images from

7 B. Ground Truth Data TIE system extract text from an image and the extracted text image can be further fed to an OCR for its recognition. The recognition of text depends on the quality of OCR and also on the text image that is given as input to the OCR. The accuracy of TIE is measured as discussed in section IV. To calculate the required metrics, the output image of TIE has to be compared with some ground truth image that is created manually. To create ground truth data, we used the toolkit developed by [24]. The ground truth data for some images is shown in figure 11. Figure 10: Aks: Dataset of Camera Based Images images are captured either with a high resolution digital camera or with a mobile camera. Further we used capturing device with different resolutions. Sample images from our dataset are shown in figure 11: (a) Natural Light (Outdoor) (b) Natural Light (Indoor) \ (c) Artificial Light ` Figure 11: Ground Truth Images from Aks VI. CONCLUSION (d) No Light Figure 10: Sample Images from Aks In this paper a collection of camera based images for Devanagari script along with ground truth data is discussed. The proposed database will be used to detect and extract Devanagari text from these camera based images. Since the ground truth data for images is also created, the database can be used to compare the efficiency of different algorithms. ISSN: Page 38

8 REFERENCES [1] J. Liang, D. Doermann, H. Li, Camera-based analysis of text and documents: a survey, International Journal on Document Analysis and Recognition (IJDAR), Volume (7), Issue 2-3, pp , [2] A. Antonacopoulos, D. Karatzas, J. Ortiz Lopez, Accessing Textual Information Embedded in Internet Images, Proc. of SPIE, Internet Imaging II, San Jose, USA, Vol. 4311, pp , January [3] T. Kanungo and C. Lee What fraction of images on the web contain text? International Workshop on Web Document Analysis (WDA), pp 43-46, [4] D. Karatzas, A. Antonacopoulos, Colour Text Segmentation in Web Images Based on Human Perception, Image and Vision Computing, Vol. 25, Issue 5, Elsevier, pp , May [5] D. Lopresti, J. Zhou, Locating and recognizing text in WWW images, Information Retrieval, 2, pp , [6] S.J. Perantonis, B. Gatos and V. Maragos, A novel Web image processing algorithm for text area identification that helps commercial OCR engines to improve their Web image recognition efficiency, Second Int. Workshop on Web Document Analysis (WDA2003), pp , Edinburgh, Scotland, August 2003 [7] J. Ohya, A. Shio, and S. Akamatsu, Recognizing characters in scene images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 2, pp , Feb [8] Y. Zhong, K. Karu, and A. Jain, Locating text in complex color images, Pattern Recognition, vol. 28, no. 10, pp , Oct [9] P. Clark and M. Mirmehdi, Recognising text in real scenes, Int. Jour. on Document Analysis and Recognition, vol. 4, no. 4, pp , [10] C. Mancas Thillou and B. Gosselin, Color text extraction with selective metric-based clustering, Computer Vision and Image Understanding, vol. 107, no. 1-2, pp , Jul [11] J. Weinman, E. Learned Miller, and A. Hanson, Scene text recognition using similarity and a lexicon with sparse belief propagation, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp , Oct [12] J. Park, G. Lee, E. Kim, J. Lim, S. Kim, H. Yang, M. Lee, and S. Hwang, Automatic detection and recognition of korean text in outdoor signboard images, Pattern Recognition Letters, vol. 31, no. 12, pp , Sep [13] Y. Pan, X. Hou, and C. Liu, A hybrid approach to detect and localize texts in natural scene images, IEEE Trans. on Image Processing, vol. 20, no. 3, pp , Mar [14] S.M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and, R. Young, ICDAR 2003 robust reading competitions In Proceedings of International Conference on Document Analysis and Recognition, pages IEEE Computer Society, [15] A. Shahab, F. Shafait, and A. Dengel, ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images, In Proceedings of the International Conference on Document Analysis and Recognition, pages IEEE Computer Society, [16] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and D. L. Heras, ICDAR 2013 Robust Reading Competition, In Proceedings of the 12th nternational Conference on Document Analysis and Recognition, pages , [17] K. Wang and S. Belongie, Word spotting in the wild, In Proceedings of 11th ECCV, pages , 2010, [18] C. Yao, X. Bai, W. Liu, Y. Ma and Z. Tu, Detecting Texts of Arbitrary Orientations in Natural Images, CVPR 2012 [19] R. Nagy, A. Dicker and K. Meyer-Wegener, "NEOCR: A Configurable Dataset for Natural Image Text Recognition". In Proceedings of CBDAR Workshop 2011 at ICDAR pp , September [20] R. Nagy, A. Dicker, and K. Meyer-Wegener, "Definition and Evaluation of the NEOCR Dataset for Natural-Image Text Recognition". University of Erlangen, Dept. of Computer Science, Technical Reports, CS-2011, 07, September [21] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, Labelme: A database and web-based tool for image annotation, IJCV, vol. 77, pp , May [22] LabelMe Dataset. [Online]. Available: csail.mit.edu/ [23] C. Wolf and J.M. Jolion, "Object Count / Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms", International Journal of Document Analysis, vol. 8, no. 4, pp , [24] T. Kasar, D. Kumar, M.N. Anil Prasad, D. Girish and A.G. Ramakrishnan, MAST: Multi-Script Annotation toolkit for Scenic Text, Proc. Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data (J-MOCR- AND), Sept. 17, Beijing, China. ISSN: Page 39

Automatic Ground Truth Generation of Camera Captured Documents Using Document Image Retrieval

Automatic Ground Truth Generation of Camera Captured Documents Using Document Image Retrieval Sheraz Ahmed, Koichi Kise, Masakazu Iwamura, Marcus Liwicki, and Andreas Dengel German Research Center for