Improving Optical Character Recognition Process for Low Resolution

Size: px

Start display at page:

Download "Improving Optical Character Recognition Process for Low Resolution"

Hector Felix Dalton
5 years ago
Views:

1 Improving Optical Character Recognition Process for Low Resolution Images 1 Imad Qasim Habeeb, 2 Shahrul Azmi Mohd Yusof, 3 Faudziah B. Ahmad 1, First Author Iraqi Commission for Computers and Informatics, Iraq, emadkassam@yahoo.com *2,Corresponding Author Universiti Utara Malaysia, shahrulazmi@uum.edu.my 3 Universiti Utara Malaysia, fudz@uum.edu.my Abstract Optical Character Recognition (OCR) systems often generate errors for images with noise or with low scanning resolution. In this paper, a novel approach that can be used to improve and restore the quality of any clean lower resolution images for easy recognition by OCR process. The method relies on the production of four copies of the original image so that each picture undergoes different restoration processes. These four copies of the images are then passed to a single OCR engine in parallel. In addition to that, the method does not need any traditional alignment between the four resulting texts, which is time consuming and needs complex calculation. It implements a new procedure to choose the best among them and can be applied without prior training on errors. The experimental results show improvement in word error rate for low resolution images by more than 67%. 1. Introduction Keywords: OCR, Low resolution image, Alignment resulting text, Multi inputs The process of optical character recognition (OCR) extracts the text in images so that it can be modified and searched [1]. Output of OCR systems often produces errors when the images contain noise or the scanning resolution is low [2, 3]. The optimal resolution to scan images for most OCR systems is 300 dots per inch (dpi) [4]. Fig. 1 shows output of two OCR engines with scanning resolution of 300 dpi and 72 dpi respectively. It shows the effect of low resolution image to OCR systems. The two OCR engines used are Tesseract [5] which is supported by Google Inc, and Asprise [6] which is a commercial software. Figure 1. The difference in OCR output for two images having different resolution Low resolution images can be extracted from sequence of low quality video [3, 7] or when dealing with available images having low resolution like what is available in thousands of documents' images in the Internet. Increasing image resolution after the scanning process will not add more details to the image unless the original scanning resolution is high [8]. This research designed effective method that can be used to deal with any available image that has a low resolution to improve and restore its quality for easy recognition by OCR process. To determine word error rate (WER) for images having low resolution, several documents containing words were scanned twice at 72 dpi, the first with a gray level, and the second with one bit black/white. The texts were extracted from Wikipedia's website. The documents International Journal of Advancements in Computing Technology(IJACT) Volume 6, Number 3, May

2 contain normal text without any layout or images. All documents' images were passed to two OCR engines, Tesseract version 3.02 and Asprise version 4.0 respectively. The output of both engines gave an average WER of greater than 64% which is high. The resulting WER for both engines is shown in Table 1. Due to the high rate of errors, a new method aimed at improving the image quality so as to reduce WER is proposed. Table 1. Word error rate for two types of images by two engines Engine Tesseract version 3.02 Asprise version 4.0 WER for Gray images WER for Binary images Average WER 65% 79% 72% 57% 71% 64% Resolution of the image represents its quality, i.e. the higher the resolution the higher the quality. The image becomes clearer, sharper, and more detailed when the resolution is high. On the other hand, its file size becomes larger and the number of pixels increase [9]. For example, in case of scanning a single page of text as image with resolution 75 dpi and store it in a bmp format, the size of file is nearly 353KB. And when the same page is scanned at 300 dpi, the size of file is almost 5240 KB; the rate of the increase in file size and number of pixels is practically 15 times; this means more processing time for OCR systems. For this reason, the proposed method clearly will decrease file size in the hard disk, and will increase speed of the OCR systems, though the scope of this research is limited to reducing the OCR errors for low resolution images. Furthermore, these images are so blurry because the pixels are not enough to represent all the details of the characters, this causes several letters seems to be touching each causing difficult to differentiate the outlines of these characters. Such poor-quality images can cause OCR segmentation and features extraction processes to become more complex. For examples, many characters' blocks in the segmentation process may contain noise, sometimes part of character in one block and the rest in another block; in other cases, a single block contains two or more touching characters [3]. Using several values for the threshold when converting grayscale image to a binary image can lead to different results from a single OCR engine [10, 11]. The proposed method will improve this situation by not classifying the values of images between 0 and 255, but to group them into three classes. After that, various operations are performed such as characters cleaning and restoration based on these classes to produce four images. These images are sent to same OCR engine to produce four outputs, where the best is chosen among them. The details of these operations are given in section 3. Using more than one input will lead to the problem of the alignment of resulting texts of OCR engines. This problem needs a long time and complex calculation, especially for number of the characters greater than 2500 [12, 13]. The proposed method used multiple inputs with alignment between the words only, while the previous works on related methods require alignment line by line or alignment between completes resulting texts before words or characters were aligned. Furthermore, the alignment of the words is easier and need simple alignment methods because the number of the characters in any words at most does not exceed 20 [13] as described in section 3. The resulting errors from OCR process can be classified into two types: non-word errors and real word errors. The first means: the words that do not exist in the lexicon, such as the word "foed". The second means: the words that exist in the lexicon, but unsuitable for the sentence, such as the word "too" in sentence "I want too eat" [14, 15]. This method in addition to improving the accuracy of the image before entering the OCR engine can perform multiple-pass decoding in detecting and correcting these errors without prior training. The contributions of this research are: (1) a new method that improves WER of OCR systems when the inputs are free of noisy images having low resolution, and (2) A new alignment method that uses multi-inputs with alignment between the words only. The paper is organized in 14

3 five sections: section 1 presented the introduction; section 2 discusses related work on OCR error correction; section 3 explains the proposed method and its implementation. In section 4, interface and data collected are defined, also experimental results and evaluation. The last section includes conclusions and future work of our research. 2. Related work of OCR Error Correction The proposed method involves three themes: multi-inputs, low-resolution images, and OCR post-processing error correction. A lot of researches relies on multi-inputs for OCR engines to improve accuracy of OCR. For examples, Lopresti and Zhou [16] stated that scanning a page multiple times, then entering images into OCR engine, and running a voting procedure to select the best among them will eliminate 20 to 50% from text errors resulting from a single OCR engine. The method does not require training; the alignment between the output texts was done line by line, with scanning resolution of 300dpi. Lund and Ringger [12] take advantage of the differences between the outputs of three OCR engines to improve accuracy. Their method enhanced the A* algorithm used in the alignments of the three OCR outputs, which resulted in reduction of WER from 22.9% to 10.3%. The alignment process used complete output texts of three OCR engines, while test images were scanned at 400 dpi. As second attempt, Lund and Ringger [17] created a decision list from in-domain training data that was used to select the best output of the three OCR engines. A* algorithm with Reverse Dijkstra admissible was used in the alignment of the complete output texts from the three engines. The method leads to a 19.5% improvement in WER compared to the best single OCR. Progressive alignment of five different OCR Engines was presented in the third attempt of Lund, Walker, and Ringger [18]. The method used maximum entropy model to select the best output from five different OCR Engines. Total output texts are used in alignment of five OCR engines; the scanning resolution was 1500 dpi for the documents' images. A 24.6% improvement in WER relative to the optimal one from the five OCR engines was attained. At fourth attempt, Lund, Kennar and Ringger [11] improved OCR accuracy by using seven values to the binarization threshold for single image, so that multiple images were passed to the same OCR engine. Progressive alignment of the outputs for total texts was used. The test images were scanned at 400 dots per inch. 2.68% of all tokens used in the test corpus were corrected by the method. The method proposed in this research compared to the prior methods, in addition to multi-inputs, will employ: (1) algorithm to clean, and restore the characters, (2) no previous training on errors, and (3) testing stage uses low scanning resolution (72 dpi) images. For the researchers that considered low-resolution images: Jacobs, Simard, Viola, and Rinker [8] presented a camera-based OCR system that can improve and recognize poor-quality documents' images. The system used machine learning approach and consists of two parts: character recognizer and word recognizer. Character recognizer is implemented using neural network which is used to predict the character at specific location in the image, while the word recognizer is used to find the word inside a given box in the image. The system trained on a large amount of data with recognition accuracy of between 80-95% on images captured with size of and font size of 10-point. This accuracy is achieved when the system uses a language model. On the other hand, the processing time was slow taking about 2 minutes and 40 seconds to produce the output for medium-sized paragraph. Ma and Agam [3] presented a super resolution framework that was based on machine learning for low resolution images using K- means algorithm. The goal of the framework was: reconstructing an excellent resolution image from a low-quality image to enhance accuracy of OCR. The results of method showed 50% error rate reduction. The testing images were re-sampled from high-resolution images of to lower resolution ratios of (1:2, 1:3, and 1:4 sampling rates) so that it can be used in the experiments. However, the proposed method in this study will improve WER compared to the previous methods. The last theme in the related work is OCR post-processing error correction, which means correcting the errors after OCR engine generates the text. It can be divided into several categories: (1) proofreading-based correction, (2) lexicon-based correction, and (3) contextbased correction [19]. Proofreading-based correction requires humans to read and rewrite text produced from the OCR process. This is inefficient as it is time-consuming, especially when the 15

4 number of words is in the thousands. While the lexicon-based correction is used to identify the non-words errors; the error happens when the word resulted from the OCR does not exist in the lexicon [14]. Lastly, the context-based takes into account the words surrounding the wrong word. It is more complex than the previous techniques and can detect real word errors [19]. Examples of OCR post-processing error correction methods include: Naseem and Hussain [20] used the similarity in shape among characters in words. Guyon and Pereira [21] proposed grammar rules. Lapata and Keller [22] suggested word count, i.e., the frequency of the word in the web or corpus is used in selecting the right word. Mays, Damerau, and Mercer [23] used two techniques in their method: first was a dictionary to identify and process non-word errors, and the second was a language model to identify and correct the grammar errors. Liu, Babad, Sun, and Chan [24] presented a matrix of sequence and count of characters for all words resulted from OCR, where the incorrect word was replaced with any word that has the highest count in the matrix. Tong, Zhai, Milic, and Evans [25] proposed confusion sets that used common errors in words. Bassil and Alwani [19] used Google's spelling and suggestions. Choudhury, et al. [26] suggested probability based language model. The proposed method in this research will not use complex method, but will use an only lexicon as an integral part in the correction process. 3. The Proposed method in this research The research involves six major steps: (S1) extract words' images from document image and store in an array, (S2) pass each word image in the array sequentially to four processes, (S3) perform cleaning, restoration, and resample on each word image in any of the four processes based on different conditions, (S4) each OCR engine will receive words images from one process in sequence, (S5) apply a procedure to select the best word resulting from the four OCR engines, and (S6) compile all words in one output text. Figure 3 shows the proposed method. Figure 2. The proposed method framework In S1, the document image is checked for gray scale. If the image is colored, then the document image is converted to gray-scale before further processing. This method does not accept binary images. This method does not accept binary images. The document image is searched through to locate words and denote the words as blocks. Thus, each block will contain a word image. At the end, words of the document image are extracted and stored as an array of word images. The first reason for extracting only words' images is if the document image is passed to multiple OCR directly, problem of the alignment of output texts will occur. More so, it is computationally complex and take long time for lengthy sequences [12, 13]. The proposed method does not need this type of alignment as described more in the decision and collection stage. The second reason is most OCR systems begin with identifying words in the image and then split words into letters' blocks before features' extraction [27], thus it can reduce 16

5 processing time by selecting words in the image once, rather than repeat it in each OCR engine. It is to be noted that the words are easy to be identified because of the presence of spaces between them. The spaces between the letters are very small and sometimes attached to each other, especially when the scanning resolution of image is low [3]. An example of spaces between two words or two characters is shown in Fig. 1. Next, in S1, each pixel value in the word image is compared with a threshold value '220'. If the pixel value is greater than 220, then the value is changed to 255. The aim is to remove some weak pixels. Furthermore, search each word image from left to right to find any vertical line having a height start from top of word image to bottom of it, and having a width as one pixel; this line must not contain any pixel value less than 160, if found it, then all values of the pixels in this line become equal to 255. This is to facilitate the splitting of word image into letters (Fig. 4). The threshold values '220' and '160' were chosen based on experiments that were conducted to specifically identify threshold values so that the process does not lead to neglect or distort other important information. The output of S1 is an array of words images. Figure 3. Word image before and after preprocessing & extraction stage Next in S2, each word image is passed in sequence to four processes (S3). Here, in S3, the proposed method classifies and converts the pixel values of a word image into three classes ranging between 0 and 255. This is followed by performing restoration to the characters. The classification and conversion of the values change four times by using a variable threshold. This means value of threshold named x changes during each process. It takes the value (130, 150, 170, and 190) for processes (1, 2, 3, and 4) respectively (Fig. 5). Steps S3 and S4 are performed in a multithreads manner, (in parallel) to reduce processing time [28]. Furthermore, OCR engines in this method are not different but are multiple copies of the Tesseract engine version In Figure 4. Processes conditions 17

process 1, several operations are performed for each word image. Firstly, each pixel value smaller than x becomes zero; the reason is: to confirm the strong pixels.

6 process 1, several operations are performed for each word image. Firstly, each pixel value smaller than x becomes zero; the reason is: to confirm the strong pixels. On the other hand, any value of pixel greater than x remains the same, the reason is: the restoration process will be performed on them. The second operation in process 1 identifying all pixels having values between (x+1) and (x+20), located beside pixels having values equal to zero as shown in fig. 5. These will become primary starting pixels for the process of restoration. Figure 5. Example of position of starting pixels In the restoration stage, each primary starting pixel has cycle of operations: (1) value of starting pixel is changed to zero, and (2) all the neighboring pixels from all sides with values not equal to zero are arranged in ascending order so that the pixel having the smallest value becomes a secondary starting pixel on condition that its value is between (x+1) and (x+20). If these conditions are met, then all previous operations (1 and 2) are performed for the new starting pixel and so on, otherwise the cycle is ended for current starting pixel, and another cycle for next primary starting pixel is initiated. The last operation in process1 is increasing the resolution of each word image to 300 dpi because most of OCR engines are optimized at this resolution. All the operations are the same in processes 2, 3 and 4, except that the value of the variable x becomes 150, 170 and 190 respectively. Previous values of x were selected based on the results from a series of experiments that were conducted to choose the best values between 0 and 255. As previously mentioned, the processes 1, 2, 3 and 4 in addition to the OCR engines are implemented in parallel to reduce the processing time. For this, any word image will be sent to the four processes to produce four words images (S5), each one passing through one OCR engine to turn into a word, so that the final results are four words sent to S6. Fig. 6 show the steps in decision & collection stage. Figure 6. The steps in decision & collection stage The first step in decision & collection stage selects only unique words from the four words resulting from OCR engines. Next, checking each word if it belongs to the lexicon. If there is just one, it is 18

7 marked as correct, otherwise, if more than one belongs to the lexicon, the valid word that contains most frequently among non shared letters will be selected. The third is creating an array named "a_shared" containing the letters that exist in the four words. Array "a_shared" is created using characters-based bigram model. For example, if the four words resulted from OCR engines are "imoge", "imagc", "imdgc", and "imogc", then array "a_shared" would contain ("im", "m?", "?g", "g?"). The fourth step suggests a list of words not exceed five words from the lexicon for each unique word. The last step will choose any word from suggestions' list having the largest number of letters in the array "a_shared". If more than one satisfies the previous conditions, only word that contains most frequently among non shared letters will be selected. The resulting words will be used to build the output text by putting space between them. 4. Results and evaluation To evaluate the proposed method, a prototype is developed using VB.NET. It uses Tesseract version 3.02 as OCR engine to convert images into text. Tesseract engine is a software library supported by Google Inc [5]. The prototype will pass documents images to OCR engine before and after using the proposed method and display the results. Experiments will test several English documents containing words; the texts in these documents are from Wikipedia's website and acts as reference or standard text. The texts do not contain any layout or images having five font sizes and five font types. Sclite toolkit supported by the National Institute of Standards and Technology was used to compute WER, by comparing a reference text with OCR output text [1, 29]. To generate two types of test images from the reference, the text is first printed on papers. Then the hardcopy are applied to two types of experiments: (1) the documents are scanned at 72 dpi with a gray level in a normal scanner to produce 487x662 images, and (2) the same documents are snapped by a 1024x768 cameras. All images result from this camera are resample to produce several low resolution 500x680 images, the default resolution of the camera is 75 dpi. The two groups of images resulted from the scanner, and the camera are tested. Results showed that the output of Tesseract engine without using the proposed method was very poor having an average WER of more than 69% and 78% respectively for both scanner images and camera images. Table 2 shows the comparison results. Source of Images Table 2. Results of the proposal method testing Average WER (Using Tesseract engine only) Average WER (Using Tesseract engine and the proposed method) Scanner 69.35% 2.21% Camera 78.82% 4.63% From Table 2, it can be observed that the average WER for scanner images is 69.35% when using only Tesseract engine and 2.21 % when using Tesseract engine and the proposed method. Average WER for camera images is 78.82% when using only Tesseract engine and is 4.63% when using both Tesseract engine and the proposed method. The difference in accuracy between scanner images and camera images is due to variations in lighting between camera and scanner, and properties of each device. The results in Table 2 also show that the proposed method improves the average WER. The alignment of multiple outputs of the proposed method was compared with two methods named "1" [18] and "2" [4]. Factors of comparison are: type of alignment and probability of the error in the alignment as shown in Table 3. First row shows that the alignment used in the proposed method requires less complex calculations because it was dealing with only four words each time, in contrast to the other two methods that deal with long sequences of strings. The second row reveals there is a possibility of error in the alignment for the methods 1 and 2 19

8 [4, 18], while it is zero for proposed method. All the experiments in this research confirmed that proposed method is better than the other two methods. Table 3. Comparison of the proposed method with two related methods Type of alignment and probability of the error in the alignment Method 1 Method 2 The Proposed method Alignment type sequence by sequence page by page word by word Probability of the error in the alignment Cannot guarantee 100% error free in alignment [18] Cannot guarantee 100% error free in alignment [4] Guaranteed 100% error free in alignment 5. Conclusion and Future Work OCR systems accuracy is very high in ideal conditions, but the error rate increases when the images contain noise, the scanning resolution is low, and a cursive written typed languages. This paper presents a new method which can restore characters' quality in weak resolution images before passing them to OCR engines. The method excludes traditional alignment among resulting texts used by related methods; and also no training is needed on errors before executing it. Furthermore, it performs a procedure to select the best among the output texts and correct wrong words if they occurred in the output texts. In addition to that, this method can be used for any language with simple modification. The experiment results show that this method will reduce WER of output text considerably. Further research can be done for more improvement in WER for systems of OCR for the cases of: the noisy images, low resolution images, or cursive writing languages. In addition to that, there is a need to minimize data set size on hard disk of N-grams based-language models, which perform well in correcting the errors that resulted after OCR process. This is done without affecting the speed of data access for its effective inclusion in desktop OCR applications. 6. References [1] S. Impedovo, L. Ottaviano, and S. Occhinegro, "Optical character recognition a survey," International Journal of Pattern Recognition and Artificial Intelligence, vol. 5, pp. 1-24, [2] B. Alex, C. Grover, E. Klein, and R. Tobin, "Digitised Historical Text: Does it have to be mediocre?," [3] D. Ma and G. Agam, "A super resolution framework for low resolution document image OCR," in IS&T/SPIE Electronic Imaging, 2013, pp P-86580P-9. [4] M. Volk, L. Furrer, and R. Sennrich, "Strategies for Reducing and Correcting OCR Errors," in Language Technology for Cultural Heritage, ed: Springer, 2011, pp [5] Google Inc. (2014, January 02). Tesseract-ocr v3.02. Available: [6] LAB Asprise. (2014, January 05). Asprise OCR SDK library v4.0. Available: [7] D. Ma and G. Agam, "Lecture video segmentation and indexing," in IS&T/SPIE Electronic Imaging, 2012, pp V-82970V-8. [8] C. Jacobs, P. Y. Simard, P. Viola, and J. Rinker, "Text recognition of low-resolution document images," in Document Analysis and Recognition, Proceedings. Eighth International Conference on, 2005, pp [9] B. J. Dawson, "Method and apparatus for dynamically selecting an image compression process based on image size and color resolution," ed: Google Patents, [10] M. R. Gupta, N. P. Jacobson, and E. K. Garcia, "OCR binarization and image pre-processing for searching historical documents," Pattern Recognition, vol. 40, pp ,

9 [11] W. B. Lund, D. J. Kennard, and E. K. Ringger, "Why multiple document image binarizations improve OCR," presented at the Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, Washington, District of Columbia, [12] W. B. Lund and E. K. Ringger, "Improving optical character recognition through efficient multiple system alignment," in Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, 2009, pp [13] I. Elias, "Settling the intractability of multiple alignment," Journal of Computational Biology, vol. 13, pp , [14] J. F. Daðason, "Post-Correction of Icelandic OCR Text," (Master's thesis, University of Iceland, Reykjavik, Iceland), [15] X. Sun, J. Gao, D. Micol, and C. Quirk, "Learning phrase-based spelling error models from clickthrough data," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp [16] D. Lopresti and J. Zhou, "Using consensus sequence voting to correct OCR errors," Computer Vision and Image Understanding, vol. 67, pp , [17] W. B. Lund and E. K. Ringger, "Error Correction with In-Domain Training Across Multiple OCR System Outputs," in Document Analysis and Recognition (ICDAR), 2011 International Conference on, 2011, pp [18] W. B. Lund, D. D. Walker, and E. K. Ringger, "Progressive alignment and discriminative error correction for multiple OCR engines," in Document Analysis and Recognition (ICDAR), 2011 International Conference on, 2011, pp [19] Y. Bassil and M. Alwani, "Ocr post-processing error correction algorithm using google online spelling suggestion," arxiv preprint arxiv: , [20] T. Naseem and S. Hussain, "A novel approach for ranking spelling error corrections for Urdu," Language Resources and Evaluation, vol. 41, pp , [21] I. Guyon and F. Pereira, "Design of a linguistic postprocessor using variable memory length Markov models," in Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, 1995, pp [22] M. Lapata and F. Keller, "The Web as a baseline: Evaluating the performance of unsupervised Web-based models for a range of NLP tasks," in Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2004, pp [23] E. Mays, F. J. Damerau, and R. L. Mercer, "Context based spelling correction," Information Processing & Management, vol. 27, pp , [24] L.-M. Liu, Y. M. Babad, W. Sun, and K.-K. Chan, "Adaptive post-processing of OCR text via knowledge acquisition," in Proceedings of the 19th annual conference on Computer Science, 1991, pp [25] X. Tong, C. Zhai, N. Milic-Frayling, and D. A. Evans, "OCR Correction and Query Expansion for Retrieval on OCR Data -- CLARIT TREC-5 Confusion Track Report," in TREC, [26] M. Choudhury, R. Saraf, V. Jain, A. Mukherjee, S. Sarkar, and A. Basu, "Investigation and modeling of the structure of texting language," International Journal of Document Analysis and Recognition (IJDAR), vol. 10, pp , [27] M. Labidi, M. Khemakhem, and M. Jemni, "Grid 5000 Based Large Scale OCR Using the DTW Algorithm: Case of the Arabic Cursive Writing," Recent Advances in Document Recognition and Understanding, p. 73, [28] Microsoft Corporation. (2014, January 08). Multithreading in Visual Basic. Available: 21

Preprocessing and Segregating Offline Gujarati Handwritten Datasheet for Character Recognition

Preprocessing and Segregating Offline Gujarati Handwritten Datasheet for Character Recognition Hetal R. Thaker Atmiya Institute of Technology & science, Kalawad Road, Rajkot Gujarat, India C. K. Kumbharana,