Optical Character Recognition for Hindi

Similar documents
R. K. Sharma School of Mathematics and Computer Applications Thapar University Patiala, Punjab, India

Preprocessing and Segregating Offline Gujarati Handwritten Datasheet for Character Recognition

Compression Method for Handwritten Document Images in Devnagri Script

Multi-Script Line identification from Indian Documents

Study and Analysis of various preprocessing approaches to enhance Offline Handwritten Gujarati Numerals for feature extraction

A Comprehensive Survey on Kannada Handwritten Character Recognition and Dataset Preparation

Keywords OCR, Scripts, Hierarchical Classification, Contour, Projections.

A Review of Optical Character Recognition System for Recognition of Printed Text

Implementation of License Plate Recognition System in ARM Cortex A8 Board

Finger print Recognization. By M R Rahul Raj K Muralidhar A Papi Reddy

A New Character Segmentation Approach for Off-Line Cursive Handwritten Words

Extraction and Recognition of Text From Digital English Comic Image Using Median Filter

Machine-printed and hand-written text lines identi cation

Locating the Query Block in a Source Document Image

Locally baseline detection for online Arabic script based languages character recognition

Automated Number Plate Verification System based on Video Analytics

Chapter 17. Shape-Based Operations

Handwritten Character Recognition using Different Kernel based SVM Classifier and MLP Neural Network (A COMPARISON)

Detection and Verification of Missing Components in SMD using AOI Techniques

Handwritten Text Image Compression for Indic Script Document

Number Plate Recognition System using OCR for Automatic Toll Collection

A NOVEL APPROACH FOR CHARACTER RECOGNITION OF VEHICLE NUMBER PLATES USING CLASSIFICATION

VEHICLE LICENSE PLATE DETECTION ALGORITHM BASED ON STATISTICAL CHARACTERISTICS IN HSI COLOR MODEL

Handwritten Nastaleeq Script Recognition with BLSTM-CTC and ANFIS method

Chapter 6. [6]Preprocessing

Contents 1 Introduction Optical Character Recognition Systems Soft Computing Techniques for Optical Character Recognition Systems

Text Extraction from Images

Proposed Method for Off-line Signature Recognition and Verification using Neural Network

Keyword: Morphological operation, template matching, license plate localization, character recognition.

A comparative study of different feature sets for recognition of handwritten Arabic numerals using a Multi Layer Perceptron

Study of 3D Barcode with Steganography for Data Hiding

Colored Rubber Stamp Removal from Document Images

RESEARCH PAPER FOR ARBITRARY ORIENTED TEAM TEXT DETECTION IN VIDEO IMAGES USING CONNECTED COMPONENT ANALYSIS

Vehicle License Plate Recognition System Using LoG Operator for Edge Detection and Radon Transform for Slant Correction

Automatic Ground Truth Generation of Camera Captured Documents Using Document Image Retrieval

COMPARITIVE STUDY OF IMAGE DENOISING ALGORITHMS IN MEDICAL AND SATELLITE IMAGES

Enhanced Binarization Technique And Recognising Characters From Historical Degraded Documents

ENHANCHED PALM PRINT IMAGES FOR PERSONAL ACCURATE IDENTIFICATION

Matlab Based Vehicle Number Plate Recognition

Text Detection in Document Images: Highlight on using FAST algorithm

IJRASET 2015: All Rights are Reserved

Number Plate Recognition Using Segmentation

Restoration of Degraded Historical Document Image 1

A SURVEY ON HAND GESTURE RECOGNITION

AUTOMATIC NUMBER PLATE DETECTION USING IMAGE PROCESSING AND PAYMENT AT TOLL PLAZA

AN EFFICIENT THINNING ALGORITHM FOR ARABIC OCR SYSTEMS

MAV-ID card processing using camera images

Real time verification of Offline handwritten signatures using K-means clustering

License Plate Localisation based on Morphological Operations

An Evaluation of Automatic License Plate Recognition Vikas Kotagyale, Prof.S.D.Joshi

Automatic Licenses Plate Recognition System

AN EXPANDED-HAAR WAVELET TRANSFORM AND MORPHOLOGICAL DEAL BASED APPROACH FOR VEHICLE LICENSE PLATE LOCALIZATION IN INDIAN CONDITIONS

INDIAN VEHICLE LICENSE PLATE EXTRACTION AND SEGMENTATION

COMPARATIVE PERFORMANCE ANALYSIS OF HAND GESTURE RECOGNITION TECHNIQUES

PHASE PRESERVING DENOISING AND BINARIZATION OF ANCIENT DOCUMENT IMAGE

Libyan Licenses Plate Recognition Using Template Matching Method

Iraqi Car License Plate Recognition Using OCR

Wavelet-based Image Splicing Forgery Detection

International Journal of Scientific & Engineering Research, Volume 5, Issue 5, May ISSN

Nikhil Gupta *1, Dr Rakesh Dhiman 2 ABSTRACT I. INTRODUCTION

Recognition System for Pakistani Paper Currency

Vehicle Number Plate Recognition with Bilinear Interpolation and Plotting Horizontal and Vertical Edge Processing Histogram with Sound Signals

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 58

Bangla Optical Digits Recognition using Edge Detection Method

Offline Signature Verification for Cheque Authentication Using Different Technique

Region Based Satellite Image Segmentation Using JSEG Algorithm

Review of the Character Recognition System Process and Optical Character Recognition Approach

Efficient Car License Plate Detection and Recognition by Using Vertical Edge Based Method

Contrast adaptive binarization of low quality document images

Recognition Of Vehicle Number Plate Using MATLAB

FPGA based Real-time Automatic Number Plate Recognition System for Modern License Plates in Sri Lanka

An Efficient Color Image Segmentation using Edge Detection and Thresholding Methods

Method for Real Time Text Extraction of Digital Manga Comic

AN APPROACH TO EXTRACT LINE, WORD AND CHARACTER FROM SCENE TEXT IMAGE

An Automatic System for Detecting the Vehicle Registration Plate from Video in Foggy and Rainy Environments using Restoration Technique

Live Hand Gesture Recognition using an Android Device

ISSN No: International Journal & Magazine of Engineering, Technology, Management and Research

Nigerian Vehicle License Plate Recognition System using Artificial Neural Network

Automated Driving Car Using Image Processing

Power Quality Disturbaces Clasification And Automatic Detection Using Wavelet And ANN Techniques

Guided Image Filtering for Image Enhancement

Abstract Terminologies. Ridges: Ridges are the lines that show a pattern on a fingerprint image.

Project Documentation

DETECTION AND CLASSIFICATION OF POWER QUALITY DISTURBANCES

NON UNIFORM BACKGROUND REMOVAL FOR PARTICLE ANALYSIS BASED ON MORPHOLOGICAL STRUCTURING ELEMENT:

Image Extraction using Image Mining Technique

Image binarization techniques for degraded document images: A review

Robust Hand Gesture Recognition for Robotic Hand Control

A Simple Skew Correction Method of Sudanese License Plate

Keywords: Image segmentation, pixels, threshold, histograms, MATLAB

Indian Currency Recognition and Verification Using Image Processing

A Novel Morphological Method for Detection and Recognition of Vehicle License Plates

International Conference on Computer, Communication, Control and Information Technology (C 3 IT 2009) Paper Code: DSIP-024

Er. Varun Kumar 1, Ms.Navdeep Kaur 2, Er.Vikas 3. IJRASET 2015: All Rights are Reserved

Fig.1: Sample license plate images[13] A typical LPR system is composed of several hardware and software components as illustrated in Figure 2

Image Segmentation of Historical Handwriting from Palm Leaf Manuscripts

Real Time ALPR for Vehicle Identification Using Neural Network

A Comparative Analysis Of Back Propagation And Random Forest Algorithm For Character Recognition From Handwritten Document

A Real Time based Physiological Classifier for Leaf Recognition

Automated Number Plate Recognition System Using Machine learning algorithms (Kstar)

Transcription:

Optical Character Recognition for Hindi Prasanta Pratim Bairagi Assistant Professor, Department of CSE, Assam down town University, Assam, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract -Optical Character Recognition is a system which images, image rectification and segmentation are considered can perform the translation of images from handwritten or in order to design this system. printed form to machine-editable form. Devanagari script is used in many Indian languages like Hindi, Nepali, Marathi, 1.2 Types of OCR Sindhi etc. This script forms the foundation of the language like Hindi which is the national and most widely spoken language Basically, there are three types of OCR. They are briefly in India. In current scenario, there is a huge demand in storing discussed below: the information in digital format available in paper documents Offline Handwritten Text and then later reusing this information by searching process. In this paper we propose a new method for recognition of The text produced by a person by writing with a pen/ printed Hindi characters in Devanagari script. In this project pencil on a paper and then scanned the document to different pre-processing operations like features extraction, digitalized them is called Offline Handwritten Text. segmentations and classification have been studied and implemented in order to design a sophisticated OCR system for Online Handwritten Text Hindi based on Devanagari script. During this research, different related research papers on existing OCR systems have Online handwritten text is the one written directly on a been studied. In this project the main emphasis is given digital platform using different digital device. The output is a towards the recognitions of the individual consonants and sequence of x-y coordinates that express pen position as well vowels which can be later extended to recognize complex as other information such as pressure and speed of writing. derived letters & words. Key Words: Optical Character Recognition, Feature Extraction, Segmentation, Hindi Character, Devanagari Script Machine Printed Text Machine printed texts are commonly found in printed documents and it is produced by offset processes. 1. INTRODUCTION The introduction part is divided into two individual parts. The first part defines about OCR, its types and its uses and the second part defines about Devanagari script, the foundation of Hindi language. 1.1 About OCR Optical Character Recognition has emerged as a major research area since 1950. Optical Character Recognition is the mechanical or electronic translation of images of handwritten or printed text into machine-editable text [1]. The images are usually captured by a scanner. However, throughout the text, we would be referring to printed text by OCR. Data Entry through OCR is relatively fast, more accuracy, and generally more efficiency than usual keyboard entry. An OCR system enables us to store a book or a magazine article directly into digital form and also make it editable. Development of OCR for Indian script is an active area of research and it also gives great challenges to design an OCR due to the large number of letters in the alphabet, the sophisticated ways in which they combine, and the complicated graphemes they result in. Usually in Devanagari script, there is no separation between the characters written in a text. In this research work different pre-processing operations like conversion of gray scale images to binary 1.3 Uses of OCR Optical Character Recognition is used to scan different types of documents such as PDF files or images and convert them into editable file. The OCR system is used for the following purposes: Processing Bank cheese Documenting library materials into digital format. Storing documents in digital form, searching text and extracting data. 1.4 About Devanagari Script Devanagari script is the foundation of many Indian languages like Hindi, Nepali, Marathi, Sindhi etc and used by more than 300 million people around the world. So Devanagari script plays a very major role in the development of literature and manuscripts. There is so much of literature from the old age manuscripts, Vedas and scriptures and since these are so old so these are not easily accessible to everyone. The need and urge to read these old age scriptures led to the digital conversion of these by scanning the books. For scanning and converting the documents into editable 2018, IRJET Impact Factor value: 6.171 ISO 9001:2008 Certified Journal Page 3968

form OCR system for Devanagari text was introduced. This editable form out of output text can be input to various other systems like it can be synthesized with the voice to hear the enchantment of scriptures etc. Devanagari script is written in left to right and top to bottom format [2]. It consists of 11vowels and 33 basic consonants. Each vowel except the first one have corresponding modifier using which we can modify a consonant. This line which is available in the upper side of a character is called Shirorekha. Based on this shirorekha each character is divided into three distinct parts. The portion in the upper side of shirorekha is called upper modifiers, in the middle portion the character is available and in the last portion lower modifiers are available. Moreover, some characters combine to form a new character set called joint characters. Optical Character Recognition for Hindi is comparatively complex due to its rich set of conjuncts. The terminology is partly phoning in that a word written in Devanagari can only be judged in one direction, but not all possible pronunciations can be written perfectly [7]. 2. RELATED WORK The work on developing a character recognition system is initiated by Sinha [3, 4] at Indian Institute of Technology, Kanpur. Till today lots of effort have been devoted to design an OCR for the Devanagari script [5, 6], but no complete OCR for Devanagari is yet available. Among all the above properties mostly Horizontal and Vertical lines form an integral part of most Hindi characters. 3.1 Various steps involves in this proposed system The proposed system includes different steps as follows: First take the printed binarized image of a character as an input. Extract the pixel information from that image and store them into a suitable memory. After successful completion of the 2 nd step, try to find out the skeleton of that character based on the pixel information. Once the skeleton is available, try to find out the different features or geometrical shapes available in that skeleton. The feature extraction process contains the following: Detection of Horizontal lines Detection of Vertical lines Detection of Cross lines Detection of Curves Chirag I Patel et al. [7] highlight a method to recognize the characters in a given scanned documents and study the effects of changing the Models using Artificial Neural Network. Jawahar et al. [8] have proposed a recognition scheme for the Indian script of Devanagari. Recognition accuracy of Devanagari script is not yet comparable to its Roman counterparts. Dileep Kumar Patel et al. [9] In this paper, the problem of handwritten character recognition has been solved with multiresolution technique using Discrete wavelet transform (DWT) and Euclidean distance metric (EDM). 3. METHODOLOGY The algorithm that is used to develop the OCR software for printed Hindi characters is based on the different geometrical features/shapes of Hindi characters. Input image is parsed into many sub parts/images based on these features. Then other properties such as distribution of points/pixels and edges within each sub images are features used to recognize parsed symbol. The major properties used to segment input character (image) into various sub symbols are- Horizontal lines, Vertical lines, Cross lines, Curves, Loops. Detection of Loops Simultaneously we prepare a database where all the features of each and every character are stored. Now compare the features found in the input image with the database and check whether the features obtained from that particular character is matches with the stored features list or not. If match found then the next step will be pass the Unicode value of that particular character to the file writer and write the character into a text file. Finally we will get the character in an editable format from the image format. 2018, IRJET Impact Factor value: 6.171 ISO 9001:2008 Certified Journal Page 3969

Extracting pixel information The binary images that are used for testing purposes consist of a white foreground in front of a large black background. The number of pixels in the background far exceeds that of those in the foreground. This means the numbers of 0's will always be at least 5 times the number of 1's. Moreover, smaller number 1s will mean lesser calculations in correlation. The extraction of pixel information is done by analyzing the foreground and background colours and stored the colour information in terms of 0'a and 1's in matrix of the image size. Thinning or finding the skeleton of the image The skeletonization phase is the first one to manipulate the input binarized image and produce polylines that describe the strokes comprising the characters. Since the algorithm is based on the geometrical and structural properties of the Hindi characters, we think the image to single-pixel width so the contours are brought out more vividly. In this way, the attributes to be studied later will not be affected by the uneven thickness of edges or lines in the symbol. Thinning is a morphological operation that is used to remove selected foreground pixels from binary images. The key here is the selection of the right pixels. Usually there are three types of pixel present in an image or we can categories the pixels into three categories. These are: Figure1: Steps involve in this system 3.2. Design of an OCR Following are the implementation details of the various steps in the proposed algorithm. Input file/image format to the OCR The implemented OCR expects the input image to be in either.bmp or.jpg format. The image should be a binary one. The text image should be written with two possible combination of colour. One is text in black colour and the background should be white or the other one is text in white colour and the background should be black. That is, the image should have only two types of pixel values, 0, for background and 1, for the foreground. Binarization For testing purpose we collected some images of characters and prepare a database of these. Since the developed system is only able to perform its task only on binarized image so we have to perform the binarization operation before the actual task starts. But here the collected images are already binarized so we need not to perform the binarization operations. Critical Pixels Pixels whose removal damages the connectivity of the image. Any pixel which is the lone link between a boundary pixel and the rest image is a Critical Pixel. Its removal will isolate the boundary pixel. Hence it should not be removed. End Pixels Pixels whose removal shortens the length of the image. An end pixel is connected to two or less pixels. Remember that we are talking about 8-connectivity here. Different considerations have to be taken for 4-connectivity. Simple Pixels Pixels which are neither Critical nor End pixels. These are the ones that can be removed for thinning. Like the other morphological operation, the behavior of the thinning operation is determined by a Structuring Element. Here in our thinning algorithm we used the eight neighbourhood concept to fine the skeleton of the character. Instead of eliminating one pixel at a time we identify the unwanted pixel of same region and then deleted them at once which decrease the time required to find the skeleton of the image. 2018, IRJET Impact Factor value: 6.171 ISO 9001:2008 Certified Journal Page 3970

point of a line to the ending point of the consecutive line segment. If the sum of length of these line is greater than the length of the end point connecting line by some threshold value then it is considered as a curve. If it intersects any point then reverse the operation to detect common line segment which is belongs to two different parts of that character. Identification of individual character Figure 2: Eight neighbourhood of a pixel Detection of lines After thinning a given alphabet to a single line we try to detect the features i.e. the distinct parts available on that alphabet taking the horizontal (shirorekha) and vertical line as baseline. For a given input image we move from starting pixel termed as base pixel to the next neighbour pixel to detect the type of line based on some rules. If the next neighbour pixel is in a left or right direction of the base pixel then the type of line is considered as horizontal line. Since most of the alphabets in Hindi have horizontal or vertical line so we find these lines first and then other lines, loops, curves and compare these features with the stored database features to identify the resultant character. 4. RESULTS The program was rigorously tested on sample images of printed Hindi characters which includes all the vowels and the consonants. The accuracy of this developed software is quite good. Since we can't show all the characters in results so we take a specific character 'PHA' to explain our approaches towards recognized a character. Step 1: Take the binarized character image as an input. If the next neighbour pixel is in an upward or downward direction of the base pixel then the type of line is considered as vertical line. If the next neighbour pixel is in a left upward or right downward direction of the base pixel then the type of line is considered as a line having negative slope. If the next neighbour pixel is in left downward or right upward direction of the base pixel then the type of line is considered as a line having positive slope. Detection of Loop Along with the line set we detect loops if available on the given character. If the starting pixel and the ending pixel of a set of line are same then this set of line constitutes a loop. Figure 3: Input Image Step 2: Find the skeleton of the character Compression of the obtained line segments Compression is performed to ignore some distortion available in the set of lines constituting the character. Thus we get minimum and necessary line segments which clearly represent that character. Detection of Curves Since most of the characters in Hindi alphabet has a horizontal and vertical line, so we extract these lines first from the obtained line set and from the remaining line set we try to construct loop and curves. Choose any line which is closest to the vertical line and start draw a line from starting Figure 4: Skeleton of the image 2018, IRJET Impact Factor value: 6.171 ISO 9001:2008 Certified Journal Page 3971

Step 3: Extract the different features available in that image Feature 4: Left side Curve Feature 1: Horizontal line Figure 8: Left side Curve Figure 5: Horizontal Line Step 4: Write the character in editable text format Feature 2: Vertical line Figure 6: Vertical Line Feature 3: Right side Curve Figure 7: Right Side Curve Figure 9: Output text with Unicode value 092B 5. CONCLUSION In this paper, we have described a system for OCR of printed Hindi characters. The recognition accuracy of the prototype implementation is very promising. During this project it has been clearly noticeable that classification of patterns affects a lot in the accuracy of an OCR system. More the classification, more accurate results can be produced. 6. FUTURE SCOPE The current thinning or skeleton finding algorithm is depend on the size of the image, which is not a very good approach towards developing a software like OCR. So we can try to overcome this situation by improving the current thinning algorithm which needs more time and effort. In this project we used a simple form of database to recognize a character. With the current database the character recognition is good but to develop good quality OCR software the database should be fully organized. So some effort and analysis is also required in the database part. 2018, IRJET Impact Factor value: 6.171 ISO 9001:2008 Certified Journal Page 3972

ACKNOWLEDGMENT The author would like to thank Prof. (Dr.) L. P. Saikia (Department of Computer Science Engineering, AdtU) for his constant moral support, encouragement and guidance that helps in correcting mistakes and proceeding further to produce the paper with the required standards. REFERENCES [1] S. Mori, et. al.: Historical Review of OCR Research and Development. Proceeding IEEE, Vol.80, No.7,1992, pp.1029-1058. [2] R. Plamondon and S. N. Srihari, On-Line and Off-Line Handwritten Recognition, A Comprehensive Survey, IEEE Pattern Analysis and Machine Intelligence, Vol- 22, No. 1, January 2000. [3] B. Philip and R. D. Sudhaker Samuel. An Efficient OCR for Printed Malayalam Text using Novel Segmentation Algorithm and SVM Classifiers International Journal of Recent Trends in Engineering, Issue. 1, Vol. 1, May 2009 [4] S. Mohanty, H. N. Dasbebartta, An Efficient Bilingual Optical Character Recognition (English-Oriya) System for Printed documents, IEEE Conference, 13FebruaryN 2009 [11] V. Bansal, R.M.K. Sinha, On How to describe Shape Of Devanagari Characters and Use them for Recognition 5th International Conference on document Analysis and recognition (ICDAR 99), Bangalore India, 1999. [12] V. Bansal, R.M.K. Sinha, A Devanagari OCR and A Brief Overview of OCR Research for Indian Script,PROC Symposium on Transaction support System (STRANS 2001), Kanpur, India, 2001. Author Profile Mr. Prasanta Pratim Bairagi received his MCA degree from Tezpur University in the year of 2013. He has five (05) years of experience in teaching and currently working as an assistant professor in the Department of Computer Science and Engineering, Assam down town University. His research area of interest includes Image Processing and Wireless Sensor Network. [5] U. Garain, B. B Chaudhuri, Segmentation of Touching Character in Printed Devanagari and Bangla Script Using Fuzzy Multifactorial Analysis. IEEE Transaction on System, Man and Cybernetics Part C: Applicationsand Reviews, Vol.32, No.4, 2002, pp.449-459. [6] C.V. Jawahar, M.N.S.S.K.P. Kumar, S.S. Ravi Kiran, A Bilingual OCR for Hindi-Telugu Documents And it s Application. Document Analysis and Recognition. IEEE Proceedings Seventh International Conference on, Vol.1, 2003, pp.408-412. [7] C. I. Patel, R. Patel, P. Patel Handwritten Character Recognition using Neural Network, International Journal of Scientific & Engineering Research Volume 2, Issue 5, May- 2011 [8] Sankaran, Naveen, and C. V. Jawahar. "Recognition of printed Devanagari text using BLSTM Neural Network." Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012. [9] D. K. Patel, T. Som, S. K. Yadav, M. K. Singh, Handwritten Character Recognition Using Multiresolution Technique and Euclidean Distance Metric, JSIP 2012, 208-214 [10] B. Philip and R. D. Sudhaker Samuel. A Novel Bilingual OCR for Printed Malayalam-English Text based on Gabor Features and Dominant Singular Values, 2009 IEEE. 2018, IRJET Impact Factor value: 6.171 ISO 9001:2008 Certified Journal Page 3973