Acknowledgements. quaere verum

Size: px

Start display at page:

Download "Acknowledgements. quaere verum"

Julian Randall
6 years ago
Views:

1 Summary This project aims to produce software that can convert images of scanned pages of Braille into ASCII text. The Braille alphabet consists of 3x2 matrices of points raised above the surface of paper. Each letter, number, and punctuation mark is rendered using either one or two matrices (where in the case of two matrices the first is a modifier capital letter, for example). Appendix B has the exact details. Braille can come on double-sided pages, and has a number of different grammars for instance, Grade 2 (also known as contracted Braille) is similar to shorthand in that a single character may have different meanings depending on the context. These are considered as possible extensions to the minimum requirements of this project. The challenge is part of the computer vision domain, specifically, optical character recognition (OCR). Image processing techniques that have been in use over the last few decades can be applied to solve this problem successfully. The aims of this project are categorised as follows: A discussion of the problem, offering a review of the relevant literature in this area and which techniques will be of use The creation of internal representations of each Braille character, which may then be used in the final software The creation of a system that can convert scans of Braille pages into ASCII text. Once created, the software will be tested and evaluated. Its limitations will be discussed, along with an outline of any possible future improvements. Finally, a conclusion is offered, along with a reflection on the entire project experience. i

2 Acknowledgements My thanks (in no particular order) go to the following people: my supervisors James Handley and Matthew Hubbard, for offering their help and advice, and coming up with the project idea; Roger Boyle, for posing more questions than answers in the Speech and Image Processing module; Andy Bulpitt, for making the Computer Vision module interesting and whose coursework gave me a head start on this project; and Nick Efford, who wrote the Java graphics libraries used herein. quaere verum ii

3 Contents Summary...i Acknowledgements...ii Chapter 1 : Introduction Motivation Report Structure Minimum Requirements Possible Enhancements Project Schedule Summary...3 Chapter 2 : Background Research Introduction Image Processing Techniques Noise Removal and Suppression Contrast Enhancement Segmentation Morphological Image Processing Skew Detection Summary...7 Chapter 3 : Implementation Introduction Initial Character Representation and Image Processing Internal Braille Character Representation File Format Image Pre-Processing Segmentation: Thresholding Morphological Operations: Dilation and Erosion Conclusion Image Understanding: Character Recognition Using Projection to Calculate Cell Coordinates Intermediate Braille Representation Character Matching Conclusion Extensions to the Minimum Requirements Alternative Braille Grammars: Braille Grade Skew Detection and Correction Strengthening of the Character Matching Algorithm: Backtracking Handling Large Braille Pages Conclusion Summary...33 Chapter 4 : Testing Introduction Test Plan...35 iii

4 4.2.1 Minimum Requirements Testing Project Extensions Testing Summary of Test Results Analysis and Discussion of Test Results Summary...39 Chapter 5 : Evaluation Introduction Project Evaluation Software Evaluation Minimum Requirements Extended Software Features Further Work and Possible Improvements Summary...42 Chapter 6 : Conclusion...43 References...44 Appendix A : Project Reflection...47 Appendix B : The Braille Alphabet...48 Appendix C : Test Results...49 iv

5 Chapter 1: Introduction The project is essentially an OCR problem, where the aim is to get a computer to recognise characters (in this case, Braille matrices) in a digitised image. Images are obtained by scanning Braille pages using a flatbed scanner. Character recognition has been an active area of computer science research since the late 1950s. It was initially perceived as an easy problem, but turned out to be a much harder than initially anticipated. Although OCR has been in use for some time (the United States Postal Service, for example, has been using OCR machines to sort mail since 1965) it will be many decades, if ever, before computers will be able to read all documents with the same accuracy as human beings. Optical Braille recognition is a simple subset of the OCR area in general. After initial image processing, the main challenge lies in actually recognising and matching individual Braille character cells to the ones stored internally. The creation of internal representations of each letter or character is straightforward, as each one is simply a binary 3x2 matrix. Pre-processing can be done by applying a number of textbook algorithms, but it is the image understanding part that is going to pose the biggest challenge. In this project, Nick Efford s com.pearsoneduc.ip graphics libraries from Digital Image Processing: a Practical Introduction Using Java ( Pearson Education Limited 2000) are utilised in the implementation of the software. 1.1 Motivation The potential of OCR systems is enormous because they enable users to harness the power of computers to access printed documents. Such documents don t have to contain normal text, but can include Braille. In the case of this project, anyone who works with blind people but does not read Braille can benefit from the software. This includes teachers, lecturers, organisations communicating with blind individuals, and computerised Braille libraries. 1.2 Report Structure The structure used is quite general to a typical final year project report, and roughly follows the stages of developing the actual software. Chapter 2 discusses the research in this area, along with the relevant literature. Chapter 3 follows the actual implementation (coding) of the software in detail, discusses which solutions and algorithms were used, and why. Chapter 4 summarises the test results. Chapter 5 evaluates the outcome of the project and the produced solution. Possible future improvements are outlined. A conclusion is offered in Chapter 6, along with a reflection on the entire project experience in Appendix A. The Braille alphabet is reproduced in Appendix B. Finally, the full test results are shown in Appendix C. 1

6 1.3 Minimum Requirements These are set out in the Mid-Project Report, and are as follows: A discussion of the problem to be solved, outlining the likely problems and detailing the solutions The creation of internal (to the computer) Braille character representations that can then be converted into ASCII text The creation of a system that can transliterate (under perfect conditions) scanned in pages of Braille into ASCII text 1.4 Possible Enhancements The possible enhancements concern the software itself, and include the addition of extra features to the program. The ability to handle double-sided Braille The ability to translate different types of Braille grammars Detection (and correction) of small amounts of skew present in the images Correction of defects (such as shadows on page) Handling of large Braille pages than do not fit on an A4 flatbed scanner 1.5 Project Schedule Table 1 sets out the projected schedule. TASK START DATE EXPECTED DATE OF COMPLETION Preparation Oct 2003 Feb 2004 Project preference form Oct 2003 Oct 2003 Minimum requirements Oct 2003 Oct 2003 Background reading Oct 2003 Feb 2004 System Implementation, Testing and Evaluation Feb 2004 Apr 2004 Design & coding Feb 2004 March 2004 Testing March 2004 Apr

7 Evaluation March 2004 Apr 2004 Write-up Dec 2003 Apr 2004 Mid-project report Dec 2003 Dec 2003 Draft chapter and cable of contents Feb 2004 Mar 2004 Final Report Dec 2003 Apr 2004 Table 1: Projected schedule of completion 1.6 Summary To achieve the stated goals, a number of processing techniques chosen after background research will be applied to the acquired image. The Braille dots will be segmented out of the scan, processed, and translated into ASCII text. Any possible enhancements will be implemented. Finally, the software will be tested and evaluated. 3

8 Chapter 2: Background Research Chapter 2: Background Research 2.1 Introduction The problem to be solved is a computer vision task, and involves areas of artificial intelligence. This chapter details some of the relevant literature in this area. The image processing algorithms and techniques are discussed, along with how and why they may be useful in this project. Virtually all of the algorithms discussed are implemented in the com.pearsoneduc.ip library. This has the obvious advantage of code reuse, allowing most of the effort to be concentrated on the choice of techniques to be used and leaving the specific implementation details aside. Braille recognition is a simpler computer vision problem when compared to some other pattern recognition tasks such as handwriting or fingerprint identification. Typically, image processing is split into two parts: low-level image processing, and high-level image understanding. Low-level methods usually use very little knowledge about the image contents and typically include noise filtering, feature extraction, and image sharpening. High-level processing is based on knowledge and goals (Sonka et al. 1999, p. 3). OCR is a subset of the general pattern recognition problem, and consists of roughly three parts: preprocessing, feature extraction, and discrimination. The principal problem in OCR can be narrowed down to understanding the concept of a character s shape and the mechanism that identifies any instantiation of this concept (Mori et al. 1999, p. 3). 2.2 Image Processing Techniques Noise Removal and Suppression The techniques discussed here are called neighbourhood operations, i.e., operations in which the new value calculated for a pixel depends on its neighbourhood as well as its original value. Such filtering techniques can be effectively used to remove various types of noise in digital images (Umbaugh 1998, p. 159). An excellent technique for removing certain types of noise (such as impulse noise) is a median filter (Efford 2000, p. 175). This operation is a rank filtering technique that smoothes out both input images. However, it generates a black border around each image when no special border generation algorithm exists, something that will have to be dealt with later. Mean filtering reduces amplitude of noise within an image. It also has the effect of giving the image a softer appearance, effectively blurring it. It is essentially a low-pass filter, and performs Gaussian smoothing of an image. Other noise filtering techniques include hybrid filters, such as the -trimmed mean filter (which sorts values from a neighbourhood into ascending order, discards a certain number of these values from either end of the list and outputs the mean of the remaining values); adaptive filters (which change 4

9 Chapter 2: Background Research their behaviour in response to variations in local image properties); and minimum and maximum filters (where we select the bottom- or top-ranked grey level from the neighbourhood as the output value) Contrast Enhancement This sections deals with modifications of grey level values of an image. The assumption is that the image is in black and white, with a number of different grey levels present. Depending on how the image is stored, the greater the number of bits is used to represent the grey level value, the greater the number of grey levels present in the image. The relationship may be defined as b n = 2 where n is the number of grey levels and b is the number of bits used. Grey level mapping is one of the simplest, yet most useful, image processing techniques. It falls under the category of point processes because each new pixel s grey level value is calculated independently of its neighbours. The simplest mappings use a general expression for brightness and contrast modification: g ( x, y) = af ( x, y) + b where b is a constant bias we may use to add to pixel values and hence change the brightness (if b < 0 then the overall brightness is decreased), f(x, y) is a linear mapping (it can be any function that has a one to one mapping), and a is a constant gain that we may use to increase (if a > 1) or decrease (if a < 1) the contrast. Histogram modification is a classic technique used in image processing. It uses a histogram of an image (which shows the distribution of grey levels in an image). Generally speaking, a histogram with a wide spread has a high contrast, while a histogram with a low spread has a low contrast. Similarly a histogram clustered at the low end of the range is dark, and vice-versa. Hence a number of processing methods are available: histogram shrinking (which compresses the histogram); histogram sliding (which slides the histogram in one direction, thus brightening or darkening it); and histogram stretching. A variation of the last technique is worth focusing on: also known as histogram equalisation, it defines a non-linear mapping of grey levels that results in optimal improvement in contrast (Efford 2000, p. 124). It redistributes the grey levels and allocates more of them where there are the most pixels, and fewer where there are fewer pixels. This has the effect of flattening the frequency distribution, tends to increase the contrast in the most heavily populated regions of the histogram, and often reveals previously hidden detail (Efford 2000, p. 124). These techniques are what are known as full-frame histogram equalisation techniques; their main drawback is that the global image properties may not be appropriate under a local context (Sonka et al. 1999, p. 100). It is possible to acquire a histogram for a local fixed-size neighbourhood of a pixel, equalise it, and then use the result to compute a new grey level value for that pixel (Pizer et al. 1987). 5

10 Chapter 2: Background Research Segmentation Image segmentation methods look for objects that either have a measure of homogeneity within themselves or have some measure of contrast with the objects on their border. Their goal is to find regions that represent objects or meaningful parts of objects. Complete segmentation (which is rare) provides a set of disjoint regions, each corresponding to a different real-world object; partial segmentation provides some regions which do not directly correspond to real-world objects (Umbaugh 1998, p. 80; Bulpitt 2003, p. 5). Segmentation techniques may be categorised into three categories: edge-based techniques (e.g., border tracing, Hough transforms), region based techniques (region growing and splitting/shrinking), and global techniques. Since the images being dealt with in this project are relatively simple, the first two categories are likely to be unnecessary. The focus will fall on the last category, or more specifically, a technique known as thresholding. Thresholding transforms a dataset containing values that vary over some range into a new dataset containing just two values. Input that falls below a specified threshold value is mapped to one of the output values; input above a specified threshold is mapped to the other output value. Sonka et al. (1998, p. 124) state that if objects do not touch each other, and if their grey-levels are clearly distinct from background grey-levels, thresholding is a suitable segmentation method. Both properties are satisfied in the case of scanned-in Braille pages. However, correct threshold selection is absolutely vital for successful threshold segmentation (Sonka et al. 1999, p. 124; Efford 2000, p. 253). In general, there are two approaches to automatic threshold selection: statistical and model-based (Mori et al. 1999, p.105). A simple method of threshold selection is to take an average of all the greylevels in an image. This is a good choice and works well as well (Mori et al. 1999, p. 107). A method that may be worth considering is called p-tile thresholding. It makes an assumption that we have some prior knowledge of some property of an image after segmentation, a good example being text printed on paper. If we know that the text covers 1/p of the sheet area, we may then easily choose a threshold T, based on the image histogram, such that 1/p of the image area has grey-level values less than T and the rest has grey-level values larger than T (Sonka et al. 1999, p.127). Finally, a good method for automatically selecting a threshold which is based on approximation of the histogram of an image is called optimal thresholding (Sonka et al. 1999, p. 128). It results in minimum error segmentation (Chow and Kaneko 1972; Rosenfeld and Kak 1982), however deciding whether a histogram is bi-modal (which is required for this method to work) may not be straightforward (Rosenfeld and de la Torre 1983). 6

11 Chapter 2: Background Research Morphological Image Processing Morphological techniques may be used for non-linear smoothing and feature enhancement (Efford 2000, p. 271). They operate on a binary image, and use a small template called a structuring element, which is positioned at all possible locations in the image and compared with the corresponding neighbourhood of pixels. Where the template intersects or fits within the neighbourhood (depending on the type of operation being performed), the resulting output pixel has a non-zero value. The most basic of these techniques are erosion and dilation. Erosion can remove unwanted, smallscale features from a binary image, while dilation has the opposite effect of enhancing features of interest. The two operations are a complement of each other, that is, they have opposite effects 2.3 Skew Detection One of the possible enhancements to the minimum requirements is the ability to detect (and correct) small amounts of rotation present in images. Several methods exist for skew detection in the OCR domain. Most are based on the Hough Transform (Hinds et al., 1990; Le et al., 1994; Yan, 1993), the Fourier Transform (Hase and Hoshino, 1985; Postl, 1986), and projection (Ciardiello et al., 1988; Baird, 1987). Fourier transform-based techniques are unreliable and difficult (Nicel 2000, p. 15); Hough Transform-based approaches are usually computationally expensive (Gatos et al. 1996, p. 1). Other approaches include connected component clustering and correlation-based algorithms. Nicel (2003, pp ) describes each method and summarises the relative advantages and disadvantages of each approach. 2.4 Summary The various image processing techniques that exist and are relevant to this project have been outlined. In order to meet the minimum requirements, input images will have to be processed and correctly segmented. Chapter 3 describes in detail how the segmented image will then be used to calculate Braille dot coordinates. An intermediate representation of the Braille page will be constructed, which will finally be translated into ASCII text. As an extension to the minimum requirements, a technique for detecting the amount of rotation present in an image will be chosen and implemented. A this stage (end of February), all the background research work has been completed. The project is on schedule. 7

12 Chapter 3: Implementation Chapter 3: Implementation 3.1 Introduction This chapter describes the implementation of the working system. The image processing algorithms and techniques used are discussed. Their use is justified and linked to background research work. The following are examined in detail: problems that have arisen during this phase; their solutions; modifications to the original approach; and the reasons for not using certain methods. Section 3.4 examines extensions to minimum requirements that have and have not been put into practice. The implementation approximately follows the original framework. Internal representations of each Braille character are constructed. Image pre-processing in the form of a median filter is carried out to suppress or remove any impulse noise. Features of interest are segmented out and enhanced using thresholding and morphological operations (erosion and dilation). High-level image understanding is carried out with the use of projection, which also helps with detecting any rotation present in the original image. An intermediate representation of the Braille page is constructed. A search algorithm is then implemented to match the characters on the intermediate form to the ones stored internally. 3.2 Initial Character Representation and Image Processing Internal Braille Character Representation Each Braille character consists of a 3x2 matrix. The dots are numbered 1-6 in the manner shown in Appendix B. The design of the character representation is a binary array, with 0 representing an absence of a dot, and 1 representing its presence. This design is very simple, but effective. Each character is defined at the start of the program. Characters may be added or modified easily in the future. Arrays allow for efficient comparison, something that will be useful at later stages when a constructed Braille character is compared to every possible match in the alphabet File Format Initially, the decision to work with greyscale images only was taken. In a fashion similar to traditional OCR problems (i.e., those involving printed text), there is little point in scanning in full colour pages of what is essentially a binary image. It was therefore assumed that the user will not use, for example, RGB images as input. Extra colour in such scans will yield no useful extra information to work with for obvious reasons Braille texts tend to come on paper of one colour only. Colour images are also typically three times larger in size. The chosen file format was Portable Network Graphics (PNG). The majority of scanners allow the user to save images using this file extension. Using a format without compression was not considered 8

13 Chapter 3: Implementation due to very large file sizes. A greyscale Windows bitmap scan of an A4 page at 600 dots per inch (dpi) is over 30 megabytes in size; the same file in RGB colour is over 100 megabytes. The com.pearsoneduc.ip.io package provides an interface that allows a choice of different file extensions, the most popular being JPEG and PNG. PNG was chosen over JPEG as it produces better quality images while file sizes remain reasonable, in the order of under one megabyte for an A4 scan at 150 dpi. Also, because of the way JPEG compression works, images degrade in quality each time they are modified. The figures below shows the differences between the two file types. Figure 3-1: JPEG. Note the continuous tones have been reproduced, but visible artefacts surround text. Also, there is ghosting outside vertical window edges. Figure 3-2: PNG. Text is clear and sharp. Continuous tones are reproduced. Following initial scans, a possible project enhancement was rejected: working with double-sided Braille. Even a high resolution (600 dpi) scan in a lossless, full-colour format, revealed there was very little difference between Braille dots that were raised above the page, and those that were recessed below it. Figure 3-3 (on the next page) shows the image, along with a close-up of a relevant area (highlighted with a dashed border in Figure 3-4). 9

14 Chapter 3: Implementation Figure 3-3: Original full-page, full-colour, double-sided scan at 600dpi (over 100 megabytes). Figure 3-4: Close-up of the dashed area from Figure

15 Chapter 3: Implementation As can be clearly seen, the difference between the raised and the recessed dots is negligible, especially on the right side of Figure Image Pre-Processing The pre-processing stage is often standard in OCR systems. In this case, a median filter with a 3x3 neighbourhood is chosen first. It reduces impulse noise and has the effect of smoothing the grey levels in the scan, something that is useful as it makes the page itself more uniform in appearance. This will assist in distinguishing the features of interest (i.e., the Braille dots) from the background. However, because of the way a median filter works, a black border is generated around the image: something that will have to be dealt with at a later stage. Initial tests showed that the dots are clearly distinctive from the page. For this reason, contrastenhancing techniques will not be used Segmentation: Thresholding Edge- and region-based segmentation techniques were found to be unnecessary. The former find borders between regions, while the latter construct regions directly. The images dealt with in this project are simple enough to use a global segmentation technique. The global knowledge about these images may be represented as a histogram of image features (Sonka et al. 1999, p. 123); here, grey levels will be used as they distinguish features of interest: Braille dots are darker (and hence have lower grey level values) than the rest of the page. As previously stated, the correct choice of a threshold level is absolutely vital to the success of this technique. A preliminary training set of test images showed little variation in shadows across the page. Scanners provide a uniform light source as they scan an image, hence illumination variance is minimised. This simplifies the problem of choosing a correct threshold value when compared to images from other sources (such as a webcam). It also suggests that regional thresholding, where an image is split into regions and each region is then thresholded independently of one another, is not necessary. Initially, the threshold value was chosen manually. Next page shows a typical scanned image (Figure 3-5), together with a correct threshold resulting in successful segmentation (Figure 3-6). Note that some impulse noise remains at the bottom left (highlighted). Figure 3-7 shows an incorrect threshold value for comparison. Here, the value chosen was too high and some residual shadow (due to the page not being completely flat against the scanner face) remains. 11

16 Chapter 3: Implementation Figure 3-5: A typical scanned image. Figure 3-6: The result of a correct threshold. 12

17 Chapter 3: Implementation Figure 3-7: The result of the threshold value being too high. An automatic threshold selection method was then implemented. Optimal thresholding was initially considered as a good candidate as it results in the smallest number of pixels being mis-segmented (Gonzalez and Wintz, 1987; Rosenfeld and Kak, 1982). The threshold is set as the minimum grey level between the maxima of two or more normal distributions. Figure 3-8 (from Sonka et al., 1999) shows: (a) probability distributions of background objects, and (b) corresponding histograms and optimal threshold. Figure 3-8: Grey level histograms approximated by two normal distributions. 13

18 Chapter 3: Implementation The main difficulty with this technique, as stated in Chapter 2, is deciding whether or not the histogram is bi-modal. Figure 3-9 shows a histogram that is typical of the images being worked with Frequency Grey level value Figure 3-9: Grey level histogram of the image in Figure 3-5. At first glance it seems that the histogram is, indeed, bi-modal. Figure 3-10 shows a close-up of the relevant area Frequency Grey level value Figure 3-10: The bi-modal histogram of Figure 3-5 (close-up). 14

19 Chapter 3: Implementation Although this seems to confirm the choice of optimal thresholding as the preferred method for selecting the threshold level, actually using the value (252 in this case) yields no useful result (Figure 3-11). Figure 3-11: Threshold using value calculated with optimal thresholding. Figure 3-12 shows a histogram of another typical image Frequency Grey level value Figure 3-12: Histogram of another typical image. 15

20 Chapter 3: Implementation Although the histogram is relatively similar to Figure 3-10 in that it shows there are fewer darker pixels present (those representing the Braille dots) than there are brighter ones (those representing the rest of the page), it does not exhibit two clear maxima with a distinct minimum in-between. The reason for this is the fact that the grey level distributions of the foreground and the background are not represented by normal distribution curves. Therefore, optimal thresholding was found not to be a suitable method for selecting the global threshold value. P-tile thresholding was the next approach to be considered for automatic threshold selection. The training set of initial scans was then used to determine the proportion of the page covered by Braille. After some experimentation, it was found that just under 4% of the page belonged to the Braille characters, which shared the same characteristic of being darker than the surrounding area. The threshold value T may now be chosen automatically using a cumulative frequency histogram of the grey level values, such that 4% of the image has grey level values less than T and the rest has grey level values larger than T. This method gave excellent results, and in the majority of cases resulted in correct automatic segmentation of the images. Two problems surfaced at this stage, however. Firstly, some noise still remained; it was the same in nature to the circled area in Figure 3-9. Using another 3x3 median filter virtually eliminated this problem (this operation makes the reasonable assumption that the Braille dots are larger than impulse noise.) Figure 3-13 shows the result: note the absence of noise (circled) and the black border generated by the second median filter. Figure 3-13: The result of a second median filter (after thresholding). 16

Chapter 3: Implementation The second problem can be seen below in Figure 3-14. Figure 3-14: The top half of a segmented image.

21 Chapter 3: Implementation The second problem can be seen below in Figure Figure 3-14: The top half of a segmented image. A black, vertical line is present on the left side of this image, distinct from the border generated by the median filter. The reason for it is a shadow cast on the page as it is being scanned, resulting from a typical Braille page being larger than A4 size (pages being too wide, rather than too long). Some of the page is therefore invariably not flat against the glass surface of the scanner as the image is being acquired (Figure 3-15). Figure 3-15: A typical Braille page too large to fit on a scanner. In Figure 3-14, the left side of the page was sticking out. This lifted it slightly, casting a shadow which was approximately as dark as the Braille dots. Although manually lowering the threshold value found previously with p-tile thresholding alleviated the situation somewhat, most of the shadow 17

22 Chapter 3: Implementation remained. Reducing the threshold value further resulted in incorrect segmentation with some of the dots in the output image simply disappearing or becoming very small. The shadow would have to be dealt with at a later stage, as a possible extension of the project. Of course, manually cropping the input image to exclude the affected area eliminated this problem entirely Morphological Operations: Dilation and Erosion Binary images that result from segmentation may contain imperfections; morphological processing techniques can remove these imperfections (Efford, 2000, p.271). These techniques are used here to enhance the size of the Braille dots after thresholding. Any dots that may have been flattened on the page tend to stand out less, which results in their reduced size in an output image. A decision was also made to invert the binary image after this operation, so that the dots are white and the background black. This simply makes the image conceptually more logical, as the presence of dots is now indicated by on pixels (with a grey level value of 255 in an 8-bit image), and their absence by off pixels (with a grey level value of 0). Figure 3-16 shows an example output before the dilation. Some problem areas are highlighted. Figure 3-16: A thresholded image before dilation. Figure 3-17 (on the next page) shows the same image after dilation. Clearly, the features of interest have been enhanced. The structuring element used for this operation was a 5x5 disk, because dilation by a disk enlarges the dots and smoothes any convex corners. Still, this presented another problem. The dots became very bold, and most of them are now touching. Obviously it is desirable to keep them separated, so another operation had to be performed. After experimenting with a further erosion operation and different structuring elements (using the disk again would be pointless as it would 18

23 Chapter 3: Implementation effectively reverse the dilation), empirical evidence suggested an erosion with a cross-shaped structuring element. This showed good results. It separated the dots slightly, while maintaining the enhancing effect of the original dilation, although some dots became more square in appearance. Figure 3-17: The image after dilation Conclusion The low-level processing techniques chosen have worked well. P-tile thresholding was found to be the best method of automatically selecting the threshold level and correctly segmenting the image. Median filtering worked well and removed most of the noise present. Special consideration will have to be given to shadows produced on the page due to Braille pages being wider than the scanner s surface. Morphological image operations also worked well in enhancing the relevant features. The resulting output so far is a binary image with well-defined white dots on a black background. A possible project enhancement was rejected: double-sided Braille. This was due to negligible difference between Braille dots that are raised and those that are recessed on a page. The project is running on schedule, with the completion of unit testing of the methods implemented so far (middle of March). However, the next part (higher-level image understanding and character recognition) is where the main difficulties are likely to lie. 3.3 Image Understanding: Character Recognition Using Projection to Calculate Cell Coordinates Braille pages have a characteristic of being very regular, that is, they have an ordered, consistent appearance. Each character s 3x2 matrix is exactly the same in size. This makes the recognition task 19

24 Chapter 3: Implementation easier than, for example, recognising printed characters on a page where different letters have slightly different dimensions, and certainly easier than recognising handwriting. This regularity can be exploited to make the task of matching Braille characters easier. Projection is a method of mapping a 2-dimensional shape onto 1 dimension. Typically, projections are generated parallel to the abscissa (horizontal projection) and the ordinate (vertical projection), and are performed on binary images. Mathematically, a horizontal projection p h and a vertical projection p v are defined as = j p h ( i) f ( i, j) and p v ( j) = f ( i, j) i where (i, j) are image coordinates. A horizontal projection is calculated by scanning the image in a left-to-right, top-to-bottom order. For each y-coordinate of the image, a counter contains the total number of corresponding hits (in this case, white pixels) along the x-axis. Nicel (2000, p. 21) outlines the simple algorithm used: for every y coordinate do for every x coordinate do if pixel (x,y) is white then increment counterarray[y]; output counterarray[y] A vertical projection is obtained in a similar manner, swapping the x and y coordinates in the first two lines of the algorithm. Two arrays are used to store the projections. Due to borders around the image generated by the median filters used previously, a small modification of the above algorithm was made: scanning begins at (x+3) and (y+3), and terminates at (width-3) and (height-3), respectively, where width and height are dimensions of the image. Projection allows us to automatically calculate Braille cell sizes and the positions of the Braille dots, assuming that the image is not skewed. We plot a graph of each projection. The graph exhibits a peaky appearance, with peaks in the vertical projection corresponding to location of Braille columns, and peaks in the horizontal projection corresponding to location of Braille lines. Figure 3-18 (on the next page) demonstrates the concept. Note that the image itself is scaled and rotated 90 anti-clockwise for illustration purposes. Its horizontal projection is shown on top of it. A clear correlation between the peaks and the location of the Braille lines can be seen. The height of the peaks shows how many hits the projection made. For example, line 8 of the page (y-coordinates of around 350) has fewer characters on it and hence the peaks are smaller. The peaks are grouped in threes (each group representing a line of Braille), since each Braille character is 3 dots high. The vertical projection graph of the same image is shown in Figure For clarity, the x-axis is scaled to the range [0:500]. As expected, the peaks are grouped in twos, representing the 2-dot width of each Braille character. 20

25 Chapter 3: Implementation Horizontal Projection Frequency y-coordinates Figure 3-18: A binary image (rotated 90 anti-clockwise) and its horizontal projection. 21

26 Chapter 3: Implementation 250 Vertical Projection 200 Frequency x-coordinates Figure 3-19: Vertical projection of the image in Figure Each peak on the horizontal projection graph corresponds to the most likely y-coordinate location of a line of Braille. Similarly, each peak on the vertical projection graph correspond to the most likely x- coordinate of each Braille column. Using this information, each graph is scanned for maxima, and the corresponding x- and y-values are noted. Detecting maxima was not trivial due to the fact that some Horizontal Projection Frequency y-coordinates Figure 3-20: The flat maxima of a horizontal projection. peaks exhibited a flat appearance (Figure 3-20), partly due to the previous erosion with a cross-shaped element. There was also the possibility of local maxima that did not correspond to a true peak. The search algorithm therefore had to compare more than one value either side of the possible candidate currently being investigated; also, once a peak was found, the algorithm skipped forward a small 22

27 Chapter 3: Implementation number of values to ensure that a particularly flat peak does not get registered as two separate maxima. Fortunately, no local maxima were present in any of the projection graphs examined. This method also exhibits some robustness with respect to the original dilation causing some dots to connect with its neighbours, as separate peaks were still registered for each dot. Braille letters cannot be printed arbitrarily at any location on a page. They are locked to a grid, and only a certain number of possible locations exists. If one were to imagine a character where every dot of a cell is embossed (such a character does not exist, although if dot 6 was on in the letter q, this character would be the result see Appendix B: The Braille Alphabet ), and then the character was repeatedly printed at every possible grid location on the page, the two arrays would contain the x and y coordinates of every cell s dot. The results of analysing the projection graphs were two arrays with the possible x- and y-coordinates of all the Braille characters on the page. Note this does not mean the dots are actually present at these locations. Projection analysis merely provides us with hypothetical locations of each dot; however it does reduce the possible search space quite substantially from every single pixel in an image. It is up to later stages of image analysis to determine whether or not a dot is actually present at the possible coordinates, and which character it belongs to. Figure 3-21 demonstrates the concept. The first six hypothetical dot locations are shown where the dashed lines cross. Clearly, the dots are not present at those locations. Figure 3-21: Using projection analysis to determine possible locations of Braille dots Intermediate Braille Representation The intermediate representation of the page is a two-dimensional character array. This form was chosen because it will allow quick searching and matching of Braille letters when compared to searching through the image itself. An o in the array indicates the presence of a dot, and a indicates its absence. A simple algorithm searches for Braille dots through all possible dot locations (which were determined using projection). The algorithm checks if the pixel at the current (x, y) 23

28 Chapter 3: Implementation position is white. If it is, it writes an o to the array, if it is not, it writes a instead. At first, an alternative algorithm that was thought to be more robust was used, as a single white pixel at a wrong location would result in a false positive. The alternative would check the value of a pixel at positions (x, y), (x-1, y), (x+1, y), (x, y-1), (x, y+1) and only if they were all white would a positive match be recorded. After experimentation, however, it was found that the original version worked much better. The modified algorithm missed out too many dots because of tiny (less than 0.1 ) variations in page orientation, while the original version encountered no spurious positives as almost no noise was present in segmented images at the locations that were being checked. Thus, the intermediate representation of the image in Figure 3-21 would be as follows: O O O O - - O O - - O O O O O - - O - 3 O O O O O - O O O O O O 4 - O - O - O - O - - O - - O - O 5 O O - O O O O O O - O O - - O - O O 7 - O - O O O - O O O O O - O - Table 2: Intermediate array representation of Figure Note the intermediate form does not try to distinguish actual 3x2 Braille cells; it merely writes a flat form to 2-dimensional array, which is now ready for the next stage: character matching Character Matching The algorithm operates on the intermediate representation of the page. The pseudo algorithm is outlined below. Initialise an empty, temporary Braille cell B; for every 3 rd y array index for every 2 nd x array index do matchcharacter(x,y); The matchcharacter routine then checks for the presence of an o character at the following indices, corresponding to the 3x2 Braille cell: (x, y), (x+1, y), (x, y+1), (x+1, y+1), (x, y+2), (x+1, y+2), 24

29 Chapter 3: Implementation and modifies B accordingly. Finally, it compares B to every stored character in its dictionary (defined at the start of the program) and if a match is found, it writes the result to an output stream. If B is still empty, indicating an absence of a character, a space is written. To add some robustness, if a match is not made (or an unknown character is encountered), a * is printed. This algorithm is somewhat naïve; it takes advantage of the fact that no Braille letter has a completely empty left column in its cell (dots 1, 2 and 3), nor does it have a completely empty top row (dots 1 and 4). It makes a reasonably accurate assumption that no single line of text will be composed entirely of punctuation marks, and that there are no peculiar features on the page such as line breaks (which, although rare, do exist). It also assumes that the user does not cut off half a column of Braille by placing the page on the scanner incorrectly. If any of these conditions are true, however, the algorithm gets derailed and fails badly after the condition is met. A solution to this problem, along with skew detection and correction, is implemented at a later stage as an extension to the minimum requirements Conclusion The approach taken was found to work very well with perfect images. It gave accurate translations of images that contained very little (less than 0.1 ) rotation and skew, and contained no shadows caused by an edge of the page not being completely flat against the scanner face as the image was acquired. This meant, however, that the images usually had to be doctored slightly, i.e., cropped to remove particularly dark edges, or artificially rotated to remove any skew. The software does not need to be told (as commercial software does) what Braille cell size is used in the document; it can calculate that automatically. The one peculiarity of the Braille grammar used was the fact that the question mark character is exactly the same as the opening quote character (dots 2, 3, and 6), meaning the software will not be able to distinguish between the two. At this stage (end of March), the minimum requirements have been met. A working system has been implemented that, in perfect conditions, translates scanned-in Braille pages into ASCII text. One of the methods used (specifically, projection) also provided a way of solving the image skew detection problem (discussed in the next section). A decision was then made to implement a feature that can detect and correct small rotations in scanned images. 3.4 Extensions to the Minimum Requirements Alternative Braille Grammars: Braille Grade 2 Initially, alternative Braille grammars were considered as a possible extension to the project. After some research it transpired that handling other grammars was no longer a part of the computer vision domain, where it would pose a slightly different challenge to the original solution, but more a natural language processing task involving statistical analysis of the input and likely some sort of probability 25

30 Chapter 3: Implementation matching. The reason for this is the fact that in Braille Grade 2, for example (which is essentially shorthand Braille), one Braille character can have many different meanings. Also, a sequence of characters can mean more than one thing, for instance cd can mean could and rcv can mean receive. The meaning depends on the context, hence this is more of a statistical language processing problem. Therefore, alternative Braille grammars will not be considered as a possible extension Skew Detection and Correction Cardiello et al. (1998) use horizontal projection to determine the rotation angle for which the mean square deviation is maximised. Because projection is already being used in this project, Cardiello et al. s approach will be used. To evaluate the projection method a correctly segmented image with no skew was rotated by 1 clockwise (Figure 3-22, with horizontal lines drawn for comparison), and its horizontal projection calculated (Figure 3-23, on the next page). Figure 3-22: A segmented image with a clockwise rotation of 1 degree. The image was then rotated anti-clockwise by 1, so that there was no skew present. Its projection profile is shown in Figure It is clear that where there is no rotation present, the graph has sharp, well-defined peaks and troughs. Although the maxima are approximately the same, the minima are much lower. This is due to the Braille dots lining up more accurately on the page when there is no skew, resulting in fewer hits in-between the dots of each cell. A decision was taken to work with 26

31 Chapter 3: Implementation horizontal projections only. Vertical projections exhibit similar properties dependent on the amount of skew present in an image, but due to the spacing between the Braille cells, horizontal projections have clearer gaps between each maximum. 180 Horizontal Projection Frequency y-coordinates Figure 3-23: Horizontal projection profile of image Figure 3-22 (1 rotation). 250 Horizontal Projection 200 Frequency y-coordinates Figure 3-24: Horizontal projection profile of image in Figure 3-22 (0 rotation). 27

32 Chapter 3: Implementation The example is somewhat artificial in that rotation present in real-world scans is likely to be lower (around 0.2 ). The projection graphs still demonstrate the same properties (Figure 3-25); note the abscissa has been restricted to show relevant features), which can be exploited. A simple technique of calculating which projection profile represents the least rotation would be an average of the minima and maxima; however Nicel (2003, pp ) found this to be unreliable. Hence, following Cardiello et al. s work, variance is used instead. This is a much more robust statistical method, and was found to work well Differences in Projection Profiles "0 Degree Rotation" "0.2 Degree Rotation" Frequency y-coordinates Figure 3-25: Difference in projection profiles (small change in rotation). The skew detection method employed uses a brute-force approach, which, although computationally expensive, works well. The input image is rotated n to n times through an angle, with the aim of maximising the variance of the horizontal projection at each stage. A bilinear (first-order) interpolation scheme is used for rotation. It computes the output pixel grey level as a distanceweighted function of the grey levels of the four pixels surrounding the calculated point in the input image (Efford 2000, p. 240). As a result, it produces smoother and visually more pleasing results than zero-order interpolation. n and are parameters that are hard-coded into the program, but may be varied. 5 was the initial value chosen for n, and 0.2 the value for. This means the image will be rotated from -1 through to +1, in 0.2 increments. These values were found experimentally, and gave good results. may be reduced to calculate the skew more precisely; n may be increased to allow for a greater range of skew detection than ±1. This will, of course, be at the cost of increased computation time. 28

Chapter 3: Implementation The main drawback of this method is lack of speed. The image is rotated before any pre-processing is done, meaning that filtering, thresholding, etc.

33 Chapter 3: Implementation The main drawback of this method is lack of speed. The image is rotated before any pre-processing is done, meaning that filtering, thresholding, etc., are repeated many times. It would be more efficient to perform rotation on the final segmented binary image. The reason for not doing so is shown below (Figure 3-26). White areas are clearly seen around the edges of the image, which are due to the borders generated by median filters and the way rotation is calculated afterwards. These areas throw the projection off somewhat and reduce the accuracy of translation. Figure 3-26: Image rotated after pre-processing. In contrast, applying the rotation before pre-processing removes the problem entirely, as can be seen below in Figure Figure 3-27: Image rotated before pre-processing. 29

Chapter 17. Shape-Based Operations

Chapter 17. Shape-Based Operations Chapter 17 Shape-Based Operations An shape-based operation identifies or acts on groups of pixels that belong to the same object or image component. We have already seen how components may be identified