An Hybrid MLP-SVM Handwritten Digit Recognizer

An Hybrid MLP-SVM Handwritten Digit Recognizer A. Bellili ½ ¾ M. Gilloux ¾ P. Gallinari ½ ½ LIP6, Université Pierre et Marie Curie ¾ La Poste 4, Place Jussieu 10, rue de l Ile Mabon, BP 86334 75252 Paris Cedex 05, France 44263 Nantes Cedex 02, France Abdel.Bellili, Patrick.Gallinari @lip6.fr Michel.Gilloux@laposte.fr Abstract This paper presents an original hybrid MLP-SVM method for unconstrained handwritten digits recognition. Specialized Support Vector Machines (SVMs) are introduced to improve significantly the MLP performances in local areas around the separation surfaces between each pair of digit classes, in the input pattern space. This hybrid architecture is based on the idea that the correct digit class almost systematically belongs to the two maximum MLP outputs and that some pairs of digit classes constitute the majority of MLP substitutions (errors). Specialized local SVMs are introduced to detect the correct class among these two classification hypotheses. The hybrid MLP-SVM recognizer achieves a recognition rate of ¼½±, for real mail zipcode digits recognition task, a performance better than several classifiers reported in recent researches. 1. Introduction The recognition of handwritten characters has been an active research domain in recent years. The result of these researches is the accumulation of many algorithms for classification using the rough representation -in pixels- of the character or a feature vector representation. Most of these algorithms achieve good performances in terms of correct recognition rate. But, in crucial real applications as automatic bankcheck reading systems or zipcode recognition systems, errors are very expensive to correct. There are two situations which reduce the classification confidence and need a good rejection mechanism: 1) patterns might be ambiguous or 2) patterns might be unrelated to the training data used to train the classifier. In this paper, we demonstrate the advantage to use Support Vector Machines (SVMs) [9] to improve the performances of an OCR (Optical Character Recognition) system based on MLP neural network [1]. In Section 2, we justify the use of MLPs and explain their limitations in real OCR applications. In Section 3, we present a brief description of the support vector machine method. Section 4 details the idea of using SVMs in an hybrid combination architecture to improve the global performace of MLPs. In Section 5, we discuss the experiments which demonstrate the relevance of our hybrid architecture on a real zipcode character recognition task. 2. MLP Networks for OCR systems The most commonly used familly of neural networks for handwritten characters recognition task is the feed-forward network, which includes multilayer perceptron (MLP) [1] and Radial-Basis Function (RBF) [4] networks. MLP networks are widely used in handwritten character recognition systems because they are very easy to train and very fast to use in classification decision process. This popularity is related to the use of the gradient back-propagation algorithm in the training process. MLPs generally achieve good performances in terms of correct recognition rate in handwritten character classification. Unfortunately, there are limits when using MLPs in classification tasks: the first is that there is no theoretic relashionship between the MLP structure (ex: hidden layers number and neurons number per layer) and the classification task; the second limitation is due to the fact that MLP derives hyperplans separation surfaces, in feature representation space, which are not optimal in terms of margin (for the margin notion, see [2]) area between the examples of two-different classes. To classify an unlabelled pattern localized in this margin area, MLPs often make erroneus classification decision with a high level confidence. This type of classification errors are very hard to avoid by using a rejection mechanism. In recent years, to achieve an optimal recognition rate, many researches resulted in the design of classification systems using different methods for combining multiple classifers [6]. The idea is to compensate the weakness of one

classifier, in a given local area of the feature space, by the strength of the other classifiers once they are correctly optimized. The combination method can use Local Accuracy Estimates [11], Local Learning Algorithm [3], Adaptive Mixtures of Local Experts [5] or aggregate the decisions obtained from individual classifiers to derive the best final decisions from a statistical point of view [7]. But the disadvantage of most of these methods is the complexity of optimization for each classifier and the definition of local area in terms of K-nearest neighbours which requires to store in the system memory all the training examples. These constraints are prohibitive in real character recognition systems where some training character sets can contain one million characters. The idea of our original method is founded on the observation that, when using MLP as a handwritten digit recognition system, the correct class is almost always one of the two maximum outputs of the MLP. The following Table 1 summarize the presence rate of the correct class among the k- th maximum MLP outputs (ÅÄÈ training set Ë Ø ½ =44081 real digit characters; ÅÄÈ test set Ë Ø ¾ =44075 real digit characters). k-th maximum MLP outputs presence rate (%) 1 97.45 2 99.00 3 99.42 4 99.66 5 99.74 Table 1. Presence rate of the correct class in the k-th maximum MLP outputs. The second observation is that some pairs of classes constitute the majority of the errors (confusions) made by the MLP (ex. ( 7, 1 ) or ( 9, 3 ) ). In order to improve the performances of the MLP, our approach consist in introducing support vector machines ( SVMs) [8, 9] to detect the correct class among the two first maximums provided by the output layer of the MLP, for certain pairs of classes. In the ideal case, i.e. if the introduced SVMs can always decide the correct class among the two first maximum ÅÄÈ outputs, the combination of the MLP and a limited number of SVMs will achieve a correct recognition rate equal to (99.00 %). 3. Support Vector Machines One of the most important recent researches in classifier design is the introduction of the support vector machines classifier by V. Vapnik [8, 9]. The idea consists to map the space Ë Ü of the input examples into a high-dimensional (possibly infinite-dimensional) feature space. By choosing an adequate mapping, the input examples become linearly or almost linearly separable in the high-dimensional space [9]. SVM is primarily a two-class classifier for which the optimization criterion is the width of the margin between the classes, i.e, the area around the decision surface defined by the distance to the nearest training examples in the feature space. These examples, called support vectors, define the classification function of the support vector machine. The optimization of a support vector machine consist to minimize the number of the support vectors by maximizing the margin between the two classes. The decision function derived by the SVM classifier for a two-class problem can be formulated, using a kernel function Ã Ü Ü µ of a new example Ü (to classify) and a training example Ü, as follows: Üµ ¾ËÎ «Ý Ã Ü Üµ «¼ (1) where ËÎ is the support vector set (a subset of the training set) and Ý ½ the label of example Ü. The parameters «¼ are optimized during the training process. There are many kernel functions Ã: the most simple one is a dot product between the input pattern to classify Ü and a member of the support vectors set Ã Ü Ü µ Ü Ü µ, which derives a linear classifer. The nonlinear SVM classifiers, as Gaussian radial basis functions SVM or polynomial SVM classifier can be derived by RBF Ã Ü Ü µ ÜÔ Ü Ü ¾ ¾ µ or Ô-th order polynomial Ã Ü Ü µ Ü Ü ½µ Ô µ functions. The support vector machine classifiers are more and more used as single classifier or combined with different types of classifiers in character recognition systems [10] 4. An hybrid MLP-SVM combined architecture The idea of our hybrid MLP-SVM combination method is motivated by the fact that, in handwritten digit character recognition task, MLP can achieve very good performances in terms of correct recognition rate if we consider the two maximum outputs (i.e. the correct class is almost systematically the first maximum or second maximum MLP outputs). This observation motivates the search for a suitable method which can detect the correct classification among these two maximum MLP outputs with the maximum confidence level in the decision of classification. Once the MLP decision is made, the problem is to choose the right class among two classification hypotheses. This choice results in a two class or binary problem. 2

One of the most effective methods to resolve a binary classification problem, with the maximum confidence in decision, is to introduce support vector machines. This combination method results in specialization of SVMs in the local area around the separation surface between each pair of the ten (10) digit classes. Although, this method can seem very tedious because it needs a SVM for each pair of classes (45 SVMs for the ten digit classes). The second originality of our method is to introduce SVMs only for the pairs of classes which constitute the majority of MLP errors (substitutions or confusions). Figure 1 shows the substitutions made by the MLP for two pairs of digit characters (9,3) and (7,1): for these four examples, the correct class is the second maximum output of the MLP. Figure 1. Examples of (9,3) and (7,1) ÅÄÈ substitutions 4.1. Hybrid MLP-SVM architecture The design and validation of this hybrid architecture needs Ø Ö (3) different digit sets, denoted Ë Ø ½, Ë Ø ¾ and Ë Ø. Ä Ð Ü µ and Ð Ü µ denote respectively the label of the digit pattern Ü and the class assigned to the digit pattern Ü by the hybrid MLP-SVM recognizer. The first and second maximums of the MLP outputs are denoted ÅÄÈ Ñ Ü½ and ÅÄÈ Ñ Ü¾. µì Ö Ò Ò ÔÖÓ Train an optimized MLP with Ë Ø ½ and determine the pairs µ of digit classes causing the majority of Å ÄÈ substitutions (i.e. the pairs of classes for which the addition of µ and µ MLP substitutions rates mentioned in Table 2 reaches a fixed threshold rate). The set containing all this pairs is denoted Ë Å ËÙ È Ö. For each pair µ in Ë Å ËÙ È Ö, extract from Ë Ø ¾ all the patterns Ü Ä Ð Ü µ or Ä Ð Ü µ µ for which ÅÄÈ Ñ Ü½ Ü µ and ÅÄÈ Ñ Ü¾ Ü µ µ or ÅÄÈ Ñ Ü½ Ü µ and ÅÄÈ Ñ Ü¾ Ü µ µ. The resulting subset is denoted ËÙ Ë Ø µ. Obviously ËÙ Ë Ø µ ËÙ Ë Ø µ. For each pair µ in Ë Å ËÙ È Ö, train and optimize a support vector machine ËÎ Å µµ using the ËÙ Ë Ø µ examples. This is a twoclass classification problem (for example, will be associated to the binary-svm label ½ and will be associated to the other label ½). Knowing that ËÙ Ë Ø µ ËÙ Ë Ø µ, there is only one SVM classifier ËÎ Å µµ for the two pairs of classes µ and µ. µ ÓÒ ÔÖÓ For an unknown ÙÒÐ ÐÐ µ digit pattern Ü in the test-validation set Ë Ø : Compute the MLP outputs for the input pattern Ü. ÅÄÈ Ñ Ü½ Ü µ and ÅÄÈ Ñ Ü¾ Ü µ µ and the pair µ of classes belong to Ë Å ËÙ È Ö, then Ð Ü µ ËÎ Å µ Ü µ (if the distance returned by the SVM is positive (SVM label ½)) or otherwise (negative distance returned). Ð Ð Ü µ ÅÄÈ Ñ Ü½ Ü µ. 5. Experimentations All the digit images sets used for our experimentations contain only segmented characters from real mail (post) zipcodes. For the validation of our approach we used one hidden layer MLP trained with a digit characters set (Ë Ø ½ ) of 44081 digits and an input layer of 138 nodes (feature vector representation dimension). The inter-classes substitutions (errors) for the digit characters set (Ë Ø ¾ = 44081 digits) are summarized by Table 2 in terms of percentage(%) of each pair digit classes. The different SVMs derived for some pairs of classes (ex. (7,1), (9,3), (6,0),...etc.) which constitutes the majority of the MLP substitutions, as shown by Table 2, were implemented by the software SVM-LIGHT provided by T. Joachims ØØÔ Ñ Ø ÓÖ Ø Ò µ. Different kernel function SVMs (linear, polynomial and RBF ) were used and the best performances are obtained by the RBF function kernel SVMs. We tested the performances of our hybrid MLP-SVM handwritten digit recognizer on the digit images set Ë Ø = 44075 patterns. The following Table 3 summarizes the performances,without any rejection, in terms of global correct recognition rates of our hybrid MLP-SVM recognizer (Reco.1) and the theoretic recognition rate (Th. Reco.) of the same hybrid MLP-SVM recognizer if the introduced specialized SVMs was able to detect systematically the correct class among the two maximum MLP outputs classification hypotheses. 3

ËÙ Ø Ø µ ±µ 0 1 2 3 4 5 6 7 8 9 0 15.44 6.61 6.61 3.67 5.14 13.23 0.73 2.94 45.68 1 2.29 9.16 10.68 20.23 12.21 2.67 37.78 4.19 3.05 2 12.29 13.11 6.55 5.73 0.00 0.82 41.80 11.88 7.78 3 9.09 6.91 3.27 0.36 13.19 0.00 23.63 7.27 35.63 4 15.75 27.88 10.91 3.63 3.03 9.09 13.94 0.00 15.75 5 14.18 17.91 0.67 21.64 4.47 15.67 14.92 2.98 7.46 6 32.80 6.40 1.60 5.60 8.00 36.80 0.00 8.80 0.00 7 3.43 38.62 11.58 23.60 11.16 0.85 0.00 2.75 8.15 8 27.53 6.52 10.14 25.00 2.17 12.68 3.98 4.71 7.24 9 18.47 6.65 7.39 35.71 3.49 8.12 0.00 11.33 8.86 Table 2. Inter-classes substitutions made by ÅÄÈ on Ë Ø ¾ Classifier Reco.1(%) Th. Reco. Å ÄÈ 97.45 99.00 (ÅÄÈ + 5 local SVMs) 97.71 98.10 (Å ÄÈ + 10 local SVMs) 97.90 98.41 (ÅÄÈ + 15 local SVMs) 98.01 98.60 Table 3. Performances of the hybrid MLP-SVM recognizer These results prove that our hybrid MLP-SVM recognizer improves, significantly, the performances in terms of recognition rate and error rate for a real zipcode digits classification task. The performance can seem modest in comparison with the optimal limit ¼¼±µ because we have used only fifteen local SVMs among the fourtyfive possible local SVMs and the test set Ë Ø contain many digit patterns which are impossible to recognize (for example broken digit patterns), even for a human (impossible to distinguish some 7 patterns from some 1 patterns outside the context). 5.1. Conclusions In this paper, an hybrid MLP-SVM handwritten digit recognition method is introduced. The method takes advantage of the simplicity of use and good classification performances of the MLP networks and compensate their weakness, in terms of non-optimal separation surfaces between classes, by introducing specialized support vector machines. These SVMs improves the performances in local areas around the classes separation surfaces in the input space of patterns. The originality of our approach consists in introducing SVMs only for the pairs of classes which constitutes the majority of the MLP network substitutions (errors). To classify an unknown pattern, the system makes one MLP decision and one SVM decision in the worst case (i.e. if the first and second maximums of MLP outputs belong to the pairs of classes causing errors). With reasonable SVM optimizations, our hybrid MLP-SVM recognizer improves the recognition performances, for a real handwritten digits classification task, in comparison with a MLP recognizer. Our present researches consist to take advantage of SVM s margin definition to introduce a reject mechanism in order to reduce, more significantly, the error rate of the hybrid MLP- SVM recognizer. Another way to improve the performances of the hybrid MLP-SVM recognizer consists to develop a theory that allows us to create SVM kernels that enforce desirable invariants for digit patterns. References [1] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK, 1995. [2] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proc. Fifth Annual Workshop on Comput. Learn. Theory, Pittsburgh, USA, July 1992. [3] L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4(6):888 900, 1992. [4] S. Haykin. Neural Networks, A Comprehensive Foundation. Macmillan Publishing Company, London, UK, 1994. [5] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79 87, 1991. [6] J. Kittler. Combining classifiers: A theoretical framework. Pattern Analysis and Applic., 1(1):18 27, 1998. [7] C. Y. Suen and Y. S. Huang. A method of combining multiple experts for the recognition of unconstrained handwritten numerals. IEEE Trans. on PAMI, 17(1):90 94, January 1995. [8] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, USA, 1998. [9] V. N. Vapnik. An overview of statistical learning theory. IEEE Trans. on Neural Networks, 10(5):988 999, 1999. 4

[10] L. Vuurpijl and L. Schomaker. Two-stage character classification: A combined approach of clustering and support vector classifiers. In Proc. Seventh International Workshop on Frontiers in Handwriting Recognition, Amsterdam, Netherlands, September 2000. [11] K. Woods, W. P. K. Jr, and K. Bowyer. Combination of multiple classifiers using local accuracy estimates. IEEE Trans. on PAMI, 19(4):405 410, 1997. 5