Mr. Chairpersons, Ladies, and Gentlemen, it is indeed a great honor for me to be given this opportunity to address you on the occasion of this

Mr. Chairpersons, Ladies, and Gentlemen, it is indeed a great honor for me to be given this opportunity to address you on the occasion of this special meeting. My English name is Sean Zhu, I come from Wuhan city, Hubei province, China. I am a teacher of College of Computer Science, South Central University for Nationalities. I am one of the founders of the Information Processing Institute for Minority i Language, which h started to operate in January, 2007. My lecture today entitled Printed Yi Character Recognition come from our institute s first project, which was begun in January,2007. 1

This is the organization of my speech. First, in the introduction, I would like to explain why we from SCUN. are here. Then I will talk about the project entitled printed Yi Character Recognition, which will start with our achievement and results, followed by some key technical points. During this period di will display our software to make you understand it more clearly. Last, I will spend some time explaining our future plan. We welcome your comments anytime. 2

The first question: Why did we come here? We came here because we have three papers that have been accepted for presentation in three different sessions, and we also have a poster session being exhibited in the Science Hall Yunnan University. 3

Using information technology to salvage and protect Nü Shu is the title of the paper accepted for presentation by the session on Issues of Language Endangerment. Nü Shu is a unique script popular among women, and has mainly been handed down for local women s use in several townships of Jiangyong and its neighboring counties in south China s Hunan province. Invented by women and used by women, these italic characters were not understood by the outside world until it was discovered by several teachers from our University(SCUN) in the early 1980s. This paper is to look at the process of Nü Shu s informationization from several perspectives: the design of Nü Shu s character set, the design of Nü Shu s importing method, and the design of Nü Shu s website and virtual steles. Research on Radical Input Method of Nü Shu is the title of the paper accepted for presentation by the session on Diversity of National Culture of Central and Southwest China and Protection of Non material Culture This paper discusses the design detail of the input method for Nü Shu. The Research of PrintedYi Character Recognition, has been accepted for presentation in today s session. The Yi language is used by the Yi ethnic group, which has more than 6,000,000 people living in Sichuan and YunNan province. We have invented multi font printed Yi character recognition technology, which provides an effective method for importing the Yi language into the computer. 4

This page show the relationship among these three papers. Generally speaking, our input material was initially minority literatures.if this minority language has a character set, like the Yi language, we could use OCR technology to make them been digital. Importing tools help to correct some wrong OCR results. If this minority language does not have character set like Nü Shu, we would design a character set for it, and then design an importing method to input its characters so we can obtain digital literatures. Then we would build a website and virtual steles to spread these literatures and this minority culture outward to the rest of the world. To state it briefly, we are trying to find an integrated solution for minority literatures 5

Our Poster Session has more detail with regard to what I have mentioned, and has some ideas presented regarding our future work. Its No. is P001. We also have prepared some publicity materials and a tryout CD. If you are interested in what we have done, please contact us to get the publicity materials and tryout CD. 6

During the time we are engaged in these research work. We gradually found Our institute s two goals; the first one is By cooperating with linguists and anthropologists, we devote ourselves to using information technology to protect civilization s achievements among all minorities in China, so as to continue Chinese culture s genes! 7

The second goal is: Taking the spreading of knowledge of minority civilizations as our application background, we are engaged in scientific research on Pattern Recognition, Computer Vision, Machine Learning, Artificial Intelligence, and Virtual Reality. So now I guess you may understand more about why we come here. First we want to find some people in the same position as we are with a commitment to minority cultures, and exchange technology and skills regarding how to protect and disseminate knowledge about minority culture to others. Second, we hope to find some new research opportunities through cooperative efforts with such people. 8

Now we come to the second part as to what the purpose of this presentation is: how we became involved with Printed Yi Character Recognition. 9

You may ask why did we start this project? First, we believe that using optical character recognition (OCR) technology with minority languages is an important way to save a minority language, thus helping to preserve historical information by recording a minority cultures' prime ideas through such technology. Second, since the introduction of Sichuan standardized Yi Language from the Liangshan Yi ethnic group,standardized Yi Language has been widely used and promoted in the area of Liangshan and Yunnan Ninglang, where the Yi ethnic minorities live. In the schools, within literary and artistic creations, through translation and broadcasting, as well as film and television production, and creation of music, along with communicating government policies and regulations, this language use has had a positive effect in promoting the economy and strongly establishing the Yi culture. Third, In China, OCR technology has been mainly used in northern minority language preservation, such as the Mongolian, Tibetan, Uigur, Kazakh, Korean, and Khalkhas. Southern minority languages, like the Yi, have seldom been researched with these techniques. 10

Before we jump to the details of the technology used, I would like to display the software. We have designed two softwares. The first one is entitled Character Recognition Experimental Platform. This platform is mainly used to establish recognition dictionaries and test algorithms. The second one is entitled Minority Language Master. Here I would like to show you the second software: This software is divided into three windows. The upper one shown here is to represent the initial picture of the language. The third one is for the recognizing results. The middle narrow box is to represent candidates. No OCR system can achieve a 100 percent correct rate, especially when the picture s quality is very poor. So we use a red color to indicate those that we think have more possibility of being wrong. This way the proofreader can easily do a quick check for correctness. After checking, the result can then be exported to a word file, just by clicking the mouse W. We are applying for a China Patent and we have the software copyright for Minority Language Master. This software is included in the CD we prepared for this congress. We hope you can tyout try it and dgive eus your feedback. 11

After showing you an example of the Minority Language Master, let us check to see some statistical results in the application of the Character Recognition Experimental Platform. 12

This is single file statistical result. Each file has 1165 characters. This result shows how the characters can be recognized as wrong. The red one is the wrong result, the green is the correct character. Single file statistic result gives us some information about what kind of similar characters will likely be wrongly recognized, thus allowing us to set some rules to prevent this happening. 13

This is a multi files statistic result, and from this we can see each sample file s first recognition rate and ten candidates recognition rate. 14

We also counted the recognizing rate font by font. We have tested all the fonts of Yi Language that we can find up to this point. 15

This is a Multi characters statistic, and from this result we can see each character s recognition rate. Here we used 743 sampling files to test for consistency of form, so that means each character has 743 samples representing it. The green character means it has never been recognized wrong. Other characters have two numbers. The upper number shows the number of first recognizing ii failure fil times, while the other number shows the the number of ten candidates recognizing failure times. At the bottom of this report form, you can see the total number of characters examined, the two kinds of recognition failure times,and the total time used to do this operation. 16

Now let us come to some key technical points in this platform. 17

What follows is the research procedure for developing a recognition system for printed Yi characters. First we collect Yi characters samples, then we divide up each character one by one to build a sample lib. Each character s features are extracted and compressed to build a recognizing dictionary. Feature matching arithmetic is used to find the most possibility character and the ten candidates. All these arithmetic are included an experimental system we called Character Recognition Experimental Platform. At the end, a practical software is designed. It is the Minority Language Master 18

The principle of character segmentation is shown in this slide. The paper Yi Language text was scanned into a computer as gray pictures. With appropriate setting of the threshold value, those pictures were changed into a binary picture file with a 0 and 1. By doing image processing, text blocks were separated from pictures. When we obtained the text blocks, we first discovered d the line s segmentation based on the line s projection, This way we are able to find each line s information with regard to divisions and record them. The next task was to determine single character segmentation. A divided text line will be pre divided character by character according to its column s projection. This way we can determine the profile of the pre divided character. We then summarized a series of combining rules according to the characteristics of Yi language characters, which helpedus merge the partswrongly dividedinto into a numberof characters, into one character. We also summed up a series anti merger rule according to the characteristics of both Yi Language characters and its common characters, again dividing those wrongly combined characters. 19

We used an algorithm of feature extraction based on the contribution of peripheral direction. The definition of direction contribution is: Take P as thecentral pointandcalculate therelative relationshipofof thenumberof consecutive black image elements in eight directions. Then we have an eight dimensional eigenvector: (i=1,2,,8) (1) In this formula, i represents direction, l i represents the number of consecutive black image elements of direction i. [d1, d2,, d8] is the eigenvector of point P s eight direction contribution. If two characters in opposite directions from each other, such as 1 2,3 4,5 8,6 7, are combined as one character. Then (2 1) (2 2) (2 3) (2 4) are the eigenvectors of point P s four direction contribution. 20

We searched the border point of every character from eight directions, then took the border point as the central point, calculating its direction contribution. In order to reduce the dimensions of characteristics, we divide one character into eight parallel regions along each searching directional. In each region, every border point s 4 PDC was calculated l along the searching direction, i then an average vector would be recorded ddas the PDC vector of this searching direction in this region. 8 searching directions 8 divided regions 4 layer depth 4 direction contribution=1024 dimensions (4) 21

This paper uses the Karhunen Loève transformation, or KL transformation in short, to compress 1024 PDC characteristics of each character to 128 dimensions. Samplematching means calculating the weighting distances between each sample character s compressed features and each dictionary character s compressed features, and then obtaining i the minimum i features required din order to determine matching criterion. i Three levels of distance are calculated in this paper to realize hierarchy marching. You can find the explanation of this expressions in our paper. 22

Now we use two minutes to show the whole detail of what we have mention above. For saving time, we use 24 sample files with 128 common characters as an example. Common characters include the English alphabet, punctuation marks and numbers. During the time we are waiting the result, any question for what I have said? 23

Next I would like to discuss one of our feature works. 24

In the coming years, we will try to design a device named Minority literatures automated importing machine. Computer vision and machine learning methods will be used in this device. 25

Our idea is shown here: We will design a A pipeline for being digital, which means Input literatures, and output computer files. Thereare two key features inthis system. 1. Each character can be segmented correctly from initial material, depending on the computer vision. 2. Systems should judge whether this character is in the character set or not, and if not, it should be added to the character set automatically. This involves machine learning. 26

We will face Four Challenges, the first one is the Multi sources of minority literature. Minority literature may be written on wood,silk,paper,bark,fur and low quality paper. 27

Multi sources bring a complex background. It is very difficult to differentiate characters from background texture. Multi sources and Complex background are two challenges for a computer vision. 28

Many minority languages are not standardized. Some characters have 2 3 different handwritings. Writer s age, gender, and writing tools have great influence on handwriting. How can a computer know about these influences? 29

Massive data make things even worse. You never know whether the character set in your computer has all the characters used in any one minority language. Maybe one would need to handle all the literature in a language to get a complete character set. These are the two challenges for machine learning. 30

Thank you very much for your kind attention. I am looking forward to hearing from you during the discussion period and the coming days in Kun Ming. 31