Chapter 4 Human Evaluation - PDF Free Download

Chapter 4 Human Evaluation Human evaluation is a key component in any MT evaluation process. This kind of evaluation acts as a reference key to automatic evaluation process. The automatic metrics is judged good or bad by viewing its correlation with human assessments. In a human evaluation, human evaluators look at the output and judge them by hand. They check whether the ouptut is correct or not. Bilingual human evaluators who can understand both input and output are the best qualified judges for the task. Figure 4.1 shows the process of human evaluation. Figure 4.1 Human Evaluation Process Here, the output of the system is provided to the human evaluator, who evaluates the output on the basis of a subjective questionnaire/metric, based on which an evaluator can judge the output. This is done for each of the sentences which are going to be evaluated. But, judging MT output merely on the basis of correctness is a very harsh method of evaluation. So, MT Researchers have developed metrics onto which an output can be checked. Most of the human evaluation campaigns judge MT outputs based on either comprehensibility or adequacy or fluency or meaning maintenance or 46

clarity. In recent years, almost all the evaluation campaigns have been using adequacy and fluency as a measure to register human judgments. These two factors can be incorporated in a single metric or can be provided as separate metrics. Moreover a human evaluator has to adjudge the output on the basis of some scale which may range from metric to metric. Table 4.1 shows adequacy scale for evaluation of sentences and Table 4.2 shows fluency scale. Score Description 5 All Information 4 Most Information 3 Much Information 2 Little Information 1 None Table 4.1: Interpretation of Adequacy on Scale 5 Score Description 5 Flawless English 4 Good English 3 Non Native English 2 Disfluent English 1 Incomprehensible Table 4.2: Interpretation of Fluency on Scale 5 In this chapter we will describe the human evaluation process. We shall describe a novel human evaluation metric that we have developed and its evaluation results on seventeen MT Engines. We shall also compare this metric with Human Adequacy and Fluency using Scale 5 metrics and would analyze its results. 4.1 HEval: Subjective Human Evaluation Metric We have used a human evaluation metric which captures the linguistic features of a language and provides qualitative assessments of the MT engine outputs. The linguistic features that we have considered in the metric are: 1. Translation of Gender and Number of the Noun(s). 2. Identification of the Proper Noun(s). 3. Use of Adjectives and Adverbs corresponding to the Nouns and Verbs. 4. Selection of proper words/synonyms (Lexical Choice). 5. Sequence of phrases and clauses in the translation. 47

6. Use of Punctuation Marks in the translation. 7. Translation of tense in the sentence. 8. Translation of Voice in the sentence. 9. Maintaining the semantics of the source sentence in the translation. 10. Fluency of translated text and translator s proficiency. 11. Evaluating the translation of source sentence (With respect to syntax and intended meaning). This metric was first proposed by late Prof. Rekha Govil. We have only perfected this metric and hence named it after her. We have employed a five point scale to measure the quality of the translations. Table 4.3 provides the description of the scales. Score Description 4 Ideal 3 Perfect 2 Acceptable 1 Partially Acceptable 0 Not Acceptable Table 4.3: Interpretation of HEval on Scale 5 All the above features are scored using these scores. An average is computed on all the eleven scores which give us a single objective value. Figure 4.2 shows the working prototype that has been developed. This prototype is being used to evaluate MT Systems developed under English-Indian Languages Machine Translation Systems Project which is sponsored by TDIL Programme, Government of India (Lata and Kumar, 2010). Since the current human evaluation metrics were unable to provide the complete assessment of the dimensions of translation quality as they only focused on just one or two quality measures. Most of the human evaluation campaigns focused on limited features of evaluations. In these evaluations adequacy and fluency being the most important. This kind of evaluation is inappropriate as in practice a human translator does not translate a text just on superficial ratings. While translation, a translator has to look at many factors like gender being addressed in the text, the proper nouns being used, use of adjectives and adverbs etc. Then only a translator comes out with a good translation. Our metric intuitively captures this phenomenon and provides an objective score. Most of the MT systems might not be able to provide appropriate 48

results for all the eleven features used in our metric, but they may provide good or acceptable results for some of the features. Judging an MT output on just one/two features/factors might deviate a human judge from making a sound judgment, as one judge, who has looked at all the features, may give a moderate rating to a translation whereas another judge, who might have overlooked some of the features, may give low rating to a translation. This causes an inter-annotator disagreement while evaluating MT Systems. Figure 4.2: HEval: Subjective Human Evaluation Metric Moreover, if a human judge is asked to reevaluate a particular translation again, then he might not assign the same rating to the translation. This poses a serious problem while ascertaining quality of MT systems. HEval addresses these issues by clearly laying down parameters onto which an evaluation can be performed. This provides a crisp and repeatable assessment of MT outputs. Since, each feature clearly states what needs to be done and how to access the same; it reduces the possibility of assigning a low score to a good translation. Table 4.4 provides an example to justify this claim, where we show some of the good and bad translations and compare them with adequacy and fluency scores. These scores are calculated using the score presented in Table 4.1 and 4.2 respectively. 49

The scores of HEval Metric for these four translations are shown in table 4.5 along with adequacy and fluency scores. From the table it is evident that HEval metric can provide results in line with Human Adequacy and Fluency measures. Moreover, this metric can even justify its results as it provides more qualitative information about the evaluation. This metric also provide consistent scores when more than one judge evaluate a translation. All the judges came out with almost same scores for the translations produced by different engines. The good translations were consistently given a higher score whereas bad translations were given lower scores. English Most impressive are the black and white chessboard marble floor, the four tall minarets (40 m high) at the corners of the structure, and the majestic dome in the middle. E1 सबस भ वश ल ह क ल और सफ द बस त स गमरमर फश, च र ल ब म न र (40 म टर ऊ च ) स रचन ह, और ब च म र जस ग बद क क न म E2 सबस भ वश ल बस त क ल और सफ द स गमरमर क फश, स रचन क क न पर च र ल ब म न र (40 म टर ऊ च ), और ब च म र जस ग बद ह. Most क क क ल और सफ द स गमरमर chessboard floor, impressive ह E12 च र ल ब minarets क corners पर ( 40 म. ऊ च ) स रचन, म ज टक ग बद म य म और E15 सबस ह क क ल और सफ द स गमरमर म जल, च र ऊ च म न र ( म. ऊ च ) क क न -क न क स रचन, और क म ज टक ग बद क म य म 40 क पर ह Table 4.4: A Case of English-Hindi MT Evaluation (Sent # 150) Engine 1 2 3 4 5 6 7 8 9 10 11 Overall Adq Flu E1 E2 E12 E15 Human 1 3 3 3 3 0 0 3 4 3 4 3 0.65 Human 2 3 3 3 3 0 0 2 2 2 2 1 0.47 Human 1 2 4 4 4 4 4 4 0 4 4 3 0.84 Human 2 3 4 4 2 4 4 4 1 4 4 3 0.84 Human 1 2 1 1 0 1 0 1 3 1 1 1 0.27 Human 2 2 0 0 0 1 0 1 3 1 0 1 0.20 Human 1 2 2 2 1 2 1 1 2 1 1 2 0.38 Human 2 2 2 2 1 2 1 1 2 1 1 2 0.34 Table 4.5: Results of Human Evaluation for Case of Table 4.4 3 3 4 4 2 2 2 3 50

In the following sections we have provided results of two judges for all seventeen MT systems and have also compared our metric with Human Adequacy and Fluency Metrics on a five point scale. For document and system levels we have averaged the scores for the entire document and system respectively. For sentence level evaluation, we have generated the rank of the translation. 4.2 Evaluation of MT Systems using HEval We used 1300 sentences for all 17 MT Engines which were distributed into 13 documents of 100 sentences each. Combined document wise score of all 17 MT engines is shown in table 4.6. Combined system wise score of all 17 engines are shown in table 4.7. The values in bold represent the highest score achieved by an engine with respected to a particular human evaluator. At document level, out of 13 documents, 8 times all human judges gave the same consistent score, i.e. they all gave a high rank to the same engine. At system level too, the results were repeatable. All the judges unanimously gave high score to Engine E2. This proves the consistency of the metric to provide same results across different judges. This inter-annotator agreement at document and system level was also repeated at sentence level as well. At sentence level, it was not possible to provide results of all the engines for all 1300 sentences, so, the results were consolidated into ranks. At this level, we classified the systems into four categories, (i) combined, which had ranks of all the systems; (ii) web, which only had the ranks of four web based MT engines; (iii) toolkit, which had the ranks of four MT toolkits used in our experiments and (iv) hybrid, which had system combinations based on MT toolkits. Table 4.8 shows the results combined study. Out of 1300 sentences 522 times, E1 was ranked as the best by judge 1. This was repeated with Judge 2, who 473 times gave a higher rank to E1. In all the cases E2 performed second best. Figure 4.3 clarifies this claim. In Web Based MT category, out of 1300 sentences, E1 was given the highest rank by all the judges, which was followed by engine E2. Table 4.9 shows the detailed figures of this study and is summarized by Figure 4.4. In MT Toolkits category, out of 1300 sentences, E15 was 51

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 E17 Human 1 0.5408 0.535 0.5095 0.4238 0.4455 0.4303 0.4018 0.388 0.4338 0.3968 0.3988 0.1898 0.4278 0.4078 0.428 0.4868 0.5038 Doc 1 Human 2 0.5554 0.543 0.4972 0.3817 0.4322 0.4173 0.3901 0.2668 0.4085 0.3926 0.3292 0.19 0.383 0.35 0.4275 0.4355 0.4457 Human 1 0.6311 0.671 0.4998 0.5005 0.2788 0.3483 0.2848 0.2168 0.2733 0.3225 0.3273 0.1904 0.3508 0.2915 0.4918 0.512 0.4958 Doc 2 Human 2 0.5741 0.6214 0.3988 0.4422 0.2354 0.2679 0.2881 0.2384 0.2366 0.3073 0.2351 0.189 0.2833 0.2314 0.3936 0.4235 0.4169 Human 1 0.624 0.6335 0.5495 0.5065 0.3455 0.4268 0.4 0.3373 0.4993 0.506 0.3815 0.1902 0.4235 0.3623 0.5008 0.515 0.4783 Doc 3 Human 2 0.5748 0.5704 0.5053 0.4699 0.3641 0.4021 0.4246 0.3542 0.4727 0.4615 0.3303 0.1904 0.4329 0.345 0.468 0.4617 0.4242 Human 1 0.647 0.6913 0.5488 0.3943 0.2865 0.2945 0.3093 0.274 0.2858 0.2845 0.2935 0.1928 0.3308 0.3225 0.4373 0.4338 0.425 Doc 4 Human 2 0.6366 0.6445 0.5531 0.4195 0.3162 0.3252 0.3177 0.278 0.3184 0.3418 0.2935 0.1934 0.3574 0.3125 0.4515 0.4413 0.4529 Human 1 0.5413 0.5933 0.4923 0.5083 0.3688 0.3998 0.4285 0.2535 0.4285 0.4318 0.344 0.1928 0.4585 0.33 0.4865 0.46 0.4725 Doc 5 Human 2 0.6469 0.6602 0.5807 0.5696 0.4546 0.4785 0.4623 0.3093 0.495 0.4821 0.4265 0.1946 0.4983 0.3892 0.5392 0.5842 0.5902 Human 1 0.6743 0.7013 0.5695 0.5053 0.4458 0.4773 0.4512 0.3163 0.454 0.4765 0.34 0.1948 0.4788 0.4025 0.4743 0.457 0.45 Doc 6 Human 2 0.6069 0.6245 0.4938 0.463 0.3639 0.3926 0.3699 0.2706 0.3822 0.4074 0.286 0.1965 0.4074 0.2997 0.4337 0.4342 0.411 Human 1 0.6938 0.6028 0.5088 0.417 0.292 0.3418 0.3273 0.2158 0.353 0.351 0.2953 0.1972 0.3668 0.2858 0.4333 0.4425 0.454 Doc 7 Human 2 0.6265 0.6468 0.5665 0.5333 0.4035 0.3848 0.431 0.2938 0.4793 0.4603 0.2868 0.198 0.4808 0.4788 0.489 0.4473 0.424 Human 3 0.6859 0.6454 0.5606 0.4986 0.3734 0.3884 0.4057 0.2798 0.4378 0.4275 0.3197 0.1998 0.449 0.4099 0.4851 0.4721 0.4656 Human 1 0.7223 0.7363 0.568 0.549 0.4425 0.5083 0.4635 0.402 0.495 0.5023 0.4828 0.1982 0.545 0.4438 0.554 0.5138 0.5195 Doc 8 Human 2 0.5325 0.539 0.4228 0.4075 0.2605 0.349 0.3195 0.1963 0.3693 0.41 0.3095 0.1964 0.378 0.2173 0.4308 0.3888 0.3928 Human 3 0.6512 0.662 0.5228 0.5045 0.3899 0.4619 0.4253 0.3314 0.4578 0.4906 0.4306 0.1948 0.4986 0.3642 0.5294 0.477 0.4807 Human 1 0.6793 0.6428 0.5385 0.481 0.3995 0.5683 0.343 0.2308 0.3403 0.4115 0.275 0.1917 0.4478 0.398 0.4633 0.4503 0.3788 Doc 9 Doc 10 Human 2 0.6889 0.6356 0.5272 0.4784 0.4015 0.5194 0.3816 0.2178 0.4257 0.4189 0.2523 0.1853 0.4197 0.3511 0.4655 0.5175 0.4889 Human 1 0.5388 0.5298 0.4405 0.3795 0.3705 0.4173 0.4065 0.4125 0.4425 0.458 0.4373 0.1828 0.4503 0.4273 0.4723 0.4845 0.4725 Human 2 0.5644 0.584 0.3855 0.3937 0.3342 0.3608 0.3674 0.3545 0.4013 0.413 0.3222 0.178 0.4317 0.3749 0.4544 0.4127 0.4143 52

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 E17 Doc 11 Doc 12 Doc 13 Human 1 0.596 0.6753 0.452 0.4863 0.2455 0.303 0.3188 0.2238 0.3388 0.4105 0.2468 0.1764 0.385 0.248 0.503 0.4893 0.5068 Human 2 0.6006 0.6599 0.4285 0.4457 0.2644 0.3253 0.3638 0.3527 0.3464 0.3972 0.3004 0.1729 0.4007 0.2925 0.4609 0.4977 0.5179 Human 1 0.6443 0.6338 0.4383 0.3853 0.2718 0.3423 0.317 0.2578 0.321 0.3443 0.335 0.1714 0.3338 0.3268 0.4068 0.4873 0.4938 Human 2 0.618 0.6091 0.4151 0.4587 0.3234 0.3801 0.3291 0.2527 0.3411 0.3767 0.3512 0.1659 0.3816 0.329 0.4016 0.4322 0.4301 Human 1 0.636 0.6478 0.6188 0.4843 0.384 0.3933 0.3823 0.3875 0.4025 0.4218 0.434 0.1636 0.4498 0.3498 0.5133 0.505 0.4908 Human 2 0.6581 0.6612 0.5618 0.462 0.3748 0.424 0.4209 0.4202 0.3903 0.4475 0.3832 0.1589 0.5122 0.4201 0.5226 0.5173 0.509 Table 4.6: Combined Document Level Scores of HEval Metric Human 1 Human 2 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 E17 0.6284 0.638 0.518 0.4631 0.352 0.4039 0.3718 0.3012 0.3898 0.409 0.3532 0.2079 0.4191 0.3535 0.4742 0.4798 0.4724 0.6201 0.6247 0.4946 0.4606 0.356 0.3956 0.3805 0.302 0.3933 0.4126 0.3277 0.2339 0.4197 0.3438 0.464 0.4697 0.4652 Table 4.7: Combined System Level Scores of HEval Metric 53

given the highest rank by all the judges, this was followed by E9. Table 4.10 shows the ranks scored by different engines and figure 4.5 summarizes the results. In hybrid Category, out 1300 sentences, E16 was adjudged as the highest ranker followed by E17. Results of this study are shown in table 4.11, and are summarized by figure 4.6. Engine Human 1 Human 2 E1 551 476 E2 356 414 E3 104 75 E4 50 66 E5 15 13 E6 16 8 E7 15 17 E8 4 30 E9 14 11 E10 12 22 E11 12 7 E12 8 3 E13 27 25 E14 15 8 E15 49 63 E16 29 36 E17 23 26 Table 4.8: System Level Ranking of HEval Metric for All MT Engines Engine Human 1 Human 2 E1 649 579 E2 435 511 E3 140 115 E4 76 95 Table 4.9: System Level Ranking of HEval Metric for Web Based MT Engines Engine Human 1 Human 2 E5 265 196 E9 245 339 E12 85 18 E15 705 747 Table 4.10: System Level Ranking of HEval Metric for MT Toolkit Based MT Engines 54

Engine Human 1 Human 2 E6 248 122 E7 76 77 E8 29 108 E10 134 153 E11 73 31 E13 146 180 E14 43 58 E16 379 320 E17 172 251 Table 4.11: System Level Ranking of HEval Metric for Hybrid MT Engines 600 500 400 300 200 Human1 Human2 100 0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 E17 Figure 4.3: Ranks Given by Human Judges to Combined MT Engines 700 600 500 400 300 200 Human1 Human2 100 0 E1 E2 E3 E4 Figure 4.4: Ranks Given by Human Judges to Web Based MT Engines 55

800 700 600 500 400 300 Human1 Human2 200 100 0 E5 E9 E12 E15 Figure 4.5: Ranks Given by Human Judges to Toolkit Based MT Engines 400 350 300 250 200 150 100 50 0 E6 E7 E8 E10 E11 E13 E14 E16 E17 Human1 Human2 Figure 4.6: Ranks Given by Human Judges to Hybrid MT Engines Combined Web MT Toolkit Hybrid Agreement Percentage Agreement Percentage Agreement Percentage Agreement Percentage H1 H2 689 52% 843 65% 688 53% 611 47% Table 4.12: Pairwise Agreement between Judges In order to strengthen our claim we also did pair wise comparisons of MT engines. We took the highest ranking engines from four different categories. In this study, we compared the results of judge 1 with judge 2. In combined category, out of 1300 sentences, 689 times, judge 1 agreed with judge 2 and by giving a higher rank to an MT engine. This accounted for 52% agreement between the two judges. All the agreement scores and their respective percentages are show in Table 4.12. From the 56

table it is clear that the metric can show reasonable agreement at sentence level as well which can further be explained by analyzing the scores given by the judges to different features. Human 1 Human 2 E1 0.175 0.136 E2 0.055 0.063 E5 0.013 0.065 E6 0.014 0.038 E9 0.030 0.23 E10 0.016-0.021 E15 0.124 0.056 E16 0.015-0.004 E17 0.027 0.011 Table 4.13: Spearman s Rho Correlation between HEval and Human Adequacy Scale 5 4.3 Correlation of HEval with Human Adequacy Score We compared the results of our metric with Human Adequacy score on scale 5. We took the results of some of the engines from various categories and correlated them with Adequacy score. For this study we used only one human judge to provide adequacy scores of MT engines. We used Spearman s rho statistic to measure correlation. We used Spearman s rho correlation as it is based on ranks and does not show sensitivity towards outliers. Table 4.13 shows the results of this study. Adequacy judgment was done on ten MT Engines, top two ranking engines from web based category. Top three MT engines from toolkit category and four high ranking MT engines from hybrid category. In all the cases E1 scored higher with significant correlation at 99% confidence interval. E2 showed significant correlation at 95% confidence interval. E5 showed significant correlation between Adequacy and Judge2 at 95% confidence interval. It showed positive correlation with Judge1. E6 and E9 showed just showed a positive correlation. E10 showed positive correlation with Judge1 and negative correlation with Judge2. E15 showed significant correlation at 95% confidence interval with all the judges. E16 showed slight negative correlation with judges 2 and positive correlation with judge 1. E17 showed positive correlation with judge 1 and 2. Mostly all the correlation coefficients either showed positive or 57

significantly positive correlations. Thus, through this we can deduce that HEval can be a good replacement for Adequacy based evaluations. Human 1 Human 3 E1 0.015 0.025 E2 0.053 0.048 E5 0.031 0.065 E6 0.020 0.072 E9 0.046 0.076 E10 0.020 0.030 E15-0.012 0.019 E16 0.015-0.004 E17 0.007-0.003 Table 4.14: Spearman s Rho Correlation between HEval and Human Fluency Scale 5 4.4 Correlation of HEval with Human Fluency Score We compared the results of our metric with Human Fluency score on scale 5. We took the results of same engines from various categories that were used for Adequacy scores. For this study also, we used only one human judge to provide fluency scores of MT engines. Here too, we used Spearman s rho statistic to measure correlation. Table 4.14 shows the results of this study. In all the cases E1 showed positive correlation with all the judges. E2 also showed positive correlation with all the judges. E5 showed significant correlation with all the judges at 95% confidence interval. E6 showed significant correlation with all the judges at 99% confidence interval. E9 showed just showed a significant positive correlation with 95% confidence inerval. E10 showed positive correlation with all the judges. E15 showed positive correlation with judges 2 and showed negative correlation with judge 1. E16 showed slight negative correlation with judges 2 and positive correlation with judge 1. E17 showed negative correlation with judges 2 and positive correlation with judge 1. Since, here also, mostly all the correlation coefficients either showed positive or significantly positive correlations. Thus, through this we can deduce that HEval can be a good replacement for Fluency based evaluations too. 58

4.5 Conclusion In this chapter, we showed the working of a new human evaluation metric which was implemented in this research. We showed the design of the metric along with the scoring criteria and the mechanism for scoring the translations using an example. We also compared the results of this metric with the popular human evaluation metrics (Adequacy and Fluency) which have been implemented in almost all the evaluation campaigns. We also correlated the results of our metric with these two metrics with some of the good and bad MT Engines. The metric showed significant correlation with both Human Adequacy and Human Fluency evaluation metrics. 59