Statistical Machine Translation Machine Translation Phrase-Based Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University October 2009 Probabilistic view on MT (E = target language, F = source language): Ê = argmax E P(E F ) = argmax E P(F E)P(E) Jörg Tiedemann 1/1 Jörg Tiedemann 2/1 Statistical Machine Translation: Language Modeling Motivation for Phrase-based SMT Language modeling: (probabilistic) LM = predict likelihood of a given string What is the likelihood P(E) to observe sentence E? Exactly what we need! Estimate probabilities from corpora: decompose into N-grams! unigram model: P(E) = P(e 1 ) P(e 2 )...P(e n ) bigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 2 )...P(e n e n 1 ) trigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 1, e 2 )...P(e n e n 2 e n 1, ) There would be much more to say about language modeling... Word-based SMT statistical word alignment P(F E) language modeling P(E) global decoding argmax E P(F E)P(E) Word-by-word translation is too weak! contextual dependencies non-compositional constructions n:m relations look at larger chunks! Jörg Tiedemann 3/1 Jörg Tiedemann 4/1
Phrase-based SMT Phrase-based SMT Translation model in PSMT: Motivation phrases = word N-grams less ambiguity, more context in translation table handle non-compositional expressions local reorderings covered by phrase translations distortion : reordering on phrase level P(F E) = I φ(f i e i )d(start i, end i 1 ) i=1 phrases are extracted from word aligned parallel corpora phrase translation probabilities (MLE): φ(f e) = count(f, e) count(f, e) f Moses toolkit: (http://www.statmt.org/moses/) Jörg Tiedemann 5/1 Jörg Tiedemann 6/1 Phrase-based SMT Statistical word alignment Standard models: Phrase translation probabilities: need phrase alignments in parallel corpus induce them from word alignments (IBM models) score extracted phrases (MLE) IBM models 1-5 (cascaded), EM training, final parameters: word translation probabilities (lexical model) fertility probabilities distortion probabilities (reordering) Viterbi alignment assign most likely links between words according to the statistical word alignment model from above Jörg Tiedemann 7/1 Jörg Tiedemann 8/1
Viterbi Word Alignment Viterbi Word Alignment from GIZA++ From the German-English Europarl corpus: special NULL word (NULL la) EMPTY alignment possible (did) only 1:many (slap); not many:1 depending on alignment direction Alignment tool: GIZA++ (http://code.google.com/p/giza-pp/) # Sentence pair (5) source length 12 target length 11 alignment score : 2.14036e-24 ich bitte sie, sich zu einer schweigeminute zu erheben. NULL ({ }) please ({ 1 2 3 }) rise ({ }), ({ 4 }) then ({ 5 }), ({ }) for ({ 6 }) this ({ 7 }) minute ({ 8 }) ({ }) s ({ }) silence ({ 9 10 }). ({ 11 }) # Sentence pair (6) source length 12 target length 10 alignment score : 3.38628e-15 ( das parlament erhebt sich zu einer schweigeminute. ) NULL ({ }) ( ({ 1 }) the ({ 2 }) house ({ 3 }) rose ({ 4 5 }) and ({ }) observed ({ 6 }) a ({ 7 }) minute ({ 8 }) ({ }) s ({ }) silence ({ 9 }) ) ({ 10 }) Jörg Tiedemann 9/1 Jörg Tiedemann 11/1 Viterbi Word Alignment Word Alignment Symmetrization source-to-target word alignment: Asymmetric alignment! no n:1 alignments can run IBM models in both directions! different links in source-to-target and target-to-source best alignment = merge both directions (?!) How? Symmetrization heuristics! Jörg Tiedemann 12/1 Jörg Tiedemann 13/1
Word Alignment Symmetrization target-to-source word alignment: Word Alignment Symmetrization symmetrized word alignment: Jörg Tiedemann 14/1 Jörg Tiedemann 15/1 Word Alignment Symmetrization start with intersection, add adjacent links (from union)... Phrase extraction Get ALL phrase pairs that are consistent with word alignments Jörg Tiedemann 16/1 Jörg Tiedemann 17/1
Phrase extraction Phrase extraction Jörg Tiedemann 18/1 Jörg Tiedemann 19/1 Phrase extraction Phrase extraction Jörg Tiedemann 20/1 Jörg Tiedemann 21/1
Phrase extraction Phrase extraction Jörg Tiedemann 22/1 Jörg Tiedemann 23/1 Scoring phrases Phrase tables Examples from a phrase table (Pirates of the Caribbean): Simple Maximum likelihood estimation: φ(f e) = count(f, e) count(f, e) f A huge phrase table! (with a lot of garbage?) Swedish English Score, det?, it s 0.666667, det?, that s 1 att bli besvikna be disappointed 1 att bli en sj?v to becoming one 1 bara vi just 0.1 bara just 0.6 bara only 0.375 barbossa och hans bes?tning barbossa and his crew 1 barbossa och hans barbossa and his 1 barbossa t?ker g?a. allt barbossa is up to...... all 1 (The training set was too small to get reasonable counts!) Jörg Tiedemann 24/1 Jörg Tiedemann 25/1
The final model for PB-SMT PB-SMT extension: Log-linear Models Instead of noisy-channel model Ê = argmax EP(F E)P(E): Ê = argmax E P(E F ) = argmax E ( φ(f i e i ) d(start i, end i 1 ) P(E) ω length(e)) model posterior directly: Ê = argmax EP(E F ) many feature functions h m (E, F ) may influence P(E F) Distortion d: Chance to move phrases to other positions fixed distortion limit (e.g. 6) simple penalty for moving: α start i end i 1 1 OR lexicalized distortion (learned from alignment) phrase translation model E F phrase translation model F E lexical weights from underlying word alignment a language model P(E) lexicalized reordering model length features (word/phrase costs/penalties) Word cost: ω length(e) = bias for longer output P(E F) = weighted combination of feature functions! Jörg Tiedemann 26/1 Jörg Tiedemann 27/1 PB-SMT extension: Log-linear Models P(E F ) = weighted (λ m ) combination of feature functions (h m ) P(E F ) = ep M m=1 λ mh m (E,F) Z Ê = argmax E P(E F) = argmax E (logp(e F )) How to learn weights λ m? M = argmax E λ m h m (E, F ) m=1 Minimum error rate training (MERT) on development set! Measure error in terms of BLEU scores (n-best list) Iterative adjustment of model parameters (slow but effective!) Phrase table with multiple scores That s what you will get from Moses: Swedish English Scores, det?, it s 0.6667 0.0959975 0.6667 0.0263227 2.718 att bli besvikna be disappointed 1 0.0221815 1 0.105472 2.718 att bli en sj?v to becoming one 1 0.00896375 1 0.00157689 2.718 bara vi just 0.1 0.0102041 1 0.128968 2.718 bara just 0.6 0.285714 0.6 0.25 2.718 bara naught but 1 0.268518 0.1 0.00195312 2.718 bara only 0.375 0.222222 0.3 0.125 2.718 phrase translation probability φ(f e) lexical weighting lex(f e) phrase translation probability φ(e f ) lexical weighting lex(e f ) phrase penalty (always exp(1) 2.718) Jörg Tiedemann 28/1 Jörg Tiedemann 29/1
Translation = decoding Global search: Ê = argmax EP(E F ) Maria no dio una bofetada a la many translation alternatives (huge phrase table) many ways to segment words into phrases re-ordering makes it even more complex Very Expensive! need search heuristics pruning (early discard weak hypotheses) stack decoding (histograms & thresholds) reordering limits Mary build translation left-to-right select foreign word to be translated select translation in phrase table add translation to partial translation (hypothesis) Jörg Tiedemann 30/1 Jörg Tiedemann 31/1 Maria no dio una bofetada a la Maria no dio una bofetada a la Mary did not Mary did not slap mark first (foreign) word as translated new example: one-to-many translation many-to-one translation Jörg Tiedemann 32/1 Jörg Tiedemann 33/1
Maria no dio una bofetada a la Maria no dio una bofetada a la Mary did not slap the Mary did not slap the green many-to-one translation example for re-ordering Jörg Tiedemann 34/1 Jörg Tiedemann 35/1 Lattice of translation options Maria no dio una bofetada a la Mary did not slap the green witch translation finished Jörg Tiedemann 36/1 Jörg Tiedemann 37/1
Hypothesis expansion Hypothesis expansion Jörg Tiedemann 38/1 Jörg Tiedemann 39/1 Hypothesis expansion Hypothesis Stacks... and continue adding more hypothesis exponential explosion of search space! here: based on number of foreign words translated expand all hypotheses from one stack during translation place expanded hypotheses into appropriate stacks get n-best list of translations Jörg Tiedemann 40/1 Jörg Tiedemann 41/1
Phrase-based SMT Summary PB-SMT More information: Homepage of the Moses toolkit http://www.statmt.org/moses/ phrase-based SMT = state-of-the-art in data-driven MT (?!) based on standard word alignment models phrase extraction heuristics & simple scoring simplistic re-ordering model huge phrase table = big memory of fragment translations heuristics for efficient decoding Active research area! New developments all the time! Jörg Tiedemann 42/1 Jörg Tiedemann 43/1 What s next? Next lab session: build your own parallel corpus sentence & word alignment Lecture: a quick look at other topics course summary Last lab session: small-scale experiments with PB-SMT (Moses) basic training, evaluation shifting domains Jörg Tiedemann 44/1