Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Size: px

Start display at page:

Download "Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT"

Ambrose Haynes
5 years ago
Views:

1 Statistical Machine Translation Machine Translation Phrase-Based Statistical MT Jörg Tiedemann Department of Linguistics and Philology Uppsala University October 2009 Probabilistic view on MT (E = target language, F = source language): Ê = argmax E P(E F ) = argmax E P(F E)P(E) Jörg Tiedemann 1/1 Jörg Tiedemann 2/1 Statistical Machine Translation: Language Modeling Motivation for Phrase-based SMT Language modeling: (probabilistic) LM = predict likelihood of a given string What is the likelihood P(E) to observe sentence E? Exactly what we need! Estimate probabilities from corpora: decompose into N-grams! unigram model: P(E) = P(e 1 ) P(e 2 )...P(e n ) bigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 2 )...P(e n e n 1 ) trigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 1, e 2 )...P(e n e n 2 e n 1, ) There would be much more to say about language modeling... Word-based SMT statistical word alignment P(F E) language modeling P(E) global decoding argmax E P(F E)P(E) Word-by-word translation is too weak! contextual dependencies non-compositional constructions n:m relations look at larger chunks! Jörg Tiedemann 3/1 Jörg Tiedemann 4/1

2 Phrase-based SMT Phrase-based SMT Translation model in PSMT: Motivation phrases = word N-grams less ambiguity, more context in translation table handle non-compositional expressions local reorderings covered by phrase translations distortion : reordering on phrase level P(F E) = I φ(f i e i )d(start i, end i 1 ) i=1 phrases are extracted from word aligned parallel corpora phrase translation probabilities (MLE): φ(f e) = count(f, e) count(f, e) f Moses toolkit: ( Jörg Tiedemann 5/1 Jörg Tiedemann 6/1 Phrase-based SMT Statistical word alignment Standard models: Phrase translation probabilities: need phrase alignments in parallel corpus induce them from word alignments (IBM models) score extracted phrases (MLE) IBM models 1-5 (cascaded), EM training, final parameters: word translation probabilities (lexical model) fertility probabilities distortion probabilities (reordering) Viterbi alignment assign most likely links between words according to the statistical word alignment model from above Jörg Tiedemann 7/1 Jörg Tiedemann 8/1

Viterbi Word Alignment Viterbi Word Alignment from GIZA++ From the German-English Europarl corpus: special NULL word (NULL la) EMPTY alignment possible (did) only 1:many (slap); not many:1 depending

3 Viterbi Word Alignment Viterbi Word Alignment from GIZA++ From the German-English Europarl corpus: special NULL word (NULL la) EMPTY alignment possible (did) only 1:many (slap); not many:1 depending on alignment direction Alignment tool: GIZA++ ( # Sentence pair (5) source length 12 target length 11 alignment score : e-24 ich bitte sie, sich zu einer schweigeminute zu erheben. NULL ({ }) please ({ }) rise ({ }), ({ 4 }) then ({ 5 }), ({ }) for ({ 6 }) this ({ 7 }) minute ({ 8 }) ({ }) s ({ }) silence ({ 9 10 }). ({ 11 }) # Sentence pair (6) source length 12 target length 10 alignment score : e-15 ( das parlament erhebt sich zu einer schweigeminute. ) NULL ({ }) ( ({ 1 }) the ({ 2 }) house ({ 3 }) rose ({ 4 5 }) and ({ }) observed ({ 6 }) a ({ 7 }) minute ({ 8 }) ({ }) s ({ }) silence ({ 9 }) ) ({ 10 }) Jörg Tiedemann 9/1 Jörg Tiedemann 11/1 Viterbi Word Alignment Word Alignment Symmetrization source-to-target word alignment: Asymmetric alignment! no n:1 alignments can run IBM models in both directions! different links in source-to-target and target-to-source best alignment = merge both directions (?!) How? Symmetrization heuristics! Jörg Tiedemann 12/1 Jörg Tiedemann 13/1

Symmetrization start with intersection, add adjacent links (from union).

4 Word Alignment Symmetrization target-to-source word alignment: Word Alignment Symmetrization symmetrized word alignment: Jörg Tiedemann 14/1 Jörg Tiedemann 15/1 Word Alignment Symmetrization start with intersection, add adjacent links (from union)... Phrase extraction Get ALL phrase pairs that are consistent with word alignments Jörg Tiedemann 16/1 Jörg Tiedemann 17/1

5 Phrase extraction Phrase extraction Jörg Tiedemann 18/1 Jörg Tiedemann 19/1 Phrase extraction Phrase extraction Jörg Tiedemann 20/1 Jörg Tiedemann 21/1

6 Phrase extraction Phrase extraction Jörg Tiedemann 22/1 Jörg Tiedemann 23/1 Scoring phrases Phrase tables Examples from a phrase table (Pirates of the Caribbean): Simple Maximum likelihood estimation: φ(f e) = count(f, e) count(f, e) f A huge phrase table! (with a lot of garbage?) Swedish English Score, det?, it s , det?, that s 1 att bli besvikna be disappointed 1 att bli en sj?v to becoming one 1 bara vi just 0.1 bara just 0.6 bara only barbossa och hans bes?tning barbossa and his crew 1 barbossa och hans barbossa and his 1 barbossa t?ker g?a. allt barbossa is up to all 1 (The training set was too small to get reasonable counts!) Jörg Tiedemann 24/1 Jörg Tiedemann 25/1

7 The final model for PB-SMT PB-SMT extension: Log-linear Models Instead of noisy-channel model Ê = argmax EP(F E)P(E): Ê = argmax E P(E F ) = argmax E ( φ(f i e i ) d(start i, end i 1 ) P(E) ω length(e)) model posterior directly: Ê = argmax EP(E F ) many feature functions h m (E, F ) may influence P(E F) Distortion d: Chance to move phrases to other positions fixed distortion limit (e.g. 6) simple penalty for moving: α start i end i 1 1 OR lexicalized distortion (learned from alignment) phrase translation model E F phrase translation model F E lexical weights from underlying word alignment a language model P(E) lexicalized reordering model length features (word/phrase costs/penalties) Word cost: ω length(e) = bias for longer output P(E F) = weighted combination of feature functions! Jörg Tiedemann 26/1 Jörg Tiedemann 27/1 PB-SMT extension: Log-linear Models P(E F ) = weighted (λ m ) combination of feature functions (h m ) P(E F ) = ep M m=1 λ mh m (E,F) Z Ê = argmax E P(E F) = argmax E (logp(e F )) How to learn weights λ m? M = argmax E λ m h m (E, F ) m=1 Minimum error rate training (MERT) on development set! Measure error in terms of BLEU scores (n-best list) Iterative adjustment of model parameters (slow but effective!) Phrase table with multiple scores That s what you will get from Moses: Swedish English Scores, det?, it s att bli besvikna be disappointed att bli en sj?v to becoming one bara vi just bara just bara naught but bara only phrase translation probability φ(f e) lexical weighting lex(f e) phrase translation probability φ(e f ) lexical weighting lex(e f ) phrase penalty (always exp(1) 2.718) Jörg Tiedemann 28/1 Jörg Tiedemann 29/1

8 Translation = decoding Global search: Ê = argmax EP(E F ) Maria no dio una bofetada a la many translation alternatives (huge phrase table) many ways to segment words into phrases re-ordering makes it even more complex Very Expensive! need search heuristics pruning (early discard weak hypotheses) stack decoding (histograms & thresholds) reordering limits Mary build translation left-to-right select foreign word to be translated select translation in phrase table add translation to partial translation (hypothesis) Jörg Tiedemann 30/1 Jörg Tiedemann 31/1 Maria no dio una bofetada a la Maria no dio una bofetada a la Mary did not Mary did not slap mark first (foreign) word as translated new example: one-to-many translation many-to-one translation Jörg Tiedemann 32/1 Jörg Tiedemann 33/1

Maria no dio una bofetada a la Maria no dio una bofetada a la Mary did not slap the Mary did not slap the green many-to-one translation example for re-ordering Jörg Tiedemann

9 Maria no dio una bofetada a la Maria no dio una bofetada a la Mary did not slap the Mary did not slap the green many-to-one translation example for re-ordering Jörg Tiedemann 34/1 Jörg Tiedemann 35/1 Lattice of translation options Maria no dio una bofetada a la Mary did not slap the green witch translation finished Jörg Tiedemann 36/1 Jörg Tiedemann 37/1

Hypothesis expansion Hypothesis expansion Jörg Tiedemann 38/1 Jörg Tiedemann 39/1 Hypothesis expansion Hypothesis Stacks... and continue adding more hypothesis exponential explosion of search space!

10 Hypothesis expansion Hypothesis expansion Jörg Tiedemann 38/1 Jörg Tiedemann 39/1 Hypothesis expansion Hypothesis Stacks... and continue adding more hypothesis exponential explosion of search space! here: based on number of foreign words translated expand all hypotheses from one stack during translation place expanded hypotheses into appropriate stacks get n-best list of translations Jörg Tiedemann 40/1 Jörg Tiedemann 41/1

11 Phrase-based SMT Summary PB-SMT More information: Homepage of the Moses toolkit phrase-based SMT = state-of-the-art in data-driven MT (?!) based on standard word alignment models phrase extraction heuristics & simple scoring simplistic re-ordering model huge phrase table = big memory of fragment translations heuristics for efficient decoding Active research area! New developments all the time! Jörg Tiedemann 42/1 Jörg Tiedemann 43/1 What s next? Next lab session: build your own parallel corpus sentence & word alignment Lecture: a quick look at other topics course summary Last lab session: small-scale experiments with PB-SMT (Moses) basic training, evaluation shifting domains Jörg Tiedemann 44/1

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

The revolution of the empiricists Machine Translation Word alignment & Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Classical approaches