The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

Size: px

Start display at page:

Download "The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search"

Dwain Carter
5 years ago
Views:

1 The revolution of the empiricists Machine Translation Word alignment & Statistical MT Jörg Tiedemann Department of Linguistics and Philology Uppsala University Classical approaches require lots of manual work! long development times low coverage, not robust disambiguation at various levels slow! Learn from translation data: example databases for CAT and MT bilingual lexicon/terminology extraction statistical translation models Jörg Tiedemann 1/37 Jörg Tiedemann 2/37 Machine Translation as Search Motivation for Data-Driven MT Maria no daba una bofetada a la bruja verde Mary not give a slap to the witch green did not a slap by green witch How do we learn to translate? grammer vs. examples teacher vs. practice intuition vs. experience no did not give slap slap to the to the the witch Is it possible to create an MT engine without any human effort? no writing of grammar rules no bilingual lexicography no writing of preference & disambiguation rules Look at all alternatives and find the best (most likely) one... Jörg Tiedemann 3/37 Jörg Tiedemann 4/37

2 Motivation for Data-Driven MT Learning from Parallel corpora Learning to translate: there is a bunch of translated stuff (collect all) learn common word/phrase translations from this collection look at typical sentences in the target language learn how to write a sentence in the target language Translation: try various translations of words/phrases in given sentence put them together, shuffle them around check which translation candidate looks best Word alignment: required for many purposes! Jörg Tiedemann 5/37 Jörg Tiedemann 6/37 Translating: Example-Based MT Components of Example-Based MT The classical example (Sato & Nagao, 1990) translate: He buys a book on international politics. examples: 1. He buys a notebook. Kare wa nõto o kau. HE topic NOTEBOOK obj BUY. 2. I read a book on international politics. Watashi wa kokusai seiji nitsuite kakareta hon o yomu. I topic INTERNATIONAL POLITICS ABOUT CONCERNED BOOK obj READ. output: Kare wa kokusai seiji nitsuite kakareta hon o kau. Issues to be addressed: aquiring & storing large example databases matching suitable example fragments alignment of fragments to target language recombination of translated fragments ranking of alternative solutions Can we do this in a formal computational model? Jörg Tiedemann 7/37 Jörg Tiedemann 8/37

3 Statistical Machine Translation Noisy channel for MT: What could have been the sentence that has generated the observed source language sentence? Statistical Machine Translation Ideas borrowed from Speech Recognition:... what a strange idea! (Thanks to Markus Saers for this and other pictures) Jörg Tiedemann 9/37 Jörg Tiedemann 10/37 A brief history of Statistical Machine Translation Statistical Machine Translation 1947 / 49: When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. (Warren Weaver) 1988: first SMT publication (IBM Candide) Every time I fire a linguist the performance goes up (Fred Jelinek) 1999: Johns-Hopkins University summer workshop: Egypt toolkit (GIZA+Cairo) 2003: Phrase-based SMT 2004: Pharaoh: phrase-based decoder 2007: Moses: Open-source toolbox, factored, phrase-based SMT, hierarchical SMT,... Probabilistic view on MT (E = target language, F = source language): Ê = argmax E P(E F) = argmax E P(F E)P(E) Jörg Tiedemann 11/37 Jörg Tiedemann 12/37

4 Statistical Machine Translation Some (very) basic concepts of probability theory Informally: model translation as an optimization (search) problem look for the most likely translation E for a given input F use a probabilistic model that assigns these conditional likelihoods use Bayes theorem to split the model into 2 parts: a language model (for the target language) a translation model (source language given target language) probability P(X) maps event X to number between 0 and 1 P(X) represents the likelihood of observing event X in some kind of experiment (trial) discrete probability distribution: i P(X = x i) = 1 P(X Y ) = conditional probability (likelihood of event X given that event Y has been observed before) joint probability: P(X, Y ) (likelihood of seeing both events) P(X, Y ) = P(X) P(Y X) = P(Y ) P(X Y ), therefore: Bayes Theorem:P(X Y ) = P(X) P(Y X) P(Y ) Jörg Tiedemann 13/37 Jörg Tiedemann 14/37 Some quick words on probability theory & Statistics Where do the probabilities come from? Experience! Use experiments (and repeat them often...) Maximum Likelihood Estimation (rely on N experiments only): P(X) count(x) N For conditional probabilities: P(X Y ) = P(X, Y ) P(Y ) count(x, Y ) N count(y ) N = count(x, Y ) count(y ) Also important: marginalizing out joined probabilities: P(X, Y i ) = i i P(X)P(Y i X) = P(X) i... more details in Matematik f or språkteknologer P(Y i X) = P(X) (Classical) Statistical Machine Translation Translation model: Ê = argmax E P(E F) = argmax E P(F E)P(E) P(F) = argmax E P(F E)P(E) P(F E), estimated from (big) parallel corpora Language model: P(E), estimated from (huge) monolingual target language corpora, takes care of fluency Decoder: global search for argmax E P(F E)P(E) for a given sentence F Jörg Tiedemann 15/37 Jörg Tiedemann 16/37

5 Statistical Machine Translation Word-based SMT models Training: Decoding: How do we get the probabilistic model? estimating probabilities from training data (machine learning) tuning model/learner parameters using development sets How do we find the most likely translation? search problem (argmax) (assuming our model is correct) gigantic search space! optimization & heuristics Why do we need word alignment? Cannot estimate P(F E)... Why not? almost all sentences are unique sparse counts! no good estimations decompose into smaller chunks! Word-based model: Assume that words in one language have been generated by words in another! a (hidden) word alignment explains this process Jörg Tiedemann 17/37 Jörg Tiedemann 18/37 Word-based Translation Models Word-based Translation Models What do we need to estimate model parameters? lexical translation distortion/re-ordering fertility NULL insertion We need a word-aligned parallel corpus! Jörg Tiedemann 19/37 Jörg Tiedemann 20/37

6 Word alignment Word alignment How do we formalize word alignment? A simple example: das Haus ist klein the house is small Natural languages are not that easy... not always 1:1 relation between words some words may be dropped word order can be quite different Define alignment function a based on positions: a : {1 1, 2 2, 3 3, 4 4} Jörg Tiedemann 21/37 Jörg Tiedemann 22/37 Word alignment Word alignment Example of reordering klein ist das Haus the house is small What does the alignment function look like? a : {1 3, 2 4, 3 2, 4 1} One-to-many alignments huset är jättelitet the house is very small 5 a : {1 1, 2 1, 3 2, 4 3, 5 3} Jörg Tiedemann 23/37 Jörg Tiedemann 24/37

7 Word alignment Word alignment Dropping words: huset är ganska litet the house is small Inserting words: huset är NULL litet the house is just small 5 a : {1 1, 2 1, 3 2, 4 4} a : {1 1, 2 1, 3 2, 4 0, 5 3} Jörg Tiedemann 25/37 Jörg Tiedemann 26/37 Statistical word alignment models Statistical Machine Translation Standard word-based translation models: IBM 1: lexical translation probabilities IBM 2: add absolute reordering IBM 3: add fertility IBM 4: relative reordering Remember: Ê = argmax E P(F E)P(E) aligned parallel corpora translation model What is missing? How can we learn model parameters from parallel corpora without explicit word-alignment? Next time more about this... aligned parallel corpora translation model P(F E) we still need the language model P(E) Standard N-gram language models Jörg Tiedemann 27/37 Jörg Tiedemann 28/37

8 Statistical Machine Translation: Language Modeling Statistical Machine Translation: Language Modeling Language modeling: (probabilistic) LM = predict likelihood of any given string What is the likelihood P(E) to observe sentence E? P LM (the house is small) > P LM (small the is house) Estimate probabilities from corpora: P(E) = P(e 1, e 2, e 3,.., e j ) What is the problem here again? Remember: Maximum Likelihood Estimation (MLE) P(E) = P(e 1,.., e j ) count(e 1, e 2,..., e j ) N Can we estimate reliable probabilities for arbitrary sentences? Remember: P(X, Y ) = P(X) P(Y X) chain rule: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 1, e 2 )... P(e j e 1,.., e j 1 ) Does this help? Jörg Tiedemann 29/37 Jörg Tiedemann 30/37 Statistical Machine Translation: Language Modeling Remember: MLE for conditional probabilities P(e j e 1,.., e j 1 ) count(e 1, e 2,..., e j ) count(e 1, e 2,..., e j 1 ) Again: What is the problem? sparse counts for large N-grams! Markov assumption: Limit the dependencies! (bigram model: P(e 3 e 1, e 2 ) P(e 3 e 2 )) Statistical Machine Translation: Language Modeling Compute sentence probabilities based on overlapping N-grams: the the house the house is house is is too too small too small. unigram model: P(E) = P(e 1 ) P(e 2 )...P(e n ) bigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 2 )...P(e n e n 1 ) trigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 1, e 2 )...P(e n e n 2 e n 1, ) What is P trigram ("the house is too small.")? Jörg Tiedemann 31/37 Jörg Tiedemann 32/37

9 Statistical Machine Translation: Language Modeling Statistical Machine Translation: Decoding Another problem: zero counts! Maria no daba una bofetada a la bruja verde Some N-grams are never observed ( count(e i, e j ) = 0)... but appear in real data (e.g. as translation candidate) What happens if we need the probability for...e i e j...? multiplying with one factor = 0 everything is zero BAD IDEA! This must be avoided! Mary not did not give a slap a slap no slap did not give to the witch green by green witch to the to Smoothing! (reserve probability mass for unseen events) Backoff models (from higher order models to lower models) slap the the witch... there would be so much more to say about LM s (see ch.7) Ê = argmax E P(F E)P(E) Jörg Tiedemann 33/37 Jörg Tiedemann 34/37 Statistical Machine Translation: Decoding Summary Decoding = search a solution for Ê given F using: Ê = argmax E P(F E)P(E) Far too many possible E s to search globally! Approximate search using good partial candidates! later more about this... MT can be put into a probabilistic framework translation models: estimated from parallel corpora language models: estimated from monolingual corpora global search = decoding = translating fully automatic (!!!) various simplifications / assumptions necessary probabilistic variant of direct translation Jörg Tiedemann 35/37 Jörg Tiedemann 36/37

10 What s next? statistical word alignment phrase-based SMT decoding Jörg Tiedemann 37/37

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Statistical Machine Translation Machine Translation Phrase-Based Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University October 2009 Probabilistic