Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Similar documents
The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

Machine Translation - Decoding

Challenges in Statistical Machine Translation

Yu Chen Andreas Eisele Martin Kay

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Statistical Machine Translation with Long Phrase Table and without Long Parallel Sentences

Midterm for Name: Good luck! Midterm page 1 of 9

Introduction to Markov Models

CSCI 5832 Natural Language Processing

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

Rule Filtering by Pattern for Efficient Hierarchical Translation

Log-linear models (part 1I)

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Machine Learning for Language Technology

Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

Log-linear models (part 1I)

The challenge of simultaneous speech translation

User Goal Change Model for Spoken Dialog State Tracking

MATHEMATICAL MODELS Vol. I - Measurements in Mathematical Modeling and Data Processing - William Moran and Barbara La Scala

Discriminative Training for Automatic Speech Recognition

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

/665 Natural Language Processing

CS 188: Artificial Intelligence Spring Speech in an Hour

Recap from previous lecture. Information Retrieval. Topics for Today. Recall: Basic structure of an Inverted index. Dictionaries & Tolerant Retrieval

Lecture 9b Convolutional Coding/Decoding and Trellis Code modulation

Sample Spaces, Events, Probability

Recap from previous lectures. Information Retrieval. Recap from previous lectures. Topics for Today. Dictionaries & Tolerant Retrieval.

Teddy Mantoro.

CS 540: Introduction to Artificial Intelligence

IBM Research Report. Audits and Business Controls Related to Receipt Rules: Benford's Law and Beyond

Precoding and Signal Shaping for Digital Transmission

Speech Synthesis using Mel-Cepstral Coefficient Feature

Teddy Mantoro.

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

Grounding into bits: the semantics of virtual worlds

Chapter 4 SPEECH ENHANCEMENT

Introduction. Description of the Project. Debopam Das

REPORT ITU-R BO Multiple-feed BSS receiving antennas

Detection of Compound Structures in Very High Spatial Resolution Images

ELT Receiver Architectures and Signal Processing Fall Mandatory homework exercises

The fundamentals of detection theory

VQ Source Models: Perceptual & Phase Issues

Learning Structured Predictors

CandyCrush.ai: An AI Agent for Candy Crush

2048: An Autonomous Solver

Techniques for Sentiment Analysis survey

Why Should We Care? Everyone uses plotting But most people ignore or are unaware of simple principles Default plotting tools are not always the best

AI Approaches to Ultimate Tic-Tac-Toe

A Bit of network information theory

Cheap, Fast and Good Enough: Speech Transcription with Mechanical Turk. Scott Novotney and Chris Callison-Burch 04/02/10

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

Wireless Network Coding with Local Network Views: Coded Layer Scheduling

THE goal of Speaker Diarization is to segment audio

Using RASTA in task independent TANDEM feature extraction

Lecture 2: SIGNALS. 1 st semester By: Elham Sunbu

Progress in the BBN Keyword Search System for the DARPA RATS Program

Introduction to HTK Toolkit

Introduction to probability

NLP, Games, and Robotic Cars

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

Application of QAP in Modulation Diversity (MoDiv) Design

A Bandit Approach for Tree Search

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Chapter 1. Probability

EE 435/535: Error Correcting Codes Project 1, Fall 2009: Extended Hamming Code. 1 Introduction. 2 Extended Hamming Code: Encoding. 1.

Two Bracketing Schemes for the Penn Treebank

Introduction to Source Coding

CS 343: Artificial Intelligence

IMPROVEMENTS TO THE IBM SPEECH ACTIVITY DETECTION SYSTEM FOR THE DARPA RATS PROGRAM

Probability and Statistics. Copyright Cengage Learning. All rights reserved.

A Static Power Model for Architects

Review: Theorem of irrelevance. Y j φ j (t) where Y j = X j + Z j for 1 j k and Y j = Z j for

DSP First Lab 08: Frequency Response: Bandpass and Nulling Filters

Database Normalization as a By-product of MML Inference. Minimum Message Length Inference

CHANNEL MEASUREMENT. Channel measurement doesn t help for single bit transmission in flat Rayleigh fading.

Transcribing Continuous Speech Using Mismatched Crowdsourcing

MAS160: Signals, Systems & Information for Media Technology. Problem Set 4. DUE: October 20, 2003

The Case for Optimum Detection Algorithms in MIMO Wireless Systems. Helmut Bölcskei

Electric Guitar Pickups Recognition

Why Should We Care? More importantly, it is easy to lie or deceive people with bad plots

Wavelets and wavelet convolution and brain music. Dr. Frederike Petzschner Translational Neuromodeling Unit

SPEECH TO SINGING SYNTHESIS SYSTEM. Mingqing Yun, Yoon mo Yang, Yufei Zhang. Department of Electrical and Computer Engineering University of Rochester

Simple Measures of Visual Encoding. vs. Information Theory

SELECTING RELEVANT DATA

Player Speed vs. Wild Pokémon Encounter Frequency in Pokémon SoulSilver Joshua and AP Statistics, pd. 3B

LECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR

Th P6 01 Retrieval of the P- and S-velocity Structure of the Groningen Gas Reservoir Using Noise Interferometry

HANDS-ON TRANSFORMATIONS: RIGID MOTIONS AND CONGRUENCE (Poll Code 39934)

Dyck paths, standard Young tableaux, and pattern avoiding permutations

CIS 2033 Lecture 6, Spring 2017

CMath 55 PROFESSOR KENNETH A. RIBET. Final Examination May 11, :30AM 2:30PM, 100 Lewis Hall

Digital Speech Processing and Coding

Your Name and ID. (a) ( 3 points) Breadth First Search is complete even if zero step-costs are allowed.

Speech Coding in the Frequency Domain

Combining Voice Activity Detection Algorithms by Decision Fusion

Advanced Digital Design

Mining for Statistical Models of Availability in Large-Scale Distributed Systems: An Empirical Study of

AM Antenna Computer Modeling Course

Notes 15: Concatenated Codes, Turbo Codes and Iterative Processing

Transcription:

Statistical Machine Translation Machine Translation Phrase-Based Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University October 2009 Probabilistic view on MT (E = target language, F = source language): Ê = argmax E P(E F ) = argmax E P(F E)P(E) Jörg Tiedemann 1/1 Jörg Tiedemann 2/1 Statistical Machine Translation: Language Modeling Motivation for Phrase-based SMT Language modeling: (probabilistic) LM = predict likelihood of a given string What is the likelihood P(E) to observe sentence E? Exactly what we need! Estimate probabilities from corpora: decompose into N-grams! unigram model: P(E) = P(e 1 ) P(e 2 )...P(e n ) bigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 2 )...P(e n e n 1 ) trigram model: P(E) = P(e 1 ) P(e 2 e 1 ) P(e 3 e 1, e 2 )...P(e n e n 2 e n 1, ) There would be much more to say about language modeling... Word-based SMT statistical word alignment P(F E) language modeling P(E) global decoding argmax E P(F E)P(E) Word-by-word translation is too weak! contextual dependencies non-compositional constructions n:m relations look at larger chunks! Jörg Tiedemann 3/1 Jörg Tiedemann 4/1

Phrase-based SMT Phrase-based SMT Translation model in PSMT: Motivation phrases = word N-grams less ambiguity, more context in translation table handle non-compositional expressions local reorderings covered by phrase translations distortion : reordering on phrase level P(F E) = I φ(f i e i )d(start i, end i 1 ) i=1 phrases are extracted from word aligned parallel corpora phrase translation probabilities (MLE): φ(f e) = count(f, e) count(f, e) f Moses toolkit: (http://www.statmt.org/moses/) Jörg Tiedemann 5/1 Jörg Tiedemann 6/1 Phrase-based SMT Statistical word alignment Standard models: Phrase translation probabilities: need phrase alignments in parallel corpus induce them from word alignments (IBM models) score extracted phrases (MLE) IBM models 1-5 (cascaded), EM training, final parameters: word translation probabilities (lexical model) fertility probabilities distortion probabilities (reordering) Viterbi alignment assign most likely links between words according to the statistical word alignment model from above Jörg Tiedemann 7/1 Jörg Tiedemann 8/1

Viterbi Word Alignment Viterbi Word Alignment from GIZA++ From the German-English Europarl corpus: special NULL word (NULL la) EMPTY alignment possible (did) only 1:many (slap); not many:1 depending on alignment direction Alignment tool: GIZA++ (http://code.google.com/p/giza-pp/) # Sentence pair (5) source length 12 target length 11 alignment score : 2.14036e-24 ich bitte sie, sich zu einer schweigeminute zu erheben. NULL ({ }) please ({ 1 2 3 }) rise ({ }), ({ 4 }) then ({ 5 }), ({ }) for ({ 6 }) this ({ 7 }) minute ({ 8 }) ({ }) s ({ }) silence ({ 9 10 }). ({ 11 }) # Sentence pair (6) source length 12 target length 10 alignment score : 3.38628e-15 ( das parlament erhebt sich zu einer schweigeminute. ) NULL ({ }) ( ({ 1 }) the ({ 2 }) house ({ 3 }) rose ({ 4 5 }) and ({ }) observed ({ 6 }) a ({ 7 }) minute ({ 8 }) ({ }) s ({ }) silence ({ 9 }) ) ({ 10 }) Jörg Tiedemann 9/1 Jörg Tiedemann 11/1 Viterbi Word Alignment Word Alignment Symmetrization source-to-target word alignment: Asymmetric alignment! no n:1 alignments can run IBM models in both directions! different links in source-to-target and target-to-source best alignment = merge both directions (?!) How? Symmetrization heuristics! Jörg Tiedemann 12/1 Jörg Tiedemann 13/1

Word Alignment Symmetrization target-to-source word alignment: Word Alignment Symmetrization symmetrized word alignment: Jörg Tiedemann 14/1 Jörg Tiedemann 15/1 Word Alignment Symmetrization start with intersection, add adjacent links (from union)... Phrase extraction Get ALL phrase pairs that are consistent with word alignments Jörg Tiedemann 16/1 Jörg Tiedemann 17/1

Phrase extraction Phrase extraction Jörg Tiedemann 18/1 Jörg Tiedemann 19/1 Phrase extraction Phrase extraction Jörg Tiedemann 20/1 Jörg Tiedemann 21/1

Phrase extraction Phrase extraction Jörg Tiedemann 22/1 Jörg Tiedemann 23/1 Scoring phrases Phrase tables Examples from a phrase table (Pirates of the Caribbean): Simple Maximum likelihood estimation: φ(f e) = count(f, e) count(f, e) f A huge phrase table! (with a lot of garbage?) Swedish English Score, det?, it s 0.666667, det?, that s 1 att bli besvikna be disappointed 1 att bli en sj?v to becoming one 1 bara vi just 0.1 bara just 0.6 bara only 0.375 barbossa och hans bes?tning barbossa and his crew 1 barbossa och hans barbossa and his 1 barbossa t?ker g?a. allt barbossa is up to...... all 1 (The training set was too small to get reasonable counts!) Jörg Tiedemann 24/1 Jörg Tiedemann 25/1

The final model for PB-SMT PB-SMT extension: Log-linear Models Instead of noisy-channel model Ê = argmax EP(F E)P(E): Ê = argmax E P(E F ) = argmax E ( φ(f i e i ) d(start i, end i 1 ) P(E) ω length(e)) model posterior directly: Ê = argmax EP(E F ) many feature functions h m (E, F ) may influence P(E F) Distortion d: Chance to move phrases to other positions fixed distortion limit (e.g. 6) simple penalty for moving: α start i end i 1 1 OR lexicalized distortion (learned from alignment) phrase translation model E F phrase translation model F E lexical weights from underlying word alignment a language model P(E) lexicalized reordering model length features (word/phrase costs/penalties) Word cost: ω length(e) = bias for longer output P(E F) = weighted combination of feature functions! Jörg Tiedemann 26/1 Jörg Tiedemann 27/1 PB-SMT extension: Log-linear Models P(E F ) = weighted (λ m ) combination of feature functions (h m ) P(E F ) = ep M m=1 λ mh m (E,F) Z Ê = argmax E P(E F) = argmax E (logp(e F )) How to learn weights λ m? M = argmax E λ m h m (E, F ) m=1 Minimum error rate training (MERT) on development set! Measure error in terms of BLEU scores (n-best list) Iterative adjustment of model parameters (slow but effective!) Phrase table with multiple scores That s what you will get from Moses: Swedish English Scores, det?, it s 0.6667 0.0959975 0.6667 0.0263227 2.718 att bli besvikna be disappointed 1 0.0221815 1 0.105472 2.718 att bli en sj?v to becoming one 1 0.00896375 1 0.00157689 2.718 bara vi just 0.1 0.0102041 1 0.128968 2.718 bara just 0.6 0.285714 0.6 0.25 2.718 bara naught but 1 0.268518 0.1 0.00195312 2.718 bara only 0.375 0.222222 0.3 0.125 2.718 phrase translation probability φ(f e) lexical weighting lex(f e) phrase translation probability φ(e f ) lexical weighting lex(e f ) phrase penalty (always exp(1) 2.718) Jörg Tiedemann 28/1 Jörg Tiedemann 29/1

Translation = decoding Global search: Ê = argmax EP(E F ) Maria no dio una bofetada a la many translation alternatives (huge phrase table) many ways to segment words into phrases re-ordering makes it even more complex Very Expensive! need search heuristics pruning (early discard weak hypotheses) stack decoding (histograms & thresholds) reordering limits Mary build translation left-to-right select foreign word to be translated select translation in phrase table add translation to partial translation (hypothesis) Jörg Tiedemann 30/1 Jörg Tiedemann 31/1 Maria no dio una bofetada a la Maria no dio una bofetada a la Mary did not Mary did not slap mark first (foreign) word as translated new example: one-to-many translation many-to-one translation Jörg Tiedemann 32/1 Jörg Tiedemann 33/1

Maria no dio una bofetada a la Maria no dio una bofetada a la Mary did not slap the Mary did not slap the green many-to-one translation example for re-ordering Jörg Tiedemann 34/1 Jörg Tiedemann 35/1 Lattice of translation options Maria no dio una bofetada a la Mary did not slap the green witch translation finished Jörg Tiedemann 36/1 Jörg Tiedemann 37/1

Hypothesis expansion Hypothesis expansion Jörg Tiedemann 38/1 Jörg Tiedemann 39/1 Hypothesis expansion Hypothesis Stacks... and continue adding more hypothesis exponential explosion of search space! here: based on number of foreign words translated expand all hypotheses from one stack during translation place expanded hypotheses into appropriate stacks get n-best list of translations Jörg Tiedemann 40/1 Jörg Tiedemann 41/1

Phrase-based SMT Summary PB-SMT More information: Homepage of the Moses toolkit http://www.statmt.org/moses/ phrase-based SMT = state-of-the-art in data-driven MT (?!) based on standard word alignment models phrase extraction heuristics & simple scoring simplistic re-ordering model huge phrase table = big memory of fragment translations heuristics for efficient decoding Active research area! New developments all the time! Jörg Tiedemann 42/1 Jörg Tiedemann 43/1 What s next? Next lab session: build your own parallel corpus sentence & word alignment Lecture: a quick look at other topics course summary Last lab session: small-scale experiments with PB-SMT (Moses) basic training, evaluation shifting domains Jörg Tiedemann 44/1