CRF and Structured Perceptron

Similar documents
Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

Log-linear models (part 1I)

Log-linear models (part III)

Machine Learning for Language Technology

Log-linear models (part 1I)

Kernels and Support Vector Machines

Computer Vision, Lecture 3

Midterm for Name: Good luck! Midterm page 1 of 9

14.7 Maximum and Minimum Values

Tracking Algorithms for Multipath-Aided Indoor Localization

Learning Structured Predictors

Mobile Wireless Channel Dispersion State Model

Math 2411 Calc III Practice Exam 2

Midterm Examination CS 534: Computational Photography

Conditional Distributions

Lecture 3 - Regression

Functions of several variables

Revision of Channel Coding

Partial Differentiation 1 Introduction

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

Introduction to Spring 2009 Artificial Intelligence Final Exam

266&deployment= &UserPass=b3733cde68af274d036da170749a68f6

Section 15.3 Partial Derivatives

EE 435/535: Error Correcting Codes Project 1, Fall 2009: Extended Hamming Code. 1 Introduction. 2 Extended Hamming Code: Encoding. 1.

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Discriminative Training for Automatic Speech Recognition

What Do You Expect? Concepts

Designing Information Devices and Systems I Fall 2016 Babak Ayazifar, Vladimir Stojanovic Homework 11

NEW HIERARCHICAL NOISE REDUCTION 1

Graph-of-word and TW-IDF: New Approach to Ad Hoc IR (CIKM 2013) Learning to Rank: From Pairwise Approach to Listwise Approach (ICML 2007)

Filtering. Image Enhancement Spatial and Frequency Based

Learning Structured Predictors

Multiple Integrals. Advanced Calculus. Lecture 1 Dr. Lahcen Laayouni. Department of Mathematics and Statistics McGill University.

High-speed Noise Cancellation with Microphone Array

Image Filtering. Median Filtering

An Adaptive Intelligence For Heads-Up No-Limit Texas Hold em

Pre-AP Algebra 2 Unit 8 - Lesson 2 Graphing rational functions by plugging in numbers; feature analysis

Lecture 19 - Partial Derivatives and Extrema of Functions of Two Variables

Estimation of Rates Arriving at the Winning Hands in Multi-Player Games with Imperfect Information

Graphs and Network Flows IE411. Lecture 14. Dr. Ted Ralphs

Recommender Systems TIETS43 Collaborative Filtering

Building a Computer Mahjong Player Based on Monte Carlo Simulation and Opponent Models

ES 111 Mathematical Methods in the Earth Sciences Lecture Outline 6 - Tues 17th Oct 2017 Functions of Several Variables and Partial Derivatives

Filip Malmberg 1TD396 fall 2018 Today s lecture

Compound Object Detection Using Region Co-occurrence Statistics

The Discrete Fourier Transform. Claudia Feregrino-Uribe, Alicia Morales-Reyes Original material: Dr. René Cumplido

Exercises for Introduction to Game Theory SOLUTIONS

Introduction to Machine Learning

WLAN a Algorithm Packet Detection Carrier Frequency Offset, and Symbol Timing. Hung-Yi Lu

Generating an appropriate sound for a video using WaveNet.

Mikko Myllymäki and Tuomas Virtanen

FUNCTIONS OF SEVERAL VARIABLES AND PARTIAL DIFFERENTIATION

Review Sheet for Math 230, Midterm exam 2. Fall 2006

Image analysis. CS/CME/BIOPHYS/BMI 279 Fall 2015 Ron Dror

47. Conservative Vector Fields

MITOCW R7. Comparison Sort, Counting and Radix Sort

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 6 Lecture - 37 Divide and Conquer: Counting Inversions

Exam 2 Review Sheet. r(t) = x(t), y(t), z(t)

Project. B) Building the PWM Read the instructions of HO_14. 1) Determine all the 9-mers and list them here:

Question Score Max Cover Total 149

Lecture 4: Spatial Domain Processing and Image Enhancement

Math Lecture 2 Inverse Functions & Logarithms

ANSWER KEY. (a) For each of the following partials derivatives, use the contour plot to decide whether they are positive, negative, or zero.

Math 148 Exam III Practice Problems

Alternation in the repeated Battle of the Sexes

CS 188: Artificial Intelligence Spring Speech in an Hour

CSE 527: Introduction to Computer Vision

Comparing Exponential and Logarithmic Rules

Card counting meets hidden Markov models

Hybrid Discriminative/Class-Specific Classifiers for Narrow-Band Signals

Graphing Techniques. Figure 1. c 2011 Advanced Instructional Systems, Inc. and the University of North Carolina 1

Fast Blur Removal for Wearable QR Code Scanners (supplemental material)

Segmentation of Fingerprint Images

Lecture 19. Vector fields. Dan Nichols MATH 233, Spring 2018 University of Massachusetts. April 10, 2018.

REVIEW SHEET FOR MIDTERM 2: ADVANCED

Math 210: 1, 2 Calculus III Spring 2008

Privacy preserving data mining multiplicative perturbation techniques

Definitions and claims functions of several variables

Stat 100a: Introduction to Probability.

Multiple Input Multiple Output (MIMO) Operation Principles

The Game-Theoretic Approach to Machine Learning and Adaptation

Differentiable functions (Sec. 14.4)

Background. Game Theory and Nim. The Game of Nim. Game is Finite 1/27/2011

Distinguishing Mislabeled Data from Correctly Labeled Data in Classifier Design

Automatic Speech Recognition (CS753)

Classification of Road Images for Lane Detection


Miscellaneous Topics Part 1

MULTI-VARIABLE OPTIMIZATION NOTES. 1. Identifying Critical Points

Application of Multi Layer Perceptron (MLP) for Shower Size Prediction

Practice problems from old exams for math 233

[f(t)] 2 + [g(t)] 2 + [h(t)] 2 dt. [f(u)] 2 + [g(u)] 2 + [h(u)] 2 du. The Fundamental Theorem of Calculus implies that s(t) is differentiable and

Lane Detection in Automotive

Name: ID: Section: Math 233 Exam 2. Page 1. This exam has 17 questions:

SSB Debate: Model-based Inference vs. Machine Learning

Digital Image Processing 3/e

Statistical Communication Theory

Adaptive Kalman Filter based Channel Equalizer

Revision: April 18, E Main Suite D Pullman, WA (509) Voice and Fax

Voice Activity Detection

Transcription:

CRF and Structured Perceptron CS 585, Fall 2015 -- Oct. 6 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2015/ Brendan O Connor

Viterbi exercise solution CRF & Structured Perceptrons Thursday: project discussion + midterm review 2

Log-linear models (NB, LogReg, HMM, CRF...) x: Text Data y: Proposed class or sequence θ: Feature weights (model parameters) f(x,y): Feature extractor, produces feature vector p(y x) = 1 Z exp T f(x, y) {z } G(y) Decision rule: arg max y 2outputs(x) G(y ) How to we evaluate for HMM/CRF? Viterbi! 3

Things to do with a log-linear model p(y x) = 1 Z exp T f(x, y) {z } G(y) f(x,y) Feature extractor (feature vector) x Text Input y Output θ Feature weights decoding/prediction arg max y 2outputs(x) G(y ) given given (just one) obtain (just one) given parameter learning given given given (many pairs) (many pairs) obtain feature engineering (human-in-the-loop) fiddle with during experiments given (many pairs) given (many pairs) obtain in each experiment 4 [This is new slide after lecture]

HMM as factor graph A 1 A 2 y 1 y 2 y 3 p(y, w) = Y t p(w y y t ) p(y t+1 y t ) B 1 B 2 B 3 log p(y, w) = X t log p(w t y t ) + log p(y t y t 1 ) G(y) goodness B t (y t ) emission factor score A(y t,y t+1 ) transition factor score (Additive) Viterbi: arg max y 2outputs(x) G(y ) 5

is there a terrible bug in sutton&mccallum? there s no sum over t in these equations! We can write (1.13) more compactly by introducing the concept of feature functions, just as we did for logistic regression in (1.7). Each feature function has the form f k (y t,y t 1,x t ). In order to duplicate (1.13), there needs to be one feature f ij (y, y 0,x)=1 {y=i} 1 {y0 =j} for each transition (i, j) and one feature f io (y, y 0,x)= 1 {y=i} 1 {x=o} for each state-observation pair (i, o). Then we can write an HMM as: ( p(y, x) = 1 K ) Z exp X kf k (y t,y t 1,x t ). (1.14) k=1 Again, equation (1.14) defines exactly the same family of distributions as (1.13), Definition 1.1 Let Y,X be random vectors, = { k } 2 < K be a parameter vector, and {f k (y, y 0, x t )} K k=1 be a set of real-valued feature functions. Then a linear-chain conditional random field is a distribution p(y x) that takes the form ( p(y x) = 1 K ) Z(x) exp X kf k (y t,y t 1, x t ), (1.16) k=1 where (x) is an instance-specific normalization function ( 6 )

HMM as log-linear A 1 A 2 y 1 y 2 y 3 p(y, w) = Y t p(w y y t ) p(y t+1 y t ) B 1 B 2 B 3 log p(y, w) = X t log p(w t y t ) + log p(y t y t 1 ) G(y) goodness G(y) = = = B t (y t ) A(y t,y t+1 ) emission transition factor score factor score 2 X 4 X X µ w,k 1{y t = k ^ w t = w} + X t k2k w2v X X i f t,i (y t,y t+1,w t ) t X i2allfeats i f i (y t,y t+1,w t ) i2allfeats 7 k,j2k 3 j,k1{y t = j ^ y t+1 = k} 5 [~ SM eq 1.13, 1.14]

CRF log p(y x) =C + T f(x, y) Prob. dist over whole sequence f(x, y) = X t f t (x, y t,y t+1 ) Linear-chain CRF: wholesequence feature function decomposes into pairs advantages 1. why just word identity features? add many more! 2. can train it to optimize accuracy of sequences (discriminative learning) Viterbi can be used for efficient prediction 8

finna get good gold y = V V A f(x,y) is... Two simple feature templates Transition features f trans:a,b (x, y) = X t Observation features f emit:a,w (x, y) = X t 1{y t 1 = A, y t = B} 1{y t = A, x t = w} V,V: 1 V,A: 1 V,N: 0... V,finna: 1 V,get: 1 A,good: 1 N,good: 0 9...

gold y = finna get good V V A Mathematical convention is numeric indexing, though sometimes convenient to implement as hash table. Transition features Observation features -0.6-1.0 1.1 0.5 0.0 0.8 0.5-1.3-1.6 0.0 0.6 0.0-0.2-0.2 0.8-1.0 0.1-1.9 1.1 1.2-0.1-1.0-0.1-0.1 f(x, y) 1 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 f trans:v,a (x, y) = NX t=2 1{y t 1 = V,y t = A} f obs:v,finna (x, y) = NX t=1 1{y t = V,x t =finna} Goodness(y) = T f(x, y)

CRF: prediction with Viterbi log p(y x) =C + T f(x, y) Prob. dist over whole sequence f(x, y) = X t f t (x, y t,y t+1 ) Linear-chain CRF: wholesequence feature function decomposes into pairs Scoring function has local decomposition TX TX f(x, y) = f (B) (t, x, y)+ f (A) (y t 1,y t ) t 11 t=2 above. You probably don t need to bother imple T f(x, y) = X t T f (B) (t, x, y)+ TX t=2 +f (A) (y t 1,y t )

1. Motivation: we want features in our sequence model! 2. And how do we learn the parameters? 3. Outline 1. Log-linear models 2. Log-linear Sequence Models: 1. Log-scale additive Viterbi 2. Conditional Random Fields 3. Learning: the Perceptron 12

The Perceptron Algorithm Perceptron is not a model: it is a learning algorithm Rosenblatt 1957 Insanely simple algorithm Iterate through dataset. Predict. Update weights to fix prediction errors. Can be used for classification OR structured prediction structured perceptron Discriminative learning algorithm for any log-linear model (our view in this course) The Mark I Perceptron machine was the first implementation of the perceptron algorithm. The machine was connected to a camera that used 20 20 cadmium sulfide photocells to produce a 400-pixel image. The main visible feature is a patchboard that allowed experimentation with different combinations of input features. To the right of that are arrays ofpotentiometers that implemented the adaptive weights. 13

Binary perceptron For ~10 iterations For each (x,y) in dataset PREDICT y = POS if T x 0 = NEG if T x<0 IF y=y*, do nothing ELSE update weights := + rx := rx if POS misclassified as NEG: let s make it more positive-y next time around if NEG misclassified as POS: let s make it more negative-y next time learning rate constant e.g. r=1 14

Structured/multiclass Perceptron For ~10 iterations For each (x,y) in dataset PREDICT y = arg max y 0 T f(x, y 0 ) IF y=y*, do nothing ELSE update weights := + r[f(x, y) f(x, y )] learning rate constant e.g. r=1 Features for TRUE label Features for PREDICTED label 15

Update rule y=pos x= this awesome movie... Make mistake: y*=neg learning rate e.g. r=1 Features for TRUE label Features for PREDICTED label := + r[f(x, y) f(x, y )] POS_aw esome POS_this POS_oof... NEG_aw esome NEG_this NEG_oof... real f(x, POS) = pred f(x, NEG) = f(x, POS) f(x, NEG) = 1 1 0... 0 0 0... 0 0 0... 1 1 0... +1 +1 0... -1-1 0... 16

Update rule learning rate e.g. r=1 Features for TRUE label Features for PREDICTED label := + r[f(x, y) f(x, y )] For each feature j in true y but not predicted y*: j := j +(r)f j (x, y) For each feature j not in true y, but in predicted y*: j := j (r)f j (x, y) 17

finna get good gold y = V V A f(x,y) is... Two simple feature templates Transition features f trans:a,b (x, y) = X t Observation features f emit:a,w (x, y) = X t 1{y t 1 = A, y t = B} 1{y t = A, x t = w} V,V: 1 V,A: 1 V,N: 0... V,finna: 1 V,get: 1 A,good: 1 N,good: 0 18...

gold y = finna get good V V A Mathematical convention is numeric indexing, though sometimes convenient to implement as hash table. Transition features Observation features -0.6-1.0 1.1 0.5 0.0 0.8 0.5-1.3-1.6 0.0 0.6 0.0-0.2-0.2 0.8-1.0 0.1-1.9 1.1 1.2-0.1-1.0-0.1-0.1 f(x, y) 1 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 f trans:v,a (x, y) = NX t=2 1{y t 1 = V,y t = A} f obs:v,finna (x, y) = NX t=1 1{y t = V,x t =finna} Goodness(y) = T f(x, y)

finna get good gold y = V V A pred y* = N V A Learning idea: want gold y to have high scores. Update weights so y would have a higher score, and y* would be lower, next time. f(x, y) V,V: 1 V,A: 1 V,finna: 1 V,get: 1 A,good: 1 f(x, y*) N,V: 1 V,A: 1 N,finna: 1 V,get: 1 A,good: 1 f(x,y) - f(x, y*) V,V: +1 N,V: -1 V,finna: +1 N,finna: -1 Perceptron update rule: := + r[f(x, y) f(x, y )]

:= + r[f(x, y) f(x, y )] Transition features Observation features -0.6-1.0 1.1 0.5 0.0 0.8 0.5-1.3-1.6 0.0 0.6 0.0-0.2-0.2 0.8-1.0 0.1-1.9 1.1 1.2-0.1-1.0-0.1 f(x, y) f(x, y ) 1 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 3 0 0 0 The update vector: + r ( ) +1-1

Perceptron notes/issues Issue: does it converge? (generally no) Solution: the averaged perceptron Can you regularize it? No... just averaging... By the way, there s also likelihood training out there (gradient ascent on the log-likelihood function: the traditional way to train a CRF) structperc is easier to implement/conceptualize and performs similarly in practice 22

Averaged {z perceptron } To get stability for the perceptron: Voted perc or Averaged perc See HW2 writeup Averaging: For t th example... average together vectors from every timestep t = 1 tx t 0 t t 0 =1 Efficiency? Lazy update algorithm in HW 23