Log-linear models (part 1I)

Similar documents
Log-linear models (part 1I)

Log-linear models (part III)

CRF and Structured Perceptron

Kernels and Support Vector Machines

Midterm for Name: Good luck! Midterm page 1 of 9

Introduction to Markov Models

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning for Language Technology

Study guide for Graduate Computer Vision

Signal Recovery from Random Measurements

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Learning Structured Predictors

Lecture 3 - Regression

Empirical Rate-Distortion Study of Compressive Sensing-based Joint Source-Channel Coding

Automatic Speech Recognition (CS753)

MAS160: Signals, Systems & Information for Media Technology. Problem Set 4. DUE: October 20, 2003

EE 435/535: Error Correcting Codes Project 1, Fall 2009: Extended Hamming Code. 1 Introduction. 2 Extended Hamming Code: Encoding. 1.

An Adaptive Intelligence For Heads-Up No-Limit Texas Hold em

Compressive Sampling with R: A Tutorial

Statistical Tests: More Complicated Discriminants

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

Princeton ELE 201, Spring 2014 Laboratory No. 2 Shazam

DeepStack: Expert-Level AI in Heads-Up No-Limit Poker. Surya Prakash Chembrolu

Digital Television Lecture 5

Computer Vision, Lecture 3

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

14.7 Maximum and Minimum Values

Optimization Techniques for Alphabet-Constrained Signal Design

MATH 8 FALL 2010 CLASS 27, 11/19/ Directional derivatives Recall that the definitions of partial derivatives of f(x, y) involved limits

CandyCrush.ai: An AI Agent for Candy Crush

Learning Structured Predictors

DISCRIMINANT FUNCTION CHANGE IN ERDAS IMAGINE

Optimal Resource Allocation for OFDM Uplink Communication: A Primal-Dual Approach

BMT 2018 Combinatorics Test Solutions March 18, 2018

Monty Hall Problem & Birthday Paradox

Spatial Domain Processing and Image Enhancement

Dynamic Data-Driven Adaptive Sampling and Monitoring of Big Spatial-Temporal Data Streams for Real-Time Solar Flare Detection

WESI 205 Workbook. 1 Review. 2 Graphing in 3D

Frugal Sensing Spectral Analysis from Power Inequalities

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

CoE4TN4 Image Processing. Chapter 3: Intensity Transformation and Spatial Filtering

Beyond Nyquist. Joel A. Tropp. Applied and Computational Mathematics California Institute of Technology

4 to find the dimensions of the rectangle that have the maximum area. 2y A =?? f(x, y) = (2x)(2y) = 4xy

SYDE 112, LECTURE 34 & 35: Optimization on Restricted Domains and Lagrange Multipliers

Embeddings Learned by Gradient Descent

Deconvolution , , Computational Photography Fall 2017, Lecture 17

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Radio Deep Learning Efforts Showcase Presentation

CS534 Introduction to Computer Vision. Linear Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

Math 2411 Calc III Practice Exam 2

Total Variation Blind Deconvolution: The Devil is in the Details*

Statistical Inference, Learning and Models for Big Data

Filtering. Image Enhancement Spatial and Frequency Based

Compound Object Detection Using Region Co-occurrence Statistics

Machine Translation - Decoding

MAS.160 / MAS.510 / MAS.511 Signals, Systems and Information for Media Technology Fall 2007

Introduction to Source Coding

Sketching Interface. Larry Rudolph April 24, Pervasive Computing MIT SMA 5508 Spring 2006 Larry Rudolph

Deconvolution , , Computational Photography Fall 2018, Lecture 12

Writing Games with Pygame

Sketching Interface. Motivation

Adaptive Wireless. Communications. gl CAMBRIDGE UNIVERSITY PRESS. MIMO Channels and Networks SIDDHARTAN GOVJNDASAMY DANIEL W.

Recommender Systems TIETS43 Collaborative Filtering

Practice problems from old exams for math 233

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

Fast Blur Removal for Wearable QR Code Scanners (supplemental material)

Local Search: Hill Climbing. When A* doesn t work AIMA 4.1. Review: Hill climbing on a surface of states. Review: Local search and optimization

M2M massive wireless access: challenges, research issues, and ways forward

Classification of Hand Gestures using Surface Electromyography Signals For Upper-Limb Amputees

Matthew Fox CS229 Final Project Report Beating Daily Fantasy Football. Introduction

Lecture 15. Global extrema and Lagrange multipliers. Dan Nichols MATH 233, Spring 2018 University of Massachusetts

Large Scale Topic Detection using Node-Cut Partitioning on Dense Weighted-Graphs

Recap from previous lecture. Information Retrieval. Topics for Today. Recall: Basic structure of an Inverted index. Dictionaries & Tolerant Retrieval

8. Lecture. Image restoration: Fourier domain

Digital Signal Processing:

Outline for today s lecture Informed Search Optimal informed search: A* (AIMA 3.5.2) Creating good heuristic functions Hill Climbing

Outline for this presentation. Introduction I -- background. Introduction I Background

Tracking Algorithms for Multipath-Aided Indoor Localization

Link State Routing. Stefano Vissicchio UCL Computer Science CS 3035/GZ01

CHAPTER 11 PARTIAL DERIVATIVES

Keywords: Adaptive filtering, LMS algorithm, Noise cancellation, VHDL Design, Signal to noise ratio (SNR), Convergence Speed.

LECTURE 19 - LAGRANGE MULTIPLIERS

Introduction to Machine Learning

Deep Learning for Autonomous Driving

arxiv: v1 [cs.lg] 23 Aug 2016

Generating Groove: Predicting Jazz Harmonization

Base station selection for energy efficient network operation with the majorization-minimization algorithm

Extracting Social Networks from Literary Fiction

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

A Comparison of Particle Swarm Optimization and Gradient Descent in Training Wavelet Neural Network to Predict DGPS Corrections

The Log-Log Term Frequency Distribution

COMPRESSIVE SENSING BASED ECG MONITORING WITH EFFECTIVE AF DETECTION. Hung Chi Kuo, Yu Min Lin and An Yeu (Andy) Wu

Prediction of Cluster System Load Using Artificial Neural Networks

Collectives Pattern CS 472 Concurrent & Parallel Programming University of Evansville

Laboratory 1: Uncertainty Analysis

The Capability of Error Correction for Burst-noise Channels Using Error Estimating Code

Black Box Machine Learning

Review Sheet for Math 230, Midterm exam 2. Fall 2006

Transcription:

Log-linear models (part 1I) Lecture, Feb 2 CS 690N, Spring 2017 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2017/ Brendan O Connor College of Information and Computer Sciences University of Massachusetts Amherst

MaxEnt / Log-Linear models x: input (all previous words) y: output (next word) f(x,y) => Rd feature function [[domain knowledge here!]] v: Rd Y parameter vector (weights) p(y x; v) = exp (v f(x, y)) P y 0 2Y exp (v f(x, y0 )) P Application to history-based LM: P (w 1..w T )= Y t = Y t P (w t w 1..w t 1 ) exp(v f(w 1..w t 1,w t )) P w2v exp(v f(w 1..w t 1,w))

f 1 (x, y) = f 2 (x, y) = f 3 (x, y) = f 4 (x, y) = f 5 (x, y) = f 6 (x, y) = f 7 (x, y) = f 8 (x, y) = 1 if y = model 1 if y = model and wi 1 = statistical 1 if y = model, wi 2 = any, w i 1 = statistical 1 if y = model, wi 2 = any 1 if y = model, wi 1 is an adjective 1 if y = model, wi 1 ends in ical 1 if y = model, model is not in w1,...w i 1 1 if y = model, grammatical is in w1,...w i 1 Figure 1: Example features for the language modeling problem, where the input x is a sequence of words w 1 w 2...w i 1, and the label y is a word. These are sparse. But still very useful. 3

Feature templates Generate large collection of features from single template Not part of (standard) log-linear mathematics, but how you actually build these things e.g. Trigram feature template: For every (u,v,w) trigram in training data, create feature f N(u,v,w) (x, y) = ( 1 if y = w, wi 2 = u, w i 1 = v where N(u, v, w) is a function that maps each trigram in the training data to a unique integer. At training time: record N(u,v,w) mapping At test time: extract trigram features and check if they are in the feature vocabulary Feature engineering: iterative cycle of model development 4

Feature subtleties On training data, generate all features under consideration Subtle issue: partially unseen features At testing time, a completely new feature has to be ignored (weight 0) Assuming a conditional log-linear model, Features typically conjoin between aspects of both input and output Features can only look at the output f(y) Invalid: Features that only look at the input 5

Multiclass Log. Reg. What does this look like in log-linear form? exp(p j j,yx j ) P (y x) = P y 0 exp( P j j,y 0 x j ) Complete input-output conjunctions generator: very common and effective Log-linear models give more flexible forms (e.g. disjunctions on output classes) Ambiguous term: feature Partially unseen features: typically helpful 6

P Learning Log-likelihood is concave (At least with regularization: typically linearly separable) log p(y x; v) = v f(x, y) log X @ @v j log p(y x; v) = y 0 2Y exp v f(x, y 0 ) E h ends in THE [ P COMBINED (BANK h) ] = K THE BANK 7

P Learning Log-likelihood is concave (At least with regularization: typically linearly separable) log p(y x; v) = v f(x, y) log X y 0 2Y exp v f(x, y 0 ) @ @v j log p(y x; v) = fun with the chain rule E h ends in THE [ P COMBINED (BANK h) ] = K THE BANK 7

P Learning Log-likelihood is concave (At least with regularization: typically linearly separable) log p(y x; v) = v f(x, y) log X y 0 2Y exp v f(x, y 0 ) @ @v j log p(y x; v) = fun with the chain rule f j (x, y) X y 0 p(y 0 x; v)f j (x, y 0 ) E h ends in THE [ P COMBINED (BANK h) ] = K THE BANK 7

P Learning Log-likelihood is concave (At least with regularization: typically linearly separable) log p(y x; v) = v f(x, y) log X y 0 2Y exp v f(x, y 0 ) @ @v j log p(y x; v) = fun with the chain rule f j (x, y) Feature in data? X y 0 p(y 0 x; v)f j (x, y 0 ) Feature in posterior? E h ends in THE [ P COMBINED (BANK h) ] = K THE BANK 7

P Learning Log-likelihood is concave (At least with regularization: typically linearly separable) log p(y x; v) = v f(x, y) log X y 0 2Y exp v f(x, y 0 ) @ @v j log p(y x; v) = fun with the chain rule f j (x, y) Feature in data? X y 0 p(y 0 x; v)f j (x, y 0 ) Feature in posterior? Gradient at a single example: can it be zero? Full dataset gradient: First moments match at mode E h ends in THE [ P COMBINED (BANK h) ] = K THE BANK 7

Moment matching Example: Rosenfeld s trigger words... loan... went into the bank Empirical history prob. (Bigram model estimate) P BIGRAM (BANK THE) = K THE BANK Log-linear model: has weaker property E h ends in THE [ P COMBINED (BANK h) ] = K THE BANK Maximum Entropy view of a log-linear model: Start with feature expectations as constraints. What is the highest entropy distribution that satisfies them? 8

stopped here 2/2 9

Gradient descent Batch gradient descent -- doesn t work well by itself Most commonly used alternatives LBFGS (adaptive version of batch GD) SGD, one example at a time and adaptive variants: Adagrad, Adam, etc. Intuition Issue: Combining per-example sparse updates with regularization updates Lazy updates Occasional regularizer steps (easy to implement) 10

Engineering Sparse dot products are crucial! Lots and lots of features? Millions to billions of features: performance often keeps improving! Features seen only once at training time typically help Feature name=>number mapping is the problem; the parameter vector is fine Feature hashing: make e.g. N(u,v,w) mapping random with collisions (!) Accuracy loss low since features are rare. Works really well, and extremely practical computational properties (memory usage known in advance) Practically: use a fast string hashing function (murmurhash or Python s internal one, etc.) 11

Feature selection Count cutoffs: computational, not performance Offline feature selection: MI/IG vs. chi-square L1 regularization: encourages θ sparsity min log p (y x)+ X j j L1 optimization: convex but nonsmooth; requires subgradient methods 12