Bayesian Nonparametrics and DPMM

Bayesian Nonparametrics and DPMM Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 17 Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 1 of 17

Clustering as Probabilistic Inference GMM is a probabilistic model (unlike K-means) There are several latent variables: Means Assignments (Variances) Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 2 of 17

Clustering as Probabilistic Inference GMM is a probabilistic model (unlike K-means) There are several latent variables: Means Assignments (Variances) Before, we were doing EM Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 2 of 17

Clustering as Probabilistic Inference GMM is a probabilistic model (unlike K-means) There are several latent variables: Means Assignments (Variances) Before, we were doing EM Today, new models and new methods Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 2 of 17

Nonparametric Clustering What if the number of clusters is not fixed? Nonparametric: can grow if data need it Probabilistic distribution over number of clusters Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 3 of 17

Dirichlet Process Distribution over distributions Parameterized by: α, G Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 4 of 17

Dirichlet Process Distribution over distributions Parameterized by: α, G Concentration parameter Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 4 of 17

Dirichlet Process Distribution over distributions Parameterized by: α, G Concentration parameter Base distribution Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 4 of 17

Dirichlet Process Distribution over distributions Parameterized by: α, G Concentration parameter Base distribution You can then draw observations from x DP(α, G). Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 4 of 17

Defining a DP Break off sticks Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 5 of 17

Defining a DP Break off sticks Draw atoms Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 5 of 17

Defining a DP Break off sticks Draw atoms Merge into complete distribution Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 5 of 17

Properties of a DPMM Expected value is the same as base distribution E DP(α,G) [x] = E G [x] (1) As α, DP(α, G) = G Number of components unbounded Impossible to represent fully on computer (truncation) You can nest DPs Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 6 of 17

Effect of scaling parameter α Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 7 of 17

DP as mixture Model Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 8 of 17

The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. 2 7 3 7 2 7 Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 9 of 17

The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. 2 7 3 7 x µ 1 x µ 2 x µ 3 2 7 Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 9 of 17

The Chinese Restaurant as a Distribution To generate an observation, you first sit down at a table. You sit down at a table proportional to the number of people sitting at the table. 2 7 3 7 x µ 1 x µ 2 x µ 3 But this is just Maximum Likelihood Why are we talking about Chinese Restaurants? 2 7 Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 9 of 17

Always can squeeze in one more table... The posterior of a DP is CRP A new observation has a new table / cluster with probability proportional to α But this must be balanced against the probability of an observation given a cluster Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 10 of 17

Gibbs Sampling We want to know the cluster assignment of each observation Take a random guess initially Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 11 of 17

Gibbs Sampling We want to know the cluster assignment of each observation Take a random guess initially This provides a mean for each cluster Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 11 of 17

Gibbs Sampling We want to know the cluster assignment of each observation Take a random guess initially This provides a mean for each cluster Let the number of clusters grow Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 11 of 17

Gibbs Sampling We want to know the cluster assignment of each observation (tables) Take a random guess initially This provides a mean for each cluster Let the number of clusters grow Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 11 of 17

Gibbs Sampling We want to know z Compute p(z i z 1... z i 1, z i+1,... z m, x, α, G) Update z i by sampling from that distribution Keep going... Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 12 of 17

Gibbs Sampling We want to know z Compute p(z i z 1... z i 1, z i+1,... z m, x, α, G) Update z i by sampling from that distribution Keep going... Notation p(z i = k z i ) p(z i z 1... z i 1, z i+1,... z m ) (2) Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 12 of 17

Gibbs Sampling for DPMM p(z i = k z i, x, {θ k }, α) (3) (4) Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 13 of 17

Gibbs Sampling for DPMM p(z i = k z i, x, {θ k }, α) (3) =p(z i = k z i, x i, x, θ k, α) (4) (5) Dropping irrelevant terms Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 13 of 17

Gibbs Sampling for DPMM p(z i = k z i, x, {θ k }, α) (3) =p(z i = k z i, x i, x, θ k, α) (4) =p(z i = k z i, α)p(x i θ k, x) (5) (6) Chain rule Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 13 of 17

Gibbs Sampling for DPMM Applying CRP p(z i = k z i, x, {θ k }, α) (3) =p(z i = k z i, x i, x, θ k, α) (4) =p(z i = k z i, α)p(x i θ k, x) (5) = {( nk n +α) θ p(x i θ)p(θ G, x) existing θ p(x i θ)p(θ G) new α n +α (6) (7) Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 13 of 17

Gibbs Sampling for DPMM p(z i = k z i, x, {θ k }, α) (3) =p(z i = k z i, x i, x, θ k, α) (4) =p(z i = k z i, α)p(x i θ k, x) (5) = = {( nk n +α) θ p(x i θ)p(θ G, x) existing α θ p(x i θ)p(θ G) new ) ( ) n x N x, n+1, 1 existing n +α {( nk n +α (6) α n +α N (x, 0, 1) new (7) Scary integrals assuming G is normal distribution with mean zero and unit variance. (Derived in optional reading.) Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 13 of 17

Algorithm for Gibbs Sampling 1. Random initial assignment to clusters 2. For iteration i: 2.1 Unassign observation n 2.2 Choose new cluster for that observation Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 14 of 17

Toy Example Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 15 of 17

Toy Example New cluster created! Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 15 of 17

Toy Example Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 15 of 17

Toy Example And repeat... Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 15 of 17

Differences between EM and Gibbs Gibbs often faster to implement EM easier to diagnose convergence EM can be parallelized Gibbs is more widely applicable Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 16 of 17

In class and next week Walking through DPMM clustering Clustering discrete data with more than one cluster per observation Machine Learning: Jordan Boyd-Graber Boulder Bayesian Nonparametrics and DPMM 17 of 17