Rishab Mudliar

About

These blogs are my notes that represent my interpretation of the CS236 course taught by Stefano.

Recap

Latent Variable Models: Introduction

Latent Variable Models: Motivation

alt text

Challenge: Very difficult to specify these conditionals by hand

Deep Latent Variable Models

alt text

Mixture of Gaussians: a Shallow Latent Variable Model

alt text

Generative process

alt text

By learning p(x|z) we can do clustering by using p(z|x) as shown in the above image.

Another advantage is that Gaussian distribution is a simple one. And might not model p(x|z) properly but by mixing multiple Gaussian models for each conditional we get a complex distribution that can approximate the true data distribution

alt text

Variational Autoencoder

Extending the concept of mixture of gaussians discussed in last section, a variational autoencoder can be thought of as a mixture of gaussians of infinite Gaussians

Marginal Likelihood

alt text

Marginal Likelihood (Autoencoder)

A mixture of an infinite number of Gaussians:

Marginal Likelihood Continued..

alt text

Using Monte Carlo

Likelihood function pθ (x) for Partially Observed Data is hard to compute:

alt text

We can think of it as an (intractable) expectation. Monte Carlo to the rescue:

alt text

Works in theory but not in practice. For most z, pθ (x, z) is very low (most completions don’t make sense). Some completions have large pθ (x, z) but we will never ”hit” likely completions by uniform random sampling. Need a clever way to select z(j) to reduce variance of the estimator.

What do we do? Could try to learn a way to sample z that is more likely to help us observe p(x)..

Importance Sampling

Instead of using a z/z solution we introduce q(z) where q(z) is a probablity distribution from which we sample z that are more likely to complete x.

alt text

Monte Carlo to the rescue:

Calculating log likelihood

Likelihood function pθ (x) for Partially Observed Data is hard to compute:

alt text

Monte Carlo to the rescue:

For training, we need the log-likelihood log (pθ (x)). We could estimate it as:

alt text

And for the same case of k=1 if we calculate the log of expectation without the monte carlo approximation. We see a potential problem

alt text

On the right hand side is the log of the expectation of sampling a single z from q(z) and on the left hand side is the output of the monte carlo approximation. It is clear that both of the values are not the same thing for k=1. For a higher k it might be same but for the base case we see that our estimation is not good.

Evidence Lower Bound

The Log-Likehlihood for partially missing data is hard to compute.

alt text log() is a concave function. log(px + (1 − p)x′) ≥ p log(x) + (1 − p) log(x′).

Using Jensen Equality we approximate the Log Likelihood as given. It is called ELBO or Evidence Lower Bound where our goal is to maximize the ELBO as much as possible

alt text

Variational Inference

ELBO is maximum or equal when q = p(z|x; θ)

Proof: alt text

KL-Divergence relation with ELBO

Suppose q(z) is any probability distribution over the hidden variables. A little bit of algebra reveals

alt text

Corresponding proof. External link

Equality holds if q = p(z|x; θ) because DKL (q(z)∥p(z|x; θ))=0

alt text

Then we can rewrite as follows.

alt text

Which means -> Minimizing KL-Divergence is equal to maximimzing ELBO

Computing q

alt text

Variational approximation of the posterior

alt text

alt text

Closing notes

alt text

The better q(z; ϕ) can approximate the posterior p(z|x; θ), the smaller DKL (q(z; ϕ)∥p(z|x; θ)) we can achieve, the closer ELBO will be to log p(x; θ). Next: jointly optimize over θ and ϕ to maximize the ELBO over a dataset