Rishab Mudliar

About

These blogs are my notes that represent my interpretation of the CS236 course taught by Stefano.

Continuing..

A variational approximation to the posterior

alt text

For cases when p(z|x; θ) is intractable we come up with a tractable approximation q(z; ϕ) that is as close as possible to p(z|x; θ)

In the case of an image having it’s top half unknown this is what we came up with the last time

alt text

Given the probablity distribution we deduced that the last choice is probably the best one

Learning via stochastic variational inference (SVI)

Our goal is to optimize ELBO given below.

alt text

Steps:

How to compute the gradients? There might not be a closed form solution for the expectations. So we use Monte Carlo sampling

To evaluate the bound, sample z1 , · · · , zK from q(z; ϕ) and estimate

alt text

Calculate gradients w.r.t θ and ϕ on the Monte-Carlo estimate.

Key assumption: q(z; ϕ) is tractable, i.e., easy to sample from and evaluate

The gradient with respect to θ is easy

alt text

The gradient with respect to ϕ is more complicated because the expectation depends on ϕ

We still want to estimate with a Monte Carlo average

Reparametrization

Want to compute a gradient with respect to ϕ of

alt text

where z is now continuous

Suppose q(z; ϕ) = N (µ, σ 2 I ) is Gaussian with parameters ϕ = (µ, σ). These are equivalent ways of sampling

Using this equivalence we compute the expectation in two ways:

alt text

Easy to estimate via Monte Carlo if r and g are differentiable w.r.t. ϕ and ϵ is easy to sample from (backpropagation)

Using Reparametrization for ELBO

alt text Using the reparameterization trick in ELBO. We can rewrite our equation as Eq(z;ϕ) [r (z, ϕ)] but if we observe it is not the same as Eq(z;ϕ) [r (z)] as our inner value in expectation depends on ϕ

Does this change this? Yes Can we still use reparametrization? Yes

Assume z = µ + σϵ = g (ϵ; ϕ) like before. Then

alt text

Amortized Inference

alt text

Once again…

alt text

Learning with Amortized Inference

As before, optimize ELBO as a function of θ, ϕ using (stochastic) gradient descent

alt text

Autoencoder: Perspective

alt text

What does the training objective L(x; θ, ϕ) do?

Intution behind the two terms?

Continued

alt text

Summary of Latent Variable Models