Rishab Mudliar

About

These blogs are my notes that represent my interpretation of the CS236 course taught by Stefano.

Recap

Learning a generative model.

Defining the setting

We have defined a way already in the last blog to represent probablity distributions in terms of parameters. But how do we find a model that captures the real data distribution as closely as possible. What are the factors to keep in mind?

That depends on the type of task.

Learning as density estimation

But how do we evaluate closeness?

KL-divergence

Intuition behind KL Divergence

Assume that we have samples from the true data distribution Pdata and we learned a model Pθ parameterized by θ then KL Divergence can be thought of as the extra bits required to represent samples from Pdata by using the learned distribution Pθ instead. Normally using less bits would mean that Pθ is very close to Pdata.

Maximum Log Likelihood

Simplifying the KL Divergence as follows

Alt text

Maximum Likelihood

Monte Carlo Estimation

Alt text

Alt text

where x1 , . . . , xT are independent samples from P.

Intuition?

Most cases P(x) or the true data distribution might not be known. In that case sampling from the dataset and approximating the expected value according to Monte-Carle Estimation gives us a way to calculate the quantity of interest. In later sections the Monte Carlo Estimation will be used in derivations which would depict the usefulness of this approximation.

Properties of Monte Carlo Estimation

Estimating Pθ (x) via MLE

Example

MLE scoring for the coin example

Extending the MLE principle to autoregressive models

Given an autoregressive model with n variables and factorization

alt text

θ = (θ1 , · · · , θn ) are the parameters of all the conditionals. Training data
D = {x(1) , · · · , x(m) }. Maximum likelihood estimate of the parameters θ?

alt text

MLE Learning: Gradient Descent

Goal : maximize arg maxθ L(θ, D) = arg maxθ log L(θ, D)

alt text

Non-convex optimization problem, but often works well in practice

MLE Learning: Stochastic Gradient Descent

What if m = |D| is huge?

We have to sum the gradients for m samples from the dataset where each sample’s probablity is represented by an autoregressive model of n variables. If m is huge let’s assume = 10^6 then there will be at least 10^6*n computations per iteration if we use the Gradient Descent algorithm. This is an expensive cost for optimizing our model.

Solution? Monte-Carlo to the rescue.

The definition for Monte-Carlo

Alt text

Definition for ∇θℓ(θ)

What’s missing?

Assuming the outer sum is what we need for computing expectation on sampling x from D and the inner sum highlighted in green is the random variable on which the expectation is being calculated then we are missing the probablities assosciated with the random variable being sampled from D.

Adding m * 1/m (Assuming the sampling process is uniform)

Then 1/m is the probablity associated for sampling each variable from D.

And by using the Monte-Carlo Estimation we can compute ∇θℓ(θ) in the following way. alt text

Bias-Variance Trade Off

Sign-off