CS236 Deep Generative Models (Part2)

About Link to heading

These blogs are my notes that represent my interpretation of the CS236 course taught by Stefano.

Part 2 Link to heading

Learning a generative model Link to heading

In the last part we defined a statistical generative model represented as p(x) that lets us sample from this distribution to generate new images. But p(x) is unknown.

How do we learn it?

Before I go into it, we would be going into a lot of probablity terminologies that we will see. So whoever is reading I hope you are good with your probability. If not, there are a lot of videos to understand these terms in an awesome way. I will try my best though and hope it is understandable

Alright let’s get started then.

Probablity Distributions Link to heading

Our goal is to learn a distribution that approximates the data that we are trying to generate. But what is a distribution? Let’s understand with an example.

Take an unbiased coin and toss it. What is the output? Write the probablity associated with it. Now write the probablity associated with the event that did not occur

It should look something like P(X = YourOutput) = 1/2 P(X != YourOutput) = 1 - 1/2

Well now you basically have the recipe to create a distribution.

Here is the checklist.

A set of random variables
The probablity associated with random distribution

Once you satisfy the checklist you plot the probablities associated with the random variable. Congratulations on creating a distribution for the first time or not..

Types Link to heading

Discrete Probability Distributions Link to heading

Probability distributions associated with random variables that are discrete in nature.

Examples(taken from the slides) Link to heading

Bernoulli distribution: (biased) coin flip Link to heading

D = {Heads, Tails}
Specify P(X = Heads) = p. Then P(X = Tails) = 1 − p.
Write: X ∼ Ber (p)
Sampling: flip a (biased) coin \

Categorical distribution: (biased) m-sided dice Link to heading

D = {1, · · · , m}
P
Specify P(Y = i) = pi , such that
pi = 1
Write: Y ∼ Cat(p1 , · · · , pm )
Sampling: roll a (biased) die

Continuous Probablity Distribution Link to heading

A continuous probability distribution is one in which a continuous random variable X can take on any value within a given range of values ‒ which can be infinite, and therefore uncountable. For example, time is infinite because you could count from 0 to a billion seconds, a trillion seconds, and so on, forever. Source(https://www.knime.com/blog/learn-continuous-probability-distribution)

Normally these distributions are defined using a probablity density function or PDFs not the file format:p

Example Link to heading

Uniform Distribution Link to heading

A Uniform distribution and is related to the events which are equally likely to occur. It is defined by two parameters, x and y, where x = minimum value and y = maximum value. It is generally denoted by u(x, y).

It is denoted as f(b)=1/y-x

Joint Probablity Distribution Link to heading

The next topic is joint distribution which can be thought of as a distribution consisting of two or more random variables.

Examples Link to heading

Flip of coins Link to heading

Consider the flip of two fair coins; let A and B be discrete random variables associated with the outcomes of the first and second coin flips respectively. Each coin flip is a Bernoulli trial and has a Bernoulli distribution. If a coin displays “heads” then the associated random variable takes the value 1, and it takes the value 0 otherwise. The probability of each of these outcomes is 1/2.

To model the joint distribution we have to consider the all the possible cases.

In this case it will be

P(A = Heads, B = Heads)
P(A = Heads, B = tails)
P(A = Tails, B = Heads)
P(A = Tails, B = Tails)

Since the coins are fair the second coin won’t have any dependence on the first coin. In that case we can denote P(A, B) as the product of it’s probaility distribution

So each of the four cases are equally likely with a probablity of 1/4

and this joint distribution will be represented as P(A, B) = P(A)P(B) A,B -> {Heads, Tails}

P(A) = 1/2 A -> {Heads, Tails}

P(B) = 1/2 B -> {Heads, Tails}

Sampling from a distribution Link to heading

Modelling a pixel’s color Link to heading

Let’s say we want to generate an image pixel by pixel in the RGB space.

For each channel assign a random variable. In this case -> 3 random variables

Red Channel R. Val(R) = {0, · · · , 255}
Green Channel G . Val(G) = {0, · · · , 255}
Blue Channel B. Val(B) = {0, · · · , 255}

The joint distribution p(r, g, b) can be used to sample pixels.

But how many parameters would be required to represent all the possibilities in this distribution

Considering that each channel has 256 possible values. It will mean 256^3 possibilites and assuming that we assign the last probablity as 1 - the sum of possible probablities

Then we get the number of parameters as 256^3 - 1

Modelling a binary image Link to heading

Suppose X1 , . . . , Xn are binary (Bernoulli) random variables, i.e., Val(Xi ) = {0, 1} = {Black, White}.

Possible States? Each variable takes two values 0 and 1 for n variable - 2^n combinations/states

Sampling from p(x1 , . . . , xn ) generates an image

No of parameters required to represent the distribution = 2^n-1

Approximating distributions Link to heading

Structure through Independence Link to heading

From previous examples we see that sampling an image can be expensive. Can we make some assumptions??

Taking the binary image example what if we assume that the pixels of the binary image are independent.

In that case, p(x1 , . . . , xn ) = p(x1 )p(x2 ) · · · p(xn)

Possible states = Still 2^n

But parameters to represent this? Assuming each bernoulli distribution can be represented by a single parameter px where px is probablity of a pixel being 1 and 1-px the opposite

In that case we can represent 2^n states by just using n random variables

(Side note: In the previous case that won’t be possible as we don’t have independence between variables and so the conditional probability would need to be represented by some parameter and hence it was 2^n-1)

Cons?

Independence assumption is too strong. Model not likely to be useful
For example, each pixel chosen independently when we sample from it.

Structure through conditional independence Link to heading

Using Chain Rule p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x1 , x2 ) · · · p(xn | x1 , · · · , xn−1 )

How many parameters? 1 + 2 + · · · + 2n−1 = 2n − 1
p(x1 ) requires 1 parameter
p(x2 | x1 = 0) requires 1 parameter, p(x2 | x1 = 1) requires 1 parameter
Total 2^n parameters.

2^n − 1 is still exponential, chain rule does not buy us anything.

Now suppose Xi+1 ⊥ X1 , . . . , Xi−1 | Xi
What this means is that the variable Xi+1 is independent of variables from X1 to Xi-1 given Xi. In that case we can modify the p(x1,….,xn) as

p(x1 , . . . , xn ) = p(x1 )p(x2 | x1 )p(x3 | x2 ) · · · p(xn | xn−1 )

This forms the idea for Bayes Networks.

Bayes Network: General Idea Link to heading

Use conditional parameterization (instead of joint parameterization)

For each random variable Xi , specify p(xi |xAi ) for set XAi of random variables

Then get joint parametrization as p(x1 , . . . , xn ) = mult( p(xi |xAi ) ) for i = 1 to n

Need to guarantee it is a legal probability distribution. It has to correspond to a chain rule factorization, with factors simplified due to assumed conditional independencies

Bayes Networks (Representation) Link to heading

A Bayesian network is specified by a directed acyclic graph (DAG) G = (V , E ) with:

One node i ∈ V for each random variable Xi
One conditional probability distribution (CPD) per node, p(xi | xPa(i) ), specifying the variable’s probability conditioned on its parents’ values Graph G = (V , E ) is called the structure of the Bayesian Network

Defines a joint distribution: p(x1 , . . . xn ) = mult(p(xi | xPa(i) )) i∈V

Claim: p(x1 , . . . xn ) is a valid probability distribution because of ordering implied by DAG

Economical representation: exponential in |Pa(i)|, not |V |

Examples Link to heading

Consider the Bayes Network given.

How do we represent this network in terms of probablity?

For each node in the graph, consider the incoming edges those are the variables that the node depends on which also denotes the conditional independence on other variables given this incoming edges.

So,
p(difficulty) = p(difficulty)
p(intelligence) = p(intelligence)
p(grade) = p(grade | difficulty, intelligence)
p(letter) = p(letter | grade)
p(sat) = p(sat | intelligence)

The joint distribution corresponding to the above BN factors as
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )

However, by the chain rule, any distribution can be written as
p(d, i, g , s, l) = p(d)p(i | d)p(g | i, d)p(s | i, d, g )p(l | g , d, i, s)

As represented by the Bayesian network. We are assuming the conditional independence.
D ⊥ I,
S ⊥ {D, G } | I ,
L ⊥ {I , D, S} | G .

Spam Classification Link to heading

Classify e-mails as spam (Y = 1) or not spam (Y = 0)

Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )

Words are conditionally independent given Y

p(y , x1 , . . . xn ) = p(y) mult(p(xi | y )) for xi = 1 to n

To evaluate p(y=1 | x1,…. xn) use Bayes Rule

Are the independence assumptions made here reasonable?
Philosophy: Nearly all probabilistic models are “wrong”, but many are nonetheless useful

Discriminative versus generative models Link to heading

Using chain rule p(Y , X) = p(X | Y )p(Y ) = p(Y | X)p(X). Corresponding Bayesian networks
However, suppose all we need for prediction is p(Y | X). In the left model, we need to specify/learn both p(Y ) and p(X | Y ), then compute p(Y | X) via Bayes rule
In the right model, it suffices to estimate just the conditional distribution p(Y | X)
- We never need to model/learn/use p(X)!
- Called a discriminative model because it is only useful for discriminating Y ’s label when given X

Example Link to heading

Logistic Regression Link to heading

For the discriminative model, assume that p(Y = 1 | x; α) = f (x, α).

Not represented as a table anymore. It is a parameterized function of x (regression)

Has to be between 0 and 1
Depend in some simple but reasonable way on x1 , · · · , xn
Completely specified by a vector α of n + 1 parameters (compact representation)

Linear dependence: let z(α, x) = α0 + sum(αi*xi) for i = 1 to n .Then, p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e^−z ) is called the logistic function

Neural Models Link to heading

In discriminative models, we assume that
p(Y = 1 | x; α) = f (x, α)

In previous example we have linear dependence

let z(α, x) = α0 + sum(αi*xi) for i = 1 to n
p(Y = 1 | x; α) = σ(z(α, x)), where σ(z) = 1/(1 + e −z ) is the logistic function
Dependence might be too simple

Solution?
Non-linear dependence: let h(A, b, x) = f (Ax + b) be a non-linear transformation of the inputs (features).

pNeural (Y = 1 | x; α, A, b) = σ(α0 + sum(αi*hi) for i = 1 to n)

More flexible
More parameters: A, b, α

Bayes Networks with Continuous Variables Link to heading

Bayes Networks can also be used in case of continuous Variables. Here is an example

Mixture of 2 Gaussians Link to heading

Consider that we want to map a bernoulli distribution Z to a gaussian distribution X.
Need to find p(x|z)
Z ∼ Bernoulli(p)
p(x|z) = p(X | (Z = 0)) ∼ N (µ0 , σ0 ) , p(X | (Z = 1)) ∼ N (µ1 , σ1 )
The parameters are p, µ0 , σ0 , µ1 , σ1

By mixing 2 Gaussians we have applied the concept of Bayes Networks where X depends on the variable Z.

Wrapping up Link to heading

So this part we kind of got a lot into understanding about various probability distributions. How to sample from distributions, how to approximate them etc. Next part will pretty much focus on the same terminology but using a different class of models termed as autoregressive models. Until then bye!!