Tuesday, August 12, 2014

on the statistics of rolling a dice part -2

In the part 1 I wrote about the Maximum Likelihood estimate and MAP estimate, along with the probability distributions affecting the process of rolling a dice. In this part I will explore   the idea of prior distributions and conjugate prior distributions.

Prior  and posterior  probability distributions

 Probability distribution tells us the probability of occurrence of a certain value in the domain of the function. A probability distribution of a coin toss will tell us the probability of occurrence of number of occurrence of head or tails in a series of coin tosses. In the case of a coin toss, we associate a probability  
p as the probability of success or probability of a head.  The probability distribution is based on this p value, and determined by the binomial distribution with p as the parameter.  In the case of rolling a dice we have a probability p(i) associated with each side (i) of the dice, and the probability distribution is determined by a multinomial distribution.   We are interested in learning the parameter associated with the distribution, once we observe the data.   Why should we care about learning the parameters? because once we learn the parameters then we have a hypothesis about the data (or the generative process governing  the generation of data; which is a coin toss or a dice rolling with a probability associated with each side). This hypothesis can be used to predict the outcome of future events.

When you  role the dice, generally, you assume that each side has the same probability of observing.
And when you rolled the dice 12 times you observed each side exactly 2 times. Based on the current observation you conclude that each side has the same probability of observation.
But, what if you  were told that in the past 100 throws 50 times side 6 came up. How does that fact should affect your estimation  about the probability distributions? Based on a pure maximum likelihood estimate, you concluded that each sides probability of observation is 1/6. Now once you factor in the prior observation, your estimation will change. The term prior probability distribution is the distribution of the parameters before observing the current data, and the term posterior probability distribution is the distribution of parameters after observing the data.  You can consider prior as the prior belief, observation of the data as the new evidence which can affect the prior belief and posterior is the new conclusion which is generated from the prior belief and the new evidence.

Posterior = k * likelihood * prior where k is a normalization constant such that probabilities add to 1.

Frequentist Vs Bayesian approach

A Frequentist approach to determine the parameters of the distribution, does not take into account of a prior probability. Frequentist will rely on a maximum likelihood estimate disregarding the prior probability. Well that is not exactly like that; often there is no exact information available on a prior distribution, and in such cases a frequentist does not bother about it. Where as a strictly Bayesian approach, a convenient prior probability distribution is assumed if an exact information is not available.
Note the term, convenient prior probability distribution. We will need it later.

What is that convenient prior probability distribution?How should it look like so that it is convenient? What is the affect of  our forcing a convenient prior as opposed to what would have been an actual prior (even though we have no idea about it because we do not have information about the past). 

Idea of conjugate prior

     By definition, a conjugate prior of a likelihood function is such that the posterior distribution is also in the same family of distribution as the prior. The likelihood function when multiplied by the prior gives a function which looks like the prior, but with different parameter values.  This is   benefical to simplify the mathematical computations involved. Two well known likelihood-conjugate prior pairs, frequently used in text analysis, are Bernoulli likelihood - Beta prior and Multinomial likelihood-Dirichlet prior.

To show why it is convenient to use a conjugate prior as a prior distribution, let us take the case of a Bernoulli-Beta distribution pair.  Let us take the case of a coin toss. Let n1 be the number of times we observe head and n0 the number of times we observe tail.

 If p is the probability of head, the probability of getting n1 heads and n0 tails is given by . This is our likelihood estimation.

Let us assume that there is a prior distribution associated with p, which is a beta distribution given by
. What this means is, we randomly draw a probability value associated with getting a head, from a beta distribution, instead fixing the value as p.

The posterior probability after observing the set X of coin tosses is



The above equation basically tells that the probability of a particular p is proportional to  the probability of drawing that p from the beta distribution and proportional to the observed data X being generated from that p.

replacing the probabilities with its actual value

Simplifying the equations,


This is a beta distribution with parameters . Notice that we have the posterior probability also in the same family of distribution as the prior(beta distribution).  The parameters   of the prior  is augmented by the number times we observe heads and tails (n1,n0).   is also known as pseudo counts.

In the case of rolling a dice we have the multinomial distribution as the likelihood distribution and Dirichlet distribution as the conjugate prior.

What if the prior is not actually the conjugate prior as we assumed?

Well, statistics is all about assumptions. We assumed the prior as the conjugate prior to simplify our math computations. If in realty the prior distribution, is not the conjugate prior as we assumed it to be, then our posterior estimation will be off in proportion to  how much the prior in actual varied from our assumption.