Sunday, March 24, 2013

Taming the naive bayes for text classification

Today I am going to write about one of the simple and widely used machine learning algorithm, which is Naive Bayes.

Some fun probability

As a warm up let me refresh some basic probability notions.
How many ways 2 red and 3 blue balls can be arranged in 5 positions.
This is basically  the  ways of arranging 2 red balls and 3 blue balls which is given by 5C2 (OR 5C3) = 5!/(2! * (5-2)!).
In a given arrangement if the probability of picking up a red ball is p what is the probability of picking up 2 red balls and 3 blue balls(out of a ball picking event where there are different colored balls) if we assume each picking is independent ?
Ans - p^2 * (1-p)^3.  If we consider all such arrangements we can pick 2 red balls and 3 blue balls with a probability  5!/(2! * 3!)  *  p^2 * (1-p)^3. This is because there are 5C2 ways of arranging these balls and in each arrangement we can pick up 2 red balls and 3 blue balls with a probability p^2 * (1-p)^3. So total probability of picking up 2 red balls and 3 blue balls is the summation of all such probabilities in each arrangements.
We can generalize the above notion to pick up n different colored balls each needs to pick up in c1,c2,c3...cN times with probability p1,p2,..pN
as n! / (c1! * c2!...cn!) * p1^c1 * p2^c2….*pn^cn. This distribution is called multinomial distribution. Here we have n different events (each event is picking up a ball of a given color). If the number of such events is 2,(ie n=2) then the above distribution is called binomial distribution. Understanding multinomial and binomial distribution is key to understanding the naive bayes.

Bayes theorem

Simply put, Bayes theorem states that P(A|B)=P(B|A)*P(A)/P(B).

or stated in a different way, 
P(cause|evidence) = P(evidence|cause) * P(cause)/P(evidence)
(To be read as probability of cause given an evidence = probability of evidence given the cause * probability of cause / probability of evidence.)
Let us use this to state the probability of rain, when we hear the sound of water.
P(rain|hear sound of water) = P(sound of water|rain) * P(rain)/P(Sound of water).
Bayes theorem state that,  probability that it is raining when the sound of water is hears is directly proportional to the probability of sound of water when it is raining, probability of raining itself and inversely proportional to the probability of sound of water. 

Let us see how it makes sense. If you hear sound of water every time it rains, you can speculate with good confidence that when you hear sound of water, then, it must be raining.  But, if  at your place, you hear sound of water, because of water sprinkler also, then the probability of raining when you hear the sound of water lessens. Similarly if there is a high probability of rain in general, then the cause of rain as the sound of water increases  (if you are in a desert area where the probability of rain in general is less, you won't be confident to predict rain as the cause of sound of water). Also, in general, if you hear sound of water every time independent of rain or not, you will be less confident to associate rain as the given cause of a "sound of water" event. 

Bayes theorem and  text classification

One important consequence of Bayes theorem is its use in classifying documents. Basically, you are given n documents each can fall into any of k categories. You need to categorize a given document to any of this k categories.
For classifying a given document, we state the problem as below.
"given that the text is constituted by words in each position, what is the probability that the text belongs to a particular category"
In Bayes terms,  we can state this problem as finding the  likelihood of the text being in category Ci given the words W1,W2,...,Wn in each word position.
P(Ci|W1,W2,W3,...Wn)
As per Bayes theorem 
P(Ci|W1,W2,W3,W4,…Wn)=P(W1,W2,W3,W4,…,Wn|Ci) * P(Ci)/ P(W1,W2,W3,W4..,Wn)


Independence assumption

The term P(W1,W2,W3,W4,…,Wn|Ci) is the probability of occurring W1 at first position, W2 at second position etc. In strict terms this probability is the joint probability of P(W1), P(W2|W1), P(W3|W2,W3)...P(Wn|Wn-1,Wn-2,...,w1).  This is given by multiplying these probabilities together.

P(W1,W2,W3,W4,…,Wn|Ci) = P(W1) * P(W2|W1,ci) *P(W3|W2,W3,ci) *..........* P(Wn|Wn-1,Wn-2,...,w1,ci), for a given arrangement of w1,w2, ....,wn.

If there are k word tokens placed in n word cells of the text, then we have nPk = n!/(n-k)! (permutation of k words from n words) such arrangements possible. So the total probability   P(W1|ci) * P(W2|W1,ci) *P(W3|W2,W3,ci) *..........* P(Wn|Wn-1,Wn-2,...,w1,ci) summed over each of the permutation.

Estimating each of the conditional word probabilities is an extremely difficult task because it demands that the training set should contain enough instances of the combinations of the above w1 to wn words. Which requires a very very large training set. To give a concrete example, to classify the text "tiger woods won pga tour", we need enough number of exclusive texts "tiger", "tiger woods","tiger woods won","tiger woods won pga", "tiger woods won pga tour"  spread in the training data. It is almost impossible to find such a training data for a real world document.
If we assume that the occurrence of a word is independent of occurrence of another word, then we simplify our probability calculation. This reduces each conditional probability to the marginal probability of the words.
ie, for example, P(Wn|Wn-1,Wn-2,...,w1,ci) = P(Wn|ci). We do not need to worry about a word occurring in conjunction with a set of other words for calculating our probabilities.  We make a second assumption, that the probability of occurrence of a word at one position is independent of occurrence of the word at another position.   This assumption make us not to worry about calculating different probability terms for each permuted text. As a consequence the text "tiger woods won pga tour",  "tiger won pga tour woods" will be treated as same document.

So the problem of text classification using naive bayes reduces to  computing
 n!/(n-k)! P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci) / P(W1,W2,W3,W4..,Wn) for each of the class labels Ci and finding the Ci which has the most probability value.

Note that for any two class Ci and Cj, the terms  n!/(n-k)! and P(W1,W2,W3,W4..,Wn) are same. So we do not need to compute these terms, but only compute the factor P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci) and assign the the class label of the document as the class which has got the maximum of the  above value.

The pseudo code will look like below
double maxLikeliHood= 0.0;
String selectedLabel = "";
for each class label Ci {
   double likeliHood = P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci)
    if(likeliHood  > maxLikeliHood) {
        maxLikeliHood = likeliHood ;
       selectedLabel = Ci
   }
}

P(Ci) is the prior probability of occurrence of the class label Ci. This is simply the ratio of documents labeled as Ci to the total number of document in the training set.
Each of the P(Wn|Ci) is calculated by taking the ratio of number of occurrence of Wi in documents labeled as Ci to the total length of documents classified as Ci.

Use of Log Likelihood

One common practice in computing the P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci) term is to add the log of the terms, instead of multiplying the terms. This is to avoid any possible floating point underflow during the calculation. Instead of comparing the likelihood values, you compare the log of the likelihood values.



No comments: