Sunday, March 24, 2013

Taming the naive bayes for text classification

Today I am going to write about one of the simple and widely used machine learning algorithm, which is Naive Bayes.

Some fun probability

As a warm up let me refresh some basic probability notions.
How many ways 2 red and 3 blue balls can be arranged in 5 positions.
This is basically  the  ways of arranging 2 red balls and 3 blue balls which is given by 5C2 (OR 5C3) = 5!/(2! * (5-2)!).
In a given arrangement if the probability of picking up a red ball is p what is the probability of picking up 2 red balls and 3 blue balls(out of a ball picking event where there are different colored balls) if we assume each picking is independent ?
Ans - p^2 * (1-p)^3.  If we consider all such arrangements we can pick 2 red balls and 3 blue balls with a probability  5!/(2! * 3!)  *  p^2 * (1-p)^3. This is because there are 5C2 ways of arranging these balls and in each arrangement we can pick up 2 red balls and 3 blue balls with a probability p^2 * (1-p)^3. So total probability of picking up 2 red balls and 3 blue balls is the summation of all such probabilities in each arrangements.
We can generalize the above notion to pick up n different colored balls each needs to pick up in c1,c2,c3...cN times with probability p1,p2,..pN
as n! / (c1! * c2!...cn!) * p1^c1 * p2^c2….*pn^cn. This distribution is called multinomial distribution. Here we have n different events (each event is picking up a ball of a given color). If the number of such events is 2,(ie n=2) then the above distribution is called binomial distribution. Understanding multinomial and binomial distribution is key to understanding the naive bayes.

Bayes theorem

Simply put, Bayes theorem states that P(A|B)=P(B|A)*P(A)/P(B).

or stated in a different way, 
P(cause|evidence) = P(evidence|cause) * P(cause)/P(evidence)
(To be read as probability of cause given an evidence = probability of evidence given the cause * probability of cause / probability of evidence.)
Let us use this to state the probability of rain, when we hear the sound of water.
P(rain|hear sound of water) = P(sound of water|rain) * P(rain)/P(Sound of water).
Bayes theorem state that,  probability that it is raining when the sound of water is hears is directly proportional to the probability of sound of water when it is raining, probability of raining itself and inversely proportional to the probability of sound of water. 

Let us see how it makes sense. If you hear sound of water every time it rains, you can speculate with good confidence that when you hear sound of water, then, it must be raining.  But, if  at your place, you hear sound of water, because of water sprinkler also, then the probability of raining when you hear the sound of water lessens. Similarly if there is a high probability of rain in general, then the cause of rain as the sound of water increases  (if you are in a desert area where the probability of rain in general is less, you won't be confident to predict rain as the cause of sound of water). Also, in general, if you hear sound of water every time independent of rain or not, you will be less confident to associate rain as the given cause of a "sound of water" event. 

Bayes theorem and  text classification

One important consequence of Bayes theorem is its use in classifying documents. Basically, you are given n documents each can fall into any of k categories. You need to categorize a given document to any of this k categories.
For classifying a given document, we state the problem as below.
"given that the text is constituted by words in each position, what is the probability that the text belongs to a particular category"
In Bayes terms,  we can state this problem as finding the  likelihood of the text being in category Ci given the words W1,W2,...,Wn in each word position.
P(Ci|W1,W2,W3,...Wn)
As per Bayes theorem 
P(Ci|W1,W2,W3,W4,…Wn)=P(W1,W2,W3,W4,…,Wn|Ci) * P(Ci)/ P(W1,W2,W3,W4..,Wn)


Independence assumption

The term P(W1,W2,W3,W4,…,Wn|Ci) is the probability of occurring W1 at first position, W2 at second position etc. In strict terms this probability is the joint probability of P(W1), P(W2|W1), P(W3|W2,W3)...P(Wn|Wn-1,Wn-2,...,w1).  This is given by multiplying these probabilities together.

P(W1,W2,W3,W4,…,Wn|Ci) = P(W1) * P(W2|W1,ci) *P(W3|W2,W3,ci) *..........* P(Wn|Wn-1,Wn-2,...,w1,ci), for a given arrangement of w1,w2, ....,wn.

If there are k word tokens placed in n word cells of the text, then we have nPk = n!/(n-k)! (permutation of k words from n words) such arrangements possible. So the total probability   P(W1|ci) * P(W2|W1,ci) *P(W3|W2,W3,ci) *..........* P(Wn|Wn-1,Wn-2,...,w1,ci) summed over each of the permutation.

Estimating each of the conditional word probabilities is an extremely difficult task because it demands that the training set should contain enough instances of the combinations of the above w1 to wn words. Which requires a very very large training set. To give a concrete example, to classify the text "tiger woods won pga tour", we need enough number of exclusive texts "tiger", "tiger woods","tiger woods won","tiger woods won pga", "tiger woods won pga tour"  spread in the training data. It is almost impossible to find such a training data for a real world document.
If we assume that the occurrence of a word is independent of occurrence of another word, then we simplify our probability calculation. This reduces each conditional probability to the marginal probability of the words.
ie, for example, P(Wn|Wn-1,Wn-2,...,w1,ci) = P(Wn|ci). We do not need to worry about a word occurring in conjunction with a set of other words for calculating our probabilities.  We make a second assumption, that the probability of occurrence of a word at one position is independent of occurrence of the word at another position.   This assumption make us not to worry about calculating different probability terms for each permuted text. As a consequence the text "tiger woods won pga tour",  "tiger won pga tour woods" will be treated as same document.

So the problem of text classification using naive bayes reduces to  computing
 n!/(n-k)! P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci) / P(W1,W2,W3,W4..,Wn) for each of the class labels Ci and finding the Ci which has the most probability value.

Note that for any two class Ci and Cj, the terms  n!/(n-k)! and P(W1,W2,W3,W4..,Wn) are same. So we do not need to compute these terms, but only compute the factor P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci) and assign the the class label of the document as the class which has got the maximum of the  above value.

The pseudo code will look like below
double maxLikeliHood= 0.0;
String selectedLabel = "";
for each class label Ci {
   double likeliHood = P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci)
    if(likeliHood  > maxLikeliHood) {
        maxLikeliHood = likeliHood ;
       selectedLabel = Ci
   }
}

P(Ci) is the prior probability of occurrence of the class label Ci. This is simply the ratio of documents labeled as Ci to the total number of document in the training set.
Each of the P(Wn|Ci) is calculated by taking the ratio of number of occurrence of Wi in documents labeled as Ci to the total length of documents classified as Ci.

Use of Log Likelihood

One common practice in computing the P(W1|Ci) * P(W2|Ci)... P(Wn|Ci) * P(Ci) term is to add the log of the terms, instead of multiplying the terms. This is to avoid any possible floating point underflow during the calculation. Instead of comparing the likelihood values, you compare the log of the likelihood values.



Sunday, March 17, 2013

Principal Component Analysis - intuition

Last week, I was pondering about the intuition behind Principal Component Analysis or PCA.  PCA is basically about finding the eigen values and eigen vectors of the covariance metrics. An understanding of PCA involves understanding of both of these concepts.
To get a more theoretical understanding of PCA go here. tutorial on pca


What is covariance metrics

Covariance metrics gives the covariance between two variables. The word "covariance" of two variables indicate how they vary together. If one variables increases from its mean and the other variable also increases from its means, then we say there is a positive covariance between the variables. A negative co-variance indicates that one variable tends to increase when the other variable tends to decrease. For example, we can say there is a positive co-variance between intelligence and grade of a student, and a negative co-variance between age and athleticism of adults. Covariance metrics measures the covariance between each two variables in the system.

Eigen Vectors and Eigen Values

To give the wikipedia defintion,  "a non-zero column vector v is a (righteigen-vector of a matrix A if (and only if) there exists a number \lambda such thatA v = \lambda v. The number \lambda is called the eigenvalue corresponding to that vector. The set of all eigenvectors of a matrix, each paired with its corresponding eigenvalue, is called the eigensystem of that matrix".

What does this mean? 
If you have a system, represented by a metrics, and if you apply an eigen vector on that system, the resulting system, has got the same orientation, but it scales in size. This means that the resulting system is either a bigger or a smaller system, but, still retaining the essential characteristics of the original system. For example, let us say that there is  a bridge which has a natural frequency(frequency by which the bridge oscillates), represented by a frequency metrics. If you apply a frequency which is the eigen vector of the natural frequency of the bridge the bridge will oscillate at a much larger scale, represented by the corresponding eigen value. Let us take another example. Let us say there is a rectangular box. If we pull the box from each direction along the axis, the box elongates, but it still retains the property of being a rectangle box. What if we try to pull the box along the diagonals. Then the box will lose its shape and  becomes something else other than a rectangle box. The force of pull, when applied along the axis can be considered as an eigen vector for the rectangle box .
Eigen vectors of covariance metrics. 
Now let us look into the idea of eigen vectors of covariance metrics.  We know that co-variance metrics captures the covariance of various variables in the system.    Think about what we get if subtract the mean from original data and take the covariance metrics.  The data tells us which features increase or decrease together.  The eigen vector of this metrics should preserve the direction of "the variance"of the data in x,y directions (for a two dimensional data).  That is, the eigen vector should tell us that in which direction the data is essentially increase or decrease together.  For an n dimensional data we get a set of n eigen vector and eigen values. If we order these eigen vectors in the the descending order of the eigen values, that gives us the decreasing order of the eigen vectors each representing the direction of variance in data, each eigen vector orthogonal to the other. 

What does it mean  by the direction of the variance of the data -  Let us take the case of a two dimensional data. Let us say we have two points A (x1,y1) and B (x2,y2). We are interested in seeing in which direction the data increase or decrease together. Or If you have to move from A to B what will be the direction of that movement, in terms of the coordinate of the system. If we have a set of such points you are interested in seeing the general direction in which the set of points align. For a case of two points A and B this direction is given by  
tan(theta) =  (y1-y2)/(x1-x2).  Now, If we have a third point C(x3,y3), which is lying in the general direction of (x1,y1) but off from the axis by a small distance, then to reach to C we need to travel along this direction and then move to c through the direction perpendicular to this direction. 
Now the variation in A,B,C has two general directions, one along the axis represented by the slope (y1-y2)/(x1-x2), and the second along the axis perpendicular to this. 

Dimensionality reduction
Suppose we want to represent the points A,B,C in one dimension instead of two dimension.  From the above example, we know that A and B lies in a straight line determined by the slope (y1-y2)/(x1-x2) and C is slightly off from that line, but only slightly.  If we have to represent these data using only one dimension, we still have to preserve the overall general distribution of this data. Had there been no C we can just transform A and B to the direction of the line connecting them, and represent them with the single coordinate determined by the distance of them from the origin along this line.  But we have point C also, which is off from the line by a small distance. Let me call the line connecting A and B as axis Z. 
If we are ready to lose the information on how much C is off from this line, then, we can project C to Z axis (by drawing a line perpendicular to the Z axis). In doing so we still preserve the general location of C w.r.t A and B but still off by a small distance. What we have done is reducing the dimensionality of the overall system from two dimension(represented by X and Y axis) to one dimension(represented by Z axis). In doing so we lost some information, but that information is not very relevant comparing to the overall information contained in the system.

We could chose Z as the axis, because C was slightly off from Z, but not  too much. What if C is off from Z axis by a large margin, which is larger than the distance between A and B along Z.
Now, the overall direction of variance of the system, is not along Z but in a direction perpendicular to Z. If we have to reduce the dimension of the system, we will have to capture the information on how much  C is off from Z axis,  and not the distance between A and B. In this case, our axis of choice will be the axis perpendicular to Z.

This is the idea we use in dimensionality reduction  using eigen values and eigen vectors. The eigen vector with the highest eigen value, gives us the principal component of variance of the system. The  eigen vector with the second highest eigen value gives us the second principal component of the system. If we have to reduce the dimension from n to k, we chose the eigen vectors corresponding to the first k eigen values and then transform the data to these dimensions. This tranformation is done by multiplying the chosen eigen vectors with the data.

Why is dimensionality reduction important
In machine learning, working with large number of dimensions is often computationally intensive. We need to come up with a reduced dimension data to make the computations feasible.