Sunday, March 17, 2013

Principal Component Analysis - intuition

Last week, I was pondering about the intuition behind Principal Component Analysis or PCA.  PCA is basically about finding the eigen values and eigen vectors of the covariance metrics. An understanding of PCA involves understanding of both of these concepts.
To get a more theoretical understanding of PCA go here. tutorial on pca


What is covariance metrics

Covariance metrics gives the covariance between two variables. The word "covariance" of two variables indicate how they vary together. If one variables increases from its mean and the other variable also increases from its means, then we say there is a positive covariance between the variables. A negative co-variance indicates that one variable tends to increase when the other variable tends to decrease. For example, we can say there is a positive co-variance between intelligence and grade of a student, and a negative co-variance between age and athleticism of adults. Covariance metrics measures the covariance between each two variables in the system.

Eigen Vectors and Eigen Values

To give the wikipedia defintion,  "a non-zero column vector v is a (righteigen-vector of a matrix A if (and only if) there exists a number \lambda such thatA v = \lambda v. The number \lambda is called the eigenvalue corresponding to that vector. The set of all eigenvectors of a matrix, each paired with its corresponding eigenvalue, is called the eigensystem of that matrix".

What does this mean? 
If you have a system, represented by a metrics, and if you apply an eigen vector on that system, the resulting system, has got the same orientation, but it scales in size. This means that the resulting system is either a bigger or a smaller system, but, still retaining the essential characteristics of the original system. For example, let us say that there is  a bridge which has a natural frequency(frequency by which the bridge oscillates), represented by a frequency metrics. If you apply a frequency which is the eigen vector of the natural frequency of the bridge the bridge will oscillate at a much larger scale, represented by the corresponding eigen value. Let us take another example. Let us say there is a rectangular box. If we pull the box from each direction along the axis, the box elongates, but it still retains the property of being a rectangle box. What if we try to pull the box along the diagonals. Then the box will lose its shape and  becomes something else other than a rectangle box. The force of pull, when applied along the axis can be considered as an eigen vector for the rectangle box .
Eigen vectors of covariance metrics. 
Now let us look into the idea of eigen vectors of covariance metrics.  We know that co-variance metrics captures the covariance of various variables in the system.    Think about what we get if subtract the mean from original data and take the covariance metrics.  The data tells us which features increase or decrease together.  The eigen vector of this metrics should preserve the direction of "the variance"of the data in x,y directions (for a two dimensional data).  That is, the eigen vector should tell us that in which direction the data is essentially increase or decrease together.  For an n dimensional data we get a set of n eigen vector and eigen values. If we order these eigen vectors in the the descending order of the eigen values, that gives us the decreasing order of the eigen vectors each representing the direction of variance in data, each eigen vector orthogonal to the other. 

What does it mean  by the direction of the variance of the data -  Let us take the case of a two dimensional data. Let us say we have two points A (x1,y1) and B (x2,y2). We are interested in seeing in which direction the data increase or decrease together. Or If you have to move from A to B what will be the direction of that movement, in terms of the coordinate of the system. If we have a set of such points you are interested in seeing the general direction in which the set of points align. For a case of two points A and B this direction is given by  
tan(theta) =  (y1-y2)/(x1-x2).  Now, If we have a third point C(x3,y3), which is lying in the general direction of (x1,y1) but off from the axis by a small distance, then to reach to C we need to travel along this direction and then move to c through the direction perpendicular to this direction. 
Now the variation in A,B,C has two general directions, one along the axis represented by the slope (y1-y2)/(x1-x2), and the second along the axis perpendicular to this. 

Dimensionality reduction
Suppose we want to represent the points A,B,C in one dimension instead of two dimension.  From the above example, we know that A and B lies in a straight line determined by the slope (y1-y2)/(x1-x2) and C is slightly off from that line, but only slightly.  If we have to represent these data using only one dimension, we still have to preserve the overall general distribution of this data. Had there been no C we can just transform A and B to the direction of the line connecting them, and represent them with the single coordinate determined by the distance of them from the origin along this line.  But we have point C also, which is off from the line by a small distance. Let me call the line connecting A and B as axis Z. 
If we are ready to lose the information on how much C is off from this line, then, we can project C to Z axis (by drawing a line perpendicular to the Z axis). In doing so we still preserve the general location of C w.r.t A and B but still off by a small distance. What we have done is reducing the dimensionality of the overall system from two dimension(represented by X and Y axis) to one dimension(represented by Z axis). In doing so we lost some information, but that information is not very relevant comparing to the overall information contained in the system.

We could chose Z as the axis, because C was slightly off from Z, but not  too much. What if C is off from Z axis by a large margin, which is larger than the distance between A and B along Z.
Now, the overall direction of variance of the system, is not along Z but in a direction perpendicular to Z. If we have to reduce the dimension of the system, we will have to capture the information on how much  C is off from Z axis,  and not the distance between A and B. In this case, our axis of choice will be the axis perpendicular to Z.

This is the idea we use in dimensionality reduction  using eigen values and eigen vectors. The eigen vector with the highest eigen value, gives us the principal component of variance of the system. The  eigen vector with the second highest eigen value gives us the second principal component of the system. If we have to reduce the dimension from n to k, we chose the eigen vectors corresponding to the first k eigen values and then transform the data to these dimensions. This tranformation is done by multiplying the chosen eigen vectors with the data.

Why is dimensionality reduction important
In machine learning, working with large number of dimensions is often computationally intensive. We need to come up with a reduced dimension data to make the computations feasible.


No comments: