Sunday, March 30, 2014

On the variations of AdaBoost

Ada Boost Intuition

Ada Boost is an ensemble method (meaning it is an ensemble of classifiers, instead of  being a single classifier).  The intuition behind AdaBoost is to form a committee of expert leaners, and take the final output from the committee instead of from a single learner.  Each expert in the committee is a learning algorithm. Each learner is trained specifically to learn about instances which other learners could not classify correctly (not exactly, but each leaner give high importance to instances which others could not classify, and "focusses" more on those instances" .  So in a sense, the committee is formed by expert learners who each specializes in classifying specific sets of instances . Each of the learner in the committee compliments the others, because each of them specifically focussed to learn from instances, others could not classify correctly.

Algorithm


 ADABOOST(S,Learn,K)
   S:TrainingSet{(X1,Y1),(X2,Y2),....,(Xm,Ym)},Yi in Y
   Learn:Learner(S,Weights)
   k: # rounds

For all i in S: w(1,i)=1/m

For r=1 to k, do
     For all i, w(r,i)=w(r,i)/  ΣW(r,i) for all i's
   
     H(r)=Learn(S,w(r))

     ε(r)=Σw(r,i)1[h(r,i)≠Y(i)]

     if  ε(r) > ½
        k=r-1
        exit
     β(r)=  ε(r)/(1- ε(r))
 
     For all i, w(r+1,i)=w(r,i) * β(r) to the power of( 1-1[h(r,i)≠Y(i)]
Output: h(x)=argmax(yΞY) Σ(r=1 to k) log(1/β(r)) 1[h(r,i)≠Y(i)]


Explanation

Inputs to the AdaBoost algorithm are a training  set Y, a learner function S and the number of iterations,k. The learner function should be capable of learning from a training set  with  weighted instances. How to learn from a weighted instance training set? We will come to that later. For now assume that the learner can learn from a set of weighted instance. The number k indicates the number of iterations the algorithms has to go through. This number is the number of experts in the committee. At each iteration the algorithm trains and picks up the best candidate algorithm and elects to the committee.

First we initialize weights to all the instances equally as 1/m, where m is the number of instances. Then the expert selection rounds begin (r=1 to k). In each round the first step is to calculate the normalized weights (P(r,i)  so that they all sum to one. Note that we do not update the weights as the normalized weights, but assign the normalized weights to a new variable.  After that we call the learn function which outputs a hypothesis function. Note that, the learn function outputs the best learner, which minimizes the error (cost function).  We select this learner as the selected learner for round r. Then we compute the aggregate weighted error of the  learner. If the aggregate error is > 1/2 we stop and exit (why? ). After that we update the weights of the instance for the next round, such that for those instances which the current learner classified correctly their weight decreases.   

The weight for each learner h(r) is  1/β(r).  1/β(r) increases with the error level associated with the particular learner h(r). That means each learner in the committee is weighted w.r.t how best they classify the instances.

Why stop when   ε(r) > ½?

  When ε(r) > ½, that means the weighted aggregated error is more than .5 or the probability of getting a correct answer from the classifier is less than 0.5 which is worse than a plain coin toss. Simply put if we toss a coin and chose the decision, that classifier will do better than our candidate classifier.  That means we can simply throw away our built in classifier and exit.

How to learn from weighted instances?

What should be the nature of the learner function Learn(S,Weights)? It is supposed to learn from a training set, where each instance has a weight associated with it.  Depending on the individual classifier you can incorporate the weight into it. For example, for naive-bayes classifier, you can incorporate the weights in building the conditional probability term. Another approach is to  create a new training set by sampling from the original weighted training set according to the weights. 

How to pick the number of iterations

It is done empirically, after plotting the train and test errors against different number of iterations. Pick the point where the training and test error is lowest. Basically as you keep increasing the number of iterations, the training error will keep  decreasing and test error will also keep decreasing. At some point the algorithm will start over fitting and test error start increasing. Pick the number of iterations immediately before this point, at which training error is at lowest but the over fitting is yet to happen.

What if we use strong classifiers instead of weak classifiers

AdaBoost is used to combine number of weak classifiers (error rate<0.5) to form a classifiers. Then naturally the question arises; what if we use  strong classifiers instead of weak classifiers. Will we get even a stronger classifier as the final ensemble? Answer is no. When you use strong classifiers and particularly when one of the classifier is far stronger than others, the strong classifier will dominate the other classifiers and because of this, those small numbers of outliers will not be picked up correctly. If all the classifiers are equally strong, then the final classifier will be of about the same strength as of the constituent classifiers.

Feature selection using AdaBoost

One interesting use of AdaBoost is in face detection in images. Viola and Jones proposed this interesting face detection algorithm using machine learning. They used AdaBoost as their ensemble method. Read about Viola-Jones paper below. 

 In this paper, authors argue that in the same way classifiers are learned during the AdaBoost features can also be learned.  Each classifier is based on a single feature. At each round of the boosting, the classifier which has the minimum weighted error rate is selected. This is different from the classical AdaBoost that, the classifiers are preformed rather than generated during the boosting process. Since each classifier is dependent on only one feature, this can be seen as a feature selection process also.


No comments: