Posts Tagged ‘classifier’

Train / Test Mahout SGD Classifier

Mahout 0.6 is released with bug fixes and new implementations. I experimented mahout for my previous projct , but it was naive bayes classifier which i used.

But now my data set is too short to run with Naive Bayes and also i came across a topic in Mahout In Action MEAP “Choosing an algorithm to train the classifier”, and it seems to select Stochastic gradient descent (SGD).So its clear that SGD is highly advisable for “Small to medium (less than tens of millions of training examples)”.

So i decided to choose SGD because its best fit for my data set.Ok, selected the algorithm then what’s up next.?? Need to dig in to the mahout source to find some hints on how to run SGD.

Yes..Examples are there which points to How to train and test using SGD.

So lets search for the command line options to run SGD.In examples/bin of mahout binary you can see
I went through that file and got an idea about how to train and test SGD in command line.

Found the way how its works, lets start the game.:)

Before that please make sure that you have a sample dataset to train.You can download 20-news group data set .

Input data set is ready ,now we are going to train it using SGD.

You can download the mahout-distribution-0.6 from any of the mirrors. Extract it and cd in to mahout folder.
There is a class TrainNewsGroups in org.apache.mahout.classifier.sgd and it accepts path of the input data set as argument.

./bin/mahout org.apache.mahout.classifier.sgd.TrainNewsGroups /home/sree/Desktop/20news-bydate/20news-bydate-train/

If you forgot to set the JAVA_HOME , an error may occur.Set the JAVA_HOME if needed.

Currently SGD implementation in mahout supports sequential / online / incremental execution methods.It will not run parallel like naive bayes.(which i experimented before)

By default TrainNewsGroups create model files (.model as filetype) in /tmp directory.While training you can see whether these files are created or not.If you can see files with names “news-group-{a number}.model” , then its sure that SGD started training over our news group data set.

Once training completed you can find a set of  .model files are generated in the /tmp directory.You can choose simply “news-group.model” or “news-group-{MAX NUMBER}.model” as model.

If you can completed the training with out any errors ,  model created from the input data set.:) Its time to test it .This part determines how much accuracy we can get on test data using SGD.

To Test the data against a model , we can use TestNewsGroups in org.apache.mahout.classifier.sgd.TestNewsGroups.

TestNewsGroups has two mandatory arguments :
–input : path of the test data
–model : path of model file.

Finally its time to test it.

./bin/mahout org.apache.mahout.classifier.sgd.TestNewsGroups –input /home/sree/Desktop/20news-bydate/20news-bydate-test/ –model /tmp/news-group.model

and you can see a confusion matrix and classified instances.

I got 73.513% , If you need to see confusion matrix just follow this link

I just discussed the accuracy of SGD in mahout mailing list and there were some follow ups.
You can search mahout mailing list archive of Dec 2011 with topic “Mahout SGD / Bayes prediction results over 20newsgroups”.

I think SGD accuracy not satisfactory when compared with Naive Bayes on the same 20-news data set. :( since SVD is best advisable for small / medium sized dataset.Lets hope mahout developers will work on it. 🙂