Train / Test Mahout SGD Classifier

Mahout 0.6 is released with bug fixes and new implementations. I experimented mahout for my previous projct , but it was naive bayes classifier which i used.

But now my data set is too short to run with Naive Bayes and also i came across a topic in Mahout In Action MEAP “Choosing an algorithm to train the classifier”, and it seems to select Stochastic gradient descent (SGD).So its clear that SGD is highly advisable for “Small to medium (less than tens of millions of training examples)”.

So i decided to choose SGD because its best fit for my data set.Ok, selected the algorithm then what’s up next.?? Need to dig in to the mahout source to find some hints on how to run SGD.

Yes..Examples are there which points to How to train and test using SGD.

So lets search for the command line options to run SGD.In examples/bin of mahout binary you can see
I went through that file and got an idea about how to train and test SGD in command line.

Found the way how its works, lets start the game.:)

Before that please make sure that you have a sample dataset to train.You can download 20-news group data set .

Input data set is ready ,now we are going to train it using SGD.

You can download the mahout-distribution-0.6 from any of the mirrors. Extract it and cd in to mahout folder.
There is a class TrainNewsGroups in org.apache.mahout.classifier.sgd and it accepts path of the input data set as argument.

./bin/mahout org.apache.mahout.classifier.sgd.TrainNewsGroups /home/sree/Desktop/20news-bydate/20news-bydate-train/

If you forgot to set the JAVA_HOME , an error may occur.Set the JAVA_HOME if needed.

Currently SGD implementation in mahout supports sequential / online / incremental execution methods.It will not run parallel like naive bayes.(which i experimented before)

By default TrainNewsGroups create model files (.model as filetype) in /tmp directory.While training you can see whether these files are created or not.If you can see files with names “news-group-{a number}.model” , then its sure that SGD started training over our news group data set.

Once training completed you can find a set of  .model files are generated in the /tmp directory.You can choose simply “news-group.model” or “news-group-{MAX NUMBER}.model” as model.

If you can completed the training with out any errors ,  model created from the input data set.:) Its time to test it .This part determines how much accuracy we can get on test data using SGD.

To Test the data against a model , we can use TestNewsGroups in org.apache.mahout.classifier.sgd.TestNewsGroups.

TestNewsGroups has two mandatory arguments :
–input : path of the test data
–model : path of model file.

Finally its time to test it.

./bin/mahout org.apache.mahout.classifier.sgd.TestNewsGroups –input /home/sree/Desktop/20news-bydate/20news-bydate-test/ –model /tmp/news-group.model

and you can see a confusion matrix and classified instances.

I got 73.513% , If you need to see confusion matrix just follow this link

I just discussed the accuracy of SGD in mahout mailing list and there were some follow ups.
You can search mahout mailing list archive of Dec 2011 with topic “Mahout SGD / Bayes prediction results over 20newsgroups”.

I think SGD accuracy not satisfactory when compared with Naive Bayes on the same 20-news data set. :( since SVD is best advisable for small / medium sized dataset.Lets hope mahout developers will work on it. 🙂


9 responses to this post.

  1. Lets hope mahout developers will work on it. 🙂

    rather waiting for mahout developers to work on it, sree just go ahead n work on it…..


  2. THanks. Nice..posting on mahout Classifier…SGD


  3. Posted by Sudhakar on October 16, 2012 at 7:03 am


    What was your experience with Naivebays with small data set. Was it giving entirely wrong results for new test data. I am facing this issue and now trying with SGD.


    • Hi Sudhakar,

      Naive bayes is designed for large data set . SGD is preferable for small data sets.But Mahout SGD is not showing good results in small data sets as in mahout tutorials.

      I assure you that if you can provide a large data set ( in my case a 4 lakh pos / neg model , both in accuarcy / clarity and in size) Naive Bayes will give you good industry accuracy. All these rely heavily on the accuracy and clarity of the model. So in my understanding , whether a data set is big / small the main point is , it should be accurate and clear in the selected domain.



  4. Posted by Priyadarshan Raj on October 16, 2012 at 7:26 am

    I am working on sentiment analysis of tweets.
    I am using mahout naive bayes classifier for it.I am making a directory “data”.Inside “data” I am making three more directories named “positive”,”negative”,”uncertain”..Then I kept 151 files(total 151Mb) on each of these positive,negatie and uncertain directory..Then I kept the data directory in hdfs..below are the set of command i ran to generate the model and labelindex out of it.

    bin/mahout seqdirectory -i ${WORK_DIR}/data -o ${WORK_DIR}/data-seq
    bin/mahout seq2sparse -i ${WORK_DIR}/data-seq -o ${WORK_DIR}/data-vectors -lnorm -nv -wt tfidf
    bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors –trainingOutput ${WORK_DIR}/data-train-vectors –testOutput ${WORK_DIR}/data-test-vectors –randomSelectionPct 40 –overwrite –sequenceFiles -xm sequential
    bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c

    I am getting the confusion matrix after testing on the same set of data using “testnb” command as given below:

    bin/mahout testnb -i ${WORK_DIR}/data-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c

    Confusion Matrix
    a b c <–Classified as
    151 0 0 | 151 a = negative
    0 151 0 | 151 b = positive
    0 0 151 | 151 c = uncertain

    Then I created a some another directory "data2" in the same way and put some random data(which is a sub set of the training data(30 files(total size 30MB) each)) in positive,negative,uncertain directory inside it .Then i created a vector out of it using the "seq2sparse" command given below :-

    bin/mahout seqdirectory -i ${WORK_DIR}/data2 -o ${WORK_DIR}/data2-seq
    bin/mahout seq2sparse -i ${WORK_DIR}/data2-seq -o ${WORK_DIR}/data2-vectors -lnorm -nv -wt tfidf

    On running the "testnb" using the model/lablelindex created from the previous set of data using the command given below:-

    bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000 -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c

    .I am getting confusion matrix like this.

    Confusion Matrix
    a b c <–Classified as
    0 30 0 | 30 a = negative
    0 30 0 | 30 b = positive
    0 30 0 | 30 c = uncertain

    Can anyone tell me why this is coming.Am i using the correct way to test the model or it is a bug in mahout 0.7.If it is not the correct way please suggest a way out of it.


  5. Hi Priyadarshan

    Better you post your question in mahout mailing list. So its open to all and you may get quick reply.

    I didnt tried mahout 0.7 Naive bayes training / testing. Once i do i will update my comments.



  6. Posted by Tania on August 28, 2014 at 4:03 pm

    can anyone tell me how to create my own dataset in mahout and then run the naive bayes algorithm on it???????


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: