Archive for the ‘FOSS’ Category

Train / Test Mahout SGD Classifier

Mahout 0.6 is released with bug fixes and new implementations. I experimented mahout for my previous projct , but it was naive bayes classifier which i used.

But now my data set is too short to run with Naive Bayes and also i came across a topic in Mahout In Action MEAP “Choosing an algorithm to train the classifier”, and it seems to select Stochastic gradient descent (SGD).So its clear that SGD is highly advisable for “Small to medium (less than tens of millions of training examples)”.

So i decided to choose SGD because its best fit for my data set.Ok, selected the algorithm then what’s up next.?? Need to dig in to the mahout source to find some hints on how to run SGD.

Yes..Examples are there which points to How to train and test using SGD.

So lets search for the command line options to run SGD.In examples/bin of mahout binary you can see classify-20newsgroups.sh.
I went through that file and got an idea about how to train and test SGD in command line.

Found the way how its works, lets start the game.:)

Before that please make sure that you have a sample dataset to train.You can download 20-news group data set .

Input data set is ready ,now we are going to train it using SGD.

You can download the mahout-distribution-0.6 from any of the mirrors. Extract it and cd in to mahout folder.
There is a class TrainNewsGroups in org.apache.mahout.classifier.sgd and it accepts path of the input data set as argument.

./bin/mahout org.apache.mahout.classifier.sgd.TrainNewsGroups /home/sree/Desktop/20news-bydate/20news-bydate-train/

If you forgot to set the JAVA_HOME , an error may occur.Set the JAVA_HOME if needed.

Currently SGD implementation in mahout supports sequential / online / incremental execution methods.It will not run parallel like naive bayes.(which i experimented before)

By default TrainNewsGroups create model files (.model as filetype) in /tmp directory.While training you can see whether these files are created or not.If you can see files with names “news-group-{a number}.model” , then its sure that SGD started training over our news group data set.

Once training completed you can find a set of  .model files are generated in the /tmp directory.You can choose simply “news-group.model” or “news-group-{MAX NUMBER}.model” as model.

If you can completed the training with out any errors ,  model created from the input data set.:) Its time to test it .This part determines how much accuracy we can get on test data using SGD.

To Test the data against a model , we can use TestNewsGroups in org.apache.mahout.classifier.sgd.TestNewsGroups.

TestNewsGroups has two mandatory arguments :
–input : path of the test data
–model : path of model file.

Finally its time to test it.

./bin/mahout org.apache.mahout.classifier.sgd.TestNewsGroups –input /home/sree/Desktop/20news-bydate/20news-bydate-test/ –model /tmp/news-group.model

and you can see a confusion matrix and classified instances.

I got 73.513% , If you need to see confusion matrix just follow this link
http://pastebin.com/1RvkZ5Kk

I just discussed the accuracy of SGD in mahout mailing list and there were some follow ups.
You can search mahout mailing list archive of Dec 2011 with topic “Mahout SGD / Bayes prediction results over 20newsgroups”.

I think SGD accuracy not satisfactory when compared with Naive Bayes on the same 20-news data set. :( since SVD is best advisable for small / medium sized dataset.Lets hope mahout developers will work on it. 🙂

Advertisements

Installing ImageMagick in Ubuntu 11.04

I am working in a project , which includes some image maniputaion functions.After a quick googling i found ImageMagick is best to satisfy my needs.

I installed ImageMagic using sudo apt-get install imagemagick.

After installation i tested some of the commands provided by imagemagick ( convert , identify). But it is not working and causing an error “No Delegates for this image”.

Then i realised that there may be some dependencies for ImageMagick and  i searched for delegates, http://www.imagemagick.org/download/delegates/

I have tried the wiki page http://wiki.helioviewer.org/wiki/Setting_Up_ImageMagick for setting up ImageMagick but failed again 😦

So decided to install it from source.

You’ll need to install a number of dependencies in addition to ImageMagick in order to have a fully functional ImageMagick installation. It’s important that these dependencies are installed before you start configuring and compiling ImageMagick, because the configure script for ImageMagick will disable functionality that isn’t available because of missing dependencies at compile time.

 The list of dependecies which found usefull for my use case are

 sudo apt-get install libjpeg8-dev libpng12-dev libglib2.0-dev libfontconfig1-dev zlib1g-dev libtiff4-dev

 After installing the above dependencies i just started to compile ImageMagick from source.

 You can download the ImageMagick source from any mirrors http://www.imagemagick.org/script/download.php

ImageMagick-6.7.5-6 is the latest version.

After downloding the source then tar it.

tar xvfz ImageMagick-6.7.5-6

cd ImageMagick-6.7.5-6

./configure

If you need to do some advanced configuration then follow this link http://www.imagemagick.org/script/advanced-unix-installation.php

sudo make

sudo make install

Installation completed ,Hooray..:) I checked ImageMagick commands and found the same delegate problem again.:( 😦

Compiling should restart from step ./configure and i added a –disable-shared option

./configure –disable-shared

Installtion completed again..No emotion.

I checked whether all the delegates and dependencies configured with ImageMagick properly

You can check it using convert -list configure

🙂 again.Delegates are configured properly.

DELEGATES fontconfig freetype jpeg jng mpeg png x11 zlib

SO time to run some ImageMagick commands.

Started with identify command.

identify a.jpg

its working..WOWWW..

a.jpg JPEG 321×400 321×400+0+0 8-bit DirectClass 28.9KB 0.000u 0:00.000

then i tried convert one image to another format

convert a.jpg a.gif

WOOOWWW again, its working..:) 🙂

So happy emotions again.:) 😀 😛 . ImageMagick set up.

Now i am going to get my hands dirty with ImageMagick. Courtesy : My Guru Jaganadhg , he usually says so , if he started learning new things.

UIMA SDK & Plugin installation

I just started dig in to UIMA core and i found some difficulties to set it up . UIMA has a good documentation but it is not best pointer for a newbie.:( . So i think writing myself a blog showing some pointers regarding the installtion and initial set up of UIMA.

Pre-requistite

1) JDK

Hope you already set up java.

2) Eclipse IDE

http://www.eclipse.org/downloads/

3) Eclipse EMF Plugin

http://wiki.eclipse.org/EMF/Installation

You can get the UIMA SDK here

http://uima.apache.org/downloads.cgi#Latest%20Official%20Releases

Here is a good pointer

http://savorywatt.com/2011/07/16/uima-quick-start-sdk-install-plugins/

Or you can directly go through the uima docs

http://uima.apache.org/downloads/releaseDocs/2.2.2-incubating/docs/html/overview_and_setup/overview_and_setup.html#ugr.ovv.eclipse_setup

FOSSMeet NIT Calicut

Fortunately i got my colleague Biju to attend the FOSSMeet at NIT Calicut.We two delivered talks.We reached at NIT campus on sunday morning.Mr.Karthik met me at the main gate and he directed us to the Bhaskara Hall where all open talks happened.

My talk was about Introduction to NLTK and it was stared at 12.05 as requested by Mr.Anil.After discussed about the introduction part about NLP an NLTK i went thru some basic practical work outs.I think it was the best way to learn a practical tool kit.

Many doubts came from audience,and i tried to clarify it.Biju also supported me to clarify some doubts.Actually it make the session interactive.After my talk i got a memento from fossmeet team.

Biju delivered his talk on Apache Mahout at 3.00 pm.He provided a demo of Document classification and Recommendation systems.Much of the audience were students.I think they got a bit confused because Biju’s talk was on latest technologies like Mahout and Hadoop.He tried his best to explain as simple as possible.Me also tried to clarify map-reduce concept.Question came about some algorithms he discussed.

Met many students after our talk.It was very nice to interact with students.Some of them are doing great jobs.

Thank u fossmeet team..Bye NITC

 

 

 

Pycon india 2010 – Great experience and exposure

Me,Jaganadh and Biju went to bangalore on sept 24 th friday night.Saturday we reached there and found a hotel to stay.It was “Rain bow” near City Market.It was a nice place actually.Saturday we went to MSRIT for the first time.

On the first day jaganadh’s talk on BioNLP at leature hall-1.we all were there in the hall.Unfortunately his laptop did not respond to the projector and he used my laptop for the presentation.So he could not be able to  show the demo.I faced similar problem of projector with my laptop.Almost all fedora / linux systems faced the same issue..

Next talk was Biju’s semantic web programming on saturday afternoon.He prepared well for the talk and was intended to show the live demo of semantic web programming.But unfortunately he also faced the projector problem.That talk was very informative and there were much audience filled in lecture hall-2.He did a great job.

My talk was on Sunday afternoon.So i got much time for the preparation.But i am afraid that if such a projector problem occurs then how can i handle it.So after lunch on sunday i checked my laptopn connected to the projector in lecture hall-2 ,same problem.it did not detected.then i replaced my laptop with jaganadh’s laptop and installed all the Python dependencies which i needed to run my Python program demo.

At 2.45 PM my talk started.And i completed my talk at 3.30.some questions were raised from the audience.Yea fortunatley i clarified it.Fortunately i showed them a demo of 3 machine learning algorithms.My intention was just give a brief introduction about the theoretical part and give more emphasis on the practical implementation.

One thing i noticed in Pycon was all guys are in to core python.They wanted to know about the new Python API’s ,Python tips,integrated technologies etc..

I am mentioning this because our approach was different actually.We are working in Natural Language Processing,machine Learning,Semantic Web type of applications and interested in Python.Obviously our approach to Python is not like others.We are trying to expose these types of innovative ideas and how can we implement these ideas using Python.Much of these topics are only on papers.

We are not in Python core development and our only intention was to introduce these artificial intelligence concepts and show we can implement these concepts in Python.

It was a great experience and exposure to us actually.