## Sunday, January 27, 2008

### "It doesnt matter about the drugs and rock and roll, sex " - Haskell bayes and other classifications

It doesnt matter about the drugs and rock and roll, sex

That was an example test case that I used against my bayes classification application. Bayes and other probability algorithms are used in most of the spam filtering software that is out in the net. I use various probability routines to classify the news articles. I begin with Toby's python code on the subject and ported is examples to haskell. So far, the results have been pretty satisfactory. The naives bayes classification is dismal, but maybe I am looking at the numbers wrong. In classic form, I will throw code at you and will give a lollipop to the mind reader that can figure out what is going on.

The output of the test case:

In my first test case, I extracted a couple of articles from the web with these subjects; politics, business, entertainment. A human can easily tell what is going on. A software program is going to have a lot of difficultly for several reasons, but two main ones. One, there are too many STOP words and too many ambiguous phrases in my sample training set. The feature word global could easily apply to entertainment, politics, or business. Which one does the computer decide upon? In any case, the first listing below shows the input data set.
` let phrases =           [            "It doesnt matter about the drugs and rock and roll, sex",           "I agree citi, global market money business business driving force",           "ron paul likes constitution war international freedom america",           "viagra drugs levitra bigger",           "Movies are fun and too enjoy",           "war america not good"          ]`

And the output:
`Running TestsAt: Sun Jan 27 05:08:37 EST 2008 t:1201428517Test Train Bayes793["train/business.train","train/entertainment.train","train/politics.train"]---- train/business.train ----  . [It doesnt matter about the drugs and rock and roll, sex ]  Bayes Probability=3.989729693042545e-7  Fisher Probability=0.836128249211062  . [I agree citi, global market money business business driving force ]  Bayes Probability=4.850155473781951e-5  Fisher Probability=0.9541001591541463  . [ron paul likes constitution war international freedom america ]  Bayes Probability=3.780639186633039e-4  Fisher Probability=0.6089235797669089  . [viagra drugs levitra bigger ]  Bayes Probability=2.419609079445145e-2  Fisher Probability=0.6980297367583733  . [Movies are fun and too enjoy ]  Bayes Probability=1.9649303466097898e-4  Fisher Probability=0.4827094766567699  . [war america not good ]  Bayes Probability=1.2098045397225725e-2  Fisher Probability=0.5440441299962482---- train/entertainment.train ----  . [It doesnt matter about the drugs and rock and roll, sex ]  Bayes Probability=2.0531041666810224e-7  Fisher Probability=0.6270914273866588  . [I agree citi, global market money business business driving force ]  Bayes Probability=2.339809268600252e-5  Fisher Probability=0.4542155835210187  . [ron paul likes constitution war international freedom america ]  Bayes Probability=1.8718474148802017e-4  Fisher Probability=0.6089235797669089  . [viagra drugs levitra bigger ]  Bayes Probability=1.197982345523329e-2  Fisher Probability=0.6980297367583733  . [Movies are fun and too enjoy ]  Bayes Probability=1.0245769451548432e-4  Fisher Probability=0.8278707842798139  . [war america not good ]  Bayes Probability=5.989911727616645e-3  Fisher Probability=0.5440441299962482---- train/politics.train ----  . [It doesnt matter about the drugs and rock and roll, sex ]  Bayes Probability=4.237119541681478e-7  Fisher Probability=0.4319928795655645  . [I agree citi, global market money business business driving force ]  Bayes Probability=5.141422998108449e-5  Fisher Probability=0.4542155835210187  . [ron paul likes constitution war international freedom america ]  Bayes Probability=4.1625450234461715e-4  Fisher Probability=0.8928739462341001  . [viagra drugs levitra bigger ]  Bayes Probability=2.632408575031526e-2  Fisher Probability=0.6980297367583733  . [Movies are fun and too enjoy ]  Bayes Probability=2.131121584070195e-4  Fisher Probability=0.46619446182733354  . [war america not good ]  Bayes Probability=1.3240857503152584e-2  Fisher Probability=0.7855657341350474Done`

Like I said before, the bayes probability numbers are all over the place. It could be an issue with that I didn't extract stop words from the training set. Or could be the training set is too small or both.

The interesting part is that the fisher probability algorithm figured out the correct classification in each instance.

It doesnt matter about the drugs and rock and roll, sex

I wanted to associate this term with entertainment. There was a 0.6270914273866588 that is entertainment and a 45% "I agree citi, global market money business business driving force" chance that this phrase is an entertainment document. Subsequently, it matched in all of the other instances as well. 100% success rate for the fisher probability.

The next listing shows the test case.
`module Tests.Data.TestTrainBayes whereimport Monad (liftM)import System.Directory (getDirectoryContents)import List (isPrefixOf, isSuffixOf)import Data.SpiderNet.BayestrainDir = "train"runTestTrainBayes :: IO ()runTestTrainBayes = do  putStrLn "Test Train Bayes"  -- Process only files with 'train' extension  files <- getDirectoryContents trainDir  let trainfiles = filter (isSuffixOf ".train") files      trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles  lst_content <- liftM (zip trainpaths) \$ mapM readFile trainpaths  -- Print a count of the training set size  let info = buildTrainSet lst_content []  putStrLn \$ show (length info)  putStrLn \$ show (categories info)  let phrases =           [            "It doesnt matter about the drugs and rock and roll, sex",           "I agree citi, global market money business business driving force",           "ron paul likes constitution war international freedom america",           "viagra drugs levitra bigger",           "Movies are fun and too enjoy",           "war america not good"          ]  -- process the following input phrases  mapM_ (\cat -> do             putStrLn \$ "---- " ++ cat ++ " ----"             mapM_ (\phrase -> do                      putStrLn \$ "  . [" ++ phrase ++ " ]"                      putStrLn \$ "  Bayes Probability=" ++                                    (show ((bayesProb info (wordTokens phrase) cat 1.0)))                      putStrLn \$ "  Fisher Probability=" ++                                    (show ((fisherProb info (wordTokens phrase) cat)))                   ) phrases        ) (categories info)`

Source

This module implements the classification source; step through the example to get an understanding of the code. Start with the test case and then move to the bayesProb and fisherProb functions.

`fisherProb :: [WordCatInfo] -> [String] -> String -> DoublefisherProb features tokens cat = invchi    where initp = 1.0          weight = 1.0          p = foldl (\prb f -> (prb * (weightedProb features f cat weight))) initp tokens          fscore = (negate 2) * (log p)          invchi = invChi2 fscore ((genericLength tokens) * 2)`