Sunday, January 27, 2008

"It doesnt matter about the drugs and rock and roll, sex " - Haskell bayes and other classifications

It doesnt matter about the drugs and rock and roll, sex

That was an example test case that I used against my bayes classification application. Bayes and other probability algorithms are used in most of the spam filtering software that is out in the net. I use various probability routines to classify the news articles. I begin with Toby's python code on the subject and ported is examples to haskell. So far, the results have been pretty satisfactory. The naives bayes classification is dismal, but maybe I am looking at the numbers wrong. In classic form, I will throw code at you and will give a lollipop to the mind reader that can figure out what is going on.

The output of the test case:

In my first test case, I extracted a couple of articles from the web with these subjects; politics, business, entertainment. A human can easily tell what is going on. A software program is going to have a lot of difficultly for several reasons, but two main ones. One, there are too many STOP words and too many ambiguous phrases in my sample training set. The feature word global could easily apply to entertainment, politics, or business. Which one does the computer decide upon? In any case, the first listing below shows the input data set.

let phrases =
[
"It doesnt matter about the drugs and rock and roll, sex",
"I agree citi, global market money business business driving force",
"ron paul likes constitution war international freedom america",
"viagra drugs levitra bigger",
"Movies are fun and too enjoy",
"war america not good"
]

And the output:

Running Tests
At: Sun Jan 27 05:08:37 EST 2008 t:1201428517
Test Train Bayes
793
["train/business.train","train/entertainment.train","train/politics.train"]
---- train/business.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=3.989729693042545e-7
Fisher Probability=0.836128249211062
. [I agree citi, global market money business business driving force ]
Bayes Probability=4.850155473781951e-5
Fisher Probability=0.9541001591541463
. [ron paul likes constitution war international freedom america ]
Bayes Probability=3.780639186633039e-4
Fisher Probability=0.6089235797669089
. [viagra drugs levitra bigger ]
Bayes Probability=2.419609079445145e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=1.9649303466097898e-4
Fisher Probability=0.4827094766567699
. [war america not good ]
Bayes Probability=1.2098045397225725e-2
Fisher Probability=0.5440441299962482
---- train/entertainment.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=2.0531041666810224e-7
Fisher Probability=0.6270914273866588
. [I agree citi, global market money business business driving force ]
Bayes Probability=2.339809268600252e-5
Fisher Probability=0.4542155835210187
. [ron paul likes constitution war international freedom america ]
Bayes Probability=1.8718474148802017e-4
Fisher Probability=0.6089235797669089
. [viagra drugs levitra bigger ]
Bayes Probability=1.197982345523329e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=1.0245769451548432e-4
Fisher Probability=0.8278707842798139
. [war america not good ]
Bayes Probability=5.989911727616645e-3
Fisher Probability=0.5440441299962482
---- train/politics.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=4.237119541681478e-7
Fisher Probability=0.4319928795655645
. [I agree citi, global market money business business driving force ]
Bayes Probability=5.141422998108449e-5
Fisher Probability=0.4542155835210187
. [ron paul likes constitution war international freedom america ]
Bayes Probability=4.1625450234461715e-4
Fisher Probability=0.8928739462341001
. [viagra drugs levitra bigger ]
Bayes Probability=2.632408575031526e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=2.131121584070195e-4
Fisher Probability=0.46619446182733354
. [war america not good ]
Bayes Probability=1.3240857503152584e-2
Fisher Probability=0.7855657341350474
Done

Like I said before, the bayes probability numbers are all over the place. It could be an issue with that I didn't extract stop words from the training set. Or could be the training set is too small or both.

The interesting part is that the fisher probability algorithm figured out the correct classification in each instance.

It doesnt matter about the drugs and rock and roll, sex

I wanted to associate this term with entertainment. There was a 0.6270914273866588 that is entertainment and a 45% "I agree citi, global market money business business driving force" chance that this phrase is an entertainment document. Subsequently, it matched in all of the other instances as well. 100% success rate for the fisher probability.

The next listing shows the test case.

module Tests.Data.TestTrainBayes where

import Monad (liftM)
import System.Directory (getDirectoryContents)
import List (isPrefixOf, isSuffixOf)
import Data.SpiderNet.Bayes

trainDir = "train"

runTestTrainBayes :: IO ()
runTestTrainBayes = do
putStrLn "Test Train Bayes"
-- Process only files with 'train' extension
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
lst_content <- liftM (zip trainpaths) $ mapM readFile trainpaths
-- Print a count of the training set size
let info = buildTrainSet lst_content []
putStrLn $ show (length info)
putStrLn $ show (categories info)
let phrases =
[
"It doesnt matter about the drugs and rock and roll, sex",
"I agree citi, global market money business business driving force",
"ron paul likes constitution war international freedom america",
"viagra drugs levitra bigger",
"Movies are fun and too enjoy",
"war america not good"
]
-- process the following input phrases
mapM_ (\cat -> do
putStrLn $ "---- " ++ cat ++ " ----"
mapM_ (\phrase -> do
putStrLn $ " . [" ++ phrase ++ " ]"
putStrLn $ " Bayes Probability=" ++
(show ((bayesProb info (wordTokens phrase) cat 1.0)))
putStrLn $ " Fisher Probability=" ++
(show ((fisherProb info (wordTokens phrase) cat)))
) phrases
) (categories info)


Source

This module implements the classification source; step through the example to get an understanding of the code. Start with the test case and then move to the bayesProb and fisherProb functions.

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/lib/haskell/src/Data/SpiderNet/Bayes.hs


fisherProb :: [WordCatInfo] -> [String] -> String -> Double
fisherProb features tokens cat = invchi
where initp = 1.0
weight = 1.0
p = foldl (\prb f -> (prb * (weightedProb features f cat weight))) initp tokens
fscore = (negate 2) * (log p)
invchi = invChi2 fscore ((genericLength tokens) * 2)

And here is an example input document, for training the business class:

Look no further than the European Central Bank, which was notably absent when the Fed made its emergency rate cut amid falling global stocks on Tuesday. In testimony Wednesday before the European Parliament, ECB President Jean-Claude Trichet came about as close as a member of the brotherhood ever will to calling out a fellow central banker: "In demanding times of significant market correction and turbulences, it is the responsibility of the central bank to solidly anchor inflation expectations to avoid additional volatility in already highly volatile markets
Economics is the social science that studies the production, distribution, and consumption of goods and services. The term economics comes from the Greek for oikos (house) and nomos (custom or law), hence "rules of the house(hold)."[1]
Although discussions about production and distribution have a long history, economics in its modern sense as a separate discipline is conventionally dated from the publication of Adam Smith's The Wealth of Nations in 1776.[7] In this work Smith describes the subject in these practical and exacting terms:
Political economy, considered as a branch of the science of a statesman or legislator, proposes two distinct objects: first, to supply a plentiful revenue or product for the people, or, more properly, to enable them to provide such a revenue or subsistence for themselves; and secondly, to supply the state or commonwealth with a revenue sufficient for the public services. It proposes to enrich both the people and the sovereign.
Smith referred to the subject as 'political economy', but that term was gradually replaced in general usage by 'economics' after 1870.
In economics, a business (also called firm or enterprise) is a legally recognized organizational entity existing within an economically free country designed to provide goods and/or services to consumers. Businesses are predominate in capitalist economies, where most are privately owned and typically formed to earn profit to increase the wealth of their owners. The owners and operators of a business have as one of their main objectives the receipt or generation of a financial return in exchange for their work and their acceptance of risk. Notable exceptions to this rule include cooperative businesses and government institutions. This model of business functioning is contrasted with socialistic systems, which involve either government, public, or worker ownership of most sizable businesses.

No comments: