Saturday, January 19, 2008

Simple Word Frequency Functions in Haskell

As part of the botlist text content mining backend I am working on; I frequently process word tokens in a web page document. The code below contains utility functions for creating a data structure that consists of a list of tuples. The tuple contains a word and the number of occurrences in the document. I modified John Goerzen's original wordFreq version from

import System.Environment
import qualified Data.Map as Map
import Data.List
import Text.Regex (splitRegex, mkRegex)
-- | Find word frequency given an input list using "Data.Map" utilities.
-- With (Map.empty :: Map.Map String Int), set k = String and a = Int
-- Map.empty :: Map k a
-- foldl' is a strict version of foldl = foldl': (a -> b -> a) -> a -> [b] -> a
-- (Original code from John Goerzen's wordFreq)
wordFreq :: [String] -> [(String, Int)]
wordFreq inlst = Map.toList $ foldl' updateMap (Map.empty :: Map.Map String Int) inlst
where updateMap freqmap word = case (Map.lookup word freqmap) of
Nothing -> (Map.insert word 1 freqmap)
Just x -> (Map.insert word $! x + 1) freqmap

-- | Pretty print the word/count tuple and output a string.
formatWordFreq :: (String, Int) -> String
formatWordFreq tupl = fst tupl ++ " " ++ (show $ snd tupl)

-- Given an input list of word tokens, find the word frequency and sort the values.
-- sortBy :: (a -> a -> Ordering) -> [a] -> [a]
wordFreqSort :: [String] -> [(String, Int)]
wordFreqSort inlst = sortBy freqSort . wordFreq $ inlst


let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
wordfreq = wordFreqSort tokens
mapM_ (\x -> (putStrLn $ formatWordFreq x)) wordfreq

The utility code was used against a spam document and output the following:

*** Content Analysis
Viagra 28
Cialis 11
to 10
with 7
Here 6
Order 5
the 5
Articles 4
Click 4
Viagra. 4

1 comment:

Bangalore Web Guru said...

That's wonderful stuff you've written up here. Been searching for it all around. Great blogWeb Designing Company Bangalore | Web Design Company Bangalore