## Saturday, January 19, 2008

### Simple Word Frequency Functions in Haskell

As part of the botlist text content mining backend I am working on; I frequently process word tokens in a web page document. The code below contains utility functions for creating a data structure that consists of a list of tuples. The tuple contains a word and the number of occurrences in the document. I modified John Goerzen's original wordFreq version from http://changelog.complete.org/plugin/tag/haskell.
`import System.Environmentimport qualified Data.Map as Mapimport Data.Listimport Text.Regex (splitRegex, mkRegex)---- | Find word frequency given an input list using "Data.Map" utilities.-- With (Map.empty :: Map.Map String Int), set k = String and a = Int--    Map.empty :: Map k a-- foldl' is a strict version of foldl = foldl': (a -> b -> a) -> a -> [b] -> a-- (Original code from John Goerzen's wordFreq)wordFreq :: [String] -> [(String, Int)]wordFreq inlst = Map.toList \$ foldl' updateMap (Map.empty :: Map.Map String Int) inlst    where updateMap freqmap word = case (Map.lookup word freqmap) of                                     Nothing -> (Map.insert word 1 freqmap)                                     Just x  -> (Map.insert word \$! x + 1) freqmap-- | Pretty print the word/count tuple and output a string.formatWordFreq :: (String, Int) -> StringformatWordFreq tupl = fst tupl ++ " " ++ (show \$ snd tupl)-- Given an input list of word tokens, find the word frequency and sort the values.-- sortBy :: (a -> a -> Ordering) -> [a] -> [a]wordFreqSort :: [String] -> [(String, Int)]wordFreqSort inlst = sortBy freqSort . wordFreq \$ inlst`

`Usage:let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content      wordfreq = wordFreqSort tokens  mapM_ (\x -> (putStrLn \$ formatWordFreq x)) wordfreq`

The utility code was used against a spam document and output the following:
`*** Content AnalysisViagra 28Cialis 11to 10with 7Here 6Order 5the 5Articles 4Click 4Viagra. 4`