Simple Word Frequency Functions in Haskell

January 19, 2008

As part of the botlist text content mining backend I am working on; I frequently process word tokens in a web page document. The code below contains utility functions for creating a data structure that consists of a list of tuples. The tuple contains a word and the number of occurrences in the document. I modified John Goerzen's original wordFreq version from http://changelog.complete.org/plugin/tag/haskell.


import System.Environment
import qualified Data.Map as Map
import Data.List
import Text.Regex (splitRegex, mkRegex)
--
-- | Find word frequency given an input list using "Data.Map" utilities.
-- With (Map.empty :: Map.Map String Int), set k = String and a = Int
--    Map.empty :: Map k a
-- foldl' is a strict version of foldl = foldl': (a -> b -> a) -> a -> [b] -> a
-- (Original code from John Goerzen's wordFreq)
wordFreq :: [String] -> [(String, Int)]
wordFreq inlst = Map.toList $ foldl' updateMap (Map.empty :: Map.Map String Int) inlst
    where updateMap freqmap word = case (Map.lookup word freqmap) of
                                     Nothing -> (Map.insert word 1 freqmap)
                                     Just x  -> (Map.insert word $! x + 1) freqmap

-- | Pretty print the word/count tuple and output a string.
formatWordFreq :: (String, Int) -> String
formatWordFreq tupl = fst tupl ++ " " ++ (show $ snd tupl)

-- Given an input list of word tokens, find the word frequency and sort the values.
-- sortBy :: (a -> a -> Ordering) -> [a] -> [a]
wordFreqSort :: [String] -> [(String, Int)]
wordFreqSort inlst = sortBy freqSort . wordFreq $ inlst


Usage:

let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
      wordfreq = wordFreqSort tokens
  mapM_ (\x -> (putStrLn $ formatWordFreq x)) wordfreq

The utility code was used against a spam document and output the following:


*** Content Analysis
Viagra 28
Cialis 11
to 10
with 7
Here 6
Order 5
the 5
Articles 4
Click 4
Viagra. 4

Search This Blog

Berlin Brown and Software Development

Simple Word Frequency Functions in Haskell

Comments

Popular posts from this blog

JVM Notebook: Basic Clojure, Java and JVM Language performance

On Unit Testing, Java TDD for developers to write

Application server performance testing, includes Django, ErlyWeb, Rails and others