Sunday, January 27, 2008

Haskell Snippet; tokenize and clean a file

Given an input string, output with a string that that has been sanitized.
Example Input:

A AAND ABLE ABOUT ABOVE
ACQUIRE ACROSS AFOREMENTIONED AFTER AGAIN
AGAINST AGO AGREE AGREEABLE AHEAD
AKIN ALBEIT ALCUNE ALGHOUGH ALGUNOS
ALL


Example output:
aand
able
about
above
accordance
according
acquire
across
aforementioned
after
again


import qualified Data.Set as Set
import Data.List

wordTokens :: String -> [String]
wordTokens content = tokens
where maxwordlen = 100
lowercase str = map toLower str
alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens
--
--
-- Given an unclean content set; tolower, filter by length, get unique tokens,
-- tokenize, join the list back together with a token on each line.
-- @see intersperse ',' "abcde" == "a,b,c,d,e"
tokenizeInput :: String -> IO String
tokenizeInput content = return $ concat . intersperse "\n" $ unify
where tokens = wordTokens content
unify = Set.toList . Set.fromList $ tokens


Stop Word Database

The tool was used to create a STOP WORD database, a database of words that are important to any particular document, words like "the", "than", "if".

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/var/lib/spiderdb/lexicon/stopwords/stopwords.tdb

No comments: