Haskell Snippet; tokenize and clean a file
Given an input string, output with a string that that has been sanitized.
Example Input:
A AAND ABLE ABOUT ABOVE
ACQUIRE ACROSS AFOREMENTIONED AFTER AGAIN
AGAINST AGO AGREE AGREEABLE AHEAD
AKIN ALBEIT ALCUNE ALGHOUGH ALGUNOS
ALL
Stop Word Database
The tool was used to create a STOP WORD database, a database of words that are important to any particular document, words like "the", "than", "if".
http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/var/lib/spiderdb/lexicon/stopwords/stopwords.tdb
Example Input:
A AAND ABLE ABOUT ABOVE
ACQUIRE ACROSS AFOREMENTIONED AFTER AGAIN
AGAINST AGO AGREE AGREEABLE AHEAD
AKIN ALBEIT ALCUNE ALGHOUGH ALGUNOS
ALL
Example output:
aand
able
about
above
accordance
according
acquire
across
aforementioned
after
again
import qualified Data.Set as Set
import Data.List
wordTokens :: String -> [String]
wordTokens content = tokens
where maxwordlen = 100
lowercase str = map toLower str
alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens
--
--
-- Given an unclean content set; tolower, filter by length, get unique tokens,
-- tokenize, join the list back together with a token on each line.
-- @see intersperse ',' "abcde" == "a,b,c,d,e"
tokenizeInput :: String -> IO String
tokenizeInput content = return $ concat . intersperse "\n" $ unify
where tokens = wordTokens content
unify = Set.toList . Set.fromList $ tokens
Stop Word Database
The tool was used to create a STOP WORD database, a database of words that are important to any particular document, words like "the", "than", "if".
http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/var/lib/spiderdb/lexicon/stopwords/stopwords.tdb
Comments