Haskell Snippet; tokenize and clean a file

Given an input string, output with a string that that has been sanitized.
Example Input:

A AAND ABLE ABOUT ABOVE
ACQUIRE ACROSS AFOREMENTIONED AFTER AGAIN
AGAINST AGO AGREE AGREEABLE AHEAD
AKIN ALBEIT ALCUNE ALGHOUGH ALGUNOS
ALL


Example output:
aand
able
about
above
accordance
according
acquire
across
aforementioned
after
again


import qualified Data.Set as Set
import Data.List

wordTokens :: String -> [String]
wordTokens content = tokens
where maxwordlen = 100
lowercase str = map toLower str
alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens
--
--
-- Given an unclean content set; tolower, filter by length, get unique tokens,
-- tokenize, join the list back together with a token on each line.
-- @see intersperse ',' "abcde" == "a,b,c,d,e"
tokenizeInput :: String -> IO String
tokenizeInput content = return $ concat . intersperse "\n" $ unify
where tokens = wordTokens content
unify = Set.toList . Set.fromList $ tokens


Stop Word Database

The tool was used to create a STOP WORD database, a database of words that are important to any particular document, words like "the", "than", "if".

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/var/lib/spiderdb/lexicon/stopwords/stopwords.tdb

Comments

Popular posts from this blog

On Unit Testing, Java TDD for developers to write

Is Java the new COBOL? Yes. What does that mean, exactly? (Part 1)

JVM Notebook: Basic Clojure, Java and JVM Language performance