Haskell Snippet; tokenize and clean a file

January 27, 2008

Given an input string, output with a string that that has been sanitized.
Example Input:

A AAND ABLE ABOUT ABOVE
ACQUIRE ACROSS AFOREMENTIONED AFTER AGAIN
AGAINST AGO AGREE AGREEABLE AHEAD
AKIN ALBEIT ALCUNE ALGHOUGH ALGUNOS
ALL


Example output:
aand
able
about
above
accordance
according
acquire
across
aforementioned
after
again


import qualified Data.Set as Set
import Data.List

wordTokens :: String -> [String]
wordTokens content = tokens
    where maxwordlen = 100
          lowercase str = map toLower str
          alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
          tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens
--
--
-- Given an unclean content set; tolower, filter by length, get unique tokens,
-- tokenize, join the list back together with a token on each line.
-- @see  intersperse ',' "abcde" == "a,b,c,d,e"
tokenizeInput :: String -> IO String
tokenizeInput content = return $ concat . intersperse "\n" $ unify
    where tokens = wordTokens content
          unify = Set.toList . Set.fromList $ tokens

Stop Word Database

The tool was used to create a STOP WORD database, a database of words that are important to any particular document, words like "the", "than", "if".

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/var/lib/spiderdb/lexicon/stopwords/stopwords.tdb

Search This Blog

Berlin Brown and Software Development

Haskell Snippet; tokenize and clean a file

Comments

Popular posts from this blog

JVM Notebook: Basic Clojure, Java and JVM Language performance

On Unit Testing, Java TDD for developers to write

Application server performance testing, includes Django, ErlyWeb, Rails and others