Monday, January 28, 2008

Haskell Snippet: read CSV file, marshall into type

This listing shows how to open a file, extract the contents, split by the delimiter (regex is a little off) and then mashall into a datatype.

data PageURLFieldInfo = PageURLFieldInfo {
linkUrlField :: String,
aUrlField :: Integer,
blockquoteUrlField :: Integer,
divUrlField :: Integer,
h1UrlField :: Integer,
imgUrlField :: Integer,
pUrlField :: Integer,
strongUrlField :: Integer,
tableUrlField :: Integer
-- The info content file contains html document information.
-- It may not exist but should, also contains URL info.
readInfoContentFile :: String -> IO PageURLFieldInfo
readInfoContentFile extr_file = do
let extr_n = (length ".extract")
extr_path = take ((length extr_file) - extr_n) extr_file
info_file = extr_path ++ ".info"
-- Extract the file, in CSV format.
-- URL::|a::|b::|blockquote::|div::|h1::|h2::|i::|img::|p::|span::|strong::|table
csvtry <- try $ readFile info_file
-- Handler error
info <- case csvtry of
Left _ -> return defaultPageFieldInfo
Right csv -> do let csv_lst = splitRegex (mkRegex "\\s*[::|]+\\s*") csv
return PageURLFieldInfo {
linkUrlField = csv_lst !! 0,
aUrlField = read (csv_lst !! 1) :: Integer,
blockquoteUrlField = read (csv_lst !! 2) :: Integer,
divUrlField = read (csv_lst !! 3) :: Integer,
h1UrlField = read (csv_lst !! 4) :: Integer,
imgUrlField = read (csv_lst !! 5) :: Integer,
pUrlField = read (csv_lst !! 6) :: Integer,
strongUrlField = read (csv_lst !! 7) :: Integer,
tableUrlField = read (csv_lst !! 8) :: Integer
return info
-- End of File


dru said...

would it not be a little cooler
if you could avoid re-specifying
the structure you are reading into,
and just have the CSV reader
automatically put fields into
their proper spot?

Berlin Brown said...

Yea, absolutely. I was actually looking for some kind of marshalling call.

But, even in my example I was a little explicit about which columns I wanted.

augustss said...

You don't need all the type annotations when constructing the value, the types are quite clear anyway. You could also map 'read' over csv to avoid the repeated calls.

Berlin Brown said...


do you have the specific code in terms of re-specifying the structure.


"You could also map 'read' over csv to avoid the repeated calls."

if I map; wouldn't I still have to specify what property function to call and what index I need to specify from the csv_lst.

I think I see where you are going though; I guess I could have a list of functions.

dru said...

augustss: is this due to the types specified in the 'data' statement?

berlin: I'm new to haskell, but it would be interesting to see how this would be solved. Would you solve it
via reflection? Would you solve it by adding extra info to the data structure? (annotations) meta-programming?