Thursday, January 31, 2008

Web specification for botist front ends

I am working for a web specification for the botlist frontend.

The botlist backend and web frontend can operate independently. The backend sends data to the frontend and the web frontend is used to display that information.

http://www.botspiritcompany.com/botlist/

In terms of specification. The primary goal is to display URL information from the botlist repository.

For example.

Specification 1:


1. Display links in a listed format
2. Have a link to the host name
3. Allow for user submissions

That is the idea.

Why have a specification? Well, because I believe in programming language agnostic oriented development. The web front end could be implemented with the Spring framework, Lisp, or Django.

Actually, I plan on adding a Lisp/Django component in the future.

Wednesday, January 30, 2008

Haskell Snippet: looking at foldl (also a java version)

In the functional programming toolkit; foldl is pretty useful. In layman terms, it is basically a way to call a function on a given input list. From zvon.org; the haskell foldl is defined as:

"It takes the second argument and the first item of the list and applies the function to them, then feeds the function with this result and the second argument and so on. See scanl for intermediate results."

The type definition is:

Foldl: (a -> b -> a) -> a -> [b] -> a

Lets say that out loud; I can call the foldl function with the following parameters; the first parameter passed to foldl is a generic function that has two inputs a and b. This input function returns a type of a. The second parameter has a type of a, the first, initial item or the result of the previous operation. The third parameter is an array of type b.

The listing below shows how you might use foldl.


let a = foldl (\x y -> (++) x (show y)) "-->" [77, 88, 99]


To make it more complicated and unreadable, I defined the first parameter function input as a lambda function with inputs 'x' and 'y':


(\x y -> (++) x (show y))


Here are some other variations of calling foldl:

-- foldl :: (a -> b -> a) -> a -> [b] -> a
-- Fold given an input array [77,88,99] = b
-- b = Array of Integer type
-- a = String type
-- Output = -->778899
let a = foldl (\x y -> (++) x (show y)) "-->" [77, 88, 99]
-- Other functions, operators
-- (++) :: [a] -> [a] -> [a]
-- In our case; x ++ (show y) where y = the integer from [b]

-- Other variations:
let lst = [77, 88, 99]
-- Pass an empty list as 'a'
b = foldl (\x y -> x ++ show y) [] lst
-- What happens if want to find the max given an array of ints
c = foldl max 0 [1, 2, 3, 4,99, 5]
-- Look at it differently with a lambda call
d = foldl (\x y -> x `max` y) 0 [1, 2, 3, 4, 99, 5]
-- How would I take the sum of a list of values (w/o using sum)
e = foldl (\x y -> x + y) 0 [1, 2, 3]
-- Output as you would expect is e = 6


The key is to use the type definition as your guide:

Foldl: (a -> b -> a) -> a -> [b] -> a


foldl with inputs function (a -> b -> a) and input a and input array [b] returns a.

How would this example look in java?
At first, I was going to give a simpler java example that outputs the same result as in haskell, but instead I ended up with this more generic java function. Try this, visualize the result of running the first haskell example shown in the listing above:

Output=-->778899

Ok, define how you would solve that result in an imperative language (like Java) with the simplest possible implementation. You might end up with this (pseudo code):


public String foldl(String init_a, int b[]) {
StringBuffer buf = init_a + b[0]
for (int i = 0; i < b.length(); i++) {
buf.append(b[i]
}
return buf.toString();
}


Something like that, now here is a generic example in Java:

// Date: 1/29/2008
// Example mimic of foldl function in haskell in java.
// foldl :: (a -> b -> a) -> a -> [b] -> a
public class A {
public interface FuncCall {
public Object exec_AtoBtoA(Object o1, Object o2);
}
public static Object exFoldl(Object initA, Object b[],
FuncCall func) {
Object a_res = func.exec_AtoBtoA(initA, b[0]);
for (int i = 1; i < b.length; i++) {
a_res = func.exec_AtoBtoA(a_res, b[i]);
}
return a_res;
}
public static void main(String [] args) {
System.out.println("running");

// -- How would I take the sum of a list of values (w/o using sum)
// let a = foldl (\x y -> (++) x (show y)) "-->" [77, 88, 99]
Object b[] = { 77, 88, 99 };
Object initA = "-->";
Object finalRes = exFoldl(initA, b, new FuncCall () {
public Object exec_AtoBtoA(Object x, Object y) {
String lambdaRes = (x.toString() + y.toString());
return lambdaRes;
}
});
System.out.println("Output=" + finalRes);
Object b2[] = { 1, 2, 3 };
Integer initA2 = 0;
finalRes = exFoldl(initA2, b2, new FuncCall () {
public Object exec_AtoBtoA(Object x, Object y) {
Integer lambdaRes = Integer.parseInt(x.toString()) +
Integer.parseInt(y.toString());
return lambdaRes;
}
});
System.out.println("Output=" + finalRes);

/*
Output after running example:
Output=-->778899
Output=6
*/
}
}

That is all there is to it to understanding foldl and other prelude haskell functions. Read the definition from the haskell documentation to get an idea of the inputs and output. Then pass the correct inputs.
For example, some other useful functions are zip and unzip. Zip has the following type:
zip :: [a] -> [b] -> [(a, b)]
Create a list of tuples from two input lists.

Prelude> let a = [1, 2, 3]
Prelude> let b = ["a", "b", "c"]
Prelude> zip a b
[(1,"a"),(2,"b"),(3,"c")]
Prelude> :t
Prelude> :t zip
zip :: [a] -> [b] -> [(a, b)]

Prelude> unzip (zip a b)
([1,2,3],["a","b","c"])
Prelude>

botlist by berlin brown



Yes, that is my real signature; thanks inkscape.

Purpose of botlist

What is the purpose of botlist? What is its role?

http://botspiritcompany.com/botlist/

A lot of web applications provide you with information based on your input. Google and yahoo work in this manner. You pull information from them based on some query. Botlist hopes to operate based on a push architecture. Much in the way that other news aggregation services works. Information is piped through to the botlist web application and the top articles and links are displayed to the user. The other links that aren't as popular will still be available through search.

Pretty simple, yet a lot of work. The back end crawler is built with a combination of haskell and python. Haskell for the content analysis and python for the crawler.

Tuesday, January 29, 2008

Paul Graham's Arc has been released

It is now out there.

http://paulgraham.com/arc0.html

"We're releasing a version of Arc today, along with a site about it at arclanguage.org. This site will seem very familiar to users of Hacker News. It's mostly the same code, with a few colors and messages changed."

Monday, January 28, 2008

Haskell Snippet: read CSV file, marshall into type

This listing shows how to open a file, extract the contents, split by the delimiter (regex is a little off) and then mashall into a datatype.


data PageURLFieldInfo = PageURLFieldInfo {
linkUrlField :: String,
aUrlField :: Integer,
blockquoteUrlField :: Integer,
divUrlField :: Integer,
h1UrlField :: Integer,
imgUrlField :: Integer,
pUrlField :: Integer,
strongUrlField :: Integer,
tableUrlField :: Integer
}
--
-- The info content file contains html document information.
-- It may not exist but should, also contains URL info.
readInfoContentFile :: String -> IO PageURLFieldInfo
readInfoContentFile extr_file = do
let extr_n = (length ".extract")
extr_path = take ((length extr_file) - extr_n) extr_file
info_file = extr_path ++ ".info"
-- Extract the file, in CSV format.
-- URL::|a::|b::|blockquote::|div::|h1::|h2::|i::|img::|p::|span::|strong::|table
csvtry <- try $ readFile info_file
-- Handler error
info <- case csvtry of
Left _ -> return defaultPageFieldInfo
Right csv -> do let csv_lst = splitRegex (mkRegex "\\s*[::|]+\\s*") csv
return PageURLFieldInfo {
linkUrlField = csv_lst !! 0,
aUrlField = read (csv_lst !! 1) :: Integer,
blockquoteUrlField = read (csv_lst !! 2) :: Integer,
divUrlField = read (csv_lst !! 3) :: Integer,
h1UrlField = read (csv_lst !! 4) :: Integer,
imgUrlField = read (csv_lst !! 5) :: Integer,
pUrlField = read (csv_lst !! 6) :: Integer,
strongUrlField = read (csv_lst !! 7) :: Integer,
tableUrlField = read (csv_lst !! 8) :: Integer
}
return info
-- End of File

Sunday, January 27, 2008

Haskell Snippet; tokenize and clean a file

Given an input string, output with a string that that has been sanitized.
Example Input:

A AAND ABLE ABOUT ABOVE
ACQUIRE ACROSS AFOREMENTIONED AFTER AGAIN
AGAINST AGO AGREE AGREEABLE AHEAD
AKIN ALBEIT ALCUNE ALGHOUGH ALGUNOS
ALL


Example output:
aand
able
about
above
accordance
according
acquire
across
aforementioned
after
again


import qualified Data.Set as Set
import Data.List

wordTokens :: String -> [String]
wordTokens content = tokens
where maxwordlen = 100
lowercase str = map toLower str
alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens
--
--
-- Given an unclean content set; tolower, filter by length, get unique tokens,
-- tokenize, join the list back together with a token on each line.
-- @see intersperse ',' "abcde" == "a,b,c,d,e"
tokenizeInput :: String -> IO String
tokenizeInput content = return $ concat . intersperse "\n" $ unify
where tokens = wordTokens content
unify = Set.toList . Set.fromList $ tokens


Stop Word Database

The tool was used to create a STOP WORD database, a database of words that are important to any particular document, words like "the", "than", "if".

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/var/lib/spiderdb/lexicon/stopwords/stopwords.tdb

"It doesnt matter about the drugs and rock and roll, sex " - Haskell bayes and other classifications

It doesnt matter about the drugs and rock and roll, sex

That was an example test case that I used against my bayes classification application. Bayes and other probability algorithms are used in most of the spam filtering software that is out in the net. I use various probability routines to classify the news articles. I begin with Toby's python code on the subject and ported is examples to haskell. So far, the results have been pretty satisfactory. The naives bayes classification is dismal, but maybe I am looking at the numbers wrong. In classic form, I will throw code at you and will give a lollipop to the mind reader that can figure out what is going on.

The output of the test case:

In my first test case, I extracted a couple of articles from the web with these subjects; politics, business, entertainment. A human can easily tell what is going on. A software program is going to have a lot of difficultly for several reasons, but two main ones. One, there are too many STOP words and too many ambiguous phrases in my sample training set. The feature word global could easily apply to entertainment, politics, or business. Which one does the computer decide upon? In any case, the first listing below shows the input data set.

let phrases =
[
"It doesnt matter about the drugs and rock and roll, sex",
"I agree citi, global market money business business driving force",
"ron paul likes constitution war international freedom america",
"viagra drugs levitra bigger",
"Movies are fun and too enjoy",
"war america not good"
]

And the output:

Running Tests
At: Sun Jan 27 05:08:37 EST 2008 t:1201428517
Test Train Bayes
793
["train/business.train","train/entertainment.train","train/politics.train"]
---- train/business.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=3.989729693042545e-7
Fisher Probability=0.836128249211062
. [I agree citi, global market money business business driving force ]
Bayes Probability=4.850155473781951e-5
Fisher Probability=0.9541001591541463
. [ron paul likes constitution war international freedom america ]
Bayes Probability=3.780639186633039e-4
Fisher Probability=0.6089235797669089
. [viagra drugs levitra bigger ]
Bayes Probability=2.419609079445145e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=1.9649303466097898e-4
Fisher Probability=0.4827094766567699
. [war america not good ]
Bayes Probability=1.2098045397225725e-2
Fisher Probability=0.5440441299962482
---- train/entertainment.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=2.0531041666810224e-7
Fisher Probability=0.6270914273866588
. [I agree citi, global market money business business driving force ]
Bayes Probability=2.339809268600252e-5
Fisher Probability=0.4542155835210187
. [ron paul likes constitution war international freedom america ]
Bayes Probability=1.8718474148802017e-4
Fisher Probability=0.6089235797669089
. [viagra drugs levitra bigger ]
Bayes Probability=1.197982345523329e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=1.0245769451548432e-4
Fisher Probability=0.8278707842798139
. [war america not good ]
Bayes Probability=5.989911727616645e-3
Fisher Probability=0.5440441299962482
---- train/politics.train ----
. [It doesnt matter about the drugs and rock and roll, sex ]
Bayes Probability=4.237119541681478e-7
Fisher Probability=0.4319928795655645
. [I agree citi, global market money business business driving force ]
Bayes Probability=5.141422998108449e-5
Fisher Probability=0.4542155835210187
. [ron paul likes constitution war international freedom america ]
Bayes Probability=4.1625450234461715e-4
Fisher Probability=0.8928739462341001
. [viagra drugs levitra bigger ]
Bayes Probability=2.632408575031526e-2
Fisher Probability=0.6980297367583733
. [Movies are fun and too enjoy ]
Bayes Probability=2.131121584070195e-4
Fisher Probability=0.46619446182733354
. [war america not good ]
Bayes Probability=1.3240857503152584e-2
Fisher Probability=0.7855657341350474
Done

Like I said before, the bayes probability numbers are all over the place. It could be an issue with that I didn't extract stop words from the training set. Or could be the training set is too small or both.

The interesting part is that the fisher probability algorithm figured out the correct classification in each instance.

It doesnt matter about the drugs and rock and roll, sex

I wanted to associate this term with entertainment. There was a 0.6270914273866588 that is entertainment and a 45% "I agree citi, global market money business business driving force" chance that this phrase is an entertainment document. Subsequently, it matched in all of the other instances as well. 100% success rate for the fisher probability.

The next listing shows the test case.

module Tests.Data.TestTrainBayes where

import Monad (liftM)
import System.Directory (getDirectoryContents)
import List (isPrefixOf, isSuffixOf)
import Data.SpiderNet.Bayes

trainDir = "train"

runTestTrainBayes :: IO ()
runTestTrainBayes = do
putStrLn "Test Train Bayes"
-- Process only files with 'train' extension
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
lst_content <- liftM (zip trainpaths) $ mapM readFile trainpaths
-- Print a count of the training set size
let info = buildTrainSet lst_content []
putStrLn $ show (length info)
putStrLn $ show (categories info)
let phrases =
[
"It doesnt matter about the drugs and rock and roll, sex",
"I agree citi, global market money business business driving force",
"ron paul likes constitution war international freedom america",
"viagra drugs levitra bigger",
"Movies are fun and too enjoy",
"war america not good"
]
-- process the following input phrases
mapM_ (\cat -> do
putStrLn $ "---- " ++ cat ++ " ----"
mapM_ (\phrase -> do
putStrLn $ " . [" ++ phrase ++ " ]"
putStrLn $ " Bayes Probability=" ++
(show ((bayesProb info (wordTokens phrase) cat 1.0)))
putStrLn $ " Fisher Probability=" ++
(show ((fisherProb info (wordTokens phrase) cat)))
) phrases
) (categories info)


Source

This module implements the classification source; step through the example to get an understanding of the code. Start with the test case and then move to the bayesProb and fisherProb functions.

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/lib/haskell/src/Data/SpiderNet/Bayes.hs


fisherProb :: [WordCatInfo] -> [String] -> String -> Double
fisherProb features tokens cat = invchi
where initp = 1.0
weight = 1.0
p = foldl (\prb f -> (prb * (weightedProb features f cat weight))) initp tokens
fscore = (negate 2) * (log p)
invchi = invChi2 fscore ((genericLength tokens) * 2)

And here is an example input document, for training the business class:

Look no further than the European Central Bank, which was notably absent when the Fed made its emergency rate cut amid falling global stocks on Tuesday. In testimony Wednesday before the European Parliament, ECB President Jean-Claude Trichet came about as close as a member of the brotherhood ever will to calling out a fellow central banker: "In demanding times of significant market correction and turbulences, it is the responsibility of the central bank to solidly anchor inflation expectations to avoid additional volatility in already highly volatile markets
Economics is the social science that studies the production, distribution, and consumption of goods and services. The term economics comes from the Greek for oikos (house) and nomos (custom or law), hence "rules of the house(hold)."[1]
Although discussions about production and distribution have a long history, economics in its modern sense as a separate discipline is conventionally dated from the publication of Adam Smith's The Wealth of Nations in 1776.[7] In this work Smith describes the subject in these practical and exacting terms:
Political economy, considered as a branch of the science of a statesman or legislator, proposes two distinct objects: first, to supply a plentiful revenue or product for the people, or, more properly, to enable them to provide such a revenue or subsistence for themselves; and secondly, to supply the state or commonwealth with a revenue sufficient for the public services. It proposes to enrich both the people and the sovereign.
Smith referred to the subject as 'political economy', but that term was gradually replaced in general usage by 'economics' after 1870.
In economics, a business (also called firm or enterprise) is a legally recognized organizational entity existing within an economically free country designed to provide goods and/or services to consumers. Businesses are predominate in capitalist economies, where most are privately owned and typically formed to earn profit to increase the wealth of their owners. The owners and operators of a business have as one of their main objectives the receipt or generation of a financial return in exchange for their work and their acceptance of risk. Notable exceptions to this rule include cooperative businesses and government institutions. This model of business functioning is contrasted with socialistic systems, which involve either government, public, or worker ownership of most sizable businesses.

Saturday, January 26, 2008

What is autonomiccomputing?

http://www.autonomiccomputing.org/

"What is autonomic computing? It is the ability of systems to be more self-managing. The term autonomic comes from the autonomic nervous system, which controls many organs and muscles in the human body. Usually, we are unaware of its workings because it functions in an involuntary, reflexive manner -- for example, we don't notice when our heart beats faster or our blood vessels change size in response to temperature, posture, food intake, stressful experiences and other changes to which we're exposed. And, by the way, our autonomic nervous system is always working."

Alan Ganek, VP Autonomic Computing, IBM

Haskell Snippet; cat data in files

It is always useful to manipulate files. This set of code (runTestTrainBayes) gets all the files in the relative directory path "train"; calls the anonymous function on each file and reads the content; appends the content into one big string. workTokens returns only tokens that are greater than 1 and less than 100. Eventually, you end up with a list of word tokens.

import System.Directory (getDirectoryContents)
import List (isPrefixOf, isSuffixOf)
import Data.SpiderNet.Bayes

trainDir = "train"

wordTokens :: String -> [String]
wordTokens content = tokens
where maxwordlen = 100
lowercase str = map toLower str
alltokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") (lowercase content)
tokens = filter (\x -> length x > 1 && length x < maxwordlen) alltokens


runTestTrainBayes :: IO ()
runTestTrainBayes = do
putStrLn "Test Train Bayes"
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
d <- concatMap lines `fmap` mapM readFile trainpaths
let z = wordTokens (concat d)
putStrLn $ show z

Update: Simpler Example

runTestTrainBayes :: IO ()
runTestTrainBayes = do
putStrLn "Test Train Bayes"
files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
d <- mapM readFile trainpaths
putStrLn $ show (length d)

Update: The snippet below shows how to associate the content of the file with the filename

files <- getDirectoryContents trainDir
let trainfiles = filter (isSuffixOf ".train") files
trainpaths = map (\x -> trainDir ++ "/" ++ x) trainfiles
d <- liftM (zip trainpaths) $ mapM readFile trainpaths

Friday, January 25, 2008

2007 Review of Books

"This year I only read 70 books, down from over 120 last year. I guess that's not too bad considering this was the Year of Other People -- I still beat my general goal of 52. Once again, here are the books that struck me as completely worth reading. This year, though, I've intermixed them with the other books I read:"

http://www.aaronsw.com/weblog/books2007

Aaron Swartz is well read; he said he read 70 books and 120 the previous year. I can't top that. But I do read a lot of technical books and like him only interested in government in politics. So there are only two categories. And normally technical is typically software/programming related.

Here is my list; I didn't completely read all of them, "DNA String algorithms" doesn't exactly make for sit down type of reading material.

Technical

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

Programming Erlang: Software for a Concurrent World by Joe Armstrong

The Haskell School of Expression: Learning Functional Programming through Multimedia by Paul Hudak

Haskell: The Craft of Functional Programming (2nd Edition) by Simon Thompson

Amazon Hacks: 100 Industrial-Strength Tips & Tools (Hacks) by Paul Bausch

Programming Perl (3rd Edition) by Larry Wall, Tom Christiansen, and Jon Orwant

Programming Ruby: The Pragmatic Programmers' Guide, Second Edition by Dave Thomas, Chad Fowler, and Andy Hunt

Text Mining Application Programming

Collective Intelligence

Structure and Interpretation of Computer Programs - 2nd Edition

Core J2EE Patterns: Best Practices and Design Strategies (2nd Edition) (Core Series) by Deepak Alur, Dan Malks, and John Crupi

Professional Java Development with the Spring Framework by Rod Johnson, Juergen Hoeller, Alef Arendsen, and Thomas Risberg (Paperback - Jul 8, 2005)

The Elements of Technical Writing (Elements of Series) by Gary Blake and Robert W. Bly (Paperback - Dec 19, 2000)

Scala Programming (released on PDF)

Freakonomics [Revised and Expanded]: A Rogue Economist Explores the Hidden Side of Everything by Steven D. Levitt and Stephen J. Dubner (Hardcover - Oct 17, 2006)

Non Fiction

The Constitution in Exile - Napolitano

Legacy of Ashes: The History of the CIA by Tim Weiner

Dead Certain: The Presidency of George W. Bush by Robert Draper

The World Is Flat: A Brief History of the Twenty-first Century by Thomas L. Friedman

Against All Enemies: Inside America's War on Terror--What Really Happened

Botlist project to introduce use of XUL/XULRunner

The project will start using XUL for database tools.

http://developer.mozilla.org/en/docs/XULRunner

Machine Learning in future releases of the Lucene project

This is major news. The Lucene developers released Lucene 2.3; in future releases except machine learning capabilities.

http://www.infoq.com/news/2008/01/lucene-23-mahout

"That will be a separate project, but may be beneficial to Lucene users. There are currently some patches in JIRA for Lucene that implement ML algorithms. The goal of this project is to provide commercial quality, large scale machine learning (ML) algorithms built on Hadoop under an Apache license. I have seen a fair amount of interest already, and hope to have this project underway in the coming month."

http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal

Would be interesting if the team could introduce the following algorithms.

* 1.3.1 Naive Bayes
* 1.3.2 Neural Networks
* 1.3.3 Support Vector Machines
* 1.3.4 Logistic Regression
* 1.3.5 Locally Weighted Linear Regression
* 1.3.6 k-Means
* 1.3.7 Principal Components Analysis
* 1.3.8 Independent Component Analysis
* 1.3.9 Expectation Maximization
* 1.3.10 Gaussian Discriminative Analysis

Wednesday, January 23, 2008

Haskell is a very expressive language

I can use the Haskell programming language to better expressive the nature of the software task that I am attempting to accomplish.

Let me reword that so the previous sentence makes sense:

I use Haskell to accomplish complicated programming tasks; the expressive nature of the language provides an effective conduit to make that happen.

One more time:

Haskell is readable, fast, and expressive. I like it.

I like pizza

Pizza meals are effective. The pizza delivery person comes to your home. You pay the person. You eat the food. A pizza is circular in shape. Some of my favorite toppings include jalepenos, mushrooms, onions, beef or sausage; no more than two toppings at a time.

Tacos are also very appetizing.

Tuesday, January 22, 2008

New real world haskell book is coming out

http://www.realworldhaskell.org/blog/

I am a rare breed and love picking up technical books. Some say they are too behind the technology trends. That is true, but if you need a book on text parsing or artificial intelligence, then a 30 year old book is as good as a 2 year old one. The real world haskell book is one that I am looking forward to; I have already picked up two haskell books and this will only enhance the collection.

Simple comparison of iteration with Python and use of Map in Haskell

Here is a simple code snippet that shows the variation between how to approach problems in python and how to approach problems in haskell. More often than not, the same problem or task that you want to accomplish is the same but the methods used are going to be different. There is no question that python is an imperative language. You will tackle a problem by figuring how how to perform step 1 and then step 2, possibly N number of times. Haskell is a purely functional programming language but some like Simon Peyton Jones call Haskell "the best imperative language".


Perform the invchi2 calculation in python

example_data = [
(4.3, 6),
(2.2, 60),
(60, 2.2),
(0.3, 4),
(32.123, 20),
(12.4, 5),
(0.04, 3),
(0, 0),
(1, 1)
]
def invchi(chi, df):
m = chi / 2.0
sum = term = math.exp(-m)
for i in range(1, df//2):
term *= m / i
sum += term
return min(sum, 1.0)
if __name__ == '__main__':
print "Running InvChi Test"
for chi, df in example_data:
print "[ chi=%s df=%s ] res=%s" % (chi, df, invchi(chi, df))
print "Done"

Similar code in Haskell

exampleData :: [(Double, Double)]
exampleData = [
(4.3, 6),
(2.2, 60),
(60, 2.2),
(0.3, 4),
(32.123, 20),
(12.4, 5),
(0.04, 3),
(0, 0),
(1, 1)]

-- Inverted Chi2 formula
invChi :: Double -> Double -> Double
invChi chi df = minimum([snd newsum, 1.0])
where m = chi / 2.0
initsum = exp (-m)
trm = exp (-m)
maxrg = fromIntegral (floor (df / 2.0)) :: Double
-- Return a tuple with current sum and term, given these inputs
newsum = foldl (\(trm,sm) elm -> ((trm*(m/elm)), sm+trm))
(trm,initsum) [1..maxrg]

runBayesTest = do
putStrLn "Bayes Test: Inv Chi"
mapM_ (\x -> putStrLn $ (show x) ++ " res=" ++ (show (invChi (fst x) (snd x)))) exampleData
putStrLn "Done Bayes"

Monday, January 21, 2008

Haskell Snippet, start of bayes probability library

I am adding support for text analysis including bayes classification. The code is based on Toby Segaran's PCIL python source code and contains some of those utility functions.

Libraries used:

import System.Environment
import qualified Data.Map as Map
import qualified Data.Set as Set
import Data.List
import Text.Regex (splitRegex, mkRegex)

Type definitions:

type WordCat = (String, String)
type WordCatInfo = (WordCat, Int)
type WordInfo = (String, Int)

Utilities for finding the word frequency in a document:

--
-- | Find word frequency given an input list using "Data.Map" utilities.
-- With (Map.empty :: Map.Map String Int), set k = String and a = Int
-- Map.empty :: Map k a
-- foldl' is a strict version of foldl = foldl': (a -> b -> a) -> a -> [b] -> a
-- Also see: updmap nm key = Map.insertWith (+) key 1 nm
-- (Original code from John Goerzen's wordFreq)
wordFreq :: [String] -> [WordInfo]
wordFreq inlst = Map.toList $ foldl' updateMap (Map.empty :: Map.Map String Int) inlst
where updateMap freqmap word = case (Map.lookup word freqmap) of
Nothing -> (Map.insert word 1 freqmap)
Just x -> (Map.insert word $! x + 1) freqmap

--
-- | Word Category Frequency, modified version of wordFreq to
-- handle Word Category type.
wordCatFreq :: [WordCat] -> [WordCatInfo]
wordCatFreq inlst = Map.toList $ foldl'
updateMap (Map.empty :: Map.Map WordCat Int) inlst
where updateMap freqmap wordcat = case (Map.lookup wordcat freqmap) of
Nothing -> (Map.insert wordcat 1 freqmap)
Just x -> (Map.insert wordcat $! x + 1) freqmap

-- | Pretty print the word/count tuple and output a string.
formatWordFreq :: WordInfo -> String
formatWordFreq tupl = fst tupl ++ " " ++ (show $ snd tupl)

formatWordCat :: WordCatInfo -> String
formatWordCat tupl = frmtcat (fst tupl) ++ " " ++ (show $ snd tupl)
where frmtcat infotupl = (fst infotupl) ++ ", " ++ (snd infotupl)

Utilities for calculating the fisher probability:

wordFreqSort :: [String] -> [(String, Int)]
wordFreqSort inlst = sortBy freqSort . wordFreq $ inlst

--
-- | bayes classification train
trainClassify :: String -> String -> [WordCatInfo]
trainClassify content cat = let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
wordcats = [(tok, cat) | tok <- tokens]
in wordCatFreq wordcats

--
-- | Return only the tokens in a category.
tokensCat :: [WordCatInfo] -> String -> [WordCatInfo]
tokensCat tokens cat = let getTokCat row = snd (fst row)
tokbycat = filter (\x -> ((getTokCat x) == cat)) tokens
in tokbycat

tokensByFeature :: [WordCatInfo] -> String -> String -> [WordCatInfo]
tokensByFeature tokens tok cat = filter (\x -> ((fst x) == (tok, cat))) tokens

--
-- | Count of number of features in a particular category
-- Extract the first tuple to get the WordCat type and then the
-- second tuple to get the category.
catCount :: [WordCatInfo] -> String -> Integer
catCount tokens cat = genericLength $ tokensCat tokens cat

-- Find the distinct categories
categories :: [WordCatInfo] -> [String]
categories tokens = let getTokCat row = snd (fst row)
allcats = Set.toList . Set.fromList $ [ getTokCat x | x <- tokens ]
in allcats

featureCount :: [WordCatInfo] -> String -> String -> Integer
featureCount tokens tok cat = genericLength $ tokensByFeature tokens tok cat

--
-- | Feature probality, count in this category over total in category
featureProb :: [WordCatInfo] -> String -> String -> Double
featureProb features tok cat = let fct = featureCount features tok cat
catct = catCount features cat
in (fromIntegral fct) / (fromIntegral catct)

--
-- | Calcuate the category probability
categoryProb :: [WordCatInfo] -> String -> String -> Double
categoryProb features tok cat = initfprob / freqsum
where initfprob = featureProb features tok cat
freqsum = sum [ (featureProb features tok x) | x <- categories features ]

weightedProb :: [WordCatInfo] -> String -> String -> Double -> Double
weightedProb features tok cat weight = ((weight*ap)+(totals*initprob))/(weight+totals)
where initprob = categoryProb features tok cat
ap = 0.5
totals = fromIntegral $ sum [ (featureCount features tok x) | x <- categories features ]

-- Inverted Chi2 formula
invChi2 :: Double -> Double -> Double
invChi2 chi df = minimum([snd newsum, 1.0])
where m = chi / 2.0
initsum = exp (-m)
trm = exp (-m)
maxrg = fromIntegral (floor (df / 2.0)) :: Double
-- Return a tuple with current sum and term, given these inputs
newsum = foldl (\(trm,sm) elm -> ((trm*(m/elm)), sm+trm))
(trm,initsum) [1..maxrg]

fisherProb :: [WordCatInfo] -> [String] -> String -> Double
fisherProb features tokens cat = invchi
where initw = 1.0
p = foldl (\prb f -> (prb * (weightedProb features f cat initw))) 1.0 tokens
fscore = (-2) * (log p)
invchi = invChi2 fscore ((genericLength features) * 2)


Some example test cases:

simpleTest1 :: IO ()
simpleTest1 = do
content <- readFile badfile
let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
wordfreq = wordFreqSort tokens
mapM_ (\x -> (putStrLn $ formatWordFreq x)) wordfreq
putStrLn $ "Number of tokens found: " ++ (show . length $ wordfreq)

simpleTest2 :: IO ()
simpleTest2 = do
let badfreq = trainClassify "viagra is bad cialis is good" "bad"
goodfreq = trainClassify "I like to run with foxes they cool" "good"
allfreq = badfreq ++ goodfreq
mapM_ (\x -> (putStrLn $ formatWordCat x)) allfreq

simpleTest3 :: IO ()
simpleTest3 = do
let aa = [(("1", "aa") :: (String, String), -1), (("2", "aa"), -1), (("3", "bb"), -1)]
tokensAA = tokensCat aa "aa"
countAA = catCount aa "aa"
c = featureProb aa "1" "aa"
putStrLn $ "-->" ++ (show countAA) ++ " // " ++ (show tokensAA) ++ " // " ++ (show c)

simpleTest4 :: IO ()
simpleTest4 = do
let aa = [(("dogs dogs", "good") :: (String, String), 3),
(("viagra", "bad") :: (String, String), 5),
(("fox", "good") :: (String, String), 2),
(("dogs", "good"), 4),
(("3", "bad"), 5)]
bb = categories aa
tokensAA = tokensByFeature aa "dogs" "good"
c = featureProb aa "dogs" "good"
d = catCount aa "good"
x = categoryProb aa "xdogs" "good"
z = weightedProb aa "dogs" "good" 1.0
putStrLn $ "-->" ++ (show d) ++ "//" ++ (show bb) ++ "//" ++ (show z)

simpleTest5 :: IO ()
simpleTest5 = do
let aa = [(("dogs dogs", "good") :: (String, String), 3),
(("viagra", "bad") :: (String, String), 5),
(("fox", "good") :: (String, String), 2),
(("dogs", "good"), 4),
(("3", "bad"), 5)]
testdata = [ "xdog" ]
bb = fisherProb aa testdata "bad"
putStrLn $ "-->" ++ show bb

Why Rails Sucks, 100% magic and 0% design

Once again, a post not my own. This is from the comp.lang.lisp forum; a discussion on Rails. And I tend to agree:

A quote from MacieJ.

"First, I completely agree with what Slava said. Now to expand that a bit
with my own thoughts:

Rails is 100% magic with 0% design. It sports all the great quality and
consistency you've come to expect from PHP, except with loads more magic.
There's no overarching design or scheme of things, it's just a bucket of
tools with some glue poured in. This has a couple of important
consequences:

- There's no reasoning about Rails -- more familiarity won't give you
better chances of figuring out something new because that's what follows
from the design, only because that's how it usually ends up being
implemented and because you have memorised more things, so your guesses
are better. In essence, being better in Rails means being better at
grepping the internet.

- There's no thought given to the general problem to solve, it's just
improved PHP + "ActiveRecord, lol". This means Rails doesn't have
solutions that are particularly good or scalable or make sense, only
hacks that happened to solve someone's specific problem at a time and
were implementable in 5 minutes or less. Rails is heaps better than PHP,
but it's still only good for what PHP is good, and that's not writing
webapps. This permeates Rails all the way down: it's got hacky modules
that only solve particular problems, those modules have hacky functions
that only solve particular problems and those functions only do hacky
things that solved someone's particular problem.

Some examples:
* AR's find family of functions. It's a horrible hack, for instance,
they support the :group clause, which has semantics ("return a collection
of groups of things") incompatible with find's base semantics ("return a
collection of things"). Rails answer? It implicitly relies on MySQL's
retarded interpretation of SQL and the fact that given a table with two
columns, id and colour, it will silently interpret "SELECT * FROM table
GROUP BY colour" as "SELECT FIRST(id), colour FROM table GROUP BY
colour". End result? A valid combination of clauses in AR will generate
incorrect SQL Postgres will (correctly) choke on.

* AR's find again, it supports :join (which documentation hilariously
describes as "rarely needed"), except that it doesn't return real objects
then, but make-believe fake readonly objects that will only have some
attributes filled in (because they have no backing with physical rows),
but will *still* have the base type and all the methods of the class you
queried on! So if you go with that and try to call one of the methods
that depend on unfilled attributes, you die horribly.

- Reading and, in general, understanding Rails is horribly difficult,
since it's no design and layers upon layers of magic. Very often you will
find 5-7 layers of functions delegating the work deeper and deeper in,
until you arrive to a completely undocumented internal function that in
turn splits the work to three other, unrelated internal functions. Given
that each of those 10 functions takes a hash called "options", each layer
altering it subtly, but none having the complete picture, and that 9
times out of 10 that hash is not documented on any level, figuring out
what your choices are is pure hell. It's made even more fun by the fact
that different parts of Rails are accessible at various points of
handling the request, and you can't just jump in and poke things from the
console, since it won't have 99% of the things that only magically spring
to life once a request is live.

- As a rule, there's no quality assurance, and the average quality is
very low. Because it's such a patchwork, it will only have parts of
things implemented, some other (not necessarily overlapping) parts
documented, and a whole load of missing things just dangling and waiting
for you to run into them. For example, the docs for
text_field_with_auto_complete discuss how you can use options to make it
complete only parts of entries (which was exactly what I needed, to show
"foo (123)" in the popup, but only insert "foo" in the text field). What
it doesn't tell you, however, is that none of the stock completion-
generating methods will prepare such splittable data for you, and you're
supposed to copy their code and hack it on your own instead. It took me
half a day to finally understand that it wasn't me being stupid and
unable to find it, but it was actually Rails documenting interfaces that
_weren't there_.

- And to stress the first point again, Rails never concerns itself with
the big-picture problem of "writing webapps". It only thinks as big as
"outputting HTML strings" and "querying the DB for a list of things".
This means the important, actually hard stuff like handling the stateless
nature of HTTP, or sanitising and escaping the user input is just not
adressed at all, and you only learn about them when one day you discover
84 possible XSS injection points (actual number from a Rails app I'm
somewhat famililar with).

I'm a huge fan of innovative frameworks like Weblocks, because they
actually stopped and thought about the whole thing for a moment, and try
things that will address the problem as a whole. And even if some
specific things will turn out to be harder to do that by just churning
out raw HTML, and there will be abstraction leaks, it's still better
because they try out new things to find a solution that has a chance
work. Rails doesn't, because its entire culture seems to be fundamentally
incapable of thinking more than 5 minutes into the future.

Cheers,
Maciej"

Sunday, January 20, 2008

One word programming language summary revisted.

In my previous blog post, I tried to describe some mainstream programming languages in a couple of words. After some deep reflection, I am revising what I said earlier. This time, I want to associate the language with tasks that the language is known for.

1. Haskell: General Purpose Language
2. Python: Twisted network framework, scripting, easy to read, easy to use
3. Ruby: Ruby on Rails, dynamic, easy to use
4. Scala: Improvement over the java language, work with existing java libraries
5. Java: Lots of libraries, lots of community, commercial involvement
5. Factor: New and emerging, fast, smart people working on the project.
6. C: Desktop development, fast, low-level development.
7. Erlang: Parallel development, server development

Haskell, I think I love you

Just thought I would share. Haskell is just plain awesome.

Saturday, January 19, 2008

Simple Word Frequency Functions in Haskell

As part of the botlist text content mining backend I am working on; I frequently process word tokens in a web page document. The code below contains utility functions for creating a data structure that consists of a list of tuples. The tuple contains a word and the number of occurrences in the document. I modified John Goerzen's original wordFreq version from http://changelog.complete.org/plugin/tag/haskell.

import System.Environment
import qualified Data.Map as Map
import Data.List
import Text.Regex (splitRegex, mkRegex)
--
-- | Find word frequency given an input list using "Data.Map" utilities.
-- With (Map.empty :: Map.Map String Int), set k = String and a = Int
-- Map.empty :: Map k a
-- foldl' is a strict version of foldl = foldl': (a -> b -> a) -> a -> [b] -> a
-- (Original code from John Goerzen's wordFreq)
wordFreq :: [String] -> [(String, Int)]
wordFreq inlst = Map.toList $ foldl' updateMap (Map.empty :: Map.Map String Int) inlst
where updateMap freqmap word = case (Map.lookup word freqmap) of
Nothing -> (Map.insert word 1 freqmap)
Just x -> (Map.insert word $! x + 1) freqmap

-- | Pretty print the word/count tuple and output a string.
formatWordFreq :: (String, Int) -> String
formatWordFreq tupl = fst tupl ++ " " ++ (show $ snd tupl)

-- Given an input list of word tokens, find the word frequency and sort the values.
-- sortBy :: (a -> a -> Ordering) -> [a] -> [a]
wordFreqSort :: [String] -> [(String, Int)]
wordFreqSort inlst = sortBy freqSort . wordFreq $ inlst


Usage:

let tokens = splitRegex (mkRegex "\\s*[ \t\n]+\\s*") content
wordfreq = wordFreqSort tokens
mapM_ (\x -> (putStrLn $ formatWordFreq x)) wordfreq


The utility code was used against a spam document and output the following:

*** Content Analysis
Viagra 28
Cialis 11
to 10
with 7
Here 6
Order 5
the 5
Articles 4
Click 4
Viagra. 4

Friday, January 18, 2008

Haskell Function calls and Parentheses

[1.] Function Calls in Haskell

As you may have noticed, passing input to a function in haskell is simple because of the statically type nature of the language. A function can have a complicated input and output return, but at least you are given the parameters of that particular function. In python you can pass anything to a function; you can pass a function, a integer constant, an instance of a class and yet the function really only expects an boolean.

[2.] Examples

putStrLn :: String -> IO ()

For example, putStrLn expects a string as an input and will return unit in the IO Monad.

>putStrLn "abc"

Will print abc to the console.

Prelude> putStrLn "abc" ++ "123"

:1:0:
Couldn't match expected type `[a]' against inferred type `IO ()'
In the first argument of `(++)', namely `putStrLn "abc"'
In the expression: putStrLn "abc" ++ "123"
In the definition of `it': it = putStrLn "abc" ++ "123"
Prelude>

The following example gave an error. Why? We gave putStrLn three arguments as opposed to the one. Here are the extra arguments, "++" and "123".

To fix the issue:

putStrLn ("abc" ++ "23")

Or even:

(putStrLn ("abc" ++ "23"))

[3.] Dollar Sign, syntatic sugar

We can also use the syntatic sugar, the dollar sign function to alleviate us from having to wrap the arguments in parentheses. Here is the type:

Prelude] :type ($)
($) :: (a -> b) -> a -> b


Pass ($) The infix function has a first input parameter (a -> b), a function and then the 'a' argument which will output b. The following code demonstrates three different calls to putStrLn, but the final output is the same.

Prelude] putStrLn "abc"
abc
Prelude] putStrLn $ "abc"
abc
Prelude] (putStrLn "abc")
abc
Prelude]

Here is another working example:

putStrLn $ "abc: " ++ show(3) ++ show(4)

If we return to the ($) definition and putStrLn.
($) :: (a -> b) -> a -> b

putStrLn represents this part of the input (a -> b)
'a' represents the String input and b represents the output of putStrLn String which is the IO () monad.

For this aspect of the call, "abc: " ++ show(3) ++ show(4);

The (++) operator has the following type definition:

Prelude> :t (++)
(++) :: [a] -> [a] -> [a]

Where [a] can be represented by [Char].

Here are another set of working examples.

Prelude> putStrLn $ "abc: " ++ (show(3) ++ show(4))
abc: 34
Prelude> putStrLn $ ("abc: " ++ (show(3) ++ show(4)))
abc: 34
Prelude> (putStrLn $ ("abc: " ++ (show(3) ++ show(4))))
abc: 34
Prelude> (putStrLn $ ("abc: " ++ (show(3) ++ show(4))))
abc: 34
Prelude> (putStrLn ("abc: " ++ (show(3) ++ show(4))))
abc: 34


Those examples are a little confusing because the (++) operator calls are resulting to one "string" input. For example:

let z = "abc: " ++ show(3) ++ show(4)

Is of type String.

[4.] Function composition

The infix dot operator shown below takes one function, another function and an input and returns the output c.

Prelude] :t (.)
(.) :: (b -> c) -> (a -> b) -> a -> c

If we used the putStrLn function, and show function; what combinations could we come up with?

Note, show is defined by the following signature.

Prelude] :t show
show :: (Show a) => a -> String


Prelude] putStrLn . show $ 5
5
Prelude]

I will use the following notation to denote the operation invoked above:

### Function: **putStrLn** ||| Type: ( String -> IO () ) ###
infix dot-operator and then:
### Function: **show** ||| Type: ( a -> String ) ###
infix dollar sign operator and then the value of 5.

Prelude] putStrLn . show $ 5
5
Prelude]

The following calls also worked

Prelude> putStrLn $ show 5
5
Prelude> putStrLn (show $ 5)
5
Prelude> putStrLn $ (show $ 5)
5
Prelude> (putStrLn $ (show $ 5))
5

If you are given a list and need to print each element on its own line, you can use the following snippet.

mapM_ (\x -> (putStrLn ("Token: " ++ x ++ " Len: " ++ (show (length x))))) ["ab", "1231", "ddd"]

Do two things to make function composition look easier, think backwards and consider that putStrLn, show and ($) are operations with only one input and one output. As you can see, with haskell are many different ways to do something as simple as invoke a function.

edited: Apparently, I got so carried away, I didn't acknowledge that you can just call the function without any operators. E.g; [ putStrLn $ show 5 ] Sorry if that simple fact wasn't made clear.

Thursday, January 17, 2008

One or so word summary of the programming languages I work with

Here is a a one or two word summary of the languages I work with. I try to spend time with most of them. They are not listed in any particular order.

1. Haskell: Functional, cool [most projects]
2. Python: Dynamic, semi-fast, indented blocks [most projects]
3. Ruby: Dynamic, semi-slow [some projects]
4. Scala: Functional, better java, actors [some projects]
5. Java: Static, lots of libraries, ugly, verbose [most projects]
5. Factor: Cool, new, fast [a couple of projects]
6. C: Cool, sloppy [a couple of projects]
7. Erlang: Parallel [very few projects]

(Less used):
7. C++: scary (I don't use C++)
8. Common Lisp (sbcl): fast, simple, dynamic
9. Scheme: simple, dynamic

(Useful):
10. Bash: ok I guess

(Want to learn):
11. Perl6
12. Relearn Lisp

First look at web/page content analysis

What defines a web page; if you had an input/output device. What input would define a page and what would the output be? I am currently searching for different tools and algorithms to analyze pages that are available on the internet. Presented here in this chart are various html tags (e.g. table, strong, bold) and the number of times the tag appeared in the document. 5 page documents were analyzed; 3 good article pages from wikipedia and 2 spam pages.



The link below contains the data that generated the chart above.

http://openbotlist.googlecode.com/svn/trunk/openbotlist/docs/media/content/chart_content.txt

Wednesday, January 16, 2008

Python Idiom; remove stop words

I always forget this simple idiom for removing a string based on a set of words to remove. This particular example, given an array of stop words, remove them.


BOTLIST_STOP_WORDS = [ "abc", "123" ]
keywords = "kjskldjfkljskdfksdkl abc lkjsdfksdklfd 123123"

def filterStopwords(self, keywords):
""" Find all of the stopwords and remove them"""
res = []
keywords_list = keywords.lower().split()
res = list(set(keywords_list).difference(set(BOTLIST_STOP_WORDS)))

# Return the new keyword string
return " ".join(res)


In that example, the stop words will be removed from the string.

Tuesday, January 15, 2008

What am I listening to? Bach, Brandenburg concertos; 4, 5, 6

Bach, Brandenburg concertos; 4, 5, 6:

This particular rendition features a pipe, string ensemble that is a mix of slow and spirited; but never dry or boring. This particular set seems to be more oriented towards strings than 1, 2, 3. I could be wrong about that though.

So far, Bach has impressed me the most.

http://en.wikipedia.org/wiki/Brandenburg_concertos

Joining theinfo.org; large datasets

I am joining and now following the community at theinfo.org. It is exactly what I have been looking for in terms for large data set crawling.

http://theinfo.org/

Sunday, January 13, 2008

Interesting Person of the day: Hunter Thompson

This person seems pretty fascinating, Hunter Thompson. My only knowledge of him is that he created Fear and Loathing.

http://en.wikipedia.org/wiki/Hunter_S._Thompson

Hunter Stockton Thompson (July 18, 1937 – February 20, 2005) was an American journalist and author, famous for his novel Fear and Loathing in Las Vegas. He is credited as the creator of Gonzo journalism, a style of reporting where reporters involve themselves in the action to such a degree that they become the central figures of their stories.

[ANN] Initial release of Spider Queue Database

Announce: This database will be used as the simple binary file format for queuing requests. Ideally, rabbitmq would be be used but the client has not been implemented. (Basically, this is just a simple file format; if you want to see the haskell source, see the link below. Covers Data.Binary use, ByteStrings, decodeFile, encodeFile and some other basic binary file manipulations)

instance Binary SpiderQueue where
put dbq = do
BinaryPut.putWord16be (magicNumberA dbq)
BinaryPut.putWord16be (magicNumberB dbq)
BinaryPut.putWord16be (majorVers dbq)
BinaryPut.putWord16be (minorVers dbq)
BinaryPut.putWord32be (queueSize dbq)
-- @see mapM: Monad m => (a -> m b) -> [a] -> m [b]
(mapM_ put (queue dbq))
get = do
magicnumbera <- BinaryGet.getWord16be
magicnumberb <- BinaryGet.getWord16be
major <- BinaryGet.getWord16be
minor <- BinaryGet.getWord16be
len <- BinaryGet.getWord32be
-- *******************************
-- Get the remaining byte string data,
-- So that we can use lazy bytestring to load to load the
-- the data types.
-- Also: queueData <- forM [1..len] (const (get :: Get QueueObject))
-- *******************************
queueData <- replicateM (fromIntegral len) (get :: Get QueueObject)
return (SpiderQueue {magicNumberA=magicnumbera,
magicNumberB=magicnumberb,
majorVers=major,
minorVers=minor,
queueSize=len,
queue=queueData
})


References

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/lib/haskell/src/Data/SpiderQueue/Queue.hs

From Tim Sweeney - Re: Hundred Year Language

A quote from: Tim Sweeney - Re: Hundred Year Language
http://lambda-the-ultimate.org/classic/message6475.html#6501

All the content in this post is a quote about various languages out there.

"C# is a great illustration of the superficiality of popular language progress. It is the most polished and fine-tuned mainstream language ever created, yet fails to come to grips with basic mathematics type and theory. You still have absurdities like 2^64=0, 7/2=3, simple expressions like x/y and a[x] which the compiler accepts yet aren't actually typesafe in any reasonable definition of the term, arrays have serious typing flaws, and the basic confusion between things and references to things remains.

Right now, the open language lineages are:

- LISP: This is the apex of untyped programming with nested scoping and metadata manipulation. It's at the end of the line, not because of lack of potential, but it's so simple and powerful than any improvements to it can be implemented in LISP itself without affecting the core language. In 100 years, after the last C and C++ programmers are long gone, there will still be LISP enthusiasts. But don't expect large-scale software development to happen this way.

- C, C++, Java, C#: The problem nowadays isn't with implementation but with the underlying flaws in the paradigm. We can continue to go the route of adding tweaks as with C#, but they're asymptotically approaching a limit that's not far from the current state of affairs.

- Python, Perl, etc: Languages with significant improvements in usability for certain audiences, and are generally a mix of untyped LISP features with typed C/Java-style features. The primary innovations I see here are in the syntax and libraries, for example Python's pioneering of "pseudocode that runs" and Perl's first-class regular expressions and text parsing capabilities. Future languages will certainly inheret these features, even as the underlying paradigm is changed.

- ML, Haskell, simply-typed functional languages with type deduction and, in general, programs consisting of a whole bunch of top-level declarations. The elegance of these languages is based on Hindley-Milner and with most of the neded extensions of the language (subtyping, dependent types), that breaks, and the resulting declarations and error messages become quite a mess. These languages will certainly remain useful to their current fans, but I see no hope for this style of programming to scale up to mainstream development.

So which lineage will be the basis of languages 100 years from now? I think the syntax has long since been determined, and will fall very much in the middle ground between Haskell, Pascal, and C++ -- and not very much outside that triangle.

Regarding the type system and semantics, I think the basic principles in such mathematical-style code as "f(x:int):int -> if x<=0 then 1 else x*f(x-1)" will not change one iota in 100 years, but that the current notions of objects, references, member functions, exceptions, however-many-bit integers, integer being "the" type of 7, and so on will seem terribly archaic. In this area, I don't think any current lineages will stand.

However, I do think we'll converge on a type system that is not far outside of the past 20 years' research into dependent types, intersection and union types, unification, constraint logic, and proofs-as-programs. These type systems essentially have the expressive power of mathematical logic itself, which is fundamental and not a product of invention or whim.

Alex Peake - Re: Hundred Year Language blueArrow
4/11/2003; 6:44:05 PM (reads: 2820, responses: 0)
IMHO (what counts is Productivity)...
"...most important change...around well-defined type systems..."
Type systems are about bug prevention and early detection, and bugs of the "type mismatch" variety only. Has anyone really studied the cost of early detection vs. productivity loss from not using dynamically typed languages?

"LISP...don't expect large-scale software development to happen this way..."
Why? Why are many fairly inadequate languages more successful than good, powerful languages? It's about 1) Money -- money begets money and Microsoft has a vested interest in maintaining their status-quo of followers, as have the Sun/IBM/Oracle gang as a way of competing with Microsoft. They are the ones with marketing might. They are the ones who communicate the loudest. They are the ones that most "choosers of programming languages" hear. And they are "safe", and
2) the great majority of "choosers of programming languages" are either incapable or uninterested in actually evaluating programming languages.

"...ML, Haskell, (Erlang,) simply-typed functional languages..."
I think the real elegance, and productivity, comes from the declarative style and the pattern matching.

BTW, Smalltalk was conspicuously absent - a wonderfully productive dynamically typed, reflective,... development environment!

Nay, I say, the future is LISP (well Scheme maybe) and Smalltalk (if only someone could afford to Market either). "

Friday, January 11, 2008

Redefining functional composition in haskell

For those us that haven't done calculus or differential equations in a while, visualizing functional composition in functional languages like haskell may be difficult. So here are some practical examples for really visualizing composition. The wikipedia definition for function composition in a computer science context has the following: "function composition is an act or mechanism to combine simple functions to build more complicated ones."

In math 101, we learned the following:

given the value of 'x', pass 'x' to the function g and then the result of g of x to the function f to get a result.

f(g(x))

Let's say we are doing the following calculation:

y = g(x) = 2 * x
z = f(y) = 3 * y
z is our result.

If we provide an input of 4, we have the following.

y = 2 * 4 = 8
z = 3 * 2 * 4 = 24
z = 3 * g(x where x = 4) = 24

In Haskell

How would you do this in haskell (in prelude):

Open ghci:

First, here is the function definition for function composition:
(.) :: (b -> c) -> (a -> b) -> a -> c

Prelude> let ggg x = 2 * x
Prelude> let fff y = 3 * y
Prelude> (fff . ggg) 4
24 (our result)

Or ((fff . ggg) 4)

Now, let's define this this in a new haskell source file:

module Main where

-- x -> y
ggg :: Integer -> Integer
ggg x = 2 * x

-- y -> z
fff :: Integer -> Integer
fff y = 3 * y

-- x -> z
compose :: Integer -> Integer
compose x = (fff . ggg) x

main = do
putStrLn $ "Z Result should be 24=" ++ show (compose 4)

Thursday, January 10, 2008

More haskell simple networking client code and connecting to RabbitMQ

Well, I have made a little bit more progress with the simple haskell network client and connected to a RabbitMQ server and received some of a valid response. What makes this example a little bit different than a simple echo client/server connection is that a AMQP client needs to send valid byte responses in big endian format. For example, the first 1 octet that must be sent to the server include this header BYTE:'A', BYTE:'M', BYTE:'Q', BYTE:'P', BYTE:1, BYTE:9, BYTE:1, BYTE:1. Bytes are easy, but sending short values and also sending unicode responses requires a little bit of work. The java client library provided by RabbitMQ as you might have guess makes sending unicode or byte packages straightforward. In haskell, well there just isn't a lot of code out there that demonstrates how to do this. With that being said, here is some update to the simple client. At first I wanted to just demonstrate networking in haskell, but now I may go a step forward in working on a usable RabbitMQ client (time permitting of course).

Key code snippet:
The following uses the Data.Binary.Get monad to receive frame data back from the rabbit mq server. Get a frame type, channel, size of the payload and then the payload byte data.

instance Binary AMQPFrame where
get = do
frameType <- getWord8
chan <- getWord16be
sz <- getWord32be
bytes <- BinaryGet.getLazyByteString (fromIntegral sz)
chw <- getWord8
return (AMQPFrame { frameType=frameType,
channel=chan,
size=sz,
payload=bytes,
ch=chw
})


Download
http://haskellnotebook.googlecode.com/svn/trunk/haskell/simplenetwork

You can see the python example, just imagine porting the python library to haskell. Syntactically, with haskell we can use some of the code verbatim. For example, look at the METHOD_MAP_NAME in the python code and compare that with the pattern matching in the haskell source.

http://barryp.org/software/py-amqplib/

Wednesday, January 9, 2008

More on the AMQP (RabbitMQ) haskell client (and an example)

I am building a AMQP client library to use along with some of the openbotlistprojects. The client library will need to be written in haskell and will be able to communicate with MQ servers like RabbitMQ. That is irrelevant because I am just wrapping my head around haskell networking.

The "Data.Binary" module is used here to have fully control over the size and endianess of all data sent across the network. There are several utilities for writing shorts, byte words, longs, 64 bit longs, etc.

Here is a data structure defined with the Binary.Put monad. Our task is write this data to the socket connection.


data AMQPData = AMQPData {
amqpHeaderA :: [Word8],
amqpHeaderB :: Word32
}

instance Binary AMQPData where
put amq = do
BinaryPut.putByteString (pack (amqpHeaderA amq))
BinaryPut.putWord32be (amqpHeaderB amq)

amqInstance :: IO AMQPData
amqInstance = return (AMQPData { amqpHeaderA = (unpack (C.pack amqpHeadA)),
amqpHeaderB = amqpHeadB
})


amq <- amqInstance
let bs_amq = encode amq
-- Through the use of lazy hPut, write out the data to the socket handle
LazyC.hPut h bs_amq


Download
You can download the haskell source for this entry at the following URL including a simple java server to mimic a AMQP node.

http://haskellnotebook.googlecode.com/svn/trunk/haskell/simplenetwork

Resources

(1) http://www.haskell.org/ghc/docs/6.8.2/html/libraries/network/Network.html
(2) http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html
(3) http://hackage.haskell.org/packages/archive/bytestring/0.9.0.1/doc/html/Data-ByteString.html

Tuesday, January 8, 2008

Understanding monads in haskell from a reddit user (808140)

First, thanks to 808140, he did a good job of explaining monads. This is not my content at all, but his.

A quote from reddit, 808140:

"The first step to understanding "monads" is realizing that the name is scary but the concept is simple. See, the term itself comes from math and was so-named because the people that found a use for it in programming were math people. Despite what some people seem to think, though, you don't need to be a math person to understand them.

The best way to grok monads is to not try to understand what they are abstractly first. Instead, familiarize yourself with several common monads and their uses. This will allow you to better see what is being abstracted.

A lot of people who are new to monads try to understand the abstraction before understanding the specific cases. Some people (usually people with math backgrounds) can do this, but for most people, the best way to understand something abstract and general is to start with specific examples.

You can learn what a monad is intuitively very quickly this way, because there are a bunch of things that are all monads but that all do very different things.

The most basic monad is the exception monad (in Haskell we call it Maybe if there's only one exception, and Either if there are many). This is not voodoo. It is no different than throwing an exception in a language you're familiar with, there's nothing new or strange about it. Basically, the code:

f x >>= g >>= h

evaluates f x where f is some function that can throw an exception, checks to see if an exception was thrown, and if it none was, it passes the output of f x into the function g, which can also throw an exception. Again, if g did not throw an exception, it passes the value into h, which itself can throw an exception. Here's an example of what this might be doing. Let's say f, g, and h are all mathematical functions, all of which at some point divide by their argument. Obviously, if the argument is zero, there will be an error, so each checks first if the argument is zero and if it is, each throws an exception, and if not, does the calculation and returns the value. Essentially, this code is the same as what one might write as h(g(f(x))) in another language, but the use of the >>= just makes it clear that we are checking for exceptions.

The reason we go to all this hassle, instead of just having exception throwing built into the language the way you might in Java or most any other imperative language, is because in functional languages functions cannot have side effects. That is, they are completely determined by what they return, and cannot do anything else. They cannot throw exceptions, or print to the screen, or set a global variable, or whatever. This allows something we call referential transparency: if you see f(5) in purely functional code, you should be able to evaluate it once, and take the value it returns and replace every other instance of f(5) in the whole program and have the program run exactly the same way. This is obviously not possible if f, say, prints to the screen (it would only print once, instead of 20 times or whatever). It is also not possible if f throws an exception (what value did it evaluate to? What would you replace other instances of f(5) with?)

So how do exceptions work, then? The same way they work in a language like C, which lacks the ability to throw exceptions as a built in part of the language: they encode the exception as part of the return value. So in our example, f returns just some number or an error value signaling that an exception was thrown. We call this error value Nothing.

So we create a datatype, which we call Maybe, that extends any type (in our example, Integer) so that it includes the value Nothing. We say that a variable of type Maybe Integer can either be Nothing, signifying that an exception was thrown or an error occurred, or it can be just a number, which we indicate by writing Just 5 or Just 10 (the Just is there to remind us that we're dealing with a Maybe type and not just an Integer).

Then the >>= operator is defined thusly. For a function f and maybe type m, if m looks like Just x then m >>= f is the same as f x. That is, the value contained in the Maybe type is just passed into f (other languages might write f(x)).

If, however, m is Nothing, then f is never called and the whole thing just evaluates to Nothing. Since the >>= operator is left associative, just like division or subtraction, we can look at our example this way:

f x >>= g >>= h

first evaluates f x. Maybe it returns Nothing; then by left associativity,

(Nothing >>= g) >>= h
Nothing >>= h
Nothing

(Referential transparency allows us to reason about our code algebraically, which is why it is good). What if f x didn't throw an exception? Let's say it returns Just 5:

(Just 5 >>= g) >>= h
g 5 >>= h

Now we evaluate g 5. It too can return a value, or it can return Nothing. You can probably see already that if it returns Nothing, the whole expression will evaluate to Nothing. If it doesn't, whatever it returns will be passed on to h, which also has the ability to return Nothing. In this way, we have propagated the error -- which is why functional programmers often talk about exceptions being propagated rather than thrown.

Now, there is one other function that is useful, which I'll call unit (in Haskell, we call it something else, but its name can confuse imperative programmers). unit just takes a normal value and makes it into a Maybe value by putting a Just in front of it:

unit x = Just x

So we could equivalently write the code we wrote before as

unit x >>= f >>= g >>= h

You can think of unit as lifting a normal value into a value that can also throw an exception.

Now, if you've read this far, you've probably noticed that I haven't mentioned monads. Well, Maybe is a monad, a specific example of one that you can use today. unit and >>= are the monad operators (the latter operator is called "bind"). Many different data structures all define unit and bind in completely different ways. All that makes a monad, really, is that you can define two functions like unit and bind on them.

A monad is always a container type, which is to say that like Maybe it extends a simple type like Integer or String by adding some extra information in some way (in our example, we added the possibility of being Nothing, but other monads add other things).

For example, it is possible to write out messages in a purely functional way using a monad. The idea is, instead of just returning a value, you return a value and the string that you want to write to the screen. Why do it that way, instead of just writing out to the screen? Because that way, your functions don't have side-effects, you preserve referential transparency, and you -- and more importantly, your compiler -- can reason about your code algebraically, allowing it to simplify your code the way you would an equation.

I won't go into the gory details regarding this monad (we call it Writer), but the basic idea is this: unit takes a normal value and puts it in a data structure which includes an empty string. Our bind operator evaluates a function, which returns that same kind of data structure, except maybe that structure's string isn't empty (in case you haven't already guessed, the string in question is the string the function "writes" to the screen). It concatenates the two strings and continues on. An example:

unit 5 >>= f >>= g >>= h
(5, "") >>= f >>= g >>= h
f 5 >>= g >> h (message is "")

Assume f 5 evaluates to 3 and writes "f was called\n" (here ++ is the concatenation operator):

(3, "" ++ "f was called\n") >>= g >>= h
g 3 >>= h (message is "f was called\n")

Now g returns 10 and writes out "g was called\n":

(10, "f was called\n" ++
"g was called") >>= h

And h returns 5 and doesn't write out anything:

 h 10 = (5, "f was called\ng was called\n" ++ "")"
= (5, "f was called\ng was called\n")

So here, bind is again passing the value contained in the Writer data structure into the next function, but at the same time it concatenating the messages that each returns.

You can see that these functions are referentially transparent: earlier I said that if f(5) wrote something to the screen, it would not be referentially transparent because you could not evaluate it once and replace every occurence of f(5) with that. But if f uses the Writer monad we just defined, then it is referentially transparent, because its output is contained in its return value.

There are many other monads. The more monads you learn, the more it will become clear what they all have in common; but in the meantime, you can use them productively as you learn them to write purely functional code if you enjoy that sort of thing.

(For the curious: in Haskell, unit is called return. It is nothing like return in C, which is why I did not call it that in this post. It is called unit in category theory, however, so it's not like I just made the name up.)

Sunday, January 6, 2008

Updated botnode.com site

I have made some small updates to botnode.com, looks a little prettier than what was existing. The purpose of the site is up for debate, but will normally center around all things software development.

http://www.botnode.com/

Friday, January 4, 2008

Link Blog Specification (use of Emacs/Scala/Lift/Antlr)

Specification for Emacs Link Blog Tool.

Overview:

A link blog contains blog entries which in turn contain a list of links. This tool will aid in the creation of a link blog service. The emacs IDE is a powerful development environment that also includes a lisp based plugin system. We can build a Text over HTTP protocol that interfaces with an emacs plugin. A user can then use the emacs plugin to add entries to the blog.

Yet another SVM Tool (used for part of speech tagging)

http://www.lsi.upc.edu/~nlp/SVMTool/

Good (readable) publication on practical support vector machines.

Support Vector Machines and Document Classification
Saurav Sahay

"Automatic Text categorization using machine learning methods like Support Vector Machines (SVM) have tremendous potential for effectively organizing electronic resources. Human categorization is very costly and time-consuming, thus limiting its application for large or rapidly changing collections. SVM is a comparatively new technique with a very solid mathematical foundation for solving a variety of ‘learning from examples’ problem and gives high performance in practical applications."

Resources

http://www-static.cc.gatech.edu/~ssahay/sauravsahay7001-2.pdf

Keywords

Reuters 21578, SVM

Thursday, January 3, 2008

Lisp like s-expressions in haskell

Kind of interesting to write lisp like code in haskell. Note, all of this is valid haskell:

Prelude> ((+) ((*) 1 2) 10)
12

Prelude> (1 * 2) + 10
12

Prelude Text.Printf> (printf "%d\n" 5)
5
Prelude Text.Printf> printf "%d %d\n" 1 2
1 2
Prelude Text.Printf> ((printf "%d %d\n" 1) 2)
1 2

Of course, you have to match up the arguments in haskell accordingly; not all functions take a just a list like you might find in common lisp.

Wednesday, January 2, 2008

Haskell Notes: Basic List operation

There post doesn't contain much content, but I thought that this Haskell List reference sheet gives a good overview of some of the powerful list functions in haskell.

http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-List.html

Haskell: Difference between "<-" left arrow and 'let' expressions

The syntax left arrow denoted by "<-", allows you to bind a variable from the result of an IO computation.

Saizan on freenode does a better job of explaining it...

When asked; What are the main differences between left-arrow and binding a variable with let.

"the <- arrow "extracts" the value from an IO computation, while using let you'd just define another name for the computation as a whole"

"so in let x = getChar, x :: IO Char, but in do x <- getChar; ... x :: Char"

Sick of Internet Journalism and response to the Ron Paul Evolution Video

I have been defending Ron Paul on his blurb about evolution. Here is one response in the reddit forum:

I am still waiting on where he said he (Ron Paul) doesn't believe in "EVOLUTION". Big difference from not siding with EVOLUTIONARY "THEORY" in which you aren't 100% on the entire details on how human beings arrived here today.

I am a life long person of science. Lets understand evolution as fact and push everything else away. But I can understand how a Christian can take issue with an some established origins of life; how some out of the 6 billion people can be confused about the origins of life.

If you can give me the video where Ron Paul says the following "I DO NOT BELIEVE IN EVOLUTION AT ALLL, WILL NEVER BELIEVE EVOLUTION" or "I AM A CREATIONIST FROM THE DAY I WAS BORN TILL NOW"

Yes, I saw the youtube video mentioned earlier, but you can play all kinds of "Bill Clinton wordisms" with that particular Ron Paul blurb.

Also, here is the difference between Evolution and Evolutionary theory.

http://en.wikipedia.org/wiki/Evolution_as_theory_and_fact


F**king "Redditards"; We want freedom for all but we can't give this man his personal beliefs. Seems like hypocrisy to me. Beliefs that don't have anything to do with the office he is running for.

"BUT HE DOESN'T accept a Darwin's or Gould theory of evolution, he can't reason on complex issues."

Cut that bull-shit. If a former Air force surgeon with a medical degree from Duke Medical school can't reason complex situations over a fucking-tard living off dad's Oil money (Bush) than god help us. Call your representatives, asking to introduce a bill that all persons running for the presidency must have a PHD in Astrophysics.

In the mean-time, lets go with these requirements:

U.S. citizen, 35 years of age, 14 years resident of U.S. Must have an absolute majority in the ELECTORAL COLLEGE – 270/538.

Personally, I am not really THAT big a Ron Paul supporter. He has interesting ideas that might possibly work. So do the other candidates. I am just tired of this internet journalism where we take 30 second sound bites and ignore a person's 70 years of life experience.

That is fine, be an media idiot, personally, I would like to hear criticism or favoritism of Ron Paul's plans for the country. Will his plan to eliminate the IRS get passed? Will it work? Is it a long term strategy? Will it rejuvenate the economy?

Early linux source code 0.1, working on modern compiler.

"Abdel Benamrouche announced that he has updated the original 0.01

Linux kernel to compile with GCC-4.x, allowing it to run on emulators such as QEMU and Bochs. After applying his series of small patches, Abdel explains that the 0.01 kernel can be built on a system running the 2.6 Linux kernel. He added that he's successfully ported bach-3.2,
portions of coreutils-6.9, dietlibc-0.31 (instead of glibc), bin86-0.16.17, make-3.81, ncurses-2.0.7, and vim-7.1 all to run on his modified 0.01 kernel." -- Quote from blog entry.

http://kerneltrap.org/Linux/Dusting_Off_the_0.01_Kernel

CouchDB and living the dream (2006/6)

It looks like CouchDB has taken off, in fact Damien Katz will now be working with IBM full time on CouchDB. This is a big deal to a developer, especially an opensource one.

I don't know the true origins of CouchDB, but just looking back at his blog, it look likes like it got started in 2006/6.

http://damienkatz.net/2006/06/

Congratulations.

Reading binary files with haskell, simple snippet

There is a hackage haskell package that is external from the core haskell libraries that includes support for reading writing binary files and support big/little endian encoding and decoding. I used this, in order to manipulate the botlist binary database format. It wasn't intuitive as I didn't quite get the understand the Binary.GET monad which contained the getWord16le, getWord16be functions and how it related to the pure Binary monad, but after a while it made it sense.
http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binary.html

Download and review this haskell source to see what I ended up with. I defined a DatabaseFile type which includes two magic number words (unsigned shorts of 16 bits each) and then populated the fields from the binary file.

http://openbotlist.googlecode.com/svn/trunk/botlistprojects/botspider/spider/tools/misc/dbreader/haskell



instance Binary SpiderDatabase where
put _ = do BinaryPut.putWord16le 0
get = do
magicnumbera <- BinaryGet.getWord16be
magicnumberb <- BinaryGet.getWord16be
return (SpiderDatabase {magicNumberA=magicnumbera,
magicNumberB=magicnumberb
})

Tuesday, January 1, 2008

Best and worst of 2007

Best and Worst of 2007, from a developer's perspective:

Best of the Best:
Reddit, Factor, Haskell

Best Website:
Reddit.com

Best Movie:
3:10 to Yuma

Best Food:
Mexican Food (love the tacos!)

Best new project, you probably haven't heard about:
Factor

Another best language, that is gaining a lot of traction:
Haskell (honorable mention for Scala)

Best Project:
Lucene/Nutch/Hadoop

Best Printer:
Brother HL2040, works with linux (at least so far)

Best Politician:
Ron Paul

Worst Product:
Firefox2-3 on Linux. (Actually, flash plugins have a lot to do with the trouble on linux, but still).
(Honorable mention for HP printers)

Worst developer discussion forum:
Joel on Software, http://discuss.joelonsoftware.com/?joel
when you get a bunch of VB developers together, this is what happens.

Worst websites:
Twitter.com, Facebook.com, MySpace.com

Worst political websites:
Slate.com, Huffingtonpost, DailyKos,

Simple look at the java bytecode through python.

Here is some python code for loading the java binary class format. It only returns the first couple of bytes and will only convert the bytecode format into a data structure. It won't parse the data. Maybe we can look at that in the future.

Some useful python libraries:

http://docs.python.org/lib/module-array.html

"struct -- Interpret strings as packed binary data"
http://docs.python.org/lib/module-struct.html

Source:
http://jvmnotebook.googlecode.com/svn/trunk/misc/bytecode/

http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc...

Another note, the java bytecode class is big-endian format and most of us work on little endian machines these days, just something to keep in mind.