Saturday, December 18, 2010

Standard Developer Toolkit - 2010 version


I don't know if it is because I am getting older but I am starting to record my thoughts online or digitally so that I can easily go back to that record. I do this by posting blog posts on blogspot and through other subversion code repositories.

One task that I have long neglected is creating a list of common developer tools. The toolkit suite contains a list of common software applications that a developer of any language or environment would need. Most of the software is open source of freely available. Obviously, different developers want or need different tools. Here is the list of packages that are essential for what I need. Yes, the list is bias with a emphasis on Java development. I marked the critical applications in bold. I would typically install of all of these tools on one machine over a long time period. Some of the tools seem similar in nature, for example, I do install Netbeans and Eclipse on the same machine. I may swap between the IDEs if there are tools built for one platform not available in the other environment.

Operating Systems:
- Use both WindowsXP/Windows7.
- Cygwin
- Use VMWare and an Ubuntu 9+ Image (only VMWare player is needed for win32)
- VMWare Fusion if you use Mac
- Ubuntu 10+ on a standalone machine (AMD64bit or Intel machine)

Programming Languages and Environments:

- Java (Java JDK from Oracle on Win32, possibly OpenJDK on Windows)
- Scala (JVM language for Java)
- Clojure (JVM language for Java)
- Python (install on both win32 and Linux)
- Perl (for some scripting)
- Haskell

Developer IDE Tools:

- Eclipse (you can easily install/setup multiple versions of Eclipse)
- Eclipse CDT for any C++ Development
- Eclipse PyDev for Python development
- IntelliJ IDEA (Great for Scala development, rapid Java development)
- Netbeans
- Emacs

Misc Developer Tools:

- Subversion client
- Git client
- Mercurial client (optional)
- Maven (for Java development)
- Ant (for Java development)

Text Editors:

- Vim (install on Cygwin if using Win32)
- Emacs (I prefer Emacs over XEmacs and use the Emacs win32 version)
- TextPad
- Notepad++ (notepad++ is open, textpad isn't, I like both)
Word Processing:

- Open Office
- Microsoft Office (I prefer older versions of the Microsoft Office suite. If you have the money, I would consider purchasing the software).

Web Browsers:

- Mozilla Firefox
- Install Firefox Firebug plugin
- Install Firefox Tamper data plugin
- Google Chrome

Network Tools:

- WinSCP (Win32)
- FileZilla (Win32)
- XChat (or XChat2)
- VMWare Player

Graphic Tools:

- Gimp

Misc Tools:

- 7zip
- WinMerge (open merging software, very useful)
- R (google R statistics application). Useful for charting data

Cygwin Installs

- Install Cygwin
- Download or make sure that these are installed: vim, find, grep, wget, gcc, g++, gtk libs, openssl libs, gnuplot

Eclipse Plugins for Java Development

- Subclipse
- Maven Plugin (m2eclipse)

Java Frameworks

- Spring Framework, Google's GWT, Apache Wicket, Hibernate, iBatis ORM, Lucene, AspectJ, ASM, Antlr

Practical AI: Hello World Bitworm Example with Numenta's Nupic (HTM for the masses)



Overview

Jeff Hawkins of Numenta and previously the creator of Palm Inc has shifted his focus from mobile computing to developing a sophisticated AI system. He has always been passionate about artificial intelligence. Jeff knew early on that the trends in AI research weren't very promising. He had concentrated his AI interest in human biology and neuroscience. Numenta has patented HTM (Hierarchical Temporal Memory) technology. According to Jeff Hawkins, you cannot mimic brain functions without including a hierarchical system of memory. The lower level has more input/output nodes than the level above it. According to Jeff, you must also take into account "temporal" memory. For example, the human brain has many parts of the brain that handle visual information. The brain may be able to detect a particular object but also factor in the time that the visual event occurred. If you are at the zoo, your brain predicts that you will see animals and animals in cages at the zoo. It is rare that you see a plane take off or land in the middle of your zoo visit. You would have had previous memories or seen pictures of the zoo in the past and parts of your brain activate other things associated the zoo. The context in time is a visit to the zoo. The memory of a zoo visit is probably in a different area of the brain than a trip to an airport. In the case of an airport, you expect to see planes landing and taking off. -- Repost from my previous blog entry.

Jeff's vision of HTM is implemented though Numenta's Nupic. Nupic is a HTM Python library and software suite that includes simple speech recognition demos, computer vision demos(picture object recognition), and other Nupic examples. Normally, you would have to pay hundreds for pattern recognition software of this quality. But all of these examples are functional and demonstrate the power of the HTM Nupic approach.

Bitworm Hello World Example

The Bitworm example provided with Nupic is a Hello World Example. But it is probably the most complex and thorough Hello World I have seen. It covers the basics but it also usable as a library or simple Nupic API. In the case of the Bitworm example, the goal is to track the movement of a bitworm through it's movement in 2D space and time. Think of the bitworm example in three dimensions. Dimension 1 is on the Y axis and is the height. In this case, the height is one. Dimension 2 is on the X axis. The X axis contains the length of the bitworm and movement along the X axis. And the third dimension is time. There are 20 time sequences trained for the bitworm examples. (OK, in reality you could think of the example in 2 dimensional space. The X and T axis are relevant).


There is one bitworm represented in the screen shot above. There are two bitworm lines from the top to the bottom of the screen. Line one is a representation of ONE bitworm and its position. In line two, the bitworm has moved in the X direction. Line two is a representation of the bitworm at TIME sequence two. There is ONE bitworm and 20 time sequences of that bitworm movement. The goal of the bitworm example is to train that movement and then predict the bitworm type based on a test set of bit worm examples.

I wrote a python example TK graphic program to render the bitworm's movement. The TK example is not provided with Nupic. The string of bits are created by the bitworm example, here is a representation of the training data set:

1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 [---- bitworm zero and time sequence 0
0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 [---- bitworm zero and time sequence 1
0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 [---- bitworm zero and time sequence 2
0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0
...

There are 420 lines of these bitworm bit vectors in the text file. Each line in the training set and the test set are representation bitworms.


The bitworm example is located at NUPIC_HOME\share\projects\bitworm

Summary to understand the bitworm example:
1. The bitworm python example does the following:
1a. The bitworm example creates the training data (File: training_data.txt)
1b. The bitworm python example creates the test data (File: test_data.txt)
1c. The bitworm example creates the bitworm categories (needed during the training process)
1d. The application trains the training test after the test and training data is created.
1e. The application validates that Nupic can learn the training data by verifying against the test data.

Think of training as training the Nupic AI software and test data as the way to check that the training worked.

2. training_data.txt and test_data.txt is a simple text file
2. Each line in the training_data.txt and test_data.txt database file consists of a bitworm (16 bits in the bitworm vector)
3. Each column or bit in the training and test file is a bit in the bitworm. There are 16 bits in the bitworm example.
4. Each line in the training and test data file is a bitworm and a representation of that bitworm at a particular time sequence.
5. There are twenty time sequences in a GROUP training set. Basically, line 1 of the training file represents a bitworm at time sequence 0. Line 2 of the training file represents a bitworm at time sequence 1, etc. At line twenty, that is the end of a time sequence group. At line twenty 22, a new bitworm example starts.
6. There are twenty EXAMPLE BITWORM time sequences.
7. There are 420 lines in the training and test data file, each line is bitworm at a particular moment in time.
9. There are zero bitworm vectors that delimit a time sequence

If you look at the file training_data.txt:

One bitworm is 16 bits:

1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 [---- bitworm zero and time sequence 0
Here is a formula to determine why there are 420 lines in the training and test file:

(1 bitworm * 20 time sequences) + (20 groups of time sequences)
+ (ZERO bitworm delimiter line * 20) = 420 lines in the file.

Example snippets from the Python Bitworm Example:

def generateBitwormData(additiveNoiseTraining = 0.0,
bitFlipProbabilityTraining = 0.0,
additiveNoiseTesting = 0.0,
bitFlipProbabilityTesting = 0.0,
numSequencesPerBitwormType = 10,
sequenceLength = 20,
inputSize = 16,
trainingMinLength = 9,
trainingMaxLength = 12,
testMinLength = 5,
testMaxLength = 8):

# Generate training data with worms of lengths between 5 and 8
trainingData = BitwormData()
trainingData['prefix'] = 'training_'
trainingData['minLength'] = trainingMinLength
trainingData['maxLength'] = trainingMaxLength
trainingData['sequenceLength'] = sequenceLength
trainingData['inputSize'] = inputSize
trainingData['numSequencesPerBitwormType'] = numSequencesPerBitwormType
trainingData['additiveNoise'] = additiveNoiseTraining
trainingData['bitFlipProability'] = bitFlipProbabilityTraining
trainingData.createData()

Train and verify the bitworm example:

# Train the network
# TrainBasicNetwork: This function trains a basic network with the given data and category files and returns the trained network
bitNet = TrainBasicNetwork(bitNet,
dataFiles = [trainingFile],
categoryFiles = [trainingCategories])
print "Bit Net (TrainBasicNetwork-1): ", bitNet

# RunBasicNetwork: Runs the network using the given data files. The output of the classifier
# node for each input pattern is stored in resultsFile.
accuracy = RunBasicNetwork(bitNet,
dataFiles = [trainingFile],
categoryFiles = [trainingCategories],
resultsFile = trainingResults)
print "Bit Net (RunBasicNetwork-2): ", bitNet
print "Training set accuracy with HTM[a] = ", accuracy * 100.0

# RunBasicNetwork: Runs the network using the given data files. The output of the classifier
# node for each input pattern is stored in resultsFile.
# Run inference on test set to check generalization
accuracy2 = RunBasicNetwork(bitNet,
dataFiles = [testFile],
categoryFiles = [testCategories],
resultsFile = testResults)
print "Bit Net (RunBasicNetwork-3): ", bitNet
print "Test set accuracy with HTM[b] = ", accuracy2 * 100.0


Modifications to the Bitworm example and moving forward:

The bitworm example is a common type of example in the AI and pattern recognition world. You are given a bit sequence. Train the bit sequence and attempt to test the sequence against other sequences that have a similar structure. In the case of the bitworm example, a bit vector of 16 bits is trained. You could modify the example to train a 16x16=256 bit vector image.

Resources:

[1]. http://www.numenta.com/

Wednesday, December 15, 2010

Practical AI: Machine Learning and Data Mining with Weka and Java



Practical AI: Machine Learning and Data Mining with Weka and Java

Most people tend to think of Artificial Intelligence with depictions of autonomous bipedal robots speaking and greeting humans in a monotone voice. The computer science field of Artificial Intelligence is far from that science fiction world that we see portrayed in the movies. The most prominent past AI research seems more like it would fit under a branch of mathematics and not computer science. AI research sub-fields include machine learning, data mining, computer vision, string match and indexing, search algorithms and neural networks. There is a new focus on not just faster search algorithms and pattern recognition, but building a relationship between computer science, neuroscience, philosophy, psychology, biology. AI is branching out beyond the algorithms and merging with those other general science and medical fields. Marvin Minsky, a prominent researcher in the AI field, called some of the previous interests like Neural Network, technology fads. Marvin Minsky's most recent book, the Emotion Machine describes some of human's behavior with a simple set of rules. The book is a useful guide for computer scientists that want to model that behavior. Jeff Hawkins of Numenta and previously the creator Palm Inc has shifted his focus from mobile computing to developing a sophisticated AI system. Numenta has patented HTM (Hierarchical Temporal Memory) technology. According to Jeff Hawkins, you cannot mimic brain functions without including a hierarchical system of memory. The lower level has more input/output nodes than the level above it. According to Jeff, you must also take into account "temporal" memory. For example, the human brain has many parts of the brain that handle visual information. The brain may be able to detect a particular object but also factor in the time that the visual event occurred. If you are at the zoo, your brain predicts that you will see animals and animals in cages at the zoo. It is rare that you see a plane take off or land in the middle of your zoo visit. You would have had previous memories or seen pictures of the zoo in the past and parts of your brain activate other things associated the zoo. The context in time is a visit to the zoo. The memory of a zoo visit is probably in a different area of the brain than a trip to an airport. In the case of an airport, you expect to see planes landing and taking off.

Basic Data Mining With Weka

In my post, I present a practical Hello World using WEKA. WEKA (Waikato Environment for Knowledge Analysis) is a suite of machine learning tools and libraries that can be used to mine data.

What is data mining and how could you use data mining techniques? Many enterprise companies are connected to large databases. These databases contain millions of records. Data mining and machine learning techniques are used to find patterns within that massive trove of data. These techniques can also used to filter out noise. Popular email filter software utilizes data mining to remove or categorize spam email.

Hello World Weka for Java Web Server Log Files

I have dozens of Apache web server log files. I wanted to find groups or clusters between some of the log files and the time that a user requested a page on my site. I used WEKA to find clusters and categorize various groups of relevant data.

12_15_10_10_14_03 [HASERR N] log.fileName1[]:Wed 22:14:03 EST 2010 [SIZE 144]
12_15_10_10_14_03 [HASERR N] log.fileName2[]:Wed 22:14:03 EST 2010 [SIZE 121]
12_15_10_10_14_03 [HASERR N] log.fileName2[]:Wed 22:14:03 EST 2010 [SIZE 156]
12_15_10_10_15_33 [HASERR N] log.fileName3[]:Wed 22:15:33 EST 2010 [SIZE 160]
12_15_10_10_15_33 [HASERR N] log.fileName3[]:Wed 22:15:33 EST 2010 [SIZE 146]
...

The timestamp with hour, minute and seconds are on the furthest column to the left. The middle column contains filename information. There are only four or five rows shown in the example. But the log files contains millions of requests. I determined the data that I want to categorize, then I converted the timestamp log file into an ARFF file format that WEKA requires. WEKA has several tools to convert generic CSV files into a ARFF format. ARFF is essentially a text database with column attributes and rows of the data that you to analyze.

Example ARFF File for Weather Data:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Here is my arff file with web server log data. The attributes consist of the log file name and the rows contain information with a count the server wrote a line to the log file at a particular period of time.

Example arff file for server logs:
% Title: Database for application line count
@relation scan_app_data
@attribute LogXX0 real
@attribute LogXX1 real
...
@attribute LogXX36 real
@attribute LogXX37 real
@attribute LogXX38 real
@attribute LogXX39 real
@attribute timeperiod real
@attribute class { 'night_morn', 'earl_morn', 'midday', 'after', 'night', 'late_nigh' }
@data
0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,night_morn
,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,night_morn


I used the WEKA GUI tools to visualize the patterns. I simply clicked on ./weka.jar to launch the WEKA explorer.

It was interesting to focus on the groups of requests. I could see a cluster of requests combined with errors displayed to the user. After looking at the visual tools, I could easily determine what time period had the most errors and in which log file.

I just wanted to cover some basics. WEKA is mature Java software but the developer and the researcher must do some work to determine which data to feed to WEKA. In the case of the log files, I wanted to cluster log file data, errors and the time period that these events occurred. In future test, I might change my ARFF file to include the web server execution time vs the application requested to see if there is any correlation between the two.

Resources:

http://www.cs.waikato.ac.nz/ml/weka/

http://www.numenta.com/