Wednesday, December 15, 2010

Practical AI: Machine Learning and Data Mining with Weka and Java



Practical AI: Machine Learning and Data Mining with Weka and Java

Most people tend to think of Artificial Intelligence with depictions of autonomous bipedal robots speaking and greeting humans in a monotone voice. The computer science field of Artificial Intelligence is far from that science fiction world that we see portrayed in the movies. The most prominent past AI research seems more like it would fit under a branch of mathematics and not computer science. AI research sub-fields include machine learning, data mining, computer vision, string match and indexing, search algorithms and neural networks. There is a new focus on not just faster search algorithms and pattern recognition, but building a relationship between computer science, neuroscience, philosophy, psychology, biology. AI is branching out beyond the algorithms and merging with those other general science and medical fields. Marvin Minsky, a prominent researcher in the AI field, called some of the previous interests like Neural Network, technology fads. Marvin Minsky's most recent book, the Emotion Machine describes some of human's behavior with a simple set of rules. The book is a useful guide for computer scientists that want to model that behavior. Jeff Hawkins of Numenta and previously the creator Palm Inc has shifted his focus from mobile computing to developing a sophisticated AI system. Numenta has patented HTM (Hierarchical Temporal Memory) technology. According to Jeff Hawkins, you cannot mimic brain functions without including a hierarchical system of memory. The lower level has more input/output nodes than the level above it. According to Jeff, you must also take into account "temporal" memory. For example, the human brain has many parts of the brain that handle visual information. The brain may be able to detect a particular object but also factor in the time that the visual event occurred. If you are at the zoo, your brain predicts that you will see animals and animals in cages at the zoo. It is rare that you see a plane take off or land in the middle of your zoo visit. You would have had previous memories or seen pictures of the zoo in the past and parts of your brain activate other things associated the zoo. The context in time is a visit to the zoo. The memory of a zoo visit is probably in a different area of the brain than a trip to an airport. In the case of an airport, you expect to see planes landing and taking off.

Basic Data Mining With Weka

In my post, I present a practical Hello World using WEKA. WEKA (Waikato Environment for Knowledge Analysis) is a suite of machine learning tools and libraries that can be used to mine data.

What is data mining and how could you use data mining techniques? Many enterprise companies are connected to large databases. These databases contain millions of records. Data mining and machine learning techniques are used to find patterns within that massive trove of data. These techniques can also used to filter out noise. Popular email filter software utilizes data mining to remove or categorize spam email.

Hello World Weka for Java Web Server Log Files

I have dozens of Apache web server log files. I wanted to find groups or clusters between some of the log files and the time that a user requested a page on my site. I used WEKA to find clusters and categorize various groups of relevant data.

12_15_10_10_14_03 [HASERR N] log.fileName1[]:Wed 22:14:03 EST 2010 [SIZE 144]
12_15_10_10_14_03 [HASERR N] log.fileName2[]:Wed 22:14:03 EST 2010 [SIZE 121]
12_15_10_10_14_03 [HASERR N] log.fileName2[]:Wed 22:14:03 EST 2010 [SIZE 156]
12_15_10_10_15_33 [HASERR N] log.fileName3[]:Wed 22:15:33 EST 2010 [SIZE 160]
12_15_10_10_15_33 [HASERR N] log.fileName3[]:Wed 22:15:33 EST 2010 [SIZE 146]
...

The timestamp with hour, minute and seconds are on the furthest column to the left. The middle column contains filename information. There are only four or five rows shown in the example. But the log files contains millions of requests. I determined the data that I want to categorize, then I converted the timestamp log file into an ARFF file format that WEKA requires. WEKA has several tools to convert generic CSV files into a ARFF format. ARFF is essentially a text database with column attributes and rows of the data that you to analyze.

Example ARFF File for Weather Data:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Here is my arff file with web server log data. The attributes consist of the log file name and the rows contain information with a count the server wrote a line to the log file at a particular period of time.

Example arff file for server logs:
% Title: Database for application line count
@relation scan_app_data
@attribute LogXX0 real
@attribute LogXX1 real
...
@attribute LogXX36 real
@attribute LogXX37 real
@attribute LogXX38 real
@attribute LogXX39 real
@attribute timeperiod real
@attribute class { 'night_morn', 'earl_morn', 'midday', 'after', 'night', 'late_nigh' }
@data
0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,night_morn
,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,night_morn


I used the WEKA GUI tools to visualize the patterns. I simply clicked on ./weka.jar to launch the WEKA explorer.

It was interesting to focus on the groups of requests. I could see a cluster of requests combined with errors displayed to the user. After looking at the visual tools, I could easily determine what time period had the most errors and in which log file.

I just wanted to cover some basics. WEKA is mature Java software but the developer and the researcher must do some work to determine which data to feed to WEKA. In the case of the log files, I wanted to cluster log file data, errors and the time period that these events occurred. In future test, I might change my ARFF file to include the web server execution time vs the application requested to see if there is any correlation between the two.

Resources:

http://www.cs.waikato.ac.nz/ml/weka/

http://www.numenta.com/

No comments: