First look at web/page content analysis

What defines a web page; if you had an input/output device. What input would define a page and what would the output be? I am currently searching for different tools and algorithms to analyze pages that are available on the internet. Presented here in this chart are various html tags (e.g. table, strong, bold) and the number of times the tag appeared in the document. 5 page documents were analyzed; 3 good article pages from wikipedia and 2 spam pages.



The link below contains the data that generated the chart above.

http://openbotlist.googlecode.com/svn/trunk/openbotlist/docs/media/content/chart_content.txt

Comments

Popular posts from this blog

Is Java the new COBOL? Yes. What does that mean, exactly? (Part 1)

On Unit Testing, Java TDD for developers to write

JVM Notebook: Basic Clojure, Java and JVM Language performance