Thursday, January 17, 2008

First look at web/page content analysis

What defines a web page; if you had an input/output device. What input would define a page and what would the output be? I am currently searching for different tools and algorithms to analyze pages that are available on the internet. Presented here in this chart are various html tags (e.g. table, strong, bold) and the number of times the tag appeared in the document. 5 page documents were analyzed; 3 good article pages from wikipedia and 2 spam pages.



The link below contains the data that generated the chart above.

http://openbotlist.googlecode.com/svn/trunk/openbotlist/docs/media/content/chart_content.txt

No comments: