Wednesday, November 19, 2008

Google's Ranking Algorithm In Review

Google started on the basis of a ranking algorithm called PageRank (discussed in previous posts here and here). Of course there is so much more to the secret sauce for these search engines now. We just don't know what they are using.

Anyway, there was a recent paper published that collected traffic going into and out of the servers at Indiana U. Using this traffic they were able to disprove 3 major assumptions underlying PageRank. PageRank assumes
  • a user is equally likely to follow any link on a page.
Actually, links are very unevenly followed. Some links carry huge amounts of traffic and others rarely see a click. (Think of how you browse a web page. Aren't there links that never look interesting like "Report A Bug" on espn.go.com?)
  • the probability of "teleporting" (or going directly) to any web page is equal to any other web page.
Actually the chance of starting to surf from any page is very skewed. Some pages are very popular destinations without following links. How many of us have favorite sites that we visit through bookmarks or typing the URL every day. We do not randomly type in URLs.
  • the probability of "teleporting" from any web page is equal across all web pages.
This was more difficult to disprove from their data. However, some sites are more likely to be stopping points in browsing and others are a bridge to more information.

The bottom line is that the links of the web are not that good at determining what actual paths people follow while browsing. However, this is the basis of major search engines that link structure determines popularity. The redeeming quality of search engines from this paper though is that they lead people to less popular sites, or sites we would not otherwise find out about and thus spread the wealth of clicks around (which is in conflict with what I had previously said in my first post on Google bias).

Thursday, November 13, 2008

The Machine is Us/ing Us

Insightful video about Web 2.0 and how you fit in to the current model of information sharing. This video was published by Michael Wesch, an assistant professor in Anthropology at Kansas State University.

Monday, November 3, 2008

Google Bias Take 2

I earlier posted that Google's ranking of search results caused a rich-get-richer problem. In other words sites linked to most often will be ranked first leading to more links.

Here is a paper that uses traffic information from Alexa to disprove this theory. It turns out that queries on search engines are very diverse. This leads to sites appearing towards the top that more specifically target the keywords given. For example Google's Udi Manber said "20 to 25% of the queries we see today, we have never seen before".

Current traffic from Alexa more closely follows the random surfer model, or discovering of web pages by viewing non-search web pages and clicking on links. It is good to see that worrisome theories are being put to the test.