I am of the belief that just because something says that it is healthy does not mean it is. For example, "low fat" to me means "less bad" rather than actually being "good". I originally was of the same opinion on educational TV. TV is brain rot for children (which is sometimes worth the quiet it brings). However, there have been two cartoons on PBS that have changed my view: Super Why and Sid the Science Kid.
Our daughter rarely seemed to be that interested in learning letters from mother and father, but once she started to watch Super Why she really started to pick up on all of the letter understanding the show brought. She had the alphabet down after just a couple weeks of watching the show once in a day. Entertaining education really worked for her.
Sid the Science Kid is the latest show that actually teaches our daughter something. She has learned about washing your hands to remove germs, what "melting" means, and what seeds are good for. I am a real fan of this show.
Wednesday, December 3, 2008
Wednesday, November 19, 2008
Google's Ranking Algorithm In Review
Google started on the basis of a ranking algorithm called PageRank (discussed in previous posts here and here). Of course there is so much more to the secret sauce for these search engines now. We just don't know what they are using.
Anyway, there was a recent paper published that collected traffic going into and out of the servers at Indiana U. Using this traffic they were able to disprove 3 major assumptions underlying PageRank. PageRank assumes
The bottom line is that the links of the web are not that good at determining what actual paths people follow while browsing. However, this is the basis of major search engines that link structure determines popularity. The redeeming quality of search engines from this paper though is that they lead people to less popular sites, or sites we would not otherwise find out about and thus spread the wealth of clicks around (which is in conflict with what I had previously said in my first post on Google bias).
Anyway, there was a recent paper published that collected traffic going into and out of the servers at Indiana U. Using this traffic they were able to disprove 3 major assumptions underlying PageRank. PageRank assumes
- a user is equally likely to follow any link on a page.
- the probability of "teleporting" (or going directly) to any web page is equal to any other web page.
- the probability of "teleporting" from any web page is equal across all web pages.
The bottom line is that the links of the web are not that good at determining what actual paths people follow while browsing. However, this is the basis of major search engines that link structure determines popularity. The redeeming quality of search engines from this paper though is that they lead people to less popular sites, or sites we would not otherwise find out about and thus spread the wealth of clicks around (which is in conflict with what I had previously said in my first post on Google bias).
Thursday, November 13, 2008
The Machine is Us/ing Us
Monday, November 3, 2008
Google Bias Take 2
I earlier posted that Google's ranking of search results caused a rich-get-richer problem. In other words sites linked to most often will be ranked first leading to more links.
Here is a paper that uses traffic information from Alexa to disprove this theory. It turns out that queries on search engines are very diverse. This leads to sites appearing towards the top that more specifically target the keywords given. For example Google's Udi Manber said "20 to 25% of the queries we see today, we have never seen before".
Current traffic from Alexa more closely follows the random surfer model, or discovering of web pages by viewing non-search web pages and clicking on links. It is good to see that worrisome theories are being put to the test.
Here is a paper that uses traffic information from Alexa to disprove this theory. It turns out that queries on search engines are very diverse. This leads to sites appearing towards the top that more specifically target the keywords given. For example Google's Udi Manber said "20 to 25% of the queries we see today, we have never seen before".
Current traffic from Alexa more closely follows the random surfer model, or discovering of web pages by viewing non-search web pages and clicking on links. It is good to see that worrisome theories are being put to the test.
Wednesday, October 29, 2008
Pandora.com
For a time I had no hope that recommender systems like Amazon.com's "Recommended for You" section would be useful to me specifically. The predictions were often predictable. Buy a CD from artist A and get a list of the most popular CD's from that artist. Not useful.
Some time ago I came across Pandora.com, which is an adapting radio station, which chooses songs to play based on what songs you have added to a station and what songs you rate positively. I actually learned of several songs and artists I was unfamiliar with that I now like (such as "Question Everything" by 8Stops7). However, it does not play all songs that are similar to the songs I tell it. And some days I find myself disagreeing with all songs played.
I think that as time goes on recommender systems will improve and we will give some credibility to recommenders. Perhaps the Netflix prize will help in that regard.
Some time ago I came across Pandora.com, which is an adapting radio station, which chooses songs to play based on what songs you have added to a station and what songs you rate positively. I actually learned of several songs and artists I was unfamiliar with that I now like (such as "Question Everything" by 8Stops7). However, it does not play all songs that are similar to the songs I tell it. And some days I find myself disagreeing with all songs played.
I think that as time goes on recommender systems will improve and we will give some credibility to recommenders. Perhaps the Netflix prize will help in that regard.
Netflix Recommender System
Netflix is trying to motivate research in the area of recommender systems and on Oct. 2, 2006 offered $1 million to anyone that could improve upon their current recommender system by a specific measure (improve RMSE by 10%). Recently I took a look at the current standings and one team is very close (improvement around 9%). Interestingly enough they had a few papers showing how they do it.
Specifically what we are talking about is collaborative filtering. There are two main approaches, either you look for global patterns in the matrix of ratings or you use the ratings from similar items or users. BellKor (team name) was able to successfully merge these two ideas into a single solution that outperformed (at the time of submission) any other approaches using one of the two approaches.
What impressed me most about the paper I read (Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model) was that in addition to testing RMSE for the test, they tried to look at the users perspective. We want to know what movie to watch now. They compared other approaches against theirs on whether they would recommend in the top 5 or top 20 a movie you would watch and rate a 5. Well done. We should all keep the end user in mind.
Any one have a really good or bad experience with recommendations made by computers?
Specifically what we are talking about is collaborative filtering. There are two main approaches, either you look for global patterns in the matrix of ratings or you use the ratings from similar items or users. BellKor (team name) was able to successfully merge these two ideas into a single solution that outperformed (at the time of submission) any other approaches using one of the two approaches.
What impressed me most about the paper I read (Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model) was that in addition to testing RMSE for the test, they tried to look at the users perspective. We want to know what movie to watch now. They compared other approaches against theirs on whether they would recommend in the top 5 or top 20 a movie you would watch and rate a 5. Well done. We should all keep the end user in mind.
Any one have a really good or bad experience with recommendations made by computers?
Tuesday, September 30, 2008
Do Good Grades Predict Success? (Freakonomics blog entry)
I recently read the post in the title of this blog entry at the Freakonomics blog, which I frequent. I love the question and have wondered myself some of the of the following related questions:
- Do grades measure our understanding or ability to learn?
- How fair is it to compare grades of different students from different schools, classes, teachers? (Some teachers are "easy" and some "hard".)
- Looking for a job.
- Interviewing well.
- Being a programmer in the real-world.
Wednesday, September 10, 2008
Google Bias
In response to my post Google Wishlist, one reader wrote:
"I would love the Google search results to include more web addresses with a differing suffix than .com (i.e. .net, .org). There are a ton of sites that get overlooked because of the seemed bias that Google has for the .com sites.
"I will admit that they have done a little better on including some of these sites on more popular searches, but as a whole the .com sites seem to get the preference."
I would agree with this reader that certainly Google results are biased, as they have to give preference in some reasonable way. Now I do not believe that the suffix of a given website is used to rank the website (try a search for "Plutonium" for example and the top result as I see it is from Wikipedia.org). I do believe that the results are biased by "link popularity" or by PageRank (as explained by the founders of Google, Sergey Brin and Lawrence Page). Basically, as I understand it the basis for ranking in many search engines is based on how many links (and the "quality" of these links) that link to a domain or webpage.
The decision to go with PageRank was a good choice. It put Google on the map originally. However, there are some drawbacks for people like me. The bias towards more popular pages, means that it is more difficult to climb to the top. It is a rich get richer web world. Those that have links, are more easily found, meaning they more easily are linked to. This would explain why one would more likely see links ranked first from big name .com sites. Now if I wrote the most informative page on Plutonium around, it would likely never beat out the Wikipedia page (everyone is linking to Wikipedia these days). For more on this topic check out the article Impact Of Search Engines On Page Popularity.
In conclusion, Google is biased necessarily which is fine for them but bad for the little guys.
"I would love the Google search results to include more web addresses with a differing suffix than .com (i.e. .net, .org). There are a ton of sites that get overlooked because of the seemed bias that Google has for the .com sites.
"I will admit that they have done a little better on including some of these sites on more popular searches, but as a whole the .com sites seem to get the preference."
I would agree with this reader that certainly Google results are biased, as they have to give preference in some reasonable way. Now I do not believe that the suffix of a given website is used to rank the website (try a search for "Plutonium" for example and the top result as I see it is from Wikipedia.org). I do believe that the results are biased by "link popularity" or by PageRank (as explained by the founders of Google, Sergey Brin and Lawrence Page). Basically, as I understand it the basis for ranking in many search engines is based on how many links (and the "quality" of these links) that link to a domain or webpage.
The decision to go with PageRank was a good choice. It put Google on the map originally. However, there are some drawbacks for people like me. The bias towards more popular pages, means that it is more difficult to climb to the top. It is a rich get richer web world. Those that have links, are more easily found, meaning they more easily are linked to. This would explain why one would more likely see links ranked first from big name .com sites. Now if I wrote the most informative page on Plutonium around, it would likely never beat out the Wikipedia page (everyone is linking to Wikipedia these days). For more on this topic check out the article Impact Of Search Engines On Page Popularity.
In conclusion, Google is biased necessarily which is fine for them but bad for the little guys.
Friday, September 5, 2008
Cool Search Engine Interfaces
For those who enjoy trying something new check out the following search engines:
Kartoo: clusters pages within a search displaying visually.
If you really get an itching to check out other search engines see the article Top 100 Alternative Search Engines, March 2007.
Sunday, August 31, 2008
Biology + Computer Science = Proof of God ?
I am a Computer Scientist, or so I claim. I am working on my PhD in Computer Science and recently took a class on Bioinformatics. Here I learned about the mysterious world of DNA. Certainly I am no expert in the field, but what a complex system we have found in finding DNA. Most living things have DNA (or RNA in some form or another) which is changing through time and encodes what we are made of. The DNA is translated through multiple steps into proteins. These proteins bend around into varying shapes, which shapes are not constant through time and are part of metabolic pathways. This is about where my knowledge ends of the whole process of DNA up to what the functions of a cell are.
I do not know how accurate the comparison is, but I see DNA in a similar way to computer languages, like Perl or Java. Computer languages are a set of very simple instructions that a human understands, which are compiled into binary which tell the CPU how to manipulate variouos memory locations in a computer. Computer languages are built on top of the mechanics of a computer system. The entire process of designing a computer, building a computer, designing a computer language, specifying all of what a language understands, building a compiler for the language, and finally building a computer program from that language which works properly is a very organized process. Chaos does not create the final process, but it is endless hours of planning, organizing logically and testing to be sure that what one has created has no major, show stopping flaws. In the end all you or I may see is something like the text on this screen transmitted perhaps 100 of miles through wires to be displayed on your computer screen.
As I see it DNA has a set of rules that it follows (though scientists have not discovered all of them) and is part of a large, complex process that makes the encoding of life possible to be passed on to the next generation. This entire process as well as all other components of a living organism, such as reproduction, eating mechanisms, having food available, and aging are all necessary parts of even the most simple, single cell living organisms. I do not understand how the world could make the leap from no life on earth to a single living cell that was capable of encoding its being and reproducing without an creator involved in the process. In my limited experience, chance does not bring about large scale organization.
I believe there is a God and that the complexities of biology, for example, show that God was involved in the creation of life as we know it.
I do not know how accurate the comparison is, but I see DNA in a similar way to computer languages, like Perl or Java. Computer languages are a set of very simple instructions that a human understands, which are compiled into binary which tell the CPU how to manipulate variouos memory locations in a computer. Computer languages are built on top of the mechanics of a computer system. The entire process of designing a computer, building a computer, designing a computer language, specifying all of what a language understands, building a compiler for the language, and finally building a computer program from that language which works properly is a very organized process. Chaos does not create the final process, but it is endless hours of planning, organizing logically and testing to be sure that what one has created has no major, show stopping flaws. In the end all you or I may see is something like the text on this screen transmitted perhaps 100 of miles through wires to be displayed on your computer screen.
As I see it DNA has a set of rules that it follows (though scientists have not discovered all of them) and is part of a large, complex process that makes the encoding of life possible to be passed on to the next generation. This entire process as well as all other components of a living organism, such as reproduction, eating mechanisms, having food available, and aging are all necessary parts of even the most simple, single cell living organisms. I do not understand how the world could make the leap from no life on earth to a single living cell that was capable of encoding its being and reproducing without an creator involved in the process. In my limited experience, chance does not bring about large scale organization.
I believe there is a God and that the complexities of biology, for example, show that God was involved in the creation of life as we know it.
Tuesday, August 26, 2008
Google Wishlist
What features do you wish Google Search had but doesn't?
Yesterday marks the start of my second year as a PhD student. I have been working on click fraud research, but soon I will be switching to Text Mining, but I need some ideas as to what I should do. To get the creative juices flowing, we will start by listing features that you wish Google Search had. Any type of feature, large or small, no matter how realistic it is. Here are some that I have come across that sound interesting:
Yesterday marks the start of my second year as a PhD student. I have been working on click fraud research, but soon I will be switching to Text Mining, but I need some ideas as to what I should do. To get the creative juices flowing, we will start by listing features that you wish Google Search had. Any type of feature, large or small, no matter how realistic it is. Here are some that I have come across that sound interesting:
- Summarize Opinions: What if you could search "people's opinions on restaraunt X" and the response would be summary of all opinions on all websites to do with that restaraunt say 35% liked it with links to those who did and 65% did not like it, then summary of why they did or did not like it. (For an example see Live's search summary of opinions limited to a few products "Canon EOS Digital Rebel XT - digital camera, 8MP, 3x Optical Zoom")
- Search Topics not always Keywords: I search for papers on text mining on Google Scholar and find only a handful of papers that relate to what I am looking for. Most of the papers I would be interested in do not actually say "Text Mining" in the title, and sometimes not even in the paper but that is what they are talking about. I want the category "Text Mining" not the keywords "Text Mining".
- Answer My Question: Sometimes I will search using a whole question, not just keywords and find someone else that asked the question on a forum, but did not get a response. I want to find the answer, not the question. This is not a new thought. There are already several approaches attempting to do this, but none of them are in the 90% accuracy on open ended questioning. (Examples: Ask Jeeves, START from MIT, etc.)
Monday, August 18, 2008
WordNet 3.0 Summary Statistics
I recently downloaded WordNet 2.1 (an ontology for the English language) for Windows and was impressed by the results of just basic words, that there were several senses displayed with definitions and example sentences. I thought I would run a few queries on WordNet. Fortunately, in the "Related Projects" link on WordNet someone has already loaded WordNet 3.0 into PostGreSQL (http://sourceforge.net/project/showfiles.php?group_id=135112&package_id=219735/&abmode=1).
Here are some of my basic questions:
It appears that a small percentage (20%) have hyponyms, though those that day may have quite a few (no surprise). Interestingly the mean number of hyponyms and hypernyms per word are the same as there are the same total number of hypernyms and hyponyms. The relationship is apparently symmetric.
Here are some of my basic questions:
- How many words are there in WordNet 3.0?
147,306
According, to one source (I am sure there is plenty of debate about how many words are in the English language in common usage), "The Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words." (from AskOxford) Of course there were several other answers from other websites, but if the same definitions of what a word is are in use, then WordNet is doing pretty well. - What are examples of words found in WordNet to get a sense of what the dictionary contains?
52850;"gas system"
103824;"predicate"
27754;"committal to memory"
81032;"malapropism"
18769;"butt hinge"
97732;"pectinidae"
46875;"family termitidae"
Here we see that by word it is not meant a single token. Also, the numbers next to each word were the IDs found in the database. - How many senses does a word have?
mean: 1.4, median: 1
1 120433 82% 2 15711 11% 3 5116 3% 57 1 0% "run" 70 1 0% "cut" 75 1 0% "break" - How many hypernyms (broader words) does a word have?
mean: 2.17, median: 1hypernyms words percent 0 35649 24% 1 46213 31% 2 27468 19% 113 1 0% "hold" 125 1 0% "cut" 156 1 0% "break" - How many hyponyms (specific examples) does a word have in WordNet?
mean: 2.17, median: 0hyponyms words percent 0 118372 80% 1 5427 4% 2 4250 3% 897 1 0% "herbaceous plant" 913 1 0% "herb" 1007 1 0% "change"
- How many synonyms does a word have?
mean: 2.07, median: 1
synonyms words percent 0 36916 25% 1 48004 33% 2 24998 17% 74 1 0% "hold" 97 1 0% "pass" 99 1 0% "break"
It appears that a small percentage (20%) have hyponyms, though those that day may have quite a few (no surprise). Interestingly the mean number of hyponyms and hypernyms per word are the same as there are the same total number of hypernyms and hyponyms. The relationship is apparently symmetric.
Thursday, August 14, 2008
First Experience with Named Entity Recognition
I must preface this email by saying that I am fairly new to Text Mining or NLP. I just tried named entity recognition as written in the GATE tool on a Wikipedia entry on the University of Louisville. All parameters are set to defaults for ANNIE. The results can be seen by taking a look at the included image.
The results to me suggest that this is a very difficult task. A majority of abbreviations such as NIH, NCAA and U of L could not be discovered what type of entity these were. It is also interesting that the University of Louisville is labeled differently in different parts of the text. In the table description (stripped of its structure in this context) labels just "Louisville" as a location, which is true but not the best labeling. While the second occurrence in the first paragraph labels the "University of Louisville" as an organization which is the correct label.
It is interesting to me that the NIH, when the full name was spelled out that the word "National" was left out and the abbreviation "NIH" was not labeled as an organization but is an unknown entity.
Lastly the labeling of persons did not do well. The president of the university is labeled correctly (Dr. James R. Ramsey), but "Faculty", "Urban", and "General Assembly" are also labeled as people. The precision of the persons label is 1/5, not so good. Though I guess the recall is a perfect 1/1.
GATE-4.0 Vista Bug
If you are new to GATE and are using Windows Vista, you may come across the following exception in the message tab:
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
at sun.swing.table.DefaultTableCellHeaderRenderer.getColumnSortOrder(DefaultTableCellHeaderRenderer.java:104)
at com.sun.java.swing.plaf.windows.WindowsTableHeaderUI$XPDefaultRenderer.getTableCellRendererComponent(WindowsTableHeaderUI.java:108)
at gate.swing.XJTable.calculatePreferredSize(XJTable.java:105)
at gate.swing.XJTable.getPreferredSize(XJTable.java:130)
at gate.swing.XJTable.getPreferredScrollableViewportSize(XJTable.java:86)
at gate.gui.FeaturesSchemaEditor.populate(FeaturesSchemaEditor.java:169)
...
It turns out that the Java libraries for GUI's are a little different in Vista from Windows XP. To fix the problem simply go to Options > Configuration and change the "Look and feel" from "Windows" (the default) to say "Metal" and you have a work around. You should no longer see that exception again.
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
at sun.swing.table.DefaultTableCellHeaderRenderer.getColumnSortOrder(DefaultTableCellHeaderRenderer.java:104)
at com.sun.java.swing.plaf.windows.WindowsTableHeaderUI$XPDefaultRenderer.getTableCellRendererComponent(WindowsTableHeaderUI.java:108)
at gate.swing.XJTable.calculatePreferredSize(XJTable.java:105)
at gate.swing.XJTable.getPreferredSize(XJTable.java:130)
at gate.swing.XJTable.getPreferredScrollableViewportSize(XJTable.java:86)
at gate.gui.FeaturesSchemaEditor.populate(FeaturesSchemaEditor.java:169)
...
It turns out that the Java libraries for GUI's are a little different in Vista from Windows XP. To fix the problem simply go to Options > Configuration and change the "Look and feel" from "Windows" (the default) to say "Metal" and you have a work around. You should no longer see that exception again.
Friday, August 1, 2008
Mining Topic-Specific Concepts and Definitions on the Web
I just finished reading the paper of the same name as this post by Bing Liu, Chee Wee Chin and Hwee Tou Ng. There were two thoughts that I had as reading this paper. First, though we all know that search engines are a one-size fits all application, I have not really thought about it. Is it not interesting that a 8-year old and a Biologist who has been doing research for 20 years can both type in the term "ant" in Google and receive the same results (Apache's Ant project for building Java applications).
Sure, without signing in to Google, what could they know about the individual making the search? What is really interesting is we are giving Google or other search engines practically nothing to work with, and expecting them to read our minds and for most of my searches they do pretty well. (Try searching for "mormons bosnia" and see what you get. By the way LDS is an abbreviation often used to refer to Mormons.)
Second (as in the second idea I had from this paper), it would be really nice if we could summarize a 100 web pages of results into a single table. What do you include and what do you not include are obvious questions. But imagine if our search for "data mining" as discussed in this paper returned a summary of definitions held on data mining. For example, "_" (53 pages), "_" (6 pages) and "_" (2 pages). Where each "_" would be a definition and the parantheses would be links to results pages containing all of those pages. Now it would be nice if there was some intelligent analysis of the definitions so that each of those 53 pages for the first definition, though each page did not write the same definition, for our purposes held the same meaning. Certainly each individual has different purposes, so that definition equality is up for debate in a lot of cases depending on who you talk to. This idea could definitely use a lot of refinement.
Sure, without signing in to Google, what could they know about the individual making the search? What is really interesting is we are giving Google or other search engines practically nothing to work with, and expecting them to read our minds and for most of my searches they do pretty well. (Try searching for "mormons bosnia" and see what you get. By the way LDS is an abbreviation often used to refer to Mormons.)
Second (as in the second idea I had from this paper), it would be really nice if we could summarize a 100 web pages of results into a single table. What do you include and what do you not include are obvious questions. But imagine if our search for "data mining" as discussed in this paper returned a summary of definitions held on data mining. For example, "_" (53 pages), "_" (6 pages) and "_" (2 pages). Where each "_" would be a definition and the parantheses would be links to results pages containing all of those pages. Now it would be nice if there was some intelligent analysis of the definitions so that each of those 53 pages for the first definition, though each page did not write the same definition, for our purposes held the same meaning. Certainly each individual has different purposes, so that definition equality is up for debate in a lot of cases depending on who you talk to. This idea could definitely use a lot of refinement.
Thursday, June 19, 2008
OpenEphyra
Open-domain question answering systems in general seek to answer questions in natural language with answers in their document collection. Rather than returning a link to a document similar to a question, the answer is returned with a link to where the answer was found.
Recently I read that it would be rare to find a software package including all the pieces to a question answering system. There are usually many pieces to such systems gathered from text mining and natural language processing, as well as several tuned parameters. However, the good news for the question answering community is that there is now an open source system available: OpenEphyra. This system has been put together by a Ph.D. student at CMU, Nico Schlaefer. The system appears to be easily extensible. It has built functionality to test the system on TREC QA track competitions. Great work Nico Schlaefer and any else who have helped.
Recently I read that it would be rare to find a software package including all the pieces to a question answering system. There are usually many pieces to such systems gathered from text mining and natural language processing, as well as several tuned parameters. However, the good news for the question answering community is that there is now an open source system available: OpenEphyra. This system has been put together by a Ph.D. student at CMU, Nico Schlaefer. The system appears to be easily extensible. It has built functionality to test the system on TREC QA track competitions. Great work Nico Schlaefer and any else who have helped.
Wednesday, June 18, 2008
Contextual Computing
I have no subscribed in Google Reader to several blogs out there. One of which is Joel Dehlin's blog. He is the CIO for the LDS church. His latest post is on the top 10 most disruptive technologies for 2008-2012, where he sites an article from ehomeupgrade.com. It is an intersting little list, though I am not sure what is meant by "fabric computing" and "semantics". One of the topics that stood out was "Contextual Computing".
As I understand it, contextual computing refers to the awareness of an application of the environment of a user and of the users usual actions. The most usual applications of which would be the use of cell phones equipped with GPS devices to alert the user of nearby resteraunts or other nearby services. I would guess that Google's use of a users IP location to filter ads could also be considered contextual computing. I would suggest that a good contextual computing application would perhaps go unnoticed by a user, but would increase the expectations of a computer to understand their intentions. For example, if Word noticed that nearly every time that it automatically uppercased a word after a period, the user undid the action, then Word would stop automatically uppercasing words after periods. Once the user caught on that Word had caught on to this pattern, the user would expect Word to catch onto all other useful patterns.
Of course, there is the possibility that certain patterns noticed by Word or other applications would not be appreciated. Take for example a user who opens up Yahoo!'s home page for the most part every 30 min. to check for email. Now if the browser or operating system started to predict this behavior and automatically opened up Yahoo!, that would make me uncomfortable.
Lastly, some applications only exist because of contextual computing. All of the new features associated with GPS devices in cars are of no use without location awareness.
It will be interesting to see what creative uses await for contextual computing.
As I understand it, contextual computing refers to the awareness of an application of the environment of a user and of the users usual actions. The most usual applications of which would be the use of cell phones equipped with GPS devices to alert the user of nearby resteraunts or other nearby services. I would guess that Google's use of a users IP location to filter ads could also be considered contextual computing. I would suggest that a good contextual computing application would perhaps go unnoticed by a user, but would increase the expectations of a computer to understand their intentions. For example, if Word noticed that nearly every time that it automatically uppercased a word after a period, the user undid the action, then Word would stop automatically uppercasing words after periods. Once the user caught on that Word had caught on to this pattern, the user would expect Word to catch onto all other useful patterns.
Of course, there is the possibility that certain patterns noticed by Word or other applications would not be appreciated. Take for example a user who opens up Yahoo!'s home page for the most part every 30 min. to check for email. Now if the browser or operating system started to predict this behavior and automatically opened up Yahoo!, that would make me uncomfortable.
Lastly, some applications only exist because of contextual computing. All of the new features associated with GPS devices in cars are of no use without location awareness.
It will be interesting to see what creative uses await for contextual computing.
Friday, May 9, 2008
What is a "Tree Pattern"?
Perhaps this may be obvious to you, but it was not to me. I was recently reading "Mining tree patterns with Almost Smallest Supertrees" from Proceedings of the 2008 SIAM International Conference on Data Mining. My first question was what is a "tree pattern"? On first skim, I could not find an example or a simple definition of what a tree pattern was. This paper is referring to frequent trees in tree structured data. For example one of the data sets uses a web log containing user session data. A session could be described as a tree where each page visited is a node and the edge between pages is a directed edge originating at the earlier page and proceeding to the later page in time. In this case a tree pattern could be that people coming from the "edu" domain often go to page A followed by page B. However it is still not clear to me if part of a session could be a pattern or if a frequent tree only refers to entire trees. Help me out if you know the answer.
Thursday, March 20, 2008
Recommendation Systems
I just returned from my Web Mining class. A student made the point there that never in his 10 years of using Amazon.com has their recommendation system ever suggested something that even made sense to him, let alone led to a purchase. He has also rated some 4,000 movies on NetFlix in hopes that he could find some new movie that he would enjoy. He has a very low opinion now of recommendation systems.
I would agree with him, though my opinion is not so strong. Never has a recommendation system worked for me. Either the recommendation is obvious, I bought a CD by artist A so I am recommended CD's by artist A that I am already aware of or the recommendation is for something unrelated and unuseful. The only time that the recommendation system seems to work for me is when I am new to a specific area and am looking for the most popular books in that area.
Recommendation systems have a difficult task. Computer systems cannot understand the human mind. Everyone makes a purchase or likes a specific book for a reason. Those reasons are different between people making the same purchase. How can you then recommend the same thing to two different people that bought that book for different reasons? It does not make sense.
The one case that I thought of where recommendations have worked for me is Pandora.com. That music recommendation system has led me to learn of new artists and songs that I like. It made many wrong song suggestions, but I like enough music that eventually something new was played that I liked. Not through a wonderful system, but more of trial and error in a genre.
I would agree with him, though my opinion is not so strong. Never has a recommendation system worked for me. Either the recommendation is obvious, I bought a CD by artist A so I am recommended CD's by artist A that I am already aware of or the recommendation is for something unrelated and unuseful. The only time that the recommendation system seems to work for me is when I am new to a specific area and am looking for the most popular books in that area.
Recommendation systems have a difficult task. Computer systems cannot understand the human mind. Everyone makes a purchase or likes a specific book for a reason. Those reasons are different between people making the same purchase. How can you then recommend the same thing to two different people that bought that book for different reasons? It does not make sense.
The one case that I thought of where recommendations have worked for me is Pandora.com. That music recommendation system has led me to learn of new artists and songs that I like. It made many wrong song suggestions, but I like enough music that eventually something new was played that I liked. Not through a wonderful system, but more of trial and error in a genre.
Monday, February 18, 2008
Who would cause the disabling of Google network partner sites
I would love to access to the data as to which blogs have been removed from Google's nework, how long they have been with Google and what exactly merited their removal. Certainly, such information would not be made available by Google, but wouldn't it be interesting to look into this data to find if there were similar root causes for a number of cases. I would love to see proof in the data that claims that perhaps someone else is causing accounts to be shut down. For example see this latest post on the subject and comments relating to it (http://www.websitebabble.com/adsense-other-ad-serving-programs/1477-banned-click-fraud.html).
Tuesday, February 12, 2008
How can we detect all types of click fraud?
It seems that there are several approaches out there to committing click fraud. The methods that are now out in the open, which means either academia discovered them first or they have been used for some time now are the following:
The first approach takes a number of humans each being paid peanuts to click on advertisements. One example public website which uses this as part of its business model is Click Monkeys which states specifically that they are not in the US and not subject to the laws of the US. Now if you are a business based in the US and abide the laws of this land, then I would hope that you would not contact a business to do your dirty work, which is not "restrained" by the laws of this land.
Next we have the click bot. This perhaps is a dying bread. I found that Clickzilla no longer has a functioning website. Of course click fraud was not their sales pitch, but there was no doubt this software could be used for such a purpose. There are several proxy lists that exist online not to mention Tor which allow a single machine to make requests through other machines on the internet. From what I understand Google has blacklisted a number of these proxies. Naively I thought that this approach could be noticed easily by checking variables such as referrer or useragent to find associations too high for normal traffic. An explanation of such a robot can be found at this link. A robot could randomly choose such attributes from a list and randomly choose periods of silence. However, how does one make sure to download the entire landing page (links and all) and enable javascript? Common robots do not support the interpretation of javascript (used by Google to follow notice characteristics of click fraud).
The next approach involves delving into the dark side of hacking. There exist large networks of zombies, or computers that have been compromised in one way or another for use by some central zombie authority. There are even botnets for rent out there. For a large sum of money one could take possession of say 100,000 machines and launch an attack undetectable by IP alone. Clickbot.a was an example of this type of attack, but the signals of this attack Google was able to decipher or they would not have released source code and a complete description.
Lastly, one can potentially use hidden frames or Flash advertisements to create clicks on advertisements from users who either visit a specific site cooperating with another on the hidden frame or who view an advertisement. In this way IP's and visitor locations are as dispersed and normal looking as the traffic is to the site or sites on which these cheats appear. The attributes of this traffic would in most cases be completely indistinguishable from normal traffic.
My question is with all of these known tricks and the yet to be discovered tricks that fraudsters use, how can one go about detecting all of these different types of fraud. Certainly there are many sources that would suggest that they are pretty good at detecting fraud, like Google or any 3rd party web traffic auditing company. However, I really wonder how much click fraud is getting through the best filters that exist out there? In reality I would like to see online advertising continue to flourish, since this leads to an ever increasing number of profitable websites paid for by advertisers instead of users. I would prefer to keep so many good services free of charge for use any time. What can we do?
- Using a click-ring
- Using a single click bot possibly by proxy
- Using a botnet or network of robots
- Hiding clicks in Flash or Javascript
The first approach takes a number of humans each being paid peanuts to click on advertisements. One example public website which uses this as part of its business model is Click Monkeys which states specifically that they are not in the US and not subject to the laws of the US. Now if you are a business based in the US and abide the laws of this land, then I would hope that you would not contact a business to do your dirty work, which is not "restrained" by the laws of this land.
Next we have the click bot. This perhaps is a dying bread. I found that Clickzilla no longer has a functioning website. Of course click fraud was not their sales pitch, but there was no doubt this software could be used for such a purpose. There are several proxy lists that exist online not to mention Tor which allow a single machine to make requests through other machines on the internet. From what I understand Google has blacklisted a number of these proxies. Naively I thought that this approach could be noticed easily by checking variables such as referrer or useragent to find associations too high for normal traffic. An explanation of such a robot can be found at this link. A robot could randomly choose such attributes from a list and randomly choose periods of silence. However, how does one make sure to download the entire landing page (links and all) and enable javascript? Common robots do not support the interpretation of javascript (used by Google to follow notice characteristics of click fraud).
The next approach involves delving into the dark side of hacking. There exist large networks of zombies, or computers that have been compromised in one way or another for use by some central zombie authority. There are even botnets for rent out there. For a large sum of money one could take possession of say 100,000 machines and launch an attack undetectable by IP alone. Clickbot.a was an example of this type of attack, but the signals of this attack Google was able to decipher or they would not have released source code and a complete description.
Lastly, one can potentially use hidden frames or Flash advertisements to create clicks on advertisements from users who either visit a specific site cooperating with another on the hidden frame or who view an advertisement. In this way IP's and visitor locations are as dispersed and normal looking as the traffic is to the site or sites on which these cheats appear. The attributes of this traffic would in most cases be completely indistinguishable from normal traffic.
My question is with all of these known tricks and the yet to be discovered tricks that fraudsters use, how can one go about detecting all of these different types of fraud. Certainly there are many sources that would suggest that they are pretty good at detecting fraud, like Google or any 3rd party web traffic auditing company. However, I really wonder how much click fraud is getting through the best filters that exist out there? In reality I would like to see online advertising continue to flourish, since this leads to an ever increasing number of profitable websites paid for by advertisers instead of users. I would prefer to keep so many good services free of charge for use any time. What can we do?
Wednesday, January 23, 2008
Click Fraud Perception is the Number One Enemy
As I have read blogs around the internet, there are a number that say that Click Fraud is adversly affecting Google and Yahoo! They are saying that Google's wallet is getting fattened by click fraud, while the little guys, the advertisers, are spending more and more money on clicks depreciating in value. Or they are saying that advertisers are slowly decreasing their ad spending due to the fear of click fraud.
No one can say with absolute certainty how much click fraud is taking place on the internet. No one can say how many clicks go unfiltered and are charged to advertisers. What we see though are that according to Google <10% of clicks are invalid (only some of these clicks are considered fraudulent) and are filtered, while according to Click Forensics click fraud is in the range of 28.1%. Each of these two companies benefit from opposite perceptions about click fraud. Google would like us to think that it is for the most part under control, and rarely would one be charged for invalid clicks. Click Forensics and others like them would like to sell their services to people that believe click fraud is rampant. As a side note Google only refunds about 0.02% of clicks that go unfiltered.
I believe that the real enemy of the PPC (pay-per-click) model of advertising is the perception that click fraud is rampant as it is spread by 3rd parties and the media. I would like to know how many advertisers cut back their budgets based on fears, while not inspecting their ROI.
I do not suspect that the fear of click fraud in advertisers will bring about the ruin of Google. However, I would suspect that it is hurting Google and advertisers more than actual click fraud.
No one can say with absolute certainty how much click fraud is taking place on the internet. No one can say how many clicks go unfiltered and are charged to advertisers. What we see though are that according to Google <10% of clicks are invalid (only some of these clicks are considered fraudulent) and are filtered, while according to Click Forensics click fraud is in the range of 28.1%. Each of these two companies benefit from opposite perceptions about click fraud. Google would like us to think that it is for the most part under control, and rarely would one be charged for invalid clicks. Click Forensics and others like them would like to sell their services to people that believe click fraud is rampant. As a side note Google only refunds about 0.02% of clicks that go unfiltered.
I believe that the real enemy of the PPC (pay-per-click) model of advertising is the perception that click fraud is rampant as it is spread by 3rd parties and the media. I would like to know how many advertisers cut back their budgets based on fears, while not inspecting their ROI.
I do not suspect that the fear of click fraud in advertisers will bring about the ruin of Google. However, I would suspect that it is hurting Google and advertisers more than actual click fraud.
Tuesday, January 22, 2008
Click Fraud Antics
I have begun to read into the area of click fraud. In doing so I have come across several blogs in which the owner of the blog complains that their Adsense account has been disabled due to click fraud, which they of course did not perpetrate. I would suspect some of these bloggers are not as innocent as their story claims. However, I am certain that there are some innocent victims in this crime. There is no way to tell whether the click fraud identified by Google originates from the author of the blog or not. In every case the author complains that Google will not level with them and divulge what exactly was detected as click fraud, so that the author can rectify the situation.
I understand that Google is worried that the more that a fraudster knows about how they detect click spam, the easier it will be for a fraudster to go undetected. That is why they cannot share any of their secrets with the authors of blogs, who would certainly share any information they learned on their blog. However, any actions that Google takes against fraud gives away some small details about what they do.
There are two sources of information that could potentially be used against Google. The first is that Google reports to Adsense users the number of clicks that they were given credit for. The second being either warnings or the disabling of accounts. In the first case if a fraudster were able to make an Adsense account for a website not exposed to the general public, then they could generate any number of attacks, with each attack having a different number of clicks associated with it in such a way that when the final number of clicks reported on that page would tell the attacker exactly which attacks failed and which attacks succeeded. Of course, such attacks would very quickly result in the suspension of their account. The attacker would necessarily need to make several accounts and perform fraud in such a small amount as to not be noticeable so quickly. The question then arises how easily could a fraudster generate several accounts without any of these new accounts traceable back to the fraudster?
This question leads to the second source of information which could be used by fraudsters to discover Google's fraud detection suite. Since accounts are being disabled for fraudulent behavior, then an attacker could use the feedback from a number of unsuspecting bloggers. An attacker would choose a number of blogs to carry out their attack. The attacker would generate fraudulent clicks on a particular blog until their clicks were detected and the blogger's Adsense account disabled. The attacker would fashion a new attack and repeat the whole process on another bloggers account. If successful, the attacker would not make any money on the attack, but would have a working prototype to use for their own purposes in the future. In this way an attacker receives a certain amount of feedback from Google without risking their own identity, shuts down a number of Adsense accounts that they may have competed with, and if it is their purpose creates a number of enemies for Google.
In the last attack, how could Google detect fraudulent clicks authorized by the author of a blog or website versus those attacks originating from outside sources. One cannot assume that click fraud would only be performed by the owner of the Adsense account or an accomplice to that owner. One could perhaps from Google's side compare similar click fraud attacks among websites for similarities and perhaps find a common signature among a number of attacks. This would not prove any of the Adsense account holders innocent, but may point to some of them being innocent. I am not sure that there is an easy answer to protecting the innocent in this last case.
I understand that Google is worried that the more that a fraudster knows about how they detect click spam, the easier it will be for a fraudster to go undetected. That is why they cannot share any of their secrets with the authors of blogs, who would certainly share any information they learned on their blog. However, any actions that Google takes against fraud gives away some small details about what they do.
There are two sources of information that could potentially be used against Google. The first is that Google reports to Adsense users the number of clicks that they were given credit for. The second being either warnings or the disabling of accounts. In the first case if a fraudster were able to make an Adsense account for a website not exposed to the general public, then they could generate any number of attacks, with each attack having a different number of clicks associated with it in such a way that when the final number of clicks reported on that page would tell the attacker exactly which attacks failed and which attacks succeeded. Of course, such attacks would very quickly result in the suspension of their account. The attacker would necessarily need to make several accounts and perform fraud in such a small amount as to not be noticeable so quickly. The question then arises how easily could a fraudster generate several accounts without any of these new accounts traceable back to the fraudster?
This question leads to the second source of information which could be used by fraudsters to discover Google's fraud detection suite. Since accounts are being disabled for fraudulent behavior, then an attacker could use the feedback from a number of unsuspecting bloggers. An attacker would choose a number of blogs to carry out their attack. The attacker would generate fraudulent clicks on a particular blog until their clicks were detected and the blogger's Adsense account disabled. The attacker would fashion a new attack and repeat the whole process on another bloggers account. If successful, the attacker would not make any money on the attack, but would have a working prototype to use for their own purposes in the future. In this way an attacker receives a certain amount of feedback from Google without risking their own identity, shuts down a number of Adsense accounts that they may have competed with, and if it is their purpose creates a number of enemies for Google.
In the last attack, how could Google detect fraudulent clicks authorized by the author of a blog or website versus those attacks originating from outside sources. One cannot assume that click fraud would only be performed by the owner of the Adsense account or an accomplice to that owner. One could perhaps from Google's side compare similar click fraud attacks among websites for similarities and perhaps find a common signature among a number of attacks. This would not prove any of the Adsense account holders innocent, but may point to some of them being innocent. I am not sure that there is an easy answer to protecting the innocent in this last case.
Wednesday, January 16, 2008
Choice to Attend U of Louisville for a Ph.D. in CECS
This is my second semester at the University of Louisville as a Ph.D. student. Before coming here a had a choice to make. I received acceptance letters from both the University of Minnesota and the University of Louisville. I am quite certain that I made the right decision considering my circumstances. The following paragraphs will seek to persuade potential students considering a Computer Science degree to attend the University of Louisville.
First let me state that I have only attended two universities in my day BYU and the U of L (as it is referred to here in Kentucky). I have nothing against any other university including the University of Minnesota (which I never was able to visit). I will not focus on the reasons not to attend other schools, instead I will only focus on the reasons one would pick to come to the University of Louisville.
The main reason I have for maintaining my choice of schools is personal attention. I have made a very weak attempt at getting to know any of the professors here at the U of L in the CECS (Computer Engineering and Computer Science) department. However, with such small class sizes and a small number of students working towards Ph.D.'s in my department, it is difficult to not be known on a first name basis. For example, the department chair, Dr. Elmaghraby, from whom I have never taken a class and who should perhaps be to busy to get to know me personally, I have talked with several times, including spending more than an hour in his office with only a simple question on my mind.
The second reason which has been very amenable to my wife is more affordable living than we previously had. I moved to Kentucky with my wife and two kids. We were able to find an apartment for cheaper than we were renting in Utah with more space, and Utah is not one of the more expensive places to live in my understanding.
Lastly, I have enjoyed my brief stay in Kentucky. People are more friendly and more patient here than what I have experienced in other parts of the country. Certainly there are drivers that prefer to double the speed limit, and some who are dishonest. Find me a state where this is not true. We have also come to love the large numbers of trees that we find in this area. There are states just as green or greener in terms of landscape than in Kentucky, but this has to be one of those things that quickly jumps out at you coming from the West (not including coastal states).
If you are interested in applying to the University of Louisville and still have doubts (how could you not considering the time investment of getting a Ph.D.) please leave a question in the comments section, and I can tell you specifics as they relate to your questions.
First let me state that I have only attended two universities in my day BYU and the U of L (as it is referred to here in Kentucky). I have nothing against any other university including the University of Minnesota (which I never was able to visit). I will not focus on the reasons not to attend other schools, instead I will only focus on the reasons one would pick to come to the University of Louisville.
The main reason I have for maintaining my choice of schools is personal attention. I have made a very weak attempt at getting to know any of the professors here at the U of L in the CECS (Computer Engineering and Computer Science) department. However, with such small class sizes and a small number of students working towards Ph.D.'s in my department, it is difficult to not be known on a first name basis. For example, the department chair, Dr. Elmaghraby, from whom I have never taken a class and who should perhaps be to busy to get to know me personally, I have talked with several times, including spending more than an hour in his office with only a simple question on my mind.
The second reason which has been very amenable to my wife is more affordable living than we previously had. I moved to Kentucky with my wife and two kids. We were able to find an apartment for cheaper than we were renting in Utah with more space, and Utah is not one of the more expensive places to live in my understanding.
Lastly, I have enjoyed my brief stay in Kentucky. People are more friendly and more patient here than what I have experienced in other parts of the country. Certainly there are drivers that prefer to double the speed limit, and some who are dishonest. Find me a state where this is not true. We have also come to love the large numbers of trees that we find in this area. There are states just as green or greener in terms of landscape than in Kentucky, but this has to be one of those things that quickly jumps out at you coming from the West (not including coastal states).
If you are interested in applying to the University of Louisville and still have doubts (how could you not considering the time investment of getting a Ph.D.) please leave a question in the comments section, and I can tell you specifics as they relate to your questions.
Subscribe to:
Posts (Atom)