Sunday, August 31, 2008

Biology + Computer Science = Proof of God ?

I am a Computer Scientist, or so I claim. I am working on my PhD in Computer Science and recently took a class on Bioinformatics. Here I learned about the mysterious world of DNA. Certainly I am no expert in the field, but what a complex system we have found in finding DNA. Most living things have DNA (or RNA in some form or another) which is changing through time and encodes what we are made of. The DNA is translated through multiple steps into proteins. These proteins bend around into varying shapes, which shapes are not constant through time and are part of metabolic pathways. This is about where my knowledge ends of the whole process of DNA up to what the functions of a cell are.

I do not know how accurate the comparison is, but I see DNA in a similar way to computer languages, like Perl or Java. Computer languages are a set of very simple instructions that a human understands, which are compiled into binary which tell the CPU how to manipulate variouos memory locations in a computer. Computer languages are built on top of the mechanics of a computer system. The entire process of designing a computer, building a computer, designing a computer language, specifying all of what a language understands, building a compiler for the language, and finally building a computer program from that language which works properly is a very organized process. Chaos does not create the final process, but it is endless hours of planning, organizing logically and testing to be sure that what one has created has no major, show stopping flaws. In the end all you or I may see is something like the text on this screen transmitted perhaps 100 of miles through wires to be displayed on your computer screen.

As I see it DNA has a set of rules that it follows (though scientists have not discovered all of them) and is part of a large, complex process that makes the encoding of life possible to be passed on to the next generation. This entire process as well as all other components of a living organism, such as reproduction, eating mechanisms, having food available, and aging are all necessary parts of even the most simple, single cell living organisms. I do not understand how the world could make the leap from no life on earth to a single living cell that was capable of encoding its being and reproducing without an creator involved in the process. In my limited experience, chance does not bring about large scale organization.

I believe there is a God and that the complexities of biology, for example, show that God was involved in the creation of life as we know it.

Tuesday, August 26, 2008

Google Wishlist

What features do you wish Google Search had but doesn't?

Yesterday marks the start of my second year as a PhD student. I have been working on click fraud research, but soon I will be switching to Text Mining, but I need some ideas as to what I should do. To get the creative juices flowing, we will start by listing features that you wish Google Search had. Any type of feature, large or small, no matter how realistic it is. Here are some that I have come across that sound interesting:
  • Summarize Opinions: What if you could search "people's opinions on restaraunt X" and the response would be summary of all opinions on all websites to do with that restaraunt say 35% liked it with links to those who did and 65% did not like it, then summary of why they did or did not like it. (For an example see Live's search summary of opinions limited to a few products "Canon EOS Digital Rebel XT - digital camera, 8MP, 3x Optical Zoom")
  • Search Topics not always Keywords: I search for papers on text mining on Google Scholar and find only a handful of papers that relate to what I am looking for. Most of the papers I would be interested in do not actually say "Text Mining" in the title, and sometimes not even in the paper but that is what they are talking about. I want the category "Text Mining" not the keywords "Text Mining".
  • Answer My Question: Sometimes I will search using a whole question, not just keywords and find someone else that asked the question on a forum, but did not get a response. I want to find the answer, not the question. This is not a new thought. There are already several approaches attempting to do this, but none of them are in the 90% accuracy on open ended questioning. (Examples: Ask Jeeves, START from MIT, etc.)
Please leave your thoughts. Anything welcome.

Monday, August 18, 2008

WordNet 3.0 Summary Statistics

I recently downloaded WordNet 2.1 (an ontology for the English language) for Windows and was impressed by the results of just basic words, that there were several senses displayed with definitions and example sentences. I thought I would run a few queries on WordNet. Fortunately, in the "Related Projects" link on WordNet someone has already loaded WordNet 3.0 into PostGreSQL (http://sourceforge.net/project/showfiles.php?group_id=135112&package_id=219735/&abmode=1).

Here are some of my basic questions:
  1. How many words are there in WordNet 3.0?
    147,306
    According, to one source (I am sure there is plenty of debate about how many words are in the English language in common usage), "The Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words." (from AskOxford) Of course there were several other answers from other websites, but if the same definitions of what a word is are in use, then WordNet is doing pretty well.
  2. What are examples of words found in WordNet to get a sense of what the dictionary contains?
    52850;"gas system"
    103824;"predicate"
    27754;"committal to memory"
    81032;"malapropism"
    18769;"butt hinge"
    97732;"pectinidae"
    46875;"family termitidae"
    Here we see that by word it is not meant a single token. Also, the numbers next to each word were the IDs found in the database.
  3. How many senses does a word have?
    mean: 1.4, median: 1
    1 120433 82%

    2 15711 11%

    3 5116 3%

    57 1 0%"run"

    70 1 0%"cut"

    75 1 0%"break"

  4. How many hypernyms (broader words) does a word have?
    mean: 2.17, median: 1
    hypernyms
    words
    percent

    0 35649 24%
    1 46213 31%
    2 27468 19%
    113 1 0%"hold"
    125 1 0%"cut"
    156 1 0%"break"


  5. How many hyponyms (specific examples) does a word have in WordNet?
    mean: 2.17, median: 0
    hyponyms
    words
    percent

    0 118372 80%
    1 5427 4%
    2 4250 3%
    897 1 0% "herbaceous plant"
    913 1 0% "herb"
    1007 1 0% "change"

  6. How many synonyms does a word have?
    mean: 2.07, median: 1
    synonyms
    words
    percent

    0 36916 25%
    1 48004 33%
    2 24998 17%
    74 1 0%"hold"
    97 1 0%"pass"
    99 1 0%"break"

Summary:
It appears that a small percentage (20%) have hyponyms, though those that day may have quite a few (no surprise). Interestingly the mean number of hyponyms and hypernyms per word are the same as there are the same total number of hypernyms and hyponyms. The relationship is apparently symmetric.

Thursday, August 14, 2008

First Experience with Named Entity Recognition


I must preface this email by saying that I am fairly new to Text Mining or NLP. I just tried named entity recognition as written in the GATE tool on a Wikipedia entry on the University of Louisville. All parameters are set to defaults for ANNIE. The results can be seen by taking a look at the included image.

The results to me suggest that this is a very difficult task. A majority of abbreviations such as NIH, NCAA and U of L could not be discovered what type of entity these were. It is also interesting that the University of Louisville is labeled differently in different parts of the text. In the table description (stripped of its structure in this context) labels just "Louisville" as a location, which is true but not the best labeling. While the second occurrence in the first paragraph labels the "University of Louisville" as an organization which is the correct label.

It is interesting to me that the NIH, when the full name was spelled out that the word "National" was left out and the abbreviation "NIH" was not labeled as an organization but is an unknown entity.

Lastly the labeling of persons did not do well. The president of the university is labeled correctly (Dr. James R. Ramsey), but "Faculty", "Urban", and "General Assembly" are also labeled as people. The precision of the persons label is 1/5, not so good. Though I guess the recall is a perfect 1/1.

GATE-4.0 Vista Bug

If you are new to GATE and are using Windows Vista, you may come across the following exception in the message tab:

Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
at sun.swing.table.DefaultTableCellHeaderRenderer.getColumnSortOrder(DefaultTableCellHeaderRenderer.java:104)
at com.sun.java.swing.plaf.windows.WindowsTableHeaderUI$XPDefaultRenderer.getTableCellRendererComponent(WindowsTableHeaderUI.java:108)
at gate.swing.XJTable.calculatePreferredSize(XJTable.java:105)
at gate.swing.XJTable.getPreferredSize(XJTable.java:130)
at gate.swing.XJTable.getPreferredScrollableViewportSize(XJTable.java:86)
at gate.gui.FeaturesSchemaEditor.populate(FeaturesSchemaEditor.java:169)
...

It turns out that the Java libraries for GUI's are a little different in Vista from Windows XP. To fix the problem simply go to Options > Configuration and change the "Look and feel" from "Windows" (the default) to say "Metal" and you have a work around. You should no longer see that exception again.

Friday, August 1, 2008

Mining Topic-Specific Concepts and Definitions on the Web

I just finished reading the paper of the same name as this post by Bing Liu, Chee Wee Chin and Hwee Tou Ng. There were two thoughts that I had as reading this paper. First, though we all know that search engines are a one-size fits all application, I have not really thought about it. Is it not interesting that a 8-year old and a Biologist who has been doing research for 20 years can both type in the term "ant" in Google and receive the same results (Apache's Ant project for building Java applications).

Sure, without signing in to Google, what could they know about the individual making the search? What is really interesting is we are giving Google or other search engines practically nothing to work with, and expecting them to read our minds and for most of my searches they do pretty well. (Try searching for "mormons bosnia" and see what you get. By the way LDS is an abbreviation often used to refer to Mormons.)

Second (as in the second idea I had from this paper), it would be really nice if we could summarize a 100 web pages of results into a single table. What do you include and what do you not include are obvious questions. But imagine if our search for "data mining" as discussed in this paper returned a summary of definitions held on data mining. For example, "_" (53 pages), "_" (6 pages) and "_" (2 pages). Where each "_" would be a definition and the parantheses would be links to results pages containing all of those pages. Now it would be nice if there was some intelligent analysis of the definitions so that each of those 53 pages for the first definition, though each page did not write the same definition, for our purposes held the same meaning. Certainly each individual has different purposes, so that definition equality is up for debate in a lot of cases depending on who you talk to. This idea could definitely use a lot of refinement.