Thursday, August 14, 2008

First Experience with Named Entity Recognition


I must preface this email by saying that I am fairly new to Text Mining or NLP. I just tried named entity recognition as written in the GATE tool on a Wikipedia entry on the University of Louisville. All parameters are set to defaults for ANNIE. The results can be seen by taking a look at the included image.

The results to me suggest that this is a very difficult task. A majority of abbreviations such as NIH, NCAA and U of L could not be discovered what type of entity these were. It is also interesting that the University of Louisville is labeled differently in different parts of the text. In the table description (stripped of its structure in this context) labels just "Louisville" as a location, which is true but not the best labeling. While the second occurrence in the first paragraph labels the "University of Louisville" as an organization which is the correct label.

It is interesting to me that the NIH, when the full name was spelled out that the word "National" was left out and the abbreviation "NIH" was not labeled as an organization but is an unknown entity.

Lastly the labeling of persons did not do well. The president of the university is labeled correctly (Dr. James R. Ramsey), but "Faculty", "Urban", and "General Assembly" are also labeled as people. The precision of the persons label is 1/5, not so good. Though I guess the recall is a perfect 1/1.

No comments: