Monday, August 18, 2008

WordNet 3.0 Summary Statistics

I recently downloaded WordNet 2.1 (an ontology for the English language) for Windows and was impressed by the results of just basic words, that there were several senses displayed with definitions and example sentences. I thought I would run a few queries on WordNet. Fortunately, in the "Related Projects" link on WordNet someone has already loaded WordNet 3.0 into PostGreSQL (

Here are some of my basic questions:
  1. How many words are there in WordNet 3.0?
    According, to one source (I am sure there is plenty of debate about how many words are in the English language in common usage), "The Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words." (from AskOxford) Of course there were several other answers from other websites, but if the same definitions of what a word is are in use, then WordNet is doing pretty well.
  2. What are examples of words found in WordNet to get a sense of what the dictionary contains?
    52850;"gas system"
    27754;"committal to memory"
    18769;"butt hinge"
    46875;"family termitidae"
    Here we see that by word it is not meant a single token. Also, the numbers next to each word were the IDs found in the database.
  3. How many senses does a word have?
    mean: 1.4, median: 1
    1 120433 82%

    2 15711 11%

    3 5116 3%

    57 1 0%"run"

    70 1 0%"cut"

    75 1 0%"break"

  4. How many hypernyms (broader words) does a word have?
    mean: 2.17, median: 1

    0 35649 24%
    1 46213 31%
    2 27468 19%
    113 1 0%"hold"
    125 1 0%"cut"
    156 1 0%"break"

  5. How many hyponyms (specific examples) does a word have in WordNet?
    mean: 2.17, median: 0

    0 118372 80%
    1 5427 4%
    2 4250 3%
    897 1 0% "herbaceous plant"
    913 1 0% "herb"
    1007 1 0% "change"

  6. How many synonyms does a word have?
    mean: 2.07, median: 1

    0 36916 25%
    1 48004 33%
    2 24998 17%
    74 1 0%"hold"
    97 1 0%"pass"
    99 1 0%"break"

It appears that a small percentage (20%) have hyponyms, though those that day may have quite a few (no surprise). Interestingly the mean number of hyponyms and hypernyms per word are the same as there are the same total number of hypernyms and hyponyms. The relationship is apparently symmetric.

No comments: