Skip to content
jdee edited this page Nov 19, 2010 · 6 revisions

Dubsar provides one important class of information not present in WordNet(R). While WordNet(R) does provide exceptional inflections for irregular words, it does not offer much help with regular inflections. WordNet(R) makes use of the Morphy algorithm to make a morphological determination of the head term associated with an inflected form. For example, the head term for queried is the verb query, the former being the past tense and past participle of the latter. WordNet(R) applications like the WordNet(R) online interface will match inflected forms when searching for a word. For example, if you search for happening, you'll find the verb happen. But if you search for the misspelled happenning, you'll find the same verb. The Morphy rules are overly broad. When searching for words, this can have the effect at times of correcting one's spelling. It helps the user find the intended dictionary entry. But it is not possible for applications using Morphy to display all inflections associated with a word, since many of them (like happenning above) are invalid.

Dubsar, on the other hand, does just this and displays all known inflections for each word. It can do this because it does not use rules to determine inflections when looking up words. Dubsar includes a database table containing every inflected form of every word it knows. When a user performs a search without wild cards, the search is matched against all known inflected word forms. However, only exact matches are performed. So if you search Dubsar for happening, you'll find happen. But you'll get no results for happenning.

These inflections are indeed generated by rules, at the time the database is built. However, while the WordNet(R) data set is large, it is also finite. It is quite possible to build, over time, a complete and exhaustive table of inflections that may simply be stored and loaded whenever a new database needs to be built. It is not necessary to anticipate every possible English word, only those found in the current data set. Rule-based inflection generation is extremely buggy and error-prone, as evidenced in Dubsar's own database. In particular, verbs ending in a short syllable with l or s often (but not always) may be conjugated with or without reduplication of the final consonant. For example, traveling and travelling; bused and bussed. Dubsar generally allows both forms in all such cases. Errors are gradually being weeded out.

For this reason, Dubsar currently only provides inflected forms for words found in the WordNet(R) exception list and for nouns and verbs made up entirely of lower-case letters (i.e., not capitalized and containing no spaces, hyphens or other punctuation). It does not attempt to pluralize Briton or 4, and it does not attempt to conjugate log-in or chew the fat. It does not attempt regular inflection of adjectives. While this is nowhere near as challenging morphologically as regular verb inflections, it results in comparative and superlative degrees like sabbaticaller and sabbaticallest. But regular comparative and superlative degrees of adjectives are easy enough for a user to identify. All these gaps will gradually be filled in over time. (Note there are no regular inflections of English adverbs. Only irregular inflections, like best for well, are provided.)

Dubsar uses the ActiveSupport Inflector from Ruby on Rails(R) to pluralize nouns not found in the WordNet(R) exception list. The ActiveSupport Inflector is a mature and stable piece of software, though its results must also be cleaned up. In particular, it generates erroneous plurals like shamen for shaman. Dubsar is gradually correcting these errors as well.

The database currently contains 223,471 inflection entries for 156,584 words. Note that each word, uninflected, is listed in the Inflections table. For example, the verb be has eight inflected forms:

  • am
  • are
  • be
  • been
  • being
  • is
  • was
  • were

Dubsar generates 23,568 verb inflections and 38,703 noun inflections not found in WordNet(R). Most of the latter are from the ActiveSupport Inflector.

The numbers above are updated live on the About Dubsar page.

Clone this wiki locally