Morpheme Tree Generator

From dandi08

A Morpheme Tree Generator would serve the function of taking an input of a grammatical word from the user and outputting a simple text tree diagram of the word, broken down into morphemes and their respective parts of speech. This would be designed in Python, and would likely use NLTK and Wordnet.

1 Initial Design
2 NLTK Usage
3 Program Flow
4 Objectives
5 Contact Information

[edit] Initial Design

It is difficult to discern where to begin, but it seems like a good place to start would be building (or finding) a corpus of English prefixes and suffixes. By matching slicing the input at the length of each token in the either prefix or suffix corpus, and then checking to see if the token matches might be a possible approach.

[edit] NLTK Usage

I'm pretty excited about this part of the program, as it synthesizes all parts of DandI. By using the NLTK library, with more specific reference to Wordnet, a lot of the program will end up looking a great deal like a CS lab. As we've learned from our cursory gleanings of information pertaining to Wordnet, it uses some function calls like .startswith() and .endswith() .. these functions will be fairly vital to recognizing existing morphemes contained in a word. Building a corpus is something that will either involve research (someone's already done it) or input (we find a list of morphemes and transcribe it ourselves).

[edit] Program Flow

Using NLTK/Wordnet
Request user input of grammatical english word
 Determine that it is a word (Wordnet)
 Determine if the word is monomorphemic
 Check against the beginning of word for existing known prefix
 Check against the end of the word for existing known suffix
 Check against wordnet corpus for the remainder after stripping found pre/suffix
 Repeat process until pre/suffixes cannot be found in the string
Print tree containing hierarchy of morphemes (likely in text to start, scaling to simple graphical representation)

[edit] Objectives

The objective of this exercise will specifically to see if such a program can be written without too many ambiguity problems. Further, if there's time, it would be a nice test of the programming to see what would happen if the input was changed from a single word from a user to an automated parsing of a large English word corpus.