Machine Translation

From dandi08

Jump to: navigation, search

[edit] Machine Translation: Producing IGTs for Latin

I would like to produce the gloss section of IGTs for Latin using a limited vocabulary and grammar. Perhaps just the first half of an introductory textbook.

It seems like the easiest way to do this would be to have all of the vocabulary that is within the project fully broken into morphemes and have python match the inputted word with the list.

It seems like strings could be used to produced a possible English translation, if we did some research into common English word order. (Maybe wordnet would help?). Then we could assign different grammatical words as variables for strings and use that to produced a sentence that may be correct in English.

It doesn't seem like the Latin itself would be the hard bit: creating a small dictionary that has a small vocabulary fully declined and edited into grammar morphemes wouldn't take any outside knowledge. It would be basically transcribing a textbook.

The programming would be more challenging, I think, since we couldn't just annotate information and would have to create something.


[edit] One Possible Methodology

1) Build a dictionary of Latin morphemes. For the sake of simplicity, we might start with only nouns and verbs, and a limited set of affixes. Each entry could be headed by the Latin morpheme, and contain a one-word English gloss, and/or relevant grammatical information (part of speech, number, tense, etc).

2) Get an input string (a simple Latin phrase). This would be the first line of our IGT.

3) Split the input phrase into morphemes (roots and affixes), perhaps by searching for substrings in each word that match entries in our Latin dictionary.

4) Create a gloss line consisting of the English gloss and grammatical information for each word of the Latin phrase. This would be the second line of an IGT.

5) Translation from the gloss line to an English phrase might be done using an English dictionary in which the English gloss and grammatical info were the head word, and an English parse was the definition. The manipulation of word order might require more intensive programming, .

I have written some code for a prototype machine translator. If copied and pasted into the python editor, this should run. It does a good job of creating IGT 2nd lines for words in its dictionary. Right now, it only works for a few nouns in the nominative and accusative cases plus some verbs, and it crashes if you include a morpheme not in the dictionary.

[edit] Project Proposal

Problem Addressed by the Project


We hope to create a system that would automatically generate an Interlinear Glossed Text (IGT) for Latin. IGTs are a standardized method for discussing text in another language. They generally have three lines. The first is the transliterated text in the language being discussed. The second line is a morpheme by morpheme analysis of the line of text, and the third line is a free translation.

There would be many benefits if IGTs could be automatically generated. The creation of IGTs would be quicker and simpler, making it easier for linguists to share their findings. Because of how they were produced, these IGTs would be in a standardized, plain-text format, which would simplify retrieval by automated systems like ODIN. These IGT would also be very useful to students of a language; the second line of an IGT shows useful grammatical information, unlike the free translation in the third line. The dictionaries containing the morphemes that were used to create the IGTs would contain sets of Latin and English morphemes. These dictionaries could be applied to future projects.

In an online search, we were unable to find any other program that automatically produces a gloss line for any language. There are translators, such as http://translate.google.com/translate_t# and http://babelfish.yahoo.com/ , but both sites produce a translation without any grammatical information. With a gloss line it is easier to understand what the meaning of each individual word is, and also easier to see the grammatical differences between two languages. These attributes of the gloss line make it easier to understand a translation, and make the translation more useful for linguistic analysis and language learning.

The project will let us study the morphemes of another language in depth, and delve deeper into Python. If we can successfully build a program that generates Latin to English IGTs, it might be easily expanded to accommodate other languages by creating new dictionaries and making small changes to the code.


Relationship of the Project to Dandi Program Themes

In our Machine Translation project we will be making extensive use of our knowledge of Python. We will be using Python to create the dictionaries, and to produce the IGTs. It is likely that we will use WordNet to perform some of the more complex operations in our program, such as using their database of synonyms to get a more nuanced translation. Our program will perform translation, which, of course, involves linguistics. We will be paying especially close attention to morphology and syntactic structure. Our program will produce IGTs, which was introduced to us by William Lewis during one of the Semantic Web lectures.


The Project Itself

The end result of our project will be a fairly complex program written in Python. We plan to begin with a fairly limited vocabulary and only the simplest grammatical constructions. We hope to have time, once we have built a working model, to extend the vocabulary and grammar that our program can handle. Our planned methodology includes the following steps:

1) We need to develop a basic understanding of the grammar of Latin and some Latin vocabulary. This will allows us to create a dictionary of Latin words, and use them in our translation program.

2) We will create a populate a dictionary/database(deliverable.) Our plan is to head each entry of our dictionary with a Latin morpheme, and let a one-word gloss and any relevant grammatical information be parts of the definition. Here are a few sample entries for the morphemes of dicimus:

 headword: dic
   gloss: say
 headword: imus
   gloss: 1 pers, Pl, pres, act
 We will have many more root entries than gram entries, since the grammar is fairly consistent.

3) We also will create a program that generates an IGT when given a Latin phrase. Here is an example IGT we found:

        1.Puella  videt  mensam
         girl.NOM  see.3SG table.ACC
         'The girl sees the table.'

(from: Musgrave, Simon (2001). Non Non-Subject Arguments in Indonesian.)

The planned steps in the program are as follows:

a) get a simple Latin phrase as input. This input will be the first line of the IGT.

b) Break the input phrase into words, and then break the words into the constituent morphemes. We will use string operations to accomplish this.

c) Generate a morpheme-by-morpheme gloss that will serve as the second line in the IGT. This program will use our dictionary to replace each of the Latin morphemes with a one-word English gloss and/or any relevant grammatical labels (grams). We will use string operations and possibly some of WordNet’s capabilities to accomplish this.

d) Use the gloss and gram information, plus a set of expected patterns, to build a grammatically acceptable English translation for the third line of the IGT.

4) We will test our program on some simple Latin phrases. Depending on the level of complexity we can achieve in our program, we may want to use Latin phrases from literature, or phrases that are familiar to English speakers.

What We Expect to Learn

We will be making extensive use of what we’ve learned in Python programming. We will use Python to create and populate dictionaries. Since we will primarily be working in text, we will need to use string operations and formatting. Loops, functions, and if/elif/else will be necessary to control the outputs.

Our linguistic knowledge will be necessary to breakdown the words into morphemes. Syntax knowledge will be imported for IGT lines two and three, and semantic knowledge will be used in the free translation in line three.

We will need to learn more about several categories. We will need to do more research into WordNet synsets and into Latin grammar. We will also need to know more about the standard forms of IGTs. It would also be helpful to know more about what systems for parsing or translation are currently available.

It may be helpful to learn more about Machine Learning/statistical syntactic analysis to help with translations to English. A deeper knowledge of English grammar could also become necessary, and it seems very likely that more research into Python could be useful.


Links

Perseus has a parsed Latin database. Perseus Lexicon

This site has several Latin textbooks and resources for free.

This is another page with good Latin resources.

This Page has tables of endings. These could be helpful for our dictionary

Words is a program that looks somewhat similar to our goals. It is written in ADA but they give user permission to use the files. Maybe we could get the dictionary?

This is a good way to check the conjugations.

Here are some fully declined nouns we can check ourselves against.

Interested People