Week 5: 10/28/08

"Mining and Enriching Linguistic Data on the Web for Domain-Specific Search"

William Lewis

3p-4:30p, Tuesday, October 28, 2008, Lecture Hall 3

Electronic resources are scarce to non-existent for the vast majority of the world's languages, with electronic corpora existing for only a few dozen languages at most. Yet, on the Web there exists data for no less than 10% of the world's 7,000 languages, and probably much more.  The problem is that until recently most of this data has not been easy to find, with most being buried in scholarly papers or hidden behind idiosyncratic database interfaces.

In this talk, I will discuss the development of an online database called ODIN, which has been designed to extract linguistically annotated data from scholarly papers posted to the Web and make it available to search.  To date, ODIN houses data for over 700 languages, and allows linguists to search by language name, family, etc., returning both instances of data and the papers from which they have been extracted.  Further, using sophisticated enrichment algorithms such as structural projections, we are in the process of improving the search interface to provide enhanced search facilities over the data (such as structural search), and use the resulting enriched data to seed the development of tools, such as taggers and parsers.

The Speaker:
William Lewis is currently Data Acquisition PM with the Microsoft Research Machine Translation Incubation Group, which he joined in July 2007.  Before joining Microsoft, he was faculty with he Computational Linguistics Professional Master's Program at UW where he continues to hold Affiliate status), and before that, he was Assistant Professor in the Linguistics Department at CSU Fresno, where he developed and managed their undergraduate major in Computational Linguistics.  Before pursuing his PhD in Linguistics from the University of Arizona (PhD, 2002), he was co-founder and CEO of SFL Services, Inc., a small software design and consulting firm located in Sacramento, California, which he ran for over a decade, and whose client base included the State of California, UPS, Pacific Gas & Electric and Fidelity National Financial, among others.  Research-wise, Lewis has been involved in the Electronic Metastructure for Endangered Languages Data (E-MELD) project since its initial NSF funding in 2001, was PI on the NSF-funded Data-Driven Linguistic Ontology grant (BCS-0411348), continues to contribute to the General Ontology of Linguistic Description (GOLD), and is the principal developer of the Online Database of Interlinear Text (ODIN).  He is author and co-author of a number of papers related to GOLD and ODIN, and is currently actively engaged in research projects that tap the highly multi-lingual nature of the ODIN database as a means to drive efforts to develop tools and resources for low-density and endangered languages.

Associated reading:
William D. Lewis and Fei Xia, "Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages".  This paper is under submission and therefore can't be posted publicly; students in the Data and Information program can find it in the "Seminar and Speaker Series" section of the Data and Information homepage.

See the file list below for a printable announcement for this lecture.

5Lewis.doc20.5 KB