Week 7: 11/11/08

"Hadoop:  Data Intensive Scalable Computing at Yahoo!" 

Owen O'Malley

3p-4:30p, Tuesday, November 11, 2008, Lecture Hall 3

As Brin and Page say in their classic 1994 paper on web search:  engineering search engines is challenging.  The web represents a huge and ever-increasing source of information and indexing it requires processing extremely large data sets. Current search engines have information on hundreds of billions of web pages and more than a trillion links between them.  And yet the data is constantly changing and must be re-indexed continually.  The combination of needing to process hundreds of terabytes of data in a reasonable amount of time requires a multitude of computers.  However, using a lot of computers, especially commodity Linux PCs, means that computers are always failing, creating an operations nightmare.

In this talk, Owen will describe how search engines scale to the necessary size using software frameworks, an area now known as web-scale data intensive computing.  In particular, he will show us how many computers can be reliably coordinated to address such problems using Apache Hadoop, which is largely developed by Yahoo, and how to program such solutions using a programming model called Map/Reduce.  

The Speaker:  

Owen O'Malley is a software architect on Hadoop working for Yahoo's Grid team, which is part of Yahoo's Cloud Computing & Data Infrastructure group. He has been contributing patches to Hadoop since before it was separated from Nutch, and is the chair of the Hadoop Project Management Committee. Although specializing in developing tools, he was wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), model checking (NASA), and distributed computing (Yahoo). He received his PhD in Software Engineering from University of California, Irvine. See http://people.apache.org/~omalley.

Associated readings: 

See the file list below for a printable announcement for this lecture.

7Omalley.doc28 KB