"Hadoop: Data Intensive Scalable Computing at Yahoo!"
Owen O'Malley
3p-4:30p, Tuesday, November 11, 2008, Lecture Hall 3
Abstract:
As Brin and Page say in their classic 1994 paper on web search: engineering search engines is challenging. The web represents a huge and ever-increasing source of information and indexing it requires processing extremely large data sets. Current search engines have information on hundreds of billions of web pages and more than a trillion links between them. And yet the data is constantly changing and must be re-indexed continually. The combination of needing to process hundreds of terabytes of data in a reasonable amount of time requires a multitude of computers. However, using a lot of computers, especially commodity Linux PCs, means that computers are always failing, creating an operations nightmare.
In this talk, Owen will describe how search engines scale to the necessary size using software frameworks, an area now known as web-scale data intensive computing. In particular, he will show us how many computers can be reliably coordinated to address such problems using Apache Hadoop, which is largely developed by Yahoo, and how to program such solutions using a programming model called Map/Reduce.
The Speaker:
Owen O'Malley is a software architect on Hadoop working for Yahoo's Grid team, which is part of Yahoo's Cloud Computing & Data Infrastructure group. He has been contributing patches to Hadoop since before it was separated from Nutch, and is the chair of the Hadoop Project Management Committee. Although specializing in developing tools, he was wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), model checking (NASA), and distributed computing (Yahoo). He received his PhD in Software Engineering from University of California, Irvine. See http://people.apache.org/~omalley.
Associated readings:
- Sergey Brin and Lawrence Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", WWW7/Computer Networks 30 (1-7): 107-117, 1998. http://infolab.stanford.edu/~backrub/google.html
- (optional) Jeffrey Dean and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters." OSDI’04: Sixth Symposium on Operating System Design and Implementation. San Francisco CA. 2004. http://labs.google.com/papers/mapreduce.html
- (optional) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. "The Google File System." 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. http://labs.google.com/papers/gfs-sosp2003.pdf
- Peruse http://hadoop.apache.org/, http://public.yahoo.com/gogate/hadoop-tutorial/, and possibly:
See the file list below for a printable announcement for this lecture.
Attachment | Size |
---|---|
7Omalley.doc | 28 KB |