Thursday, September 13, 2007
Earlier this year, the University of Washington partnered with Google to develop and implement a course to teach large-scale distributed computing based on MapReduce and the Google File System (GFS). The goal of developing the course was to expose students to the methods needed to address the problems associated with hundreds (or thousands) of computers processing huge datasets ranging into terabytes. I was excited to take the first version of the class, and stoked to serve as a TA in the second round.
But you can't program air, so Google provided a cluster computing environment to get us started. And since computers can't program themselves (yet?), UW provided the most essential component: students with sweet ideas for a huge cluster. After learning the ropes with these new tools, students finished the course by producing an impressive array of final projects, including an n-body simulator, a bot to perform Bayesian analysis on Wikipedia edits to search for spam, and an RSS aggregator that clustered news articles by geographic location and displayed them using the Google Maps API. Check out Geozette.
We are looking at ways to encourage other universities to get similar classes going, so we've also published the course material that was used at the University of Washington on Google Code for Educators. You're more than welcome to check out the Google Summer Intern video lectures on MapReduce, GFS, and parallelizing algorithms for large scale data processing. This summer I've been working on exposing these educational resources and other tools so that anyone can work on and think about cool distributed computing problems without the overhead of installing his or her own cluster. In that vein, we've released a virtual machine containing a pre-configured single node instance of Hadoop that has the same interface as a full cluster without any of the overhead. Feel free to give it a whirl.
We're happy to be able to expose students and researchers to the tools Googlers use everyday to tackle enormous computing challenges, and we hope that this work will encourage others to take advantage of the incredible potential of modern, highly parallel computing. Virtually all of this material is Creative Commons licensed, and we encourage educators to remix it, build upon it, and discuss it in the Google Code for Educators Forum.
Lastly, a quick shout out to the other interns who helped out on our team this summer: Aaron Kimball, Christophe Taton, Kuang Chen, and Kat Townsend. I'll miss you guys!