Infusing Cluster Computing into the Introductory Computer Science Curriculum
Matthew Johnson, Robert H. Liao, Alexander Rasmussen, Ramesh Sridharan, Dan Garcia and Brian Harvey
We have incorporated cluster computing fundamentals into the introductory computer science curriculum at UC Berkeley. For the first course, we have developed coursework and programming problems in Scheme centered around Google's MapReduce. To allow students only familiar with Scheme to write and run MapReduce programs, we designed a functional interface in Scheme and implemented bindings to allow tasks to be run in parallel on a cluster. The streamlined interface enables students to focus on programming to the essence of the MapReduce model and avoid the potentially cumbersome details in the MapReduce implementation, and so it delivers a clear pedagogical advantage.
We have also developed cluster computing curriculum for the other two courses, including a direct use of the Hadoop API in our Java-based second course, and a performance and benchmarking analysis using MPI in our C-based third course. This treatment of parallelism in the introductory courses follows the "high-to-low" abstraction progression already present in the courses, and together the courses provide a deep introduction to cluster parallelism.
Figure 1: (a) shows the flow of data as STk exports its environment and user code to HDFS and initializes the worker node task trackers running SISC, and (b) shows how the results of the parallel computation are copied from HDFS to the front-end local disk and returned within STk.
- H. Abelson and G. J. Sussman, Structure and Interpretation of Computer Programs, MIT Press, Cambridge, MA, 1999.
- J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," OSDI, 2004.
- Google Code For Educators, Distributed Systems, website: http://code.google.com/edu/content/parallel.html.