|about||I'm currently a CS major in my senior year. Having taken most of the upper divison CS courses offered at Cal, I looked at the graduate listing for Spring and found CS267 of interest. I hope to gain both theoretical and pratical experience from the course. Given that my goal is focused towards industry upon graduation, understanding the applications of parallelism is fastly becoming invaluable.|
|experience||Upper divison: CS188, CS186, CS162, CS184, CS170 (to date) Lower division: CS61A, CS61BL, CS61C, CS70 (to date) Internship: Software Engineer at Caucho (2010) Club: Engineering Club President (2008-2009)|
|interest||As an undergraduate, I have little experience in terms of the scope of software engineering. However, from the courses I have taken and an internship from last summer, I've have gained interest in the designing of operating systems and servers (in particular servers over distributed systems). On aside, I was also interested in the overwhelming possiblities of parallelism in graphics rendering (eg. Ray Tracing).|
Introduction: Today data centers are beginning to accumulate data at an increasingly high rate. Programmers and administrators consistently face the issue of processing large data sets stored on equally large clusters efficiently while also having to face the issue of scalability. Engineers at Google noticed a common abstraction, patented as MapReduce, amongst many solutions being used to compute over large data sets. Hadoop, sponsored by Apache, is an open source implementation developed in Java and offers an API to a few other languages such as C++.
MapReduce offers a solution to a subset of problems in the larger scope of distributed computing. The abstraction follows two simple phases. The first phase, or the mapping phase, is processed by a master machine which takes the original data set and breaks it into an appropriate number of smaller data sets to distribute to the worker machines. The second phase, known as the reduce/accumulate phase, then takes solutions computed by the worker machines and accumulates a final result as output. An intermediary step between the Map and Reduce phase is the grouping phase, which is essentially the computation handled by the worker machines.
Description: The above image shows an analogous problem for Hadoop: simply to break the data set into smaller chunks, compute, and finally reduce the solutions into a final output. Taken from http://www.gridgain.com/images/mapreduce_small.png.
Hadoop creates an abstraction between the details of computing over large data sets and the developers/administrators working directly on clusters. All thats needed by Hadoop are the functions specific to a given data set which will be used to map/reduce the data/results appropriately.
Data: One of the requirements of Hadoop is that all data must be provided/stored as key value pairs. It's this mapping that is exploited by the accumulation/computational/reduction phases.
The most difficult aspect of large data sets is managing them. Conventional file systems simply don't account for large data sets distributed over a large amount of computers. Large clusters and databases often resort to their own distributed file systems such as GFS, by google, and VMFS, by VMware. Hadoop offers its own distributed filesystem known as HDFS (Hadoop Distributed File System). Furthermore it allows for systems the capacity to scale easily. It's argued that data sets on the order hundreds of gigabytes constitute to the low end of such filesystems. With the combined power to compute multiple solutions in parallel over a series of computer, which can later be scaled easily, makes Hadoop a very succesfull tool in distributed computing. However, as mentioned before, Hadoop offers a solution to only a subset of problems which exist in the scope of distributed computing.
Success: Yahoo! currently runs Hadoop on a cluster of over 36,000 machines amounting to over 100,000 CPUs. Universities in India and TataMotors also run Hadoop on supercomputers for academic purposes. Many other companies, which are listed at Hadoops official website, including Facebook, use Hadoop on their main clusters.