Hadoop, Apache and the Benefits of Contributing to Open Source Projects
October 23, 2011
Jake Homan, a Senior Software Engineer at LinkedIn and UW Bothell CSS graduate, gave a recent guest lecture at UWB on Apache Hadoop: Petabytes and Terawatts, offering an overview and applications of Hadoop as well as related distributed computing tools developed within the Apache Software Foundation. The presentation offered a great balance of breadth and depth that was very well suited to the audience, primarily composed of senior undergraduate and Master's-level computer science students (and a few faculty). One of the most valuable insights shared by Jake was the enormous value that contributing to open source software projects can offer CS students - and other interested in software engineering career opportunities - to develop and demonstrate both their technical skills and their ability to work and play well with others.
Jake explained that Hadoop has two primary components: a distributed file system and a framework to support distributed computation. The Hadoop Distributed File System (HDFS) divides files into 128 MB blocks, makes 2 copies - yielding 3 replicas - of all the blocks, and then distributes the blocks on different DataNodes (computers). A NameNode manages the DataNodes and, among other tasks, regenerates the file blocks stored on a DataNode when that DataNode dies - and given enough DataNodes and enough time, a DataNode is sure to die - to ensure that 3 replicas of every file block are always available.
Hadoop provides a Java implementation of the MapReduce framework to support distributed computation. Using the prototypical example of a word count program - which Jake described as the "hello, world" program for distributed computing - he showed how to break down a computation into a Mapper and a Reducer. Generally speaking, a Mapper takes a <key, value> pair and generates zero or more <key, value> pairs; a Reducer takes all the values of one key and generates zero or more <key, value> pairs.
Applying this framework to the problem of counting words in a text (or collection of texts), a Hadoop program might start by splitting the text into lines or sentences where the keys represent the sequence positions of lines or sentences and the values represent the segments of text, e.g.,
<0, "Four score and seven years ago ...">
...
Hadoop would distribute these <key, value> pairs acrross DataNodes, where a TaskTracker on each DataNode would use a Mapper to split its line or sentence into a sequence of words and counts (where all counts are initially 1), yielding
<"Four", 1>
<"score", 1>
<"and", 1>
<"seven", 1>
...
During the Reduce phase, the outputs of Mappers are aggregated and sorted by key, yielding <key, list-of-values> pairs:
<"a", [1, 1, 1, 1, 1, 1, 1]>
<"above", [1]>
<"add", [1]>
...
These are then reduced [again] to <key, value> pairs, yielding the final sequence of word and frequency counts:
<"a", 7>
<"above", 1>
<"add", 1>
...
Distributed systems are increasingly the norm rather than the exception in companies providing any kind of web services - or involving any other kind of non-trivial computation - and so knowledge and experience in working with distributed systems is an increasingly important component of computer science education. However, even with knowledge of distributed systems, writing programs that can take advantage of distributed system architecture is still difficult and error-prone.
Jake said that if programmers can learn to think in terms of MapReduce, they can use Hadoop to manage many of the logistical and coordination aspects of distributed system programming; if programmers want to think or work with relational databases (SQL), they can use Hive; and if they want to think or work with higher level scripting languages, they can use Pig. Both of these are among the many Apache tools that can be layered on top of Hadoop. [I wrote about several of these tools in a post last August on Hadoop Day in Seattle: Hadoop, Cascading, Hive and Pig.]
One of the most useful pieces of knowledge that Jake shared during his presentation concerned the often underappreciated second-order benefits of contributing to open source projects, i.e., above and beyond the intrinsic value of improving software tools which, in many cases, programmers are using themselves. The first question he asks a software engineer candidate is "Have you done open source?" Open source software projects typically make all the code and the online conversations about the code publicly available, so Jake can do some background investigation to learn about both the open source code the candidate has written and the way the candidate has interacted with other contributors and stakeholders (e.g., the way a candidate has responded to bug reports or feature requests). The candidacy of any software engineer who has not contributed to any open source software projects may be considerably diminished by a deficit in this area.
Getting involved in an open source project can be intimidating, so Jake shared a link to the Apache Software Foundation list of ASF newbie issues that would be appropriately scoped projects for someone who wants to test the waters. I have not contributed directly to any Apache project - yet - but I did engage in some civic hacktivism at Data Camp Seattle in February, and some random hacks of kindness at RHOK 3 in June. I would like to organize an appropriately and inspiringly themed open source hackathon at UWB for students, faculty and other interested parties sometime in the near future ... but it will have to wait until after the fall quarter, as the three classes I'm teaching now are consuming nearly all time and energy. I'm glad I at least took an hour off last week for Jake's engaging and educational presentation.