Previous month:
August 2011
Next month:
November 2011

October 2011

Hadoop, Apache and the Benefits of Contributing to Open Source Projects

Hadoop_elephant Jake Homan, a Senior Software Engineer at LinkedIn and UW Bothell CSS graduate, gave a recent guest lecture at UWB on Apache Hadoop: Petabytes and Terawatts, offering an overview and applications of Hadoop as well as related distributed computing tools developed within the Apache Software Foundation. The presentation offered a great balance of breadth and depth that was very well suited to the audience, primarily composed of senior undergraduate and Master's-level computer science students (and a few faculty). One of the most valuable insights shared by Jake was the enormous value that contributing to open source software projects can offer CS students - and other interested in software engineering career opportunities - to develop and demonstrate both their technical skills and their ability to work and play well with others.

HDFSJake explained that Hadoop has two primary components: a distributed file system and a framework to support distributed computation. The Hadoop Distributed File System (HDFS) divides files into 128 MB blocks, makes 2 copies - yielding 3 replicas - of all the blocks, and then distributes the blocks on different DataNodes (computers). A NameNode manages the DataNodes and, among other tasks, regenerates the file blocks stored on a DataNode when that DataNode dies - and given enough DataNodes and enough time, a DataNode is sure to die - to ensure that 3 replicas of every file block are always available.

MapReduceHadoop provides a Java implementation of the MapReduce framework to support distributed computation. Using the prototypical example of a word count program - which Jake described as the "hello, world" program for distributed computing - he showed how to break down a computation into a Mapper and a Reducer. Generally speaking, a Mapper takes a <key, value> pair and generates zero or more <key, value> pairs; a Reducer takes all the values of one key and generates zero or more <key, value> pairs.

Applying this framework to the problem of counting words in a text (or collection of texts), a Hadoop program might start by splitting the text into lines or sentences where the keys represent the sequence positions of lines or sentences and the values represent the segments of text, e.g.,

<0, "Four score and seven years ago ...">
...

Hadoop would distribute these <key, value> pairs acrross DataNodes, where a TaskTracker on each DataNode would use a Mapper to split its line or sentence into a sequence of words and counts (where all counts are initially 1), yielding

<"Four", 1>
<"score", 1>
<"and", 1>
<"seven", 1>
...

During the Reduce phase, the outputs of Mappers are aggregated and sorted by key, yielding <key, list-of-values> pairs:

<"a", [1, 1, 1, 1, 1, 1, 1]>
<"above", [1]>
<"add", [1]>
...

These are then reduced [again] to <key, value> pairs, yielding the final sequence of word and frequency counts:

<"a", 7>
<"above", 1>
<"add", 1>
...

Distributed systems are increasingly the norm rather than the exception in companies providing any kind of web services - or involving any other kind of non-trivial computation - and so knowledge and experience in working with distributed systems is an increasingly important component of computer science education. However, even with knowledge of distributed systems, writing programs that can take advantage of distributed system architecture is still difficult and error-prone.

Jake said that if programmers can learn to think in terms of MapReduce, they can use Hadoop to manage many of the logistical and coordination aspects of distributed system programming; if programmers want to think or work with relational databases (SQL), they can use Hive; and if they want to think or work with higher level scripting languages, they can use Pig. Both of these are among the many Apache tools that can be layered on top of Hadoop. [I wrote about several of these tools in a post last August on Hadoop Day in Seattle: Hadoop, Cascading, Hive and Pig.]

One of the most useful pieces of knowledge that Jake shared during his presentation concerned the often underappreciated second-order benefits of contributing to open source projects, i.e., above and beyond the intrinsic value of improving software tools which, in many cases, programmers are using themselves. The first question he asks a software engineer candidate is "Have you done open source?" Open source software projects typically make all the code and the online conversations about the code publicly available, so Jake can do some background investigation to learn about both the open source code the candidate has written and the way the candidate has interacted with other contributors and stakeholders (e.g., the way a candidate has responded to bug reports or feature requests). The candidacy of any software engineer who has not contributed to any open source software projects may be considerably diminished by a deficit in this area.

ApacheSoftwareFoundationLogoGetting involved in an open source project can be intimidating, so Jake shared a link to the Apache Software Foundation list of ASF newbie issues that would be appropriately scoped projects for someone who wants to test the waters. I have not contributed directly to any Apache project - yet - but I did engage in some civic hacktivism at Data Camp Seattle in February, and some random hacks of kindness at RHOK 3 in June. I would like to organize an appropriately and inspiringly themed open source hackathon at UWB for students, faculty and other interested parties sometime in the near future ... but it will have to wait until after the fall quarter, as the three classes I'm teaching now are consuming nearly all time and energy. I'm glad I at least took an hour off last week for Jake's engaging and educational presentation.


Reflections on Connections: A Review of Connected, the Film

Connected_The_Film_Poster Watching the recent Seattle premiere of Connected: An Autoblogography about Love, Death and Technology, a documentary directed by Tiffany Shlain, I experienced a cascading and interconnected series of thoughts and emotions evoked by this loving tribute to the intellectual and emotional influence that her late father, Leonard Shlain, had on his family and the world at large.

The film, which opened at the Varsity Theatre on Friday night (and is only slated for one week), starts out with an inspiring quote by John Muir, a naturalist with a clear vision of interconnectedness:

When you tug at a single thing in the universe, you find it's attached to everything else.

Shlain goes on to tug at a number of interrelated themes throughout the film, including personal experiences with life-changing events involving cancer, pregnancy, birth and death, as well as more general topics such as brains, alphabets, honeybees and evolution. Many of these topics were inspired by the life and work of her father, a surgeon who authored three books about interconnectedness that are now on my Amazon Wish List: Art & Physics, The Alphabet vs. the Goddess and Sex, Time and Power.

Two of the most interesting intellectual insights I gleaned from the film are the connections between brain hemisphere dominance and sexual dominance, and a broader view of technology as an evolutionary process. Shlain - and I'm being intentionally ambiguous here, because the film reflects the views of both father and daughter - suggests that the creation of an alphabet for communication enhanced the relative value of the brain's left hemisphere (responsible for logic and reason) over the right hemisphere (concerned with aesthetics and relationships) in human affairs. Since men tend to be left brain dominant and women tend to be right brain dominant, the growing prominence of left brain processing led to a growing dominance of men over women. Recent technological developments such as the Internet and social networking services increasingly enhance the prominence of relationships, and thereby help promote a resurgence of right brain processing and the consequent elevation of the role of women.

In a separate but related thread, Shlain describes technology as any tool that extends our capability and influence. The film includes a quote on technology by Albert Einstein:

It has become obvious that our technology has exceeded our humanity.

The film expresses a technological optimism, espousing a perspective in which technology, by definition, extends our humanity. During earlier stages in human evolution, growing brains gave hominids a growing evolutionary advantage. Eventually, the brains of homo sapiens reached a maximum size, beyond which the birthing of big brained babies would risk the sacrificing of their mothers. To maintain dominance, homo sapiens has increasingly found innovative ways of utilizing external resources to complement our fixed-sized brains, and so the relentless march of technological developments can be seen as a natural and unavoidable byproduct of human evolution.

That is not to say that all technological developments are good. We have developed technologies that can destroy humanity, and the world as we know it. But developing new technologies is simply part of being human, so the question is not whether or not we will continue to develop new technologies, but rather what capabilities do we want to enhance, and what kind of influence do we want to exert on the world.

While the film was intellectually stimulating on many levels, I found the emotional impact to be even stronger. At the outset of the Q&A session immediately following the film, Shlain seemed almost apologetic about the personal nature of the story she shared about her father, and the insights he developed and shared about various dimensions of interconnectedness. I believe the personal intensity she brought to the endeavor is what gives the story its incredible power, demonstrating the timeless wisdom of another visionary who understood interconnectedness, psychologist Carl Rogers:

What is most personal is most general.

Based on his portrayal in the film, Leonard Shlain appears to have had a profound impact on his family, and his battles with cancer motivated him to consciously renew his devotion to his family.

I recently received a handmade birthday card from my daughter that suggests that my influence on her life may be more significant than I'd imagined (and thankfully, mostly positive). I don't expect that my life and work will have the level of influence that Leonard Shlain achieved in his lifetime, but the card was a reminder of the influence I have had ... and the film offered a welcome opportunity to reflect on the kind of connections I want to cultivate with my family, as well as with my friends, colleagues, students and others I encounter, offline and online.


Continuing Education: Senior Lecturer at the University of Washington, Bothell

Uwb-logo I recently embarked on the next stage of my re-engagement with academia, as a Senior Lecturer in the Computer & Software Systems program at the University of Washington, Bothell. Like the Tacoma campus, where I taught last winter and spring, the Bothell campus cultivates a small college culture within a large university system: classes are relatively small (with a maximum of 30-45 students in each) and there is a strong student-centered orientation among all the faculty and staff. The faculty - tenure track and non-tenure track - are actively engaged in research and other scholarly activities, but excellence in teaching is an essential attribute among all faculty.

During my first quarter, I am teaching courses on the Fundamentals of Computing (the introductory course for the CSS major) and Operating Systems (a senior-level core course in the major). I'm excited about teaching these courses for a number of reasons, not least of which is that these are the same courses I taught my first full-time semester teaching at the University of Hartford in 1985. Some content has changed, but many of the basic concepts have persisted over the intervening years. I'll be teaching courses on human-computer interaction, network design and web programming in the spring and winter quarters.

I don't anticipate much time for research during the next few quarters, as all of these courses will require new preparations on one or more dimensions. However, I do anticipate engaging some of my entrepreneurial energy. Although the Bothell campus is 20 years old, in the academic world this still qualifies as a "startup". The campus has ambitious growth plans to double in size over the next 5 years, and I'm looking forward to new opportunities for instigating, connecting and evangelizing in this new educational setting.

I also don't anticipate much time for blogging during this period; this post is already late (classes started last week), and I won't add much more to it. I do want to express my sincere gratitude for all the support I enjoyed from the faculty, staff and students at UW Tacoma throughout my initial re-engagement with academia last year. I am similarly grateful for the warm welcome I have received from the faculty, staff and students at UWB and CSS, and I look forward to my continuing education - as both a producer and a consumer - at the University of Washington.