Hadoop Day in Seattle: Hadoop, Cascading, Hive and Pig
August 24, 2010
I attended Hadoop Day - a community event to spread the love of Hadoop and Big Data - at Amazon's Pac-Med building in Seattle a week ago. I missed the morning session of the event, but recently became better acquainted with some of the dimensions of this space via the excellent overview and analysis by Mike Loukides at O'Reilly Radar, What Is Data Science? The afternoon "Introductory Track" included presentations about a number of tools for processing large data sets - Hadoop, Cascading, Hive and Pig - by large and small companies involved with big data - KarmaSphere, Drawn to Scale, Facebook and Yahoo. The session was intended as a hands-on learning opportunity, but due in part to poor network connectivity, it ended up being mostly an eyes- and ears-on educational event (but still very worthwhile).
Abe Taha, VP of Engineering for Karmasphere, started the afternoon session with 0-60: Hadoop Development in 60 Minutes or Less (slides embedded below), which offered a great general introduction to Hadoop, a preview of the other tools that would be presented in later sessions (from different levels of the Hadoop stack) and an appropriately scaled (i.e., relatively brief and informative) demonstration of the Karmasphere Studio tool.
Abe led off with the motivation behind Hadoop: the need for a scalable tool for discovering insights (or at least patterns) in ever-increasing collections of data, such as logs of web site traffic. Hadoop embodies the MapReduce paradigm in which data is represented as records or tuples, and computing processes can be broken down into mapping - in which some function is computed over a subset of tuples - and reducing - in which the results of the applications of the mapping function to different subsets are then combined. The power of Hadoop comes in being able to farm out the functions and different data subsets across a cluster of computers, potentially increasing the speed of deriving a result.
Simple examples were offered to illustrate how Hadoop works, e.g., computing the maximum of a set of numbers, adding a set of numbers, and counting the occurrences of words in a large text or collection of texts (e.g., The Complete Works of William Shakespeare). After reviewing how these data sets might be represented in Hadoop, Abe provided some Java code to illustrate how the map and reduce functions could be implemented to process them (these code segments are included in the slides). Although the poor network connectivity precluded trying running the code during the session, the clear presentation and simple examples left a relative newcomer like me with the sense that "I can do this" (which I believe was the main objective for the day).
Over half of Abe's slides were on Karmasphere Studio (starting around slide #26 (out of 66)), and the way it can help address some of the problems with overhead in Hadoop, particularly with respect to allowing debugging, prototyping and testing without having to deploy to a cluster of computers. However, only about a quarter of the hour-long presentation was devoted to the tool, and given that the Community Edition of Karmasphere Studio is available for free, I thought he achieved the right balance between covering Hadoop fundamantals as well as a tool for using Hadoop.
Next up was Bradford Stephens, founder of Drawn to Scale and organizer of the event, who presented an Introduction to Cascading. Cascading is a layer on top of Hadoop that allows users to think and work at a higher level of abstraction, focusing on workflows rather than mapping and reducing functions. Cascading offers a collection of operations - functions, filters and aggregators - that can be used in conjunction with any Java Virtual Machine-based language. Bradford showed a 15-line sample of Cascading code to process apache web server logs, and an equivalent 200 line Java program to do the same thing.
Bradford offered the most interactive exercise of the afternoon, showing us some Cascading code to process New York Stock Exchange closing prices, and inviting us to help him write the code that would find the symbol and price of the stock with the highest closing price for each of the days represented in the dataset. I cannot find the slides for Bradford's talk, but the code and data he used in the examples are available at the main Hadoop Day site.
After a break, Ning Zhang, a software engineer at Facebook, presented an introduction to Hive entitled a Petabyte Scale Data Warehousing System on Hadoop (slides embedded below).
Ning presented some statistics about Facebook I'd heard or read elsewhere - e.g., 500M monthly active users, 130 friends per user (on average) - along with several I had not known before:
- 250 million daily active users
- 160 million active objects (groups, events, pages)
- 60 object (group/event/page) connections per user (on average)
- 500 billion minutes per month spent on the site across all users
- 25 billion content items are shared per month across all users
- 70 content items created per user per month (on average)
- 200 GB of data/day was being generated on the site in March 2008
- 12+ TB of data/day was being generated by end of 2009, growing by 8x annually
In addition to the increasing demands of users, Facebook application developers and advertisers want feedback on their apps and ads. Facebook decided against using closed, proprietary systems due to issues of cost, scalability and length of the development and release cycles. They considered using Hadoop, but wanted something that provided a higher level of abstraction and used the kinds of schemas traditionally provided in relational database management systems (RDBMS). Hive provides the capability to express operations in SQL and have them translated into the MapReduce framework, and provides extensive support for optimizations ... a dimension that is increasingly important for a company with increasingly big data needs.
Alan Gates, a software architect on the Yahoo! grid team, led off his talk on Pig, Making Hadoop Easy (slides embedded below) with a motivating example, showing how a 200 (?) line program to find the top 5 sites visited by 18-25 year olds in Java using Hadoop directly could be written as a 10 line program in Pig Latin. Pig represents a middle way between straight Hadoop and the higher level abstractions provided by Cascading and Hive, providing the capability to program in a higher level scripting language (i.e., higher level than Java) while still being able to define elements procedurally (vs. the declarative definitions typical of SQL-oriented frameworks).
Alan recently wrote a blog post on Pig and Hive at Yahoo! in which he delves more deeply into the similarities and differences between the two frameworks, both of which have their place(s) in the realm of data warehousing. Data processing typically involves three phases: data collection, data preparation, and data presentation. Pig is particularly well suited to three tasks involved in the data preparation phase (aka Extract Transform Load (ETL) or data factory):
- pipelines: in which data is cleaned or otherwise transformed
- iterative processing: in which a big data set has a steady stream of incremental additions, making it costly to reprocess the entire set with each new addition
- research: a scripting language like Python is an excellent tool for doing things when you're not quite sure what you're doing [as an example, Jake Hofman, of Yahoo! Research, presented a tutorial on Large Scale Social Media Analysis with Hadoop at the International Conference on Weblogs and Social Media (ICWSM 2010) in May]
According to Alan, Hive is well suited to the data presentation phase, in which business intelligence analysis and ad hoc queries may be better accommodated by a language that directly supports SQL. It seems to me that an argument could be made that these tasks might also be categorized as research, though perhaps the differentiation between phases lies more in the types of questions one might most easily be able to ask (and answer). In any case, the data and code used by Alan in his talk are also available on the Hadoop Day site.
In addition to interesting presentations, there were some other interesting things I noted about the group at the event. I would estimate that 95% of the attendees were male - much higher than the events I typically attend, which focus more on human-computer interaction and social computing - and the proportion of Macs was much lower than other events I typically attend - perhaps 30%, with nearly 50% of the laptops I saw being Thinkpads. The presentations and presenters were great, as was the view and the food; the only downside was the poor wireless connectivity ... which was somewhat surprising, given the site (Amazon), but was probably due to the need to rig an ad-hoc network outside the firewall just for the event.
All in all, it was a very worthwhile day, and I'm grateful to the organizers, the sponsors - Amazon Web Services, Fabric Worldwide, Cloudera and Karmasphere - and all the presenters for putting the event together. There is already some talk about holding another Eastside Networking Event, a technology-oriented event with several representatives from the local big data community (Amazon Web Services, Microsoft Windows Azure and Facebook [Seattle]) that I wrote about in my last post. I don't know whether there will be future day-long Hadoop events in Seattle, but there are monthly Hadoop meetups in Seattle which are also organized by Bradford; the next meeting of the group is this Wednesday.