I attended my first Strata conference last week. The program offered a nice blend of strategic and technical insights and experiences regarding the design and use of "big data" systems. Atigeo was a sponsor, and I spent much of my time in our booth demonstrating and discussing our xPatterns big data analytics platform (about which I may write more later). Outside the exhibit area, highlights included a demonstration of the IPython Notebook, a tutorial on neural networks and deep learning, and a panel on Data for Good. I often find it helpful to compile - and condense - my notes after returning from a conference, and am sharing them here in case they are of use to others.
On Tuesday, I attended this all-day tutorial with 10 different presentations.
After a brief review of some key historical and conceptual underpinnings of machine learning, Alex Gray delineated 3 sources of error that data scientists must contend with: finite data, wrong parameters and the wrong type of model. Techniques for reducing error include weak scaling (use more machines to model more data), strong scaling (use more machines to model the same data, faster) and exploratory data analysis and visualization tools. Demonstrations included sentiment analysis of Twitter data during the US presidential election, identification of outliers in a large data set and a visualization of Wikipedia [unfortunately, I can't find the slides or any information about these demos online]. Quotable quotes include the no free lunch theorem: "Do I have to read all of these machine learning papers to understand this concept?" "Yes."
What the #@)*$ is Big Data? A Holistic View of Data and Algorithms
Alice Zheng, Director of Data Science, GraphLab
Alice Zheng highlighted the gap between algorithms, which prefer certain data structures, and raw data, which is often amenable to certain data structures, and noted that data structures (beyond flat, 2-dimensional tables) are an often overlooked bridge between data and algorithms in data science and engineering efforts. She showed how data for movie recommendation systems and network diagnostic systems can be represented as tables, and then how representing them with graph data structures can make them much more efficient to work with. Her colleague, Carlos Guestrin, gave a more in-depth GraphLab tutorial in another session later that afternoon, which I imagine was somewhat similar to the one captured in a 42-minute video of a GraphLab session at Strata NY 2013.
Overcoming the Barriers to Production-Ready Machine-Learning Workflows [slides]
Henrik Brink, CTO, wise.io; Joshua Bloom, Professor, University of California, Berkeley
Henrik Brink and Joshua Bloom highlighted the gaps between data science and production systems, emphasizing the optimization tradeoffs among accuracy, interpretability and implementability. Effectively measuring accuracy requires choosing an appropriate evaluation metric that captures the essence of what you (or your customer) cares about. Interpretability should focus on what an end user typically wants to know - why a model gives specific answers (e.g., which features are most important) - rather than what a data scientist may find most interesting (e.g., how the model works). Implementability encompasses the time and cost of putting a model into production, including considerations of integration, scalability and speed. The lessons learned from the Netflix Prize are instructive, since implementation concerns led the sponsors not to deploy the winning algorithm, even though it achieved improved accuracy.
Ted Dunning defined an anomaly as "What just happened that shouldn't?" and posited the goal of anomaly detection as "Find the problem before other people do ... But don't wake me up if it isn't really broken." In detecting heart rate anomalies, he described the creation of a dictionary of shapes representing lower level patterns in a heart rate, and then using adaptive thresholding to look for outliers. In many anomaly detection problems, he has found that many key elements can be effectively modeled as mixture distributions.
Deep learning is an extension of an old concept - multi-layer neural networks - that has recently become very popular. Ilya Sutskever provided a very accessible overview of the history, concepts and increasing capabilities of of these systems, provocatively asserting - and providing some evidence for - "Anything humans can do in 0.1 seconds, a big 10-layer network can do, too." Connecting all nodes in all layers of such a network would be prohibitively expensive; convolutional neural networks restrict the numbers of connections by mapping only subregions between different layers. Several successful (and a few unsuccessful) examples of visual object recognition were illustrated in the Google+ photo search service. References were made to Yann LeCun's related work on learning feature hierarchies for object recognition and word2vec, an open source tool for computing vector representations of words that can be used in applying deep learning to language tasks.
Kira Radinsky led off with some of the business complexities of sales cycles - due to factors such as time, cost, probability of a sale, amount of a sale - and the typically low rate of conversion (< 1%). She mentioned a number of techniques used by SalesPredict, such as automatic feature generation, classifiers as features and the use of personas to deal with sparseness and severe negative skew in CRM data. Revisiting the importance of interpretability, she described this perception problem as "Emotional AI", and gave an example where even though SalesPredict had achieved a 3-fold increase in conversion rates for a customer, they were not happy until/unless they could understand why the system was prioritizing certain leads. She also warned of the dangers of success in prediction: once customers start relying on the ranking of sales leads, they focused all their efforts on those with "A" scores, neglecting all others, leading to potential missed opportunities (since the ranking is imperfect) and further skewing of the data.
Magda Balazinska's group is exploring a number of ways of facilitating the management of big data, and focused on just one - Collaborative Query Management (CQMS) - for this session. Unfortunately, I had to step away for part of this session, but my understanding is that CQMS involves collecting successful queries and making relevant queries available to other users who appear to be following similar trajectories in exploratory data analysis. While the goals and design of the system seem reasonable, they have not yet conducted any user studies to validate whether users find the provision of relevant queries helpful in their analysis.
Max Gasner encouraged us to apply a key lesson from relational databases to big data: decoupling implementation enables abstraction. Furthermore he proposed that successful big data platform design and development should exhibit 4 properties: robust, honest, flexible and simple. He called out BigML (co-founded by my friend and former boss & CEO at Strands, Francisco Martin) as a first generation example of such a system. Echoing issues of interpretability raised in two earlier sessions, he noted that "black boxes are easy to use and hard to trust". Riffing off with a phrase popularized by Mao Tse-Tung - "let a hundred flowers bloom; let a hundred
schools of thought general purpose predictive platforms contend" - he noted there is lots of room to innovate on APIs and presentation, and so lots of opportunities for companies (like ours) building general purpose predictive platforms (GPPPs).
Ben Hamner warned that while machine learning is powerful, there are lots of ways to screw up; but he claimed that all are avoidable. Potential problems include overfitting, data insufficiency (or "finite data" as Alex Gray described it), data leakage (irrelevant features in the problem representation) and solving the wrong problem (calling to mind the 12 steps for technology-centered designers). He illustrated many of these problems - and solutions - with an amusing story about the iterative development of a vision-based system for regulating access through a pet door. He also offered an amusing quote by a machine learning engineer that captured the widespread zeal for ensemble learning methods:
"We'd been using logistic regression in high dimensional feature spaces, tried a random forest on it and it improved our performance by 14%. We were going to be rich!!"
I was initially skeptical about the wisdom of scheduling a presentation on algebra in the last slot of the session, but Oscar Boykin offered an energetic and surprisingly engaging overview of semigroups (sets with associative operations), monoids (semigroups with a zero property) the value of expressing computations as associative operations. He went on to champion the value of hashing rather than sampling to arrive at approximate - but acceptable - solutions to some big data problems, using Bloom filters, HyperLogLog and Count-min sketches as examples. In addition to sharing his slides, he also offered some challenges for those interested in diving further into the topic.
A Sampling of Other Strata Presentations
I spent much of my time on Wednesday and Thursday in the exhibitors area, but did manage to get out to see a few sessions, some of which I will briefly recount below.
Geoffrey Moore, author of Crossing the Chasm, was an ideal choice for a keynote speaker at Strata, given the prevalence of references to chasms and gaps throughout many of the other sessions. Moore presented a variant of the Technology Adoption Life Cycle, noting that pragmatists - on the other side of the chasm from the early adopters - won't move until they feel pain. For consumer IT, he recommends adopting lean startup principles and leaping straight to the "tornado"; for enterprise IT, he recommends focusing on breakthrough projects with top-tier brands, and building up high value use cases with compelling reasons to buy. He also reiterated one of his most quotable big data quotes:
"Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway"
Eric Pugh shared some insights and experiences in building the Global Patent Search Network for the US Patent and Trademark Office. He and his team had to navigate tensions between two classes of developers (data people and UX people), as well as two classes of users (patent examiners and the general public). Among the lessons: don't underestimate the amount of effort required for user interface (40% for GPSN), put a clickable prototype with a subset of the data in front of users as early as possible, don't move files (use S3funnel), be careful where you sample your data (data volume can increase exponentially over time), and keep the pipeline as simple as possible.
Yann Ramin shared a broad array of problems - and solutions - in working with time series data, alerts and traces at Twitter, some of which is captured in an earlier blog post on Observability at Twitter. He made a strong case for the need to move beyond logging toward the cheap & easy real-time collection of structured statistics when working with web services (vs. programs running on host computers), highlighting the value of embedding first tier aggregation and sampling directly in large-scale web applications. Among the open source tools he illustrated were: the Finagle web service instrumentation package, a template (twitter-server) used to define Finagle-instrumented services at Twitter, and a distributed tracing system (zipkin) based on a 2010 research paper on Dapper. As with many other Strata presenters, he also had a pithy quote to capture the motivation behind much of the work he was presenting:
"When something is wrong, you need data yesterday"
Brian Granger (Cal Poly San Luis Obispo)
Of all the talks at Strata, this one got me the most excited about getting home (or back to work) to practice what I learned. In an act of recursive storytelling, Brian Granger told a story about how to use the IPython Notebook and NBViewer (NoteBook Viewer) to compose and share a reproducible story with code and data ... by using IPython Notebook. Running NBViewer in a browser, he was able to show and execute segments of Python code and have the results returned and variously rendered in the browser window. While the demonstration focused primarily on Python, the notebook also supports a variety of other languages (including Julia, R and Ruby). A recurring theme throughout the conference was bridging gaps, and in this case, the gap was characterized as "a grand canyon between the user and the data", with IPython Notebook serving as the bridge. He had given a longer tutorial - IPython in Depth - on Tuesday, and I plan to soon use the materials there to bridge the gap from learning to doing.
[Update, 2014-04-09: I have followed through on my intention, creating and posting an IPython Notebook on Python for Data Science]
The last session I attended at Strata was also inspiring, and I plan to look for local opportunities for doing [data science for] good. The moderator and panelists have all been involved in projects involving the application of data science techniques and technologies to help local communities, typically by helping local government agencies - which often have lots of data but little understanding of how to use it - better serve their constituents. Drew Conway helped NYFD use data mining to rank buildings' fire risk based on 60 variables, enabling them to better prioritize fire department inspector's time. Rayid Ghani co-founded the Data Science for Social Good summer fellowship program at the University of Chicago last year, and Elena Eneva was one of the program mentors who was willing to take a sabbatical from her regular work to with teams of students in formulating big data solutions to community problems [disclosure: Rayid and Elena are both friends and former colleagues from my Accenture days]. Rayid noted that there are challenges in matching data science solutions to community problems, and so he developed a checklist to help identify the most promising projects (two elements: a local community or government organization that has - and can provide - access to the data, and has the capacity for action). Elena suggested that most data scientists would be surprised at how much impact they could have by applying a few simple data science techniques. If I were to attempt to summarize the panel with a quote of my own, I would riff on Margaret Meade:
A number of other Strata attendees have shared their insights and experiences:
- A review of the Hardcore Data Science tutorial by Ben Lorica (Chief Data Scientist, O'Reilly Media): Bridging the gap between research and implementation
- Daily conference summaries by Dev Nambi: Tuesday, Wedndesday and Thursday
- A conference summary by Alistair Croll (Strata program co-chair): Race Alongside the Machine
- A review of a VC panel by Jordan Novet (VentureBeat): What big-data VCs are sick of - and what they really want
- Observations about the shifting proportion of bizfolk ("suits") to devs ("hoodies"), among other aspects of Strata, by Dan Woods (Forbes): "The Buyers Arrive" - A Round-Up Of Strata 2014 In Santa Clara
- There's a nice set of set of Strata 2014 photos on Flickr
"Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it."
which, as far as I can tell, was first articulated in a Facebook post by Dan Ariely ... and was the inspiration for the cartoon: