I didn't physically attend Strata NY + Hadoop World this year, but I did watch the keynotes from the conference. O'Reilly Media kindly makes videos of the keynotes and slides of all talks available very soon after they are given. Among the recurring themes were haranguing against the hype of big data, the increasing utilization of Hadoop as a central platform (hub) for enterprise data, and the importance and potential impact of making data, tools and insights more broadly accessible within an enterprise and to the general public. The keynotes offered a nice mix of business (applied) & science (academic) talks, from event sponsors and other key players in the field, and a surprising - and welcome - number of women on stage.
Atigeo, the company where I now work on analytics and data science, co-presented a talk on Data Driven Models to Minimize Hospital Readmissions at Strata Rx last month, and I'm hoping we will be participating in future Strata events. And I'm hoping that some day I'll be on stage presenting some interesting data and insights at a Strata conference.
Meanwhile, I'll include some of my notes on interesting data and insights presented by others, in the order in which presentations were scheduled, linking each presentation title to its associated video. Unlike previous postings of notes from conferences, I'm going to leave the notes in relatively raw form, as I don't have the time to add more narrative context or visual augmentations to them.
Hadoop's Impact on the Future of Data Management
Mike Olson @mikeolson (Cloudera)
3000 people at the conference (sellout crowd), up from 700 people in 2009.
Hadoop started out as a complement to traditional data processing (offering large-scale processing).
Progressively adding more real-time capabilities, e.g. Impala & Cloudera search.
More and more capabilities migrating form traditional platforms to Hadooop.
Hadoop moving from the periphery to the architectural center of the data center, emerging as an enterprise data hub.
Hub: scalable storage, security, data governance, engines for working with the data in place
Spokes connect to other systems, people
Announcing Cloudera 5, "the enterprise data hub"
Announcing Cloudera Connect Cloud, supporting private & public cloud deployments
Announcing Cloudera Connect Innovators, inaugural innovator is DataBricks (Spark real-time in-memory processing engine)
Separating Hadoop Myths from Reality
Jack Norris (MapR Technologies)
Hadoop is the first open source project that has spawned a market
3:35 compelling graph of Hadoop/HBase disk latency vs. MapR latency
Hadoop is being used in production by many organizations
Big Impact from Big Data
Ken Rudin (Facebook)
Need to focus on business needs, not the technology
You can use science, technology and statistics to figure out what the answers are, but it is still am art to figure out what the right questions are
How to focus on the right questions:
* hire people with academic knowledge + business savvy
* train everyone on analytics (internal DataCamp at Facebook for project managers, designers, operations; 50% on tools, 50% on how to frame business questions so you can use data to get the answers)
* put analysts in org structure that allows them to have impact ("embedded model": hybrid between centralized & decentralized)
Goals of analytics: Impact, insight, actionable insight, evangelism … own the outcome
Five Surprising Mobile Trajectories in Five Minutes
Tony Salvador (Intel Corporation)
Tony is director at the Experience Research Lab (is this the group formerly known as People & Practices?) [I'm an Intel Research alum, and Tony is a personal friend]
Personal data economy: system of exchange, trading personal data for value
* hyper individualism (Moore's Cloud, programmable LED lights)
* hyper collectivity (student projects with outside collaboration)
* hyper differentiation (holistic design for devices + data)
Big data is by the people and of the people ... and it should be for the people
Can Big Data Reach One Billion People?
Quentin Clark (Microsoft)
Praises Apache, open source, github (highlighted by someone from Microsoft?)
Make big data accessible (MS?)
Hadoop is a cornerstone of big data
Microsoft is committed to making it ready for the enterprise
HD Insight (?) Azure offering for Hadoop
We have a billion users of Excel, and we need to find a way to let anybody with a question get that question answered.
Power BI for Office 365 Preview
What Makes Us Human? A Tale of Advertising Fraud
Claudia Perlich (Dstillery)
A Turing test for advertising fraud
Dstillery: predicting consumer behavior based on browsing histories
Saw 2x performance improvement in 2 weeks; was immediately skeptical
Integrated additional sources of data (10B bid requests)
Found "oddly predictive websites"
e.g., Women's health page --> 10x more likely to check out credit card offer, order online pizza, or reading about luxury cars
Large advertising scam (botnet)
36% of traffic is non-intentional (Comscore)
Botnet behavior is easier to predict than human behavior
Put bots in "penalty box": ignore non-human behavior
From Fiction to Facts with Big Data Analytics
Ben Werther @bwerther (Platfora)
When it comes to big data, BI = BS
Contrasts enterprises based on fiction, feeling & faith vs. fact-based enterprises
Big data analytics: letting regular business people iteratively interrogate massive amounts of data in an easy-to-use way so that they can derive insight and really understand what's going on
3 layers: Deep processing + acceleration + rich analytics
Product: Hadoop processing + in-memory acceleration + analytics engines + Vizboards
Example: event series analytics + entity-centric data catalog + iterative segmentation
The Economic Potential of Open Data
Michael Chui (McKinsey Global Institute)
[Presentation is based on newly published - and openly accessible (walking the talk!) - report: Open data: Unlocking innovation and performance with liquid information.]
Louisiana Purchase: Lewis & Clark address a big data acquisition problem
Thomas Jefferson: "Your observations are to be taken with great pains & accuracy, to be entered intelligibly, for others as well as yourself"
What happens when you make data more liquid?
4 characteristics of "openness" or "liquidity" of data:
* degree of access
* machine readability
Benefits to open data:
* benchmarking exposing variability
* new products and services based on open data (Climate Corporation?)
How open data can enable value creation
* matching supply and demand
* collaboration at scale
"with enough eyes on code, all bugs are shallow"
--> "with enough eyes on data, all insights are shallow"
* increase accountability of institutions
Open data can help unlock $3.2B [typo? s/b $3.2T?] to $5.4T in ecumenic value per year across 7 domains
* consumer products
* oil and gas
* health care
* consumer finance
What needs to happen?
* identify, prioritize & catalyze data to open
* developer, developers, developers
* talent (data scientists, visualization, storytelling)
* address privacy confidentiality, security, IP policies
* platforms, standards and metadata
The Future of Hadoop: What Happened & What's Possible?
Doug Cutting @cutting (Cloudera)
Hadoop started out as a storage & batch processing system for Java programmers
Increasingly enables people to share data and hardware resources
Becoming the center of an enterprise data hub
More and more capabilities being brought to Hadoop
Inevitable that we'll see just about every kind of workload being moved to this platform, even online transaction processing
Designing Your Data-Centric Organization
Josh Klahr (Pivotal)
GE has created 24 data-driven apps in one year
We are working with them as a Pivotal investor and a Pivotal company, we help them build these data-driven apps, which generated $400M in the past year
Pivotal code-a-thon, with Kaiser Permanente, using Hadoop, SQL and Tableau
What it takes to be a data-driven company
* Have an application vision
* Powered by Hadoop
* Driven by Data Science
Encouraging You to Change the World with Big Data
David Parker (SAP)
Took Facebook 9 months to achieve the same number of users that it took radio 40 years to achieve (100M users)
At-risk students stay in school with real-time guidance (University of Kentucky)
Soccer players improve with spatial analysis of movement
Visualization of cancer treatment options
Big data geek challenge (SAP Lumira): $10,000 for best application idea
The Value of Social (for) TV
Shawndra Hill (University of Pennsylvania)
Social TV Lab
How we can derive value from the data that is being generated by viewers today?
Methodology: start with Twitter handles of TV shows, identify followers, collect tweets and their networks (followees + followers), build recommendation systems from the data (social network-based, product network-based & text-based (bag of words)). Correlate words in tweets about a show with demographics about audience (Wordle for male vs. female)
1. You can use Twitter followers to estimate viewer audience demographics
2. TV triggers lead to more online engagement
3. If brands want to engage with customers online, play an online game
Real time response to advertisement (Teleflora during Super Bowl): peaking buzz vs. sustained buzz
Demographic bias in sentiment & tweeting (male vs. female response to Teleflora, others)
Influence = retweeting
Women more likely to retweet women, men more likely to retweet men
4. Advertising response and influence vary by demographic
5. GetGlue and Viggle check-ins can be used as a reliable proxy for viewership to
* predict Nielsen viewership weeks in advance
* predict customer lifetime value
* measure time shifting
All at the individual viewer level (vs. household level)
Ubiquitous Satellite Imagery of our Planet
Will Marshall @wsm1 (Planet Labs)
Ultracompact satellites to image the earth on a much more frequent basis to get inside the human decision-making loop so we can help human action.
Redundancy via large # of small of satellites with latest technology (vs. older, higher-reliability systems on one satellite)
Recency: shows more deforestation than Google Maps, river movement (vs. OpenStreetMap)
API for the Changing Planet, hackathons early next year
The Big Data Journey: Taking a holistic approach
John Choi (IBM)
Invention of sliced bread
Big data [hyped] as the biggest thing since the sliced bread
Think about big data as a journey
1. It's all about discipline and knowing where you are going (vs. enamored with tech)
VC $2.6B investment into big data (IBM, SAP, Oracle, … $3-4B more)
2. Understand that any of these technologies do not live in a silo. The thing that you don't want to have happen is that this thing become a science fair project. At the end of the day, this is going to be part of a broader architecture.
3. This is an investment decision, want to have a return on investment.
How You See Data
Sharmila Shahani-Mulligan @ShahaniMulligan (ClearStory Data)
The Next Era of Data Analysis: next big thing is how you analyze data from many disparate sources and do it quickly.
More data: Internal data + external data
More speed: Fast answers + discovery
Increase speed of access & speed of processing so that iterative insight becomes possible.
More people: Collaboration + context
Needs to become easier for everyone across the business (not just specialists) to see insights as insights are made available, have to make decisions faster.
Can Big Data Save Them?
Jim Kaskade @jimkaskade (Infochimps)
1 of 3 people in US has had a direct experience with cancer in their family
1 in 4 deaths are cancer-related
Jim's mom has chronic leukemia
Just got off the phone with his mom (it's his birthday), and she asked "what is it that you do?"
"We use data to solve really hard problems like cancer"
Cancer is 2nd leading cause of death in children
"The brain trust in this room alone could advance cancer therapy more in a year than the last 3 decades."
We can help them by predicting individual outcomes, and then proactively applying preventative measures.
Big data starts with the application
Stop building your big data sandboxes, stop building your big data stacks, stop building your big data hadoop clusters without a purpose.
When you start with the business problem, the use case, you have a purpose, you have focus.
50% of big data projects fail (reference?)
"Take that one use case, supercharge it with big data & analytics, we can take & give you the most comprehensive big data solutions, we can put it on the cloud, and for some of you, we can give you answers in less than 30 days"
"What if you can contribute to the cure of cancer?" [abrupt pivot back to initial inspirational theme]
Changing the Face of Technology - Black Girls CODE
Peta Clarke @volunteerbgcny (Black Girls Code - NY), Donna Knutt @donnaknutt (Black Girls Code)
Why coding is important: By 2020, 1.4M computing jobs
Women of color currently make up 3% of computing jobs in US
Goal: teach 1M girls to code by 2040
Thus far: 2 years, 2000 girls, 7 states + Johannesburg, South Africa
Beyond R and Ph.D.s: The Mythology of Data Science Debunked
Douglas Merrill @DouglasMerrill (ZestFinance)
[my favorite talk]
Anything which appears in the press in capital letters, and surrounded by quotes, isn't real.
There is no math solution to anything. Math isn't the answer, it's not even the question.
Math is a part of the solution. Pieces of math have different biases, different things they do well, different things they do badly, just like employees. Hiring one new employee won't transform your company; hiring one new piece of math also won't transform your company.
Normal distribution, bell curve: beautiful, elegant
Almost nothing in the real world, is, in fact, normal.
Power laws don't actually have means.
Joke: How do you tell the difference between an introverted and an extroverted engineer? The extroverted one looks at your shoes instead of his own.
The math that you think you know isn't right. And you have to be aware of that. And being aware of that requires more than just math skills.
Science is inherently about data, so "data scientist" is redundant
However, data is not entirely about science
Math + pragmaticism + communication
Prefers "Data artist" to data scientist
Fundamentally, the hard part actually isn't the math, the hard part is finding a way to talk about that math. And, the hard part isn't actually gathering the data, the hard part is talking about that data.
The most famous data artist of our time: Nate Silver.
Data artists are the future.
What the world needs is not more R, what the world needs is more artists (Rtists?)
Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
Foster Provost (NYU | Stern)
[co-author of my favorite book on Data Science]
Agrees with some of the critiques made by previous speaker, but rather likes the term "data scientist"
Shares some quotes from Data Science and its relationship to Big Data and Data-Driven Decision Making
Gartner Hype Cycle 2012 puts "Predictive Analytics" at the far right ("Plateau of Productivity")
[it's still there in Gartner Hype Cycle 2013, and "Big Data" has inched a bit higher into the "Peak of Inflated Expectations"]
More data isn't necessarily better (if it's from the same source, e.g., sociodemographic data)
More data from different sources may help.
Using fine-grained behavior data, learning curves show continued improvement to massive scale.
1M merchants, 3M data points (? look up paper)
But sociodemographic + pseudo social network data still does not necessarily do better
See Pseudo-Social Network Targeting from Consumer Transaction Data (Martens & Provost)
Seem to be very few case studies where you have really strong best practices with traditional data juxtaposed with strong best practices with another sort of data.
We see similar learning curves with different data sets, characterized by massive numbers of individual behaviors, each of which probably contains a small amount of information, and the data items are sparse.
See Predictive Modelling with Big Data: Is Bigger Really Better? (Enrique Junque de Fortuny, David Martens & Foster Provost)
Others have published work on on Fraud detection (Fawcett & FP, 1997; Cortes et al, 2001), Social Network-based Marketing (Hill, et al, 2006), Online Display-ad Targeting (FP, Dalessandro, et al., 2009; Perlich, et al., 2013)
Rarely see comparisons
Take home message:
The Golden Age of Data Science is at hand.
Firms with larger data assets may have the opportunity to achieve significant competitive advantage.
Whether bigger is better for predictive modeling depends on:
a) the characteristics of the data (e.g., sparse, fine-grained data on consumer behavior)
b) the capability to model such data