Any sufficiently large number of signals is indistinguishable from noise. I suspect this principle does not figure prominently in the consciousness of people who are live-tweeting from conferences or other physical world events, or participating in purely virtual tweet chats. I have filtered and even unfollowed several friends who have gone on live-tweeting or tweet chatting binges, as I do not care to have my main Twitter feed consumed by tweets from events I do not care about.
A tweet today from Alyssa Royse suggests I am not alone in this irritation regarding Twitter etiquette:
All of my blocked hashtags end in "con." Time for conferences to rethink how they ask people to use Twitter.
Although I do not physically attend many conferences or other tweet-worthy events these days, when I do, I have adopted a practice that others may find useful. I use the @reply mechanism to reference the event Twitter handle at the start of the tweet - which hides the tweet from anyone who does not follow both me and the event - and then use the designated event hashtag so that anyone who is explicitly following the event hashtag can also see it. Others may remain blissfuly unaware of my avid participation in and live transcription of the event highlights.
But generally speaking, I try to maintain a small footprint for my live-tweeting ... and I would like to encourage others to adopt a similar practice.
[Oops - forgot about tweet chats ...probably because I do not participate in them. Briefly, a tweet chat is a period (typically an hour) during which a moderator will post a series of questions or prompts, and then others post responses to that question, all using a designated hashtag. A similar practice can be adopted in such scenarios, in which respondents direct their responses to the moderator (or the person who posted the question) using @replies.]
I didn't physically attend Strata NY + Hadoop World this year, but I did watch the keynotes from the conference. O'Reilly Media kindly makes videos of the keynotes and slides of all talks available very soon after they are given. Among the recurring themes were haranguing against the hype of big data, the increasing utilization of Hadoop as a central platform (hub) for enterprise data, and the importance and potential impact of making data, tools and insights more broadly accessible within an enterprise and to the general public. The keynotes offered a nice mix of business (applied) & science (academic) talks, from event sponsors and other key players in the field, and a surprising - and welcome - number of women on stage.
Meanwhile, I'll include some of my notes on interesting data and insights presented by others, in the order in which presentations were scheduled, linking each presentation title to its associated video. Unlike previous postings of notes from conferences, I'm going to leave the notes in relatively raw form, as I don't have the time to add more narrative context or visual augmentations to them.
3000 people at the conference (sellout crowd), up from 700 people in 2009. Hadoop started out as a complement to traditional data processing (offering large-scale processing). Progressively adding more real-time capabilities, e.g. Impala & Cloudera search. More and more capabilities migrating form traditional platforms to Hadooop. Hadoop moving from the periphery to the architectural center of the data center, emerging as an enterprise data hub. Hub: scalable storage, security, data governance, engines for working with the data in place Spokes connect to other systems, people Announcing Cloudera 5, "the enterprise data hub" Announcing Cloudera Connect Cloud, supporting private & public cloud deployments Announcing Cloudera Connect Innovators, inaugural innovator is DataBricks (Spark real-time in-memory processing engine)
Need to focus on business needs, not the technology You can use science, technology and statistics to figure out what the answers are, but it is still am art to figure out what the right questions are How to focus on the right questions: * hire people with academic knowledge + business savvy * train everyone on analytics (internal DataCamp at Facebook for project managers, designers, operations; 50% on tools, 50% on how to frame business questions so you can use data to get the answers) * put analysts in org structure that allows them to have impact ("embedded model": hybrid between centralized & decentralized) Goals of analytics: Impact, insight, actionable insight, evangelism … own the outcome
Tony is director at the Experience Research Lab (is this the group formerly known as People & Practices?) [I'm an Intel Research alum, and Tony is a personal friend] Personal data economy: system of exchange, trading personal data for value 3 opportunities * hyper individualism (Moore's Cloud, programmable LED lights) * hyper collectivity (student projects with outside collaboration) * hyper differentiation (holistic design for devices + data) Big data is by the people and of the people ... and it should be for the people
Praises Apache, open source, github (highlighted by someone from Microsoft?) Make big data accessible (MS?) Hadoop is a cornerstone of big data Microsoft is committed to making it ready for the enterprise HD Insight (?) Azure offering for Hadoop We have a billion users of Excel, and we need to find a way to let anybody with a question get that question answered. Power BI for Office 365 Preview
A Turing test for advertising fraud Dstillery: predicting consumer behavior based on browsing histories Saw 2x performance improvement in 2 weeks; was immediately skeptical Integrated additional sources of data (10B bid requests) Found "oddly predictive websites" e.g., Women's health page --> 10x more likely to check out credit card offer, order online pizza, or reading about luxury cars Large advertising scam (botnet) 36% of traffic is non-intentional (Comscore) Co-visitation patterns Cookie stuffing Botnet behavior is easier to predict than human behavior Put bots in "penalty box": ignore non-human behavior
When it comes to big data, BI = BS Contrasts enterprises based on fiction, feeling & faith vs. fact-based enterprises Big data analytics: letting regular business people iteratively interrogate massive amounts of data in an easy-to-use way so that they can derive insight and really understand what's going on 3 layers: Deep processing + acceleration + rich analytics Product: Hadoop processing + in-memory acceleration + analytics engines + Vizboards Example: event series analytics + entity-centric data catalog + iterative segmentation
Louisiana Purchase: Lewis & Clark address a big data acquisition problem Thomas Jefferson: "Your observations are to be taken with great pains & accuracy, to be entered intelligibly, for others as well as yourself" What happens when you make data more liquid?
4 characteristics of "openness" or "liquidity" of data: * degree of access * machine readability * cost * rights
Benefits to open data: * transparency * benchmarking exposing variability * new products and services based on open data (Climate Corporation?)
How open data can enable value creation * matching supply and demand * collaboration at scale "with enough eyes on code, all bugs are shallow" --> "with enough eyes on data, all insights are shallow" * increase accountability of institutions
Open data can help unlock $3.2B [typo? s/b $3.2T?] to $5.4T in ecumenic value per year across 7 domains * education * transportation * consumer products * electricity * oil and gas * health care * consumer finance What needs to happen? * identify, prioritize & catalyze data to open * developer, developers, developers * talent (data scientists, visualization, storytelling) * address privacy confidentiality, security, IP policies * platforms, standards and metadata
Hadoop started out as a storage & batch processing system for Java programmers Increasingly enables people to share data and hardware resources Becoming the center of an enterprise data hub More and more capabilities being brought to Hadoop Inevitable that we'll see just about every kind of workload being moved to this platform, even online transaction processing
GE has created 24 data-driven apps in one year We are working with them as a Pivotal investor and a Pivotal company, we help them build these data-driven apps, which generated $400M in the past year Pivotal code-a-thon, with Kaiser Permanente, using Hadoop, SQL and Tableau
What it takes to be a data-driven company * Have an application vision * Powered by Hadoop * Driven by Data Science
Took Facebook 9 months to achieve the same number of users that it took radio 40 years to achieve (100M users) Use cases At-risk students stay in school with real-time guidance (University of Kentucky) Soccer players improve with spatial analysis of movement Visualization of cancer treatment options Big data geek challenge (SAP Lumira): $10,000 for best application idea
Social TV Lab How we can derive value from the data that is being generated by viewers today? Methodology: start with Twitter handles of TV shows, identify followers, collect tweets and their networks (followees + followers), build recommendation systems from the data (social network-based, product network-based & text-based (bag of words)). Correlate words in tweets about a show with demographics about audience (Wordle for male vs. female) 1. You can use Twitter followers to estimate viewer audience demographics 2. TV triggers lead to more online engagement 3. If brands want to engage with customers online, play an online game Real time response to advertisement (Teleflora during Super Bowl): peaking buzz vs. sustained buzz Demographic bias in sentiment & tweeting (male vs. female response to Teleflora, others) Influence = retweeting Women more likely to retweet women, men more likely to retweet men 4. Advertising response and influence vary by demographic 5. GetGlue and Viggle check-ins can be used as a reliable proxy for viewership to * predict Nielsen viewership weeks in advance * predict customer lifetime value * measure time shifting All at the individual viewer level (vs. household level)
Ultracompact satellites to image the earth on a much more frequent basis to get inside the human decision-making loop so we can help human action. Redundancy via large # of small of satellites with latest technology (vs. older, higher-reliability systems on one satellite) Recency: shows more deforestation than Google Maps, river movement (vs. OpenStreetMap) API for the Changing Planet, hackathons early next year
[No slides?] Invention of sliced bread Big data [hyped] as the biggest thing since the sliced bread Think about big data as a journey 1. It's all about discipline and knowing where you are going (vs. enamored with tech) VC $2.6B investment into big data (IBM, SAP, Oracle, … $3-4B more) 2. Understand that any of these technologies do not live in a silo. The thing that you don't want to have happen is that this thing become a science fair project. At the end of the day, this is going to be part of a broader architecture. 3. This is an investment decision, want to have a return on investment.
The Next Era of Data Analysis: next big thing is how you analyze data from many disparate sources and do it quickly. More data: Internal data + external data More speed: Fast answers + discovery Increase speed of access & speed of processing so that iterative insight becomes possible. More people: Collaboration + context Needs to become easier for everyone across the business (not just specialists) to see insights as insights are made available, have to make decisions faster. Data-aware collaboration Data harmonization Demo: 6:10-8:30
1 of 3 people in US has had a direct experience with cancer in their family 1 in 4 deaths are cancer-related Jim's mom has chronic leukemia Just got off the phone with his mom (it's his birthday), and she asked "what is it that you do?" "We use data to solve really hard problems like cancer" "When?" "Soon" Cancer is 2nd leading cause of death in children "The brain trust in this room alone could advance cancer therapy more in a year than the last 3 decades." Bjorn Brucher We can help them by predicting individual outcomes, and then proactively applying preventative measures. Big data starts with the application Stop building your big data sandboxes, stop building your big data stacks, stop building your big data hadoop clusters without a purpose. When you start with the business problem, the use case, you have a purpose, you have focus. 50% of big data projects fail (reference?) "Take that one use case, supercharge it with big data & analytics, we can take & give you the most comprehensive big data solutions, we can put it on the cloud, and for some of you, we can give you answers in less than 30 days" "What if you can contribute to the cure of cancer?" [abrupt pivot back to initial inspirational theme]
Why coding is important: By 2020, 1.4M computing jobs Women of color currently make up 3% of computing jobs in US Goal: teach 1M girls to code by 2040 Thus far: 2 years, 2000 girls, 7 states + Johannesburg, South Africa
[my favorite talk] Anything which appears in the press in capital letters, and surrounded by quotes, isn't real. There is no math solution to anything. Math isn't the answer, it's not even the question. Math is a part of the solution. Pieces of math have different biases, different things they do well, different things they do badly, just like employees. Hiring one new employee won't transform your company; hiring one new piece of math also won't transform your company. Normal distribution, bell curve: beautiful, elegant Almost nothing in the real world, is, in fact, normal. Power laws don't actually have means. Joke: How do you tell the difference between an introverted and an extroverted engineer? The extroverted one looks at your shoes instead of his own. The math that you think you know isn't right. And you have to be aware of that. And being aware of that requires more than just math skills. Science is inherently about data, so "data scientist" is redundant However, data is not entirely about science Math + pragmaticism + communication Prefers "Data artist" to data scientist Fundamentally, the hard part actually isn't the math, the hard part is finding a way to talk about that math. And, the hard part isn't actually gathering the data, the hard part is talking about that data. The most famous data artist of our time: Nate Silver. Data artists are the future. What the world needs is not more R, what the world needs is more artists (Rtists?)
[co-author of my favorite book on Data Science] Agrees with some of the critiques made by previous speaker, but rather likes the term "data scientist" Shares some quotes from Data Science and its relationship to Big Data and Data-Driven Decision Making Gartner Hype Cycle 2012 puts "Predictive Analytics" at the far right ("Plateau of Productivity") [it's still there in Gartner Hype Cycle 2013, and "Big Data" has inched a bit higher into the "Peak of Inflated Expectations"] More data isn't necessarily better (if it's from the same source, e.g., sociodemographic data) More data from different sources may help. Using fine-grained behavior data, learning curves show continued improvement to massive scale. 1M merchants, 3M data points (? look up paper) But sociodemographic + pseudo social network data still does not necessarily do better See Pseudo-Social Network Targeting from Consumer Transaction Data (Martens & Provost) Seem to be very few case studies where you have really strong best practices with traditional data juxtaposed with strong best practices with another sort of data. We see similar learning curves with different data sets, characterized by massive numbers of individual behaviors, each of which probably contains a small amount of information, and the data items are sparse. See Predictive Modelling with Big Data: Is Bigger Really Better? (Enrique Junque de Fortuny, David Martens & Foster Provost) Others have published work on on Fraud detection (Fawcett & FP, 1997; Cortes et al, 2001), Social Network-based Marketing (Hill, et al, 2006), Online Display-ad Targeting (FP, Dalessandro, et al., 2009; Perlich, et al., 2013) Rarely see comparisons
Take home message: The Golden Age of Data Science is at hand. Firms with larger data assets may have the opportunity to achieve significant competitive advantage. Whether bigger is better for predictive modeling depends on: a) the characteristics of the data (e.g., sparse, fine-grained data on consumer behavior) b) the capability to model such data