Data Science

Doing Data Science, Helping People Get Jobs @Indeed

I_help_people_get_jobsHaving just marked my 1-year anniversary at Indeed, it occurs to me that I have not yet blogged about my not-so-new job as a data scientist helping people get jobs. In addition to ending a long (7+ month) drought in my blogging practice, I'm also hoping that in sharing a bit about my work at Indeed, I might help more people learn about and get jobs at Indeed's growing Seattle office.

When I tell people I work for Indeed, I get 2 basic types of responses:

  • "Cool - I love Indeed!"
  • "What's Indeed?"

I don't think I can add much to the first response, except to say that I have found several of my own jobs on Indeed (including my current Data Scientist job at Indeed and my former Principal Scientist job at Nokia Research Center Palo Alto) and helped my daughter find two of her jobs on Indeed, so I love Indeed, too.

Job_search_indeedTo address the second response: Indeed.com is the world's top job site, offering a number of tools to both help people find jobs and help jobs find people (or, more precisely, help employers find employees) - an ideal combination for someone like me with a long-standing passion for making connections ... and helping others make valuable connections.

Indeed helps people get jobs by providing tools to both job seekers and employers. Job seekers can upload or create a resume, search for jobs, receive email alerts when new jobs that match their search criteria are posted, directly apply for some jobs, and track the status of jobs for which they have applied, interviewed or been made offers, all at no cost to job seekers. The company also provides parallel tools for employers to upload jobs, search for resumes, receive email alerts when new resumes that match their search criteria are uploaded.

Employers can enjoy some Indeed services for free. Indeed automatically aggregates millions of job postings from thousands of web sites every day. The site also provides an interface for employers to post jobs directly through Indeed. Paid services include sponsoring jobs to appear in job search results (using a pay-per-click revenue model), contacting job seekers who have uploaded resumes to the site, enabling job seekers to directly apply for jobs through Indeed, and using Indeed's applicant tracking system.

Many things have impressed me during my first year at Indeed, but I'll focus on just a few: mission, measurement, transparency and non-attachment.

The most impressive aspect of Indeed - which was apparent even during the interview process - is the pervasive and relentless focus on the mission of the company: helping people get jobs. I've long been fascinated by the world(s) of work, and it is inspiring to work alongside others who are similarly inspired to help people address a fundamental human need. Just about every internal discussion of a new product or feature at Indeed eventually boils down to the question of "Will this help more people get jobs?"

And the next question is usually "How can we measure the impact?" While a steadily increasing number of users are sharing their stories about finding jobs on Indeed, we don't always know when our users get jobs, so measuring impact often involves various proxies for job-seeking success, but such approximation is a fact of life in most data-driven companies.

Agile-boardIndeed makes extensive use of the Atlassian JIRA software project tracking system for new features, bugs and other issues that arise in the course of software development. Some of the other organizations in which I've worked had cultures of parochialism, secrecy and defensiveness, where critiques were best kept to oneself, or communicated privately. Early on at Indeed, I would often report bugs or make suggestions for improvements via email. After gentle and persistent encouragement, I now report them via JIRA, which - being publicly searchable (within the firm) - increases the possibility for sharing lessons learned. I have yet to encounter an Indeedian who has taken any such feedback personally, or felt so attached to a product feature or segment of code that they weren't willing to consider reviewing and revising it (or allowing someone else to do so) ... and unlike reports I've read about the culture at some other tech companies, I have yet to encounter an asshole at Indeed.

My own work at Indeed currently centers on helping people get jobs by taking greater advantage of the data in the millions of resumes that job seekers have created or uploaded at Indeed. This involves a mix of analyzing, cleaning and provisioning resume data to enhance existing products and inform new products designed to improve search and recommendations for both job seekers and employers.

Ignorance_book_coverOver the past several months, I've acted as the chief question answerer in an internal "Resume Q&A" forum we've created to help product managers, data scientists and software engineers better understand and leverage our resume data. Answering these questions has enabled me to practice thoroughly conscious ignorance, offering me numerous opportunities to ask questions of my own, and thereby learn about a broad range of products and processes, as well as various data and code repositories ... and helping forge new connections across them. The work offers me a nice blend of analysis, communication, coding (in Python and Java) and education, a few of my favorite things.

One of the advantages arising from my spiral career path is a user-centered focus I adopted during my years doing user experience research and design. As a practicing data scientist, my UX orientation occasionally helps me trace anomalies in the data back to shortcomings in one or more of the user interfaces or the flow of the user experience across Indeed web services. This UX-oriented data analysis has resulted in at least 2 small, but substantive, changes in the user interface, which I hope has helped more people get jobs.

In addition to regular opportunities to practice my natural inclinations toward instigating and connecting, I've recently started exercising my evangelizing inclinations. I gave a demonstration / presentation on how Indeed can help job seekers to a local job search support group in Bellevue last week, and am hoping to do more evangelizing to job seekers in the future. I am also hoping to start giving more technical presentations on some of the cool things we are doing at Indeed, evangelizing to different audiences, in part, to help us help more data scientists, software engineers, UX designers and researchers, product managers and quality analysts get jobs ... at Indeed, in Seattle and elsewhere.


Notes from #PyData Seattle 2015

PyDataSeattleLogoI was among 900 attendees at the recent PyData Seattle 2015 conference, an event focused on the use of Python in data management, analysis and machine learning. Nearly all of the tutorials & talks I attended last weekend were very interesting and informative, and several were positively inspiring. I haven't been as excited to experiment with new tools since I discovered IPython Notebooks at Strata 2014.

I often find it helpful to organize my notes after attending a conference or workshop, and am sharing these notes publicly in case they are helpful to others. The following is a chronological listing of some of the highlights of my experience at the conference. My notes from some sessions are rather sparse, so it is a less comprehensive compilation than I would have liked to assemble. I'll also include some links to some sessions I did not attend at the end.

Python for Data Science

Joe McCarthy, Indeed, @gumption

Gumption_pydataThis was my first time at a PyData conference, and I spoke with several others who were attending their first PyData. Apparently, this was the largest turnout for a PyData conference yet. I gave a 2-hour tutorial on Python for Data Science, designed as a rapid on-ramp primer for programmers new to Python or Data Science. Responses to a post-tutorial survey confirm my impression that I was unrealistically optimistic about being able to fit so much material into a 2-hour time slot, but I hope the tutorial still helped participants get more out of other talks & tutorials throughout the rest of the conference, many of which presumed at least an intermediate level of experience with Python and/or Data Science. As is often the case, I missed the session prior to the one in which I was speaking - the opening keynote - as I scrambled with last-minute preparations (ably aided by friends, former colleagues & assistant tutors Alex Thomas and Bryan Tinsley).

Scalable Pipelines w/ Luigi or: I’ll have the Data Engineering, hold the Java!

Jonathan Dinu, Galvanize, @clearspandex

Luigi_user_recsRunning and re-running data science experiments in which many steps are repeated, some of which are varied (e.g., with different parameter settings), and several take a long time are all part of a typical data science workflow. Every company in which I've worked as a data scientist has rolled their own workflow pipeline framework to support this process, and each homegrown solution has offered some benefits while suffering from some shortcomings. Jonathan Dinu demonstrated Luigi, an open source library initially created by Spotify for managing batch pipelines that might encompass a large number of local and/or distributed computing cluster processing steps. Luigi offers a framework in which each stage of the pipeline has input, processing and output specifications; the stages can be linked together in a dependency graph which can be used to visualize progress. He illustrated how Luigi could be used for a sample machine learning pipeline (Data Engineering 101), in which a corpus of text documents is converted to TF-IDF vectors, and then models are trained and evaluated with different hyperparameters, and then deployed.

Keynote: Clouded Intelligence

Joseph Sirosh, Microsoft, @josephsirosh

Connected-cowsJoseph Sirosh sprinkled several pithy quotes throughout his presentation, starting off with a prediction that while software is eating the world, the cloud is eating software (& data). He also introduced what may have been the most viral meme at the conference - the connected cow - as a way of emphasizing that every company is a data company ... even a dairy farm. In an illustration of where AI [artificial intelligence] meets AI [artificial insemination], he described a project in which data from pedometers worn by cows boosted estrus detection accuracy from 55% to 95%, which in turn led to more successful artificial insemination and increased pregnancy rates from 40% to 67%. Turning his attention from cattle to medicine, he observed that every hospital is a data company, and suggested that Florence Nightingale's statistical evaluation of the relationship between sanitation and infection made her the world's first woman data scientist. Sirosh noted that while data scientists often complain that data wrangling is the most time-consuming and challenging part of the data modeling process, that is because deploying and monitoring models in production environments - which he argued is even more time-consuming and challenging - is typically handed off to other groups. And, as might be expected, he demonstrated how some of these challenging problems can be addressed by Azure ML, Microsoft's cloud-based predictive analytics system.

The past, present, and future of Jupyter and IPython

Jonathan Frederic, Project Jupyter, @GooseJon

Jupyter_logoIPython Jupyter Notebooks are one of my primary data science tools. The ability to interleave code, data, text and a variety of [other] media make the notebooks a great way to both conduct and describe experiments. Jonathan described the upcoming Big Split(tm), in which IPython will be separated from Notebooks, to better emphasize the increasingly language-agnostic capabilities of the notebooks, which [will soon] have support for 48 language kernels, including Julia, R, Haskell, Ruby, Spark and C++. Version 4.0 will offer capabilities to

  • ease the transition from large notebook to small notebooks
  • import notebooks as packages
  • test notebooks
  • verify that a notebook is reproducible

As a former educator, a new capability I find particularly exciting is nbgrader, which uses the JupyterHub collaborative platform, providing support for releasing, fetching, submitting and collecting assignments. Among the personally most interesting tidbits I learned during this session was that IPython started out as Fernando Perez' "procrastination project" while he was in PhD thesis avoidance mode in 2001 ... an outstanding illustration of the benefits of structured procrastination.

Deep Learning with Python: getting started and getting from ideas to insights in minutes

Alex Korbonits, Nuiku, @korbonits

AlexNet_architectureDeep Learning seems to be well on its way toward the peak of inflated expectations lately (e.g., Deep Learning System Tops Humans in IQ Tests), Alex Korbonits presented a number of tools for and examples of Deep Learning, the most impressive of which was AlexNet, a deep convolutional neural network developed by another Alex (Alex Krizhevsky, et al) that outperformed all of its competitors in the LSVRC 2010 ImageNet competition (1.3M high-res images across 1000 classes) by such a substantial margin that it changed the course of research in computer vision, a field that had hitherto been dominated by hand-crafted features refined over a long period of time. Alex Korbonits went on to demonstrate a number of Deep Learning tools & packages, e.g., Caffe and word2vec, and applications involving scene parsing and unsupervised learning of high-level features. It should be noted that others have taken a more skeptical view of Deep Learning, and illustrated some areas in which there's still a lot of work to be done.

Jupyter for Education: Beyond Gutenberg and Erasmus

Paco Nathan, O’Reilly Media, @pacoid

120818_stacked_s-curves-thumb-600x358-2254One of the most challenging aspects of attending a talk by Paco Nathan is figuring out how to bide my time between listening, watching, searching for or typing in referenced links ... and taking notes. He is a master speaker, with compelling visual augmentations and links to all kinds of interesting related material. Unfortunately, while my browser fills up with tabs during his talks, my notes typically end up rather sparse. In this talk, Paco talked about the ways that O'Reilly Media is embracing Jupyter Notebooks as a primary interface for authors using their multi-channel publishing platform. An impressive collection of these notebooks can be viewed on the O'Reilly Learning page. Paco observed that the human learning curve is often the most challenging aspect to leading data science teams, as data, problems and techniques change over time. The evolution of user expertise, e.g., among connoisseurs of beer, is another interesting area involving human learning curves that was referenced during this session.

Counterfactual evaluation of machine learning models

Michael Manapat, Stripe, @mlmanapat

Fraud detection presents some special challenges in evaluating the performance of machine learning models. If a model is trained on past transactions that are labeled based on whether or not they turned out to be fraudulent, once the model is deployed, the new transactions classified by the model as fraud are blocked. Thus, the transactions that are allowed to go through after the model is deployed may be substantially different - especially with respect to the proportion of fraudulent transactions - than those that were allowed before the model was deployed. This makes evaluation of the model performance difficult, since the training data may be very different from the data used to evaluate the model. It also complicates the training of new models, since the new training data (post model deployment) will be biased. Michael Manapat presented some techniques to address these challenges, involving allowing a small proportion of potentially fraudulent transactions through and using a propensity function to control the "exploration/exploitation tradeoff".

Keynote: A Systems View of Machine Learning

Josh Bloom, UC Berkeley & wise.io, @profjsb

In the last keynote of the conference, Josh Bloom shared a number of insights about considerations often overlooked by data scientists regarding how data models fit into the systems into which they are deployed. For example, while data scientists are often concerned with optimizing a variety parameters in building a model, other important areas for optimization are overlooked, e.g., hardware and software demands of a deployed model (e.g., the decision by Netflix not to deploy the model with the highest score in the Netflix Prize), the human resources required to implement and maintain the model, the ways that consumers will [try to] interpret or use the model, and the direct and indirect impacts of the model on society. Noteworthy references include a paper by Sculley, et al, on Machine Learning: The High Interest Credit Card of Technical Debt and Leon Bottou's ICML 2015 keynote on Two Big Challenges of Machine Learning.

NLP and text analytics at scale with PySpark and notebooks

Paco Nathan, O'Reilly Media, @pacoid

Once again, I had a hard time keeping up with the multi-sensory inputs during a talk by Paco Nathan. Although I can't find his slides from PyData, I was able to find a closely related slide deck (embedded below). The gist of the talk is that many real-world problems can often be represented as graphs, and that there are a number of tools - including Spark and GraphLab - that can be utilized for efficient processing of large graphs. One example of a problem amenable to graph processing is the analysis of a social network, e.g., contributors to open source forums, which reminded me of some earlier work by Weiser, et al (2007), on Visualizing the signatures of social roles in online discussion groups. The session included a number of interesting code examples, some of which I expect are located in Paco's spark-exercises GitHub repository. Other interesting references included TextBlob, a Python library for text processing, and TextRank, a graph-based ranking model for text processing, a paper by Mihalcea & Tarau from EMNLP 2004.

Pandas Under The Hood: Peeking behind the scenes of a high performance data analysis library

Jeffrey Tratner, Counsyl, @jtratner

Array_vs_listPandas - the Python open source data analysis library - may take 10 minutes to learn, but I have found that it takes a long time to master. Jeff Tratner a key contributor to Pandas - an open source community he described as "really open to small contributors" - shared a number of insights into how Pandas works, how it addresses some of the problems that make python slow, and how the use of certain features can lead to improved performance. For example, specifying the data type of columns in a CSV file via the dtype parameter in read_csv can help pandas save space and time while loading the data from the file. Also, the Dataframe.append operation is very expensive, and should be avoided wherever possible (e.g., by using merge, join or concat). One of my favorite lines: "The key to doing many small operations in Python: don’t do them in Python"

Mistakes I've Made

Cameron Davidson-Pilon, Shopify, @cmrn_dp

While I believe that there are no mistakes, only lessons, I do value the relatively rare opportunities to learn from others' lessons, and Cameron Davidson-Pilon (author of Probabilistic Programming & Bayesian Methods for Hackers) shared some valuable lessons he has learned in his data science work over the years. Among the lessons he shared:

  • Sample sizes are important
  • It is usually prudent to underestimate predictions of performance of deployed models
  • Computing statistics on top of statistics compounds uncertainty
  • Visualizing uncertainty is a the role of a statistician
  • Don't [naively] use PCA [before regression]

Among the interesting, and rather cautionary, references:

There were a few sessions about which I read or heard great things, but which I did not attend. I'll include information I could find about them, in the chronological order in which they were listed in the schedule, to wrap things up.

Testing for Data Scientists

Trey Causey, Dato, @treycausey

Learning Data Science Using Functional Python

Joel Grus, Google, @joelgrus

Code + Google docs presentation (can't figure how to embed)

Big Data Analytics - The Best of the Worst : AntiPatterns & Antidotes

Krishna Sankar, blackarrow.tv, @ksankar

Python Data Bikeshed

Rob Story, Simple, @oceankidbilly

GitHub repo

Low Friction NLP with Gensim

Trent Hauck, @trent_hauck

Slides [PDF]

[Update, 2015-08-05: the PyDataTV YouTube channel now has videos from the conference]


Python for Data Science: A Rapid On-Ramp Primer

IpythonIn my last post, I was waxing poetic about an IPython Notebook demonstration that was one of my highlights from Strata 2014:

"this one got me the most excited about getting home (or back to work) to practice what I learned"

Well, I got back to work, and learned how to create an IPython Notebook. Specifically, I created one to provide a rapid "on-ramp" for computer programmers who are already familiar with basic concepts and constructs in other programming language to learn enough about Python to effectively use the Atigeo xPatterns analytics framework (or other data science tools). The Notebook also includes some basic data science concepts, utilizing material I'd earlier excerpted in a blog post in which I waxed poetic about the book Data Science for Business, by Foster Provost and Tom Fawcett, and other resources I have found useful in articulating the fundamentals of data science.

nltk_book_cover.gif The rapid on-ramp approach was motivated, in part, by my experience with the Natural Language Toolkit (NLTK) book, which provides a rapid on-ramp for learning Python in conjunction with the open-source NLTK library to develop programs using natural language processing techniques (many of which involve machine learning). I find that IPython Notebooks are such a natural and effective way of integrating instructional information and "active" exercises that I wish I'd discovered it back when I was teaching courses using Python at the University of Washington (e.g., what came to be known as the socialbots course). I feel like a TiVO fanatic now, wanting to encourage anyone and everyone sharing any knowledge about Python to use IPython Notebooks as a vehicle for doing so.

I piloted an initial version of the Python for Data Science notebook during an internal training session for software engineers who had experience with Java and C++ a few weeks ago, and it seemed to work pretty well. After the Strata 2014 videos were released, I watched Olivier Grisel's tutorial on Introduction to Machine Learning with IPython and scikit-learn, and worked through the associated parallel_ml_tutorial notebooks he posted on GitHub. I updated my notebook to include some additional aspects of Python that I believe would be useful in preparation for that tutorial.

Not only was this my first IPython Notebook, but I'm somewhat embarrassed to admit that the Python for Data Science repository represents my first contribution to GitHub. When I was teaching at UW, I regularly encouraged students to contribute to open source projects. Now I'm finally walking the talk ... better late than never, I suppose.

In any case, I've uploaded a link to the repository on the IPython Notebook Viewer (NBViewer) server - "a simple way to share IPython Notebooks" - so that the Python for Data Science notebook can be viewed in a browser, without running a local version of IPython Notebook (note that it may take a while load, as it is a rather large notebook).

I'll include the contents of the repo's README.md file below. Any questions, comments or other feedback is most welcome.

This short primer on Python is designed to provide a rapid "on-ramp" for computer programmers who are already familiar with basic concepts and constructs in other programming languages to learn enough about Python to effectively use open-source and proprietary Python-based machine learning and data science tools.

The primer is spread across a collection of IPython Notebooks, and the easiest way to use the primer is to install IPython Notebook on your computer. You can also install Python, and manually copy and paste the pieces of sample code into the Python interpreter, as the primer only makes use of the Python standard libraries.

There are three versions of the primer. Two versions contain the entire primer in a single notebook:

The other version divides the primer into 5 separate notebooks:

  1. Introduction
  2. Data Science: Basic Concepts
  3. Python: Basic Concepts
  4. Using Python to Build and Use a Simple Decision Tree Classifier
  5. Next Steps

There are several exercises included in the notebooks. Sample solutions to those exercises can be found in two Python source files:

  • simple_ml.py: a collection of simple machine learning utility functions
  • SimpleDecisionTree.py: a Python class to encapsulate a simplified version of a popular machine learning model

There are also 2 data files, based on the mushroom dataset in the UCI Machine Learning Repository, used for coding examples, exploratory data analysis and building and evaluating decision trees in Python:

  • agaricus-lepiota.data: a machine-readable list of examples or instances of mushrooms, represented by a comma-separated list of attribute values
  • agaricus-lepiota.attributes: a machine-readable list of attribute names and possible attribute values and their abbreviations

Hype, Hubs & Hadoop: Some Notes from Strata NY 2013 Keynotes

Stratany2013_header_logo_tm_no_ormI didn't physically attend Strata NY + Hadoop World this year, but I did watch the keynotes from the conference. O'Reilly Media kindly makes videos of the keynotes and slides of all talks available very soon after they are given. Among the recurring themes were haranguing against the hype of big data, the increasing utilization of Hadoop as a central platform (hub) for enterprise data, and the importance and potential impact of making data, tools and insights more broadly accessible within an enterprise and to the general public. The keynotes offered a nice mix of business (applied) & science (academic) talks, from event sponsors and other key players in the field, and a surprising - and welcome - number of women on stage.

Atigeo, the company where I now work on analytics and data science, co-presented a talk on Data Driven Models to Minimize Hospital Readmissions at Strata Rx last month, and I'm hoping we will be participating in future Strata events. And I'm hoping that some day I'll be on stage presenting some interesting data and insights at a Strata conference.

Meanwhile, I'll include some of my notes on interesting data and insights presented by others, in the order in which presentations were scheduled, linking each presentation title to its associated video. Unlike previous postings of notes from conferences, I'm going to leave the notes in relatively raw form, as I don't have the time to add more narrative context or visual augmentations to them.


Hadoop's Impact on the Future of Data Management
 
Mike Olson @mikeolson (Cloudera)

3000 people at the conference (sellout crowd), up from 700 people in 2009.
Hadoop started out as a complement to traditional data processing (offering large-scale processing).
Progressively adding more real-time capabilities, e.g. Impala & Cloudera search.
More and more capabilities migrating form traditional platforms to Hadooop.
Hadoop moving from the periphery to the architectural center of the data center, emerging as an enterprise data hub.
Hub: scalable storage, security, data governance, engines for working with the data in place
Spokes connect to other systems, people
Announcing Cloudera 5, "the enterprise data hub"
Announcing Cloudera Connect Cloud, supporting private & public cloud deployments
Announcing Cloudera Connect Innovators, inaugural innovator is DataBricks (Spark real-time in-memory processing engine)


Separating Hadoop Myths from Reality
Jack Norris (MapR Technologies)

Hadoop is the first open source project that has spawned a market
3:35 compelling graph of Hadoop/HBase disk latency vs. MapR latency
Hadoop is being used in production by many organizations


Big Impact from Big Data
Ken Rudin (Facebook)

Need to focus on business needs, not the technology
You can use science, technology and statistics to figure out what the answers are, but it is still am art to figure out what the right questions are
How to focus on the right questions:
* hire people with academic knowledge + business savvy
* train everyone on analytics (internal DataCamp at Facebook for project managers, designers, operations; 50% on tools, 50% on how to frame business questions so you can use data to get the answers)
* put analysts in org structure that allows them to have impact ("embedded model": hybrid between centralized & decentralized)
Goals of analytics: Impact, insight, actionable insight, evangelism … own the outcome


Five Surprising Mobile Trajectories in Five Minutes
Tony Salvador (Intel Corporation)

Tony is director at the Experience Research Lab (is this the group formerly known as People & Practices?) [I'm an Intel Research alum, and Tony is a personal friend]
Personal data economy: system of exchange, trading personal data for value
3 opportunities
* hyper individualism (Moore's Cloud, programmable LED lights)
* hyper collectivity (student projects with outside collaboration)
* hyper differentiation (holistic design for devices + data)
Big data is by the people and of the people ... and it should be for the people


Can Big Data Reach One Billion People?
Quentin Clark (Microsoft)

Praises Apache, open source, github (highlighted by someone from Microsoft?)
Make big data accessible (MS?)
Hadoop is a cornerstone of big data
Microsoft is committed to making it ready for the enterprise
HD Insight (?) Azure offering for Hadoop
We have a billion users of Excel, and we need to find a way to let anybody with a question get that question answered.
Power BI for Office 365 Preview


What Makes Us Human? A Tale of Advertising Fraud
Claudia Perlich (Dstillery)

A Turing test for advertising fraud
Dstillery: predicting consumer behavior based on browsing histories
Saw 2x performance improvement in 2 weeks; was immediately skeptical
Integrated additional sources of data (10B bid requests)
Found "oddly predictive websites"
e.g., Women's health page --> 10x more likely to check out credit card offer, order online pizza, or reading about luxury cars
Large advertising scam (botnet)
36% of traffic is non-intentional (Comscore)
Co-visitation patterns
Cookie stuffing
Botnet behavior is easier to predict than human behavior
Put bots in "penalty box": ignore non-human behavior


From Fiction to Facts with Big Data Analytics
Ben Werther @bwerther (Platfora)

When it comes to big data, BI = BS
Contrasts enterprises based on fiction, feeling & faith vs. fact-based enterprises
Big data analytics: letting regular business people iteratively interrogate massive amounts of data in an easy-to-use way so that they can derive insight and really understand what's going on
3 layers: Deep processing + acceleration + rich analytics
Product: Hadoop processing + in-memory acceleration + analytics engines + Vizboards
Example: event series analytics + entity-centric data catalog + iterative segmentation


The Economic Potential of Open Data
Michael Chui (McKinsey Global Institute)

[Presentation is based on newly published - and openly accessible (walking the talk!) - report: Open data: Unlocking innovation and performance with liquid information.]

Louisiana Purchase: Lewis & Clark address a big data acquisition problem
Thomas Jefferson: "Your observations are to be taken with great pains & accuracy, to be entered intelligibly, for others as well as yourself"
What happens when you make data more liquid?

4 characteristics of "openness" or "liquidity" of data:
* degree of access
* machine readability
* cost
* rights

Benefits to open data:
* transparency
* benchmarking exposing variability
* new products and services based on open data (Climate Corporation?)

How open data can enable value creation
* matching supply and demand
* collaboration at scale
"with enough eyes on code, all bugs are shallow"
--> "with enough eyes on data, all insights are shallow"
* increase accountability of institutions

Open data can help unlock $3.2B [typo? s/b $3.2T?] to $5.4T in ecumenic value per year across 7 domains
* education
* transportation
* consumer products
* electricity
* oil and gas
* health care
* consumer finance
What needs to happen?
* identify, prioritize & catalyze data to open
* developer, developers, developers
* talent (data scientists, visualization, storytelling)
* address privacy confidentiality, security, IP policies
* platforms, standards and metadata


The Future of Hadoop: What Happened & What's Possible?
Doug Cutting @cutting (Cloudera)

Hadoop started out as a storage & batch processing system for Java programmers
Increasingly enables people to share data and hardware resources
Becoming the center of an enterprise data hub
More and more capabilities being brought to Hadoop
Inevitable that we'll see just about every kind of workload being moved to this platform, even online transaction processing


Designing Your Data-Centric Organization
Josh Klahr (Pivotal)

GE has created 24 data-driven apps in one year
We are working with them as a Pivotal investor and a Pivotal company, we help them build these data-driven apps, which generated $400M in the past year
Pivotal code-a-thon, with Kaiser Permanente, using Hadoop, SQL and Tableau

What it takes to be a data-driven company
* Have an application vision
* Powered by Hadoop
* Driven by Data Science


Encouraging You to Change the World with Big Data
David Parker (SAP)

Took Facebook 9 months to achieve the same number of users that it took radio 40 years to achieve (100M users)
Use cases
At-risk students stay in school with real-time guidance (University of Kentucky)
Soccer players improve with spatial analysis of movement
Visualization of cancer treatment options
Big data geek challenge (SAP Lumira): $10,000 for best application idea


The Value of Social (for) TV
Shawndra Hill (University of Pennsylvania)

Social TV Lab
How we can derive value from the data that is being generated by viewers today?
Methodology: start with Twitter handles of TV shows, identify followers, collect tweets and their networks (followees + followers), build recommendation systems from  the data (social network-based, product network-based & text-based (bag of words)). Correlate words in tweets about a show with demographics about audience (Wordle for male vs. female)
1. You can use Twitter followers to estimate viewer audience demographics
2. TV triggers lead to more online engagement
3. If brands want to engage with customers online, play an online game
Real time response to advertisement (Teleflora during Super Bowl): peaking buzz vs. sustained buzz
Demographic bias in sentiment & tweeting (male vs. female response to Teleflora, others)
Influence = retweeting
Women more likely to retweet women, men more likely to retweet men
4. Advertising response and influence vary by demographic
5. GetGlue and Viggle check-ins can be used as a reliable proxy for viewership to
* predict Nielsen viewership weeks in advance
* predict customer lifetime value
* measure time shifting
All at the individual viewer level (vs. household level)


Ubiquitous Satellite Imagery of our Planet
Will Marshall @wsm1 (Planet Labs)

Ultracompact satellites to image the earth on a much more frequent basis to get inside the human decision-making loop so we can help human action.
Redundancy via large # of small of satellites with latest technology (vs. older, higher-reliability systems on one satellite)
Recency: shows more deforestation than Google Maps, river movement (vs. OpenStreetMap)
API for the Changing Planet, hackathons early next year


The Big Data Journey: Taking a holistic approach
John Choi (IBM)

[No slides?]
Invention of sliced bread 
Big data [hyped] as the biggest thing since the sliced bread
Think about big data as a journey
1. It's all about discipline and knowing where you are going (vs. enamored with tech)
VC $2.6B investment into big data (IBM, SAP, Oracle, … $3-4B more)
2. Understand that any of these technologies do not live in a silo. The thing that you don't want to have happen is that this thing become a science fair project. At the end of the day, this is going to be part of a broader architecture.
3. This is an investment decision, want to have a return on investment.


How You See Data
Sharmila Shahani-Mulligan @ShahaniMulligan (ClearStory Data)

The Next Era of Data Analysis: next big thing is how you analyze data from many disparate sources and do it quickly.
More data: Internal data + external data
More speed: Fast answers + discovery
Increase speed of access & speed of processing so that iterative insight becomes possible.
More people: Collaboration + context
Needs to become easier for everyone across the business (not just specialists) to see insights as insights are made available, have to make decisions faster.
Data-aware collaboration
Data harmonization
Demo: 6:10-8:30


Can Big Data Save Them?
Jim Kaskade @jimkaskade (Infochimps)

1 of 3 people in US has had a direct experience with cancer in their family
1 in 4 deaths are cancer-related
Jim's mom has chronic leukemia
Just got off the phone with his mom (it's his birthday), and she asked "what is it that you do?"
"We use data to solve really hard problems like cancer"
"When?"
"Soon"
Cancer is 2nd leading cause of death in children
"The brain trust in this room alone could advance cancer therapy more in a year than the last 3 decades."
Bjorn Brucher
We can help them by predicting individual outcomes, and then proactively applying preventative measures.
Big data starts with the application
Stop building your big data sandboxes, stop building your big data stacks, stop building your big data hadoop clusters without a purpose.
When you start with the business problem, the use case, you have a purpose, you have focus.
50% of big data projects fail (reference?)
"Take that one use case, supercharge it with big data & analytics, we can take & give you the most comprehensive big data solutions, we can put it on the cloud, and for some of you, we can give you answers in less than 30 days"
"What if you can contribute to the cure of cancer?" [abrupt pivot back to initial inspirational theme]


Changing the Face of Technology - Black Girls CODE
Peta Clarke @volunteerbgcny (Black Girls Code - NY), Donna Knutt @donnaknutt (Black Girls Code)

Why coding is important: By 2020, 1.4M computing jobs
Women of color currently make up 3% of computing jobs in US
Goal: teach 1M girls to code by 2040
Thus far: 2 years, 2000 girls, 7 states + Johannesburg, South Africa


Beyond R and Ph.D.s: The Mythology of Data Science Debunked
Douglas Merrill @DouglasMerrill (ZestFinance)

[my favorite talk]
Anything which appears in the press in capital letters, and surrounded by quotes, isn't real.
There is no math solution to anything. Math isn't the answer, it's not even the question.
Math is a part of the solution. Pieces of math have different biases, different things they do well, different things they do badly, just like employees. Hiring one new employee won't transform your company; hiring one new piece of math also won't transform your company.
Normal distribution, bell curve: beautiful, elegant
Almost nothing in the real world, is, in fact, normal.
Power laws don't actually have means.
Joke: How do you tell the difference between an introverted and an extroverted engineer? The extroverted one looks at your shoes instead of his own.
The math that you think you know isn't right. And you have to be aware of that. And being aware of that requires more than just math skills.
Science is inherently about data, so "data scientist" is redundant
However, data is not entirely about science
Math + pragmaticism + communication
Prefers "Data artist" to data scientist
Fundamentally, the hard part actually isn't the math, the hard part is finding a way to talk about that math. And, the hard part isn't actually gathering the data, the hard part is talking about that data.
The most famous data artist of our time: Nate Silver.
Data artists are the future.
What the world needs is not more R, what the world needs is more artists (Rtists?)


Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
Foster Provost (NYU | Stern)

[co-author of my favorite book on Data Science]
Agrees with some of the critiques made by previous speaker, but rather likes the term "data scientist"
Shares some quotes from Data Science and its relationship to Big Data and Data-Driven Decision Making
Gartner Hype Cycle 2012 puts "Predictive Analytics" at the far right ("Plateau of Productivity")  
[it's still there in Gartner Hype Cycle 2013, and "Big Data" has inched a bit higher into the "Peak of Inflated Expectations"]
More data isn't necessarily better (if it's from the same source, e.g., sociodemographic data)
More data from different sources may help.
Using fine-grained behavior data, learning curves show continued improvement to massive scale.
1M merchants, 3M data points (? look up paper)
But sociodemographic + pseudo social network data still does not necessarily do better
See Pseudo-Social Network Targeting from Consumer Transaction Data (Martens & Provost)
Seem to be very few case studies where you have really strong best practices with traditional data juxtaposed with strong best practices with another sort of data.
We see similar learning curves with different data sets, characterized by  massive numbers of individual behaviors, each of which probably contains a small amount of information, and the data items are sparse.
See Predictive Modelling with Big Data: Is Bigger Really Better? (Enrique Junque de Fortuny, David Martens & Foster Provost)
Others have published work on on Fraud detection (Fawcett & FP, 1997; Cortes et al, 2001), Social Network-based Marketing (Hill, et al, 2006), Online Display-ad Targeting (FP, Dalessandro, et al., 2009; Perlich, et al., 2013)
Rarely see comparisons

Take home message:
The Golden Age of Data Science is at hand.
Firms with larger data assets may have the opportunity to achieve significant competitive advantage.
Whether bigger is better for predictive modeling depends on:
a) the characteristics of the data (e.g., sparse, fine-grained data on consumer behavior)
b) the capability to model such data


The Scientific Method: Cultivating Thoroughly Conscious Ignorance

Ignorance_HowItDrivesScience_StuartFirestein_coverStuart Firestein brilliantly captures the positive influence of ignorance as an often unacknowledged guiding principle in the fits and starts that typically characterize the progression of real science. His book, Ignorance: How It Drives Science, grew out of a course on Ignorance he teaches at Columbia University, where he chairs the department of Biological Sciences and runs a neuroscience research lab. The book is replete with clever anecdotes interleaved with thoughtful analyses - by Firestein and other insightful thinkers and doers - regarding the central importance of ignorance in our quests to acquire knowledge about the world.

Each chapter leads off with a short quote, and the one that starts Chapter 1 sets the stage for the entire book:

"It is very difficult to find a black cat in a dark room," warns an old proverb. "Especially when there is no cat."

He proceeds to channel the wisdom of Princeton mathematician Andrew Wiles (who proved Fermat's Last Theorem) regarding the way science advances:

It's groping and probing and poking, and some bumbling and bungling, and then a switch is discovered, often by accident, and the light is lit, and everyone says "Oh, wow, so that's how it looks," and then it's off into the next dark room, looking for the next mysterious black feline.

Firestein is careful to distinguish the "willful stupidity" and "callow indifference to facts and logic" exhibited by those who are "unaware, unenlightened, and surprisingly often occupy elected offices" from a more knowledgeable, perceptive and insightful ignorance. As physicist James Clerk Maxwell describes it, this "thoroughly conscious ignorance is the prelude to every real advance in science."

The author disputes the view of science as a collection of facts, and instead invites the reader to focus on questions rather than answers, to cultivate what poet John Keats' calls "negative capability": the ability to dwell in "uncertainty without irritability". This notion is further elaborated by philosopher-scientist Erwin Schrodinger:

In an honest search for knowledge you quite often have to abide by ignorance for an indefinite period.

PowerOfPullIgnorance tends to thrive more on the edges than in the centers of traditional scientific circles. Using the analogy of a pebble dropped into a pond, most scientists tend to focus near the site where the pebble is dropped, but the most valuable insights are more likely to be found among the ever-widening ripples as they spread across the pond. This observation about the scientific value of exploring edges reminds me of another inspiring book I reviewed a few years ago, The Power of Pull, wherein authors John Hagel III, John Seely Brown & Lang Davison highlight the business value of exploring edges: 

Edges are places that become fertile ground for innovation because they spawn significant new unmet needs and unexploited capabilities and attract people who are risk takers. Edges therefore become significant drivers of knowledge creation and economic growth, challenging and ultimately transforming traditional arrangements and approaches.

On a professional level, given my recent renewal of interest in the practice of data science, I find many insights into ignorance relevant to a productive perspective for a data scientist. He promotes a data-driven rather than hypothesis-driven approach, instructing his students to "get the data, and then we can figure out the hypotheses." Riffing on Rodin, the famous sculptor, Firestein highlights the literal meaning of "dis-cover", which is "to remove a veil that was hiding something already there" (which is the essence of data mining). He also notes that each discovery is ephemeral, as "no datum is safe from the next generation of scientists with the next generation of tools", highlighting both the iterative nature of the data mining process and the central importance of choosing the right metrics and visualizations for analyzing the data.

Professor Firestein also articulates some keen insights about our failing educational system, a professional trajectory from which I recently departed, that resonate with some growing misgivings I was experiencing in academia. He highlights the need to revise both the business model of universities and the pedagogical model, asserting that we need to encourage students to think in terms of questions, not answers. 

W.B. Yeats admonished that "education is not the filling of a pail, but the lighting of a fire." Indeed. TIme to get out the matches.

If_life_is_a_game_these_are_the_rules_large

On a personal level, at several points while reading the book I was often reminded of two of my favorite "life rules" (often mentioned in preceding posts) articulated by Cherie Carter-Scott in her inspiring book, If Life is a Game, These are the Rules:

Rule Three: There are no mistakes, only lessons.
Growth is a process of experimentation, a series of trials, errors, and occasional victories. The failed experiments are as much a part of the process as the experiments that work.

Rule Four: A lesson is repeated until learned.
Lessons will repeated to you in various forms until you have learned them. When you have learned them, you can then go on to the next lesson.

Firestein offers an interesting spin on this concept, adding texture to my previous understanding, and helping me feel more comfortable with my own highly variable learning process, as I often feel frustrated with re-encountering lessons many, many times:

I have learned from years of teaching that saying nearly the same thing in different ways is an often effective strategy. Sometimes a person has to hear something a few times or just the right way to get that click of recognition, that "ah-ha moment" of clarity. And even if you completely get it the first time, another explanation always adds texture.

My ignorance is revealed to me on a daily, sometimes hourly, basis (I suspect people with partners and/or children have an unfair advantage in this department). I have written before about the scope and consequences of others being wrong, but for much of my life, I have felt shame about the breadth and depth of my own ignorance (perhaps reflecting the insight that everyone is a mirror). It's helpful to re-dis-cover the wisdom that ignorance can, when consciously cultivated, be strength.

[The video below is the TED Talk that Stuart Firestein recently gave on The Pursuit of Ignorance.]

 

 


An Excellent Primer on Data Science and Data-Analytic Thinking and Doing

DataScienceForBusiness_coverO'Reilly Media is my primary resource for all things Data Science, and the new O'Reilly book on Data Science for Business by Foster Provost and Tom Fawcett ranks near the top of my list of their relevant assets. The book is designed primarily to help businesspeople understand the fundamental principles of data science, highlighting the processes and tools often used in the craft of mining data to support better business decisions. Among the many gems that resonated with me are the emphasis on the exploratory nature of data science - more akin to research and development than engineering - and the importance of thinking carefully and critically ("data-analytically") about the data, the tools and overall process. 

CRISP-DM_Process_DiagramThe book references and elaborates on the Cross-Industry Standard Process for Data Mining (CRISP-DM) model to highlight the iterative process typically required to converge on a deployable data science solution. The model includes loops within loops to account for the way that critically analyizing data models often reveals additional data preparation steps that are needed to clean or manipulate the data to support the effective use of data mining tools, and how the evaluation of model performance often reveals issues that require additional clarification from the business owners. The authors note that it is not uncommon for the definition of the problem to change in response to what can actually be done with the available data, and that it is often worthwhile to consider investing in acquiring additional data in order to enable better modeling. Valuing data - and data scientists - as important assets is a recurring theme throughout the book.

DataScienceForBusiness_Figure7_2As a practicing data scientist, I find the book's emphasis on the expected value framework - associating costs and benefits with different performance metrics - to be a helpful guide in ensuring that the right questions are being asked, and that the results achieved are relevant to the business problems that motivate most data science projects. And as someone whose practice of data science has recently resumed after a hiatus, I found the book very useful as a refresher on some of the tools and techniques of data analysis and data mining ... and as a reminder of potential pitfalls such as overfitting models to training data, not appropriately taking into account null hypotheses and confidence intervals, and the problem of multiple comparisons. I've been using the Sci-Kit Learn package for machine learning in Python in my recent data modeling work, and some of the questions and issues raised in this book have prompted me to reconsider some of the default parameter values I've been using.

DataScienceForBusiness_Figure8_5The book includes a nice mix of simplified and real-world examples to motivate and clarify many of the common problems and techniques encountered in data science. It also offers appropriately simplified descriptions and equations for the mathematics that underly some of the key concepts and tools of data science, including one of the clearest definitions of Bayes' rule and its application in constructing Naive Bayes classifiers I've seen. The figures (such as the one above) add considerable clarity to the topics covered throughout the book. I particularly like the chapter highlighting the different visualizations - profit curves, lift curves, cumulative response curves and receiver operator characteristic (ROC) curves - that can be used to help compare and effectively communicate the performance of models. [Side note: it was through my discovery of Tom Fawcett's excellent introduction to ROC analysis that I first encountered the Data Science for Business book. In the interest of full disclosure, I should also note that Tom is a friend and former grad school colleague (and fellow homebrewer) from my UMass days].

The penultimate chapter of the book is on Data Science and Business Strategy, in which the authors elaborate on the importance of making strategic investments in data, data scientists and a culture that enables data science and data scientists to thrive. They note the importance of diversity in the data science team, the variance in individual data scientist capabilities - especially with respect to innate creativity, analytical acument, business sense and perseverence - and the tendency toward replicability of successes in solving data science problems, for both individuals and teams. They also emphasize the importance of attracting a critical mass of data scientists - to support, augment and challenge each other - and progressively systematizing and refining various processes as the data science capability of a team (and firm) matures ... two aspects whose value I can personally attest to based on my own re-immersion in a data science team.