I was among 900 attendees at the recent PyData Seattle 2015 conference, an event focused on the use of Python in data management, analysis and machine learning. Nearly all of the tutorials & talks I attended last weekend were very interesting and informative, and several were positively inspiring. I haven't been as excited to experiment with new tools since I discovered IPython Notebooks at Strata 2014.
I often find it helpful to organize my notes after attending a conference or workshop, and am sharing these notes publicly in case they are helpful to others. The following is a chronological listing of some of the highlights of my experience at the conference. My notes from some sessions are rather sparse, so it is a less comprehensive compilation than I would have liked to assemble. I'll also include some links to some sessions I did not attend at the end.
This was my first time at a PyData conference, and I spoke with several others who were attending their first PyData. Apparently, this was the largest turnout for a PyData conference yet. I gave a 2-hour tutorial on Python for Data Science, designed as a rapid on-ramp primer for programmers new to Python or Data Science. Responses to a post-tutorial survey confirm my impression that I was unrealistically optimistic about being able to fit so much material into a 2-hour time slot, but I hope the tutorial still helped participants get more out of other talks & tutorials throughout the rest of the conference, many of which presumed at least an intermediate level of experience with Python and/or Data Science. As is often the case, I missed the session prior to the one in which I was speaking - the opening keynote - as I scrambled with last-minute preparations (ably aided by friends, former colleagues & assistant tutors Alex Thomas and Bryan Tinsley).
Running and re-running data science experiments in which many steps are repeated, some of which are varied (e.g., with different parameter settings), and several take a long time are all part of a typical data science workflow. Every company in which I've worked as a data scientist has rolled their own workflow pipeline framework to support this process, and each homegrown solution has offered some benefits while suffering from some shortcomings. Jonathan Dinu demonstrated Luigi, an open source library initially created by Spotify for managing batch pipelines that might encompass a large number of local and/or distributed computing cluster processing steps. Luigi offers a framework in which each stage of the pipeline has input, processing and output specifications; the stages can be linked together in a dependency graph which can be used to visualize progress. He illustrated how Luigi could be used for a sample machine learning pipeline (Data Engineering 101), in which a corpus of text documents is converted to TF-IDF vectors, and then models are trained and evaluated with different hyperparameters, and then deployed.
Joseph Sirosh sprinkled several pithy quotes throughout his presentation, starting off with a prediction that while software is eating the world, the cloud is eating software (& data). He also introduced what may have been the most viral meme at the conference - the connected cow - as a way of emphasizing that every company is a data company ... even a dairy farm. In an illustration of where AI [artificial intelligence] meets AI [artificial insemination], he described a project in which data from pedometers worn by cows boosted estrus detection accuracy from 55% to 95%, which in turn led to more successful artificial insemination and increased pregnancy rates from 40% to 67%. Turning his attention from cattle to medicine, he observed that every hospital is a data company, and suggested that Florence Nightingale's statistical evaluation of the relationship between sanitation and infection made her the world's first woman data scientist. Sirosh noted that while data scientists often complain that data wrangling is the most time-consuming and challenging part of the data modeling process, that is because deploying and monitoring models in production environments - which he argued is even more time-consuming and challenging - is typically handed off to other groups. And, as might be expected, he demonstrated how some of these challenging problems can be addressed by Azure ML, Microsoft's cloud-based predictive analytics system.
IPython Jupyter Notebooks are one of my primary data science tools. The ability to interleave code, data, text and a variety of [other] media make the notebooks a great way to both conduct and describe experiments. Jonathan described the upcoming Big Split(tm), in which IPython will be separated from Notebooks, to better emphasize the increasingly language-agnostic capabilities of the notebooks, which [will soon] have support for 48 language kernels, including Julia, R, Haskell, Ruby, Spark and C++. Version 4.0 will offer capabilities to
- ease the transition from large notebook to small notebooks
- import notebooks as packages
- test notebooks
- verify that a notebook is reproducible
As a former educator, a new capability I find particularly exciting is nbgrader, which uses the JupyterHub collaborative platform, providing support for releasing, fetching, submitting and collecting assignments. Among the personally most interesting tidbits I learned during this session was that IPython started out as Fernando Perez' "procrastination project" while he was in PhD thesis avoidance mode in 2001 ... an outstanding illustration of the benefits of structured procrastination.
Deep Learning seems to be well on its way toward the peak of inflated expectations lately (e.g., Deep Learning System Tops Humans in IQ Tests), Alex Korbonits presented a number of tools for and examples of Deep Learning, the most impressive of which was AlexNet, a deep convolutional neural network developed by another Alex (Alex Krizhevsky, et al) that outperformed all of its competitors in the LSVRC 2010 ImageNet competition (1.3M high-res images across 1000 classes) by such a substantial margin that it changed the course of research in computer vision, a field that had hitherto been dominated by hand-crafted features refined over a long period of time. Alex Korbonits went on to demonstrate a number of Deep Learning tools & packages, e.g., Caffe and word2vec, and applications involving scene parsing and unsupervised learning of high-level features. It should be noted that others have taken a more skeptical view of Deep Learning, and illustrated some areas in which there's still a lot of work to be done.
One of the most challenging aspects of attending a talk by Paco Nathan is figuring out how to bide my time between listening, watching, searching for or typing in referenced links ... and taking notes. He is a master speaker, with compelling visual augmentations and links to all kinds of interesting related material. Unfortunately, while my browser fills up with tabs during his talks, my notes typically end up rather sparse. In this talk, Paco talked about the ways that O'Reilly Media is embracing Jupyter Notebooks as a primary interface for authors using their multi-channel publishing platform. An impressive collection of these notebooks can be viewed on the O'Reilly Learning page. Paco observed that the human learning curve is often the most challenging aspect to leading data science teams, as data, problems and techniques change over time. The evolution of user expertise, e.g., among connoisseurs of beer, is another interesting area involving human learning curves that was referenced during this session.
Fraud detection presents some special challenges in evaluating the performance of machine learning models. If a model is trained on past transactions that are labeled based on whether or not they turned out to be fraudulent, once the model is deployed, the new transactions classified by the model as fraud are blocked. Thus, the transactions that are allowed to go through after the model is deployed may be substantially different - especially with respect to the proportion of fraudulent transactions - than those that were allowed before the model was deployed. This makes evaluation of the model performance difficult, since the training data may be very different from the data used to evaluate the model. It also complicates the training of new models, since the new training data (post model deployment) will be biased. Michael Manapat presented some techniques to address these challenges, involving allowing a small proportion of potentially fraudulent transactions through and using a propensity function to control the "exploration/exploitation tradeoff".
In the last keynote of the conference, Josh Bloom shared a number of insights about considerations often overlooked by data scientists regarding how data models fit into the systems into which they are deployed. For example, while data scientists are often concerned with optimizing a variety parameters in building a model, other important areas for optimization are overlooked, e.g., hardware and software demands of a deployed model (e.g., the decision by Netflix not to deploy the model with the highest score in the Netflix Prize), the human resources required to implement and maintain the model, the ways that consumers will [try to] interpret or use the model, and the direct and indirect impacts of the model on society. Noteworthy references include a paper by Sculley, et al, on Machine Learning: The High Interest Credit Card of Technical Debt and Leon Bottou's ICML 2015 keynote on Two Big Challenges of Machine Learning.
Once again, I had a hard time keeping up with the multi-sensory inputs during a talk by Paco Nathan. Although I can't find his slides from PyData, I was able to find a closely related slide deck (embedded below). The gist of the talk is that many real-world problems can often be represented as graphs, and that there are a number of tools - including Spark and GraphLab - that can be utilized for efficient processing of large graphs. One example of a problem amenable to graph processing is the analysis of a social network, e.g., contributors to open source forums, which reminded me of some earlier work by Weiser, et al (2007), on Visualizing the signatures of social roles in online discussion groups. The session included a number of interesting code examples, some of which I expect are located in Paco's spark-exercises GitHub repository. Other interesting references included TextBlob, a Python library for text processing, and TextRank, a graph-based ranking model for text processing, a paper by Mihalcea & Tarau from EMNLP 2004.
Pandas - the Python open source data analysis library - may take 10 minutes to learn, but I have found that it takes a long time to master. Jeff Tratner a key contributor to Pandas - an open source community he described as "really open to small contributors" - shared a number of insights into how Pandas works, how it addresses some of the problems that make python slow, and how the use of certain features can lead to improved performance. For example, specifying the data type of columns in a CSV file via the dtype parameter in read_csv can help pandas save space and time while loading the data from the file. Also, the Dataframe.append operation is very expensive, and should be avoided wherever possible (e.g., by using merge, join or concat). One of my favorite lines: "The key to doing many small operations in Python: don’t do them in Python"
While I believe that there are no mistakes, only lessons, I do value the relatively rare opportunities to learn from others' lessons, and Cameron Davidson-Pilon (author of Probabilistic Programming & Bayesian Methods for Hackers) shared some valuable lessons he has learned in his data science work over the years. Among the lessons he shared:
- Sample sizes are important
- It is usually prudent to underestimate predictions of performance of deployed models
- Computing statistics on top of statistics compounds uncertainty
- Visualizing uncertainty is a the role of a statistician
- Don't [naively] use PCA [before regression]
Among the interesting, and rather cautionary, references:
- The Most Dangerous Equation, Howard Wainer, American Scientist (2007)
- How A Single A/B Test Increased Conversions by 336%, Dustin Sparks
There were a few sessions about which I read or heard great things, but which I did not attend. I'll include information I could find about them, in the chronological order in which they were listed in the schedule, to wrap things up.
[Update, 2015-08-05: the PyDataTV YouTube channel now has videos from the conference]