Clear Mind, Wild Heart, Spiral Career Path, New Job
The Scientific Method: Cultivating Thoroughly Conscious Ignorance

An Excellent Primer on Data Science and Data-Analytic Thinking and Doing

DataScienceForBusiness_coverO'Reilly Media is my primary resource for all things Data Science, and the new O'Reilly book on Data Science for Business by Foster Provost and Tom Fawcett ranks near the top of my list of their relevant assets. The book is designed primarily to help businesspeople understand the fundamental principles of data science, highlighting the processes and tools often used in the craft of mining data to support better business decisions. Among the many gems that resonated with me are the emphasis on the exploratory nature of data science - more akin to research and development than engineering - and the importance of thinking carefully and critically ("data-analytically") about the data, the tools and overall process. 

CRISP-DM_Process_DiagramThe book references and elaborates on the Cross-Industry Standard Process for Data Mining (CRISP-DM) model to highlight the iterative process typically required to converge on a deployable data science solution. The model includes loops within loops to account for the way that critically analyizing data models often reveals additional data preparation steps that are needed to clean or manipulate the data to support the effective use of data mining tools, and how the evaluation of model performance often reveals issues that require additional clarification from the business owners. The authors note that it is not uncommon for the definition of the problem to change in response to what can actually be done with the available data, and that it is often worthwhile to consider investing in acquiring additional data in order to enable better modeling. Valuing data - and data scientists - as important assets is a recurring theme throughout the book.

DataScienceForBusiness_Figure7_2As a practicing data scientist, I find the book's emphasis on the expected value framework - associating costs and benefits with different performance metrics - to be a helpful guide in ensuring that the right questions are being asked, and that the results achieved are relevant to the business problems that motivate most data science projects. And as someone whose practice of data science has recently resumed after a hiatus, I found the book very useful as a refresher on some of the tools and techniques of data analysis and data mining ... and as a reminder of potential pitfalls such as overfitting models to training data, not appropriately taking into account null hypotheses and confidence intervals, and the problem of multiple comparisons. I've been using the Sci-Kit Learn package for machine learning in Python in my recent data modeling work, and some of the questions and issues raised in this book have prompted me to reconsider some of the default parameter values I've been using.

DataScienceForBusiness_Figure8_5The book includes a nice mix of simplified and real-world examples to motivate and clarify many of the common problems and techniques encountered in data science. It also offers appropriately simplified descriptions and equations for the mathematics that underly some of the key concepts and tools of data science, including one of the clearest definitions of Bayes' rule and its application in constructing Naive Bayes classifiers I've seen. The figures (such as the one above) add considerable clarity to the topics covered throughout the book. I particularly like the chapter highlighting the different visualizations - profit curves, lift curves, cumulative response curves and receiver operator characteristic (ROC) curves - that can be used to help compare and effectively communicate the performance of models. [Side note: it was through my discovery of Tom Fawcett's excellent introduction to ROC analysis that I first encountered the Data Science for Business book. In the interest of full disclosure, I should also note that Tom is a friend and former grad school colleague (and fellow homebrewer) from my UMass days].

The penultimate chapter of the book is on Data Science and Business Strategy, in which the authors elaborate on the importance of making strategic investments in data, data scientists and a culture that enables data science and data scientists to thrive. They note the importance of diversity in the data science team, the variance in individual data scientist capabilities - especially with respect to innate creativity, analytical acument, business sense and perseverence - and the tendency toward replicability of successes in solving data science problems, for both individuals and teams. They also emphasize the importance of attracting a critical mass of data scientists - to support, augment and challenge each other - and progressively systematizing and refining various processes as the data science capability of a team (and firm) matures ... two aspects whose value I can personally attest to based on my own re-immersion in a data science team.

comments powered by Disqus