The Netflix video streaming and DVD service announced Thursday that it is switching from a 5-star rating system to a simpler thumbs up / thumbs down system. I've been a Netflix user (and fan) for many years, and love their personalized ratings predictions. I have often used their model in presentations and brainstorming involving other services that could benefit from that kind of personalization. I think this change is a bad idea.
I called the Netflix Help Center (866-579-7172), and the customer service representative I spoke with told me they were eager to receive feedback on this topic, especially feedback that specifies why users are in favor or not in favor of the proposed change. I shared several reasons why I thought it was a bad idea and I want to share those reasons here, in the hope it may encourage other Netflix users - especially those who share my view that it's a bad idea - to contact Netflix and provide feedback.
Granularity is Good
The main objection I have to the proposed change is that I make careful distinctions in both the ratings I give to movies I have seen and on the personalized predicted ratings Netflix offers for movies I have not yet seen. I probably watch, on average, 2 hours of TV a week, 1 DVD every month, 1 movie in a theater every 3 months. I hardly ever watch video content on Netflix, YouTube or other streaming sources. So I'm probably an outlier on several dimensions.
That said, on the rare occasions when I do want to watch a movie - at home or in a theater - I will only watch a movie for which the personalized predicted Netflix rating is at least 4.0. Since I know the personalized prediction accuracy is dependent (in part) on my own ratings, I am very careful in how I rate movies. I use the following interpretations for the 5-star scale:
5 stars: a movie I liked so much I've seen it several times and/or would enjoy seeing again
4 stars: a movie I liked a lot, but am not interested in seeing again
3 stars: a movie I liked, but would probably have preferred to spend my time watching something else
2 stars: a movie I didn't like, and probably didn't watch much of
1 star: I don't know if I've ever seen a 1-star movie, and certainly don't want to ever see one
While some people may find it easier to give a thumbs up or thumbs down rating (which I will refer to hereafter as a thumbs-based rating), I would find it more difficult. I envision the following mapping from my 5-star schema to thumbs-based ratings:
Thumbs up for a 4-star or 5-star movie
No rating for a 3-star movie
Thumbs down for a 1-star or 2-star movie
Given that I rarely see a 2-star movie, I would probably only be giving thumbs up ratings in the proposed new scheme, and predict that the lower volume of ratings combined with the lower granularity of ratings would result in less accurate Netflix rating predictions.
Quality vs. Quantity
Speaking of quantity, the Verge article reported that Netflix saw a 200% increase in the number of ratings among the test group who used thumbs up or thumbs down, compared to the number of ratings using the 5-star rating group.
The article doesn't report on the change in the number of users who submit ratings using thumbs up or thumbs down, nor is it clear whether a specific control group was used in the experiment. Based on their marvelously detailed posts in the Netflix tech blog, especially the posts on their recommender systems, I suspect they were very careful in the way the designed the experiment. Perhaps more details will eventually be reported there.
The article also doesn't report on the quality of the recommendations under the thumbs-based rating system. More is not necessarily better, and it is not clear what kind of impact the increased quantity had on the perceived quality of predictions based on the new system.
Given that the average U.S. adult consumes 5.5 hours of TV, movies, games, and other video content per day, I suspect most users are less discriminating than I am with respect to what they will watch. It may be that the quality of recommendations using the new system serves high-volume - or even average-volume - video consumers as well or better than it would serve low-volume video consumers. But if my supposition that higher volume video consumers are less discriminating is correct, then the increase in quality may not have much impact on the amount consumed. And since Netflix charges flat monthly rates, those of us who consumer relatively little video content are paying just as much as those who consume large amounts of video content .. and if the recommendation quality declines for someone like me, who consumes little content, and the quantity of video I consume similarly declines, I am more likely to discontinue the service than a high-volume consumer who might consume less if the quality of recommendations is not as good (due to fewer ratings). But if they are already consuming a large quantity of video, I don't understand what problem is Netflix trying to address.
Returns on Investments
The article draws an analogy between Netflix ratings and Spotify thumbs-based ratings, which I think is an inappropriate comparison point. I use both the Spotify and Pandora streaming music services (in fact, I'm a paid subscriber for both (I hate commercials in any medium)), but rating a song that lasts a few minutes is very different - in my view - from rating a movie that lasts a few hours. I'm much more willing to provide a finer granularity rating (e.g., on a 5-star scale) for an experience that will last hours vs. minutes.
I think a better comparison point would be Yelp, which uses 5-star ratings for restaurants and other service providers. I'm willing to provide ratings on a 5-star scale for restaurants, because it represents a more significant investment of time (and money). I would even consider TripAdvisor, an online service for reviews and ratings of hotels and other destinations and activities associated with traveling, a better comparison point than Spotify, as pl
Personalized Ratings for All
In fact, I think both Yelp and TripAdvisor could benefit from adopting the potentially-soon-to-be-former Netflix personalized rating scheme. I am growing weary of wading through reviews of restaurants on Yelp from people who rant about the bartender not paying attention to them, or a special event dinner that went awry, or from anyone who doesn't share similar tastes in restaurants to me. I would love it if Yelp would offer a personalized rating, or at least let me read reviews from people like me.
TripAdvisor ratings have become almost useless to me. It appears that many hotels are carpet-bombing guests with email invitations to review their stay, and the result seems to be that many places now have an overwhelming abundance of reviews from people who have only posted one review. I consider most newbie reviews nearly useless, both because they tend to be short and uninformative, and because there is no way to know what kind of other places the person has reviewed, so I can't tell how much the reviewer is like me.
I could rant further on the decline of both of these services - which I once found far more useful - but I will let it go (for now). I wanted to compose this post because throughout all the years I've been a Netflix user, the service has only gotten better (as I gave it more ratings upon which to make recommendations), and I'd hate to see yet another beloved rating, review and recommender service decline.
If you feel similarly, I urge you to call Netflix soon, as they are reportedly planning to roll out the new thumbs-based rating system in April.
I thought it would be fun to experiment with all 3: implementing the recursive C function used by Gustavo Duarte in Python, doing so inside PythonTutor (which can generate an iframe) and then embedding the PythonTutor iframe inside an IPython Notebook, which I would then embed in this blog post.
Unfortunately, I haven't achieved the trifecta: I couldn't figure out how to embed the PythonTutor iframe inside an IPython Notebook, so I will embed both of them separately here in this blog post. Across the collection, 3 flavors of visualizing recursion are shown:
simple print statement output tracing the call stack (ASCII version)
a static call stack image created by Gustavo
a dynamic call stack created automatically by PythonTutor
I'll start out with embedding Motivating and Visualizing Recursion in Python, an IPython Notebook I created to interweave text, images and code in summarizing Gustavo Duarte's compelling critique and counter-proposal for how best to teach and visualize recursion, and reimplementing his examples in Python.
Next, I'll embed an iframe for visualizing recursion in Python, providing a snapshot of its dynamic execution and visualization within PythonTutor:
I really like the way that PythonTutor enables stepping through and visualizing the call stack (and other elements of the computation). It may not be visible in the iframe above (you have to scroll to the right to see it), so I'll include a snapshot of it below.
If anyone knows how to embed a PythonTutor iframe within an IPython Notebook, please let me know, as I'd still welcome the opportunity to achieve a trifecta ... and I suspect that combining these two tools would represent even more enhanced educational opportunities for Pythonistas.
"this one got me the most excited about getting home (or back to work) to practice what I learned"
Well, I got back to work, and learned how to create an IPython Notebook. Specifically, I created one to provide a rapid "on-ramp" for computer programmers who are already familiar with basic concepts and constructs in other programming language to learn enough about Python to effectively use the Atigeo xPatterns analytics framework (or other data science tools). The Notebook also includes some basic data science concepts, utilizing material I'd earlier excerpted in a blog post in which I waxed poetic about the book Data Science for Business, by Foster Provost and Tom Fawcett, and other resources I have found useful in articulating the fundamentals of data science.
The rapid on-ramp approach was motivated, in part, by my experience with the Natural Language Toolkit (NLTK) book, which provides a rapid on-ramp for learning Python in conjunction with the open-source NLTK library to develop programs using natural language processing techniques (many of which involve machine learning). I find that IPython Notebooks are such a natural and effective way of integrating instructional information and "active" exercises that I wish I'd discovered it back when I was teaching courses using Python at the University of Washington (e.g., what came to be known as the socialbots course). I feel like a TiVO fanatic now, wanting to encourage anyone and everyone sharing any knowledge about Python to use IPython Notebooks as a vehicle for doing so.
I piloted an initial version of the Python for Data Science notebook during an internal training session for software engineers who had experience with Java and C++ a few weeks ago, and it seemed to work pretty well. After the Strata 2014 videos were released, I watched Olivier Grisel's tutorial on Introduction to Machine Learning with IPython and scikit-learn, and worked through the associated parallel_ml_tutorial notebooks he posted on GitHub. I updated my notebook to include some additional aspects of Python that I believe would be useful in preparation for that tutorial.
Not only was this my first IPython Notebook, but I'm somewhat embarrassed to admit that the Python for Data Science repository represents my first contribution to GitHub. When I was teaching at UW, I regularly encouraged students to contribute to open source projects. Now I'm finally walking the talk ... better late than never, I suppose.
In any case, I've uploaded a link to the repository on the IPython Notebook Viewer (NBViewer) server - "a simple way to share IPython Notebooks" - so that the Python for Data Science notebook can be viewed in a browser, without running a local version of IPython Notebook (note that it may take a while load, as it is a rather large notebook).
I'll include the contents of the repo's README.md file below. Any questions, comments or other feedback is most welcome.
This short primer on Python is designed to provide a rapid "on-ramp" for computer programmers who are already familiar with basic concepts and constructs in other programming languages to learn enough about Python to effectively use open-source and proprietary Python-based machine learning and data science tools.
The primer is spread across a collection of IPython Notebooks, and the easiest way to use the primer is to install IPython Notebook on your computer. You can also install Python, and manually copy and paste the pieces of sample code into the Python interpreter, as the primer only makes use of the Python standard libraries.
There are three versions of the primer. Two versions contain the entire primer in a single notebook:
There are several exercises included in the notebooks. Sample solutions to those exercises can be found in two Python source files:
simple_ml.py: a collection of simple machine learning utility functions
SimpleDecisionTree.py: a Python class to encapsulate a simplified version of a popular machine learning model
There are also 2 data files, based on the mushroom dataset in the UCI Machine Learning Repository, used for coding examples, exploratory data analysis and building and evaluating decision trees in Python:
agaricus-lepiota.data: a machine-readable list of examples or instances of mushrooms, represented by a comma-separated list of attribute values
agaricus-lepiota.attributes: a machine-readable list of attribute names and possible attribute values and their abbreviations
I didn't physically attend Strata NY + Hadoop World this year, but I did watch the keynotes from the conference. O'Reilly Media kindly makes videos of the keynotes and slides of all talks available very soon after they are given. Among the recurring themes were haranguing against the hype of big data, the increasing utilization of Hadoop as a central platform (hub) for enterprise data, and the importance and potential impact of making data, tools and insights more broadly accessible within an enterprise and to the general public. The keynotes offered a nice mix of business (applied) & science (academic) talks, from event sponsors and other key players in the field, and a surprising - and welcome - number of women on stage.
Meanwhile, I'll include some of my notes on interesting data and insights presented by others, in the order in which presentations were scheduled, linking each presentation title to its associated video. Unlike previous postings of notes from conferences, I'm going to leave the notes in relatively raw form, as I don't have the time to add more narrative context or visual augmentations to them.
3000 people at the conference (sellout crowd), up from 700 people in 2009. Hadoop started out as a complement to traditional data processing (offering large-scale processing). Progressively adding more real-time capabilities, e.g. Impala & Cloudera search. More and more capabilities migrating form traditional platforms to Hadooop. Hadoop moving from the periphery to the architectural center of the data center, emerging as an enterprise data hub. Hub: scalable storage, security, data governance, engines for working with the data in place Spokes connect to other systems, people Announcing Cloudera 5, "the enterprise data hub" Announcing Cloudera Connect Cloud, supporting private & public cloud deployments Announcing Cloudera Connect Innovators, inaugural innovator is DataBricks (Spark real-time in-memory processing engine)
Hadoop is the first open source project that has spawned a market 3:35 compelling graph of Hadoop/HBase disk latency vs. MapR latency Hadoop is being used in production by many organizations
Need to focus on business needs, not the technology You can use science, technology and statistics to figure out what the answers are, but it is still am art to figure out what the right questions are How to focus on the right questions: * hire people with academic knowledge + business savvy * train everyone on analytics (internal DataCamp at Facebook for project managers, designers, operations; 50% on tools, 50% on how to frame business questions so you can use data to get the answers) * put analysts in org structure that allows them to have impact ("embedded model": hybrid between centralized & decentralized) Goals of analytics: Impact, insight, actionable insight, evangelism … own the outcome
Tony is director at the Experience Research Lab (is this the group formerly known as People & Practices?) [I'm an Intel Research alum, and Tony is a personal friend] Personal data economy: system of exchange, trading personal data for value 3 opportunities * hyper individualism (Moore's Cloud, programmable LED lights) * hyper collectivity (student projects with outside collaboration) * hyper differentiation (holistic design for devices + data) Big data is by the people and of the people ... and it should be for the people
Praises Apache, open source, github (highlighted by someone from Microsoft?) Make big data accessible (MS?) Hadoop is a cornerstone of big data Microsoft is committed to making it ready for the enterprise HD Insight (?) Azure offering for Hadoop We have a billion users of Excel, and we need to find a way to let anybody with a question get that question answered. Power BI for Office 365 Preview
A Turing test for advertising fraud Dstillery: predicting consumer behavior based on browsing histories Saw 2x performance improvement in 2 weeks; was immediately skeptical Integrated additional sources of data (10B bid requests) Found "oddly predictive websites" e.g., Women's health page --> 10x more likely to check out credit card offer, order online pizza, or reading about luxury cars Large advertising scam (botnet) 36% of traffic is non-intentional (Comscore) Co-visitation patterns Cookie stuffing Botnet behavior is easier to predict than human behavior Put bots in "penalty box": ignore non-human behavior
When it comes to big data, BI = BS Contrasts enterprises based on fiction, feeling & faith vs. fact-based enterprises Big data analytics: letting regular business people iteratively interrogate massive amounts of data in an easy-to-use way so that they can derive insight and really understand what's going on 3 layers: Deep processing + acceleration + rich analytics Product: Hadoop processing + in-memory acceleration + analytics engines + Vizboards Example: event series analytics + entity-centric data catalog + iterative segmentation
Louisiana Purchase: Lewis & Clark address a big data acquisition problem Thomas Jefferson: "Your observations are to be taken with great pains & accuracy, to be entered intelligibly, for others as well as yourself" What happens when you make data more liquid?
4 characteristics of "openness" or "liquidity" of data: * degree of access * machine readability * cost * rights
Benefits to open data: * transparency * benchmarking exposing variability * new products and services based on open data (Climate Corporation?)
How open data can enable value creation * matching supply and demand * collaboration at scale "with enough eyes on code, all bugs are shallow" --> "with enough eyes on data, all insights are shallow" * increase accountability of institutions
Open data can help unlock $3.2B [typo? s/b $3.2T?] to $5.4T in ecumenic value per year across 7 domains * education * transportation * consumer products * electricity * oil and gas * health care * consumer finance What needs to happen? * identify, prioritize & catalyze data to open * developer, developers, developers * talent (data scientists, visualization, storytelling) * address privacy confidentiality, security, IP policies * platforms, standards and metadata
Hadoop started out as a storage & batch processing system for Java programmers Increasingly enables people to share data and hardware resources Becoming the center of an enterprise data hub More and more capabilities being brought to Hadoop Inevitable that we'll see just about every kind of workload being moved to this platform, even online transaction processing
GE has created 24 data-driven apps in one year We are working with them as a Pivotal investor and a Pivotal company, we help them build these data-driven apps, which generated $400M in the past year Pivotal code-a-thon, with Kaiser Permanente, using Hadoop, SQL and Tableau
What it takes to be a data-driven company * Have an application vision * Powered by Hadoop * Driven by Data Science
Took Facebook 9 months to achieve the same number of users that it took radio 40 years to achieve (100M users) Use cases At-risk students stay in school with real-time guidance (University of Kentucky) Soccer players improve with spatial analysis of movement Visualization of cancer treatment options Big data geek challenge (SAP Lumira): $10,000 for best application idea
Social TV Lab How we can derive value from the data that is being generated by viewers today? Methodology: start with Twitter handles of TV shows, identify followers, collect tweets and their networks (followees + followers), build recommendation systems from the data (social network-based, product network-based & text-based (bag of words)). Correlate words in tweets about a show with demographics about audience (Wordle for male vs. female) 1. You can use Twitter followers to estimate viewer audience demographics 2. TV triggers lead to more online engagement 3. If brands want to engage with customers online, play an online game Real time response to advertisement (Teleflora during Super Bowl): peaking buzz vs. sustained buzz Demographic bias in sentiment & tweeting (male vs. female response to Teleflora, others) Influence = retweeting Women more likely to retweet women, men more likely to retweet men 4. Advertising response and influence vary by demographic 5. GetGlue and Viggle check-ins can be used as a reliable proxy for viewership to * predict Nielsen viewership weeks in advance * predict customer lifetime value * measure time shifting All at the individual viewer level (vs. household level)
Ultracompact satellites to image the earth on a much more frequent basis to get inside the human decision-making loop so we can help human action. Redundancy via large # of small of satellites with latest technology (vs. older, higher-reliability systems on one satellite) Recency: shows more deforestation than Google Maps, river movement (vs. OpenStreetMap) API for the Changing Planet, hackathons early next year
[No slides?] Invention of sliced bread Big data [hyped] as the biggest thing since the sliced bread Think about big data as a journey 1. It's all about discipline and knowing where you are going (vs. enamored with tech) VC $2.6B investment into big data (IBM, SAP, Oracle, … $3-4B more) 2. Understand that any of these technologies do not live in a silo. The thing that you don't want to have happen is that this thing become a science fair project. At the end of the day, this is going to be part of a broader architecture. 3. This is an investment decision, want to have a return on investment.
The Next Era of Data Analysis: next big thing is how you analyze data from many disparate sources and do it quickly. More data: Internal data + external data More speed: Fast answers + discovery Increase speed of access & speed of processing so that iterative insight becomes possible. More people: Collaboration + context Needs to become easier for everyone across the business (not just specialists) to see insights as insights are made available, have to make decisions faster. Data-aware collaboration Data harmonization Demo: 6:10-8:30
1 of 3 people in US has had a direct experience with cancer in their family 1 in 4 deaths are cancer-related Jim's mom has chronic leukemia Just got off the phone with his mom (it's his birthday), and she asked "what is it that you do?" "We use data to solve really hard problems like cancer" "When?" "Soon" Cancer is 2nd leading cause of death in children "The brain trust in this room alone could advance cancer therapy more in a year than the last 3 decades." Bjorn Brucher We can help them by predicting individual outcomes, and then proactively applying preventative measures. Big data starts with the application Stop building your big data sandboxes, stop building your big data stacks, stop building your big data hadoop clusters without a purpose. When you start with the business problem, the use case, you have a purpose, you have focus. 50% of big data projects fail (reference?) "Take that one use case, supercharge it with big data & analytics, we can take & give you the most comprehensive big data solutions, we can put it on the cloud, and for some of you, we can give you answers in less than 30 days" "What if you can contribute to the cure of cancer?" [abrupt pivot back to initial inspirational theme]
Why coding is important: By 2020, 1.4M computing jobs Women of color currently make up 3% of computing jobs in US Goal: teach 1M girls to code by 2040 Thus far: 2 years, 2000 girls, 7 states + Johannesburg, South Africa
[my favorite talk] Anything which appears in the press in capital letters, and surrounded by quotes, isn't real. There is no math solution to anything. Math isn't the answer, it's not even the question. Math is a part of the solution. Pieces of math have different biases, different things they do well, different things they do badly, just like employees. Hiring one new employee won't transform your company; hiring one new piece of math also won't transform your company. Normal distribution, bell curve: beautiful, elegant Almost nothing in the real world, is, in fact, normal. Power laws don't actually have means. Joke: How do you tell the difference between an introverted and an extroverted engineer? The extroverted one looks at your shoes instead of his own. The math that you think you know isn't right. And you have to be aware of that. And being aware of that requires more than just math skills. Science is inherently about data, so "data scientist" is redundant However, data is not entirely about science Math + pragmaticism + communication Prefers "Data artist" to data scientist Fundamentally, the hard part actually isn't the math, the hard part is finding a way to talk about that math. And, the hard part isn't actually gathering the data, the hard part is talking about that data. The most famous data artist of our time: Nate Silver. Data artists are the future. What the world needs is not more R, what the world needs is more artists (Rtists?)
[co-author of my favorite book on Data Science] Agrees with some of the critiques made by previous speaker, but rather likes the term "data scientist" Shares some quotes from Data Science and its relationship to Big Data and Data-Driven Decision Making Gartner Hype Cycle 2012 puts "Predictive Analytics" at the far right ("Plateau of Productivity") [it's still there in Gartner Hype Cycle 2013, and "Big Data" has inched a bit higher into the "Peak of Inflated Expectations"] More data isn't necessarily better (if it's from the same source, e.g., sociodemographic data) More data from different sources may help. Using fine-grained behavior data, learning curves show continued improvement to massive scale. 1M merchants, 3M data points (? look up paper) But sociodemographic + pseudo social network data still does not necessarily do better See Pseudo-Social Network Targeting from Consumer Transaction Data (Martens & Provost) Seem to be very few case studies where you have really strong best practices with traditional data juxtaposed with strong best practices with another sort of data. We see similar learning curves with different data sets, characterized by massive numbers of individual behaviors, each of which probably contains a small amount of information, and the data items are sparse. See Predictive Modelling with Big Data: Is Bigger Really Better? (Enrique Junque de Fortuny, David Martens & Foster Provost) Others have published work on on Fraud detection (Fawcett & FP, 1997; Cortes et al, 2001), Social Network-based Marketing (Hill, et al, 2006), Online Display-ad Targeting (FP, Dalessandro, et al., 2009; Perlich, et al., 2013) Rarely see comparisons
Take home message: The Golden Age of Data Science is at hand. Firms with larger data assets may have the opportunity to achieve significant competitive advantage. Whether bigger is better for predictive modeling depends on: a) the characteristics of the data (e.g., sparse, fine-grained data on consumer behavior) b) the capability to model such data
O'Reilly Media is my primary resource for all things Data Science, and the new O'Reilly book on Data Science for Business by Foster Provost and Tom Fawcett ranks near the top of my list of their relevant assets. The book is designed primarily to help businesspeople understand the fundamental principles of data science, highlighting the processes and tools often used in the craft of mining data to support better business decisions. Among the many gems that resonated with me are the emphasis on the exploratory nature of data science - more akin to research and development than engineering - and the importance of thinking carefully and critically ("data-analytically") about the data, the tools and overall process.
The book references and elaborates on the Cross-Industry Standard Process for Data Mining (CRISP-DM) model to highlight the iterative process typically required to converge on a deployable data science solution. The model includes loops within loops to account for the way that critically analyizing data models often reveals additional data preparation steps that are needed to clean or manipulate the data to support the effective use of data mining tools, and how the evaluation of model performance often reveals issues that require additional clarification from the business owners. The authors note that it is not uncommon for the definition of the problem to change in response to what can actually be done with the available data, and that it is often worthwhile to consider investing in acquiring additional data in order to enable better modeling. Valuing data - and data scientists - as important assets is a recurring theme throughout the book.
As a practicing data scientist, I find the book's emphasis on the expected value framework - associating costs and benefits with different performance metrics - to be a helpful guide in ensuring that the right questions are being asked, and that the results achieved are relevant to the business problems that motivate most data science projects. And as someone whose practice of data science has recently resumed after a hiatus, I found the book very useful as a refresher on some of the tools and techniques of data analysis and data mining ... and as a reminder of potential pitfalls such as overfitting models to training data, not appropriately taking into account null hypotheses and confidence intervals, and the problem of multiple comparisons. I've been using the Sci-Kit Learn package for machine learning in Python in my recent data modeling work, and some of the questions and issues raised in this book have prompted me to reconsider some of the default parameter values I've been using.
The book includes a nice mix of simplified and real-world examples to motivate and clarify many of the common problems and techniques encountered in data science. It also offers appropriately simplified descriptions and equations for the mathematics that underly some of the key concepts and tools of data science, including one of the clearest definitions of Bayes' rule and its application in constructing Naive Bayes classifiers I've seen. The figures (such as the one above) add considerable clarity to the topics covered throughout the book. I particularly like the chapter highlighting the different visualizations - profit curves, lift curves, cumulative response curves and receiver operator characteristic (ROC) curves - that can be used to help compare and effectively communicate the performance of models. [Side note: it was through my discovery of Tom Fawcett's excellent introduction to ROC analysis that I first encountered the Data Science for Business book. In the interest of full disclosure, I should also note that Tom is a friend and former grad school colleague (and fellow homebrewer) from my UMass days].
The penultimate chapter of the book is on Data Science and Business Strategy, in which the authors elaborate on the importance of making strategic investments in data, data scientists and a culture that enables data science and data scientists to thrive. They note the importance of diversity in the data science team, the variance in individual data scientist capabilities - especially with respect to innate creativity, analytical acument, business sense and perseverence - and the tendency toward replicability of successes in solving data science problems, for both individuals and teams. They also emphasize the importance of attracting a critical mass of data scientists - to support, augment and challenge each other - and progressively systematizing and refining various processes as the data science capability of a team (and firm) matures ... two aspects whose value I can personally attest to based on my own re-immersion in a data science team.
I recently accepted an offer to assume the role of Director, Analytics and Data Science, at Atigeo LLC. This career transition mostly marks a shift of title and status, as I've been consulting at Atigeo as a Principal Scientist for the past 18 months (part-time during the academic year and full-time during summers). I'm excited about continuing to exercise and extend my skills and experience in natural language processing, machine learning and usability, contributing to Atigeo's health products - an area of heightened interest for me over the last several years - as well as exploring other emerging opportunities for the company and its partners and customers.
On the cusp of this transition, I was inspired by David Whyte's 6-CD set, Clear Mind, Wild Heart, and his compelling poetry and prose regarding "courageous conversations", "cyclical invitations", "investigative vulnerability" and "hazarding" oneself on "successive frontiers" of existence. I've listened to this entire collection dozens of times, and have referenced his poetry in several previous posts. During this particular cycle, I was struck by his observations about feeling hemmed in, and the importance of taking advantage of periodic opportunities to harvest the fruits of one's labors and loves. I also revisited and reflected on Martin Buber's insights - channeled by Oriah Mountain Dreamer - about bringing all of who I am to my work, and came to believe that I am better able to bring more dimensions of myself into my new (current) work than I could at my previous work.
This is not to say that I was not able to bring many dimensions of myself into my previous work. Indeed, teaching computer science at UW Bothell (and UW Tacoma) offered me an opportunity to exercise and extend a broad array of skills initially cultivated in an earlier teaching cycle at the University of Hartford. Unfortunately, as time went on, I was experiencing increasing conflict between my desire to promote experimentation and exploration among students, and my need to assess their competency in a standard, objective and time-efficient way. I found myself acting as gatekeeper - to ensure that students' grades reflected their capabilities to take on greater and greater challenges further along the curriculum (and ultimately in their careers) - and yet wanting to help them tear down walls. I also found myself increasingly uncertain about opportunities for my own career growth in academia.
As I pondered the paths that lay before me, I reflected on the ways my professional life has evolved in cycles. I started to chart the different stages on a spiral graph, but soon realized that there are too many dimensions, and my progression has not followed an orderly or entirely predictable sequence. Instead, I'll settle for inserting an emblematic photo (that I particularly like for its upward vs. downward perspective) and simply listing some of the dimensions through which my career has cycled:
academia and industry (and large and small institutions in both realms)
teaching, research, design, development and management
artificial intelligence, mobile and ubiquitous computing, and human-computer interaction
Apparently, I'm not the only person to have thought of careers as following a spiral path. In an intriguing paper on "Career Pandemonium: Realigning
Organizations and Individuals" (Academy of Management Executive, 10(4), 1996), Ken Brousseau and his colleagues describe the spiral career path as a non-traditional model involving periodic major moves across different areas, in which "the new field draws upon knowledge and skills developed in the old field, and at the same time throws open the door to the development of an entirely new set of knowledge and skills". That sounds about right.
The authors also offer a related insight about career resiliency:
Instead of people dedicated to a particular discipline, function, job, or career path, the career resilient workforce would be composed of employees who not only are dedicated to the idea of continuous learning but also stand ready to reinvent themselves to keep pace with change; who take responsibility for their own career management and, last but not least, who are committed to the company's success.
I am grateful for all the support of my continuous learning at UW Bothell - from the faculty, staff, students and administrators - during the last career cycle, and I hope to maintain some form of connection with the university during this next cycle.
As I proceed with my latest self-reinvention (or, at least, transition), I can't help but note the marvelous rendition of the idea of non-linear paths through life articulated in one of my favorite Harry Chapin songs, All My Life's a Circle:
No straight lines make up my life; And all my roads have bends; There's no clear-cut beginnings; And so far no dead-ends.
I've been using the Delicious social bookmarking web service for many years as a way to archive links to interesting web pages and associate tags to personally categorize - and later search for - their content [my tags can be found under the username gump.tion, a riff on the original Delicious URL, del.icio.us]. In December 2010, a widely circulated rumor reported that Yahoo was planning to shutdown Delicious, and a number of my friends abandoned the service for other services. I was in the midst of yet another career change, rejoining academia after a 21-year hiatus, with little time for browsing, much less bookmarking, so I did not make any changes at the time.
It turns out that rather than being shutdown, Delicious was was sold in April 2011, and various changes have since been made to the service and its user interface. The Delicious UI initially interpreted spaces in the TAGS field as tag separators, e.g., typing in the string "education mooc disruption" (as shown in the screenshot below) would be interpreted as tagging a page with the 3 tags "education", "mooc" and "disruption"; if you wanted to have a single tag with those 3 terms, you had to use remove or replace the spaces, e.g., "educationmoocdisruption" or "education_mooc_disruption". Someime in October 2011, the specifications changed, and commas rather than spaces were used to separate tags, allowing spaces to be used in the tags themselves, e.g., "education mooc disruption" was interpreted as a single tag (equivalent to "educationmoocdisruption"). Unfortunately, I did not see an announcement or notice this change for quite some time, and so I had hundreds of web pages archived on Delicious with tags I did not intend.
This problem surfaced recently when I was sharing my bookmarks on MOOCs (massive open online courses) with a group of students working on a project investigating MOOCs in an small closed offline course, Computing Technology and Public Policy. There were several pages I remembered bookmarking that did not appear in pages associated with my MOOC tag. Searching through my archive for the titles of some of those pages, I discovered several pages tagged with terms including spaces. I started manually renaming tags, replacing the multi-term tags with the multiple tags I'd intended to associate with the pages. After a dozen or so manual replacements, I scanned my tag set and saw many, many more, and so decided to try a different approach.
The Delicious API provides a programmatic way to access or change tags associated with an authenticated user's account. Ever since my first socialbots experiment, my programming language of first resort in accessing any web service API is Python, and as I expected, there is a Python package for accessing the Delicious API, aptly named pydelicious. Using pydelicious, I discovered that my Delicious account had over 200 tags with unintended spaces in them. I'm sharing the process I used to convert these tags in case it is of interest / use to others in a similar predicament. [Note: my MacBook Pro, running Mac OS X 10.8.3, comes prebundled with Python 2.7.2; instructions for installing and using Python can be found at python.org.]
Replacing all the tags containing unintended spaces with comma-delimited equivalents (e.g., replacing "education mooc disruption" with "education", "mooc", "disruption") was relatively straightforward, using the following sequence:
Install pydelicious Type easy_install pydelicious on the command line (on Mac OS X, this is can be done in a Terminal window; on Windows, this can be done in a Command Prompt window)
$ easy_install pydelicious
Searching for pydelicious
Reading http://pypi.python.org/simple/pydelicious/
Reading http://code.google.com/p/pydelicious/
Best match: pydelicious 0.6
Downloading http://pydelicious.googlecode.com/files/pydelicious-0.6.zip
Processing pydelicious-0.6.zip
...
Finished processing dependencies for pydelicious
$
MacBook-Joe:Python joe$ python
Python 2.7.2 (v2.7.2:8527427914a2, Jun 11 2011, 15:22:34)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pydelicious import DeliciousAPI
>>> from getpass import getpass
>>> a = DeliciousAPI('gump.tion', getpass('Password:'))
Password:
>>> t = a.tags_get()
[>>> is the Python prompt]
Import the pydelicious package and getpass function
>>> from pydelicious import DeliciousAPI
>>> from getpass import getpass
>>>
Authenticate my Delicious username and password with the Delicious API
>>> api = DeliciousAPI('gump.tion', getpass('Password:'))
Password:
>>>
[Note: my password is not displayed in the Terminal window as I type it]
Retrieve all my tags
>>> tagset = api.tags_get()
>>>
[tagset will be a dictionary (or associative array) with a single key, tags, whose associated value is an array of dictionaries, each of which has two keys, count and tag, e.g.,
tagset['tags'] can be used to access the array of counts and tags, and a for loop can be used to iterate across each element of the array.]
Check for tags with spaces
>>> for tag in tagset['tags']:
... if ' ' in tag['tag']:
... print tag['count'], ': ', tag['tag']
...
1 : socialnetwork security socialbots
1 : education openaccess p2p collaboration cscl
1 : education parenting
1 : psychology wrongology education
1 : privacy internet politics business surveillance censorship
1 : robots psychology nlp
[... is the Python continuation prompt, indicating the interpreter expects the command to be continued. Note that the 200+ lines of tags with spaces has been truncated above.]
>>> for tag in tagset['tags']:
... if ' ' in tag['tag']:
... api.tags_rename(tag['tag'], tag['tag'].replace(" ", ", "))
...
>>>
Verify that the tags have been replaced via the API
>>> for tag in api.tags_get()['tags']:
... if ' ' in tag['tag']:
... print tag['count'], ': ', tag['tag']
...
>>>
[Replacing the reference to tagset with a fresh call to api.get_tags()]
Verify that the tags have been replaced via a browser
E.g., reload the page above, then edit the tags field in the Delicious user interface to manually replace spaces (%20) with commas (%2C), resulting in the following URL: https://delicious.com/gump.tion/education%2Cmooc%2Cdisruption
Having replaced all the tags with unintended spaces, I've reduced my tag set from 881+ to 680. I now see that I have a number if misspelled tags (e.g., commumity), and a number of singleton tags that are semantically similar to other tags I've used more regularly (e.g., comics (2) and humor (27)) - an inconsistency that similarly affects the category tags on this blog - but I'll leave further fixes for another time in which I want to engage in structured procrastination.
The cover of Gayle Laakmann McDowell's book, Cracking the Coding Interview, and links to her Career Cup web site and Technology Woman blog are included in the slides I use on the first day of every senior (400-level) computer science course I have taught over the last two years. These are some of the most valuable resources I have found for preparing for interviews for software engineering - as well as technical program manager, product manager or project manager - positions. I recently discovered she has another book, The Google Resume, that offers guidance on how to prepare for a career in the technology industry, so I've added that reference to my standard introductory slides.
While my Computing and Software Systems faculty colleagues and I strive to prepare students with the knowledge and skills they will need to succeed in their careers, the technical interview process can prove to be an extremely daunting barrier to entry. The resources Gayle has made available - based on her extensive interviewing experience while a software engineer at Google, Microsoft and Apple - can help students (and others) break through those barriers. The updated edition of her earlier book focuses on how to prepare for interviews for technical positions, and her latest book complements this by offering guidance - to students and others who are looking to change jobs or fields - on how to prepare for careers in the computer technology world.
I have been looking for an opportunity to invite Gayle to the University of Washington Bothell to present her insights and experiences directly to our computer science students since I started teaching there last fall, and was delighted when she was able to visit us last week. Given the standing room only crowd, I was happy to see that others appreciated the opportunity to benefit from some of her wisdom. I will include fragments of this wisdom in my notes below, but for the full story, I recommend perusing her slides (embedded below) or watching a video of a similar talk she gave in May (also embedded further below), and for anyone serious about preparing for tech interviews and careers, I recommend reading her books.
Gayle emphasized the importance of crafting a crisp resume. Hiring managers typically spend no more than 15-30 seconds per resume to make a snap judgment about the qualifications of a candidate. A junior-level software engineer should be able to fit everything on one page, use verbs emphasizing accomplishments (vs. activities or responsibilities), and quantify accomplishments wherever possible. Here are links to some of the relevant resources available at her different web sites:
One important element of Gayle's advice [on Slide 13] that aligns with my past experience - and ongoing bias - in hiring researchers, designers, software engineers and other computing professionals is the importance of working on special projects (or, as Gayle puts it, "Build something!"). While graduates of computer science programs are in high demand, I have always looked for people who have done something noteworthy and relevant, above and beyond the traditional curriculum, and it appears that this is a common theme in filtering prospective candidates in many technology companies. This is consistent with advice given in another invited talk at UWB last year by Jake Homan on the benefits of contributing to open source projects, and is one of the motivations behind the UWB CSS curriculum requiring a capstone project for all our computer science and software engineering majors.
Gayle spoke of "the CLRS book" during her talk at UWB and her earlier talk at TheEasy, a reference to the classic textbook, Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. She said that entry-level software engineer applicants typically won't need to know data structures and algorithms at the depth or breadth presented in that book, and she offers a cheat sheet / overview of the basics on Slides 23-40, and an elaboration in Chapters 8 & 9 of her CtCI book. However, for those who are interested in delving more deeply into the topic, an online course based on the textbook is now part of the MIT Open CourseWare project, and includes video & audio lectures, selected lecture notes, assignments, exams and solutions.
One potential pitfall to candidates who prepare thoroughly for technical interviews is they may get an interview question that they have already seen (and perhaps studied). She recommended that candidates admit to having seen a question before, equating not doing so with cheating on an exam, and to avoid simply reciting solutions from memory, both because simple slip-ups are both common and easy to catch.
Gayle stressed that was there is no correlation between how well a candidate thinks he or she did in an interview and how well their interviewers thought they did. In addition to natural biases, the candidate evaluation process is always relative: candidates' responses to questions are assessed in the context of the responses of other candidates for the same position. So even if a candidate thinks he or she did well on a question, it may not be as well as other candidates, and even if a candidate thinks he or she totally blew a question, it may not have been blown as badly as other candidates blew the question.
Another important factor to bear in mind is that most of the big technology companies tend to be very conservative in making offers; they generally would prefer to err on the side of false negatives than false positives. When they have a candidate who seems pretty good, but they don't feel entirely confident about the candidate's strength, they have so many [other] strong candidates, they would rather reject someone who may have turned out great than risk hiring someone who does not turn out well. Of course, different companies have different evaluation and ranking schemes, and many of these details can be found in her CtCI book.
Gayle visits the Seattle area on a semi-regular basis, so I'm hoping I will be able to entice her to return each fall to give a live presentation to our students. However, for the benefit of those who are not able to see her present live, here is a video of her Cracking the Coding Interview presentation at this year's Canadian University Software Engineering Conference (CUSEC 2012) [which was also the site of another great presentation I blogged about a few months ago, Bret Victor's Inventing on Principle].
Finally, I want to round things out on a lighter note, with a related video that I also include in my standard introductory slides, Vj Vijai's Hacking the Technical Interview talk at Ignite Seattle in 2008:
Since my shoulder surgery 4 weeks ago, I've been spending a lot of time developing software and listening to Pandora. The pain meds (oxycodone & hydrocodone) put me in a bit of a brain fog, limiting the effective breadth and depth of my thinking (and doing), but a reasonably well-defined coding task seemed ideally suited to my power of concentration ... and I've always found listening to music while coding helps put & keep me in "the zone".
The Pandora fremium online music service has developed such an accurate model of my preferences over the past few weeks that I've upgraded to Pandora One. The annual subscription version of the service has eliminated commercials and increased the length of time I can listen per day, and eliminates the pause and prompt asking "Are you still listening?" if I don't interact regularly with the site.
The one annoyance that remains is that there are certain songs that I believe should never be played without also playing the song that immediately follows them on the original album / CD. This strikes me as the musical opposite of a non sequitur - examples of which are virally proliferating as we slog through the U.S. presidential election season - so I propose the following name for this phenomenon:
Falling In and Out of Love / Amie (Pure Prairie League, Bustin' Out, 1972)
On the Run / Breathe / Great Gig in the Sky (Pink Floyd, Dark Side of the Moon, 1973)
Any Colour You Like / Brain Damage / Eclipse (Pink Floyd, Dark Side of the Moon)
Happiest Days of Our Lives / Another Brick in the Wall Part 2 (Pink Floyd, The Wall, 1979; contributed by Eric)
I'm probably dating myself with these examples, and by my uncertainty about whether contemporary bands are producing songs that are intended to flow so naturally from one to the other. Perhaps it was solely or primarily a trend of the late 60s and early 70s.
I'll update the list with additional examples as I encounter them, and would welcome any other examples anyone is inclined to share in the comments.
I've been working on a graphical user interface to enable a user to view, modify and add relevance judgments for a set of results returned by a search engine in response to a set of topics. The GUI was developed as part of some work I've been doing on the Text REtrieval Conference (TREC) 2012 Medical Records Track. I hope to write more about the GUI and the work on TREC in the future. For now, I want to share a solution I developed to a problem with the Java 6Swing and AWT components I was struggling with over the past few days.
The problem was a JList contained within a JScrollPane contained within a JPanel that was resizing when new String elements added to the JList were longer than its current width. I initially implemented the GUI, which extends JFrame, using a BorderLayout:
The offending JList was in the West (leftmost) area, and when it grew (after adding a string longer than its initial width), it shrank the JTextArea component I have in the Center area. I wanted to prevent any resizing from happening.
I read that the Center area of a BorderLayout cannot be protected from resizing - it fills whatever space is available after all the other components have been rendered - so I experimented with different LayoutManagers (GridBagLayout and BoxLayout) ... which will be the subject of a future blog post [update: see How to approximate BorderLayout with GridBagLayout or BoxLayout]. I couldn't manage to get GridBagLayout to work and play nicely with my components; BoxLayout - using a BoxLayout.X_AXIS JPanel for the West, Center and East area components, which was contained inside a BoxLayout.Y_AXIS JPanel (which was wedged between BoxLayout.Y_AXIS JPanels for North and South areas) - reduced the shrinking, but did not eliminate it.
I considered extending JPanel and overriding the getPreferredSize() or getMaximumSize() methods, but decided against this as the log files revealed that some resizing occurs after all the components in the GUI are initially populated. I wanted to allow the component sizes to settle into an initial populated state, and then prevent subsequent resizing.
After trying several other possibilities, I finally wrote a method to update the width associated with the PreferrredSize and MaximumSize, which I call after populating the components. I'll share it below, in case it is of use to others, or in case others have better solutions.
private void restrictPanelWidth(JPanel panel) {
int panelCurrentWidth = panel.getWidth();
int panelPreferredHeight = (int)panel.getPreferredSize().getHeight();
int panelMaximumHeight = (int)panel.getMaximumSize().getHeight();
panel.setPreferredSize(new Dimension(panelCurrentWidth, panelPreferredHeight));
panel.setMaximumSize(new Dimension(panelCurrentWidth, panelMaximumHeight));
}
Among the shortcomings of this solution is that any JPanel restricted by this method will not automatically resize if/when the containing window (JFrame) is resized. My expectation is that my GUI will fill the screen - however large that screen is - and that users will generally not want to make it any smaller than full screen. If my assumption proves unwarranted, I suppose I could reintroduce some flexibility via the componentResized() method of a ComponentAdapter, but I'm going to wait to see if any of the [currently] small group of users asks for this capability.
Update: I discovered another possible approach, which I have notnow tried [see update 2 below], described in the Advanced JList Programming article at the Sun Developer Network (SDN). In the section entitled "JList Performance: Fixed Size Cells, Fast Renderers", Hans Muller suggests using the setFixedCellWidth() or setPrototypeCellValue() - which is passed a prototype (e.g., String) value that sets the cell width and cell height to the width and height of the prototype value - method to restrict the width of cells in a JList. He also notes "be sure to set the prototypeCellValue property after setting the cell renderer". The shortcoming I envision with this approach is that I would have to call setFixedCellWidth() any time that the window is resized ... though, as noted before, this is a scenario I don't currently handle in the GUI anyway.
Update 2: When I loaded the GUI with a JList containing cells wider than the desired width, the original problem recurred, so my fix was not effective. I have since used setFixedCellWidth() for the two JLists in the West and East areas, and all is well in my [GUI] world again.