Conferences and Workshops

WTF Economy: Augmentation, Disintermediation and Small Acts of Production

NextEconomy_logoTim O'Reilly (O'Reilly Media) opened last week's conference on the Next:Economy, aka the WTF economy, noting that "WTF" can signal wonder, dismay or disgust. I experienced all three reactions at different times during the ensuing two-day "investigation into the potential of emerging technologies to remake our world for the better". I attended the conference because I have long been interested in the nature and meaning of work, and now that I work for Indeed, I am particularly interested in how we can remake our world by reimagining how the world works for the better.

The conference presentations and discussions, curated by Tim and his co-organizers Steven Levy (Backchannel) and Lauren Smiley (Medium), provided many interesting insights, experiences and provocations. If I had to choose three top themes that emerged for me, they would be

SmallIsBeautiful_cover

  • Augmentation vs. automation: technology designed to assist human workers by taking over some of their tasks vs. technology designed replacing human workers by taking over all of their tasks
  • The disintermediation of creative work: the vast array of tools and resources available for creative people to pursue their (our?) passions while earning a sustainable income outside of the constraints of a traditional job or company
  • Small is beautiful: the growing number of platforms for designing and creating products and/or offering services at small scales that can enrich the lives of the producers, consumers and society at large

I'm sharing - and shortening - my conference notes here, as I tend to understand and remember what I learn better when I re-process them for potential public consumption. I've also compiled a Twitter list including all the Next:Economy speakers I could find. Kevin Marks compiled and shared a much more extensive collection of notes from day 1 and day 2 of the conference, and a Pinboard storify offers an alternative perspective.

No ordinary disruption

NoOrdinaryDisruption_coverJames Manyika (McKinsey Global Institute) shared so many interesting facts, figures and forecasts that I had a hard time keeping up. The following are among the nuggets I was able to capture:

  • The consuming class (those who live above the subsistence level) will rise from 23% of the population to 50% in 2025
  • The last 50 years of year-over-year 3.5% average global GDP growth were fueled by the combined growth of labor (1.7%) and productivity (1.8%); with labor supply expected to peak in 2050, productivity growth will have to increase 80% to maintain overall growth rates
  • Labor markets don't work very well, producing massive shortages of workers with the right skills in the right places; digital platforms for helping increasing the quality of matches and decreasing job search times may help improve labor markets as labor supply declines
  • If all countries were to improve gender parity to level achieved by their "best in region" neighbors, we could add as much as $12 trillion (11%) to annual 2025 GDP
  • 45% of tasks could be automated, but only 5% of jobs can be completely automated; up to 30% of tasks in 60% of jobs could be automated, redefining occupations and skills needed
  • We need to change our mindset from jobs to work, and from wages to income

From self-driving cars to retraining humans

Sebastian Thrun (Co-Founder and CEO, Udacity) offered an interesting perspective on human learning vs. machine learning: when a person makes a mistake (e.g., causes an automobile accident), that person learns, but no one else learns; when a robot (e.g., a self-driving car) makes a mistake, all other robots can learn. He said Udacity's strategy is to develop nano-courses and nano-degrees in response to industry demand for specific skills, to help more people get unstuck more quickly. He also recommends that everyone become an Uber driver for a day.

The future of personal assistants

Alexandre Lebrun (Head, Wit.ai, Facebook) said that Facebook's M personal assistant is designed to interact with customers as far as it can go, and then observe human customer service representatives ("trainers") when they intervene so it can learn how to handle new situations. Adam Cheyer (Co-founder and VP of Engineering, Viv Labs) talked about exponential programming in which an application can write code to extend itself; in the case of the Viv personal assistant, when you make a request, Viv writes a custom program at that moment to respond to that request.

Will robots augment us or rule us?

WhatTheDormouseSaid_coverJohn Markoff (Journalist, New York Times) identified two divergent trajectories that emerged at Stanford around 1962: John McCarthy's lab focused on autonomous systems (Artificial Intelligence or AI) while Doug Englebart's lab focused on augmenting human intelligence (Human Computer Interaction or HCI), and noted that system designers have a choice: design people into the system or design people out. Jerry Kaplan (Visiting Lecturer, and Fellow at The Stanford Center for Legal Informatics, Stanford University) said the oft-cited [especially at this conference] 2013 Oxford study on the future of employment, which warned that 47% of total US employment is at risk of being automated was intellectually interesting but critically flawed; his view is that tasks will continue to be automated - as they have for 200 years - but not jobs, except for jobs whose tasks can all be automated (consistent with James Manyika's earlier forecast).

"Knowledge Work": No longer safe from automation

Kristian Hammond (Chief Scientist, Narrative Science) began with a bold claim about Quill, his company's software that uses templates to automatically construct narratives (text) to explain structured data: "If your data and their analysis have meaning, Quill can transform that into language". Since one might characterize everything in the world that humans attend to as some mix of data and analysis, I'm not sure exactly what that claims means, but I do understand - and like - his more modest claim that Quill specializes in small audiences ("writing for one"). I was also impressed with some of the examples of pilot projects he gave, including narrativizations of MasterCard's narrativization of targeted recommendations for small business owners, a portfolio fund manager's quarterly report (which typically takes a month to produce) and web site performance reporting. Echoing the theme of augmenting human intelligence that pervaded many presentations, he proposed that "If anyone has the word 'analyst' in their job title, something like Quill is going to be working with them at some point". And ending on yet another provocative note, he observed that "we need data scientists everywhere, but they don't scale" and suggested "'data scientist' is the sexiest job of the 21st century; it's the next job we're going to automate".

The Kickstarter economy

Yancey Strickler (Co-Founder & CEO, Kickstarter) presented some examples of how his crowd funding site has redefined hardware design as an artful medium, facilitating small-scale manufacturing of product lines with 150-5000 units and enabling creative people to be more independent and move easily from product to product. He also articulated three guiding principles for the WTF economy:

  1. DON"T SELL OUT (but still survive); creative people often feel guilty charging money for work they do, but they need to earn enough to support themselves and their families
  2. BE IDEALISTIC: you don't have to buy into the money monoculture and its new rules for competition based on anxiety, paranoia, disruption and war (Kickstarter is now a Public Benefit Corporation)
  3. IT'S HARDER, BUT IT'S EASIER: it's harder to measure success without an exclusive focus on financial profits, but it's easier to make decisions based on more human-centered principles

The small scale factory of the future

LimorFried_ThisIsHowIWorkLimor Fried (hacker, slacker, code cracker, Adafruit Industries) gave the most engaging remote presentation I've ever witnessed from her 80-person, $40M open-source hardware company's factory in New York. She articulated a particularly pithy description of Adafruit, which has produced nearly 900 tutorials -"we're an education company, with a gift shop at the end" - and offered an endearing story of a 6-year-old girl who regularly watches the Adafruit "Ask an Engineer" show, which features so many women engineers that the girl asked "Daddy, are there any men engineers?"

One, two, three, boom!

Mark Hatch (CEO, TechShop) launched into his presentation by informing us "I love revolutions" such as the one that was ignited when "the middle class got access to the tools of the industrial revolution", and "I'm a former Green Beret, and I love blowing things up". He proceeded to give us a whirlwind tour of his multifaceted response to the question "Has anything serious come out of the maker movement?", inviting the audience to share his enthusiasm for each example of a product - and lifestyle business - created bymembers of his TechShop hackerspaces by shouting "Boom!". The examples included an electric motorcycle, a jet pack, a desktop diamond manufacturing device, a laser-cutting cupcake topper, an underwater robot, as well as more well-known companies that started as a TechShop project such as Square, Solum and DripTech (the latter 2 being among the 5 top agricultural startups of 2014).

Real R&D is hard

Saul Griffith (CEO, Otherlab) shared highlights from two seminal 1945 essays by Vannevar Bush: As We May Think, which presaged the Internet, speech recognition and online knowledge sources (among other things) and the lesser-known report, Science, the Endless Frontier, which effectively transformed and transitioned research from small labs to large universities. With Google''s R&D budget of $9B now rivaling the combined funding of the National Science Foundation ($6B) and the Defense Advanced Research Project Agency (DARPA, $3B), we are in the midst of a transition of research from universities toward industry. Smaller scale industry research labs, such as Otherlab) have produced innovations that include drones used for wind power generation, inexpensive and durable trackers for solar panels, cheap actuators that can be used on a [slow] walking inflatable elephant robot, pneubotic (air-filled / air-powered) robots that can lift their own weight and minimize risks that their heavier counterparts pose to human collaborators, and soft exoskeletons that can enable anyone to run faster with less energy.

Why services aren't enough

Jeff Immelt (Chairman and CEO, General Electric) noted that industrial productivity growth has been declining from 4% (2006-2011) to 1% (2011-2015), but that reframing industrial products as data platforms (e.g., a locomotive today is a rolling data center with 600+ sensors) may open up opportunities for new products and productivity gains.

Workplace monitoring, algorithmic scheduling, and the quest for a fair workweek

FWI-fact-sheet_rev4Esther Kaplan (Editor, The Investigative Fund) chaired a panel discussion about what it is like to live (and work) in the reality of "the hyper-lean, electronically scheduled labor force", focusing primarily on retail jobs, which make up 10% of all US jobs.

Darrion Sjoquist (Starbucks barista, Working Washington) shared stories of growing up with a mother who worked for Starbucks, in which the entire family and was adversely affected by the unpredictability arising from 5-day advance schedule notices and clopenings (a closing shift followed by an opening shift the next day), making it difficult to commit to clubs or other extra-curricular school activities, as well as stories from his own experience as a Starbucks barista, in which co-workers and customers are at risk from sick employees doing their best to operate in a work regimen that is so lean that they cannot afford to take sick days.

Carrie Gleason (Director, Fair Workweek Initiative Center for Popular Democracy) noted that involuntary part-time employment more than doubled between 2008-2010, cited research on the instability created by schedule unpredictability among early career workers in the US labor market and said that workers are trying to minimize the instability by setting up shift-swapping groups on Facebook. She also announced a new Fair Workweek Initiative designed to "provide working families with stable employment, a livable income, and family-sustaining scheduling".

Charles DeWitt (Vice President, Business Development, Kronos) said that Walmart ushered in an era in which labor became "a big bucket of cost and compliance issues", and people became an asset to be optimized. Schedule stability was not factored into workforce management software, but it could be. Citing the annual Gallup Employment Engagement Survey, and Gallup's estimate that disengaged employees cost the US $450-550B in lost productivity each year, he proposed that linking scheduling practices to employee engagement metrics may be a good way to promote greater stability.

Does on demand require independent contractors?

Leah Busque (Founder, TaskRabbit) framed TaskRabbit as service networking vs. social networking, which now includes 30K taskers (people who perform tasks for pay) in 21 cities, who make an average of $35/hour and $900/month across all locations, with 10% of taskers working full-time (via TaskRabbit). The service was first launched in Boston in 2008, where the large population of students was expected to provide plenty of candidate taskers; instead, early taskers tended to be stay-at-home moms, retired people and young professionals. Leah noted that "any business with a platform with 2 sides of a market has to make both sides successful" ... however, she did not recommend that everyone sign up to become a tasker for a day.

What's it like to drive for Uber or Lyft?

Eric Barajas (Driver, Uber), Jon Kessler (Driver, Lyft, former cab driver) and Kelly Dessaint (Driver, National Veterans Cab, former Uber and Lyft driver) provided a lively exchange of insights and experiences regarding the costs and benefits for driving for different companies. I don't know whether they would want their income shared publicly (outside the conference), but it appeared that Kelly, the cab driver, is faring better than the others in terms of higher revenue (due, in part, to larger and more frequent tips) and lower expenses (a daily gate fee of $111-121). Uber and Lyft drivers are responsible for gas, vehicle maintenance and insurance (Uber and Lyft only cover insurance only while drivers have paying passengers in the vehicle), and, in some cases, a lease; they face the perpetual prospect of being "deactivated" due to a customer complaint, and so have relatively little autonomy once they pick up a passenger. Kelly noted that taxi drivers are a community - "we look at each other, we nod" - and warned "the worst drivers in SF are tourists" and "most Uber/Lyft drivers are tourists" (because they live outside the city); he offered an interesting counter-perspective to what I've heard from the relatively few cab drivers I've traveled with recently.

The changing nature of work

Esko Kilpi (Managing Director, Esko Kilpi Company) observed that "People are not clever, people have never been clever, and people will never be clever" and so we create and use technology to compensate. He argued that the work systems we have are broken, because they are based on artificial scarcity and "wrong ideas about who human beings are", failing to take into account different situations with different demands. While work has always been about solving problems; it used to be that your boss told you the problem you were supposed to solve, but increasingly, defining the problem is itself part of the work, and so work and learning are inextricably linked. I highly recommend his recent Medium post on The New Kernel of On-Demand Work.

What's the investment opportunity?

Simon Rothman (Partner, Greylock Partners), Gary Swart (Venture Partner, Polaris Partners) and James Cham (Bloomberg Beta [whose homepage is on GitHub (!)]) shared a number of interesting insights; unfortunately, I missed the introductions and so cannot offer precise attribution, but here are some of the highlights shared during this session:

  • Just because the customer model is appealing doesn't mean the business model will work
  • A lot of platforms that call themselves marketplaces are actually managed services
  • Well intermediated marketplaces feel like a service
  • Regulation is a historical artifact, attempting to project the past into the present / future
  • Silicon Valley is the QA department for the rest of the world

Supporting workers in the on-demand economy

Nick Grossman (General Manager, Union Square Ventures) presented his views on the deconstruction of the firm, or "the jobs of a job": brand, income, customers, taxes and administration, benefits and insurance, facilities and equipment, scheduling, community, training. More and more of these elements can be found outside of traditional companies, enabling Individual workers to become, in effect, networked micro-firms.

Creating better teams

Stewart Butterfield (Co-founder and CEO, Slack) shared his view on the evolution of objects around which computer applications are based: applications, documents and (now) relationships. Slack has 1.7M users, including 1M paid subscribers, and for 3+ years was produced unselfconsciously as a tool to support lateral transparency while the team worked on what they thought would be their primary product (a game), until the tool itself emerged as a useful product in its own right.

Tax and accounting tools for the franchise of one

Brad Smith (President and CEO, Intuit) talked about the difficulties of quantifying "self-employment" because the term can be defined in so many ways. 78% of US Intuit users who are self-employed have 3 or more sources of income. He sees the recent extension of myRA (my Retirement Account) eligibility from small- to medium-sized businesses (SMBs) to self-employed people as a more positive step than most of the other reviews I've read.

Flexibility needed: Not just for on-demand workers

Anne-Marie Slaughter (President and CEO, New America) noted that "having a job that allows you to support your family is an essential ingredient in a personal life narrative" and that supporting your family means more than earning income: workers need the flexibility to care for family members. In discussing generous parental leave policies announced recently by some firms, she argued that you have to change the culture, not just the policy, and so extended parental leave won't matter much unless the men in senior leadership positions take advantage of the new policies.

Conference dinner

There were two separate talks presented by Code for America fellows: one has avoided signing a rental lease or owning a home through serial AirBnB stays in each place he's worked for a number of years now; another went on food stamps in his effort to better understand the challenges faced by food stamp recipients in California. Unfortunately, I did not take notes during either talk, and now cannot even remember the names of the presenters.

Humans need not apply? Not so fast!

Nick Hanauer (Second Avenue Partners) offered a number of observations and insights:

  • Some amount of economic inequality is good (healthy incentives), but too much is bad
  • We think of prosperity as money, but this is wrong
  • What matters is the accumulation of solutions to human problems
  • Not GDP, but the rate at which we solve these problems
  • "How we improve our lives" is the point of the economy
  • Revolutions, like bankruptcies, come gradually, and then suddenly (see also: The Pitchforks are coming .. for us Plutocrats)
  • "Trickle-down economics is an intimidation tactic masquerading as an economic theory" [my favorite sound bite from the conference]

Managing talent in the networked age

Reid Hoffman (Co-Founder & Executive Chairman, LinkedIn; Partner, Greylock Partners) and Zoë Baird (CEO and President, Markle Foundation) announced the Markle Rework America initiative: "A 21st Century Digital Labor Market for Middle-Skill Job Seekers", designed to boost the signaling and improve the matching among employers, workers and educators. The initiative partners include LinkedIn, Arizona State university and edX, as well as regional partners in Colorado and Phoenix (the regions where the program is initially being rolled out). Reid wrote an interesting critique of higher education, Disrupting the Diploma, a few years ago, highlighting the need to "make certification faster, cheaper, and more effective". Zoe suggested that we need to create boot camps for other kinds of training to support other parts of the market beyond the technology sector, and explore new methods of credentialing beyond the traditional degrees granted by colleges and universities (which collectively make up the third largest lobbying group in Washington DC).

Exponential teaching

Kimberly Bryant (Founder, Black Girls CODE) championed exponential teaching: "If you teach one girl, she will naturally turn around and teach five, six, or 10 more”. Teaching black girls to code is a promising effort to counteract the effects of 45% of black women aged 25 and over having no high school diploma. The grass-roots organization has created 10 chapters in 4 years, and she invoked the optimism of Grace Lee Boggs with respect to the program's prospects: "I believe we are at the point now, in the US, where a movement is starting to emerge".

Matching workers with opportunities at high velocity

Stephane Kasriel (CEO, Upwork) talked about how UpWork ("Tinder for work") matches freelance knowledge workers (primarily software developers) to remote work opportunities, shortening the typical job search from 3-4 weeks to 1-2 days. With the steadily declining half-life of any skill (especially in the technology sector), Stephane declared "the resume is dead", and "all we want to do is reduce all the time people waste writing long form job descriptions and long form job proposals". To match workers with work, Upwork utilizes three different machine learning models to determine

  • Who is qualified?
  • Who is interested?
  • Who is available?

Upwork has 150K work descriptions and 250K worker profiles, with 5K workers signing up every day. Stephane said only about 2% of new workers get jobs right away, and one of the challenges for Upwork ("the celebrity agent of the freelancers") is to effectively manage talent at different stages of their pipeline (newcomers, rising stars and established workers). Stephane suggested that industry is not doing a good job of communicating data about needed skills back to academia. Having looped in and out of academia a few times myself, I would suggest that academia is not configured in a way that facilitates rapid response to changing educational needs, and so other learning channels will likely be needed to support the continuous cycle of nano-jobs, nano-skills and nano-degrees in the future.

Work rules: Lessons from Google's success

WorkRules_coverLaszlo Bock (Senior Vice President of People Operations, Google) said an internal study revealed there is little predictive value in determining the probability of success based on which school a candidate attended; the most predictive value was found in a sample work test, in which a candidate performs the type of work associated with the job function, and the second most predictive value was in cognitive ability, which can be assessed via structured interview questions (give me an example of a time when you did X). Google does not look only for superstars, since many superstars don't perform as well when they switch companies, but also looks for team players who improve the performance of everyone around them (like basketball player Shane Battier). Among their efforts to promote diversity, they have implemented an unconscious bias training program, and a Google in Residence program in which Google engineers are embedded in a handful of historically black colleges and universities where they advise on curriculum matters and mentor students. Google also conducts an anonymous survey of 8 management attributes to provide specific feedback (with no penalties) to managers; they have found that average favorability ratings increase every cycle, and that a 2-hour management program can help managers improve by 6-7 points.

Intelligent agents, augmented reality, and the future of productivity

Satya Nadella (CEO, Microsoft) posited the agent as the third runtime model (after the PC operating system and the browser), and talked about some of Microsoft's work in speech recognition, augmented reality and machine learning. He described augmentation as a "race with the machine", rather than a race against the machine (which may be a more apt description of automation).

Augmented reality in the factory

Daqri_smart_helmetBrian Mullins (Founder and CEO, DAQRI) asked "What if you could put on a helmet and do any job?" and proceeded to present a number of applications of their augmented reality Smart Helmet: digitizing analog devices (e.g., gauges), providing thermal vision and improving "cognitive literacy". He went into some detail - including a video - on one deployment in partnership with Hyperloop using the helmet in a steel mill. I can't find that video, but found another video from EPRI that provides a better sense of how the Smart Helmet works.

How augmented workers grow the market

David Plouffe (Chief Advisor and Member of the Board, Uber) posed what he calls the central government question: How do we get more income to more people? He said asking how Uber is affecting the taxi market is the wrong question; instead, we should be looking at how Uber is affecting a multi-modal ecosystem, noting that fewer millennials are choosing to own cars and expanding transportation options can be crucial in helping people escape poverty. Uber has 400K drivers, but 100K drive only a few times a month and 50% of drivers drive < 10 hours / week. During the Q&A session, Eric Barajas, the Uber driver who had appeared on the previous day's panel, asked about what Uber can do to better support drivers who are driving full-time, and was invited to come talk to the Uber SF office about it (I hope he doesn't suffer any adverse consequences from speaking up).

Reinventing healthcare

Lynda Chin (Director, Institute for Health Transformation) talked about the Oncology Expert Adviser, a healthcare application based on IBM Watson. As is so often the case when I read about Watson in healthcare, I love the idea of a system that can take in all kinds of input - text analysis of articles from the medical literature as well as individual patient charts, diagnoses of past cases by human experts - and assist doctors in diagnosis and treatment. Unfortunately, as is also so often the case when I hear or read about Watson in healthcare, I don't get a clear idea of what it can actually do yet.

Policy action recommendations for the 21st century economy

In this panel discussion, Felicia Wong (President and CEO, Roosevelt Institute) championed campaign finance reform, reversing the financialization of the economy and making massive investment in human capital; Neera Tanden (President, Center for American Progress) noted that the US is relatively unique among leading industrial countries in the downward pressure on wages and pointed toward Australia and Canada as better models for achieving median income growth through higher rates of unionization (30% vs. 6%), more equitable education (higher education is [nearly] free) and relatively smaller and more regulated financial sectors; Zoë Baird (CEO and President, Markle Foundation) recommended the adoption of policies that will help SMBs better reach global markets.

Rewiring the US labor market

Byron Auguste (Managing Director, Opportunity@Work) claims that "talent is far more evenly distributed than opportunity or money" and that the labor market is "a wildly inefficient 'efficiency' market", pointing to credential creep (e.g., only 20% of current administrative assistants have bachelor degrees and yet 65% of administrative assistant job descriptions require them), Okun's Law (the correlation between increased unemployment and reduced GDP), a quit rate that is down 28% since 2010 and a record number (5.8M) of unfilled jobs. The White House TechHire initiative includes a national employer network and an open platform to provided better training for and access to information technology jobs in 31 cities for some of the segments of the US population that are not being well served by the current labor market:

  • 30-40M college goers who did not graduate
  • 15-20M caregivers limited in ability to work for pay
  • 10-15M experienced skilled older workers needing to re-tool
  • 1.5M veterans who are unemployed or entering the workforce soon
  • 6M disconnected youth

Worker voice in the 21st century

Jess Kutch (Digital Strategist and Co-founder, Coworker.org) and Michelle Miller (Co-founder, Coworker.org) addressed the conference via video, providing examples of workers speaking up and instigating changes in companies, using their coworker.org petition campaign platform. One notable example is the successful effort by Starbucks baristas to change the company policy barring visible tattoos. Perhaps more relevant, given other sessions at the conference, are petition campaigns to urge Starbucks to give employees a fair workweek and to urge Uber to give consumers the option of adding a tip to all Uber fares.

Reinventing the labor union

Liz Shuler (Secretary-Treasurer / CFO, AFL-CIO) talked about some of the challenges facing unions ("the original disrupters" of the workplace) and unionization efforts. Andy Stern (Senior Fellow, Columbia University; former President, SEIU) talked about the strategic inflection point we are approaching with respect to automation's potential impact on work and the long-standing connection between work and income. He suggested that now is an ideal time to consider universal basic income, which guarantees everyone a basic level of income by having government give money directly to those in poverty, rather than via special programs such as food stamps and earned income tax credits that are burdensome for everyone involved. "If you want to end poverty give people money".

Portable benefits and the "shared security account"

Laura Tyson (Professor & Director of Institute for Business and Social Impact, Haas School of Business at UC Berkeley) spoke with Nick Hanauer (Second Avenue Partners) and David Rolf (President, SEIU 775) about some of the ideas they raised in an inspiring article on Shared Security, Shared Growth, in which they argue for decoupling benefits from specific jobs and attaching them to the workers, a separation which will become increasingly important as workers increasingly work multiple jobs (or tasks). Among the many interesting topics discussed was the incentives for CEOs to buy back stock (vs. incentives for programs that might benefit workers), which led me to an HBR article on profits without prosperity.

Reinventing public transportation

Logan Green (CEO, Lyft) talked about some of the ways that Lyft is trying to support its drivers. Although 78% of Lyft drivers drive < 15 hours / week, they offer a power driver bonus, giving increasing amounts of commission back as drivers drive more. The Lyft app offers passengers the option to include a tip, though Tim O'Reilly (who was interviewing Logan) said that a Lyft driver told him that only 20% of passengers leave a tip. Toward the end of the conversation, Logan noted that the least profitable runs in public transportation are the most expensive, and that some kind of public/private partnership might enable Lyft to complement public transportation.

A people-centered economy

Chad Dickerson (CEO, Etsy) talked about how Etsy embraces the idea that small is beautiful. In addition to enabling individual artisans to sell their creations, Etsy is reimagining manufacturing, allowing Etsy users to register as manufacturers if they want to work with other Etsy users to help create their products. There are also self-organized teams of Etsy sellers around the world, and he gave an example where Italian sellers were encouraging prospective buyers to buy things produced from Greek sellers during the Greece financial crisis. He also said that Etsy has become a Public Benefits Corporation.

The good jobs strategy

GoodWorkCodeZeynep Ton (Associate Professor, MIT Sloan School of Management) presented her research into the financial success enjoyed by companies that embrace human-centered systems and provide jobs with meaning and dignity, offering the following principles: operate with some slack (not too lean), offer less (fewer products), cross train, standardize and empower. Dan Teran (co-founder, Managed by Q) said he recognized early on that "if we wanted to have the best employees, we had to be the best employer"; engineers at the company go out and clean an office the first week, so they can better understand the tasks and environment in which the cleaners who work for the company operate. Palak Shah (Social Innovations Director, National Domestic Workers Alliance) noted that "in a way you can consider domestic workers [nannies, caregivers, cleaners] as the original gig workers" and presented the Good Work Code, "an overarching framework of 8 simple values that are the foundation of good work".

Enable people, and they will amaze you

Evan Williams (CEO, The Obvious Corporation, and founder of Medium and Blogger), whose first tech job was with O'Reilly Media, said the thing that got him excited about the Internet was the idea of knowledge exchange. One of the motivations behind Medium was that comments on blogs are not on the same level as the posts; on Medium, replies are posts, and so all commentary is on the same level. Noting that "what you measure gets rewarded", Medium measures time on a page, not [just] page views or unique visitors. In the Q&A session, someone asked "how do you think about the future". Ev replied "I just listen to Tim" ... and then Tim replied "I look for people who are passionate about what they do".

The conference sessions offered a compelling collection of people who are passionate about what they do, and much of what many of the speakers do is helping other people find ways to exercise their passions in what they do ... and a recursive promulgation of passion seems to be as good a note as any on which to end this post.


Notes from #PyData Seattle 2015

PyDataSeattleLogoI was among 900 attendees at the recent PyData Seattle 2015 conference, an event focused on the use of Python in data management, analysis and machine learning. Nearly all of the tutorials & talks I attended last weekend were very interesting and informative, and several were positively inspiring. I haven't been as excited to experiment with new tools since I discovered IPython Notebooks at Strata 2014.

I often find it helpful to organize my notes after attending a conference or workshop, and am sharing these notes publicly in case they are helpful to others. The following is a chronological listing of some of the highlights of my experience at the conference. My notes from some sessions are rather sparse, so it is a less comprehensive compilation than I would have liked to assemble. I'll also include some links to some sessions I did not attend at the end.

Python for Data Science

Joe McCarthy, Indeed, @gumption

Gumption_pydataThis was my first time at a PyData conference, and I spoke with several others who were attending their first PyData. Apparently, this was the largest turnout for a PyData conference yet. I gave a 2-hour tutorial on Python for Data Science, designed as a rapid on-ramp primer for programmers new to Python or Data Science. Responses to a post-tutorial survey confirm my impression that I was unrealistically optimistic about being able to fit so much material into a 2-hour time slot, but I hope the tutorial still helped participants get more out of other talks & tutorials throughout the rest of the conference, many of which presumed at least an intermediate level of experience with Python and/or Data Science. As is often the case, I missed the session prior to the one in which I was speaking - the opening keynote - as I scrambled with last-minute preparations (ably aided by friends, former colleagues & assistant tutors Alex Thomas and Bryan Tinsley).

Scalable Pipelines w/ Luigi or: I’ll have the Data Engineering, hold the Java!

Jonathan Dinu, Galvanize, @clearspandex

Luigi_user_recsRunning and re-running data science experiments in which many steps are repeated, some of which are varied (e.g., with different parameter settings), and several take a long time are all part of a typical data science workflow. Every company in which I've worked as a data scientist has rolled their own workflow pipeline framework to support this process, and each homegrown solution has offered some benefits while suffering from some shortcomings. Jonathan Dinu demonstrated Luigi, an open source library initially created by Spotify for managing batch pipelines that might encompass a large number of local and/or distributed computing cluster processing steps. Luigi offers a framework in which each stage of the pipeline has input, processing and output specifications; the stages can be linked together in a dependency graph which can be used to visualize progress. He illustrated how Luigi could be used for a sample machine learning pipeline (Data Engineering 101), in which a corpus of text documents is converted to TF-IDF vectors, and then models are trained and evaluated with different hyperparameters, and then deployed.

Keynote: Clouded Intelligence

Joseph Sirosh, Microsoft, @josephsirosh

Connected-cowsJoseph Sirosh sprinkled several pithy quotes throughout his presentation, starting off with a prediction that while software is eating the world, the cloud is eating software (& data). He also introduced what may have been the most viral meme at the conference - the connected cow - as a way of emphasizing that every company is a data company ... even a dairy farm. In an illustration of where AI [artificial intelligence] meets AI [artificial insemination], he described a project in which data from pedometers worn by cows boosted estrus detection accuracy from 55% to 95%, which in turn led to more successful artificial insemination and increased pregnancy rates from 40% to 67%. Turning his attention from cattle to medicine, he observed that every hospital is a data company, and suggested that Florence Nightingale's statistical evaluation of the relationship between sanitation and infection made her the world's first woman data scientist. Sirosh noted that while data scientists often complain that data wrangling is the most time-consuming and challenging part of the data modeling process, that is because deploying and monitoring models in production environments - which he argued is even more time-consuming and challenging - is typically handed off to other groups. And, as might be expected, he demonstrated how some of these challenging problems can be addressed by Azure ML, Microsoft's cloud-based predictive analytics system.

The past, present, and future of Jupyter and IPython

Jonathan Frederic, Project Jupyter, @GooseJon

Jupyter_logoIPython Jupyter Notebooks are one of my primary data science tools. The ability to interleave code, data, text and a variety of [other] media make the notebooks a great way to both conduct and describe experiments. Jonathan described the upcoming Big Split(tm), in which IPython will be separated from Notebooks, to better emphasize the increasingly language-agnostic capabilities of the notebooks, which [will soon] have support for 48 language kernels, including Julia, R, Haskell, Ruby, Spark and C++. Version 4.0 will offer capabilities to

  • ease the transition from large notebook to small notebooks
  • import notebooks as packages
  • test notebooks
  • verify that a notebook is reproducible

As a former educator, a new capability I find particularly exciting is nbgrader, which uses the JupyterHub collaborative platform, providing support for releasing, fetching, submitting and collecting assignments. Among the personally most interesting tidbits I learned during this session was that IPython started out as Fernando Perez' "procrastination project" while he was in PhD thesis avoidance mode in 2001 ... an outstanding illustration of the benefits of structured procrastination.

Deep Learning with Python: getting started and getting from ideas to insights in minutes

Alex Korbonits, Nuiku, @korbonits

AlexNet_architectureDeep Learning seems to be well on its way toward the peak of inflated expectations lately (e.g., Deep Learning System Tops Humans in IQ Tests), Alex Korbonits presented a number of tools for and examples of Deep Learning, the most impressive of which was AlexNet, a deep convolutional neural network developed by another Alex (Alex Krizhevsky, et al) that outperformed all of its competitors in the LSVRC 2010 ImageNet competition (1.3M high-res images across 1000 classes) by such a substantial margin that it changed the course of research in computer vision, a field that had hitherto been dominated by hand-crafted features refined over a long period of time. Alex Korbonits went on to demonstrate a number of Deep Learning tools & packages, e.g., Caffe and word2vec, and applications involving scene parsing and unsupervised learning of high-level features. It should be noted that others have taken a more skeptical view of Deep Learning, and illustrated some areas in which there's still a lot of work to be done.

Jupyter for Education: Beyond Gutenberg and Erasmus

Paco Nathan, O’Reilly Media, @pacoid

120818_stacked_s-curves-thumb-600x358-2254One of the most challenging aspects of attending a talk by Paco Nathan is figuring out how to bide my time between listening, watching, searching for or typing in referenced links ... and taking notes. He is a master speaker, with compelling visual augmentations and links to all kinds of interesting related material. Unfortunately, while my browser fills up with tabs during his talks, my notes typically end up rather sparse. In this talk, Paco talked about the ways that O'Reilly Media is embracing Jupyter Notebooks as a primary interface for authors using their multi-channel publishing platform. An impressive collection of these notebooks can be viewed on the O'Reilly Learning page. Paco observed that the human learning curve is often the most challenging aspect to leading data science teams, as data, problems and techniques change over time. The evolution of user expertise, e.g., among connoisseurs of beer, is another interesting area involving human learning curves that was referenced during this session.

Counterfactual evaluation of machine learning models

Michael Manapat, Stripe, @mlmanapat

Fraud detection presents some special challenges in evaluating the performance of machine learning models. If a model is trained on past transactions that are labeled based on whether or not they turned out to be fraudulent, once the model is deployed, the new transactions classified by the model as fraud are blocked. Thus, the transactions that are allowed to go through after the model is deployed may be substantially different - especially with respect to the proportion of fraudulent transactions - than those that were allowed before the model was deployed. This makes evaluation of the model performance difficult, since the training data may be very different from the data used to evaluate the model. It also complicates the training of new models, since the new training data (post model deployment) will be biased. Michael Manapat presented some techniques to address these challenges, involving allowing a small proportion of potentially fraudulent transactions through and using a propensity function to control the "exploration/exploitation tradeoff".

Keynote: A Systems View of Machine Learning

Josh Bloom, UC Berkeley & wise.io, @profjsb

In the last keynote of the conference, Josh Bloom shared a number of insights about considerations often overlooked by data scientists regarding how data models fit into the systems into which they are deployed. For example, while data scientists are often concerned with optimizing a variety parameters in building a model, other important areas for optimization are overlooked, e.g., hardware and software demands of a deployed model (e.g., the decision by Netflix not to deploy the model with the highest score in the Netflix Prize), the human resources required to implement and maintain the model, the ways that consumers will [try to] interpret or use the model, and the direct and indirect impacts of the model on society. Noteworthy references include a paper by Sculley, et al, on Machine Learning: The High Interest Credit Card of Technical Debt and Leon Bottou's ICML 2015 keynote on Two Big Challenges of Machine Learning.

NLP and text analytics at scale with PySpark and notebooks

Paco Nathan, O'Reilly Media, @pacoid

Once again, I had a hard time keeping up with the multi-sensory inputs during a talk by Paco Nathan. Although I can't find his slides from PyData, I was able to find a closely related slide deck (embedded below). The gist of the talk is that many real-world problems can often be represented as graphs, and that there are a number of tools - including Spark and GraphLab - that can be utilized for efficient processing of large graphs. One example of a problem amenable to graph processing is the analysis of a social network, e.g., contributors to open source forums, which reminded me of some earlier work by Weiser, et al (2007), on Visualizing the signatures of social roles in online discussion groups. The session included a number of interesting code examples, some of which I expect are located in Paco's spark-exercises GitHub repository. Other interesting references included TextBlob, a Python library for text processing, and TextRank, a graph-based ranking model for text processing, a paper by Mihalcea & Tarau from EMNLP 2004.

Pandas Under The Hood: Peeking behind the scenes of a high performance data analysis library

Jeffrey Tratner, Counsyl, @jtratner

Array_vs_listPandas - the Python open source data analysis library - may take 10 minutes to learn, but I have found that it takes a long time to master. Jeff Tratner a key contributor to Pandas - an open source community he described as "really open to small contributors" - shared a number of insights into how Pandas works, how it addresses some of the problems that make python slow, and how the use of certain features can lead to improved performance. For example, specifying the data type of columns in a CSV file via the dtype parameter in read_csv can help pandas save space and time while loading the data from the file. Also, the Dataframe.append operation is very expensive, and should be avoided wherever possible (e.g., by using merge, join or concat). One of my favorite lines: "The key to doing many small operations in Python: don’t do them in Python"

Mistakes I've Made

Cameron Davidson-Pilon, Shopify, @cmrn_dp

While I believe that there are no mistakes, only lessons, I do value the relatively rare opportunities to learn from others' lessons, and Cameron Davidson-Pilon (author of Probabilistic Programming & Bayesian Methods for Hackers) shared some valuable lessons he has learned in his data science work over the years. Among the lessons he shared:

  • Sample sizes are important
  • It is usually prudent to underestimate predictions of performance of deployed models
  • Computing statistics on top of statistics compounds uncertainty
  • Visualizing uncertainty is a the role of a statistician
  • Don't [naively] use PCA [before regression]

Among the interesting, and rather cautionary, references:

There were a few sessions about which I read or heard great things, but which I did not attend. I'll include information I could find about them, in the chronological order in which they were listed in the schedule, to wrap things up.

Testing for Data Scientists

Trey Causey, Dato, @treycausey

Learning Data Science Using Functional Python

Joel Grus, Google, @joelgrus

Code + Google docs presentation (can't figure how to embed)

Big Data Analytics - The Best of the Worst : AntiPatterns & Antidotes

Krishna Sankar, blackarrow.tv, @ksankar

Python Data Bikeshed

Rob Story, Simple, @oceankidbilly

GitHub repo

Low Friction NLP with Gensim

Trent Hauck, @trent_hauck

Slides [PDF]

[Update, 2015-08-05: the PyDataTV YouTube channel now has videos from the conference]


IPython, Deep Learning & Doing Good: Some Highlights from Strata 2014

Strata2014_sponsoring_125x125I attended my first Strata conference last week. The program offered a nice blend of strategic and technical insights and experiences regarding the design and use of "big data" systems. Atigeo was a sponsor, and I spent much of my time in our booth demonstrating and discussing our xPatterns big data analytics platform (about which I may write more later). Outside the exhibit area, highlights included a demonstration of the IPython Notebook, a tutorial on neural networks and deep learning, and a panel on Data for Good. I often find it helpful to compile - and condense - my notes after returning from a conference, and am sharing them here in case they are of use to others.

Hardcore Data Science Tutorial

On Tuesday, I attended this all-day tutorial with 10 different presentations.

Extreme Machine Learning
Alexander Gray, CTO, Skytree, Inc. 

After a brief review of some key historical and conceptual underpinnings of machine learning, Alex Gray delineated 3 sources of error that data scientists must contend with: finite data, wrong parameters and the wrong type of model. Techniques for reducing error include weak scaling (use more machines to model more data), strong scaling (use more machines to model the same data, faster) and exploratory data analysis and visualization tools. Demonstrations included sentiment analysis of Twitter data during the US presidential election, identification of outliers in a large data set and a visualization of Wikipedia [unfortunately, I can't find the slides or any information about these demos online]. Quotable quotes include the no free lunch theorem: "Do I have to read all of these machine learning papers to understand this concept?" "Yes."

What the #@)*$ is Big Data? A Holistic View of Data and Algorithms
Alice Zheng, Director of Data Science, GraphLab

Alice Zheng highlighted the gap between algorithms, which prefer certain data structures, and raw data, which is often amenable to certain data structures, and noted that data structures (beyond flat, 2-dimensional tables) are an often overlooked bridge between data and algorithms in data science and engineering efforts. She showed how data for movie recommendation systems and network diagnostic systems can be represented as tables, and then how representing them with graph data structures can make them much more efficient to work with. Her colleague, Carlos Guestrin, gave a more in-depth GraphLab tutorial in another session later that afternoon, which I imagine was somewhat similar to the one captured in a 42-minute video of a GraphLab session at Strata NY 2013.

Overcoming the Barriers to Production-Ready Machine-Learning Workflows [slides]
Henrik Brink, CTO, wise.io; Joshua Bloom, Professor, University of California, Berkeley

ML_Algorithmic_Tradeoff_wiseioHenrik Brink and Joshua Bloom highlighted the gaps between data science and production systems, emphasizing the optimization tradeoffs among accuracy, interpretability and implementability. Effectively measuring accuracy requires choosing an appropriate evaluation metric that captures the essence of what you (or your customer) cares about. Interpretability should focus on what an end user typically wants to know - why a model gives specific answers (e.g., which features are most important) - rather than what a data scientist may find most interesting (e.g., how the model works). Implementability encompasses the time and cost of putting a model into production, including considerations of integration, scalability and speed. The lessons learned from the Netflix Prize are instructive, since implementation concerns led the sponsors not to deploy the winning algorithm, even though it achieved improved accuracy.

Anomaly Detection [slides]
Ted Dunning, Chief Application Architect, MapR

Ted Dunning defined an anomaly as "What just happened that shouldn't?" and posited the goal of anomaly detection as "Find the problem before other people do ... But don't wake me up if it isn't really broken." In detecting heart rate anomalies, he described the creation of a dictionary of shapes representing lower level patterns in a heart rate, and then using adaptive thresholding to look for outliers. In many anomaly detection problems, he has found that many key elements can be effectively modeled as mixture distributions.

Neural Networks for Machine Perception [slides]
Ilya Sutskever, Google Inc

Examples_DeepLearning_GooglePlusPhotoSearchDeep learning is an extension of an old concept - multi-layer neural networks - that has recently become very popular. Ilya Sutskever provided a very accessible overview of the history, concepts and increasing capabilities of of these systems, provocatively asserting - and providing some evidence for - "Anything humans can do in 0.1 seconds, a big 10-layer network can do, too." Connecting all nodes in all layers of such a network would be prohibitively expensive; convolutional neural networks restrict the numbers of connections by mapping only subregions between different layers. Several successful (and a few unsuccessful) examples of visual object recognition were illustrated in the Google+ photo search service. References were made to Yann LeCun's related work on learning feature hierarchies for object recognition and word2vec, an open source tool for computing vector representations of words that can be used in applying deep learning to language tasks.

The Predictive Business
Kira Radinsky, CTO, SalesPredict

Kira Radinsky led off with some of the business complexities of sales cycles - due to factors such as time, cost, probability of a sale, amount of a sale - and the typically low rate of conversion (< 1%). She mentioned a number of techniques used by SalesPredict, such as automatic feature generation, classifiers as features and the use of personas to deal with sparseness and severe negative skew in CRM data. Revisiting the importance of interpretability, she described this perception problem as "Emotional AI", and gave an example where even though SalesPredict had achieved a 3-fold increase in conversion rates for a customer, they were not happy until/unless they could understand why the system was prioritizing certain leads. She also warned of the dangers of success in prediction: once customers start relying on the ranking of sales leads, they focused all their efforts on those with "A" scores, neglecting all others, leading to potential missed opportunities (since the ranking is imperfect) and further skewing of the data.

Can We Make Big Data Management Easier?
Magda Balazinska, Associate Professor, University of Washington

Magda Balazinska's group is exploring a number of ways of facilitating the management of big data, and focused on just one - Collaborative Query Management (CQMS) - for this session. Unfortunately, I had to step away for part of this session, but my understanding is that CQMS involves collecting successful queries and making relevant queries available to other users who appear to be following similar trajectories in exploratory data analysis. While the goals and design of the system seem reasonable, they have not yet conducted any user studies to validate whether users find the provision of relevant queries helpful in their analysis.

Design Challenges for Real Predictive Platforms [slides]
Max Gasner, SMTS, Predictive Intelligence, Salesforce.com

Strata_gasner_desiderataMax Gasner encouraged us to apply a key lesson from relational databases to big data: decoupling implementation enables abstraction. Furthermore he proposed that successful big data platform design and development should exhibit 4 properties: robust, honest, flexible and simple. He called out BigML (co-founded by my friend and former boss & CEO at Strands, Francisco Martin) as a first generation example of such a system. Echoing issues of interpretability raised in two earlier sessions, he noted that "black boxes are easy to use and hard to trust". Riffing off with a phrase popularized by Mao Tse-Tung - "let a hundred flowers bloom; let a hundred schools of thought general purpose predictive platforms contend" - he noted there is lots of room to innovate on APIs and presentation, and so lots of opportunities for companies (like ours) building general purpose predictive platforms (GPPPs).

Machine Learning Gremlins
Ben Hamner, Data Scientist, Kaggle

Ben Hamner warned that while machine learning is powerful, there are lots of ways to screw up; but he claimed that all are avoidable. Potential problems include overfitting, data insufficiency (or "finite data" as Alex Gray described it), data leakage (irrelevant features in the problem representation) and solving the wrong problem (calling to mind the 12 steps for technology-centered designers). He illustrated many of these problems - and solutions - with an amusing story about the iterative development of a vision-based system for regulating access through a pet door. He also offered an amusing quote by a machine learning engineer that captured the widespread zeal for ensemble learning methods:

"We'd been using logistic regression in high dimensional feature spaces, tried a random forest on it and it improved our performance by 14%. We were going to be rich!!"

Algebra for Scalable Analytics [slides, challenges]
Oscar Boykin, Twitter

Algebra_strata_bloom_filterI was initially skeptical about the wisdom of scheduling a presentation on algebra in the last slot of the session, but Oscar Boykin offered an energetic and surprisingly engaging overview of semigroups (sets with associative operations), monoids (semigroups with a zero property) the value of expressing computations as associative operations. He went on to champion the value of hashing rather than sampling to arrive at approximate - but acceptable - solutions to some big data problems, using Bloom filters, HyperLogLog and Count-min sketches as examples. In addition to sharing his slides, he also offered some challenges for those interested in diving further into the topic.

A Sampling of Other Strata Presentations

I spent much of my time on Wednesday and Thursday in the exhibitors area, but did manage to get out to see a few sessions, some of which I will briefly recount below.

Crossing the Chasm: What’s New, What’s Not [slides (PPTX), video]
Geoffrey Moore (Geoffrey Moore Consulting)

TechnologyAdoptionLifeCycle_GeoffreyMooreGeoffrey Moore, author of Crossing the Chasm, was an ideal choice for a keynote speaker at Strata, given the prevalence of references to chasms and gaps throughout many of the other sessions. Moore presented a variant of the Technology Adoption Life Cycle, noting that pragmatists - on the other side of the chasm from the early adopters - won't move until they feel pain. For consumer IT, he recommends adopting lean startup principles and leaping straight to the "tornado"; for enterprise IT, he recommends focusing on breakthrough projects with top-tier brands, and building up high value use cases with compelling reasons to buy. He also reiterated one of his most quotable big data quotes:

"Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway"

Building a Lightweight Discovery Interface for Chinese Patents [slides]
Eric Pugh (OpenSource Connections)

Eric Pugh shared some insights and experiences in building the Global Patent Search Network for the US Patent and Trademark Office. He and his team had to navigate tensions between two classes of developers (data people and UX people), as well as two classes of users (patent examiners and the general public). Among the lessons: don't underestimate the amount of effort required for user interface (40% for GPSN), put a clickable prototype with a subset of the data in front of users as early as possible, don't move files (use S3funnel), be careful where you sample your data (data volume can increase exponentially over time), and keep the pipeline as simple as possible.

How Twitter Monitors Millions of Time-series [slides]
Yann Ramin (Twitter, Inc.)

Observability_at_twitterYann Ramin shared a broad array of problems - and solutions - in working with time series data, alerts and traces at Twitter, some of which is captured in an earlier blog post on Observability at Twitter. He made a strong case for the need to move beyond logging toward the cheap & easy real-time collection of structured statistics when working with web services (vs. programs running on host computers), highlighting the value of embedding first tier aggregation and sampling directly in large-scale web applications. Among the open source tools he illustrated were: the Finagle web service instrumentation package, a template (twitter-server) used to define Finagle-instrumented services at Twitter, and a distributed tracing system (zipkin) based on a 2010 research paper on Dapper. As with many other Strata presenters, he also had a pithy quote to capture the motivation behind much of the work he was presenting:

"When something is wrong, you need data yesterday"

The IPython Notebook: Get Close to Your Data with Python and JavaScript [IPython notebook]
Brian Granger (Cal Poly San Luis Obispo)

Of all the talks at Strata, this one got me the most excited about getting home (or back to work) to practice what I learned. In an act of recursive storytelling, Brian Granger told a story about how to use the IPython Notebook and NBViewer (NoteBook Viewer) to compose and share a reproducible story with code and data ... by using IPython Notebook. Running NBViewer in a browser, he was able to show and execute segments of Python code and have the results returned and variously rendered in the browser window. While the demonstration focused primarily on Python, the notebook also supports a variety of other languages (including Julia, R and Ruby). A recurring theme throughout the conference was bridging gaps, and in this case, the gap was characterized as "a grand canyon between the user and the data", with IPython Notebook serving as the bridge. He had given a longer tutorial - IPython in Depth - on Tuesday, and I plan to soon use the materials there to bridge the gap from learning to doing.

[Update, 2014-04-09: I have followed through on my intention, creating and posting an IPython Notebook on Python for Data Science]

Data for Good
Moderator: Jake Porway (DataKind)
Panelists: Drew Conway (IA Ventures), Rayid Ghani (Edgeflip | University of Chicago ), Elena Eneva (Accenture)

The last session I attended at Strata was also inspiring, and I plan to look for local opportunities for doing [data science for] good. The moderator and panelists have all been involved in projects involving the application of data science techniques and technologies to help local communities, typically by helping local government agencies - which often have lots of data but little understanding of how to use it - better serve their constituents. Drew Conway helped NYFD use data mining to rank buildings' fire risk based on 60 variables, enabling them to better prioritize fire department inspector's time. Rayid Ghani co-founded the Data Science for Social Good summer fellowship program at the University of Chicago last year, and Elena Eneva was one of the program mentors who was willing to take a sabbatical from her regular work to with teams of students in formulating big data solutions to community problems [disclosure: Rayid and Elena are both friends and former colleagues from my Accenture days]. Rayid noted that there are challenges in matching data science solutions to community problems, and so he developed a checklist to help identify the most promising projects (two elements: a local community or government organization that has - and can provide - access to the data, and has the capacity for action). Elena suggested that most data scientists would be surprised at how much impact they could have by applying a few simple data science techniques. If I were to attempt to summarize the panel with a quote of my own, I would riff on Margaret Meade:

Updates

A number of other Strata attendees have shared their insights and experiences:

I created a wordle from the Strata 2014 Exhibitors page:

And finally, although not directly related to the conference, I found this very funny "big data" cartoon, by Tom Fishburne, yesterday while searching for the origin of the quote,

"Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it."

which, as far as I can tell, was first articulated in a Facebook post by Dan Ariely ... and was the inspiration for the cartoon:

140113.bigdata.jpg, from tomfishburne.com


Hype, Hubs & Hadoop: Some Notes from Strata NY 2013 Keynotes

Stratany2013_header_logo_tm_no_ormI didn't physically attend Strata NY + Hadoop World this year, but I did watch the keynotes from the conference. O'Reilly Media kindly makes videos of the keynotes and slides of all talks available very soon after they are given. Among the recurring themes were haranguing against the hype of big data, the increasing utilization of Hadoop as a central platform (hub) for enterprise data, and the importance and potential impact of making data, tools and insights more broadly accessible within an enterprise and to the general public. The keynotes offered a nice mix of business (applied) & science (academic) talks, from event sponsors and other key players in the field, and a surprising - and welcome - number of women on stage.

Atigeo, the company where I now work on analytics and data science, co-presented a talk on Data Driven Models to Minimize Hospital Readmissions at Strata Rx last month, and I'm hoping we will be participating in future Strata events. And I'm hoping that some day I'll be on stage presenting some interesting data and insights at a Strata conference.

Meanwhile, I'll include some of my notes on interesting data and insights presented by others, in the order in which presentations were scheduled, linking each presentation title to its associated video. Unlike previous postings of notes from conferences, I'm going to leave the notes in relatively raw form, as I don't have the time to add more narrative context or visual augmentations to them.


Hadoop's Impact on the Future of Data Management
 
Mike Olson @mikeolson (Cloudera)

3000 people at the conference (sellout crowd), up from 700 people in 2009.
Hadoop started out as a complement to traditional data processing (offering large-scale processing).
Progressively adding more real-time capabilities, e.g. Impala & Cloudera search.
More and more capabilities migrating form traditional platforms to Hadooop.
Hadoop moving from the periphery to the architectural center of the data center, emerging as an enterprise data hub.
Hub: scalable storage, security, data governance, engines for working with the data in place
Spokes connect to other systems, people
Announcing Cloudera 5, "the enterprise data hub"
Announcing Cloudera Connect Cloud, supporting private & public cloud deployments
Announcing Cloudera Connect Innovators, inaugural innovator is DataBricks (Spark real-time in-memory processing engine)


Separating Hadoop Myths from Reality
Jack Norris (MapR Technologies)

Hadoop is the first open source project that has spawned a market
3:35 compelling graph of Hadoop/HBase disk latency vs. MapR latency
Hadoop is being used in production by many organizations


Big Impact from Big Data
Ken Rudin (Facebook)

Need to focus on business needs, not the technology
You can use science, technology and statistics to figure out what the answers are, but it is still am art to figure out what the right questions are
How to focus on the right questions:
* hire people with academic knowledge + business savvy
* train everyone on analytics (internal DataCamp at Facebook for project managers, designers, operations; 50% on tools, 50% on how to frame business questions so you can use data to get the answers)
* put analysts in org structure that allows them to have impact ("embedded model": hybrid between centralized & decentralized)
Goals of analytics: Impact, insight, actionable insight, evangelism … own the outcome


Five Surprising Mobile Trajectories in Five Minutes
Tony Salvador (Intel Corporation)

Tony is director at the Experience Research Lab (is this the group formerly known as People & Practices?) [I'm an Intel Research alum, and Tony is a personal friend]
Personal data economy: system of exchange, trading personal data for value
3 opportunities
* hyper individualism (Moore's Cloud, programmable LED lights)
* hyper collectivity (student projects with outside collaboration)
* hyper differentiation (holistic design for devices + data)
Big data is by the people and of the people ... and it should be for the people


Can Big Data Reach One Billion People?
Quentin Clark (Microsoft)

Praises Apache, open source, github (highlighted by someone from Microsoft?)
Make big data accessible (MS?)
Hadoop is a cornerstone of big data
Microsoft is committed to making it ready for the enterprise
HD Insight (?) Azure offering for Hadoop
We have a billion users of Excel, and we need to find a way to let anybody with a question get that question answered.
Power BI for Office 365 Preview


What Makes Us Human? A Tale of Advertising Fraud
Claudia Perlich (Dstillery)

A Turing test for advertising fraud
Dstillery: predicting consumer behavior based on browsing histories
Saw 2x performance improvement in 2 weeks; was immediately skeptical
Integrated additional sources of data (10B bid requests)
Found "oddly predictive websites"
e.g., Women's health page --> 10x more likely to check out credit card offer, order online pizza, or reading about luxury cars
Large advertising scam (botnet)
36% of traffic is non-intentional (Comscore)
Co-visitation patterns
Cookie stuffing
Botnet behavior is easier to predict than human behavior
Put bots in "penalty box": ignore non-human behavior


From Fiction to Facts with Big Data Analytics
Ben Werther @bwerther (Platfora)

When it comes to big data, BI = BS
Contrasts enterprises based on fiction, feeling & faith vs. fact-based enterprises
Big data analytics: letting regular business people iteratively interrogate massive amounts of data in an easy-to-use way so that they can derive insight and really understand what's going on
3 layers: Deep processing + acceleration + rich analytics
Product: Hadoop processing + in-memory acceleration + analytics engines + Vizboards
Example: event series analytics + entity-centric data catalog + iterative segmentation


The Economic Potential of Open Data
Michael Chui (McKinsey Global Institute)

[Presentation is based on newly published - and openly accessible (walking the talk!) - report: Open data: Unlocking innovation and performance with liquid information.]

Louisiana Purchase: Lewis & Clark address a big data acquisition problem
Thomas Jefferson: "Your observations are to be taken with great pains & accuracy, to be entered intelligibly, for others as well as yourself"
What happens when you make data more liquid?

4 characteristics of "openness" or "liquidity" of data:
* degree of access
* machine readability
* cost
* rights

Benefits to open data:
* transparency
* benchmarking exposing variability
* new products and services based on open data (Climate Corporation?)

How open data can enable value creation
* matching supply and demand
* collaboration at scale
"with enough eyes on code, all bugs are shallow"
--> "with enough eyes on data, all insights are shallow"
* increase accountability of institutions

Open data can help unlock $3.2B [typo? s/b $3.2T?] to $5.4T in ecumenic value per year across 7 domains
* education
* transportation
* consumer products
* electricity
* oil and gas
* health care
* consumer finance
What needs to happen?
* identify, prioritize & catalyze data to open
* developer, developers, developers
* talent (data scientists, visualization, storytelling)
* address privacy confidentiality, security, IP policies
* platforms, standards and metadata


The Future of Hadoop: What Happened & What's Possible?
Doug Cutting @cutting (Cloudera)

Hadoop started out as a storage & batch processing system for Java programmers
Increasingly enables people to share data and hardware resources
Becoming the center of an enterprise data hub
More and more capabilities being brought to Hadoop
Inevitable that we'll see just about every kind of workload being moved to this platform, even online transaction processing


Designing Your Data-Centric Organization
Josh Klahr (Pivotal)

GE has created 24 data-driven apps in one year
We are working with them as a Pivotal investor and a Pivotal company, we help them build these data-driven apps, which generated $400M in the past year
Pivotal code-a-thon, with Kaiser Permanente, using Hadoop, SQL and Tableau

What it takes to be a data-driven company
* Have an application vision
* Powered by Hadoop
* Driven by Data Science


Encouraging You to Change the World with Big Data
David Parker (SAP)

Took Facebook 9 months to achieve the same number of users that it took radio 40 years to achieve (100M users)
Use cases
At-risk students stay in school with real-time guidance (University of Kentucky)
Soccer players improve with spatial analysis of movement
Visualization of cancer treatment options
Big data geek challenge (SAP Lumira): $10,000 for best application idea


The Value of Social (for) TV
Shawndra Hill (University of Pennsylvania)

Social TV Lab
How we can derive value from the data that is being generated by viewers today?
Methodology: start with Twitter handles of TV shows, identify followers, collect tweets and their networks (followees + followers), build recommendation systems from  the data (social network-based, product network-based & text-based (bag of words)). Correlate words in tweets about a show with demographics about audience (Wordle for male vs. female)
1. You can use Twitter followers to estimate viewer audience demographics
2. TV triggers lead to more online engagement
3. If brands want to engage with customers online, play an online game
Real time response to advertisement (Teleflora during Super Bowl): peaking buzz vs. sustained buzz
Demographic bias in sentiment & tweeting (male vs. female response to Teleflora, others)
Influence = retweeting
Women more likely to retweet women, men more likely to retweet men
4. Advertising response and influence vary by demographic
5. GetGlue and Viggle check-ins can be used as a reliable proxy for viewership to
* predict Nielsen viewership weeks in advance
* predict customer lifetime value
* measure time shifting
All at the individual viewer level (vs. household level)


Ubiquitous Satellite Imagery of our Planet
Will Marshall @wsm1 (Planet Labs)

Ultracompact satellites to image the earth on a much more frequent basis to get inside the human decision-making loop so we can help human action.
Redundancy via large # of small of satellites with latest technology (vs. older, higher-reliability systems on one satellite)
Recency: shows more deforestation than Google Maps, river movement (vs. OpenStreetMap)
API for the Changing Planet, hackathons early next year


The Big Data Journey: Taking a holistic approach
John Choi (IBM)

[No slides?]
Invention of sliced bread 
Big data [hyped] as the biggest thing since the sliced bread
Think about big data as a journey
1. It's all about discipline and knowing where you are going (vs. enamored with tech)
VC $2.6B investment into big data (IBM, SAP, Oracle, … $3-4B more)
2. Understand that any of these technologies do not live in a silo. The thing that you don't want to have happen is that this thing become a science fair project. At the end of the day, this is going to be part of a broader architecture.
3. This is an investment decision, want to have a return on investment.


How You See Data
Sharmila Shahani-Mulligan @ShahaniMulligan (ClearStory Data)

The Next Era of Data Analysis: next big thing is how you analyze data from many disparate sources and do it quickly.
More data: Internal data + external data
More speed: Fast answers + discovery
Increase speed of access & speed of processing so that iterative insight becomes possible.
More people: Collaboration + context
Needs to become easier for everyone across the business (not just specialists) to see insights as insights are made available, have to make decisions faster.
Data-aware collaboration
Data harmonization
Demo: 6:10-8:30


Can Big Data Save Them?
Jim Kaskade @jimkaskade (Infochimps)

1 of 3 people in US has had a direct experience with cancer in their family
1 in 4 deaths are cancer-related
Jim's mom has chronic leukemia
Just got off the phone with his mom (it's his birthday), and she asked "what is it that you do?"
"We use data to solve really hard problems like cancer"
"When?"
"Soon"
Cancer is 2nd leading cause of death in children
"The brain trust in this room alone could advance cancer therapy more in a year than the last 3 decades."
Bjorn Brucher
We can help them by predicting individual outcomes, and then proactively applying preventative measures.
Big data starts with the application
Stop building your big data sandboxes, stop building your big data stacks, stop building your big data hadoop clusters without a purpose.
When you start with the business problem, the use case, you have a purpose, you have focus.
50% of big data projects fail (reference?)
"Take that one use case, supercharge it with big data & analytics, we can take & give you the most comprehensive big data solutions, we can put it on the cloud, and for some of you, we can give you answers in less than 30 days"
"What if you can contribute to the cure of cancer?" [abrupt pivot back to initial inspirational theme]


Changing the Face of Technology - Black Girls CODE
Peta Clarke @volunteerbgcny (Black Girls Code - NY), Donna Knutt @donnaknutt (Black Girls Code)

Why coding is important: By 2020, 1.4M computing jobs
Women of color currently make up 3% of computing jobs in US
Goal: teach 1M girls to code by 2040
Thus far: 2 years, 2000 girls, 7 states + Johannesburg, South Africa


Beyond R and Ph.D.s: The Mythology of Data Science Debunked
Douglas Merrill @DouglasMerrill (ZestFinance)

[my favorite talk]
Anything which appears in the press in capital letters, and surrounded by quotes, isn't real.
There is no math solution to anything. Math isn't the answer, it's not even the question.
Math is a part of the solution. Pieces of math have different biases, different things they do well, different things they do badly, just like employees. Hiring one new employee won't transform your company; hiring one new piece of math also won't transform your company.
Normal distribution, bell curve: beautiful, elegant
Almost nothing in the real world, is, in fact, normal.
Power laws don't actually have means.
Joke: How do you tell the difference between an introverted and an extroverted engineer? The extroverted one looks at your shoes instead of his own.
The math that you think you know isn't right. And you have to be aware of that. And being aware of that requires more than just math skills.
Science is inherently about data, so "data scientist" is redundant
However, data is not entirely about science
Math + pragmaticism + communication
Prefers "Data artist" to data scientist
Fundamentally, the hard part actually isn't the math, the hard part is finding a way to talk about that math. And, the hard part isn't actually gathering the data, the hard part is talking about that data.
The most famous data artist of our time: Nate Silver.
Data artists are the future.
What the world needs is not more R, what the world needs is more artists (Rtists?)


Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
Foster Provost (NYU | Stern)

[co-author of my favorite book on Data Science]
Agrees with some of the critiques made by previous speaker, but rather likes the term "data scientist"
Shares some quotes from Data Science and its relationship to Big Data and Data-Driven Decision Making
Gartner Hype Cycle 2012 puts "Predictive Analytics" at the far right ("Plateau of Productivity")  
[it's still there in Gartner Hype Cycle 2013, and "Big Data" has inched a bit higher into the "Peak of Inflated Expectations"]
More data isn't necessarily better (if it's from the same source, e.g., sociodemographic data)
More data from different sources may help.
Using fine-grained behavior data, learning curves show continued improvement to massive scale.
1M merchants, 3M data points (? look up paper)
But sociodemographic + pseudo social network data still does not necessarily do better
See Pseudo-Social Network Targeting from Consumer Transaction Data (Martens & Provost)
Seem to be very few case studies where you have really strong best practices with traditional data juxtaposed with strong best practices with another sort of data.
We see similar learning curves with different data sets, characterized by  massive numbers of individual behaviors, each of which probably contains a small amount of information, and the data items are sparse.
See Predictive Modelling with Big Data: Is Bigger Really Better? (Enrique Junque de Fortuny, David Martens & Foster Provost)
Others have published work on on Fraud detection (Fawcett & FP, 1997; Cortes et al, 2001), Social Network-based Marketing (Hill, et al, 2006), Online Display-ad Targeting (FP, Dalessandro, et al., 2009; Perlich, et al., 2013)
Rarely see comparisons

Take home message:
The Golden Age of Data Science is at hand.
Firms with larger data assets may have the opportunity to achieve significant competitive advantage.
Whether bigger is better for predictive modeling depends on:
a) the characteristics of the data (e.g., sparse, fine-grained data on consumer behavior)
b) the capability to model such data


Design for Health: Notes from a Multidisciplinary CSCW 2012 Workshop

IMG_0720

I participated in an incredibly well organized and facilitated workshop on Brainstorming Design for Health: Helping Patients Utilize Patient-Generated Data on the Web on Saturday. The participants represented a diverse range of backgrounds and interests - even for a workshop at a Computer Supported Cooperative Work (CSCW) conference, already a particularly diverse community. Our workshop had representation from fields including computer and/or information science (especially data geeks), design (several flavors), anthropology, urology and even veterinary medicine.

After outlining the agenda and going around the room with brief introductions, we were treated to a remote keynote presentation by Paul Wicks, Director of Research & Development for PatientsLikeMe, a web service for compiling and provisioning patient-reported data for use in clinical trials. Paul offered an overview of the organization, highlighting some of its successes - including the discovery of the ineffectiveness of lithium for treating ALS and a more recent study revealing the positive user experiences of PatientsLikeMe users with epilepsy (55% of respondents consider PatientsLikeMe “moderately or very” helpful in learning about the type of seizures they experience) - and some of the challenges they face with respect to complexities (ontologies for symptoms, diagnoses and treatments) and incentives (ensuring that patients who give something get something).

Epilepsy-survey-image

The epilepsy study was paticularly interesting to me, as I've explored the web 2.0 service on behalf of my wife, who suffers from a few chronic conditions, and we were both struck (and personally disincentivized) by how narrowly structured the interface for describing conditions was. Paul acknowledged the regimentation of the data and interface, but noted that very little progress has been made in using natural language processing techniques to effectively extract useful data from less structured patient descriptions of symptoms, diagnoses or treatments, despite a great deal of effort.

Patients-data-insight

PatientsLikeMe takes a very pragmatic approach to serving as an intermediary between patients' data and organizations that are willing to pay for that data, and so they focus on the sweet spot of data that can be relatively easily collected and provisioned. I was glad to hear about the epilepsy study, as it provides evidence that some patients are also reaping benefits from sharing their data. More generally, Paul was forthright and even evangelical about the business orientation of PatientsLikeMe (they are a for-profit corporation) and encouraged all workshop participants to think about sustainability - beyond the scope of government grants, business contracts or other relatively short term fors of support - in our own work.

After the keynote, we were partitioned into four working groups, all of which were tasked with defining a problem, designing a solution and reporting back to the broader group. The focused small group activity provided a context for stimulating discussions about a range of issues involving health, data, users and design, and the time constraints provided an impetus to keep things flowing toward a goal.

  1. Methods for processing narrative versus numeric data
  2. Depicting a diversity of opinions and experiences embedded within patient-generated information
  3. Working with "lay" concepts and language and their alignment with complex medical issues
  4. Being mindful with privacy-enhancing methods for data handling

I started off in Group 1, as I am interested in narrative data (the hard problem currently avoided by PatientsLikeMe), but it quickly became apparent that most of the other members in the group were primarily interested in the relatively short narratives that unfold on Twitter rather than the longer form patient narratives - such as one might find in blogs or online support forums -  that I am primarily interested in.

During the report outs after breakout session 1, Group 3 described a persona, "Kelly", who was so remarkably similar to my wife, and her epic digestive health odyssey (which we described in a rather epic long form narrative blog post last August), that I decided to switch groups during the lunch break. Group 3 was particularly diverse - with two interaction designers, a graphics designer, an MD, an anthropologist, and a few folks (like me) with computer or information science backgrounds - and most of the members came were oustide of the traditional CSCW community (which itself is rather diverse). This diversity, coupled with the participation of people who could personally relate to the plight of the patient persona of "Kelly", enabled us to make good progress on our design for helping "Kelly".

IMG_0724

The first - and in this case, probably final - design, "Health Tryst", was modeled - in both name and functionality - on Pinterest the increasingly popular online pinboard for "organizing and sharing the things you love", and included features for helping a patient with irritable bowel syndrome (IBS) navigate to relevant information and online support groups that might help her (or him) cope with a chronic condition, and to share these items with others.

IMG_0728

I won't go into all the details of the design, as I believe the most valuable aspect of the process were the discussions that arose in the context of designing something that would be useful to a patient like "Kelly" suffering from IBS. The one feature I will highlight is a capability for Kelly to enter her own personal narrative using her own words, and the application would automatically seek out both synonyms used more commonly in the medical and/or patient support communities, as well as automatically link to resources associated with the themes and topics indicated in her narrative.

The unfolding design sketches and scenario offered effective props to keep those discussions focused and flowing. I'm not sure if anyone will carry the design forward, but suspect everyone involved came away with a keener awareness of some of the issues faced by "Kelly" and the medical providers and online community members who might help her.

There were several other interesting designs that emerged from the other groups. Unfortunately, I didn't take good notes on them, and so cannot report on the other designs. As was the case in our group, the designs served to spark interesting discussions within and across groups on issues relating to health, data and technology. However, I was struck by a general theme that emerged (for me), which I believe was particularly well summed up in a recent Forbes article by David Shaywitz on Medicine's Tech Future, from last week's Future Med 2020 conference:

there’s a huge gap between the way many technologists envision medical problems and the way problems are actually experienced by physicians and patients

Our group was extremely fortunate to have good representation of both physicians and patients, as well as technologists and designers. In fact, if I had to select the highlight of all the highlights of the entire day, it was the discovery of an incredibly powerful visualization of a medical history timeline created by Katie McCurdy to use as a prop in discussing her chronic conditions in an initial interview with yet another new physician, and the confirmation by the physician in our group that this was, indeed, exactly the kind of prop that he and other physicians would likely find extremely useful in such a context.

Katiemccurdy_med-timeline

At the end of the workshop, we discussed a number of ways we might move forward in the future. I think one of the most effective ways to move discussions - and designs - forward will be to ensure broader participation from patient and physician communities, perhaps organizing or participating in workshops associated with some of these other communities.

I also think that health applications offer a perfect context within which to organize unconferences to bring together designers, developers, patients, physicians and business folk. I participated in a civic hacktivism project at Data Camp Seattle last February, but the Hacking 4 Health unconference in Palo Alto in September 2010 is a more relevant example that might be emulated to help move things forward. The Health Foo Camp this past July offers another unconference event that might be of interest to those who want to continue designing for health, and anyone interested in participatory design in the context of health might also want to check out the Society of Participatory Medicine and their blog, e-Patient.net.

The main conference is about to start, so I want to wrap this up. I am very grateful to the workshop organizers - Jina Huh and Andrea Hartzler - for bringing us all together and providing the perfect level of structure for promoting engaging discussions and designs on a topic that is of such great interest to all participants, and I look forward to future opportunities to practice designing for health.

Update: I'm including a couple of related blog posts, in case they help facilitate links across communities interested in this area (these and other related posts can be found in the Health category of this blog):


Airborne telepresence robots: 1995 & 2011

image from www.prop.org In introducing a short Marketplace Tech Report story about a floating blimp telepresence avatar this morning, host John Moe somewhat sarcastically said "Oh, no: not another floating blimp telepresence avatar story!", highlighting the rather unusual nature of a story about a "blimp-based boss". The story, reported by producer Larissa Anderson starting at the 3:08 mark, was about a floating remote-control telepresence robot that can enable people to remotely interact with - and perhaps unexpectedly look over the shoulders of - coworkers. It is a rather unusual story, but perhaps not quite as novel as some may believe. I was immediately reminded of some early research my friend Eric Paulos did at UC Berkeley on "Space Browsers" and other examples of what he called Personal Roving Presence (PRoP) in the 1990s.

After following some links to learn more about the Marketplace Tech Report story, I discovered an article - and embedded video - by Jim Giles in New Scientist about Telepresence robots go airborne. The New Scientist article references a CHI 2011 presentation last week by Tobita Hiroaki and colleagues at Sony Computer Science Laboratories. The associated alt.chi paper, Floating avatar: telepresence system using blimps for communication and entertainment, includes a reference to the earlier work by Eric Paulos and John Canny (which was started in 1995 and presented in the CHI 1998 video program). Given that the more recent example of floating telepresence robots by Sony CSL is currently making the rounds in the popular press, and my abiding interest in promoting accuracy in science reporting, I wanted to highlight the earlier work at UCB outside of traditional academic publication citation threads.

image from www.boingboing.net image from www.boingboing.net Somewhat ironically, just last week, I mentioned another example of robotic telepresence "then & now" in the class I'm teaching on Computer-Mediated Communication. A 2005 BoingBoing post by David Pescovitz on Telerobots Separated at Birth highlighted the similarity between a wheeled successor of Space Browser, what Eric called PRoP 2, and "Sister Mary", an example of what InTouch Health calls RP [Remote Presence] Endpoint Devices).

Separated at birth? At left, Sister Mary, a telerobot offered by InTouch Health that enables physicians to conduct their rounds remotely. Sister Mary is now being tested at St. Mary's Hospital in London. Link and Link

At right, Eric Paulos and John Canny's Personal Roving Presence (PRoP), a telerobot that "provides video and audio links to the remote space as well as providing a visible, mobile entity with which other people can interact." PRoP was developed at UC Berkeley in 2001 1997. Link

image from www.open-video.org Unfortunately, I can't find an embeddable online video of Space Browsers, but I did find a 3-minute video on PRoP: Personal Roving Presence from the CHI 1998 Video Program at the Open Video Project (which includes a storyboard of images from the video and a link to a downloadable 31MB MPG video of PRoP).

I'll include excerpts of coverage of airborne telepresence robots from 1995 and 2011 below.

1995

Interfacing Reality

 

Images

Space Browsers: A Tool for Ubiquitous Tele-embodiment

The first PRoPs were simple airborne tele-robots we named Space Browsers first designed and flown in 1995. The Space Browsers were helium-filled blimps of human proportions or smaller propelled by several lightweight motor-driven propellers. On board each blimp was a color video camera, a microphone, a speaker, and the electronics and radio links necessary to enable remote operation. The entire payload was less than 600 grams (typically 400-500 grams). We used the smallest blimps that could carry the necessary cargo in order to keep them as maneuverable as possible. Our space browsers ware able to navigate hallways, doorways, stairwells, and even in the confines of an elevator. We experimented with several different configurations, ranging in size from 180x90 cm to 120x60 cm and shapes from cylinders and spheres to "pillow-shaped" rectangles. We found he smaller blimps were best-suited for moving into groups of people and engaging in conversation with minimal disruption since they took up no more space than a standing person. The browsers were designed to move at a speed similar to a human walking.

The basic principal was that a user anywhere on the internet could log into a browser configured to pilot the blimp. The system used a Java applet to send audio to the blimp, to control its locomotion and retrieve audio and visual information. As the remote user guided the blimp through space the blimp delivered live video and audio to the pilot's machine using standard tele-conferencing software. The user could thus observe and take part any remote conversation accessible by  the blimp. These blimps allowed the user to travel, observe, and communicate throughout 3D space. He could observe things as if he was physically there.

2011

Telepresence robots go airborne

03:40 12 May 2011
Jim Giles, contributor, Vancouver, Canada

Picture the scene: your boss phones to say he is working from home. A calm descends over the office. Workers lean back in their chairs. Feet go up on desks - this shift is going to be pretty chilled.

Suddenly, a super-sized video feed of your boss, projected onto to the front of a helium-filled balloon equipped with a loudspeaker, floats silently into the room and starts issuing orders from above your head. Not such a good day.

This blimp-based boss, which brings to mind the all-seeing Big Brother of George Orwell's 1984, is the creation of Tobita Hiroaki and colleagues at Sony Computer Science Laboratories in Tokyo. Its eerie quality hasn't escaped Hiroaki - he says that his colleagues described the experience of talking to a metre-wide floating image of a co-worker as ">Tobita Hiroaki and colleagues at Sony Computer Science Laboratories in Tokyo. Its eerie quality hasn't escaped Hiroaki - he says that his colleagues described the experience of talking to a metre-wide floating image of a co-worker as "very strange".

The project does have some non-sinister applications. It's part of a wider movement aimed at making "telepresence" who medical specialist>telepresence" possible. Imagine a medical specialist who can't make it to a regional hospital, but needs to consult with a patient there. Or an academic expert who wants to deliver a lecture remotely. Telepresence researchers are working on technology that can get a representation of these people into the room. To put it another way: telepresence lets you be in two places at the same time.


Social Media and Computer Supported Cooperative Health Care

Cscw2012-logo-100x100I've become increasingly aware of - and inspired by - the ways that social media is enabling platform thinking, de-bureaucratization and a redistribution of agency in the realm of health care. Blogs, Twitter and other online forums are helping a growing number of patients - who have traditionally suffered in silence - find their voices, connect with other patients (and health care providers) and discover or co-create new solutions to their ills. In my view, this is one of the most exciting and promising areas of computer supported cooperative work (CSCW), and in my role as Publicity Co-chair for ACM CSCW 2012 (February 11-15, Seattle) I am hoping to promote greater participation - in the conference - among the researchers, designers, developers, practitioners and other innovators who are utilizing social media and other computing technologies for communication, cooperation, coordination and/or confrontation with various stakeholders in the health care ecosystem.

Figure3-patient20 Dana Lewis, the founder and curator of the fast-paced, weekly Twitter chats on health care in social media (#hcsm, Sundays at 6-7pm PST), recently served as guest editor for an upcoming article on social media in health care for the new Social Mediator forum in ACM Interactions magazine. The article - which will appear in the July/August 2011 issue - weaves together insights and experiences from some of the leading voices in the use of social media in health care: cancer survivor, author and speaker "ePatient Dave" deBronkart promotes the use of technology for enabling shared decision-making by patients and providers; patient rights artist and advocate Regina Holliday shares her story of how social media tools are enabling her to channel her anger with a medical bureaucracy that hindered her late husband's access to vital information in his battle with cancer by writing on walls, online and offline; pediatrician Wendy Sue Swanson describes how she uses her SeattleMamaDoc blog for both teaching and learning in her practice of medicine; health care administrator Nick Dawson invokes the analogy of school in offering his perspective on the evolution of social media in health care, as it matures from freshman-level to graduate studies.

Spm_2010_logo In my social media sojourns, I've encountered many other inspiring examples of people, programs and platforms that are being used to empower patients to connect more effectively with information and potential solutions:

Cscw2011-logo-white It is important to note that health care has been an area of focus for CSCW in the past. For example, there was a CSCW 2011 sesssion on health care, and other papers on health care were presented in other sessions:

Cscw2010-logo There were also a number of health care presented at CSCW 2010:

There was also a CSCW 2010 workshop on CSCW Research in Health Care: Past, Present & Future with 21 papers.

My primary goal in this particular post is to increase awareness and broaden the level of participation among people designing, using and studying social media in health care. My most immediate goal is to alert prospective authors about the upcoming deadline for Papers and Notes - June 3 -  which has been moved earlier this year to incorporate a revision & resubmission phase in the review process, which was partly designed to accommodate the shepherding of promising submissions by authors outside of the traditional CSCW community who have valuable insights and experiences to share.

At some later phase, I'll start instigating, connecting & evangelizing other channels of potential participation, such as posters, workshops, panels, videos, demonstrations, CSCW Horizon (a category specially designated for non-traditional CSCW) and the doctoral colloquium. For now, I would welcome any help in spreading the word about the conference - and its relevance - to the health care social media community.


Civic Hacktivism at Data Camp Seattle

The Code for America Seattle fellows organized Data Camp Seattle, a day-long unconference / hackathon in collaboration with Socrata and the City of Seattle on Saturday. The event brought together city leaders, neighborhood leaders, technologists and [other] civic-minded individuals and groups to share ideas, data and tools, and to build or improve applications to promote civic awareness, engagement and well-being.

Code-for-america The day started with a brief overview of Code for America by the CfA Seattle fellows, Chach Sikes, Alan Palazzolo and Anna Bloom. Code for America is an organization that pairs web designers and developers (fellows) with city governments to create web applications that promote more openness, transparency, efficiency and effectiveness in the provision of services by the hosting cities. In 2011, four cities - and 20 fellows - were selected in a competitive application process; other CfA 2011 cities are Boston (with 7 fellows), Philadelphia (7) and Washington, DC (3). Anna described the focus of the CfA 2011 Seattle fellows as connecting local leaders to solve civic problems.

SeattleGov Chach introduced a number of representatives from the City of Seattle who were attending the event. Neil Berry, with the Seattle Open Data team, told us that his team offers over 100 location-encoded datasets at data.seattle.gov, including information on crime statistics, building permits and neighborhood boundaries. Bill Schrier, CTO/CIO of Seattle (aka Chief Seattle Geek, @BillSchrier), emphasized the collaborative investment made in CfA Seattle by the City of Seattle, EsrI and Microsoft, and the early & ongoing support for the broader initiative by Tim O'Reilly. He described the goal of CfA Seattle as transforming data into useful information & applications.

Socrata Chris Metcalf, Technical Program Manager and Developer Evangelist at Socrata, where the event was hosted (and which paid for our lunches), described the company as providing software as a service (SaaS) for governments of all levels and sizes, from 5,000-citizen townships to the federal government. Among their services is a Web-based API for [open] government data, and among their clients is the City of Seattle (Socrata hosts data.seattle.gov).

Among the other participants introduced during the opening session were

  • representatives from Seattle's South Park neighborhood a community known for its high per-capita concentration of artists, children and industry (and the not-always-desireable byproducts of industry)
  • Sanjay Bhatt, a Seattle Times reporter who focuses on the visualization of data relating to the Seattle area
  • Sarah Schacht, executive director of Knowledge As Power, with the general aim of making lawmaking accessible to the public and the more specific goal (for the day) of developing ways of parsing legislative documents (for which there exists no international metadata standard)
  • Pascal Schuback, with the King County Office of Emergency Management, and CrisisCommons.org, who emphasized the need to create better reporting mechanisms (e.g., a smartphone app to augment current practices of in-person visits and landline phone calls) and an open data set for the damages wrought by disasters
  • Russell Branca, the developer behind SeaAPI.com, which offers a map-based interface for data about Seattle, who came seeking assistance in improving the site and sevice
  • Naoya Makino, a computer science student at Simon Fraser University, who described EatSure, Vancouver an application developed in / for Vancouver, BC, that makes health inspection reports for restaurants available via a Google Maps interface
  • Andrew Morton, a graduate student at the University of Washington Information School, who is working on a project to analyze the access and accessibility of health information through public libraries
  • Brian Ferris, a graduate student in the UW Computer Science & Engineering department, who developed and manages OneBusAway, a web-, phone- and SMS-based service that provides real-time arrival information for King County Metro bus routes, and who proposed a crowd-sourced civic-oriented game - Fuzzy Neighborhood Labels - to enable users to identify neighborhood boundaries
  • Aaron Parecki and Amber Case, of GeoLoqi.org, briefly described how their platform can be used to create location-based triggers that can be sent to notify users of potentially interesting information related to the places they are in (or near), and how another platform, Tropo, can be used for SMS, IM, voice calls and speech synthesis.

Clustering ideas for #DataCampSEA unconference sessions After the introductions, Alan brought out post-it pads and markers, and Chach announced that we would have 5 minutes to write down things we want to share (on the yellow and orange post-its) and things we want to learn (on the blue and green post-its), and then post them on a wall. We would then have 5 minutes to cluster the things people want to share and learn into common themes or topics, to facilitate the formation of unconference breakout sessions. Leaders were recruited for each cluster, we split into smaller discussion groups in smaller rooms, and brief reports were given just before lunch. Notes from the sessions - on the South Park neighborhood, mobile damage assessment apps, transit apps, mobile / geolocation apps, data mining, information visualization - were posted to the DataCampSEA Google Group. I joined the session on mobile geolocation apps, led by Aaron & Amber, and my notes from the session can be found at the link above.

After lunch, a number of prospective projects were proposed (many of which had been suggested during the introductions), and we again split off into smaller groups, but this time with the goal of designing and developing rather than - or in addition to - discussing applications. I joined Aaron, Amber and others to design and develop a mobile geolocation app that would enable users to subscribe to events or event types from Seattle city event calendars (and, eventually, other geocoded event sources) and be notified via SMS whenever they were within 500 meters of the event site within an hour of the start of the event. Obviously, there are a lot of important details to be worked out for a full-fledged application for performing this task, but we were able to make considerable headway on an application over the course of a little over 3 hours.

Collaborating on HearNear at #DataCampSEA We quickly came up with a name for the application - HearNear (the idea being that your phone "listens" for events of interest nearby) - and self-organized into different tasks:

  • Aaron setup the GeoLoqi instance for the app, and helped others develop the other pieces that would be required by the GeoLoqi API.
  • Amber, a UX Designer, developed the wireframes for the site and worked with Jesse and Jenny on the overall look and feel.
  • Gene Homicki, president of Objective Consulting, reserved the domain name (hearnear.org) and set us up with web hosting service within moments of our deciding on a name.
  • Jesse Kocher, Lead Developer at WalkScore.com, and Jenny Frankl, Seattle Youth Commission program coordinator, designed a fabulous logo for the app.
  • Steve Ripley, a web designer and developer with seattle.gov, helped us find and decide among the various calendar event feeds provided by the City of Seattle; we decided the iCal feed was easier to parse than the RSS feed
  • Rebecca Gutterman started working on a Java-based parser for the iCal feed to find and convert the relevant fields into the format required by the GeoLoqi API
  • Naoya started working on a Python version of an iCal feed parser
  • I initially started working on a PHP version of the iCal feed parser, but with two others working on a parser, I soon decided I could be more helpful to the team by identifing and defining mappings between the GeoLoqi API and the iCal feed.

http://hearnear.org Live! I have probably mis- or undercharacterized much of the work done by all the other people on the team, as I became increasingly engrossed in my relatively small role in working out the iCal -> GeoLoqi translations during the session. In any case, it was pretty amazing how much the team accomplished in such a short period of time! Amber has posted a number of photos of the HearNear application - and development effort - on Flickr, one of which I've included at the right. There were a number of other new and/or improved applications worked on other groups during the afternoon, but my note-taking energy was pretty low by the end of the day, so I'm hoping that those developments will be captured / represented elsewhere.

I wasn't sure what to expect going into the event, but was greatly impressed with the interactions, overall experience and outcomes at Data Camp Seattle. I've admired the Code for America project since first learning about it, and have been a proponent of open data and platform thinking (and doing) on my blog. It was inspiring and empowering to have an opportunity to do more than simply blog about these topics ... though I recognize the potential irony of writing that statement in a new blog post about these topics.

I suspect that one of the most durable outcomes of the Code for America project will be this kind of projection or radiation of civic empowerment through - and beyond - the efforts of the CfA fellows and their collaboration partners. In The Wealth of Networks, Yochai Benkler writes about how "[t]he practice of producing culture makes us all more sophisticated readers, viewers, and listeners, as well as more engaged makers". In Program or Be Programmed, Doug Rushkoff warns against "relinquishing our nascent collective agency" to computers and the people who program them by engaging in "a renaissance of human capacity" by becoming programmers ourselves.

While many - or even most - of the specific applications we designed and developed during the Data Camp Seattle civic hackathon may not gain widespread traction and use, if the experience helps more of us shift our thinking - and doing - toward becoming co-creators of civic applications - and civic engagement - then the Code for America project will have succeeded in achieving some grand goals indeed.

[Update: Alex Howard (@digiphile), of O'Reilly Media, has also written a summary of the event.]


Innovation, Research & Reviewing: Revise & Resubmit vs. Rebut for CSCW 2012

cscw2012-logo Research is about innovation, and yet many aspects of the research process often seem steeped in tradition. Many conference program committees and journal editorial boards - the traditional gatekeepers in research communities - are composed primarily of people with a long history of contributions and/or other well-established credentials, who typically share a collective understanding of how research ought to be conducted, evaluated and reported. Some gatekeepers are opening up to new possibilities for innovations in the research process, and one such community is the program committee for CSCW 2012, the ACM Conference on Computer Supported Cooperative Work ... or as I (and some other instigators) like to call it, Computer-Supported Cooperative Whatever.

This year, CSCW is introducing a new dimension to the review process for Papers & Notes [deadline: June 3]. In keeping with tradition, researchers and practitioners involved in innovative uses of technology to enable or enhance communication, collaboration, information sharing and coordination are invited to submit 10-page papers and/or 4-page notes describing their work. The CSCW tradition of a double-blind review process will also continue, in which the anonymous submissions are reviewed by at least three anonymous peers (the program committee knows the identities of authors and reviewers, but the authors and reviewers do not know each others' respective identities). These external reviewers assess the submitted paper or note's prospective contributions to the field, and recommend acceptance or rejection of the submission for publication in the proceedings and presentation at the conference. What's new this year is an addition to the traditional straight-up accept or reject recommendation categories: reviewers will be asked to consider whether a submission might fit into a new middle category, revise & resubmit.

CSCW, CHI and other conferences have enhanced their review processes in recent years by offering authors an opportunity to respond with a rebuttal, in which they may clarify aspects of the submission - and its contribution(s) - that were not clear to the reviewers [aside: I recently shared some reflections on reviews, rebuttals and respect based on my experience at CSCW and CHI]. For papers that are not clear accepts (with uniformly high ratings among reviewers) - or clear rejects (uniformly low ratings) - the program committee must make a judgment call on whether the clarifications proposed in a rebuttal would represent a sufficient level of contribution in a revised paper, and whether the paper could be reasonably expected to be revised in the short window of time before the final, camera-ready version of the paper must be submitted for publication. The new process will allocate more time to allow the authors of some borderline submissions the opportunity to actually revise the submission rather than limiting them to only proposing revisions.

As the Papers & Notes Co-Chairs explain in their call for participation:

Papers and Notes will undergo two review cycles. After the first review a submission will receive either "Conditional Accept," "Revise/Resubmit," or "Reject." Authors of papers that are not rejected have about 6 weeks to revise and resubmit them. The revision will be reviewed as the basis for the final decision. This is like a journal process, except that it is limited to one revision with a strict deadline.

The primary contact author will be sent the first round reviews. "Conditional Accepts" only require minor revisions and resubmission for a second quick review. "Revise/Resubmits" will require significant attention in preparing the resubmission for the second review. Authors of Conditional Accepts and Revise/Resubmits will be asked to provide a description of how reviewer comments were addressed. Submissions that are rejected in the first round cannot be revised for CSCW 2012, but authors can begin reworking them for submission elsewhere. Authors need to allocate time for revisions after July 22, when the first round reviews are returned [the deadline for initial submissions is June 3]. Final acceptance decisions will be based on the second submission, even for Conditional Accepts.

Although the new process includes a revision cycle for about half of the submissions, community input and analysis of CSCW 2011 data has allowed us to streamline the process. It should mean less work for most authors, reviewers, and AC members.

The revision cycle enables authors to spend a month to fix the English, integrating missing papers in the literature, redoing an analysis, or adopt terminology familiar to this field, problems that in the past could lead to rejection. It also provides the authors of papers that would have been accepted anyway to fix minor things noted by reviewers.

This new process is designed to increase the number and diversity of papers accepted into the final program. Some members of the community - especially those in academia - may be concerned that increasing the quantity may decrease the [perceived] quality of submissions, i.e., instead of the "top" 20% of papers being accepted, perhaps as many as 30% (or more) may be accepted (and thus the papers and notes that are accepted won't "count" as much). However, if the quality of that top 30% (or more) is improved through the revision and resubmission process, then it is hoped that the quality of the program will not be adversely affected by the larger number of accepted papers presented there ... and will actually be positively affected by the broader range of accepted papers.

I often like to reflect on Ralph Waldo Emerson's observation:

All life is an experiment. The more experiments you make the better.

If research - and innovation - is about experimentation, then it certainly makes sense to experiment with the ways that experiments are assessed by the research communities to which they may contribute new insights and knowledge.

BeingWrongBook There is a fundamental tension between rigorous validation and innovative exploration. Maintaining high standards is important to ensuring the trustworthiness of science, especially in light of the growing skepticism about science among some segments of the public. But scientists and other innovators who blaze new trails often find it challenging to validate their most far-reaching ideas to the satisfaction of traditional gatekeepers, and so many conferences and journals tend to be filled with more incremental - and more easily validatable - results. This is not necessarily a bad thing, as many far-reaching ideas turn out to be wrong, but I increasingly believe that all studies and models are wrong, but some are useful, and so opening up new or existing channels for reviewing and reporting research will promote greater innovation.

I'm encouraged by the breadth and depth of conversations, conversions and alternatives I've encountered regarding research and its effective dissemination, including First Monday, arXiv and alt.chi. At least one other ACM-sponsored research community - UIST (ACM Symposium on User Interface Software & Technology) - is also considering changes to their review process; Tessa Lau recently wrote about that in a blog post at the Communications for the ACM, Rethinking the Systems Review Process (which, unfortunately, is now behind the ACM paywall ... another issue relevant to disseminating research). The prestigious journal, Nature, recently wrote about the ways social media is influencing scientific research in an article on Peer Review: Trial by Twitter.

I think it is especially important for a conference like CSCW that is dedicated to innovations in communication, collaboration, coordination and information sharing (which [obviously] includes social media) to be experimenting with alternatives, and I look forward to participating in the upcoming journey of discovery. And in the interest of full disclosure, one way I am participating in this journey is as one of the Publicity Co-Chairs for CSCW 2012, but I would be writing about this innovation even if I were not serving in that official capacity.

[Update: Jonathan Grudin, one of the CSCW 2012 Papers & Notes Co-Chairs, has written an excellent overview of the history and motivations of the revise and resubmit process in a Communications of the ACM article on Technology, Conferences and Community: Considering the impact and implications of changes in scholarly communication.]


Hadoop Day in Seattle: Hadoop, Cascading, Hive and Pig

image from hadoopday.org I attended Hadoop Day - a community event to spread the love of Hadoop and Big Data - at Amazon's Pac-Med building in Seattle a week ago. I missed the morning session of the event, but recently became better acquainted with some of the dimensions of this space via the excellent overview and analysis by Mike Loukides at O'Reilly Radar, What Is Data Science? The afternoon "Introductory Track" included presentations about a number of tools for processing large data sets - Hadoop, Cascading, Hive and Pig - by large and small companies involved with big data - KarmaSphere, Drawn to Scale, Facebook and Yahoo. The session was intended as a hands-on learning opportunity, but due in part to poor network connectivity, it ended up being mostly an eyes- and ears-on educational event (but still very worthwhile).

karmasphere Abe Taha, VP of Engineering for Karmasphere, started the afternoon session with 0-60: Hadoop Development in 60 Minutes or Less (slides embedded below), which offered a great general introduction to Hadoop, a preview of the other tools that would be presented in later sessions (from different levels of the Hadoop stack) and an appropriately scaled (i.e., relatively brief and informative) demonstration of the Karmasphere Studio tool.

Abe led off with the motivation behind Hadoop: the need for a scalable tool for discovering insights (or at least patterns) in ever-increasing collections of data, such as logs of web site traffic. Hadoop embodies the MapReduce paradigm in which data is represented as records or tuples, and computing processes can be broken down into mapping - in which some function is computed over a subset of tuples - and reducing -  in which the results of the applications of the mapping function to different subsets are then combined. The power of Hadoop comes in being able to farm out the functions and different data subsets across a cluster of computers, potentially increasing the speed of deriving a result.

image from shakespeare.mit.edu Simple examples were offered to illustrate how Hadoop works, e.g., computing the maximum of a set of numbers, adding a set of numbers, and counting the occurrences of words in a large text or collection of texts (e.g., The Complete Works of William Shakespeare). After reviewing how these data sets might be represented in Hadoop, Abe provided some Java code to illustrate how the map and reduce functions could be implemented to process them (these code segments are included in the slides). Although the poor network connectivity precluded trying running the code during the session, the clear presentation and simple examples left a relative newcomer like me with the sense that "I can do this" (which I believe was the main objective for the day).

Over half of Abe's slides were on Karmasphere Studio (starting around slide #26 (out of 66)), and the way it can help address some of the problems with overhead in Hadoop, particularly with respect to allowing debugging, prototyping and testing without having to deploy to a cluster of computers. However, only about a quarter of the hour-long presentation was devoted to the tool, and given that the Community Edition of Karmasphere Studio is available for free, I thought he achieved the right balance between covering Hadoop fundamantals as well as a tool for using Hadoop.

image from drawntoscalehq.com Next up was Bradford Stephens, founder of Drawn to Scale and organizer of the event, who presented an Introduction to Cascading. Cascading is a layer on top of Hadoop that allows users to think and work at a higher level of abstraction, focusing on workflows rather than mapping and reducing functions. Cascading offers a collection of operations - functions, filters and aggregators - that can be used in conjunction with any Java Virtual Machine-based language. Bradford showed a 15-line sample of Cascading code to process apache web server logs, and an equivalent 200 line Java program to do the same thing.

Bradford offered the most interactive exercise of the afternoon, showing us some Cascading code to process New York Stock Exchange closing prices, and inviting us to help him write the code that would find the symbol and price of the stock with the highest closing price for each of the days represented in the dataset. I cannot find the slides for Bradford's talk, but the code and data he used in the examples are available at the main Hadoop Day site.

image from hadoop.apache.org After a break, Ning Zhang, a software engineer at Facebook, presented an introduction to Hive entitled a Petabyte Scale Data Warehousing System on Hadoop (slides embedded below).

Facebook_logo Ning presented some statistics about Facebook I'd heard or read elsewhere - e.g., 500M monthly active users, 130 friends per user (on average) - along with several I had not known before:

  • 250 million daily active users
  • 160 million active objects (groups, events, pages)
  • 60 object (group/event/page) connections per user (on average)
  • 500 billion minutes per month spent on the site across all users
  • 25 billion content items are shared per month across all users
  • 70 content items created per user per month (on average)
  • 200 GB of data/day was being generated on the site in March 2008
  • 12+ TB of data/day was being generated by end of 2009, growing by 8x annually

In addition to the increasing demands of users, Facebook application developers and advertisers want feedback on their apps and ads. Facebook decided against using closed, proprietary systems due to issues of cost, scalability and length of the development and release cycles. They considered using Hadoop, but wanted something that provided a higher level of abstraction and used the kinds of schemas traditionally provided in relational database management systems (RDBMS). Hive provides the capability to express operations in SQL and have them translated into the MapReduce framework, and provides extensive support for optimizations ... a dimension that is increasingly important for a company with increasingly big data needs.

image from hadoop.apache.org Alan Gates, a software architect on the Yahoo! grid team, led off his talk on Pig, Making Hadoop Easy (slides embedded below) with a motivating example, showing how a 200 (?) line program to find the top 5 sites visited by 18-25 year olds in Java using Hadoop directly could be written as a 10 line program in Pig Latin. Pig represents a middle way between straight Hadoop and the higher level abstractions provided by Cascading and Hive, providing the capability to program in a higher level scripting language (i.e., higher level than Java) while still being able to define elements procedurally (vs. the declarative definitions typical of SQL-oriented frameworks).

Yahoo_logo Alan recently wrote a blog post on Pig and Hive at Yahoo! in which he delves more deeply into the similarities and differences between the two frameworks, both of which have their place(s) in the realm of data warehousing. Data processing typically involves three phases: data collection, data preparation, and data presentation. Pig is particularly well suited to three tasks involved in the data preparation phase (aka Extract Transform Load (ETL) or data factory):

According to Alan, Hive is well suited to the data presentation phase, in which business intelligence analysis and ad hoc queries may be better accommodated by a language that directly supports SQL. It seems to me that an argument could be made that these tasks might also be categorized as research, though perhaps the differentiation between phases lies more in the types of questions one might most easily be able to ask (and answer). In any case, the data and code used by Alan in his talk are also available on the Hadoop Day site.

Alan Gates at Hadoop Day In addition to interesting presentations, there were some other interesting things I noted about the group at the event. I would estimate that 95% of the attendees were male - much higher than the events I typically attend, which focus more on human-computer interaction and social computing - and the proportion of Macs was much lower than other events I typically attend - perhaps 30%, with nearly 50% of the laptops I saw being Thinkpads. The presentations and presenters were great, as was the view and the food; the only downside was the poor wireless connectivity ... which was somewhat surprising, given the site (Amazon), but was probably due to the need to rig an ad-hoc network outside the firewall just for the event.

All in all, it was a very worthwhile day, and I'm grateful to the organizers, the sponsors - Amazon Web Services, Fabric Worldwide, Cloudera and Karmasphere - and all the presenters for putting the event together. There is already some talk about holding another Eastside Networking Event, a technology-oriented event with several representatives from the local big data community (Amazon Web Services, Microsoft Windows Azure and Facebook [Seattle]) that I wrote about in my last post. I don't know whether there will be future day-long Hadoop events in Seattle, but there are monthly Hadoop meetups in Seattle which are also organized by Bradford; the next meeting of the group is this Wednesday.