When I was working on natural language processing and speech recognition systems in the 90s, one of our mantras was "there's no data like more data", i.e., all things being equal, the accuracy of recognition tends to increase with the addition of more labeled data. The Linguistic Data Consortium at the University of Pennsylvania was [and, I suspect, still is] the primary source for labeled text and speech data, and it was available - for a fee - to all members, most of whom were researchers and developers in academia and industry. Three recent developments in the past week have prompted a reflection on the broader power of data ... and the people and organizations that have access to it.
The first development was a series of recent announcements about the broader availability of Twitter data. One announcement was that the U.S. Library of Congress was acquiring the entire Twitter public archive:
Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress. That’s a LOT of tweets, by the way: Twitter processes more than 50 million tweets every day, with the total numbering in the billions.
We thought it fitting to give the initial heads-up to the Twitter community itself via our own feed @librarycongress. (By the way, out of sheer coincidence, the announcement comes on the same day our own number of feed-followers has surpassed 50,000. I love serendipity!)
We will also be putting out a press release later with even more details and quotes. Expect to see an emphasis on the scholarly and research implications of the acquisition.
On the one hand, I believe this is a very positive development. Google, Yahoo and Microsoft all pay for real-time access to the Twitter "firehose", and now researchers and developers with shallower pockets will be able to access the entire Twitter public data archive ... after some yet-to-be-announced delay (it's not clear when the archive will become available, how often it will be updated, or how often developers or their applications will be able to access it).
A related development, also announced during the recent Twitter's developer conference (Chirp), was that Twitter is offering a stream API to supplement its REST API and Search API. As with the other APIs, there are limitations imposed on its use, lest fail whales become a significantly more common sight, but this still represents a positive development in making more data more openly accessible.
However, the co-occurrence of these announcements with speculation about President Obama's next nominee for the U.S. Supreme Court reminded me of the release of video rental data during the confirmation hearings for Robert Bork during the Reagan administration; although this data did not seem to affect the outcome, its release did lead to the Video Privacy Protection Act in 1988. Belated revelation of alleged pornographic video rental data shortly after the confirmation of Supreme Court Justice Clarence Thomas in 1991, during the George W. Bush administration, has given rise to speculation about whether Thomas would have been confirmed had this evidence been made available earlier in the process.
I don't want to draw too strong of an analogy between private video rental records and public tweets, but given the broadening range of web services that enable people to automatically update their status[es] about their use of those services (e.g., Netflix users can automatically post their movie ratings on Facebook), I find myself speculating about how the Twitter archive might affect future judicial nominations and/or future elections for political offices ... but given my biases toward a more transparent society, I suppose that if the data is out there, I'd rather have it publicly available than have limited access to it.
And speaking of sharing updates and other data across web services, the second recent development in the realm of open data to give me pause were announcements at the Facebook developer's conference (f8) last week. VentureBeat's f8 roundup offers a nice summary of these announcements, which included a Graph API and a "like" button that can be used on any web site ... vastly increasing the prospects for personalization and sociality across the web ... and placing Facebook squarely in the center of this hyperpersonalized and hypersocialized network. Lili Cheng, of Microsoft's FUSE Labs, wrote about the first Facebook partnership announced - and demonstrated - during the keynote at f8, a new Facebook app for sharing documents created by her group.
As with the Twitter announcement, I see many positive possibilities in these developments, but I see an even darker shadow being cast by the Facebook announcements. Marshall Kirkpatrick at ReadWriteWeb articulated some of my concerns in a post asking Is the New Facebook a Deal with the Devil?
Facebook blew people's minds today at its F8 developer conference but one sentiment that keeps coming up is: this is scary. The company unveiled simple, powerful plans to offer instant personalization on sites all over the web, it kicked off meaningful adoption of the Semantic Web with the snap of the fingers, it revolutionized the relationship between the cookie and the log-in, it probably knocked a whole class of recommendation technology startups that don't offer built-in distribution to 400 million people right out of the market. It popularized social bookmarking and made subscribing to feeds around the web easier than ever before. And it may have created the biggest disruption to web traffic analytics in years: demographically verified visitor stats tied to people's real identities. There was so much big news that the analytics part didn't even come up in the keynote.
This is so much new technology and it's tied in so closely with one very powerful company that there is big reason to stop and consider the possible implications. There are reasons to be scared. The bargain Facebook offers is very, very compelling - but it's not a clear win for the web.
Pete Cashmore at Mashable offers a somewhat less apprehensive, or perhaps simply more capitalistic, perspective on these developments, Shocker: Facebook Does What’s Best For Facebook:
Facebook is building a database of the world’s preferences, but won’t give others access unless they promote Facebook on their sites (by using Facebook logins). ...
So Facebook is building a database of information about you, but you don’t really own it: Facebook does. ...
Bottom line: when a company solves a problem, should we be surprised that they solve it in a way that creates value for both customers and the company itself? Isn’t that how capitalism works?
Since then, I've read other commentaries that present a less apprehensive view of these developments, e.g., a comment by Austin on a post in TechCrunch Europe on Privacy issues? Google engineers leaving Facebook in droves:
There are two things going on here:
1. An iFrame on sites that points to Facebook. The iframe request is data loaded so it knows where the user came from. Facebook shows activity and friends that have interacted with the site but the data IS NOT shared. You have to be logged into facebook for it to work. It LOOKs like it is on that site but it isn’t. It is a little window into facebook on a different page.
2. Applications can ask users for access to their data through the service formerly known as ‘connect’. Each and every user has to agree to share the data. If you don’t want to share then don’t use the App.
Facebook isn’t doing anything differently then they did before, it is just easier and more integrated.
Although a subsequent commenter posted an unsubstantiated and rather abusive allegation that Austin works for Facebook (Austin's username is linked to Aqumin, a financial data and analysis firm), no one rebutted his argument.
Another positive perspective was presented in an O'Reilly Radar post by David Recordon - who does work for Facebook - on Why f8 was good for the open web:
- No more 24-hour caching limit (as long as developers using Facebook API data are keeping it up to date and agree to remove it at a user's request).
- An API that is realtime and isn't just about content (developers can subscribe to changes).
- The Open Graph protocol benefits the web, not just Facebook, and is licensed under the Open Web Foundation Agreement.
- Support for OAuth 2.0
I discovered Dave's post via Tim O'Reilly's tweet, and as one of the post prominent proponents of the open web, Tim's endorsement carries a great deal of weight (for me). He also tweeted a link to another positive perspective on the Facebook announcements, by Fred Wilson, a partner at Union Square Ventures, who raised doubts about One Graph to Rule Them All?:
These other social graphs [Twitter, Tumblr, Foursquare, Disqus, GetGlue, and others (remember del.ico.us?)] can and will grow in the wake of Facebook. I am not sure if Facebook's ambition is to create the one social graph to rule them all but if it is, I don't think they will succeed with that. If it is to empower the creation of many social graphs for various activities and to be in the center of that activity and driving it, I think they are already there and will continue to be there for many years to come.
And referencing Tim brings me to the third (and final) recent development I wanted to mention regarding open data: his keynote on where open source and open data are going in the age of the cloud at the 2010 O'Reilly MySQL Conference and Expo last week. Some of the issues he raised in his talk are reflected in a blog entry he posted last month on The State of the Internet Operating System (a "part 2" followup is promised soon). If I were to highlight one theme from the keynote, it is his statement that the future actually belongs to the data, not the database. I'll highlight a few of his more specific observations and insights below.
The 21st century data challenge is how to deliver algorithmic real-time cloud-based intelligence to mobile applications.This cloud future includes...
- Devices acting as sensors for intelligent data collection
- Devices whose UI is on the web rather than the device
- Feeding data into multiple online services that will turn into a full-on sensor web
- Setting the stage for robotics, augmented reality and the next generation of personal electronics
The Internet Operating System is a Data Operating System:
- It helps applications find out about
- and helps people interact with them through services
- Matching and Recognition
Referencing an earlier blog post on The War for the Web, Tim asked "Who will own the Internet Operating System? Do we want anyone to own it? If not, we better get busy."
Invoking concepts from Wall Street, via the Money:Tech conference ("Where Web 2.0 meets Wall Street"), and applying them to the prospects for the open web, Tim noted that some financial companies that started out as brokers started trading for their own accounts, against their customer, and warned us to watch for this behavior on the Internet: "The giants of the internet are trading for their own accounts, building a platform on which all roads lead back to themselves."
Noting that each of the players (giants) in "the Internet Operating System game" tends to embrace open source for their own strategic reasons and is giving away something that is valuable to someone else, Tim suggested that we may see "some interesting open source moves around Microsoft's Bing search engine", and offered a partial list of potential open source supporters in different application areas:
- Search: Microsoft
- Maps: Microsoft, Nokia, Yelp, Foursquare
- Speech: Nuance, Microsoft
- Social Graph: Google
- Payment: Paypal
- Cloud infrastructure: VMware
- Smartphones: Google
- Device Operating Systems: Google
Shifting his attention from industry to government, Tim presented some Open Government Data Principles - relevant to data sharing by anyone else - composed by a group of 30 other leading strategic thinkers. Interestingly, this group did not include any government representatives.
Government data shall be considered open if it is made public in a way that complies with the principles below:
- Complete All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
- Primary Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
- Timely Data is made available as quickly as necessary to preserve the value of the data.
- Accessible Data is available to the widest range of users for the widest range of purposes.
- Machine processable Data is reasonably structured to allow automated processing.
- Non-discriminatory Data is available to anyone, with no requirement of registration.
- Non-proprietary Data is available in a format over which no entity has exclusive control.
- License-free Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.
Toward the end of his talk, Tim referenced a recent Radar O'Reilly blog post by Nat Torkington on Truly Open Data, in which Nat notes that we have to build some tools to support open data, e.g., tools for provisioning and tracking. In short, we need to make it as easy to share data as it is to share code in open source movement. So maybe a more appropriate title for this post would be "There's no data like more open data and tools" ... but I think I'll save that for a future followup post.