Previous month:
March 2010
Next month:
May 2010

April 2010

There's no data like more open data

Ldc-logo When I was working on natural language processing and speech recognition systems in the 90s, one of our mantras was "there's no data like more data", i.e., all things being equal, the accuracy of recognition tends to increase with the addition of more labeled data. The Linguistic Data Consortium at the University of Pennsylvania was [and, I suspect, still is] the primary source for labeled text and speech data, and it was available - for a fee - to all members, most of whom were researchers and developers in academia and industry. Three recent developments in the past week have prompted a reflection on the broader power of data ... and the people and organizations that have access to it.

LibraryOfCongress-logo TwitterLogo The first development was a series of recent announcements about the broader availability of Twitter data. One announcement was that the U.S. Library of Congress was acquiring the entire Twitter public archive:

Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress. That’s a LOT of tweets, by the way: Twitter processes more than 50 million tweets every day, with the total numbering in the billions.

We thought it fitting to give the initial heads-up to the Twitter community itself via our own feed @librarycongress. (By the way, out of sheer coincidence, the announcement comes on the same day our own number of feed-followers has surpassed 50,000. I love serendipity!)

We will also be putting out a press release later with even more details and quotes. Expect to see an emphasis on the scholarly and research implications of the acquisition.

On the one hand, I believe this is a very positive development. Google, Yahoo and Microsoft all pay for real-time access to the Twitter "firehose", and now researchers and developers with shallower pockets will be able to access the entire Twitter public data archive ... after some yet-to-be-announced delay (it's not clear when the archive will become available, how often it will be updated, or how often developers or their applications will be able to access it).

ChirpLogo A related development, also announced during the recent Twitter's developer conference (Chirp), was that Twitter is offering a stream API to supplement its REST API and Search API. As with the other APIs, there are limitations imposed on its use, lest fail whales become a significantly more common sight, but this still represents a positive development in making more data more openly accessible.

150px-Bork2 150px-Clarence_Thomas_official However, the co-occurrence of these announcements with speculation about President Obama's next nominee for the U.S. Supreme Court reminded me of the release of video rental data during the confirmation hearings for Robert Bork during the Reagan administration; although this data did not seem to affect the outcome, its release did lead to the Video Privacy Protection Act in 1988. Belated revelation of alleged pornographic video rental data shortly after the confirmation of Supreme Court Justice Clarence Thomas in 1991, during the George W. Bush administration, has given rise to speculation about whether Thomas would have been confirmed had this evidence been made available earlier in the process.

FacebookNetflix I don't want to draw too strong of an analogy between private video rental records and public tweets, but given the broadening range of web services that enable people to automatically update their status[es] about their use of those services (e.g., Netflix users can automatically post their movie ratings on Facebook), I find myself speculating about how the Twitter archive might affect future judicial nominations and/or future elections for political offices ... but given my biases toward a more transparent society, I suppose that if the data is out there, I'd rather have it publicly available than have limited access to it.

Zuck-at-f8 F8-logo And speaking of sharing updates and other data across web services, the second recent development in the realm of open data to give me pause were announcements at the Facebook developer's conference (f8) last week. VentureBeat's f8 roundup offers a nice summary of these announcements, which included a Graph API and a "like" button that can be used on any web site ... vastly increasing the prospects for personalization and sociality across the web ... and placing Facebook squarely in the center of this hyperpersonalized and hypersocialized network. Lili Cheng, of Microsoft's FUSE Labs, wrote about the first Facebook partnership announced - and demonstrated - during the keynote at f8, a new Facebook app for sharing documents created by her group.

Readwriteweb-logo As with the Twitter announcement, I see many positive possibilities in these developments, but I see an even darker shadow being cast by the Facebook announcements. Marshall Kirkpatrick at ReadWriteWeb articulated some of my concerns in a post asking Is the New Facebook a Deal with the Devil?

Facebook blew people's minds today at its F8 developer conference but one sentiment that keeps coming up is: this is scary. The company unveiled simple, powerful plans to offer instant personalization on sites all over the web, it kicked off meaningful adoption of the Semantic Web with the snap of the fingers, it revolutionized the relationship between the cookie and the log-in, it probably knocked a whole class of recommendation technology startups that don't offer built-in distribution to 400 million people right out of the market. It popularized social bookmarking and made subscribing to feeds around the web easier than ever before. And it may have created the biggest disruption to web traffic analytics in years: demographically verified visitor stats tied to people's real identities. There was so much big news that the analytics part didn't even come up in the keynote.

This is so much new technology and it's tied in so closely with one very powerful company that there is big reason to stop and consider the possible implications. There are reasons to be scared. The bargain Facebook offers is very, very compelling - but it's not a clear win for the web.

Mashable Pete Cashmore at Mashable offers a somewhat less apprehensive, or perhaps simply more capitalistic, perspective on these developments, Shocker: Facebook Does What’s Best For Facebook:

Facebook is building a database of the world’s preferences, but won’t give others access unless they promote Facebook on their sites (by using Facebook logins). ...

So Facebook is building a database of information about you, but you don’t really own it: Facebook does. ...

Bottom line: when a company solves a problem, should we be surprised that they solve it in a way that creates value for both customers and the company itself? Isn’t that how capitalism works?

Techcrunchuk2 Since then, I've read other commentaries that present a less apprehensive view of these developments, e.g., a comment by Austin on a post in TechCrunch Europe on Privacy issues? Google engineers leaving Facebook in droves:

There are two things going on here:

1. An iFrame on sites that points to Facebook. The iframe request is data loaded so it knows where the user came from. Facebook shows activity and friends that have interacted with the site but the data IS NOT shared. You have to be logged into facebook for it to work. It LOOKs like it is on that site but it isn’t. It is a little window into facebook on a different page.

2. Applications can ask users for access to their data through the service formerly known as ‘connect’. Each and every user has to agree to share the data. If you don’t want to share then don’t use the App.

Facebook isn’t doing anything differently then they did before, it is just easier and more integrated.

Although a subsequent commenter posted an unsubstantiated and rather abusive allegation that Austin works for Facebook (Austin's username is linked to Aqumin, a financial data and analysis firm), no one rebutted his argument.

Radar_logo Another positive perspective was presented in an O'Reilly Radar post by David Recordon - who does work for Facebook - on Why f8 was good for the open web

  • No more 24-hour caching limit (as long as developers using Facebook API data are keeping it up to date and agree to remove it at a user's request).
  • An API that is realtime and isn't just about content (developers can subscribe to changes).
  • The Open Graph protocol benefits the web, not just Facebook, and is licensed under the Open Web Foundation Agreement.
  • Support for OAuth 2.0

I discovered Dave's post via Tim O'Reilly's tweet, and as one of the post prominent proponents of the open web, Tim's endorsement carries a great deal of weight (for me). He also tweeted a link to another positive perspective on the Facebook announcements, by Fred Wilson, a partner at Union Square Ventures, who raised doubts about One Graph to Rule Them All?:

These other social graphs [Twitter, Tumblr, Foursquare, Disqus, GetGlue, and others (remember] can and will grow in the wake of Facebook. I am not sure if Facebook's ambition is to create the one social graph to rule them all but if it is, I don't think they will succeed with that. If it is to empower the creation of many social graphs for various activities and to be in the center of that activity and driving it, I think they are already there and will continue to be there for many years to come.

And referencing Tim brings me to the third (and final) recent development I wanted to mention regarding open data: his keynote on where open source and open data are going in the age of the cloud at the 2010 O'Reilly MySQL Conference and Expo last week. Some of the issues he raised in his talk are reflected in a blog entry he posted last month on The State of the Internet Operating System (a "part 2" followup is promised soon). If I were to highlight one theme from the keynote, it is his statement that the future actually belongs to the data, not the database. I'll highlight a few of his more specific observations and insights below.

The 21st century data challenge is how to deliver algorithmic real-time cloud-based intelligence to mobile applications.This cloud future includes...

  • Devices acting as sensors for intelligent data collection
  • Devices whose UI is on the web rather than the device
  • Feeding data into multiple online services that will turn into a full-on sensor web
  • Setting the stage for robotics, augmented reality and the next generation of personal electronics

The Internet Operating System is a Data Operating System:

  • It helps applications find out about
    • People
    • Places
    • Things
    • Prices
    • Documents
    • Images
    • Sounds
    • Relationships
    • ...
  • and helps people interact with them through services
    • Search
    • Payment
    • Matching and Recognition
    • ...

Referencing an earlier blog post on The War for the Web, Tim asked "Who will own the Internet Operating System? Do we want anyone to own it? If not, we better get busy."

MoneyTech Invoking concepts from Wall Street, via the Money:Tech conference ("Where Web 2.0 meets Wall Street"), and applying them to the prospects for the open web, Tim noted that some financial companies that started out as brokers started trading for their own accounts, against their customer, and warned us to watch for this behavior on the Internet: "The giants of the internet are trading for their own accounts, building a platform on which all roads lead back to themselves."

Noting that each of the players (giants) in "the Internet Operating System game" tends to embrace open source for their own strategic reasons and is giving away something that is valuable to someone else, Tim suggested that we may see "some interesting open source moves around Microsoft's Bing search engine", and offered a partial list of potential open source supporters in different application areas:

  • Search: Microsoft
  • Maps: Microsoft, Nokia, Yelp, Foursquare
  • Speech: Nuance, Microsoft
  • Social Graph: Google
  • Payment: Paypal
  • Cloud infrastructure: VMware
  • Smartphones: Google
  • Device Operating Systems: Google

Shifting his attention from industry to government, Tim presented some Open Government Data Principles - relevant to data sharing by anyone else - composed by a group of 30 other leading strategic thinkers. Interestingly, this group did not include any government representatives.

Government data shall be considered open if it is made public in a way that complies with the principles below:

  1. Complete All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
  2. Primary Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
  3. Timely Data is made available as quickly as necessary to preserve the value of the data.
  4. Accessible Data is available to the widest range of users for the widest range of purposes.
  5. Machine processable Data is reasonably structured to allow automated processing.
  6. Non-discriminatory Data is available to anyone, with no requirement of registration.
  7. Non-proprietary Data is available in a format over which no entity has exclusive control.
  8. License-free Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.

Toward the end of his talk, Tim referenced a recent Radar O'Reilly blog post by Nat Torkington on Truly Open Data, in which Nat notes that we have to build some tools to support open data, e.g., tools for provisioning and tracking. In short, we need to make it as easy to share data as it is to share code in open source movement. So maybe a more appropriate title for this post would be "There's no data like more open data and tools" ... but I think I'll save that for a future followup post.

The further commoditization of Twitter followers

A few months ago, I wrote about the commoditization of Twitter followers, after discovering a number of automated, semi-automated and manual strategies that people - and non-human systems - were employing to artificially boost their Twitter follower counts. My earlier discovery was sparked by noticing some unusual numbers in the profiles of some recent followers of my Twitter stream. My latest discovery of yet another Twitter commoditization tool was similarly sparked by the profile of a new follower - who has since unfollowed me - that listed 1,983 followees, 787 followers and only 6 tweets. Clicking through to the Twitter homepage of this new follower revealed that 3 of these 6 tweets referenced TweetAdder, a tool that promises to "get more followers, instantly".

Automate Twitter Promotion & Marketing

Find and Engage in Like-Minded Twitter Followers & Automate Twitter Posts!

tweet adder for twitter, twitter marketing, twitter promotional tool

  • Increase Twitter and Site Traffic to your event, charity, service, business, band, or website
  • Find Like Minded Twitter Followers in Seconds
  • Auto Follow Targeted Twitter Profiles
  • Rapidly Increase Niche Twitter followers
  • Multiple Accounts, Unlimited Twitter Profiles
  • Automate Twitter Posts, Scheduled Tweets, Stay active in participation
  • Automate Direct Messages
  • Auto Unfollow, VIP Safe List
  • Deletes Direct Messages
  • Best Twitter Friend Search Capabilities
  • Every Twitter Feature imaginable!
  • Spend time on other tasks while the program works for you
  • Questions? Contact Us!

TweetAdder appears to be slightly less cynical than, the fully automatic reciprocal following system I referenced in my earlier post, wherein new users who signup are automatically followed by all existing users, and automatically reciprocally follow all existing users. However, it does include the phrase "twitter follower bot" in the title field of the image used to promote the product.

[Update, 05-Apr-2012: Twitter has filed a lawsuit against TweetAdder and four other entities I would categorize as providing "spamware as a service".]

TweetAdder, the self-proclaimed "Ferrari of Twitter Friend Adder and Promotion Software", is a semi-automatic follower acquisition tool, relying on the reflexive reciprocal "follow back" response exhibited by a signifcant proportion of Twitter users (TweetAdder claims that this represents 30%-50% of Twitter users). After purchasing the software, users need to spend some time with targeting Twitter users that they want to lure into reciprocally following them, e.g., by specifying keywords, locations and/or other Twitter users whose followers they want to reach. The software purportedly provides for automating tweets and direct messages ... I wonder if future versions will provide for automatic retweets of targeted prospective followers, as I imagine that would be an even more effective lure.


At first, I thought "well, at least this is not yet another Ponzi scheme", but then I found that TweetAdder offers an "affiliates program" in which users are purportedly paid $10 to sign up, 50% commission on direct sales referrals and 10% on affiliates' sales referrals. The TweetAdder purchase page includes an icon for the SC Magazine Awards 2009, "organized to honor the professionals, companies and products that help fend off the myriad security threats confronted in today's corporate world". However, searching for "tweetadder" and "tweet adder" on the SC Magazine site returned 0 results. If SC Magazine does write an article about TweetAdder, I wonder how they would portray the product.

As in my earlier post, I want to explicitly state that this post is intended as a critique, not an endorsement, of such automated Twitter follower acquisition schemes. I was surprised to discover that TweetAdder was endorsed in an NBC News piece by Mike Wendland on Handy apps to help manage your Twitter account. Immediately following a reference to "lots of tips and tricks and scams out there", Wendland says "The best tool I've found is a program called TweetAdder." The end of the piece includes a link to his web site and his Twitter handle (@pcmike). I wonder if @pcmike, who has approximately 6000 followees and 8000 followers, is a member of the TweetAdder affiliates program.

The mainstream media has given considerable attention to a recent Pew Center for People and the Press survey that revealed that Americans have an increasingly negative view of government (25% positive, 65% negative). I think it's important to note, in this context, that the same survey revealed that Americans have an increasingly negative view of the national news media (31% positive, 57% negative) ... and, somewhat ironically, a rather positive view of small businesses (71% positive, 19% negative) and technology companies (68% positive, 18% negative). Perhaps future surveys might break out a new category of "Twitter-based companies" or "social media companies".

Public Negative Views of Institutions, Pew Research, April 2010

Violent communication, emotional contagion, genocide and eliminationism

WorseThanWar Last night, I watched a disturbing show on PBS, Worse than War, "the first major documentary to explore the phenomenon of genocide and how we can stop it". Daniel Jonah Goldhagen, narrator of the film and author of the book upon which it is based, argues that contrary to common conceptions of irrational and spontaneous combustion as the cause of genocide, it actually involves careful planning by rational actors, beginning with the identification of a political objective - typically the removal or elimination of an ethnic group - followed by the persistent demonization and vilification of members of that group through violent and virulent communication and other acts.

Goldhagen proposes that genocide could be more properly characterized as eliminationism:

the belief that one's political opponents are "a cancer on the body politic that must be excised — either by separation from the public at large, through censorship or by outright extermination - in order to protect the purity of the nation"

The 2-hour film (which can be viewed online in its entirety) reviews a number of large-scale atrocities - mass murders often accompanied by systematic rapes and other forms of torture - committed during the 20th and 21st centuries in Darfur, Rwanda, the former Republic of Yugoslavia, Cambodia, Guatemala, Armenia and, of course, Nazi Germany.

In nearly every case, the international community did little to stop the atrocities, and many actions - and inaction - of members of the local and global community reminded me of the social roles involved in the circle of bullying I wrote about in my last post (Be Impeccable with Your Word: Confrontation vs. Condescension and Intimidation): bullies, followers or henchmen, supporters or passive bullies, passive supporters or possible bullies, disengaged onlookers, possible defenders and defenders.

One of the most disturbing segments of the film (starting around the 1:03 mark) showed U.N. Peacekeepers in Rwanda abruptly abandoning the Ecole Technique Officielle school in Kigali, in which they had been protecting thousands of Tutsi from homicidal Hutus, who immediately moved in and massacred the unprotected and unarmed Tutsi. Goldhagen claims that the one post-WWII example of significant and effective intervention, the 1999 NATO bombing of former Yugoslavia, resulted in Slobodan Milošević, leader of the Serbian eliminationists, quickly ceasing atrocities and coming to the negotiation table. He argues that the biggest obstacle to preventing genocide is the lack of the will on the part of world leaders.

ConnectedBookCover Throughout the film, I was reminded of the concept of epidemic hysteria or Mass Psychogenic Illness (MPI) that I recently read about in Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives. The authors, Nicholas Christakis and James Fowler, describe several instances of large-scale emotional contagion in which groups of people "catch" emotions from others through direct contact or observation over varying lengths of time. For example, in what has become known as the Tanganyika laughing epidemic, uncontrollable bouts of laughter lasting a few minutes to a few hours spread across a population of several hundred people during the first several months of 1962. Another, more recent, example was several waves of MPI at a high school in McMinville, TN, during 1998, in which gasoline was purportedly smelled and dozens of people suffered from symptoms of nausea and dizziness; no objective evidence of gasoline or any other physical agent that may have caused the symptoms was ever found. Several other examples are provided, but the important thing I want to note here is that the characteristics that tend to mark episodes of MPI include a highly connected community that tends to be isolated and/or stressed ... characteristics that appear to apply to most, if not all, of the groups of genocide perpetrators depicted in Goldhagen's film.

Toward the end of their book, Christakis and Fowler discuss the "interpersonal spread of criminal behavior as an example of a bad network outcome". As with other viral effects, people observing the commission of a crime - or perhaps its after-effects (e.g., the broken window theory) - may be more likely to commit crimes themselves. They note that "the riskier or more serious the crime, the less likely others are to follow suit (though there can be frenzies of murder too, as in the Rwandan genocide)." Unfortunately, in this context, they do not explore these more serious types of criminal frenzies further.

LuciferEffect Another book that came to mind was The Lucifer Effect: Understanding How Good People Turn Evil, by Philip Zimbardo, which reports on - among other things - his [in]famous Stanford Prison Experiment, in which a group of college students were randomly partitioned into groups of prison guards and prisoners and placed within a simulated prison. The experiment, which was intended to last 2 weeks, was stopped after just 6 days due to the unanticipated ferocity and sadism with which the "prison guards" adopted and performed their roles, and the depression and other signs of stress exhibited by those playing the "prisoners". I haven't actually read the book, but based on the broader coverage described in its synopsis, I believe that it provides many insights relevant to the types of genocide - or eliminationism - described in Goldhagen's film, e.g., the strength of "situational power" and the effects of "conformity, obedience to authority, role-playing, dehumanization, deindividuation and moral disengagement".

I wish I could say that Goldhagen's film depicts atrocities beyond anything I could ever imagine happening in this country ... at least in modern times (slavery, the civil war, and other epochs in our history may represent approximations of eliminationism). However, the roots of all of the examples of eliminationism he examines are all preceded by periods of persistent demonization and vilification of classes of people ... practices that seem to be on the increase in some media pundits and channels. In researching this blog post, I was simultaneously heartened and disheartened to discover that I am not alone in this concern.

In a Marquette Law Review article on Eliminationist Discourse in a Conflicted Society: Lessons for America from Africa?, Phyllis Bernard writes:

This Article proceeds from the assumption that—from a less lofty, more grassroots perspective—modern, organized, formal, one-time venues for extremist political speech do not present the most potent threat to physical safety and a stable democracy. The greater danger emanates from pervasive right-wing extremist themes on radio, television, and some online news sources (often as a modern-day replacement for hard-copy newspapers and newsletters). These media support an increasingly passionate and virulent message in public discourse. This message encourages persons who feel uneasy or displaced in society to expiate their grievances not through the political process, but through murder.


This Article addresses pervasive, long-term, mixed messages that blend ostensible news with entertainment, politics, religion, and appeals to ethnic identity and general fear-mongering. Although such discourse receives the greatest coverage in the mass media, the better forum to mitigate and neutralize the incitement to action may be on a person-to-person level. This Article will explore interventions in Rwanda and Nigeria that adapted American dispute prevention and resolution methods to African media and dispute resolution traditions. The African collaborations offer a different view of justice, based on relationships, which may provide a better fit and forum for America to address extremist media messages and their impact on society.

I hope, for the sake of all Americans, that we can learn the lessons from other conflicts, find common ground, foster more civil and respectful relationships, and avoid the kinds of catastrophes we have witnessed in countries that may be, in some key respects, not so different from our own. And I also hope that we can find and employ the will to use our considerable power to stand up to bullies in other parts of the world.