Using pydelicious to fix Delicious tags with unintended blank spaces
May 22, 2013
I've been using the Delicious social bookmarking web service for many years as a way to archive links to interesting web pages and associate tags to personally categorize - and later search for - their content [my tags can be found under the username gump.tion, a riff on the original Delicious URL, del.icio.us]. In December 2010, a widely circulated rumor reported that Yahoo was planning to shutdown Delicious, and a number of my friends abandoned the service for other services. I was in the midst of yet another career change, rejoining academia after a 21-year hiatus, with little time for browsing, much less bookmarking, so I did not make any changes at the time.
It turns out that rather than being shutdown, Delicious was was sold in April 2011, and various changes have since been made to the service and its user interface. The Delicious UI initially interpreted spaces in the TAGS field as tag separators, e.g., typing in the string "education mooc disruption" (as shown in the screenshot below) would be interpreted as tagging a page with the 3 tags "education", "mooc" and "disruption"; if you wanted to have a single tag with those 3 terms, you had to use remove or replace the spaces, e.g., "educationmoocdisruption" or "education_mooc_disruption". Someime in October 2011, the specifications changed, and commas rather than spaces were used to separate tags, allowing spaces to be used in the tags themselves, e.g., "education mooc disruption" was interpreted as a single tag (equivalent to "educationmoocdisruption"). Unfortunately, I did not see an announcement or notice this change for quite some time, and so I had hundreds of web pages archived on Delicious with tags I did not intend.
This problem surfaced recently when I was sharing my bookmarks on MOOCs (massive open online courses) with a group of students working on a project investigating MOOCs in an small closed offline course, Computing Technology and Public Policy. There were several pages I remembered bookmarking that did not appear in pages associated with my MOOC tag. Searching through my archive for the titles of some of those pages, I discovered several pages tagged with terms including spaces. I started manually renaming tags, replacing the multi-term tags with the multiple tags I'd intended to associate with the pages. After a dozen or so manual replacements, I scanned my tag set and saw many, many more, and so decided to try a different approach.
The Delicious API provides a programmatic way to access or change tags associated with an authenticated user's account. Ever since my first socialbots experiment, my programming language of first resort in accessing any web service API is Python, and as I expected, there is a Python package for accessing the Delicious API, aptly named pydelicious. Using pydelicious, I discovered that my Delicious account had over 200 tags with unintended spaces in them. I'm sharing the process I used to convert these tags in case it is of interest / use to others in a similar predicament. [Note: my MacBook Pro, running Mac OS X 10.8.3, comes prebundled with Python 2.7.2; instructions for installing and using Python can be found at python.org.]
Replacing all the tags containing unintended spaces with comma-delimited equivalents (e.g., replacing "education mooc disruption" with "education", "mooc", "disruption") was relatively straightforward, using the following sequence:
- Install pydelicious
Type easy_install pydelicious on the command line (on Mac OS X, this is can be done in a Terminal window; on Windows, this can be done in a Command Prompt window)
$ easy_install pydelicious Searching for pydelicious Reading http://pypi.python.org/simple/pydelicious/ Reading http://code.google.com/p/pydelicious/ Best match: pydelicious 0.6 Downloading http://pydelicious.googlecode.com/files/pydelicious-0.6.zip Processing pydelicious-0.6.zip ... Finished processing dependencies for pydelicious $
[$ is the Terminal command prompt (for bash)] - Launch python
Type python on the command line
MacBook-Joe:Python joe$ python Python 2.7.2 (v2.7.2:8527427914a2, Jun 11 2011, 15:22:34) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from pydelicious import DeliciousAPI >>> from getpass import getpass >>> a = DeliciousAPI('gump.tion', getpass('Password:')) Password: >>> t = a.tags_get()
[>>> is the Python prompt] - Import the pydelicious package and getpass function
>>> from pydelicious import DeliciousAPI >>> from getpass import getpass >>>
- Authenticate my Delicious username and password with the Delicious API
>>> api = DeliciousAPI('gump.tion', getpass('Password:')) Password: >>>
[Note: my password is not displayed in the Terminal window as I type it] - Retrieve all my tags
>>> tagset = api.tags_get() >>>
[tagset will be a dictionary (or associative array) with a single key, tags, whose associated value is an array of dictionaries, each of which has two keys, count and tag, e.g.,{tags: [{'count': '188', 'tag': 'socialmedia'}, {'count': '179', 'tag': 'education'}, ...}
tagset['tags'] can be used to access the array of counts and tags, and a for loop can be used to iterate across each element of the array.] - Check for tags with spaces
>>> for tag in tagset['tags']: ... if ' ' in tag['tag']: ... print tag['count'], ': ', tag['tag'] ... 1 : socialnetwork security socialbots 1 : education openaccess p2p collaboration cscl 1 : education parenting 1 : psychology wrongology education 1 : privacy internet politics business surveillance censorship 1 : robots psychology nlp
[... is the Python continuation prompt, indicating the interpreter expects the command to be continued. Note that the 200+ lines of tags with spaces has been truncated above.] - Visit a multi-space tag via a browser
E.g., https://delicious.com/gump.tion/education%20mooc%20disruption; this is to set the stage for verifying a space-delimited tag has been correctly replaced with its comma-delimited equivalent tag. - Replace spaces with commas in all tags with the renametag API call
>>> for tag in tagset['tags']: ... if ' ' in tag['tag']: ... api.tags_rename(tag['tag'], tag['tag'].replace(" ", ", ")) ... >>>
- Verify that the tags have been replaced via the API
>>> for tag in api.tags_get()['tags']: ... if ' ' in tag['tag']: ... print tag['count'], ': ', tag['tag'] ... >>>
[Replacing the reference to tagset with a fresh call to api.get_tags()] - Verify that the tags have been replaced via a browser
E.g., reload the page above, then edit the tags field in the Delicious user interface to manually replace spaces (%20) with commas (%2C), resulting in the following URL: https://delicious.com/gump.tion/education%2Cmooc%2Cdisruption
Having replaced all the tags with unintended spaces, I've reduced my tag set from 881+ to 680. I now see that I have a number if misspelled tags (e.g., commumity), and a number of singleton tags that are semantically similar to other tags I've used more regularly (e.g., comics (2) and humor (27)) - an inconsistency that similarly affects the category tags on this blog - but I'll leave further fixes for another time in which I want to engage in structured procrastination.