Procrastiblog

June 8, 2008

Tweaking an RSS Feed in Python

Filed under: Python, Tech — Chris @ 8:00 pm

I’ve been teaching myself a bit of Python by the just-in-time learning method: start programming, wait for the interpreter to complain, and go check the reference manual; keep the API docs on your hard disk and sift through them when you need a probably-existing function. Recently, I wanted to write a very simple script to manipulate some XML (see below) and I was surprised (though it has been noted before) at the relatively confused state of the art in Python and XML.

First of all, the Python XML API documentation is more or less “go read the W3C standards.” Which is fine, but… make the easy stuff easy, people.

Secondly, the supposedly-standard PyXML library has been deprecated in some form or fashion such that some of the examples from the tutorial I was working with have stopped working (in particular, the xml.dom.ext module has gone somewhere. Where, I do not know).

So, in the interest of producing more and better code samples for future lazy programmers, here’s how I managed to solve my little problem.

The Problem: Twitter’s RSS feeds don’t provide clickable links

The Solution: A script suitable for use as a “conversion filter” in Liferea (and maybe other feed readers too, who knows?). The script should:

  1. Read and parse an RSS/Atom feed from the standard input.
  2. Grab the text from the feed items and “linkify” them
  3. Print the modified feed on the standard output.

Easy, right? Well, yeah. The only tricky bit was using the right namespace references for the Atom feed, but again that’s only because I refuse to read and comprehend the W3C specs for something so insignificant. I ended up using the lxml library, because it worked. (The script would be about 50% shorter if I hadn’t added a command-line option --strip-user to strip the username from the beginning of items in a single-user feed and a third shorter than that if it only handled RSS or Atom and not both.)

Here’s the code, in toto. (You can download it here.)

#! /usr/bin/env python

from sys import stdin, stdout
from lxml import etree
from re import sub
from optparse import OptionParser

doc = etree.parse(stdin)

def addlinks(path,namespaces=None):
    for node in doc.xpath(path,namespaces=namespaces):
        # Turn URLs into HREFs
        node.text = sub("((https?|s?ftp|ssh)\:\/\/[^\"\s\<\>]*[^.,;'\">\:\s\<\>\)\]\!])",
                        "<a href=\"\\1\">\\1</a>",
                        node.text)
        # Turn @ refs into links to the user page
        node.text = sub("\B@([_a-z0-9]+)",
                        "@<a href=\"http://twitter.com/\\1\">\\1</a>",
                        node.text)

def stripuser(path,namespaces=None):
    for node in doc.xpath(path,namespaces=namespaces):
        node.text = sub("^[A-Za-z0-9_]+:\s*","",node.text)

parser = OptionParser(usage = "%prog [options] SITE")
parser.add_option("-s", "--strip-username",
                   action="store_true",
                   dest="strip_username",
                  default=False,
                  help="Strip the username from item title and description")
(opts,args) = parser.parse_args()

# For RSS feeds
addlinks("//rss/channel/item/description")
# For Atom feeds
addlinks( "//n:feed/n:entry/n:content",
           {'n': 'http://www.w3.org/2005/Atom'} )

if opts.strip_username:
     # RSS title/description
     stripuser( "//rss/channel/item/title" )
     stripuser( "//rss/channel/item/description" )
     # Atom title/description
     stripuser( "//n:feed/n:entry/n:title",
                 namespaces = {'n': 'http://www.w3.org/2005/Atom'} )
     stripuser( "//n:feed/n:entry/n:content",
                 namespaces = {'n': 'http://www.w3.org/2005/Atom'} )

doc.write(stdout)

If there are any Python programmers in the audience and I’m doing something stupid or terribly non-idiomatic, I’d be glad to know.

Thanks in part to Alan H whose Yahoo Pipe was almost good enough (it doesn’t handle authenticated feeds, as far as I can tell) and from whom I ripped off the regular expressions.

[UPDATE] Script changed per first commenter.

Advertisements

Top Chef and BSG Catch-Up

Filed under: Battlestar Galactica, Not Tech, Top Chef, TV — Chris @ 4:18 pm

I have been remiss in blogging Top Chef and Battlestar Galactica this year. Suffice it to say I’m watching and enjoying, but my ardor for both has somewhat dimmed.

Unlike previous seasons of Top Chef, I don’t have a real rooting interest in any of the cheftestants this year. If I were forced to choose I would guess Richard is probably going to win (he’s about as well-liked as Stephanie and more consistent). I—along with the rest of the world—loathe Lisa, but she’s just kind of a bad trip, not really a boo-hiss, lie-to-your-face villain in the Tiffani/Omarosa mold. An interesting bit of data, for those Lisa-haters who suspect they are suffering from an irrational aversion to her attitude, looks, and posture: she has—by far—the worst record of any cheftestant to appear in a Top Chef finale (1 Elimination win, 1 place, no Quickfire wins; she has been up for elimination or on the losing team in the last seven consecutive episodes (!)). Incidentally, Richard (3 Elimination wins, 5 places, and 2 Quickfire wins) and Stephanie (4 Elimination wins, 5 places, and 1 Quickfire win) have by far the best records of any previous cheftestant, period. (In comparison, the previous three winners (Harold, Ilan, and Hung) had only 4 Elimination wins total.)

On the other side, BSG has been doing a lot of the mythical flim-flam (I don’t really care where Earth is or whether they ever find it) and not so much of the intense post-9/11 fractured-mirror business that made the first three seasons so addictive. The characters have been getting pushed around the chessboard willy-nilly without much attention paid to consistency or plausibility (to wit: President Lee Adama), all in service of a presumed “mind-blowing” series finale (to arrive not before calendar year 2009, as I understand it) that I am quite certain will disappoint (I’m not going to be X-Files‘ed ever again).

So there’s your TV-blogging for the year. Back to work.

Blog at WordPress.com.