Friday, December 29, 2006

Unix Tool Of The Day: XMLStarlet

XMLStarlet is a must have tool for dealing with XML when using Unix shell scripts. XMLStarlet allows you to compose tiny XSLT stylesheets right on the command line. Because of this capability, it allows you to bridge painlessly bridge the world of XML and Unix text oriented tools.

Here's an example. Suppose you wanted to get a list of all the recent posts from your buddy's blog. How do you do this? Well, grab their RSS feed and process it, right?

Here's an example of doing just that using a Unix command line:

wget -o /dev/null -O - http://benjisimon.blogspot.com/atom.xml | \
 sed "s| xmlns='[^']*'||" | \
 xml  sel  -t -m /feed/entry -v ./title -n  | \
 nl | head -5

The commands do the following:

  1. Grab atom.xml from my friend's blog
  2. Pre-process the XML to fix an annoying namespace issue (learn more here)
  3. Create an on the fly XSLT stylesheet that loops through each entry element and gets the value of the title
  4. Run the results through the standard nl command, which numbers lines and discard all but the top 5 lines

The result of the above command is:

     1  TSA's Losing Battle
     2  Commutecast #1: Trying it out
     3  Hot Dog Cook-In
     4  Google Hack: Finding Gift Ideas
     5  MP3 Player Update

This doesn't begin to capture the power of XMLStarlet, but at least you get the idea that processing XML doesn't need to be a big 'ol hack.

Naturally, XMLStarlet takes some practice. But, the docs are good, and the results are worth the effort.

If you use XML files and the command line, learn this tool. You'll be glad you did.

Oh, and I should mention, all of the above works just fine on Cygwin under Windows XP too. So you really have no excuses not to take advantage of this slick tool.

6 comments:

  1. Anonymous4:21 AM

    ([xml](New-Object Net.WebClient).DownloadString("http://benjisimon.blogspot.com/atom.xml")).feed.entry | %{ $_.title."#text" } | select -first 5

    ReplyDelete
  2. Anonymous5:40 AM

    Oops, I left out the line numbering. Also here's another way to get the first 5.

    ([xml](New-Object Net.WebClient).DownloadString("http://benjisimon.blogspot.com/atom.xml")).feed.entry[0..4] | %{ $i=0 }{ "" + ++$i + " " + $_.title."#text" }

    ReplyDelete
  3. OK, I'll bite. What language is that Alex?

    ReplyDelete
  4. Anonymous10:00 PM

    PowerShell

    ReplyDelete
  5. Cool - thanks for sharing!

    Great, now I have to come up with a trickier example of xmlstarlet ;-)

    ReplyDelete
  6. Nice man. You've made my dream of having my GMail read to me a reality :)

    ReplyDelete