Wednesday, January 12, 2022

Free and Fast, A Programmer Friendly Source for Historic News Data

To power past projects, I've looked around for sources of historic news data that I could query with ease. The APIs I found required subscription fees that I couldn't justify.

The other day, I realized a free and accessible source for structured news data may be readily at hand. After a few minutes of poking around, I had my first version of headlines, a shell script that pulls back headlines given a date.

Let's Mine The News

Here's the script in action. I'm showing 5 headlines from 3 random days within the last 3,000 days. Full disclosure: I ran this process a few times until I got 3 dates that were relatively far apart.

$ for i in 1 2 3 ; do x=$(($RANDOM % 3000)) ; echo "$x days ago" ; headlines -d "$x days ago"  | head -5  ; done
350 days ago
Tue, 26 Jan 2021 22:40:08 GMT|President Biden announces the purchase of enough doses to fully vaccinate Americans by summer's end
Tue, 26 Jan 2021 22:22:41 GMT|Watch Biden's vaccine announcement
Tue, 26 Jan 2021 20:12:12 GMT|White people are getting vaccinated at higher rates
Sat, 23 Jan 2021 01:35:21 GMT|See expert's plan to end pandemic in four weeks
Tue, 26 Jan 2021 12:34:44 GMT|The global scramble for vaccines is getting ugly
2753 days ago
Sun, 29 Jun 2014 19:52:16 EDT|Gay couple's 40-year immigration battle
Fri, 27 Jun 2014 06:04:45 EDT|'Heavy drinker' definition surprises
Sun, 29 Jun 2014 07:14:00 EDT|NASA tests saucer for Mars mission
Sat, 28 Jun 2014 20:18:19 EDT|Routine traffic stop turns physical
Sun, 29 Jun 2014 14:15:01 EDT|90 rolls of duct tape made THIS
1461 days ago
Thu, 11 Jan 2018 22:42:45 GMT|President reportedly suggests US get more people from countries like Norway
Thu, 11 Jan 2018 22:51:43 GMT|Democrats say Trump's remark proves he is racist
Wed, 10 Jan 2018 19:50:37 GMT|White House corrects DACA meeting transcript
Thu, 11 Jan 2018 22:46:06 GMT|Trump rejects bipartisan DACA proposal
Thu, 11 Jan 2018 21:27:49 GMT|Rep. Cuellar: The border wall is a dumb idea

headlines is powered by the Wayback Machine at It works because stores RSS feeds for posterity. The data you're seeing above is from CNN's RSS feed, which has been diligently capturing nearly every day since January 10th, 2005.

Here's an example of pulling from three different RSS feeds: CNN, New York Times and Fox News. I'm using '1460 days ago,' which was inspired from the random date selection above. Apparently on this day, it was being reported that Trump had casually denegrated Haiti and pretty much all of Africa.

At first it appears that CNN and The New York Times are lit up with the news, while Fox's top story is "Surprising celebrity facts." However, if you look at the dates, you'll see that didn't have a feed for January 12th, 2018, so it's giving us January 10th. The news about Trump's comments came on 11th, so the fact that Fox isn't covering it yet isn't as meaningful as it may appear. There's also no proof that the RSS feed captured by represents what people saw on the home page of

$ for src in cnn_top nyt_top fox_top ; do echo "Source=$src" ; headlines -d "1460 days ago" -s $src | head -5 ; done
Fri, 12 Jan 2018 12:49:56 GMT|Two other GOP senators say they 'don't recall' the President 'saying these comments specifically'
Fri, 12 Jan 2018 19:27:17 GMT|What Trump supporters think of his 'shithole' remark
Fri, 12 Jan 2018 17:22:57 GMT|Analysis: Why no one should believe Trump's 'shithole' denial
Fri, 12 Jan 2018 14:39:27 GMT|Anchor chokes up discussing Trump comment
Fri, 12 Jan 2018 08:24:20 GMT|Late night reacts to Trump's 'shithole' comments
Fri, 12 Jan 2018 23:05:52 GMT|Trump, Haiti, London: Your Friday Evening Briefing
Fri, 12 Jan 2018 19:03:01 GMT|Senator Insists Trump Used ‘Vile and Racist’ Language
Fri, 12 Jan 2018 20:48:11 GMT|News Analysis: A President Who Fans, Rather Than Douses, the Nation’s Racial Fires
Fri, 12 Jan 2018 23:41:18 GMT|‘‘Don’t Feed the Troll’: Much of the World Reacts in Anger at Trump’s Insult
Fri, 12 Jan 2018 22:47:38 GMT|Porn Star Who Claimed Sexual Encounter With Trump Received Hush Money, Wall Street Journal Reports
Wed, 10 Jan 2018 10:00:00 GMT|Who knew? Surprising celebrity facts
Wed, 10 Jan 2018 10:00:00 GMT|FOX411's snap of the day
Wed, 10 Jan 2018 03:36:15 GMT|Magnitude 7.6 quake hits in Caribbean north of Honduras
Wed, 10 Jan 2018 03:29:06 GMT|Church: Guam archbishop faces new sexual assault allegation
Wed, 10 Jan 2018 03:22:58 GMT|Australia experiences 3rd hottest year on record in 2017

The above example highlights the limitation of depending on There's no guarantee that there will be headline data for every day of the year. Still, it's remarkable how effective headlines is given its simplicity.

How Does It Work?

Pulling news data from is a two step process. First, the script queries for the status of the RSS feed in question. For example:

$  curl -s -G \
    --data-urlencode url= \
    --data-urlencode timestamp=20180113 | jq .
  "url": "",
  "archived_snapshots": {
    "closest": {
      "status": "200",
      "available": true,
      "url": "",
      "timestamp": "20180110041238"
  "timestamp": "20180113"

Then, the script takes the 'closest' URL, retrieves that RSS feed and processes it with xmlstarlet to make human readable output.

$ curl -s '' | xmllint --format - | head -10
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href=""?>
<rss xmlns:media="" xmlns:content="" xmlns:dc="" version="2.0">
    <title>FOX News</title>
    <description><![CDATA[ - Breaking news and video. Latest Current News: U.S., World, Entertainment, Health, Business, Technology, Politics, Sports.]]></description>

Ultimately, this works as well as it does because is indexing, and making available to us, a machine readable format. While news organizations never intended to maintain historic snapshots of their feeds, is glad to do precisely this. It also begs the question: what other programmer friendly data is storing?

The Complete Script

Here's the most recent version of headlines, which includes support from pulling from a variety of RSS feeds. Happy News Hacking!


## Show headlines

usage() {
  me=$(basename $0)
  echo "Usage: $me  -t timestamp [-v] [-s source]"
  echo "Usage: $me  -d date-string [-v] [-s source]"
  exit 1

source_map() {
  case $1 in
    cnn_top) u='' ;;
    cnn_world) u='' ;;
    cnn_politics) u='' ;;
    cnn_tech) u='' ;;
    cnn_business) u='' ;; 
    nyt_business) u='' ;;
    nyt_politics) u='' ;;
    nyt_top) u='' ;;
    fox_top) u='' ;;
    fox_politics) u='' ;;
    fox_tech) u='' ;;

  echo $u;


while getopts "t:d:vus:h" o; do
  case "$o" in
    s) source=$OPTARG ;;
    d) date=$OPTARG ;;
    t) timestamp=$OPTARG  ;;
    u) include_url=yes ;;
    v) verbose=yes ;;
    * | h)

if [ -n "$date" ] ; then
  timestamp=$(date -d "$date" +%Y%m%d)

source_url=$(source_map $source)
if [ -z "$source_url" ] ; then
  echo "$source isn't a valid source"

if [ -z "$timestamp" ] ; then

timestamp=$(echo $timestamp | sed 's/[^0-9]//g')

url=$(curl -s -G \
           --data-urlencode url=$source_url \
           --data-urlencode timestamp=$timestamp \
           '' | tee $HOME/.headlines.wb | jq -r .archived_snapshots.closest.url)

if [ -z "$url" ] ; then
  echo "No headlines found for: $timestamp"
  echo "";
  cat $HOME/.headlines.wb

curl -s "$url" | xmllint --format - | if [ "$verbose" = "yes" ] ; then
  expr="-m '/rss/channel/item' -v pubDate -o '|' "
  if [ "$include_url" = "yes" ] ; then
    expr="$expr -v guid -o '|' "
  expr="$expr -v title -n"
  eval xmlstarlet sel -t $expr | grep -v '^[|]'

No comments:

Post a Comment