Wednesday, January 12, 2022

Free and Fast, A Programmer Friendly Source for Historic News Data

To power past projects, I've looked around for sources of historic news data that I could query with ease. The APIs I found required subscription fees that I couldn't justify.

The other day, I realized a free and accessible source for structured news data may be readily at hand. After a few minutes of poking around, I had my first version of headlines, a shell script that pulls back headlines given a date.

Let's Mine The News

Here's the script in action. I'm showing 5 headlines from 3 random days within the last 3,000 days. Full disclosure: I ran this process a few times until I got 3 dates that were relatively far apart.

$ for i in 1 2 3 ; do x=$(($RANDOM % 3000)) ; echo "$x days ago" ; headlines -d "$x days ago"  | head -5  ; done
350 days ago
Tue, 26 Jan 2021 22:40:08 GMT|President Biden announces the purchase of enough doses to fully vaccinate Americans by summer's end
Tue, 26 Jan 2021 22:22:41 GMT|Watch Biden's vaccine announcement
Tue, 26 Jan 2021 20:12:12 GMT|White people are getting vaccinated at higher rates
Sat, 23 Jan 2021 01:35:21 GMT|See expert's plan to end pandemic in four weeks
Tue, 26 Jan 2021 12:34:44 GMT|The global scramble for vaccines is getting ugly
2753 days ago
Sun, 29 Jun 2014 19:52:16 EDT|Gay couple's 40-year immigration battle
Fri, 27 Jun 2014 06:04:45 EDT|'Heavy drinker' definition surprises
Sun, 29 Jun 2014 07:14:00 EDT|NASA tests saucer for Mars mission
Sat, 28 Jun 2014 20:18:19 EDT|Routine traffic stop turns physical
Sun, 29 Jun 2014 14:15:01 EDT|90 rolls of duct tape made THIS
1461 days ago
Thu, 11 Jan 2018 22:42:45 GMT|President reportedly suggests US get more people from countries like Norway
Thu, 11 Jan 2018 22:51:43 GMT|Democrats say Trump's remark proves he is racist
Wed, 10 Jan 2018 19:50:37 GMT|White House corrects DACA meeting transcript
Thu, 11 Jan 2018 22:46:06 GMT|Trump rejects bipartisan DACA proposal
Thu, 11 Jan 2018 21:27:49 GMT|Rep. Cuellar: The border wall is a dumb idea

headlines is powered by the Wayback Machine at archive.org. It works because archive.org stores RSS feeds for posterity. The data you're seeing above is from CNN's RSS feed, which archive.org has been diligently capturing nearly every day since January 10th, 2005.

Here's an example of pulling from three different RSS feeds: CNN, New York Times and Fox News. I'm using '1460 days ago,' which was inspired from the random date selection above. Apparently on this day, it was being reported that Trump had casually denegrated Haiti and pretty much all of Africa.

At first it appears that CNN and The New York Times are lit up with the news, while Fox's top story is "Surprising celebrity facts." However, if you look at the dates, you'll see that archive.org didn't have a feed for January 12th, 2018, so it's giving us January 10th. The news about Trump's comments came on 11th, so the fact that Fox isn't covering it yet isn't as meaningful as it may appear. There's also no proof that the RSS feed captured by archive.org represents what people saw on the home page of foxnews.com.

$ for src in cnn_top nyt_top fox_top ; do echo "Source=$src" ; headlines -d "1460 days ago" -s $src | head -5 ; done
Source=cnn_top
Fri, 12 Jan 2018 12:49:56 GMT|Two other GOP senators say they 'don't recall' the President 'saying these comments specifically'
Fri, 12 Jan 2018 19:27:17 GMT|What Trump supporters think of his 'shithole' remark
Fri, 12 Jan 2018 17:22:57 GMT|Analysis: Why no one should believe Trump's 'shithole' denial
Fri, 12 Jan 2018 14:39:27 GMT|Anchor chokes up discussing Trump comment
Fri, 12 Jan 2018 08:24:20 GMT|Late night reacts to Trump's 'shithole' comments
Source=nyt_top
Fri, 12 Jan 2018 23:05:52 GMT|Trump, Haiti, London: Your Friday Evening Briefing
Fri, 12 Jan 2018 19:03:01 GMT|Senator Insists Trump Used ‘Vile and Racist’ Language
Fri, 12 Jan 2018 20:48:11 GMT|News Analysis: A President Who Fans, Rather Than Douses, the Nation’s Racial Fires
Fri, 12 Jan 2018 23:41:18 GMT|‘‘Don’t Feed the Troll’: Much of the World Reacts in Anger at Trump’s Insult
Fri, 12 Jan 2018 22:47:38 GMT|Porn Star Who Claimed Sexual Encounter With Trump Received Hush Money, Wall Street Journal Reports
Source=fox_top
Wed, 10 Jan 2018 10:00:00 GMT|Who knew? Surprising celebrity facts
Wed, 10 Jan 2018 10:00:00 GMT|FOX411's snap of the day
Wed, 10 Jan 2018 03:36:15 GMT|Magnitude 7.6 quake hits in Caribbean north of Honduras
Wed, 10 Jan 2018 03:29:06 GMT|Church: Guam archbishop faces new sexual assault allegation
Wed, 10 Jan 2018 03:22:58 GMT|Australia experiences 3rd hottest year on record in 2017

The above example highlights the limitation of depending on archive.org. There's no guarantee that there will be headline data for every day of the year. Still, it's remarkable how effective headlines is given its simplicity.

How Does It Work?

Pulling news data from archive.org is a two step process. First, the script queries archive.org for the status of the RSS feed in question. For example:

$  curl -s -G \
    --data-urlencode url=http://feeds.foxnews.com/foxnews/latest \
    --data-urlencode timestamp=20180113 https://archive.org/wayback/available | jq .
{
  "url": "http://feeds.foxnews.com/foxnews/latest",
  "archived_snapshots": {
    "closest": {
      "status": "200",
      "available": true,
      "url": "http://web.archive.org/web/20180110041238/http://feeds.foxnews.com/foxnews/latest",
      "timestamp": "20180110041238"
    }
  },
  "timestamp": "20180113"
}

Then, the script takes the 'closest' URL, retrieves that RSS feed and processes it with xmlstarlet to make human readable output.

$ curl -s 'http://web.archive.org/web/20180110041238/http://feeds.foxnews.com/foxnews/latest' | xmllint --format - | head -10
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.foxnews.com/~d/styles/itemcontent.css"?>
<rss xmlns:media="http://search.yahoo.com/mrss/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>FOX News</title>
    <link>http://www.foxnews.com/</link>
    <description><![CDATA[FOXNews.com - Breaking news and video. Latest Current News: U.S., World, Entertainment, Health, Business, Technology, Politics, Sports.]]></description>
    <image>
      <url>http://tools.foxnews.com/sites/tools.foxnews.com/files/images/fox-news-logo.png</url>

Ultimately, this works as well as it does because archive.org is indexing, and making available to us, a machine readable format. While news organizations never intended to maintain historic snapshots of their feeds, archive.org is glad to do precisely this. It also begs the question: what other programmer friendly data is archive.org storing?

The Complete Script

Here's the most recent version of headlines, which includes support from pulling from a variety of RSS feeds. Happy News Hacking!

#!/bin/bash

##
## Show headlines
##

usage() {
  me=$(basename $0)
  echo "Usage: $me  -t timestamp [-v] [-s source]"
  echo "Usage: $me  -d date-string [-v] [-s source]"
  exit 1
}

source_map() {
  case $1 in
    cnn_top) u='http://rss.cnn.com/rss/cnn_topstories.rss' ;;
    cnn_world) u='http://rss.cnn.com/rss/cnn_world.rss' ;;
    cnn_politics) u='http://rss.cnn.com/rss/cnn_allpolitics.rss' ;;
    cnn_tech) u='http://rss.cnn.com/rss/cnn_tech.rss' ;;
    cnn_business) u='http://rss.cnn.com/rss/money_latest.rss' ;; 
    nyt_business) u='https://rss.nytimes.com/services/xml/rss/nyt/Business.xml' ;;
    nyt_politics) u='https://rss.nytimes.com/services/xml/rss/nyt/Politics.xml' ;;
    nyt_top) u='https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' ;;
    fox_top) u='http://feeds.foxnews.com/foxnews/latest' ;;
    fox_politics) u='http://feeds.foxnews.com/foxnews/politics' ;;
    fox_tech) u='http://feeds.foxnews.com/foxnews/scitech' ;;
  esac

  echo $u;
}


source=cnn_top

while getopts "t:d:vus:h" o; do
  case "$o" in
    s) source=$OPTARG ;;
    d) date=$OPTARG ;;
    t) timestamp=$OPTARG  ;;
    u) include_url=yes ;;
    v) verbose=yes ;;
    * | h)
      usage
      ;;
  esac
done


if [ -n "$date" ] ; then
  timestamp=$(date -d "$date" +%Y%m%d)
fi

source_url=$(source_map $source)
if [ -z "$source_url" ] ; then
  echo "$source isn't a valid source"
  exit
fi

if [ -z "$timestamp" ] ; then
  usage
fi

timestamp=$(echo $timestamp | sed 's/[^0-9]//g')

url=$(curl -s -G \
           --data-urlencode url=$source_url \
           --data-urlencode timestamp=$timestamp \
           'https://archive.org/wayback/available' | tee $HOME/.headlines.wb | jq -r .archived_snapshots.closest.url)

if [ -z "$url" ] ; then
  echo "No headlines found for: $timestamp"
  echo "";
  cat $HOME/.headlines.wb
  exit
fi



curl -s "$url" | xmllint --format - | if [ "$verbose" = "yes" ] ; then
  cat
else
  expr="-m '/rss/channel/item' -v pubDate -o '|' "
  if [ "$include_url" = "yes" ] ; then
    expr="$expr -v guid -o '|' "
  fi
  expr="$expr -v title -n"
  eval xmlstarlet sel -t $expr | grep -v '^[|]'
fi

No comments:

Post a Comment