Wednesday, April 22, 2015

Slurping in Google Spreadsheet Data in 6 lines of Standard PHP Code

A data analyst and a coder walk into a bar... Actually, scrap that. A data analyst and a coder have to quickly solve a problem. The data analyst has the data in spreadsheet form, but the coder wants a programmer friendly solution. What do they do? The coder could build out a full blown app, first by importing the analyst's data into a database, and then developing a crude UI for the analyst to interact with, but that would be overkill. Remember Programmers should be lazy.

Recently I found myself in this sort of situation. The code I needed to write was basically some trivial PHP. But getting the data in and out was going to be a pain. So I turned to my Digital Swiss Army Knife: Google Spreadsheets for a solution.

I had the data analyst import the data into a Google Spreadsheet. I then used the old school Spreadsheets List API to access the data from within PHP. To make matters especially simple, I had the analyst publish the data to the web and then used the publicly accessible API. The result was that I spent no time worrying about authentication, and could test everything out on the command line using curl.

I used PHP's curl to grab the data and SimpleXML to easily pick through it. There were a couple of speed bumps to get over (like how to access the namespaced elements in SimplXML), but once I figured them out, the solution couldn't have been simpler.

Here's a trivial demo I put together which exercises all this. First, I've got a Sample Data Set, and then I've got some code which slurps in this data and spits out the max and average. Of course, if that's all the info I needed, I'd forgo this whole PHP script. Where this sort of solution really shines is when you need to access multiple tabs from the spreadsheet and either filter or aggregate data in complex ways.

Notice how the actual PHP for picking up the data from the spreadsheet is a measly 6 lines of code. SimpleXML's magic means that while no official API is used, accessing the document is painless.

Happy Hacking!

<?
/*
 * Demonstrate easy access to a Google Spreadsheet
 */

// Before getting into the code, you'll want to:
// (a) make sure the document is published to the web
// (b) figure out the worksheet URL by running something like
// 
//    curl -s 'https://spreadsheets.google.com/feeds/worksheets/[id]/public/full' | xmllint.exe  -format -|\
//       grep '2006#list' | sed -e 's/^.*href="//' -e '/".*//' 
//

$worksheet_url = 'https://spreadsheets.google.com/feeds/list/1IR4JAaJWcScBONP4R3O_IbO4ecyVF5kOBEBzoExjM7E/od6/public/full';

/*
 * OK, we've got our URL so slurp it in and make it an easily accessible
 * document.  Notice how the $gsx namespace is setup for easy use below.
 */
$ch   = curl_init($worksheet_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml  = curl_exec($ch);
$doc  = new SimpleXMLElement($xml);
$ns = $doc->getNamespaces(true);
$gsx = $ns['gsx'];

/*
 * Do something with the data.
 */

header("Content-Type: text/plain");

$nsamples = 0;
$max = false;
$avg = false;

foreach($doc->entry as $e) {
 $nsamples++;
 if($max === false) {
   $max = array('ts' => $e->children($gsx)->timestamp . "", 
                'val' => $e->children($gsx)->pressure);
 } else {
   if(($e->children($gsx)->pressure . "") > $max['val']) {
     $max['ts']  = $e->children($gsx)->timestamp . "";
     $max['val'] = $e->children($gsx)->pressure . "";
   }
 }
 if($avg === false) {
   $avg = $e->children($gsx)->pressure . "";
 } else {
   $avg = (($e->children($gsx)->pressure . "") + $avg) / 2;
 }
}

echo "Num Samples: $nsamples\n";
echo "Max: {$max['ts']} = {$max['val']}\n";
echo "Avg: $avg\n";

?>

No comments:

Post a Comment