Tuesday, November 10, 2015

HTML Content Migration, The Quick and Lazy Way

We're exploring a redesign of our shul's website, including a move to WordPress. Like any website migration, there's the question of moving over content. On one hand, the current site's HTML is pretty poor, so just copying the HTML into WordPress is a bad idea. On the other hand, creating each page from scratch in WordPress and copying and pasting the text in place is just too arduous to consider. What to do?

First things first, I wanted to get a hold of all the content in the current site. That's easy, I just did:

 wget -r http://www.etzhayim.net

That left me with a directory full of messy HTML files. My first thought was to rig something up using Lynx and sed:

lynx -nolist -dump Education.html | \
   sed '1,/The Wisdom Project/d' | \
   sed '/Etz Hayim | 2920/,/All rights reserved/d'

(The use of lynx was inspired by this recent tweet)

That actually got me pretty close, but I figured I could do better. So I busted out PHP and threw every HTML cleanup trick in the book at the files, including Simple HTML DOM, strip_tags, string replacements and regular expression replacements. You can find the code that does all that here.

When I was finally done, I had scrubbed HTML content but no obvious way to get it into WordPress. Enter All Import. This superior WordPress plugin has definitely become part of my Hacker's Toolbox. I updated my PHP code to generate one large XML document. I then uploaded this XML document via All Import. All Import takes you step by step through the process and I was trivially able to import all the pages into the site.

If you find yourself needing to bulk cleanup a lot of HTML and want to use my code (aka: page-sifter), you'll need to provide a .conf.php file. This file contains site specific settings. Here's a sample to get you started:

// What's the root document in your URL that contains the actually text of the site?
// Use this to pull out the guts of the page, and discard the header, footer, and sidsbars
define('SIFT_CONTENT_EXPR', '#content2');

// Turn relative URLs into absolute ones so that images and content resolves
define('SIFT_BASE_URL', 'http://www.mysite.com/');

Stay Lazy My Friends.

No comments:

Post a Comment