Monday, July 09, 2007

Postgres Hack Of The Day

We had a bunch of data in our database that was corrupted with some binary characters (which are actually the unicode value to use when no other is appropriate, but from my perspective, it was binary junk).

We wrestled with all sorts of ways of scrubbing the data, and finally, ended up with this quick and dirty hack. No doubt there's a better way to do this (please, school me in the comments :-)).

First, we figured out that you can convert arbitrary text and binary code to a hex string by using the encode function. At first, that didn't work because apparently there's no cast setup between varchar and binary strings. But that's easily remedied with:

 CREATE CAST (varchar as bytea) WITHOUT FUNCTION;

(Man, I love Google. I'd be so out of work without it.)

With the ability to turn binary strings into hex text, we were then able to do a regexp replacement against the hex text. In our case, we wanted to replace these junky characters with an apostrophe - which for the record, has a hex value of 27.

Finally, we converted the hex text to a string by using decode.

The final hack turned out to be:

SELECT decode(regexp_replace(encode(name::bytea, 'hex'), 
                             'efbfbd', '27'), 
              'hex') 
FROM some_table where name ilike 'Name With Junk At The End%'; 

It's ugly - but it works.

Bonus Hack: While struggling with binary characters, I accidentally typed \x into psql. Turns out, that turns on expanded output, which among other things will show you output in the format:

  -[ RECORD 1 ]-----------------
  name    | Mr. Foo
  phone   | 555-1212
  address | Fooville

The command line tools for MySql and Sybase both offered this vertical layout, but I had to see psql do it. Turns out, it does.

2 comments:

  1. Anonymous6:32 PM

    Ben, this page explains how our data was corrupted by the replacement character .

    ~Kelly

    ReplyDelete
  2. Anonymous11:21 AM

    Ok, just one more update: because we had cast varchar as bytea, when I ran the update using the hack I got an error informing me that the update failed because the column value was of type varchar, but expression was of type bytea.

    The original regex was too greedy:

    SELECT regexp_replace('FOO�S BAR', '[[:alnum:]_]([\u�])[[:alnum:]_]', '''');
    regexp_replace
    ----------------
    FO' BAR


    So I simplified it, and it returned just what I wanted, and as a varchar:

    SELECT regexp_replace('FOO�S BAR', '�', '''');
    regexp_replace
    ----------------
    FOO'S BAR


    Because the regexp_replace returns a varchar, the update works!

    ~K

    ReplyDelete