Monday, June 04, 2018

Dabbling in Computational Magic: Pearson Hashing

Hashing is one of the closest things we have to magic in the computer world. Imagine I told you that I could turn any piece of text, from a single letter to the entire Bible, into a unique 32 character string. Impossible! you'd say: one can generate an infinite number of strings, yet 32 characters is a fixed space. And yet, this is exactly what hashing algorithms like MD5 deliver.* In short, magic.

(Not sure why this is exciting? Consider how much value we get from finger prints, which behave much like hashes. They're not perfect, but they do a great job of uniquely identifing and verifying a person. The same can be said about hashes and data.)

Programming Praxis has an excellent exercise on the topic: Pearson Hashing which is a fun way to understand how hashing works.

The paper that describes Pearson Hashing is only 4 pages long and quite readable. I highly recommend printing it out and spending some quality time with it.

Pearson Hashing isn't quite as magical as MD5: its promise isn't a unique number for any piece of text. Instead, it maps arbitrary text to a number between 0 and 255 in an essentially random fashion. It does have some useful properties: text that's similar produces completely different Person Hashing values and text that uses the same characters but in a different order also produce different values. You can imagine using Pearson Hashing as a rough check to make sure text hasn't been tampered with or other sanity checks.

The implementation of Pearson Hashing is almost trivial:

(define (pearson-hash text)
    (let loop ((hash 0)
               (chars (map char->integer (string->list text))))
      (cond ((null? chars) hash)
            (else
             (loop (vector-ref (*T*) (bitwise-xor hash (car chars)))
                   (cdr chars))))))

I say almost trivial, because the algorithm calls for the construction of a randomly permuted table of values. I found that generating this random table took far more time than the implementation of the hash algorithm. You can find my full implementation here.

To make the exercise more interesting, I branched out a bit and implemented a few variations of pearson-hashing. These include:

  • simple-hash: described in the Pearson Hashing paper as an algorithm that uses an even simpler approach to hashing.
  • pearson-hash-mod: rather than use bitwise-xor to derive the next hash value, this approach uses mod 256.
  • lr-pearson-hash: the Pearson Hashing paper describes an approach to generating larger range hash values. That is, a wider range than just 0 - 255.

I grabbed a list of the top 10,000 English words identified by Google and used it to compare each of the hashing functions above.

My comparison is weak, but still somewhat informative.

As promised, the Pearson Hashing variations always give a different hash on similar text, even if an anagram is used (the simple-hash fails this test). The pair of numbers after the hashing tests represent the max and min bucket size from hashing the 10,000 words from Google. If hashing the 10,000 words was perfectly distributed, then each of the 256 hash values would get 39 or so words associated with them. You can see that all the hashing algorithms were in that neighborhood, with Pearson Hashing being slightly better than the simple-hash approach.

My test also shows that if I use the Pearson Hashing algorithm and generate 40 bit values (that is, I run pearson-hash 5 times for each string) then each of the 10,000 words get a unique hash value. MD5 works in terms of 128 bit values, which I could approximate by running the running pearson-hashing algorithm 16 times. I've got no idea how MD5 works, but I imagine generating 128 bit Pearson Hash values would start to get me into the neighborhood of the magic hashing I initially described.

*Note: MD5 isn't as secure as the I'm making it out to be. There are known issues. There are stronger hashes currently available. I expect those will be obsolete one day, too. MD5 is good enough for our purposes.

No comments:

Post a Comment