Thursday, May 29, 2008

Names, Lots of Names

I needed a whole bunch of sample last names for some bogus data for an app I'm working on. Thankfully, we have Google, and of course, this information is just a query away.

I quickly found the Top 1000 most common surnames in the US (as of the 1990 census).

This should be exactly what I need.

Little known fact - Simon is a more common last name than Frank and Clayton. Not to mention Lloyd and Boone. Heck, we just barely beat out Waters.

As a bonus, here's a little PLT-Scheme app I wrote up to grab and fix the case on the names. No claim that it's especially efficient or compact, it just is.

(module name-grabber mzscheme
  (require (lib "" "net")
           (prefix l: (lib "" "srfi"))
           (prefix s: (lib "" "srfi"))
           (planet "" ("lizorkin" "sxml.plt" 1 4))
           (planet "" ("neil" "htmlprag.plt" 1 3))
           (lib "" "srfi"))

  (provide get-names)

  (define (fix-case word)
    (string-append (string-upcase (substring word 0 1)) 
                   (string-downcase (substring word 1))))
  (define (get-names)
    (let* ((url (string->url ""))
           (in-port (get-pure-port url))
           (doc (html->sxml in-port)))
      (close-input-port in-port)
      (let ((name-rows ((sxpath '("//table[@style = 'boldtable']/tr")) doc)))
        (l:drop (map (lambda (row)
                     (let ((cols ((sxpath '(// td *text*)) row)))
                       (if (not (null? cols))
                           (fix-case (l:first cols))
                   name-rows) 1))))

Update: Added a provide clause, thanks Grant!


  1. Thanks Ben. You don't have a provide clause in there; is that intentional?

  2. No it wasn't intentional - and, I just added it.

    But it wasn't completely un-intentional either

    Lately, I've found DrScheme I've been using DrScheme as environment where I can quickly whip up little apps to generate stuff - like this list of names, or bulk sets of INSERT statements.

    Perhaps because Windows has no Shell, I've been finding that DrScheme can fill that niche.

    DrScheme has a "module" mode where the bindings for a given module are visible, which is why what I was using when I wrote the attached code.

    In this case, I wrote up the script, kicked off the REPL, use a couple of list operations to take a slice of the 1000 names. And, I was done.

    Thanks for the suggestion!

  3. I actually blogged about generating random fake names before (7/2006), because I needed some test data. My solution used curl, grep, and sed (all from a cygwin bash shell) to do very, very basic html scraping.

    Of course, we all know what's bound to happen when you rely on html scraping: the html eventually changes, and your scripts no longer work [properly].

    So, I revised it in 11/2006, but apparently never posted the revised version.

    That code no longer works, either, though. I just wanted to share the way I did it, especially the website that I used:

  4. Got to say Dave, is impressive. I could do without all the ads, but creating an entire identity is just too cool.

    Actually, the mobile version is pretty clean.

    Nice find!


