Wednesday, February 03, 2010

The Joy Of Generators - Playing Around With Bing, Scheme and Yield

I've been a fan of generators for quite some time. But seeing that they were added to the core PLT scheme libraries, I thought they were worth another look.

What I like about generators is their ability to take potentially complex looping problems, and make them a lot simpler and more module. Consider this example that I whipped up in a hurry - I think generators solve the problem in an elegant way, where regular iteration would have gotten messy.

The Problem

I wanted to experiment with generators in a context that showed off their ability to work with large sets of data. And what larger set of data is there than the web? The little I challenge I worked up is this:

Suppose you want to find out how your website ranks for a given search phrase. That is, when you search the web, what page of the results your domain show up on? (if at all)

While I could have solved this with any search engine, Bing has a clever little feature - you can set the format of any search to be xml and instantly get back a well formed XML document you can work with. Google and Yahoo do this too, but they do it through fairly extensive APIs and for my toy example, I wanted something simpler.

The Solution

The first part of the solution was to add the ability to search the web using Bing and to return the results. This has nothing to do with generators, so I won't bother showing the solution here. Though you're welcome to download the code here. It's yet another example of using PLT scheme to grab an XML document off the web and parse it - a trick that every time I do, makes me smile.

With a quick bing library written I can move on to the fun part. What I want to do is to write a function that takes in a search term (aka keyword) and a domain (the website to check), and find the occurrences of when the search result mentions the domain. I wrote the core of the solution in a single function below. Notice that the name ends in /y - this is a hint to me that this function will yield results. I'm not sure this is an ideal standard, but for our purposes it will work. Here's the code:

(require "bing.ss"
         scheme/generator
         (only-in srfi/13 string-contains))

(define-struct  match (url page position) #:prefab)

;; Check for matches of domain when searching for keyword.
;; max-pages says how many results we should look through before we give up.
(define (keyword-check/y keyword domain max-pages)
  (define (offset p) (* p 10))
  ;; This nested function, 'check', does the real work. It checks
  ;; a given page for occurrences.
  (define (check page)
    (when (< page max-pages)
      (let ([results (search keyword #:offset (offset page))])
        (unless (null? results)
          ;; OK, if we made it this far, it means that bing has results for us
          ;; to look through. So let's look at each one.
          (for ([r results] [i (in-naturals)])
            (when (string-contains (doc-url r) domain)
              ;; At this point, our string-contains tells us we have
              ;; a match. Whoo! What do we do? yield the result.
              (yield (make-match (doc-url r) page i))))
          ;; This should look like an infinite loop. We just finished
          ;; checking one page, and are now going onto the next.
          ;; However, we'll only do this if we're asked to give another
          ;; result from our yield.
          (check (add1 page))))))
  (check 0))

The beauty of this code is that it's oblivious to the context that it's called within. When it finds a match, it invokes yield, and the continues looking for additional matches. It doesn't need to worry that the result set it's analyzing may be infinite.

I created a quick wrapper function for keyword-check/y that turns it into a simple generator:

(define (keyword-check/g keyword domain max-pages)
  (generator (keyword-check/y keyword domain max-pages)))

I can now make use of the solution. Say I want to see how my company's website is doing for the phrase Software Idea. I can do that interactive like so:

> (define idea-finder (keyword-check/g "software idea" "ideas2executables.com" 100))

;; idea-finder is now a function that I can invoke to get results from
;; my generator. Note, I'm willing to look through the first 100 pages of
;; bing results.
;;
> (idea-finder)
#s(match "http://www.ideas2executables.com/" 0 8)
> (idea-finder)
#s(match "http://www.ideas2executables.com/" 1 0)
> (idea-finder)
#s(match "http://www.ideas2executables.com/portfolio/" 6 6)
> (idea-finder)
#s(match "http://www.ideas2executables.com/portfolio/" 7 1)
> (idea-finder)
#s(match "http://www.ideas2executables.com/who-we-are/" 15 1)

You can see that it found the website on the first and second pages, then the 7th and 8th and finally on page 16 (notice the 0 indexing).

Finally, I can wrap up the generator in a neat little package like so:

(define (keyword-checker keyword domain)
  (for/list ([i (in-range 0 3)]
             [hit (in-generator (keyword-check/y keyword domain 100))])
    hit))

I can then invoke this function like any other:

> (keyword-checker "Software Idea" "ideas2executables.com")
(#s(match "http://www.ideas2executables.com/" 0 8)
 #s(match "http://www.ideas2executables.com/" 1 0)
 #s(match "http://www.ideas2executables.com/portfolio/" 6 6))

For large data sets, being able to just yield and continue searching for more data, really does make for an elegant programming model.

1 comment:

  1. Anonymous4:27 AM

    Thanks for this interesting posting.

    A "meta comment": your CSS design is not very reader-friendly, it makes your code very hard to read with this fixed width. With my choice of font size I have to scroll all the time. Or copy and paste to Emacs and read it there. Also, your "random snapshots" is distracting.

    There is no need to define `check' as an internal procedure, your module system will hide it anyway. You only make it harder to call it separately for testing e.g.

    There is no `for' in Scheme :-).

    The job of making nested parens stand out is better left to a decent IDE, brackets here are just a kludge. Anyway, just a matter of taste...

    s/occurance/occurence

    ReplyDelete