Wednesday, April 11, 2018

Gotcha of the day: A recipe to keep cron from crushing my server

I've got a customer who uses kraken.io to provide image optimization for their WordPress site. Using the very cool wp command line plugin, it's possible to setup a cron job that processes images in the background.

*  *  *  *  *  /usr/local/bin/wp media krake --limit=20 > /dev/null 2>&1

Once a minute, the system tries to optimize up to 20 images. If someone uploads a whole bunch of images, the system will temporarily serve non-optimized files. But over time, the system catches up and all is good.

That is, until kracken.io has an outage. Needless to say, I found this out the hard way. The problem is rather than gracefully timing out, the wp media krake command just hangs. Once a minute cron diligently kicks off yet another wp command. This repeats until the box crushes itself.

As I imagined solutions to this problem, I pictured myself hacking the krake code or implementing some general purpose time-out wrapper for cron jobs.

But as I mulled the problem over, I realized that my solution need not be so complex. While I was focused on the lack of a timeout that kept the processes hanging, an equally valid approach would focus on insuring only a single instance of wp media krake was ever run. To implement that, I'd need only check for a lock file before launching the wp command.

And of course, Unix already has a command to implement exactly this. It's flock. Using flock is trivial: you provide it with a file to use as a lock and a command to run. If that file is locked, the command either hangs until it can proceed, or if configured as such, refuses to run.

For example, in one shell I run:

flock /tmp/foo -c "sleep 10; ls"

And in another shell, I run:

flock -n /tmp/foo -c "echo 'Whoo!'"

As long as the original command is running (which takes at least 10 seconds), the second command will fail.

How have I gone my whole life without knowing about flock? This is awesome.

So rather than having to hack wp media krake, or heck, write any code at all I've solved my problem with a simple command line change.

*  *  *  *  *  flock -xn /tmp/krake.cron.lock -c "/usr/local/bin/wp media krake --limit=20" > /dev/null 2>&1

Next time kraken.io goes down (may it not be for 120 years!), the cron command will hang. But that's OK, because a minute later, the lock is still in place and flock will refuse to launch another instance.

Bye-bye self-created DOS attack; hello server that can trivially weather a 3rd party outage.

No comments:

Post a Comment