Wednesday, August 10, 2011

Thoughts On A Server Outage

As I type out this blog entry, Rimuhosting, one of my favorite web hosting providers is experiencing a major outage. I feel for them, as I've been there before. Hopefully, they'll be back up in no time. However, with this server outage on my mind, I can't help but share a few thoughts that come to mind:

Before the Outage

  • Setup offsite backups today. Right now. I've done this for a number of servers, and my preferred method is duplicity + Amazon's S3. Duplicity efficiently creates backups and S3 makes storing the results dirt cheap.
  • Life in the cloud means that an outage like this can be a major work stopper. Diversifying where you keep information in the cloud can help. I'll still take the (hopefully) well paid support team of a cloud service over storing things locally and expecting my skills as a computer geek to solve the problem.
  • Make sure your company has a Twitter account where service status updates can be broadcast. Twitter makes an ideal place for these updates because (a) it's where folks already are and (b) you'll want a solution that's off your network and unlikely to be down when you are.

Checking for an Outage

  • Whenever I notice a website is acting funky, the first place I head over to is search.twitter.com. A couple weeks ago, Gmail wasn't working for me. Turned out I had forgotten to turn off my debugging proxy. A quick search Twitter showed me it wasn't some massive outage. Today's Rimuhosting issue was detected in just a few seconds when I checked twitter.
  • I recently discovered Down For Every One Or Just Me.com, a service which no doubt does a trivial check to see if a site is down from its location. The idea being that it would be unlikely that your personal setup and its server setup are both unable to reach the site. It's a clever idea and a site I'll hold on to.

When You Go Down

  • The first order of business should be to report the status on Twitter. Be honest. Outages are part of life on the Internet. Keeping people informed will help keep folks calm. Facebook is also a good place to spread the word of your status.
  • Work like hell to get the problem fixed.
  • Take responsibility for what you broke
  • Look for a way to compensate folks' time. You probably can't give everyone everything they want (how do you credit someone for lost customers that may have brought in thousands of dollars worth of business?). But, do something. Anything.

There, my ramblings on an outage. Now, let's see, did Rimu come back while I was typing this away? D'oh. The did not. This ain't good.

I'm pulling for you guys!

6 comments:

  1. Found your post searching online for anything on Rimuhosting being down because I have three VPS on them. What is worse, their website is also down and so I cannot access even my control panel nor webmin. I'm heading over to check out your Amazon backup recommendation. For sure when this is over, I'll be doing some real backup.

    I'm also considering having a mirror of my site so that when downtimes take as long as this it will only be a matter of changing the IP address of my domains to have them working off the backup sites.

    This is ugly. But I like Rimuhosting - one of the best I've used. If they get this back before I start losing my 'good' feelings for them, they're still remain my favorite :)

    ReplyDelete
  2. Funny, so I read your post before I went to bed, and was thinking that I was glad I chose Slicehost years ago and not Rimu. Then at 1:59AM I start getting alerts from Pingdom that my host is down. It turns out that Slicehost (Rackspace) is having a pretty severe outage at their Atlanta datacenter. That host has *never* gone down for like 3 or 4 years. What the heck is going on with our datacenters?

    ReplyDelete
  3. Ginger - I know what you mean about the IP address change taking as long to propagate as the fix. I always worry that in the scramble to deal with a down server, I'll make the problem worse.

    Mark - man, that's strange about your data center in Atlanta. Given what we know about how much are data centers put into their power requirements, it's amazing that they would fail so badly.

    Turns out, my clients servers didn't even go down. When I was finally able to access them this morning, I found that my screen sessions were alive and ready to go. Guess we were just hit with routers and other hardware being out.

    Guess we got lucky.

    ReplyDelete
  4. Yup, same here, although my uptime is only 357 days. I certainly don't recall rebooting last year. :-)

    Slicehost's problem was indeed some sort of connectivity problem into the datacenter.

    I've actually been thinking about switching over to Linode, I had considered doing it last night during the outage. I can get faster hardware and twice the RAM for the price I'm paying now.

    ReplyDelete
  5. You'll have to let me know how linode works out for you.

    Me, I'm still a rimu guy for my own servers.

    ReplyDelete
  6. HI Ben, Glenn from RimuHosting here. Thank you so much for the kind words, and for sticking with us. I've added a month free hosting to your account. Feel free to contact us directly if you have any other questions :).

    In case you were not already aware we posted details to http://blog.rimuhosting.com/2011/08/12/10-august-report/

    I like the ideas in this post also. On thing to add, content backups are not just recommended, but essential. You never know when that old document or configuration may turn out to be just what you needed. Sorry probably preaching to the converted... have a great day.

    ReplyDelete