Thursday, March 31, 2016

Gotcha of the day: AWS EC2 Instance Periodically Gets Database Connection Timeout Errors

I recently migrated a client from a single server environment to a dynamic AWS environment. The migration went smoothly enough. But we ended up with an odd conundrum: at 3 minutes after the hour, various requests would fail with a database Connection Timeout error. It was like clockwork. 9:03. 10:03. 11:03. You get the idea. The problem struck for just a few seconds, but it absolutely impacted customers.

What the heck was going on?!

The Connection Timeout error (specifically, a MySQL Client error, 2002) seemed like a network issue. Was the network being restarted once an hour? Maybe we were getting a new IP from DHCP? Or perhaps there was some DNS action happening once an hour? To help determine this, I ran this recipe from the command line:

while sleep .5; \
  do date ; \
   echo | nc -vv  db.server.com 3306 ; \
   echo ; echo; echo; \
done

The above command uses netcat to establish a connection to the database. The output looks something like so:

Thu Mar 31 14:37:53 UTC 2016
Connection to db.server.com 3306 port [tcp/mysql] succeeded!
adfasfdafda^mysql_native_password

While much of that is gibberish - the -vv shows me that the connection was successful.

I kicked this off at 9:02pm. By 9:04pm, the usual alerts had come in saying that various DB connections had timed out. But my command line above, showed no such error.

So it wasn't a Linux networking issue. The network was stable throughout the event. Ugh.

After much more debugging, I whipped up this PHP script:

<?php
require_once('system/config.php');

$count = mysql_value("SELECT COUNT(*) FROM users");
echo date('c') . " | $count\n";
?>

The idea was to create a trivial PHP script (which of course, isn't really trivial because it's loading up our system config which pulls in a huge number of files) that runs a single, simple, database query.

I ran this bad boy in loop like so:

while sleep .5; \
  do php -f script.php ; done

10:03pm rolled around, and sure enough, the script hung and finally failed with a Connection Timeout. Now this was progress!

To truly figure out what I was missing, I turned to my good friend strace. I ran the above like so:

while sleep .5; \
  do strace -o /tmp/snapshot.out php -f script.php ; done

I waited for the command to hang and reviewed the file. The tail of the file was like so:

recvfrom(3, "\344\341\201\200\0\1\0\1\0\1\0\staging.db.server.com"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.2.0.74")}, [16]) = 208
close(3)                                = 0
gettimeofday({1459436115, 155678}, NULL) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("58.28.1.102")}, 16) = -1 EINPROGRESS (Operation now in progress)

As is typical with strace most of the output was meaningless to me. But, it was clear what those last lines were showing me. First, it was stuck trying to connect to 58.28.1.102. Which was odd, because that wasn't our database's IP. And further more, it did a name lookup of: staging.db.server.com. Aha, so that's where the random IP address was coming from! It wasn't random, it was from our staging server.

In other words, at 3 minutes past the hour, something funky was happening that was causing our system configuration to fallback on communicating with our staging database. With this information in hand, the problem was trivial to debug. We had a process happening once an hour in cron, and when it ran, it refreshed a configuration file. While that configuration file was getting refreshed the system fell back to using the staging setup. It took swapping two lines of code to fix this monster of a problem.

So netcat, strace and traditional Unix skills won the day. However, I learned another incredibly valuable lesson. Whenever you report any sort of network error, you should also report the host and port you're trying to connect with. Had I included this in the database connection error, no Linux heroics would have been necessary. Lesson learned.

1 comment:

  1. I read that, have no idea what any of it means, but am definitely in awe of anyone who can do that.

    ReplyDelete