Wednesday, August 17, 2016

Gotcha of the Day: WordPress 4.6 Crashes with a Segfault when Serving HTTPs Content

See Also: Gotcha of the Day: More WordPress 4.6 Segfault Challenges

Earlier this morning, one of my clients' sites started spitting out Secure Connection Failed messages when attempting to access https pages within WordPress:

PHP files served via https worked fine, and WordPress files served via http worked fine.

Looking in Apache's error log I saw a stream of generic segfault message:

[Wed Aug 17 12:06:19 2016] [notice] child pid 9015 exit signal Segmentation fault (11)
[Wed Aug 17 12:06:22 2016] [notice] child pid 9035 exit signal Segmentation fault (11)
[Wed Aug 17 12:06:25 2016] [notice] child pid 9034 exit signal Segmentation fault (11)
[Wed Aug 17 12:06:33 2016] [notice] child pid 8883 exit signal Segmentation fault (11)
[Wed Aug 17 12:06:42 2016] [notice] child pid 9031 exit signal Segmentation fault (11)
[Wed Aug 17 12:06:52 2016] [notice] child pid 9037 exit signal Segmentation fault (11)
[Wed Aug 17 12:07:04 2016] [notice] child pid 8976 exit signal Segmentation fault (11)
[Wed Aug 17 12:07:11 2016] [notice] child pid 9033 exit signal Segmentation fault (11)
[Wed Aug 17 12:07:19 2016] [notice] child pid 9036 exit signal Segmentation fault (11)

Ugh. That's not good. Or helpful.

In response to the ticket I opened with the hosting company, the support tech suggested I follow the instructions here which outline a method for analyzing the core dump generated during a segfault.

It's been years since I busted out gdb and tried to make sense of a core dump, but I didn't have much choice.

I tried to follow the instructions in the article, but most of steps didn't work. I wasn't able to install the debug symbols, and my attempt to have the core dump written to a special directory failed. But still, setting ulimit -c unlimited was key, and once that was done, Apache was glad to drop 200+ Megabyte core files in my webroot.

Analyzing the core dump was more fruitful than I'd imagined. Here's a relevant snippet:

Core was generated by `/usr/sbin/httpd'.
Program terminated with signal 11, Segmentation fault.
#0  0xb77d0a0a in ?? () from /lib/libpcre.so.0
Missing separate debuginfos, use: debuginfo-install httpd-2.2.15-54.el6.centos.i686
(gdb) bt
#0  0xb77d0a0a in ?? () from /lib/libpcre.so.0
#1  0xb77d33d4 in ?? () from /lib/libpcre.so.0
#2  0xb77dafb4 in pcre_exec () from /lib/libpcre.so.0
#3  0xb6c658ed in php_pcre_split_impl () from /etc/httpd/modules/libphp5.so
#4  0xb6c65ed7 in ?? () from /etc/httpd/modules/libphp5.so
#5  0xb6e18659 in ?? () from /etc/httpd/modules/libphp5.so
#6  0xb6deee7e in execute () from /etc/httpd/modules/libphp5.so
#7  0xb6dbd0fb in zend_call_function () from /etc/httpd/modules/libphp5.so
#8  0xb6d048c6 in ?? () from /etc/httpd/modules/libphp5.so
#9  0xb6e18659 in ?? () from /etc/httpd/modules/libphp5.so
#10 0xb6deee7e in execute () from /etc/httpd/modules/libphp5.so
#11 0xb6dbd0fb in zend_call_function () from /etc/httpd/modules/libphp5.so
#12 0xb6d048c6 in ?? () from /etc/httpd/modules/libphp5.so
#13 0xb6e18659 in ?? () from /etc/httpd/modules/libphp5.so
#14 0xb6deee7e in execute () from /etc/httpd/modules/libphp5.so
#15 0xb6dbd0fb in zend_call_function () from /etc/httpd/modules/libphp5.so
#16 0xb6d048c6 in ?? () from /etc/httpd/modules/libphp5.so
#17 0xb6e18659 in ?? () from /etc/httpd/modules/libphp5.so
#18 0xb6deee7e in execute () from /etc/httpd/modules/libphp5.so
#19 0xb6dbd0fb in zend_call_function () from /etc/httpd/modules/libphp5.so

What I see there is an infinite loop within PHP code. The system is probably crashing due to running out of stack space. As noted above, this is a WordPress and https problem. Given that there's no mention of SSL in the above function calls, this lead me to believe the problem was strictly within PHP.

Then it hit me: earlier in this morning we'd upgraded the site to WordPress 4.6. Could that be the cause? I grabbed WordPress 4.5.3 and ran some diffs. Whoa, there was way too many changes to detect what might have happened in 4.6 to cause this.

I then decided to try rolling back WordPress. I followed the instructions here, which turned out to be surprisingly simple. They boiled down to:

  1. Shutting down the web server
  2. Moving the existing 4.6 wp-includes, wp-admin away
  3. Copying 4.5.3 version of wp-includes and wp-admin into the web root
  4. Copying and overwriting top level 4.5.3 *.php files over those that came with 4.6
  5. Started up the web server
  6. Logged into wp-admin
  7. WordPress detected the version change and told me that it would need to "upgrade" the database. I clicked the button to proceed
  8. The system was happy again

I have to say, I'm quite impressed that WordPress can so smoothy "upgrade" to a past version.

Most importantly, WordPress + https worked again!

I've got to say, I'm totally floored by this. It appears as though a WordPress update triggered a low level PHP bug which triggered a massive stack overflow and core dump. What are the odds?

A quick check of Twitter showed that nobody else was complaining of this issue, so my guess is that it's a weird mix of the version of PHP and Apache and such that's causing this. More than anything, this is a reminder that under the hood there's serious complexity (WordPress depends on PHP, which depends on Apache, which depends on the OS, which depends on SSL networking code, and so on) and yet most of the time, things Just Work. Remarkable.

See Also: Gotcha of the Day: More WordPress 4.6 Segfault Challenges

4 comments:

  1. Having a great rollback story is good approach in software and life.

    ReplyDelete
  2. Thanks for sharing my tutorial on downgrading WordPress!

    ReplyDelete
  3. Thank you! This error was driving me crazy.

    I just experienced it today - it broke Plugins, Themes, Settings and some other pages in my WP Multisite admin as well as for some of the individual site's admin pages. In Chrome, the error is quite different: ERR_EMPTY_RESPONSE.

    After finding your post, I checked my apache logs and see the same segfault as you do. It seems to have been triggered after an automatic restart of Apache this morning. I suspect it is related to an unattended upgrade that ran earlier in the morning, upgrading openssl:amd64 from version 1.0.2g-1ubuntu4.1 to 1.0.2g-1ubuntu4.4.

    I updated Wordpress to 4.6 and 4.6.1 a while ago, so it may be the specific combination of these Wordpress and OpenSSL version that causes problems.

    I'll try to follow your downgrade instructions and see if that helps.

    A bit of extra info from anyone else experiencing this:

    The Apache error log from yesterday starts with

    [Thu Sep 22 06:25:01.966435 2016] [mpm_prefork:notice] [pid 30521] AH00163: Apache/2.4.18 (Ubuntu) OpenSSL/1.0.2g-fips configured -- resuming normal operations
    [Thu Sep 22 06:25:01.966454 2016] [core:notice] [pid 30521] AH00094: Command line: '/usr/sbin/apache2'

    The error log from today starts with

    [Fri Sep 23 06:25:03.260257 2016] [mpm_prefork:notice] [pid 30521] AH00163: Apache/2.4.18 (Ubuntu) OpenSSL/1.0.2g configured -- resuming normal operations
    [Fri Sep 23 06:25:03.260271 2016] [core:notice] [pid 30521] AH00094: Command line: '/usr/sbin/apache2'
    [Fri Sep 23 06:25:16.276552 2016] [core:notice] [pid 30521] AH00051: child pid 16323 exit signal Segmentation fault (11), possible coredump in /etc/apache2

    ...and the segfaults just keep on coming.

    ReplyDelete
  4. I am seeing segfaults on WP after upgrade from 4.5.2 to 4.6.1, OSSEC began reporting the notices soon after upgrade. I am not seeing many issues on the front end, however. I am not running HTTPS on this server, and am actively working on an nginx implementation.

    I’ve backtraced the segfault problem via gdb, but there is not much detail to the results. I’ve changed logging to debug, and the lines surrounding the segfault mostly include wordpress’ static file server, ms-files.php.

    Not sure what’s up, but I recently upgraded WordPress on that server to 4.6.1, so if this issue just started since last week, it could be something to do with the upgrade.

    I did change PHP’s header output buffering to 8192 from 4096, and this seems to make the segfault happen less frequently.

    pid 52015
    (gdb) backtrace
    #0 0x00007ff355a67f00 in ?? () from /etc/httpd/modules/libphp5.so
    ...
    #6 0x00007ff355be8ac0 in execute () from /etc/httpd/modules/libphp5.so
    #7 0x00007ff355bb93f5 in zend_call_function () from /etc/httpd/modules/libphp5.so
    #8 0x00007ff355b10e67 in ?? () from /etc/httpd/modules/libphp5.so
    #9 0x00007ff355c11978 in ?? () from /etc/httpd/modules/libphp5.so
    #10 0x00007ff355be8ac0 in execute () from /etc/httpd/modules/libphp5.so
    #11 0x00007ff355bc2cad in zend_execute_scripts () from /etc/httpd/modules/libphp5.so
    #12 0x00007ff355b70c18 in php_execute_script () from /etc/httpd/modules/libphp5.so
    #13 0x00007ff355c4c335 in ?? () from /etc/httpd/modules/libphp5.so
    #14 0x00007ff360662fa0 in ap_run_handler ()
    #15 0x00007ff36066685e in ap_invoke_handler ()
    #16 0x00007ff360671dcc in ap_internal_redirect ()
    #17 0x00007ff3575a9a15 in ?? () from /etc/httpd/modules/mod_rewrite.so
    #18 0x00007ff360662fa0 in ap_run_handler ()
    #19 0x00007ff36066685e in ap_invoke_handler ()
    #20 0x00007ff360671f90 in ap_process_request ()
    #21 0x00007ff36066edd8 in ?? ()
    #22 0x00007ff36066aac8 in ap_run_process_connection ()
    #23 0x00007ff360676d1d in ?? ()
    #24 0x00007ff360677029 in ?? ()
    #25 0x00007ff360677cac in ap_mpm_run ()
    #26 0x00007ff36064ea80 in main ()

    --
    [Wed Oct 19 15:04:44 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:44 2016] [debug] mod_deflate.c(687): [client 172.27.72.6] Zlib: Compressed 10233 to 10089 : URL /wp-includes/ms-files.php, referer: http://www.utc.edu/
    [Wed Oct 19 15:04:44 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:44 2016] [debug] mod_deflate.c(687): [client 172.27.72.6] Zlib: Compressed 18305 to 18153 : URL /wp-includes/ms-files.php, referer: http://www.utc.edu/
    [Wed Oct 19 15:04:44 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:44 2016] [debug] mod_deflate.c(687): [client 172.27.72.6] Zlib: Compressed 20250 to 20069 : URL /wp-includes/ms-files.php, referer: http://www.utc.edu/
    [Wed Oct 19 15:04:44 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:44 2016] [debug] mod_deflate.c(687): [client 172.27.72.6] Zlib: Compressed 29124 to 4040 : URL /index.php
    [Wed Oct 19 15:04:44 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:45 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:45 2016] [notice] child pid 52858 exit signal Segmentation fault (11)
    [Wed Oct 19 15:04:45 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:45 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:47 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    [Wed Oct 19 15:04:47 2016] [debug] mod_deflate.c(687): [client 172.27.72.6] Zlib: Compressed 0 to 2 : URL /wp-cron.php, referer: http://blog.utc.edu/business/wp-cron.php?doing_wp_cron=1476903887.2620859146118164062500
    [Wed Oct 19 15:04:47 2016] [debug] mod_headers.c(743): headers: ap_headers_output_filter()
    --

    ReplyDelete