Wednesday, July 20, 2005

I fought glibc and won

This morning I woke to a few strange emails from Ask and a ticket that the Parrot TODO list was broken. The gist of it was: getprotobyname was causing our production mod_perl's to crash. (See entry from earlier today for a stacktrace.)



Obviously, not a good thing.



Question One: What changed? Neither Ask nor I could remember changing anything that *should* have affected this recently. The easy solution (revert) was out.


Step one: Reproduce and isolate. This turned out to be the easy part. I configured a bare-bones Combust with only a single website configured with only a single controller:



package Test::Control::Test;
use base 'Combust::Control';
use DBD::mysql;
use LWP::Simple;
sub handler ($$) {
my ($self, $r) = @_;
my $output = LWP::Simple::get("http://www.cnn.com");
$self->send_output(\$output, 'text/html');
}

Without the use DBD::mysql, everything was fine. With it, KABOOM.


Step two: Debug.. By telling apache not to fork, it was easier to track things down. /pkg/apache1/bin/httpd -X -f /home/robert/minisite/apache/conf/httpd.conf GDB wasn't particularly helpful. It got me the stack trace, and told me it was happening during DSO symbol lookup. (Smells like glibc!). Hrm, might be a problem with something scribbling over memory. Lets try Valgrind. No luck, the older version of valgrind we have on the system bombs out.


Ask found this RedHat Bugzilla ticket. It is a similar problem, but didn't get resolved. It did lead me closer to the solution. As suggested in the ticket, I ran my apache with all of the dynamic loader debugging enabled: LD_DEBUG=all LD_DEBUG_OUTPUT=/tmp/some-file . The copious (16MB) output ended like this:



4702: symbol=_nss_files_parse_protoent; lookup in file=/pkg/packages/apache-1.3.33/libexec/mod_setenvif.so
4702: symbol=_nss_files_parse_protoent; lookup in file=/pkg/packages/apache-1.3.33/libexec/libperl.so
4702: symbol=_nss_files_parse_protoent; lookup in file=/lib/libnsl.so.1
4702: symbol=_nss_files_parse_protoent; lookup in file=/lib/libutil.so.1

It should have looked something like this: (from earlier in some-file)



4702: symbol=strlen; lookup in file=/pkg/apache1/bin/httpd
4702: symbol=strlen; lookup in file=/lib/tls/libm.so.6
4702: symbol=strlen; lookup in file=/lib/libcrypt.so.1
4702: symbol=strlen; lookup in file=/usr/lib/libgdbm.so.2
4702: symbol=strlen; lookup in file=/lib/libdl.so.2
4702: symbol=strlen; lookup in file=/lib/tls/libc.so.6
4702: binding file /pkg/packages/apache-1.3.33/libexec/mod_log_config.so to /lib/tls/libc.so.6: normal symbol `strlen' [GLIBC_2.0]

So, we now know that the problem is the dynamic linker is having trouble binding the symbol _nss_files_parse_protoent (which lives in /lib/libnss_files.so) and that has something to do with DBD::mysql. We also know that mysql.so (the C portion of DBD::mysql) is linked against libnss_files. (See ldd output or information from the some-file.)


That struck me as odd, so I attempted rebuilding DBD::mysql to see why it was linking against the nss libraries. That's generally something that should be sucked in by libresolv. Definitely, a general purpose application shouldn't be linking against specific nss ("Name Service Switch") libraries.


Turns out, DBD::mysql was getting the information from mysql_config.


--libs [-L/usr/lib/mysql -lmysqlclient -lz -lcrypt -lnsl -lm -lc -lnss_files -lnss_dns -lresolv -lc -lnss_files -lnss_dns -lresolv]

Whoa! Duplication, redundancy, extra libraries, and explicit linking against libc. Definitely not something that most applications should do. I could understand that MySQL itself might need to do weird things - it's a complicated application - but things linking against it shouldn't have to.


Step three: Fix it.


--- mysql_config.old 2005-07-20 16:01:34.000000000 -0700
+++ mysql_config 2005-07-20 16:02:06.000000000 -0700
@@ -86,10 +86,10 @@
# Create options
libs="$ldflags -L$pkglibdir -lmysqlclient -lz -lcrypt -lnsl -lm "
-libs="$libs -lc -lnss_files -lnss_dns -lresolv -lc -lnss_files -lnss_dns -lresolv"
+libs="$libs -lc -lresolv"
libs=`echo "$libs" | sed -e 's; \+; ;g' | sed -e 's;^ *;;' | sed -e 's; *\$;;'`
-libs_r="$ldflags -L$pkglibdir -lmysqlclient_r -lz -lpthread -lcrypt -lnsl -lm -lpthread -lc -lnss_files -lnss_dns -lresolv -lc -lnss_files -lnss_dns -lresolv "
+libs_r="$ldflags -L$pkglibdir -lmysqlclient_r -lz -lpthread -lcrypt -lnsl -lm -lpthread -lc -lresolv "
libs_r=`echo "$libs_r" | sed -e 's; \+; ;g' | sed -e 's;^ *;;' | sed -e 's; *\$;;'`
cflags="-I$pkgincludedir -O2 -mcpu=i486 -fno-strength-reduce " #note: end space!
include="-I$pkgincludedir"

That's cheating. But it got the job done. After making that change, I rebuild DBD::mysql. By not explicitly linking mysql.so
against -lnss_files, the internal magic of glibc could do the right thing and not blow up.


1 comment:

  1. fwiw, there is a bug filed against mysql (http://bugs.mysql.com/1814 ) that relates to this. the mysql autoconf stuff probably needs to make a distinction between CFLAGS-for-server-compilation and CFLAGS-for-client-compilation.

    ReplyDelete