Our NFS server setup at our datacenter consists of two SuperMicro SC933 chassis, each with dual Intel Xeon 3 Ghz, 2GB memory, and 15 200GB SATA disks connected to a Areca ARC-1160 16-ports SATA RAID controller. High Availability by redundancy and fail-over is taken care of by Heartbeat and DRBD. This setup is responsible for serving up document roots for our web cluster through NFS, and it obviously is very important that it always works :)

These systems run Slackware Linux, which has historically been my distro of choice for critical systems. When deploying Heartbeat on Slackware i ran into some issues which i’d like to share here. I won’t go into basic stuff like actually compiling and installing DRBD and Heartbeat, since that is pretty well documented in various other places, for starters the Linux-HA site (home of Heartbeat).

Heartbeat starts HA services as defined in the file ‘haresources’, but in doing so, Heartbeat seems a bit SysV-init biased. SysV init based systems only start certain services when they are told to do so, by linking them to a certain runlevel. So on a Sysv-style system, if you don’t want to start NFS services at boot time, you just remove them from their runlevels, but leaving the init.d script intact.

Because Slackware has more of a BSD-style init, it starts most of its service daemons from /etc/rc.d/rc.M, including RPC services and NFS. We don’t want these services started by rc.M at boot time, because we want Hearbeat to manage these services. Normally on Slackware, if you don’t want a certain service started at boot-time, you would simply ‘chmod a-x’ the specific rc script in /etc/rc.d, but that is not an option now, since heartbeat still needs to be able to start/stop the service from its ‘haresources’. My solution was to leave all scripts executable, and rename the rc scripts of the services i wanted to be managed by Heartbeat:

cd /etc/rc.d
mv rc.rpc rc.rcp-hb
mv rc.nfsd rc.nfsd-hb
mv rc.samba rc.samba-hb

Now the RPC services, NFS and Samba will no longer be started at boot time, since rc.M only looks for the existence of the rc scripts without the added ‘-hb’ part.

Next we tell Heartbeat the names of the rc scripts to start/stop by putting them in ‘haresources’. My ‘haresources’ file looks like this:

fs1 drbddisk::shared drbddisk::backups \
Filesystem::/dev/drbd0::/var/nfsroot/shared::reiserfs \
Filesystem::/dev/drbd1::/var/nfsroot/backups::xfs \
Delay::2::0 \
rc.samba-hb \
rc.rpc-hb \
rc.nfsd-hb \
Delay::3::0 \
IPaddr::10.0.0.150/16/eth0

As you can see i have Hearbeat managing two DRBD volumes (’shared’ and ‘backups’), NFS and Samba, and one shared IP address.

To have DRBD and Heartbeat started at boot time, i added the following to ‘rc.local’:

/etc/rc.d/drbd start
/etc/rc.d/heartbeat start

And to stop them at reboot, i added this to ‘rc.local_shutdown’:

/etc/rc.d/heartbeat stop
/etc/rc.d/drbd stop

Also, don’t forget to move /var/lib/nfs to your drbd volume for proper locking, and alter your rc.rpc-hb to add a cluster name to the startup of statd. More background info on this at Linux-HA (step 4e and 4f)

Now for all this to work properly, there is one more thing, which I found out the hard way :P When Heartbeat releases its resources, it stops all services mentioned in ‘haresources’ by calling their related rc scripts with ’stop’. I ran into some strange failover behaviour, and found the following in my logs:

heartbeat: 2007/10/01_13:16:42 info: Running /etc/rc.d/rc.samba stop
heartbeat: 2007/10/01_13:16:42 ERROR: Return code 1 from /etc/rc.d/rc.samba
heartbeat: 2007/10/01_13:16:42 ERROR: Resource script for rc.samba probably not LSB-compliant.
heartbeat: 2007/10/01_13:16:42 WARN: it (rc.samba) MUST succeed on a stop when already stopped
heartbeat: 2007/10/01_13:16:42 WARN: Machine reboot narrowly avoided!

Apparently, Heartbeat will sometimes call an rc script with ’stop’ while the services is already in “stopped state”. Now, looking at our rc.samba-hb, we see the following stop function:

samba_stop() {
killall smbd nmbd
}

This call to killall will return with a non-zero exit code when there are no processes to kill, which results in the rc script exiting with a non-zero exit code. This makes Heartbeat think something failed, resulting in the above error messages. The fix for this is rather simple, though maybe a bit hackish. Change the samba_stop function by adding a ‘exit 0′:

samba_stop() {
killall smbd nmbd
exit 0
}

This should make Heartbeat happy. There are probably other rc scripts around that do not comply to this, so check your startup scripts.

Finally, watch out with DRBD-0.7.24 on a system with a kernel >= 2.6.22. I still use the DRBD 0.7 branch, and when i deployed 0.7.24 on kernel 2.6.22.9, i ran into a whole bunch of trouble. The load average would suddenly spike enormously, and my systems went unresponsive to shutdown or reboot commands. I found a drbd-user mailinglist posting from someone with similar issues. Apparently it’s a known issue with drbd-0.7.24 on kernel >= 2.6.22 and XFS, which is fixed only in subversion!
After fetching and installing the latest subversion revision of the drbd-0.7 branch as described here, the problem is solved.

Tags: , , , , ,
Leave a Reply