Monitoring disk/RAID hardware in CentOS on HP DL360 servers

For some time I’ve been having problems monitoring disk and RAID hardware in CentOS Linux on our HP ProLiant DL360 G5 servers. To begin with, I discovered that hpasm, the main monitoring agent provided by HP, does not actually give any information about the disks or the RAID controller, even though it does seem to monitor all other hardware in the server.

So, to monitor the disks, I started using smartd, which is part of the smartmontools package and comes with the CentOS distribution. This software uses the SMART system to attempt to predict when disks are going to fail. It’s quite easy to configure. For example, here’s the /etc/smartd.conf file I created on one of our servers which has six SAS disks arranged into a single RAID 5 partition. This should cause smartd to email me as soon as any problems are detected with any of the disks:

/dev/cciss/c0d0 -d cciss,0 -a -m
/dev/cciss/c0d0 -d cciss,1 -a -m
/dev/cciss/c0d0 -d cciss,2 -a -m
/dev/cciss/c0d0 -d cciss,3 -a -m
/dev/cciss/c0d0 -d cciss,4 -a -m
/dev/cciss/c0d0 -d cciss,5 -a -m

Unfortunately, smartd doesn’t actually appear to be all that useful. It starts up fine and indicates via syslog that it’s monitoring the disks, but we’ve now had two disk failures on our HP servers, and neither time did smartd give any warning whatsoever about the failure. I only found out about the failure each time because I saw the orange warning light on the disk when I happened to be at the data centre – really not ideal!

I therefore went back to HP and decided to figure out what software of theirs I would need to use to do proper RAID/disk monitoring. HP Technical Support were predictably useless, telling me I needed to install the ‘System Management Homepage’ and a whole load of other bloated nonsense that I really didn’t want taking up resources on our servers. Some googling quickly revealed what I actually needed, which was the hpacucli RPM, which comes in the HP ProLiant Support Park (‘PSP’) alongside various other things including the RPM for the hpasm agent I mentioned above (which annoyingly seems to have been renamed to hp-health; I wish HP wouldn’t keep changing names of things unnecessarily because it’s already confusing enough trying to keep track of their software). The PSP can easily be downloaded from the Support section on HP’s website and is available for a variety of Linux distributions. The hpacucli RPM is easily installed in the normal way (rpm -i etc.) and, once installed, you should be able to get information about the current status of your RAID setup with something like this:

hpacucli ctrl all show status
hpacucli ctrl slot=0 ld all show status
hpacucli ctrl slot=0 pd all show status

That command should provide output similar to the following:

Smart Array P400i in Slot 0 (Embedded)
Controller Status: OK
Cache Status: OK
Battery/Capacitor Status: OK

logicaldrive 1 (341.7 GB, RAID 5): OK

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 72 GB): OK
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 72 GB): OK
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 72 GB): OK
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 72 GB): OK
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, 72 GB): OK
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, 72 GB): OK

That seems to do the trick nicely, then. Even better, I found an easy way of integrating it into Nagios so that we’ll get notified the second anything goes wrong with any of our disks or RAID hardware. This was just a case of downloading the check_hparray plugin and configuring it accordingly. (I’m not going into the Nagios configuration here because it’s pretty well documented, and anyone administering a Nagios system should have no problem working it out.)

So there we go. Proper RAID/disk hardware monitoring and alerting in CentOS on HP DL360 servers with minimum hassle and no unnecessary bloated software.

Edit: this post was originally written for CentOS 5, but I believe it should work fine for CentOS 6 too.