always remember

Nothing is foolproof to a sufficiently talented fool... Make something
idiot proof, and the world will simply make a bigger idiot.

LSI RAID Controller – Physical Disk S.M.A.R.T. Status Check (Nagios NRPE)

A simple and efficient way of grabbing the smart health of a scalable amount of disks on an LSI Hardware RAID controller, exiting to Nagios statuses.

This plugin is also published in Nagios Exchange HERE

#!/bin/bash
####################################
#Dave Byrne
#LSI Hardware RAID S.M.A.R.T Check
####################################

#Where is storcli?
storclibin="/opt/MegaRAID/storcli/storcli64"

#Check if storcli actually exists, echo and exit warning if it isnt
if ! [ -e "$storclibin" ]
then
  echo "WARNING - StorCli Not Found"
  exit 1
fi

#Get number of underlying physical disks from storcli
underlyingdisks=`$storclibin /c0 show|grep "Physical Drives"|awk '{ print $4}'`

#Loop to grab the smart health result from all physical disks
counter=0
while [ $counter -lt $underlyingdisks ]; do
  #echo counter at $counter
  smartctl -d sat+megaraid,$counter -a /dev/sda|grep "SMART overall-health self-assessment test result" >> /usr/results.txt
  let counter=counter+1
done

#Analyse results and start giving exit codes
if grep -q FAILED "/usr/results.txt";
then
    #failed string found, time to work out what disk it was
    line=`awk '/FAILED/{ print NR; exit }' /usr/results.txt`
    #subtract 1 from line number as disks start at 0
    let line=line-1
    diskinfo=`smartctl -d sat+megaraid,$line -a /dev/sda|grep -E 'Device Model|Serial Number'`
    printf "CRITICAL - Device ID #$line FAILURE ImminentnDISK INFO: n$diskinfo n"
    rm -f /usr/results.txt
    exit 2
else
    #failed string not found, echo and exit ok
    echo "OK - All array disks healthy"
    rm -f /usr/results.txt
    exit 0
fi

Check will detect number of drives attached to controller and query them all for overall smart health based off of smartctl’s analysis of pre-fail attributes etc.

-Returns OK if all attached drives return healthy.
-Returns Warning if storcli could not be found.
-Returns Critical if one or more drives report unhealthy. And proceeds to pull the device ID# (slot#), drive device name and serial number and prints this info witht he Critical messsage in Nagios Core.
-Relies on smartctl (From SmartMonTools) and StorCli (MegaRAID CLI, successor to megacli)

Run check under sudo in command definition, ensure Nagios can run both storcli and smartctl under sudo by editing visudo with the following lines:
Change:
Defaults requiretty

To:
Defaults requiretty
Defaults:nagios !requiretty

Change:
## Allow root to run any commands anywhere
root ALL=(ALL) ALL

To:
## Allow root to run any commands anywhere
root ALL=(ALL) ALL
nagios ALL=(ALL) NOPASSWD:/usr/sbin/smartctl
nagios ALL=(ALL) NOPASSWD:/opt/MegaRAID/storcli/storcli64

Edit the check file and set the storcli location to wherever your storcli binary is. Same with the visudo additions.

dave / January 12, 2016 / Code, Nagios Monitoring
Tags: , , , , , ,