At least a few times a day, my main production server has recently begun failing over. Only, the quorum doesn't move to the other node.
I ran the validation report, and got a lot of info I don't understand. It is an HP ProLiant BL680c G5. Some of the info in the report that I'm wondering about:
Degraded: HpCISSs HP MPIO DSM for EVA4x00/6x00/8x00 family of Disk Arrays Link-Layer Topology Discovery Mapper I/O Driver Mount Point Manager
Those are just a few. I'm not so worried about "errors", but "degraded" seems to imply that it should be running but isn't doing so well.
System details: sql - 10.50.4000.0 Windows - Windows NT - 64 Bit
It really is going to be awful if it IS an MPIO issue. That has happened many times, and the DBA team has been accused of modifying the settings! I'm the team lead, and even I have no idea what half of this means (I know what MPIO is, and I recognize all the SAN stuff, but troubleshooting? Nah.)
Interesting new information – right before this began happening, we updated the Firewall Service Module, and rebooted the core switches.
I'm thinking there are settings which don't have the correct defaults?
Logged during the failover:
Cluster Agent: The cluster resource FileServer-(server)(Cluster Disk 1) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource SQL Server Agent has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource SQL Server has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource Analysis Services has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]" "Cluster Agent: The cluster resource FileServer-(servername)(Cluster Disk 4) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB]"
it is really odd, because there aren't a lot of error messages. the only real info I have is from the validation report. Disks 1-4 always fail, but not logged in the same order, and then the quorum just stays on the node it is on.
After speaking with the network guy, he thinks that perhaps when the core switch was rebooted, it affected the preferred routes for the NICs. He's going to clean things up a bit this weekend (removing the file shares and recreating them) and we'll see where we are. The adding/removing didn't work. It failed over yesterday evening.
We are using fiber channel to connect between the servers and the storage array. We just patched Windows (using Shavlik) and now we have the same issue. I'm starting to wonder if it is some default setting in MPIO that keeps returning.