With the introduction of Oracle Engineered Systems the DBA is moved somewhat closer to infrastructure design decisions, and expected to at least have some opinions of the network design requirements for the database. At least that is the situation I find myself in 🙂
After deploying an ODA for testing, I find myself with the current setup:
System Controller 0 has the public bonded interface (bond0) connected to a typical edge switch, a Catalyst 2960 series. A management interface (bond1) is connected to a second edge switch of the same type.
System Controller 1 similarly has it's public interface connected to the second switch, while the management interface is connected to the first switch.
This way, if one of the switches goes down, an operator will be able to reach each system controller either via the public or the management interface to facilitate diagnostics.
On the Cisco end of things, EtherChannel groups are configured for the 4 bonded interfaces of the ODA. The two switches are individually wired to the rest of the network, with no direct links between the two.
At first glance this does look like a reasonable design, but the more I think about different fault scenarios the more questions I seem to come up with.
Taking into consideration that these edge-type switches are not in themselves redundant, it seems rather important that the cluster can deal with one switch becoming unavailable due to power supply failure, or one switch failing to forward packages.
The database clients (zend server application servers in this case) are each similarly connected with a bonded interface to only to one of the two switches. This brings up some questions with regard to load balancing: The way I understand 11gR2 RAC, simply connecting to the SCAN address will quite possibly let the client go the long way to the main network and back through the other switch, which can hardly be considered to be very efficient.
What happens if a switch fails or stops forwarding packets? Will connections find the accessible VIP listener through SCAN? Will RAC somehow detect the network fault and move the SCAN and VIP to the System Controller with a working and accessible public interface? I honestly can't see how it would.
And while clients taking the long way through the core network and back is acceptable during a failover scenario, it sure would be nice to avoid it in normal production.
I'm sure Oracle has a very clear idea of how this should all work together, but I'm afraid I just don't see it all that clearly.
Is it possible to achieve full redundancy with edge-class/non-reduntant switches? Can we somehow add some control on where client connections are routed in production and failover situations? Perhaps there is a good way to interconnect the two switches to allow traffic directly between clients on one switch and database listener on the other?
At this point I'm looking for any best practices and fundamental network design considerations that should be applied to a typical high availability ODA implementation.
Hopefully this will then be of use to any DBA that is faced with making network design decisions for their ODA 🙂
The ODA is configured with bonds in active-backup configuration. I think this may allow for a setup where each interface on the bond is connected to a different switch, without any switch side configuration.
Anyone know if this is the case?
[root@oma1 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth2