The last few times I've updated our MFE firewalls, I've had some downtime that I can't account for. Here is the situation:
We have two applicances in HA. MFE1 is the primary and MFE2 is the standby. When I update them, I've been Downloading the package to each one, then Installing the update on MFE2. This goes without any problems. After I'm assured that MFE2 is back up and once again the standby device in the HA cluster, I Install the package on MFE1. Here's where things go wonky. Despite the Cluster type being Peer to Peer with a Cluster Takeover Time of 13 seconds, the appliances do not appear to roll over. This results in our Internet being down until MFE1 finishes its reboot cycle (appx 3-5 minutes, although I've never timed it exactly).. When it finally comes back up, the Cluster Status will show that MFE2 is now the primary and MFE1 is the standby. Then, just for giggles and grins, I'll reboot MFE2 to force MFE1 to become the primary again. During this reboot of MFE2, we do not lose any Internet connectivity, indicating the HA is working fine.
So, why do you suppose that HA does not work when I reboot MFE1? Am I doing something wrong or out of order?
As always, TIA.
It doesn't look as though you are doing things out of order - that's how I would do it.
I don't think it necessarily has anything to do with applying patches, it sounds more like a generic HA problem. I assume that when you click on the Cluster Status button it indicates that the cluster is OK (showing MFE2 as the secondary, rather than saying something to suggest that it isn't functioning as part of the cluster). I guess if you are applying the updates through the GUI by connecting to the cluster IP address that, at an admin level at least, the cluster is functioning.
You may need to raise a ticket and run it past technical support, because what you are describing suggests that the secondary doesn't realise or doesn't receive the communication to say that the primary is no longer available. You will probably need to find a convenient period of time to try some failover tests (pulling network cables out of monitored inteterfaces, or shutting down MFE1 for a while to see if MFE2 ever realises that it needs to take over).
It could be that MFE2 *is* aware that MFE1 is not there but is somehow unable to grab the cluster IP addresses (which is why everything seems to halt until MFE1 comes back).
Connect directly to MFE2's physical IP address (rather than the cluster address) and, with MFE1 shut down, see if the cluster status changes.
I agree with Phil, and to add a little more, what might be happening is that the firewalls are failing over properly, but arp entries on routers and switches in your network may not be updating (so the packets are being sent to the wrong firewall).
To ensure that failover is functioning properly, try running "cf cluster failover", the very first line will show whether the firewall is acting as primary or standby. If that output shows the correct information then it may be worth looking at arp.