I have a pair of S4016's that I am trying to set up in a Active/Standby HA configuration. I have verified that all my interfaces and zones are identical. I follow the cluster creation wizard on the Primary successfully, and then I join the secondary device with the wizard (supposedly) successfully. But then I run cf cluster status on the primary and it confirms he is primary but the peer does not show. On the secondary firewall I run the same command. it reports that the primary is reporting down and that its not connected and that the secondary is acting primary. When I use the Check cluster status button on the admin console it tells me that my secondary is not part of the cluster. When I try and verify the interfaces via the admin console I get a unknown socket error when trying to view the secondary device. I am also concurrently logged in via ssh to each device so i know they are both up. Internet searchers have not returned any hits on the error messages i have received, Has any one else encountered this or similar errors that could help me figure out what is going wrong?
My first thought is that you have an explicit rule in the Access Control Rules for the 'entrelayd' service (called the "Enterprise Relay Server" at version 8). Go to your Rules and type the word "enter" in the Search box. Do you have a rule using this service? If you do you must delete it.
One other thing to check on both firewalls: login via SSH and run 'cat /secureos/etc/failover.conf' and go to the bottom of the file. There is a line there that says "key(SHA512 some_key)." Make sure both of them say "SHA512" and that one doesn't say "SHA1" instead.
Sliedl, thanks for the quick response. I just checked on both of your suggestions, I have no rules with the entrelayd service and a check of my failover.conf verified they are both at SHA512. One additional bit of info I just thought of. When we first tried this the hearbeat zone was a redundant lagg interface. I read that there was an issue with this in 8.0.0, even though Im currently using 8.3.2 ( i couldn't confirm but assumed this was fixed by now) I thought I would go back and break out the lagg and just use a single interface. But its the same problem. I did break up the HA pair before I made these interface changes, then re created the cluster. Could this have affected something?
There could be any number of things causing this and if there is some explicit error there will be an audit event on either the primary or the standby (or both).
If the firewalls are connected via a switch then the switch must be able to pass IGMP and it must not decrement the TTL (the TTL of the heartbeat is 1). If they are connected via a switch you can try connecting them with a cable (straight-through or crossover) to see if that fixes the issue. Also the heartbeat interface should not be VLANed.
We ended up finding a workaround if not the correct solution. In the Advanced interface options we removed the "Monitor Interface" check on the Heartbeat interface. This corrected our issue with the cluster not recognizing the secondary firewall was there. We also tested failover and restoral successfully. We still get occasional errors that the secondary is not responding when making configuration changes, but we believe that is a layer one issue with the hearbeat cable.