Pro Services and Tier 2 failing you..... not unheard of so that leaves most of us rolling up the sleeves and figuring things out on our own. Since it is new setup, check the cabling between receivers for yourself to verify and then reset/re-image the receivers to factory and start over again.
I don't have my notes handy to give you more specific things to check but in general the ESM creates, distributes and manages the SSH keys for all devices in the tree.
Look at ha_status output on both receivers and compare results. Check all the recent logs under /var/log/ directory. If you have to enter a password when ssh'ing to the either receiver from the ESM or between receivers make sure to have both the ESM and the other receiver in the pair's public keys in the authorized_keys2 file on the respective receiver.
If you are still unsure that communication is AOK add the factory rsa public key ( you can generate that from ssh-keygen -f /etc/NitroGuard/factory-id_rsa -y ) and put that in the authorized_keys2 file on both receivers too. ( this is especially helpful when re-keying or pairing devices )
On the ESM : cat /etc/NitroGuard/sshcc.conf to see the IP's and associated ssh keys for all devices managed.
/etc/NitroGuard/vipsid in where to confirm the device id's ( eg for example 4ED11:34AH5 ) for the machine as you'll see those noted after the public key in the authorized_keys file.
I'll look for other tips tomorrow.
@akerr HA issues galore! We are running into numerous issues with 9.5.0 MR2 in our lab currently and have HA issues in our production environment running 9.4.2
One thing that we noticed in one of our environments was that speed/duplex was auto-negotiating to 100/half for some reason. Be sure that each NIC is at 1000/full. (command: ethtool eth0 and ethtool eth1) Try manually setting them to 100/full to see if anything changes. Also, tail the /var/log/messages and see if anything noticeable pops out.
You can always attempt to re-key, I have had mixed success with re-keying as I have lost complete connection to both ERs at times. I would ensure you have McAfee on a WebEx during a re-key - the last time I did it was a few Fridays ago at 4PM. I didn't leave work until 8.
Tier II has managed to get the one pair working properly, but the other pair is still only sort of working. Waiting for Tier II to get back to us. I've checked what folks have suggested but nothing has made a difference yet. I'd like to hear the details of what they did to get the first pair working since we left them with a remote session overnight and by morning it was working.
We are in the same boat. I am still waiting to hear feedback on what exactly they did to get the pair back up (9.5.0 MR2).
As bad as it sounds, I am actually relieved to know that I am not the only one experiencing HA issues. For a while, I was thinking that I was doing something wrong, configuration-wise, however, it appears that there is a bug that needs to be worked out. We are currently stable in all of our environments. With regards to 9.5.0 specifically, my issue was happening after fail over. The secondary was successful in becoming the primary and logging, however, when I would go to set the primary back to primary, it would fail and thats when we would see complete HA breakdown.
Out of curiosity, are you using 1260s? The guy we had in from Professional Services said there are issues with these ones, generally they were thought to be related to the order of the network ports not necessarily being in the correct order, though imaging to 9.5.0 is supposed to fix that. Whatever they did to fix it didn't involve re-cabling though.
Yes we are using 1260s and yes the NIC configuration was an absolute nightmare to try and troubleshoot at the beginning. We have always thought that the NIC configuration has always been the underlying problem (ie: maybe those NICs are hard coded into other processes that now conflict with the new port config).
We are 9.5.0 MR1 in all but one DEV env... In 9.5.X you have the ability to set a preferred primary receiver ( Properties -> Receiver Configuration -> Interface -> HA ?? (tab) ) which should help keep receivers from just flipping like on 9.4.x nightmare.
The newest hardware is all Intel based network cards and that is the best you can get driver wise under Linux and I don't things are hardcoded from what I've seen.
With the help of Tier III mind you... I just got through replacing old 2250's and 2230 in the mix with new 3450's on 5 HA pairs in three different ESM environments. We had a few eth1 ports go dark on us and not light up again until the network guys toggle the port down/up ( during hardware swaps mind you ) we typically don't see eth1 going dark in normal situations.
We did have one IPMI card freak out on us .... support tried a few ipmi and ipmitool commands but those failed so we shutdown twice no-go so we had to unplugged power for ten minutes; once powered back on all was fine after that.
The process was loosely:
1) put secondary in standby.
2) shut down receiver.
3) put new receiver in place, wire up and power on.
4) add new receivers known_hosts entry on ESM then "reinitialize secondary"
5) after settings get applied to both receivers check network.conf on the current primary and deleted any duplicate [ifaceX.X] sections.
6) verify known_hosts on ESM for each receiver, also update known_hosts on each receiver to the 172.x.x.2 and .3 addresses respectively.
7) Give them 5 minutes to work out cluster and who's going to be primary. Make sure ha_status, crm status look good, ensure the GUI shows AOK
8) put other receiver in standby
9) repeat steps 1 - 7
10) Roll out policy
So there is definitely multiple bugs to work through but thankfully 9.5.X is supposed to be focused on stability and fixes versus "new features"
Funnily enough, we are running 9.5.0 in our lab currently. When we set the primary to our "preferred" logging stopped. It would only continue when we did not set a primary. Also, today I made some progress with support - Once I have further details into what they saw happening but it appears as though a rogue interface (that has been deleted previously to fix the problem) keeps showing up during fail-over. My guess is the interface was not properly removed from both ERs and it was "re-syncing" the information back over somehow. Should have better details in the next couple days.
OK, the documentation for the HA receivers is not the greatest, even today. This is a kind way of saying INSUFFICIENT. You need to perform the basic network configuration of both receivers via a directly attached console prior to attempting HA. This is not specified in the documentation. It leaves you guessing, because if you have the first one configured and added to the ESM console you might think that magic happens to perform all configuration of the second receiver via the IPMI connections when those are cabled. That is not the case! Setup the basic networking config of the second receiver via the console, confirm that is working but do not import it into the ESM. Then follow the pairing process as documented.
Furthermore, the 9.6.0 base (no maintenance release) receiver .iso image apparently doesn't activate all of the ethernet ports, which also causes HA pairing to fail. Nice.
Keep in mind that their documentation also only refers to "1U" receivers. Older 1250 model receivers cannot be mixed with a newer 1260 model of receivers (yah, this is something that they may have not pointed out prior to me wasting 2 days trying it) , and indeed as noted above the order of the ethernet ports on both was different when the model numbers changed. During this whole fiasco support had to create a brand new config file for ordering the ports because I had escalated to the head of North American engineering / support.
If in doubt about cabling do direct connections using pre-tested cables. The ridiculous rules at my company state that you have to have clustered equip in different racks, which makes it a bit challenging to confirm the heartbeat and IPMI connections as they go through patch panels (but not switches). So I had to lay cables across the raised floor because the receivers were mounted about 5 racks apart to be confident about the connections.
Keep in mind the management and public (VIP) for the receivers all must exist on the same network segment and VLAN, and go through the same switch. Through the same switch is not hard and fast and can be worked around, but the VLAN and network segment are essentially cast in stone. Unless you want to spend (aka waste) a lot of time trying to configure otherwise, stick to what works.