McAfee support recently told us that syncing with redundant ESM requires additional "Finalize" phase (after syncing reached 100%), which basically shuts down cpservice on both primary and redundant ESM, making SIEM useless for unspecified time (database size, network bandwidth dependent).
Sounds bad enough?
It gets worse - you have to click "Finalize" button within few hours after syncing 100%, otherwise syncing will restart from beginning. Of course ESM will not tell you ETA from syncing or finalize, so you must periodically and patiently check the progress yourself.
This process must be repeated after each and every upgrade. Otherwise redundant ESM is totally useless.
May I ask for your views, opinions or hacks how to make this thing actually useable and less painful?
I believe the "finalize" is new in the 9.4.2 series as I don't remember doing this before. Your link between the two ESM's could be a factor of how long it takes. I would suggest that you scan /usr/local/ess/data/NitroError.log and /var/log/messages on both ESM to see if anything sticks out i.e. file system or database errors. Also, tail both of those logs on both ESM's for debugging purposes
We had a ssh key issue that prevented the secondary from pulling the partition list from the primary.. look for InitRedund in the logs.
We just upgraded to 9.4.2 MR6 last week and basically re-enabled the ESM redundancy mode ( I believe because we were coming from 9.3.2 ), after the upgrade which took 2-3 hours for the processes to complete.
Other things to run/check:
Go through the checklist and release notes.
Make sure the databases are clean of course.
Make sure the auto get events is disabled while enabling redundancy.
On the secondary, clean out any .gz file /usr/local/ess/dbredund/ between a cpservice stop / start. ( You get and entire dump versus a diff of what is on the primary )
In a terminal run resm_status on the secondary ESM -- first line returns the % complete and the next line Ok
It took ~2 hours to "finalize" syncing on our ESM. Having used Nitro since 9.3.0, I have no recollection of this "feature" prior to 9.4.2.
Upgraded test environment to 9.5.0 last week and must say there are really cool monitoring features there, so production environment will be upgraded very soon
There are methods to the madness here. The "finalize" process is actually a nice improvement over the old methods of synchronizing a primary and redundant ESM, although it's possible you may not have experienced the pain in the past.
In order to establish a redundant pair of ESMs, they need to be synchronized to a common state. This ensures that each ESM starts out with the same configuration, and the same baseline set of events. This is a challenge when your primary ESM is taking in lots of events and inserting them in to the database on the fly. Kind of like trying to paint a portrait of a toddler who just won't sit still.
Previously to 9.4, synchronizing a redundant ESM meant stopping the primary ESM from collecting events, doing a full sync, and then starting everything back up. The time this took depended on how much historical data you've got in your primary DB. During the early phases of a deployment, it takes very little time at all. For a system that has several TB of historical data, it could take days to copy this data over.
This situation improved in 9.4. We implemented a new strategy for synchronizing, which allowed us to copy over the vast majority of the data from the primary to the redundant without shutting down any services on the primary ESM. Then, as a final step, we "finalize". This entails stopping the collection services (as before) so we can ensure the DB is in a quiescent state, then copying over the final DB partitions. Worst case, your primary ESM is now offline for an hour or two during the finalize process (often much less), instead of potentially many hours or several days.
The reason you must "finalize" within a couple hours of being told to do so is because we want to minimize the actual downtime. Your primary ESM is collecting data from receivers while it's waiting for you, and it begins to drift further and further from parity with the redundant. If it drifts too far, it re-runs the "phase 1" sync to sort of catch back up. Then it gives you another opportunity to finailize.
I hope this helps understand a little of what's going on behind the scenes.
We have a 150 MB bandwith between HQ and our DR sites.
Disk usage status of our HQ Primary Esm for 6 month:
McAfee-ETM-X4 ~ # df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 1.9T 11G 1.8T 1% /
/dev/sda1 975M 62M 864M 7% /boot
/dev/sdb1 13T 11T 2.3T 83% /data_hd
shm 48G 0 48G 0% /dev/shm
/dev/md0 743G 609G 135G 82% /index_hd
I've started sync process one week ago! And I'm getting an email from esm about finalize phase being ready, yesterday. Then I clicked finalize and esm has not been working for about 16 hours!!
Current disk usage status of Our DR Secondary Esm:
McAfee-ETM-X4 ~ # df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 1.9T 9.9G 1.8T 1% /
/dev/sda1 976M 67M 860M 8% /boot
/dev/sdb1 13T 10T 2.8T 79% /data_hd
shm 48G 0 48G 0% /dev/shm
/dev/md0 743G 552G 192G 75% /index_hd
This is also output of resm_status command on sec_esm:
McAfee-ETM-X4 ~ # resm_status
Q1: How long is finalize going to take?
Q2: Is there a way to stop/pause/cancel this process?
Q3: Is there a way to access Primary ESM dashboard while finalize process is working? Then may be at least we can work on the current db!
Thank your for your time,
Stats from my primary ESM for comparison.
HDD - sda3 Size: 1.9TB, Used: 33GB( 2%), Available: 1.7TB, Mount: /
HDD - md0 Size: 743GB, Used: 592GB(80%), Available: 151GB, Mount: /index_hd
HDD - sdb1 Size: 95.0TB, Used: 1.8TB( 2%), Available: 90.0TB, Mount: /data_hd
It looks that your DAS is nearly full - 83% in 6 months, you have about 1 month left at this rate, also much more data to sync!
Dunno much about "resm_status" command, mine shows
Sorry I am unable to answer your questions, I suspect negative answer to last 2 is quite probable.
I have few new things about redundant ESM:
1. only admins can access it, which makes it pointless in the scenario when primary ESM is unavailable.
2. there is no check for NTP being out-of-sync between primary and redundant, so you have to check it manually in case any issues.
3. to avoid re-syncing redundant ESM after upgrade (not sure if it works between major versions, but it worked for 9.5.0 -> 9.5.0 MR1 upgrade), all you need to do is to disable "auto event interval" and wait all jobs to finish before upgrade.
after discussing your case with McAfee Engineering (worked with them on some other issue last week), they provided a hint why syncing may take so long.
As highlighted on the screenshot - the recommendation is to restrict historical data insertions.
Reason - in case older DB partition is updated during syncing, it needs to be re-synced.
Thanks for feedback. Finalization phase was take a very long time (4 days)!!! And the real problem is we have a 9 SIEM Appliance as below;
- 2x ETM x4
- 2x ELM 6000
- 3x ERC 4600
- 1x ACE 3450
- 1x APM 3460
We are using these devices is about 4 month. ETM's has 13TB disk size and current disk usage %90!!! No compression, no retention. ESM event log count is 6 billion. However, we had plan that this disk size can enough for at least one year! I've started delete unnecessary event data on ESM but disk usage rate is no change!
As far as I understand McAfee ESM as like as two peas with Arcsight ESM. Because that Arcsight ESM just make indexing data last 3 month. If you need more old data then it directing you to arcsight logger.
Please someone explain ESM workaround?
Total number of your SIEM appliances doesn't matter for syncing/finalizing, we have 17 of them (without DAS).
ESM will delete oldest events before it reaches 100% (proven to work in my test lab).
Retention on ELM is managed seperately, hence you can keep logs in ELM for much longer if you wish. End of good news - you will have to use painful ELM search in this case.
I'm surprised by your disk usage, in my SIEM there are 70+ billion events, but 1.9 billion total records in DB (after 6 months), disk usage hasn't changed much, just disregard my comment about DAS there as we haven't started filling it yet.
Hate to say this, but event aggregation could actually help in your case.
There may be some rules you have to modify, as they do not have the right aggregation settings, or if there is something you want to see every event without aggregation, those should be the exception.
Our Receivers Aggregation settings are set to "Low"
We have an older X3 using 5 TB of local storage and 12 TB of DAS storage to hold 13 rolling months of data.
Record counts from our Database are as follows: Alert (event) - 3+ Billion events; Flow - 11+ Billion Flows - We have over 18 Billion events this year alone when looking at the UI.
The number of SIEM appliances should have no bearing on syncing, we have 17 appliances also, not including attached DAS devices.
We broke Redundancy when upgrading from 9.3.2 to 9.4.2 and never re-enabled it, we are waiting on a new X series primary to be ordered.
We also have a smaller environment (7 appliances) that is just over a year old, X4 ESM with 1+ Billion Alert "Events" in the DB, for 41+ Billion total events in the UI. No DAS on this appliance, and we are only at 1.5 TB of storage used, 12 TB still available.