Just as a first draft to start the discussion, I would say :
- The collection rate per device, to bring to light any behaviour under (quiet device) or above (noisy devices) the baseline.
- CPU, Memory, Disk usage, temperature, & co.
- DB Usage
- Device outages / availability
All compared with their own baselines
it would be great to have comprehensive dashboard for health checks...
This kind of dashboard would sell me on upgrading to 9.5 and using a content pack to get the health check dashboard.
Has anyone created a report or dashboard that provides this information automatically?
This is one of my most immediat goals. I will share them as soon as stable.
There is an existing view called "Device Status" that you could use a starting point. You can use it at any level, an individual Data Source or Parent, a Receiver, or the top level ESM.
Shows the following details on the ESM:
- Ten Minute Average CPU Load
- Current CPU Load
- Current Memory Utilization
- Current Hard Disk Utilization
- Block Device Statistics
- Ten Minute Event and Flow Rate
I built a custom a view which is my default view, I have a small window which shows me the event distribution for the last hour from each Receiver, ACE, ePO, APM, etc.. As well as an overall EPS Gauge for the ESM. At the top of the view I still have my normal Event Summary, Event Count, and Distribution.
This allows me to quickly see if my overall EPS, as well as each Appliance, to see if there are any issues with missing events that have not been alerted on for Low Event Count.
I'm in the process of building reports for device & system health for all ESM devices including the ESM itself. As soon as I finalize and verify content and automation I will get with McAfee and see if they will let me post them here in community.
You can create a separate dashboard having distribution panels for major devices in architecture, ACE, NSM, ATD, ERC
Query Type: Distribution
Device ID: ACE/ NSM/ ATD/ ERC
It will give the overview of events, you can find the device outage based on ups and downs.
Having worked with the McAfee SIEM for nearly (3) years now, making sure the appliances are healthy is not so easily done. There should be an appliance health dashboard, instead problems with the appliances are lumped into the event stream from data sources or don't make it any further than /var/log/messages. We gave up and built something to monitor the thing we paid good money to monitor all of our things. It's not terribly sophisticated but it's very effective, it goes like this.
1. Simple perl scripts run hourly on each appliance to dump command line stuff to an appliance status file, for example: disk free, hw raid, logical raid, dssummary, listings of key directories, etc.
2. A collection script on a Windows Server collects up the status files from all the appliances.
3. A monitoring script cuts up all the data into parameters for each appliance and tests to see if a parameter is within green, yellow, or red tolerances.
4. The monitoring script sends out a green alert every (6) hours if all is well, or if there's a yellow or red condition it will directly email or text.
It's a shame to have to resort to this, but with 13K+ data sources in play and dummy parent folders, the little red or yellow flags are always there. We've also tried setting up alerts for health events but have had things go bad wrong without a peep from those.