I would like to create a dashboard on which I can see the total event count per device. Ideally, I would also like to have a baseline and the deviation from this baseline.
The goal is to abnormal behaviours... Servers who generate too much events or too less, based on their respective baselines.
I created a new dashboard with a table. I created a new query with which I select the fields "Device ID", "Device Name" and "Event count".
Maybe my approach is not a good one... ? I would appreciate any help or clue
Thank you in advance.
ESM version : 9.4.2
There are few ways of doing this, solutions that work include:
- Bar charts - Collection rate [by Device] per second
- Distribution charts
My preference is to use Distribution charts per device type as it shows amounts of events over time, so I can quickly see deviations or interruptions (the latter being rather frequent on McAfee SIEM despite HA setup).
The downside is space limit (your screen) if you want to have chart for each single device.
I now have a nice graph with the collection rate per second for each of my data-source, ordered by avg_rate, and a distribution chart bound to it. Much more easier than my approach ^^
Problem is even if I show baseline averages and margins, it will show all the data-sources, regardless of the deviation.
What I would like is to see only the sources which are under or above those margins, and/or order my sources by those deviations.
I don't even know if this is feasible ^^
Baseline per data source is perfectly doable, not sure what graph you are referring to, can you present a screenshot?
Bar charts can show multiple deviations, if your distribution chart shows all of them, you can bound it to "collection rate per second" bar chart and distribution chanrt will change when you select each data source device. This is blind guess, based on my graphs.
Sorry not to be clearer I will do my best ^^
Right now, I have something like this :
The problem with this solution is that it requires a manual interaction to see which data-source hasn't a normal behaviour. With more than 100 sources, it is not usable...
This is why I would like to order my data sources not by average rate but by its deviation from its own baseline. Problem is baseline can be shown on the graph but cannot be used in the queries.
Have you thought about setting up Alarm's for Devices that have a Deviation from the Baseline? Again, if you want one for each Data Source, you will need to create 100 separate Alarms.
This isn't feasible in our environment, as we have about 2,000 Data Sources on one ESM, and about 1,000 on our other ESM. We have Deviation from Baseline Alarms for the Devices themselves (Receivers, ACE, APM), as well as for particular groupings like Firewalls, VMware Hosts, or simply for Unknown Events.
It varies, you have to tweak the alarm to fit your environment.
Some examples are:
General Deviation from Baseline for a Device:
Query - Total Events; Time Frame: Last 8 Hours; Trigger when 90% below baseline; Check Rate - 1 Hour
Others include settings such as:
Query - Total Events; Time Frame - Lat 1 Week, Trigger when - 50% above, 50% below; Check Rate - 12 Hours
Query - Total Events; Time Frame - Last 2 Hours; Trigger when - 25% above; Check Rate - 1 Hour (Unknown Events from Unix/Linux/AIX systems)
The Unknown Events increase for certain types of systems is usually triggered by one of several events, either someone enabled Debug mode (switch or router typically), or in the case of some VMware hosts, the local storage for it's logs filled up, so then it logs even more events saying it is out of space, over and over.
We are currently dealing with a receiver (small older orange Nitro box) which has been getting "lowmem_reserve" messages followed up by "IPSDBServer: Error: Could not send event(s) to correlator through socket - Unable to obtain lock(4)" - when this occurs the receiver replies to a ping, but no longer processes events, does not allow ssh connections, and worst of all, does not accept incoming syslog messages. I just had to hard boot it again this morning due to this issue, after following up on the Alarm email, and viewing my dashboard and seeing that we had no events in the past 30 minutes from that receiver.
My default dashboard contains a small Distribution view for each "Device" we have, 9 Receivers, ACE, APM, ePO, plus an "EPS" gauge for Total Events per Second of the system, so I can quickly see when an issue may be taking place.
We also have "Device Failure" alarms for each device, that check every 10 minutes. there are occasional False Positives, but we usually see this prior to the Deviation from Baseline, as we check more frequently.
I created a Correlation rule with the following criteria :
I also grouped by "Device ID". So my rule is as follow :
I then created a dashboard with a bar chart. I am not sure which event query I should select... I made few tests, with my rule's signature ID as a filter, but my chart stay empty.
And I really don't think none of our servers are in the red... Any idea?
bonus question : Threshold cannot be < 1. Doesn't it use normal distribution law?
Thanks in advance,