Event aggregation is a simple concept, however, it is important to understand the nuances to effectively operate the SIEM. In short, log aggregation allows a SIEM to be able to reduce event volume by combining like events. There are multiple ways to adjust aggregation and this post will cover each of the use cases. Feel free to add your own use cases or ask questions in the comments. This information is current as of release 9.5.2 MR2.
Data Flow Primer
Due to the repetitive nature of log data, a great deal of efficiency can be gained by consolidating like events based on common fields. The process starts at the device creating logs. The logs are transferred to a Receiver, using any one of a variety of protocols, where they are processed for storage on the ESM and ELM.
For the ELM, the Receiver bundles up the raw logs and provides them to the ELM where they are compressed and digitally signed for integrity. These files are then added to a time-based repository where they are available for full text search and integrity verification to prove they have not been tampered with for the duration of the retention period.
For the ESM, the Receiver will parse the logs into fields, normalize the events into categories and aggregate the data based on the repetition of like events. This process is basically creating metadata for the logs. The Receiver then inserts the metadata into its local database and stores it there on a first in, first out (FIFO) basis. Meanwhile, the ESM is querying each Receiver every few minutes to get the latest events and populate them into the GUI.
The solution essentially stores two copies of the data. The metadata on the ESM provides the operational representation of the raw data stored on the ELM.
 By default, “like” is defined as a common source IP address, destination IP address and signature ID.
Note: For a combo appliance/VM, the data flow is the same between the virtual components that exist within the appliance.
How Aggregation Works
By default, events are aggregated based on the same source IP, destination IP and event ID within a single ESM->Receiver query window. This is called Dynamic Aggregation and it is enabled by default by a per Receiver basis.
If Dynamic Aggregation were to be disabled, it would cause event aggregation records to extend through multiple ESM->Receiver query windows. For example, some number of events are aggregated under a single record, instead of closing out that record immediately after the ESM queries the Receiver, it is held open to see if the event occurs again during the next window. The same happens on the Receiver. If the event occurs again during the next query window, the counters will be updated and record will continue to be extended up until the max record time (12 hours by default). This means that for each event that arrives there is an extra lookup on the both Receiver and ESM and which will impact performance. This is called Level 1 aggregation and used to be the default setting before Dynamic Aggregation was added to increase performance and aggregation granularity. It's highly recommended to always leave Dynamic Aggregation enabled.
This works really well in most scenarios, however, there are always exceptions.
Aggregation Use Case: Firewall Command Auditing
Every environment has its own unique use cases, requirements, business drivers, metrics, alarms, reports, etc. One that I recently had the opportunity to work with had an excellent use case to audit firewall changes by logging every command entered, generate a weekly report with the firewall sessions and reconcile it with the change requests. They had properly enabled the logging level so that each command entered was sent as an individual log, but the logs, with the common event ID, source and destination IP addresses, were being aggregated under a single "CLI Command Entered" event. The result was that the fidelity of the text actually entered, which was parsed into the Command field, was not visible in the aggregate event.
In this instance, the use case requires that every single event (CLI command) generated by the log source be available for query and analysis and reporting in the ESM. For this to work, aggregation must be disabled completely for the event ID to capture every single command typed. Fortunately, manually typed firewall commands aren’t generating a high volume of events so disabling aggregation for these events will not negatively impact performance. There are instructions for disabling aggregation for a parsing rule ahead.
Aggregation vs. Performance
Aggregation is an effective method to summarize the events in such a way that the details required for reporting, alarms and advanced correlation without requiring enormous compute resources. The level of aggregation that is achieve varies between implementations, but in general you can expect to see approximately a 10:1 average ratio in most environments but it's not uncommon to see ratios of 30-50:1 for events with high frequency and repetition like firewall flow setup logs. This means that the ESM only has to analyze 10% of the events that the Receiver parses. This also enables the architecture to scale horizontally by adding Receivers to feed a single ESM.
It also means that if aggregation were to be disabled completely that the ESM would not be able to process events at scale. There is room for tuning aggregation however it’s important to consider the impact of any changes that might drastically impact the event aggregation ratio. It would not be a good practice to disable aggregation for a high volume event or low quality event. The aggregation ratio can also be impacted by the cardinality of the events. There is a corner case that will reduce performance if the data fields show too much diversity. In most cases, there is a semi-contiguous subnet where most of the data sources live which allows for a normal level aggregation. If every event included a different source or destination IP, which could be the case in a distributed DOS attack, then aggregation would be reduced and impact performance.
To handle this sort of situation, the Receiver has two additional tiers of aggregation beyond what Dynamic Aggregation provides.
Per the default settings, if a single event occurs more than 300,000 times in a single minute then Level 2 Aggregation will kick in. This will cause the event to be aggregated more aggressively. The destination IP for the event that crossed the threshold will be set to 0.0.0.0 and ignored for future events in that minute. This immediately reduces all the records down to one and eases the performance that the burst required.
If the event count surpasses 350,000 in a minute then Level 3 Aggregation will kick in. This will cause both the source and destination IP addresses to be reset to 0.0.0.0 leaving the aggregation record to match every time that event occurs in that minute. In almost all cases the the default settings work out well so it's not recommended to modify the Receiver wide aggregation settings in most cases.
Disable Aggregation for a Parsing Rule
Best practice dictates that no changes are made at the Default Policy level. The most direct route to disable aggregation for a single rule is to:
Find and highlight the event in a View and then select Show Rule from the top left context menu.
This will open the Policy Manager with the rule and data source selected. You're able to select the rule and disable Aggregation at this point.
Then roll out the policy for the change to take effect.
Aggregation Use Case: DNS Query Logs
Collecting DNS queries opens the door for numerous valuable use cases and is reinforced even further when outgoing DNS is blocked. It's possible to compare domains to threat feeds, indicators of compromise (IOCs), detect attacks like FastFlux and monitor DNS sinkhole activity. The default aggregation settings are not ideal for DNS queries because the domain data may be lost if the records are aggregated on the source IP, destination IP (DNS server) and event ID (DNS query event). DNS queries can represent a high volume of events so disabling aggregation completely is not a recommended course of action. In this case, it is most ideal to leverage Custom Aggregation to adjust the fields in which the DNS events are aggregated while still maintaining some level of aggregation.
In the case of a DNS query event, custom aggregation allows for the common fields to be changed to event ID, source IP and domain. This means that every time a host makes a DNS query to the same domain within an ESM query window will be aggregated under one event. A new aggregation record will be created each time the domain changes so every domain is included in the metadata while still being able to maintain some level of aggregation due to the repetitive nature of DNS. The process to adjust Custom Aggregation follows.
Configure Custom Aggregation for a Rule
The most direct method to adjust the aggregation for an event is to locate the event a View and then highlight Modify aggregation settings in the top left context menu.
In the Custom Aggregation settings you're able to adjust the fields as needed.
You're also to see all of the devices using Custom Aggregation by selecting a Receiver, going to Properties and Event Aggregation.
From there you're able to click the View button to show the list of devices with Custom Aggregation settings applied.
Aggregation Use Case: Authentication Events
We've covered use cases for both disabling aggregation completely as well as using Custom Aggregation to adjust the common fields used for aggregation. There is some amount of discretion required to determine which events require additional granularity and how best to provide it. One use case that can could use either is around authentication events. There are numerous types of authentication events from different devices. Some authentication events occur automatically in the background when remote files are accessed or scheduled tasks are run. These can represent a very large volume of events. There are also authentication events that represent interactive logins by humans that represent a lower number, but more critical, events. The type and volume of event will dictate whether aggregation should be disabled, adjusted or left at default.
In the picture below you can notice that events like "An account was successfully logged on" (4624) happens at a very different rate than an event like "A login was attempted using explicit credentials" (4648).
For event 4624, it would be best to use Custom Aggregation to match on common event ID, source IP and source username. In this case, if there were multiple login failures using different usernames from a single source IP address, there would be a unique aggregation record for each username tracking the number of times it was used. For an event like 4648 is better to disable aggregation completely due to the low volume and high criticality of an explicitly performed login event.
Additional Aggregation Considerations
There are additional types of rules that could be considered for exemption from the default aggregation settings. Some suggestions would be any custom correlation rules or direct alarms (if this happens, alert me) might be good candidates as well. The key is to consider the volume of the events and which fields are required for your use cases to find the best balance between functionality and performance.