As a new product manager for McAfee SIEM, I spend my days learning about our product's features and how they address information security risk in an organization.  What I learned before I got here, with a decade-long background in information security investigations and threat detection, was the best places in security data to look for proactively detecting threats and risks, before they become an incident.  Seems like as good a place to start as any.


Before listing the winners, it would be worthwhile to explain why and how these data sources rose to the top of the list.


Goals for Correlation: While a SIEM can meet multiple needs, those needs are quite different.  Risk and threat correlation does not have have the same goals as incident response.  In the incident response process, you are working around a chain of events, from start through compromise through cleanup.  For correlation, you may be focused on early detection of a small part of that chain, or even tiny details of each event in the chain.  It is this difference in focus that makes the thought behind correlation and incident response so different, and it makes a different set of data sources valuable for each (sounds like a future blog post).  These goals do apply to both rule-based and risk-based correlation (aka algorithmic correlation).


Criteria for Data Sources:  Now that we know we are looking for potentially deep detail, we can start to value some data sources over others.  While this was a wholly unscientific effort, it was based on experience of many, many people and some reasonable yardsticks (meter-sticks?) for comparison:

  1. The source has to give you more information than other data sources might offer about the same event, OR the source provides information that NO other data source offers
  2. The source has to help the most in addressing the risks you are trying to manage.  It has to be the most relevant to the threat, or it has to be the most innovative in identifying or measuring it.



Correlation is smoke detection, not firefighting.


Based on these criteria, here are the best data sources for SIEM correlation, in no particular order.


Windows Event Logs

Whether you are looking at attacks from the outside or the inside, Windows event logs give you a staggering amount of information to correlate.  To illustrate, logging on to a remote desktop and opening two administrative control panels can generate more than 30 events in the security log.  You have text and binary versions of most of the information; you also have a lot of metadata that helps you do the correlation itself (source and destination IP, hostnames, etc.).  With all that information comes a lot of effort required to comprehend it.  Add to that thousands of different formats for the data, and building out correlation becomes a discouraging task.  If you are looking for hints of trouble that can be matched up to other hints, or rolled up into a robust risk model, those that look return to windows event logs time and time again.



The companion data source for non-Windows environments, Syslog holds the same promise for risk and threat correlation.  While potentially less noisy, it sits somewhere between structured and unstructured data.  Although it is well-respected for what it lends to log collection and management, I have seen some shy away from it because of what it requires you to know about each program that runs on the host.  They each log information a little differently, and the less-structured format means that details often get squashed together.  All that aside, for those that want to know what is going on across hundreds, thousands, or tens of thousands of machines on a network, this is a go-to data source.  Depending on the depth of logging enabled, many different key details for a correlation can be dumped in one place.




DNS is an excellent example of a data source that provides information you can't find in other data sources.  If you didn't know it before Dan Kaminsky you certainly know it now: the IP address you get from a DNS server may not be the one everyone else gets.  The DNS packet tells you what the machine asked for, the answer it was given, all framed up with source and destination IP.  Not to mention any hinky data that may be smugggled in the additional records field, the packet gives you lot of detail.  Besides the needed transformation from the network layer to the application layer, it tells you about what other machines a machine is asking about.  Finding these connections is an important part of correlation, and you have to do it with some packet inspection here versus log inspection.


Application Logs


Application logs, whether custom for the application, out-of-the-box, or instrumented with the help of a third party, are correlation gold.  If you are looking for the clearest reflection of what a user is doing on a system, this is likely where you'll find it.  Here events become transactions, "this user did this at this time".  If you are building a risk model, you have weighting options to segment high-risk from low-risk.  If you are building detection rules, you have something that will probably generate less false positives.  Undoubtedly a huge challenge to integrate in an enterprise, many experts see it as absolutely worth it.  Included in this group are other transaction-based logs like DLP logs.




Somewhere outside the realm of logs and packets, there is everything else you might need for correlation.  There is the building location of an asset, the job role of a user, the city that IANA registered to an IP address.  All of these things can be a critical part of a set of correlation rules or a robust risk model.  They often exist in a non-security system and can be one of the hardest to pull in: for example, does any company in the world have the same HR and asset management system in common?  On the plus side, context data sources win from the information that they contain, not available in most if not all security data sources.  On the minus side, because of the data they hold, correlation builders may hold this data source above all others, turning rules and models into disasters.  For some details you are seeking, it may be easier to detect them with an innovative approach in the other 4 data sources listed here.  Not always the case, but a model can be more scalable with more innovative detection (and we have another blog post on the list).


So there they are.  To sum up:

  1. Data sources great for correlation are not the same as those great for incident response or log management or compliance.
  2. There could be more on the list, but don't discount the ones above
  3. I have seen the same data sources popular with both risk-based and rule-based correlation.


Grant Babb