Could you please help me with McAfee ESM that forwards 2 weeks old logs to a destination? It's set up for TCP/Syslog/CEF format and has around 10 destinations with different filters.
For example, I see May 3 2019 timestamp in forwarded messages today, May 16... (checked with Wireshark/tcpdump)
Any help will be appreciated.
thank you for your interest in my issue.
All of the destination are behind even those that send 2-3 EPS only. It's 1Gig network there and we have in total not more than 500 EPS of forwarded data, each event ~1Kb so it should work well. I'm working with network team, but we haven't found anything suspicious yet. I tried to find retransmissions/error in traffic dumps but got nothing. I plan to reduce number of destinations but I don't think it affects performance (it's not like we have dozens of destinations). Any advice on what else to check?
Sounds like some kind of bottleneck is being run into. I'm not sure what your use case for ESM event forwarding is, but you will not be able to use ESM event-forwarding with the new snow flex architecture, when it comes out, as they are mutually exclusive. I do know that ESM event forwarding is not efficient nor does it scale that well, maybe you should look at forwarding events from the receivers using data archiving?
What version are you running? It is possible to tap into the kafka event queue instead?
Hi, sorry, I thought I replied but it looks like my message was lost.
Yeah, bottleneck, that's what I'm looking for... ESM is 10.x.x so I cannot use Kafka yet, once it is upgraded I'm definitely using the queue.
As of now, I can get ~250 EPS out of ESM (it's aggregated data, around 10 times less than 'original' EPS coming to ESM). EPS graph is pretty flat so I suppose it is some kind of a limit set in ESM.
Events are aggregated on the event receivers, not the ESM. The receivers actually just run another instance of the NitroDB. The ESM (in simple terms), just pulls these records over into its master database for any of the previous minutes that it does not have a record of when collection is performed. This mechanism was overhauled with the advent of the Kafka messaging system.
CEF is extremely verbose, are you sure they are only 1Kb in size. Sending all of the database fields in CEF format is likely much larger. Is it possible each packet is only about 1Kb. This only represents 1000 characters, with the CEF preamble and field enumeration, you are very likely to run over this amount. Also do you have "Send Packet" selected. That option will send the raw packet of one aggregated events along with the raw database fields.
yep, events from network devices are 400-700 bytes, windows 600-900 bytes, only some IPS/AV/EDR/DNS events are more than 1 Kb, might be 1400-1600 bytes or so. CEF header is 40-60 bytes, depends on a signature name. Send packet is not enabled for sure. Anyway, I switched it to UDP and still see around 200 pps (~equal to EPS in my case). Network and receiving side can handle thousands of pps so it is not a bottleneck here. The main suspect is ESM itself.
Oh right... is the event forwarding limit set to 0 (unlimited)?
I just did a small test on a v11 lab I have running. I know it's not the same as v10, but I don't believe any significant changes have been made to the forwarding processors.
With only 2 cores I am able to send upwards of 1000 messages a second as UDP packets.
Sorry for long silence, been a bit busy.
I finally got an opportunity to play a bit with ESM settings. I've switched CEF to SEF and EPS went up to 1500-2000 EPS via UDP when I blackholed traffic. I did it for a short period, switched to TCP after that. ESM somehow controls how many events it sends because it does not change EPS rate gradually but tries to keep it at 250, 500, 750 , 1000 EPS levels (tested only tcp) and it's really interesting. Plus, it sends events in batches ~1 000 000 events and that I find interesting as well. I'm still trying to get maximum EPS by tuning the receiving side, mostly just out of curiosity because I've got required EPS rate.
Thank you for ideas and information!
Items are forwarded when they are received. This is dependant on collection rate. If you auto collect every 5 minutes, you will notice a burst of forwarded events every 5 minutes.
If you are seeing problems only when TCP is selected. I assume these are VMs? The NICs on your hosts likely do not have enough RX/TX queues, resulting in TCP back pressure. There's various ways to validate that this is the problem with TCP, which I am sure you can find resources to online.
I am thinking after looking at some straces, that the ESM forwarding using TCP is single threaded, as data must be sent serially. UDP would not have this issue, this also might be why UDP seems to be able to send forwarded packets faster.
I can't think of a good reason to send all of your events to another platform. The ESM/ELM has very stable and proven backup and archiving strategies. Sending the data to another platform, is basically breaking the rules of database normal form. I often see customers trying to use the ESM features in unintended ways, this is no exception, can you think of another way to solve your problem where you do not need to duplicate all of the data somewhere else?