We had two recent experiences where client systems received the message:
Anti-Malware Engine Overloaded
The Anti-Malware engine is currently overloaded andcontent delivery is not permitted without being checked for viruses. Please try again later.
One occurrence happened over the weekend. The other happened yesterday. Devices involved were not the same.
These occurrences seemed to be related to failed AV engine updates as mentioned in the following KB article:
Specific errors included:
[Anti-Malware Engine] [ErrorFromAVEngine] Error message from engine: 'McAfee micro incremental update failed: 8'
[Anti-Malware Engine] [ErrorFromAVEngine] Error message from engine: 'McAfee Gateway Anti-Malware Engine: failed to load MFE base API in '/opt/mwg/plugin/data/antivirus/SCANM/573''
[Anti-Malware Engine] [ErrorLoadingSCANM] Failed to load and initialize 'McAfee Gateway Anti-Malware' engine from directory '/opt/mwg/plugin/data/antivirus/SCANM/573'. Error code: 'SCANMAPI_ERROR_LOADFAILED'.
[Anti-Malware Engine] [AVLoadFailure3] Cannot load engine McAfee Gateway Anti-Malware with index '573'. Reason: 'SCANMapiInitialize() failed with code: SCANMAPI_ERROR_LOADFAILED'.
[Anti-Malware Engine] [AviraCannotInitializeSavapi] Avira: Avira: Cannot initialize savapi3 due to error code 10.
[Anti-Malware Engine] [ErrorFromAVEngine] Error message from engine: 'McAfee Gateway Anti-Malware Engine: failed to load MFE base API in '/opt/mwg/plugin/data/antivirus/SCANM/225''
[Anti-Malware Engine] [ErrorLoadingSCANM] Failed to load and initialize 'McAfee Gateway Anti-Malware' engine from directory '/opt/mwg/plugin/data/antivirus/SCANM/225'. Error code: 'SCANMAPI_ERROR_LOADFAILED'.
[Anti-Malware Engine] [AVLoadFailure3] Cannot load engine McAfee Gateway Anti-Malware with index '225'. Reason: 'SCANMapiInitialize() failed with code: SCANMAPI_ERROR_LOADFAILED'.
After the first occurrence, we had to "service mwg restart" to restore services on that device.
During the second occurrence, the initial attempt to perform a service restart resulted in failure to stop the anti-malware engine.
We are configured with Block on Anti-Malware Engine errors in Error Handler.
Prior to the second occurrence, I configured a new Error Handler rule to send a notification if Error.ID equals 14001 -- criteria based on the existing Error Handler block rule for Anti-Malware Engine Overload.
This rule did not activate and we ended up with over 4 dozen help desk tickets related to the problem. Since nearly 2000 connections were denied, 4 dozen problem reports is probably fortunate from our standpoint.
I have now expanded the notification rule to this criteria:
Error.ID equals 14000 OR Error.ID equals 14001 OR Error.ID equals 851 OR Error.Message matches Broken*
Will this be sufficient to provide a timely notification if this issue recurs? Any thoughts on why the original notification rule didn't activate?
If not, I need to know so that I can configure a non-MWG monitoring solution to trigger if an anti-malware overload message is received.
I forget which version you have, but 7.3 added some additional codes that probably cover those conditions. My guess is they were added to cover your use-case.
You should open a support ticket and provide them a feedback.
Depending on the situation you are experiencing, a restart of mwg-antimalware could resolve the situation.
Do you have core dumps enabled? If so take a look and see if mwg-antimalware is generating cores. (/opt/mwg/logs/debug/cores/) We are having the same experience, but I haven't seen any correlation with the errors you mention. But there are a slew of antimalware cores in the 20 minutes or so before mwg-antimalware becomes unresponsive. Development is looking into the feedbacks and cores we've provided, apparently there is some issue in mwg-antimalware. If you have cores open a support case, they helped us out in alleviating the issue by identifying a few URLs that were causing mwg-antimalware to core dump. We whitelisted those and haven't had a repeat since. We still see a handful of cores each day but haven't had mwg-antimalware become completely unresponsive.
We haven't made the jump to 7.3 yet. Waiting for it to be a little more mature since AFAICT it's a significant upgrade @ the kernel/base OS level.
In 7.3.0 the rule criteria is: Error.ID greater than or equals 14002 AND Error.ID less than or equals 14050
so that would cover Error.ID that are 14002-14050. The error that we hit was definitely 14001 and one of the problems we ran into was that the notification that I had configured didn't fire when the issue happened on a weekday.
I tried to get thread info from mwg-antimalware, but it was so overloaded that it refused the request.
If are willing to share some of the URLs that were identified as problematic, that could be useful to us. PM is fine if you don't want to post them publicly.
We do not currently have core generation enabled because our traffic volume + logs make that a risky proposition on systems where we haven't "fixed" the file system partitioning.
I don't have any useful ones for you, they were all intranet sites or else required logons to get to.
One I can share is:
This caused a malware detection and core dump every time I went to it.
We've been experiencing similar problems. The first problem occurred Sunday and we spent several hours trying to figure out what was going on. Finally were able to narrow it down to a single (of 8) device and rebooted it after taking a feedback.
We had two more devices start blocking on AM engine load failure. Since we know what to do we rebooted and did not have a significant outage. (With 40,000 users, any blip in Internet access is noticed rather quickly).
Monday we updated the Block if Antimaleware Engine cannot be loaded rule to send an email to those of us who can do something about it. We caught one error yesterday at noon in time to reboot the offending device before customer notifications came in.
Watching the devices closely today I saw all of the Avira AM engines restart at t 12:17pm EST.
This was indicated by several notices in all the core-errors logs:
[2013-02-28 12:17:22.597 -05:00] [AV] [AVError] Error in AntivirusFilter: 'Cannot filter because special update is performed.'.
And in the antimaleware-errors logs I see this:
[2013-02-28 12:17:22.497 -05:00] [Anti-Malware Engine] [ExitRestartAppInternalKill] Stopped 'McAfee Web Gateway Anti-Malware Engine version: 188.8.131.52.0 - build: 13253' after internal kill requested...
[2013-02-28 12:17:29.556 -05:00] [Anti-Malware Engine] [TermSignalReceived] 'McAfee Web Gateway Anti-Malware Engine version: 184.108.40.206.0 - build: 13253' - child process exited (termsignal='9').
[2013-02-28 12:17:29.578 -05:00] [Anti-Malware Engine] [RestartAppFrequentFailCountOK] 'McAfee Web Gateway Anti-Malware Engine version: 220.127.116.11.0 - build: 13253' - restarting...
[2013-02-28 12:17:29.600 -05:00] [Anti-Malware Engine] [StartApp] Starting 'McAfee Web Gateway Anti-Malware Engine version: 18.104.22.168.0 - build: 13253' - 'No FIPS mode'.
#Further review of the Dashboard charts I see #that each of the devices that have failed in the past week all show a gradual memory utilization by mwg-antimalware beginning around 1200 EST on 2/20. When the memory gets to 4G the device fails. This is only showing on the devices that we have had failures on, the other devices have a mostly flat line for memory usage in mwg-antimalware.
My hope is that the new engine that just went out is a fix. I've got a message into the support team for verification.
Good news update from the support team. An official notice is coming out.
Engineering has identified an issue with the MWG AV process which caused older AV engines not to get unloaded properly after an update occurred. This issue slowly caused memory usage to increase with every AV update until the AV process reached its limit of 4GB. Engineering identified a solution for this issue which was implemented with the release of "Gateway DAT" 1644 and in addition, an AV engine restart was triggered to bring the AV engine to a clean start. The release of “Gateway DAT” 1644 and AV engine restart is automated and does not require any further actions to implement.
To verify that your MWG DATs are updated as described above, please review the following in the UI:
Dashboard >> Alerts >> Appliance Status >> Gateway DATs 1644 or newer Dashboard >> Alerts >> Appliance Status >> DATs 7000 or newer
To verify that the issue is resolved, we recommend that you check the appliance for the next 3 days once a day with the following command:
/usr/sbin/lsof -p $(pgrep -n mwg-antimalware) | grep "antivirus/SCANM" | grep "(deleted)"
If the command returns no results then the issue has been eliminated. If the command returns any results or if you need any further assistance please execute a new feedback file (Troubleshooting >> Feedback) and provide it to Support.