Does anyone know if there is a McAfee recommended setup for SNMP Traps on these units that we can be confident will alert us of any degradation on a WebWasher unit?
I suggest reading https://community.mcafee.com/community/business/email_web/webgateway/blog/2011/05/18/have-you-ever-w.... This will give you an intro to incidents inside MWG.
Degraded Performance can mean many things - overload, hardware broken, etc. Just some examples:
The default error handler knows some rules, e.g. the one for CPU overload.
Using the incident system, you can add others, such as for the case in which the AV engine is overloaded, there is a predefined one already as well, which just needs the addition of SNMP options.
Using ASC, you can configure more appliance specific parameters, see https://community.mcafee.com/community/business/email_web/webgateway/blog/2011/08/21/install-intels-....
This gives you access to additional hardware info for the Intel appliances (WG 4000,4500,5000B, 5500B models). There is a similar system for Dell based hardware (1100, 1900, 2900, 5000, 5500 all series) I am attaching a drafted guide for this.
I am also attaching the Intel MIBs which include additional monitoring points.
Thanks Michael! I had already seen the Blog about what is available as far as monitoring objects and I like the new Intel information but going back to my original question we do not know what constitutes a properly monitored unit. For example we rebooted two units in a 6 unit cluster yesterday and did not receive a single SNMP TRAP.
I am used to monitoring network devices that have a clear cut defined set of SNMP TRAPs and by using those we know that the unit is healthy. With the WebGateway I do not feel comfortable that we have everything covered. There is a lot of monitoring objects but no way of knowing what is a healthy unit. Please help clear up this up.
Ok, so let us start building a list together, which we will take inside an come up with a set of possible events to send traps upon.
the most obvious ones:
CPU Load too high
too few memory
Disk issues (S.M.A.R.T)
AntiMalware engine issues (not loaded, update issues, run out of filtering threads)
URL Filtering engine issues (same as above)
Connection blocking the proxy
Excellent! Thanks! That is a good start. I would add a TRAP and Polled objects for the Following:
Appliance Health - Values included would be Healthy, degraded, and Critical ( THIS IS THE MOST IMPORTANT ONE AS IT CAN ACT AS A CATCH ALL FOR ANY SYSTEM ISSUES)
System Temperature - Polled temperatures and a TRAP for when a threshold is exceeded > Warning and critical thresholds
Paging Activity - system is paging excessively
Fan Errors - If applicable
Memory Errors -
File system low disk -
Scheduled Job Errors -
Any Hardware related issues -
Internal Process errors - not sure if this is applicable but the intent is to be notified if an internal process crash is prevent traffic from passing.
I am sure there are others but this should give us a good start.
Also we should be able to receive the standard MIB II TRAPs 0 through 4 per the table below.
Generic Trap Name
Indicates that the agent has rebooted. All management variables will be reset; specifically,
Indicates that the agent has reinitialized itself. None of the management variables will be reset.
Sent when an interface on a device goes down. The first variable binding identifies which interface went down.
Sent when an interface on a device comes back up. The first variable binding identifies which interface came back up.
Indicates that someone has tried to query your agent with an incorrect community string; useful in determining if someone is trying to gain unauthorized access to one of your devices.
Indicates that an Exterior Gateway Protocol (EGP) neighbor has gone down.
Indicates that the trap is enterprise-specific. SNMP vendors and users define their own traps under the private-enterprise branch of the SMI object tree. To process this trap properly, the NMS has to decode the specific trap number that is part of the SNMP message.
not sure if we can provide all these, but excellent points - also as education for me. What do others see as important?
I read your post (https://community.mcafee.com/community/business/email_web/webgateway/blog/2011/05/18/have-you-ever-w...) and I've got some questions.
About incidentIDs 22, 23 and 24, what's the default limits? Is there a documentation with those specifics?
22 Filesystem usage mwg-monitor detected a filesystem usage beyond a certain threshold - Filesystem usage on PART exceeds selected limit.
23 Memory usage mwg-monitor detected a dirty-pages to total mem ratio beyond a certain limit - Memory usage ratio of N processes exceed selected limit.
24 Load mwg-monitor detected a load beyond a certain limit - 5 minute load average exceeds selected limit.
200 Check for license expiration - The license expire date has been checked
Finally, about incidentID 200, is it possible to create a rule that send a trap or email a period of time before the license expiration?
Incident 22: The incident will be triggered if 90% disk space are in use.
Incident 23: will check
Incident 24: The threshold is currently at 3.0.
For incident 200 there is a default rule set shipped with MWG. It shows some examples how to let you know that the license expires in X days. The property "License.RemainingDays" contains this information. The rule set will always send you an eMail (or a trap, etc...) when the license is verified. You could add a criteria like "License.RemainingDays < 30" to only let you know when you are close to the expiry day.
I still have to find out what "when the license is verified" means :-)