I didn't find many topics on this so I thought I would start one...
We are running webgateway 7.2 and have been for 6 months or so. I've noticed a bit of slowness. It isn't entirely unbearable, but certainly seems noticeable. I am not certain of the best ways to go about figuring out where the bottleneck is. Should I be checking log files and such through the GUI? Should I be checking out the load on the appliance itself? I thought that a discussion such as this may be helpful to many people wanting to make sure they are operating at full capacity. We currently probably have ~500 connections to this appliance.
when I troubleshoot performance issues I usually recommend the following first:
- Pick a random website that you will use for troubleshooting. Pick a website with several objects, news sites generally work pretty fine
- Get a browser and a plugin that allows you to measure the speed of the website. Examples are Chrome + Page load time extension, Firefox + Yslow or Firefox + Firebug
- At this stage please note that enabling the plugin can also make the page slower, so always keep trying the website without any plugins and check the user experience (e.g. does loading of the page "feels" slow or better)
- Also remember to Clear all caches and completely restart your browser after each test you have performed. Control + F5 is NOT suitable in most cases
- Make sure you can reach your website with and without MWG in the loop (same PC definitely prefered to have good test results)
If you KNOW there is a big performance loss since you have MWG you can skip the following steps and continue later on...
- Turn off MWG and make sure you have direct Internet Access. Measure the speed of the website a coule of times by loading it, writing down the time, restart browser, clear cache, test again etc.
- Turn on MWG and perform identical tests. At this stage please note that there will definitely be a change in the page loading time.
Just as an example, when loading a common german "news" magazines website there are approx. 150-200 objects downloaded. Each of these objects has to pass the rule engine and most likely multiple filters such as (not limited to) URL Filter lookups, AV Engine, Openers (e.g. unpacking), etc. I have been talking to people complaining that MWG is not as fast as no proxy or as a caching-only proxy - this expectation will probably be disappointing.
So I assume now you have measured or felt some performance lack. The rest becomes pretty easy.
- At the very beginning of your policy create a new rulegroup, which only matches for your client IP. Into this rule group place a "Stop Cycle" rule. IMPORTANT: This will turn off all filtering for your client IP (it is only for testing!)
- Now you skip all filters in MWG and you can find out if a filter is causing the delay or the proxy itself:
a) If the access is still slow we can assume that the issue is proxy or network related
b) If the access is noticable of measurable better now there is one or more filters causing the delay
You have basically four options which can cause the problem
1.) something is broken in the network
Most people do not accept this statement but I can confirm that in my time in support I have found ~80% of all performance related cases to have a network fault.
- The most prominent candidate is DNS. Most people do not seem to care about DNS because it "worked for years". MWG actually performs a lot of DNS queries and a fast and accurate DNS is required. To find out more about DNS run a packet capture on port 53 and check request/response times.
- Another candidate are any kind of intrusion detection/prevention systems which are placed "somewhere". Usually they work fine so people forget about them being there. I have seen MWG waking them up a couple of times and they started causing issues. If there are such devices working on network level in the loop try if you can remove them temporarily
- Hardware. I didn't believe this myself but I had a customer a while ago who had performance problems and it turned out that a broken switch port was causing the problem. Check the network for collisions and other network problems.
2.) MWG is overloaded
In this case MWG has too much to do. You can find out by adding an additional node that you use for yourself without sharing it to your co-workers or adding a (high-powered) VM and check if the delay goes away. If this is the case more appliances will help
3.) It won't go any faster
This is certainly also an option. As I mentioned MWG performs a lot of stuff. With the Stop Cycle rule applied you skip all filtering but MWG still touches the network traffic, e.g. handle the tcp connections etc., which should add a slight overhead. I expect that this is not noticable because otherwise we would have a lot of performance related tickets :-)
You malconfigured something at the proxy level. Maybe specified a wrong DNS, maybe used an IP address that is already present at the network, etc.
Note: For all of the above steps you need to know your network in detail and/or should be able to create and properly analyze packet captures. I understand that this may not be your daily business so I recommend to contact support or professional services to get assistance in troubleshooting.
If you read here I assume that placing the "Stop Cycle" rule has increased the performance of the website, so we now know that one or more filters cause the problem. To continue troubleshooting you have an option that is time intensive and an option that is time intensive :-)
First option: Move the rule set you created. By moving the ruleset around in the policy you basically execute all filters that are placed on top of this "Stop Cycle" rule. You can experiment with the position and re-measure the time it taks to show the website (don't forget about clearing cache and restarting the browser). You can place it to the end, in the middle of the policy and - last but not least - move it down one rule set, save, measure time, move it down, save, measure time, etc. until you feel or see a noticable delay.
Second option: Disable the stop cycle rule and add a rule that writes a rule trace. These files are horrible to read, but actually they write down how long processing of each rule took. You can start searching for the rule groups and look at the time each rule group took and so find out what took long. Problem: If only one object out of the 150-200 mentioned before is slow you will end up with 150-200 rule traces. You could use Firebug to see how long each of the connections took and check if there is only one that was slow. If you got it, look at this specific rule trace :-)
If you are here you have found the rule set causing the delay. Now you could try to tweak it. Some options are
- Does it make sense to apply the rule? Maybe it is "too strict"?
- Maybe you can change some settings e.g. reduce heuristics turn on/off an option?
- On the other hand: Maybe the rule simply takes it time... in this case see the first bullet point :-)
Again support or professional services can help, if you don't see potential for improvement.
That is basically what I can say without more information. Maybe you find something that helps you to move forward.
From very recent experience with SLOW sites...
Sorry, silly question here, but what is the IPS log?
Also, there was another thread on here that mentioned DNS issues and linked to a testing tool and it did indeed report that our DNS was slow, but are there logs that will tell me it is a bottleneck?
Sorry for the delay to respond... The Intrusion Protection System (IPS) is a device that listen to trafic, match some signature/behavior/etc. and bloc that trafic. The log is... whereever you sent it...
Check internal DNS BlackHole... If you reply to blackhole DNS requests with an internal address (other then the loopback address) and there is nobody to answer the request, you will have timeout issues again. In our case, the pseudo website normally responding to those requests (and logging them) crash without anybody being notified... Timeout issues again...
Suggestion for DNS blackhole timeout issues: Set up DNS Blackhole entries to point to a real address in your IP space that is on the other side of the MWGs than your clients. Then configure MWG to block traffic to that address.
Client requests bad domain, gets blackholed IP address, tries to go through MWG, gets immediately blocked.
Sorry btlyric, I'm not sure what you mean. I don't believe we have anything set up for an internal DNS blackhole. Are there certain errors I should be looking for? We are just using Windows servers for DNS.
Whoops -- I was replying to DBO's comment about internal use of a DNS blackhole. If you were using an internal DNS blackhole, you could see performance issues if MWG was trying to process those connections. Since you're not, it's not relevant.