7 Replies Latest reply on Jul 30, 2014 2:29 AM by asabban

    Load balancing issues with 7.3.2.10

    malware-alerts

      Running 2 clustered appliances v7.3.2.10 in transparent router mode. (WG5000 models)

       

      We started sending users to the MWG (roughly 3500 users) and everything is fine except for the fact that the master appliance doesn't send any scan requests to the slave box.

       

      When we start sending more load to MWG (we did a test with 10000 users), the primary box hits it's limit (CPU 100% used by Antimalware engine and users get the 'Antimalware engine overload' block page) and absolutely no requests are sent to the secondary box.

       

      We already have a case opened and support has confirmed that the load balancing is not happening properly in our setup and escalated the case to development.

       

      I'm wondering if anyone else has noticed this? Is this a problem specific to 7.3.2.10 ? (In which case we'd simply downgrade to 7.3.2.9 for the time being).

        • 1. Re: Load balancing issues with 7.3.2.10
          asabban

          Hello,

           

          I have seen this issue multiple times when

           

          a.) The "port forwarding" is not correctly set up. For transparent modes you set a forward like "80,443 => 9090" to tell MWG which packets to intercept. If that is not correctly added (probably a reboot is required) traffic will not be shared

          b.) The load balancing uses client IP persistency. If there is a downstream proxy, NAT or similar device in front of MWG all data will come from a single IP address. In that case the load is not shared across the cluster

          c.) The cluster configuration is not OK and the "master" only sees itself as an active node to accept traffic. The "slave" node must detect itself as "up & running", otherwise it will not see any traffic. This can be checked with "mfend-lb -s" on the command line

           

          Best,

          Andre

          1 of 1 people found this helpful
          • 2. Re: Load balancing issues with 7.3.2.10
            malware-alerts

            Andre,

             

            Thanks for your suggestions.

             

            As a transition we have the MWG scan traffic transparently coming out of 2 ISA proxy servers. (We have a project that is just starting to move the users from ISA directly to MWG but is targetted to end early 2015 as we will slowly migrate users)

             

            The users hit an ACE LB before going to the ISA servers, so traffic is fairly equal between the 2 ISA servers (this is confirmed by looking at the stats in number of requests and volume transfered).

             

            The dataflow goes something like this:

             

            Workstations  >> ACE-LB >> ISA Proxies (Default gateway points to MWG VIP) >> MWG Cluster >> Internet

             

            The answer we got from support was that basically having only the 2 ISA servers sending requests to the MWG cluster was not enough for the loadbalancing to properly happen since only 2 distinct IPs are hitting the cluster.

             

            We're kinda stuck between a rock and a hard place right now since we need to replace our old EWS boxes (on hardware that is not supported anymore) that were setup in transparent mode and when we reviewed the design with our McAfee rep and product specialist, we were told MWG could replace the EWS as-is for the time being, until we complete our proxy migration from ISA to MWG (target is end-of-year).

             

            As is stands now, support is telling us that the only way to make the MWG cluster load balance properly is to install either the ICAP or chaining plugin to the ISA servers but because of administrative constraints, we cannot introduce any changes in the ISA configuration during the transition period.

             

            We're basically looking at our options for this to work (load balanced or not) wihout changing anything on the ISA.

             

            I was thinking either:

             

            • Setting the default gateway of each ISA servers to a specific MWG node, in order to systematically only send 1/2 the load to each MWG
            • Tweaking the ruleset (whitelisting the most heavily used sites for example) and disabling some features in MWG (Anti-Malware for example) in order to reduce the workload of the MWG master, while keeping a minimum of security (at this point we,d be more or less doing exactly what EWS does), hopefully reducing the load enough so 1 MWG can handle all users.

             

            I'll be reaching out to our rep tomorrow to have an idea of how much load a single WG5000 appliance can handle. Ideally it would have to be able to handle about 600-700 reqs/sec on avreage (peaks up to 1000 reqs/sec) without completely choking.

             

            Thanks.

             

            Message was edited by: malware-alerts on 7/27/14 9:33:02 PM CDT
            • 3. Re: Load balancing issues with 7.3.2.10
              asabban

              Hello,

               

              unfortunately i have hit that problem a few times in the past already. The stickiness to the source IP only is causing the problems here. Also MWG is not lightning-fast in deciding which node should handle what IP address, so if you have two IP addresses only I assume requests are coming in from both IP addresses almost at the same time and MWG will most likely use only one node to filter them as it has not yet collected any information about the load distribution before the next requests comes in.

               

              So yes, this is a bit of a problem. Honestly I would not touch the ISA servers and also it does not make too much sense in my opinion to setup just another layer of configuration which could have other side effects.

               

              Actually there might be an option which could help you, but as far as I know it is not officially supported, which means that in case of problems support might not be able or willing to help you. Also we have to check if this could work for you at all.

               

              You can actually tell the network driver to not apply source IP stickiness, which means requests will be shared across the available MWG nodes in a round robin fashion. This will allow all MWG nodes to be utilized, however sessions may also be distributed across multiple nodes. This means a user might open a web site with various objects in it and some objects are fetched via Proxy1 while others are fetched via Proxy2. This makes it hard to troubleshoot issues like missing objects or sporadic connection problems, as you have no clue where your user was sent to.

               

              Additionally - in theory! - this might cause trouble with some application which require session stickiness. For example an online banking application might drop you in case if notices requests for one authenticated session come from different sources. In the end you are most likely NATting all MWGs behind one public IP address so appliactions on the internet will most likely not be able to notice that you came via different proxies, but in theory there are headers added (Via, X-Forwarded-For, etc.) which MIGHT cause trouble.

               

              I have been running such a setup at a customer with a similar problem for a time and all went good. But certainly I can fully understand if you do not want to take any risk here.

               

              If you are interested I can try to get more details on what we did at that customer in the past and how to set those options. If this is not acceptable I think your ideas are a good approach. You could combine "cheap" checks with "expensive" checks, e.g. only apply Media Type Filters and Anti-Malware rule sets when the URL has a Medium (or lower) reputation and/or is not member of categories you like.

               

              As far as I understand this would only be a temporary limitation until your users "directly" access MWG, is that true? Also you could use an MWG with a "limited" policy as upstream for the ISA servers while users you migrated to MWG directly get the "full" policy, which will have the advantage that you won't be in the situation to enable AV for all users at once when you completed the migration. In my opinion it is helpful to push users into the new policy with more security step by step, as questions may show up.

               

              Additional ideas may be to ask your McAfee rep if there is any chance to "borrow" an additional box for a limited time and/or run some VMs as the license allows unlimited physical/virtual machines as far as I know.

               

              I hope we can get you out of this situation smoothly.

               

              Best,

              Andre

               

              Best,

              Andre

              1 of 1 people found this helpful
              • 4. Re: Load balancing issues with 7.3.2.10
                malware-alerts

                Andre,

                 

                Thanks again for such great details.

                 

                I would love to have more details on how you configured the network driver to work in a round-robin fashion. In our setup, I cannot touch the ISA servers much to test anyting, but fortunately I've got test MWG boxes that I can play with at will so I'd be more than willing to test it out.

                 

                We already NAT the IPs externally and remove all VIA or X-Forwarded headers from the requests so I don't think session stickyness would be a problem.

                 

                This would only be temporary and would be reverted back to the original configuration once we have migrated enough users directly on the MWG so that load-balancing can happen in a normal fashion (I would guess that with about half the users migrated from ISA to MWG we would be able to return to the original configuration).

                 

                 

                 

                As a plan 'B' I will be working on a tweaked ruleset to avoid costly checks for sites with good reputations as you mentioned.

                 

                Thanks again for your help, this is greatly appreciated.

                 

                Message was edited by: malware-alerts on 7/28/14 1:29:32 PM CDT
                • 5. Re: Load balancing issues with 7.3.2.10
                  asabban

                  Hello,

                   

                  with the command "cat /proc/wsnat/status" you can see all the configuration details and statistics of the network driver. There is an option called "lbpersist"  which is the part that remembers IP addresses. You can temporarily set this to 0 by running

                   

                  echo "lbpersist 0" > /proc/wsnat/status

                   

                  That should be done on all the director nodes. You can verify the setting change with "cat /proc/wsnat/status" again. The value should now point to 0.

                   

                  You can also monitor some statistics here, for eaxmple "cat /proc/wsnat/status | grep i.http"

                   

                  If you monitor all instances now you should see that the traffic does no longer hit only one machine. The load will probably not 100%ly equally balanced, but you should see a better distribution across the nodes. If all works fine you should make the change permanent by adding the command to /etc/rc.local.

                   

                  Let me know how that works.

                   

                  Best,

                  Andre

                  • 6. Re: Load balancing issues with 7.3.2.10
                    malware-alerts

                    Is a restart of the networking services necessary following this change or is it dynamically applied?

                    • 7. Re: Load balancing issues with 7.3.2.10
                      asabban

                      I don't think there is a restart required. In my lab I just executed the commands and load started to distribute across both nodes.

                       

                      Best,

                      Andre