2 Replies Latest reply on Oct 8, 2013 10:56 AM by PhilM

    Possible DNS problem causing significant performance issues

    PhilM

      I've been working with a customer on and off for some time with an apparent DNS issue which I am not seeing anywhere else.

       

      The customer is running a split-DNS configuration which (for historical reasons) is our preferred method - we feel that having split DNS running solves more issues than it introduces and we continue to run split DNS on our own MFE appliance.

       

      The primary symptom is the Firewall slowing down to a crawl (to the point where it can take upwards of 30 seconds to log via SSH) and running the top command results in the 3 load average values exceeding 20. Normally I would expect these values to be, at most, 4.

       

      Rebooting the internally-hosted Windows DNS servers offers temporary respite, with the load averages on the Firewall dropping back down to an acceptable level. But, within a few minutes the value will begin to climb to the point where the system is virtually non-responsive again.

       

      When the customer first reported this they were running 8.3.0 and we noticed that their HA cluster was reporting the availability of 8.3.1. When I read the release note/readme for this patch and discovered that it contained a fix for the bind process it seemed like a no-brainer. So, I recommended that he install this update to his two appliances, which he duly did. Later that day he reported that following installation everything seemed to have calmed down and he was happy this was probably the end of the issue.

       

      Fast forward a few months to the end of last week and he called reporting identical symptoms. Again, taking one or more of his internal DNS servers down for a period of time seemed to allow the Firewall to recover, but re-introducing them would see the load average on the cluster start to climb once again.

       

      While I was on the phone to another customer I received a message to say that this particular customer had called in again. When I returned the call he said it was about the same issue, but in the time it took me to get back to him he decided to switch the DNS configuration of his 8.3.1 cluster from split DNS servers to transparent mode and the problem had once again gone away.

       

      Sure enough, I've just logged in and everything look sunny again. However, I am concerned that rather than fixing the problem he may have just moved it somewhere else.

       

      Can anyone offer any thoughts on what might have been causing the original issue?

       

      -Phil.

        • 1. Re: Possible DNS problem causing significant performance issues
          sliedl

          That's a tough one to guess on.  We'd have to troubleshoot it while it's happening.  Perhaps there were large zone transfers going on?  Perhaps some kind of DNS attack from the inside?  It sounds like the firewall was being overwhelmed by the amount of traffic it was receiving.

          • 2. Re: Possible DNS problem causing significant performance issues
            PhilM

            I agree it is a difficult one.

             

            Understandably given the delicate nature of 'uptime' because the problem resulted in a loss of DNS resolution and that in turn pretty much brought everything to a halt I completely understood why my customer was more interested in making the problem go away than understanding what the cause was.

             

            Looking at the audit, there didn't appear to be an abnormal amount of audited DNS traffic. On a previous encounter with another customer who had a machine which had been compromised by one of the then popular nasties which flooded the gateway with ICMP traffic, when you loaded the audit viewer it pretty much scrolled continuously with audit records pertaining to the outbound ping rule. In this case, if I were to offer an objective opinion the audit didn't suggest the same was happening with DNS.

             

            All I could see was the the load average values from the 'top' command where much higher than I'd ever expect them to be (I was originally trained on Sidewinder 5.x by Todd Ferweda and I still have written on my training notes "Load aversage >4 = BAD") at 20+ and they were only getting larger and the DNSp service was taking up quite more CPU% than I would expect it to. I can't remember off the top of my head but it was in the 30-40% bracket.

             

            The rest is as per my original post.

             

            Given the hunch that switching to transparent mode may have simply 'moved' the problem I have warned the customer that if one or more of his internal DNS server has been compromised as it is performing some kind of DNS-base poisioning attack of DoS attempt he may find his public IP becoming blacklisted by external reputation services.

             

            Given the system is now running in transparent mode, is there anything you can suggest (tcpdump or other CLI tool) which could show him if one of his internal hosts had become an attack source?

             

            -Phil.