I've been working with a customer on and off for some time with an apparent DNS issue which I am not seeing anywhere else.
The customer is running a split-DNS configuration which (for historical reasons) is our preferred method - we feel that having split DNS running solves more issues than it introduces and we continue to run split DNS on our own MFE appliance.
The primary symptom is the Firewall slowing down to a crawl (to the point where it can take upwards of 30 seconds to log via SSH) and running the top command results in the 3 load average values exceeding 20. Normally I would expect these values to be, at most, 4.
Rebooting the internally-hosted Windows DNS servers offers temporary respite, with the load averages on the Firewall dropping back down to an acceptable level. But, within a few minutes the value will begin to climb to the point where the system is virtually non-responsive again.
When the customer first reported this they were running 8.3.0 and we noticed that their HA cluster was reporting the availability of 8.3.1. When I read the release note/readme for this patch and discovered that it contained a fix for the bind process it seemed like a no-brainer. So, I recommended that he install this update to his two appliances, which he duly did. Later that day he reported that following installation everything seemed to have calmed down and he was happy this was probably the end of the issue.
Fast forward a few months to the end of last week and he called reporting identical symptoms. Again, taking one or more of his internal DNS servers down for a period of time seemed to allow the Firewall to recover, but re-introducing them would see the load average on the cluster start to climb once again.
While I was on the phone to another customer I received a message to say that this particular customer had called in again. When I returned the call he said it was about the same issue, but in the time it took me to get back to him he decided to switch the DNS configuration of his 8.3.1 cluster from split DNS servers to transparent mode and the problem had once again gone away.
Sure enough, I've just logged in and everything look sunny again. However, I am concerned that rather than fixing the problem he may have just moved it somewhere else.
Can anyone offer any thoughts on what might have been causing the original issue?