Why do some features affect performance so much?

Version 1

    When looking at the performance data for Firewalls (not just UTM) you will notice that claimed throughput figures will vary greatly depending on which features are turned on. Factors of 10 or more difference in throughput and the like are not unusual. In fact the 'extra features consume resources and will thus slow things down' aka 'no free lunch' rule is pretty much universal.

     

    This document provides some background as to why this is so by providing some intermediate to low-level technical background on how specific forms of inspection in networking equipment works. It is hoped this knowledge will allow people using the equipment to better reason about the likely performance impact different types of configuration and traffic shapes will have.  Will my device still perform ok if I add 20 voip channels to my installation? Can it survive an ISP-pipe upgrade from 20 to 50Mbps? Can it deal with another 50 support-staff level users? What about 20 extra accounting staff?

     

    Most firewalls are multi-function devices which can be configured to perform varying amounts of inspection on the traffic that passes through them. Check the source or destination address of traffic and only allow certain ones is about the simplest inspection that is usually able to be performed. At the most complex Firewalls can do a complete protocol-standards adherence check (thus having to imspect every packet-header) and would also look at every data-byte in a stream to verify the desired security policy is being adhered to. That is, is this really HTTP traffic, and in any case what site is being accessed (don't allow sports sites for example) and is the data returned of reasonable size (bandwidth limiting) and content (malicious or inappropriate content).

     

    In general the more inspection is being done, the more resources (CPU/memory/disk-IO etc.) are going to be required to perform this task. Since these resources are limited, performance factors such as speed (max throughput, max packet rate) and latency will be affected by the amount of inspection that is configured. There are exceptions to this rule - but its still the safest starting point when thinking about these matters.

     

    While this may appear to be stating the bleedingly obvious, it gets interesting when one considers that the effects on performance vary for each feature, as well as by traffic shape and by accelerator technologies in non-uniform ways. So a change in average packet size (additional voip channels might reduce the overall average packet size through a device) may or may not have an impact if feature A is turned on - but might for feature B. Perhaps because feature A operates in a manner where its resource consumption is not a function of packet size - or at least not a dominant factor there of. Or because unlike the other model in the product range, feature A on the fancy model has an accelerator which changes how feature A performance relates to packet size. Thus, plotting performance/latency/packets-per second etc. against packet-size (say) will yield different shaped graphs, depending on which combinations of features are turned on and what accelerators are available. The shape of the graphs are crucial in predicting performance.

     

    Finding the answer to 'how will my device be affected by feature A and traffic-shape-change X' is thus just a simple matter of constructing a graph containing the appropriate combination of factors, and reading off the result. Unfortunately companies are reticent to publish that much information for a number of (good) reasons. The trick then is to be able to use the limited amount of information that is being published, combine this with ones knowledge of how 'stuff really works' and then one can predict fairly well what will likely happen. Not a substitute for testing ones assumptions of course - but certainly ought to enable one to narrow the field of scenarios one has to test to just a few or perhaps one. "I think it should work like _this_ - oh hey look it does - I'm done.".

     

    Of course there are a lot of features out there - and since one has to consider combinatorial effects as well - it could get ugly. Which is one reason why performance graphs are not a practical way of dealing with the problem, one would have to publish too many. Although a performance calculator can help  there. Still, to do its job well a calculator has to model a very complicated environment, requiring much data to be entered into to it to do a decent job. With some background knowledge, its often possible to determine the dominant performance affecting features and make a good educated guess/calculation with much less effort. And of course it helps to sanity check ye olde calculators output.

     

    Fortunately there are some basic computer architecture principles that once understood allow one to reason about most features likely performance impact. Or to put it another way, there are only so many different shapes of graphs out there. Pick the patterns and you'll be able to figure out quite a lot.

     

    The simplest firewall feature and its primary function is blocking a particular port or address. Straight forward you might think - packet arrives, look at which port/address it wants to go to, and compare against the block rules and if it matches one, just drop the packet. However this is not in fact how most firewalls do it. Firewalls tend to take advantage of the fact that most communication occurs in packet-flows. That is the source and destination address and src port and destination port in UDP+TCP (and most other similar protocols) tend to remain static for many packets going back and forth. ie. the http or https session. There are ways of quickly (cheaply) figuring out that packets are related (hash tables and similar). If one is only interested in checking addresses and ports then, a much faster way of processing rules is to ask

    - is the packet part of an allowed flow if yes, let it through immediately no other checking done

    - if not, check the rules. if its ok, create a flow and let the packet through (glossing over details)

    The 'check the rules' operation is hugely more expensive than 'is the packet part of a flow'. Usually.

     

    The above has some interesting side-effects. The rules are only checked on the first packet - ie. once per flow. It thus follows that varying the number of rules in the rule-set will affect the number of connections-per-second that the firewall will be able to process - but not the number of packets per second - or at least only if connections all only have very few packets in them. ie. where number of packets/second is not that different - from session/second. Like say lots of very small HTTP requests..

     

    This trick/optimization can only work if there is an ability to make such correlating assumptions of course. So if one were to use a string-match on the contents of packets to look for some words (say), then every packet would have to be inspected and now the number of these rules would affect the overall packet rate - but not the session per second rate unless sessions only have a few packets. This apparent reversal (sessions vs. pps) comes about due to the dominance one cost has over the other in the respective cases. Inspecting each packet will slow things down so much, creating new sessions is now a comparatively cheap operation, so it no longer varies much when compared to 'number of rules in force'.

     

    Being clever with algorithms then can make a big difference to performance - but one has to be careful which aspect of performance is affected. Hardware accelerators can also make a big difference. They tend to reduce the cost of an operation to either a constant amount or at least a much reduced amount - sometimes at the cost of adding latency (to talk to the accelerator and get results back).

     

    So far we have covered the effects on performance relating to how many and what type of packets we are inspecting. The second major factor is where these inspections are carried out. Most Firewalls will be based around operating systems that use a 2-tiered architecture to provide reliability and security services. The kernel talks to hardware and implements a number of services, so that processes running in user space can make simplifying assumptions about the world and be protected from treading on each others toes. ie. one process crashing does not necessarily lead to a crash of the whole system. Relevant topics here are Memory Management Units (MMU), role-based access control and related operating system concepts.

     

    For performance considerations this boils down to some simple rules of thumb:

    - doing things in the kernel is fast, while doing them in user space is slow (usually)

    - doing anything complicated in the kernel is hard/risky/impossible, often it has to be done in user space.

     

    User space data processing tends to be slow because data has to be copied to and from it - usually. Research on zero copy access to network data is occurring - but its not all that easily available and mainstream yet.

     

    This leads to a lose-lose scenario for performance. The more sophisticated and complex an algorithm we want to apply to scan our network data, the slower it will be, because that's just the nature of complexity. But what's worse, we also have to likely do it in user-space, and that costs extra.

     

    Indeed compared to the initial port/address scanning its really a quadruple whammy. A deep-inspection algorithm that verifies protocol compliance, and looks at the content of the data for intrusion or malware related reasons has to look at every packet (whammy 1) with a complex algorithm (whammy 2) which  references a large data set of patterns (whammy 3) in user space (whammy 4).

     

    Now using accelerator technology, like crypto or pattern-match accelerators tends to happen in hardware cards and thus in the kernel. Which makes things a lot better again as 3 of our 4 whammys are either completely eliminated or at least vastly reduced.That is, if our protocol is amenable to kernel implementation. IPSec is but SSL-VPN, for example isn't. SSL-VPN works on TCP sockets. Which are hard to terminate in the kernel - meaning you have to run the data into a user process and back out again. If you also need to decrypt/encrypt it on the way, then now the data has to go back up into the kernel, to the crypto-accelerator, and all the way back again. So if you ever wondered why SSL-VPN boxes are so slow for their size - there you are. IPsec will always run rings around SSL-VPN. IPsec uses way less memory and CPU and is designed to work in the kernel. Its a shame its extremely hard to configure, its routing is completely misguided and interoperability between vendors is a huge challenge. Then again SSL-VPN has no interoperability at all since there are no SSL-VPN standards. Once configured, IPSec can and will be better than SSL-VPN in every single respect. Its a sad indictment on the IPSec protocol suite then that SSL-VPN exists at all. Everything SSL-VPN does should be as easy and more trivial to do with IPSec - except the opposite is the case. So now we'll spend billions to make a brick (SSL-VPN) fly, because certain people were unable to see that hard-to-use security is worthless at a population level. No, IKEv2 doesn't really fix that either.

     

    Which is why accelerator technology is such a hot topic in development team circles - but accelerators do not solve everything. The bad news is that many features have no handy accelerator technology available. Also, many accelerator technologies don't work all that well for certain data patterns or are disproportionately expensive. A few of the pattern-matching silicon (asic/fpga etc.) companies are finding that their silicon isn't any better than just using one of the cores on the multi-core CPU's that are becoming increasingly the norm even in smaller embedded devices.

     

    Conclusion:

    Predicting performance accurately is really complicated and difficult - however - the above gives rise to a few rules of thumb that make it fairly straight forward to determine what likely (order of magnitude) impact a feature (on vs. off) will have on performance:

    - will this feature have to look at all the packets, or just a few?

    - is this likely running in the kernel, or in user space

    - is there a large database of signatures or similar involved or not?

    - is there an accelerator and if so is it called from user space for this feature or the kernel?

    - is the traffic mostly large or small packets?

     

    Each time the answer to the above is 'yes' it will have a significant impact on performance. Probably in the vicinity of a factor of 2-3. So as an example, if one moves from simple packet filtering (kernel, first packet only) to full Intrusion protection (every packet, user space, large database) that's somewhere between 2*2*2 and 3*3*3 in performance hit you can expect. ie. a factor of 10-30. So the 500Mbps large-packet throughput for firewall could become 30-50Mbps when IPS is turned on - unless vendor has done something special. This is a very, very rough guess indeed - but then that's the point of the exercise - get a gut-feel for what will likely happen.

     

    These rules of thumb and vendor statistics then will hopefully allow you to do some basic extrapolating about what the likely effect of an environmental or feature change will be for your devices.