1 2 Previous Next 14 Replies Latest reply on Mar 19, 2010 11:41 AM by JoeBidgood

    SA/DR - Network Traffic Problems

    runcmd

      Our network administrator has indicated that two Super Agent Distributed Repositories (at two different remote sites) have gone "bonkers" and started pumping out tons of traffic.  He had users at the sites shut them down and traffic at those sites has since returned to normal.  I'm trying to figure out what went wrong.  The problem was reported shortly after my replication task ended.  The Server Task Log shows one error on the replication and I'm uncertain as to whether it is associated with this issue:

       

      1ABDB6C8-C10D-4B93-98EA-1592D1B22818 2010-01-06 01:00:02.61 ERROR Failed to connect, error 10061 ( No connection could be made because the target machine actively refused it. )

       

      I've never seen an SA identified by its GUID in the log before.  That GUID is associated with one of DRs--but that SA/DR is not at one of the two sites that were having problems.  Also, the SA/DR hostname associated with that GUID appears later on in the log as completing without errors.

       

      I didn't think that SAs really "pushed" any traffic out but, rather, that clients talked to them and pulled data from them.  Regardless, my suspicion is that something occurred to hinder or block traffic to one or more of the other repositories and clients at other sites started hammering these SAs as the next closest DR (as I have clients determining the closest repository by ping response time).  My question is: Does the SA/DR maintain logs of which clients pulled updates from them?  If so, I'd like to look at them and see if clients from other sites were trying to update from these two SA/DRs at the time the problem was reported.

       

      Any other ideas on this issue would also be welcome and appreciated.  Thank you!

        • 1. Re: SA/DR - Network Traffic Problems
          JoeBidgood
          I didn't think that SAs really "pushed" any traffic out but, rather, that clients talked to them and pulled data from them.

           

          That's correct - there's no traffic initiated by the SAs (apart from a wakeup call to the local subnet if you have global updating enabled, but that certainly shouldn't constitute a lot of traffic.)

           

          My question is: Does the SA/DR maintain logs of which clients pulled updates from them?  If so, I'd like to look at them and see if clients from other sites were trying to update from these two SA/DRs at the time the problem was reported.

           

          Any other ideas on this issue would also be welcome and appreciated.  Thank you!

           

          I don't think the SAs log this - I don't have one immediately available to check - but the client machines will log the repo they are using: it might be easier to check some of those, if you suspect that they are talking to the wrong repo.

           

          Regards -

           

          Joe

          • 2. Re: SA/DR - Network Traffic Problems
            runcmd

            We do not use Global Updating.

             

            The problem is, we have several thousand clients and 20+ repositories.  Trying to find clients that updated from those two repositories could be like trying to find a needle in a haystack because I'd have to check the individual logs on each client.  It would be easier to try to replicate the issue by bringing a problem SA/DR back up with a packet sniffer running.  I just want to avoid bringing that site to its knees again.

             

            How granular do clients get with their ping response times?...  If two or more SA/DRs respond with the same number of milliseconds on a ping, how does the client decide which of those DRs to use?

             

            Thanks!

            • 3. Re: SA/DR - Network Traffic Problems
              runcmd

              On the SA/DRs, I've noticed that the "Agent_[hostname].log" in the "C:\Documents and Settings\All Users\Application Data\McAfee\Common Framework\DB" folder has a lot of entries like this, with different IPs and ports...

               

              yyyy-mm-dd hh:mm:ss I #1828 LstnSvr 'HEAD' request received from Host: 10.0.0.2:1625
              yyyy-mm-dd hh:mm:ss I #1812 LstnSvr CAsyncSocket::DoAccept for event: FD_ACCEPT
              yyyy-mm-dd hh:mm:ss I #1824 LstnSvr 'GET' request received from Host: 10.0.0.2:1626
              yyyy-mm-dd hh:mm:ss I #1812 LstnSvr CAsyncSocket::DoAccept for event: FD_ACCEPT

               

              My initial guess was that this log provided the IP addresses and source ports of clients pulling updates because the same log on my regular MAs doesn't appear to contain these entries.  However, if I force an AutoUpdate on a VSE client and then look at this log on the SA/DR that it updated against, the client's IP address does not always appear in this log.  The log on the SA is big, with only ~30 minutes worth of data and 8000+ lines.

               

              At this point my clients are not selecting their DR by ping response time correctly because a machine just updated from DR that has a 5ms response time when the computer sitting next to it is a DR with a 1ms response time.  I know the SiteList.xml contains the list of repositories, but does it contain any information on the order of the repositories and/or their ping response times for that client?  I'm getting close to just opening a ticket with support.

              • 4. Re: SA/DR - Network Traffic Problems

                I'm taking a shot in the dark here, but I want to mention it because something somewhat similar happened to us. We too chose to use the option that selected the repository by ping time... here's the problem, and frankly I think McAfee did an absolutely horrible job documenting this.

                 

                The ping time option only checks the first three repositories in the list, and it does this by which subnet is closest to it. So if you have two subnets that are 10.10 and 10.11, McAfee is going to use that subnet as one of it's three over a subnet with a 10.254 address, despite the fact that the 10.254 could be sitting in the next rack! So in our case, we have VLANs and subnets all over the country, and they're not arranged in a region specific order. So 10.10 might be in Texas, 10.11 might be in California, and 10.254 might be on the next floor in Texas. Because that 10.11 is closer to the 10.10 address, ePO counts it as a closer repository because of their numberic subnet values, not because it's actually faster.

                 

                I hope that makes sense. What ended up happening in our case was we had clients pulling from all over the place because of this stupid design decision. We have sense gone back and forced certain groups in the system tree to pull from specified repositories. This may or may not be the case for you, but it wouldn't surprise me if it was.

                • 5. Re: SA/DR - Network Traffic Problems
                  JoeBidgood

                  As you're on ePO 4.5, you can query the DB for which repo the machines updated from. Create a new Events query and select Client Events. In the Columns section, make sure you have Site Name selected - this is the name of the repo. Filter on event ID 2401 or 2402 (which means successful or failed update task) and that should give you the repos in question.

                   

                  HTH -

                   

                  Regards,

                   

                  Joe

                  • 6. Re: SA/DR - Network Traffic Problems
                    runcmd

                    Mindcrime,
                    I think you hit the nail on the head.  I spent most of the day, yesterday, researching this and then finally opened a case with support...

                     

                    If you look in the "McAfee ePolicy Orchestrator 4.5 Product Guide", it states that the ping time option for DR selection "sends an ICMP ping to the closest five repositories (based on subnet value) and sorts them by response time" (p.187) and for subnet distance it "compares the IP addresses of client systems and all repositories and sorts repositories based on how closely the bits match. The more closely the IP addresses resemble each other, the higher in the list the repository is placed" (p.188).  Unless you have architected your network in a manner that subnets sequentially coincide with distance, neither seems to be a very good method of determining distance to closest repository.  How silly of me to believe that "ping time" meant ping response time alone and that "subnet distance" meant number of hops in a trace route--as neither is truly the case.

                     

                    The support representative I spoke with indicated that he has seen a lot of problems with both the ping and subnet distance methods of identifying the closest repository.  His recommendation was to group the clients at each site and apply specific policies to each which uniquely identify the closest repositories using an ordered list.  This would seem to be a step backwards--especially considering laptops can hop all over the network.  At this time, I've disabled all of my distributed repositories and have everyone pulling updates directly from the ePO until I can organize a crap load of groups & polices or engineer some other kind of solution.

                     

                    Thanks for the response!

                    • 7. Re: SA/DR - Network Traffic Problems
                      runcmd

                      FEATURE REQUEST:  It would be nice if SuperAgent Distributed Repositories could be configured to send out a broadcast message to tell clients to update from that DR.  Because broadcasts (typically) won't travel to another subnet, you'd know that the DR is at least on the same subnet as the clients and, with the exception of VLANs, the DR should at least be in close proximity to the clients.  In the event that you have multiple subnets in close proximity (for example, perhaps each floor of a facility is a different subnet), then you can investigate routing that broadcast traffic to other subnets / broadcast domains.  The SA broadcast interval should be configurable at ePO by policy.

                      • 8. Re: SA/DR - Network Traffic Problems
                        JoeBidgood
                        ... "subnet distance" meant number of hops in a trace route--as neither is truly the case.

                         

                        Just as an aside, this *is* what it means for MA 4.5 - the subnet distance calculation now counts the hops (and you can set a maximum number of hops, if required.)

                         

                        Bear in mind though that any method for arbitrarily determining the nearest repo is never going to be 100% accurate all the time. If you absolutely, positively, 100% *must* guarantee that a particular machine only talks to a specific repository, then a user-defined repository list set in the agent policy is the only way to ensure this.

                         

                        Regards -

                         

                        Joe

                        • 9. Re: SA/DR - Network Traffic Problems

                          I wouldn't call choosing a repository based on ping distance to be exactly "arbitrary". It was a poor design choice to word the options the way they did, considering how they were implemented and it's obviously lead to confusion.

                          1 2 Previous Next