8 Replies Latest reply: Feb 1, 2013 8:29 AM by dmease729 RSS

    Global update appears to be sending out wakeup-calls when superagent wake-up calls expire

    dmease729

      Hi,

       

      We have global updating enabled, and 3 superagent repositories running on selected VLANs.  I noted a couple of days ago that there appeared to be something happening that does not appear to be documented.  The following server tasks were seen in the server task log:

       

      Update Master Repository task 18:09:46 completed in 5 minutes
      Global Update replication task 18:15:37 completed in 3 minutes (replication to all superagent repositories)
      Wake Up Agents task 18:18:44 failed (3/3 calls expired)

       

      The 'wake' up agents task was expected as part of the normal global updating process, however what followed next was not.  After the super-agent wake up calls expired, singular wake up calls where seen to be initiated to every single system in the ePO system tree, and during this time, manual wake-up calls and attempted agent deployments were failing (I am assuming this is due to the backlog of wake-up calls).

       

      My question is: The action of the wake-up calls that appear to have been sent out to every system after the super-agent wake-up calls expired does not appear to be documented.  Is it expected, and is this some form of fallback?  I would suspect this is not meant to happen, as per page 190 in the ePO4.6 prouct guide:

      "If the agent does not receive the broadcast for any reason, such as when the client computer is turned off, or there are no SuperAgents, at the next ASCI, the minimum catalog version is supplied, which starts the process."

       

      and indeed from page 191:

      "A SuperAgent is installed on each broadcast segment. Managed systems cannot receive a SuperAgent wake-up call if there is no SuperAgent on the same broadcast segment. Global updating uses the SuperAgent wake-up call to alert agents that new updates are available."

       

      This would seem to indicate that a)servers on subnets not covered by a superagent will pick up updates after next ASCI and b) (loosely interpreted) that if they do not receive a wake-up call from a superagent on their subnet (which would happen after the result of a super-agent wake-up call expiry) then again, they will pick up updates after next ASCI.  Neither of which would require the ePO server to start sending out singular wakeup calls to all systems in the system tree.

       

      cheers,

        • 1. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
          JoeBidgood

          That's certainly not the behaviour I would expect. Is this new? As in, has it only recently started?

           

          Thanks -

           

          Joe

          • 2. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
            dmease729

            Hi Joe,

             

            I thought that would be the answer, thought I was going mental!

             

            Yes, it has only recently started, and only came to light when the connectivity to the superagent hosts all happened to be down at the same time.  The SA repo replication worked fine, but the superagent wakeups expired.  It did not happen the following day when the superagent wakeupcalls succeeded.

            • 3. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
              JoeBidgood

              Very strange... I'll have to see if I can reproduce it in a test environment. I'll update the thread with what I find.

               

              Regards -

               

              Joe

              • 4. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
                dmease729

                Cheers Joe :-)

                 

                Am trying to get some time free at the moment so I can do the same - will keep you updated if I get freed up

                • 5. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
                  dmease729

                  Issue presented again - Very rough notes follow, and I will hopefully get back with more soon...

                   

                  server.log starts at 20130131052816
                  searching for 'wake' results in the first logged wakeup call at 20130131222520
                  Note: I had restarted 3 main ePO services at 22:22.

                  The 'Wake up agents' task that should wake up the superagents (after successful global update replication) started at 18:41:07, and all calls expired (task failed).  Nothing was seen in log
                  After this, multiple single wakeup calls are failing, but none of these are seen in the log, so it looks as though the system isnt even attempting to send wakeup calls.


                  Coming back to the first time I noted this issues originally was when I was investigating failed agent deployments (went to deploy agent from ePO, but framepkg was not being delivered.  Confirmed at time that browse to C$ and admin$ worked with credentials I was using).  Also noted that manual attempted wakup calls were failing.
                  I noted that netstat showed a number of what appeared to be stale wakeup calls to 2 systems. 

                   

                  So I am guessing root cause to this is the stale wakeup calls.  Also guessing that this is somehow stopping the superagent wakeup calls from being sent out, which then leads to the wierd issue that I am seeing with multiple single wakeup calls.  There may be 2 distinct issues here.  I will investigate the stale wakeup calls shown in netstat when the appear again (I have just recently restarted services) and see what I can find...

                  • 6. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
                    ekim

                    just curiosity, are you sure that ePo sent an agentwakup call to all clients in all segments? or just to the segment where ePo server resides? as long as I am concern, with Global update task, ePO server itself sent an agentwakup call into its own segment......

                     

                     

                    • 7. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
                      dmease729

                      Good point, I had honestly never thought of that before!  From my observations, however:

                       

                      - When the 'Wake up agents' task (note the 's' as this is the superagent wakeup call task) is successful, no other wake up calls are seen in log straight after this event

                      - When it is not successful (the superagent wakeup calls expire, in this case), mention of neither this or the individual wakeup calls shown in the server task log appear in the server.log

                      - When clicking the individual wakeup call events in the server task log, if they have failed, it does not list the server that it is trying to wakeup, so I would be unable to confirm your suspicions

                       

                      Its an interesting one, and from what I have seen so far, it will happen again on this ePO server.  I have output from the last couple of times so am going to see if I can determine if there is any root cause to what I am seeing.

                       

                      Message was edited by: dmease729 - further clarification on first point in bullet list (..."in log straight after this event") on 01/02/13 05:27:10 CST
                      • 8. Re: Global update appears to be sending out wakeup-calls when superagent wake-up calls expire
                        dmease729

                        Further troubleshooting so far (rough notes):

                         

                        - (Something) causing frameworkservice on 2 managed systems to become unresponsive.  Netstat shows high (100+) number of established wakeup call connections, which do not tally with that shown on the netstat output on ePO (20-30)
                        - Windows event logs on 'problem' systems show "A timeout (30000 milliseconds) was reached while waiting for a transaction response from the McAfeeFramework service."
                        - Although wakeup call tallys do not match, there are still a number of outstanding wakeup call established connections to the two problem systems on the ePO server.  As these calls are outstanding, I am guessing that further wakeup calls (even SuperAgent wakeupcalls?) are not sent.  Also I have noted that in this situation, agent deployments fail (access and deployment credentials are fine)
                        - As the superagent wakeup call fails (expire), we then run into what I have been seeing.

                        So... root cause I believe is associated with the endpoints.  Not sure if this can be labbed easily (unless the root cause actually leads on to another issue which *can* be labbed...

                         

                        Further details:


                        From netstat on ePO server, I have 26 lots of the below, where PID 27844 is Apache (expected):
                          TCP    <EPO SERVER>:<HIGH PORT>    <PROBLEM SYSTEM>:8081       ESTABLISHED     27844

                         

                        From server.log on ePO server:

                         

                        #17068:

                        20130201034719 I #17068 NAIMSRV  Wake up agent on DNS name <ANOTHER SYSTEM1>...
                        20130201034719 I #17068 MCUPLOAD Successfully disabled CA trust options.
                        20130201035019 I #17068 NAIMSRV  Wake up agent on DNS name <ANOTHER SYSTEM1>...
                        20130201035019 I #17068 MCUPLOAD Successfully disabled CA trust options.
                        20130201035129 I #17068 NAIMSRV  Wake up agent on DNS name <PROBLEM SYSTEM>...
                        ...nothing else.  Latest timestamp in log 20130201115903

                         

                        #14928:

                        20130201041120 I #14928 NAIMSRV  Wake up agent on DNS name <ANOTHER SYSTEM2>...
                        20130201041121 I #14928 MCUPLOAD Successfully disabled CA trust options.
                        20130201041420 I #14928 NAIMSRV  Wake up agent on DNS name <ANOTHER SYSTEM2>...
                        20130201041421 I #14928 MCUPLOAD Successfully disabled CA trust options.
                        20130201041530 I #14928 NAIMSRV  Wake up agent on DNS name <ANOTHER SYSTEM3>...
                        20130201041531 I #14928 MCUPLOAD Successfully disabled CA trust options.
                        20130201041531 I #14928 NAIMSRV  Wake up agent on DNS name <ANOTHER SYSTEM4>...
                        20130201041531 I #14928 MCUPLOAD Successfully disabled CA trust options.
                        20130201041720 I #14928 NAIMSRV  Wake up agent on DNS name <ANOTHER SYSTEM2>...
                        20130201041721 I #14928 MCUPLOAD Successfully disabled CA trust options.
                        ...more systems with the same until...
                        20130201042731 I #14928 NAIMSRV  Wake up agent on DNS name <PROBLEM SYSTEM>...
                        ...nothing else.  Latest timestamp in log 20130201115903


                        From Agent_<PROBLEM SYSTEM>.log on managed system:

                        2013-02-01 03:43:24.294 I #2152 Agent Sending the next batch of immediate events
                        2013-02-01 03:43:24.294 i #2152 Agent Agent is looking for events to upload
                        2013-02-01 03:43:24.294 I #2152 Agent Agent did not find any events to upload
                        2013-02-01 03:48:24.282 I #2152 Agent Sending the next batch of immediate events
                        2013-02-01 03:48:24.282 i #2152 Agent Agent is looking for events to upload
                        2013-02-01 03:48:24.282 I #2152 Agent Agent did not find any events to upload
                        2013-02-01 03:53:24.271 I #2152 Agent Sending the next batch of immediate events
                        2013-02-01 03:53:24.271 i #2152 Agent Agent is looking for events to upload
                        2013-02-01 03:53:24.271 I #2152 Agent Agent did not find any events to upload
                        2013-02-01 03:58:24.259 I #2152 Agent Sending the next batch of immediate events
                        2013-02-01 03:58:24.259 i #2152 Agent Agent is looking for events to upload
                        2013-02-01 03:58:24.259 I #2152 Agent Agent did not find any events to upload

                        2013-02-01 04:23:24.201 I #2152 Agent Sending the next batch of immediate events
                        2013-02-01 04:23:24.201 i #2152 Agent Agent is looking for events to upload
                        2013-02-01 04:23:24.201 I #2152 Agent Agent did not find any events to upload
                        2013-02-01 04:28:24.190 I #2152 Agent Sending the next batch of immediate events
                        2013-02-01 04:28:24.190 i #2152 Agent Agent is looking for events to upload
                        2013-02-01 04:28:24.190 I #2152 Agent Agent did not find any events to upload
                        2013-02-01 04:33:24.178 I #2152 Agent Sending the next batch of immediate events
                        2013-02-01 04:33:24.178 i #2152 Agent Agent is looking for events to upload
                        2013-02-01 04:33:24.178 I #2152 Agent Agent did not find any events to upload

                         

                        Note that the same is seen going back and forward in the agent logs - there doesnt appear to be anything else, until I scroll all the way back to 18th

                        January, where the agent log looks normal.

                         

                        I have noted that there is a discrepancy between the established wakeup calls showing on the ePO server, and those showing on the problem server:

                          TCP    <PROBLEM SYSTEM>:8081       <EPO SERVER>:65343    ESTABLISHED     1188    x103, where PID 1188 is Framework Service (expected):
                        No other specific port has a high 'ESTABLISHED' count...

                         

                        Windows event log on problem system shows the occasional:
                        "A timeout (30000 milliseconds) was reached while waiting for a transaction response from the McAfeeFramework service."

                         

                        System has not been polling in to ePO since the agent logs started going a little wierd (above)