1 2 Previous Next 10 Replies Latest reply on Feb 5, 2014 10:49 AM by JoeBidgood

    EPO 4.66 Server Pegging

    awbattelle

      EPO 4.66

      Windows 2008 R2

      4 Processors all pegging

      6 Gigs RAM

      Lots of HDD space

      Separate SQL 2005 Server cluster. Health and size good.

      The EPO server is virtual running on a VMware ESX cluster.

       

      All  4 processors are pegging. Seems to be caused by the Tomcat

       

      Tomcat5.exe running at 74-96% of the processor and pegging all 4 processors. Spent all morning with McAfee yesterday. The issue seems to have developed over the weekend. We had been running rogue sensors on a couple of subnets for testing for a couple of weeks with no problems. We began to see an issue with systems failing to communicate with EPO on Friday afternoon on some of our east coast machines. We did deploy the Rogue Sensors to the enterprise on Sunday night. (only to desktops found to be on at night) On Monday morning the server was pegging. Might this have something to do with the rogue sensor deployment? I've tried KB79321 and KB52973 to no avail. Although I did only try the workaround for KB52973, and not the upgrade to EPO 4.67. But i READ THE 4.67 docs and does not seem to address anything like my issue.

      Here is a sample of the server log. Anything jump out at anyone? It looks normal doesn't it?

       

      20140204065021          I          #08884          NAIMSRV           Received [MsgRequest] from ADA170:{20D7CFAD-CB18-4C95-8162-0BA1C988ECFB}

      20140204065021          E          #02256          NAIMSRV           Failed to find IP address for ACG428.aero.org

      20140204065021          I          #02256          NAIMSRV           Wake up agent on NetBIOS name ACG428...

      20140204065021          I          #08884          NAIMSRV           Signing agent response package with key lZ+sflZkliUFkW1eWHUb/kkek5NHZxc6TZrXX+zoT1g=

      20140204065021          I          #07280          NAIMSRV           Received [PropsVersion] from ACE024:{6E702F12-920B-4643-9D28-093996ACC81B}

      20140204065021          I          #07280          NAIMSRV           Processing agent props for ACE024(6E702F12-920B-4643-9D28-093996ACC81B)

      20140204065021          I          #08040          NAIMSRV           Received [MsgUpload] from ACN305:{CBF21363-F51E-4D82-A691-202A21271086}

      20140204065021          I          #07028          NAIMSRV           Received [MsgUpload] from ACU740:{AA034745-60E3-4150-9ECB-28D7C57BD54D}

      20140204065022          I          #08388          NAIMSRV           Received [MsgUpload] from ADA460:{66E3AE2E-79E8-467B-9ACE-2EEF5879F1C2}

      20140204065022          I          #08808          MCUPLOAD          Successfully disabled CA trust options.

      20140204065022          I          #03476          NAIMSRV           Received [PropsVersion] from ACQ940:{646543D5-5520-46A1-AC7D-621B11E4514E}

      20140204065022          I          #03476          NAIMSRV           Processing agent props for ACQ940(646543D5-5520-46A1-AC7D-621B11E4514E)

      20140204065022          I          #03476          NAIMSRV           Sending props response for agent ACQ940, agent has up-to-date policy

      20140204065023          I          #08680          NAIMSRV           Wake up agent on DNS name ACU806.aero.org...

      20140204065023          I          #09004          MCUPLOAD          Successfully disabled CA trust options.

        • 1. Re: EPO 4.66 Server Pegging
          JoeBidgood

          How many sensors did you deploy?

           

          Thanks -

           

          Joe

          • 2. Re: EPO 4.66 Server Pegging
            awbattelle

            We have 307 subnets accross the country. Acording to our sensor status console. We have 549 passive sensors and 31 active. The subnet status says only 36 subnets are covered out of 307 possible subnets.

            Now, thinking that the sensors might had something to do with the issue, I did change the Rogue Sensor policy from Enabled to Disabled. However the sensor console still indicates 31 active sensors. I am tempted to send a command ro remove all sensors.

            • 3. Re: EPO 4.66 Server Pegging
              JoeBidgood

              It's certainly possible that this is sensor-related. Deploying a large number of sensors can cause a one-off performance spike on the server - the first time a sensor communicates, the server has to generate a pair of security keys, and this is a CPU-intensive task. If you multiply that by a large number of sensors then the effect can be significant. Once the initial key generation is done, though, the CPU load is removed.

               

              Do you know how many machines the sensor was deployed to (as opposed to the current number of active and passive sensors)? That will determine the size of the problem.

               

              Essentially there's two approaches: you can either "grin and bear it", and wait for all the sensors to communicate, or you could remove them, and then deploy them more gradually so as to spread the load of key generation so as not to affect other operations. Do you know how many machines the sensor was deployed to (as opposed to the current number of active and passive sensors)? That will determine the size of the problem.

               

              Thank -

               

              Joe

              • 4. Re: EPO 4.66 Server Pegging
                awbattelle

                We set up an automatic task to deploy sensors to any Desktop PC with 1 hour or less check in times running Windows 7, and we ran this job in the middle of the night on Saturday night at 11:30 PM. Now, I'm looking at the server task log, and see that sensors were deployed successfully at about 2 systems per minute from 11:30 till 7:30 AM so that is about 960 systems, then at 7:53 I get

                Terminated: Rogue Detector Installer

                Then I get a bunch of errors till the task terminates at about 8AM.

                So I guess the short answer is about 1000 systems.

                 

                So I'm wondering though if the system would continue to try to install the sensors on the systems that errored out during this job, as there seem to be about 1000 of them.

                So, I am thinking about just selecting all desktop systems and telling them to remove the rogue sensor, and see if this will stop the pegging. Then we can revisit sensor deployment.

                 

                Thing is though, We started getting reports of systems failing to update on Friday afternoon. BEFORE the big Sensor push Saturday night. At that time, we had been testing the rogue sensor system on about 6 subnets with nominal results.  So, I'm not entirely convinced this is the problem, so my strategy at this point would be to try to isolate the cause. I've already turned off all tasks except for daily dat updates and weekly scans.

                 

                Message was edited by: awbattelle on 2/4/14 12:16:06 PM CST
                • 5. Re: EPO 4.66 Server Pegging
                  JoeBidgood

                  The ePO server maxing the CPU shouldn't cause client machines to fail to udpate, unless they're updating from the master repo on the ePO server itself?

                   

                  Personally I think removing the sensors, and then staggering their rollout after the server has returned to normal, would be a wise move. There's no real way to predict how long this situation may last, so the grin and bear it approach may not be realistic.

                   

                  HTH -

                   

                  Joe

                  • 6. Re: EPO 4.66 Server Pegging
                    awbattelle

                    Well, we have not all, but many clients do update directly from the master EPO server. Of course we have super agents scateered throughout the network. When the issue was at it's worst, clients were logging this;

                     

                    2014-02-03 06:52:45.457 I #1312 naInet failed to receive package..server is busy

                     

                    and EPO was logging this repeatedly

                    20140204092817 E #08820 mod_epo  Server is too busy (245 connections) to process request

                     

                    (today it is less frequent, as we told clients to check in every 120 minutes instead of 30.)

                     

                    So, it seems that the server was too busy processing it's maximum allowable connections to take care of any clients that might check in.

                    So we had the our server team beef up our main EPO server as much as they could. They added two cores and an extra 2 gigs of memory. So now. it's a little better

                    The last time we got this message was;

                    20140204094019 E #08824 mod_epo  Server is too busy (245 connections) to process request

                    which, I gather means 02/14/2014 at 9:40 and 19 seconds.

                    And this entry seems much less constant than it was yesterday, when it was almost continuous.

                     

                    Server still pegging all 4 cores at the time of this writing.I just ran the same query I used to install the sensors in reverse, that is I just changed the action to remove instead of Install. it's been running now for 1 hour and 45 minutes and says it's still at 50%.

                    So, Still pegging.  I'll give it one more day perhaps, then... ???

                    • 7. Re: EPO 4.66 Server Pegging
                      petersimmons

                      I would probably leave things alone for a bit. It isn't exactly clear what caused what BUT it sounds like you have some issues with the server running out of connection (really rare). I would cancel any major tasks. Just leave the deployed sensors in place and let them finish whatever has been started. Undoing it may end up making things worse for no net gain.

                       

                      Also, a client check in of 30 minutes is probably a bit too frequent for anything but laboratory conditions. Stick with an ASCI of 120-240 minutes and a PEI of 30 minutes or so.

                      • 8. Re: EPO 4.66 Server Pegging
                        awbattelle

                        Still pegging this morning.

                        I did run a job last night to try to remove rogue sensors. It looks like it ran OK for a few hours and then terminated.

                        The detected system console still says I have 20 active sensors and 559 passive with 3 missing.

                        When I stop Apache, I can navigate the EPO console with no problem, but when I start it, everthing bogs down.

                        Although the clients seem to be checking in OK for virus defs, noe of the systems we try to encrypt are strating. They seem to be waiting for EPO to deliver a reuse certificate..

                        Anyway, I have an open ticket with gold support, so I'm going to work it this morning.

                        • 9. Re: EPO 4.66 Server Pegging
                          awbattelle

                          OK, I stopped and started the McAfee ePolicy Orchestrator 4.6.6 Server service, and the system appears to have returned to normal. Yay! Also, lab systems that were waiting to encrypt have begun encrypting. Yay!

                          CPU usage is now at a nominal 33-40% peak. memory usage has also gone down a little.

                          So, perhaps, running my job last night helped reverse the issue. We will revisit the rogue sensor deployment but only 1-5 subnets at a time. Which will suck because there are like 330 enclaved subnets in our enterprise.

                          So, lesson learned.. Pwerhaps I will open a discussion on how best to deploy rogue sensors.

                          I will monitor the system for the rest of the day. Hopefully all is nominal. Thanks for the support from the community.

                          1 2 Previous Next