1 2 Previous Next 17 Replies Latest reply on May 20, 2010 2:09 AM by auism

    SG560 performance issues (4.0.5)

      I posted this to the SnapGear user group but was directed here, so I'll repost..

       

      One of our clients just started having an issue with their SG560 today,
      running the 4.0.5 firmware (been on this firmware for months now). CPU
      utilization is frequently spiking to 100% and staying there for about 20
      minutes at a time. This corresponds with reports of Internet access being
      slow to nonexistent.

      We only have 2 IPSec VPNs, both of which appear to be staying up, just slow.
      I'm not having much luck trying to find a breakdown of CPU usage to try to
      identify the process that's causing the spikes. 'top' isn't an available
      command if I telnet in, and I'm not seeing anything on the diagnostic page.
      Here's what the front page is currently showing:

      Load Average 1.280 (5 min) 1.060 (10 min) 0.890 (15 min)
      CPU 100% (total) 1% (user) 1% (kernel)
      Memory Usage 50% (of 14680 KB)
      Config Usage 25% (of 1024 KB) 3% (of 1835 inodes)
      Temp File Usage 27% (of 512 KB) 2% (of 1835 inodes)

      Can anybody point me in the right direction to troubleshoot this?

        • 1. Re: SG560 performance issues (4.0.5)

          This could be caused by a number of issues.

           

          The first to check is to see if an internal host is causing this performance issue.

          If you disconnect the internal LAN and connect only a single 'clean' laptop/pc, does the issue persist ?

           

          If you still need assistance you are best to contact Tech Support as a inspection of a Support Report will probably reveal the issue.

          • 2. Re: SG560 performance issues (4.0.5)

            This sounds like the issue we are also having recently on version 4.05 as well on an SG580. I have not been able to find the cause or have time to investigate further and have blame it on us upgrading the firmware / config from version 3.22 to 4 without doing a fresh config.

             

            The symtons for our issue are:

            After unit has been on for about a week approx (some time shorter):

            When acccessing the GUI via HTTPS it becomes very slow and eventualy times out. WAN connections drops out therefore we lose outside access, after a few minutes the unit completely crap itself all lights goes solid green. We then have to powercycle it, it goes back to normal for until next time.Before it die completely, looking at the CPU usage grap it's constanly at 100% until it dies / overload itself.

             

            It has been fine the last few weeks so I'm not quite sure if some of the config changes / addition we did changed this or the trigger hasn't been hit again yet. I will try to grab a TSR next time we hit this (if it will allow me to).

             

            Still an issue at the moment is the slowness of the GUI when trying to config IPSEC and things like GRE or most things in the GUI. We have avoided touching the config / GUI during business hours until I get time to re do the config.

             

             

            Message was edited by: orionweb on 1/5/10 5:44:48 PM CST
            1 of 1 people found this helpful
            • 3. Re: SG560 performance issues (4.0.5)

              I wasn't the one who did the firmware upgrade, but the chances are high that the one who did also went directly from 3.x to 4.x (I think we started with an earlier 4 build which ended up having all kinds of nasty performance bugs).  The weird thing is, it's been pretty stable since 4.0.5 was put on.  Even weirder is that the problem started yesterday about an hour AFTER the unit had been power cycled (it may have been happening before, but I wasn't called until after).  Eventually I ended up rebooting it again yesterday and things went back to normal.

               

              I'd still like to know if there's a way to view CPU utilization per process so I could find what process is using it all?  When I was finally able to get back on prior to the reboot, I checked the services tab on the home page but the only services listed also showed less than 1% CPU utilization, so I still don't know what the rogue process was.

              • 4. Re: SG560 performance issues (4.0.5)

                Can you post the process list from near the top of the support report or the output of this command from the command line

                 

                ps

                 

                and also this command froim the command prompt

                 

                cpu -r 1

                 

                run for a few seconds and cttrl-c

                • 5. Re: SG560 performance issues (4.0.5)

                  The problem hasn't resurfaced since the last reboot, so this data may not be terribly useful, but if nothing else I guess it will give me something to compare with if/when it happens again.

                   

                    PID USER       VSZ STAT COMMAND
                      1 root       456 S    /bin/init
                      2 root         0 SW<  [kthreadd]
                      3 root         0 SWN  [ksoftirqd/0]
                      4 root         0 SW<  [events/0]
                      5 root         0 SW<  [khelper]
                      6 root         0 SW<  [kblockd/0]
                      7 root         0 SW   [pdflush]
                      8 root         0 SW   [pdflush]
                      9 root         0 SW<  [kswapd0]
                     10 root         0 SW<  [aio/0]
                     11 root         0 SW<  [mtdblockd]
                     18 root       568 S    watchdog /dev/watchdog
                     48 root         0 SW<  [ixp400_eth]
                     49 root         0 SW<  [ixp400_eth]
                    131 root         0 SW   [crypto]
                    132 root         0 SW   [crypto_ret]
                    147 root      1532 S    statsd daemon
                    261 root       396 S    /bin/inetd
                    262 root       500 S    /bin/flatfsd
                    263 root       580 S    /sbin/syslogd -n
                    264 root       568 S    /sbin/klogd -n
                    265 root       396 S    /bin/cron
                    266 root       800 S    /bin/ifmond
                    267 root      1616 S    /bin/acld
                    269 root       748 S    /bin/ntpd -g -n -U 60
                    272 root         0 SW   [ixp400 eth1]
                    273 root       976 S    /bin/nflogd -p -d -c
                    303 root         0 SW   [ixp400 eth0]
                    336 root      2692 S    pluto --nofork --secretsfile /etc/config/ipsec.secret
                    374 root      2464 S N  pluto helper  #  0                                  
                    376 root       716 S    lwdnsq
                  4433 root       704 S    /bin/httpd /home/httpd
                  4442 root      4188 S    ./cgix
                  4445 root       576 S    sh -c ps ax
                  4446 root       572 R    ps ax

                   

                  As for the cpu command,

                   

                  # cpu -r 1
                  -sh: cpu: not found

                   

                  No real help there.

                  • 6. Re: SG560 performance issues (4.0.5)

                    It is useful.

                     

                    It shows what subsytems you have enabled which can affect factors as mentioned here: ( grab a cuppa first )

                     

                    http://community.mcafee.com/docs/DOC-1114

                     

                    What I really need is a Support Report, generated when the issue is occurring, which you need to go via support to submits as per the reasons cited here:

                     

                    http://community.mcafee.com/docs/DOC-1061

                     

                    Is this possible ?

                    • 7. Re: SG560 performance issues (4.0.5)

                      If it happens again, I'll try to pursue that route.  I'm a little weary of contacting support simply because I haven't had a positive experience in the past.  The last time I tried to submit a TSR I never received a response, but this was before McAfee got involved, so maybe things are different now.

                       

                      Looking at the new support site, it looks like I need to create a new account.  One of the required fields is "Grant #", but it doesn't explain what that is.

                      • 8. Re: SG560 performance issues (4.0.5)

                        Things have changed a bit since the McAfee merger and things like this forum show some of the positive things to come of it.

                         

                        We have also changed some escalation procedures lately, so I hope your experience will be different as we genuinely want to get data on real issues experienced by customers and get them fixed.

                         

                        The Grant number is necessary these days. You need to register at

                         

                        http://my.securecomputing.com

                         

                        with your UTM serial number, and you can then retrieve a Grant ID which can be used to contact support.

                         

                        The best way is via the web portal

                         

                        https://mysupport.mcafee.com

                         

                        for complex cases, as that way your issue is documented by yourself, including necessary TSR's

                         

                        http://community.mcafee.com/docs/DOC-1061

                         

                        and network diagrams and can be easily understood by the escalating engineer if required.

                        • 9. Re: SG560 performance issues (4.0.5)

                          > I'd still like to know if there's a way to view CPU utilization per process

                          for pure engineering interest, general techno-phile enjoyment and to prove a point:

                             this is linux, it can do (almost) anything!

                           

                          What follows below is a bit technical. It is focused on answering the above question,

                          and that is all. It does not imply our customers are expected to get involved with the

                          product at this level, or reflect in any way on the problem that gave rise to the question.

                           

                           

                          On bigger devices this is easy - use top. Smaller devices have many tools missing, but the

                          raw stats are still there - if you know where to look and how to interpret them.

                           

                          top gets its data from /proc/<pid>/stat. man proc(5) for field details (google is your friend).

                          top reads /proc/*/stat - then reads it again a second later. comparing the two

                          sets of data you will find that for each process systime(stime) and usrtime(utime) may have changed a bit,

                          the delta giving you the total CPU jiffies the process has consumed between the two measurements.

                          That's column 14 & 15 respectively in stat (err - I think - but its a big number to have to count to for

                          a manager - caveat emptor).

                          You compare that to the actual elapsed time that the two snapshots were taken at, and hey-presto,

                          there is your CPU consumption percentage.

                           

                          The tiny-tcl interpreter on all our devices is clever enough for one to be able to write a script that could

                          do this. As an example of the sort of stuff tcl can do, check out /bin/highavaild on the box. As a bonus

                          you will note that tcl has full access to the config (read/write).

                           

                          So doing that properly is a bit painful, but the following script should list you a set of 'interesting' processes

                          every second, and the size of the number should be the jiffies its used  provided I didn't screw up somewhere,

                          which is entirely possible - so take this as an in-elegant, naive, untested, 'in-principle', alpha-level example.

                          The output seemed to agree with top on the 'worst offender' front - that's all I checked.

                           

                          of course you could look up the actual proc-name under /proc and sort the the thing and calculate the

                          exact elapsed time and convert to percentage etc. etc. - left as an exercise and for a community member

                          to post here to demonstrate their superior zen-level mastery of tcl.

                           

                          save this to /etc/config/mytop or /var/mytop and then 'metash mytop' from a cli.

                           

                          proc pstat {pt} {
                            upvar 1 $pt ptv
                            set stat [glob /proc/\[0-9\]*/stat ]

                           

                            foreach s $stat {
                              if {![catch {open $s r} f]} {
                                gets $f data
                                set fields [split $data " "]

                           

                                set ptv($s) [expr [lindex $fields 14] + [lindex $fields 15]]
                                close $f
                              }
                            }
                          }

                           

                          array set plast {}

                           

                          pstat plast
                          while {1} {
                            sleep 1

                            array set pcur {}
                            pstat pcur
                            foreach s [array names pcur] {
                              if {[info exists pcur($s)] && [info exists plast($s)] } {
                                set val [ expr $pcur($s) - $plast($s) ]
                                if { $val > 0 }  {puts stdout "$s $val"}
                              }
                            }
                            array set plast [array get pcur]
                          }

                           

                          Cheers

                          tom

                          1 2 Previous Next