1 2 Previous Next 12 Replies Latest reply on Apr 16, 2014 9:06 AM by sroering

    Concerns on WebReporter as a plausible tool for a large company

    lpp

      Hello, we would like to get a feedback from users or McAfee experts on the validity of WebReporter as a sound instrument to report web navigation on large companies.

       

      Here is the situation:

       

      We have two proxy farms, each built with ten WebGateway MWG-5500 hardware balanced. Fo each site we just installed a server for a WebReporter service. After several days trying to optimise, tune and rearrange the settings, we are actually unable to successfully import the logs generated in a single day. Based on your experience, we would like to understand if the problem is related to our implementations, design and decisions, and so if it can be worked-out and sorted,  or if it is due to WebReporter inability to deal with large data volumes.

       

      A single farm generates approximately 150 GBytes of logs per day (220 GBytes if HTTP 407 error is logged). The logs are rotated when they reach 100 Mbytes, which gives us approximately 1.500 log files per day.

       

      In order to comply to national laws and regulations regarding users' privacy, we had to develop a two-stages process: the original WebGateway logs are moved every night towards a secure and certified log collecting facility. Just before this transfer, a scheduled script copies these logs providing anonimity (the user name is replaced by a hash with a random seed) and transfers them to the WebReporter node, where they land compressed (approximately 16 GByte of gzip-compressed daily logs) on their respective directory (one for each source appliance).

       

      Once received, WebReporter starts to import each log file (32 maximum concurrent log processing jobs are set under Administration, Performance, Advanced options). Ten log sources are defined, one for each directory, corresponding to a source appliance.

       

      With MySQL (either internal or external) the first job lasted a couple of minutes, the sixth or seventh (of more than a thousand!) three to four hours; we were never able to succesfully import  all the logs for a single day.

       

      We are now trying with Oracle 11G, as suggested, but with no avail. The import time is larger, probably due to the increased complexity of both the engine and the structure (referential integrity and system tables). We imported five logs in a day (0,3%).

       

      So my question is: do we have to skip WebReporter and look for an alternate solution? Do we need to improve our hardware? Can we try to develop an in-house solution to perform parallel import of logs to Oracle and  then use webReporter to manipulate that data (is a detailed description of the import operations available)?

       

       

      Thank you very much for you advices and suggestions.

       

      Best regards,

      LPP

       

       

      P.S. We are using a single HP Proliant DL580 G8 server per site, with two Intel Xeon E5-2660 running at 2.20GHz, 128 GBytes of RAM, 2 Terabytes of space provided by 6 hard-disk drive units under RAID. WebReporter ha 16 GBytes of RAM assigned, while Oracle 96 GBytes.

       

      In any moment during the log import, A SINGLE CORE is working (100%), while the remaining 31 cores are always idle. The RAM is only used by the operating system to cache the files. This situation was the same with MySQL, so even with partitioning enabled the application seems to be designed for a sequential import (but then why the concurrent jobs setting in the Administration pane?) and not for parallel processing.

        • 1. Re: Concerns on WebReporter as a plausible tool for a large company
          trishoar

          Hi LPP,

           

          I'll describe our setup first, we are a smaller site than yours generating between 30-50GB of logs per day. Our reporting servers currently consist of 2 HS21 Blades with 16GB of RAM, 8 Core's and SAN storage. 1 blade is for Web reporter and 1 is for MySQL. Both are on Linux, (RHEL 5 for WR as that was all that was supported when I built this, and RHEL6 for MySQL)

           

          In terms of WR config we have been advised to not take the memory allocation above 4GB by McAfee, thus that’s where it is set, and I've found that for log processing 5 concurrent jobs works best for us. Each log file is the same size as yours, 100MB, and takes about 1 minute to process. The import process is staggered through the night; with the proxy servers Ftp’ing the data up, server by server, starting at 00:10 and finishing at 03:30. We could reduce this time a bit with more aggressive scheduling, but it works for us.

           

          We have the log sources set to Condense log records into page views, and we do not store detailed data.

           

          I've altered processing so that Maximum unflushed record age is 25 seconds and the Queue throttle wait time is 5000 milliseconds, other than that I've the Advanced options are as from McAfee.

           

          I find the setup works for us, since the main purpose I have this for is to give me information on internet usage, top users and top sites.

           

          However If I wanted to use it for abuse reporting, that’s where the whole thing falls apart. The daily jobs are ok, taking about 30 minutes to run, but we sometimes have to run reports for customers, and though the basic content of the report is the same as the daily jobs we have, these can take upwards of 2 hours to complete.

           

          If I had time I'd look at an alternative to Web reporter, as IMO it is too slow for doing that sort of reporting, and I know our customers would like us to provide Monthly usage information, which we simply cannot do due to this being so slow.

           

          It is also worth mentioning that CSR is a replacement to Web reporter. This is Free, but requires EPO and Windows, which for me are both sticking points as I dislike EPO, and I prefer Linux.

           

          If I get time I plan to test LogStach and GreyLog to see what sort of reports I can get from those and there is also Splunk, which I know some people use for MWG.

           

          Regards,

           

          Tris

          1 of 1 people found this helpful
          • 2. Re: Concerns on WebReporter as a plausible tool for a large company
            sroering

            Hello,

             

            Yes, we have many large customers using Web Reporter which produce more data than 150GB per day.  From what you described, the main difference between your deployment and the typical large customer is that they use Microsoft SQL Server Enterprise.  Subsiquently that is where most of the tweaks have been made in regards to scaling performance.  I'd don't have comparison numbers for MySQL, but I'll check around when more people get into the office.  Who suggested Oracle?  If you have support ticket numbers, please send them to me in a private message and I can check your history.

             

            What performance options have you changed from the defaults?  Cache sizes, number of concurrent jobs, etc?  Give Web Reporter 16GB is complely unnecessary and not recommended. 4~8Gb is the max.  Giveing it more only makes the garbage collector lazy. Eventually all 16GB would be used for unreferenced objects, then when the garbage collector is forced to work, it is inefficient.  You can set the aggregate cache size to 400000 (400k) which should really be the only configuration needed besides setting the memory to 4~8 GB.

             

            The concurrent log parsing jobs are simply the workers feeding data into the system and they are more than enough to keep it busy.  When you say that only one core is working at 100%, what is the process?

             

            If I could see your Web Reporter config, and the server logs for several days, I can investigate where the bottle neck exists and we could find ways to improve it. It would be good to start a support ticket.  The slowness you describe could be caused by a number of things.

            • 3. Re: Concerns on WebReporter as a plausible tool for a large company
              lpp

              Thank you very much for you response.

               

              Your feedback is useful, since it demonstrates clearly that a solution is possible.

               

              Best regards,

              LPP

              • 4. Re: Concerns on WebReporter as a plausible tool for a large company
                lpp

                Hello Sroering,

                 

                Unfortunately I am not able to open a SR, because WebReporter is the only McAfee product for which our procurement department hasn't renewed the maintenance fee (they are clever people).

                 

                I posted a request to understand if a WebReporter solution is feasible for our volumes, and both your answer and Trishoar's seem to confort us in this belief.

                 

                Although on single nodes, our procured hardware seems also sufficient, as Trishoar's company is processing a third of our daily volume in only three hours and with less resources (beeing ready at 09.00 every day would be ideal for us).

                 

                I just followed the advices I was able to collect from your answers. The memory assigned to the application is now 6 GB,  with  Maximum unflushed record age of 25 seconds, Queue throttle wait time of 5000 milliseconds and 5 maximum concurrent log processing jobs.

                 

                I also adopted a different strategy, with ten log sources scheduled to run every 20 minutes and cadenced two minutes after one another, where the WebReporter fetches the created logs from the REST interface. WebGateway is generating a new dedicated log, with a blank user, and rotates and compresses after reaching 10 MBytes.

                 

                To decrease the import time, the processing tab is completely cleared (no more page condensing or detailed web).

                 

                The 100% CPU process (1 core upon 32) is always due to the RDBMS (now Oracle, before MySQL), where a single query takes a huge amount of time (and a single core). I remember that with SQLTop I could see it was a periodical and very long SELECT with SORT).

                 

                I would be enthusiast to provide you with all the possible information, as I am concerned that we are still far from our target (I don't think the platform will be able to process in less than 2 minutes the logs generated by a node in 20 minutes, which is our current time limit to avoid the creation of queues).

                 

                Thank you very much for your attention on this matter.

                 

                Best regards,

                LPP

                • 5. Re: Concerns on WebReporter as a plausible tool for a large company
                  sroering

                  trishoar wrote:

                   

                   

                  I've altered processing so that Maximum unflushed record age is 25 seconds and the Queue throttle wait time is 5000 milliseconds, other than that I've the Advanced options are as from McAfee.

                  The max unflushed age won't provide any benifit. It simply tells Web Reporter "If no new records are added to the queue after X seconds, go ahead and flush them to the DB". This is meant to clean the record cache at the end of log parsing.  If there are jobs in the queue, but this timer is tripped, yes, it will send data, but it's not a bottle neck and not adding any real benifit.  The queue throttle wait time is how long the log parsers sleep when the input record queue is full.  The default of 10 seconds is sufficient, but no harm in 5.

                   

                  I find the setup works for us, since the main purpose I have this for is to give me information on internet usage, top users and top sites.

                   

                  However If I wanted to use it for abuse reporting, that’s where the whole thing falls apart. The daily jobs are ok, taking about 30 minutes to run, but we sometimes have to run reports for customers, and though the basic content of the report is the same as the daily jobs we have, these can take upwards of 2 hours to complete.

                   

                  I would ask what version of Web Reporter you are using, and the filters you are using.  Newer versions have been optimized to generate queries that are a little more agreeable to the database. We found that the DB would sometimes use poor judgement in generating an execution plan, so we made modifications to the queries to help with that issue. I would also ask if you have partitioning enabled and if index maintenance is enabed. Those have direct impact on query performance.

                   

                  If I had time I'd look at an alternative to Web reporter, as IMO it is too slow for doing that sort of reporting, and I know our customers would like us to provide Monthly usage information, which we simply cannot do due to this being so slow.

                   

                  It is also worth mentioning that CSR is a replacement to Web reporter. This is Free, but requires EPO and Windows, which for me are both sticking points as I dislike EPO, and I prefer Linux.

                   

                  If you have support tickets in the past regarding performance issues, send me the numbers and I'd be happy to investigate.  In general, Web Reporter will out perform most general reporting solutions. At least give us a chance to see what can be done to make Web Reporter work to suit your needs.

                   

                  CSR is a cousin of Web Reporter.  Log parsing performance there shouldn't be too much difference. They have evolved separately for a couple of years, but for the most part, what is changed in one will get changed in the other. Regarding the reporting side, they are very different in how they function.

                  • 6. Re: Concerns on WebReporter as a plausible tool for a large company
                    sroering

                    Here is the best practices for Web Reporter that will describe many of the issues discussed on this thread.

                    https://kc.mcafee.com/corporate/index?page=content&id=KB73295

                     

                    lpp wrote:

                     

                    Hello Sroering,

                     

                    Unfortunately I am not able to open a SR, because WebReporter is the only McAfee product for which our procurement department hasn't renewed the maintenance fee (they are clever people).

                     

                    upload a Web Reporter feedback to our ftp server and send me the filename in a private message. I'll let you know what I see.  ftp.support.securecomputing.com/incoming/

                     

                    I just followed the advices I was able to collect from your answers. The memory assigned to the application is now 6 GB,  with  Maximum unflushed record age of 25 seconds, Queue throttle wait time of 5000 milliseconds and 5 maximum concurrent log processing jobs.

                     

                    Eating your dinner with 5 spoons won't help you finish the meal any faster. Two spoons does the job sufficiently.  Similarly, more than 2 log parsing jobs doesn't help. It would work against the system. I would strongly encourage you to keep that at the default.  Log parsing jobs are just responsible for breaking the log lines from the access logs into a parsing queue. It doesn't impact how fast the logs are processed. 

                     

                    I also adopted a different strategy, with ten log sources scheduled to run every 20 minutes and cadenced two minutes after one another, where the WebReporter fetches the created logs from the REST interface.

                     

                    Ahh.. that's a problem.  You shouldn't pull logs through the rest interface if you have more than one or two appliances.  All the log sources are going to pull logs through the one proxy.  So you are routing all access logs from 10 proxies through a single gateway on the way to Web Reporter. That could be one of your bottle necks.

                     

                    To decrease the import time, the processing tab is completely cleared (no more page condensing or detailed web).

                     

                    Paged views can compress your data by about a factor of 8~10 times.  It also does a really good job of removing the embeded urls that make up a typical web page. Do you really want/need to see 10 million requests to google analytics every day?  Most people don't, but if you do, that comes at a huge cost.  Bigger database also means longer cache loads and longer report run times.  If that was a decision made for performance, then you should re-enable paged views because it is much faster.   Please see this doc for more information. https://community.mcafee.com/docs/DOC-4662

                     

                    The 100% CPU process (1 core upon 32) is always due to the RDBMS (now Oracle, before MySQL), where a single query takes a huge amount of time (and a single core). I remember that with SQLTop I could see it was a periodical and very long SELECT with SORT).

                     

                    I would be enthusiast to provide you with all the possible information, as I am concerned that we are still far from our target (I don't think the platform will be able to process in less than 2 minutes the logs generated by a node in 20 minutes, which is our current time limit to avoid the creation of queues).

                     

                    The items discussed above indicate things that I would expect to cause performance issues. Let's start by addressing those and then re-evaluate.  I like to measure performance on a records/second basis.  It is important to consider the page-view compression factor. The GUI shows the outbound rate wich is much smaller than the true record count. 

                     

                    On Microsoft SQL Server Enterprise we have real-world numbers where Web Reporter processes more than 8k records per second with a 6:1 page view compression.  That is about 48k records per second.  I wouldn't expect that number on MySQL or Oracle, but it shows that in the right environment and configuration, it will scale to your needs. Let's figure out the target records per second on your system and see what needs to be done after we have addressed the issues we have already identified.

                    1 of 1 people found this helpful
                    • 7. Re: Concerns on WebReporter as a plausible tool for a large company
                      lpp

                      Thank you Sroering for sharing your expertise.

                       

                      I just uploaded a feedback file generated after adjusting the settings to your indications. Its name is simply WRFeedback_2014.04.16_09.22.34.zip.

                       

                      Anyway this morning I found all the log successfully imported for the first time. Moreover, we get almost real-time data since the log source scheduling is set to 20 minutes.

                       

                      We are using REST only temporarily to test this new approach (collecting small chunks of logs every few minutes instead of processing a huge volume at night), as WebGateway FTP push is always difficult to work (curl error 7) and needs time to set-up.

                       

                      The overall settings are basically default, except for a few parameters.

                       

                      I set the maximum concurrent log processing jobs to the value of 2.

                      Since the CPUs are basically idle, I hoped that this parameter would increase parallelism during log import.

                       

                      We always loved page views (along with the detailed web), but again we thought that clearing the processing tab would decrease importing time. I set back page views. I just doubled the value of IP cache size, User cache size and Site request cache size, as the hit ratio was 99.9, 100% and 98.9% respectively.

                       

                      We don't use directories at all (as the user names are blanked anyway).

                       

                      Currently the 6GB of RAM are assigned to WebReporter.

                       

                      Our target would be to keep one week of logs.

                       

                      We never opened performance related SR in the past, as the system was performing well in our test plant, where everything is in scale. Our only SR on WebReporter (3-3196668633) is related to the ability to perform strong authentication (without this feature we must blank user names, therefore losing users's persistence amongst requests), and became a PER.

                       

                      Thank you again for your precious help.

                       

                      Best regards,

                      LPP

                      • 8. Re: Concerns on WebReporter as a plausible tool for a large company
                        lpp

                        I am sorry, I didn't understand the hit cache ratio results I got.

                         

                        Now I am using defaults for  IP cache size, User cache size and Site request cache size, and I doubled Aggregate record cache.

                         

                        Thank you for your attention.

                        LPP

                        • 9. Re: Concerns on WebReporter as a plausible tool for a large company
                          trishoar

                          Hi,

                           

                          We arecurrently running 5.2.1.02 1491, on MySQL 5.5 with partitioning enabled, and DBSchema 22. Unfortunately we are in a similar situation to LLP, in that we havenot renewed support and thus I do not have any calls open regardingperformance.

                          Wereindex every week, run a DB optimize periodically and only keep 3 months’worth of data. Also we roll hours to days after 3 days. So in terms of DB maintenanceI think, we are ok, but please let me know if there is anything else we can do tomake this faster?

                          Insertion rate for us is not a problem, and in fact we get up to 14k per second, with a4:1 page view ratio.

                          wreporter.png

                           

                          Theproblem we have is getting data out.

                          To give abit more detail, there are 6 different proxy clusters feeding in, the main oneas described above, with the others only contributing ~10GB between them.

                          I've madesure reports are confined to just the relevant cluster so as to reduce theamount of data to query, and an example of the reports we produce would be adaily report consisting of:

                          • Bandwidth: Volume by hour     (Chart)
                          • Top 30 Sites by number of     hits (Chart)
                          • Top 30 Sites by Volume of     data (Chart)
                          • Top 30 IP's sorted Volume,     with Hits (Table)

                           

                          A Weeklyand Monthly report with shows Bandwidth Volume by Day (Chart).

                           

                          The mostcommon Ad-hoc report we run consists of:

                          • Bandwidth: Volume by Day for     Month (Chart)
                          • Site Name sorted by Bytes,     with Percent of Bytes, Hits and percent of Hits (Table)
                          • Top 20 Content Type by Bytes     (Chart)
                          • Top 30 IP's sorted Volume,     with Hits (Table)

                           

                          This isthen filtered by an IP range, and the proxy cluster the customer is on. Thelast 2 reposts we ran for this too over 2 hours. Ideally I'd like to automatethis and send it out to 400 customer’s month. But at 2 hours per site, thereporter would be completely tied up providing this info.

                          I'vetried running the reports in parallel and find they run much, much slower, sowe only run 1 report at a time.

                           

                          Tris

                          1 2 Previous Next