We have been using webreporter for some time, I've recently rebuilt our service on newer hardware as the performance was too poor. The service is currently working ok for the log parsing stage, but reporting is still slower than I'd like, so I'm hoping some one can suggest some ways I can improve our setup.
What we have at present is a webreporter server running 5.2.0.02 1105. This is on RHEL5.7 x64. The server is an IBM HS21 with 16GB memory and 2 quadcore 3.0GHZ CPU's so should have plenty of grunt. The storage is 250GB connected over Fibre Channel to an IBM XIV SAN.
I have a second of these Blades that is running RHEL6.1 x64 and MySQL 5.1.52 again with 250GB storage.
I'm puling in data from 12 WW6 servers, 8 of them are shifting 300GB per day the others are 90-100GB. this represents about 60-80 Million lines per day.
When doing the import I'm processing 20-30k lines per second which is fine, I've set webreporter up thusly
5 Log processing Jobs
Queue throttle threshold 150000 records
Site request cache 150000
Aggregate record cache 1500000
Java is set to consume 10GB RAM
We do not store logs for detailed reporting, and are condensing into page views.
Monitoring the server when doing the log import shows it is pushing the CPU hard which is good
When doing reporting though the DB server will sit querying the DB using just 1 thread until its gathered all the data for 1 part of the report. I've tried tuning the MYSQL db a bit, but I'm no DBA and have no real idea what I'm doing in this aspect. in /etc/my.cnf I've set the following parameters
Also, on the DB the memory usage is very low, with MySQL using only about 1GB.
The server has already take a noticeable slow down over the last week or so since I set it up,
A report covering all log sources for yesterday took 30 Minutes to run, this covered the following areas:
Bandwith: Volume by hour (Chart)
Top 30 Sites by Hits (Chart)
Top 30 Sites by Volume (Chart)
top 20 Content Types by volume (Chart)
Bandwith Volume By Site
Top 100 IP's by Volume, with percent of Bytes, Hits, Hits - Blocked and percent of hits
Log Source by Volume
Trusted source protection area (Detail)
That report was originally taking under 1 Minute.....
My concern is that I need to be able to produce a similar report that will capture the whole month for specific customers, in that instance will will filter the customers by IP.
If any one has any thoughts on improving the DB performance I'd most appreciate it!
As to the other part of the question, I'm seeing alot of log lies being ignored due to spaces in the refer string, such as
2011-09-07 00:00:32,128 ERROR [securecomputing.smartfilter.logparsing.LogAudit] (LogAudit) invalid line at linenumber 22245 '10.122.145.32 - "-" [06/Sep/2011:11:19:16 +0000] "GET http://www.slideshare.net/rss/slideshow/id/7148126 HTTP/1.1" 200 2418 "http://static.slidesharecdn.com/swf/menu.swf?embedCode=<div style="width:425px" id="__ss_7148126"> <strong style="display:block;margin:12px" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" "ia, md" - "application/xml" "Teaching_Resources" 0.730 "-" -' from file 'access1109061119.merged.log20110907-000027304.dat'
and also due to buggy User agents used by Sophos
2011-09-07 00:00:47,122 ERROR [securecomputing.smartfilter.logparsing.LogAudit] (LogAudit) invalid line at linenumber 706 '10.122.204.140 - "-" [06/Sep/2011:11:33:07 +0000] "GET http://es-web-4.sophos.com/update/ccosx/Sophos%20Anti-Virus%20Home%20Edition.mpkg/Contents/Packages/... HTTP/1.1" 200 5142 "" "<UA a="Mac" c="A-Wrights-iMac.local" u="cca0009aa5e" v="7.3.3" />" "-" - "application/octet-stream" "-" 1.614 "-" -' from file 'access1109061135.merged.log20110907-000040969.dat'
I already have the following in my exclude log lines matching
.*"<UA u="[A-Z0-9\-]*" c="[A-Z0-9\-]*"/>".*$|.*SophosUpdateManager.*
can any one suggest something to capture those other 2 as well?
The bottle neck is the DB. Web Reporter is waiting for the DB to run the queries. MySQL will only use one thread per query, and Web Reporter will only send one query (per report) at a time. So to utilize muliple threads, you need to break those queries into separate reports.
On the DB, I'd recommend the 64-bit MySQL if you're using 32-bit version. Then Increase the key_buffer. That's the buffer that holds the indexes and will improve your query performance.
On the Web Reporter side, it's strongly recommended that all the Log Parsing jobs stay at the default. Since you seem to be running fine, it's probably OK to leave it there, but if you see log parsing performance issues, you might want to put the log parsing jobs back to 2 to see if it improves. Also, your aggregate cache is probably severely undersized. The aggregate cache holds summary data for the hours covered by data being parsed. If you're looking at the cache hit rates, keep in mind that aggregate cache may not get above 80% because you're always importing data forward into empty slots. But if the hit rate is really low, such as 20%, that's a problem. You would also see messages in the server log regarding not being able to free enough hour buckets from the AggregateCache or AggregateSession.
Last problem I see is the java memory for Web Reporter. There shouldn't be any reason to go more than 4~6 GB. Going larger can actually have a negative affect. We believe the garbage collector ends up persisting a lot of old references when the memory setting is too high.
Ok, I've turned the Java size down to 6GB, and will see how that goes.
On the DB, MySQL is x64. I've set key_buffer to 25% of total ram (4GB) and will see how we go.
I've also changed the DB maintenance to happen at a different time, as it was running directly after the log parsing which meant the cache stats were being reset by the time I came into the office.
When I have been manually been running the log parsing, the aggregate cache was sitting around 70-80% and looking in the server.log all I'm seeing is this:
2011-09-08 00:00:12,480 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) prefetch this dayHour now: [day=15224 hour=16 logSourceId=8]
2011-09-08 00:00:12,482 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) adding dayHour=[day=15224 hour=16 logSourceId=8] total size=1459923
2011-09-08 00:00:12,487 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) adding dayHour=[day=15224 hour=17 logSourceId=8] total size=1459951
2011-09-08 00:00:15,527 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) adding dayHour=[day=15224 hour=14 logSourceId=8] total size=1476417
2011-09-08 00:00:19,651 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) Removing eldest dayHour bucket at=[day=15224 hour=17 logSourceId=6]
2011-09-08 00:00:19,651 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] total records ejected=13950
2011-09-08 00:00:21,978 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) Removing eldest dayHour bucket at=[day=15224 hour=19 logSourceId=6]
2011-09-08 00:00:21,978 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] total records ejected=3075
2011-09-08 00:00:22,477 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) Removing eldest dayHour bucket at=[day=15224 hour=18 logSourceId=6]
2011-09-08 00:00:22,477 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] total records ejected=3736
2011-09-08 00:00:23,283 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) Removing eldest dayHour bucket at=[day=15224 hour=13 logSourceId=6]
2011-09-08 00:00:23,283 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] total records ejected=119146
2011-09-08 00:00:23,916 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) prefetch this dayHour now: [day=15224 hour=7 logSourceId=8]
2011-09-08 00:00:23,918 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) adding dayHour=[day=15224 hour=7 logSourceId=8] total size=1384758
2011-09-08 00:00:24,274 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) prefetch this dayHour now: [day=15224 hour=15 logSourceId=8]
2011-09-08 00:00:24,276 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateCache] (52) adding dayHour=[day=15224 hour=15 logSourceId=8] total size=1386981
2011-09-08 00:00:37,270 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateSession] Flushing aggregate session...
2011-09-08 00:00:47,998 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateSession] Aggregate Session flush is closing hibernate session
2011-09-08 00:00:48,002 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateSession] clearing aggregateRecordCache's flags
2011-09-08 00:00:48,002 INFO [securecomputing.smartfilter.reportdbbuilder.AggregateSession] Flush of aggregate session completed
If I were to split the reports out in to multiple separate queries do you think it would work better where I for example, run the report "Bandwidth: Volume by hour (Chart)" for each customer, then run "Top 30 Sites by Hits (Chart)" for each customer etc, or will this have no real benefit?
You found the right messages regarding loading the Aggregate Cache, but really they would be of little interest to you unless you see something regards to not being able to free enough room for new data. That would indicate that the aggregate cache is too small and you could increase it slightly.
Keep in mind that the default performance values are enough for all but our largest deployments, so it's not a good idea to make values arbitrarily large because it could have a negative affect. The aggregate cache wouldn't need to be any larger than the number of summary records generated over a 4 hour period.
I think your report would benifit from running queries in parallel, so breaking the report into smaller sets of queries should help. I cannot say for sure if running 10 quries as 10 reports at the same time would be better than running them sequentially because there must be a point were the performance would degrade. It could be higher than 10 (our default), but I don't have any case history to know what other people have seen or tried. My personal feeling is that 10 will probably be OK, but you'll need to test it to know for sure.
Also, you should consider turning off the roll-ups on DB Maintenance if you have enough disk space. They will add a lot of time on your DB maintenance and cause extra fragmentation that will degrade report performance. And if you notice that the reports seem to slow down over time, you might want to consider running "OPTIMIZE TABLE" on the DB tables every couple months. That will defrag the data file and the indexes, but requires enough disk space to make a copy of the scr_fct_web table and index (because that is what the DB is essentially doing)
Ok, I've turned off the roll ups and will monitor the sapce consumption, The DB is at 5.6 GB with 29 Million records from the 25/08/2011. I've set it to delete them after 1 Year (thats bassed on 0.4GB per day * 365 = 146GB). I can grow the volume easy enough, and may reduce the storage for webreporter as I dont think it will need 250GB over the term.
Another issue I've spotted is the reindexing is not currently working, and in the server.log its spitting out
ERROR [securecomputing.smartfilter.server.project.reportdbbuilder.tasks.DatabaseMaintenanceTask] Failed to complete rebuild indexes maintenance com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: SELECT command denied to user 'wreporter'@'10.134.176.54' for table 'proc'
In the DB i've given it full access, apart from grant so I dont see why it would say it is denied. Any thoughts on whate to look for that?
I'm going to create a load of schduled reports customer and start timing them all to see how things are looking.
Thanks for all your help so far.
To resolve this error
ERROR [securecomputing.smartfilter.server.project.reportdbbuilder.tasks.DatabaseMaint enanceTask] Failed to complete rebuild indexes maintenance com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: SELECT command denied to user 'wreporter'@'10.134.176.54' for table 'proc'
I had to provide my webreporter user access to the local MySQL database as well, in-case any one else runs into this.
Running multiple reports at the same time had a very negative impact on performance with reports taking longer overall to run. even running near identical reports at the same time was significantly lower then running them sequentially.
In about a week I'll have 1 months worth of data collected and I'll start testing running reports that sample 1 month to see how that works, as the ideal will be we run most of the customer reports on a monthly basis.