Our customer has a requirement to retain 6 months worth of webwasher access logs to query using CSR. The most common type of query wil be around the examination of an individual user's web browsing habits. There will be a small number of users running reports, probably no more than a couple at a time. Looking at the current rate of log creation this equates to around 5,000,000,000 individual access logs. We anticipate using page views to minimise the size of the database, and the expectation is that the number of logs is set to increase in the future.
We are planning to use SQL Sever 2008 SE on a SAN attached physical server. CSR will be installed on a separate physical server, with ePO running on a VM.
I'm wondering if anyone has had any experience of using CSR (or indeed Web Reporter) for this order of magnitude of data. I believe that the main load will be on the SQL server and I'm keen to ensure that it is spec'd highly enough to cope. Initial thoughts are starting with 32GB RAM, using a quad core Xeon 2.4Ghz processor. Any experiences / lessons learnt would be most welcome.
Regarding log parsing, the performance of CSR and Web Reporter 5.2 (and indeed 5.2.1) should be about the same. The big difference between the two products is on the report generation, but I don't think this should be too different in performance either.
Regarding your log data, we have several customers that generate over 250 million records per day. keeping at least 6 months. Are you sure about your log record count? Maybe you mean 5 million or are considering log data other than access logs. 5 billion is quite large and I would guess the company size would need to be well over 100,000 people. If you really meant 5 million, then I would put you as an "average" size to what we have for customers.
regarding hardware, CSR and Web Reporter are about the same (assuming CSR is on a different server than ePO). You shouldn't need more than 12 GB of RAM and that is for special circumstances were you might run ver large reports that are several GB. SQL server would be my preference for the database too. Make sure it is on dedicated hardware and give it all the RAM you can. You will need an Microsoft SQL Server Enterprise license, but you should enable DB partitioning for best performance.
My recommendation, is to use page views, not only for compression, but it makes your reports more closely reflect reality. See this doc for more information.
Please consider this KB article for deployment best practices.
I know, the data seems huge. We have in excess of 27,000 users on the estate. Around 15GB of access logs are written every day, equating to 4392 log files (each 100MB) for the last month. Each log file contains around 250,000 log records and 25% of them can be ignored due to 407 errors. My calc was based on:
250,000 x 4392 x 0.75 x 6 = 4,941,000,000 as representing the total number of records for 6 months.
Interesting that you mention customers with 250 million records per day. That sounds much larger than our customer who I reckon is running at about 37.5 million access records per day currently. And yes, these are just access logs that we are considering.
In light of the above, do you reckon that we might be OK with our proposed server, albeit with the potential for more memory (i.e. use 2008 Enterprise Edition to break above the 32GB limit of SE)? What would you propose as good starting point? 64GB?
Thanks for the links and the quick response. I'll have a read now.
5 billion records for 6 months makes a lot more sense. I thought you were saying 5 billion per day. Don't take this the wrong way, but that's really not very much data. You shoulnd't have any performance issues if you plan well. I judge performance based on records per second. At 40 million records a day you need to run at least 463 records a second or you are falling behind. Web Reporter has no trouble running over 3,000 records per second on the out side. If page views are enabled, you can multiply that number by about 10x for 30k/sec.
So if you can achieve 3,000 records per second, you can pars a full day of log files without page views in less than 4 hours. That should be plenty of time for running reports and performing maintenance, etc.
Make sure the DB has dedicated hardware. The DB does more work than Web Reporter since it runs the reports and stores data during log parsing. Web Reporter only has to do the log parsing. I think you would be fine with 32GB RAM on the DB server. You want to put your best hardware here. That includes disks. If your detail data table is 500 GB, and you run a detail report for the full 6 months, the DB needs to scan the entire table which is a lot of disk IO. Enabled DB parttioning to get the best report performance.
Web Reporter should run well on any modern server with 12 GB of RAM. You would probably run fine with 4~8 GB of RAM dedicated to Web Reporter. I think the default cache options will work fine for you.
Use summary data for your reports when ever possible. Detail data is atleast 10x larger and the cost of running such reports reflects this.
That is my advice.
A small note. You said that the traffic has a lot of HTTP status 407, which means you are doing proxy authentication. As you seemed to have noticed, Web Reporter pre-filters this data (as it should) and doesn't go into the DB. This also affects the performance numbers I gave.
In general, there are two places to take a reading. Log records in the file need to be read into memory, then we do some processing, and ultimately some records come out the back end.
I ignored your 25% HTTP status 407 records which are dropped on the front end. Ultimately the front end has NEVER had any performance issues. The default 2 log parsing threads tear through data like industrial wood chippers. Nobody has ever needed more than 2 log parsing threads.
If issues happen, it is on the processing. That is why I make all my performance measuremets on what comes out of the system. Administration > Options > Performance > DB Statistics >> Records per second. The only gotcha is to keep in mind that you need to mulitply that number by your average page-view compression (usually around 10x) to know how many records on the front end are getting processed. But again, this doesn't consider 407s since it is pre-filtered.
Thanks very much for your response there. Very helpful.
I am trying to get some kind of handle on how fast or slow, reporting might be for users.
I have taken your advice and beefed up the proposed database server with the follow spec:
Server (Fujitsu RX300S7) with 32GB RAM (DDR3 1600), database to be stored on a RAID 5 array based on 600GB local drives, 2 x Intel Xeon E5-2643 4C/8T 3.30 GHz 10 MB processors.
For CSR, we are also planning to use a separate physical server (similar spec to above except with 1 processor and 300GB drives)
Just wondering if this spec sounds about right and typically with 6 months of logs (approximating to 0.5 billion access logs taking into account page view compression), how quickly the system might handle user queries. Note that the most critical query will be one that looks at a user's activity over the entire 6 month period and so as you say this will be very I/O intensive. I'm hoping that with local disks, that the high levels of I/O can be supported.
Very rough thoughts on potential performance would be much appreciated. Are we potentially talking minutes, a few hours, lots of hours to run a user activity query?
Run regular index maintenance, and use SQL partitioning. If your reports are using summary data, such as top categories by hits for jdoe over 6 months, or top sites by bytes for jdoe over 6 months, then your reports should be fairly fast. For all 6 months, I wouldn't expect the reports to run more than a couple minutes per query on summary data. Shorter time periods would be faster still. Summary data has indexes on pretty much everything, which makes it very fast if you keep them defragmented.
Detail data only has indexes on the record ID and time. So running reports on detail data for all 6 months will do a full table scan regardless of your filters. Trying to predict performance on detail data is much more difficult since it depends on all the events using disk IO. I think you have a good plan in place and shouldn't have unreasonable report times.
A detailed report that says "give me all urls for site mcafee.com from user jdoe over 6 months" I would expect less than 1 hour. A detailed report that says "give me all the urls that contain the word 'mcafee' over 6 months" I wouldn't be surprised if it took over an hour. Both queries do a full table scan since they cover all 6 months. The main difference is that the first can filter on the user id and site ID (a simple number comparison). The second query has to to a text search of the URL which is very costly.
I hope that helps put performance into perspective.