Not sure if this is the right section but here it goes. When running reports I notice that sites that have different "subhosts" in the URL are shown as seperate entries. This gives an inaccurate report. An example of this is when you report on Streaming media. You will see many lines listing many different site names but they all finish with youtube.com
I want it to simply ignore the subhosts and give me a total of all traffic to youtube.com adding up all those subhosts and just showing the total bandwidth from youtube. Below is an example of what i see in a report
189 o-o.preferred.atl05s01.v5.lscache8.c.youtube.com Streaming Media, Media Sharing 449.54
190 o-o.preferred.atl05s01.v4.lscache3.c.youtube.com Streaming Media, Media Sharing 448.79
191 o-o.preferred.atl05s01.v12.lscache1.c.youtube.com Streaming Media, Media Sharing 448.67
192 o-o.preferred.atl05s01.v17.lscache6.c.youtube.com Streaming Media, Media Sharing 445.76
193 o-o.preferred.atl05s01.v11.lscache4.c.youtube.com Streaming Media, Media Sharing 444.70
194 o-o.preferred.atl05s01.v8.lscache2.c.youtube.com Streaming Media, Media Sharing 441.94
195 o-o.preferred.atl05s01.v22.lscache4.c.youtube.com Streaming Media, Media Sharing 441.19
You are correct, Web Reporter does not strip subdomains from the hostname. If you want to report on top level domains, then they will need to be put into a user-defined column that using a ruleset.
I haven't tested this, and I'm relatively certain this won't be 100% correct, but the regex for the ruleset would need to be something like this.
The URLs look like this
"GET http://foo.bar.com/somethingmore.html HTTP/1.1"
A break down of what I was trying to achieve is this.
.* matches any character.
\.* matches a single period (which may or may not exist)
(stuff between paren is what we will be keeping)
[a-zA-Z\-0-9]+ Maches alpha-numeric characters and dash, one or more required
\. matches a single perios (required)
[a-zA-Z]+ matches only alpha characters, one or more required
\/* matches optional forward slash
.* matches any character, one or more times.
The regex above would only give you the "co.uk" or "co.jp" part.
A better solution would need to put a list of these known domains into an optional list. For example, something like this.
Since + is a greedy operator, it would continue to consume the top level domains in the list. Then in front of it, you could optionally get another level domain "[a-zA-Z\-0-9]*". The reason I say optional is you might want to consider the possibility that "info.co.uk" is the complete domain.
And lastly, you might want to make a save default rule that keeps the entire host. Hopefully that's enough to get you started. Web Reporter uses the Java Regex notation if you need to find documentation. Here's one example. http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
Another thought.. keep in mind that it's perfectly OK to have an IP address for the hostname. That's another case you would need to handle. I'm sure there are other's I haven't thought of yet either.
Thanks for the suggestion. I understand your points where we might have problems. How about i make it a little eaiser. Im specifically looking to combine Youtube and Facebook hits only. The should be easier to nail down since we know what we are working with in regard to domain names. What so you think?
Ok i have made some headway with this. I have created a rule that will take basically anything ending in youtube.com and turn it into just www.youtube.com
For my needs this is perfect. The second part is I have to apply this rule to a User defined column. I want it to populate a "new" site option with this rule but when i tell it to use a Log record from the source data the "site" field is not shown, only URL is the closest. How does Web Reporter work out the Site field? Does it use the URL field or a particular field log file header?
Unfortunately you cannot modify the pre-populated columns such as url or site name without actually modifying the access log before importing. The user defined columns essentially allow you to pull custom data from the log and store it as an extra value.