5 Replies Latest reply on Sep 26, 2011 11:45 AM by sroering

    Web Reporter showing subhosts seperately

      Not sure if this is the right section but here it goes.  When running reports I notice that sites that have different "subhosts" in the URL are shown as seperate entries.  This gives an inaccurate report.  An example of this is when you report on Streaming media.  You will see many lines listing many different site names but they all finish with youtube.com

       

      I want it to simply ignore the subhosts and give me a total of all traffic to youtube.com adding up all those subhosts and just showing the total bandwidth from youtube.  Below is an example of what i see in a report

       

       

      189 o-o.preferred.atl05s01.v5.lscache8.c.youtube.com           Streaming Media, Media Sharing      449.54

      190 o-o.preferred.atl05s01.v4.lscache3.c.youtube.com           Streaming Media, Media Sharing      448.79

      191 o-o.preferred.atl05s01.v12.lscache1.c.youtube.com         Streaming Media, Media Sharing      448.67

      192 o-o.preferred.atl05s01.v17.lscache6.c.youtube.com      Streaming Media, Media Sharing      445.76

      193 o-o.preferred.atl05s01.v11.lscache4.c.youtube.com      Streaming Media, Media Sharing      444.70

      194 o-o.preferred.atl05s01.v8.lscache2.c.youtube.com      Streaming Media, Media Sharing      441.94

      195 o-o.preferred.atl05s01.v22.lscache4.c.youtube.com      Streaming Media, Media Sharing      441.19

        • 1. Re: Web Reporter showing subhosts seperately
          sroering

          Hello,

           

          You are correct, Web Reporter does not strip subdomains from the hostname.  If you want to report on top level domains, then they will need to be put into a user-defined column that using a ruleset.

           

          I haven't tested this, and I'm relatively certain this won't be 100% correct, but the regex for the ruleset would need to be something like this.

           

          .*\.*([a-zA-Z\-0-9]+\.[a-zA-Z]+)\/*.*

           

          The URLs look like this

          "GET http://foo.bar.com/somethingmore.html HTTP/1.1"

           

          A break down of what I was trying to achieve is this.

           

          .*       matches any character.

          \.*     matches a single period (which may or may not exist)

          (stuff between paren is what we will be keeping)

           

          [a-zA-Z\-0-9]+       Maches alpha-numeric characters and dash, one or more required

          \.      matches a single perios (required)

          [a-zA-Z]+       matches only alpha characters, one or more required

          \/*              matches optional forward slash

          .*           matches any character, one or more times.

           

           

          Have fair warning that many countries outside of the US have two levels at the end of their domains.  For example www.google.co.uk or www.google.co.jp

          The regex above would only give you the "co.uk" or "co.jp" part.

           

          A better solution would need to put a list of these known domains into an optional list.  For example, something like this.

          [co|nz|jp|com|info|net|biz|org|edu]+

           

          Since + is a greedy operator, it would continue to consume the top level domains in the list. Then in front of it, you could optionally get another level domain "[a-zA-Z\-0-9]*".  The reason I say optional is you might want to consider the possibility that  "info.co.uk" is the complete domain.

           

          And lastly, you might want to make a save default rule that keeps the entire host.  Hopefully that's enough to get you started.  Web Reporter uses the Java Regex notation if you need to find documentation.  Here's one example. http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

          • 2. Re: Web Reporter showing subhosts seperately
            sroering

            Another thought.. keep in mind that it's perfectly OK to have an IP address for the hostname. That's another case you would need to handle. I'm sure there are other's I haven't thought of yet either.

            • 3. Re: Web Reporter showing subhosts seperately

              Thanks for the suggestion.  I understand your points where we might have problems.  How about i make it a little eaiser.  Im specifically looking to combine Youtube and Facebook hits only.  The should be easier to nail down since we know what we are working with in regard to domain names. What so you think?

              • 4. Re: Web Reporter showing subhosts seperately

                Ok i have made some headway with this.  I have created a rule that will take basically anything ending in youtube.com and turn it into just www.youtube.com

                 

                For my needs this is perfect.  The second part is I have to apply this rule to a User defined column.  I want it to populate a "new" site option with this rule but when i tell it to use a Log record from the source data the "site" field is not shown, only URL is the closest.  How does Web Reporter work out the Site field?  Does it use the URL field or a particular field log file header?

                 

                Thanks.

                • 5. Re: Web Reporter showing subhosts seperately
                  sroering

                  Unfortunately you cannot modify the pre-populated columns such as url or site name without actually modifying the access log before importing.  The user defined columns essentially allow you to pull custom data from the log and store it as an extra value.