3 Replies Latest reply on Mar 27, 2014 10:22 AM by cnewman

    MediaType.FromHeader vs. MediaType.EnsuredTypes

    bkirk

      Questions about the difference between List.OfMediaType.ToString(MediaType.EnsuredTypes, "") and MediaType.ToString(MediaType.FromHeader)

       

      We use to use a log format that had the MediaType.FromHeader, but now we are using one that uses MediaType.EnsuredTypes.  Which is more reliable I would think the ensured type would be but as you can see in this second log entry it doesn't return any results.  We are loading the logs into splunk so the formats can be manipulated to accepts either format, I just want to know what is the more accurate MediaType.

       

      $ grep user123  access.log/access.log |tail -2  # MediaType.ToString(MediaType.FromHeader)

      [26/Mar/2014:12:15:10 -0400] "user123" 192.168.6.16 200 "GET http://cdn.kloveair1.com/services/broadcast.asmx/GetRecentSongs?SiteId=1&format= json&callback=GetRecentSongs HTTP/1.1" "Religion/Ideologies" "Minimal Risk" "application/json" 5637 "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0" "" "0" "" 436

      [26/Mar/2014:12:15:24 -0400] "user123" 192.168.6.16 200 "GET http://emf.mp3.miisolutions.net/kl/klove_newmedia_web_high HTTP/1.1" "Content Server" "Minimal Risk" "audio/mpeg" 13758590 "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0" "" "0" "" 459

       

      $ grep user123 MWGaccess.log/MWGaccess.log |tail -2  # String.ReplaceIfEquals(List.OfMediaType.ToString(MediaType.EnsuredTypes, "", "", "-")

      [26/Mar/2014:12:15:10 -0400] "webgatewaytest" "user123"!!!! 192.168.6.16 192.168.6.16 216.38.164.162 "cdn.kloveair1.com" 200 "text/plain" 370 5569 "44" "0" "HTTP" "GET" "http://cdn.kloveair1.com/services/broadcast.asmx/GetRecentSongs?SiteId=1&format= json&callback=GetRecentSongs"!==! "HTTP/1.1" "GET http://cdn.kloveair1.com/services/broadcast.asmx/GetRecentSongs?SiteId=1&format= json&callback=GetRecentSongs HTTP/1.1"==!= "Religion/Ideologies" "Minimal Risk" "0" "Gateway Anti-Malware" "Block If Virus was Found" 0 "-" false "-" false "-" "-" "80" "http" "http://www.klove.com/" "FF21.0-6.1"!=!=! "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0"

      [26/Mar/2014:12:15:24 -0400] "webgatewaytest" "user123"!!!! 192.168.6.16 192.168.6.16 216.38.170.207 "emf.mp3.miisolutions.net" 200 "-" 417 13759867 "856322" "85" "HTTP" "GET" "http://emf.mp3.miisolutions.net/kl/klove_newmedia_web_high"!==! "HTTP/1.1" "GET http://emf.mp3.miisolutions.net/kl/klove_newmedia_web_high HTTP/1.1"==!= "Content Server" "Minimal Risk" "3" "Gateway Anti-Malware" "Skip on Streaming Media" 0 "-" false "-" false "-" "-" "80" "http" "http://www.klove.com/listen/player.aspx" "FF21.0-6.1"!=!=! "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0"

       

       

      Thank you!

        • 1. Re: MediaType.FromHeader vs. MediaType.EnsuredTypes

          That's a great question!

           

          Let me explain what the 3 mediatypes are and then show you a solution.

          1) Mediatype from header is based on the content type header from the web server. You are correct in thinking that this could be wrong or anything at all for that matter. This is one of the reasons that most web browsers attempt to render just about anything, regardless of the content type the server says it is. Most web servers actually just pick this up from the extension on the file anyway unless configured otherwise.

          2) Mediatype from extension is equally flawed, it is just based on the file extension, also can be manipulated.

          3) Media type ensured is better. MWG has a list of byte sequences and other matching criteria that allows it to make a guess at the file type. Generally it's pretty good. However there are two concerns with it. One is that it is a guess. If the probability is over 70% we list it otherwise, there is nothing. Many formats are notoriously difficult as they have no set format. Txt files are a good example of a file that may or may not have a probability over 70%. The second issue is that mediatype.ensured is a list instead of a single answer, as more than one format could be probable. This means that you could end up with a number of mediatypes listed.

           

          In order to fix this, I created a rule that will test for media type ensured, and if it exists, I use the first entry in the list (had to pick one), save it to a user-defined variable, if it does not have a value, I use mediatype.fromheader instead.

          Then in the log format, I reference my user-defined.MediaType variable instead of the default media-type from header.

          It's the best of both worlds.

          Screen Shot 2014-03-26 at 12.47.36 PM.png

          You can put these above your log in the log handler if media-type ensured is called above in the actual policy, or you can put the test in the policy and then reference the user-defined in your logs.

          I've attached an example access log definition to get you started.

           

          Message was edited by: cnewman on 3/26/14 1:08:04 PM CDT
          • 2. Re: MediaType.FromHeader vs. MediaType.EnsuredTypes
            bkirk

            This is great.  I changed the solution around a little bit:

             

            Capture1.PNG

             

            This if I understand this correclty it will give me the List of MediaType.EnsuredTypes in a comma Separated list, or if that is blank then it will give me the MediaType.FromHeader.  I could probably add the MediaType from extension if all else fails too.  My logs have this field in quotes so it can have spaces and commas and what ever I want.

             

            Thank you for the quick response.

            • 3. Re: MediaType.FromHeader vs. MediaType.EnsuredTypes

              Looks reasonable to me, the main reason I only used the first media type from the list was that many reporting products will not link that list properly in reports and would list the unique values (application/text, application/html, etc) as a totally separate value, instead of giving you results when you search for text OR html.

              CSR, Nitro and WR would all do that.

               

              Good luck!