    URL-Filter for archive.org




      i have a request to whitelist the website http://archive.org. Since this Website is presenting other Websites content of the past, it is correctly categorized as "Anonymizing Utilities".

      Now the question ist, if there is a posibility to use our URL-Filter Ruleset for the presented Websites.

      As far as i have seen, there are two indicators which maybe could be used to identify the displayed Site:

      1. There is the URL-Path, which shows up Website like (https://web.archive.org/web/20160802092006/http://9gag.com/ ). (Which indeed should be blocken in our company)

      2. There is a HTTP-Header called "Referer", which also Points out to the Site (https://web.archive.org/web/*/http://9gag.com)


      Does anyone have an idea how to solve this?

          Maybe the easiest way is to overwrite category for site archive.org.

          This can be done by list of websites and their category. This settings is available in URL Filter settings.



            Hi Lubomir,

            thank you for your fast reply. The category overwrite is not exactly what i was looking for, because i cannot manually Categorize just every Website in the Database of this site :-) The Website is showing up the past of millions of stored websites. So i thought about a automatical URL-Category overwrite depending on the content of the site.

            (First i thought about a similar Way, like the Youtube Filter, but Youtube has its own API which returns what you need. So its not the same Way in this case.)

              It should be possible to allow access to archive.org and still block the archive.org instances of sites that you would ordinarily block. You can also do this for other sites which have cached content.


              Create a new URL Filter configuration with "Search for and rate embedded URLs" selected. For this example, I'll refer to it as Default with Embedded.


              Create a new Category list that includes all of the categories that you would usually block, but do not include Anonymizing Utilities. For this example, I'll refer to it as Bad Category No Anonymizing Utilities


              Create a new list for the sites that will be handled this way. I'll refer to this one as Cached Content Sites. I used a wildcard list for future flexibility, but did not use wildcard/regex matching for the entries. Add archive.org and web.archive.org to the list.


              In the same rule set and above your existing rule which blocks specific categories, add a new rule:



              URL.Host matches in list Cached Content Sites AND

              URL.Categories<Default with Embedded> none in list Bad Category No Anonymizing Utilities



              Stop Rule Set.


              With this rule in place, what should happen is that archive.org itself and site content hosted on archive.org is permitted, but any sites which would be blocked through the normal category blocking will still be blocked.


              For example, the URL http://repo.hackerzvoice.net/depot_madchat/reseau/anti-peer2peer-networks.txt is classified as Malicious Software, Malicious Downloads.


              When the site is accessed via archive.org, the URL looks like this:


              http://web.archive.org/web/20150731000606/http://repo.hackerzvoice.net/depot_mad chat/reseau/anti-peer2peer-networks.txt


              Because the URL Filter settings for this rule are looking at the embedded URL, the categorizations for that URL will also be considered.The rule will fire with action Stop Rule Set for sites that don't match in the Bad Category No Anonymizing Utilities and the original blocking rule will be skipped. If a site's categorization is in the new list, the rule won't fire, Stop Rule Set won't be applied and the rest of that specific rule set will be considered.


              One caveat associated with archive.org is that the categorization for it is currently Education/Reference for web.archive.org and Internet Services for archive.org so if you're not looking at embedded URLs in your primary URL Filter configuration, ALL of the content hosted on archive.org will be accessible unless a category overwrite is applied for "archive.org" to force the categorization back to Anonymizing Utilities and the results from the configuration described here will not be what would be expected.

