We have been effected by the problem of WEB spider/crawlers as they continuously keep on collecting data from our website and use the content on their portals. Since in my setup i don't have a WAF deployed i thought of making a co-relation rule which might help in determining these software.The website i am trying to protect is used by a very large no of user over the day.
Since the characteristics of a Web Spider/Crawler is to go through a web site at a very rapid speed during a short span of time , we can use this create an alert.
for example :
group by Source IP , Destination IP
command = GET,PUT
duration = 10 seconds, distinct value : 250
From this alert we can create a Watch-list , which can be then used to monitor activities by these IP address.
Data byte sent
No of request.
Though i m little circumspect whether the duration filter will work for this small a duration or not. Also there may be chances of a false positive as ISP uses NAT for providing internet to user and may end u hitting threshold defined by us.
Do tell me what you think of this and if any thing can be further added to this.