
Detecting Bots in Apache and Nginx Logs Using Python - marklit
http://tech.marksblogg.com/detect-bots-apache-nginx-logs.html
======
jonknee
I've had solid success detecting bots with a really easy pattern--usage
frequency. Humans don't make request after request for long periods of times,
but bots almost all do. The time between requests is usually pretty consistent
too, not a lot of humans wait X seconds between doing things. Or not take
breaks (what are the odds a human has made a request every hour for 48 hours
straight?).

~~~
greglindahl
It's important to realize that there are many kinds of bots. So just because
you found some doesn't mean you've discovered them all!

~~~
jonknee
Of course, I just wanted to suggest another technique that might be helpful to
someone. Just going by IP address like the OP led to me to find a lot of VPNs
and the like that I didn't want to detect.

------
languagehacker
I was hoping there would be some machine learning in here. This just seems to
be cross referencing a couple of different data sources.

~~~
donalhunt
agreed.

I think the OP needs to break the problem down into 1) access from bots that
identify themselves (e.g. googlebot, bingbot, etc) 2) access from bots that
masquerade as humans and 3) activity that is from humans.

#1 should be trivial by looking at the user-agents. #2 and #3 could utilise
machine learning to categorise the behaviour of the connecting party, etc

It's not particularly clear why the OP wants to detect bots... I suspect it's
to get a clearer signal of what assets are being accessed by humans.

~~~
yeukhon
#2 and #3 is a moot point. Is my selenium / python script bot when I am
reproducing human behaviors for my testing? I guess you can called it a bot or
an automata. So we need a definition of what is considered a robot.

The best detections for human vs non-human so far have been (1) introducing
captcha and (2) based on view time and interaction (hotspot). With both we can
have a reasonable criteria to build a somewhat simple model.

If based on server access log, then you need to group resources together (css,
js, images, html) or ignore most of the resources, and calculate view time.
Based on some reasonable expectation, a user who views multiple pages at the
same time or within +X seconds would likely be a robot than a human despite we
humans are used to opening up several tabs once we are used to how to use the
website. As many tests do put a sleep in between (since there's always lag) so
for the non-intrusive robot, we will have a difficult time distinguishing
within some reasonable confidence.

If we enabled tracing / tracking, then this becomes slightly easier as well
since we can learn the behavior of "new" and "veteran" users. This is why
tracking is such as a privacy issue for many.

------
orf
Seems to be more 'filtering access logs by a blacklist' than actually
detecting bots.

I run a VPN through Hetzner, so requests from my IP are not a bot (I hope!).
Really you want to look at the paths (filtering out all the /w00tw00t
requests) and the user agents above all, which the author touches on. However
a whitelist approach is better than a blacklist IMO.

Also in the `in_block` you really want to hoist the `IPAddress(ip)` call out
of the `any()` loop!

~~~
marklit
Thanks for the heads up, I've updated the gist.

~~~
guitarbill
Loved the write-up. Two more things that might make your Python code less
memory intensive.

Instead of `open(filename, 'r+b').read().split('\n')` you can use `for line in
open(filename):`, which avoids loading large files into memory. (small gotcha:
`line` will contain the newline character(s) which can be stripped via e.g.
`rstrip`)

You can also drop the square brackets from calls to methods that take an
iterable e.g. `any`/`all`, `set`, and `join`. So `join([...])` becomes simply
`join(...)`. Python will use a generator expressions instead of constructing
and passing a new list list. To quote PEP 289: "generator expressions [are] a
high performance, memory efficient generalization of list comprehensions"

These can really make a difference with big files/lists, but are a good habit
in any case. I hope it helps in the future!

------
guillem_lefait
You may also want to add the amazon IP ranges:
[http://docs.aws.amazon.com/general/latest/gr/aws-ip-
ranges.h...](http://docs.aws.amazon.com/general/latest/gr/aws-ip-ranges.html)

