

How do you guys detect people hitting your site too often/scraping content? - screamingfrog

I was wonder if you guys manually analyze log files or use a service like Splunk&#x2F;Loggy&#x2F;etc to help with detecting abuse? What is your first step in blocking this behavior?
======
glimcat
Detect:

Look for high frequency access with low variability in the delay between same-
type resource requests. It's a relatively simple heuristic, but great
precision & recall for classifying scrapers vs. manual access.

Block:

You can't block a sufficiently determined expert scraper.

You can filter out a lot of scraping if that's a meaningful goal, but it's
often not. A good strategy here is rate limiting + header inspection + CAPTCHA
on high activity. Possibly add transfer quotas if you have a lot of large
assets which might cause problems at low request rates.

This will deal with "script kiddies" and automated tools which account for the
majority of cases where scraping could potentially impact the server's ability
to respond to other users. It won't block researchers or other scrapers with
more sophisticated skills.

I've seen people go wild here, to the point of silly things like displaying
database record text as images. Complicated schemes generally impact UX for
your routine users, while making it take slightly more time to write a scraper
that's already "build once, use until the site is redesigned."

------
raquo
\- I haven't used any existing tools, but built some easy blockers myself

\- Adding hidden-with-css honeypot links with URLs formatted with the same
pattern as good links

\- Check user agents - many scrapers forget to change "phantomjs" to something
else

\- Maybe block ip ranges of known vps providers

\- make an access_log DB, define some patterns (e.g. regularness / number of
visits, skewed distribution of pages viewed, etc.) and look up offending IPs

\- decide what to do with offenders - block outright, mangle content (e.g.
Obnoxiously watermark images, hide prices, provide less detailed product
descriptions, add your brand to titles/descriptions if these are probably your
competitors scraping your catalog).

------
ovechtrick
You could use [http://www.sumologic.com/](http://www.sumologic.com/)

Have it send a report when an IP is seen > x times in your logs.

~~~
screamingfrog
Thanks for the tip!

------
lauradhamilton
You can set up a honeypot. Include links that only a robot can see, and then
whenever someone clicks on that link block the ip address.

