
How to detect web bots? - avastel
https://antoinevastel.com/javascript/2020/02/09/detecting-web-bots.html
======
Minor49er
While interesting and well-intentioned, the advice in the article can cause
issues. The agent isn't downloading CSS or images? Could be a blind user.
Pages being downloaded in quick succession? Could be browser prefetch. Lack of
mouse movements during navigation? It could be a user on a mobile device or
screen reader.

What if the bot is making individual requests from unique IP addresses? What
if it's scraping many pages over time rather than a single smash-and-grab
approach?

The article admits that this is a hard thing to solve. In most cases, it's
probably not very worthwhile to try to detect bots in the ways that are
suggested. Focusing on patterns in the server logs to find out what's being
targeted might be more beneficial. Then slap a login or captcha around
anything valuable that bots shouldn't have access to

~~~
forgotmypw
I use a keyboard-driven browser to reduce wrist strain, and I sometimes get
tagged as a bot, probably for lack of mouse movement.

~~~
bryanrasmussen
basically I've written bots for scraping governmental sites, and fashion
sites. I've never yet had a problem (although the code on one fashion site
looked like all classes were randomly generated with implied some attempt to
keep scraping from working), so what I'm wondering is - what kinds of sites
actually go through the trouble of tagging you as a bot, since you've got
experience care to share?

~~~
edoceo
Ticketmaster

~~~
MonkeyIsNull
Good response, I definitely never though of that before. Serious case where
you don't want Bots buying stuff up.

~~~
edoceo
Ages ago I was one of those bots. Most fun I've had playing cat-and-mouse.
Their team was actually quite good. And I learned that Perl can solve any
problem.

~~~
GrayTextIsTruth
Have you written about this? Sounds like an entertaining story.

~~~
edoceo
I have not. But I will and will post to HN.

------
wittyusername
My favorite thing to do was to detect a bot and then show the bot a parallel
dataset I'd developed of bogus information. Instead of hitting a block and
working its way around it, competitors scraped a bunch of slightly warbled
nonsense data. Have fun with that gang! Also made it super duper easy to see
who'd try to steal our data.

~~~
cdoxsey
Also known as a honeypot:
[https://en.wikipedia.org/wiki/Honeypot_(computing)](https://en.wikipedia.org/wiki/Honeypot_\(computing\)).

~~~
Can_Not
Actually, better terminology would be Honeytoken, Trap Street, and/or
Fictitious Entry.

[https://en.m.wikipedia.org/wiki/Honeytoken](https://en.m.wikipedia.org/wiki/Honeytoken)

[https://en.m.wikipedia.org/wiki/Fictitious_entry](https://en.m.wikipedia.org/wiki/Fictitious_entry)

[https://en.m.wikipedia.org/wiki/Trap_street](https://en.m.wikipedia.org/wiki/Trap_street)

------
hinkley
I work for a SaaS company that provides a domain per customer, and one of the
consequences of this that never occurred to me before is that even polite
crawlers and bots can thrash your servers if you have enough TLDs.

Every few months someone discovers the correlation between our log messages
and user agents and we rehash the same discussion about a particular badly
behaved crawler that produces enough log noise that it distracts from
identifying new regressions.

I coordinated an effort to fix a problem with bad encoding of query parameters
ages ago, but we still see broken ones from this one bot.

------
dominickreyes
I just launched a site. I have not mentioned it anywhere and there are no
links to it on the internet, yet the logs are full of bots looking for
vulnerabilities. Judging by what they are looking for, it seems I could
eliminate half of them by if (location.contains('php')).

~~~
jesterson
Those are kiddies and most of them are dumb but filtering more complex ones is
not that easy.

Don't worry about backlinks - they know you've registered a domain so already
checking A records in domain DNS. They are much more clever that 10 years ago
when we could expect 0 requests to site that was just set up.

------
fouc
I think bots should have full rights to access the internet, just like humans.

Discriminating against bots just makes the internet worse for everyone. In the
future, everyone will have personal bots and agents to help automate many
things.

~~~
fouc
I'm actually being serious about this. Hopefully others are worried about the
trend to discriminate against bots.

If corporations can have personhood rights, why can't non-human agents?

~~~
WilliamEdward
Corporations should not have human rights. Bots should not either but they
should be allowed on the internet, no need to bring rights into it.

------
koboll
The message I'm getting from this article is that headless Chrome should offer
an 'undetectable' mode where its unique, fingerprintable globals are replaced
with those from the headful version.

------
zzo38computer
There are many false positives, though. It might believe it a bot but is
wrong. Changing the number in the URL is something I commonly do.
Additionally, I often use curl when I want to download a file (rather than
using the browser's download manager).

~~~
to1y
Out of interest, why do you prefer curl?

~~~
zzo38computer
I just don't like the graphical file management.

------
throwaway4392
At the very least you can use these techniques to know what not to do when
scraping, and therefore scrape better.

------
gesman
We detecting web bots by analyzing behavior of all web traffic and via
clustering we finding entities that are behaving unusually similar to each
other. This way we detect "clusters" or families of bots.

~~~
madamelic
Interesting!

I wrote something that leverages AWS Lambdas to get around rate limiting, this
solution would've tagged me instantly.

------
cpt1138
Since 1/21 I have identified 153,423 attempts to gain access to some servers I
run. All from unique ip addresses. It's one thing to identify bots, it's quite
another what do about them.

------
madamelic
> it’s safe to assume that attackers are likely to lie on their fingerprints

I hate this sentiment that every one who doesn't want to be tracked is a
criminal.

It's the digital version of "You don't have anything to hide if you aren't
doing bad".

\---

I don't understand the fear about bots or scraping. As long as bots are
behaving nicely (not slamming servers), they are just as much web citizens as
humans.

The web is about sharing of information and having an entire _company_ about
exterminating the viability of bots is horrifying.

~~~
Kiro
I have bots completely destroying the fun for other players in my game. Are
you saying I should just ignore it because they have "rights"?

~~~
a1369209993
You have _players_ completely destroying the fun for other players in your
game. How you should address this (eg bracketed matchmaking, rate limiting,
bans for harrassment, etc) has to do with _what_ they're doing, not whether
the game input is being generated by a human or computer.

~~~
Kiro
I really don't understand the higher purpose of "bots should never be banned
anywhere, ever". You are allowed to think scraping is OK while thinking they
should be banned in other contexts at the same time.

And FYI the bots are destroying the in-game economy in my game by doing
extreme grinding no normal human being is capable of. It's a very hard problem
to tackle with mechanics.

~~~
a1369209993
> bots should never be banned anywhere, ever

I enthusiaticly support banning bots for abusive behaviour. I _also_
enthusiaticly support banning humans for abusive behaviour.

> extreme grinding no normal human being is capable of

> a very hard problem to tackle with mechanics.

\- diminishing returns for over-farming particular areas (per-player, _maybe_
also total)

\- stat penalties for playing too long / bonuses for rest ("You have worked a
fourty-hour week in this game, don't you have a life you should be getting
back to?")

\- moving resources[0] around to break pathfinding, and/or adding poisoned
ones to discourage blind grabbing.

\- or just ban people who regularly dump extreme quantities of grind-farmed
stuff into the economy (since that _is_ the behaviour you're trying to stop),
bots or no bots.

0: including enemies, who might, for example, _run away_ from someone who's
been slaughtering them for the past six hours

~~~
Kiro
Sure, those are good ideas (even though most are not applicable to my game
specifically, it's probably not what you imagine) but I presume you understand
that it's much easier for me to just detect and ban bots. Regardless I don't
want bots running around that won't respond or interact with other players.
That itself is a reason to ban them.

I still don't understand why you think bots should be allowed in my game at
all. I don't want them and the players certainly don't want them (bot
accusation is the number one drama).

------
mobilemidget
"..(the OS and its version, as well as whether it is a VM.."

Could anybody link me up with something to read on how to detect it is a VM
based on TLS fingerprint?

~~~
dylz
One of my services routinely bans abusive hosts that try repeatedly ban
evading, but we notice via TCP fingerprinting that their machine has started
up under a minute ago each time it rotates.

Non-passive: It's also possible to read a GPU name like "VMware SVGA" from JS,
or watch for mismatched/wrong hardwareConcurrency.

~~~
mobilemidget
thanks, will be googling bit more on these terms and techniques

------
DailyHN
Bots have a PR problem.

~~~
DailyHN
What about users of RPA software like UiPath, Blue Prism, etc.?

------
Razengan
I wonder if they could just make a new browser standard which lets the server
put the browser into a mode where the browser itself ensures there’s a live
human interacting with the page/app, like using the webcam + OS facial
recognition to ensure the server that it’s not a bot.

~~~
fluxsauce
This makes my skin crawl. I understand the intent, but this is morally wrong
on many different levels.

I hate credential stuffing and malicious activity as much as the next person,
but I would never sacrifice user privacy like that.

Also, you can spoof a webcam (but that's not the real problem).

~~~
Razengan
It would be better than the unpaid labor we have to do for Captcha.

~~~
realusername
Captcha don't stop bots anymore, they only stop humans. I don't know why
websites are still using those.

------
brunoTbear
Answering this question was recently a 1B acquisition for F5 with their
purchase of Shape Security.

Will be interesting to see what happens to other industry players like
PerimeterX. Distil was eaten by the corpse of Imperva, and I don't see Akamai
making strong headway with Botman.

Google is going after this too with reCAPTCHA. The HN reaction to that has
been interesting.

It's interesting to me how many comments in these threads talk about scraping
as the issue with bots. Every sale I saw when I worked on this problem was
related to credential stuffing. Seems the enterprise dollars are in the fraud
space, but the HN sentiment is in scraping.

Funny how disconnected the community here can be from what I saw first-hand as
the "real" issue. Makes me wonder what other topics it gets wrong. Surely my
area of expertise isn't special.

~~~
madamelic
I think that scraping is a more divisive and undecided thing. Credential
stuffing is pretty cut-and-dry a bad thing.

No one credential stuffs for innocent fun.

