
High-performance .NET by example: Filtering bot traffic - alexandrnikitin
https://alexandrnikitin.github.io/blog/high-performance-dotnet-by-example/
======
hdhzy
Excellent post showing how to correctly improve code w.r.t. performance using
the scientific method: hypothesis, measuring the baseline, change, measuring
effect with real tools, real code.

Thanks for sharing!

~~~
Someone1234
I also found it an interesting post in that it kind of inadvertently proves
that for most situations you shouldn't optimise to this extent.

Meaning, yes, the OP got impressive performance improvements but the code is
also completely unreadable and utilises unsafe code sections which could
expose you to security problems/memory leaks/memory corruption. Not to mention
they've recreated and will need to maintain an in-house version of the
Dictionary class.

Their first optimisation (from Enumerator to List and Any() to Count()) are
something every codebase could use. Most of their other optimisations make the
code a maintenance minefield.

Plus programmers are expensive. Hardware is cheap. Why spent time on harder
code to write that's also harder to maintain in the medium to long term when
instead you could just throw money at hardware and call it a day? Just food
for thought, not really a criticism in and of itself.

PS - Please don't take this post too seriously. I am not really being
critical, just playing devil's advocate. I actually enjoyed the linked article
a lot.

~~~
jackmott
Hardware might be cheap at some scales, but what if your service gets popular?

What it latency for a single request can't be improved with more cores?

What if your product is used by consumers who may have old hardware? or
phones, or watches, or laptops and they want their battery to last?

What if you consistently practiced at making high performance code, maybe then
it wouldn't seem "unreadable" to you any more?

What if Linq-like higher order functions weren't slow?
[https://github.com/jackmott/LinqFaster](https://github.com/jackmott/LinqFaster)

What if slow software was common today because of modern attitudes, and I
wasn't seeing any increase in stability or features to show for it?

~~~
deanCommie
> what if your service gets popular?

Then re-design for map-reduce, and scale horizontally

> What it latency for a single request can't be improved with more cores?

Then look into pre-computing and caching

> What if your product is used by consumers who may have old hardware? or
> phones, or watches, or laptops and they want their battery to last?

I thought we were talking about server requests? If we are, then offload this
work to the server

> What if you consistently practiced at making high performance code, maybe
> then it wouldn't seem "unreadable" to you any more?

But it's not all about you. Unless you're working on a pet project, or you
have the credibility and reputation to be the final call on a significant open
source project, you might get hit by a bus tomorrow. Or, if you do a really
good job, your company will need you to be a force multiplier to teach a dozen
others to try to imitate you. Even if you're a "10x" programmer.

And by the way, when you optimize code THIS much, any refactoring or tweaks to
new features cause your optimizations to get tossed out, and you have to start
over from scratch.

> What if Linq-like higher order functions weren't slow?
> [https://github.com/jackmott/LinqFaster](https://github.com/jackmott/LinqFaster)

[https://github.com/jackmott/LinqFaster#limitations](https://github.com/jackmott/LinqFaster#limitations)

> What if slow software was common today because of modern attitudes, and I
> wasn't seeing any increase in stability or features to show for it?

Except you are, and you don't even realize it. Optimizations like this blog
post matter a LOT on client software. Be it apps, or websites - anything run
on the client will need this kind of attention sometimes.

But this guy is writing server software. Micro-optimizing on the server side
the way he is doing is silly.

~~~
alexandrnikitin
> But this guy is writing server software. Micro-optimizing on the server side
> the way he is doing is silly.

Khm... What if you have millions of requests per second with tight latency
requirements measured in milliseconds; and a bunch of business logic to fit
into that. Such optimizations aren't so silly. There are different scenarios
on both client and server sides.

~~~
dahauns
Heh. You mean, like the example in the article? :)

------
throwasehasdwi
Blocking access based on arbitrary user agent strings is a really bad idea.
Every single bad bot will avoid known user agent strings or pretend to be
Google, so you're only blocking well behaved ones. Plus there's thousands of
browser versions out there, so there's a very good chance you're blocking some
users for no reason.

The proper way to do this is to block by IP, based on behavior. Block IPs
slowing down the site or throw up a captcha like cloudflare does.

Blocking bots sounds great but it just brings Google one step closer to a
monopoly. Even good bots just pretend to be people nowadays because lots of
people are implementing naive site protection strategies.

Edited: to be less mean

~~~
alexandrnikitin
Yes, you're right. There are many ways to block robots: IP, UA, behaviour
analysis. An advertising company has to have UA based filtering to be
compliant with standards. However, the focus of the blog post is on
performance rather on how to block bots.

------
senorjazz
Rather than block on UA, just add some honeypots. An invisible link. Any bot
that pulls that page gets blocked as scrapers tend to pull all links from the
page and follow.

Use the robots.txt to ban the pulling of specific pages. Bots 99% of the time
ignore robots, so if they pull it: block

Check how quickly pages are pulled. If passes a threshold: block

~~~
alexandrnikitin
Yes, using honeypots is one of the ways to identify bots. But that wasn't the
focus of the post. I'll add some clarification.

------
doubleplusgood
I did something similar with nginx, the data file from 51degrees and some lua
code; each instance only handles 10-20k requests/sec so no clever optimization
was needed.

~~~
oblio
Would you mind posting the Lua code?

~~~
doubleplusgood
Hi,

I've made a gist[0]; feel free to get in touch via GH if you'd like to discuss
it further.

[0] -
[https://gist.github.com/marklr/ae0c2f1eb61855d13cde6cef6bf63...](https://gist.github.com/marklr/ae0c2f1eb61855d13cde6cef6bf63541)

------
NKCSS
I'd probably store cached results for Dictinary<int, HashSet<string>> allowed,
notAllowed; where int == length of the user agent. This should probably be
blazing fast as well instead of keep doing those lookups.

~~~
alexandrnikitin
I doubt that exactly that will work. There are tens of thousands of different
UAs (maybe 100K). Perhaps some kind of tiny (few CPU cache lines) cache for
most popular UAs could help. But again: measure, measure, measure :)

~~~
NKCSS
Publish your test set and I can look at it :)

~~~
alexandrnikitin
I'm afraid I can't do that because of proprietary data. I think I can come up
with analogous tests using open data. I'll let you know ;)

------
frik
Good post!

A lot of manual work with various perf tools.

What's a bit missing is some production performance monitoring (APM) that
gives you such data, with no manual interaction.

~~~
alexandrnikitin
I intend to write a separate blog post about low-overhead production
monitoring (not sure when it happen though)

------
tener
So, the industry standard requires them not to serve ads to the bots... which
means they have implemented the ad blocking themselves?

------
brilliantcode
what if the "grey" traffic came from residential IP addresses using a normally
distributed range of user agents? How would you reliable distinguish them from
regular traffic?

~~~
Benfromparis
Basically, we are using two sort of technics : technical and behavior.

Technical : if the UserAgent claim to be a regular browser (let say Chrome 43)
we will check on network level if the client implement http protocol like
Chrome 43 usually do and on the JS side if the Javascript render is correct
for Chrome. In case it's a real Chrome, we will check if the Browser is
controlled by automation Tool.

Behavior : we will check if the path of requests is regular according to the
website usage.

Disclaimer: I'm working at [https://datadome.co](https://datadome.co), a bot
protection tool.

