
Ask HN: Should we block non-gzip HTTP bots to reduce bandwidth costs? - jotto
We run web servers on EC2 (behind ELB) and are dealing with high bandwidth charges.
The bulk of the bandwidth charges are due to bots scraping&#x2F;crawling&#x2F;pinging without gzip.<p>Is it safe to block requests made without the accept-encoding header? (and what HTTP status code should be used? 400?) Naively it seems it would only be inconveniencing poorly written and uninvited bots (SEO spam, bots trying to find exploits etc.)<p>Or should we respond with gzip compression, no matter what?<p>Or should we stop using AWS since the bandwidth is too expensive because of situations like this?
======
Someone1234
Have you considered banning the problematic bots using robots.txt[0]? Then if
they ignore that, you could block them more aggressively.

My concern about gzip is that it could impact legacy clients or real people
behind poorly configured proxies at places like businesses, colleges, and
public hotspots.

[0]
[https://en.wikipedia.org/wiki/Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)

------
assafmo
I'd block it and wait to see if your clients notice... If it breaks
functionality for some of them just help them fix this and say you fixed
performance issues so you could reduce prices later down the road.

------
mozumder
I just ignore the ACCEPT type and send Gzip anyways.

Always send compressed data. All real clients accept it. The rest is noise.

------
toomuchtodo
Have you considered using Nginx directives to slow bots down/rate limit them?

[http://alex.mamchenkov.net/2017/05/17/nginx-rate-limit-
user-...](http://alex.mamchenkov.net/2017/05/17/nginx-rate-limit-user-agent-
control-bots/)

Also, can you front your EC2 cluster with Cloudfront, Cloudflare, or another
CDN in order to reduce your outbound EC2 data costs?

------
stusmall
The company I work for makes a great next gen WAF. Along with identifying and
protecting against malicious entities we also can help reduce load from heavy
handed bots.

Reach out at [https://threat-x.com/contact](https://threat-x.com/contact)

------
tyingq
Serve a "blocked" page with some simplistic captcha to unblock?

If you see a lot of clients using the captcha, then consider backing it out...

Or maybe 302 them to an aggressively cloudflare cached copy of your site
that's no-indexed?

