More

davidfischer · 2026-05-29T20:31:24 1780086684

Nowadays, somebody can just ask claude to build them a scraper/bot that hooks into a proxy network and all of a sudden they can easily send 20k+ reqs/min from hundreds or thousands of IPs cycling them as they get rate limited or banned. In my work, the scrapers have gotten way more aggressive in the last 2 years or so. Frankly, I'm happy there is a solution.

There may be things to criticize Cloudflare for, but the problem of bots and scrapers destroying the open web was getting worse no matter what.

davidfischer · 2026-03-19T16:37:39 1773938259

I built EthicalAds (https://www.ethicalads.io/) for exactly this reason.

No tracking. No cookies. No behavioral targeting (targeting based on stuff you've previously done). Every website where our ads appear AND every advertiser is hand approved. No JS from advertisers: just a plain JPG/PNG and text.

We're small but on track to pay out $500k to publishers this year.

nosioptar · 2026-03-19T23:25:38 1773962738

I really hope you succeed. If ads were a simple text/image with zero js, I wouldn't need an adblocker.

davidfischer · 2026-02-19T23:29:41 1771543781

They absolutely do. Every sponsorship you see on a podcast or a youtube video or a streamer is a contextual ad. Many open source sponsorships are actually a form of marketing. You could argue that search ads are pretty contextual although there's more at work there. Every ad in a physical magazine is a contextual ad. Physical billboards take into account a lot of geographical context: the ads you see driving in LA are very different than the ones you see in the Bay Area. Ads on platforms like Amazon, HomeDepot, etc. are highly contextual and based on search terms.

davidfischer · 2026-02-19T22:56:53 1771541813

Founder of EthicalAds here. In my view, this is only partially true and publishers (sites that show ads) have choices here but their power is dispersed. Advertisers will run advertising as long as it works and they will pay an amount commensurate with how well it works. If a publisher chooses to run ads without tracking, whether that's a network like ours or just buyout-the-site-this-month sponsorships, they have options as long as their audience generates value for advertisers.

That said, we 100% don't land some advertisers when they learn they can't run 3rd party tracking or even 3rd party verification.

davidfischer · 2025-08-21T17:30:30 1755797430

My employer, Read the Docs, has a blog on the subject (https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...) of how we got pounded by these bots to the tune of thousands of dollars. To be fair though, the AI company that hit us the hardest did end up compensating us for our bandwidth bill.

We've done a few things since then:

- We already had very generous rate limiting rules by IP (~4 hits/second sustained) but some of the crawlers used thousands of IPs. Cloudflare has a list that they update of AI crawler bots (https://developers.cloudflare.com/bots/additional-configurat...). We're using this list to block these bots and any new bots that get added to the list.

- We have more aggressive rate limiting rules by ASN on common hosting providers (eg. AWS, GCP, Azure) which also hits a lot of these bots.

- We are considering using the AI crawler list to rate limit by user agent in addition to rate limiting by IP. This will allow well behaved AI crawlers while blocking the badly behaved ones. We aren't against the crawlers generally.

- We now have alert rules that alert us when we get a certain amount of traffic (~50k uncached reqs/min sustained). This is basically always some new bot cranked to the max and usually an AI crawler. We get this ~monthly or so and we just ban them.

Auto-scaling made our infra good enough where we don't even notice big traffic spikes. However, the downside of that is that the AI crawlers were hammering us without causing anything noticeable. Being smart with rate limiting helps a lot.

davidfischer · on Jan 3, 2025

I'm not the poster you're responding to but I'm one of the founding team of EthicalAds. We're a small team, focused exclusively on marketing to devs, and really trying to show high-quality ads without tracking people (ads are contextually targeted).

You can get a feel for what you'll earn here[1]. Basically you earn 70% of the gross of what we charge advertisers (see advertiser pricing[2]). Keep in mind these are ad views which aren't quite the same as pageviews. They're a subset.

Ads are a straight-forward path to monetization but not always the best. If you can make a project work as SaaS or really make sponsorships work for you (this requires effort), those will definitely earn A LOT more money per pageview than ads. Ads require a lot of traffic to make them work well. Usually you want high tens to hundreds of thousands of pageviews per month.

From a what we look at for publishers (sites that show ads) perspective, we're usually looking for high quality dev-focused sites or projects that don't want to just show Google ads. Per ad, publishers will earn much more with us than Google display ads but if you want to stick 4-5 Google ads on your site, video ads, or the like we can't compete with that and we don't want our ads on sites that do that. Devs hate them. My email is in my bio if you want to discuss further.

Regardless, good luck on the projects!

[1]: https://www.ethicalads.io/publishers/calculator/ [2]: https://www.ethicalads.io/advertisers/pricing/

davidfischer · on Oct 18, 2024

My employer, Read the Docs is a heavy user of Cloudflare. It's actually hard to imagine serving as much traffic as we do as cheaply as we can without them.

That said, for publicly hosted open source documentation, we turn down the security settings almost all the way. Security level is set to "essentially off" (that's the actual setting name), no browser integrity check, TOR friendly (onion routing on), etc. We still have rate limits in place but they're pretty generous (~4 req/s sustained). For sites that don't require a login and don't accept inbound leads or something like that, that's probably around the right level. Our domains where doc authors manage their docs have higher security settings.

That said, being too generous can get you into trouble so I understand why people crank up the settings and just block some legitimate traffic. See our past post where AI scrapers scraped almost 100TB (https://news.ycombinator.com/item?id=41072549).

davidfischer · on Aug 8, 2024

There's a few ways first party cookies can track you. Probably the biggest single way is Google Analytics which by default uses only first party cookies. Even without cookies at all, GA could track you across the web although first party cookies do make this a little easier and "better". However, first party cookies can help trackers in other ways like for CNAME cloaking[1] which basically makes a first-party cookie function similarly to a third-party one.

Disclosure: I work for a small privacy focused ad company.

[1] https://webkit.org/blog/11338/cname-cloaking-and-bounce-trac...

davidfischer · on June 25, 2024

SRI generally won't work here because the served polyfill JS (and therefore the SRI hash) depends on the user agent/headers sent by the user's browser. If the browser says it's ancient, the resulting polyfill will fill in a bunch of missing JS modules and be a lot of JS. If the browser identifies as modern, it should return nothing at all.

Edit: In summary, SRI won't work with a dynamic polyfill which is part of the point of polyfill.io. You could serve a static polyfill but that defeats some of the advantages of this service. With that said, this whole thread is about what can happen with untrusted third parties so...

stusmall · on June 25, 2024

Oooft. I didn't realize it's one that dynamically changes it's content.

hluska · on June 25, 2024

So maybe it’s less that the article is selling something and more that you just don’t understand the attack surface?

koolba · on June 25, 2024

It absolutely would work if the browser validates the SRI hash. The whole point is to know in advance what you expect to receive from the remote site and verify the actual bytes against the known hash.

It wouldn’t work for some ancient browser that doesn’t do SRI checks. But it’s no worse for that user than without it.

reubenmorais · on June 25, 2024

The CDN in this case is performing an additional function which is incompatible with SRI: it is dynamically rendering a custom JS script based on the requesting User Agent, so the website authors aren't able to compute and store a hash ahead of time.

davidfischer · on June 25, 2024

I edited to make my comment more clear but polyfill.io sends dynamic polyfills based on what features the identified browser needs. Since it changes, the SRI hash would need to change so that part won't work.

koolba · on June 25, 2024

Ah! I didn’t realize that. My new hot take is that sounds like a terrible idea and is effectively giving full control of the user’s browser to the polyfill site.

svieira · on June 25, 2024

And this hot take happens to be completely correct (and is why many people didn't use it, in spite of others yelling that they were needlessly re-inventing the wheel).

tracker1 · on June 25, 2024

Yeah... I've generated composite fills with the pieces I would need on the oldest browser I had to support, unfortunately all downstream browsers would get it.

Fortunately around 2019 or so, I no longer had to support any legacy (IE) browsers and pretty much everything supported at least ES2016. Was a lovely day and cut a lot of my dependencies.

jermaustin1 · on June 25, 2024

They are saying that because the content of the script file is dynamic based on useragent and what that useragent currently supports in-browser, the integrity hash would need to also be dynamic which isn't possible to know ahead of time.

stusmall · on June 25, 2024

Their point is that the result changes depending on the request. It isn't a concern about the SRI hash not getting checked, it is that you can't realistically know the what you expect in advance.

davidfischer · on June 11, 2024

Keychain Access has not been "fine". It's had multiple unaddressed data loss bugs. For example, Keychain lost all passwords from all Keychains after the Catalina update[1] and this wasn't fixed in the next 3 Catalina minor updates. Multiple users reported the issue to Apple and the response was crickets. Even if you restored the passwords, it helpfully deleted them all again. I switched to 1Password and declared Keychain Access a lost cause. I don't think I'll be giving them a second chance here.

[1] https://discussions.apple.com/thread/250722178

ethbr1 · on June 11, 2024

That does seem like one Apple fault, that Microsoft does better -- triaging actual bugs.

Apple has devs. Maybe some teams are short-staffed, but they fix things.

What Apple doesn't seem to have is a functional bugfix priority loop that includes customer input and provides feedback.

alex_suzuki · on June 11, 2024

It starts with the horrendous thing called “Feedback Assistant”. Black hole if I ever saw one.

icee · on June 12, 2024

Depends on how we define "fine," but your own post clarifies that it was "website passwords -- but not app passwords, secure notes, certs, or keys." That's a pretty big difference compared to "all passwords" and seems like it affected a small number of people. In any case, if we're using anecdotes, I haven't had any issues with it so far and it's been decades. Given how 1Password has been getting shittier over time, I've been looking for an alternative, and I for one am going to give this a shot. You can check in with me in a few more decades and ask me if it went okay.