Hacker News new | past | comments | ask | show | jobs | submit login
New headless Chrome has been released and has a near-perfect browser fingerprint (antoinevastel.com)
463 points by avastel on Feb 19, 2023 | hide | past | favorite | 240 comments



I am the PM working on Headless. Feel free to ask questions in this thread and I will try to answer them if I can.

Edit: Please also note that we have not released New Headless yet. We "merely" landed the source code.


There are many comments about potential abuse. I would be curious to know if your team have ever challenged each other to look like a real person accessing a site and the other part of the team tries to detect and block them? If there is anyone that could do this it would be the creators of Headless.

Why go through the exercise, one may ask? I believe it would be a critical thinking exercise to improve Headless even more while giving website maintainers a way to opt out of receiving traffic from it. If not your team, have you reached out to see if people from project zero would take on that challenge in their abundance of spare time? [1]

[1] - https://googleprojectzero.blogspot.com/


We regularly get feature requests for Headless to provide a field or property that can be polled by JS frameworks to detect if Headless is active e.g. windows.isBot.

Well, Headless is open source, which means anybody could build a Headless version with such a property set to "I am a human, trust me!" and employ such a modified binary ... ;-)


Oh absolutely, relying on a header would be a placebo at best. I was thinking more along the line of having two teams, one that develops Headless and another team at Google that try to defeat it non stop. An official game of cat and mouse. Project: Tom and Jerry? I guess legal would never buy into that name.

My own personal method for my silly hobby sites is just to put passwords on things with an auth prompt delay.


Why should Google redteam their headless browser though? As other comments point out there's plenty of ways for bot detectors to id bots even with a browser which mirrors a normal one: https://news.ycombinator.com/item?id=34858056

Almost all of those are things are outside of the scope of the browser itself. And anyone doing serious bot attacks already have scripts/forks that modify these signals. I don't see how the chrome team could do much to help stop that at that level.


In theory their blue team could come up with even more advanced puzzles that bots trip over and then open source and document the bot puzzles. I don't know that they would, incentives or lack thereof and all. If nothing else it might make their work day more fun.

Or if I put my evil corp hat on, the incentive could be that they make puzzles that only Headless can get around and all other bots become trivial to block and obsolete by even the least knowledgeable hobbyist. Perhaps Google release Nginx, Apache HTTPD, Apache Traffic Server, Envoy and HAProxy modules that only Headless can get around and all other bots internet-wide are entirely silenced. Chrome becomes the one and only bot to rule them all.


Why would they want to do that?


Oh man, you're making me put that hat back on.

I suppose that Google going through that exercise would mean that they get market dominance on bot gathering data and anyone not using Chrome Headless would be unable to obtain freebie data. This could enable future features whatever that may be. readjusts hat One future feature could be auto-discovery of Google DNS and Google proxies in GCP so they can learn about new data sources through crowd-sourcing thus making their big-data sets more complete and their machine learning more powerful. Developers could block the proxies or compile them out but as we know most people are too lazy to do this and many won't care.

Another advantage would be that eventually the only bots abusing Google would be bots using their code and they would know how to detect and deal with as they would implement their own open source anti-bot modules in their web servers, load balancers, etc...

There are more obscure ideas but I am doffing the hat before the hat-wraiths sense it.



You jest, but I could actually see this becoming a thing. I envision a future dystopian internet where people first have to authenticate their network gear, PC's, laptops, cell phones, cars, trucks, e-bikes, toasters, coffee makers to a government contracted service. Once authenticated they utilize something similar to that RFC but probably instead a nonce or jwt token tied to their device that gets embedded in the packet header somehow. Then sanctioning a continent, country, state, ISP, city, company, manufacturer, distributor or person would be simply disabling their evil bits so to speak.

The push for this is starting with adult content [1] but the goal posts could easily be mounted on train car with a very long and smooth train track that only goes downhill.

[1] - https://news.ycombinator.com/item?id=34726509


There's a huge amount of aggro pissy shitthrowing that Chrome is facilitating automation in these threads. Bollocks.

You know what? The Internet Is For End Users [1]. If we're going to cite an RFC, it should be RFC 8890. Not having a better headless Chrome would be a violation of the most basic principles of the internet.

There are some cases where automation can get out of hand, but blocking these efforts should not come at user expense. So says the RFC8890, and a general collective belief/hum-in-the-room. The availability of a good browser like Chrome helping should not be an issue, given how many other ways bad players have to go too far & cause harm to sites. The people who have to deal with this are not the priority & this doesn't radically change their troubles; this radically helps end users wishing to exercise agency though.

In most cases being able to script & automate a site is a completely primitive user-agency, of no special regard. Headless Chrome being a somewhat tolerable way of doing that scripting is 100% morale, correct. It greatly assists us in fulfilling a primary & clear overarching purpose of the internet: to be for end users.

I wish I could say I cannot believe the complaining & whinining & snivelling, the pretentious-nonsense/acting-offended that Chrome would dare help make good automation. I wish I could say I don't think this crowd recognizes nor comprehends the basic purpose of the internet, but again, I think I know better; I suspect they do but their protests are disingenous, that they have allied their hearts with darker forces, against the user.

[1] https://www.rfc-editor.org/rfc/rfc8890


>Headless is open source, which means anybody could build a Headless version with such a property set to "I am a human, trust me!"

This is flawed reasoning. Just because we can't eliminate abuse from headless browsers that doesn't mean we shouldn't work to reduce it. Finding such a modified binary or making it yourself is additional friction that will cause less of these bots to exist. Some people may not care if a website is able to block them or not or some people may not decided to do the work to read the robots.txt. By implementing these capabilites into the product by default you are making the web ecosystem a better place wit less abuse. You are right that someone could make a version without the antiabuse parts, but surely that fork will be less popular and less used.


What about if I want the headless browser to look exactly the same? Why should we make a distinction between humans and machines?


If I run a soup kitchen, and Google is sending robots to my establishment which are indistinguishable from humans, I should I have the right to ask if the client is a robot.

I would hope that Google's robots would not be programmed to lie to me, but would be honest.

If robots are required to be honest, then I have a choice to serve them or not. If they are not honest, I do not have a choice.


Then don't add code to your site to make it work different?

>Why should we make a distinction between humans and machines?

Because machines can be used to abuse a site at a scale that humans can't. Site owners want to protect their site against abuse.


By modifying the browser. It feels like DRM by a different name to me.


Okay? I don't care what you call it. It will reduce the amount of abuse in the world and that is a good thing.


While I appreciate your answer from a technical point of view - indeed it is trivial modify/spoof - there is an ethical dimension.

Should bots have the legal right to say they are human?

For example - if Google Inc is visiting a web page to collect information about it using a headless bowser, and the server asks - are you a bot - should Google be legally or ethically allowed to answer no? (declarations in headers could remove the need for question/answer chatter.)

(I want to pre-empt dismissing this line of questioning via 'what if Google wants to know how the site will be served to a human for better search results because google could include a specific header for that, eg "I am a bot, but request that you serve the version of this page served to humans". It would be up to the server to honor or reject that request.)

The defaults Google choose have compounding effects in our society. If you make it "normal" for bots to pretend to be human, the industry has minimal pressure to hold any standard above what you do, and better norms may never appear, or be delayed by a decade. The alternative is to be thoughtful today to try to create a better world.


https://github.com/paulirish/headless-cat-n-mouse was this basic idea, but open sourced.


The destination of that escalation is DRM.


Do you guys ever think about abusive automation at all, or do you just consider that other people's problem?


Abusive how? Headed chrome can be automated, as can wget.

Its bizarre to ask a client side program to implement server-side controls for users you want to allow on your site but throttle.


Headed chrome adds a huge amount of overhead, and can also be fingerprinted more easily. This is a lot more declarative and makes it easier to run an abuse farm. Although, per my other comment, I don't see Headless as a tool that will particularly move the needle on abuse cases.


Isn't headed chrome usually fingerprinted by variables inserted by the chromedriver? You can rename these variables and be undetectable (you don't even have to recompile chromedriver, you can use a hex editor or a perl replacement).

At least I've never gotten detected.


There are even Puppeteer plugins that will do it for you. [^1]

The best detection I've come across so far (i.e. before this release) has just required I run headless Chrome in headed mode. Granted, I don't do a ton of scraping -- mostly just pulling data out of websites so that I can play with it in aggregate using more civilized tools.

[1]: https://github.com/berstend/puppeteer-extra/tree/master/pack...


You call it abuse. Other people might call it use.


I've not yet encountered anyone who doesn't consider spam to be a form of abuse.


Spam can be an effective way around censorship. What is and isn't abuse often isn't as objective as some people want to pretend.


I am that anyone you mentioned. For example, autoposting on 4chan works very well for me. I spam goods on 4chan to buy or create opinions that I force.


[flagged]


Would you please stop posting in the flamewar style? We've had to ask you this in the past as well. It's not what this site is for, and destroys what it is for.

https://news.ycombinator.com/newsguidelines.html


I'm sorry. I'll try to bite my tongue more often when I'm in combative mood. Thanks for putting up with me so far.


You call it use. Other people might call it abuse.


That's my point exactly.


Are we just misquoting the Eurythmics now?


The implications of your question are beyond dystopian


Please elaborate.


Because it suggests adding usage controls, possibly enforced via cloud connectivity, to add restrictions that will inevitably make legitimate usage more difficult, frustrating, and most importantly, subject to outside control. Extend this far enough and the world starts to look like Doctorow's "Unauthorized Bread".

This is an awful world, one designed to reinforce class divide and protect the entrenched and the rich by deliberately handicapping easily-accessible tools, because of a few bad actors. It creates a world where the code for literally everything is the most hideously complex version of itself because it is riddled with constant checks, phone-homes, and arbitrary usage limits. It further pushes us towards a disempowering future where our computing is limited exclusively to appliance-like devices whos inner workings are controlled for it. It stands against the very principle of general-purpose computing.


That's not beyond dystopian. It's just dystopian.

And implications of a question aren't either. Just your imagined implications. Questions aren't bad.


See my comment[1] on this very thread.

[1] https://news.ycombinator.com/item?id=34858232


If you are soy developer who thinks cloudflare is god that should solve problems for you and use O(n^2) or even worse algorithms in your code so you can't even optimize it, it is only your problem, correct.

In 2000 sites were running where code has been precisely made such way DDoS attack was impossible. Now it is heckin sauce of js malware obfuscated proprietary code.

If your site like this, you deserved it. Cloudflare and such companies just need your money for solving 5-minutes problem like AWF that is just a regex, and you have limits even for user agent filtering, lol.

Stop making shitcode and learn HTTP and TCP/IP theory, and you will make antispam filter that is 200% better than any cloudflare shit that is simply malware that runs cryptominer as a "IUAM" mode for their own benefit and you even pay for it.


For what it's worth, the large "players" already seem to have this capability. They've forced pretty much everyone to roll out captchas, waf-level throttling, proof of work interstitials, and behavior-based fingerprinting.

While my immediate response was the same as yours, I think this actually won't really change much in the way of bad actors.

It's unfortunate, but basic controls (such as throttling, etc) are pretty much a floor-required feature - one way to avoid this burden is to do things like use 3rd party idp (aka google login). I'm not happy with the state of things but I don't think headless will particularly contribute to a material increase in abuse cases.


Now that headless mode is a "real" Chromium instance, is it possible to add extension support to Chrome running in headless mode?


I didn't know this was a restriction before! Interesting. I would have assumed old headless had a profile, that typical command-line efforts[1] would let one load extensions. Are we sure that your question is valid? Are we sure that previous headless Chrome didn't have profiles or couldn't load extensions? I'm not sure this question is valid. I think maybe the assumptions here are incorrect.

The new Chrome headless certainly purports to be "just Chrome" "without actually rendering." One of the notable differences in the new headless mode is that it at least shows the stock/built-in extensions. From the submission:

> Similarly, when it comes to plugins, the old headless Chrome used to return no plugins with navigator.plugins, which is a technique that used to be exploited for detection when Headless Chrome got released 6 years ago, cf this blog post. The new headless Chrome returns the same plugins as a headful Chrome, and that’s the same for the mimeTypes obtained with navigator.mimeTypes:

Maybe perhaps the new headless is faking it, but my impression is that extensions definitely work as normal in the new headless Chrome. How or whether they worked before is another very very interesting question I'd like answers to.

I do wish the AMA dev had actually replied to this. My hope is that this wasn't an issue before (but default plugins just weren't installed, and now they are, just to alter fingerprinting), and that now the situation is unchanged but default plugins are installed.

[1] https://stackoverflow.com/questions/16800696/how-install-crx...


https://bugs.chromium.org/p/chromium/issues/detail?id=706008

It looks like the new headless mode does support extensions.


Can you talk about your team's motivations for improving headless mode? Any particular use cases in mind?


Here are two of them: -Test reproducibility -Automated configuration rollouts in enterprise environments


Improving test environments is a huge upside. I haven't worked on browser automation in nearly a decade, but finding ways to work around shortcomings in the headless environment used to burn a lot of time on that team. I know of many small teams which made deliberate decisions NOT to do any browser automation tests (e.g. Selenium) because some issues required testing hooks in production code.


Is it too late to change the name from "new headless"? It won't be new forever, and then there will need to be a new new mode, or a differently named one that people think is older because it isn't the new mode.


No, obviously, the next version will be called Newer Headless. Then you get the More Newer or Even Newer release. Or my personal favorite NewV2. /s

Using the word "new" in naming conventions is the most moronic and shortsighted way to name things in something that is quite obviously going to be changing in the somewhat near future.


New College is doing fine even with its name. It's just a name. Doesn't really matter.


Also New Forest.


It reminds me of "pont neuf"("new bridge" in French), which is the oldest bridge in Paris crossing the seine.


By all rights, it ought to be EvenLessHead. ~


See also: report_final_draft(1).doc


You would be surprised how much we talked about that . New/old are just relevant for the transition period.


But then how would you have the pleasure of figuring out the sort order between New $Feature, Advanced $Feature, Revamped $Feature and Enhanced $Feature?


What is the actual difference between old and new?

Is there a list, an explanation anywhere?


So this argument can be used these ways:

--headless

--headless=new

--headless=chrome

And each mean something different - but what?

Not documented, very frustrating.

Can you explain the difference between each of the above arguments?


There are two headless mode: "chrome" and "new".

--headless enables the default, which is "chrome".

--headless=new enables "new".


Any chance of an build for the Raspberry Pi?


So the --headless=new doesn't work on any released version of Chrome yet?


what makes the new one “Native” ?


It's real Chromium, not emulating a Chromium browser. "Old" Headless was merely pretending to be a Chromium browser, the "New" Headless is a Chromium browser. "Old" Headless requires a parallel/duplicate implementation of features, which leads to subtle behavior differences or infeasability to support certain features e.g. extensions proper.


wow, i had no idea the old headless was a reimplementation. congrats on landing the new one


Does this mean we might see proper extension support in "New" Headless?


Can this replace chromium embedded framework (CEF)?


I fail to see the connection. Can you elaborate?


[flagged]


What rumors? Can you provide any links or context?


I built a remote browser based on headless Chrome^0 and this is going to make things way easier. It's also great to see Google supporting Chrome use cases beyond "consumer browsing", and perhaps that's in large part been pushed by the "grass roots popularity" of things like puppeteer and playwright.

One thing I'm hoping for (but have heard it would require extensive rejigging of almost absolutely everything) is Extensions support in this new headless.

However, if I'm reading the winds, it seems as if things might be going there, because:

- Tamper scripts now work on Firefox mobile

- Non-webkit iOS browsers are in the works

- It's technically possible to "shim" much of the chrome.extension APIs using RDP (the low-level protocol that pptr and its ilk are based on) which would lead essentially to a "parallel extensions runtime" and "alt-Webstore" with less restrictions, something which Google may not look merrily upon

Anyway, back to "headless detection", for the remote isolated browser, I have been using an extensive bot detection evasion script that proxied many of the normal properties on navigator (like plugins, etc), and tested extensively against detectors like luca.gg/headless^1

Interestingly one of the most effective way to defeat "first wave" / non-sophisticated bots used to be simply throwing up a JS modal (alert, confirm, prompt) -- for the convenient way it kills the JS runtime until dismissed, and how you have to explicitly dismiss it.

^0 = https://github.com/crisdosyago/BrowserBox

^1 = https://luca.gg/headless/


I'm assuming the next step will be to bring to Cloudflare's pet project of TPM attestation into Chrome, otherwise known as PATs[1]. And just like that, not only would headless be defeated, but all of you using rooted devices and small time browsers would be left high and dry.

It's "Right to read"[2] all over again.

[1] https://www.ietf.org/archive/id/draft-private-access-tokens-...

[2] https://www.gnu.org/philosophy/right-to-read.en.html


If you can only use PATs on headlessless machines, then they're actually headpats. Everybody loves headpats. I don't see a problem.


Heatpats are nice when they're optional. Mandatory headpats are not universally admired.


What stops someone from making a fake TPM that speaks the appropriate protocol and just instantly signs off on every request? AFAIK there isn't some grand/central list of trusted TPM modules. Anyone can implement one as a Linux driver: https://www.kernel.org/doc/html/latest/security/tpm/tpm_ftpm...

A fake TPM would be useless for security but just fine for fooling websites that there is a real human at the computer.


From wikipedia:

> Computer programs can use a TPM to authenticate hardware devices, since each TPM chip has a unique and secret Endorsement Key (EK) burned in as it is produced.

That EK is signed by the TPM manufacturer, and so it’s likely they’ll only trust the keys of physical TPM manufacturers. Good luck forging that in software.


I wonder if we'll get a cat-and-mouse game with miscellaneous TPM manufacturers "accidentally" leaking their keys, getting blacklisted, creating new ones, etc. I'd like to think that there's at least a nontrivial amount of the population wanting to subvert the authoritarian corporatocracy and with the skills to do so.


It's going to be an extremely janky or very private website if they only allow you to use it when you have 1 of like a dozen supported and approved hardware TPMs to view it.


The latest windows version requires a hardware tpm on a device in order to be installed. Every hardware vendor has therefore included a tpm on all their new machines. This was already standard on apple devices, and many android devices have one as well.


Sure but someone who wants to build a web scraper won't care, they could use their own homebrew TPM that does a no-op and claims a user pressed a button or was present when they actually were not there.

I doubt websites will go to the trouble to keep a list of approved TPMs. It's the SSL root certs nightmare all over again and even worse. No one is going to want to deal with managing a whole new giant list of devices, having fire drill updates to revoke compromised ones, etc.


What is the solution to automation then? What do I do when someone hits my content-rich Wordpress blog with a scraper that hits 100 pages a second to download my content, and my database falls over leading to real, legitimate users being unable to use my site? What if it’s not a legitimate scraper but someone with hundreds of proxies uses them to DDOS my site for days? Should I sacrifice my uptime to protect the freedom of those unwilling to attest that they’re running on real hardware?


The method to stop a (D)DoS is the same as it always was: caching and rate limiting.

Re: content scraping -- I was an indie web dev of a sort for a while and people always ask this question, and the answer is it's impossible to stop. Not even Facebook or big content sites like CNet or The Verge can stop it. At the bottom of it, you can just access the site in a browser and save the source. Content scraping is a rephrasing of "viewing content even just once". Stopping it is antithetical to the web and technologically infeasible.


it's probably actually cheaper to pay people piece rates to do it for you in a browser than to pay a developer to write and maintain a scraping script anyway, so if the later became genuinely impossible moving to the former isn't a big deal.


Put your WordPress blog behind a caching proxy with a 5s TTL - that way any amount of traffic to a URL will produce at most one hit every 5 seconds to your backend.

I've used this trick to survive surprise spikes of traffic in multiple projects for years.

Doesn't help for applications where your backend needs to be involved in serving every request, but WordPress blogs serving static content are a great example of something where that technique DOES work.


Proof-of-work schemes such as Hashcash[1] and simple ratelimiting algorithms can act as deterrents to spamming and scraping attacks.

There are other kinds of non-invasive bot management you can do as well, however, due to various reasons I'm not in a position to talk about it. A few other methods are mentioned at the end of the post being discussed[2].

[1] https://en.wikipedia.org/wiki/Hashcash

[2] https://antoinevastel.com/bot%20detection/2023/02/19/new-hea...


Proof of work isn't very practical here, because computation is a lot cheaper in datacenters than on phones.


The trick is to prevent the offloading of the proof-of-work challenge to another device, as suggested in the Picasso paper[1].

[1] https://storage.googleapis.com/pub-tools-public-publication-...


Can privacy be preserved with zero knowledge proofs? I don't like the idea of universal fingerprinted devices in an already heavily authoritarian world.


Neat! This does seem like it should work!

Semantic quibble: it's less "proof of work" and more "proof of hardware+work". Or, as they call it, hardware-bound proof of work. The reason you can't offload the challenge to a more powerful device is that they rely on identifying stable differences for each device class that ultimately trace down to the hardware they're running on.


From reading the abstract, isn't this just exploiting the same class of security vulnerabilities that the OP is lamenting are being fixed?


Not sure. Maybe not, if it's about device-specific information instead of headed-vs-headless distinctions?


Wasn't mining in the browser basically shutdown by every major browser?

It was done super fast.. one can't help but think that Google pull all the levers they had at Apple/Mozilla to made sure the first viable alternative to advertisement was killed before it was born. But I think as a side effect it make PoW might be sort of impossible?

I don't really know how to mining "fingerprinting" works exactly - so would be curious to know if I'm wrong


What killed "mining in the browser", more than anything else, was:

1) It was almost exclusively used for malicious purposes. Very few legitimate web sites used cryptominers, and it was never considered a viable substitute for display advertising; it was primarily deployed on hacked web sites. Browser vendors were relatively slow to react; many of the first movers were actually antivirus/antimalware vendors adding blocks on cryptominer scripts and domains.

2) The most popular cryptominer scripts, like Coinhive, all mined the Monero coin. (Most other cryptocurrencies were impractical to mine without hardware acceleration.) Monero prices were at an all-time high at the time; when Monero prices crashed in late 2018, the revenue from running cryptominer scripts dropped dramatically, making these scripts much less profitable to run. (This is ultimately what led Coinhive to shut down.)


I guess slow/fast is subjective. It didn't seem like enough time passed for a legitimate ecosystem to develop. Just the basic idea of say hosting a static-site/blog on a VPS with a cryptominer that could pay for itself would have been a game changer - but was probably just the tip of the iceberg of possibilities. Instead we're still stuck either having to sell our traffic/info to Google/Microsoft, put up ads, pay for it out of pocket. The entrenched players won

The hacked site boogieman felt overblown (and from what you're saying it sounds like if would have died out anyway). I'm sure it happened, but at least personally I never once came across it. Or if I did, then my CPU spun a bit more and I didn't notice. No real harm done.

More fundamentally we're now in territory where the browser vendors get to decide what javascript is okay to run and which isn't.

Anyway, it's just complaining into the ether :) it is what it is. thanks for the context of the market forces and antivirus companies


> I guess slow/fast is subjective. It didn't seem like enough time passed for a legitimate ecosystem to develop.

Coinhive was live from 2017 - 2019, and it basically ran the whole course from exciting new tech to widely abused to dead over those two years. I don't think it needed more time.

> The hacked site boogieman felt overblown...

Troy Hunt acquired several of the Coinhive domains in 2021 -- two years after the service shut down -- and it was still getting hundreds of thousands of requests a day, mostly from compromised web sites and/or infected routers. It was a serious problem, albeit one which mostly affected smaller and poorly maintained web sites.

https://www.troyhunt.com/i-now-own-the-coinhive-domain-heres...


Make it someone else's problem; put a caching CDN in front of it, like Cloudflare, who have experience with these problems (like intentional or accidental DDOS).


I understand and agree with the suggestion of putting a CDN, but it's somewhat ironic to suggest the use of Cloudflare when that very same company is advocating for the DRM-for-webpages scheme.


Is it not a fair to assume that Cloudflare, as a company who have made a name for themselves selling various DDoS protection services, realize they're in an arms race with the old school way of handling these problems are are pursuing more advanced solutions before the current techniques are entirely useless?

It would be easy to point to the irony of saying "instead of supporting Cloudflare's proposals for PATs, use their CDN product for brute force protection" but on the other hand, they employ a lot of experts in this space and might see the writing on the wall in an increasingly adversarial public internet.


This is a good question, but if you look at it closely, Cloudflare seems to be the only company advocating for attestation schemes for the web.

It’s almost as if the conspiracy theory of Cloudflare acting as an arm of the US government and helping in the centralization of the internet is actually true.


is there such thing as a caching CDN that effectively protects against scrapers? generally if somebody is going to try and scrape a whole bunch of old infrequently-accessed but dynamically generated pages, most of those won't be in the cache and so the caching proxy isn't going to help at all.

i'm honestly asking, not just trying to disprove you. this is a real problem i have right now. ideally i'd get all my thousands of old, never-updated but dynamically generated pages moved over to some static host, but that's work and if i could just put some proxy in front to solve this for me i'd be pretty happy. but afaik, nothing actually solves this.


Akamai has a scraper filter (I think it just rate limits scrapers out of the box but can be configured to block if you want). I'm not sure how good it is at detecting what is a scraper and what isn't though.


Yeah, AWS has one of these, a set of firewall rules called "bot control". it seems to work well enough for blocking the well-behaved bots who request pages at a reasonable rate and self-identify with user-agent strings (which i'm not really concerned about blocking, but it does give me some nice graphs about their traffic). it seem doesn't do a whole lot to block an unknown scraper hitting pages as fast as it can.


rate limit. Or paywall.


> What do I do when someone hits my content-rich Wordpress blog with a scraper that hits 100 pages a second to download my content, and my database falls over

It's a blog. Blogs are not complex. Why is your blog's database so awfully designed that 100 pages a second causes it to fall over?

> leading to real, legitimate users being unable to use my site?

You assume that a scraper is not a legitimate user. I argue otherwise. If you don't want a scraper to use your site then put your site behind a paywall.

> What if it’s not a legitimate scraper but someone with hundreds of proxies uses them to DDOS my site for days?

If it's a network bandwidth problem, then a reverse proxy (eg, CDN) solves that.

> Should I sacrifice my uptime to protect the freedom of those unwilling to attest that they’re running on real hardware?

All software runs on real hardware. What is your exact question?

I am accessing this site in a virtual machine. I could be doing it with a headless browser. Why does that matter at all?


PoW captcha like MCaptcha. (It's technically not a captcha, for the pedantic)


We have a chatbot that can send users screenshots of their CMS views (kanban, calendar, tables, gallery, etc) from inside of Slack.

The screenshotting uses puppeteer and chromium and a read-only session to impersonate the user and screenshot their dashboard.

It uses the old version of chromium and there were many gotchas that required a lot of extra scaffolding to actually render ours and other websites like they would on my laptop. This will hopefully make it easier for us to maintain once implemented.


If you add DRM video playback to the fingerprint, it is pretty much impossible to fake...

Either they have a real TPM with a real nvidia graphics card able to decrypt content with a real serial number... Or they don't...

If one graphics card or TPM serial number starts acting bot-like, you can ban just that one.


I browse with DRM disabled. Every time it gives me a notification about it, I view it as a "hah, fingerprinting avoided!" signal.

Sites that use it get my anti-traffic. I don't buy, support, or condone DRM'd media and I actively disable EME on every browser I come across...


> I don't buy, support, or condone DRM'd media

this is good, but it would also be helpful if you supported the anti DRM movement. Some people have developed ways to get around certain DRM such was Widevine, from dumping your own CDM to Widevine proxy. Just ignoring the problem is not going to make it go away. Over the last two years DRM use for streaming content has increased significantly. If you want to really help, I would look into contributing code to these projects, or donations.


I'm not seeing how that doesn't help support DRM?

It does nothing to dissuade content gatekeepers from employing restrictive DRM on their sites.

Anti-DRM would be avoiding anything that gives money to those that employ DRM to incentivize the removal of the DRM. Frankly, flat out piracy (streaming ripped content) is more likely to result in the removal of DRM than making it appear that the DRM is working well for the provider.


[flagged]


We don’t want to deal with having to be forced into having specific hardware, operating systems, and browsers to watch content we paid for. I’ve had perfectly good monitors that were before HDCP was a thing, and these sites gimp the quality or outright refuse to play media because the monitor didn’t have some bogus technology.


Even as someone who isn't in the slightest interested in unauthorised copying of content, watching videos on anything which isn't VLC on my laptop is such a PITA that I never do it.


DRM has a huge impact on what I consume. For example only being able to watch Netflix at 720p due to running a *nix distro.


Good for you.

There are sites that commercially distribute DRMed video content; say, Netflix. They have a large audience, and they care, whether me and you like it or not.


Using Netflix as the example, Widevine L1 has very limited support on the desktop, i.e. Microsoft Edge on Windows and Safari on macOS.

All other configurations use L3 which is a shared key, e.g. provided by ChromeCDM as it runs entirely on the CPU - which is why Netflix content also works under Linux, albeit L3 is limited to 720p (or 1080p with browser extensions).

Given Chrome's massive browser market share, I'm not sure whether enabling DRM adds anything meaningful to the fingerprint - i.e. I don't think it's possible to revoke an L3 key without pushing out a new version of the CDM to all users of that browser, as has happened once before with Chrome.

FWIW I've tested Widevine L3 decryption works using a ”headless” docker container running Chrome. The only caveat to add is that Chrome must not be started with --headless, but you don't need a real GPU either, Xvfb works just fine.


I've never used Netflix (or other streaming sites like them) because of the DRM. Youtube manages to prove that a streaming model can be very, very profitable without it at all, as does BBC iPlayer.


YouTube uses DRM for licensed content like TV shows


How much of that audience is watching on a device without a video card? Almost none.


AFAICT, the server can avoid serving the DRMed content until the browser proves it has a legitimate DRM-respecting playback capability, which is designed to be hard to feign. That is, unless something like [1] is correctly implemented in the headless mode, DRM content won't be available anyway.

Am I missing anything?

[1]: https://developer.mozilla.org/en-US/docs/Web/API/Navigator/r...


What use case is there for accessing DRM video content using a headless browser?


Automated downloading of the content, I assume.


i love the contrast in these comments. on the one hand you have all the people arguing that headless chrome is unethical because websites need to be able to block bot traffic, and on the other you have actual humans saying they try as hard as they can to behave like a bot.


> ... actual humans saying they try as hard as they can to behave like a bot.

Blaming humans for desiring privacy is bad. No one here is "trying to behave like a bot".

Exaggerated example: "Oh, you don't want to show me, a random stranger on the internet, your ID? You are behaving like a crook!"


i'm not blaming people for wanting privacy. i'm just saying that if you value privacy, you can't also value blocking bots, because in order to block a bot you have to collect enough information to violate the real people's desire for privacy.

and there seems to be significant overlap between the people who think enabling bots is morally wrong, and people who think fingerprinting is morally wrong. if you value privacy, you have to value privacy for all web users even before you've collected enough data to determine whether that web user is a real person or not.


TPMs do not reveal a unique serial number or similar identifier by design for privacy reasons.

A TPM can attest that some measurements were done with it and it can attest that it comes from vendor X. You can block an entire vendor if they don’t behave but not individual TPMs via remote attestation.

You can use a scheme in which you can set up an „identity“ on first use and then on next use authenticate the same identity. But that identity is kinda per use case.


I was under the impression that the EK could be used to identify individual TPMs- why can’t it?


I don't believe DRM fingerprinting is used in the wild. Firefox shows when DRM is being used (like Netflix) and I've never seen it used outside that.


Reddit's website uses DRM for fingerprinting - https://iter.ca/post/reddit-whiteops/


Maybe they changed their mind on that, because it does not show me any DRM usage as of now.


> If one graphics card or TPM serial number starts acting bot-like, you can ban just that one.

I don't think you can get the serial number, though?

(And if there was an API for this it wouldn't be a passive one, which makes it inapplicable for fingerprinting)


Also shutting out a lot of older and weird devices (internet fridges, dumb smart tvs, and more, many Linux and bsd users) who can’t play DRM.

Some sites won’t care, but for some this will be too high a price for avoiding headless bots.


How does this work? Wouldn't a lot of real user-agents not have this capability and therefore not be able to be fingerprinted and banned in this way?


Can you report back the TPM serial number to the webserver?

If so, why isn't this used as an immutable ever-cookie that can't be deleted?


You can't, the parent comment has combined a few real world possible things into an impossible combination.


Why couldn't they just use a software TPM?


How do i set the new part of the headless flag in Python?

The article mentions that to use this you need to specify the --headless=new flag.

I know that to set the headless flag i can just use this code:

    from selenium.webdriver.chrome.options import Options

    options = Options()
    options.headless = True
But how would I specify the new part of the flag/option?


There's a mention to this in the recent Selenium blog post https://www.selenium.dev/blog/2023/headless-is-going-away/#a...

Basically omit options.headless and use options.add_argument("--headless=new") instead.


The cat & mouse game continues...


PM working on Headless here. Masking bots is not the reason why the new Headless mode was created. The goal is to provide an headless browser that can be used in web tests. The original Headless is essentially a separate browser implemented in parallel to "proper" Chromium. That results in all sorts of subtle reproducibility problems for developers using Headless for their tests.


PM working on Private Browsing mode here. Watching pornography is not the reason why the new Private Browsing mode was created. The goal is to provide an Private Mode than can be used for Christmas shopping. ;)

In all seriousness, despite intentions, and I do love headless mode for actual integration tests with Webdriver, it’s no exaggeration to say that it is likely the single greatest avenue for bots and spam enablement across the entire internet, and imo is probably net Bad.


If it weren't for bots there would be no search engines, no internet archive, no WWW. Bots, and the tools for making them, are essential to the functioning of the web.


It seems more neutral to me. Yes there's a lot of spam and other types of malicious behavior, but I don't think it's good overall to try to eliminate web automation entirely to stop it.


A necessary evil for supporting an open and programmable internet (IMO).


> PM working on Headless here. Masking bots is not the reason why the new Headless mode was created.

Right. But it will be massively used just for that.


Yes, same as many technologies with legitimate uses. Tor is largely used for illegal activities, yet many would say the anonymity it provides for the general public is worth it being created (or the anonymity it provided for US intelligence).


I'm not chasting anyone for building a piece of cool tech but that does seem like something like a holy grail for bots.


Nice to know you are the arbiter of what is and what isn't a "cool piece of tech"


I don't know whether you're illiterate or just maliciously misinterpreted what I wrote.


This is such good news to hear. Browser test automation was a pretty sore spot. I'm excited for your work.


> Masking bots is not the reason why the new Headless mode was created.

You might consider looking into some resources on Intent vs Impact (eg, [0]).

IMNSHO, anyone working in tech has a responsibility to consider what their creations can be used for, in addition to what they intend them to be used for. There's just too much potential for scalability of nefarious behavior to do otherwise.

[0] https://www.masterclass.com/articles/intent-vs-impact


Please reveal what you work on so I can publicly judge whether you have considered and properly chosen between intent vs impact or any other possible moral failings of your work as I see it.


I’m naive here but why would Chrome release a headless browser that makes it easier for bot developers to avoid detection?


This blog post is written that way because the guy works in the bot detection business so it's what he cares most about.

But there are still plenty of legitimate use cases for wanting a headless browser that perfectly replicates a normal browser environment. The obvious ones are automated frontend testing tools like https://playwright.dev/


Exactly. And as the blog post mentioned, people who have a strong need to block bots have tools other than browser fingerprinting at their disposal. Quoth the post:

> It’s important to leverage other signals such as:

>

> * Behavior (client-side and server-side)

> * Different kinds of reputations (IP, sessions, user)

> * Proxy detection, in particular, residential proxy detection

> * Contextual information: time of the day, country, etc

> * TLS fingerprinting.

Having a headless browser that behaves exactly like a normal one is tremendously useful for making things. And people who really *need* to block bots also need to contend with "mechanical turk" style attackers anyway. These techniques are also very useful against that approach, which still may be cheaper than making an undetectable bot even with a near-perfect Chrome fingerprint available headless.


> * Behavior (client-side and server-side)

Imagine count of false positives.

> * Different kinds of reputations (IP, sessions, user)

Almost all use mobile network right now. One IP can be sticked to thousands of users. Imagine count of false positives.

> * Proxy detection, in particular, residential proxy detection

Most residential proxy are just common ISP ips bought by a face or led by botnet. Imagine false positives of simple home users that are IP ranged like on 4chan.

> * Contextual information: time of the day, country, etc

script.execute(() => navigator.dateOffset = Math.random()...) script.execute(() => navigator.country = Math.random()...) script.execute(() => navigator.etc = Math.random()...)

> * TLS fingerprinting.

Imagine count of false positives, especially because there are 4 common tls fingerprints across browsers.

Just cope and seethe that your antispam filters will never work, antibot measures are fail. Cloudflare turnstile is fail. Bots won as usual.


We use a headless browser to load an internal webpage (with content that may be updated several times per day) and generate a pdf on-demand.


As a bot developer, without taking legal steps (I do not break the law) there is no stopping me regardless.


Based. Make companies cope further, by scrapping their prices and make your own prices 5% lower so customers buy from you. Their sites lag even on 32GB ram devices, with animations everywhere and zero optimization. This is the only way to compensate it. Even if you have nothing to sell, you still should abuse their antispam filters.


Because none of the people complaining about headless bots (read probably: content and retail) are major stakeholders from Chrome's viewpoint.


headless browsers or are faster than the normal browsers (no GUI) so your tests run faster


Chrome sets navigator.webdriver to true when controlled by automation.

Until now, bots could simply use headful mode to achieve the same effect that is now made available through the new headless implementation.


Are there non-headless browsers modified specifically to have extremely generic fingerprints? Hiding OS, GPU, fonts everything.


Firefox (and probably others) have fingerprint protection. https://support.mozilla.org/en-US/kb/firefox-protection-agai...


Any chromium based forks?


Brave.


Going by the amount of upset advertising/cyberstalking companies that Brave is indistinguishable from Chrome, I think this may be the answer.

I don't like the way they pretend(ed) to send funds to websites using their cryptocurrency services, though. Good software, sketchy company.


+1. Also make sure to disable all cryptocurrency and "web3" related plugins for a pleasant experience.


Not a browser but Arkenfox[1] hardens standard firefox. But it's not for everyone and using something this specific can be a problem in itself.

[1]: https://github.com/arkenfox/user.js/


Tor browser (based on Firefox) seems to fit that bill.


At the end we come to a browser and we have to emulate a mouse that does all the clicking.


We should assume anyone visiting a site without some kind of credentialed login is a 'bot'.

Or for all intents and purposes 'noise' traffic.

It'd be nice for the powers that be develop an anonymous cookie standard to allow people to flag themselves as 'humans' without enabling the host to know anything about them.

We are fighting wars over problems that we have created for ourselves.


No one is gonna use your malware tracking cookies, I always block them and Chrome as good browser also detects such usage. Also, by european law, you have no force to make "anonymous" cookie. We already lived after evercookies, now there are no any evercookies.

If I ever detect that site uses "anonymous" cookies without my consent, well you will have to pay a lot of $$ for me as a compensation. Enjoy, try your luck, man. I need money, anyways and I love jurisdiction.

Move on.


I am using the new headless Chrome for my Browser-Automation SaaS (PhantomJsCloud.com) and it is working great.

It fixes some nagging compatibilities with certain websites. I don't bother with anti-bot mitigations, and I don't expect this to be useful in that regard. commercial Anti-Bot doesn't care about how much you spoof your browser fingerprint.

feel free to AMA


I just realized that I use your service to get pool monitoring data.

I configured everything (login & scraping) and start fetching data using your serivce. Then I discoverd that you have to login to load the dashboard but the their get API only requires the serial number to fetch the data .


If you want to get billing details automatically (without logging in) just inspect the Http Response sent back with every api request. it says how many credits are remaining, the api cost, etc.


Does phantomjscloud still run phantomjs? How do you keep it up to date with security patches? Have you benchmarked how this new chrome performs against the old headless chrome and against phantomjs?


No, running on Chrome now. I'm stuck with the unfortunate name. Chrome is much faster and well behaved than PhantomJs ever was. I offered PhantomJs as an option until not long ago, but nobody was using it.


Can you share the code for how to launch a new headless chrome?


If you are using Puppeteer, it's easy, just add "headless":"new" as described here: https://pptr.dev/guides/chrome-extensions

fyi, The new version just helps with compatibility, I have not seen it impact anti-bot detection at all.


I tried with akami and it still didn’t work. Still need the stealth plugin and some additional tweaks to bypass


> navigator.plugins.length = 0

So, any website on the Internets can know how many plugins my browser has? Ridiculously!


It would seem like no, in recent times at least. In recent browser versions (Chrome 94+, Firefox 99+, etc.) it's been changed to only report the default PDF plugins

https://developer.mozilla.org/en-US/docs/Web/API/Navigator/p...


I wish I can automate some of the banking tasks. I tried but couldn't automate Chase, Citi or CapitalOne.

If anyone has a working script to login and perform simple task on one of these sites, please share it.


Last time I was able to automate Chase by targeting their mobile site which, at the time anyway, had a dedicated URI. Mobile site was simple HTML and easy to scrape.


> the new headless Chrome can still be detected using JS browser fingerprinting techniques [...] however, the task has become more challenging [...] I’m not going to share any new detection signals

Any guesses?


In the bot detection methods I've seen so far on this, a large part of it is timing analyses where there is a significant difference between headed and headless, e.g. graphical operations, audio processing.


That, or making sure that mouse really moved somewhere (in a sensible way) before the click occured.


This would have false positives for some accessibility software, I believe


True, that's why you don't want to block the pageload on this signal alone, just use it to trigger a captcha.


It's pretty awful to make people who need accessibility software go through more captchas. Those are an accessibility nightmare.


Or even non-disabled people who typically browse using the keyboard only. Please stop sending users who you find inconvenient to captchas!


With Privacy Pass they won't see more captchas, they will actually see fewer of them.


That could be circumvented rather easily I guess, by using a non-headless (head-having? head-full? headed?) browser instead. And perhaps adding some random human-seeming delay in interactions.


Headed browser.

And maybe, but that will make enduser suffer more (as always), as more false-positives will be caught.


This is off topic but when did we get the ability to use spaces in URLs?


Browsers have automatically done the "correct" thing (converting to "bot%20detection") under-the-hood for years in my experience. I remember MS FrontPage-made sites with spaces in the name and IE would work with them.


in what sense? spaces as mod encoded (%20) values have been around ever since I've used the web. those spaces are occassionally displayed as spaces in the url bar, depending on the context.


The best way to catch a robot is just to slap a captcha there. Everything else is kind of useless and not effective.


Getting captchas solved reliably via a service costs around $1 per 1000 captchas so captchas are kinda useless as well if there's a tiny monetary incentive to get to whatever is behind the captcha.


How is that accomplished? Real humans?


Depends on the captcha but there's many popular services that you can plug into your code through APIs for bypassing captchas (https://www.2captcha.com, https://anti-captcha.com). I think the hardest one is probably the invisible reCaptcha Enterprise.


Real humans, in places where $10 / day is reasonable money.


i'm not sure what the current state of the art is, but my favourite way of solving captchas was the porn vendors who were monetizing their sites by presenting captchas in front of their videos, but the captchas were captured from legitimate sites. so every time a person solved a captcha to watch a video, it enabled a bot to access a captcha-protected resource.


Having people solve CAPTCHAs for you seems more ethical to me than showing them ads - at least with CAPTCHA the cost paid (here: your labour) is clear to the user.




All captchas are soon going to be difficult to solve for humans and easy to solve for bots. Many already are. They also have terrible accessibility.


How do captchas work for blind people behind screen readers? I usually use a lot of keyboard strokes which seems to trigger a lot of captcha systems

So far, the play audio option are kind of weird, specially if you're hard of hearing.


They sometimes have an audio version, unfortunately at the same time this one is used to bypass the captcha through audio recognition software.


[flagged]


This is a terrible notion.

> Accessibility is for everyone, including you, if you live long enough and the alternative is worse. So your choice is death or you are going to use accessibility features.

– John Siracusa

Also, making services accessible is not only the obviously right thing to do, but also the law here in EU.

https://en.wikipedia.org/wiki/European_Accessibility_Act


Offering up some of our strength, ability, and comfort to help others who might be less fortunate or whose qualities lie elsewhere is what makes us human, and probably played a large part in getting us where were are today.

You might be part of a very miniscule group yourself, if this is really what you believe.

Our digital world will never be perfect, but allowing for everyone to at least be able to access and benefit from it is very much something we can and should do.


Educate yourself before writing such selfish nonsense.

https://www.sense.org.uk/about-us/statistics/deafblindness-s...


This is a terrible take. The more technology is integrated into society, the more we need to offer different avenues to access it. Otherwise we'll be excluding the differently-abled from many parts of society, and at some point we really should be able to put that behind us...


Captchas also tell apart the average human visitor from the very committed human visitor that really, really, really needs to do whatever they can do on your website.


Ha ha, good point. When presented with captcha I often decide I don't care that much and just close the page.


I do the same, but sometimes I wish I could give better feedback.

"Dear British Airways, I booked with SAS instead because you assumed a Linux user with Firefox was a bot."

(Or maybe it was the other way round, I forgot.)


They are also very good at distinguishing paying Google users who get the fast-pass to Google captchas.


Is that a thing? What services do you need to buy to bypass recaptcha?


That means way more captchas after this release, yay


Why even bother?


> However, with recent progress in automatic and audio recognition, [detecting bots with captchas] has evolved

...and that's from 3 years ago

https://antoinevastel.com/javascript/2020/02/09/detecting-we...


No one stopped a Chromium fork from this earlier.


The game continues. Back in 2010 when I was writing the first in-browser bot detection signals for Google (so BotGuard could spot embedded Internet Explorers) I wondered how long they might last. Surely at some point embedded browsers would become undetectable? It never happened - browsers are so complex that there will probably always be ways to detect when they're being automated.

There are some less obvious aspects to this that matter a lot in practice:

1. You have to force the code to actually run inside a real browser in the first place, not simply inside a fast emulator that sends back a clean response. This is by itself a big part of the challenge.

2. Doing so is useful even if you miss some automated browsers, because adversaries are often CPU and RAM constrained in ways you may not expect.

3. You have to do something sensible if the User-Agent claims to be something obscure, old or alternatively, too new for you to have seen before.

4. The signals have to be well protected, otherwise bot authors will just read your JS to see what they have to patch next. Signal collection and obfuscation work best when the two are tightly integrated together.

These days there are quite a few companies doing JS based bot detection but I noticed from write-ups by reverse engineers that they don't seem to be obfuscating what they're doing as well as they could. It's like they heard that a custom VM is a good form of obfuscation but missed some of the reasons why. I wrote a bit about why the pattern is actually useful a month ago when TikTok's bot detector was being blogged about:

https://www.reddit.com/r/programming/comments/10755l2/revers...

tl;dr you want to use a mesh oriented obfuscation and a custom VM makes that easier. It's a means, not an end.

Ad: Occasionally I do private consulting on this topic, mostly for tech firms. Bot detectors tend to be either something home-grown by tech/social networking firms, or these days sold as a service by companies like DataDome, HUMAN etc. Companies that want to own their anti-abuse stack have to start from scratch every time, and often end up with something subpar because it's very difficult to hire for this set of skills. You often end up hiring people with a generic ML background but then they struggle to obtain good enough signals and the model produces noise. You do want some ML in the mix (or just statistics) to establish a base level of protection and to ensure that when bots are caught their resources are burned, but it's not enough by itself anymore. I offer training courses on how to construct high quality JS anti-bot systems and am thinking of maybe in future offering a reference codebase you can license and then fork. If anyone reading this is interested, drop me an email: mike@plan99.net


What are bots used for? I can think of a few reasons, wrote a scraper/submitter myself in the 90's for a cooperative of subcontractors that was being forced to use an extremely sluggish web app by the big company that provided their gigs.

But I guess there are all kind of purposes, some benign some nefarious, and that they somehow influence the bot operation and detection.


People are paying $500 for bots used to buy the latest Nike/Adidas/... limited edition sneakers. Or videocards a few years ago (for crypto mining).

It's a whole industry.

> If we consider a user base of ~175 users, and a minimum bot price of 200 euros (175 users x 200 euros), then the bot developers made at least 35K euros (~$37K USD) in initial bot sales.

https://datadome.co/threat-research/inside-sneaker-bot-busin...


Artificial scarcity in sneakers is their design decision. These shenanigans should have zero impact on browser policy.


I thought about building something like that for photographers to get gigs from large real-estate photography contractors who sub-contract the work to independent photographers. Automated tools would benefit the photographers greatly. The benefit comes at the expense of those not using automated tools, so the morality of such a tool is at least somewhat questionable.


"The signals have to be well protected, otherwise bot authors will just read your JS to see what they have to patch next. Signal collection and obfuscation work best when the two are tightly integrated together."

JS sounds like a bad match for this task. I perform similar checks from the backend with http headers and Python.

Is there a compelling reason to stick with JS despite the added complexity of obfuscation?

Edit: My use case is different than yours as it's part of a pid-free analytics application. However, bot detection is still an important component of that product.


If you're only relying on http headers, you're missing all but the most trivial of "bots". There are other things you could do with a backend-only approach but if your code doesn't run where the device connects to (e.g. you're behind a load balancer or other reverse proxy), those are largely unworkable.


"If you're only relying on http headers, you're missing all but the most trivial of bots"

Very true. Capturing, processing, and storing analytics data long-term is expensive. If I eliminate even 50% of that noise, the savings will be worth it.

I'm attempting to identify the bulk of bots with http headers and real-time session monitoring. I also have an unauthorized list (known bad actors) and an ignore list (search bots, etc.). It works pretty well but definitely doesn't begin address the problem as a whole (from a security perspective).

It's an interesting and complex topic.


Re: your ad.

This sounds like a solid product / startup idea to me. I worked on spambot detection in a previous job and it's not at all trivial to solve. Though we were specifically interested in detecting the abusive use of bots, not bots in general, so I focused simply on detecting unusual resource consumption rather than fingerprinting.


There are startups doing this sort of thing already, the article is written by the head of research at one. But tech firms often like to have their own in-house stack with the source code.


Yeah, but for non-tech companies, like Nike/Adidas as mentioned in other comments, they will need this kind of bot-detection services.



What do you mean by a "mesh-oriented obfuscation"? My best guess is: serving a different subset of the VM detection code to each client?


There's lots of techniques that can fall under that heading. The idea is to tie together your logic and obfuscation so that the things you have to do to undo the obfuscation end up breaking access to other parts of the program. Using the output of hash functions as decryption keys is one famous approach but there are others.


Heh, I had a feeling you'd show up here. Hi, Mike :)


Long time no see mate :)


Why my first reaction on the last part is "oh no!"? Seems something that would have more illegitimate/annoying use cases than good


Can't you say the same about a real browser with Selenium driving it? It's been available for years; has it been hugely detrimental for something?


It's not like spam farms can't use their own version of Chromium that already mimicks a real browser. Relying on client side indicators for your bot detection will only catch the bots that don't care about being caught in the first place. Show an alert that says "welcome to my site!" for any browsers originating from a data center and you've probably filtered most of those out.

I like automating menial tasks in shitty web UIs (i.e. clearing out a list of sessions/search history/ad providers that only allow removing a single entry at a time). Simply using Firefox also gets flagged by a lot of these shitty bot detection services. I've never seen them do any useful work.

The only exception is maybe reCAPTCHA or Cloudflare's alternative; that seems to be quite good at catching actual bots, but I do hate most websites that use them because in Firefox you end up clicking on boats twenty times. They're also trivially bypassed by delegating your spamming to click farms, as 1000 minimum wage workers in a faraway country can be cheaper than paying for dev time to work around the minor nuisances of bot detection.


> As you can imagine, given my position at DataDome (a bot detection company), I’m not going to share any new detection signals as I used to do

Here comes the sales pitch....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: