If your goal is to only allow the original Google Chrome browser, that is fine. Otherwise this might cause false alarms.
The arms race goes on.
In the beginning was the web, and it was good. Content came along. Some was good, some was cats. Then paid sites with sign-up. Then search engines. Then ads.
Pretty soon folks thought "I not only own this content, I own how it will be presented to the end user. If I choose to add in cats, or Flash ads, or whatnot? They're stuck consuming it. I own everything about the content from the server to the mind of the person consuming it, the entire pipe."
Many people did not like this idea. Ads were malicious, they installed malware. The practice of using ads on content caused sites to track users like lab rats. Armies of people majoring in psychology were hired to try to make the rats do more of what we wanted them to do.
Ad blockers were born. Then anti-ad-blockers. Then headless browsers. Now anti-headeless browsers.
It's just a huge waste of time and energy. The model is broken, and no amount of secret hacker ninja shit is going to make it work. You want to know where we'll end up? We'll end up with multiple VMs, each with a statistically common setup, each consuming content on the web looking just like a human doing it. (We'll be able to do that by tracking actual humans as they consume content). But nobody will be looking at those VMs. Instead, those invisible screens will be read by image recognition software which will then condense what's on there and send the results back to whoever wants it.
Content providers will never win at this. Nor should they. Instead, we're just going to sink billions into a busted-ass business model over the next couple of decades throwing good money after bad.
In the offline world companies also try to protect themselves (you can get banned from stores when they find out that you work for the competition) but it's harder to do.
1. Scraping results from a property listing website, specifically to pull the properties an agent had listed and put them on their website. The agent didn't want to pay the fee for API access (they probably ended up paying my employer more to scrape it, and keep that updated, but hey).
2. Scraping an e-commerce website of a company my employer were working with to keep our product catalogues in sync - the partner had their own platform, but no API for it.
3. Automating requests to a price comparison website to find out what prices competitors are offering for particular types of customer.
You can run into problems when they remove things like chronological listing view in favor of some "algorithm" that you can't control, but even that just adds a little work on the back end to detect duplicates and an asterisk that some of the content will be missed.
There can also be an arms race where the site developers will start trying all kinds of crazy techniques to make scraping harder. Ironically this often happens after they revoke API access, meaning users don't have an alternative except to participate in the arms race.
No it's not, it's just swapping out a few selectors or regular expressions...
That’s not spying, that’s a completely legitimate and normal use case of the web. Everything on the web is public, and the web was also created to allow that.
They're as easy as ever to read, as that's necessary for the documents to be displayed to the end-user.
There's shittiness from both producers and consumers.
It's completely fucking stupid that we have to come up with this nonsense.
If more sites adopt serving fake data to headless chrome, people will just return to the old xvfb workflow for those sites, and use headless for everything else.
Could even use a small set of xvfb scrapers to verify the results, and automate detection of false data.
On the other hand, this will also get wrong data in search engines.
Eventually, human browsing and headless browsing converge. Nobody wants to make the human browsing experience bad, so the headless browsing continues.
In my opinion, if you’re running a site that is existentially threatened by someone else having your content, you need something else for your moat.
This race won't end, and the only result beyond wasted effort is the creation of ever more ridiculous and user-hostile practices.
The future will be raspberry pi clusters connected to HDMI capture cards.
Personally, as someone who regularly uses several different browsers and experiments with others, I wish the Web was far more browser-neutral.
So making it detectable (intentionally, even, right there in the user agent!) is really absurd.
Or actually, it makes one wonder about Google's motives.
Is there some way to declare, "I am a legitimate academic user", something akin to 'TSA Pre' status?
"Sure, register for & use the site's API," you'll say. What if they don't have one?
"Sure, just don't slam the server with too many requests in a short time," you'll say. But if they're rejecting you just because they detect you're headless, etc...?
Isn't that their right?
If I pay for my outgoing bandwidth (even if I don't) I am under no obligation to give my content/data/whatever to any third party source, even academic.
Aren't you? You put a server on the publicly routable Internet. And made it talk over HTTP. At this point I believe you've already chosen to waive your rights not to serve content.
Although, I think you're framing it wrong, you're not obligated to give the content, someone is just choosing to consume it in a way you hadn't intended.
But as long as you're providing it publicly then it makes no sense that you'd be able to dictate how it's consumed.
* create fake profiles in order to boost someone's "followers" in a social network where you can monetize your "influencer" status
* click ads from a competitor in a way that would trigger fraud prevention from the ad network effectively preventing the competitor to advertise there
CAPTCHAs are useful, but they're an X/Y problem in the same way that this headless-detection is: trying to detect human vs bot, when the real solution would be to slow down (a portion of) the traffic.
Hashcash would seem like a better solution, since that doesn't lock anybody out (human or bot), it just slows them down to reduce server load. If some clients are higher priority than others (e.g. human users vs poorly-programmed bots) then use info like IP, cookies, etc. to slow down the low priority requests, or even adjust the difficulty depending on how likely the client is to be causing load.
Not suggesting it's better or worse - just an alternative if you need something that appears to be like a desktop browser.
It is not possible to detect and block Chrome headless | https://news.ycombinator.com/item?id=16179181
A fully blocked bot will error and get replaced with a working bot. A bot that subtly errors again, and again, and again will look almost-right and create a maintenance nightmare...
This wouldn't impact day-to-day users baring gross incompetence.
Of course the JS itself can fail due to incongruent browser behaviour... but why would you trigger a bot obfuscation routine based on a failed JS call?
That is the gross incompetence I was referring to, and it's hard to call basic errors a lack of basic testing anything but.
Downvotes aside, the kinds of f-ups you're speculating about here are at the level of knowing how true/false works in JS.
And, no, there really are not valid reasons for users to be adding specific properties on their navigation objects to flag for headless, or use specific extension objects that report the use of headless automation, if they aren't. There is no valid reason you should set your Edge userAgent to "HeadlessChrome", either.
That's not an angry lost user, friend, that is an upset unauthorized third-party content scraper. I work with Open Data, so I don't care, but some sites for-realsies do.
In any case, users can do whatever they want with their client and expect the service to work properly. If you detect abuse you should block or captcha them but the fact of them being a possible bot doesn't really call for such drastic measure. It's the second worst approach after serving hindering scripts to them.
Disclaimer: I haven't downvoted you as I don't downvote things prompting a discussion.
The server can then block, redirect or feed the browser erroneous data.