Hah, this didn't cover window.webdriver, and I was about to post that you can still use that (since I assumed window properties weren't deletable) but... window properties are deletable. Cool.
From the original article, you put a proxy in front of Chrome headless and inject the deletion code into the HTML of each page before any JS loaded by the page.
I read these things and I think "So much wasted energy and effort"
In the beginning was the web, and it was good. Content came along. Some was good, some was cats. Then paid sites with sign-up. Then search engines. Then ads.
Pretty soon folks thought "I not only own this content, I own how it will be presented to the end user. If I choose to add in cats, or Flash ads, or whatnot? They're stuck consuming it. I own everything about the content from the server to the mind of the person consuming it, the entire pipe."
Many people did not like this idea. Ads were malicious, they installed malware. The practice of using ads on content caused sites to track users like lab rats. Armies of people majoring in psychology were hired to try to make the rats do more of what we wanted them to do.
Ad blockers were born. Then anti-ad-blockers. Then headless browsers. Now anti-headeless browsers.
It's just a huge waste of time and energy. The model is broken, and no amount of secret hacker ninja shit is going to make it work. You want to know where we'll end up? We'll end up with multiple VMs, each with a statistically common setup, each consuming content on the web looking just like a human doing it. (We'll be able to do that by tracking actual humans as they consume content). But nobody will be looking at those VMs. Instead, those invisible screens will be read by image recognition software which will then condense what's on there and send the results back to whoever wants it.
Content providers will never win at this. Nor should they. Instead, we're just going to sink billions into a busted-ass business model over the next couple of decades throwing good money after bad.
It's more complicated than that, almost all spidering that I do lately for clients is not about scraping the content in a classical sense, but some form of "spying" of one's competition, mainly in e-commerce. Tracking inventories, fine-tuning prices, tracking how they promote certain products online, etc. Not to mention really blackhat stuff that many people do like attacking competition with fraudulent ad clicks and similar. Many times "the content providers" have every right to wish to protect themselves.
That has been done for decades. Most manufacturers employ scores of people going into shops to check how products are presented. Supermarkets employ people to check traffic and sales at competing stores. Doing this online can be easier (if you know how to do it) but it's hardly new.
In the offline world companies also try to protect themselves (you can get banned from stores when they find out that you work for the competition) but it's harder to do.
I'll add my piece of anecdata to this. I've done a few projects that involved scraping over my career.
1. Scraping results from a property listing website, specifically to pull the properties an agent had listed and put them on their website. The agent didn't want to pay the fee for API access (they probably ended up paying my employer more to scrape it, and keep that updated, but hey).
2. Scraping an e-commerce website of a company my employer were working with to keep our product catalogues in sync - the partner had their own platform, but no API for it.
3. Automating requests to a price comparison website to find out what prices competitors are offering for particular types of customer.
API keys can be revoked at any time or the API service can be suddenly terminated. You can't trust them. Scraping is more maintenance, but it is more reliable.
Indeed. A web site is an API, a crappy one, yes, but also one without unnecessary auth requirements and one they can't remove without removing their business.
Sure, but how often do websites change their layout? Once every five years at most. And it will take you no more than an hour to rewrite the signature. Ain't that hard. I've been doing that for a living and it's manageable, even if you monitor thousands of sites.
Ultimately it has to show the same data unless the site is completely discontinuing the service (in which case you are up a creek in either scenario). Even in the face of a total redesign you're usually just adjusting a few signatures and maybe tweaking the traversal tree.
You can run into problems when they remove things like chronological listing view in favor of some "algorithm" that you can't control, but even that just adds a little work on the back end to detect duplicates and an asterisk that some of the content will be missed.
There can also be an arms race where the site developers will start trying all kinds of crazy techniques to make scraping harder. Ironically this often happens after they revoke API access, meaning users don't have an alternative except to participate in the arms race.
> "spying" of one's competition, mainly in e-commerce. Tracking inventories, fine-tuning prices, tracking how they promote certain products online, etc.
That’s not spying, that’s a completely legitimate and normal use case of the web. Everything on the web is public, and the web was also created to allow that.
The irony, I think, is that the standards that the current web is descended from GML/SGML/HTML were created explicitly for the purposes of making documents easier to read and understand by computers. Here now we are trying to make it as difficult as possible to read those same documents by computers.
I think you might be misunderstanding. Someone's scripted use of a Web site can burn more resources, he's saying, but it's not "for nothing" -- they're deriving some benefit from the scripting.
They wouldn't have to do that if the website wasn't - purposefully or through crappy engineering - making the data unnecessarily difficult to get and process by machines.
Without any irony, at my workplace, we've spun up a docker swarm with headless browsers and a simply RESTful api that takes screen shots of websites and serves them back to the employee's browser so that our employees can surf the web without even the possibility of exposing themselves to malicious software garbage. It's not required, but it's an available corporate service. Another guy built a version that returns PDFs with working links so they can even click links.
It's completely fucking stupid that we have to come up with this nonsense.
Yes but from experience in the 1980's the punters realy did not like being nickel and dimed on pre internet based services Prestel, Tymnet/Telecom Gold and so on.
You probably want the web equivalent of malicious compliance - an algorithmically generated web-hole or similar. That way the bot author isn't entirely sure you're on to them; it could be a bot or server error. Like send the right headers but garbage data that looks like it's compressed but isn't, or doubly compressed garbage, or trim pages at a different place (before anything interesting), or slow data transfers, or ...
The main thing headless chrome saves is having to spawn and manage xvfb per instance.
If more sites adopt serving fake data to headless chrome, people will just return to the old xvfb workflow for those sites, and use headless for everything else.
Could even use a small set of xvfb scrapers to verify the results, and automate detection of false data.
I saw that with a page and it works extremely well. Inserting wrong data based on rules (there it was on rate limiting rather than user agent) is extremely hard to detect. The scraper never knows if they see real or wrong data.
On the other hand, this will also get wrong data in search engines.
All web automation and automation prevention is a cat and mouse game where you never stop the scrapers, you just create more effort for them. It’s like traditional and digital security in that regard, except that security often has an element of difficulty in overcoming it (cryptography, thickness of physical barriers), whereas stopping web scraping is about adding more trivial things to make the process more complicated.
Eventually, human browsing and headless browsing converge. Nobody wants to make the human browsing experience bad, so the headless browsing continues.
In my opinion, if you’re running a site that is existentially threatened by someone else having your content, you need something else for your moat.
Don't worry. Thanks to the W3C and their EME standards, scraping will reach the level of other sorts of security. I'm surprised I haven't yet seen a simple framework for serving your page not as a page but as an EME-protected blob that bears a rendering of the content. We will see just that.
The whole point of using an headless browser is to work around web sites that attempt to block simple "curl" style scraping (or where you need to execute JavaScript to scrape).
So making it detectable (intentionally, even, right there in the user agent!) is really absurd.
Or actually, it makes one wonder about Google's motives.
That's one use-case for Headless browsers. Most people actually use Headless browsers to test their website, i.e. for functionality / performance / rendering.
That's definitely not the whole point of headless browsers, that's more of a side-effect.
The whole point of headless browsers is rather automation and testing.
Same as torrents are for the distribution of legal content. That was the original thought and it's still used for that but I'd bet the majority of headless browser requests crawl websites not owned by the scraper.
Is there a way to enable Chrome PDF Viewer/Widevine Content Decryption Module etc in headless chromium? Is there some switch in chromium code base that would enable that?
Re. blocking scrapers: Some of us are neither vast corporate espionage practicioners nor zombie-botnet users: we're on our own, scraping for data science & other academic research purposes.
Is there some way to declare, "I am a legitimate academic user", something akin to 'TSA Pre' status?
"Sure, register for & use the site's API," you'll say. What if they don't have one?
"Sure, just don't slam the server with too many requests in a short time," you'll say. But if they're rejecting you just because they detect you're headless, etc...?
> But if they're rejecting you just because they detect you're headless, etc
Isn't that their right?
If I pay for my outgoing bandwidth (even if I don't) I am under no obligation to give my content/data/whatever to any third party source, even academic.
> If I pay for my outgoing bandwidth (even if I don't) I am under no obligation to give my content/data/whatever to any third party source, even academic.
Aren't you? You put a server on the publicly routable Internet. And made it talk over HTTP. At this point I believe you've already chosen to waive your rights not to serve content.
Isn't that the same argument regarding ripping music CDs? If I pay for the musicians, manufacturing and distribution costs to put a CD in stores, etc.
Although, I think you're framing it wrong, you're not obligated to give the content, someone is just choosing to consume it in a way you hadn't intended.
Headless browsers are used to create robots in order to automate the gaming of web-based value systems - thus diluting the value for legitimate participants. Examples:
* create fake profiles in order to boost someone's "followers" in a social network where you can monetize your "influencer" status
* click ads from a competitor in a way that would trigger fraud prevention from the ad network effectively preventing the competitor to advertise there
If there's malicious code on the page you could use this to block headless browsers (which might be security scanners) from trying to load / run the malicious code, such as CoinHive.
rather than blocking a bot, it would make much more sense to CAPTCHA an ip that is producing a lot of traffic in a short time. Scaping has always been part of the web, and one should not have the belief that the information on a website is only going to be available on said website.
This approach only stops the most basic and laziest scrapers. Some people have tens of thousands of diverse IP addressed to utilize for scraping. Many of them will not give a shit about your bandwidth or server constraints and will cause your server to hit bottlenecks, making it slow and useless for everyone.
> it would make much more sense to CAPTCHA an ip that is producing a lot of traffic in a short time.
CAPTCHAs are useful, but they're an X/Y problem in the same way that this headless-detection is: trying to detect human vs bot, when the real solution would be to slow down (a portion of) the traffic.
Hashcash would seem like a better solution, since that doesn't lock anybody out (human or bot), it just slows them down to reduce server load. If some clients are higher priority than others (e.g. human users vs poorly-programmed bots) then use info like IP, cookies, etc. to slow down the low priority requests, or even adjust the difficulty depending on how likely the client is to be causing load.
Good point - for a lot of what unwanted headless scraping would be used for I imagine returning subtly changing patterns of semi-useful data is probably more useful than blocking.
A fully blocked bot will error and get replaced with a working bot. A bot that subtly errors again, and again, and again will look almost-right and create a maintenance nightmare...
Yep, just feeding wrong data to a headless browser will trick most. If you keep data realistic (add small random error terms), it could take very long until someone finds out.
If someone is setting specialized properties in their browser to impersonate automated browsing, or using automated browsing themselves, the question is only if you (as a content provider), accept that their use is legitimate...
This wouldn't impact day-to-day users baring gross incompetence.
The code posted on the site fails simply on incongruity of the JS behaviour (window.chrome) and userAgent. I can see how this can fail with common user setting userAgent to Chrome on Edge or Firefox for some compatibility reason or just forgetting to turn off old referrer override. There can be other valid reasons it will fail if user is not bot that I miss. And BLAM they'll get all wrong data for no reason... You may call it gross incompetence or whatever but this method will get you one angry lost user at the time.
UserAgent detection is in the "old" groups, is specific to having your userAgent be "HeadlessChrome", and is no longer recommended. The new triggers are 'navigator.webdriver', or a chrome extension specific object, or specific permissions being set, none of which are relevant to or impacted by any the scenarios you are highlighting...
Of course the JS itself can fail due to incongruent browser behaviour... but why would you trigger a bot obfuscation routine based on a failed JS call?
That is the gross incompetence I was referring to, and it's hard to call basic errors a lack of basic testing anything but.
Downvotes aside, the kinds of f-ups you're speculating about here are at the level of knowing how true/false works in JS.
And, no, there really are not valid reasons for users to be adding specific properties on their navigation objects to flag for headless, or use specific extension objects that report the use of headless automation, if they aren't. There is no valid reason you should set your Edge userAgent to "HeadlessChrome", either.
That's not an angry lost user, friend, that is an upset unauthorized third-party content scraper. I work with Open Data, so I don't care, but some sites for-realsies do.
Providing users with fake data is never a good idea because it can be, and probably will be, used against you in the long run. Plus no sane evil scrapper uses default referrer and no masking so misfires are realistically possible within a thin line needed to detect them.
In any case, users can do whatever they want with their client and expect the service to work properly. If you detect abuse you should block or captcha them but the fact of them being a possible bot doesn't really call for such drastic measure. It's the second worst approach after serving hindering scripts to them.
Disclaimer: I haven't downvoted you as I don't downvote things prompting a discussion.
Agreed, we use headless browsers for automated regression testing and adding a check on production would help ensure that testers don't goof and test in the wrong environment and test that developers don't hard code URLs that cause the environments to hop.
The techniques explained in the article seem like they'd be JS running on the browser itself, so… the "browser itself knows it's headless" pretty much sums it up.
If your goal is to only allow the original Google Chrome browser, that is fine. Otherwise this might cause false alarms.