Ask HN: Have you ever used anti detect browsers for web scraping?

jmt_ · on Nov 18, 2022

How would you actually use an anti-detect browser programmatically? Would you need to write a custom Selenium driver for it or equivalent for Playwright? Even if the browser is built off something like Chrome, you'd still need a way to interact with the anti-detect related features.

A good trick I discovered is using webkit thru Playwright to bypass fingerprinting and related anti-bot measures. Firefox/Chrome simply leaks too much information, even with various "stealth" modifications. e.g: have been able to reliably scrape a well known companies site that implemented a "state of the art, AI-powered, behavioral analysis, etc" anti-bot product. Using Chrome/Firefox + stealth measures in Playwright did not work - simply switching to Webkit with no further modifications did the trick.

Not exactly what you're asking, but my point is, that with a little time and effort, I've usually been able to find fairly simple holes in most anti-bot measures -- it probably wouldn't be terribly hard (especially since you're versed in scraping) to build-out something similar to what you're looking to achieve without having to pay for sketchy anti-detect browsers.

DantesTravel · on Nov 18, 2022

Yes, that’s what I’ve done up to now. When forced to use Playwright, I’ve noticed too that Webkit is less detected, but depends from website to website. I tried the solution described on the substack, fundamentally the gologin browser, based on chromium, opens a port on your local machine and Playwright connects to that browser, automating the crawling.

jmt_ · on Nov 18, 2022

Yeah, Chrome is the worst choice for this use-case - see my last comment on this thread for more on that. Can you speak a bit more on what you'd like to use a headless anti-detect browser for over regular headless browsers? Is it to leverage their built-in fingerprinting control, effectively avoiding anti-bot measure with little effort, or management of multiple "profiles", etc? My system effectively comes down to using webkit, and storing credentials (encrypted w/ symmetric key) as well as whatever information is needed by Playwright to reconstruct the session. Simply using webkit + DB effectively achieves a headless anti-detect browser, but you're right that webkit alone isn't always a one-and-done solution.

anxiously · on Nov 18, 2022

Any tips/code examples for your webkit solution(s)? Where does one begin with using webkit for scraping?

I think using anti-fingerprinting is itself a fingerprint. I imagine it would be easier to hide in the noise of regular browsers.

jmt_ · on Nov 18, 2022

> I think using anti-fingerprinting is itself a fingerprint. I imagine it would be easier to hide in the noise of regular browsers.

That's what I thought originally too. The problem is the "leaky-ness" of Chrome and Firefox - they expose a large amount of information that can be easily used to train various ML classifiers. Chrome's DevTool Protocol is most commonly used when headless access to Chrome is desired and is inherently "leaky", by design as a protocol for debugging. Don't even try to use any flavor of headless Chrome, even with stealth plugins. Firefox isn't much better.

Webkit doesn't seem to expose as much information, and having a much lesser percentage of usage, I think there's simply less information to feed into a classifier to learn to detect it reliably. There's a few sites that offer fingerprint testing such as:

- https://amiunique.org/fp

- https://webscraping.pro/wp-content/uploads/2021/02/testresul...

Try writing a script that goes to a page like this and have it take a screenshot, using Chrome, Firefox, and then Webkit to see the difference yourself. I use the Python port of Playwright personally. In the project I mentioned in my last comment, all I had to do was change the browser Playwright was using to webkit - i.e "browser = p.webkit.launch()" where "p" is a sync_playwright context manager instance. I tried Chrome and Firefox with many, many, attempts at stealth modifications and none worked. Removing my "stealth code" for the other browsers and changing it to webkit was all that was needed. Blew me away that it was that simple honestly. I've used this trick on other websites and have noticed webkit just gets processed differently by captchas/anti-bot, etc. Selenium should also offer support for a WebKit driver if you prefer it over Playwright.

fxtentacle · on Nov 18, 2022

I've found that it's almost never needed. Most of the "advanced AI human detection" things are glorified IP reputation systems. So you just need a few IPs that would be way too painful to block, for example US residential IPs, and you're good.

But if you really want to make sure, it's pretty easy to remote-control a cheap Android phone. Plus detection thresholds tend to be much higher on mobile, because filling out a ReCaptcha with a touch screen is just such a horrible user experience.

jmt_ · on Nov 18, 2022

Interesting idea leveraging a cheap Android. I wonder how difficult it would be to modify an instance of a regular headless browser in order to convince a website you're using an Android browser. Not sure if Androids just come with mobile Chrome these days or if OEM/carrier-developed type stock browsers still get shipped.

Also totally right on the IP reputation point. I saw a post on HN in the last few months of someone describing how they used a cheap mobile data plan + USB LTE modem to proxy their web scraping. I believe you get effectively treated as a residential IP (depends on the complexity of the system - if they're simply blacklisting datacenter IPs then this should work) with the additional benefit of being able to change the IP assigned to the modem easily.

mdaniel · on Nov 18, 2022

> remote-control a cheap Android phone.

Any idea if the android emulator would suffice? For sure cheaper and easier to automate since rooting those can be much easier than rooting actual phones, which are usually designed against such things

ridgered4 · on Nov 18, 2022

Could you use some kind of android emulator for this task instead of an actual phone?

darkpatterns · on Nov 19, 2022

Not if the antibot product is checking canvas and gpu render hashes. If you present a random hash instead of a well-known hash for an iPhone or flagship Android handset, that generally registers as a red flag.

darkpatterns · on Nov 18, 2022

Good community called Scraping Enthusiasts on this topic here: https://discord.gg/4fGEPZzs Plus curated list of research papers here if you want to go deep on the subject matter: https://github.com/prescience-data/dark-knowledge

splatzone · on Nov 18, 2022

The Hero browser is designed for this kind of sneaky scraping, it’s very interesting: https://github.com/ulixee/hero

ffgh · on Nov 18, 2022

Can you share the substack?

DantesTravel · on Nov 18, 2022

I didn’t mention because someone would think I’m promoting it: it’s called The Web Scraping Club.

anxiously · on Nov 18, 2022

Seems like a fairly shallow (content marketing fluff) substack to be honest. There are better places to follow, like: https://www.trickster.dev/post/

DantesTravel · on Nov 18, 2022

Some posts are good, more technical, others not. But in my opinion is worth reading.

mdaniel · on Nov 18, 2022

> So, what is wget?

I know about https://xkcd.com/1053/ but _come on_

jnk345u8dfg9hjk · on Nov 18, 2022

this smells like an ad for GoLogin

anxiously · on Nov 18, 2022

Or a covert ad for the substack.. given their only other submission was the substack. Either way it doesn't really matter, it was a fair post and brought up an interesting idea of stealth browsers though I don't think they work all that well.

jnk345u8dfg9hjk · on Nov 18, 2022

Agreed

DantesTravel · on Nov 18, 2022

In the substack post are mentioned several browsers, so I would say no. I didn’t even mention the Substack so no one can say I’m promoting it.

decide1000 · on Nov 18, 2022

What do you mean with "most of them require a windows machine to run"?

DantesTravel · on Nov 18, 2022

The browser in most case has a client for windows

QuadmasterXLII · on Nov 18, 2022

if you don't want to be detected, run chrome in a vm and move the mouse around with pyuserinput