Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Have you ever used anti detect browsers for web scraping?
17 points by DantesTravel on Nov 18, 2022 | hide | past | favorite | 24 comments
I'm in the web scraping industry for a while and I often spend some time creating my "swiss knife" with Playwright or Selenium in case things get tough. Thanks to a niche substack I'm following, I discovered only today the existence of anti detect browsers like GoLogin and others. From what I see, they seem a good solution for small projects, but difficult to scale in larger ones for costs of licensing and infrastructure (most of them require a windows machine to run). Does any of you guys smarter than me use these browsers on a large scale? How is composed your tech stack?



How would you actually use an anti-detect browser programmatically? Would you need to write a custom Selenium driver for it or equivalent for Playwright? Even if the browser is built off something like Chrome, you'd still need a way to interact with the anti-detect related features.

A good trick I discovered is using webkit thru Playwright to bypass fingerprinting and related anti-bot measures. Firefox/Chrome simply leaks too much information, even with various "stealth" modifications. e.g: have been able to reliably scrape a well known companies site that implemented a "state of the art, AI-powered, behavioral analysis, etc" anti-bot product. Using Chrome/Firefox + stealth measures in Playwright did not work - simply switching to Webkit with no further modifications did the trick.

Not exactly what you're asking, but my point is, that with a little time and effort, I've usually been able to find fairly simple holes in most anti-bot measures -- it probably wouldn't be terribly hard (especially since you're versed in scraping) to build-out something similar to what you're looking to achieve without having to pay for sketchy anti-detect browsers.


Yes, that’s what I’ve done up to now. When forced to use Playwright, I’ve noticed too that Webkit is less detected, but depends from website to website. I tried the solution described on the substack, fundamentally the gologin browser, based on chromium, opens a port on your local machine and Playwright connects to that browser, automating the crawling.


Yeah, Chrome is the worst choice for this use-case - see my last comment on this thread for more on that. Can you speak a bit more on what you'd like to use a headless anti-detect browser for over regular headless browsers? Is it to leverage their built-in fingerprinting control, effectively avoiding anti-bot measure with little effort, or management of multiple "profiles", etc? My system effectively comes down to using webkit, and storing credentials (encrypted w/ symmetric key) as well as whatever information is needed by Playwright to reconstruct the session. Simply using webkit + DB effectively achieves a headless anti-detect browser, but you're right that webkit alone isn't always a one-and-done solution.


Any tips/code examples for your webkit solution(s)? Where does one begin with using webkit for scraping?

I think using anti-fingerprinting is itself a fingerprint. I imagine it would be easier to hide in the noise of regular browsers.


> I think using anti-fingerprinting is itself a fingerprint. I imagine it would be easier to hide in the noise of regular browsers.

That's what I thought originally too. The problem is the "leaky-ness" of Chrome and Firefox - they expose a large amount of information that can be easily used to train various ML classifiers. Chrome's DevTool Protocol is most commonly used when headless access to Chrome is desired and is inherently "leaky", by design as a protocol for debugging. Don't even try to use any flavor of headless Chrome, even with stealth plugins. Firefox isn't much better.

Webkit doesn't seem to expose as much information, and having a much lesser percentage of usage, I think there's simply less information to feed into a classifier to learn to detect it reliably. There's a few sites that offer fingerprint testing such as:

- https://amiunique.org/fp

- https://webscraping.pro/wp-content/uploads/2021/02/testresul...

Try writing a script that goes to a page like this and have it take a screenshot, using Chrome, Firefox, and then Webkit to see the difference yourself. I use the Python port of Playwright personally. In the project I mentioned in my last comment, all I had to do was change the browser Playwright was using to webkit - i.e "browser = p.webkit.launch()" where "p" is a sync_playwright context manager instance. I tried Chrome and Firefox with many, many, attempts at stealth modifications and none worked. Removing my "stealth code" for the other browsers and changing it to webkit was all that was needed. Blew me away that it was that simple honestly. I've used this trick on other websites and have noticed webkit just gets processed differently by captchas/anti-bot, etc. Selenium should also offer support for a WebKit driver if you prefer it over Playwright.


I've found that it's almost never needed. Most of the "advanced AI human detection" things are glorified IP reputation systems. So you just need a few IPs that would be way too painful to block, for example US residential IPs, and you're good.

But if you really want to make sure, it's pretty easy to remote-control a cheap Android phone. Plus detection thresholds tend to be much higher on mobile, because filling out a ReCaptcha with a touch screen is just such a horrible user experience.


Interesting idea leveraging a cheap Android. I wonder how difficult it would be to modify an instance of a regular headless browser in order to convince a website you're using an Android browser. Not sure if Androids just come with mobile Chrome these days or if OEM/carrier-developed type stock browsers still get shipped.

Also totally right on the IP reputation point. I saw a post on HN in the last few months of someone describing how they used a cheap mobile data plan + USB LTE modem to proxy their web scraping. I believe you get effectively treated as a residential IP (depends on the complexity of the system - if they're simply blacklisting datacenter IPs then this should work) with the additional benefit of being able to change the IP assigned to the modem easily.


> remote-control a cheap Android phone.

Any idea if the android emulator would suffice? For sure cheaper and easier to automate since rooting those can be much easier than rooting actual phones, which are usually designed against such things


Could you use some kind of android emulator for this task instead of an actual phone?


Not if the antibot product is checking canvas and gpu render hashes. If you present a random hash instead of a well-known hash for an iPhone or flagship Android handset, that generally registers as a red flag.


Good community called Scraping Enthusiasts on this topic here: https://discord.gg/4fGEPZzs Plus curated list of research papers here if you want to go deep on the subject matter: https://github.com/prescience-data/dark-knowledge


The Hero browser is designed for this kind of sneaky scraping, it’s very interesting: https://github.com/ulixee/hero


Can you share the substack?


I didn’t mention because someone would think I’m promoting it: it’s called The Web Scraping Club.


Seems like a fairly shallow (content marketing fluff) substack to be honest. There are better places to follow, like: https://www.trickster.dev/post/


Some posts are good, more technical, others not. But in my opinion is worth reading.


> So, what is wget?

I know about https://xkcd.com/1053/ but _come on_


this smells like an ad for GoLogin


Or a covert ad for the substack.. given their only other submission was the substack. Either way it doesn't really matter, it was a fair post and brought up an interesting idea of stealth browsers though I don't think they work all that well.


Agreed


In the substack post are mentioned several browsers, so I would say no. I didn’t even mention the Substack so no one can say I’m promoting it.


What do you mean with "most of them require a windows machine to run"?


The browser in most case has a client for windows


if you don't want to be detected, run chrome in a vm and move the mouse around with pyuserinput




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: