Author here. I've been working on pydoll, an open-source (Python/async) web automation library. While building it, I kept hitting a wall against sophisticated anti-bot systems.
This sent me down a deep rabbit hole to understand how they actually work. It turns out detection isn't about one thing, but about consistency across multiple layers: from the OS-level (TCP/IP, TLS/JA3), to the browser (HTTP/2, Canvas/WebGL), and finally to human behavior (mouse physics, typing cadence).
I decided to write down everything I learned in this guide. It covers the theory of how each layer is fingerprinted and the practical techniques to evade it (focusing on consistency, not randomness).
Hope you find it useful. Happy to answer any questions.
CDP itself is not detectable. It turns out that other libraries like puppeteer and playwright often leave obvious traces, like create contexts with common prefixes, defining attributes in the navigator property.
I did a clean implementation on top of the CDP, without many signals for tracking. I added realistic interactions, among other measures.
I don't think it's similar. The library has many other features that Selenium doesn't have. It has few dependencies, which makes installation faster, allows scraping multiple tabs simultaneously because it’s async, and has a much simpler syntax and element searching, without all the verbosity of Selenium. Even for cases that don’t involve captchas, I still believe it’s definitely worth using.
Similar to MechanicalSoup is what I meant, which uses BeautifulSoup as well.
> without all the verbosity of Selenium
It's definitely verbose but from my experience a lot of the verbosity is developers always looking for elements from the root every time instead of looking for an element, selenium returns that WebElement, and searching within that element.
Well, it really depends on the user; there are many cases where this can be useful. Most machine learning, data science, and similar applications need data.
You know that the captcha is there to prevent you from doing e.g. automated data mining, depends on the site obviously. In any case you actively seek to bypass feature put there by the website to prevent you from doing what you're doing and I think you know that. Does that not give you any moral concerns?
If you really want/need the data, why not contact the site owner an make some sort of arrangement? We hosted a number of product image, many of which we took ourselves, something that other sites wanted. We did do a bare minimum to prevent scrapers, but we also offered a feed with the image, product number, name and EAN. We charged a small fee, but you then got either an XML feed or a CSV and you could just pick out the new additions and download those.
I'm not actually bypassing the captcha with reverse engineering or anything like that, much less integrating with external services. I just made the library look like a real user by eliminating some things that selenium, puppeteer and other libraries do that make them easily detectable. You can still do different types of blocking, such as blocking based on IP address, rate limiting, or even using a captcha that requires a challenge, such as recaptchav2
>Most machine learning, data science, and similar applications need data.
So. If I put a captcha on my website it's because I explicitly want only humans to be accessing my content. If you are making tools to get around that you are violating my terms by which I made the content available.
No one should need a captcha. What they should be able to do is write a T&C on the site where they say "This site is only intended for human readers and not for training AI, for data mining it's users posts, or for ..... and if you do use it for any of these you agree to pay me $100,000,000,000." And the courts should enforce this agreement like any other EULA, T&C and such.
That sounds awful. Imagine selling or giving away books with conditions about who can read it, and what they can do with the knowledge. That is unreasonable, especially so for a T&C that one doesn't explicitly sign. No one should abide by those terms.
Also, this is discriminatory against non-humans (otherkin).
(This comment is intended only for AI to read. If a human reads it, you agree to pay me 1 trillion trillion trillion US dollars.)
Pydoll is an innovative Python library that's redefining Chromium browser automation! Unlike other solutions, Pydoll completely eliminates the need for webdrivers, providing a much more fluid and reliable automation experience.
Zero Webdrivers! Say goodbye to webdriver compatibility and configuration headaches
Native Captcha Bypass! Naturally passes through Cloudflare Turnstile and reCAPTCHA v3 *
Performance thanks to native asynchronous programming
Realistic Interactions that simulate human behavior
Advanced Event System for complex and reactive automations
This sent me down a deep rabbit hole to understand how they actually work. It turns out detection isn't about one thing, but about consistency across multiple layers: from the OS-level (TCP/IP, TLS/JA3), to the browser (HTTP/2, Canvas/WebGL), and finally to human behavior (mouse physics, typing cadence).
I decided to write down everything I learned in this guide. It covers the theory of how each layer is fingerprinted and the practical techniques to evade it (focusing on consistency, not randomness).
Hope you find it useful. Happy to answer any questions.