Companies like Google love kicking down the ladder. You can bet that the Google crawler will have its own "attestation token" but if you want to crawl the Web with your own code you'll be SOL.
All these billion-dollar tech companies got their start thanks to open, accessible, hackable systems. Now it's all being locked down so only the big guys can play, and the rest of us have to pay a fee just to put our "apps" into their walled gardens, and if we do anything they don't like (or are just unlucky) then we get banned forever.
"You can bet that the Google crawler will have its own "attestation token" but if you want to crawl the Web with your own code you'll be SOL."
Let's be real here and note that while most web properties welcome Google crawlers, there are many, many other scrapers/crawlers that offer zero value to web operators while costing resources.
This is just silly, there exist frameworks like selenium that allow you to run any browser of choice and emulate actual user behavior(clicks, keystrokes). If they go further the emulation layer will have to be moved higher, above the virtual machine running the browser for example. The truth is, this has nothing to do with scraping, scrapers will find a way. This is to stop the majority of people from using ad block.
Hi, Selenium & Appium creator here. I've always been on the test automation side of things. The fact that these tools were also useful for scraping was an interesting coincidence to me. These days I make physical robots that are the "real world" equivalent of Selenium or Appium with a stylus that actually taps the screen and presses buttons. To websites and apps, taps and clicks are real, not emulated. Primary use is still test automation, especially when it also involves a real-world component like a credit card transaction with a credit card reader. The number of people contacting me who are interested in getting a physical robot as a way to circumvent software bot detection is increasing. Yes, scrapers will find a way.
Thanks, although I'm not active day-to-day on the Selenium and Appium projects these days. All my love to the current maintainers keeping the projects going!
Wow, thank you for the great software :-) And the physical robot approach is very interesting. Of course it introduces physical world limits (you can't run 1k tests in parallel to load test the site unless you have 1k robots), but still it is very cool.
If I understand this proposal correctly, this is exactly to prevent such things. Yes, of course, it’s to prevent people from using ad block. But a nice side-effect is to block crawlers, or frameworks like selenium as well, so they can „serve ads only to real people“. Of course, people will always find a way to crawl. We already have bot farms that are just remote controlled smartphones lined up somewhere. But it makes it harder for everyone who isn’t Google to compete with Google.
>If they go further the emulation layer will have to be moved higher, above the virtual machine running the browser for example.
Your hypothetical change of emulation tactics won't work. You're analyzing at the wrong abstraction level.
The "attestation tokens" to validate the integrity of the web browser environment would come from a 3rd-party (e.g. Google Play services).
For example... Today, hacks like youtube-dl work because implementing client-side code to "solve javascript puzzle challenges" is still inside the "world" that Google-server-to-browser-client present to each other. Same for client-side solvers for Cloudflare captchas. The "3rd-party attestation token" breaks those types of hacks.