It might works if you need to handle a few websites. But this retro engineering approach is not maintainable if you want to handle hundreds or thousands of websites.
Thank you! Happy if you use it for your e2e tests in your servers, it's an open-source project!
Of course it's quite easy to spin a local instance of a headless browser for occasional use. But having a production platform is another story (monitoring, maintenance, security and isolation, scalability), so there are business use cases for a managed version.
It was my first idea. Forking Chromium has obvious advantages (compatibility). But it's not architectured for that. The renderer is everywhere. I'm not saying it's impossible, just that it did look more difficult to me than starting over.
And starting from scratch has other benefits. We own the codebase and thus it's easier for us to add new features like LLM integrations. Plus reducing binary size and startup time, mandatory for embedding it (as a WASM module or as C lib).
Bot protection of course might be a problem but it depends also on the volume of requests, IP, and other parameters.
AI agents will do more and more actions on behalf of humans in the future and I believe the bot protection mechanism will evolve to include them as legit.
thanks, it doesn't seem like it's the direction it's going at the moment. If you look at the robots.txt of many websites, they are actually banning AI bots from crawling the site. To me it seems more likely that each site will have its own AI agent to perform operations but controlled by the site.
The first thing that came to mind when I saw this project wasn't scraping (where I'd typically either want a less detectible browser or a more performant option), but as a browser engine that's actually sane to link against if I wanted to, e.g., write a modern TUI browser.
Banning the root library (even if you could with UA spoofing and whatnot) is right up there with banning Chrome to keep out low-wage scraping centers and their armies of employees. It's not even a little effective also risks significant collateral damage.
it is trivial to spoof user-agent, if you want to stop a motivated scraper, you need a different solution that exploits the fact that robots use headless browser
It's also trivial to detect spoofed user agents via fingerprinting. The best defense against scrapers is done in layers, with user-agent name block as the bare minimum.
We had some discussions about it. It seems to us that AGPL will ensure that a company running our browser in a cloud managed offer will have to keep its modifications open for the community.
We might be wrong, maybe AGPL will damage the project more than eg. Apache2. In that case we will reconsider our choice. It's always easier this way :)
When you want to browse a few websites from time to time, a local headful browser might be a solution. But when you have thousands or millions of webpages, you need a server environment and a headless browser.
You're right, the debugging part is a good use case for graphical rendering in a headless environment.
I see it as a build time/runtime question. At build (dev) time I want to have a graphical response (debugging, computer vision, etc.). And then, when the script is ready, I can use Lightpanda at runtime as a lightweight alternative.
I was doing a personal side project for a while where I was trying to make my own little Wayback Machine-alike. Mine was very rudimentary, built on top of Firefox and WebDriver plus Squid proxy.
For debugging purposes you could have your headless browser function as a HTTP Proxy Server, maybe? And in your headless browser you could capture a static snapshot of the DOM after your JavaScript runtime has executed the scripts for the page. Similar to how the archive.today guy serves static snapshots of websites. And then developers using your headless browser could point their Firefox or Chrome browser to the HTTP Proxy server hosted by your headless browser program, in order to get a static snapshot view of what the DOM is like after your headless browser has executed JavaScript from the page. And then Firefox or Chrome will render static HTML view of what the page looked like to your headless browser, that the developer can inspect to make decisions about further interactions with the page. As a tool for debugging.