I've spent some time feeding llm with scrapped web pages and I've found that retaining some style information (text size, visibility, decoration image content) is non trivial.
Our goal is to build a headless browser, rather than a general purpose browser like Servo or Chrome. It's already available if you would like to try it: https://lightpanda.io/docs/open-source/installation
I see you're using html5ever for HTML parsing, and like it's trait/callback based API (me too). It looks like style/layout is not in scope at the moment, but if you're ever looking at adding style/layout capabilities to lightpanda, then you may find it useful to know that Stylo [0] (CSS / style system) and Taffy [1] (box-level layout) are both avaiable with a similar style of API (also Parley [2] which has a slightly different API style but can be combined with Taffy to implement inline/text layout).
Off topic note: I read the website and a few pages of the docs and it's unclear to me for what I can use LightPanda safely. Like say I wanted to swap my it as my engine on playwright, what are the tradeoffs? What things are implemented, what isnt?
Thanks for the feedback, we will try to make this clearer on the website. Lightpanda works with Playwright, and we have some docs[1] and examples[2] available.
Web APIs and CDP specifications are huge, so this is still a work in progess. Many websites and scripts already work, while others do not, it really depends on the case. For example, on the CDP side, we are currently working on adding an Accessibility tree implentation.
Maybe you should recommend a recipe for configuring playwright with both chromium and lightpanda backends so a given project can compare and evaluate whether lightpanda could work given their existing test cases.
I think it's really more of an alternative to JSDom than it is an alternative to Chromium. It's not going to fool any websites that care about bots into thinking it's a real browser in other words.
Would be helpful to compare Lightpanda to Webkit, Playwright has a driver for example and its far faster and less resource hungry than Chrome.
When I read your site copy it struck me as either naive to that, or a somewhat misleading comparison, my feedback would be just to address it directly alongside Chrome.
Respectfully, for browser-based work, simplicity is absolutely not a good enough reason to use a memory-unsafe language. Your claim that Zig is in some way safer than Rust for something like this is flat out untrue.
What is your attack model here? Each request lives in its own arena allocator, so there is no way for any potentially malicious JavaScript to escape and read memory owned by any other request, even if there is a miscode. otherwise, VM safety is delegated to the V8 core.
Believe it or not, using arenas does not provide free memory safety. You need to statically bound allocations to make sure they don't escape the arena (which is exactly how arenas work in Rust, but not Zig). There are also quite a lot of ways of generating memory unsafe code that aren't just use after free or array-out-of-bounds in a language like Zig, especially in the context of stuff like DOM nodes where one frequently needs to swap out pointers between elements of one type and a different type.
Choosing something like Zig over C++ on simplicity grounds is going to be a false economy. C++ features exist for a reason. The complexity is in the domain. You can't make a project simpler by using a simplistic language: the complexity asserts itself somehow, somewhere, and if a language can't express the concept you want, you'll end up with circumlocution "patterns" instead.
Build system complexity disappears when you set it up too. Meson and such can be as terse as your Curl example.
I mean, it's your project, so whatever. Do what you want. But choosing Zig for the stated reasons is like choosing a car for the shape of the cupholders.
Your Swiss Army Knife with a myriad of 97 oddly-shaped tools may be able to do any job anyone could ask of it, but my Swiss Army Knife of 10 well-designed tools that are optimal for my set of tasks will get my job done with much less frustration.
But sometimes not good ones. Lot's of domains make tradeoffs about what features of C++ to actually make use of. It's an old language with a lot of cruft being used across a wide set of problems that don't necessarily share engineering trade offs.
That’s not fully true though. There’s different types of complexity:
- project requirements
- requirements forced upon you due to how the business is structured
- libraries available for a particular language ecosystem
- paradigms / abstractions that a language is optimised for
- team experiences
Your argument is more akin to saying “all general purpose languages are equal” which I’m sure you’d agree is false. And likewise, complexity can and will manifest itself differently depending on language, problems being solved, and developer preferences for different styles of software development.
So yes, C++ complexity exists for a reason (though I’d personally argue that “reason” was due to “design by committee”). But that doesn’t mean that reason is directly applicable to the problems the LightPanda team are concerned about solving.
C++ features for complexity management are not ergonomic though, with multiple conflicting ideas from different eras competing with each other. Sometimes demolition and rebuild from foundations is paradoxically simpler.
A lot of them only still exist for backwards compatabilities sake though. And a decent amount because adding something as a language extension rather than building the language around it has consequences.
C++ features exist for a reason but it may not be a reason that is applicable to their use case. For example, C++ has a lot of features/complexity that are there primarily to support low-level I/O intensive code even though almost no one writes I/O intensive code.
I don't see why C++ would be materially better than Zig for this particular use case.
It might works if you need to handle a few websites. But this retro engineering approach is not maintainable if you want to handle hundreds or thousands of websites.
Thank you! Happy if you use it for your e2e tests in your servers, it's an open-source project!
Of course it's quite easy to spin a local instance of a headless browser for occasional use. But having a production platform is another story (monitoring, maintenance, security and isolation, scalability), so there are business use cases for a managed version.
It was my first idea. Forking Chromium has obvious advantages (compatibility). But it's not architectured for that. The renderer is everywhere. I'm not saying it's impossible, just that it did look more difficult to me than starting over.
And starting from scratch has other benefits. We own the codebase and thus it's easier for us to add new features like LLM integrations. Plus reducing binary size and startup time, mandatory for embedding it (as a WASM module or as C lib).
Bot protection of course might be a problem but it depends also on the volume of requests, IP, and other parameters.
AI agents will do more and more actions on behalf of humans in the future and I believe the bot protection mechanism will evolve to include them as legit.
thanks, it doesn't seem like it's the direction it's going at the moment. If you look at the robots.txt of many websites, they are actually banning AI bots from crawling the site. To me it seems more likely that each site will have its own AI agent to perform operations but controlled by the site.
The first thing that came to mind when I saw this project wasn't scraping (where I'd typically either want a less detectible browser or a more performant option), but as a browser engine that's actually sane to link against if I wanted to, e.g., write a modern TUI browser.
Banning the root library (even if you could with UA spoofing and whatnot) is right up there with banning Chrome to keep out low-wage scraping centers and their armies of employees. It's not even a little effective also risks significant collateral damage.
it is trivial to spoof user-agent, if you want to stop a motivated scraper, you need a different solution that exploits the fact that robots use headless browser
It's also trivial to detect spoofed user agents via fingerprinting. The best defense against scrapers is done in layers, with user-agent name block as the bare minimum.
reply