Hacker Newsnew | past | comments | ask | show | jobs | submit | fbouvier's commentslogin

Thanks Steeve!

Yes HTML is too heavy and too expensive for LLM. We are working on a text-based format more suitable for AI.

What do you think of the DeepSeek OCR approach where they say that vision tokens might better compress a document than its pure text representation?

https://news.ycombinator.com/item?id=45640594

I've spent some time feeding llm with scrapped web pages and I've found that retaining some style information (text size, visibility, decoration image content) is non trivial.


Keeping some kind of style information is definitely important to understand the semantics of the webpage.

Hi, I am Francis, founder of Lightpanda. We wrote a full article explaining why we choose Zig over Rust or C++, if you are interested: https://lightpanda.io/blog/posts/why-we-built-lightpanda-in-...

Our goal is to build a headless browser, rather than a general purpose browser like Servo or Chrome. It's already available if you would like to try it: https://lightpanda.io/docs/open-source/installation


I see you're using html5ever for HTML parsing, and like it's trait/callback based API (me too). It looks like style/layout is not in scope at the moment, but if you're ever looking at adding style/layout capabilities to lightpanda, then you may find it useful to know that Stylo [0] (CSS / style system) and Taffy [1] (box-level layout) are both avaiable with a similar style of API (also Parley [2] which has a slightly different API style but can be combined with Taffy to implement inline/text layout).

[0]: https://github.com/servo/stylo

[1]: https://github.com/DioxusLabs/taffy

[2]: https://github.com/linebender/parley

---

Also, if you're interested in contributing C bindings for html5ever upstream then let me know / maybe open a github issue.


Off topic note: I read the website and a few pages of the docs and it's unclear to me for what I can use LightPanda safely. Like say I wanted to swap my it as my engine on playwright, what are the tradeoffs? What things are implemented, what isnt?

Thanks for the feedback, we will try to make this clearer on the website. Lightpanda works with Playwright, and we have some docs[1] and examples[2] available.

Web APIs and CDP specifications are huge, so this is still a work in progess. Many websites and scripts already work, while others do not, it really depends on the case. For example, on the CDP side, we are currently working on adding an Accessibility tree implentation.

[1] https://lightpanda.io/docs/quickstart/build-your-first-extra...

[2] https://github.com/lightpanda-io/demo/tree/main/playwright


Maybe you should recommend a recipe for configuring playwright with both chromium and lightpanda backends so a given project can compare and evaluate whether lightpanda could work given their existing test cases.

I was actually interested into using lightpanda for E2Es to be honest, because halving the feedback cycle would be very valuable to me.

I think it's really more of an alternative to JSDom than it is an alternative to Chromium. It's not going to fool any websites that care about bots into thinking it's a real browser in other words.

Would be helpful to compare Lightpanda to Webkit, Playwright has a driver for example and its far faster and less resource hungry than Chrome.

When I read your site copy it struck me as either naive to that, or a somewhat misleading comparison, my feedback would be just to address it directly alongside Chrome.


Thanks Francis, appreciate the nice & honest write-up with the thought process (while keeping it brief).

Would be great if it could be used as a wasm library... Just saying... Is it? I would actually need and use this.

Respectfully, for browser-based work, simplicity is absolutely not a good enough reason to use a memory-unsafe language. Your claim that Zig is in some way safer than Rust for something like this is flat out untrue.

What is your attack model here? Each request lives in its own arena allocator, so there is no way for any potentially malicious JavaScript to escape and read memory owned by any other request, even if there is a miscode. otherwise, VM safety is delegated to the V8 core.

Believe it or not, using arenas does not provide free memory safety. You need to statically bound allocations to make sure they don't escape the arena (which is exactly how arenas work in Rust, but not Zig). There are also quite a lot of ways of generating memory unsafe code that aren't just use after free or array-out-of-bounds in a language like Zig, especially in the context of stuff like DOM nodes where one frequently needs to swap out pointers between elements of one type and a different type.

In that blog post, the author said safer than C not Rust.

Choosing something like Zig over C++ on simplicity grounds is going to be a false economy. C++ features exist for a reason. The complexity is in the domain. You can't make a project simpler by using a simplistic language: the complexity asserts itself somehow, somewhere, and if a language can't express the concept you want, you'll end up with circumlocution "patterns" instead.

Build system complexity disappears when you set it up too. Meson and such can be as terse as your Curl example.

I mean, it's your project, so whatever. Do what you want. But choosing Zig for the stated reasons is like choosing a car for the shape of the cupholders.


Your Swiss Army Knife with a myriad of 97 oddly-shaped tools may be able to do any job anyone could ask of it, but my Swiss Army Knife of 10 well-designed tools that are optimal for my set of tasks will get my job done with much less frustration.

> C++ features exist for a reason.

But sometimes not good ones. Lot's of domains make tradeoffs about what features of C++ to actually make use of. It's an old language with a lot of cruft being used across a wide set of problems that don't necessarily share engineering trade offs.


That’s not fully true though. There’s different types of complexity:

- project requirements

- requirements forced upon you due to how the business is structured

- libraries available for a particular language ecosystem

- paradigms / abstractions that a language is optimised for

- team experiences

Your argument is more akin to saying “all general purpose languages are equal” which I’m sure you’d agree is false. And likewise, complexity can and will manifest itself differently depending on language, problems being solved, and developer preferences for different styles of software development.

So yes, C++ complexity exists for a reason (though I’d personally argue that “reason” was due to “design by committee”). But that doesn’t mean that reason is directly applicable to the problems the LightPanda team are concerned about solving.


C++ features for complexity management are not ergonomic though, with multiple conflicting ideas from different eras competing with each other. Sometimes demolition and rebuild from foundations is paradoxically simpler.

A lot of them only still exist for backwards compatabilities sake though. And a decent amount because adding something as a language extension rather than building the language around it has consequences.

C++ features exist for a reason but it may not be a reason that is applicable to their use case. For example, C++ has a lot of features/complexity that are there primarily to support low-level I/O intensive code even though almost no one writes I/O intensive code.

I don't see why C++ would be materially better than Zig for this particular use case.


And some lightweight alternatives like Bellard's QuickJS (https://bellard.org/quickjs/) in C and Kiesel (https://kiesel.dev/) in Zig.


Also, Hermes which React Native uses.


AWS' LLRT runtime just switched to QuickJS-NG [0]

0. https://github.com/awslabs/llrt/pull/669


Yes, argentic workflows are one of our use cases for Lightpanda.

We skip the graphical rendering of the web page for instant startup, fast execution and low resources usage.


Can skipping rendering affect website behavior? What happens when JS tries to get layout/color information? How often does this break a website?


They skip rendering but maybe don't skip layout and style computation?


Does it save much resources at all then? Id think that style computation and layouting takes a large chunk of the total resources used.


It might works if you need to handle a few websites. But this retro engineering approach is not maintainable if you want to handle hundreds or thousands of websites.


Thank you! Happy if you use it for your e2e tests in your servers, it's an open-source project!

Of course it's quite easy to spin a local instance of a headless browser for occasional use. But having a production platform is another story (monitoring, maintenance, security and isolation, scalability), so there are business use cases for a managed version.


It was my first idea. Forking Chromium has obvious advantages (compatibility). But it's not architectured for that. The renderer is everywhere. I'm not saying it's impossible, just that it did look more difficult to me than starting over.

And starting from scratch has other benefits. We own the codebase and thus it's easier for us to add new features like LLM integrations. Plus reducing binary size and startup time, mandatory for embedding it (as a WASM module or as C lib).


The Chromium/Webkit renderer used to have multiple rendering backends. You might use or add a no-op backend.


There are a lot of uses cases:

- LLM training (RAG, fine tuning)

- AI agents

- scraping

- SERP

- testing

- any kind of web automation basically

Bot protection of course might be a problem but it depends also on the volume of requests, IP, and other parameters.

AI agents will do more and more actions on behalf of humans in the future and I believe the bot protection mechanism will evolve to include them as legit.


thanks, it doesn't seem like it's the direction it's going at the moment. If you look at the robots.txt of many websites, they are actually banning AI bots from crawling the site. To me it seems more likely that each site will have its own AI agent to perform operations but controlled by the site.


I fully understand your concern and agree that scrapers shouldn't be hurting web servers.

I don't think they are using our browser :)

But in my opinion, blocking a browser as such is not the right solution. In this case, it's the user who should be blocked, not the browser.


If your browser doesn't play nicely and obey robots.txt when its headless I don't think it's that crazy to block the browser and not the user.


Every tool can be used in a good or bad way, Chrome, Firefox, cURL, etc. It's not the browser who doesn't play nicely, it's the user.

It's the user's responsibility to behave well, like in life :)


The first thing that came to mind when I saw this project wasn't scraping (where I'd typically either want a less detectible browser or a more performant option), but as a browser engine that's actually sane to link against if I wanted to, e.g., write a modern TUI browser.

Banning the root library (even if you could with UA spoofing and whatnot) is right up there with banning Chrome to keep out low-wage scraping centers and their armies of employees. It's not even a little effective also risks significant collateral damage.


it is trivial to spoof user-agent, if you want to stop a motivated scraper, you need a different solution that exploits the fact that robots use headless browser


> it is trivial to spoof user-agent

It's also trivial to detect spoofed user agents via fingerprinting. The best defense against scrapers is done in layers, with user-agent name block as the bare minimum.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: