Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Gumbo HTML parser:

https://github.com/google/gumbo-parser

It's one of the most conformant (if not the most conformant - 0.10.0 passes all html5lib-trunk tests) standalone HTML parsing libraries out there. It has third-party bindings in nearly a dozen different languages. The API is simple, the code is robust and well-tested, and being written in C, it's often a fair bit faster than alternatives.



Looks interesting. Does it parse html in chunks? (I am actually looking for html5 parsing library that does it in chunks in C).


No, it reads in a whole string at once and then parses it as a single document. It's actually pretty hard to parse HTML in chunks, because the spec allows for text that comes later in the document to alter the parse tree of nodes produced earlier (see, for example, foster-parenting or the adoption agency algorithm). You could take a look at Hubbub as a callback-based HTML5 parser, but the way that works is to take a callback interface where you need to implement 18 or so different functions.

http://www.netsurf-browser.org/projects/hubbub/




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: