Hacker News new | comments | show | ask | jobs | submit login
Show HN: Fast C-based HTML 5 parsing for Python (github.com)
169 points by aumerle 175 days ago | hide | past | web | favorite | 71 comments

Python: interactive glue language between high performance C libraries. It's funny how we have converged on this solution, I use it daily so I'm really not complaining but I really didn't see this as one of the logical outcomes of the various programming language streams.

But when you think about it it is kind of logical: inner loops and low level code tend to be fairly static and are often re-used so it pays of to write them in a language that maximizes for that use-case, but structure can be executed much slower and re-use is relatively low when going from one application to the next so it pays of to write that in another language.

So we get this split roughly halfway down the application stack where we switch from interactive and interpreted high level language to compiled low level language.

It's a very interesting optimum.

It's actually a fairly common pattern: think high performance filters glued together with the shell. Or "steering" of supercomputer codes via Tcl, Squeak Smalltalk doing high performance multi-media (at the time) using a few well-placed primitives and an otherwise run-of-the-mill bytecode interpreter.

A very similar split goes right through the middle of Objective-C. Although it is compiled, the "Objective" part always felt pretty close to interactive and is pretty high-level, and C is, well, C. Same concept, less mismatch between the parts.

One of Objective-Smalltalk's goals is to invert the Objective-C trade-off: make the interactive, high-level part primary, while still allowing a smooth transition to the low-level parts without resorting to making that transition automatic by either a heldencompiler or super-smart JIT.

Yes, it is a very interesting optimum, though I think it is one that can be improved on.

I've long heralded this as some of the most attractive things about interpreted languages like Python and friends, and what really draws me to them in the first place.

Take parsing a config file - it doesn't need to be fast unless it's truly in your hot path and you're doing millions of them. I like not having to worry about memory management when it comes to them. On the other hand if I have a hot path/slow operation having the ability to compile a library to do the heavy lifting and gluing it together in a higher-level language means that you spend a lot less time fighting your language and a lot more time solving your problems.

Fighting your language to get performance is a special trait of C/C++. Languages like D, Rust or Go are trying to fix this.

I use this pattern a lot. For example, in https://github.com/kovidgoyal/kitty where the UI layer is in python and the performance critical code is in C. I find this pattern gives me an optimum in performance + maintainability.

You REALLY didn't know about http://kitty.9bis.net/ before starting this?

Yep. Similar to how we used to write the high performance code in ASM and the business logic in C. Now we write the high performance code in C (or equiv. low-level language) and the high performance code in a slow but expressive dynamic language. My favorite is writing high-level code in some kind of Lisp (previously Clojure, lately Racket) because REPL-driven development makes it incredibly fast to iterate and experiment and then cement/polish it into production code all while having the same state in memory.

This is one of the reasons I enjoy using Nim; it's that exact split, but within the same language. It's quite a mind-bender to begin with, but damn is it powerful once you've got the hang of it.

For example, today I was doing some homomorphic encryption experiments, and leveraged the GNU GMP library for some integer calculations. Crossing that "boundary" in Nim was the easiest thing I've ever done. Take a high level "ref" type (`Int`) which is actually `type Int = ref mpz_t`: This means I can write nice high-level procs over the top of the low-level GMP types, but also just de-reference any `Int` to pass it to the "real" GMP functions as needed!

I'd love to see other languages explore this, too, as I think it's really powerful. It's my biggest gripe with Node.js: FFI is a bit "ick" the last time I tried it!

Also known as Ousterhout's Dichotomy


And it's super great when your C extension has a memory leak and you can't figure out where in Python you are creating circular references... happened to me once. Two days of my life...

But yes as long as you write good C, it's a pretty awesome pattern.

That's why people are starting writting the Python extension in rust.

You can have memory leaks in Rust as well. Especially if you are interfacing with a program that expects a C interface (i.e. CPython).

This architecture has a name: http://wiki.c2.com/?AlternateHardAndSoftLayers

Note: (on the off chance that someone did not know about this :) ) The author is also the only developer of calibre. https://calibre-ebook.com/ which is also based on python

Indeed, I developed html5-parser to eventually replace html5lib in calibre

Big fan of calibre. Really appreciate your work. Have converted many books to put in kindle! Flawless.

I have worked on document formats (office, pdf etc) and I understand these standards are simply tedious!

My favorite part about this documentation is actually the "Safety and correctness" section: https://html5-parser.readthedocs.io/en/latest/#safety-and-co...

It shows that a good engineer took the time to think about not only speed, but correctness in implementation, and compiling things with `-pedantic-errors -Wall -Werror` is a practice that I very rarely see in the wild, and should be heralded as good practice. It's far too common to see hundreds of warnings zip past while compiling, and sometimes they do tell you about problems you need to solve.

-pedantic-errors -Wall -Werror isn't used in the wild because it breaks whenever a new compiler comes out. New compilers often come out with new warnings, new cases for old warnings, etc. It's not forwards-compatible to make warnings errors.

You should release your code with no warnings on the current compiler, but new warnings shouldn't break everything.

-Werror is used only when building on the continuous integration servers, not in the released code. That way, you get the best of both worlds. Your code wont break on new compiler releases, but it also wont have neglected warnings.

Where are the test that show it's 30x faster? Otherwise, pretty cool. I only do occasional HTML parsing in scripts, so whatever is builtin and BeautifulSoup is good enough for me, but I can surely imagine a 30x speedup being not only useful but necessary for large processing jobs or a scaled-up user interface.

30x is quick! But still it doesn't appear to be sequential parser.

html -> gumbo parse tree -> lxml tree

What makes it efficient even with two tree transformations ?

Is one of the above trees constructed in sequential way and is less-recursive ?

The short answer to this is that you can do an awful lot of work in C for the cost of a single Python instruction.

The slightly longer answer is that the vast majority of time in parsing is spent actually parsing - examining each character and adjusting your state machine accordingly. When I was benchmarking the gumbo-python bindings, it was 95%+ parsing, < 5% spent on tree construction. Speed up the parsing part by 100x (which isn't all that unreasonable, when you consider that a field reference in Python involves hashing a string and looking up an object in a dictionary attached to the object), and you'll get an equivalent speedup in total runtime that can pay for a lot of tree reconstructions. The Gumbo parse tree itself will often fit in L2 cache (IIRC I'd benchmarked it at about 90% of documents use < 400K RAM), so it doesn't take much time to traverse.

Doing the Gumbo => lxml translation in C rather than Python gives a similar speedup for tree construction: instead of having to lookup fields in Python as dictionary references, you can do it in C by memory offset.

Interesting. I clearly lack any knowledge in gumbo. But from what I understand from comment is

>> benchmarking the gumbo-python bindings, it was 95%+ parsing, < 5% spent on tree construction

95% on parsing - this speeds up since its done in C instead of python - ok. 5% spent on tree construction. I was little surprised with this - I assumed dictionaries might help although construction (seem expensive) for tree but tree traversal becomes optimal.

But as I write this, I find you have mentioned trees in question are of memory OFFSET types. That may explain why its quick. Else DOM styled trees (which I assumed) in C would be expensive to construct and traverse.

Finally, this begs another question : memory offset trees ? How optimal are they in construction since without having knowledge of depth of a tag and its children-depth - this will be have to be a two time parse with some sort of data structure(tree) to maintain book keeping for a second time parse to construct memory offset tree. Can this be done in single parse ? Could you share some insight into this ?

How fast is compare to lxml with the HTML parser ?

I know that lxml won't parse as many cases as this module, but in many cases lxml can do the job.

I have not tested it, but it is probably slower. The advantage of this is that is follows the HTML 5 parsing spec, so it gives you the same results as a browser when parsing malformed HTML

I have found the lxml API in python leaks a considerable amount of memory, but that was while going through a large amount of data though (Wikipedia dump). This is in contrast to a handful of pages. But that was a while ago, it may be different now.

What do you mean by "leak"? I wrote a scraper that used lxml and a single instance/process ran for month on end in production without any issues. Lxml may have used more memory than some alternatives, but I don't think it was "leaking" memory.

That is likely true. My experience with it was in 2011/2012 ish. Their FAQ[1] claims memory leaks have been fixed over time.

* http://lxml.de/FAQ.html

I made a pull request[1] for gumbo-parser a while ago. While it is a wonderful project, it seems to be in need of a new maintainer, or a maintained fork.

[1] https://github.com/google/gumbo-parser/pull/370

(Original author and former maintainer of Gumbo here.)

Yes, it does need a new maintainer, or maintained fork. I finally lost my commit rights to the /google GitHub org about a year ago, 2 years after leaving Google. So I don't actually have the ability to merge pull requests anymore. I still have some friends at Google, some of whom were involved in Gumbo's development, but they've moved on to other projects as well and likely wouldn't be able to take over. I'm working on a new startup idea myself, so my time is fairly limited, though I can answer questions if they come up.

@nostrademons: Thank you for gumbo, which is an excellent project.

> I finally lost my commit rights to the /google GitHub org about a year ago, 2 years after leaving Google.

That looks like pretty serious security lapse on Google's part without further context.

Yes, the lack of maintenance of gumbo-parser was another reason I decided to include a private copy. You are welcome to make your PR against html5-parser's copy of gumbo-parser, and I will review it.

If I understand things correctly, gumbo in html5-parser was forked from https://github.com/Sigil-Ebook/sigil-gumbo/ at 0830e1145fe08758d6ef24f77dfcbeac4633676f, which in turn was forked from https://github.com/vmg/gumbo-parser/tree/v1.0.0 which was forked from google/gumbo-parser, pre google/gumbo-parser 0.9.4.

There's quite a few changes on the original upstream since then. gumbo_malloc -> gumbo_parser_allocate, free_node -> destroy_node, passing around the parser as an argument, &c.

Yes, since I wrote html5-parser for use in calibre (i.e. to parse the HTML in e-books), sigil-gumbo is a more appropriate upstream for me.

I have no objectsion to merging in changes from gumbo-parser that are not in sigil-gumbo over time, but it will need to be done gradually, as there is only so much time I can devote to html5-parser now that it meets the needs of calibre.

I think sigil-gumbo also merged in a lot of the changes in 0.9.4 and 1.0.0. Most of them were related to performance & memory allocation; a developer at GitHub did a lot of work to speed up their internal copy of Gumbo, and merged a lot of that work back to master. Kevin Hendricks (sigil-gumbo's maintainer) was involved in a number of those discussions; I believe he applied most of the patches to his own tree as well.

I have some doubts that html5-parser would be 30x faster than html5lib without the performance work in 0.9.4; Gumbo up through 0.9.3 was really slow for a C library.

Why does it bundle a copy of the gumbo library instead of just linking to it?

Because it uses a modified version of gumbo to support parsing not-well formed XHTML as well

I thought the whole point of XHTML is that non-well-formed documents should generate an error instead of being unpredictably interpreted according to each implementation's whims.

And if you really insist on doing that, why not just parse XHTML as HTML? HTML5 parsing rules can already handle some XHTML-like constructs; it's what browsers do when they're served XHTML as text/html. If it's good enough for them, it should be good enough for you.

:) Try parsing the following snippet using HTML 5 parsing algorithm and see what you get:

<html><head><title /></head><body><p>foo

How often does a self-closing <title /> tag appear in the wild?

In E-books, which IIUC this library was designed to parse: very frequently. The ePub3 specification actually specifies the XHTML serialization of HTML5 as its content serialization, so this is a correct eBook. Ironically, this means that most eBooks are not actually valid HTML5. We had another eBook reader (Sigil) that used Gumbo and ran into this issue.

On the web: very rarely. I've actually never seen a self-closing <title /> tag, but I've seen other cases where docs that are actually XML are served with a text/html MIME type and this creates pathological DOM structures.

(I have a funny story from Gumbo's testing that illustrates this. I ran across a document that was actually some XML dialect with 45,000 self-closing elements in a row, but was served with a text/html MIME type. Since HTML5 doesn't recognize self-closing elements that aren't in the spec, this created a DOM tree with 45,000 levels of nesting. Gumbo could handle this because it uses an iterative state machine for parsing, but my testing code did recursive descent on this and choked. I posted it on MemeGen - Google's internal water-cooler website - with a link to the offending web page ... and then got a few emails from other Googlers about how it was kinda rude of me to crash their browsers. It turns out Chrome couldn't handle the page, and would die with a stack overflow when viewing it.)

I encounter self closed <title> tags in malformed XHTML files all the time. In fact I encounter them so often, I used to use a dedicated sanitization pass before passing int he html to html5lib, for that reason alone.

Ha, okay! I tried. I really did. I feel for whomever has to read this spec.

I'm assuming the tokenizer successfully tokenizes through "<title", and that we're in the "tag name state"[1]; from there, we see the space, that takes us to the "before attribute name state"[2]; then we consume the solidus, and switch to the "self-closing start tag state"[3]; we see the ">" and the spec says,

> Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token.

So, now the tokenizer is:

  <title /></head>
           ^ about to consume this
and we're in the "data" state, which is approximately correct for about to consume a </head>, so that's looking okay. (I'll also point out that the more Englishy parts that talk about start tags[4] only say that "if the element is one of the void elements, or if the element is a foreign element, then there may be a single "/" (U+002F) character" but don't really elaborate on what to do if it isn't.)

So, let's say the tokenizer escapes okay. The parser is another story…

Let's assume at this point we're in the "in head" insertion mode[5]; we get a start tag from the tokenizer as per above. This state tells us,

> A start tag whose tag name is "title"

> Follow the generic RCDATA element parsing algorithm.

but that algorithm seems to be directed at the tokenizer, which runs a bit counter to

> The input to the tree construction stage is a sequence of tokens from the tokenization stage.

Regardless, the self-closing flag on the start tag that got emitted doesn't get acknowledged[6], which seems to cause a "parse error". But what should a parser do with that parse error? How should we behave? If we follow the link, we end up with

> The error handling for parse errors is well-defined (that's the processing rules described throughout this specification)

…which, no spec, no you don't.

The best I've got is that the tokenizer and tree construction stages aren't independent; that is, if we go back to the bit where we're supposed to be following the "generic RCDATA element parsing algorithm" — which is under tree construction, it states,

> switch the tokenizer to the RCDATA state.

which seems to heavily imply that we adjust the tokenizer's state from within tree construction. Okay. So if we follow the algorithm for RCDATA, noting that the tokenizer was about to consume </head>, AFAICT, it gets very quickly into the "RCDATA end tag open state" after eating the </ from </head>, then "RCDATA end tag name state", then eventually consumes the > and we hit:

> "If the current end tag token is an appropriate end tag token, then switch to the data state and emit the current tag token. Otherwise, treat it as per the "anything else" entry below."

So what's an appropriate end tag token?

> An appropriate end tag token is an end tag token whose tag name matches the tag name of the last start tag to have been emitted from this tokenizer,

The last tag we emitted was that weird <title/> thing, so no, this is not an appropriate end tag, and we skip to "anything else"

> Switch to the RCDATA state. Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a character token for each of the characters in the temporary buffer (in the order they were added to the buffer). Reconsume the current input character.

Uh-oh. From here, my mental parse just loops around in the RCDATA state; the title of the document ends up being "</head><body><p>foo".

I threw up a simple webserver to host that string: Both Chrome and Firefox agree! But as a parser, I feel abused.

The HTML5 algorithm could be a bit clearer here, but I think it does specify a well formed result, and additionally would flag this document as having a parse error (for an unacknowledged self-closing tag, but hilariously? oddly?, not for the rampant consumption of tags by the RCDATA algorithm…)

But that seems to be what the spec says to do. I'm going to go out on a limb and guess that ebook readers don't follow the spec, and that if you attempted to, people would (incorrectly) blame your software, even if it's doing the right thing, putting you between a rock and a hard place because the ebook writers can't write well-formed markup? (I really feel for you here, if that's correct; ideally, you could just emit a blank book and tell the ebook writer to fix it…; maybe if you emitted a warning at the bottom that stated "This eBook contained errors; a best-effort has been made to correct it; if you experience problems with this book, notify the publisher.") But, then, my SO shows me plenty of just simply typographical errors in her ebooks, and she can't even see the markup.

[1]: https://www.w3.org/TR/html5/syntax.html#tag-name-state

[2]: https://www.w3.org/TR/html5/syntax.html#before-attribute-nam...

[3]: https://www.w3.org/TR/html5/syntax.html#self-closing-start-t...

[4]: https://www.w3.org/TR/html5/syntax.html#start-tags

[5]: https://www.w3.org/TR/html5/syntax.html#parsing-main-inhead

[6]: https://www.w3.org/TR/html5/syntax.html#acknowledge-self-clo...

Yeah, I have to deal with what we have in the real world, not in spec land. Yielding blank documents on a <title/> just isn't going to cut it :)

This would replace html5lib in BeautifulSoup nicely :)

No, no. I don't want some complicated C blob in the middle of my Python system. I don't have time to chase buffer overflows and pointer bugs.

Has anyone benchmarked this against html5lib compiled with PyPy? Html5lib is slow in CPython because there are a lot of low-level per-character tests, as required by the HTML5 spec. The spec specifies how bad HTML is to be parsed in great detail, which means a lot of IF statements. CPython is a dog on code like that. PyPy, not so much.

[I was the original author of much of html5lib]

Yes, http://speed.pypy.org/comparison/ has html5lib parsing (an old, static copy of) the HTML5 spec as a benchmark. It's about 3x faster than CPython, which is to say, still very slow.

I agree there's a good argument that writing new parser code in C is not a great idea, but I don't think it's possible — or at least easy — to get reasonable parsing speed in pure Python. My ideal solution here would be bindings to a Rust parser e.g. html5ever, although there is a tradeoff there in terms of ease of distribution.

The main missing feature in html5ever for these purposes is the ability to construct the final parse tree in the rust code (c.f. lxml), without which the performance bottlneck becomes constructing the Python objects to represent the document. Of course a streaming API like SAX can be even faster, but often isn't all that useful.

[I'm the current mostly absentee maintainer of html5lib]

https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C1... has an up-to-date version of html5lib, albeit only on CPython: notably, both 3.6 and the latest 3.7 build are significantly faster than 2.7.

That said, I don't think html5lib is going to become massively quicker: string allocations are just going to become an ever bigger issue (i.e., s[1:10] causes an allocation in Python v. just referencing the subsequence), and even using Cython, at least under CPython, isn't going to help with that.

I thought you could use memoryview over a string to get rid of that allocation even in 2.7


Allocations are challenging for HTML parsers, even in C, because of the presence of entity references and case-normalization of attribute & tag names. That means that a lot of the time when you think you ought to be able to just use a slice or memoryview into the original source text, you can't; for example, if any of your text nodes contains &lt; ('<') or &ldquo (smart double quote), you can't use the original source buffer, because you're supposed to have decoded the entity to a unicode character, which will leave the string a different length. This happens stupidly often in real HTML.

I initially had the API for Gumbo use string slices a lot more than the final released API, and then found that I couldn't do it and needed to allocate in order to maintain correctness. I'd done a patch that arena-allocated all memory used in the parse, which gave a fairly significant CPU speedup, but it also bloated max memory usage in ways that some clients found unacceptable, so I never merged it. Small C strings at least are quite lightweight; Python strings have a lot of additional overhead, and much of the PyObject structure itself requires chasing pointers.

Only over bytes objects, not over unicode objects.

Couldn't a little toolbox of helpers take care of those allocations? Or would that turn the whole thing into something too complicated?

If just the tokenizer was in Rust, that might help. Much of the HTML5 error handling involves dealing with malformed tokens. Comments that begin "<-" instead of "<--". All that stuff at the beginning of a file for guessing the character set. None of those require creating elaborate Python structures from non-Python code.

(I run a web crawler written in Python, which means code exposed to arbitrarily bad HTML. I've had to report and fix bugs in BeautifulSoup and html5lib. At least they fail in a well-defined way. C code can fail in arbitrarily bad ways, and C string processing is notorious for being troublesome.)

> My ideal solution here would be bindings to a Rust parser e.g. html5ever

Funny how somebody submitted exactly this on HN a few hours ago : https://news.ycombinator.com/item?id=14591017

You do know that by default bs4 uses lxml.html to do its parsing, which is also a big blob of C, right?

For example, running the following:

BeautifulSoup('<p>a') /usr/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

Would it be possible to interface this with other languages? e.g. Go, Ruby, Java, or is there a hard dependency on Python?

This has a hard dependency on libxml2. If your language of choice has a libxml2 based tree, it can be ported to that.

Basically your language needs an equivalent of lxml

Would also be nice to see a performance comparison with other libraries, e.g. Go's html.Parse

The gumbo build uses the flag -Wunused-but-set-variable which was added in gcc 4.6. The build script doesn't check for gcc 4.6 though and simply fails.

GCC 4.6 was released in March 2011. If you're using an older GCC, why?

Recently fixed a bug causing GCC 4.4something to overflow its stack. They are still out there.

Any luck using this with BeautifulSoup?

Simply pass the argument treebuilder='soup' to the parse function. But note that you wont get as much of a performance boost, because the besutifulsoup tree has to be built in python, not C.

Speed isn't everything. How permissive is the parser? We look for a compromise here.

As permissive as html5lib, which is as permissive as a modern browser, since both are base don the HTML 5 parsing spec.

Opinion: permissive parsing is bad for the industry. Speed should be king. It's not difficult to write correct html!

Be strict in what you produce and liberal in what you accept.

Strictness works well when you have a small number of active producers and consistent, constantly renewed content.

The web is the opposite - a huge number of producers and widely varying content, some of which is 1-2 decades old and will never be updated.

You can design and build a great, strict HTML parser - it just won't work well for a significant portion of the www.

And if you make one that is popular enough, you can change the direction of evolution of the web! (See mobile safari and flash)

Isn't it because 'incorrect HTML' has been practically defined out of existence?

Opinion: the industry doesn't equate with reality. Pointy-haired MBAs with some PHP background would tell you otherwise, but there's a whole wide world out there.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact