
Show HN: Fast C-based HTML 5 parsing for Python - aumerle
https://github.com/kovidgoyal/html5-parser
======
jacquesm
Python: interactive glue language between high performance C libraries. It's
funny how we have converged on this solution, I use it daily so I'm really not
complaining but I really didn't see this as one of the logical outcomes of the
various programming language streams.

But when you think about it it is kind of logical: inner loops and low level
code tend to be fairly static and are often re-used so it pays of to write
them in a language that maximizes for that use-case, but structure can be
executed much slower and re-use is relatively low when going from one
application to the next so it pays of to write that in another language.

So we get this split roughly halfway down the application stack where we
switch from interactive and interpreted high level language to compiled low
level language.

It's a very interesting optimum.

~~~
zer01
I've long heralded this as some of the most attractive things about
interpreted languages like Python and friends, and what really draws me to
them in the first place.

Take parsing a config file - it doesn't need to be fast unless it's truly in
your hot path and you're doing millions of them. I like not having to worry
about memory management when it comes to them. On the other hand if I have a
hot path/slow operation having the ability to compile a library to do the
heavy lifting and gluing it together in a higher-level language means that you
spend a lot less time fighting your language and a lot more time solving your
problems.

~~~
skocznymroczny
Fighting your language to get performance is a special trait of C/C++.
Languages like D, Rust or Go are trying to fix this.

------
zaszrespawned
Note: (on the off chance that someone did not know about this :) ) The author
is also the only developer of calibre. [https://calibre-
ebook.com/](https://calibre-ebook.com/) which is also based on python

~~~
aumerle
Indeed, I developed html5-parser to eventually replace html5lib in calibre

~~~
samblr
Big fan of calibre. Really appreciate your work. Have converted many books to
put in kindle! Flawless.

I have worked on document formats (office, pdf etc) and I understand these
standards are simply tedious!

------
zer01
My favorite part about this documentation is actually the "Safety and
correctness" section: [https://html5-parser.readthedocs.io/en/latest/#safety-
and-co...](https://html5-parser.readthedocs.io/en/latest/#safety-and-
correctness)

It shows that a good engineer took the time to think about not only speed, but
correctness in implementation, and compiling things with `-pedantic-errors
-Wall -Werror` is a practice that I _very_ rarely see in the wild, and should
be heralded as good practice. It's far too common to see hundreds of warnings
zip past while compiling, and sometimes they _do_ tell you about problems you
need to solve.

~~~
mrout
-pedantic-errors -Wall -Werror isn't used in the wild because it breaks whenever a new compiler comes out. New compilers often come out with new warnings, new cases for old warnings, etc. It's not forwards-compatible to make warnings errors.

You should release your code with no warnings on the current compiler, but new
warnings shouldn't break everything.

~~~
aumerle
-Werror is used only when building on the continuous integration servers, not in the released code. That way, you get the best of both worlds. Your code wont break on new compiler releases, but it also wont have neglected warnings.

------
metalliqaz
Where are the test that show it's 30x faster? Otherwise, pretty cool. I only
do occasional HTML parsing in scripts, so whatever is builtin and
BeautifulSoup is good enough for me, but I can surely imagine a 30x speedup
being not only useful but necessary for large processing jobs or a scaled-up
user interface.

~~~
aumerle
[https://html5-parser.readthedocs.io/en/latest/#benchmarks](https://html5-parser.readthedocs.io/en/latest/#benchmarks)

~~~
samblr
30x is quick! But still it doesn't appear to be sequential parser.

html -> gumbo parse tree -> lxml tree

What makes it efficient even with two tree transformations ?

Is one of the above trees constructed in sequential way and is less-recursive
?

~~~
nostrademons
The short answer to this is that you can do an awful lot of work in C for the
cost of a single Python instruction.

The slightly longer answer is that the vast majority of time in parsing is
spent actually parsing - examining each character and adjusting your state
machine accordingly. When I was benchmarking the gumbo-python bindings, it was
95%+ parsing, < 5% spent on tree construction. Speed up the parsing part by
100x (which isn't all that unreasonable, when you consider that a field
reference in Python involves hashing a string and looking up an object in a
dictionary attached to the object), and you'll get an equivalent speedup in
total runtime that can pay for a lot of tree reconstructions. The Gumbo parse
tree itself will often fit in L2 cache (IIRC I'd benchmarked it at about 90%
of documents use < 400K RAM), so it doesn't take much time to traverse.

Doing the Gumbo => lxml translation in C rather than Python gives a similar
speedup for tree construction: instead of having to lookup fields in Python as
dictionary references, you can do it in C by memory offset.

~~~
samblr
Interesting. I clearly lack any knowledge in gumbo. But from what I understand
from comment is

>> benchmarking the gumbo-python bindings, it was 95%+ parsing, < 5% spent on
tree construction

95% on parsing - this speeds up since its done in C instead of python - ok. 5%
spent on tree construction. I was little surprised with this - I assumed
dictionaries might help although construction (seem expensive) for tree but
tree traversal becomes optimal.

But as I write this, I find you have mentioned trees in question are of memory
OFFSET types. That may explain why its quick. Else DOM styled trees (which I
assumed) in C would be expensive to construct and traverse.

Finally, this begs another question : memory offset trees ? How optimal are
they in construction since without having knowledge of depth of a tag and its
children-depth - this will be have to be a two time parse with some sort of
data structure(tree) to maintain book keeping for a second time parse to
construct memory offset tree. Can this be done in single parse ? Could you
share some insight into this ?

------
dalf
How fast is compare to lxml with the HTML parser ?

I know that lxml won't parse as many cases as this module, but in many cases
lxml can do the job.

~~~
Paul-ish
I have found the lxml API in python leaks a considerable amount of memory, but
that was while going through a large amount of data though (Wikipedia dump).
This is in contrast to a handful of pages. But that was a while ago, it may be
different now.

~~~
Buttons840
What do you mean by "leak"? I wrote a scraper that used lxml and a single
instance/process ran for month on end in production without any issues. Lxml
may have used more memory than some alternatives, but I don't think it was
"leaking" memory.

~~~
Paul-ish
That is likely true. My experience with it was in 2011/2012 ish. Their FAQ[1]
claims memory leaks have been fixed over time.

* [http://lxml.de/FAQ.html](http://lxml.de/FAQ.html)

------
sebcat
I made a pull request[1] for gumbo-parser a while ago. While it is a wonderful
project, it seems to be in need of a new maintainer, or a maintained fork.

[1] [https://github.com/google/gumbo-
parser/pull/370](https://github.com/google/gumbo-parser/pull/370)

~~~
nostrademons
(Original author and former maintainer of Gumbo here.)

Yes, it does need a new maintainer, or maintained fork. I finally lost my
commit rights to the /google GitHub org about a year ago, 2 years after
leaving Google. So I don't actually have the ability to merge pull requests
anymore. I still have some friends at Google, some of whom were involved in
Gumbo's development, but they've moved on to other projects as well and likely
wouldn't be able to take over. I'm working on a new startup idea myself, so my
time is fairly limited, though I can answer questions if they come up.

~~~
aumerle
@nostrademons: Thank you for gumbo, which is an excellent project.

------
pwdisswordfish
Why does it bundle a copy of the gumbo library instead of just linking to it?

~~~
aumerle
Because it uses a modified version of gumbo to support parsing not-well formed
XHTML as well

~~~
pwdisswordfish
I thought the whole point of XHTML is that non-well-formed documents should
generate an error instead of being unpredictably interpreted according to each
implementation's whims.

And if you _really_ insist on doing that, why not just parse XHTML as HTML?
HTML5 parsing rules can already handle some XHTML-like constructs; it's what
browsers do when they're served XHTML as text/html. If it's good enough for
them, it should be good enough for you.

~~~
aumerle
:) Try parsing the following snippet using HTML 5 parsing algorithm and see
what you get:

<html><head><title /></head><body><p>foo

~~~
pwdisswordfish
How often does a self-closing <title /> tag appear in the wild?

~~~
nostrademons
In E-books, which IIUC this library was designed to parse: very frequently.
The ePub3 specification actually specifies the XHTML serialization of HTML5 as
its content serialization, so this is a correct eBook. Ironically, this means
that most eBooks are not actually valid HTML5. We had another eBook reader
(Sigil) that used Gumbo and ran into this issue.

On the web: very rarely. I've actually never seen a self-closing <title />
tag, but I've seen other cases where docs that are actually XML are served
with a text/html MIME type and this creates pathological DOM structures.

(I have a funny story from Gumbo's testing that illustrates this. I ran across
a document that was actually some XML dialect with 45,000 self-closing
elements in a row, but was served with a text/html MIME type. Since HTML5
doesn't recognize self-closing elements that aren't in the spec, this created
a DOM tree with 45,000 levels of nesting. Gumbo could handle this because it
uses an iterative state machine for parsing, but my testing code did recursive
descent on this and choked. I posted it on MemeGen - Google's internal water-
cooler website - with a link to the offending web page ... and then got a few
emails from other Googlers about how it was kinda rude of me to crash their
browsers. It turns out _Chrome_ couldn't handle the page, and would die with a
stack overflow when viewing it.)

------
jszymborski
This would replace html5lib in BeautifulSoup nicely :)

~~~
Animats
No, no. I don't want some complicated C blob in the middle of my Python
system. I don't have time to chase buffer overflows and pointer bugs.

Has anyone benchmarked this against html5lib compiled with PyPy? Html5lib is
slow in CPython because there are a lot of low-level per-character tests, as
required by the HTML5 spec. The spec specifies how bad HTML is to be parsed in
great detail, which means a lot of IF statements. CPython is a dog on code
like that. PyPy, not so much.

~~~
jgraham
[I was the original author of much of html5lib]

Yes, [http://speed.pypy.org/comparison/](http://speed.pypy.org/comparison/)
has html5lib parsing (an old, static copy of) the HTML5 spec as a benchmark.
It's about 3x faster than CPython, which is to say, still very slow.

I agree there's a good argument that writing new parser code in C is not a
great idea, but I don't think it's possible — or at least easy — to get
reasonable parsing speed in pure Python. My ideal solution here would be
bindings to a Rust parser e.g. html5ever, although there is a tradeoff there
in terms of ease of distribution.

The main missing feature in html5ever for these purposes is the ability to
construct the final parse tree in the rust code (c.f. lxml), without which the
performance bottlneck becomes constructing the Python objects to represent the
document. Of course a streaming API like SAX can be even faster, but often
isn't all that useful.

~~~
gsnedders
[I'm the current mostly absentee maintainer of html5lib]

[https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C1...](https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C12%2BL%2B3.5%2C12%2BL%2B3.6%2C12%2BL%2B2.7&ben=633&env=1%2C2&hor=false&bas=12%2BL%2B2.7&chart=normal+bars)
has an up-to-date version of html5lib, albeit only on CPython: notably, both
3.6 and the latest 3.7 build are significantly faster than 2.7.

That said, I don't think html5lib is going to become massively quicker: string
allocations are just going to become an ever bigger issue (i.e., s[1:10]
causes an allocation in Python v. just referencing the subsequence), and even
using Cython, at least under CPython, isn't going to help with that.

~~~
daniel_rh
I thought you could use memoryview over a string to get rid of that allocation
even in 2.7

[https://docs.python.org/3/library/stdtypes.html#memoryview](https://docs.python.org/3/library/stdtypes.html#memoryview)

~~~
nostrademons
Allocations are challenging for HTML parsers, even in C, because of the
presence of entity references and case-normalization of attribute & tag names.
That means that a lot of the time when you _think_ you ought to be able to
just use a slice or memoryview into the original source text, you can't; for
example, if any of your text nodes contains &lt; ('<') or &ldquo (smart double
quote), you can't use the original source buffer, because you're supposed to
have decoded the entity to a unicode character, which will leave the string a
different length. This happens stupidly often in real HTML.

I initially had the API for Gumbo use string slices a lot more than the final
released API, and then found that I couldn't do it and needed to allocate in
order to maintain correctness. I'd done a patch that arena-allocated all
memory used in the parse, which gave a fairly significant CPU speedup, but it
also bloated max memory usage in ways that some clients found unacceptable, so
I never merged it. Small C strings at least are quite lightweight; Python
strings have a lot of additional overhead, and much of the PyObject structure
itself requires chasing pointers.

------
ernsheong
Would it be possible to interface this with other languages? e.g. Go, Ruby,
Java, or is there a hard dependency on Python?

~~~
aumerle
This has a hard dependency on libxml2. If your language of choice has a
libxml2 based tree, it can be ported to that.

Basically your language needs an equivalent of lxml

------
droithomme
The gumbo build uses the flag -Wunused-but-set-variable which was added in gcc
4.6. The build script doesn't check for gcc 4.6 though and simply fails.

~~~
loeg
GCC 4.6 was released in March 2011. If you're using an older GCC, why?

~~~
dom0
Recently fixed a bug causing GCC 4.4something to overflow its stack. They are
still out there.

------
wooptoo
Any luck using this with BeautifulSoup?

~~~
aumerle
Simply pass the argument treebuilder='soup' to the parse function. But note
that you wont get as much of a performance boost, because the besutifulsoup
tree has to be built in python, not C.

------
RUTHLESS_RUFUS
Speed isn't everything. How permissive is the parser? We look for a compromise
here.

~~~
exabrial
Opinion: permissive parsing is bad for the industry. Speed should be king.
It's not difficult to write correct html!

~~~
kbutler
Be strict in what you produce and liberal in what you accept.

Strictness works well when you have a small number of active producers and
consistent, constantly renewed content.

The web is the opposite - a huge number of producers and widely varying
content, some of which is 1-2 decades old and will never be updated.

You can design and build a great, strict HTML parser - it just won't work well
for a significant portion of the www.

And if you make one that is popular enough, you can change the direction of
evolution of the web! (See mobile safari and flash)

