

Python implementation of the WHATWG HTML5 specification - coderdude
http://code.google.com/p/html5lib/

======
hoppipolla
To be clear, this only implements the parser section of the specification;
that is the stuff at [1]. It is pretty useful if you want a reliable-with-
real-world-content HTML parser that can be used with many python tree
libraries that only ship XML parsers or ship unreliable HTML parsers (e.g.
ElementTree, lxml, minidom). It is probably a bad idea to use it if
performance is a big concern.

(disclaimer: I am one of the maintainers of this library)

[1] [http://www.whatwg.org/specs/web-apps/current-
work/multipage/...](http://www.whatwg.org/specs/web-apps/current-
work/multipage/parsing.html#parsing)

~~~
coderdude
I've relied heavily on lxml in the past to parse HTML and I've found it to be
quite reliable against tag soup. Do you have any specific gripes or sources I
could reference? I'm interested in finding out where it is deficient.

~~~
hoppipolla
Hmm, I only have vague recollections about people reporting problems with lxml
and I can't trivially find any supporting evidence with Google so it is
possible that it is good enough in almost all cases (since afaik it doesn't
actually implement the HTML5 algorithm yet it presumably doesn't work when you
need browser-grade compatibility, but such extreme requirements are rare). In
any case lxml is a superb library which I would recommend to anyone;
personally I often use html5lib with the lxml treebuilder to take advantage of
all the nice features in the lxml tree. It is also the best choice I am aware
of for fast parsing of possibly broken HTML in python.

------
kilian
As beautifulsoup isn't actively developed anymore, I hope html5lib will pick
up where beautifulsoup left off in terms of "tag soup support".

