

How to parse HTML - ranit8
http://blogs.perl.org/users/jeffrey_kegler/2011/11/how-to-parse-html.html

======
thristian
In the bad old days, parsing real-world HTML was a horrible task because every
web-browser had a huge collection of undocumented corner-cases and hacks; some
accidental, some the result of reverse-engineering other vendors' corner-cases
and hacks. Most standalone HTML parsers could generate _some_ document tree
from a given input file; whether or not it would match the one generated by an
actual browser was another matter.

These days, however, we have the HTML5 parsing algorithm, reverse-engineered
from various vendors' web browsers but actually documented and implementable
(still horribly complicated, but that's legacy content for you). Not only is
the HTML5 parsing algorithm designed to be compatible with legacy browsers,
modern browsers are replacing their old parsing code with new HTML5-compatible
implementations, so parsing should be even more consistent (I know Firefox has
switched to an HTML5 parser, I think IE has made a bunch of noise about it
too; I don't follow WebKit all that closely, but I'd be surprised if they
haven't moved towards an HTML5 parser).

~~~
masklinn
> I know Firefox has switched to an HTML5 parser

Yep, this was mainlined in Firefox 4 (with Gecko 2.0).

> I think IE has made a bunch of noise about it too

Support is being built, it's planned for IE10.

> I don't follow WebKit all that closely, but I'd be surprised if they haven't
> moved towards an HTML5 parser

The HTML5 parsing algorithm has been in Webkit since the second half of 2010.

And you have not asked, but HTML5 parsing was officially released in Opera
11.6 last month.

~~~
kkolev
> HTML5 parsing was officially released in Opera 11.6 last month.

I hope that's not related to the annoying freezes the community's been
complaining about since that release...

------
justincormack
Now that html5 defines how to parse all html fragments there is really no
reason not to use that algorithm.

~~~
arkitaip
You're assuming that web sites consist of compliant html; which is never the
case.

~~~
masklinn
The HTML5 parsing algorithm was designed to standardize parsing of real-world
pages, including error recovery (for invalid and/or legacy markup), that's the
whole bloody point of it.

Better, using an implementation of the HTML5 parsing algorithm means _you're
parsing pages the same way browsers do_ : Gecko (Firefox), Webkit (Chrome and
Safari) and Presto (Opera) have all landed the HTML5 parsing algorithm, and
Trident (IE) is in the process of getting it (the feature is planned for
IE10's Trident 6.0)

------
jrockway
This article should be called, "how to write a Marpa-based HTML parser", not
"how to parse HTML". If you're a Perl programmer and want to parse HTML into
an XML-style DOM, use XML::LibXML. If you can't handle the libxml2 dependency,
use HTML::Parser.

------
perfunctory
The fact that browsers accept defective html is the most evil thing that
happened to the web. Any library that tries to parse "real world" html just
contributes to that evil. I am astonished that we tolerate this and still call
ourselves (software) engineers.

~~~
donut
"Be liberal in what you accept, and conservative in what you send." -
<http://en.wikipedia.org/wiki/Robustness_principle>

~~~
perfunctory
<http://queue.acm.org/detail.cfm?id=1999945>

~~~
donut
Good read, thanks for that. I agree with the conclusion: there is no one-size-
fits-all rule for interoperability.

The way I see it, it's ultimately about tradeoffs. I can only imagine what
things would be like today if web browsers implemented a strict parsing of
HTML and refused to render invalid pages. One possibility is hindered adoption
of HTML by the masses. Another is that two vendors would disagree about the
HTML spec and cause pages to be browser-specific. (Turns out this happened
anyway :-))

Is the HTML5 spec better in terms of interop and compatibility than the
previous ones? <http://www.tbray.org/ongoing/When/201x/2010/02/15/HTML5>

------
gambler
Since this seems to be aimed (among other things) towards input sanitization,
here is a semi-relevant entry that might amuse someone.

<https://gist.github.com/1575452>

This is a sanitizing HTML "parser" done in roughly 100 lines of PHP code. It
does tag and attribute whitelisting, checks for protocols to prevent XSS,
deals with unclosed and unopened tags, and does some other things. The biggest
issue is that it's not well-factored. However, its shortness is appealing,
because I understand how it works. I would have hard time trusting a library
with thousands of lines of code to do input validation.

------
skadamat
For you python users, the BeautifulSoup module has a prettify module which
does the same thing.

~~~
masklinn
Bleach[0] might be a better idea, it's based on html5lib

[0] <http://pypi.python.org/pypi/bleach>

------
ypcx
If you want to go serious about web crawling and/or web scraping (within legal
boundaries of course), you want to use Node.js and appropriate modules (don't
remember the exact names right now). This is because Node.js being based on
the V8 JavaScript engine, can completely emulate a real web browser - it can
load and parse the HTML, as well as JavaScript. And many sites won't load
properly without JavaScript.

~~~
masklinn
What you're saying makes no sense whatsoever, at any level of resolution.

Chrome's rendering engine, and the library used to deal with parsing HTML and
building a DOM tree is Webkit's Webcore[0]. V8 and Webcore are not the same
thing and V8 does not provide a DOM implementation (that's webcore's job) nor
does it handle any HTML parsing (that's _also_ ) webcore's job.

V8 is a javascript VM. That's it. It does not "emulate a real web browser"
(let alone completely), and nor does Node.

[0] <http://trac.webkit.org/browser/trunk/WebCore?rev=64712>

~~~
ypcx
That's why I said emulate. V8 (Node) with appropriate modules can emulate the
browser - both parse the DOM, and then run scripts on that DOM. PHP/Perl/etc.
can't do that. Java could do that with Rhino I assume, but I'd say V8 is much
closer. I'm also not saying anything about emulating exactly Chrome. I wish I
had time to dig up that module for Node now, but I don't (I don't remember the
name).

