

The HTML5 Parsing Algorithm - sant0sk1
http://webkit.org/blog/1273/the-html5-parsing-algorithm/

======
mbrubeck
Yay! Also coming in Firefox 4:
[http://hacks.mozilla.org/2010/05/firefox-4-the-
html5-parser-...](http://hacks.mozilla.org/2010/05/firefox-4-the-html5-parser-
inline-svg-speed-and-more/)

and IE9:
[http://blogs.msdn.com/b/ie/archive/2010/03/16/html5-hardware...](http://blogs.msdn.com/b/ie/archive/2010/03/16/html5-hardware-
accelerated-first-ie9-platform-preview-available-for-developers.aspx)

Unfortunately the new parsing algorithm breaks my online banking site:
<http://bugzil.la/565689>

~~~
palish
Ugh, their response:

 _This is an intentional change. Prior to the HTML5 parsing algorithm,
browsers backtracked and reparsed when seeing an EOF inside a script to deal
with </script> inside an inline script. This means that prior to HTML5, an
accidental or maliciously forced premature end of file could change the
executability properties of pieces of an HTML file.

The current magic in the spec was carefully designed and researched to permit
forward-only parsing in a maximally Web-compatible way. The solution in the
spec was known to break a handful of pages among lots and lots of pages listed
by dmoz, but the breakage was deemed negligible.

This is probably the highest-risk change in the HTML5 parsing algorithm, and
this is the first report of it breaking an "important" contemporary site.
Considering that keeping the forward-only tokenization behavior is highly
desirable, in the absence of evidence of more important breakage, I'm treating
this as an evang issue realizing that further evidence may force us to revisit
this part of the spec. But let's try to get away with forward-only parsing!_

What is the logic behind this? Just because we want "forward-only parsing",
we're going to try to force people not to write </script> inside of a
Javascript string? That seems rather silly.

Are there any benefits to forward-only parsing besides speed? Because if this
change was done in the name of performance, then we really should re-evaluate
who exactly is proposing these types of changes and why.

~~~
pornel
ECMAScript has escape sequence just for this occasion:

    
    
        <\/script>
    

Works everywhere (it's shame so few people know about it and use uglier and
invalid "</sc"+"ript>").

~~~
pbiggar
Why is that invalid?

~~~
pornel
<http://www.w3.org/TR/html4/types.html>

"The first occurrence of the character sequence "</" (end-tag open delimiter)
is treated as terminating the end of the element's content."

Of course browsers never implemented this correctly.

------
tedunangst
"Together, these two algorithms form the core of the parser and consist of
over 10,000 lines of code."

Anybody else think that if your markup language requires a 10k line parser,
somebody took a wrong turn at the complexity vs simplicity fork in the road?

~~~
rimantas
Pay attention to this part:

    
    
      All browsers that implement the HTML5 parsing algorithm
      should parse HTML the same way, which means your web page
      should parse the same way in Firefox 4 and the WebKit
      nightly, even if it contains invalid markup.
    

Parsing perfect HTML5 would be easy, but one of the features of HTML5 spec is
that it does that no spec did before: it defines how parsing should work
exactly, even in the case of invalid markup. Also, the parser hast to deal
with deprecated elements no longer in the spec (such as infamous <font>). I
assums most work went into this "how to parse tag soup" part.

~~~
tedunangst
Deciding that invalid markup should work would be the wrong turn I alluded to.

It would have been very easy to drop all the back compat bs by saying "An
HTML5 document is one that begins with the 6 bytes '<html5' and if the
document is invalid, reject it. Anything else, parse however you want."
Browsers that support HTML5 add text/html5 to the Accept header.

~~~
ori_b
So, how many different parsers did you want to see in the browser again?

You'd still need to support HTML4 somewhere. Supporting HTML5 separately just
means duplicating the common parts of the parser. The simplicity boat has
already sailed.

~~~
tedunangst
Addition is simpler than combination.

    
    
        HTML4 + HTML5 < HTML4 * HTML5

~~~
ori_b
They're extremely similar. You don't get an explosion of code size. Unless
you're suggesting that HTML5 should also be radically different syntactically
as well?

~~~
tedunangst
I'm suggesting that a strict HTML5 parser that doesn't have complex recovery
code is radically simpler than one that does.

~~~
ori_b
I'm suggesting that a HTML4 parser that has complex recovery code, and a
mostly-copy-and-paste HTML5 parser that doesn't isn't a big win.

Browsers are going to have a complex, ugly, recovery-enabled parser in them
either way, and the effort to add HTML5 to the recovery-enabled parser isn't
very big, comparatively speaking.

------
jimmyjazz14
I would prefer that the parser just failed when invalid code was encountered,
personally I saw this as a strong point of XHTML.

~~~
benhoyt
Hmmm. I suspect that would break 90% of web pages on the web. On the other
hand, people would make their pages valid pretty quick. :-)

~~~
wlievens
What about pages that are not maintained but still carry value?

