

Firefox now only has one HTML parser - AndrewDucker
https://blog.mozilla.org/mrbkap/2013/08/12/the-old-html-parser-is-dead-long-live-the-html-parser/

======
xymostech
Does anyone know why about:blank is so magical that can't use the new HTML
parser? Can it not be parsed like other pages?

~~~
knackers
Thanks for the link unwind/paulrouget2.

I think it went a little over my head. What role does about:blank actually
play? I'm assuming that the wild behaviour and parsing difficulty is the
result of it performing some special function (beyond just returning a blank
page).

~~~
hsivonen
A browsing context doesn't start its life empty. Instead, when a browsing
context is created, if a JS program looks at what's in there, there's an
about:blank doc in there. Since you can create a browsing context
synchronously (e.g.
document.body.appendChild(document.createElement("iframe"))), there has to be
a way for the initial about:blank document to materialize synchronously. The
HTML parser is always async. (Edit: The HTML parser is always async when
loading a URL. Then there are innerHTML, createContextualFragment and
DOMParser, which are synchronous.)

Add various events (readystatechange, DOMContentLoaded, load) to added fun.
And the fact that browsing contexts that are top-level from the Web
perspective are iframe-like from the XUL perspective and the code for dealing
with this duality is a mess.

------
reedlaw
Anyone know what the status on multi-threading is? It looks like there are two
potential solutions to the problem:

1\. Servo--[http://www.webmonkey.com/2013/04/mozillas-
servo/](http://www.webmonkey.com/2013/04/mozillas-servo/)

2\. or, Electrolysis--[http://www.internetnews.com/blog/skerner/mozilla-set-
to-revi...](http://www.internetnews.com/blog/skerner/mozilla-set-to-revive-
electrolysis-for-firefox-process-threading.html)

Also see
[https://bugzilla.mozilla.org/show_bug.cgi?id=392073](https://bugzilla.mozilla.org/show_bug.cgi?id=392073)

~~~
robin_reala
Both are under implementation. Servo is nowhere near ready yet as would be
expected for a totally new engine (although it is making progress [1]).
Electrolysis is in for Firefox Mobile / Firefox OS and is in progress for
desktop[2].

[1]
[https://twitter.com/metajack/status/364571230331875331](https://twitter.com/metajack/status/364571230331875331)

[2]
[https://wiki.mozilla.org/Electrolysis](https://wiki.mozilla.org/Electrolysis)

~~~
thousande
Some more screenshots from Servo: [http://about-rust.blogspot.de/2013/08/some-
pages-in-servo-as...](http://about-rust.blogspot.de/2013/08/some-pages-in-
servo-as-of-2013-08-10.html)

------
fauigerzigerk
In case anyone else wants to have a look at that (partially) generated C++
code, here's the online source: [http://mxr.mozilla.org/mozilla-
central/source/parser/html/](http://mxr.mozilla.org/mozilla-
central/source/parser/html/)

------
code4life
This should serve as an example to those that believe "the code is the
comments" (regarding the very first point in this article).

Comments are important in non trivial applications. Please stop thinking they
are not.

~~~
greglindahl
I believe that good code is the comments. From the sounds of it, this code
isn't good in that way. Plus it had comments.

~~~
KMag
I agree that good code self-documents what the code is doing. Good comments
document what the code should be doing.

It's important to have both pieces of information in the same place, to
minimize the overhead of fixing some subtlety someone might incidentally
notice as a side effect of glancing at the code while working on related code.

I can't count the number of times I've run into a complex bit of poorly
commented code that looked like it mishandled a subtle corner case, politely
emailed the author(s) asking what the intended logic is before claiming I
found a bug, gotten the "read the code, dude" response, come back to them with
"is the intention really to <insert description of corner case behavior>", and
gotten back "my bad, broseph".

There have also been a few times that I've incidentally noticed something
looked like it didn't handle a corner case properly in poorly commented code
(but not so wrong to obviously be a bug), but failed to follow up with the
author due to time pressure on the things I was supposed to be working on, and
later had that corner case behavior bite us.

------
frik
"HTML5 parser [...] automatic translation from Java to C++"

Do you reuse code from Rhino? (Mozilla's Java based javascript engine)

Why not convert the code from Java to C++ once, and then maintain the C++
code?

~~~
hsivonen
Rhino code is not involved.

The portable core of the HTML parser is maintained as Java. However, the
translation is not done during the Firefox build process. Instead, the
translation is triggered manually when the Java code changes and the output of
the translation is committed to the Firefox source repository.

(The Java code is committed to the Firefox source repository, too, to make
license compliance easier for downstream projects that opt to distribute the
whole app under the GPL. The Java code is the preferred form for making
modifications for GPL purposes.)

Edit: As for why not maintain C++ separately, that would mean doing
maintenance twice: once for Java and once for C++. The parsing algorithm still
changes from time to time. Support for the template element was added. Spec
bugs related to how nested HTML inside SVG and MathML work were fixed. I
expect a subtle spec change to how correctly deeply nested phrase-level
formatting tags are handled in the future, since the part of the spec that
handles misnesting breaks correct nesting. Oops. (But it's great that browser
are now so spec-compliant that you can see that it's a spec bug, because
Firefox, Chrome and IE10 all show the same weirdness.)

~~~
frik
I already checked the code and I know from my CS courses that's it's easier to
code and unit-test a language lexer/parser in Java than in C++ or C.

Was that the main reason to code it in Java?

~~~
hsivonen
The reason why the parser was written in Java in the first place was that it
was written for the Validator.nu HTML validator. The validator was written in
Java, because Java had the best set of Unicode-correct libraries for the
purpose and a common API (SAX) for connecting the libraries.

Even though testing wasn't the original motivation for writing in Java, it's
much nicer to unit test the parser as its isolated Java manifestation than as
its Gecko-integrated C++ manifestation.

------
ksec
Memory Improvement? Performance Improvement? Code Base Reduction? Binary Size
Reduction?

Or it will be so small that doesn't matter?

~~~
mrbkap
There won't be much of a runtime effect as most of the ripped-out code was
unused anyway. In a stripped binary (32-bit ARM), I saw 40-50kb worth of
codesize reduction, which isn't nothing (especially on mobile) but not a game-
changing win.

The biggest advantage was really just getting rid of a bunch of unused and
unmaintained code.

~~~
kunil
40-50kb is pretty much nothing, even for mobile

------
itsbits
What will be the affects with one parser??..any performance improvements??

------
hartator
IMHO, Firefox has more and more trouble to renew itself.

This browser was great when it come out, fast, reliable and virus-free. I
remember the good old time of IE when you have to wonder forever whenever if
you'll click on a suspicious link or not. However, I have the feeling since
chrome come out, they don't innovate anymore. Extensions are slow and
unreliable, a lot of stuff feels like copy/paste from Opera or Chrome,
Firebug, once the greatest, seems outdated, they have taken forever to support
retina display on the mac, their engine crash or freeze in JS heavy
environment... and they removed the blink tag.

My 5 cents.

~~~
sitkack
I recently tried switching back to FF from Chrome and I find FF to lag
horribly. Downloads are frequently broken and stall. The <filename>,
<filename.part> thing is ridiculous. It feels like there is a molasses powered
queue between input events and response, FF feels like a drunken master. I am
in Chrome right now, I wish I was in FF.

~~~
robin_reala
_The <filename>, <filename.part> thing is ridiculous._

Yep. Luckily it’s being fixed as we speak:
[https://bugzilla.mozilla.org/show_bug.cgi?id=420355](https://bugzilla.mozilla.org/show_bug.cgi?id=420355)

~~~
tmzt
It's actually quiet nice to be able to rename the file and wget -c it.

I don't know if .crdownload has a header, but it doesn't seem to work the same
way.

