Hacker News new | past | comments | ask | show | jobs | submit login
How Browsers Work: Behind the Scenes of Modern Web Browsers (html5rocks.com)
282 points by Garbage on Feb 14, 2012 | hide | past | web | favorite | 41 comments

> So a lot of the parser code is fixing the HTML author mistakes.

This is probably my biggest problem with most of the ideology behind HTML 5. You learn from past mistakes and HTML has no way to teach you anything because there is no error-correction-test feedback cycle.

How do you learn proper HTML? You write something, you load it in a browser, does it show properly? It was right. Does it show in a funny way? There must be something wrong. What? Who knows. Nobody knows because you cannot, by definition create _wrong_ HTML 5. It is just that you did not write what you think you wrote.

There are validators you can use to validate your page. But most of the times they will tell you that your page is wrong while it actually looks perfectly as you wish in the browser. The browser is not complaining and the page looks good... these validators must be too picky, a waste of time.

The problem that killed XHTML was the draconian error handling in most browser. Only Opera had a good way to handle XHTML errors: a banner that told you that the page had an error and which error. Under that banner the page, rendered as best as possible. That was a good way to learn (and to tell people that the person who wrote the site is not a pro).

I'd argue that forgiving HTML parsing is one of the main reasons the web got as big and broad as it did.


I disagree. It's also what allowed the "IE-only web" to persist for about five years.

It might have been a good thing until about 1997, but at that point there was no shortage of people creating new web content, and raising the barrier to entry would have done no harm. And a lot of innovation in browser features might have happened sooner (due to increased competition between browsers.)

There was great competition between browsers through the 1990s. Unfortunately, they were competing through 'value add' proprietary extensions and browser lock-in, which is orthogonal to the issue of whether HTML parsing should be more permissive or more draconian.

Somewhat. Microsoft reverse-engineered a lot of Netscape's rendering quirks, as well as adding their own. Being quite liberal it what it accepted certainly didn't hurt IE adoption. (And a lot of these things are now in black&white in the HTML5 spec).

Related: http://diveintohtml5.info/past.html

"[W]hy do we have an <img> element? Why not an <icon> element? Or an <include> element? Why not a hyperlink with an include attribute, or some combination of rel values? Why an <img> element? Quite simply, because Marc Andreessen shipped one, and shipping code wins."

> The problem that killed XHTML was the draconian error handling in most browser.

I'd argue just the opposite. Browsers treated XHTML doctypes as "tag-soup" HTML4. Firefox would only validate the document if you used the xhtml mime type, in which case you lost progressive rendering and your site would seem slow to the user.

The key point here is that XML wasn't doing any favors to the browser vendors - their rendering model just didn't work that way internally.

Net result is a gazillion 'XHTML' documents which aren't actually XML. Now you can't start to throw up end-user warnings or half the web would appear to be broken. So, admit it was a dubious idea to begin with and start over.

> What? Who knows. Nobody knows because you cannot, by definition create _wrong_ HTML 5.

HTML5 defines pretty strict conformance requirements for authors. That's separate thing from defining error recovery mechanisms for UAs.

You can easily learn what is wrong with your code using the W3C Validator.


which is a big improvement over the old DTD-based one which couldn't verify contents of attributes or structures more than one level deep:


> HTML5 defines pretty strict conformance requirements for authors.

What you are referring to wrong as in is _non valid_, what I was referring to was _non working_.

Invalid HTML 5 _works_, so does invalid HTML. At no point your browser will stop and say "Come on, that is not HTML, that is garbage". If there is no such point, then there is no _wrong_ HTML 5.

Take this code, it is valid HTML 5 (may the XML gods forbid me)

    <!DOCTYPE html>
    <title>My feelings</title>
    I love HTML
It will be shown without any problem by a browser. The title will be "My feelings" and the body will be "I love HTML".

The following is invalid HTML5

    <title>My feelings</title>
    I love HTML
Yet, it will be shown "correctly" by browsers without any problem, just like the previous one.

Once such a lax error recovery mechanism is in place _without additional warning in the UI_, how is one able to define what is wrong and what is correct?

> how is one able to define what is wrong and what is correct?

There are many arbitrary lines there. There were huge bikeshedding debates on HTML WG just how much must be quoted, escaped, declared and closed.

Generally the correct/valid subset is chosen to be free from gotchas as much as possible (only things that behave as expected are allowed).

It's a compromise between best practices and not so pretty, but very common code out there.

It's counter-productive to declare 99% of working pages "invalid". With less nitpicking errors validators can have better signal to noise ratio and flag errors that are more likely to cause trouble, and authors are more likely to take those seriously rather than assume validator is impossible to please.

e.g. misnested tags are disallowed, because it's hard to understand how they are interpreted.

DOCTYPE is required, because it disables emulation of IE5 bugs (Quirks Mode).

OTOH unquoted attributes and some unescaped ampersands are allowed, because most often they're parsed unambiguously in a way that authors expect.

    <!DOCTYPE html>
    <title>My feelings</title>
    I love HTML
it is a valid html5

You are missing the point: I know that you could add `<!DOCTYPE html>` to make that document valid and you know as well. But whoever writes the second snippet does not know because we are not there to point it out. And if you point it out they will look at you puzzled: "You are saying that it is not valid, but it renders, and in exactly the same way! Why are you making all this fuss about this "validity" thing?"

HTML5 does not mean that any markup is valid. The specification simply defines what a user agent should do when it encounters invalid markup. Previously user agents had to guess what others (IE) did when they encountered a piece of invalid markup.

I don't see much of problem with loosely typed markup. Because markup is not supposed to be written by only the highly skilled engineers.

An average Joe is supposed to feel great about writing something that renders the way he/she wants without having to go into deeper stuff like semantics, validity or even cross-browser.

This is something that should be left to people who need to know it in their strata, isn't it?

> An average Joe is supposed to feel great about writing something that renders

Indeed HTML is great for that, but the problem is that you never "level up". Once your content renders, you are done. A lot of Joes may be interested in how things work behind the scenes or in making things "correct" more than "just working". It would be great, from a pedagogical point of view, to have browser render Joe's content (for instant gratification) but also to point out that "Ehi, on line 32 you closed </p> before </i>. It should be the other way around because of this rule called nesting, have a look at it". I think we are wasting a lot of man-years around the globe for the lack of such warnings.

In the education of many people, compiler errors and warning had exactly this function: they made you do whatever you wanted (as long as decent) but they would also pointed out the basic mistakes ("Ehi, on line 14 you print the variable prg_name, but that variable has not been initialized, beware").

The downside is the average Joe thinks this stuff is easy enough and he should do it professionally. And then:

- Market floods with professionals who antagonize proper web developers, since, to unknoweledgeable clients, the result appears to be the same

- Web floods with sites that behave in unpredictable ways in different browsers

- Proper developers carry the burden of dealing with idiosyncrasies of various browsers

- Browser developers carry the burden of trying guess through amazingly creative atrocities

Although, to be honest, things don't look as grim as they used to regarding the middle two. I just wish some people would stick to HTML and stopped brute-forcing javascript to work.

Except all of those things started happening in about 1994. And that's why 15+ years later we have spec which defines error conditions rather than just the 'proper' way.

(If you weren't online/sentient back then, sites commonly had these 'Best viewed in Netscape' badges on them.)

I think it started with Mosaic in 1993.

This is great content, but it is so hard to read. The light blue against the light gray is rough on the eyes, and the font size/style isn't helping either.

Here's the article at author's website http://taligarsiel.com/Projects/howbrowserswork1.htm.

"Tali published her research on her site, but we knew it deserved a larger audience, so we've cleaned it up and republished it here."

That's much better; thank you!

I was just about to post something to that effect. The site feels like Microsoft MSDN. In the case of documentation or technical information, you'll never go wrong by using a large, easy-to-read font with a strong contrast between background and text color.

could not agree more. very hard to read, seems opaque in a way when trying to read the text.

Update: so, i removed the dot texture on the page and also changed the google webfont to "gudea" instead of "open sans" and it seems to be a lot more readable. i think if the authors of the site changed the font to something more readable, all would be good. the font i choose was just picked at random.

see my edits: http://dl.dropbox.com/u/103326/after_Gudea.png

readable http://readable.tastefulwords.com/ makes a reasonable job of it, but it's clearly not built to be read.

You might want to try readability: http://www.readability.com/

readability is a great option, but this site is brand new. it's not like it has outdated typography, etc.. you would expect the site to be readable without any tools :\

I actually used readability to prettify the article and send it to my kindle. Sadly though, it stayed on my Kindle for a long time, hopefully I will read it soon.

It really is.


Changing it to Helvetica (or Arial) make a little bit easier to read (Putting above code into URL bar does this).

Best way to read it is selecting all the text. The white text on red "background" is not bad.

Started reading the document tripped over a statement at the beginning:

"It is important to note that Chrome, unlike most browsers, holds multiple instances of the rendering engine - one for each tab. Each tab is a separate process."

This is not true and has never been true to my knowledge without restricting the statement. Chrome will open links opened in a new tab/window which point to the same domain in the same content process and it will start mixing different domains in the same content processes once a certain threshold is reached (whether this is a hard-coded upper limit on how many content processes may exist simultaneously or if this is calculated dynamically I do not know).

Old discussion: http://news.ycombinator.com/item?id=2894708

Personally, I'm happy for the repost - I was looking for this, but couldn't find it, just yesterday.

Agree, reposts help me when searching in google site:http://news.ycombinator.com/

Also, it gives a chance to add to the discussion of what's been happening in the domain in the past 181 days!

Sometimes I wish reposting of old content at the same URL was allowed.

A few interesting quotes about performance optimization from the article:

"Firefox blocks all scripts when there is a style sheet that is still being loaded and parsed. Webkit blocks scripts only when they try to access certain style properties that may be affected by unloaded style sheets."

"WebCore simply throws a global switch when any sibling selector is encountered and disables style sharing [optimization] for the entire document when they are present. This includes the + selector and selectors like :first-child and :last-child."

"After parsing the style sheet, the rules are added to one of several hash maps, according to the selector. There are maps by id, by class name, by tag name and a general map for anything that doesn't fit into those categories." Every element then looks in these maps to find rules which might apply to it.

I desperately need this in ebook format so I can read it in chunks, leave it, come back to it, etc. [edit] Found something -- dotEPUB works great.

sigh Is it too much to ask HTML5 advocates to make sure their pages render properly on an iPad?

And also you guys, being at that level, should insist on Apple guys to respect web-standards for file-input on iPad Safari.

This is gonna get corny very soon.

Great work! The guy who does photoshop job for you is simply amazing.

From a high-level view, I hope we see super-intense competition of tablet browsers soon. Because looking at -webkit-on-iPad, I feel IE6 has just been re-invented.

What's so bad about webkit on the iPad? It's my favorite way to browse the web.

Yeah it's good when it comes to rendering neatly done content websites - i.e consumption.

But when it comes to creating the content, such as having an email done with browser based rich text-editing, or file-input for attachments it falls flat on the face.

Example of a discussion: http://answers.37signals.com/basecamp/2659-ipad-file-uploadi...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact