Hacker News new | past | comments | ask | show | jobs | submit login
Google homepage doesn't close html tags, on purpose (bug.gd)
52 points by thorax on June 27, 2009 | hide | past | web | favorite | 39 comments

I demand to see some evidence supporting this assumption that closing tags "take up time". It seems to me that logic dictates it would take longer to parse a broken schema than a valid one. Perhaps leaving out those tags saves Google $1MM per year in bandwidth costs; it wouldn't terribly surprise me. But what would surprise me is evidence that browsers are quicker at processing an invalid document.

HTML 4 is not an XML language: it uses tags, but the concept of a document being structurally invalid is not the same as XHTML.

Every browser that handles HTML 4 supports a kind of "tag soup". You must take explicit action to shift them into a strict mode; consequently, it does not take longer to "parse a broken schema" -- handling an HTML document without the final closing tags should be no slower than a complete document, because no validation is taking place.

I would guess that the rationale for omitting tags was to make their pages smaller so that the data transfer is quicker and cheaper.

Right, and I'm saying that it's presumptuous to just figure "Oh, fewer bytes == faster page loads!" The data transfer may be quicker (and their costs certainly cheaper), but does it do anything to speed up the experience of the site? Not if the saved milliseconds are lost parsing the monstrously invalid DOM.

Go read the HTML specs. Very, very carefully, and without preconceived notions. This is not "monstrously invalid" for HTML. Many, many end tags are, in fact, entirely optional, as defined in the specifications, and many tags such as "body" are also surprisingly optional.

HTML is not XML.

It might be invalid, it is "monstrously invalid XHTML", but it is not "monstrously" invalid HTML.

I have been appropriately corrected (by you and others), thank you.

I think I've spent so many years battling with anti-standards folks that I now take it personally when people advocate leaving out tags and not using a proper Doctypes and the countless other things Google does. In this case, I failed to do any research before making my argument, and have been deftly slain for it ;)

The question is not if it's valid or invalid (which is another debate) but rather is it faster or slower.

I'd like to see Google provide facts and figures. When Steve Souders worked at Yahoo! a large part of recommendations were backed up by experimental data collected by Tenni Theurer (my manager's wife, coincidentally).

I'd like to see Google provide the same kind of experimental data for their guidelines. If saving 14 bytes is a network saving, where are the page render tests in browsers, particularly "A-grade" browsers?

See facts and figures? Absolutely.

In light of the whipping their "facts" on PHP optimization recommendations have received here ( http://news.ycombinator.com/item?id=676856 ), it's best not to take pronouncements from a google post as coming from on high.

"The question is not if it's valid or invalid (which is another debate) but rather is it faster or slower."

Well, the question I answered was whether it was valid. I'm not knowledgeable enough to answer the speed question.

If you look at the HTML 4 recommendation from the W3C, you'll find that this is not invalid HTML. SGML and HTML, unlike XML and XHTML, allow implicit opening and closing tags. As far as I know, they even allow overlapped nesting, which even breaks the tree nature of the document structure, e.g. <b>foo<i>bar</b>baz</i>. Browsers have supported this stuff since HTML 1, I seriously doubt there is any significant penalty in parsing.

e.g. http://www.w3.org/TR/html401/struct/global.html#h-7.3

Their goal is not saving bandwidth costs: their goal is having their pages render faster. I can't find the link (will search in a bit) but I recently saw an article on Google where they had statistics on the number of searches they lost when their pages rendered a bit slower.

Google is also one of a very few companies where this type of micro-micro-optimizing even begins to make sense.

No it doesn't: this makes sense for everyone. Scale would be involved if they were out to save a few bytes per request, to save on bandwidth. However, they are out to lower the per user load/rendering time, which is completely independent of the scale of the company delivering the page (assuming the delivery scales well enough that extra traffic does not lower page delivery time, but that is usually the case).

I think he's right. Google's home HTML is only 5kB, which is pretty low these days, and their images and CSS aren't much more. If your page size is less than 10kB, those couple of bytes start making a difference. Google have probably literally optmised everything else they can optimise. Image atlas to reduce it to one file, they have a CDN, Custom web server, custom OS kernel, etc., some of which helps far more than some closing tag.

Yes, all those things are determined by scale, because there is a clear cost involved, that requires a certain scale to pay for itself. However, in the case of closing tags the cost is negligible, so it's something that can be used by everyone. Google claims (however, I still can't find the link that shows) that loading/rendering their search result pages slower makes people perform less searches. That's why they do not allow you to have more than 10 results per page; not even via your personal settings. The same trade off will hold for every service whose profitability is directly dependent on the number of page views. So it's not scale, just the sort of business you are in that determines whether not closing tags makes sense.

Am I wrong or are people convinced the parent must be right, given he's got 50 upmods? I was kind of hesitant to criticise it with that many upmods, but really, am I that obtuse if I don't see his point?

pmjordan pointed out that my assertion that the per user loading/rendering time is completely independent of the scale is overstating the case, as he illustrated with a CDN and custom software. However, in my response to him, I think I made a good case that that argument doesn't hold for stripping closing tags?

The author doesn't seem to know the difference between a tag and an element. An unclosed tag would of course be something like "<html". It wouldn't even be accurate to say that Google doesn't close elements; for example the body and html elements are closed at EOF whether the closing tags are included explicitly or implied.

A more accurate headline would have been "Google leaves out some optional tags, as understood by all browsers and permitted by every HTML specification since the beginning of time".

I'm the author-- I feel it's common enough to say "close tags" when you mean closing or ending a markup element.

I'm surprised to see such a literal nitpick upvoted so much, but I can't disagree that you're correct technically. I'm just trying to share an odd/interesing position of Google's and not trying to write a deep analysis or anything.

It's common to say it that way, but that doesn't make it proper. As George Orwell argued in his 1946 essay 'Politics and the English language' (http://www.orwell.ru/library/essays/politics/english/e_polit): clarity of language is closely bound to clarity of thought. Everyone should be encouraged to write as clearly as possible, to avoid confusion and muddling of thoughts.

Of course, the grandparent overstates his case and is unconstructive as a result. His fallacy is that he argues that someone must be an idiot, because they make a trivial error. Which overlooks the fact that trivial errors are made by experts all the time, exactly because it's a trivial point that doesn't have their attention. You wouldn't believe the silly errors PhD's fix in the papers of fellow PhD's.

It's common enough for people not to know the difference between a tag and, well, any part of the syntax of HTML that is not a tag. That doesn't mean we should encourage this confusion. I see this frequently in technical channels where people have HTML questions but lack the vocabulary to express them or comprehend the answer. One of the very first things an HTML newbie must learn is the correct names for elements, tags and attributes, which usually means unlearning the incorrect use of 'tag'.

Is this a nitpick? Perhaps if your subject had been something other than markup, or valid markup in particular, you'd be right. I think it's reasonable, though, when writing on a technical subject, to insist on correct usage of that subject's basic technical vocabulary.

This seems disingenuous once I view source on search results and find multiple inline script elements, some not too small, again inline, style/CSS in the head and many dispersed span tags also with inline style declarations. I see some element attributes not quoted but at least the href on search results are indeed between quotes.

From what I can see in their source, there is a lot more they could do to optimize their bandwidth and page load speeds than eating a couple of closing tags.

Requesting an entirely new file for CSS and Javascript is probably significantly more expensive (time-wise if not bandwidth-wise) than sending down more bytes for this one page.

For one time, the first. After that it would come from the browser cache with a bandwidth cost of zero. If every user makes a few queries daily the savings would surely be bigger than not closing a few tags here and there. And I still see lots of onclicks, and spans with multiple CSS class names. Their layout is not that complicated, surely there's room to optimize that too.

Is there any real reason for the body and html close tags (and head for that matter), other than to fit in with the rest of the syntax? Like something you'd want to put after the body, but still within the html tag?

      Here's our head
      Here's our body

There isn't even a need for the <head> and <body> tags at all. This is valid HTML 4.01 Strict:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
  <html lang="en">
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title>Here's our title</title>
    <link rel="stylesheet" href="stylesheet">
    <p>Here's our body
And this is valid HTML 5:

  <!DOCTYPE html>
  <html lang="en">
    <meta charset="utf-8">
    <title>Here's our title</title>
    <link rel="stylesheet" href="stylesheet">
    <p>Here's our body
(From http://meiert.com/en/blog/20080429/best-html-template/ )

Note that Google almost never employs any technique on the main search site unless it positively affects their statistics.

So the fact that these elements aren't closed probably improves (by a measurable percentage) user engagement/success on their site.

I remember Anne van Kesteren's site once had something like "it's valid, sure." in the source code comments. I had looked through it by chance and couldn't believe someone from Opera would omit so many useful tags. But then, sure enough, it validates just fine.


There's something deeply wrong with the way web developers learn how this stuff works. I keep seeing surprised discoveries of facts that haven't changed since HTML 2.0 came out thirteen years ago.

What is it? Are the books they read (instead of the actual specs) really that terrible?

Knowing where to start always looks a lot easier in retrospect.

yes, but you want to close frame, img, li, and p tags


> If you do close your tags, Internet Explorer will render your pages even faster.

Whereby "even faster" still means much slower than standards-adherent browsers. I can't imagine that closing the tags improves the speed enough for it to make any real difference. Better to optimize a javascript algorithm or compress an image to squeeze out another millisecond or two.

Theoretically, perhaps. Practically, I doubt the close-tags proposed by this MSDN article would make any measurable difference.

In fact, leaving tags out consistently in certain tag-heavy auto-generated layouts, by reducing network IO and fitting more content into memory buffers at a time, could be faster.

Using the close-tags proposed by the MSDN article would yield an invalid HTML document. Per the DTD and the SGML standard, the closing tag of IMG, an element with EMPTY content, MUST be omitted. The element is already closed before the parser gets to the closing tag - so, in essence, you would be trying to close it twice.

The only thing about this I really ever heard was the fact that they strip out all whitespace, but checking http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.co... it seems as if they don't use quotes on tag attributes where ever they are not necessary as well as many other things I was taught not to use, to produce valid code.

Not using quotes on tag attributes is okay for HTML provided that... refer to the standards for the complete info.

From HTML 4 (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2):

The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.

From HTML 5 (http://www.w3.org/TR/html5/syntax.html#attributes):

...the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK (") characters, U+0027 APOSTROPHE (') characters, U+003D EQUALS SIGN (=) characters, or U+003E GREATER-THAN SIGN (>) characters, and must not be the empty string.

Google homepage is fine, they need to spend more time on maps.google.com. I wonder what the true return is on investing in the analysis, the number crunching, etc. just to save a few bits of data.

There must be a formula for queries a year, returning users, versioning and caching resources.

I bet inline styles and scripts are heavier in the long run than cached as external resources for millions of requests, specially multiple requests a day per user.

Now, I don't know how cache works, but if you have to make a resource request just to get a 'has not changed' response, then the problem is in the protocol.

How about sending all 'modified-since' tags in the header for every resource, so the client then requests a second bulk with all the resources required?

Somebody can explain how cache works in the browser?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact