Hacker News new | past | comments | ask | show | jobs | submit login
Google Homepage Size Over Time (measuredmass.wordpress.com)
126 points by ISL on Oct 13, 2012 | hide | past | favorite | 53 comments

I find this interesting as well as unimportant at the same time. Interesting because it shows that Google has responded to increasing connection speeds by adding bells and whistles to their homepage. That's really great.

Unimportant because with Google Chrome, Firefox, Safari, etc. - going to the google homepage is almost entirely optional. Google is working so hard into baking themselves into almost any facet of web functionality that their homepage is becoming less and less important over time.

The only reason I check the google homepage is to look at the doodles.

Strangely, Google is killing off the one thing that keeps me going to their sorta homepage, iGoogle. When they finally do kill it about a year from now, I'll have to come up with some substitute which may not be so Google-centric.

I've switched from iGoogle to opera portal ( http://portal.opera.com/ ), which is a pretty close equivalent (similar 3-column view etc). Only major loss is it has very few gadgets compared to iGoogle, it's mostly just for rss feeds. (And no, you don't need to use opera to use it).

Yeah I don't understand why they do that, igoogle is pretty good and I use it alot.

I don't even know what igoogle is or does.

User adoption is what drives features like this to remain, not how useful they are.

That is true. This feature has been around for many years and used to be advertised on their navbar before Google+ came about, but it never stood out as a prominent component of the Google experience. I figure they are pushing people to adopt Google+ as their "portal" but I'm likely wrong as the two products are quite different. It could also be that it doesn't generate any revenue and the experiment is nearing the end of its life cycle.

..and to ping internet :)

I replaced that long ago with 'ping'. Still google but not quite the homepage.

You won't know if your DNS is down in that case :)

well, the DNS would be anyway :)

But pinging it doesn't tell you if DNS service is up/down.

You should be running your own in-house anyways.

Google has some kinda fancy browser caching, such that it's not a great way to test if your internet connection is working. Even with a shift-refresh.

Totally agree to your point of view--Google is growing, then what's the problem if the code grows? It doesn't matter, what matters is that the load time is low.

Unimportant? It's probably good for their SEO :).

I always enjoyed this interesting story:

One vigilante sent Google an anonymous email every so often just listing a number, like 37 or 43. Eventually Mayer and her colleagues figured out it referred to the number of words on the Google homepage—the implication being that someone was keeping track, so don't screw up the design

– The Google Story

A friend of mine used to keep a post-it note on her monitor with 2,073,418,204 written on it. She was keeping track of the number of pages google reported it had indexed[1]. We got very excited when it changed to 3,083,324,652.

[1] http://web.archive.org/web/20030205061559/http://www.google....

Does Google report this anywhere nowadays? I searched for it a while back and couldn't find it.

I found only estimates: http://www.worldwidewebsize.com/ 40-50 billion pages. From the site:

"The size of the index of a search engine is estimated on the basis of a method that combines word frequencies obtained in a large offline text collection (corpus), and search counts returned by the engines. Each day 50 words are sent to all four search engines. The number of webpages found for these words are recorded; with their relative frequencies in the background corpus, multiple extrapolated estimations are made of the size of the engine's index, which are subsequently averaged. The 50 words have been selected evenly across logarithmic frequency intervals (see Zipf's Law). The background corpus contains more than 1 million webpages from DMOZ, and can be considered a representative sample of the World Wide Web."

I'm wondering if they removed it because the definition of a 'page/document' changed to the point where it's no longer a meaningful number. I'm talking about realtime search etc. A shift towards viewing data as being delineated by the information it contains instead of the documents that contain it.

I see the general point being made here, but it's important to put it in context. The current homepage design is arguably more minimalistic and useful than any design from 2000 onwards. Links to other Google services are conveniently located in a bar at the top where they don't get in the way (as opposed to crowding the area around the search box previously). I can click a microphone and speak my query. When I start typing, results start loading instantly. All improvements in my book, and definitely worth it for the broadband world, which is probably a lot of Google's business at this point. I think it's important to separate minimalist and usable design from minimalist HTML -- I'm sure Google is keeping an eye on their page size and making conscious cost-benefit decisions here.

Now don't get me started on the results page though...

I'm not sure I had a point (though minimalism is appealing); when putting the numbers together, I just wanted to see what the plot looked like.

TFA: "...included only the size of the ‘.html’ file, no images or fanciness."

The '.html' file for google.com has most of the 'fanciness' embedded in it in the form of a ton of JS and all the CSS.

I think this architecture was put in place when instant search/preview went live which is the cause of the first large spike. I'm guessing the second spike is integration with google+.

This looks incorrect. http://web.archive.org/web/20110713000446id_/http://www.goog... is only 28K according to ls -lh.

Perhaps OP forgot to put id_ after the capture timestamp in the archive URLs? The id_ makes sure that the Wayback Machine only returns the page as it was when it was indexed.

I definitely used id_ . I saved the source that I used; I'll recheck things.

Should be fixed now. The last point is still ~100k. Running Chromium, if I'm logged in, google.com's HTML saves as 104k. Logged out, it's 94k, as measured by ls -lh.

Looks good. I was using cURL, so that explains the ~100k vs. ~28k discrepancy.

You are correct, some of them slipped past, and contain archive.org code. It's quick to fix.

Thank you.

I'd be curious to see a similar plot of average internet speed vs time.

And, correspondingly, a plot of Google homepage HTML file size divided by average internet speed vs time, so we wouldn’t have to compare the two curves in our heads.

It's 100 kB now.

If the plot had a similar curve from 0-50kB would everyone still be commenting? What about 0-5 kB?

Is this about the absolute values or the curve?

Unless everyone is expecting exponential page size growth to continue indefinitely (and ignoring bandwidth increases), I don't see the point.

1-5 kB would be even more interesting. That would make Google's flagship page several orders of magnitude smaller than most websites.

would be interesting to compare this to world wide typical and average bandwidth availability.

google used to prize page load times for their search landing page. i wonder if the increase is coming about because they recognize they can indeed get away with it and maintain their performance goals, or if they re-shifted priorities and/or acceptable values for load times.

Perhaps also the page is tee-ing up for even faster results page access, by way of instant search.

And I haven't checked, but perhaps most of the k weight comes in after the page has rendered, in which case it wouldn't negatively impact perceived load time.

The page is also more featurefull than it used to be when in the logged in state.

This is ignoring the fact that a whole bunch of assets (CSS/JS) are embedded into the page to reduce the number of HTTP requests.

Wouldn't it be better to have two asset pages, one for JS and one for CSS. ( unless there's a way to merge those two )

I don't know how often google changes that code, but even if it were as often as weekly, users would gain the benefit of caching those files and loading locally, thereby making the page even less to download.

Or are those two http hits really that expensive?

The only reason I can think it is a bad idea is caching could hurt them. If the browser doesn't handle caching correctly, and they do change the source, frequently or infrequently, users may see broken pages and/or broken functionality. Not all users know of Shift-reload, none should need to, and I find it doesn't work reliably myself.

What is the best practice here? 2 http requests, embedded code on the page, arbitrary http requests that are cached, no caching, etc.?

> If the browser doesn't handle caching correctly, and they do change the source, frequently or infrequently, users may see broken pages and/or broken functionality.

Typically you just use a new URL to get around this ... Frequently changed cacheable files (css, js) are usually timestamped or contain version information in the filename.

On average, the size of a website triples every ~5 years.

Nowadays there are lots of really fat websites with tons of resources. Personally, I think that's kinda convenient. This way it's a lot easier to make sites which are much faster than those of the competition.

More and more people start to merge all of their CSS and all of their JS files, then they minify them, and finally the whole thing is gzipped. This surely helps a lot. However, their sites still continue to grow, because they continue to add more and more crap.

If only my internet connection tripled every 5 years too.

In reality, it was 28.8Kbps up to 1999, then has remained in the 2-3 Mbps range ever since.

I wonder how much of this has nothing at all to do with presenting search results to the user, i.e. fulfilling the user's expectations... (guess: > 95%)

Does anyone have a good CLI for Google and duckduckgo that I could use from within putty with clickable links? (similar to surfraw, but just dumping text + URLs from the results to stdout instead of launching a text mode browser)


$ curl -sL http://google.com | wc -c


Plot updated to answer your question. There's a big difference between that which is served to curl and to a browser.

Thank you!

Maybe it's psychological, but I feel that since they shifted to SPDY the encrypted version is more responsive despite the bigger size.

The important thing is iGoogle was killed off.

Not yet, another year. http://www.google.com/ig

It gets cached so it's okay.

The label of the ordinate contains an error - kilobyte = byte * 1024, as opposed to byte / 1024


#bytes * (1 kilobyte)/(1024 bytes) = #kilobytes

or bytes/1024=kilobytes, for short.

Basically, it's a conversion factor, not a definition.

Might want to consider using the unambiguous term kibibyte instead.


It is a tradeoff, it avoids the need to explain which definition of kilo is used, at the cost of using a far less common term.

Kibibyte doesn't make things clearer in this case. They aren't talking about whether the factor should be 1000 or 1024. The first poster said that kilobytes should be bytes * 1024. That is incorrect. The number of kilobytes is equal to the number of bytes / 1024, not bytes * 1024.

Seems like it goes down hill roughly when Marissa moved on.

I'm sure someone is measuring yahoo.com since Mayer joined. :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact