How long will it take for Google to index this?

jakewolf · on June 2, 2008

I couldn't resist checking: http://search.yahoo.com/search?p=%22How+long+will+it+take+fo...

http://search.live.com/results.aspx?q=%22How+long+will+it+ta...

http://www.alexa.com/search?q=%22How%20long%20will%20it%20ta...

All duds for now.

andreyf · on June 3, 2008

PG talked about blocking all non-Google crawlers in an attempt to cut down on traffic ... not sure if he ever did or not

haasted · on June 3, 2008

There doesn't seem to be anything about it in HN's robot.txt.

Of course, it may have been implemented in other ways.

andrewparker · on June 3, 2008

What are those "x" URLs blocked in the robots.txt for HN?

dkokelley · on June 2, 2008

20 minutes ish. Very good. Keep in mind that as they update the 21 "minutes ago" will get farther, so it may have been even sooner.

merrick33 · on June 2, 2008

Historically the indexing factor is tied to the amount and authority of a domains inbound links. Not all sites will get the results yielded on this domain.

pierrefar · on June 2, 2008

I've seen some blog posts indexed within 2 minutes. A lot of people have reported sub-5-minute indexing. 20-40min is good and the most common time frame in my experience and from what I understand from other people.

Ultra-fast indexing depends on posting just before the crawler hits the index page. A lot of luck is involved to catch the crawler like that.

tb · on June 3, 2008

If someone uses Google Reader to subscribe to your RSS feed, then your new blog posts are likely to get indexed as soon as Google Reader sees your new post show up in RSS.

pyroman · on June 3, 2008

Some blog software, like wordpress, will ping services when you post something new. http://codex.wordpress.org/Update_Services

martian · on June 2, 2008

In the spirit of Douglas Hofstadter... nice recursion.

mojuba · on June 3, 2008

http://www.google.ie/search?q=%22This+sentence+can+not+be+fo...

TrevorJ · on June 2, 2008

If I imbeded the search result in a third webpage and then googled that page...woo, makes my head hurt.

ivankirigin · on June 3, 2008

This isn't even that odd. That's just plain recursion.

Each notice of indexing should alter the query text based upon the comments at the time.

byrneseyeview · on June 2, 2008

Indexed. 25 minutes?

In their excerpt, they note that it was posted 21 minutes ago. So that seems to be their score.

bluelu · on June 3, 2008

5 weeks ago, I had a question regarding aws and created a topic in the amazon forum. Afterwards I searched google again, with different keywords and I found my own topic. That was scary. Only a few minutes passed (at most 5)

TrevorJ · on June 2, 2008

Google is impressive, that's for sure.

khafra · on June 3, 2008

I posted "Quisquis custodiet ipsos custodes" on reddit, then immediately decided I wanted to check my spelling--I was already the top result in Google.

khafra · on June 3, 2008

...and now this one's on top, with the former post in second place. My mind boggles at the possibilities for recursion.

henning · on June 2, 2008

for those of us who have never run high authority websites -- if google indexes you that frequently, how much bandwidth do they use up?

byrneseyeview · on June 2, 2008

Google has gotten much faster at finding new pages lately. It would be interesting to know how much faster.

jsteele · on June 2, 2008

No offence Tarquin, but what's the purpose of this test? It cannot possibly tell us anything about our own sites, or the sites we work on.

Like many of us, I've already tracked the time it takes for Google to index the changes to my own sites - about 14 hours. I work on sites that the Google robots hit only once every 2 weeks - less content, and fewer updates, though some of them earn 10 times what my more 'frequent' sites earn.

Good O, Hacker News! Which means sweet F all to me, all of my customers, and pretty much everybody else on the planet.

By the way, how long IS a piece off string?

byrneseyeview · on June 3, 2008

Tarquin?

The purpose was to measure a phenomenon I'd previously observed -- that it's getting harder to find answers on Google, because people post their questions on well-trafficked sites, and within a few minutes, those sites rocket to the top for a well-crafted search query trying to answer the question.

"By the way, how long IS a piece off string?"

I don't know, but strlen(piece) probably does.

lyime · on June 2, 2008

Thats pretty impressive. I have noticed my small blog posts get indexed quickly on google.

earle · on June 2, 2008

30 to 90 minutes for this site

Tichy · on June 3, 2008

I recently thought about this: what if the whole internet worked more like Twitter? People are laughing about Twitter and think it is an easy problem, but not even Google manages instant updates! The equivalent to Twitter for Google would be to index all news on the internet immediately.

In that sense, I think we can cut Twitter some slack...

smanek · on June 3, 2008

Except that you're off by (at least) several dozen orders of magnitude. Twitter, in fact, is several thousand times smaller than even GMail, which has ~100,000,000 registered users to Twitter's ~1,000,000 (http://twitterfacts.blogspot.com/2007/10/twitter-number-of-u... - and that's assuming twitter's user base has over doubled in the last six months while GMails has remained stagnant for the last year or so).

Google manages to route large messages between GMail users instantly, without breaking a sweat, and with 100% uptime (so does Microsoft, Yahoo, and every other major email provider). They have no trouble dealing with users who send or receive thousands of emails a day, while twitter claims this is some sort of pathological, borderline-impossible case to deal with.

And GMail is relatively trivial compared to actually indexing the internet. The last decent estimate I saw (2005 - http://www.cs.uiowa.edu/~asignori/web-size/) suggested the indexable WWW is ~ 12 Billion pages - and god knows how much that has increased in the last four years.

Seriously, make sure you're in the right ballpark - hell, in the right galaxy - before talking out off your ass. Twitter's problem should be easy today - it was solved decades ago. The fact that Twitter can't scale is nothing more than a testament to their incompetence.

Tichy · on June 3, 2008

Well I was really thinking about twitter replacements, as I said - what if the whole net would work like twitter? I update my blog and I want it to be known instantly. Google does not provide it yet.

But I have also thought about mail, perhaps it would be easiest to replace Twitter with "micromailing", and use existing software solutions. Difference is of course the "only receive messages from people I subscribe to" thing, but that should only make it easier.

Edit: also, google mail doesn't do polling, so maybe it is very different after all. It's a completely different architecture. Maybe Twitter would work better with an inbox-per-user model, but I wouldn't bet on it.

Another difference: most emails don't go to dozens of recipients at the same time. Tweets often have several hundred recipients.

smanek · on June 3, 2008

For starters, I don't buy your premise that most mail doesn't go to dozens of recipients - I don't have data to back this up but I'd bet that between mailing lists, spam, etc., the vast majority of email (by volume) is sent to multiple recipient.

And yes, it is true that SMTP doesn't use polling. But that is because polling for messages doesn't scale very well - a smart 15 year old could tell you that. I don't how twitter's architecture works, but if it polls all a user's subscriptions each time that user logs in then I say that was a bad architecture decision.

It would be pretty straightforward to emulate subscriptions with a 'mailing list' type solution. The simple fact of the matter is that Twitter chose a poor architecture. If they simply used an inbox/mailing list type solution all their problems would be non-existent and twitter would scale, near linearly, as you increased the number of servers.

Yes, it would be grossly inefficient HD space wise - but who cares? Normalization (seems to be) what got them in this mess. And disk space is cheap enough that it doesn't really matter.

Assume Twitter had a million active users (they don't), and that they each sent 5 Tweets/day on average (users don't), and that each user was subscribed by 5 Users on average (they aren't). That means about 5GB of Tweets are made a day. Using an inbox type scheme (with absolutely no normalization - which is unlikely) that would only require about 25GB of storage a day. Given reasonable cluster costs (and assuming data is 4x redundantly stored on the cluster), that would require well under a $100K/year in storage expansion (to store all tweets forever). Using semi-intelligent caching and partial normalization I think it would be pretty easy to cut those storage requirements in half w/o breaking a sweat.

I say again Twitter's problems are trivial - they just seem to have (initially) made a lot of poor design choices, and are now living with the consequences.

Tichy · on June 3, 2008

I can't really defend Twitter's architecture, I am simply interested in the best way to implement something like that as well - but not as a centralized service.

Just a note: email does polling, too. Most email clients poll the server on a regular basis. But that is usually every 5 minutes or less? Twitter has just limited polling to 30 times per minute for external clients, if I remember correctly.

I haven't thought too deeply about it, but I wonder if a mixture between polling and pushing could work. Either it could be decided on the nature of the client if it should push or poll (ie how many subscribers does it have, how active is it), or clients could connect and say "please push for the next 20 minutes" or just send keep-alives. If client alive, push, otherwise let him poll? I doubt that polling could be eliminated completely. If an email server is offline for an extended period of time, the message never arrives. Mabye that is acceptable, though.

bluelu · on June 3, 2008

I completely agree with you! (smanek)

modoc · on June 3, 2008

| I update my blog and I want it to be known instantly. Google does not provide it yet.|

If your blog supports providing a dynamic sitemap, and notifying google that the sitemap has been updated, it basically does.

Check out: http://wordpress.org/extend/plugins/google-sitemap-generator...

On your other points, I'd be pretty surprised if you couldn't provide a twitter-type service, at twitter-type message volumes, using e-mail as the hidden back-end. I don't think hundreds of local recipients would be that big of an issue, especially at twitter message rates/volumes. I'd be willing to bet it could be done pretty easily with a small number of decent servers. Then again, it could be that the Twitter folks know something I don't. Or they're just too emotionally invested in their existing back-end. No idea.