Historically the indexing factor is tied to the amount and authority of a domains inbound links. Not all sites will get the results yielded on this domain.
I've seen some blog posts indexed within 2 minutes. A lot of people have reported sub-5-minute indexing. 20-40min is good and the most common time frame in my experience and from what I understand from other people.
Ultra-fast indexing depends on posting just before the crawler hits the index page. A lot of luck is involved to catch the crawler like that.
If someone uses Google Reader to subscribe to your RSS feed, then your new blog posts are likely to get indexed as soon as Google Reader sees your new post show up in RSS.
5 weeks ago, I had a question regarding aws and created a topic in the amazon forum. Afterwards I searched google again, with different keywords and I found my own topic. That was scary. Only a few minutes passed (at most 5)
I posted "Quisquis custodiet ipsos custodes" on reddit, then immediately decided I wanted to check my spelling--I was already the top result in Google.
No offence Tarquin, but what's the purpose of this test? It cannot possibly tell us anything about our own sites, or the sites we work on.
Like many of us, I've already tracked the time it takes for Google to index the changes to my own sites - about 14 hours. I work on sites that the Google robots hit only once every 2 weeks - less content, and fewer updates, though some of them earn 10 times what my more 'frequent' sites earn.
Good O, Hacker News! Which means sweet F all to me, all of my customers, and pretty much everybody else on the planet.
The purpose was to measure a phenomenon I'd previously observed -- that it's getting harder to find answers on Google, because people post their questions on well-trafficked sites, and within a few minutes, those sites rocket to the top for a well-crafted search query trying to answer the question.
I recently thought about this: what if the whole internet worked more like Twitter? People are laughing about Twitter and think it is an easy problem, but not even Google manages instant updates! The equivalent to Twitter for Google would be to index all news on the internet immediately.
In that sense, I think we can cut Twitter some slack...
Except that you're off by (at least) several dozen orders of magnitude. Twitter, in fact, is several thousand times smaller than even GMail, which has ~100,000,000 registered users to Twitter's ~1,000,000 (http://twitterfacts.blogspot.com/2007/10/twitter-number-of-u... - and that's assuming twitter's user base has over doubled in the last six months while GMails has remained stagnant for the last year or so).
Google manages to route large messages between GMail users instantly, without breaking a sweat, and with 100% uptime (so does Microsoft, Yahoo, and every other major email provider). They have no trouble dealing with users who send or receive thousands of emails a day, while twitter claims this is some sort of pathological, borderline-impossible case to deal with.
And GMail is relatively trivial compared to actually indexing the internet. The last decent estimate I saw (2005 - http://www.cs.uiowa.edu/~asignori/web-size/) suggested the indexable WWW is ~ 12 Billion pages - and god knows how much that has increased in the last four years.
Seriously, make sure you're in the right ballpark - hell, in the right galaxy - before talking out off your ass. Twitter's problem should be easy today - it was solved decades ago. The fact that Twitter can't scale is nothing more than a testament to their incompetence.
Well I was really thinking about twitter replacements, as I said - what if the whole net would work like twitter? I update my blog and I want it to be known instantly. Google does not provide it yet.
But I have also thought about mail, perhaps it would be easiest to replace Twitter with "micromailing", and use existing software solutions. Difference is of course the "only receive messages from people I subscribe to" thing, but that should only make it easier.
Edit: also, google mail doesn't do polling, so maybe it is very different after all. It's a completely different architecture. Maybe Twitter would work better with an inbox-per-user model, but I wouldn't bet on it.
Another difference: most emails don't go to dozens of recipients at the same time. Tweets often have several hundred recipients.
For starters, I don't buy your premise that most mail doesn't go to dozens of recipients - I don't have data to back this up but I'd bet that between mailing lists, spam, etc., the vast majority of email (by volume) is sent to multiple recipient.
And yes, it is true that SMTP doesn't use polling. But that is because polling for messages doesn't scale very well - a smart 15 year old could tell you that. I don't how twitter's architecture works, but if it polls all a user's subscriptions each time that user logs in then I say that was a bad architecture decision.
It would be pretty straightforward to emulate subscriptions with a 'mailing list' type solution. The simple fact of the matter is that Twitter chose a poor architecture. If they simply used an inbox/mailing list type solution all their problems would be non-existent and twitter would scale, near linearly, as you increased the number of servers.
Yes, it would be grossly inefficient HD space wise - but who cares? Normalization (seems to be) what got them in this mess. And disk space is cheap enough that it doesn't really matter.
Assume Twitter had a million active users (they don't), and that they each sent 5 Tweets/day on average (users don't), and that each user was subscribed by 5 Users on average (they aren't). That means about 5GB of Tweets are made a day. Using an inbox type scheme (with absolutely no normalization - which is unlikely) that would only require about 25GB of storage a day. Given reasonable cluster costs (and assuming data is 4x redundantly stored on the cluster), that would require well under a $100K/year in storage expansion (to store all tweets forever). Using semi-intelligent caching and partial normalization I think it would be pretty easy to cut those storage requirements in half w/o breaking a sweat.
I say again Twitter's problems are trivial - they just seem to have (initially) made a lot of poor design choices, and are now living with the consequences.
I can't really defend Twitter's architecture, I am simply interested in the best way to implement something like that as well - but not as a centralized service.
Just a note: email does polling, too. Most email clients poll the server on a regular basis. But that is usually every 5 minutes or less? Twitter has just limited polling to 30 times per minute for external clients, if I remember correctly.
I haven't thought too deeply about it, but I wonder if a mixture between polling and pushing could work. Either it could be decided on the nature of the client if it should push or poll (ie how many subscribers does it have, how active is it), or clients could connect and say "please push for the next 20 minutes" or just send keep-alives. If client alive, push, otherwise let him poll? I doubt that polling could be eliminated completely. If an email server is offline for an extended period of time, the message never arrives. Mabye that is acceptable, though.
On your other points, I'd be pretty surprised if you couldn't provide a twitter-type service, at twitter-type message volumes, using e-mail as the hidden back-end. I don't think hundreds of local recipients would be that big of an issue, especially at twitter message rates/volumes. I'd be willing to bet it could be done pretty easily with a small number of decent servers. Then again, it could be that the Twitter folks know something I don't. Or they're just too emotionally invested in their existing back-end. No idea.
http://search.live.com/results.aspx?q=%22How+long+will+it+ta...
http://www.alexa.com/search?q=%22How%20long%20will%20it%20ta...
All duds for now.