

How long will it take for Google to index this? - byrneseyeview
http://www.google.com/search?q=%22How+long+will+it+take+for+Google+to+index+this%3F%22&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

======
jakewolf
I couldn't resist checking:
[http://search.yahoo.com/search?p=%22How+long+will+it+take+fo...](http://search.yahoo.com/search?p=%22How+long+will+it+take+for+Google+to+index+this%3F%22&fr=yfp-t-501&toggle=1&cop=mss&ei=UTF-8)

[http://search.live.com/results.aspx?q=%22How+long+will+it+ta...](http://search.live.com/results.aspx?q=%22How+long+will+it+take+for+Google+to+index+this%3F%22&go=&form=QBHP)

[http://www.alexa.com/search?q=%22How%20long%20will%20it%20ta...](http://www.alexa.com/search?q=%22How%20long%20will%20it%20take%20for%20Google%20to%20index%20this?%22)

All duds for now.

~~~
andreyf
PG talked about blocking all non-Google crawlers in an attempt to cut down on
traffic ... not sure if he ever did or not

~~~
haasted
There doesn't seem to be anything about it in HN's robot.txt.

Of course, it may have been implemented in other ways.

~~~
andrewparker
What are those "x" URLs blocked in the robots.txt for HN?

------
dkokelley
20 minutes ish. Very good. Keep in mind that as they update the 21 "minutes
ago" will get farther, so it may have been even sooner.

------
merrick33
Historically the indexing factor is tied to the amount and authority of a
domains inbound links. Not all sites will get the results yielded on this
domain.

------
pierrefar
I've seen some blog posts indexed within 2 minutes. A lot of people have
reported sub-5-minute indexing. 20-40min is good and the most common time
frame in my experience and from what I understand from other people.

Ultra-fast indexing depends on posting just before the crawler hits the index
page. A lot of luck is involved to catch the crawler like that.

~~~
tb
If someone uses Google Reader to subscribe to your RSS feed, then your new
blog posts are likely to get indexed as soon as Google Reader sees your new
post show up in RSS.

------
martian
In the spirit of Douglas Hofstadter... nice recursion.

~~~
TrevorJ
If I imbeded the search result in a third webpage and then googled that
page...woo, makes my head hurt.

~~~
ivankirigin
This isn't even that odd. That's just plain recursion.

Each notice of indexing should alter the query text based upon the comments at
the time.

------
byrneseyeview
Indexed. 25 minutes?

In their excerpt, they note that it was posted 21 minutes ago. So that seems
to be their score.

------
bluelu
5 weeks ago, I had a question regarding aws and created a topic in the amazon
forum. Afterwards I searched google again, with different keywords and I found
my own topic. That was scary. Only a few minutes passed (at most 5)

------
TrevorJ
Google is impressive, that's for sure.

------
khafra
I posted "Quisquis custodiet ipsos custodes" on reddit, then immediately
decided I wanted to check my spelling--I was already the top result in Google.

~~~
khafra
...and now this one's on top, with the former post in second place. My mind
boggles at the possibilities for recursion.

------
henning
for those of us who have never run high authority websites -- if google
indexes you that frequently, how much bandwidth do they use up?

------
byrneseyeview
Google has gotten much faster at finding new pages lately. It would be
interesting to know how much faster.

------
jsteele
No offence Tarquin, but what's the purpose of this test? It cannot possibly
tell us anything about our own sites, or the sites we work on.

Like many of us, I've already tracked the time it takes for Google to index
the changes to my own sites - about 14 hours. I work on sites that the Google
robots hit only once every 2 weeks - less content, and fewer updates, though
some of them earn 10 times what my more 'frequent' sites earn.

Good O, Hacker News! Which means sweet _F_ all to me, all of my customers, and
pretty much everybody else on the planet.

By the way, how long IS a piece off string?

~~~
byrneseyeview
Tarquin?

The purpose was to measure a phenomenon I'd previously observed -- that it's
getting harder to find answers on Google, because people post their questions
on well-trafficked sites, and within a few minutes, those sites rocket to the
top for a well-crafted search query trying to answer the question.

"By the way, how long IS a piece off string?"

I don't know, but strlen(piece) probably does.

------
lyime
Thats pretty impressive. I have noticed my small blog posts get indexed
quickly on google.

------
earle
30 to 90 minutes for this site

------
Tichy
I recently thought about this: what if the whole internet worked more like
Twitter? People are laughing about Twitter and think it is an easy problem,
but not even Google manages instant updates! The equivalent to Twitter for
Google would be to index all news on the internet immediately.

In that sense, I think we can cut Twitter some slack...

~~~
smanek
Except that you're off by (at least) several dozen orders of magnitude.
Twitter, in fact, is several thousand times smaller than even GMail, which has
~100,000,000 registered users to Twitter's ~1,000,000
([http://twitterfacts.blogspot.com/2007/10/twitter-number-
of-u...](http://twitterfacts.blogspot.com/2007/10/twitter-number-of-
users.html) \- and that's assuming twitter's user base has over doubled in the
last six months while GMails has remained stagnant for the last year or so).

Google manages to route large messages between GMail users instantly, without
breaking a sweat, and with 100% uptime (so does Microsoft, Yahoo, and every
other major email provider). They have no trouble dealing with users who send
or receive thousands of emails a day, while twitter claims this is some sort
of pathological, borderline-impossible case to deal with.

And GMail is relatively trivial compared to actually indexing the internet.
The last decent estimate I saw (2005 - <http://www.cs.uiowa.edu/~asignori/web-
size/>) suggested the indexable WWW is ~ 12 Billion pages - and god knows how
much that has increased in the last four years.

Seriously, make sure you're in the right ballpark - hell, in the right galaxy
- before talking out off your ass. Twitter's problem should be easy today - it
was solved decades ago. The fact that Twitter can't scale is nothing more than
a testament to their incompetence.

~~~
Tichy
Well I was really thinking about twitter replacements, as I said - what if the
whole net would work like twitter? I update my blog and I want it to be known
instantly. Google does not provide it yet.

But I have also thought about mail, perhaps it would be easiest to replace
Twitter with "micromailing", and use existing software solutions. Difference
is of course the "only receive messages from people I subscribe to" thing, but
that should only make it easier.

Edit: also, google mail doesn't do polling, so maybe it is very different
after all. It's a completely different architecture. Maybe Twitter would work
better with an inbox-per-user model, but I wouldn't bet on it.

Another difference: most emails don't go to dozens of recipients at the same
time. Tweets often have several hundred recipients.

~~~
smanek
For starters, I don't buy your premise that most mail doesn't go to dozens of
recipients - I don't have data to back this up but I'd bet that between
mailing lists, spam, etc., the vast majority of email (by volume) is sent to
multiple recipient.

And yes, it is true that SMTP doesn't use polling. But that is _because_
polling for messages doesn't scale very well - a smart 15 year old could tell
you that. I don't how twitter's architecture works, but if it polls all a
user's subscriptions each time that user logs in then I say that was a bad
architecture decision.

It would be pretty straightforward to emulate subscriptions with a 'mailing
list' type solution. The simple fact of the matter is that Twitter chose a
poor architecture. If they simply used an inbox/mailing list type solution all
their problems would be non-existent and twitter would scale, near linearly,
as you increased the number of servers.

Yes, it would be grossly inefficient HD space wise - but who cares?
Normalization (seems to be) what got them in this mess. And disk space is
cheap enough that it doesn't really matter.

Assume Twitter had a million active users (they don't), and that they each
sent 5 Tweets/day on average (users don't), and that each user was subscribed
by 5 Users on average (they aren't). That means about 5GB of Tweets are made a
day. Using an inbox type scheme (with absolutely no normalization - which is
unlikely) that would only require about 25GB of storage a day. Given
reasonable cluster costs (and assuming data is 4x redundantly stored on the
cluster), that would require well under a $100K/year in storage expansion (to
store all tweets forever). Using semi-intelligent caching and partial
normalization I think it would be pretty easy to cut those storage
requirements in half w/o breaking a sweat.

I say again Twitter's problems are trivial - they just seem to have
(initially) made a lot of poor design choices, and are now living with the
consequences.

~~~
Tichy
I can't really defend Twitter's architecture, I am simply interested in the
best way to implement something like that as well - but not as a centralized
service.

Just a note: email does polling, too. Most email clients poll the server on a
regular basis. But that is usually every 5 minutes or less? Twitter has just
limited polling to 30 times per minute for external clients, if I remember
correctly.

I haven't thought too deeply about it, but I wonder if a mixture between
polling and pushing could work. Either it could be decided on the nature of
the client if it should push or poll (ie how many subscribers does it have,
how active is it), or clients could connect and say "please push for the next
20 minutes" or just send keep-alives. If client alive, push, otherwise let him
poll? I doubt that polling could be eliminated completely. If an email server
is offline for an extended period of time, the message never arrives. Mabye
that is acceptable, though.

