Google says there are now 10^12 unique URLs. I am proud to announce the 2nd trillion.

dhotson · on July 28, 2008

I worked on a web crawler once before.. you've got no idea how annoying these kinds of websites are to detect.

http://en.wikipedia.org/wiki/Spider_trap

.. although I'm sure they've got some smart people working on it at Google.

aristus · on July 28, 2008

You detect them like Unix detects a symlink loop: it punts on the problem and just errors after 8 symlink traversals.

The equivalent is a crawl depth limit, which could be a hard limit (dumb) or a function of page-rank (smart) and the trustworthiness of the inbound link (smarter) and also the data quality & diversity of the traversed pages (best).

There are very good reasons why Googlebot seems to hit you from one IP at a time -- it's a long-running thread that is making all sorts of decisions about your site as it crawls.

13ren · on July 28, 2008

Google has indexed 38 so far: http://www.google.com.au/search?q=site%3Aianab.com%2Ftrillio...

Tackling the data quality & diversity of the traversed pages (best):

Producing English text with the 40 bits, by driving a generative grammar or a markov/travesty generator, would make it harder for Google to detect that the pages are auto-generated. It's unlikely to infer the function f(URL) -> text (or even to attempt it), but would limit the recursion for the other reasons you mention.

(guessing) sites like hackernews are indexed primarily by recursion (few direct inbound links to specific stories).

aristus · on July 28, 2008

(guessing) sites like hackernews are indexed primarily by recursion (few direct inbound links to specific stories).

Correct. Notice that it is difficult to find old HN comments on Google, since after a while there are no short paths from the home page to them. In practice & all else being equal (quality, length, spamminess, speed, age, uniqueness, PR, etc), the maximum depth a page can afford to have is about 6 or 7.

anamax · on July 29, 2008

http://drunkmenworkhere.org/ did an interesting study of how some search engines handled infinite sites.

kleevr · on July 28, 2008

http://ianab.com/trillion/1099377140735.html

mynameishere · on July 28, 2008

http://ianab.com/trillion/1099510313975.html

LogicHoleFlaw · on July 28, 2008

http://ianab.com/trillion/370743989334.html

jsmcgd · on July 28, 2008

http://ianab.com/trillion/892949752285.html

parenthesis · on July 28, 2008

http://ianab.com/trillion/658144902297.html

whacked_new · on July 28, 2008

Hahaha, you could make a biginturl out of this.

mynameishere · on July 28, 2008

I was going to laborously spell out C-C-C-C-C-C-C-Combo breaker, but then I remembered that I only have one life to live...

snprbob86 · on July 28, 2008

I submitted http://ianab.com/trillion/0.html to http://www.google.com/addurl/ and http://search.msn.com.sg/docs/submit.aspx

Please log requests from the Google and Microsoft bots and let us know how long it takes the respective bots to figure out that every page is the same :-)

bdr · on July 28, 2008

http://pastebin.com/m54b7d354

Looks like it just hit a bunch of links from the first page.

snprbob86 · on July 28, 2008

So it looks like Google grabbed about 40 links before giving up? I wonder what a good "score" is? At first, I'd guess less is better, but too few might be running the risk of throwing out potentially good pages. Too many, and the bot is just wasting effort. The 40 score could vary as well based on parallel conditions assuming many bot instances are sharing a task pool. Be sure to post the Microsoft results if/when they crawl you.

brianr · on July 28, 2008

Looks like it hit all of the links on the first page (there are 8x5 boxes = 40) and didn't find anything interesting, so it didn't crawl any deeper. If the second-level pages had more interesting/unique content, I bet it would've kept going.

hhm · on July 28, 2008

Thanks a lot for sharing this. Very interesting indeed.

attack · on July 28, 2008

They actually did explicitly say that this measurement ignores URL's which are non-unique enough or of insignificant worth as best that they could determine. But ah well, I guess it's just for fun.

mrtron · on July 28, 2008

Hilariously clever, and very easy to create with something like Django's regex urls.

kleevr · on July 28, 2008

^trillion/(?P<number>[\d]{1,13}).html$

hhm · on July 28, 2008

Regex? What for? You only need to convert the number in the url to binary, and then 1s are whites, and 0s are blacks.

akrito · on July 28, 2008

So, if the number modulo 13 is 0, a taunt, "Maybe you could try drawing something," shows up. Any other easter eggs?

jrichmond · on July 28, 2008

http://ianab.com/trillion/1000000000000.html ("I think you win something... ") http://ianab.com/trillion/00000000000000.html (sort of an easter egg)

unalone · on July 28, 2008

If you go past a trillion, they give you a free bonus page.

jrockway · on July 28, 2008

I like the first message: "It's true: you can click on the boxes."

Maybe I've had too much sugar today, but that made me laugh out loud.

lpgauth · on July 28, 2008

I guess this goes to show how this metric is useless on it's own but if you compare with other previous years then you can understand how far the web as evolved.

unalone · on July 28, 2008

http://ianab.com/trillion/344940631126.html

Do all the URLs leave snarky messages?

eru · on July 28, 2008

drewcrawford · on July 28, 2008

Let's play pong http://ianab.com/trillion/547597778686.html

zevu · on July 28, 2008

Ok, funny, but Google has indexed only 40 of the pages. Where is the rest?! huh

Chocobean · on July 28, 2008

the red spiders that work the gears of google got bored after 40 i guess

ajkirwin · on July 28, 2008

http://ianab.com/trillion/840689566653.html

UFO :D