Hacker News new | past | comments | ask | show | jobs | submit login

The URLs don't need to be posted online. Some browsers (Chrome, possibly Firefox with Safe Browsing mode, very likely any browser with a Google Toolbar installed) send visited URLs to Google and they will be indexed. I don't know if this is officially documented by Google, but several people have reported seeing this while testing new/beta websites that weren't published or linked anywhere.

Hi there, allow me to correct this misconception. I've debunked that idea often enough that I wrote a blog post about this four years ago: http://www.mattcutts.com/blog/toolbar-indexing-debunk-post/ I wrote an earlier debunk post in 2006 too: http://www.mattcutts.com/blog/debunking-toolbar-doesnt-lead-...

I noticed a new twist in your post though: you're saying that because of Safe Browsing (which checks for e.g. malware as users surf the web), those urls are sent to Google. The way that Chrome and Firefox actually do Safe Browsing is that they download an encrypted blob which allows the browser to do a lookup for dangerous urls on the client side--not by sending any urls to Google. I believe that if there's a match in the client-side encrypted table, only then does the browser send the now-suspect url to Google for checking.

Here's more info: https://developers.google.com/safe-browsing/ I believe the correct mental model of the Safe Browsing API in browsers is "Download a hash table of believed-to-be-dangerous urls. As you surf, check against that local hash table. If you find a match/collision, then the user might be about to land on a bad url, so check for more info at that point."

Hope that helps. Further down in the discussion, someone posted this helpful link with more explanation: http://blog.alexyakunin.com/2010/03/nice-bloom-filter-applic...

Sorry but don't believe you about google toolbar. I had a private page with no links in or out and yet it appeared in google search. It was not guessable and there was no chance for a referrer link. The page was never shared with friends nor accessed outside my own computers.

I only found out when a friend searched for his name and the page appeared as it was my phone list

Multiple people have run controlled experiments like I described in http://www.mattcutts.com/blog/debunking-toolbar-doesnt-lead-...

The most common way such "secret" pages get crawled is that someone visited that secret page with their referrers on and then goes to another page. For example, are you 100% positive that every person who ever visited that page had referrers turned off on every single browser (including mobile phones) they used to access that page?

Are you sure that it is the referrer headers? PP clearly stated there were no outgoing links on the secret page. I think there's a much more mundane explanation: javascript stuff downloaded from Googles CDN. People nowadays are so used to just plopping jQuery etc. into their web pages that they forget that this stuff has to come from somewhere. If it's from Google, I'm quite certain that their CDN loader phones home right before it gives up any of the good stuff.

EDIT: Confirmed, though I was wrong in that there's no loader, requesting jQuery from ajax.googleapis.com gives them a nice fresh Referer header pointing at your secret site for their spiders to crawl. Be mindful!

I'm 100% sure. That page was for me and me alone. It was never accessed by anyone but me. I never shared the URL with anyone.

Referrers only get shared through links. There were no links to or from that page. Going to a page and typing in new URL does not provide a referrer.

an old meme, and my usual recommendation: just test it: create a page that i not linked from anywhere. visit it with the browsers mentioned above. watch the logfiles. wait for it. nope, no googlebot request. it is unbelievable easy to test, i have done so on various occasions in the past, so there is no need for you to spread a "several people have reported" rumor. just ... test ... it.

as for the old stories, that google does this kind of thing: people, especially SEOs or people who think they know SEO, always blame google. oh, my beta.site has been indexed, it must be because of ... google is evil.

most of the times i have seen cases where googlebot found a not published yet site it was because of (just some examples, not a complete list) i.e.:

* turned on error reporting (most of the PHP sites) * the URLs were already used in some javascript * server side analytics software, open to the public * apaches shows file/order structure * indexable logfiles * people linked to the site * somebody tweeted about it * site was covered on techcrunch (yes, really) * all visited URLs in the network were tracked by a firewall, the firewall published a log on an internal server, the internal server was reachable from the outside * internal wiki is indexable * intranet is indexable * concept paper is indexable

testing your hypothesis "chrome/google toolbar/... push URLs into the googlebot discovery queue, which leads to googlebot visits" is easily testable. no need to spread rumors. setup for testing this: make an html-page (30 seconds max, basically ssh to your server, create a file, write some html), tail & grep logfiles (30 sec max), wait (forever)

It is a myth that is hard to get rid of. No one wants to admit they tweeted out a link to the dev website.

Though I recently found this on the Google+ FAQ: http://support.google.com/webmasters/bin/answer.py?hl=en&...

  When you add the +1 button to a page, Google assumes that
  you want that page to be publicly available and visible in
  Google Search results. As a result, we may fetch and show
  that page even if it is disallowed in robots.txt.
I can understand adding a +1 button to a dev site, and then not understanding why it shows up in the index.

Don't forget people who may have * installed UserScripts / GreaseMonkey scripts * Browser plugins other than Google Toolbar which may send stuff to the big G * (Self-)modded browsers which send out stuff to wherever...the list goes on and on indeed.

Best thing to do to keep a site secret: * Don't host it on the internet (d'uh) * Hide behind a portal page and have that and your server weed out misconfigured / hijacked browsers before any can proceed to your real secret site (also see web cloaking).

I'm not sure either, but I doubt that Chrome or any of the badware-stopping features that are built in to it cause the URLs they're checking to be indexed. I'd be even more surprised if Firefox did this.

If you've got the toolbar installed though, I'd be less surprised if they tried crawling or indexing URLs you go to.

EDIT: It looks like they've explicitly said the toolbar does not cause things to appear in search results: http://www.seroundtable.com/google-toolbar-indexing-12894.ht....

At least in terms of malware detection, Chrome utilises a bloom filter in the first instance to identify the probability of a URL being malicious before making remote calls. If it is found to be positive, only then does it submit it to Google for more precise verification.

1. http://blog.alexyakunin.com/2010/03/nice-bloom-filter-applic...

> EDIT: It looks like they've explicitly said the toolbar does not cause things to appear in search results

I read this too after posting, but I'm skeptical. It wouldn't be the first time they claimed to not do things they later admitted doing ... The rationale being that search engines need a way to discover new URLs quickly and keep ahead of the competition (indexing speed and breadth).

I'd also like to know what exactly Google Desktop Search does with URLs it finds.

You could make a good bit of easy money if you can prove your suspicions. But since you haven't...

Google indexes URLs despite measures such as robots.txt when these URLs are discovered by Google software including Chrome and their Toolbar.

Robots.txt is about fetching content, it has noting do to with indexing URLs or anything which is part of the content at non-robots.txt restricted locations.

Safe browsing in Firefox is implemented client side, it doesn't share urls with Google.

When you like or share a post in your newsfeed, you're sending a linkback to the original post.

So, if your newsfeed is public "to everyone" Google is able to crawl and index the content on it (discard the original post privacy settings)

Whether google sends the urls to itself or not can be easily decided by using a http monitoring tool like fiddler and with hosts filter we can narrow down the traffic to google.com

Leave it running for few days you will see for yourself

Wouldn't a proper robots.txt rule fix this?

A robots.txt file disallowing crawling on the sites that display the contents of user email would help fix this.

However, as some of the discussion below points out, I don't believe that disallowing crawling of these URLs in our robots.txt would keep them from the index if a search engine finds reference to them elsewhere; I think it simply keeps them from being crawled.

(Regardless of whether one has a Facebook account or not) If your theory is correct, this seems like a good reason to not use Chrome or any browser with a Google Toolbar :)

Another good reason.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact