I noticed a new twist in your post though: you're saying that because of Safe Browsing (which checks for e.g. malware as users surf the web), those urls are sent to Google. The way that Chrome and Firefox actually do Safe Browsing is that they download an encrypted blob which allows the browser to do a lookup for dangerous urls on the client side--not by sending any urls to Google. I believe that if there's a match in the client-side encrypted table, only then does the browser send the now-suspect url to Google for checking.
Here's more info: https://developers.google.com/safe-browsing/ I believe the correct mental model of the Safe Browsing API in browsers is "Download a hash table of believed-to-be-dangerous urls. As you surf, check against that local hash table. If you find a match/collision, then the user might be about to land on a bad url, so check for more info at that point."
Hope that helps. Further down in the discussion, someone posted this helpful link with more explanation: http://blog.alexyakunin.com/2010/03/nice-bloom-filter-applic...
I only found out when a friend searched for his name and the page appeared as it was my phone list
The most common way such "secret" pages get crawled is that someone visited that secret page with their referrers on and then goes to another page. For example, are you 100% positive that every person who ever visited that page had referrers turned off on every single browser (including mobile phones) they used to access that page?
EDIT: Confirmed, though I was wrong in that there's no loader, requesting jQuery from ajax.googleapis.com gives them a nice fresh Referer header pointing at your secret site for their spiders to crawl. Be mindful!
Referrers only get shared through links. There were no links to or from that page. Going to a page and typing in new URL does not provide a referrer.
as for the old stories, that google does this kind of thing: people, especially SEOs or people who think they know SEO, always blame google. oh, my beta.site has been indexed, it must be because of ... google is evil.
most of the times i have seen cases where googlebot found a not published yet site it was because of (just some examples, not a complete list) i.e.:
* turned on error reporting (most of the PHP sites)
* server side analytics software, open to the public
* apaches shows file/order structure
* indexable logfiles
* people linked to the site
* somebody tweeted about it
* site was covered on techcrunch (yes, really)
* all visited URLs in the network were tracked by a firewall, the firewall published a log on an internal server, the internal server was reachable from the outside
* internal wiki is indexable
* intranet is indexable
* concept paper is indexable
testing your hypothesis "chrome/google toolbar/... push URLs into the googlebot discovery queue, which leads to googlebot visits" is easily testable. no need to spread rumors. setup for testing this: make an html-page (30 seconds max, basically ssh to your server, create a file, write some html), tail & grep logfiles (30 sec max), wait (forever)
Though I recently found this on the Google+ FAQ:
When you add the +1 button to a page, Google assumes that
you want that page to be publicly available and visible in
Google Search results. As a result, we may fetch and show
that page even if it is disallowed in robots.txt.
Best thing to do to keep a site secret:
* Don't host it on the internet (d'uh)
* Hide behind a portal page and have that and your server weed out misconfigured / hijacked browsers before any can proceed to your real secret site (also see web cloaking).
If you've got the toolbar installed though, I'd be less surprised if they tried crawling or indexing URLs you go to.
EDIT: It looks like they've explicitly said the toolbar does not cause things to appear in search results: http://www.seroundtable.com/google-toolbar-indexing-12894.ht....
I read this too after posting, but I'm skeptical. It wouldn't be the first time they claimed to not do things they later admitted doing ... The rationale being that search engines need a way to discover new URLs quickly and keep ahead of the competition (indexing speed and breadth).
I'd also like to know what exactly Google Desktop Search does with URLs it finds.
So, if your newsfeed is public "to everyone" Google is able to crawl and index the content on it (discard the original post privacy settings)
Leave it running for few days you will see for yourself
However, as some of the discussion below points out, I don't believe that disallowing crawling of these URLs in our robots.txt would keep them from the index if a search engine finds reference to them elsewhere; I think it simply keeps them from being crawled.