
Google search indexes itself - franze
https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.google.com%2F%2Fsearch%3Fq%3Dproranktracker.com%2B%2B%2BHostgator%2BCoupon%2BCode%3ACOUPON333&pws=0&hl=en#pws=0&hl=en&q=site:http:%2F%2Fwww.google.com%2F%2Fsearch
======
roland-s
Google's robots.txt
[http://www.google.com/robots.txt](http://www.google.com/robots.txt) disallows
/search but not //search.

However, if you search
site:[http://www.google.com/search](http://www.google.com/search) and show
omitted search results, you get a bunch of results (all 404s).

If you do this there are some strange results on the last couple pages.

For example: Obama won't salute the flag | Phallectomy | horse+mating+video |
feral+horses+induced+abortion | Lactating+dog+images | animal+mating+video |
mating+mpg+-beastiality+-...

So, Half Life 3 confirmed.

~~~
marceldegraaf
I thought you were joking about those search keywords, but indeed:
[http://i.marceldegraaf.net/sitehttpwww.google.comsearch_-
_Go...](http://i.marceldegraaf.net/sitehttpwww.google.comsearch_-
_Google_zoeken_2014-09-10_20-48-06.png) (screenshot)

~~~
user24
ODF files! Those sick sick people.

------
daveloyall
A better example url is
[https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...](https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.google.com%2F%2Fsearch%3Fq%3D)

Note that switching //search to /search eliminates the phenomenon.

Note too that all the results on page 1 and page 10 are related to hostgator
and coupon codes. I expect that there is some site which contains some text or
links that cause these results.

Note also that the `site:` search operator isn't supposed to include anything
but a domain or subdomain: no [http://](http://) nor /search should be
included.

Finally, note that the results are actually google search pages, though! So I
do think this is some kind of bug.

But NOT an instance of Google indexing its result pages. Please change the
title to 'This one weird google bug will make you scratch your head!' :)

Edit: andybalholm suggests (on this page) that the double slash is in fact
causing the googlebot to visit those search results page and indeed index
them. Hm, sounds true.

Has anybody visited the spamfodder pages and found instances of malformed yet
operative links to google search? (I don't feel like visiting those sites on
this machine on this network.)

~~~
Buge
>But NOT an instance of Google indexing its result pages.

That's what it looks like to me. Could you explain the difference?

~~~
daveloyall
I changed my tune at some point via seeing comments here. I posted a comment
to that effect.

In hindsight, your comment alone would have changed my tune: nope, I can't
explain the difference between a page appearing in search results and a page
being indexed. Thanks for the illumination. :)

------
kentonv
This demonstrates the dangers of loose path resolution rules.

Traditionally, consecutive slashes in a path name are treated as equivalent to
a single slash, presumably to simplify apps that need to join two path
fragments -- they can safely just concatenate rather than call a library
function like path.join().

Unfortunately, this makes it much harder to write code that blacklists certain
paths, as robots.txt is designed to do. Clearly, Google's implementation of
robots.txt filtering does not canonicalize double-slashes, and so it thinks
//search is different from /search and only /search is blacklisted.

My wacky opinion: Path strings are an abomination. We should be passing around
string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz". You can use
something like slashes as an easy way for users to input a path, but the
parsing should happen at point of input, and then all code beyond that should
be dealing with lists of strings. Then a lot of these bugs kind of go away,
and a lot of path manipulation code becomes much easier to write.

~~~
wglb
_We should be passing around string lists, e.g. [ "foo", "bar", "baz"] instead
of "foo/bar/baz"._

But that doesn't in and of itself solve the problem, because "foo/bar//baz"
would map to ["foo" "bar" "" "baz"/] without any additional convention.

This is actually not that unusual. this site does not treat two consecutive
slashes as a single slash. There are likely others implementation differences.

Certainly in posix consecutive slashes count as one for file paths, but URL
query strings are not file paths.

~~~
daveloyall
_... "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] ..._

No, I think it'd be more like proto://host/thing?foo&bar&baz (put an =1 on
each of those if you like).

Yeah, I'm employing a convention, but so to is the concept of _list of
strings_ that the commenter invoked.

------
szaroubi
Funny thing, Google indexes itself, indexing itself, indexing others .... All
results lead to google search, which lead to google search results ...

[https://www.google.ca/search?q=site%3Ahttp%3A%2F%2Fwww.googl...](https://www.google.ca/search?q=site%3Ahttp%3A%2F%2Fwww.google.com%2F%2Fsearch+inurl%3A%22q%3Dsite%3Agoogle.com%22&oq=site%3Ahttp%3A%2F%2Fwww.google.com%2F%2Fsearch+inurl%3A%22q%3Dsite%3Agoogle.com%22&aqs=chrome..69i57j69i58.807j0j7&sourceid=chrome&es_sm=91&ie=UTF-8)

~~~
TallboyOne
We must go deeper

~~~
imrehg
[http://inception.davepedu.com/](http://inception.davepedu.com/)

------
franze
hi OP here, i did not consider this to go front-page, just thought it was a
funny meta bug.

and no, it's not clickbait and i'm not affiliated with hostgator or any of
that other crap.

a few strange points i would like to point out:

the indexed result pages are [http://](http://) not [https://](https://) \-
but to my knowledge google forces [https://](https://) everywhere.

the double slash issue is probably the reason why googlebot does indeed index
this. robots.txt is a shitty protocol, i once tried to understand it in detail
and coded
[https://www.npmjs.org/package/robotstxt](https://www.npmjs.org/package/robotstxt)
and yes, there are a shitload of cases you just can't cover with a sane
robots.txt file.

as there are no [https://www.google.com/search](https://www.google.com/search)
(with "s" like secure) URLs indexed google(bot) probably has some failsafes to
not index itself, but the old [http://](http://) URLs somehow slipped through.

but now lets go meta: consider the implications! the day google indexes itself
is the day google becomes self aware. google is a big machine trying to
understand the internet. now it's indexing itself, trying to understand itself
- and it will succeed.the "build more data center algorithms" will kick in as
google - which basically indexed the whole internet - is now indexing itself
recursively! the "hire more engineers to figure how to deal with all this
data" algorithm will kick in (yeah, recursively every developer will become a
google dev, free vegan food!), too.

i think it's awesome.

by the way, a few years ago somebody wrote a similar story
[http://www.wattpad.com/3697657-google-ai-what-if-google-
beca...](http://www.wattpad.com/3697657-google-ai-what-if-google-became-self-
aware) fun enough the date for self awareness is "December 7, 2014, at 05:47
a.m" [update: ups, sorry seems to be the wrong story, but i'm sure the "google
indexes itself becomes self aware" short story is out there, but i just can't
find it right now ... strange coincident?]

~~~
agwa
> the indexed result pages are [http://](http://) not [https://](https://) \-
> but to my knowledge google forces [https://](https://) everywhere.

Google only forces HTTPS for certain User-Agent strings. I just tried fetching
[http://www.google.com](http://www.google.com) with the Googlebot User-Agent
string and Google did not redirect to HTTPS.

------
afro88
It's a bug in the indexing system, exploited by hostgator for (I'm guessing)
SEO purposes. There are other people doing the same thing, and they're all
spammy (viagra sales etc.)

I reckon this will be fixed in a matter of days, judging by how quickly the
latin lorem ipsum google translate thing was sorted out.

~~~
thefreeman
And fixed.

------
johnmu
(I work with the search team at Google) This was a bug on our side, and should
be resolved now.

~~~
barrystaes
Does this explain why Google search results have degraded the last 6 months? I
am not trolling -seriously- for me googling first is hardly worthwhile
nowadays. A user from the Netherlands. If there was a way to still use the
2009 search index, i would!!

~~~
kuschku
There actually is a way to use the pre-2012 search index!

Just use [http://www.google.com/custom](http://www.google.com/custom) I use
either DuckDuckGo or this site all the time, I'd probably switch to DuckDuckGo
completely if this search would go down.

~~~
maaarghk
Lovely but doesn't use an old index. Just searched for the name of an album
released in 2013. Usual results.

------
dzhiurgis
And web archive indexes it's internal IP addresses and a... live printer:
[http://web.archive.org/web/*/http://printer](http://web.archive.org/web/*/http://printer)

~~~
cpqq
Which from the snapshot, shows an IP that's... still online:
[http://208.70.27.164/hp/device/this.LCDispatcher](http://208.70.27.164/hp/device/this.LCDispatcher)

~~~
dzhiurgis
Yep. A whois confirms it's their IP address.

Which is nothing wrong on it's own, as long it's protected by good password
and doesn't fail to likes of thc-hydra.

They also had some ancient snapshots from 192.xxx range

------
ParkerK
They seem to be indexing Bing too ;)
[https://www.google.com/search?num=20&pws=0&hl=en&q=site%3Aht...](https://www.google.com/search?num=20&pws=0&hl=en&q=site%3Ahttp%3A%2F%2Fwww.bing.com%2F%2Fsearch)

~~~
barrystaes
They fixed the OP issue by now, but this still works..

------
spyder
Looks like it's got fixed because i cannot see any results.

------
Igglyboo
All of the results are HostGator coupons, anyone else seeing the same?

~~~
chm
Yes. Look at the query:

    
    
        search?q=site%3Ahttp%3A%2F%2Fwww.google.com 
        %2F%2Fsearch%3Fq%3Dproranktracker.com%2B%2B 
        %2BHostgator%2BCoupon%2BCode%3ACOUPON333&pws=0& 
        hl=en#pws=0&hl=en&q=site:http:%2F%2Fwww.google.com 
        %2F%2Fsearch

~~~
freehunter
Even if you take the search string
site:[http://www.google.com//search](http://www.google.com//search) and put it
into a fresh Google search, it only returns HostGator coupons. Maybe someone
from Google can explain it.

~~~
nathanm412
add -hostgator to the search query and you'll find best-seller-watches.com
dominating the list. Add that one to your query and things get really strange.

[https://www.google.com/webhp?gws_rd=ssl#safe=off&q=site:goog...](https://www.google.com/webhp?gws_rd=ssl#safe=off&q=site:google.com%2F%2Fsearch+-hostgator+-best-
seller-watches)

------
underdown
Google also caches itself:
[http://webcache.googleusercontent.com/search?q=cache:T1NRLL-...](http://webcache.googleusercontent.com/search?q=cache:T1NRLL-
CwwsJ:www.google.com//search%3Fq%3Djoygame.com%2B%2B%2BHostgator%2BCoupon%2BCode:COUPON3333+&cd=451&hl=en&ct=clnk&gl=us)

~~~
wooptoo
It's for good measure, in case it's down.

------
expose
"Your search -
site:[http://www.google.com//search](http://www.google.com//search) \- did not
match any documents."

------
kazinator
The goal is to have an explicit Google search result which expresses the
equivalent of "this Google search cannot be found via Google".

This will help construct a proof of Göogdel's Incompleteness Theorem.

Without being able to find _anything_ in Google, including Google searches,
and including _that_ search for Google searches itself, Google is not a
completely powerful search engine; however, it cannot be complete and
consistent at the same time. There are searches which cannot be shown to be
conclusively either in the index, or not in the index.

~~~
nes350
Made my day!

------
bictorman
I wonder if it's somehow possible to exploit this to pass pagerank from
google.com to your own website. Or if there's even people already doing it.

~~~
pbhjpbhj
Well, let's look at the results - coupons, watches, ... - yup some blackhat
SEO is probably cursing whoever publicised this issue.

------
giancarlostoro
I think it might not be that they "index themselves" but they index links to
google that others post on forums, it's common for people to link to "lmgtfy"
so they probably index those links too. I don't see google "googling" on
itself while indexing it's own searches. Unless Skynet.

------
lubujackson
Results for [http://www.google.com///search](http://www.google.com///search)
as well.

But not [http://www.google.com////search](http://www.google.com////search)
because that's just crazy, come on.

------
GFischer
Very strange:

[https://www.google.com/search?q=site:http://www.google.com/s...](https://www.google.com/search?q=site:http://www.google.com/search&pws=0&hl=en&start=10&filter=0)

I got some searches like:

www.google.com/search@q=tetris+sorry+henk

[https://www.google.com/search=pupuk+cair+alami](https://www.google.com/search=pupuk+cair+alami)

www.google.com/search&q=strobe+trigger+schematic

www.google.com/search@q=transvestites+used+in+rituals (!!!!)

Edit: roland-s found it first :) , and yes, the last pages of results are
pretty weird.

[https://news.ycombinator.com/item?id=8298239](https://news.ycombinator.com/item?id=8298239)

------
ushi
Funny thing... It works only with[0]

    
    
        site:http://www.google.com//search
    

but not with[1]

    
    
        site:http://www.google.com/search
    

[0]
[https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...](https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.google.com%2F%2Fsearch)

[1]
[https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...](https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.google.com%2Fsearch)

------
TuxLyn
Try it like this. site:www.google.com (About 34,000,000 results) or
site:[http://www.google.com](http://www.google.com) inurl:search (About
185,000 results)

------
ankit84
10,800 results for
"site:[http://www.google.com///search"](http://www.google.com///search")

------
Igglyboo
Seems like they could easily fix this with robots.txt or something similar, I
really doubt it's oversight on their part either.

Any ideas why they're doing this?

~~~
andybalholm
I assume that some site has hostgator-related links with two slashes instead
of one. Due to the two slashes, the GoogleBot doesn't realize that it's
indexing their own results pages.

------
CompuHacker
It works with

    
    
      site:http://www.google.com/search
    

, but all the results are considered duplicates and omitted. Hit the button.

~~~
daveloyall
Nice! These results must represent all the hrefs people have posted that point
to google search...

------
shangxiao
Just checked this link again... It appears that Google has fixed the //search
issue as it returns no results now.

~~~
charonn0
Thank you. I was wondering what all the fuss was.

------
tehwebguy
What the fuck [http://i.imgur.com/Zf2CJzS.jpg](http://i.imgur.com/Zf2CJzS.jpg)

------
ChuckMcM
Fun. I expect a cheeky onebox to come out of this at some point along the
lines of the recursion search.

------
hellohellokitty
why the hack google ever made it possible to hit the search url with more than
one slash there...

~~~
hellohellokitty
It's interesting if add a slash to this page the result will be different.

[https://news.ycombinator.com//item?id=8297241](https://news.ycombinator.com//item?id=8297241)

Where in all other cases tested it won't

Is this a server specific stuff? Or it's configurable

[http://url.spec.whatwg.org//#concept-url-
path](http://url.spec.whatwg.org//#concept-url-path)
[http://www.nytimes.com///pages//politics//index.html](http://www.nytimes.com///pages//politics//index.html)
[http://www.bing.com////search?q=site%3Abing.com%2Fsearch%3Fq...](http://www.bing.com////search?q=site%3Abing.com%2Fsearch%3Fq%3D)
[https://www.cloudflare.com///index](https://www.cloudflare.com///index)

~~~
squeaky-clean
Many frameworks allow you to route URLs to actions instead of mapping to a
file. I just tested it in one of my Symfony projects, and I was able to route
/login and //login to two separate controllers.

Furthermore, it's pretty common to rewrite URLs, doing things like
adding/removing trailing slashes, whatever. So it wouldn't be too difficult to
have it condense multiple slashes into just one.

For example, this link worksfine:
google.com//////////////////////////////////search?q=foobar

Google search tries to cover a lot of typos or be pretty user-friendly for
people who don't understand tech. I wouldn't be surprised if there's a grandma
out there who thinks [http://google.com//search](http://google.com//search) is
the correct method.

------
olalonde
Isn't the head of web spam at Google a HNer (Matt I think?)?

~~~
sejje
I believe Matt Cutts went on sabbatical: [https://www.mattcutts.com/blog/on-
leave/](https://www.mattcutts.com/blog/on-leave/)

------
fleitz
But does it index the results of the search of the index?

------
digz
Now google will index searches of its own searches.

------
carbonr
Ouroboros

------
antino
So meta.

------
bcRIPster
Can someone nuke the link on this post. It's clearly click bate and we're just
driving traffic into it. :(

~~~
blueflow
You should mention the arbitrary data in the query section, its not visible at
the first look.

~~~
bcoates
That's an artifact of google's weird link stuffing, if you search
'site:[http://www.google.com//search'](http://www.google.com//search') by hand
it still works

~~~
namuol
Perhaps this "works" because all the pagerank stuff has been altered by all
the sudden traffic related to hostgator coupons.

------
tzaman
Googleception! (sorry for the useless comment, but I had to)

~~~
thanatropism
Eigengoogle.

------
lern_too_spel
Does nobody here understand robots.txt? It's pretty easy to figure out what's
going on if you do. I assumed most users here work with web technologies, but
maybe the readership doesn't skew that way as much as I thought.

