Hacker News new | past | comments | ask | show | jobs | submit login
The most mysterious Google ranking ever... (jamespanderson.tumblr.com)
149 points by ry0ohki on Jan 14, 2011 | hide | past | favorite | 50 comments



It's a super rare HTTP Response Code issue:

STEP 1: Accessing CakeCentral.com returns a 404 "Not Found" HTTP Code when requested:

1. Go to http://www.rexswain.com/httpview.html and enter in http://cakecentral.com/

2. Take a look at the response codes, see the 404

STEP 2: Previously, inexplicably, _actual_ error pages on CakeCentral.com such as: http://www.google.com/search?q=site:www.cakecentral.com%2Fca... returned 302 redirects to Beerby.com

STEP 3: Beerby.com uses a "Soft" error page, meaning if you type in a URL like: http://www.beerby.com/adfadi you get a 302 TEMPORARY redirect to a 200 OK page.


I pinged the indexing team at Google, but this is almost certainly something weird going on with cakecentral.com's webhost:

wget http://cakecentral.com/ --2011-01-14 09:12:50-- http://cakecentral.com/ Resolving cakecentral.com... 174.129.211.41 Connecting to cakecentral.com|174.129.211.41|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2011-01-14 09:13:01 ERROR 404: Not Found.

Once the root page of a site starts returning 404s, we have to start taking guesses about the best way to handle it. Best advice for Cake Central: make sure your root page returns a valid HTML page with a 200 response code.

P.S. If the webhost is trying to do something sneaky, e.g. things work for browsers, but wget or Googlebot is treated differently somehow, the owner of Cake Central can use our free "Fetch as Googlebot" feature in our webmaster console to help diagnose the problem.

Summary: not the weirdest search result I've seen by far. Webhosts that serve up 404s, redirects, or duplicate error pages can cause arbitrary things to happen in search engines. Bing doesn't have the url cakecentral.com indexed at all, for example. Blekko has it, but their page is from Nov. 11, 2010, so they're probably missing the 404 issue by being a couple months older.


Filling in the gaps:

  cakecentral.com.	90	IN	A	174.129.211.41
If you look at http://174.129.211.41/ without a host you'll see that it's a nginx reverse proxy/cache

both beerby and cakecentral are on EC2.

Pretty likely that an error was made at some point in the cakecentral nginx config to include a beerby EC2 private IP as part of a load balance pool or as the single back end (either fat fingered or by retaining an old IP as instances were stopped and started).

It has since been corrected (probably?), but as cakecentral.com is returning a 404 to robots on their homepage the best, most recent return google has was when it was misdirected.


STEP 4: Google now shows the destination page if your results contain a 302 redirect, not the source page:

> Many months ago, if you saw someresult.com/search2.php?url=mydomain.com, that would sometimes have content from mydomain. That could happen when the someresult.com url was a 302 redirect to mydomain.com and we decided to show a result from someresult.com. Since then, we’ve changed our heuristics to make showing the source url for 302 redirects much more rare. We are moving to a framework for handling redirects in which we will almost always show the destination url.

http://www.mattcutts.com/blog/seo-advice-url-canonicalizatio...


Like in this case, I find myself looking at the response headers of GET Requests quite often, so:

  alias h='curl -sIw "Time: %{time_total}s\n" -X GET'
This issues a GET on the given URL, printing only the response headers and the time elapsed. Append -L if you want to follow redirects.


Also interesting that searching {+cake central} or {cake +central} instead of {cake central}, the first result is the correct one.

I thought that the "+" just disabled the spelling/synonym/etc... alterations, while apparently in this query it does some kind of post-filtering for exact matches... (since both words do not appear in the page)


+ historically means "Do not give me any pages that don't contain this word." This used to be very important back when search engines were stupid, and basically ranked on the sum total number of occurrences of each term in the search. Searches for multiple terms would frequently be dominated by results that mentioned one of the terms many times.


I think you're on the right trail, but it's still confusing why this would take the number one ranking over the root domain? Maybe Google thought the entire domain had moved?


Yeah i bet they think that is the temporary location of the cakecentral homepage


Maybe there are a lot of links to pages on cakecentral.com that no longer exist and now result in 404?


Huh? Why would CakeCentral.com pages 302 redirect to Beerby.com?


The really interesting question is why the start page of cakecentral.com would be transmitted with a 404 error code.


I don't know if this is what's going on with cakecentral, but I did something like this inadvertently many years ago.

We needed to have a CGI script handle all hits under a certain location, and for various reasons mod_rewrite wasn't an option. So I put something like this in an .htaccess:

ErrorDocument 404 /path/to/script.cgi

I didn't realize until later I needed to explicitly set "Status: 200" in the script's headers. As far as browsers were concerned, everything worked, even IE, since the "error message" (our page content) was long enough to not trigger its built-in error message.


It is definitely an anomaly. Take a look at the following comparison between the top 10 results for the keyword "cake central". It's worse than the other results in every significant way, yet it sits at #1.

http://grab.by/grabs/32ac4e9cade57bedcc96c8e42fb66a2f.png

DA = Domain Age

PR = Page Rank

IC = Indexed Content (pages)

BLP= Backlinks for the page

BLD = Backlinks for the domain

BLEG = Backlinks from .edu/.gov pages

DMZ = Listed in DMOZ

YAH = Listed in Yahoo Directory

Title, URL, Desc, Head = Whether the keyword is included in any of those

CA = Google Cache age


Where does one go to get such a table?


Market Samurai: http://www.marketsamurai.com/c/Antonio (referral link). It's an excellent (and expensive) program for internet marketing and SEO research, available for Mac and Windows (it's made in Flex/Air). If you buy during the trial period, you can get a big discount though. I got my copy for $97.

Screenshot of the screen from which the table has been taken: http://grab.by/grabs/323101a2a3382f4c75b2f077a481931c.png


shameless affiliate link drop man, come on now


I'd personally recommend getting the data by installing SEO Site Tools for Chrome, and/or the SEOmoz Mozbar. Both free.

Disclaimer: I know the guys from SEOmoz fairly well, and have used their site for years.


Try http://360voltage.com and run a Voltmeter report. You have to have an account, but it's free. There's a ton of services like this, SEOmoz is probably the best.


Interesting that people, when searching for 'cake central', still click on a link with a title saying 'ERROR: backend server did not respond in time' even though the second result has 'CakeCentral' in bold.


That's why Adwords works wonders if you can get your ad on the top of a SERP.


Maybe they are Feeling Lucky ?


So many lucky users? :)

In an interview with the Washington Post in 2006 Marissa Mayer from Google said that almost no one ever uses the "I'm feeling lucky" button:

http://www.washingtonpost.com/wp-dyn/content/article/2006/10...

But maybe in 2011 Google users are luckier.


Or just typing 'cake central' in an address bar in firefox.


I guess most are probably just clicking on "I'm feeling lucky" and going there without even seeing the results.


ok, could it be that they both have the same server hosting company?

because here we see a beerby page with a cakecentral URL http://www.google.com/search?q=site%3Acakecentral.com+beerby...

i would guess it was either a server (housing) accident or a DNS f*ckup that let beerby and cakecentral switch places (in an erroneous state) for a short time, bad thing google picked u the cakecentral home page URL in that moment. it saw it as either a redirect or a direct douplicate of the beerby site and decided to show the older indexed page with the same content (the beerby error page).

yeah, either this or google screwed up.

update: why i guess this is because i have seen similar errors when sombody screws up redirects from the home page. (makes HTTP 302 redirects from the home page to another page, and that page (or the redirect) is then changed to something else...) but this is the first time i ever see such an error between two unrelated sites.


I could reproduce this weird behaviour. First I thought that there might be some pages linking to the error page with the anchor text in question - this is also what the cache page claims. Also there are many scraper and auto generated spam sites with broken links that never really show up in the Google index.

Similar cases have happened before. There is a forum by Google for Webmasters where you can tell Google about problems with your website:

http://www.google.com/support/forum/p/Webmasters?hl=en

You could tell them your findings and maybe someone from the Google team will look into the matter, if you are lucky.


lol the last result on that one even has the cakecentral domain, but has a title of one of our user pages.

edit I'm wondering if maybe something F'ed up in Google's database


This is the reason you should always configure your web server to serve an HTTP 503 "Unavailable" when your backend is not online or not fast enough. This will tell the Google bot to come back later and not index the result.


Could this be some new slang term that is not quite popular yet? "Cake central" = drinking lots of beer. "The other day I got caked at that bar, it was cake central down there"


its not on urban dictionary ... yet...


Looking at the Google cache page I see this:

These terms only appear in links pointing to this page: cake central

Looks like the good old Google bombs still work :) http://en.wikipedia.org/wiki/Google_bomb


definitely not a google bomb. just checked their incoming links, and well there are not links which target "cake central"

the note

"These terms only appear in links pointing to this page: cake central"

always shows up as soon as the query words could not get found on the cached page.


Yes, there seems to be something else that is wrong with the Google index as some pages from the cakecentral.com domain show up with content from beerby - this has been noted by someone in this thread and I could just reproduce it.

On the other hand, we can never be sure if pages exist or where they are on the web that link to our pages with a certain anchor text. The link: operator is broken since a long time and shows only a small subset of the pages linking to the page in question if anything at all.

A more complete list of links can be found in the Google Webmaster Tools, but this is also never 100% complete or up to date. And we can use the Site Explorer to get on the quest to find a certain link:

http://siteexplorer.search.yahoo.com/


siteexplorer is more than useless, and the google link: operator is crippled, but for a link bomb you need quite some links with the exact matching linktext, but a simple search for ["cake central" beerby] does not show anything. (and other queries with the link: inanchor: oprators, too) so that it can be relativly safely assumed that it is not a link bomb (in a link bomb you always find some of the links)

or lets phrase it like this

there is absence of evidence that it was a link bomb


I'm pretty sure the web crawl Google does to figure out your rankings is separate from the one that saves the cached version and probably that snippet.

That said, I have no idea why that page would rank on those terms, error or not.


Real Head scratcher. Waiting for Matt Cutts to hop on here and explain this.


Doesn't this just reflect the dirty little secret that Google doesn't really have to get any particular details right, just mostly right most of the time?


another interesting thing to notice is that Google instant comes with the right result (cakecentral.com). Only when I press enter (or the search button) I get to the beerby.com result. Google instant does however claim that it's showing results for cake central magazine and I can search instead for "cake central".

EDIT: even better - searching for cakecentral.com also leads to the same error page on beerby.com


There was a similar ranking some months ago involving searching for "vatican" in Google http://news.ninemsn.com.au/technology/7931120/vatican-search...


I tried whoishostingthis.com on both beerby.com and cakecentral.com to get reported of 404 error.

Edit: It worked at the second attempt beerby.com is hosted at Acquia hosting and cakecentral.com at Amazon


hmm? Beerby is on Amazon too actually



That's all well and good, but I'm one of the owners of Beerby and I can tell you for sure it's on Amazon EC2 :)


This likely happens when a page was the top result but then got re-spidered. Google probably keeps the old rank for a while, even though the content has changed.


Using CakePHP framework? That wouldn't affect it, but that's the first thing that popped into my mind.


Good thought, but we're not using CakePHP


That's really an interesting find.


Google bombing ? ;)

In popular french "cake traces" refers to brown marks in underpants. I guess the french expression "cake face" is a subsequent derivation from it. So I'm trying to guess what "cake central" might mean ...


looks like it is fixed now




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: