Hi!
The robots file is not a text file. I know!
Products blocked due to possible copyright infringement. A lot are disputes over product name or t-shirt slogan.
We were ranking for some terms we did not want to be associated with so we added the offending search result page to the file. Not sure how it ever got indexed in the first place.
Peace!
I don't think you actually want to do that. Blocking in robots.txt will prevent Google from crawling the URL - not from indexing it. You actually want them to crawl the URL and then respond with a 404 or 410.
If there are inbound links pointing to that URL, you should disavow them in GWT.
Why you're at it - can you unblock traffic from Vietnam? I know a few people here who want to order online (through a courier) but can't and aren't capable of VPNing. Plus, isn't blocking entire countries from accessing your site against the spirit of the internet?
Why they are even bothering to allow their search results to be indexed is beyond me. Any decent SEO would tell you this is a terrible idea. If their in-house SEO told them this was a good idea, then they seriously need to re-evaluate their program.
Not only has Matt Cutts said you shouldn't be doing this [1][2], but it's also listed in Google's Webmaster Guidelines as things NOT to do [3]
A quick query of "site:nordstrom.com/sr/" shows they have 260,000 search results pages in Google's index.
Just a few search results pages with 0 results their system is creating and allowing Google to index:
Well, you'd want them to try to crawl it. Perhaps there was a bot crawling their site for such links and they wanted to identify specifically that one. It seems most likely that such URL's never existed, but this would be a decent way to identify a bot that's otherwise not distinguishable.
Disallow is used to exclude certain URLs from being crawled, in this case it looks like a search query as the store's search uses this pattern: http://shop.nordstrom.com/sr/query.
Update!
offending term we rank on "extra small teen porn" - offending page /sr/petite-extra-small. Robots updated (live tomorrow)! Asked Google to remove the URL.
Google Webmaster Tools shows you the search terms used to click thru to your website from Google Searches.
I'm not sure if it still works, but referer URLs used to log search terms in the web access logs.
Yeah, the clue is in "Server: AkamaiGHost". The site is clearly behind some sort of anti-crawl protection which decides it doesn't like me. I think it would make sense to make an exception for exactly that file though.
I'm not sure what letters would go on the end of "porn" to make a clothing item, but I'll bet it exists. Or someone misspelt "pony", or... well, you get the idea. No matter what, you really don't want your site to appear in search results for those terms.
> Saeculum obscurum (Latin: the Dark Age) is a name given to a period in the history of the Papacy during the first half of the 10th century, beginning with the installation of Pope Sergius III in 904 and lasting for sixty years until the death of Pope John XII in 964. During this period, the Popes were influenced strongly by a powerful and corrupt aristocratic family, the Theophylacti, and their relatives.
Muscle memory. I find myself doing this all the time too, it is just a universal way of filtering things down, so I don't worry about it too much. Like when using lsof, sure you can use filters via lsof itself, but sometimes you just use "lsof |grep" because it is easier than looking up the correct option. The performance to muscle memory tradeoff is worth it.
Nothing incorrect about using cat. It makes almost zero practical difference except saving a couple keystrokes. There are some scenarios where it actually matters, but this isn't one of them.
I wouldn't be surprised to hear that they did put that url into their robots.txt because it is a search result page generated by a bot and that has been visited and indexed by Google…
I'm not up to date on how google's current search indexing algorithm works. Supposing someone has a lot of bots that post links to `extra-small-teen-pony` or some other /sr query on nordstrom.com from other sites - would google index that?
Do a google search for some of those products - ends up being even weirder when you see some of the sites linking to them.
edit: especially considering that google still links to them, it just doesn't show their description, just gives the standard error for "couldn't show this description because it was blocked by the site's robots.txt"
Google indexes pages that link to those pages, but can't index the pages themselves. The PageRank algorithm still sees inbound links to unindexable pages.
I've dealt with retailers who had to remove clothing products for legal reasons, usually copyright complaints. It could be a knockoff of another design, or copyrighted characters on the clothes. If there is any backdoor way to get to the product page, they could be in trouble.
Designers often pull stuff from stores or negotiate exlusives with another store. This could have been some attempts of making sure some search results don't show up anymore.
The point of robots.txt "Disallow"s is to tell search engine crawlers not to crawl a URL (which could exist). It's not to tell search engines that URLs matching a pattern don't exist.