Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nordstrom's robots.txt (nordstrom.com)
199 points by prokizzle on Feb 27, 2015 | hide | past | favorite | 63 comments


Hi! The robots file is not a text file. I know! Products blocked due to possible copyright infringement. A lot are disputes over product name or t-shirt slogan. We were ranking for some terms we did not want to be associated with so we added the offending search result page to the file. Not sure how it ever got indexed in the first place. Peace!


I don't think you actually want to do that. Blocking in robots.txt will prevent Google from crawling the URL - not from indexing it. You actually want them to crawl the URL and then respond with a 404 or 410.

If there are inbound links pointing to that URL, you should disavow them in GWT.


a "noindex" robots meta tag would work too


Why you're at it - can you unblock traffic from Vietnam? I know a few people here who want to order online (through a courier) but can't and aren't capable of VPNing. Plus, isn't blocking entire countries from accessing your site against the spirit of the internet?

Edit: this is what we see from Vietnam - https://cloudup.com/cOLr9vimmIC


I like you already :)


thanks! I have some inside knowledge!


as in you work at nordstrom?


kinda sorta maybe


awesome response!


Why they are even bothering to allow their search results to be indexed is beyond me. Any decent SEO would tell you this is a terrible idea. If their in-house SEO told them this was a good idea, then they seriously need to re-evaluate their program.

Not only has Matt Cutts said you shouldn't be doing this [1][2], but it's also listed in Google's Webmaster Guidelines as things NOT to do [3]

A quick query of "site:nordstrom.com/sr/" shows they have 260,000 search results pages in Google's index.

Just a few search results pages with 0 results their system is creating and allowing Google to index:

http://shop.nordstrom.com/sr/sorrelli http://shop.nordstrom.com/sr/query http://shop.nordstrom.com/sr/flogg

There can be benefit in creating pages at scale, but this is a textbook example on how not to go about it.

1. https://www.mattcutts.com/blog/search-results-in-search-resu... 2. http://searchengineland.com/googles-cutts-auto-generated-con... 3. https://support.google.com/webmasters/answer/35769


You're rolling along thinking this is just some bad web developer somewhere until you get to the last Disallow...


Could that be a honeypot to trap bots that don't respect robots.txt?


Looks more like a mop-up after a website infection incident.


I have had to do that before. The URLs in that case seemed to have a more random distribution than most of these.


maybe, but why would THAT be the honey pot? it could be any phrase, really.


Well, you'd want them to try to crawl it. Perhaps there was a bot crawling their site for such links and they wanted to identify specifically that one. It seems most likely that such URL's never existed, but this would be a decent way to identify a bot that's otherwise not distinguishable.


Well, I'm on a list now..


You have something against petite young ladies of a modest stature? Sizeist!


This other retailer's page is about the same except for that one entry: http://www.neimanmarcus.com/robots.txt


I don't get it: is that disallow targeting some sort of porn spambot crawler?


They probably made a product with that slug by accident and are doing that to try to get it removed from search engine results.


It makes me wonder if Nordstrom ever stocked the now-defunct Pornstar clothing brand.


Disallow is used to exclude certain URLs from being crawled, in this case it looks like a search query as the store's search uses this pattern: http://shop.nordstrom.com/sr/query .


Maybe some mal-intended inbound links that they don't want to be associated with?


Is this some sort of new terrible negative seo? Associate a target url for a bunch of negative things?


Maybe that was their intention, but that's not how it's done.


Update! offending term we rank on "extra small teen porn" - offending page /sr/petite-extra-small. Robots updated (live tomorrow)! Asked Google to remove the URL.


So uh, how did you find out that you match for this phrase?


Google Webmaster Tools shows you the search terms used to click thru to your website from Google Searches. I'm not sure if it still works, but referer URLs used to log search terms in the web access logs.


Probably looked at their logs and found a lot of visits to that URL.


Hmm ...

    $ curl --dump-header - 'http://shop.nordstrom.com/robots.txt'
    HTTP/1.1 403 Forbidden
    Server: AkamaiGHost
    Mime-Version: 1.0
    Content-Type: text/html
    Content-Length: 281
    Expires: Fri, 27 Feb 2015 23:42:13 GMT
    Date: Fri, 27 Feb 2015 23:42:13 GMT
    Connection: close
    ...
Probably a bad idea to deny serving robots.txt to bots (even if they ought to interpret that as a total ban)


Funny, when I do the same thing I get a completely different response:

$ curl --dump-header - 'http://shop.nordstrom.com/robots.txt' HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html; charset=utf-8 Server: Microsoft-IIS/7.5 X-Powered-By: ASP.NET Content-Length: 865 Date: Sat, 28 Feb 2015 07:04:36 GMT Connection: keep-alive


Yeah, the clue is in "Server: AkamaiGHost". The site is clearly behind some sort of anti-crawl protection which decides it doesn't like me. I think it would make sense to make an exception for exactly that file though.


Reading, reading, reading... wtf! I hope that is an abbreviation for something, but to be honest, I don't want to Google and find out.

  Disallow: /sr/extra-small-teen-porn*


I'm not sure what letters would go on the end of "porn" to make a clothing item, but I'll bet it exists. Or someone misspelt "pony", or... well, you get the idea. No matter what, you really don't want your site to appear in search results for those terms.


➜ ~ cat /usr/share/dict/words | grep "^porn"

pornerastic

pornocracy

pornocrat

pornograph

pornographer

pornographic

pornographically

pornographist

pornography

pornological


> pornocracy

Is this what we will call it when the NSA starts blackmailing politicians with their internet histories?


It means rule of the whores and is a certain period in the catholic church.


http://en.wikipedia.org/wiki/Saeculum_obscurum

> Saeculum obscurum (Latin: the Dark Age) is a name given to a period in the history of the Papacy during the first half of the 10th century, beginning with the installation of Pope Sergius III in 904 and lasting for sixty years until the death of Pope John XII in 964. During this period, the Popes were influenced strongly by a powerful and corrupt aristocratic family, the Theophylacti, and their relatives.


Why are you catting a single file into grep?


Muscle memory. I find myself doing this all the time too, it is just a universal way of filtering things down, so I don't worry about it too much. Like when using lsof, sure you can use filters via lsof itself, but sometimes you just use "lsof |grep" because it is easier than looking up the correct option. The performance to muscle memory tradeoff is worth it.


I think GP may be referring to the "unnecessary use of cat"


The correct way:

grep "^porn" /usr/share/dict/words


Nothing incorrect about using cat. It makes almost zero practical difference except saving a couple keystrokes. There are some scenarios where it actually matters, but this isn't one of them.


impossible to do more research on this at work.


That's just way too specific to be an accident, wonder what the story is..


I wouldn't be surprised to hear that they did put that url into their robots.txt because it is a search result page generated by a bot and that has been visited and indexed by Google…


I'm not up to date on how google's current search indexing algorithm works. Supposing someone has a lot of bots that post links to `extra-small-teen-pony` or some other /sr query on nordstrom.com from other sites - would google index that?


If they threw a custom 404 that reports 200. That's how I assumed those phone number lookup sites work.


shop.nordstrom.com/sr is for search results. So I guess someone linked to that search result somewhere, likely as a joke

EDIT: Looks like "porn star clothing" search also got indexed: https://www.google.ca/search?client=ubuntu&channel=fs&q=porn... (apparently someone has an edgy blog where that link is the punchline?)


Also, in

   Disallow: /sr/mattress* Disallow: /sr/mattresses*
Doesn't A imply B?


Do a google search for some of those products - ends up being even weirder when you see some of the sites linking to them.

edit: especially considering that google still links to them, it just doesn't show their description, just gives the standard error for "couldn't show this description because it was blocked by the site's robots.txt"


Google indexes pages that link to those pages, but can't index the pages themselves. The PageRank algorithm still sees inbound links to unindexable pages.


Screenshot of original robots.txt for when the changes get pushed to production:

http://imgur.com/8zkK8w2


Anyone have any theories on why they're blocking specific products?


I've dealt with retailers who had to remove clothing products for legal reasons, usually copyright complaints. It could be a knockoff of another design, or copyrighted characters on the clothes. If there is any backdoor way to get to the product page, they could be in trouble.


Designers often pull stuff from stores or negotiate exlusives with another store. This could have been some attempts of making sure some search results don't show up anymore.


They might have signed an agreement restricting "lowest advertised price". They're pretty common.


I can't think of a single reason. Some of the URLs still work too.


The point of robots.txt "Disallow"s is to tell search engine crawlers not to crawl a URL (which could exist). It's not to tell search engines that URLs matching a pattern don't exist.


I guess they removed that porn line. I clicked through on it and was perplexed what the big deal was, and I had to read the comments here to know.


It's still there for me. Maybe they use a CDN that serves different things to different locations or does caching?


"Disallow: /sr/extra-small-teen-porn"

Yeah someone got pwned, or had a serious HR incident.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: