Hacker Newsnew | comments | show | ask | jobs | submit login
Bit.ly/robots.txt and the Dangers of Custom Shortened URLs (davidnaylor.co.uk)
49 points by byrneseyeview 2137 days ago | 15 comments



All the owner of that blog would have to do would be to change his post to look like a normal robots.txt file and he could happily ban Google (or Yahoo, or whoever) from crawling any page on bit.ly.

No crawler I know of accepts a redirected robots.txt from an alternate domain for rules about the original domain.

-----


Is that part of the robots.txt RFC or just a happy security coincidence? The bit.ly issue seems like a pretty bad bug that ought to be fixed.

You also might be able to fool Google Webmaster Tools or other utilities into thinking you own bit.ly. One of the authentication methods I've seen used was the creation of an empty html document in the root directory.

-----


The 1994 original proposal is silent on what to do with any unexpected response codes/types.

The 1997 internet-draft (never an RFC) suggests the redirect should be followed, but the potential confusion that could cause has meant few, if any, crawlers have followed that guidance. It's more likely to be a webmaster configuration error than real intent.

-----


That is true, but it must still mess things up to have robots.txt redirected. It would appear that Google and other bots won't have any way of reaching the real robots.txt.

And what about sitemap.xml, atom.xml, and other typical files that could also be redirected?

-----


"And what about sitemap.xml, atom.xml, and other typical files that could also be redirected?"

Why not try it? As of this writing, both of those go to other places.

I checked the other one that lept to mind, favicon.ico. Bit.ly appears to have hardcoded it (probably in apache configuration or equivalent for whatever server they use), however, try http://bit.ly/faviconico and look at the resulting URL. Looks like a few things were tried by the same guy. Now, pwning the favicon would have been cool.

-----


That is bit.ly's problem for not making an exception (assuming they want one in the first place, it's not required), not a general problem.

-----


Nope, its not a general problem, but I bet the programmers over at bit.ly are still going to catch some flack over this.

-----


Agreed

-----


your wrong basically because a 301 returns "found" Google will follow it and use the data, we will blog it tommorrow with proof

-----


I'm a bit amazed at how much these URL shorteners have caught on.

I've started to see them used in email sent internally by Corporate folks at my company, as links to press releases on our own website.

-----


Of course, many corporate Web sites are so poorly designed that every URL is longer than 80 characters and thus may be mangled in email.

-----


Most corporate email travels via Exchange with Rich Text, so the 80 character limit doesn't apply. That's just what "normal users" use; rightfully so.

For us hackers: When sending plain text messages, you should encase urls in <angle_brackets> to prevent breaking and especially the trailing period problem like <www.google.com>.

-----


You forgot: Lotus Notes. Mishandles html emails so much, the new Outlook is pure perfection compared to it. (grumbles over having to implement a complex receipt email for a company that uses Notes internally).

-----


Maybe it's a subtle way to make sure they're being read.

-----


Heh, check out the statistics for that URL:

http://bit.ly/info/robots.txt

Hundreds of thousands of "direct" hits, and hardly any actual referrers. Looks like bit.ly is counting crawlers as hits. Not too surprising given the lack of validation elsewhere.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: