Hacker News new | past | comments | ask | show | jobs | submit login
Twitter just updated its robots.txt to exclude all scrapers (twitter.com)
16 points by sb057 on July 15, 2015 | hide | past | favorite | 19 comments

Here's a write up as to why they have made the changes. Was going to write it yesterday, but life got in the way:


Why would you not use 301 redirects or just rel=canonical?

There must be some platform issue is my best guess...

They could at least have allowed the Internet Archive in the robots.txt, since the way things stand all www.twitter.com links will be unavailable from the Wayback Machine. That will obviously be a huge loss to researchers.

(Update: the Wayback Machine will be fine, using the twitter.com/robots.txt)



They blocked robots on their marketing pages.

again, no, they blocked it on WWW. only: http://webmarketingschool.com/no-twitter-did-not-just-de-ind...

Nope. No. They didn't.

What they did was some perfectly legitimate duplicate content protection.

Will write it up in a bit more detail...

so what does that mean?

Absolutely nothing from Twitter should be appearing in search engines.

Actually i believe its only from the "www" subdomain. Take a look at the robots without the "www"

https://www.twitter.com/robots.txt https://twitter.com/robots.txt

This is likely just to prevent content duplication/nudge users to visit without the "www".

yep makes a huge difference... as explained here as well http://webmarketingschool.com/no-twitter-did-not-just-de-ind...

You should never use Google's estimate as a real estimate, especially once it gets past 100. There are better ways to move content, like 301 redirects or rel=canonical.

So only firehose agreements get data? What's the impact to general SEO or 3rd parties? Doesn't sound good.

Except don't they have a special deal with Google to use the firehose?

IMO, that's exactly the reason. Before search engines could scrape the data and load the content for free. Now they'll need to reach firehose data agreements.

Take a look at the robots file without the "www" subdomain. Its likely to prevent content duplication/push for only the non "www" url to appear on search engines.


Uh... Excellent point. Seems the entire speculation is/was unfounded.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact