Hacker News new | past | comments | ask | show | jobs | submit login

But why retroactively remove the data? The original owner was fine with holding it, why should the snapshot be deleted because a completely different person wants his completely different website to not be crawled?



It's hard for a bot to understand concepts of 'owner' and 'completely different person' based on the data they have available. Companies can use this robots.txt feature to un-index old marketing content after a re-branding, for example. Or after an acquisition.


Sure, but, surely, the bot has timestamps saying "robots.txt allowed me to keep these documents last time I spidered them". Why do they have to be retroactively removed? robots.txt only disallows spidering, it doesn't mandate that you should delete all the data you've already spidered.


Because most of the problems come from people who want to hide old material that they didn't realize was being indexed. The automatic behavior is simple and easy to implement, and doesn't require any human intervention.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: