Hacker News new | past | comments | ask | show | jobs | submit login

I don't see any issue at all if the bot is respecting the robots.txt file. Any malicious user will figure some other evil way anyways, be it Nutch or a network of intelligent lightbulbs.



so you will actually respect:

    User-agent: google
    Allow: /
    
    User-agent: *
    Disallow: /
well, yes it's good behavior to actually respect it, but well I've seen such robots.txt already which makes it really painful to create a competing search engine.


This has me ridiculously curious now. Is that common? Other than a random sampling of sites I go to, there a good way to get numbers on how often this is used?

Edit: In my scanning, I have to confess that wikipedia's robot file is the best. Fairly heavily commented on why the rules are there. https://en.wikipedia.org/robots.txt


I analyzed the top 1 million robots.txt files looking for sites that allow google and block everyone else here: https://www.benfrederickson.com/robots-txt-analysis/ - it's a relatively common pattern for major websites


I did run a Yacy web crawler (P2P websearch https://yacy.net) a while ago. As far I remember I just saw Yandex for a few times disallowed in the robots.txt when I had trouble with crawling a site. Mostly I just got an empty website for my Yacy crawler instead the "real" Website.


just do like browsers did with user agent strings. call your bot "botx (google crawler compatible)" and crawl everything that allows Google bot without any weight on your conscience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: