I don't see any issue at all if the bot is respecting the robots.txt file. Any m...

merb · on July 5, 2018

so you will actually respect:

    User-agent: google
    Allow: /
    
    User-agent: *
    Disallow: /

well, yes it's good behavior to actually respect it, but well I've seen such robots.txt already which makes it really painful to create a competing search engine.

taeric · on July 5, 2018

This has me ridiculously curious now. Is that common? Other than a random sampling of sites I go to, there a good way to get numbers on how often this is used?

Edit: In my scanning, I have to confess that wikipedia's robot file is the best. Fairly heavily commented on why the rules are there. https://en.wikipedia.org/robots.txt

benfrederickson · on July 5, 2018

I analyzed the top 1 million robots.txt files looking for sites that allow google and block everyone else here: https://www.benfrederickson.com/robots-txt-analysis/ - it's a relatively common pattern for major websites

_trampeltier · on July 5, 2018

I did run a Yacy web crawler (P2P websearch https://yacy.net) a while ago. As far I remember I just saw Yandex for a few times disallowed in the robots.txt when I had trouble with crawling a site. Mostly I just got an empty website for my Yacy crawler instead the "real" Website.

gcb0 · on July 5, 2018

just do like browsers did with user agent strings. call your bot "botx (google crawler compatible)" and crawl everything that allows Google bot without any weight on your conscience.