Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How to opt-out from Chat GPT crawling?
7 points by marcopicentini on March 13, 2023 | hide | past | favorite | 7 comments
We have a large public content website, and ChatGPT has crawled without our permissions.

How can we block ChatGPT user agent from stealing our written knowledge?




And this is what ChatGPT told me when I asked her.... ;-)

"...It is possible that some of the text data used to train me may have been collected from sources that did not respect robots.txt rules. However, I have no way of knowing for sure, as I do not have access to that information. Nonetheless, it is important to always respect the directives set forth in robots.txt files when collecting data from the web, as they are intended to help website owners control access to their content..."


Do we know if ChatGPT respects Robots.txt?


ChatGPT does not access websites at this point. You want to know if the public data set curators do, like CommonCrawl.

I would assume OP cares about their site being used for training more generally, like with LLAMA and everything else that will come out moving forward?


Is that definitely the case? When I tried to ask it about the SVB incident, it first claimed I was mistaken and that nothing out of the ordinary was happening. When I persisted, it said it checked news sites to find more information since its latest data was from 2021 (though it claimed _not_ to find anything signaling SVB was in trouble, so I am not sure if it "lied" or just wasn't effective). When I linked to a specific news article describing the issue, ChatGPT responded in a way that made it clear it accessed the article and parsed its content.


Here's how to make the site accessible to search engines, but not common crawl bots:

https://commoncrawl.org/big-picture/frequently-asked-questio...

e.g.:

    User-agent: CCBot
    Disallow: /


"...public content website..."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: