FWIW, that’s why I’m working on a platform[1] to help devs deploy polite crawlers and scrapers out of the box that respect robots.txt (and 429s, Retry-After response headers, etc). It also happens to be entirely built on Cloudflare.
I wonder to what degree -- for example, do they respect the Crawl-delay directive? For example, HN itself has a 30-second crawl-delay (https://news.ycombinator.com/robots.txt), meaning that crawlers are supposed to wait 30 seconds before requesting the next page. I doubt ChatGPT will delay a user's search of HN by up to 30 seconds, even though that's what robots.txt instructs them to do.
Would ChatGPT when live interacting with a user even have to respect robots.txt? I would think the robots.txt only applies to automatic crawling. When directed by a user, one could argue that ChatGPT is basically the user agent the user is using to view the web. If you wanted to write a browser extension that shows the reading time for all search results on google, would you respect robots.txt when prefetching all pages from the results? I probably wouldn’t, because that’s not really automated crawling to me.
> there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing it
You could try moving your favicon to another dir, or root dir, for the time being, and update your HTML to match. That way it would be allowed according to the version that Google still has cached. Also, I think browsers look for a favicon at /favicon.ico regardless, so it might be worth making a copy there too.
/favicon.ico is the default and it will be loaded if your page does not specify a different path in the metadata but in my experience most clients respect the metadata and won't try to fetch the default path until after the <head> section of the page loads for HTML content.
But non-HTML content has no choice but to use the default so it's generally a good idea to make sure the default path resolves.
I think it is from the Internet Explorer days. .ico is an actual icon file format on Windows and IIRC originally the icons were in that format, with support for GIF coming when Netscape supported the feature.
[Shameless plug] I'm building a platform[1] that abides by robots.txt, crawl-delay directive, 429s, Retry-After response header, etc out of the box. Polite crawling behavior as a default + centralized caching would decongest the network and be better for website owners.
I'm building a platform for developers to build, deploy, and share web crawlers. It's built entirely on Cloudflare and aims to be the cheapest and easiest solution for crawling tens of millions of pages.
I’m working on a platform[1] (built on Cloudflare!) that lets devs deploy well-behaved crawlers by default, respecting robots.txt, 429s, etc. The hope is that we can introduce a centralized caching layer to alleviate network congestion from bot traffic.
I love the sentiment, but the real issue is one of incentives and not ability. The problem crawlers have more than enough technical ability to minimize their impact. They just don't have a reason to care right now.
I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?
They could just create a custom license based of Apache 2.0 that allows sharing but constraints some specific behavior. It won't be formally Open Source, but will have enough open source spirit that academics or normal people will be happy to use it.
[1] https://crawlspace.dev