Hacker News new | past | comments | ask | show | jobs | submit | andrethegiant's comments login

FWIW, that’s why I’m working on a platform[1] to help devs deploy polite crawlers and scrapers out of the box that respect robots.txt (and 429s, Retry-After response headers, etc). It also happens to be entirely built on Cloudflare.

[1] https://crawlspace.dev


> AFAIK OpenAI currently respects robots.txt

I wonder to what degree -- for example, do they respect the Crawl-delay directive? For example, HN itself has a 30-second crawl-delay (https://news.ycombinator.com/robots.txt), meaning that crawlers are supposed to wait 30 seconds before requesting the next page. I doubt ChatGPT will delay a user's search of HN by up to 30 seconds, even though that's what robots.txt instructs them to do.


Would ChatGPT when live interacting with a user even have to respect robots.txt? I would think the robots.txt only applies to automatic crawling. When directed by a user, one could argue that ChatGPT is basically the user agent the user is using to view the web. If you wanted to write a browser extension that shows the reading time for all search results on google, would you respect robots.txt when prefetching all pages from the results? I probably wouldn’t, because that’s not really automated crawling to me.


> there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing it

You could try moving your favicon to another dir, or root dir, for the time being, and update your HTML to match. That way it would be allowed according to the version that Google still has cached. Also, I think browsers look for a favicon at /favicon.ico regardless, so it might be worth making a copy there too.


/favicon.ico is the default and it will be loaded if your page does not specify a different path in the metadata but in my experience most clients respect the metadata and won't try to fetch the default path until after the <head> section of the page loads for HTML content.

But non-HTML content has no choice but to use the default so it's generally a good idea to make sure the default path resolves.


> won't try to fetch the default path until after the <head> section of the page loads for HTML content

That's a really interesting optimization. How did you discover this? Reading source?


Thanks for sharing, I wasn’t knowing that browsers look for a favicon at /favicon.ico. Thanks again.


I think it is from the Internet Explorer days. .ico is an actual icon file format on Windows and IIRC originally the icons were in that format, with support for GIF coming when Netscape supported the feature.


Many browsers will accept a favicon.ico that's actually a PNG file with no issues.


[Shameless plug] I'm building a platform[1] that abides by robots.txt, crawl-delay directive, 429s, Retry-After response header, etc out of the box. Polite crawling behavior as a default + centralized caching would decongest the network and be better for website owners.

[1] https://crawlspace.dev


I'm building a platform for developers to build, deploy, and share web crawlers. It's built entirely on Cloudflare and aims to be the cheapest and easiest solution for crawling tens of millions of pages.

https://crawlspace.dev


I think OP was referring to how fast someone built something with Anthropic's new Computer Use product, as it was announced yesterday


I’m working on a platform[1] (built on Cloudflare!) that lets devs deploy well-behaved crawlers by default, respecting robots.txt, 429s, etc. The hope is that we can introduce a centralized caching layer to alleviate network congestion from bot traffic.

[1] https://crawlspace.dev


I love the sentiment, but the real issue is one of incentives and not ability. The problem crawlers have more than enough technical ability to minimize their impact. They just don't have a reason to care right now.


That sentence stood out to me too, very powerful. Felt it right in the feels.


Have you tried the Structured Output feature that OpenAI released last week?


I'd be curious to know the answer to this also.


I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?


> but why permit people to run it themselves?

I wouldn't worry about that if I were them: it's been shown again and again that people will pay for convenience.

What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.


> What I'd worry about is Amazon/Cloudflare repackaging my model and outcompeting my platform.

Why let Amazon/Cloudflare repackage it?


How would you stop them?

The license is Apache 2.


That's my question -- why license as Apache 2


What license would allow complete freedom for everyone else, but constrain Amazon and Cloudflare?


They could just create a custom license based of Apache 2.0 that allows sharing but constraints some specific behavior. It won't be formally Open Source, but will have enough open source spirit that academics or normal people will be happy to use it.


The LLaMa license is a good start.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: