Hacker News new | past | comments | ask | show | jobs | submit | erickhill's comments login

Can we not put lines in our robots.txt files to block being crawled?

There was a bunch of reporting on how AI companies and researchers were using tools that ignored robots.txt. It's a "polite request" that these companies had a strong incentive to ignore, so they did. That incentive is still there, so it is likely that some of them will continue to do so.

CommonCrawl[0] and the companies training models I'm aware of[1][2][3] all respect robots.txt for their crawling.

If we're thinking of the same reporting, it was based on a claim by TollBit (a content licensing startup) which was in turn based the fact that "Perplexity had a feature where a user could prompt a specific URL within the answer engine to summarize it". Actions performed by tools acting as a user agent (like archive.today, or webpage-to-PDF site, or a translation site) aren't crawlers and aren't what robots.txt is designed for, but either way the feature is disabled now.

[0]: https://commoncrawl.org/faq

[1]: https://platform.openai.com/docs/bots

[2]: https://support.anthropic.com/en/articles/8896518-does-anthr...

[3]: https://blog.google/technology/ai/an-update-on-web-publisher...


These policies are much clearer than they were when last I looked, which is good. On the other hand. Perplexity appeared to ignore robots.txt as part of a search-enhanced retrieval scheme, at least as recently as June of this year. The article title is pretty unkind, but the test they used pretty clearly shows what was going on.

https://www.wired.com/story/perplexity-is-a-bullshit-machine...

It takes this sort of critical scrutiny, otherwise mechanisms like robots.txt do get ignored, whether willfully or mistakenly.


> The article title is pretty unkind, but the test they used pretty clearly shows what was going on.

I believe this article is around the same misunderstanding - it doesn't appear to show any evidence of their crawler, or web scraping used for training, accessing pages prohibited by robots.txt.


Robots.txt is a suggestions. As is reporting on using it.

The companies that are ignoring robots.txt, are also probably the companies not advertising that they are ignoring robots.txt.


The EU's AI act points to the DSM directive's text and data mining exemption, allowing for commercial data mining so long as machine-readable opt-outs are respected - robots.txt is typically taken as the established standard for this.

In the US it is a suggestion (so long as Fair Use holds up) but all I've seen suggests that the major players are respecting it, and minor players tend to just use CommonCrawl which also does. Definitely possible that some slip through the cracks, but I don't think it's as useless as is being suggested.


Technically, robot.txt isn't enforcing anything, so it is just trust.

""OpenAI CTO doesn't know what data was used to train the company's video generating platform, Sora""

https://www.youtube.com/watch?v=4AYbZG3h14w

Funny. If I can browse to it, it is public right? That is how some people's logic goes. And how OpenAI argued 2 years ago when GPT3.5/ChatGPT first started getting traction.


> Technically, robot.txt isn't enforcing anything, so it is just trust.

There's legal backing to it in the EU, as mentioned. With CommonCrawl you can just download it yourself to check. In other cases it wouldn't necessarily be as immediately obvious, but through monitoring IPs/behavior in access logs (or even prompting the LLM to see what information it has) it would be possible to catch them out if they were lying - like Perplexity were "caught out" in the mentioned case.

> Funny. If I can browse to it, it is public right? That is how some people's logic goes. And how OpenAI argued 2 years ago when GPT3.5/ChatGPT first started getting traction.

If you mean public as in the opposite of private, I think that's pretty much true by definition. Information's no longer private when you're putting it on the public Internet.

If you mean public as in public domain, I don't think that has been argued to be the case. The argument is that it's fair use (that is, the content is still under copyright, but fitting statistical models is substantially transformative/etc.)


AI companies are ignoring robots.txt in the race to slurp up the entire internet [1].

[1] https://www.reuters.com/technology/artificial-intelligence/m...


Yeah and even better: share blocklists of known AI crawler IPs so we can just block them. Robots.txt is too voluntary.

Someone who doesn't care about polluting our corpus is not going to care about your robots.txt

It sincerely pleases me to see the Amiga so rightfully discussed in this article. In the 1980s, Amiga was a magical computer years ahead of so many of its peers (including the PC by miles). Sadly, the video capabilities that made it so special eventually became its Achilles heel.

>Sadly, the video capabilities that made it so special eventually became its Achilles heel.

How weird: I was browsing YouTube last night (with the SmartTube app) and somehow stumbled on a video that discussed this exact thing, basically making the case that Wolfenstein 3D killed the Amiga and discussing how the unique video capabilities it had which were great for 2D side-scrollers made it so difficult to make a FPS shooter work well on it, because apparently the Amiga didn't have direct framebuffer access the way PCs did with VGA mode 0x13.


It certainly has direct framebuffer access. But the bitplane representation where the bits of each pixel's value are spread out across multiple bytes can make certain kinds of updates very time consuming.

Yeah, that's what the video was discussing. Sorry, I got the terminology wrong.

It didn't exactly kill it. Wolfenstein being feasible on the PC and not the Amiga, was just a symptom of stagnation. The Amiga (as a promising commercial venture!) had doom (pun intended) written all over it even before Wolfenstein. Commodore ignored the Amiga for years and years.

Edit: I just recalled something - the Amiga recquired either a TV or increasingly rare monitors with PAL/NTSC frequencies. You couldn't just walk in to a computer shop and buy an Amiga and a VGA compatible monitor. It was a flickery and low-resolution monitor or a TV. Not exactly endearing to professionals. I mean, I loved the Amiga maybe too much, it was always the underdog, but it was increasingly also the losing underdog.


A1200 and A4000 could be hooked into a VGA monitor for the flickerless experience. The caveat was that the flicker-free display modes were added on top of old ones, which meant that, while you could run Workbench and most applications on the VGA monitor, all games ran in the obsolete PAL modes your VGA display couldn’t handle. This created a market for niche dual-mode displays, which solved the problem, but were a bit pricey.

I used a VGA monitor with my Amiga 1200 (with an adapter). It was not flickery at all and was full resolution.

Yes, I know. I had that setup myself. :)

I posit though, that by the time Amiga 1200 was out, Amiga as a commercial venture was already dead in the water. The 1200 was a last ditch effort. Still loved it, of course.


I remember that there was some sort of shareware with a 40 day trial that my brother ran, but it stayed at 40 days. They had removed the clock as a cost saving measure on the A1200.

Yup, pretty desperate.


It might have had a rocky transition, but it was also very badly mismanaged by Commodore.

Amazing concept. I also really love the almost Apple Newton/Palm Pilot vibes of the UI, too.

It's like when your joystick seems to be on the fritz, yet you keep going for way too long before deciding to do something else entirely.


The trick is to stop before you rub your joystick raw.

After reading the subtitle 100% agreed.


Geraldo Rivera would like to live-stream entering it for the very first time during prime time for all the world to see.


But not before 50mins of media garbage


The History Channel would turn it into 18 episode season.


Anantech was the high watermark in tech journalism and the only place I'd go to look at in-depth (sometimes beyond belief) reviews of Apple hardware test results not found anywhere else on the web. Page after page after page of detailed tests and results.

Hard to imagine that type of content being lucrative from a display-ad point of view if they used traditional ad networks, but the effort was absolutely appreciated and respected by readers.

A sad day but considering how the online ad market has tried to force publishers to focus on video content an understandable one for printed-word journalists. It's awful.


This is true, and I second your sadness. They always had those 1/2/3 pages more than competitors about architecture details at the start of every review.

But apparently right now it pays more to do a cheap video review on YouTube with fake benchmarks, you get the hundred thousands video views, sell the hardware and call it a day.


:( oh? That’s a shame.



Ah yes, same one. Posted straight from AP too.


If you click the little icon in the upper-left corner of the UI, you can change 'skins' as well. Very cool


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: