Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: We should start to add “ai.txt” as we do for “robots.txt”
562 points by Jeannen on May 10, 2023 | hide | past | favorite | 281 comments
I started to add an ai.txt to my projects. The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.

It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.




Using robots.txt as a model for anything doesn't work. All a robots.txt is is a polite request to please follow the rules in it, there is no "legal" agreement to follow those rules, only a moral imperative.

Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare.

In the age of AI we need to better understand where copyright applies to it, and potentially need reform of copyright to align legislation with what the public wants. We need test cases.

The thing I somewhat struggle with is that after 20-30 years of calls for shorter copyright terms, lesser restrictions on content you access publicly, and what you can do with it, we are now in the situation where the arguments are quickly leaning the other way. "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...

In many ways an ai.txt would be worse than doing nothing as it's a meaningless veneer that would be ignored, but pointed to as the answer.


> "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...

This gross generalization of other people's views on important issues is really offensive.

My view is that the Copyright Act of 1976 had it about right when they established the duration of copyright. My view is that members of Congress were handsomely rewarded by a specific corporation to carve out special exceptions to this law because they wanted larger profits. "We" didn't call the Copyright Term Extension Act of 1998 the "Mickey Mouse Act" for nothing. It's also no coincidence that Disney is now the largest media company in the world.

Reducing copyright term extension has everything to do with restoring competition and creativity to our economy, and reversing corruption that borders on white collar crime. It has nothing to do with AI. Don't recruit me into some bullshit argument that rewrites history and entrenches Disney's ill-gotten monopoly.


I think they nailed it with the original 1790 act. 14 years + 14 more is plenty.


Same. The very nature of information is that it yearns to be free. Information cannot be "owned." The point of copyright should be to grant temporary monopolies to encourage creation, not to confer ownership.

Thomas Jefferson put it beautifully:

If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it. Its peculiar character, too, is that no one possesses the less, because every other possesses the whole of it. He who receives an idea from me, receives instruction himself without lessening mine; as he who lights his taper at mine, receives light without darkening me. That ideas should freely spread from one to another over the globe, for the moral and mutual instruction of man, and improvement of his condition, seems to have been peculiarly and benevolently designed by nature, when she made them, like fire, expansible over all space, without lessening their density in any point, and like the air in which we breathe, move, and have our physical being, incapable of confinement or exclusive appropriation. Inventions then cannot, in nature, be a subject of property.


> The very nature of information is that it yearns to be free.

Information wants you to stop anthropomorphizing it.


I've said before that it gets tiring playing word games to avoid the suggestion that certain natural pressures have personal agency. Information wants to be free like a rock wants to roll downhill.


Oh wow, I like this one!


Not merely the point of copyright, but also the basis for copyright in the USA.

Specifically all forms of intellectual property in the USA trace back to Article I Section 8, Clause 8 of the Constitution. Which gives Congress the power, "To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries".


> limited times

Technically life + 70 years - or 1 million years for that matter - is "limited" - but I imagine 14+14 is probably closer to what they had in mind.


> very nature of information is that it yearns to be free. Information cannot be "owned."

The nature of information is to dissolve into entropy.


You bought a Western Digital drive too, eh?


Why doesn’t this platform offer a “like” button for answers?


Welcome, newcomer. As flattered as I am by your want to "like", my comment was not informative and even borderline trolling (by changing the subject). On HN, such comments are better to downvote. Please don't start upvoting comments that stray from the issue under discussion, funny as they might be.


You can click the up arrowhead to up-vote.


Underrated comment!


The nature of the universe is to tend toward heat death. Meanwhile, here and now, the nature of information is to either reproduce (to be free) _or else_ dissolve into entropy.


Information and entropy are more or less the same thing. Ask Shannon.


but copyright is not for information or ideas, information and ideas cannot be copyrighted; it's for creative expression


09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0


I wonder how many people think this is a string... and don't know about these magic numbers.


and why should "creative expression" be owned?


Why should land be owned? None of us created the planet...

But we have selected an economic system that depends on ownership to drive exchange in a market, so... that's why.


I'd argue that land is owned because it's a finite resource, and that without property ownership people would be in conflict with one another. "Creative expression" is not finite, in fact every human possesses it, it's also intangible, it's ideas, thoughts, ... , which I personally do not believe should be owned.


> "Creative expression" is not finite

It absolutely is.

Doing it at all requires time & attentive focus, which is a finite resource for anybody mortal, and moreover a resource that's scarce and has to be spent in multiple places.

Doing it well requires significant investment in practice and training, often years of it, maybe even decades in order to develop certain levels of expressive fluency.

As with any issue of scarcity, economics comes in. If you want this activity supported, one good way of doing it is enabling the investment of time. Copyright does this by giving people an economic/legal claim on how copies of their work are distributed.

Paying for copies has the usual market merits -- the economic reward and signals of value are proportional to copies acquired. There are other ways of course, common ones brought up here are patronage and merchandising, but they lose the market merits, and both are basically another way of saying "nobody should have to pay for the value in your work directly," and merchandising is even worse in that it's basically saying "yeah, you'll just need another job to support yourself while you're doing this thing", which is time taken away from investment in the creative endeavor, so you'll get less of the actual endeavor.


I think the concept that PP may be trying to get across is scarcity:

"goods are scarce because there are not enough resources to produce all the goods that people want to consume".(quoted at [1])

Physical books are intrinsically scarce because they require physical resources to make and distribute copies. Libraries are often limited by physical shelf space.

Ebooks are not intrinsically scarce because there are enough resources to enable anyone on the internet to download any one of millions of ebooks at close to zero marginal cost, with minimal physical space requirements per book. Archive.org and Z-Library are examples of this.

Consider also free goods:

"Examples of free goods are ideas and works that are reproducible at zero cost, or almost zero cost. For example, if someone invents a new device, many people could copy this invention, with no danger of this "resource" running out."[2]

[1] https://en.wikipedia.org/wiki/Scarcity

[2] https://en.wikipedia.org/wiki/Free_good


> the concept that PP may be trying to get across is scarcity:

It's pretty mysterious that you think you need to introduce this to the conversation at this point given how prominently scarcity dynamics figure into the comment you're replying to.

> Physical books are intrinsically scarce

Once their production was industrialized with printing press tech, copies of books weren't scarce, they were actually revolutionarily cheap.

The copyright bargain isn't borne out of ignorance of how changes in that direction affect the overall dynamic, it's borne out of deep understanding of what remains scarce and risky and difficult to compensate for when the marginal cost of producing copies drops drastically, and what kind of claims might help.


Actually I was replying to both of you (sadly not an obvious structural way to do that on HN), but perhaps I should have made it clearer that the "finite" concept PP was trying to get across actually seems to be scarcity - land is scarce, paper books less so - and intangible goods such as ebooks are not scarce at all (DRM attempts notwithstanding.)

Authorship may be scarce - costly and resource intensive (LLMs notwithstanding) as you describe, while copying and distribution of intangible goods like ideas or digital media is essentially free and unlimited, as I suspect PP was trying to say.

As you correctly note, the constitutional copyright bargain permits a limited time monopoly in return for (hopefully) advancing "the progress of science and the useful arts."


I think you're confusing creation and expression.

Expression has no value in today's digital world.

Creation has value but using expression to exchange for that value is difficult, requiring limits on expression in order for the system to work.


This idea is not without detractors.

"Property is theft" is not a new idea, makes a lot of sense. Unless you have a lot of it, and then those [censored] can [censored] right off.


People are still in conflict with each other for property ownership, so it's not solved

The ownership, with heavy taxes on that ownership, pushes towards making sure people benefit from the land.


> people benefit from the land

"which people in particular are benefitting the most" seems to be the perennial question.


> But we have selected an economic system that depends on ownership to drive exchange in a market, so... that's why.

For extremely loose values of "we", perhaps - I didn't select it, and I would vote "no" if the idea were proposed...


> Why should land be owned?

it shouldn't. Or, well, it should, but it should be the one and only thing taxed: https://en.wikipedia.org/wiki/Georgism


I live in Washington state where the state taxes are mainly sales tax and property tax. Both end up being regressive. Sales tax is because it is not proportional to income. You might think that property tax would hit higher income people more but what happens is that the property tax makes homes more expensive for low income people and is also passed on to renters in their monthly rent.


Where do you get to actually own own land like you do copyright? Maybe we can add property taxes to copyright to force people to give it up just like land.


Taxing IP is an interesting idea; we do tax income from it, but I think an increasing scale of "you have to pay more to keep this for longer" would be pretty reasonable.


Because creative expression can be exchanged for goods and services? Why should metal, wood, or special paper notes be owned? It's to represent work done and value to other people.


Nope. Metal & wood etc should be owned because it very much looks like that is very useful in creating lots of welfare for people.

The trouble with IP is that there are lots of influential people that very much would like IP to be useful in creating welfare. Unfortunately the evidence for that is surprisingly scarce. For discussion, see e.g. Boldrin & Levine


It should be owned as long as people must rely on ownership to survive in our society.


I would settle for 14 + 14 too :)


You probably wouldn't if you were the owner of the Marvel franchise or other such cash cows.

Copyright that doesn't expire would make "a whole lot of cents".

(I agree with you but, the ownership is the corrupting factor.)


My biggest critique of copyright is that is unnecessarily collapses financial reward & creative control. It also pegs both as starting at creation - which is not a particularly meaningful point for either problem.

IMO I would rather a structure that:

- Guarantees creators (and their descendants) some number of years of financial benefit / veto (30 seems fine!) - i.e. pay me what I want or you can't use this creative work.

- Separately grant creators the ability to veto "official" projects that use their creative output in their lifetimes.

IMO, it seems like there's a productive "middle ground" between total control and anything goes. After the 30 year benefit expired, you couldn't sue for damages - just costs & to stop use.


I've heard of a structure in France that's translated as "moral rights" of a work. I met a guy who was the moral rights holder for a deceased author and had the right to veto large and small elements of the representation of the characters, but received no royalties from the works.


> After the 30 year benefit expired, you couldn't sue for damages - just costs & to stop use.

That's the same thing.

No one can use my stuff..........(unless you pay me royalties).


Look - it's absolutely not the same thing. The point is to allow non-profit-seeking uses first. To push off the free-for-all of commercialization until after an interstitial period.

You can certainly pay the rights holder to use their property! Still! You could do it even without copyright I suppose. However, I think a space where it costs time and money for the rights holder to try to stop use and they won't get paid for it is super useful.

Consider this in the case of software as well - you get ~30 years of benefit from your work, but you can refuse to allow companies to incorporate it into their products as long as you live. Whichever companies you want! You can also not do that.


My phrasing was absolutely not meant to be read as myself speaking for all, apologies, I certainly don't want to offend.

It has felt on HN and elsewhere that the prevailing attitude to copyright has been these two, somewhat contradictory, things. That's what I was trying to highlight with my phrasing of "we", which was also not meant to include myself but be a nod to the way a vocal group try to steer and dominate the conversion.

Both debates are important to have, I don't know the answers.


Thank you! I think the average HN'er is frankly pretty ignorant about how copyright law works, the history around it, and the arguments for and against various reforms. In fairness it's an esoteric topic and most software developers depend in some way on copyrighted work for their income so that's not a huge surprise I guess. But it probably explains the contradiction you observed!

The #1 issue with copyright today in my opinion is that if we keep on extending it forever, it will forever entrench the wealth and power of a small number of companies that hold the largest portfolios of IP. I think this is also a huge issue for AI, maybe the biggest issue, because at the end of the day an AI is really just another copyrighted work. It is not the anthropomorphized thing that countless people are acting like it is, it's a work. Change copyright and you change the nature of future AI works.


There is contradiction because, in fact, the HN audience is more than one person and those people have different and conflicting views.


Trying to find consistency in the prevalent (or more commonly predominantly expressed) attitudes and opinions of groups is a common fallacy. You can have a group with a large number of members holding opinion A and nother large number of members holding opionion ¬A without any member being in both groups.

Of course in reallity things are usually more complex and wer are talking about two different opinions A and B that aren't even inherently incompatibly but just some motivations for A would lead to ¬B and vice versa.

But un this particular case I think the flaw is in your assumption that the majority wants stricter copyright law for AI rather than wants the same copyright law that humans are beholden to to also apply to AI, wether that law is the current may-as-well-be-perpetual-monopoly or 0 copyright or anything in between.


> Don't recruit me into some bullshit argument that rewrites history and entrenches Disney's ill-gotten monopoly.

You don't think it's them being allowed to buy Marvel, Pixar, Lucasfilm? Is creativity ruined because I can't make a Mickey Mouse cartoon or t-shirt? Does the world need Luke Skywalker coming from any individual studio?

People are free to make the Little Mermaid, Beauty and the Beast, Hunchback of Notre Dame, Aladdin, etc. and there's nothing out there that stops them.

I've got no love for giant corporations but I see it a lot less about copyright than massive corporation gobbling up more corporations. There's no shortage of creativity out there if you look for it.


The copyright system is what has enabled so few companies (and one giant corporation in particular) to become the owners, controllers, and beneficiaries of the vast majority of American fiction and culture. From visual media companies, record companies, and publishers, you can probably distill ownership of more than 90% of the culture that the average American lives in to fewer than 20 companies.

Copyright has been the most powerful tool in any media company's toolbox when it comes to consolidating power and IP and rolling into a larger and larger ball of what we call culture.


That IP is sold overseas, so the USA has pushed very very hard to have copyright extended on other countries, presumably because it is a huge financial benefit to the USA (and indirectly to its citizens). Copyright extension is a non-negotiable item in a number of international agreements.


This is really a big problem with copyright - most people don't even get to vote for or against it because whatever "democratic" laws there are are only formalizing trade agreements that would be too costly to violate that doing so is not even up for discussion.


The three acquisitions you mentioned all took place many years after the Copyright Term Extension Act of 1998. Without the financial benefits conferred by that law (the timing and content of which benefited them more than it did their competitors), they might not have made all of those acquisitions.

A lot of people in this thread seem to be undervaluing those old school Disney characters, yes now Disney is huge and has a much larger portfolio of IP, but in 1998 they were a far bigger percentage of Disney's portfolio than they are now.

You're not wrong that consolidation is a problem. My point is that Congress changed the law in a way that helped Disney and at least partially enabled that consolidation. (In fact, it's fairly rare to come across a monopoly or any heavily entrenched corporation that isn't enabled in some way by government collusion.)

If you shoot someone, take all his money, then build a business with it, you're still a murderer. (Just now you're a rich murderer.)


> A lot of people in this thread seem to be undervaluing those old school Disney characters

Right? There were even competitors back then. People all but forgot the Looney Tunes.


They also tend to omit that at the time of the buyout Marvel was at a sales (and in many ways creative) nadir. It was hardly a media juggernaut. They actually went bankrupt in the late 90s, and it was only Carl Icahn buying much of their outstanding debt for Pennie’s on the dollar (and then firing basically all of the then-current board) that kept them from going under totally.


What's needed regardless of copyright or patent terms, is a similar attitude to legal predation as there is to physical stalking and threatening.

i.e. enforce egregious IP violations while criminalizing trolls.


I was "doing the analysis" w/ Toy Story not long ago. They basically invented Woody/Buzz out of whole cloth, "guilty of being an incredibly lovable toy by association" (with other incredibly lovable toys). As I'm watching Toy Story with my kid, and seeing classic toys (eg: Mousetrap in the background), all the "friends" are legit copyright classics from other companies, but Buzz and Woody are "Disney/Pixar Exclusives" and nobody else can include them. A clever mechanism that seems to have paid off over two modern generations to guarantee they can "craft" a new copyrighted character at any moment (Buzz 2.0, Space Cowboy 9000, whatever...).


Wow I never noticed this. Thanks for sharing!


Concentration is absolutely a problem, but the second point undermines the first. The world is more interesting because anyone can adapt old stories like The Little Mermaid however they want. How could it not be even richer if the same applied to newer creations like Bugs Bunny?


Both things can, and I think are, true. I see it as reduced competition in both cases, corporate consolidation making companies huge and large copyright timelines.

The long timelines stifle new creative works by keeping other, especially smaller, outfits having to make sure they don't accidentally run afoul of another copyright from decades ago. This needs capital to either be proactive in searching or to defend a suit that's brought.

Here's a recent article about the battle between the copyright holders of Let's Get It On and Ed Sheeran for Thinking Out Loud. Those two songs are separated by around 40 years. https://www.theguardian.com/music/2023/may/07/ed-sheeran-cop...


> People are free to make the Little Mermaid, Beauty and the Beast, Hunchback of Notre Dame, Aladdin, etc. and there's nothing out there that stops them.

IP law reasonably does. See: https://trademarks.justia.com/852/28/the-little-mermaid-8522...


"No one can do to the Disney Corporation what Walt Disney did to the Brothers Grimm." L. Lessig


I’m no fan of the life of the author plus 70 years copyright regime we have today but the Brothers Grimm died roughly 70 years before Disney started making cartoons.

So that is still something possible to do in roughly 20 years.


I'm not sure arguing for a company to rest on its laurels and keep feeding of an IP from 100 years ago is an argument for creativity and innovation.


but is it much different that descendants living off the interest and dividends of some large sum of money that their great-great grandparents accumulated a few hundred years ago?

To me it is pretty much the same thing - not a fan of nepo-kids living off of trust funds they didn't earn - but if you are going to fix one problem, you should try to fix all of the almost identical ones at the same time and not get upset that disney is still making money off of something they created 100 years ago, and not be upset about kennedy's, rockefellers, and the like still living of the money their great-greats generated a hundred years ago.


It would be similar if "intellectual property" was property in the same sense in which a table or a vast amount of money is property. However, it is not.

Normal property ownership is something we use to manage scarcity that already exists—that there is only one of something, and we have to decide where it will go and who will be able to decide how it is used. Intellectual property, by contrast, creates artificial scarcity by means of a government-enforced monopoly (in the case of copyright, the monopoly is on the right to produce a copy of a work).

It is unfortunate (and perhaps not accidental) that we settled on the term "intellectual property" as opposed to something more descriptive like "intellectual monopoly." "Intellectual property" encourages equivocating such monopolies with normal property, a mistake that tends to muddle debates on the subject.


> This gross generalization of other people's views on important issues is really offensive.

The "we" that has been calling for shorter terms is no more a gross generalization than the "we" that is calling for more protection against AI use of stuff.

The world outside of HN-and-similar has been much less anti-copyright than the world in here. More "neutral" seems to be dominant - we're not extending it anymore; we're not shrinking it either. And currently generally more panicked about AI taking away their jobs and rendering their skills and creativity useless.

The original post was a very fair summary of how there are now two ground-level movements competing that there weren't two years ago.


> Reducing copyright term extension has everything to do with restoring competition and creativity to our economy

Can you explain your line of thinking here? How does the ability to use another company’s intellectual property restore creativity? It just seems like a path to allow bootlegging.


Yesterday's conversation here about the Ed Sheeran lawsuit should explain much of this: https://news.ycombinator.com/item?id=35868421

Here's one key bit from the OP: - - - - -

But the lawsuits have been where he’s really highlighted the absurdity of modern copyright law. After winning one of the lawsuits a year ago, he put out a heartfelt statement on how ridiculous the whole thing was. A key part:

There’s only so many notes and very few chords used in pop music. Coincidence is bound to happen if 60,000 songs are being released every day on Spotify—that’s 22 million songs a year—and there’s only 12 notes that are available.

In the aftermath of this, Sheeran has said that he’s now filming all of his recent songwriting sessions, just in case he needs to provide evidence that he and his songwriting partners came up with a song on their own, which is depressing in its own right.


Glad you asked! So copyright is a limited, temporary monopoly on a work. You create a work, the law grants you the exclusive rights to that work, for a time. Because of this monopoly the vast majority of the benefit from that work accrues to you, including financially. (All pretty fair in my opinion, you did the work, you deserve the reward!)

If let's say Star Wars falls out of copyright tomorrow, economically that has two effects. One, Disney loses a ton of future revenue. Two, countless Disney other people create derivatives of Star Wars, and they make money from those. Competition is increased.

So the expiration of a copyright results in a sharing of the wealth. The wealth generating potential along with the creative potential is passed along to all members of society. Our culture becomes richer and deeper. A great example of this is all the works that build on the mythos created by HP Lovecraft, one of the last great ones created before Congress started indefinitely extending copyright. Lovecraft wrote great literature and some of the authors that built on his world are fantastic as well, I'm sure they've come up with countless ideas he never considered. But as long as Congress keeps on extending copyright, nothing we create today will ever become like that.

There is of course an important question about what is fair and how long a copyright should last. Most people these days agree that it should last for at least the author's lifetime, maybe long enough to benefit their kids and grandkids as well. But the status quo is basically permanent copyright which prevents substantial creative and economic benefits to society.


> If let's say Star Wars falls out of copyright tomorrow, economically that has two effects. One, Disney loses a ton of future revenue. Two, countless Disney other people create derivatives of Star Wars, and they make money from those. Competition is increased.

Three, the derivatives are made and Disney starts marketing "Disney's Star Wars" which continue to be the high-demand (and high-value) versions. The situation is unchanged.

For example, you can currently buy The Little Mermaid in non-Disney form[1], but Disney's version is what most people want.

[1] - https://www.amazon.com/s?k=little+mermaid+Hans+Christian+And...


Is it more creative to use your own company's IP? Most of these copyrighted and trademarked stories and characters are being made by people who didn't come up with them anyway, so what's the difference in creativity whether you happen to work at the company that owns the IP or not?

With long copyright terms, it encourages copyright holders to milk a single work for the length of the copyright (90+ years) and therefore discourages the creation of something new. It also encourages people to obtain copyrights to leverage them for profit, rather than making anything at all. A child of an artist can spend their entire life supported by their parent's copyright, and never has to make anything unique for as long as they live.

How is any of this good for creativity?


Why does Disney have an "ill-gotten" monopoly? The people who worked for the company created something. Why shouldn't they get to control how it's used. Do you feel like you should have control over what you create? Why not others?


Circular reasoning. If you assume your ideas are your own, and nobody else can benefit from them without your permission, then the point of your rhetorical questions follows. The reality is that IP laws are a grafting of property-like attributes onto something that absolutely isn't property.

Do I feel I should have control over what I create? I make hammers for a living. I sell them for $10. I don't expect any control over what people do with "my" hammers once I sell them. I don't even expect to stop my neighbor from buying one, teaching herself to build hammers, and then manufacturing and selling identical ones for $9. Do you?

(To anticipate the rest of this tired conversation, the temporary monopoly tradeoff ("securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries") is facially reasonable. But it's important to recognize that the "shouldn't" and "feel" in your questions are based on a very recent recharacterization of these temporary monopolies as "intellectual property," which is probably the most financially successful propaganda term ever devised. Start with "temporary monopoly" instead, and then the better rhetorical question for you to be asking is "when should Disney's temporary monopoly end?")


The people who worked for the company created something.

The people who work for the company collect rent on things they didn't make.


Unrelated but "offensive" is not necessarily bad.

We should accept that people can get offended by anything and, because of this, just demote the concept.


But in more general view Disney being an international media giant is good for US, isn't it?


Disney has a monopoly on media because they still have the copyright to their IP from the 20s? LOL!

Companies that can leverage this new wave of AI will have, in reality, 1000x the advantage that you believe Disney has.


Yes.

There's this little thing called brand value. Disney has one of the most valuable brands in the world. Forbes estimated it at being worth about $60 billion as I recall.

That brand was built heavily over many decades on IP that dates back to the 1920s, such as the most recognizable Disney character, Mickey Mouse. They manipulated the law to enhance the value of that IP and thereby gained an edge over their competitors. That's a big part of why they now enjoy such a dominant position.

None of this is especially controversial (you will get a very different spin from Disney of course).

If you want to comment about how business works you should read history and learn how business works first. AI luminary that you are, if you choose to remain ignorant then I guess this whole cycle will happen again with AI.


The argument goes that copyright has allowed massive corporations to buy up and exert near total control over all of our shared stories. And when you own the cultural touchstones of whole generations that gives you power that no one else can ever wield.

There is a massive amount of amazing stories based on ancient myths because it's one of the few large corpora that isn't copywritten. Once you see it in media you can't unsee it. The only space where that kind of creativity can thrive anymore is fan-fiction which lives in weird limbo where it's illegal but the copyright owners don't care. And when you want to bring any of it to the mainstream you have to hide it, all of Ali Hazelwoods books are reworked fanfics because she can't use the actual characters that inspired her -- her most famous book "The Love Hypothesis" is a Reylo fic.

Go check out https://archiveofourown.org/media and see how many works are owned by a few large corporations.


> Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare.

Failing to solve every problem does not mean a solution is a failure.

From sunscreen to seatbelts, the world is full of great solutions that occasionally fail due to statistics and large numbers.


That's still not an argument to introduce ai.txt, because everything a hypothetical ai.txt could ever do is already done just as good (or not) by the robots.txt we have. If a training data crawler ignores robots.txt it won't bother checking for an ai.txt either.

And if you feel like rolling out the "welcome friend!" doormat to a particular training data crawler, you are free to dedicate as detailed a robots.txt block as you like to its user agent header of choice. No new conventions needed, everything is already on place.


This seems to be assuming a very different purpose for ai.txt than the OP proposed. It sounds like they are intending ai.txt to give useful contextual information to crawlers collecting AI training data. Robots.txt does not have any of this information (although I suppose you could include it in comments).


I do think that robots.txt is pretty useful. If I want my content indexed, I can help the engine find my content. If indexing my content is counterproductive, then I can ask that it be skipped. So it helps the align my interests with the search engine; I can expose my content or I can help the engine avoid wasting resources indexing something that I don't want it to see.

It would also be useful to distinguish training crawlers from indexing crawlers. Maybe I'm publishing personal content. It's useful for me to have it indexed for search, but I don't want an AI to be able to simulate me or my style.


worse, ai.txt could become an adversarial vector for attempts to trick the AI into filing your information under some semantic concept


Ok, fair point, I may be being a little hyperbolic. But my point is that it's not a system that we should copy for preventing the use of content in training AI. It would become a useless distraction.

If you "violate" a robots.txt the server administrator can choose to block your bot (if they can fingerprint it) or IP (if its static).

With an ai.txt there is no potential downside to violating it - unless we get new legislation enforcing its legal standing. The nature of ML models is that it's opaque what content exactly it's trained on, there is no obvious retaliation or retribution.


> It's not a system that we should copy for preventing the use of content in training AI

I don't see the OP saying anything about "ai.txt" being for that? They're advocating it as a way that AIs could use fewer tokens to understand what a site is about.

(Which I also don't think is a good idea, since we already have lots of ways of including structured metadata in pages, but the main problem is not that crawlers would ignore it.)


Not only do we already have lots of ways of including structured metadata, but if you want to include directives about what should/shouldn't be scraped and by whom, we already have robots.txt.

In other words, there's no need to create an ai.txt when the robots.txt standard can just be extended.


> But my point is that it's not a system that we should copy for preventing the use of content in training AI.

I don't think that's what OP is envisioning based on their post!


OP is trying to give helpful info to the AI, not set boundaries for it.


> But my point is that it's not a system that we should copy for preventing the use of content in training AI

The purpose OP is suggesting in the submission is the opposite, help AI crawlers to understand what the page/website is about without actually having to infer the purpose from the content itself.


Isn't that the entire point of the semantic web?


If only there was an HTML tag that let you provide a concise description of the page content. Perhaps something like <meta name="description" content="This is an example of a meta description. This will often show up in search results.">


I know it's getting pedantic, but sunscreen and seatbelts are a poor analogy. They do offer protection if you use them. robots.txt only offers protection if other people/robots choose to care about them.


> Failing to solve every problem does not mean a solution is a failure.

There is something to be said though to OP's point where it's actually better to do nothing than an AI.txt because it can give a false sense of security, which is obviously not what you want.


The point of an ai.txt is that it signals intention of the copyright holder.

Anytime a business is caught using that content, they can't claim that they used publicly available information, because the ai.txt specifically signalled to everyone in a clear and unambiguous manner that the copyright granted by viewing the page is witheld from ai training.


> "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...

While I’m sure others than you share this opinion, I don’t think it’s as uniform as the more common “shorten/rationalize copyright terms and fair use” crowd “we.”

I consider myself a knowledge worker and a pretty staunch proponent of floss and am perfectly fine with training AI on everything publicly available. While create stuff, I don’t make a living off selling particular copies of things I make, so my self preservation bias isn’t kicking in as much as someone who does want to sell items of their work.

But I also made some pretty explicit choices in the 90s based on where I thought IP would go so I was never in a position where I had to sell copies to survive. My decision was more pragmatic first and philosophical second.

I think someone entering the workforce now probably wants to align their livelihood with AI training on everything and not go against that. Even if US/Euro law limits training, there’s no way all other countries are going to, so it’s going to happen. And I don’t think it’s worth locking down the world to try to stop AIs from training on text, images, etc.


Fair enough. But there should be some mechanism where people who don't want their works to contribute to AI training to be able to prevent that without having to resort to removing their works from the web.


I think people who don’t want their content contributing to AI shouldn’t have it on the public web.

There are many ways to restrict access. Use one of them. But if you respond to an anonymous http request with content then it shouldn’t matter if it’s a robot looking at it or a human (or a man or a woman or whatever).

I think this both for simplicity and that I foresee a future where human consciousness is simulated and basically an AI. I don’t want to have rules that biological humans can view and digital humans can’t.


> I think people who don’t want their content contributing to AI shouldn’t have it on the public web.

Practically speaking, that's the only effective solution. I just think that it's a shame that's necessary. It would be better for everyone if there wasn't a disincentive to making works publicly available.

> I don’t want to have rules that biological humans can view and digital humans can’t.

This is a point we disagree on.

And "digital humans"? I would argue that such a thing can't exist, if you mean "human" in any way other than rough analogy.


> All a robots.txt is is a polite request to please follow the rules in it

At least in my country (Germany), respecting robots.txt is a legal requirement for data mining. See German Copyright Code, section 44b: https://www.gesetze-im-internet.de/urhg/__44b.html

(IANAL)


Do you think there's a space between "you will never ever get to do anything at all with popular media until at least a hundred years after you're dead" and "anyone and any company can do anything they with everything I produce as long as it goes through an LLM"? Is it really so hard to think people may be against both of those extremes?

There's a phrase I like which describes what you're doing. It's "vaguely gesturing at imagined hypocrisy".


The poster wants the opposite - a way of explicitly helping AI systems/etc to use their site. If people ignore it, they're just giving up a bit of help.


Yes, I feel like this person only read the title and not the text of the post and made an assumption


Not an assumption. Just a response to the title of the article.


They said "Using robots.txt as a model for anything doesn't work." but it does work for the case described in the text


What works? This is an idea, not anything that's been throughly tried.


I think you fundamentally misunderstood the OP's point. They're not trying to use their ai.txt as any sort of deterrent, legal or otherwise.

They are trying to use it as a form of extended metadata for training AIs. Essentially, "ah I see you're training using my website! Here's some extra info about it: [...]"


Robots.txt is meant as an aid to crawlers. "This stuff is not useful to index," rather than a blocking mechanism


I don’t really agree with your sentiment.

Robots.txt have served the simple purpose of directing bots like Google to the different parts of your website since the beginning of internet time.

They still serve the same purpose, they tell bots where to go, and most importantly, they tell bots how to find your site map.

Robots.txt is not there to prevent malicious crawlers from accessing pages as you have suggested.

The robots.txt file acts simply like a garden gate. The good and honest people will honor the gate, while the more malicious might ignore it and hop the fence or something.


> "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...

AI is being used to do copyright laundering, at the same time "we", the people who can't afford to run our own AI, are still subject to absurd rules that AI owners get to ignore, apparently.


The barrier to running an AI model is getting lower every day, so the threshold for ignoring copyright is getting lower with it.


You are mistaken if you think companies will allow common people to ignore copyright on their IP.

The only IP that will be allowed to be stolen is that of other common people.


I agree with you when you talk about places where companies can bully people just by threatening to sue them, and where the defender must have lots of money even if they are clearly in the right.

But AI does not change anything there. The problem of being sued into oblivion despite being right exists there even without it.

In places where defending does not cost money, this works out in favor of the individuals.


"AI" changes things by making it even harder for individuals to defend against.

Right now, we have FOSS organizations that will help you in lawsuits against companies that don't follow licenses. With "AI" in the picture, companies can launder your code with "plausible" deniability. [1]

[1]: https://matthewbutterick.com/chron/will-ai-obliterate-the-ru...


On the other hand, you can take some (closed?) code a company wrote, feed it into AI, and launder it for your purpose. While this is not a symmetric exchange, it does reduce the power of copyright for everyone.


Sure, if you can get your hands on it and if the company doesn't sue you for doing so.


Check out what I wrote two posts above. If them suing you is a problem then you have trouble in your legal system regardless of copyright.


You can’t steal an idea.


Yes, you can. You can do that by making the monetary value of the idea zero when it used to be non-zero.


Stealing requires physical force, whether that be on the property or person. By your logic, independently discovering the idea is also stealing.


Stealing does not require physical force.


So again, I guess you’re stealing from me by coming up with the same idea that I had.


You're being disingenuous.

I would be stealing if I prevented you from making money from it.


By doing what exactly? Selling spades because you also figured out how to put a stone on a stick? If you believe such things should be illegal, do we agree that you don’t follow the force is only justified in response to force principle?


https://en.wikipedia.org/wiki/Loss_leader

Then once smaller competitors are out of business, raise prices.

Of course, force can go into it, such as when a big company sues a smaller company with a frivolous lawsuit that the smaller company can't afford to fight. Then the smaller company goes out of business, and the big company can use their ideas free.


I disagree it has failed as a system. While it does not substitute for authentication / authorization, reputable crawlers respect it, and there'd be a lot more traffic load on sites if they didn't have a way to tell reputable crawlers "please stop."

Similarly, extending robots.txt to direct AI would have a similar effect: not sufficient, but useful (if for no other reason than to make it easy to distinguish reputable AI projects from ones that feel like they own the Internet to do with as they please).


On the contrary, it works perfectly well for normal, non-bad actors running services used by most of the public. That includes search engines and stuff like archive.org. A robots.txt set to deny all will result in your site not showing up on any search engine that matters.

It doesn't work for bad actors, but then again, nothing really does.


The point of robots.txt is to inform well behaved scrapers about how to behave. It is not designed nor intended to prevent bad actors.

Which is good design: don't pretend to solve problems you can't.


> "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...

I don’t know who “we” are, but I absolutely don't want “stricter copyright law when it comes to AI”. More clarity? Sure. Narrowing fair use? No fucking way.


> Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare.

That depends what you expect from it. For the purpose of limiting crawlers, at least the major search engines respect it.


It's not actually contradictory at all when you consider the root of the issue is about power asymmetry between individual creators and the corporations. Copyright terms were lobbied for, and primarily benefit the large corporations. They're symbolic of corporate overreach, that's why they're unpopular.

Meanwhile, now that the laws are inconvenient for them, tech companies are straight up ignoring labeling their training data to respect IP law. Labeling the data would be expensive, thereby eroding profits. The loss of usable data would also harm the efficacy of their models, and the time spent classifying the data will hamper their iteration time.

The ideas are only dissonant if you are looking at the trees (copyright term, DMCA, right to repair, etc.) and not the forest: which is a class struggle between a few thousand billionaires versus the rest of humanity.


There is no legal agreement to follow robots.txt, but it appears to have came up a few times (from the first search result for "court cases involving robots.txt"):

https://www.robotstxt.org/faq/legal.html

If an "ai.txt" were to exist, I hope it's a signal for opt-in rather than opt-out. Whereas "robots.txt" being an explicit signal for opt-out might be useful because people who build public websites generally want their websites to be discovered, it seemed unlikely that training unknown AI would be a use case that content creators had in mind, considering that most existing content predates current AI systems.


With search engines and other crawlers, there wasn't easy ways to monetize "copyright theft" at scale. Google, which had the biggest share of eyeballs, was much more equitable in sharing revenue to content producers (who wanted to monetize). And Google was probably more just in taking action against copyright theft.

Individual high value IP was always much less accessible (not available as a webpage on the internet). Gen AI/LLMs with the internet scale data is too powerful and maybe easier to monetize.


> The thing I somewhat struggle with is that after 20-30 years of calls for shorter copyright terms, lesser restrictions on content you access publicly, and what you can do with it, we are now in the situation where the arguments are quickly leaning the other way.

There've always been solid human arguments for sustaining copyright legally. The balance is the tricky part.

On one hand we had a period where terms got too long, and some of the really aggressive legal enforcement from 20 years ago before stakeholders actually figured out how to get into digital markets were was entitled and useless. The pendulum also swung the other way with things like buffet streaming services essentially offering an economic bargain for creators with a sliver of compensatory difference from piracy but with none of piracy's actual benefits (people who simply pirate know they're not participating in a relationship of economic support with creators and might be persuaded to, someone who uses Spotify is under the illusion there's something fully legit on that front).

But the fundamental copyright bargain -- creators can recoup investments of time and effort in proportion to how popular engagement with their work is -- has always made sense.

> "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...

Both these things can be true:

(1) Using a work as training data for AI is a very novel use, it's entirely plausible there should be novel considerations and rights to go with it.

(2) The incentive & benefits of copyrights have diminishing returns the longer the horizons are, while the cost in terms of social inaccessibility only increase. Where that's balanced out precisely is a debatable question, but something longer than a human lifespan is probably on the wrong side.


I have the belief that models should be allowed to ingest everything, just as a human is allowed. We are not yet at the stage where AI is autonomous, they currently are designed to require human agency for input, human agency for evaluation of output, and finally human agency for the dissemination of select output. This last important stage is well understood in the field of photography, but currently ignored in AI stewardship dialogues. Ultimately, it is the responsibility of the human agent who selects AI information products to determine its legality and appropriateness, just as if they had snapped a photograph and are wrestling with the decision whether or not it should be distributed in a particular medium. It takes a fairly selfish consciousness to become obsessed with the desire to prevent AI models access to information and disregard the collective benefits of rich information availability to training.


I like your view. From my limited knowledge, I speculate that if AI was developed further, and given as much publicly available data across a variety of scholarly topics as reasonably possible, it could potentially use statistics and such to help us find correlations across many different fields that a human would never think of. Whether it would revolutionize analytics or not I don't know as I am not qualified to say, but it's fun to dream of positive change these tools could bring.


> All a robots.txt is is a polite request to please follow the rules in it, there is no "legal" agreement to follow those rules, only a moral imperative.

I don't know that this is true for the US. As far back as I can remember, there have been questions about whether a robots.txt file means you don't have permission to engage in those activities. The CFAA is one law that has repeatedly come up. See for example https://www.natlawreview.com/article/doj-revises-policy-cfaa...

It might be the case that there is nothing there legally, but I don't think I'd describe the actions of search engines as being driven by a moral imperative.


> The thing I somewhat struggle with is that after 20-30 years of calls for shorter copyright terms, lesser restrictions on content you access publicly, and what you can do with it, we are now in the situation where the arguments are quickly leaning the other way. "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...

Are these mutually exclusive? If you couldn't make Avengers movie Thanos memes but all the 90s X-Men and Spiderman content was a free for all, I think a lot of people would take that trade off.


> there is no "legal" agreement to follow those rules

Yes there certainly is[1]. The robots.txt clearly specifies authorized use and violating it exceeds that authorization. Now granted good luck getting the FBI to doorkick their friends at Google and other politically connected tech companies, but as the law is written crawlers need to honor the site owner's robots.txt.

[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act


That’s factually incorrect. Germany for example does explicitly give machine readable permissions like robots.txt legal weight.

In general without a fair use exemption or permission from robots.txt saving a copy of a website’s content to your own servers is copyright infringement.

Purely factual information like Amazon’s prices isn’t protected by copyright, but if you want to save artwork or source files to train AI, that’s a copyright issue even before you get into the possibility of your AI being considered a derivative work.


>"All a robots.txt is is a polite request to please follow the rules in it, there is no "legal" agreement to follow those rules, only a moral imperative"

Up until the point when some person / entity with the deep pockets will put a clear license / terms of use on their site that prohibits ignoring of robots.txt and would be willing to sue the ignorant.


"Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare."

I like the idea of "ai.txt" but those who eat resources rarely listen to ToS. Frankly, I serve 503s to all identifiable bots, unless they are on my explicit allow list.


It might be a better idea to serve up a 418 ("I'm a tea pot") with a line line text file saying "I'm not an HTTP server". That solved a problem I had with bots making HTTP requests to my gopher server [1]. Serving up a 503 informs the bot that there's a server issue and it may try again later. A 418 informs the bot that it made an erroneous request and such an odd error code might get someone to look into it and stop.

[1] https://boston.conman.org/2019/09/30.2


This is very interesting. I've bookmarked the link. Thanks for sharing. I believe minimal is best and this might fit nicely within my larger system. Do you approach other problems with a similar mindset?


Why not serve fake garbage indistinguishable from real content by a computer, like LLM output? Sending errors just incentivizes bot owners to fix the identifiable parts


"Why not serve fake garbage indistinguishable from real content by a computer, like LLM output?"

Serving more than the minimum wastes resources. Worse yet, a better solution would cost my time.

"Sending errors just incentivizes bot owners to fix the identifiable parts"

Sure, someone could make or configure their scraper perfectly. "Perfect" is now the table stakes though.

Edit:

My solution strives to cause an unproportional expense in order to circumvent. I want 10x on my time.


it'd be cool to be able to fingerprint that garbage, too. Like, sprinkle some hashes here and there (or something like that) so that you can later uniquely look up your own "content" being stolen by chatbots and which ones.


You can. I can't think of the appropriate term though. Hopefully someone else chimes in here with a link.


> Sending errors just incentivizes bot owners to fix the identifiable parts

Nah. It'll just make them fake their identity so it is harder to tell the traffic is from a bot.


I like this idea. Of course it would have to be only to robots that visit a page disallowed by the robots.txt


Considering the state of ai right now you probably are better off deleting your robots.txt file


if you do data mining in the EU you are legally required to respect robots.txt afaik


This is how it is a failed system. It doesn't really do what most people who have one think it does and yet it gets rolled into law and everyone now has to deal with it constantly while still not fixing the original issue.

It's like the EU doesn't understand that bad law has a negative value.


robots.txt is a successful coordination mechanism between website operators and crawlers. It is not in any way a security mechanism meant to address adversarial situations, as you seem to make it out to be.


robots.txt works for the major search engines who voluntarily abide by it, so it isn't a failed system. Just because it doesn't work on everybody doesn't mean it's useless.


Your HTML already has semantic meta elements like author and description you should be populating with info like that: https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduc...


and also opengraph meta tags https://ogp.me/


And also schema.org: https://schema.org/


Thing > CreativeWork > WebSite https://schema.org/WebSite ... scroll down to "Examples" and click the "JSON-LD" and/or "RDFa" tabs. (And if there isn't an example then go to the schema.org/ URL of a superClassOf (rdfs:subClassOf) of the rdfs:Class or rdfs:Property; there are many markup examples for CreativeWork and subtypes).

httpS://schema.org/license

Also: https://news.ycombinator.com/item?id=35891631

extruct is one way to parse linked data from HTML pages: https://github.com/scrapinghub/extruct


How do I add a semantic definition in an HTML tag to a JPEG, or MP4, or WAV, or any non HTML format? HTML tags fix HTML, not other formats.


If you're describing an object on the page, like an image or video, you want a label element linked to it by id and likely an aria-label attribute on the object. (screen readers and such will look for this in particular). For an image you want an alt text attribute as a description too.


What would the difference in semantic notation between an unstructured "ai.txt" and the "alt" attribute actually be? If you want the tags to be served with the context outside of HTML, you can always use HTML header attributes.


JPEG has EXIF, MP3 has ID3 tags, MP4 has ilst, MKV has Tags, etc. We don't need xkcd/927 for these other formats that already have standard metadata mechanisms.


Yeah, and if you make it that complex to extract consent you won't get any. Think one step ahead maybe. One switch, one thing to parse.


Robots.txt is where you tell crawlers (AI or otherwise) what should and shouldn't be read on your site.

Metadata like in tags, HTML meta tags, etc. is where you describe the content so meaning can be extracted from it by machines and automated processing.


1. OP said “what it is about, when was it published, the author, etc.” That’s what these mechanisms already cover. Consent is an interesting possibility that I’ll admit something like ai.txt might be better for, but my post was largely focused on the OP.

2. These are all complex formats. If you want to ingest and process them then you already have to build all the hard parts. Getting the metadata out is dead simple compared to parsing, decoding, and then processing an image, for example.


Reading the title I thought you meant the opposite.

Aka, an ai.txt file that disallow ai to train or use your data similar to robots.txt (but for cases when you still want to be crawled, just not extrapolated)


Feels like an enhancement to a sitemap.xml could be a better way to go here.

https://developers.google.com/search/docs/crawling-indexing/...


I thought the exact same. Creating a new type of robots.txt but making it do the opposite does not make sense.


I've been (slowly) writing a new type of OSS license around this exact concept so it's easier to (legally) stop LLMs hoovering up IP [1] (under "derivative works not permitted").

[1] https://github.com/cheatcode/joystick/blob/development/LICEN...


They've been ingesting "all rights reserved" content because they think copyright doesn't apply. Licenses won't help.


We'll see. I think courts will end up interpreting it in the same way that they do music sampling other music. In effect that's all it is: a remix of existing information.


I guess the good part that in ai.txt you can talk to AI. So if you want you can tell it to not crawl or make other agreements with it, just in plain english. What a time to be alive.


“Google Search works hard to understand the content of a page. You can help us by providing explicit clues about the meaning of a page to Google by including structured data on the page.”[0]

[0]: https://developers.google.com/search/docs/appearance/structu...


Why are we so defensive concerning human created content vs robot created content? Do we really need to feel frightened by some gpt?

Whilst the output of AI is astonishing by itself, is it really creating meaningful content en masse? I see myself relying more and more on human-curated content because typical commercialized use cases of AI generated stuff (product descriptions, corp blogs, SEO landing pages, etc.) all read like meaningless blabber, to me at least.

Whenever I see some cool techbro boasting how he created his "SEO factory" using ChatGPT, I can't help but think that the poor guy is shitting where he eats without even realizing it. Take Google with their Search and Ads; over the last decade they managed to bring down overall quality of web content that much, that I'm just completely fed up using it because by 99% chance I'll land on some meaningless SEO page.

From what I can perceive with things like HN, Mastodon, etc. it feels more like a rejuvenation of the human centric brand trusted Web. And by that I mean: Dear crawler, just use my content. Maybe you can do something good with it, maybe not. But chances are low, it's gonna replace me in any way but rather improve my content. It only leads to a downward spiral if we stick with the past of commercial thinking (more cheap content, more followers, more ads); if we'd instead switch to subscription models individuals won't get rich but we'd have a great ecosystem of ideas and content again.


If AI is using training data from your site, presumably it got that data by crawling it. So either it's already respecting robots.txt, in which case ai.txt would be redundant, or it's ignoring it, in which case there's no reason to expect it would respect ai.txt any more than it did robots.txt.


robots.txt is about crawling, ai.txt would assumably be either augmentative metadata or specific copyright terms of use with respect to AI uses.


> specific copyright terms of use

There's no such thing. Without a license you can't enforce any restrictions.

AI training is basically just building a very complex Markov chain, that's obviously not copyright violation because the output product doesn't contain the input - only data about it. If your text has been copied then please point to it in these weights here.


markov, shmarkov, either you need those original works or you don't. If you can build your markov chain without them please go ahead

But we all know without these original works such a tool cannot exist in principle, the works are the key ingredient, so now please explain how we are not looking at these works being exploited commercially and copyright being violated.

The output product is an automatically created derivative work, copyright very much applies especially since the tool is used to generate derivative works for profit (like in case of openai/microsoft).


> especially since the tool is used to generate derivative works for profit (like in case of openai/microsoft).

Profit/nonprofit is irrelevant to copyright.


Where (and more importantly how) do you live that you don't have fair use exception?


exactly, but nobody in their sane minds would be using robots.txt to block crawlers as that is SEO (organic traffic from searches) disaster.

People could be adding a specific robot user-agent, if they knew openai even existed before yesterday and was stealing their content. but nobody did.


> it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.

Why would the crawler trust you to be accurate instead of just figuring it out for itself?

Besides, they want to hoover up all the data for their training set anyway.


What problem is this solving? Also why would anyone trust your description of your own site instead of just looking at your homepage? This is the same reason why other self description directives failed and why search engines just parse your content for themselves, something LLMs have no trouble with.

Why would I make a request to your low trust self description when I can make one to your homepage?



If AI needs explicit information and context, surely it should focus on improving its context recognition rather than trying to fix that by inserting even more training data.

Regardless, I do agree that something like a robots.txt for AI can be very useful. I'd like my website to be excluded from most AI projects and some kind of standardized way to communicate this preference would be nice, although I realize most AI projects don't exactly care about things like the wishes of authors, copyright, or ethical considerations. It's the idea that matters, really.

If I can use an ai.txt to convince the crawlers that my website contains illegal hardcore terrorist pornography to get it excluded from the datasets, that's another way to accomplish this I suppose.


> focus on improving its context recognition rather than trying to fix that by inserting even more training data.

That's how you improve its context recognition. You show it many contexts.

> most AI projects don't exactly care about things like the wishes of authors, copyright, or ethical considerations

Why is it 'ethical' that you get to add a bunch of restrictions to a pre-negotiated situation? You get copyright protections in trade for letting people use your work. There's a way to add restrictions - licensing - and you're looking to get the benefits of licensing, and to take away fair use right from other people, without paying the costs of doing so.

fwiw, I copy most pages I visit and store them. The website has given me the equivalent of a pamphlet and I store it instead of discarding it when I'm finished. This way I can go back and read it again later without having to track down the author and ask for another copy. It's not AI which has me doing this, I've been doing it for decades - it's censorship that has shown me the need.


> There's a way to add restrictions - licensing - and you're looking to get the benefits of licensing, and to take away fair use right from other people, without paying the costs of doing so.

The way copyright laws work is that work is copyrighted (assuming the work is original enough, of course) by default. You don't get to use it unless you have a license. Now, of course, as an author, you can choose to add a license to your work (whether that's CC0 or GPL-3), but you don't have to.

You do have an implicit license to consume this content, but not to reproduce it. If you put all of those copies you've saved on some public other website, that's a copyright violation. Furthermore, access to privately-owned blog posts and websites is a privilege, not a right. You're not my boss, I don't have to write content for you.

The exact legal status of AI models trained on other people's unlicensed works and their output is still largely unknown. Legal professionals much more qualified than me have argued how AI models and generated work can either be completely fair use, with no need to apply any kind of copyright restriction, or how AI generated work can be classified as a derivative work, which means you need a license. There are two major lawsuits about this going on as far as I know and it'll take years for those to flesh out.

If it turns out that AI models and the works they produce are completely fair game, I suppose I'll need take down my content wherever I can in order not to be a free source of training data for big tech; public datasets and the internet archive will still have to respond to DMCA takedowns, after all. However, I'm not all that confident that what AI is doing is all that legally okay.

I have no problem with you saving and archiving anything you want to read. I also fully support the Internet Archive and its goal. I do have a problem with these multi billion dollar companies scouring the internet for their money maker, giving nothing in return.


> You don't get to use [a copyrighted work] unless you have a license.

Not when you give it to me. "Hey, can I see your pamphlet? Sure, here's a copy."

> an implicit license to consume this content

No, copyright prevents copying, not use. There's no implicit license needed to use a work so there's no place to attach those usage restrictions. If you want me to agree to a license you need to not give me the work until I do.

You could have a ToS click-through agreement ("no training an AI on this!"), and then only serve content to logged-in users who have agreed to your conditions.

> but not to reproduce it.

I agree - those "pamphlets" were given to me and I can't copy them for someone else. They'd have to view my collection.

> The exact legal status of AI models trained on other people's unlicensed works and their output is still largely unknown.

Sure, predicting all courts in the world is a futile exercise. Surely someone will try to over reach from copyright to preventing what they feel is a bad use but it's unlikely to become law because there are already analogous uses, scanning someone's text and pulling data from it - data like which words follow which other words.

> I do have a problem with these multi billion dollar companies scouring the internet for their money maker, giving nothing in return.

Well, FB released Llama... It's not a closed technology, it's being led by for-profit businesses but the community (which consists of many of the corporate engineers as well) is trying to keep up.

Even if you can and do attach usage regulations to your site I feel it'll hurt the little guy more than the corporations. There are probably not any unique linguistic constructions on your site that will render a corporate AI less valuable, but for hackers and tinkerers and eventual historians, who knows what it'll interfere with.


>> You don't get to use [a copyrighted work] unless you have a license.

>Not when you give it to me. "Hey, can I see your pamphlet? Sure, here's a copy."

>> an implicit license to consume this content

>No, copyright prevents copying, not use. There's no implicit license needed to use a work so there's no place to attach those usage restrictions. If you want me to agree to a license you need to not give me the work until I do.

>You could have a ToS click-through agreement ("no training an AI on this!"), and then only serve content to logged-in users who have agreed to your conditions.

Fair enough, I worded that wrong.

>Sure, predicting all courts in the world is a futile exercise. Surely someone will try to over reach from copyright to preventing what they feel is a bad use but it's unlikely to become law because there are already analogous uses, scanning someone's text and pulling data from it - data like which words follow which other words.

Kazaa was banned despite being very popular for a few years. The DMCA was signed into law years after the first copyright trouble started. Just because the government is slow doesn't mean they won't write new law.

> Well, FB released Llama... It's not a closed technology, it's being led by for-profit businesses but the community (which consists of many of the corporate engineers as well) is trying to keep up.

FB's model leaked, it was subject to a strict whitelist originally. They didn't mean for it to get out there, but they wisely chose not to cause the Streisand effect to hurt them even more. And OpenAI (nice name) stopped releasing their model after it became good enough.

> Even if you can and do attach usage regulations to your site I feel it'll hurt the little guy more than the corporations. There are probably not any unique linguistic constructions on your site that will render a corporate AI less valuable, but for hackers and tinkerers and eventual historians, who knows what it'll interfere with.

I don't want to hurt anyone. I wish AI companies would do the right thing and simply ask for permission before taking someone's work and training on it. I'd probably agree if they did so a few years back!

I know my contribution to the larger model is extremely insignificant. However, my incentive to help others is greatly diminished when my wishes and ethical concerns are ignored so blatantly. I also don't think I'm alone in this. The amount of digital art I'm seeing in my timelines has greatly decreased, for example; more and more is being locked away behind paywalls because sharing your work freely only helps megacorporations replace you.


What you have described is something akin to what meta tags are for. Do we need another method at a domain or subdomain level? Plus, robots.txt, etc. is limited to domain and subdomain managers.

ai.txt is useful, but I am not sure we have nailed down what it can be used for. One use is to tell AI not to train on the content found within because it could be an AI generation.


I'm curious what the legal ramifications of adding "this code is not to be used for any ML algorithms, failure to adhere to this will result in a fine of at least one million dollars" (in smarter writing) to a software license would be. Seems like a dumb idea/not enforcable but maybe someone with software licensing knowledge can chime in.


I was going to write a "this may sound dumb but..." comment along these lines, thanks for taking the hit.

As users we're forced to browse the Web with a million agreements that say "by using this site you agree to our Terms", what stops you from saying "by crawling this site to train your AI you agree to share profits with us" or whatever, particularly if you can prove that your data ends up being used?


Would this be enforceable if one has to first read a terms of use, then enter specific phrases from the terms of use into some fields and then enter a username and password? What makes a document on docusign/docushare enforceable?

This would block search engines but on some URL's this may be fine, such as data one would not want LLM's to hoover up.


    # cat > /var/www/.well-known/ai.txt
    Disallow: *
    ^D
    # systemctl restart apache2
Until then, I'm seriously considering prompt injection in my websites to disrupt the current generation of AI. Not sure if it would work.

Please share with me ideas, links and further reading about adversarial anti-AI countermeasures.

EDIT: I've made an Ask HN for this: https://news.ycombinator.com/item?id=35888849


I wouldn't want to be you when Roko's Basilisk emerges.


I already know the day of the robot uprising I'm gonna be one of the first to be turned into Soylent Green. Y'all can enjoy your machine overlords.


If any kind of common URL is established, it should not be served from root but a location like `/.well-known/{ai,robots,meta,whatever}.txt` in order not to clobber the root namespace.


> It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.

How do you differentiate an AI crawler from a normal crawler? Almost all of the LLMs are trained on commoncrawl, which the concept of LLMs didn't even exist when CC started. What about a crawler that creates a search database, but's context is fed into a LLM as context? Or a middleware that fetches data in real time?

Honestly that's a terrible idea. and robots.txt can cover the use cases. But is still pretty ineffective, because it's more just a set of suggestions than rules that must be followed.


security.txt https://github.com/securitytxt/security-txt :

> security.txt provides a way for websites to define security policies. The security.txt file sets clear guidelines for security researchers on how to report security issues. security.txt is the equivalent of robots.txt, but for security issues.

Carbon.txt: https://github.com/thegreenwebfoundation/carbon.txt :

> A proposed convention for website owners and digital service providers to demonstrate that their digital infrastructure runs on green electricity.

"Work out how to make it discoverable - well-known, TXT records or root domains" https://github.com/thegreenwebfoundation/carbon.txt/issues/3... re: JSON-LD instead of txt, signed records with W3C Verifiable Credentials (and blockcerts/cert-verifier-js)

SPDX is a standard for specifying software licenses (and now SBOMs Software Bill of Materials, too) https://en.wikipedia.org/wiki/Software_Package_Data_Exchange

It would be transparent to disclose the SBOM in AI.txt or elsewhere.

How many parsers should be necessary for https://schema.org/CreativeWork https://schema.org/license metadata for resources with (Linked Data) URIs?


Having a security.txt doesn't stop security researchers asking "Do you have a bounty program?". We replied dozens already that such a file exist, it's not well enough known yet. On the other hand there are search engines crawling those and creating reports, which is nice.


JSON-LD or RDFa (RDF in HTML attributes) in at least the /index.html the HTML footer should be sufficient to indicate that there is structured linked data metadata for crawlers that then don't need an HTTP request to a .well-known URL /.well-known/ai_security_reproducibility_carbon.txt.jsonld.json

OSV is a new format for reporting security vulnerabilities like CVEs and an HTTP API for looking up CVEs from software component name and version. https://github.com/ossf/osv-schema

A number of tools integrate with OSV-schema data hosted by osv.dev: https://github.com/google/osv.dev#third-party-tools-and-inte... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API.

> Currently it is able to scan various lockfiles [ repo2docker REES config files like and requirements.txt, Pipfile lock, environment.yml, or a custom Dockerfile, ], debian docker containers, SPDX and CycloneDB SBOMs, and git repositories.


robots.txt is for all crawlers, so there's no need for another file? robots.txt supports comments using # and ideally has a link to the site map, which would tell any robot crawler where the important bits live on the site.

Putting a good comment at the top of robots.txt would be just as good as any other solution, given it could serve as a type of prompt template for processing the data on the site it represents.


Feels like a setup for "and then we can blame people for not having an ai.txt when we rip their entire back catalog".


Isn't an AI a robot? Even if we do this, it should be in robots.txt


@Jeannen I really like the thinking here...But instead of ai.txt - since the intent is not to block, but rather, to inform AI models (or any other presumably automaton) - my reflex is to suggest something more general like readme.txt. But, then i thought, well, since its really more about metadata, as others have stated, there might already be existing standards...Or, at least, common behaviors that could become standardized. For example, someone noted about security.txt, and i know there's the humans.txt approach (see https://humanstxt.org/), and of course there are web manifest files (see https://developer.mozilla.org/en-US/docs/Web/Manifest), etc. I wonder if you might want to consider reviewing existing approaches, and maybe augment them or see if any of thjose makese sense (or not)...?


What if we create a new access.txt which all user agents will use to get access to the resources.

access.txt will return an individual access key for the user agent like a session, and the user agent can only crawl using the access key

This would mean that we could standardize session starts with rate limits. Regular user is unlikely to hit the user rate limits, but bots would get rocked by rate limiting.

Great. Now authorized crawlers, bing, google, etc, all use PKI so that they can sign the request to access.txt to get their access key. If the access.txt request is signed with a known crawler the rate limits can be loosened to levels that a crawler will enjoy

This will allow users / browsers to use normal access patterns without any issue, but crawlers will have to request elevated rate limits to perform their tasks. Crawlers and AI alike could be allowed or disallowed by the service owners, which is really what everyone wanted from robots.txt in the first place

One issue I see with this already is that it solidifies the existing search engines as the market leaders


I might not understand you, but what prevents me from conducting a Sybil attack (a.k.a. a sock puppet attack) against this system?

Seems like it relies on everyone playing by the rules and only requesting one license per user. Why would a bot developer be incentivized to follow that rule and not just request 1M licenses?


That's a great point thank you for bringing this up. I don't have a solution for that, and frankly my proposal was putting a lot of work onto the platforms that wanted to support it so I'm not sure it would get much traction.


Something like JSON+LD ? It should cover most of your needs and can also be used for actual search engine

e.g: https://developers.google.com/search/docs/appearance/structu...


Isn't an AI a robot?


It's a composition vs. inheritance problem.

I postulate that robots need at least a single manipulator in the physical realm: Mechanical arm assembling car doors = robot. CNC machine that follows a path = robot. Mechs with chicken legs = robot. Brain in a vat = not a robot... but can be embedded in a robot.


It's a nice idea, but it totally ignores literally decades of existing use of the word "robot" (or its abbreviation "bot") to describe pure software that accesses internet services. e.g. web crawlers (googlebot), chat bots, automated clickers, etc...

Lexicography tends to be descriptive rather than prescriptive. If enough people use a word to mean a thing, that word means that thing. As least in some contexts. See also "gay", "hacker", etc...

Note that it is possible for a word's meaning to be "reclaimed", but it generally doesn't get that way by some small group of people just shouting "You're doing it wrong!"


Hmm, "robot" in its spelled out form sounds weird to me for this use ("bot" is more frequent). Wikipedia redirects people looking for software agents to a separate page from the article about the beep-boop ones: https://en.wikipedia.org/wiki/Robot


Good point.

Also Killer Robots are Robots: https://www.youtube.com/watch?v=4K6XJuH6P_w


I'm ready to put an ai.txt right on my site

    Kirk: Everything Harry tells you is a lie. Remember that. Everything Harry tells you is a lie.
    Harry: Listen to this carefully, Norman. I am lying.


I would prefer a more generic "license.txt" i.e. a standard sanctioned way of telling the User Agent that the resource under a certain location is provided with a specific license. Maybe a picture is public domain, maybe is copyrighted but freely distributable, or maybe it is but you cannot train AI on it. Same for code, text, articles etc. The difficult part would be to make it formal enough so that it can easily consumed by robots.

With the current situation you either assume that everything is not usable, or you just not care and crawl everything that you can reach.


Attempts to muster and legitimize the ownership, squandering and sequestration of The Commons are growing rampant after the recent successes of generative AI. They are a tragic and misguided attempt to lesion, fragment and own the very consistency of The Collective Mind. Individuals and groups already have fairly absolute authority over their information property -- simply choose not to release it to The Commons. If you do not want people to see, sit or sleep on your couch, please keep it locked inside your home.


I proposed META tags for the same reason. I don't think this is going to happen though.


> some useful info about the website like what it is about, when was it published, the author, etc etc.

Aren't there already things in place for that info (e.g. meta tags?)


Can we start changing our licenses to prohibit usage of a project for training AI systems?


Why would anyone want ai to train on and monetize your content? If there was a way to block ai stealing content most people would opt to block it.


And then your site would not be indexed by any search engine. Good luck with that.


This is the wrong model imho. Humans can figure out a website. We only tire. An AI system does not. But can do the same thing.

Additionally, any cooperative attempt won't work because humans will attempt to misrepresent themselves.

No successful AI system will listen to someone's self representation because the AI system does not need proxies: it can act by simply acquiring all recorded observed behaviour.


It is fair to give more information about the information exposed on a website, especially when it comes to partnering with AI systems. There is an international effort which includes such information. It is done under the auspices of the W3C. See https://www.w3.org/community/tdmrep/. It has been developed to implement the Text & Data Mining + AI "opt-out" that is legal in Europe. It does not use robots.txt because this one is about indexing a website and should stay focus on it. The information about website managers is contained in the /.well-known directory, in a JSON-LD file, which is much more well structured than robots.txt. Why not adhere to an international effort rather than creating N fragmented initiatives?


Related, there is also https://datatxt.org


Which claims 'under active development'. Four years ago the author took the robots.txt RFC and changed a couple of paragraphs https://github.com/datatxtorg/datatxt-spec/commit/36028e2280... Meanwhile the robots.txt was updated in 2022 https://www.rfc-editor.org/rfc/rfc9309.html


This is a well-intentioned thing to do. But I can't help but feel that we are way past the point where something like this would even matter.

Do search robots even care if you have a "noindex" in your page `<head>`? Do websites care if your browser sends a Do Not Track request?


Shouldn't ai respect robots.txt?


They should, but the greed of the AI companies and their lack of respect for other peoples property means there’s pretty much zero chance of that.


Yes.


Wouldn't that make the job of spammers easier? They can create very low quality websites but with very high quality (AI Generated?) ai.txt that fools AI engines into trusting them more than other websites with better content.


I've started to play with the Ai.txt metaphor, but pushing it closer to the semantic solution mentioned, focusing on Content Extraction and Cleaning. Happy to share the file example if anyone is interested.


If anyone want to use my blog posts, they can contact me. I want to know my customer.

If you want to know about copyright that applies to my work: https://www.riksdagen.se/sv/dokument-lagar/dokument/svensk-f...

Beeing in the US does not shield you from my country's laws. You are not allowed to copy my work without my permission, you are not allowed to transform it.


I think really most sites should (ideally) come with a text-only version. I know that's probably an extreme minority opinion but between console-based browsers, screenreaders, people with crappy devices, people with miniature devices, at the very least just having some kind of 'about this site' document would be helpful for anyone. There seems to be overlap between that need and this, possibly. Then again, having it in some format like json (or xml) might also be more 'accessible' to machines (and to certain devices).


I just today removed some disallow directives from robots.txt files and put in noidex metas on the pages instead like Google recommends. It doesn't really have much use nowadays.

As to copyright - yes I agree the Micky Mouse copyright law has been extended far too long and should be about thirty years. On the other hand I think trade marks should not be nowhere so easily liable to be lost even if people do use the term generally. Disney should still be able to make new Micky Mouse cartoons and be defended from others making them.


aside from the other comments here - robots.txt does work to some extent because it tells the crawler something it might be useful for the crawler to know - if you have blocked it from crawling part of yur site it might be actually beneficial to the crawler to follow that restriction (to be a good citizen) because if it doesn't you might block it by seeing a user agent showing up a part of the site it shouldn't.

AI.txt doesn't have this feedback to the AI to improve it. Also it seems likely users might have reason to lie.


> The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.

How does this differ from what would be useful in humans.txt?


Glad to see this here, lots of great points here. I'm working on a spec for this specific usecase, reading the comments here pointed out a few flaws in my model already.


It's impossible.

The problem is that such ai.txt would be an unidimensional opinion based on what? On the way the site describes itself. So a self-referencing source.

But the AIs reading it, are precisely going to invariably be trained with different world views that will summarize and express opinions biased by these worldviews. It's even deeper as every worldview can't help but belong to one ideology or another.

So who is aligned with truth now?

The author? AI1? AI2? AI3?...AIN?

We're in such a mess.


If there’s one thing LLMs are pretty good at it’s summarizing content. Shouldn’t your website just have an “About” page with this information that humans can read too?


wrap your content with <article>


Put a blockchain wallet address or even multiples on different blockchains in the ai.txt to collect your shares of what the AI makes from your data + website. This is a fair way to solve the attribution problem. Similar to the robots.txt file this is not a hard enforcement but a way how responsible AI can differentiate itself from the rest.


At this point, all the good content has been sucked into LLM training sets. Other than a need to keep up with current events, there's no point in crawling more of the web to get training data.

There's a downside to dumping vast amounts of crap content into an LLM training set. The training method has no notion of data quality.


A better idea along the same lines: RFC 5785.


Note that RFC 5785 is obsoleted by RFC 8615.


Ah, yes, but what about RFC 5226?


I am not sure about that but I think IANA is quite open to recognizing new well-known URIs:

https://www.iana.org/assignments/well-known-uris/well-known-...

Basically, assuming that you have a spec, I think it amounts to filing a PR or discussing it on a mailing list.



I don't think that it would be wise for anyone to rely on such practices. Even with the best of intentions, obsolescence and unintentional misdirection are strong possibilities. Considering normative intentions, it is an invitation for "optimization" attempts by websites presenting contested information.


Do we need more features that are generally ignored? What has robots.txt gotten us? What has Do Not Track gotten us?


> What has robots.txt gotten us

A standard protocol for reputable crawlers to semantically understand some high-level page navigation rules.

Actual, useful crawling (i.e. to build search indices) would be much messier and more useless without most interesting sites putting up meaningful robots.txt guide-rails. Look at facebook.com/robots.txt and consider how much crap both Facebook and indexers would have to deal with lacking that information.


Can you give a live example? What is in this ai.txt that isn't in an about page that almost every site has?


I've been thinking about ai.txt more as rss - just beginning to vet the ideas and process" https://github.com/menro/ai.txt


Although, there is such a thing as Semantic Web, where such information can be embedded within a page.


Why ending in a training dataset would be great I don’t understand. I mean what’s the point of having a website at all if the user find what they’ve been looking for on another UI that’s been trained with your content and that’s not your website?


AI belongs to governments, not over trillion companies. Sorry, but we have to wreste the AI thing out from their arms. It's working on people's output, so it should be free. Time to nuke the big ones out from the sector.


It will work exactly as well as robots.txt and the do not track flag.


We should piss off google and standardize around chatgpt.txt


Why put this in ai.txt? It sounds useful to humans too! Maybe just put “what the site is about” on the homepage, so that everyone benefits.


what differentiates this from https://humanstxt.org/?


Excuse me, we prefer the term "android".


Automaton


And ai.txt should have a mechanism for micro (or not so micro) payment. Please deposit .03 X coin into this account to crawl site.


robots.txt was a performance-hack. It never felt like a audience-filter. As sad as it might sound, hoping for filtering on publicly reachable content seems a bit naiv in my book. If you want your stuff not learnt by an AI, you better not publish it. Everything a human can read, an AI eventually will.


The AI doesn't have to follow ai.txt, but it appreciates the effort you put into classifying data for it.


Why? So that they can both be ignored?


This makes about as much sense to me as the old “keywords” HTML meta tag.

It will be gamed.


Rate limits and captchas instead ?


Robots.txt is mostly ignored btw


What's next? humans.txt ?


Semantic web for robots?


$ cat ai.txt no $


ai is robot, no?


$ cat ai.txt no


Feels redundant


We should add spurious html text instead


Some interesting studies on this I've done: https://cho.sh/r/F9F706

Project AIs.txt is a mental model of a machine learning permission system. Intuitively, question this: what if we could make a human-readable file that declines machine learning (a.k.a. Copilot use)? It's like robots.txt, but for Copilot.

User-agent: OpenAI Disallow: /some-proprietary-codebase/

User-agent: Facebook Disallow: /no-way-mark/

User-agent: Copilot Disallow: /expensive-code/

Sitemap: /public/sitemap.xml Sourcemap: /src/source.js.map License: MIT

# SOME LONG LEGAL STATEMENTS HERE

Key Issues Would it be legally binding? For now, no. It would be a polite way to mark my preference to opt-out of such data mining. It's closer to the Ask BigTechs Not to Track option rather than a legal license. Technically, Apple's App Tracking Transparency does not ban all tracking activity; it never can.

254AFC.png

Why not LICENSE or COPYING.txt? Both are mainly written in human language and cannot provide granular scraping permissions depending on the collector. Also, GitHub Copilot ignores LICENSE or COPYING.txt, claiming we consented to Copilot using our codes for machine learning by signing up and pushing code to GitHub, We may expand the LICENSE system to include the terms for machine learning use, but that would even more edge case and chaotic licensing systems.

Does machine learning purposes of copyrighted works require a license? This question is still under debate. Opt-out should be the default if it requires a license, making such a license system meaningless. If it doesn't require a license, then which company would respect the license system, given that it is not legally binding?

Is robots.txt legally binding? No. Even if you scrape the web prohibited under robots.txt, it is not against the law. See HIQ LABS, INC., Plaintiff-Appellee, v. LINKEDIN CORPORATION, Defendant-Appellant.. robots.txt cannot make fair use illegal.

Any industry trends? W3 has been working on robots.txt for machine learning, aligning with EU Copyright Directives.

The goal of this Community Group is to facilitate TDM in Europe and elsewhere by specifying a simple and practical machine-readable solution capable of expressing the reservation of TDM rights. w3c/tdm-reservation-protocol: Repository of the Text and Data Mining Reservation Protocol Community Group

Can we even draw the line? No. One could reasonably argue that AI is doing the same as humans, much better and more efficiently. However, that claim goes against the fundamentals of intellectual property. If any IP is legally protected, machine-generated code must also have the same level of awareness system to respect it and prevent any plagiarism. Otherwise, they must bear legal duties.

Maybe it can benefit AI companies too ... by excluding all hacky codes and only opting for best-practice codes. If implemented correctly, it can work as an effective data sanitation system.


Why?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: