Hacker News new | past | comments | ask | show | jobs | submit | bshacklett's comments login

I’m not sure how common sense factors in, but the education system has absolutely failed most providers that I’ve seen in the last 20 years.

It’s the same idea that drove Lorem Ipsum for type setting placeholders.

If you live in a place with a lot of clay, geotextile fabric can certainly be problematic for simple residential settings.


Except that the inscentives are totally different from an end user perspective. Apple is inscentivised to make the phone more attractive to the person using it, because they're the ones paying. With Google search, it's the advertiser that's paying. The only thing Google needs to worry about is keeping advertisers happy. That doesn't align with making search results better.


That's specious. Apple doesn't get paid by "consumers" either, they sell phones to retailers for the most part. "All they care about" is making retailers happy, right?

Obviously everyone who sells a product want to keep their end users (the "end" in "end users" is there for a reason!) happy, because if they don't they won't get paid. To argue that only your favorite is so incentivised seems silly.


Unfortunately, you lose a significant amount of functionality by degoogling. Any app which relies on Google services, which is a large number, will be broken.


1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?


> 1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

If I grow apple trees in front of my house and you come and take all apples and then turn up at my doorstep trying to sell me apple juice made from the apples you nicked that doesn't mean you had the right to do it, because I chose not to build a tall fence around my apple trees. Public content is free to read for humans, not free for corporations to offer paid content generation services based on my public content taken without me knowing or being asked for permission.

> 2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

You are making this kind of argument: "How much is a drop of gas? Nothing. Right, could you fill my car drop by drop?"

If we have technology that can charge for producing bullshit on an industrial scale by recombining sampled works of others, we are perfectly capable of keeping track of the sources used for training and generative diarrhoea.

> 3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?

Yes https://www.bl.uk/plr


All of these responses were so quality, there's really no need to add. I Especially like the apple argument about a product in your front yard. You still have no basis to take them from my front yard.

If there was the equivalent of what a lot of other sites have (gems, gold, ribbons) I'd give you one. Got a lot of gems, I'll send you an admittedly teeny heliodore, tourmaline, or peridot at cost if you want one. Gemstone market's junk lately with the economy.


You're both just repeating the "you wouldn't download an apple" argument. In the context of the Internet, you're voluntarily sending the user an apple and expecting them to not do various things to it, which is unreasonable. Nothing is taken. If it were, your website would be completely empty.

Remember, Copying Is Not Theft. Copyright law is just a temporary monopoly meant to economically incentivize you. Nothing more.

BTW, pro-AI countries do differentiate between private and public posts. If it's public, it's legally fair game to train on it. If it's private, you need a license to access it. So it does matter. Also see: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


LLMs don’t memorize everything they’re trained on verbatim, either. It’s all vectors behind the scenes, which is relatable to how the human brain works. It’s all just strong or weak connections in the brain.

The output is what matters. If what the LLM creates isn’t transformative, or public domain, it’s infringement. The training doesn’t produce a work in itself.

Besides that, how much original creative work do you really believe is out there? Pretty much all art (and a lot of science) is based on prior work. There are true breakthroughs, of course, but they’re few and far between.


If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes. Now, if you “get it from a friend”, illegally, _or_ you just redeploy the code, without creating a transformative work, then there’s a problem.

> Just pay the stupid license and if that makes your business unsustainable then it's not much a business is it?

In the persona of a business owner, why pay for something that you don’t legally, need to pay for? The question of how copyright applies to LLMs and other AI is still open. They’d be fools to buy licenses before it’s been decided.

More importantly, we’re potentially talking about the entire knowledge of humanity being used in training. There’s no-one on earth with that kind of money. Sure, you can just say that the business model doesn’t work, but we’re discussing new technologies that have real benefit to humanity, and it’s not just businesses that are training models this way.

Any decision which hinders businesses from developing models with this data will hinder independent researchers 10 fold, so it’s important that we’re careful about what precedent is set in the name of punishing greedy businessmen.


> They’d be fools to buy licenses before it’s been decided.

They are willingly ignoring licenses until someone sues them? That's still illegal and completely immoral. There is tons of data to train on. The entirety of Wikipedia, all of StackOverflow (at least previously), all of the BSD and MIT licenses source code on Github, the entire Gutenberg project. So much stuff, freely and legally available, yet their feel that they don't need to check licenses?


The legality of their behavior is not currently well defined, because it's unprecedented. Fair use permits transformative works. It has yet to be decided whether LLMs and their output qualify as transformative, or even if the training is capable of infringing copyright of an individual work in the first place if they're not reproducing it. In fact, there's a good amount of evidence which indicates that fair use _does_ apply, given how Google operates and what they've argued successfully (https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com...).

Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

You might not like the idea of your blog posts or other publicly posted materials being used to train LLMs, but that doesn't make it illegal (morality is subjective and I'm not about to argue one way or another). If it's really that much of a problem, you _do_ have the ability to remove your information from public accessibility, or otherwise protect it against LLM ingestion (IP restrictions, etc.).

edit: I am not a lawyer (this is likely obvious to any lawyers out there); this is my personal take.


Note that not all jurisdictions have the concept of "fair use" (use of copyrighted material, regardless of transformation applied, is permitted in certain contexts…ish). Canada, the UK, Australia, and other jurisdictions have "fair dealing" (use of copyrighted material depends on both reason and transformation applied…ish). Other jurisdictions have neither, and only licensed uses are permitted.

Because the companies behind large models (diffusion, LLM, etc.) have consumed content created under non-US copyright laws and have presented it to people outside of US copyright law jurisdiction, they are likely liable for misapplication of fair dealing, even if the US ultimately deems what they have done as "fair use" (IMO this is unlikely because of the perfect reproduction problems that plague them all in different ways; there are likely to be the equivalent of trap streets that will make this clearly copyright violation on a large scale).

It's worth noting that while models like GitHub Copilot "freely" use MIT, BSD (except BSD0), and Apache licensed software, they are likely violating the licenses every time a reasonable facsimile pops up because of the requirement to include copies of the licensing terms for full or partial distribution or derivation.

It's almost as if wholesale copyright violations were the entire business model.


You're right. I'm definitely taking a very US-centric view here; it's the only copyright system I'm familiar with. I'm really curious how jurisdictions with no concept of fair use or fair dealing work. That seems like a legal nightmare. I expect you wouldn't even be able to critique a copyrighted work effectively, nor teach about it.

When you speak of the "perfect reproduction" problem, are you referring to cases where LLMs have spit out code which is recognizable from source training data? I agree that that's a problem, but I expect the solution is to have a wider range of training data to allow the LLM to better "learn" the structure of what it's being trained on. With more/broader training data, the resulting output should have less chance of reproducing exactly what it was trained on _and_ potentially introduce novel methods of solving a given problem. In the meantime, it would probably be smart for some kind of test for recognizable reproduction and for the answers to be thrown out, perhaps with a link to the source material in their place.

There's also a point, however, where the same code is likely to be reproduced regardless of training. Mathematical formulas and algorithms come to mind. If there's only one good solution to a problem, even humans are likely to come up with the same code without even seeing each others output. It seems like there's a grey area here which we need to find some way to account for. Granted this is probably the exception, rather than the rule.

> It's almost as if wholesale copyright violations were the entire business model.

If I had to guess, this is probably a case where businesses are pushing something out sooner than it should have been. I find it unlikely that any business is truly basing their model on something which is so obviously illegal. I'm fully willing to believe, however, that they're willing to ignore specific instances of unintentional copyright infringement until they're forced to do something about it. I'm no corporate apologist. I just don't want to see us throw this technology away because it has problems which still need solving.


I live in a fair dealing jurisdiction, and additional uses would need to be negotiated with the rights holders. (I believe that this is part of the justification behind the Canadian law on social media linking to news organizations.) It is worth noting that in addition to the presence or absence of fair dealing/fair use, there are also moral rights which must be considered (which is another place where LLM tech — especially the so-called summarization — likely falls afoul of the law: authors have the moral right to not be misrepresented and the LLM process of "summarization" may come to the opposite conclusion of what the author actually wrote).

Perfect reproductions apply not only to software, but to poetry, prose, and images. There is a reason why diffusion model providers are facing lawsuits over "in the style of <artist>", because some of the styles are very distinctive and include elements akin to trap streets on maps (this happens elsewhere — consider the lawsuit and eventual settlement over the tattoo image used in The Hangover 2).

With respect to "training it on more data", I do not believe you are correct — but I have no proof. The public statements made by the people who have done the training have suggested that they have done such training on extremely wide and deep sources that have been digitized, including a number of books and the wider Internet. The problem is that, on some subjects, there are very few source materials and some of those source materials have distinctive styles which would be reproduced when discussing those subjects.

I’m now more than thirty years into my career. Some algorithms will see similar code written by humans, but most code has some variability outside of those fairly narrow ranges. Twenty years ago, I derived the Diff::LCS library for Ruby from the same library for Perl, but I look back on the original code I ported from and I cannot recognize the algorithms (this is a problem for wanting to consider how to implement things differently). Someone else might have ported it differently and chosen different trade-offs than I did. Even simple things like the variable names chosen likely differ between two developers for similarly complex pieces of code implementing the same algorithm.

There is an art to programming — and if someone has a particular coding style (in Ruby, think Seattle style as distinct) which shows up in copilot output, then you have a possible source for the training.

Finally, I believe you are being naïve about businesses basing their model on "something which is so obviously illegal". Might I remind you of Uber (private care hires were illegal in most jurisdictions because it is something that requires licensing and insurance), AirBnB (private hotel-style rentals were illegal in most jurisdictions because it is something that requires licensing and insurance and specific tax filings), Napster (all your music are belong to no one, at least until the musicians and their labels got involved), etc. I firmly believe that every single commercial LLM available now — possibly with the exception of Apple's, because they have been chasing licensing — is based on wholesale intentional copyright violations. (Non-commercial LLMs may be legal under fair use and/or fair dealing provisions, which does not address issues for content created where neither fair use nor fair dealing apply.)

I am unwilling to give people like sama the benefit of the doubt; any copyright infringement was not only intentional, but brazen and challenging in nature.

I'm frankly looking forward to the upcoming AI winter, because none of these systems can deliver on their promises, and they can't even exist without misusing content created by other people.


> Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

Your take on how all this works is probably more inline with reality than mine, it's just that my brain refuse to comprehend the willingness to take on that type of risk.

You're basically telling investors that your business may be violating all sorts of IP laws, you don't know and have taken no actions to determine that. It's just a gamble that this might work out, while taking billions in funding. There's apparently no risk assessment in VC funding.


> If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes.

Even that can be considered infringement and get you taken to court. It's one of the reasons reading leaked code is considered bad and you hear terms like cleanroom[0] when discussing reproductions of products.

[0]: https://en.wikipedia.org/wiki/Clean_room_design


It certainly can be, but it's not guaranteed. Clean room design is one way to avoid a legally ambiguous situation. It's not a hard requirement to avoid infringement. For example, the US Supreme Court ruled that Google's use of the Java APIs fell under fair use.

My point is: just because certain source material was used in the making of another work does not guarantee that it's infringing on the rights of that original IP.


There’s a lot more cost involved in running a library than buying books. Staff and building upkeep are big expenses. That said, you want to have some newer books coming in, too, if you want to keep kids interested.


When I was 9, the librarian was the scary lady in the corner that yelled at me if I so much as coughed.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: