What does derivative mean here? Because IMO it means that the existing work was used as input. So if you used a LLM and it was trained on the existing work, that's a derivative work. If you rot13 encode something as input, so you can't personally read it, and then a device decides to rot13 on it again and output it, that's a derivative work.
In order for it to be creatively derivative you would need to copy the structure, logic, organization, and sequence of operations not just reimplement the functionality. It is pretty clear in this case that wasn't done.
As a cynical person I assume all the frontier LLMs were trained on datasets that include every open source project, but as a thought experiment, if an LLM was trained on a dataset that included every open source project _execept_ chardet, do you think said LLM would still be able to easily implement something very similar?
Of course, the problem with this interpretation is that all modern LLMs are derivatives from huge amounts of text under completely different licenses, including "All rights reserved", and therefore can not be used for any purpose.
I'm not sure how you square the circle of "it's alright to use the LLM to write code, unless the code is a rewrite of an open source project to change its license".
> Of course, the problem with this interpretation is that all modern LLMs are derivatives from huge amounts of text under completely different licenses, including "All rights reserved", and therefore can not be used for any purpose.
> I'm not sure how you square the circle of "it's alright to use the LLM to write code
You seem like you're on the cusp of stating the obvious correct conclusion: it isn't.
LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default. You can understand this more simply by calculating the model size as an inverse of a fantasy compression algorithm that is 50% better than SOTA. You'll find you'd still be missing 80-90% of the training data even if it were as much of a stochastic parrot as you may be implying. The outputs of AI are not derivative just because they saw training data including the original library.
Then onto prompting: 'He fed only the API and (his) test suite to Claude'
This is Google v Oracle all over again - are APIs copyrightable?
I find the "compression" argument not very strong, both because copyright still applies to (very) lossy codecs (e.g. your 16kbps Opus file of Thriller infringes, even if the original 192khz/32bit wav file was 12,000kbps), and because copyright still applies to transformed derivative works (a tiny midi file of Thriller might still be enough for the Jackson's label to get you)
> This is Google v Oracle all over again - are APIs copyrightable?
Yes this is the best way to ask the question. If I take a public facing API and reimplement everything, whether it's by human or machine, it should be sufficient. After all, that's what Google did, and it's not like their engineers never read a single line of the Java source code. Even in "clean room" implementations, a human might still have remembered or recalled a previous implementation of some function they had encountered before.
> LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default.
About this specific point, it is unclear how much of a defect memorization actually is - there are also reasons to see it as necessary for effective learning. This link explains it well:
"The clean-room reimplementation test" isn't a legal standard, it's a particular strategy used by would-be defendants to clearly meet the standard of "is the new work free of copyrightable expression from the original work".
This scenario is not new with AI at all though? 14 years ago I watched a group of 3 front-end devs spin up a proof of concept in ember.js that has a flashy front end, all fake data, and demo it to execs. They wowed the execs and every time the execs asked "how long would it take to fix (blank) to actually show (blank)?" the devs hit f12, inspect element, and typed in what they asked for and said "already done!".
It was missing years of backend and had maybe 1/20th feature parity with what we already had and it would have, in hindsight, been literally impossible to implement some of the things we would need in the future if we had went down that path. But they were amazed by this flashy new thing that devs made in a weekend that looked great but was actually a disaster.
I fail to see how this is any different than what people are complaining about with vibe coded LLM stuff a decade and a half later now? This was always being done and will continue to be done; it's not a new problem.
I've done this multiple times in various codebases, both medium sized personal ones (approx 50k lines for one project, and a smaller 20k line one earlier) and am currently in the process of doing a similar migration at work (~1.4 million lines, but we didn't migrate the whole thing, more like 300k of it).
I found success with it pretty easily for those smaller projects. They were gamedev projects, and the process was basically to generate a source of truth AST and diff it vs a target language AST, and then do some more verifier steps of comparing log output, screenshot output, and getting it to write integration tests. I wrote up a bit of a blog on it. I'm not sure if this will be of any use to you, maybe your case is more difficult, but anyway here you go: https://sigsegv.land/blog/migrating-typescript-to-csharp-acc...
For me it worked great, and I would (and am) using a similar method for more projects.
"I also wanted to build a LOT of unit tests, integration tests, and static validation. From a bit of prior experience I found that this is where AI tooling really shines, and it can write tests with far more patience that I ever could. This lets it build up a large hoard of regression and correctness tests that help when I want to implement more things later and the codebase grows."
The tests it writes in my experience are extremely terrible, even with verbose descriptions of what they should do. Every single test I've ever written with an LLM I've had to modify manually to adjust it or straight up redo it. This was as recent as a couple months ago for a C# MAUI project, doing playwright-style UI-based functionality testing.
I'm not sure your AST idea would work for my scenario. I'd be wanting to convert XNA game-play code to PhaserJS. It wouldn't even be close to 95% similar. Several things done manually in XNA would just be automated away with PhaserJS built-ins.
Ya I could see where framework patterns and stuff will need a lot of corrections in post after that type of migration. For mine it was the other direction and only the server portion (Express server written in typescript for a Phaser game, and porting to Kestrel on C#, which was able to use pretty much identical code and at the end after it was done I just switch and refactor ba few things to make it more idiomatic C#).
For the tests, I'm not sure why we have such different results but essentially it took a codebase I had no tests in, and in the port it one shot a ton of tests that have already helped me in adding new features. My game server for it runs in kubernetes and has a "auto-distribute" system that matches players to servers and redistributes them if one server is taken offline. The integration tests it wrote for testing that auto-distribute system found a legit race condition that was there in both the old and new code (it migrated it accurately enough that it had the same bugs) and as part of implementing that test it fixed the bug.
Of course I wouldn't use it if it wasn't a good tool but for me the difference between doing this port via this method versus doing it manually in prior massive projects was such an insane time save that I would have been crazy to do it any other way. I'm super happy with the new code and after also getting the test infra and stuff like that up it's honestly a huge upgrade from my original code that I thought I had so painstakingly crafted.
Is there a way you can reasonably use that data offer to a large extent? For example, I know my provider has a setup where you can register your device MAC, connect to their city-wide wifi, and then it will let you use your data as wifi. In this scenario if it's really unlimited, is there a way you can chain devices together to get something crazy like 20Gbps up/down and do a bunch of heavy operations 24 hours a day on that for a few days?
Warning: I imagine if you do this they will say "it's not really unlimited bro that's not that we meant"
How would a team verify this for any current model? They would have to observe and control all training data. In practice, any currently available model that is good enough to perform this task likely fails the clean room criteria due to having a copy of the source code of the project it wants to rewrite. At that point it's basically an expensive lossy copy paste.
You can always verify the output. Unless the problem being solved really is exceedingly specific and non-trivial, it's at least unlikely that the AI will rip off recognizable expression from the original work. The work may be part of the training but so are many millions of completely unrelated works, so any "family resemblance" would have to be there for very specific reasons about what's being implemented.
Presumably you've posted it here because you would like others to either view it or consume it, neither of which are possible on (at least my) fairly common Galaxy S23 phone. Of course it's your decision how you want to serve the experience; we're just saying that right now your page is a fairly poor experience compared to the average website.
Yep, on my phone I opened this page and closed it in 20 seconds. Can't see shit, captain! Didn't figure out if it's the font, weight, or size but this is the worst looking website I've hit in a while.
reply