If the model weights are able to be copyrighted, then they are also able to be a derivative work of something copyrighted, as the model is based on copyrighted works. So either the weights are infringing copyright, which means they don't belong to Meta, or the weights can't be copyrighted.
I don't think this follows. Calling it a derivative work already feels like a stretch, but even granting that framing, the use is clearly transformative and therefore likely to be considered fair use in the US.
I actually wrote a Wikipedia article on the intersection of copyright law and deep learning models the other day (https://en.wikipedia.org/wiki/Artificial_intelligence_and_co...). I was hoping to include a section on the copyrightability of model weights, but was sadly able to find 0 coverage in reliable sources.
So, while the poster you replied to has the wrong reasoning, they got to the right conclusion. Mostly.
Let's start with a non-AI example: compilers. If I have the source code for Linux, I can compile it, but I don't own the kernel binaries I made. This is because the compiler is a purely mechanical process, not a tool of human creativity. Copyright in the US attaches to creativity from human authors. So the source code would be the creative work, not the binaries.
We don't normally talk about this because ownership over the source code still flows through to the binaries. Your permission to copy that Linux binary is downstream of Linus having granted you permission to do so under the GPL. If you had instead copied, say, the NT kernel, you would be infringing the copyright on the NT kernel source code by distributing binaries of it.
So now let's go to AI land. You've collected a bunch of training data and dumped it into a linear algebra blender. That's like compiling source code: the ML trainer program adds no creativity or authorship, so you haven't gained any ownership over the data. Remember: this training data is scraped off the Internet from other people's work. Fair use merely makes it non-infringing to do this, it does not mean you own the result.
There are two avenues by which Meta could still get US copyright over the language model:
- They could make a model with their own training data that they made, and use their ownership over the training data to get ownership over the model.
- They could assert ownership over the compilation of training data.
Compilation ownership is kind of weird. Basically, in the US, you can make a compilation of other people's work and own solely that. Like, say, a "Top 10 Songs I Like" playlist[0]. But even then the creativity and authorship rules still apply. These models are not being trained by having humans manually select specific works that would do well in the model. They scrape the Internet and train on everything[1]. In fact, they usually don't even use their own scrapes; they use Common Crawl, LAION-5B, and/or The Pile.
Whether or not any of this is right would require someone to actually share LLaMA, get sued by Facebook, and then assert this legal theory. And hope that Facebook does not assert any other legal claims, such as misappropriation of trade secrets, which might actually stick.
[0] Or in a particularly egregious example, someone copyrighting their Magic: The Gathering deck in protest of this nonsense.
[1] Stable Diffusion at least uses an "aesthetics score", but AFAIK that's generated by an AI so also not copyrightable.
you explain it better than I. My idea is that, relative to AI, you can either go with: "this thing is simply an array of numbers, I'm not infringing anyone's copyright by creating this model". Or you can go with "this creation is mine, I made it, you cannot use it". Because you cannot basically create a program that can spit copyrighted work and then claim that the thing is yours. That is not going to fly.
Because if you do that, then all I have to do to pirate a book is train a model on that book and sell the trained model as mine, which does not make sense.
No, the weights cannot be considered a work of authorship since they are not intentionally created by anyone. Only creations that qualify as works of authorship can be copyrighted. Furthermore, a copyright exists as the property of an author, which, as previously mentioned, does not apply to models.
As another user said, the process is mechanical, so I'm not sure it can be thought as a derivative work.
I guess what I want to say is that in this matter of AI, you can't have your cake and eat it. If you want to have copyright over your weights, be prepared to also pay for the rights of the content your weights were based on.
And I think nobody in the AI world want to walk through that avenue.
The U.S. Copyright Office says that copyright protection doesn't cover ideas, procedures, processes, systems, methods of operation, concepts, or discoveries. No matter how they're described, explained, illustrated, or embodied. You can find this info in the Copyright Act, under 17 U.S.C. § 102(b). Here's the source: https://www.copyright.gov/title17/92chap1.html#102.
Music can be copyrighted, and it can also be made from samples from other music that's copyrighted. But sampling still happens without infringing copyright.
I'd say the small handful of bits flipped in a model from training on some text, or piece of code, or an image is even less copyrighted information brought over than a music sample or borrowed/referenced melody.
Your philosophical argument is interesting, but what the op was saying was one of the linked repos used in this repo is inaccessible due to DMCA: https://github.com/shawwn/llama-dl
So while what you say may be true the DMCA seems to have worth for these orgs because they can get code removed by the host, who is uninterested in litigating, and the repo owner likely is even less capable of litigating the DMCA.
Unfortunately as a tool of fear and legal gridlock DMCA has shown itself to be very useful to those with ill intent.
Downloading alpaca weights actually does use a torrent now! But it seems a little.... leechy... to use a fire-and-forget CLI torrent client to grab a dozens of gigabytes file and then quit.
Best would be to let the seeding happen for a while, until it at least reaches 0.5 ratio or something like that. But downloading over torrent is still better than over HTTP, it'll still seed the pieces (unless the downloader turned that off fully, that'd be an asshole move) as it download the missing pieces.
I hope we will see a revival of native desktop development with AI. There is really no reason for bloated JS containers, we can ask AI to rewrite it in C++.
I hope so too. Plus, in my experience with using GPT4 to generate code, it's more effective at writing C++ than it is at writing Python. Something about annotating the types helps it with getting things right.
That will be absurdly wasteful of power. My calculator runs using the power of ambient light in a comfortably lit room. There are some things that an LLM is well-suited for, but c'mon. Unless electricity and hardware becomes free, I just don't see your prediction coming true.
Historically, I don't think an absurd waste of power is gonna dissuade most people from throwing LLMs at everything they can.
I've already seen functions like LessThanOnePage(text) that combine a long prompt + freeform text in a slow LLM that basically says "will this text fit on one printed-out page". Takes several minutes to run, but requires a tiny fraction of the brainpower required to efficiently implement the function.
Kind of seems like a new level or generation of high-level (slow to execute but fast to code) vs low-level (fast to execute but slow to code) programming.
Compare your calculator with the Electron-like based on included in Windows or with sending a calculator query to Google. The latter still exists despite orders of magnitude more energy usage.
Running neural networks will become cheaper too, all computers will have hardware accelerators for them.
I don't think they mean the LLM will literally be controlling the pixels, it will just be writing the high level code that eventually controls the pixels.
# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
#For Windows and CMake, use the following command instead:
cd <path_to_llama_folder>
mkdir build
cd build
cmake ..
cmake --build . --config Release
# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
# install Python dependencies
python3 -m pip install torch numpy sentencepiece
# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1
# quantize the model to 4-bits (using method 2 = q4_0)
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
dalai does a lot of magic in the background. I had to run the package from source to debug the error messages. When run in on a Mac with zsh, it still runs bash under the hood to setup the python venv. Took me an hour to debug that as bash did not have poetry and pyenv configured to use <3.11 python since I never use bash. Yet, sentencepiece does not have a 3.11 wheel available, thus running dalai fails.
Today, just 2d after running dalai successfully, it refuses to start and just hangs in there. Unclear why.
The first version won't work on my linux distribution (nixos). The second one also won't work, but I know how to make it work (C++ and Python without any additional abstraction layers are easy).