Hacker Newsnew | past | comments | ask | show | jobs | submit | davisp's commentslogin

> If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

Uh, that is exactly what a derivative work is. You literally specify that Hamlet is an input to your work. I believe you're conflating derivative with transformative. You're certainly creating a transformative derivation of Hamlet, but you are by definition creating a derivative work by training a Markov chain on the text of Hamlet.

The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under. Others argue that there's not an exact copy of the original source in the LLM's weights so by definition it must be a transformative work. I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.

Personally, I don't care one way or the other. I'm one of the folks that thinks software shouldn't be copyright-able in the first place.


> Uh, that is exactly what a derivative work is.

No, it isn't. A derivative work isn't something based on extracting underlying ideas or patterns from another work, it's something that includes copyrighted portions of the other work.

An annotated edition of Hamlet is a derivative work. A Cliff's Notes summary of Hamlet is a derivative work.

Strange Brew and The Lion King are not derivative works of Hamlet simply because they include literary themes and plot points that originated in Hamlet. A list of word counts of popular works of literature that includes an entry for Hamlet is also not a derivative work. The Markov chain described above is not a derivative work.

> The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under.

And I would agree with them. An LLM that actually is outputting non-trivial code that matches a public project's code verbatim is engaging in copying, and not stochastic inference.

> I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.

It's a shame that the same fundamental questions have to be relitigated over and over again just because the contextual formalities and modes of expression have changed. I wonder how many of the legal cases are going to be copies or derivative works of previous ones.


> Strange Brew and The Lion King are not derivative works of Hamlet simply because they include literary themes and plot points that originated in Hamlet.

But try to write your own story of a lion cub chased away by his uncle and living in a jungle until his childhood friend finds him and convinces him to reclaim his kingdom, and you'll quickly hear from Disney's lawyers how non-derivative it really is.

OSS devs aren't worried about Hamlet reinterpretations. They're worried about legally-distinct-but-functionally-identical software clones. Unlike Disney, they don't have millions in their pockets to fight the legal battle. You know who does have millions? The people they'd be fighting against, who are going to use every single of your arguments to claim their AI-generated reimplementation of Kefir is not bound by GPL (or even by BSD 3-clause in case of runtime). No share-alike, no attribution, no nothing. If they are right, then the OSS social contract is dead. Even if they're not right, but behave as if they're right because they have lawyers and OSS devs don't - the social contract is just as dead.


> But try to write your own story of a lion cub chased away by his uncle and living in a jungle until his childhood friend finds him and convinces him to reclaim his kingdom, and you'll quickly hear from Disney's lawyers how non-derivative it really is.

I'd expect them to say "we don't like this, but since it's not actually a derivative work, we can't do anything about it". As long as you're not directly copying things like characters, dialogue, etc., it's not a derivative work.

That's why Armageddon is not a derivative work of Deep Impact, the Shark Attack series is not a derivative work of Jaws, the more famous Titanic is not a derivative work of 1979's S.O.S. Titanic, and the Harry Potter series is not a derivative work of Teen Witch.

Using the same story themes, plot points, and setting as another work does not implicate that other work's copyright. Only substantial copying of specifics does.


> As long as you're not directly copying things like characters, dialogue, etc., it's not a derivative work.

Define a character. Is another lion prince named Simba the same character? Is a lion prince named something else the same character? Is a human prince named Simba the same character? I'm no copyright expert, but from what I know about fanfics and fanart, the US courts ruled all of these violate copyright (you can win a book plagiarism lawsuit even if the other book has all names changed and every sentence went through thesaurus). The few cases where the obvious stand-in was ruled non-infringing were on the grounds of parody exception, not on the grounds of being non-derivative.

The many Titanic movies are not each other's derivatives because none of them are based on each other. They're all based on the historical events directly. Now, if the original Titanic was fictional like the famous Nautilus, then yes, the 1997 movie would be derivative, but not of the 1979 series.

Which part of Harry Potter is directly rips off Teen Witch the way Lion King directly rips off Hamlet? I'm not familiar with that movie.


I’m gonna give this a very charitable read by saying that while I find the ways that the treatment of burn victims was advanced by abhorrent means, we as a society have still benefited from those means.

> So once the model is created there isn't a good reason to encumber it by how it was created.

I am trying to be very specific here. I assume no untoward motivations from the parent commenter. I am not intending to cast aspersions. Whoever wrote this, I feel no ill will for you and this is not meant as a personal slight.

And I will be very clear, this statement as written could probably be defended because of the “by how it was created” clause.

However, “So once the model is created there isn’t a good reason to encumber it” is so… fucking I don’t even know, because what the actual fuck?

I apologize for the profanity, I really do. But, really? Are you fucking kidding me?

These models should not exist. Ever. By any means. Do not pass Go. Go directly to jail.

I understand the engineering brain enough to contemplate abstract concepts with detachment. That’s all I think happened here. But holy fuck, please pause and consider things a bit.


Outlawing the use of existing material is vital market protection for producers. Denouncing these models may not actually be a good way to reduce harm.


> These models should not exist. Ever. By any means. Do not pass Go. Go directly to jail.

If it's possible to produce CSAM that doesn't involve actual children and have a measurable impact in profitablity and demand of the real thing, leading to a net reduction in the harm done to children wouldn't you be on the wrong side of the argument you think you're making?

> I understand the engineering brain enough to contemplate abstract concepts with detachment.

I would argue it's a rational take.

Can we agree the goal to reduce harm of children is good? Or only if the solution is comfortable to you?


> These models should not exist. Ever. By any means. Do not pass Go. Go directly to jail.

Exactly. It's disturbing that this needs to be explained to people.


All the alternatives are likely to result in more children being harmed. These models probably should exist. Ideally they'll destroy the commercial incentives for people to hurt children.


So there's a bit of a misunderstanding here in the chain of blog posts that I can clear up. First, from this article:

  That’s the question I’ve been mulling over for days, because
  I don’t see how this action can make any particular guarantees
  about durability, at least not in any portable way.
This part is super easy to clear up. CouchDB in no way relies on an fsync after open for any guarantee on durability. As shown in [1], CouchDB has been running an fsync on file open since extremely early in its development. However, I can easily see how just reading the Neighbourhoodie article would lead here.

The missing context is that CouchDB primarily fsync's after open because when an empty database is created, we write a header to disk. The very early implementation in [1] just didn't limit this to only cases where we write the header and that general behavior has never been changed (though the implementation is a bit different today, the effect is the same).

Also, in hindsight, I believe this claim in the Neighbourhoodie is probably too strong:

  However, CouchDB is not susceptible to the sad path.
I didn't read the article super closely the first time since I'd been through the background discussions on the finer details, but today I'd probably hedge that a bit with language along the lines of:

  However, CouchDB is *probably* not susceptible to the sad
  path. While we can't guarantee it can't happen due to how
  various I/O operations are (not) specified, we're doing as much
  as we can to prevent it. Also, don't forget that your storage
  device might be lying about fsync anyway.
The underlying logic around that requires considering the original blog post in this chain [3]. That article posits a pathological error condition where we write something, crash, restart, issue read from a dirty page cache, and then hard crashing the entire machine. In this case, the database returned a read that was never committed.

As the author of this (as in this thread) article notes:

  Using OpenZFS as an example (hey, it’s what I know), fsync()
  always flushes anything outstanding for the underlying object,
  regardless of where the writes came from.
AFAIK, this is the norm and, I assume, the reason that the NULL BITMAP article [3] suggests the fsync on open. In CouchDB land, we just went back and said, "Oh nice, we already do that for other reasons anyway." Unfortunately the "we already do it for other reasons" aspect didn't really come through. So in the end, while none of the behavior on fsync-on-open is guaranteed in anyway shape or form, it's not impossible that it's saved our bacon a non-zero number of times. Just because its not guaranteed, its common that filesystems will in fact perform those flushes regardless of which file descriptor is used.

Also, to make sure that we're not missing the field for the cornstalks, I want to point out that the double fsync commit protocol used by CouchDB is probably 99.some-more-nines responsible for CouchDB's durability guarantees. However, that's not 100%, so when we find weird edge cases like in [3] we try and make sure that we're as correct as can be. For instance, here's the response to fsync-gate [4].

[1] https://github.com/apache/couchdb/blob/956c11b35487fb8ffcf70...

[2] https://neighbourhood.ie/blog/2025/02/26/how-couchdb-prevent...

[3] https://buttondown.com/jaffray/archive/null-bitmap-builds-a-...

[4] https://github.com/apache/couchdb/commit/3505281559513e29224...


Thanks Paul, I’ve updated the post to clarify: https://neighbourhood.ie/blog/2025/02/26/how-couchdb-prevent...


Does anyone know if there's an obvious reason that adding a `no_panic` crate attribute wouldn't be feasible? It certainly seems like an "obvious" thing to add so I'm hesitant to take the obvious nerd snipe bait.


The standard library has a significant amount of code that panics, so a `no_panic` crate attribute would currently only work for crates that don't depend on the standard library. I imagine most interesting crates depend on the standard library.


What caught my eye in the article was the desire to have something that doesn't panic with a release profile, while allowing for panics in dev profiles. Based on other comments I think the general "allow use of std, but don't panic" seems like something that could be useful purely on the "Wait, why doesn't that exist?" reactions.


You could do it, but I would prefer guarantees on a per-call chain basis using a sanitizer. It should be quite easy to write.


I'm no rustc expert, but from what little I know it seems like disabling panics for a crate would be an obvious first step. You make a great point though. Turning that into a compiler assertion of "this function will never panic" would also be useful.


It’s a good first step, but half of the crates in crates.io have at least 40 transitive dependencies. Some have hundreds or thousands. A big effort.


Absolutely correct! Had the bird strike not occurred, there wouldn’t have been a crash. Had things with the go around been handled properly, there would have been no crash.

Etc etc. The fact that a wall was 50m out of compliance or whatever it ends up being will be a footnote at best in the review of this crash.



> If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.

I think you’re missing the fact that the heart beats will be combined with existing packets. Hence the quoted bit. If you’ve got 1000 nodes, they should be doing something with that network such that an extra 50 bytes (or so) every 30s would not be an issue.


They would be combined if each node was sending messages each second to every other node. Is that realistic?


That’ll just depend on whatever code was deployed to the cluster. For the clusters I used to operate, the answer would be absolutely all nodes talk to all nodes all the time.

I personally never operated anything above roughly 250 nodes, but that limit was mostly due to following the OP’s advice about paying attention to the configuration of each node in the cluster. In my case, fewer nodes with fancier and larger raid arrays ended up being a better scaling strategy.


Most likely those are just the states where they already have a tax presence. For whatever reason they happen to currently employ folks in those states so adding employees is easy. Adding new states means getting lawyers and CPA type folks involved which is a hurdle to hiring in larger organizations.


A wild tangent but reading “heritable metabolizing” really hit me on the “are viruses alive” question.

I’ve been around enough biotech to have considered the differences between plasmids and viruses versus archaea, bacteria, and eukaryotes. I’ve always considered “heritable change” as the base definition of “life”. As in, “life” is progeny resemble their parent(s)? Or “heritable change”.

“Heritable metabolizing” quite nicely captures that difference between the levels of single molecule “life” and singular/multicellular “life”.

Apologies for the random aside, it was just one of those random “I have a vague idea of why mitochondria are important, but I don’t see them as fundamental” parts of my “What is life?” definition being refined.


> “heritable change” as the base definition of “life”

I years ago saw a research talk by someone doing, IIRC, regional-sized evolutionary-time-duration multi-scale ecosystem simulation - they made the same call.


Sometimes science doesn’t have to be precise to demonstrate a result.

Consider trying to measure feedback from a microphone and speaker. You don’t have to be an expert to know that there’s a quick change in system behavior when the microphone gets too close to the speaker.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: