OpenAI says it has evidence DeepSeek used its model to train competitor

udev · 2025-01-29T04:25:01 1738124701

Imnimo · 2025-01-29T17:13:23 1738170803

I think there's two different things going on here:

"DeepSeek trained on our outputs and that's not fair because those outputs are ours, and you shouldn't take other peoples' data!" This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.

"DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim. The DeepSeek R1 paper shows that distillation is really powerful (e.g. they show Llama models get a huge boost by finetuning on R1 outputs), and if it were the case that DeepSeek were using a bunch of o1 outputs to train their model, that would legitimately cast doubt on the narrative of training efficiency. But that's a separate question from whether it's somehow unethical to use OpenAI's data the same way OpenAI uses everyone else's data.

riantogo · 2025-01-29T19:00:23 1738177223

Why would it cast any doubt? If you can use o1 output to build a better R1. Then use R1 output to build a better X1... then a better X2.. XN, that just shows a method to create better systems for a fraction of the cost from where we stand. If it was that obvious OpenAI should have themselves done. But the disruptors did it. It hindsight it might sound obvious, but that is true for all innovations. It is all good stuff.

Imnimo · 2025-01-29T19:31:13 1738179073

I think it would cast doubt on the narrative "you could have trained o1 with much less compute, and r1 is proof of that", if it turned out that in order to train r1 in the first place, you had to have access to bunch of outputs from o1. In other words, you had to do the really expensive o1 training in the first place.

(with the caveat that all we have right now are accusations that DeepSeek made use of OpenAI data - it might just as well turn out that DeepSeek really did work independently, and you really could have gotten o1-like performance with much less compute)

deepGem · 2025-01-29T20:17:45 1738181865

From the R1 paper

In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data

Is this cold start data what OpenAI is claiming their output ? If so what's the big deal ?

Imnimo · 2025-01-29T20:59:54 1738184394

DeepSeek claims that the cold-start data is from DeepSeekV3, which is the model that has the $5.5M pricetag. If that data were actually the output of o1 (a model that had a much higher training cost, and its own RL post-training), that would significantly change the narrative of R1's development, and what's possible to build from scratch on a comparable training budget.

TheGeminon · 2025-01-29T21:13:28 1738185208

In the paper DeepSeek just says they have ~800k responses that they used for the cold start data on R1, and are very vague about how they got it:

> To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

Imnimo · 2025-01-29T22:34:11 1738190051

My surface-level reading of these two sections is that the 800k samples come from R1-Zero (i.e. "the above RL training") and V3:

>We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.

>For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting.

The non-reasoning portion of the DeepSeek-V3 dataset is described as:

>For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data.

I think if we were to take them at their word on all this, it would imply there is no specific OpenAI data in their pipeline (other than perhaps their pretraining corpus containing some incidental ChatGPT outputs that are posted on the web). I guess it's unclear where they got the "reasoning prompts" and corresponding answers, so you could sneak in some OpenAI data there?

deepGem · 2025-01-29T23:56:29 1738194989

That's what I am gathering as well. Where is OpenAI going to have substantial proof to claim that their outputs were used ?

The reasoning prompts and answers for SFT from V3 you mean ? No idea. For that matter you have no idea where OpenAI got this data from either. If they open this can of worms, their can of worms will be opened as well.

IAmGraydon · 2025-01-30T06:18:25 1738217905

>Where is OpenAI going to have substantial proof to claim that their outputs were used ?

I assume in their API logs.

rekttrader · 2025-01-30T06:54:37 1738220077

Shibboleths in output data

joe_the_user · 2025-01-29T22:06:19 1738188379

It's like the claim "they showed anyone create a powerful from scratch" becomes "false yet true".

Maybe they needed OpenAI for their process. But now that their model is open source, anyone can use that as their cold start and spend the same amount.

"From scratch" is a moving target. No one who makes their model with massive data from the net is really doing anything from scratch.

bmicraft · 2025-01-29T22:29:10 1738189750

Yeah, but that kills the implied hope of building a better model for cheaper. Like this you'll always have a ceiling of being a bit worse then the openai models.

roenxi · 2025-01-30T06:06:37 1738217197

The logic doesn't exactly hold, it is like saying that a student is limited by their teachers. It is certainly possible that a bad teacher will hold the student back, but ultimately a student can lag or improve on the teacher without only a little extra stimulus.

They probably would need some other source of truth than an existing model, but it isn't clear how much additional data is needed.

reassess_blind · 2025-01-29T23:46:48 1738194408

Isn't DeepSeek a bit better, not worse?

diedyesterday · 2025-01-30T15:24:01 1738250641

Don't forget that this model probably has far less params than o1 or even 4o. This is a compression/distillation, which means it frees up so much compute resources to build models much powerful than o1. At least this allows further scaling compute-wise (if not in the amount of, non-synthetic, source material available for training).

Loic · 2025-01-30T06:32:36 1738218756

Not for me. As I build a chemical factory, I do not reinvent everything.

They are using the current SOTA tools and models to build new models for cheaper.

vlovich123 · 2025-01-30T07:22:19 1738221739

If R1 were better than O1, yes you would be right. But the reporting I’ve seen is that it’s almost as good. Being able to copy cutting edge models won’t advance the state of the art in terms of intelligence. They have made improvements in other area, but if they reused O1 to train their model, that would be effectively a ctrl-c / ctrl-v strictly in terms of task performance.

unclebucknasty · 2025-01-30T11:20:34 1738236034

It's not just about whether competitors can improve on OpenAI's models. It's about whether they can continually create reasonable substitutes for orders of magnitude less investment.

vlovich123 · 2025-01-30T15:19:43 1738250383

> It's about whether they can continually create reasonable substitutes for orders of magnitude less investment

That just means that the edge you’re able to retain if you invest $1B is nonexistent. It also means there’s a huge disincentive to invest $1B if your reward instantly evaporates. That would normally be fine if the competitor is otherwise able to get to that new level without the $1B. But if it relies on your $1B to then be able to put in $100M in the first place to replicate your investment, it essentially means the market for improvements disappears OR there’s legislation written to ensure competitors aren’t allowed to do that.

This is a tragedy of the commons and we already have historical example for how humans tried to deal with it and all the problems that come with it. The cost of producing a book requires substantial capital but the cost of copying it requires a lot less. Copyright law, however flawed and imperfect, tries to protect the incentive to create in the face of that.

chefandy · 2025-01-31T11:44:12 1738323852

> That just means that the edge you’re able to retain if you invest $1B is nonexistent.

Jeez. Must be really tough to have some comparatively small group of people financially destroy your industry with your own mechanically-harvested professional output while dubiously claiming to be be better than you when in reality it’s just a lot cheaper. Must be tough.

Maybe they should take some time to self-reflect and make some art and writing about it using the products they make that mechanically harvest the work of millions of people, and have already screwed up the commercial art and writing marketplaces pretty throughly. Maybe tell DeepSeek it’s their therapist and get some emotional support and guidance.

unclebucknasty · 2025-02-01T00:20:48 1738369248

This. There is something doubly evil about OpenAI harvesting all of that work for its own economic benefit, while also destroying the opportunity for those it stole from to continue to ply their craft.

chefandy · 2025-02-01T04:12:23 1738383143

And then all of their stans taking on a persecution complex because people that actually made all of the “data” don’t uncritically accept their work as equivalent adds insult to injury.

unclebucknasty · 2025-01-30T16:30:51 1738254651

>it essentially means the market for improvements disappears OR there’s legislation...

This is possibly true, though with billions already invested I'm not sure that OpenAI would just...stop absent legislation. And, there may be technical or other solutions beyond legislation. [0]

But, really, your comment here considers what might come next. OTOH, I was replying to your prior comment that seemed to imply that DeepSeek's achievement was of little consequence if they weren't improving on OpenAI's work. My reply was that simply approximating OpenAI's performance at much lower cost could still be extraordinarily consequential, if for no other reason than the challenges you subsequently outlined in this comment's parent.

[0] On that note, I'm not sure (and admittedly haven't yet researched) how DeepSeek just wholesale ingested ChatGPT's "output" to be used for its own model's training, so not sure what technical measures might be available to prevent this going forward.

RHSman2 · 2025-01-30T21:02:21 1738270941

The value of intelligence is only when it is better than the rest. Unless you are Microsoft of course.

PeterStuer · 2025-01-30T08:49:37 1738226977

Strong disagree. Copy/paste would mean they took o1's weights and started finetuning from there. That is ot what happened here at all.

vlovich123 · 2025-01-30T15:25:46 1738250746

First, there could have been industrial espionage involved so who knows. Ignoring that, you’re missing what I’m saying. Think of it this way - if it requires O1’s input to reach almost the same task performance, then this approach gives you a cheap way to replicate the performance of a leading edge model at a fraction of the cost. It does not give you a way to train something that beats a cutting edge model. Cutting edge models require a lot of R&D & capital expenditure - if they’re just going to be trivially copied after public availability, the response is going to be legislation to keep the incentive there to keep meaningful investments in that area. Otherwise you’re going to have another AI winter where progress shrivels because investment dollars dry up.

That’s why it’s so hard to understand the true cost of training Deepseek whereas it’s a little bit easier for cutting edge models (& even then still difficult).

throw234234234 · 2025-01-31T04:15:31 1738296931

"Otherwise you’re going to have another AI winter where progress shrivels because investment dollars dry up."

Tbh a lot of people in the world would love this outcome. They will use AI because not using it puts them at a comparative disadvantage - but would rather AI doesn't develop further or didn't develop at all (i.e. they don't value the absolute advantage/value). There's both good and bad reasons for this.

RHSman2 · 2025-01-30T21:00:49 1738270849

This.

“Hey OpenAI, if you had to make a clone of yourself again how would you do it and for a lot cheaper?”

Nice move.

skinner_ · 2025-01-30T09:48:27 1738230507

When you build a new model, there is a spectrum of how you use the old model: 1. taking the weights, 2. training on the logits, 3. training on model output, 4. training from scratch. We don't know how much advantage #3 gives. It might be the case that with enough output from the old model, it is almost as useful as taking the weights.

powerapple · 2025-01-30T16:25:03 1738254303

I lean on the idea that R1-Zero was trained from cold start, at the same time, they have tried many things including using OpenAI APIs. These things can happen in parallel.

manquer · 2025-01-30T00:46:39 1738197999

> you had to do the really expensive o1 training in the first place

It is no better for OpenAI in this scenario either, any competitor can easily copy their expensive training without spending the same, i.e. there is a second mover advantage and no economic incentive to be the first one.

To put it another way, the $500 Billion Stargate investment will be worth just $5Billion once the models become available for consumption, because it only will take that much to replicate the same outcomes with new techniques even if the cold start needed o1 output for RL.

hattmall · 2025-01-30T01:14:35 1738199675

Shouldn't OpenAI be able to rather easily detect such usage?

hmmm-i-wonder · 2025-01-30T12:53:52 1738241632

Now that its been done, is OpenAI needed or can you iterate on DeepSeek only moving forward?

My understanding is this effectively builds on OpenAI's very expensive initial work, provides a "nearly as good as" model for orders of magnitude cheaper to train and run, that also provides a basis to continue building on and improving without openAI, and without human bottlenecks.

That cuts OAI off at the knees in terms of market viability after billions have been spent. If DS can iterate and match the capabilities of the current in-development OAI models in the next year, it may come down to regulatory capture and government intervention to ensure its viability as a company.

manquer · 2025-01-31T00:45:51 1738284351

You cannot really have government intervention against open source and weights successfully.

the attempt in cryptography with PGP and export controls made that clear.

Even if DS specifically is banned (and even effectively), a dozen other clean room replications following their published methods will become available.

It is possible this government will ban all “unapproved” LLMs not running at authorized provider[1], saying it is weapon and AGI or skynet or whatever makes powers that sound important, thus establishing the need for control [2], the rest of the world will keep innovating.

—-

[1] Bans just need to work only economically, not at information level i.e organization with liability considerations will not use “unapproved” ones and they are ones who will bulk of the money and that what they need to protect.

[2] if they were smart they could do this positively without the backlash bans would have. By giving protections to compliant models like legal indemnity for for model companies and users without necessarily blocking others

hmmm-i-wonder · 2025-01-31T13:33:24 1738330404

I agree they can't really _successfully_ intervene, but I have very high expectations that they will attempt to in some manner.

MrLeap · 2025-01-29T19:58:28 1738180708

o1 wouldn't exist without the combined compute of every mind that led to the training data they used in the first place. How many h100 equivalents are the rolling continuum of all of human history?

dchichkov · 2025-01-29T20:30:02 1738182602

It should be possible to learn to reason from scratch. And the ability to reason in a long context seems to be very general.

Nevermark · 2025-01-29T21:02:16 1738184536

How does one learn reasoning from scratch?

Human reasoning, as it exists today, is the result of tens of thousands of years of intuition slowly distilled down to efficient abstract concepts like "numbers", "zero", "angles", "cause", "effect", "energy", "true", "false", ...

I don't know what reasoning from scratch would look like without training on examples from other reasoning beings. As human children do.

dchichkov · 2025-01-29T22:21:52 1738189312

There are examples of learning reasoning from scratch with reinforcement learning.

Emergent tool use from multi-agent interaction is a good example - https://openai.com/index/emergent-tool-use/

ipaddr · 2025-01-30T03:12:37 1738206757

Now you are asking for a perfect modeling of the system. Reinforcement learning works by discovering boundaries.

tracker1 · 2025-01-30T09:26:18 1738229178

Now rediscover all the plants that are and aren't poisonous to most people.

dchichkov · 2025-01-30T22:43:21 1738277001

I've suggested that long context should be included into the prompt.

In your particular case the prompt would look something like: <pubmed dump> what are the plants that aren't poisonous to most people?

A general reasoner would recover language and relevant world model from pubmed dump. And then would proceed to reason about it, to perform the task.

It doesn't look like a particularly efficient process.

Davidzheng · 2025-01-29T21:28:10 1738186090

Actually i also think it's possible. Start with natural numbers axiom system. Form all valid sentences of increasing length. RL on a model to search for counter example or proofs. This on sufficient computer should produce superhuman math performance (efficiency) even at compute parity

MrLeap · 2025-01-29T21:52:31 1738187551

I wonder how much discovery in math happens as a result in lateral thinking epiphanies. IE: A mathematician is trying to solve a problem, their mind is open to inspiration, and something in nature, or their childhood or a book synthesizes with their mental model and gives them the next node in their mental graph that leads to a solution and advancement.

In an axiomatic system, those solutions are checkable, but how discoverable are they when your search space starts from infinity? How much do you lose by disregarding the gritty reality and foam of human experience? It provides inspirational texture that helps mathematicians in the search at least.

Reality is a massive corpus of cause and effect that can be modeled mathematically. I think you're throwing the baby out with the bathwater if you even want to be able to math in a vacuum. Maybe there is a self optimization spider that can crawl up the axioms and solve all of math. I think you'll find that you can generate new math infinitely, and reality grounds it and provides the gravity to direct efforts towards things that are useful, meaningful and interesting to us.

soulofmischief · 2025-01-29T22:13:05 1738188785

As I mentioned in a sister comment, Gödel's incompleteness theorems also throw a wrench into things, because you will be able to construct logically consistent "truths" that may not actually exist in reality. At which point, your model of reality becomes decreasingly useful.

At the end of the day, all theory must be empirically verified, and contextually useful reasoning simply cannot develop in a vacuum.

staunton · 2025-01-30T07:22:06 1738221726

Those theorems are only relevant if "reasoning" is taken to its logical extreme (no pun intended). If reasoning is developed/trained/evolved purely in order to be useful and not pushed beyond practical applications, the question of "what might happen with arbitrarily long proofs" doesn't even come up.

On the contrary, when reasoning about the real world, one must reason starting from assumptions that are uncertain (at best) or even "clearly wrong but still probably useful for this particular question" (at worst). Any long and logic-heavy proof would make the results highly dubious.

danenania · 2025-01-30T06:09:38 1738217378

A question is: what algorithms does the brain use to make these creative lateral leaps? Are they replicable?

Unless the brain is using physics that we don’t understand or can’t replicate, it seems that, at least theoretically, there should be a way to model what it’s doing with silicon and code.

States like inspiration and creativity seem to correlate in an interesting way with ‘temperature’, ‘top p’, and other LLM inputs. By turning up the randomness and accepting a wider range of output, you get more nonsense, but you also potentially get more novel insights and connections. Human creativity seems to work in a somewhat similar way.

kmeisthax · 2025-01-29T23:57:43 1738195063

https://en.wikipedia.org/wiki/Monstrous_moonshine#Origin_of_...

iczero · 2025-01-29T22:16:42 1738189002

I believe https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_... (Gödel's incompleteness theorems) applies here

hmmm-i-wonder · 2025-01-30T12:57:19 1738241839

Dogs are probably the best example I can think of. They learn through experience and clearly reason, but without a complex language to define abstract concepts. Its very basic reasoning, but they do learn and apply that learning.

To your point, experience is the training. Without language/data to represent human experience and knowledge to train a model, how would you give it 'experience'?

Nevermark · 2025-01-30T19:01:17 1738263677

And yet dogs, to a very high degree, just learn the same things. At least the same kinds of things, over and over.

They were pre-designed to learn what they always learn. Their minds structured to readily make the same connections as puppies, that dogs have always needed to survive.

Not for real reasoning, which by its nature, does not have a limit.

hmmm-i-wonder · 2025-01-30T20:10:28 1738267828

> just learn the same things. At least the same kinds of things, over and over.

Its easy to train the same things to a degree, but its amazing to watch different dogs individually learn and reason through things completely differently, even within a breed or even a litter.

Reasoning ability is always limited by the capacity of the thinker to frame the concepts and interactions. Its always limited by definition, we only push that limit farther than other species, and AGI may eventually push it past our abilities.

soerxpso · 2025-01-30T00:25:15 1738196715

There was necessarily a "first reasoning being" who learned reasoning from scratch, and then it's improved from there. Humans needed tens of thousands of years because:

- humans experience reality at a slower pace than AI could theoretically experience a simulated reality

- humans have to transfer knowledge to the next generation every 80 years (in a manner that's very lossy), and around half of each human lifespan is spent learning things that the previous generation already knew

addicted · 2025-01-30T01:21:03 1738200063

The idea that there was “necessarily a first reasoning being” is neither obvious nor likely.

Reasoning could very well have originally been an emergent property of a group of beings.

The animal kingdom is full of examples of groups being more intelligent than individuals, including in human animals as of today.

It’s entirely possible that reasoning emerged as a property of a group before it emerged in any individual first.

carlob · 2025-01-30T07:18:14 1738221494

I think you are focusing too much on the fact that a being needs to be an individual organism, which is kind of an implementation detail.

What I wonder instead is whether reasoning is a property that is either there or not there, with a sharp boundary of existence.

butlike · 2025-01-30T16:54:32 1738256072

The dead organism cannot reason. It's simply a survivorship-bias. Reasoning evolved like any other survival mechanism.

soerxpso · 2025-02-04T20:28:50 1738700930

Whether the first reasoning entity is an individual organism or a group of organisms is completely irrelevant to the original point. If one were to grant that there was in fact a "first reasoning group" rather than a "first reasoning being" the original argument would remain intact.

butlike · 2025-01-30T16:52:14 1738255934

Did it kill them? y - must be unsafe n - must be safe

Do this continually through generations until you arrive at modern society.

MrLeap · 2025-01-29T21:13:50 1738185230

Creating reasoning from scratch is the same task as creating an apple pie from scratch.

First you must invent the universe.

psychoslave · 2025-01-30T11:32:55 1738236775

>First you must invent the universe.

That was the easy part though, figuring out how to handle all the unintended side effects it generated is still an ongoing process. Please sit and relax while we are solving the few incidentals events occurring here and there, rest assured we are putting our best effort to their resolution.

miki123211 · 2025-01-29T22:17:09 1738189029

It is possible to learn to reason from scratch, that's what R1-0 did, but the resulting chains of thought aren't legible to humans.

To quote DeepSeek directly:

> DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

dchichkov · 2025-01-29T22:34:05 1738190045

If you look at the benchmarks of the DeepSeek-V3-Base, it is quite capable, even in 0-shot: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#base-mod... This is not from scratch. These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.

On the other hand, my take on it, the ability to do reasoning in a long context is a general capability. And my guess is that it can be bootstrapped from scratch, without having to do training on all of the internet or having to distill models trained on the internet.

cma · 2025-01-30T00:53:20 1738198400

> These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.

But we already know that is the case: the Deepseek v3 paper says it was posttrained partly with an internal version of R1:

> Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.

And deepseekmath did a repeated cycle of this kind of thing mixing in 10% of old previously seen data with new generated data from last gen in a continuous bootstrap.

PeterStuer · 2025-01-30T08:53:05 1738227185

Possible? I guess evolution did it over the course of a few billion years. For engineering purposes, starting from the best advanced position seems far more efficient.

soulofmischief · 2025-01-29T22:11:17 1738188677

I've been giving this a lot of thought over the last few months. My personal insight is that "reasoning" is simply the application of a probabilistic reasoning manifold on an input in order to transform it into constrained output that serves the stability or evolution of a system.

This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.

But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.

Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains.

This doesn't mean a well-trained model can't reason about input it hasn't encountered before, just that the input needs to be in some way causally connected to the same laws which governed the input the manifold was trained on.

I'm sure we could create a fully generalized reasoning manifold which could handle anything, but I don't see how we possibly get that without first considering and encountering all possible inputs. But these inputs still have to have some form of constraint governed by laws that must be learned through sampling, otherwise you'd just be training on effectively random data.

The other commenter who suggested simply generating all possible sentences and training on internal consistency should probably consider Gödel's incompleteness theorems, and that internal consistency isn't enough to accurately model and interpret the universe. One could construct a thought experiment about an isolated brain in a jar with effectively unlimited neuronal connections, but no sensory connection to the outside world. It's possible, with enough connections, that the likelihood of the brain conceiving of true events it hasn't actually encountered does increase meaningfully. But the brain still has nothing to validate against, and can't simply assume that because something is internally logically consistent, that it must exist or have existed.

vkou · 2025-01-29T21:12:48 1738185168

If OpenAi had to account for the cost of producing all the copyrighted material they trained their LLM on, their system would be worth negative trillions of dollars.

Let's just assume that the cost of training can be externalized to other people for free.

fakedang · 2025-01-31T06:04:09 1738303449

Even if what OpenAI asserts in the title of this post is true, then their system is worth negative trillions of dollars.

If other players can access that data with relatively less effort, then it's futile trying to train your models and improve upon them, as clearly you don't have an architectural moat, just a training moat.

Kind of like an office scene where an introverted hardworker does all the tedious work, while his extroverted colleague promotes it as his and gains credit.

hmottestad · 2025-01-29T21:31:23 1738186283

At the pace that DeepSeek is developing we should expect them to surpass OpenAI in not that long.

The big question really is, are we doing it wrong, could we have created o1 for a fraction of the price. Will o4 cost less to train than o1 did?

The second question is naturally. If we create a smarter LLM, can we use it to create another LLM that is even smarter?

It would have been fantastic if DeepSeek could have come out with an o3 competitor before o3 even became publicly available. That way we would have known for sure that we’re doing it wrong. Cause then either we could have used o1 to train a better AI or we could have just trained in a smarter and cheaper way.

pertymcpert · 2025-01-29T21:40:06 1738186806

The whole discussion is about whether or not the second case of using o1 outputs to fine tune R1 is what allowed R1 to become so good. If that's the case then your assertion that DeepSeek will surpass OpenAI doesn't really make sense because they're dependent on a frontier model in order to match, not surpass.

hmottestad · 2025-01-30T08:07:23 1738224443

Yeah, that's my point. If they do end up surpassing OpenAI then it would seem likely that they aren't just relying on copying from o1, or whatever model is the frontier model at that time.

cherry_tree · 2025-01-29T20:14:30 1738181670

> I think it would cast doubt on the narrative "you could have trained o1 with much less compute, and r1 is proof of that"

Whether or not you could have, you can now.

SpaceManNabs · 2025-01-29T19:55:56 1738180556

My question is if deepseek r1 is just a distilled o1, i wonder if you can build a fine tuned r1 through distillation without having to fine tune o1.

zombiwoof · 2025-01-29T21:29:59 1738186199

Exactly. They piggybacked of lots of compute and used less. There still is a total sum of a massive amount of compute

cratermoon · 2025-01-29T23:18:45 1738192725

OpenAI piggybacked on the whole internet and the catalogued and shared human knowledge therein.

fmbb · 2025-01-30T06:43:18 1738219398

That’s a lot of watt hours!

PeterStuer · 2025-01-30T08:56:33 1738227393

And lets not forget a gazillion hours of human reinforcement by armies of 3rd world mechanical turks.

bitfilped · 2025-02-02T19:12:52 1738523572

Except OpenAI hasn't shared anything.

TeMPOraL · 2025-01-30T12:38:56 1738240736

Sure. This is fine. Data is still a product, no matter how much businesses would like to turn it into a service.

The model already embodies the "total sum of a massive amount of compute" used to create it; if it's possible to reuse that embodied compute to create a better model, that's good for the world. Forcing everyone to redo all that compute for themselves is, conversely, bad for the world.

RHSman2 · 2025-01-30T21:11:20 1738271480

Nothing good for the world in this ai race but your comment is very good.

da_chicken · 2025-01-29T21:37:33 1738186653

I mean, yes that's how progress works. Has OpenAI got a patent? If not it's fair game.

We don't make people figure out how to domesticate a cow every time they want a hamburger. Or test hundreds of thousands of filaments before they can have a lightbulb. Inventions, once invented, exist as giants to stand upon. The inventor can either choose to disclose the invention and earn a patent for exclusive rights, or they can try to keep it a secret and hope nobody reverse engineers it.

philipwhiuk · 2025-01-31T13:46:14 1738331174

You mean to create an apple pie from scratch you first have to invent the universe?

rockemsockem · 2025-01-29T19:04:46 1738177486

I think the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI. Even though in the paper they give a lot of credit to Llama for their techniques. The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

All of this should have been clear anyway from the start, but that's the Internet for you.

joe_the_user · 2025-01-29T19:44:40 1738179880

The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.

As far as I know, DeepSeek adds only a little to the transformers model while o1/o3 added a special "reasoning component" - if DeepSeek is as good as o1/o3, even taking data from it, then it seems the reasoning component isn't needed.

david-gpu · 2025-01-29T19:56:50 1738180610

> I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model

Distillation is a term of art in AI and it is fundamentally incorrect to talk about distilling human-created data. Only an AI model can be distilled.

https://en.m.wikipedia.org/wiki/Knowledge_distillation#Metho...

joe_the_user · 2025-01-29T21:52:42 1738187562

Meh,

It seems clear that the term can be used informally to denote the boiling down of human knowledge, indeed it was used that way before AI appeared in the popular imagination.

david-gpu · 2025-01-29T22:08:55 1738188535

In the context in which you said it, it matters a lot.

>> The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

> Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.

If deepseek was produced through the distillation (term of art) of o1, then the cost of producing deepseek is strictly higher than the cost of producing o1, and can't be avoided.

Continuing this argument, if the premise is true then deepseek can't be significantly improved without first producing a very expensive hypothetical o1-next model from which to distill better knowledge.

That is the argument that is being made. Please avoid shallow dismissals.

Edit: just to be clear, I doubt that deepseek was produced via distillation (term of art) of o1, since that would require access to o1's weights. It may have used some of o1's outputs to fine tune the model, which still would mean that the cost of training deepseek is strictly higher than training o1.

joe_the_user · 2025-01-29T22:18:54 1738189134

just to be clear, I doubt that deepseek was produced via distillation

Yeah, your technical point is kind of ridiculous here that in all my uses of distillation (and in the comment I quoted), distillation is used in informal sense and there's no allegation that DeepSeek could have been in possession of OpenAI's model weights, which is what's needed for your "Distillation (term of Art)".

ada1981 · 2025-01-30T06:16:36 1738217796

I’m not sure why folks don’t speculate China is able to obtain copies of OpenAI's weights.

Seems reasonable they would be investing heavily in plaing state assets within OpenAI so they can copy the models.

joe_the_user · 2025-01-30T19:31:28 1738265488

Because it feeds conspiracy theories and because there's no evidence for it? Also, let's talk DeepSeek in particular, not "China".

Looking back on the article, it is indeed using "distillation" as a special/"term of art" but not using it correctly. IE, it's not actually speculating that DeepSeek obtained OpenAI's weights and distilled them down but rather that it used OpenAI's answers/output as a starting point (which there is a different method/"term of art").

PontifexCipher · 2025-01-29T20:18:43 1738181923

Some info that may be missing:

- v2/v3 (not r1) seem to be cloned from o1/4o output, and perform worse (this cost the oft-repeated 5ish mm USD)

- r1 is specifically a reasoning step (using RL) _on top of_ v2/v3 and performs similarly to o1 (the cost of this is _not reported anywhere_)

- In the o1 blog post, they specifically say they use RL to add reasoning to LLMs: https://openai.com/index/learning-to-reason-with-llms/

sudosysgen · 2025-01-29T20:29:18 1738182558

The R1-Zero paper shows how many training steps the RL took, and it's not many. The cost of the RL is likely a small fraction of the cost of the foundational model.

aprilthird2021 · 2025-01-29T19:11:56 1738177916

> the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI

I did not think this, nor did I think this was what others assumed. The narrative, I thought, was that there is little point in paying OpenAI for LLM usage when a much cheaper, similar / better version can be made and used for a fraction of the cost (whether it's on the back of existing LLM research doesn't factor in)

TheGRS · 2025-01-29T20:04:51 1738181091

Yes, well the narrative that rocked the stock market is different. Its looking at what DeepSeek did and assuming they may have competitive advantage in this space and could outperform OpenAI at their own game.

If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly, even if the initial cost is huge. It also means OpenAI probably needs a better moat to protect its interests.

I'm not sure where the reality is exactly, but market reactions so far have basically followed that initial narrative and now the rebuttal.

addicted · 2025-01-30T01:24:56 1738200296

The idea that someone can easily replicate an OpenAI model based simply on OpenAI outputs is, I’d argue, immeasurably worse for OpenAI’s valuation than the idea that someone happened to come up with a few innovations that leapfrogged OpenAI.

The latter could be a one time thing, and/or OpenAi Could still use their financial might to leverage those innovations and get even better with them.

However, the former destroys their business model and no amount of intelligence and innovation from OpenAI protects them from being copied at a fraction of the cost.

aprilthird2021 · 2025-01-30T05:35:14 1738215314

> Yes, well the narrative that rocked the stock market is different.

How do you know this?

> If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly

Why? If every innovation OpenAI is trying to keep as secret sauce becomes commoditized quickly and cheaply, then why would markets care about any innovations they have? They will be unable to monetize them.

davrosthedalek · 2025-01-30T21:47:26 1738273646

Couldn't OpenAI just put in their license that training off OpenAi output is not allowed? With shibboleth or API logs, this could be verifiable.

aprilthird2021 · 2025-01-31T01:21:02 1738286462

Why would it matter when Chinese deepseek is not going to abide by such rules or be forced to and will release their model open weights so anyone anywhere can host it?

Also, scraping most of the websites they scrape is also not allowed, they do it anyways

davrosthedalek · 2025-01-31T16:18:06 1738340286

If they can make the US and Europe block the use of Deepseek and derivatives, they would be able to protect most of their market.

paul_e_warner · 2025-01-30T12:53:38 1738241618

There were different narratives for different people. When I heard about r1, my first response was to dig into their paper and it's references to figure out how they did it.

kelnos · 2025-01-29T20:15:36 1738181736

> I did not think this, nor did I think this was what others assumed.

That's what I thought and assumed. This is the narrative that's been running through all the major news outlets.

It didn't even occur to me that DeepSeek could have been training their models using the output of other models until reading this article.

bigfudge · 2025-01-30T07:44:25 1738223065

Fwiw I assumed they were using o1 to train. But it doesn’t matter: the big story here is that massive compute resources are unlikely to be as important in the future as we thought. It cuts the legs off stargate etc just as it’s announced. The CCP must be highly entertained by the timeline.

aiono · 2025-01-29T19:40:44 1738179644

That's only the case if you don't need to use the output of a much more expensive model.

hmmm-i-wonder · 2025-01-30T13:05:34 1738242334

>shows that models like o1 are necessary.

But HOW they are necessary is the change. They went from building blocks to stepping stones. From a business standpoint that's very damaging to OAI and other players.

KingOfCoders · 2025-01-29T19:20:06 1738178406

OpenAI couldn't do it, when the high cost of training and access to GPUs is their competitive advance against startups, they can't admit that it does not exist.

patcon · 2025-01-30T08:24:24 1738225464

Are we it rediscovering the evolutionary benefit of progeny (from an information theoretic lens)?

And is this related to the lottery ticket hypothesis?

https://arxiv.org/pdf/1803.03635.pdf

herodoturtle · 2025-01-30T05:53:21 1738216401

Thanks for the insightful comment.

I have a question (disclaimer: reinforcement learning noob here):

Is there a risk of broken telephone with this?

Kinda like repeatedly compressing an already compressed image eventually leads to a fuzzy blur.

If that is the case then I’m curious how this is monitored and / or mitigated.

ospray · 2025-01-29T22:08:39 1738188519

They did do that themselves it's called o3.

RHSman2 · 2025-01-30T20:56:22 1738270582

When will over training happen on the melange of models at scale? And will AGI only ever be an extension of this concept?

That is where artificial intelligence is going. Copy things from other things. Will there be a AI Eureka moment where it deviates and knows where and why the reason it is wrong?

indymike · 2025-01-30T14:00:09 1738245609

Bad things happen in tech when you don't do the disrupting yourself.

anothernewdude · 2025-01-30T05:55:20 1738216520

If they're training R1 on o1 output on the benchmarks - then I don't trust those benchmarks results for R1. It means the model is liable to be brittle, and they need to prove otherwise.

dontreact · 2025-01-29T21:32:09 1738186329

Is there any evidence R1 is better than O1?

It seems like if they in fact distilled then what we have found is that you can create a worse copy of the model for ~5m dollars in compute by training on its outputs.

iforgot22 · 2025-01-29T20:12:36 1738181556

"Then use R1 output to build a better X1" is the part I'm not sure about. Is X1 going to actually be better than R1?

qwertox · 2025-01-29T21:15:03 1738185303

They're standing on the shoulders of giants, not only in terms of re-using expensive computing power almost for free by using the outputs of expensive models. It's a bit of a tradition in that country, also in manufacturing.

unreal37 · 2025-01-29T22:51:04 1738191064

I thought OpenAI GPT took Wikipedia and the content of every book as inputs to train their models?

Everyone is standing on the shoulders of giants.

qwertox · 2025-01-30T04:45:51 1738212351

What I meant to say was that OpenAI did put a lot of money into extracting value out of the pile of (partially copyrighted) data, and that DeepSeek was freeloading on that investment without disclosing it, making them look more efficient than they truly are.

bigfudge · 2025-01-30T07:46:00 1738223160

How do you think manufacturing in the US got started? Everyone is on someone’s shoulders.

dartos · 2025-01-29T22:25:54 1738189554

What does “better” really even mean here?

Better benchmark scores can be cooked

Sophira · 2025-01-29T21:11:30 1738185090

Honestly, it's kind of silly that this technology is in the hands of companies whose only aim is to make money, IMO.

lenerdenator · 2025-01-29T21:44:58 1738187098

Well, originally, OpenAI wasn't supposed to be that kind of organization.

But if you leave someone in the tech industry of SV/SF long enough, they'll start to get high on their own supply and think they're entitled to insane amounts of value, so...

goatlover · 2025-01-29T21:32:43 1738186363

It's because they're the ones who could raise the money to make those models. Academics don't have access to that kind of compute. But the free models exist.

gmd63 · 2025-01-29T19:22:55 1738178575

Why not just copy and paste the model and change the name? That's an even more efficient form of distillation.

wgjordan · 2025-01-29T20:12:47 1738181567

Even assuming the model was somehow publicly available in a form that could be directly copied, that would be a more blatant form of copyright infringement. Distillation launders copyrighted material in a way that OpenAI specifically has argued falls under fair use.

PeterStuer · 2025-01-30T09:12:20 1738228340

Ironically Deepseek is doing what OpenAI originally pledged to do. Making the model open and free is a gift to humanity.

Look at the whole AI revolution that Meta and others have bootstrapped by opening their models. Meanwhile OpenAI/Microsoft, Antropic, Google and the rest are just trying to look after number 1 while trying to regulatory capture an AI for me but not for thee outcome of full control.

curt15 · 2025-01-30T12:46:28 1738241188

Is there anything still "open" about OpenAI these days?

iamleppert · 2025-01-30T13:42:35 1738244555

I hear Sam is pretty open in his relationship.

johnisgood · 2025-01-31T11:10:55 1738321855

Lmfao

sloucher · 2025-01-30T13:04:27 1738242267

The bow doors?

https://en.wikipedia.org/wiki/MS_Herald_of_Free_Enterprise

oakpond · 2025-01-30T13:01:18 1738242078

You don't understand, "open" stands for "open your wallet."

balder1991 · 2025-01-30T13:19:52 1738243192

Or another question, do they still publish any research that’s relevant for the field nowadays?

awestroke · 2025-01-30T13:31:27 1738243887

No. They publish PDFs that hype up their models, but they do not publish anything even resembling a high-level overview of model architecture

jacobgorm · 2025-01-30T14:36:52 1738247812

Given that you can download and use the weights, the model architecture has to be includded as part of that. And I did read a paper from them recently describing their MoE architecture and how it differs from the original GShard.

awestroke · 2025-01-30T15:59:46 1738252786

Excuse me? What weights can you download from OpenAI? gpt2 does not count

jacobgorm · 2025-01-31T13:01:26 1738328486

Sorry I meant that DeepSeek release their models. Wrong context.

jajko · 2025-01-30T10:27:13 1738232833

I don't think it makes sense to look at some previous PR statements of Altman et al re this when there a tens of billions floating around and egos get inflated to moon sizes. Farts in the wind have more weight, but this goes for all corporate PR.

Thieves yelling 'stop those thieves' scenario to me, they just were first and would not like losing that position. But its all about money and consequently power, business as usual.

sillyfluke · 2025-01-30T13:07:08 1738242428

There seems to a rare moderation error by dang with respect to this thread.

The comments were moved here by dang from an flagged article with an editorialized /clickbait title. That flagged post has 1300 points at the time of writing.

https://news.ycombinator.com/item?id=42865527

1.

It should be incumbent on the moderator to at least consider that the motivation for the points and comments may have been because many thought the "hypocrisy" of OpenAI's position was a more important issue than OpenAI's actual claim of DeepSeek violating its ToS. Moving the comments to an article that buries the potential hypocrisy issue that may have driven the original points and comments is not ideal.

2.

This article is from FT, which has a content license deal with OpenAI. To move the comments to an article from a company that has a conflict of interest due to its commercial relations with the YC company in question is problematic here especially since dang often states they try to more hands-off on moderation when the article is about a YC company.

3.

There is a link by dang to this thread from the original thread, but there should also be a link by dang to the original thread from here as well. Why is this not the case?

4.

Ideally, dang should have asked for a more substantial submission that prioritized the hypocrisy point to better match the spirit of the original post instead of moving the comments to this article.

seaal · 2025-01-30T13:32:42 1738243962

One of the few times I’ve disagreed with dang’s moderation, truly obnoxious to try and find a conversation you checked on previously.

handsclean · 2025-01-30T12:37:52 1738240672

Yes, but we were duped at the time, so it’s right and good that we maintain light on and anger at the ongoing manipulation, in the hope of next time recognizing it as it happens, not after they’ve used us, screwed us, and walked away with a vast fortune.

jeanlucas · 2025-01-30T12:01:16 1738238476

But it makes sense to expose their blatantly lies whenever possible to diminish the credibility they are trying to build while accusing others of the same they did

jajko · 2025-01-31T10:56:49 1738321009

Oh yes I agree with all of you that lies should be exposed, also who lies like that once will lie again, 0 doubt there.

Just don't set the expectations bar too high to start with is all I am saying. Folks that get so high up money and power wise aren't nice people, period. Even if nice normal guy without any sociopathic traits would suddenly shoot so high, the environment and pressures would deform them pretty quickly.

Also, I would consider only some leaked private conversations with close people as representative truth, not some PR statements carefully crafted by team of experts.

Happy to be proven wrong, still waiting for an example #1 to give me some hope.

miki123211 · 2025-01-29T22:13:24 1738188804

> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data

IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

Even if the courts affirm that there's a fair use defence for AI training, DeepSeek may still be in the wrong here, not because of copyright infringement, but because of a breach of contract.

I don't think OpenAI would have much of a problem if you train your model on data scraped from the internet, some of which incidentally ends up being generated by Chat GPT.

Compare this to training AI models on Kindle Books randomly scraped off the internet, versus making a Kindle account, agreeing to the Kindle ToS, buying some books, breaking Amazon's DRM and then training your AI on that. What DeepSeek did is more analogous to the latter than the former.

anon373839 · 2025-01-29T22:38:06 1738190286

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

You actually don’t know this. Even if it were true that they used OpenAI outputs (and I’m very doubtful) it’s not necessary to sign an agreement with OpenAI to get API outputs. You simply acquire them from an intermediary, so that you have no contractual relationship with OpenAI to begin with.

shishy · 2025-01-31T20:49:42 1738356582

I figured those contracts with an intermediary would extend to anyone they re-sell to, or prohibit them from re-selling...

fdsjgfklsfd · 2025-02-01T06:11:42 1738390302

You are free to publish your conversations with ChatGPT on the Internet, where they can be picked up by scrapers. US ruled that they are not covered by copyright...

krust · 2025-01-29T22:21:37 1738189297

>IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

I have some news for you

dmitrygr · 2025-01-29T22:42:41 1738190561

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

By existing in USA, OpenAI consented to comply with copyright law, and how did that go?

blibble · 2025-01-29T23:08:12 1738192092

training is either fair use, or it isn't

OpenAI can't have it both ways

chefandy · 2025-01-29T23:36:55 1738193815

Right, but it was never about doing the right thing for humanity, it was about doing the right thing for their profits.

Like I’ve said time and time again, nobody in this space gives a fuck about anyone that isn’t directly contributing money to their bottom line at that particular instant. The fundamental idea is selfish, damages the fundamental machinery that makes the internet useful by penalizing people that actually make things, and will never, ever do anything for the greater good if it even stands a chance of reducing their standing in this ridiculously overhyped market. Giving people free access to what is for all intents and purposes a black box is not “open” anything, is no more free (as in speech) than Slack is, and all of this is obviously them selling a product at a huge loss to put competing media out of business and grab market share.

miki123211 · 2025-01-30T03:51:49 1738209109

The issue here is breach of contract, not copyright.

glooglork · 2025-01-30T07:26:18 1738221978

It's quite unlikely that OpenAI didn't break any TOS with all the data they used for training their models. Not just OpenAI but all companies that are developing LLMs.

IMO, it would look bad for OpenAI to push strongly with this story, it would look like they're losing the technological edge and are now looking for other ways to make sure they remain on top.

boppo1 · 2025-02-01T11:10:15 1738408215

Interesting that Trump signalled positively for deepseek. Said something like 'american companies need to wake up'. Has Sam not paid the piper yet?

staticman2 · 2025-01-30T06:54:26 1738220066

Similar to how a patent contract becomes void when a patent expires regardless of what the terms of the contract says, it's not clear to me OpenAI can enforce a contract provision for an API output they own no copyright in.

Since they have no intellectual property rights in the output, it's not clear to me they have a cause of action to sue over how the output is used.

I wonder if any lawyers have written about this topic.

prmoustache · 2025-01-30T07:17:03 1738221423

What makes you think they had a contract with them in the first place? You can use openAI through intermediaries/proxies.

WolfRazu · 2025-01-30T09:35:56 1738229756

I assume all those intermediaries have to pass on the same ToS to their customers otherwise that seems like a very unusual move.

fdsjgfklsfd · 2025-02-01T06:13:00 1738390380

How many thousands or millions of contracts has OpenAI breached by scraping data off of websites that have terms of service explicitly saying not to scrape data off their websites?

avs733 · 2025-01-30T02:01:10 1738202470

They can sure try though, and I would be damned surprise if this wasn’t related to Sam’s event with trump last week.

windexh8er · 2025-01-29T23:49:48 1738194588

"Free for me, not for thee!" - Sam Altman /s

But in all reality I'm happy to see this day. The fact that OpenAI ripped off everyone and everything they could and, to this day pretend like they didn't, is fantastic.

Sam Altman is a con and it's not surprising that given all the positive press DeepSeek got that it was a full court assault on them within 48 hours.

freen · 2025-01-29T22:15:48 1738188948

Did OpenAI abide by my service’s terms of service when it ingested my data?

cortesoft · 2025-01-29T22:17:12 1738189032

Did OpenAI have to sign up for your service to gain access?

lolinder · 2025-01-29T22:22:11 1738189331

It probably ignored hundreds of thousands of "by using this site you consent to our Terms and Conditions" notices, many of which probably would be read as prohibiting training. But that's also a great example of why these implicit contracts don't really work as contracts.

otherme123 · 2025-01-29T22:39:43 1738190383

OpenAI scrapped my blog so aggressively that I had to ban their IPs. They ignored the robots.txt (which is kind of ToS) by 2 orders of magnitude, they ignored the explicit ToS that I copypasted blindly from somewhere but turns out it forbids what they did (something like you can't make money with the content). Not that I'm going to enforce it, but they should at least shut up.

freen · 2025-01-29T22:37:15 1738190235

Civil law is only available to deep pockets.

Contracts are enforceable to the degree to which you can pay lawyers to enforce them.

I will run out of money trying to enforce my terms of service against openAI, while they have a massive war chest to enforce theirs.

Ain’t libertarianism great?

blibble · 2025-01-29T23:09:01 1738192141

solution: live in a country OpenAI can't get to you

e.g China

staunton · 2025-01-30T07:32:42 1738222362

Are you suggesting it's easier to successfully sue OpenAI for copyright infringement if you live in China?

qup · 2025-01-30T10:32:22 1738233142

No, they're suggesting that deepseek avoids getting sued by openAI

bayindirh · 2025-01-29T22:53:18 1738191198

No, but some of the data is licensed.

For example, my digital garden is under GFDL, and my blog is CC BY-NC-SA. IOW, They can't remix my digital garden with any other license than GFDL, and they have to credit me if they remix my blog, and can't use it for any commercial endeavor, which OpenAI certainly does now.

So, by scraping my webpages, they agree to my licensing of my data. So they're de-facto breaching my licenses, but they cry "fair-use".

If I tell that they're breaching the license terms, they'd laugh at me, and maybe give me 2 cents of API access to mock me further. When somebody allegedly uses their API with their unenforcable ToS, they scream like an agitated cuckatoo (which is an insult to the cuckatoo, BTW. They're devilishly intelligent birds).

Drinking their own poison was mildly painful, I guess...

BTW, I don't believe that Deepseek has copied/used OpenAI models' outputs or training data to train theirs, even if they did, "the cat is out of the bag", "they did something amazing so they needed no permissions", "they moved fast and broke things", and "all is fair-use because it's just research" regardless of how they did it.

Heh.

Ukv · 2025-01-30T09:51:49 1738230709

> So, by scraping my webpages, they agree to my licensing of my data.

If the fair use defense holds up, they didn't need a license to scrape your webpage. A contract should still apply if you only showed your content to people who've agreed to it.

> and "all is fair-use because it's just research"

Fair use is a defense to copyright infringement, not breach of contract. You can use contracts, like NDAs, to protect even non-copyright-eligible information.

Morally I'd prefer what DeepSeek allegedly did to be legal, but to my understanding there is a good chance that OpenAI is found legally in the right on both sides.

bayindirh · 2025-01-30T11:07:52 1738235272

At this point, what I'm afraid is the justice system will be just an instrument in this all Us vs. Them debate, so their decisions will not be bound by law or legality.

Speculations aside, from what I understood, something like this shouldn't hold a drop of water under fair-use doctrine, because there's a disproportional damage, plus a huge monopolistic monetary gain because of what they did and how they did.

On the other hand, I don't believe that Deepseek used OpenAI (in any capacity or way or method) to develop their models, but again, it doesn't matter how they did it in this current conjecture.

What they successfully did was to upset a bunch of high level people, regardless of the technical things they achieved.

IMHO, AI war has similar dynamics to MAD. The best way is not to play, but we are past the Rubicon now. Future looks dirty.

Ukv · 2025-01-30T11:56:08 1738238168

> from what I understood, something like this shouldn't hold a drop of water under fair-use doctrine, because there's a disproportional damage, plus a huge monopolistic monetary gain

"Something like this" as in what DeepSeek allegedly did, or the web-scraping done by both of them?

For what DeepSeek allegedly did, OpenAI wouldn't have a copyright infringement case against them because the US copyright office determined that AI-generated content is not protected by copyright - and so there's no need here for DeepSeek to invoke fair use. It'll instead be down to whether they agreed to and breached OpenAI's contract.

For the web-scraping it's more complicated. Fair use is determined by the weighing of multiple factors - commercial use and market impact are considered, but do not alone preclude a fair use defense. Machine learning models do seem, at least to me, highly transformative - and "the more transformative the new work, the less will be the significance of other factors".

Additionally, since the market impact factor is the effect of the use of the copyrighted work on the market for that work, I'd say there's a reasonable chance it does not actually include what you may expect it to. For instance if you're a translator suing Google Translate for being trained on your translated book, the impact may not be "how much the existence of Google Translate reduced my future job prospects" nor even "how many fewer people paid for my translated book because of the existence of Google Translate" but rather "how many fewer people paid for my translated book than would have had that book been included in the training data" - which is likely very minor.

addicted · 2025-01-30T01:29:07 1738200547

They probably did to access the NYTimes articles.

outside1234 · 2025-01-29T22:26:00 1738189560

That isn't required to be in violation of copyright

freen · 2025-01-29T22:32:10 1738189930

Actually, yes, they actively agreed to them. Clicked the button and everything.

baq · 2025-01-30T06:37:40 1738219060

Have their scraping bots consented to cookies?

thorncorona · 2025-01-29T22:21:22 1738189282

Can you steal someone else’s laptop if they stood up to get a drink?

addicted · 2025-01-30T01:31:18 1738200678

OpenAI itself has argued, to the degree that your analogy applies, that if the goal of stealing the laptop is to train AI then the answer is Yes.

cortesoft · 2025-01-30T01:47:07 1738201627

Wouldn't this analogy be more like, "can you read my laptop screen if I stood up to get a drink?"

freen · 2025-01-31T01:46:58 1738288018

And steal the ip from your startup and then go public.

gizajob · 2025-01-29T22:55:06 1738191306

If their OS is open to the internet and you can scrape it and copy it off while they’re gone, then that would be about the right analogy. And OpenAi and DeepSeek have done the same thing in that case.

secstate · 2025-01-29T23:02:14 1738191734

Yes, if you can pay off any witnesses.

rpastuszak · 2025-01-29T22:56:52 1738191412

What?

dartos · 2025-01-29T22:25:05 1738189505

TOS are not contracts.

lolinder · 2025-01-29T22:44:38 1738190678

Citation? My understanding was that they are provided that someone has to affirmatively accept them in order to use your site. So Terms of Service stuck at the bottom in the footer likely would not count as a contract because there's no consent, but Terms of Service included in a check box on a login form likely would count.

But IANAL, so if you have a citation that says otherwise I'd be happy to see it!

addicted · 2025-01-30T01:27:41 1738200461

You don’t need a citation.

You just need to read OpenAI’s arguments about why TOS and copyright laws don’t apply to them when they’re training on other people’s copyrighted and TOS protected data and running roughshod over every legal protection.

xdennis · 2025-01-30T00:17:55 1738196275

IANAGL, but in Germany a ToS is not a contract and can be declared void if it's deemed by courts to be unfair.

vanviegen · 2025-01-30T08:00:18 1738224018

Yes, though this is especially true when it's consumers 'agreeing' to the TOS. Anything even somewhat surprising within such a TOS is basically thrown out the window in European courtrooms without a second look.

For actual, legally binding consent, you'll need to make some real effort to make sure the consumer understands what they are agreeing to.

Spooky23 · 2025-01-30T03:21:22 1738207282

People here will argue that. But the Chinese DNGAF.

like_any_other · 2025-01-29T23:08:08 1738192088

Legally, I understand your point, but morally, I find it repellent that a breach of contract (especially terms-of-service) could be considered more important than a breach of law. Especially since simply existing in modern society requires us to "agree" to dozens of such "contracts" daily.

I hope voters and governments put a long-overdue stop to this cancer of contract-maximalism that has given us such benefits as mandatory arbitration, anti-benchmarking, general circumvention of consumer rights, or, in this case, blatantly anti-competitive terms, by effectively banning reverse-engineering (i.e. examining how something works, i.e. mandating that we live in ignorance).

Because if they don't, laws will slowly become irrelevant, and our lives governed by one-sided contracts.

anothernewdude · 2025-01-30T07:21:04 1738221664

It's not hard to get someone else to submit queries and post the results, without agreeing to the license.

tempeler · 2025-01-29T19:12:12 1738177932

On another subject, if it belongs to OpenAI because it uses OpenAI, then doesn't that mean that everything produced using OpenAI belongs to OpenAI? Isn't that a reason not to use OpenAI? It's very similar to saying that you used Google and searched; now this product belongs to Google. They couldn't figure out how to respond; they went crazy.

dathinab · 2025-01-29T20:12:46 1738181566

The US ruled that AI produced things are by themself not copyrightable.

So no, it doesn't belong to OpenAI.

You might be able to sue for penalties for breach of contract of the TOS, but that doesn't give them the right to the model. And even if it doesn't give them any right to invalidate unbound copyright grants they have given to 3rd parties (here literally everyone). Nor does it prevent anyone from training their own new models based on it or prevent anyone from using it. Oh, and the one breaching the TOS might not even have been the company behind DeepSeek but some in-between 3rd party.

Naturally this is under a few assumptions:

- the US consistently applies it's own law, but they have a long history of not doing so

- the US doesn't abuse their power to force their economical opinions (ban DeepSeek) on other countries

- it actually was trained on OpenAI, but uh, OpenAI has IMHO shown over the years very clearly that they can't be trusted and they are fully in-transparent. How do we trust their claim? How do we trust them to not retrospectively have tweaked their model to make it look as if DeepSeek copied it?

protocolture · 2025-01-29T23:12:45 1738192365

>The US ruled that AI produced things are by themself not copyrightable.

The US ruled that the AI cannot be the author, that doesn't lead like so many clickbait articles suggest, that no AI products can be copyrighted.

1 Activist tried to get the US copyright office to acknowledge his LLM as the author, who would then provide him a license to the work.

There was no issue with himself being the original author and copyright holder of the AI works. But thats not what was being challenged.

Aloisius · 2025-01-30T00:22:05 1738196525

The copyright office ruled AI output is uncopyrightable without sufficient human contribution to expression.

Prompts, they said, were unlikely enough to satisfy the requirement of a human controlling the expressive elements thus most AI output today is probably not copyrightable.

https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

protocolture · 2025-01-30T02:50:47 1738205447

>The Office concludes that, given current generally available technology, prompts alone do not provide sufficient human control to make users of an AI system the authors of the output.

Prompts alone.

But there are almost no cases of "Prompts Alone" products seeking copyright.

Even what 3-4 years ago?, AI tools moved into a collaborative footing. Novel AI forces a collaborative process (and gives you output that can demonstrate your input which is nice). ChatGPT effectively forces it due to limited memory.

There was a case, posted here to ycombinator, where a chinese judge upheld "significant" human interaction was involved when a user made 20-odd adjustments to their prompt iterating over produced images and then added a watermark to the result. I would be very surprised if most sensible jurisdictions didn't follow suit.

Midjourney and ChatGPT already include tools to mask and identify parts of the image to be regenerated. And multiple image generators allow dumb stuff like stick figures and so forth to stand in as part of an uploaded image prompt.

And then theres AI voice which is another whole bag of tricks.

>thus most AI output today is probably not copyrightable.

Unless it was worked on even slightly as above. In fact it would be hard to imagine much AI work that isn't copyrightable. Maybe those facebook pages that just prompt "Cyberpunk Girl" and spit out endless variations. But I doubt copyright is at the forefront of their mind.

Aloisius · 2025-01-30T20:55:34 1738270534

A person collaborating on output almost certainly still would still not qualify as substantive contributions to expression in the US.

The US copyright's determination was based on the simple analogy of someone hiring someone else to create a work for them. The person hiring, even if they offer suggestions and veto results, is not contributing enough to the expression and therefore has no right to claim copyright themselves.

If you stand behind a painter and tell them what to do, you don't have any claim to copyright as the painter is still the author of the expression, not you. You must have a hand in the physical expression by painting yourself.

protocolture · 2025-02-01T22:28:40 1738448920

>A person collaborating on output almost certainly still would still not qualify as substantive contributions

But then

>You must have a hand in the physical expression by painting yourself.

You contradict yourself. Novel AI will literally highlight your contributions separately to the AI so you can prove you also painted. Image generators literally let you paint over the top to select AI boundaries.

dathinab · 2025-01-29T23:25:39 1738193139

but even then wouldn't the people using OpenAI still be the author/copyright holder and never OpenAI? (as no human on OpenAIs side is involved in the process of creating the works)

protocolture · 2025-01-29T23:47:34 1738194454

OpenAI is a company of humans, the product is ChatGPT. Theres a grey area regarding who owns the content, so OpenAI's terms and conditions state that all ownership of the resulting content belongs to the user. This is actually advantageous because it means that they dont hold ownership on bad things created by their tool.

protocolture · 2025-01-29T23:54:02 1738194842

That said you can still provide terms to access the tool, IIRC midjourney allows creators to own their content but also forces them to license it back to midjourney for advertising. Prompts too from memory.

johndhi · 2025-01-29T21:39:54 1738186794

to be clear, their terms of service are pretty clear that the USER owns the outputs.

jonathanstrange · 2025-01-29T22:55:34 1738191334

The official stance in the US is currently that there is no copyright on AI output.