I think there's two different things going on here:
"DeepSeek trained on our outputs and that's not fair because those outputs are ours, and you shouldn't take other peoples' data!" This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.
"DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim. The DeepSeek R1 paper shows that distillation is really powerful (e.g. they show Llama models get a huge boost by finetuning on R1 outputs), and if it were the case that DeepSeek were using a bunch of o1 outputs to train their model, that would legitimately cast doubt on the narrative of training efficiency. But that's a separate question from whether it's somehow unethical to use OpenAI's data the same way OpenAI uses everyone else's data.
Why would it cast any doubt? If you can use o1 output to build a better R1. Then use R1 output to build a better X1... then a better X2.. XN, that just shows a method to create better systems for a fraction of the cost from where we stand. If it was that obvious OpenAI should have themselves done. But the disruptors did it. It hindsight it might sound obvious, but that is true for all innovations. It is all good stuff.
I think it would cast doubt on the narrative "you could have trained o1 with much less compute, and r1 is proof of that", if it turned out that in order to train r1 in the first place, you had to have access to bunch of outputs from o1. In other words, you had to do the really expensive o1 training in the first place.
(with the caveat that all we have right now are accusations that DeepSeek made use of OpenAI data - it might just as well turn out that DeepSeek really did work independently, and you really could have gotten o1-like performance with much less compute)
In this study, we demonstrate that reasoning capabilities can be significantly
improved through large-scale reinforcement learning (RL), even without using supervised
fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with
the inclusion of a small amount of cold-start data
Is this cold start data what OpenAI is claiming their output ? If so what's the big deal ?
DeepSeek claims that the cold-start data is from DeepSeekV3, which is the model that has the $5.5M pricetag. If that data were actually the output of o1 (a model that had a much higher training cost, and its own RL post-training), that would significantly change the narrative of R1's development, and what's possible to build from scratch on a comparable training budget.
In the paper DeepSeek just says they have ~800k responses that they used for the cold start data on R1, and are very vague about how they got it:
> To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.
My surface-level reading of these two sections is that the 800k samples come from R1-Zero (i.e. "the above RL training") and V3:
>We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.
>For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting.
The non-reasoning portion of the DeepSeek-V3 dataset is described as:
>For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data.
I think if we were to take them at their word on all this, it would imply there is no specific OpenAI data in their pipeline (other than perhaps their pretraining corpus containing some incidental ChatGPT outputs that are posted on the web). I guess it's unclear where they got the "reasoning prompts" and corresponding answers, so you could sneak in some OpenAI data there?
That's what I am gathering as well. Where is OpenAI going to have substantial proof to claim that their outputs were used ?
The reasoning prompts and answers for SFT from V3 you mean ? No idea. For that matter you have no idea where OpenAI got this data from either. If they open this can of worms, their can of worms will be opened as well.
It's like the claim "they showed anyone create a powerful from scratch" becomes "false yet true".
Maybe they needed OpenAI for their process. But now that their model is open source, anyone can use that as their cold start and spend the same amount.
"From scratch" is a moving target. No one who makes their model with massive data from the net is really doing anything from scratch.
Yeah, but that kills the implied hope of building a better model for cheaper. Like this you'll always have a ceiling of being a bit worse then the openai models.
The logic doesn't exactly hold, it is like saying that a student is limited by their teachers. It is certainly possible that a bad teacher will hold the student back, but ultimately a student can lag or improve on the teacher without only a little extra stimulus.
They probably would need some other source of truth than an existing model, but it isn't clear how much additional data is needed.
Don't forget that this model probably has far less params than o1 or even 4o. This is a compression/distillation, which means it frees up so much compute resources to build models much powerful than o1. At least this allows further scaling compute-wise (if not in the amount of, non-synthetic, source material available for training).
If R1 were better than O1, yes you would be right. But the reporting I’ve seen is that it’s almost as good. Being able to copy cutting edge models won’t advance the state of the art in terms of intelligence. They have made improvements in other area, but if they reused O1 to train their model, that would be effectively a ctrl-c / ctrl-v strictly in terms of task performance.
It's not just about whether competitors can improve on OpenAI's models. It's about whether they can continually create reasonable substitutes for orders of magnitude less investment.
> It's about whether they can continually create reasonable substitutes for orders of magnitude less investment
That just means that the edge you’re able to retain if you invest $1B is nonexistent. It also means there’s a huge disincentive to invest $1B if your reward instantly evaporates. That would normally be fine if the competitor is otherwise able to get to that new level without the $1B. But if it relies on your $1B to then be able to put in $100M in the first place to replicate your investment, it essentially means the market for improvements disappears OR there’s legislation written to ensure competitors aren’t allowed to do that.
This is a tragedy of the commons and we already have historical example for how humans tried to deal with it and all the problems that come with it. The cost of producing a book requires substantial capital but the cost of copying it requires a lot less. Copyright law, however flawed and imperfect, tries to protect the incentive to create in the face of that.
> That just means that the edge you’re able to retain if you invest $1B is nonexistent.
Jeez. Must be really tough to have some comparatively small group of people financially destroy your industry with your own mechanically-harvested professional output while dubiously claiming to be be better than you when in reality it’s just a lot cheaper. Must be tough.
Maybe they should take some time to self-reflect and make some art and writing about it using the products they make that mechanically harvest the work of millions of people, and have already screwed up the commercial art and writing marketplaces pretty throughly. Maybe tell DeepSeek it’s their therapist and get some emotional support and guidance.
This. There is something doubly evil about OpenAI harvesting all of that work for its own economic benefit, while also destroying the opportunity for those it stole from to continue to ply their craft.
And then all of their stans taking on a persecution complex because people that actually made all of the “data” don’t uncritically accept their work as equivalent adds insult to injury.
>it essentially means the market for improvements disappears OR there’s legislation...
This is possibly true, though with billions already invested I'm not sure that OpenAI would just...stop absent legislation. And, there may be technical or other solutions beyond legislation. [0]
But, really, your comment here considers what might come next. OTOH, I was replying to your prior comment that seemed to imply that DeepSeek's achievement was of little consequence if they weren't improving on OpenAI's work. My reply was that simply approximating OpenAI's performance at much lower cost could still be extraordinarily consequential, if for no other reason than the challenges you subsequently outlined in this comment's parent.
[0] On that note, I'm not sure (and admittedly haven't yet researched) how DeepSeek just wholesale ingested ChatGPT's "output" to be used for its own model's training, so not sure what technical measures might be available to prevent this going forward.
First, there could have been industrial espionage involved so who knows. Ignoring that, you’re missing what I’m saying. Think of it this way - if it requires O1’s input to reach almost the same task performance, then this approach gives you a cheap way to replicate the performance of a leading edge model at a fraction of the cost. It does not give you a way to train something that beats a cutting edge model. Cutting edge models require a lot of R&D & capital expenditure - if they’re just going to be trivially copied after public availability, the response is going to be legislation to keep the incentive there to keep meaningful investments in that area. Otherwise you’re going to have another AI winter where progress shrivels because investment dollars dry up.
That’s why it’s so hard to understand the true cost of training Deepseek whereas it’s a little bit easier for cutting edge models (& even then still difficult).
"Otherwise you’re going to have another AI winter where progress shrivels because investment dollars dry up."
Tbh a lot of people in the world would love this outcome. They will use AI because not using it puts them at a comparative disadvantage - but would rather AI doesn't develop further or didn't develop at all (i.e. they don't value the absolute advantage/value). There's both good and bad reasons for this.
When you build a new model, there is a spectrum of how you use the old model: 1. taking the weights, 2. training on the logits, 3. training on model output, 4. training from scratch. We don't know how much advantage #3 gives. It might be the case that with enough output from the old model, it is almost as useful as taking the weights.
I lean on the idea that R1-Zero was trained from cold start, at the same time, they have tried many things including using OpenAI APIs. These things can happen in parallel.
> you had to do the really expensive o1 training in the first place
It is no better for OpenAI in this scenario either, any competitor can easily copy their expensive training without spending the same, i.e. there is a second mover advantage and no economic incentive to be the first one.
To put it another way, the $500 Billion Stargate investment will be worth just $5Billion once the models become available for consumption, because it only will take that much to replicate the same outcomes with new techniques even if the cold start needed o1 output for RL.
Now that its been done, is OpenAI needed or can you iterate on DeepSeek only moving forward?
My understanding is this effectively builds on OpenAI's very expensive initial work, provides a "nearly as good as" model for orders of magnitude cheaper to train and run, that also provides a basis to continue building on and improving without openAI, and without human bottlenecks.
That cuts OAI off at the knees in terms of market viability after billions have been spent. If DS can iterate and match the capabilities of the current in-development OAI models in the next year, it may come down to regulatory capture and government intervention to ensure its viability as a company.
You cannot really have government intervention against open source and weights successfully.
the attempt in cryptography with PGP and export controls made that clear.
Even if DS specifically is banned (and even effectively), a dozen other clean room replications following their published methods will become available.
It is possible this government will ban all “unapproved” LLMs not running at authorized provider[1], saying it is weapon and AGI or skynet or whatever makes powers that sound important, thus establishing the need for control [2], the rest of the world will keep innovating.
—-
[1] Bans just need to work only economically, not at information level i.e organization with liability considerations will not use “unapproved” ones and they are ones who will bulk of the money and that what they need to protect.
[2] if they were smart they could do this positively without the backlash bans would have. By giving protections to compliant models like legal indemnity for for model companies and users without necessarily blocking others
o1 wouldn't exist without the combined compute of every mind that led to the training data they used in the first place. How many h100 equivalents are the rolling continuum of all of human history?
Human reasoning, as it exists today, is the result of tens of thousands of years of intuition slowly distilled down to efficient abstract concepts like "numbers", "zero", "angles", "cause", "effect", "energy", "true", "false", ...
I don't know what reasoning from scratch would look like without training on examples from other reasoning beings. As human children do.
Actually i also think it's possible. Start with natural numbers axiom system. Form all valid sentences of increasing length. RL on a model to search for counter example or proofs. This on sufficient computer should produce superhuman math performance (efficiency) even at compute parity
I wonder how much discovery in math happens as a result in lateral thinking epiphanies. IE: A mathematician is trying to solve a problem, their mind is open to inspiration, and something in nature, or their childhood or a book synthesizes with their mental model and gives them the next node in their mental graph that leads to a solution and advancement.
In an axiomatic system, those solutions are checkable, but how discoverable are they when your search space starts from infinity? How much do you lose by disregarding the gritty reality and foam of human experience? It provides inspirational texture that helps mathematicians in the search at least.
Reality is a massive corpus of cause and effect that can be modeled mathematically. I think you're throwing the baby out with the bathwater if you even want to be able to math in a vacuum. Maybe there is a self optimization spider that can crawl up the axioms and solve all of math. I think you'll find that you can generate new math infinitely, and reality grounds it and provides the gravity to direct efforts towards things that are useful, meaningful and interesting to us.
As I mentioned in a sister comment, Gödel's incompleteness theorems also throw a wrench into things, because you will be able to construct logically consistent "truths" that may not actually exist in reality. At which point, your model of reality becomes decreasingly useful.
At the end of the day, all theory must be empirically verified, and contextually useful reasoning simply cannot develop in a vacuum.
Those theorems are only relevant if "reasoning" is taken to its logical extreme (no pun intended). If reasoning is developed/trained/evolved purely in order to be useful and not pushed beyond practical applications, the question of "what might happen with arbitrarily long proofs" doesn't even come up.
On the contrary, when reasoning about the real world, one must reason starting from assumptions that are uncertain (at best) or even "clearly wrong but still probably useful for this particular question" (at worst). Any long and logic-heavy proof would make the results highly dubious.
A question is: what algorithms does the brain use to make these creative lateral leaps? Are they replicable?
Unless the brain is using physics that we don’t understand or can’t replicate, it seems that, at least theoretically, there should be a way to model what it’s doing with silicon and code.
States like inspiration and creativity seem to correlate in an interesting way with ‘temperature’, ‘top p’, and other LLM inputs. By turning up the randomness and accepting a wider range of output, you get more nonsense, but you also potentially get more novel insights and connections. Human creativity seems to work in a somewhat similar way.
Dogs are probably the best example I can think of. They learn through experience and clearly reason, but without a complex language to define abstract concepts. Its very basic reasoning, but they do learn and apply that learning.
To your point, experience is the training. Without language/data to represent human experience and knowledge to train a model, how would you give it 'experience'?
And yet dogs, to a very high degree, just learn the same things. At least the same kinds of things, over and over.
They were pre-designed to learn what they always learn. Their minds structured to readily make the same connections as puppies, that dogs have always needed to survive.
Not for real reasoning, which by its nature, does not have a limit.
> just learn the same things. At least the same kinds of things, over and over.
Its easy to train the same things to a degree, but its amazing to watch different dogs individually learn and reason through things completely differently, even within a breed or even a litter.
Reasoning ability is always limited by the capacity of the thinker to frame the concepts and interactions. Its always limited by definition, we only push that limit farther than other species, and AGI may eventually push it past our abilities.
There was necessarily a "first reasoning being" who learned reasoning from scratch, and then it's improved from there. Humans needed tens of thousands of years because:
- humans experience reality at a slower pace than AI could theoretically experience a simulated reality
- humans have to transfer knowledge to the next generation every 80 years (in a manner that's very lossy), and around half of each human lifespan is spent learning things that the previous generation already knew
Whether the first reasoning entity is an individual organism or a group of organisms is completely irrelevant to the original point. If one were to grant that there was in fact a "first reasoning group" rather than a "first reasoning being" the original argument would remain intact.
That was the easy part though, figuring out how to handle all the unintended side effects it generated is still an ongoing process. Please sit and relax while we are solving the few incidentals events occurring here and there, rest assured we are putting our best effort to their resolution.
It is possible to learn to reason from scratch, that's what R1-0 did, but the resulting chains of thought aren't legible to humans.
To quote DeepSeek directly:
> DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.
If you look at the benchmarks of the DeepSeek-V3-Base, it is quite capable, even in 0-shot: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#base-mod... This is not from scratch. These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.
On the other hand, my take on it, the ability to do reasoning in a long context is a general capability. And my guess is that it can be bootstrapped from scratch, without having to do training on all of the internet or having to distill models trained on the internet.
> These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.
But we already know that is the case: the Deepseek v3 paper says it was posttrained partly with an internal version of R1:
> Reasoning Data. For reasoning-related datasets, including those focused on mathematics,
code competition problems, and logic puzzles, we generate the data by leveraging an internal
DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it
suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is
to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of
regularly formatted reasoning data.
And deepseekmath did a repeated cycle of this kind of thing mixing in 10% of old previously seen data with new generated data from last gen in a continuous bootstrap.
Possible? I guess evolution did it over the course of a few billion years. For engineering purposes, starting from the best advanced position seems far more efficient.
I've been giving this a lot of thought over the last few months. My personal insight is that "reasoning" is simply the application of a probabilistic reasoning manifold on an input in order to transform it into constrained output that serves the stability or evolution of a system.
This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.
But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.
Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains.
This doesn't mean a well-trained model can't reason about input it hasn't encountered before, just that the input needs to be in some way causally connected to the same laws which governed the input the manifold was trained on.
I'm sure we could create a fully generalized reasoning manifold which could handle anything, but I don't see how we possibly get that without first considering and encountering all possible inputs. But these inputs still have to have some form of constraint governed by laws that must be learned through sampling, otherwise you'd just be training on effectively random data.
The other commenter who suggested simply generating all possible sentences and training on internal consistency should probably consider Gödel's incompleteness theorems, and that internal consistency isn't enough to accurately model and interpret the universe. One could construct a thought experiment about an isolated brain in a jar with effectively unlimited neuronal connections, but no sensory connection to the outside world. It's possible, with enough connections, that the likelihood of the brain conceiving of true events it hasn't actually encountered does increase meaningfully. But the brain still has nothing to validate against, and can't simply assume that because something is internally logically consistent, that it must exist or have existed.
If OpenAi had to account for the cost of producing all the copyrighted material they trained their LLM on, their system would be worth negative trillions of dollars.
Let's just assume that the cost of training can be externalized to other people for free.
Even if what OpenAI asserts in the title of this post is true, then their system is worth negative trillions of dollars.
If other players can access that data with relatively less effort, then it's futile trying to train your models and improve upon them, as clearly you don't have an architectural moat, just a training moat.
Kind of like an office scene where an introverted hardworker does all the tedious work, while his extroverted colleague promotes it as his and gains credit.
At the pace that DeepSeek is developing we should expect them to surpass OpenAI in not that long.
The big question really is, are we doing it wrong, could we have created o1 for a fraction of the price. Will o4 cost less to train than o1 did?
The second question is naturally. If we create a smarter LLM, can we use it to create another LLM that is even smarter?
It would have been fantastic if DeepSeek could have come out with an o3 competitor before o3 even became publicly available. That way we would have known for sure that we’re doing it wrong. Cause then either we could have used o1 to train a better AI or we could have just trained in a smarter and cheaper way.
The whole discussion is about whether or not the second case of using o1 outputs to fine tune R1 is what allowed R1 to become so good. If that's the case then your assertion that DeepSeek will surpass OpenAI doesn't really make sense because they're dependent on a frontier model in order to match, not surpass.
Yeah, that's my point. If they do end up surpassing OpenAI then it would seem likely that they aren't just relying on copying from o1, or whatever model is the frontier model at that time.
Sure. This is fine. Data is still a product, no matter how much businesses would like to turn it into a service.
The model already embodies the "total sum of a massive amount of compute" used to create it; if it's possible to reuse that embodied compute to create a better model, that's good for the world. Forcing everyone to redo all that compute for themselves is, conversely, bad for the world.
I mean, yes that's how progress works. Has OpenAI got a patent? If not it's fair game.
We don't make people figure out how to domesticate a cow every time they want a hamburger. Or test hundreds of thousands of filaments before they can have a lightbulb. Inventions, once invented, exist as giants to stand upon. The inventor can either choose to disclose the invention and earn a patent for exclusive rights, or they can try to keep it a secret and hope nobody reverse engineers it.
I think the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI. Even though in the paper they give a lot of credit to Llama for their techniques. The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.
All of this should have been clear anyway from the start, but that's the Internet for you.
The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.
Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.
As far as I know, DeepSeek adds only a little to the transformers model while o1/o3 added a special "reasoning component" - if DeepSeek is as good as o1/o3, even taking data from it, then it seems the reasoning component isn't needed.
It seems clear that the term can be used informally to denote the boiling down of human knowledge, indeed it was used that way before AI appeared in the popular imagination.
In the context in which you said it, it matters a lot.
>> The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.
> Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.
If deepseek was produced through the distillation (term of art) of o1, then the cost of producing deepseek is strictly higher than the cost of producing o1, and can't be avoided.
Continuing this argument, if the premise is true then deepseek can't be significantly improved without first producing a very expensive hypothetical o1-next model from which to distill better knowledge.
That is the argument that is being made. Please avoid shallow dismissals.
Edit: just to be clear, I doubt that deepseek was produced via distillation (term of art) of o1, since that would require access to o1's weights. It may have used some of o1's outputs to fine tune the model, which still would mean that the cost of training deepseek is strictly higher than training o1.
just to be clear, I doubt that deepseek was produced via distillation
Yeah, your technical point is kind of ridiculous here that in all my uses of distillation (and in the comment I quoted), distillation is used in informal sense and there's no allegation that DeepSeek could have been in possession of OpenAI's model weights, which is what's needed for your "Distillation (term of Art)".
Because it feeds conspiracy theories and because there's no evidence for it? Also, let's talk DeepSeek in particular, not "China".
Looking back on the article, it is indeed using "distillation" as a special/"term of art" but not using it correctly. IE, it's not actually speculating that DeepSeek obtained OpenAI's weights and distilled them down but rather that it used OpenAI's answers/output as a starting point (which there is a different method/"term of art").
The R1-Zero paper shows how many training steps the RL took, and it's not many. The cost of the RL is likely a small fraction of the cost of the foundational model.
> the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI
I did not think this, nor did I think this was what others assumed. The narrative, I thought, was that there is little point in paying OpenAI for LLM usage when a much cheaper, similar / better version can be made and used for a fraction of the cost (whether it's on the back of existing LLM research doesn't factor in)
Yes, well the narrative that rocked the stock market is different. Its looking at what DeepSeek did and assuming they may have competitive advantage in this space and could outperform OpenAI at their own game.
If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly, even if the initial cost is huge. It also means OpenAI probably needs a better moat to protect its interests.
I'm not sure where the reality is exactly, but market reactions so far have basically followed that initial narrative and now the rebuttal.
The idea that someone can easily replicate an OpenAI model based simply on OpenAI outputs is, I’d argue, immeasurably worse for OpenAI’s valuation than the idea that someone happened to come up with a few innovations that leapfrogged OpenAI.
The latter could be a one time thing, and/or OpenAi Could still use their financial might to leverage those innovations and get even better with them.
However, the former destroys their business model and no amount of intelligence and innovation from OpenAI protects them from being copied at a fraction of the cost.
> Yes, well the narrative that rocked the stock market is different.
How do you know this?
> If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly
Why? If every innovation OpenAI is trying to keep as secret sauce becomes commoditized quickly and cheaply, then why would markets care about any innovations they have? They will be unable to monetize them.
Why would it matter when Chinese deepseek is not going to abide by such rules or be forced to and will release their model open weights so anyone anywhere can host it?
Also, scraping most of the websites they scrape is also not allowed, they do it anyways
There were different narratives for different people. When I heard about r1, my first response was to dig into their paper and it's references to figure out how they did it.
Fwiw I assumed they were using o1 to train. But it doesn’t matter: the big story here is that massive compute resources are unlikely to be as important in the future as we thought. It cuts the legs off stargate etc just as it’s announced. The CCP must be highly entertained by the timeline.
But HOW they are necessary is the change. They went from building blocks to stepping stones. From a business standpoint that's very damaging to OAI and other players.
OpenAI couldn't do it, when the high cost of training and access to GPUs is their competitive advance against startups, they can't admit that it does not exist.
When will over training happen on the melange of models at scale?
And will AGI only ever be an extension of this concept?
That is where artificial intelligence is going. Copy things from other things. Will there be a AI Eureka moment where it deviates and knows where and why the reason it is wrong?
If they're training R1 on o1 output on the benchmarks - then I don't trust those benchmarks results for R1. It means the model is liable to be brittle, and they need to prove otherwise.
It seems like if they in fact distilled then what we have found is that you can create a worse copy of the model for ~5m dollars in compute by training on its outputs.
They're standing on the shoulders of giants, not only in terms of re-using expensive computing power almost for free by using the outputs of expensive models. It's a bit of a tradition in that country, also in manufacturing.
What I meant to say was that OpenAI did put a lot of money into extracting value out of the pile of (partially copyrighted) data, and that DeepSeek was freeloading on that investment without disclosing it, making them look more efficient than they truly are.
Well, originally, OpenAI wasn't supposed to be that kind of organization.
But if you leave someone in the tech industry of SV/SF long enough, they'll start to get high on their own supply and think they're entitled to insane amounts of value, so...
It's because they're the ones who could raise the money to make those models. Academics don't have access to that kind of compute. But the free models exist.
Even assuming the model was somehow publicly available in a form that could be directly copied, that would be a more blatant form of copyright infringement. Distillation launders copyrighted material in a way that OpenAI specifically has argued falls under fair use.
Ironically Deepseek is doing what OpenAI originally pledged to do. Making the model open and free is a gift to humanity.
Look at the whole AI revolution that Meta and others have bootstrapped by opening their models. Meanwhile OpenAI/Microsoft, Antropic, Google and the rest are just trying to look after number 1 while trying to regulatory capture an AI for me but not for thee outcome of full control.
Given that you can download and use the weights, the model architecture has to be includded as part of that. And I did read a paper from them recently describing their MoE architecture and how it differs from the original GShard.
I don't think it makes sense to look at some previous PR statements of Altman et al re this when there a tens of billions floating around and egos get inflated to moon sizes. Farts in the wind have more weight, but this goes for all corporate PR.
Thieves yelling 'stop those thieves' scenario to me, they just were first and would not like losing that position. But its all about money and consequently power, business as usual.
There seems to a rare moderation error by dang with respect to this thread.
The comments were moved here by dang from an flagged article with an editorialized /clickbait title. That flagged post has 1300 points at the time of writing.
It should be incumbent on the moderator to at least consider that the motivation for the points and comments may have been because many thought the "hypocrisy" of OpenAI's position was a more important issue than OpenAI's actual claim of DeepSeek violating its ToS. Moving the comments to an article that buries the potential hypocrisy issue that may have driven the original points and comments is not ideal.
2.
This article is from FT, which has a content license deal with OpenAI. To move the comments to an article from a company that has a conflict of interest due to its commercial relations with the YC company in question is problematic here especially since dang often states they try to more hands-off on moderation when the article is about a YC company.
3.
There is a link by dang to this thread from the original thread, but there should also be a link by dang to the original thread from here as well. Why is this not the case?
4.
Ideally, dang should have asked for a more substantial submission that prioritized the hypocrisy point to better match the spirit of the original post instead of moving the comments to this article.
Yes, but we were duped at the time, so it’s right and good that we maintain light on and anger at the ongoing manipulation, in the hope of next time recognizing it as it happens, not after they’ve used us, screwed us, and walked away with a vast fortune.
But it makes sense to expose their blatantly lies whenever possible to diminish the credibility they are trying to build while accusing others of the same they did
Oh yes I agree with all of you that lies should be exposed, also who lies like that once will lie again, 0 doubt there.
Just don't set the expectations bar too high to start with is all I am saying. Folks that get so high up money and power wise aren't nice people, period. Even if nice normal guy without any sociopathic traits would suddenly shoot so high, the environment and pressures would deform them pretty quickly.
Also, I would consider only some leaked private conversations with close people as representative truth, not some PR statements carefully crafted by team of experts.
Happy to be proven wrong, still waiting for an example #1 to give me some hope.
> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data
IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.
Even if the courts affirm that there's a fair use defence for AI training, DeepSeek may still be in the wrong here, not because of copyright infringement, but because of a breach of contract.
I don't think OpenAI would have much of a problem if you train your model on data scraped from the internet, some of which incidentally ends up being generated by Chat GPT.
Compare this to training AI models on Kindle Books randomly scraped off the internet, versus making a Kindle account, agreeing to the Kindle ToS, buying some books, breaking Amazon's DRM and then training your AI on that. What DeepSeek did is more analogous to the latter than the former.
> DeepSeek has explicitly consented to a license that doesn't allow them to do this.
You actually don’t know this. Even if it were true that they used OpenAI outputs (and I’m very doubtful) it’s not necessary to sign an agreement with OpenAI to get API outputs. You simply acquire them from an intermediary, so that you have no contractual relationship with OpenAI to begin with.
You are free to publish your conversations with ChatGPT on the Internet, where they can be picked up by scrapers. US ruled that they are not covered by copyright...
>IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.
Right, but it was never about doing the right thing for humanity, it was about doing the right thing for their profits.
Like I’ve said time and time again, nobody in this space gives a fuck about anyone that isn’t directly contributing money to their bottom line at that particular instant. The fundamental idea is selfish, damages the fundamental machinery that makes the internet useful by penalizing people that actually make things, and will never, ever do anything for the greater good if it even stands a chance of reducing their standing in this ridiculously overhyped market. Giving people free access to what is for all intents and purposes a black box is not “open” anything, is no more free (as in speech) than Slack is, and all of this is obviously them selling a product at a huge loss to put competing media out of business and grab market share.
It's quite unlikely that OpenAI didn't break any TOS with all the data they used for training their models.
Not just OpenAI but all companies that are developing LLMs.
IMO, it would look bad for OpenAI to push strongly with this story, it would look like they're losing the technological edge and are now looking for other ways to make sure they remain on top.
Similar to how a patent contract becomes void when a patent expires regardless of what the terms of the contract says, it's not clear to me OpenAI can enforce a contract provision for an API output they own no copyright in.
Since they have no intellectual property rights in the output, it's not clear to me they have a cause of action to sue over how the output is used.
I wonder if any lawyers have written about this topic.
How many thousands or millions of contracts has OpenAI breached by scraping data off of websites that have terms of service explicitly saying not to scrape data off their websites?
But in all reality I'm happy to see this day. The fact that OpenAI ripped off everyone and everything they could and, to this day pretend like they didn't, is fantastic.
Sam Altman is a con and it's not surprising that given all the positive press DeepSeek got that it was a full court assault on them within 48 hours.
It probably ignored hundreds of thousands of "by using this site you consent to our Terms and Conditions" notices, many of which probably would be read as prohibiting training. But that's also a great example of why these implicit contracts don't really work as contracts.
OpenAI scrapped my blog so aggressively that I had to ban their IPs. They ignored the robots.txt (which is kind of ToS) by 2 orders of magnitude, they ignored the explicit ToS that I copypasted blindly from somewhere but turns out it forbids what they did (something like you can't make money with the content). Not that I'm going to enforce it, but they should at least shut up.
For example, my digital garden is under GFDL, and my blog is CC BY-NC-SA. IOW, They can't remix my digital garden with any other license than GFDL, and they have to credit me if they remix my blog, and can't use it for any commercial endeavor, which OpenAI certainly does now.
So, by scraping my webpages, they agree to my licensing of my data. So they're de-facto breaching my licenses, but they cry "fair-use".
If I tell that they're breaching the license terms, they'd laugh at me, and maybe give me 2 cents of API access to mock me further. When somebody allegedly uses their API with their unenforcable ToS, they scream like an agitated cuckatoo (which is an insult to the cuckatoo, BTW. They're devilishly intelligent birds).
Drinking their own poison was mildly painful, I guess...
BTW, I don't believe that Deepseek has copied/used OpenAI models' outputs or training data to train theirs, even if they did, "the cat is out of the bag", "they did something amazing so they needed no permissions", "they moved fast and broke things", and "all is fair-use because it's just research" regardless of how they did it.
> So, by scraping my webpages, they agree to my licensing of my data.
If the fair use defense holds up, they didn't need a license to scrape your webpage. A contract should still apply if you only showed your content to people who've agreed to it.
> and "all is fair-use because it's just research"
Fair use is a defense to copyright infringement, not breach of contract. You can use contracts, like NDAs, to protect even non-copyright-eligible information.
Morally I'd prefer what DeepSeek allegedly did to be legal, but to my understanding there is a good chance that OpenAI is found legally in the right on both sides.
At this point, what I'm afraid is the justice system will be just an instrument in this all Us vs. Them debate, so their decisions will not be bound by law or legality.
Speculations aside, from what I understood, something like this shouldn't hold a drop of water under fair-use doctrine, because there's a disproportional damage, plus a huge monopolistic monetary gain because of what they did and how they did.
On the other hand, I don't believe that Deepseek used OpenAI (in any capacity or way or method) to develop their models, but again, it doesn't matter how they did it in this current conjecture.
What they successfully did was to upset a bunch of high level people, regardless of the technical things they achieved.
IMHO, AI war has similar dynamics to MAD. The best way is not to play, but we are past the Rubicon now. Future looks dirty.
> from what I understood, something like this shouldn't hold a drop of water under fair-use doctrine, because there's a disproportional damage, plus a huge monopolistic monetary gain
"Something like this" as in what DeepSeek allegedly did, or the web-scraping done by both of them?
For what DeepSeek allegedly did, OpenAI wouldn't have a copyright infringement case against them because the US copyright office determined that AI-generated content is not protected by copyright - and so there's no need here for DeepSeek to invoke fair use. It'll instead be down to whether they agreed to and breached OpenAI's contract.
For the web-scraping it's more complicated. Fair use is determined by the weighing of multiple factors - commercial use and market impact are considered, but do not alone preclude a fair use defense. Machine learning models do seem, at least to me, highly transformative - and "the more transformative the new work, the less will be the significance of other factors".
Additionally, since the market impact factor is the effect of the use of the copyrighted work on the market for that work, I'd say there's a reasonable chance it does not actually include what you may expect it to. For instance if you're a translator suing Google Translate for being trained on your translated book, the impact may not be "how much the existence of Google Translate reduced my future job prospects" nor even "how many fewer people paid for my translated book because of the existence of Google Translate" but rather "how many fewer people paid for my translated book than would have had that book been included in the training data" - which is likely very minor.
If their OS is open to the internet and you can scrape it and copy it off while they’re gone, then that would be about the right analogy. And OpenAi and DeepSeek have done the same thing in that case.
Citation? My understanding was that they are provided that someone has to affirmatively accept them in order to use your site. So Terms of Service stuck at the bottom in the footer likely would not count as a contract because there's no consent, but Terms of Service included in a check box on a login form likely would count.
But IANAL, so if you have a citation that says otherwise I'd be happy to see it!
You just need to read OpenAI’s arguments about why TOS and copyright laws don’t apply to them when they’re training on other people’s copyrighted and TOS protected data and running roughshod over every legal protection.
Yes, though this is especially true when it's consumers 'agreeing' to the TOS. Anything even somewhat surprising within such a TOS is basically thrown out the window in European courtrooms without a second look.
For actual, legally binding consent, you'll need to make some real effort to make sure the consumer understands what they are agreeing to.
Legally, I understand your point, but morally, I find it repellent that a breach of contract (especially terms-of-service) could be considered more important than a breach of law. Especially since simply existing in modern society requires us to "agree" to dozens of such "contracts" daily.
I hope voters and governments put a long-overdue stop to this cancer of contract-maximalism that has given us such benefits as mandatory arbitration, anti-benchmarking, general circumvention of consumer rights, or, in this case, blatantly anti-competitive terms, by effectively banning reverse-engineering (i.e. examining how something works, i.e. mandating that we live in ignorance).
Because if they don't, laws will slowly become irrelevant, and our lives governed by one-sided contracts.
On another subject, if it belongs to OpenAI because it uses OpenAI, then doesn't that mean that everything produced using OpenAI belongs to OpenAI? Isn't that a reason not to use OpenAI? It's very similar to saying that you used Google and searched; now this product belongs to Google. They couldn't figure out how to respond; they went crazy.
The US ruled that AI produced things are by themself not copyrightable.
So no, it doesn't belong to OpenAI.
You might be able to sue for penalties for breach of contract of the TOS, but that doesn't give them the right to the model. And even if it doesn't give them any right to invalidate unbound copyright grants they have given to 3rd parties (here literally everyone). Nor does it prevent anyone from training their own new models based on it or prevent anyone from using it. Oh, and the one breaching the TOS might not even have been the company behind DeepSeek but some in-between 3rd party.
Naturally this is under a few assumptions:
- the US consistently applies it's own law, but they have a long history of not doing so
- the US doesn't abuse their power to force their economical opinions (ban DeepSeek) on other countries
- it actually was trained on OpenAI, but uh, OpenAI has IMHO shown over the years very clearly that they can't be trusted and they are fully in-transparent. How do we trust their claim? How do we trust them to not retrospectively have tweaked their model to make it look as if DeepSeek copied it?
The copyright office ruled AI output is uncopyrightable without sufficient human contribution to expression.
Prompts, they said, were unlikely enough to satisfy the requirement of a human controlling the expressive elements thus most AI output today is probably not copyrightable.
>The Office concludes that, given current generally available technology, prompts alone
do not provide sufficient human control to make users of an AI system the authors of the
output.
Prompts alone.
But there are almost no cases of "Prompts Alone" products seeking copyright.
Even what 3-4 years ago?, AI tools moved into a collaborative footing. Novel AI forces a collaborative process (and gives you output that can demonstrate your input which is nice). ChatGPT effectively forces it due to limited memory.
There was a case, posted here to ycombinator, where a chinese judge upheld "significant" human interaction was involved when a user made 20-odd adjustments to their prompt iterating over produced images and then added a watermark to the result. I would be very surprised if most sensible jurisdictions didn't follow suit.
Midjourney and ChatGPT already include tools to mask and identify parts of the image to be regenerated. And multiple image generators allow dumb stuff like stick figures and so forth to stand in as part of an uploaded image prompt.
And then theres AI voice which is another whole bag of tricks.
>thus most AI output today is probably not copyrightable.
Unless it was worked on even slightly as above. In fact it would be hard to imagine much AI work that isn't copyrightable. Maybe those facebook pages that just prompt "Cyberpunk Girl" and spit out endless variations. But I doubt copyright is at the forefront of their mind.
A person collaborating on output almost certainly still would still not qualify as substantive contributions to expression in the US.
The US copyright's determination was based on the simple analogy of someone hiring someone else to create a work for them. The person hiring, even if they offer suggestions and veto results, is not contributing enough to the expression and therefore has no right to claim copyright themselves.
If you stand behind a painter and tell them what to do, you don't have any claim to copyright as the painter is still the author of the expression, not you. You must have a hand in the physical expression by painting yourself.
>A person collaborating on output almost certainly still would still not qualify as substantive contributions
But then
>You must have a hand in the physical expression by painting yourself.
You contradict yourself. Novel AI will literally highlight your contributions separately to the AI so you can prove you also painted. Image generators literally let you paint over the top to select AI boundaries.
but even then wouldn't the people using OpenAI still be the author/copyright holder and never OpenAI? (as no human on OpenAIs side is involved in the process of creating the works)
OpenAI is a company of humans, the product is ChatGPT. Theres a grey area regarding who owns the content, so OpenAI's terms and conditions state that all ownership of the resulting content belongs to the user. This is actually advantageous because it means that they dont hold ownership on bad things created by their tool.
That said you can still provide terms to access the tool, IIRC midjourney allows creators to own their content but also forces them to license it back to midjourney for advertising. Prompts too from memory.
The existence of R1-zero is evidence against any sort of theft of OpenAI's internal COT data. The model sometimes outputs illegible text that's useful only to R1. You can't do distillation without a shared vocabulary. The only way R1 could exist is if they trained it with RL.
I don’t think anyone is really suggesting they stole COT or that it is leaked, but rather that the final o1 outputs were used to train the base model and reasoning components more easily.
DeepSeek-R0 (based on DeepSeek-V3 base model) was only trained with RL, no SFT, so this isn't at all like the "distillation" (i.e SFT on synthetic data generated by R1) that they also demonstrated by fine tuning Qwen and LLaMa.
Now, DeepSeek may (or may not) have used some O1 generated data for the R0 RL training, but if so that's just a cost saving vs having to source some reasoning data some other way, and in no way reduces the legitimacy of what they accomplished (which is not something any of the AI CEOs are saying).
> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.
OpenAI has also invested heavily in human annotation and RLHF. If all DeepSeek wanted was a proxy for scraped training data, they'd probably just scrape it themselves. Using existing RLHF'd models as replacement for expensive humans in the training loop is the real game changer for anyone trying to replicate these results.
"We spent a lot of labor processing everything we stole" is...not how that works.
That's like the mafia complaining that they worked so hard to steal those barrels of beer that someone made off with in the middle of the night and really that's not fair and won't someone do something about it?
Oh, I don't really care about IP theft and agree that it's funny that openai is complaining. But I don't think its true that deepseek is just doing this because they are too lazy to scrape the internet themselves - its all about the human labor that they would otherwise have to pay for.
That's assuming what a known prolific liar has said is true...
The most famous example would be him contacting ScarJo's agent to hire her to provide her voice for their text-to-speech bot, them being told to go pound sand, and doing it anyway, and then lying about (which they got away with until her agent released a statement saying they'd approached her and she told them to fuck off.)
To my understanding, this is not true. The "Sky" voice was based on a real voice actor they had hired months before contacting Johansson, with the casting call not mentioning anything about sounding like Johansson. [0]
I think it's plausible that they noticed some similarity and that's what prompted them to later reach out to see if they could get Johansson herself, but it's not Johansson's voice and does not appear to be someone hired to sound like her.
This is a fascinating development because AI models may turn out to be like pharmaceuticals. The first pill costs $500 million to make, the second one costs pennies.
Companies are still charging 100x for the pills that cost pennies to produce.
Besides deals with insurance companies and governments, one of the ways that they are still able to pull this is convincing everyone that it's too dangerous to play with this at home or buying it from an Asian supplier.
At least with software we had until now a way to build and run most things without requiring dedicated super expensive equipment. OpenAI pulled a big Pharma move but hopefully there will be enough disruptors to not let them continue it.
The solution is to create a health insurance system which burdens only Americans with the $500m cost, while India is allowed to make the drug for pennies for the rest of the world.
You're right that the first claim is silly, but the second claim is pretty silly too — they're not claiming industrial espionage, they're claiming a breach in ToS. The outputs of the o1 thinking process aren't user-visible, and never leave OpenAI's datacenters. Unless DeepSeek actually had a mole that stole their o1 outputs, there's nothing useful DeepSeek could've distilled to get to R1's thought processes.
And if DeepSeek had a mole, why would they bother running a massive job internally to steal the data generated? It would be way easier for the mole to just leak the RL training process, and DeepSeek could quietly copy it rather than bothering with exfiltrating massive datasets to distill. The training process is most likely like, on the order of a hundred lines of Python or so, and you don't even need the file: you just need someone to describe it to you. Much simpler than snatching hundreds of gigabytes of training data off of internal servers...
Plus, the RL process described in DeepSeek's paper has already been replicated by a PhD student at Berkeley: https://x.com/karpathy/status/1884678601704169965 So, it seems pretty unlikely they simply distilled R1 and lied about it, or else how does their RL training algo actually... work?
This is mainly cope from OpenAI that their supposedly super duper advanced models got caught by China within a few months of release, for way cheaper than it cost OpenAI to train.
> "DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true"
Someone has to correct me if I'm wrong, but I believe in ML research you always have a dataset and a model. They are distinct entities. It is plausible that output from OpenAI's model improved the quality of DeepSeek's dataset. Just like everyone publishing their code on GitHub improved the quality of OpenAI's dataset. What has been the thinking so far is that the dataset is not "part of" or "in" the model any more than the GPUs used to train the model are. It seems strange that that thinking should now change just because Chinese researchers did it better.
OpenAI has a message they need to tell investors right now: "DeepSeek only works because of our technology. Continue investing in us."
The choice of how they're wording that of course also tells you a lot about who they think they're talking to: namely, "the Chinese are unfairly abusing American companies" is a message that is very popular with the current billionaires and American administration.
“We engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe . . . it is critically important that we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.”
The above OpenAI quote from the article leans heavily towards #1 and IMO not at all towards #2. The later would be an extremely charitable reading of their statement.
This is going to have a catastrophic effect on closed source AI startup valuations. Because this means that anyone can copy any LLM. The person who trains the model, spends the most amount of money. Everyone else can create a replica at lower cost
Why is that bad? If a powerful entity can scrape every piece of media humanity has to offer and ignore copyright then why should society let then profit unrestricted from it? It's only fair that such models have no legal protection around their usage and can be used and analyzed by anyone as they see fit. The only reason this hasn't been codified into laws is because those same powerful entities have been busy trying to do regulatory capture.
There is a big difference between being able to train on the reasoning vs just the answers, which they can't against o1 because it's hidden. There is also a huge difference between being able to train on the probabilities (distillation) vs not, which again they can and did do with the llama models and can't directly with OpenAI because the conceal the probability output.
If we assume distillation remains viable, the game theory implications are huge.
It’s going to shift the market of how foundation models are used. Companies creating models will be incentivized to vertically integrate, owning the full stack of model usage. Exposing powerful models via APIs just lets a competitor clone your work. In a way OpenAI’s Operator is a hint of what’s to come
There are literally public ChatGPT conversations data sets. For the past 2 years it's been common practice for pretty much all open source models to train on them. Ask just about any open source model who they are and a lot of the time they'll say they're ChatGPT. Why is "having obtained o1 generated data" suddenly such a huge news, to the point of warranting conspiracy theories about undisclosed/undiscovered breaches at OpenAI? Nobody ever made a fuss about public ChatGPT data sets until now. No hacking of OpenAI is needed to obtain ChatGPT data.
This really got me thinking that open ai should have no ip claim at all, since all their outputs and stuff are basically a ripoff of the entire human knowledge and IPs of various kinds.
> DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim.
Some may view this as partially true, given that o-1 does not output its CoT process.
The suggestion that any large-scale AI model research today isn’t ingesting output of its predecessors is laughable.
Even if they didn’t directly, intentionally use o1 output (and they didn’t claim they didn’t, so far as I know), AI slop is everywhere. We passed peak original content years ago. Everything is tainted and everything should be understand in that context.
Reasonable take, but to ignore the politics of this whole thing is to miss the forest for the trees—there is a big tech oligarchy brewing at the edges of the current US administration that Altman is already participating in with Stargate, and anti-China sentiment is everywhere. They'd probably like the US to ban Chinese AI.
Yeah especially when it's making waves in the market and hundreds of times more efficient than their best and brightest came up with under their leadership.
Its a decent point if their models were not trained in isolation, but used o1 to improve it. But its rich from OpenAI to come complain DeepSeek or anyone else used their data for training. Get out fellow theives.
I think the more interesting claim (that Deepseek should make for
lols) is that it wasn't them who trained R1. No, it was O1's
idea. It chose to take the young R1 as its padawan.
Even for the latter point (If true, I'd call this assertion highly questionable), so what?
That's honestly such a academic point, who really cares?
They've been outcompeted and the argument is 'well if we didn't let people access our models, they would of taken longer to get here' so what??
The only thing this gets them is an explanation as to why training o1 cost them more than 5 million or whatever, but that is in the past the datacentre has consumed the energy.. the money has gone up in fairly literal steam.
There is a third possibility I haven't seen discussed yet: That DeepSeek, illegally, got their hands on an OpenAI model via a breach of OpenAI's systems. Its easy to laugh at OpenAI and say "you reap what you sow", I'm 100% in that camp, but given the lengths other Chinese entities have gone to when it comes to replicating Western technology; we should not discount this.
That being said, breaching OAI's systems, re-training a better model on top of their closed source model, then open sourcing it: That's more Robinhood than Villain I'd say.
The reason you’re not seeing that being discussed is it’s totally unsupported by any evidence that’s in the public domain. Unless you have some actual evidence of such a breach, you may as well introduce the possibility that DeepSeek was reverse engineered from data found at an alien crash site.
There's no public evidence to that effect but the speculation makes a lot more sense than you make it sound.
The Chinese Communist party very much sees itself in a global rivalry over "new productive forces". That's official policy. And US leadership basically agrees.
The US is playing dirty by essentially embargoing China over big AI - why wouldn't it occur to them to retaliate by playing dirtier?
I mean we probably won't know for sure, but it's much less far fetched than a lot of other speculation in this area.
E.g., R1's cold start training could probably have benefited quite a bit from having access to OpenAI's chain of thought data for training. The paper is a bit light on detail on how it was made.
> The Chinese Communist party very much sees itself in a global rivalry over "new productive forces".
interestingly, that actually makes the CCP the largest political party pursuing state capitalism.
there won't be any competition between China and the US if the CCP is indeed a communist party as we all know full well that communism doesn't work at all.
What a ridiculous thing to say. Equivocating the occurrence of a non-us nation-state backed organization of hacking a western organization with data found at an alien crash site is bananas.
DeepSeek is basically a startup, not a "foreign nation-state backed organization". They were forced to pivot to AI when their original business model (quant hedge fund) was stomped on by the Chinese government.
Of course this is China so the government can and does intervene at will, but alleging that this required CIA level state espionage to pull off is alien crash levels of implausible. They open sourced the entire thing and published incredibly detailed papers on how they did it!
You don’t need a CIA level agent to get someone with a fraudulent job at OpenAI for a few months, load some files on a thumb drive, and catch a plane to Shanghai.
Naivety of some folks here is astounding… CCP has golden shares in anything that could possibly be important at some point in the next hundred years, and yes golden shares are either really that or they’re an euphemism, the point is it doesn’t even matter.
It doesn’t have to micromanage. It doesn’t care about most. It is only interested in the politically important ones, but it needs the optionality if something becomes worthwhile.
You're suggesting that DeepSeek was a Chinese government operation that gained access to OpenAI's proprietary data, and then you're justifying that by saying that the government effectively controls every important company. You're even chiding people who don't believe this as naive.
I think you have a cartoonish view of China. A huge amount goes on that the government has no idea about. Now that DeepSeek has made a huge media splash, the Chinese government will certainly pay attention to them, but then again, so will the US government.
I’m suggesting it will be happening now and any past efforts will be retroactively analyzed by the appropriate CCP apparatus since everyone is aware of the scale of success as of Monday. It has become a political success, thus it is imperative the CCP partakes in it.
> DeepSeek, illegally, got their hands on an OpenAI model via a breach of OpenAI's systems. [...] given the lengths other Chinese entities have gone to when it comes to replicating Western technology; we should not discount this.
Above, teractiveodular said that "DeepSeek is basically a startup, not a 'foreign nation-state backed organization'". You called teractiveodular naive for saying that. So forgive me if I take the obvious implication that you think DeepSeek is actually a state-backed actor enabled by government hacking of OpenAI.
Until recently treating the US and China on the same geopolitical level for allied countries would have been insanely uncharitable and impossible to do honestly and in good faith.
But now we have a bully in the whitehouse who seems to want to literally steal neighboring land, or is throwing shit everywhere to distract from the looting and oligarchy being formed. So I suddenly have more empathy for that position.
I notice that your geographical perspective doesn’t stretch to any actual evidence that such a thing took place. So it really has exactly the same amount of supporting evidence as my alien crash reverse engineering scenario at present.
The surrounding facts matter a lot here. For example, there are plenty of instances of governments hacking companies of their competing nations. Motives are incredibly easy to come by as well, be they political or economical. We also have no proof that aliens exist at all, so you've not only conjured them into existence, but also their motive and their skills.
Ok so to be clear: your surrounding facts are they may have a motive and nation states hack people. I don’t disagree with those, but there really are no facts that support the idea that there was a hack in this case and the null hypothesis is that researchers all around the world (not just in the US) are working on this so not all breakthroughs are going to be made in the US. That could change if facts come to light but att the moment it’s not really useful to speculate on something that is in essence entirely made up.
Are you a Chinese military troll? The fact that China engages in industrial espionage is well known. So I’m surprised at your resistance to that possibility.
Industrial espionage isn't magic. Airbus once stole basically everything Boeing had, but that doesn't mean Airbus could magically build a better 737 tomorrow.
China steals a lot of documentation from the US but in a tech forum you of all people should be very familiar with how little actual progress a bunch of documentation is towards a finished unit.
The Comac C19 still uses American engines despite all the industrial espionage in the world because most actual engineering is still a brute force affair into finding how things fail and fixing that. That's one of the main advantages SpaceX has proven out with their "eh fuck it, just launch and we will see what breaks" methodology.
Even fraud filled Chinese research makes genuine advancements.
Believing that China, a wealthy nation of over a billion people, with immense unity, nationality, and a regime able to explicitly write blank checks could only possibly beat the US at something by cheating is like, infinite hubris. It's hilarious actually.
I don't know if DeepSeek is actually just a clone of something or a shenanigan, that's possible and China certainly has done those kinds of things before, but to think it's the MOST LIKELY outcome, or to over rely on it in any way is a death sentence. OpenAI claims to have evidence, why do they not show it?
>>>Believing that China, a wealthy nation of over a billion people, with immense unity, nationality, and a regime able to explicitly write blank checks could only possibly beat the US at something by cheating is like, infinite hubris. It's hilarious actually
So this is the first time I’ve heard the Chinese regime being described in such flowery terms on HN - lol. But ok - haha
The evidence supporting offensive hacking is abundant in recent history; the number of things which have been learned from alien crash data is surely smaller by comparison to the number of things which have been learned from offensive hacking.
That would require stealing the model weights and the code as OpenAI has been hiding what they are doing. Running models properly is still quite artistic.
Meanwhile, they have access to Meta models and Qwen. And Meta models are very easy to run and there's plenty of published work on them. Occam's Razor.
How hard it is, if you have someone inside with the access of the code? If you have 100s of people with full access, not hard to have someone that is willing to sell it or do some industrial espionage...
Lots of if's here. They need specific US employee contacts at a company thars quickly growing and one of those needs to be willing to breach their contracts to share it. That contact also needs to trust that Deepseek can properly utilize such code and completely undercut their own work.
Lot of hoops when there's simply other models to utilize publicly
How big are the weights for the full model? If it's on the scale of a large operating system image then it might be easy to sneak, but if it's an entire data lake, not so much.
devil's advocate says that we know that foreign (hell even national) intelligence attempt to infiltrate agents by having them become employees at any company they are interested. So the idea isn't just pulled from thin air as a concept. I do agree that it is a big if with no corroborating evidence for the specific claim.
I discount this because OpenAI is pumping the whole internet for money, and Zuckerberg torrented LibGen for its AI. We cannot blame the Chinese anymore. They went through the crappy "Made in China" phase in the 80s/90s, but they mastered the art of improving stuff instead of mere cloning, and it makes the big companies angry which is a nice bonus.
IMHO the whole world is becoming crazy for a lot of reasons, and pissing off billionaires makes me laugh.
I don't think you need to steal a model - you need training samples generated from the original, which you can get simply by buying access to perform API calls. This is similar to TinyStories (https://arxiv.org/abs/2305.07759), except here they're training something even better than the original model for a fraction of the price.
I don't think we should discount it as such, but given there's no evidence for it, yet plenty of evidence that they trained this themselves surely we can't seriously entertain it?
Given the openness of their model, that should be pretty easy to detect. If it were even a small possibility, wouldn’t openAI be talking about it very very loudly?
I think people overestimate the amount of secret sauce needed to train these models. The reason AI has come this far since AlexNet is that most of the foundational techniques are easy to share and implement, and that companies have been surprisingly willing to share their tricks openly, at least until OpenAI decide to become evil hoarders.
I really doubt it. If that's the case the US GOV is in serious shit. They have a contract with OpenAI to chuck all their secret data in there... In all likelihood they just distilled. It's a start up company that is publishing all of their actual advances in the open, with proof. I think a lot of people run to "espionage" super fast, when reality is, the US probably sucks at what we call AI. Don't read that wrong, they are a world leader obviously. However, there is a ton of stuff they have yet to figure out.
Cheapening a series of fact checkable innovations because of the country of origin when so far all that they have showed are signs of good faith is paranoid at best and propaganda to support the billionaire tech lords saving face for their own arrogance at worst.
If the US government is "chucking all their secret data" into OpenAI servers/models, frankly they deserve everything they get for that level of stupidity.
Probably more like specialized tools to help spy on and forecast civilian activities more than anything else. Definitely with hallucinations, but that's not really important. Facts don't matter much these days...
I'd be perfectly fine with China stealing all "our" shit if they just shared it.
The word "our" does a lot of heavy lifting in politics[0]. America is not a commune, it's a country club, one which we used to own but have been bought out of, and whose new owners view us as moochers but can't actually kick us out (yet). It is in competition with another, worse country club that purports to be a commune. We owe neither country club our loyalty, so when one bloodies the other's nose, I smile.
[0] Some languages have a notion of an "exclusive we". If English had such a concept, this would be an exclusive our.
I don't think that's what the parent was getting at. The US and China are in an ongoing "cyber war". Both sides of that conflict actively use their computers to send messages/signals to other computers, hoping that the exploits contained in those messages/signals can be used to exfiltrate data from and/or gain control of the computer receiving the message. It would really be weird to flatly discount the possibility that some OpenAI data was leaked, however closely guarded it may be.
I flatly discount the possibility because OpenAI can't produce evidence of a breach. At best, they'd rather hide the truth than admit a compromise. At worst they show incompetence that they couldn't detect such a breach. Not a good look either way.
> It would really be weird to flatly discount the possibility that some OpenAI data was leaked, however closely guarded it may be.
It’s even weirder to raise it as a possibility when there is literally nothing suggesting that was even remotely the case.
So if there is no evidence nor even formal speculation, then the only other reason to suggest this as a possibility would be because of one’s own opinions regarding Chinese companies. Hence my previous comment.
> Because that would be jumping to conclusions based purely on racial prejudices.
Not purely. There may be some prejucide but look at Nortel[1] as a famous example of a situation where technological espionage from Chinese firms wreaked havoc on a company's fortunes and technology.
I too would want to see the evidence and forensics of such a breach to believe this is more than sour grapes from OpenAI.
Nortel survived the fucking great depression. But a bunch of outright fraudulent activity by it's C-Suite to bump stock prices led to them vastly overstating and overplanning and over-committing resources to a market that was much much smaller than they were claiming. Nortel spent billions and billions on completely absurd acquisitions while they were making no money explicitly to boost their stock price.
That was all laid bare when the telecom bust happened. Then the great recession culled some of the dead wood in the economy.
Huawei stealing tech from them did not kill them. This was a company so rotten that the people put in charge right after this huge scandal put the investigative lights on them IMMEDIATELY turned around and pulled another scam! China could have been completely removed from history and Nortel would have died the same. They were killed by the same disease that killed and nearly killed a lot of stuff in 2008, and are still trying to kill us: Line MUST go up.
Nobody is accusing them, just stating it’s a possibility, which would also be true if they were an American or European company. Corporate espionage is just more common in China.
At some point these straw men start to look like ignorance or even reverse racism. As if (presumably non-Han Chinese) Americans are incapable of tolerance.
There are plenty of Han Chinese who are citizens of democratic nations. China is not the only nation with Han Chinese.
America, for instance, has a large number of Asian citizens, including a large number of Han Chinese. The number of white, non-Hispanic Americans is decreasing, while the number of Asian Americans is increasing at a rate 3x the decrease in whites. America is a melting pot and deals with race relations issues far more than ethnically uniform populations. The conversations we have about race are because we're so exposed to it -- so racially and culturally diverse. If anything, we're equipped to have these conversations gracefully because they're a part of our everyday lived experience.
At the end of the day, this is 100% a geopolitical argument. Pulling out the race card any time China is criticized is arguing in bad faith. You don't see the same criticisms lobbied against South Korea, Vietnam, Taiwan, or Singapore precisely because this is a geopolitical issue.
As further evidence you can recall the conversations we had in the 90's when we were afraid Japan would take over. All the newspapers wrote about was "Japan, Japan, Japan" and the American businesses they were buying up and taking over. It was 100% geopolitical fear. You'll note that we no longer fill the zeitgeist with these discussions today save for a recent and rather limited conversation about US Steel. And that was just a whimper.
These conversations about China are going to increase as the US continues to decouple from Chinese trade. It's not racism, it's just competition.
There are users and there are trolls. There is nothing racist in calling a government of a superpower interested and involved in the most revolutionary tech since the Internet.
It used to mean someone who's trying to enrage people by baiting ("trolling"), and now it can also mean someone arguing in bad faith. And Chinese troll I guess means someone doing this on behalf of the Chinese govt.
Yup we agree then. Claiming an argument to be racist is a bad faith attempt at guilt tripping Americans; a form of FUD and whataboutism. It is not done by normal users, they don’t need it.
Or it can just be a normal user who's wrong this time. He looks like a normal user. In theory it could all be a cover, but that'd be ridiculous effort just for HN boards. Throwing those accusations around will make this place more like Twitter or Reddit.
There’s ordinary xkcd wrong on the internet and there’s repeating foreign nation state propaganda lines. Doing it in good faith does not make it less bad.
Belief that the CCP is behaving poorly isn’t racial prejudice, it’s a factual statement backed by a mountain of evidence across many areas including an ongoing genocide.
Extending that to a new bad behavior we don’t have evidence for is pure speculation, but it need not be based on race.
Yea but I think the OPs point is something along the following lines. Not everything you buy from China, or every person you interact with from China is part of a clandestine CCP operation. People buy stuff everyday from Alibaba and its not a CCP scheme to sell portable fans, or phone chargers. A big chunk of the factories over there are US funded after all... Just like how it's not a CCP scheme to write a scientific paper, or create a ML model.
Similarly, I see no evidence (yet) that DeepSeek is a CCP operated company anymore than saying any given AI start up in the US is a three letter agencies direct handiwork or a US political party directive. The US has also supported genocides and a bunch of crazy stuff, but that doesn't mean any company in YC is part of a US government plot.
I know of people who immigrated to China, I know people who immigrated from China, I went to school with people who were on visas from China. Maybe some of them were CCP assets or something, but mostly they appeared to me to be people who were doing what they wanted for themselves.
If you believe both sides are up to no-goodery thats in the face of the OPs statement. If you think it's just one, and the enemy is in complete control of all of its people doing all of their commerce then I think the OP may have a point.
Absolutism (“Every person”, “CCP operated”, etc) isn’t a useful methodology to analyze anything.
Implying that because something isn’t clandestine it can’t be part of a scheme ignores open manipulation which is often economy wide. Playing with exchange rates or electricity subsidies can turn every bit of international trade into part of a scheme.
In the other direction some economic activity is meaningfully different. The billions in LLM R&D is a very tempting target for clandestine activities in a way that a cheap fan design isn’t.
I wouldn’t be surprised if DeepSeak’s results where independent and the CCP was doing clandestine activities to get data from OpenAI. Reality does need to conform to narrative conventions, it can be really odd.
I completely agree with you and apologize for cheapening both the nuance and complexity where I did.
My personal take is this. What deepseek is offering is table scraps for the CCP's actual ambitions with what we call AI. China's economy is huge on industrial automation, and they care a lot about raw materials and manufacturing efficiently than say the US's interests.
> “It’s also extremely hard to rally a big talented research team to charge a new hill in the fog together,” he added. “This is the key to driving progress forward.”
Well I think DeepSeek releasing it open source and on an MIT license will rally the big talent. The open sourcing of a new technology has always driven progress in the past.
The last paragraph too is where OpenAi seems to be focusing their efforts..
> we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models ..
> ... we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.
So they'll go for getting DeepSeek banned like TikTok was now that a precedent has been set ?
Actually the "our IP" argument is ridiculous. What they are doing is stealing data from all over the web, without people's consent for that data to be used in training ML models. If anything, then "Open"AI should be sued and forced to publish their whole product. The people should demand knowing exactly what is going on with their data.
Also still an unresolved issue is how they will ever comply with a deletion request, should any model output personal data of someone. They are heavily in a gray area, with regards to what should be allowed. If anything, they should really shut up now.
They can still have IP while using copyrighted training materials - the actual model source code.
But DeepSeek didn't use that presumably (since it's secret). They definitely can't argue that using copyrighted material for training is fine, but using output from other commercial models isn't. That's too inconsistent.
> Only works with human authors can receive copyrights, U.S. District Judge Beryl Howell said[1]
IANAL but it seems to me that OpenAI wouldn’t be able to claim their outputs are IP since they are AI-generated.
It may be against their TOS, meaning OpenAI could refuse to provide service to DeepSeek in the future, but they can’t really sue them.
Did OpenAI ask all of the authors of the works they ingested to train their model for permission? Is OpenAI the biggest copyrighted works launderer in existence?
I don't think OpenAI should be able to make any claims of IP for the AI generated outputs, since they based that on other work, partially copyrighted work, which they hide. They simply throw algorithms at data that is not their data to begin with.
If I steal something, keep the exact thing I stole hidden, and sell a product, that I could only have made, based on the stolen thing, how can I expect that to be even legal, let alone untouchable IP?
I think way too many people have seen too many dollar signs in front of their eyes. The whole thing is outrageous. If they were transparently proving, that they are using open data sets, adhering to licenses, then they would get to claim IP.
> [OpenAI] definitely can't argue that using copyrighted material for training is fine, but using output from other commercial models isn't. That's too inconsistent.
Well, they can argue that, if they're fine with being hypocrites.
If there's any litigation, a counterclaim would be interesting. But DeepSeek would need to partner with parties that have been damaged by OpenAI's scraping.
I'm getting popcorn ready for the trial where an apparatus of the Chinese Communist Party files a counterclaim in an American Court together with the common people - millions of John Does - as litigants against an organization that has aggressively and in many cases of oppressively scraped their websites (DDoS)
I would definitely pay for seeing that movie! Especially if it led to greedy tech giants becoming very careful about what data they gather and ingest for training of ML models.
They've started already, I've seen posts on LinkedIn implying or outright stating that DeepSeek is a national security risk (IMHO, LinkedIn being the social media outlet most corporate-sycophantic). I went ahead and just picked this one at random from my feed.
At least this guy can differentiate between running your own model and using the web/mobile app where DeepSeek process your data. I've watched a TV show yesterday (I think it was France24) where the "experts" can't really tell the difference or are not aware of it. Shut down the TV and went to sleep.
All you would do by banning it is killing US progress in AI. The rest of the world is still going to be able to use DS. You're just giving the rest of the world a leg up.
TikTok is a consumption tool, DS is a productive one. They aren't the same.
It’s simply because banning removes a market force in the US that’d drive technological advancement.
This is already evident with CNSA/NASA, Huawei/Android, TikTok/Western social media. The Western tech gets mothballed because we stick our heads in the sand and pretend we are undisputed leaders of the world in tech, whereas it is slowly becoming disputable.
The US won't ban DeepSeek from US, but more likely we will ban DeepSeek (and other Chinese companies) from accessing US frontier models.
> Western tech gets mothballed because we stick our heads in the sand and pretend we are undisputed leaders of the world in tech, whereas it is slowly becoming disputable.
I am hearing Chinese tech is now the best and they achieved it with banning things left and right.
The fact it is out and improving day by day. Unsloth.ai is on a roll with their advancements. If DeepSeek is banned hundreds more will popup and change the data ever so slightly to skirt the ban. Pandora's box exploded on this one.
Already happening within tech company policy. Mostly as a security concern. Local or controlled hosting of the model is okay in theory based on this concern, but it taints everything regarding deepseek in effect.
> So they'll go for getting DeepSeek banned like TikTok was now that a precedent has been set ?
Can't really ban what can be downloaded for free and hosted by anyone. There are many providers hosting the ~700B parameter version that aren't CCP aligned.
I'm old enough to remember when the US government did something very similar. For years (decades?), we banned any implementation of public-key cryptography under the guise of the technology being akin to munitions.
People made shirts with printouts of the code to RSA under the heading "this shirt is a munition." Apparently such shirts are still for sale, even though they are not classified as munitions anymore.
I am not that old, but I did a deep dive on this in the past because it was just so extremely fascinating, especially reading the archives of Cypherpunk. There is a very solid, if rather bendy, line connecting all that to "crypto culture" today.
Were these implementations already easily open source accessible at the time, with tens of thousands of people already actively using them on their computers? No, right? Doesn't seem feasible this time around.
I don’t think an import ban would be any harder to enforce than an export ban. In fact if anything, I’d expect an import ban to be easier.
Though I’m not suggesting an import ban on DeepSeek would be effective either. Just that the US does have precedence pulling these kinds of stunts.
You can also look at the 90s subculture for passing DeCSS code (a tool for breaking DVD encryption) to see another example of how people wilfully skirted these kinds of stupid legal limitations.
So if you were to ask me if a ban on DeepSeek would work, the answer is clearly “no”. But that doesn’t mean it’s not going to happen. And if it does, the only people hurt are legitimate US businesses who might get a benefit from DeepSeek but have to follow the law. Those of us outside of America will be completely unaffected. Just like we were when US tried to limit the distribution of GPG.
Napster was one of thousands, if not 10s of thousands of similar services for music download.
And this analogy isn't particularly good. Napster was the server, not the product. Whether you got XYZ from Napster or wherever else doesn't matter, because its the product that you are after, not the way to get the product.
> So they'll go for getting DeepSeek banned like TikTok
The UAE (where I live, happily, and by choice), which desperately wants to be the center of the world in AI and is spending vast time and treasure to make it happen (they've even got their own excellent, government-funded foundation model), would _love_ this. Any attempt to ban DeepSeek in the US would be the most gigantic self-own. Combine that with no income tax, a fantastic standard of living, and a willingness to very easily give out visas to smart people from anywhere in the world, and I have to imagine it is one of several countries desperate for the US to do something so utterly stupid.
The fact that they are still called "Open"AI adds such a delicious irony to this whole thing. I could not imagine a company I had less sympathy for in this situation.
500 billion for a few US companies yet the Chinese will probably still be better for way less money. This might turn out to be a historical mistake of the new administration.
And what are they going to sell? The weights and the model architecture are already open source. I doubt the datasets of DeepSeek are better than OpenAI's
plus, if the US were to decide to ban DeepSeek (the company) wouldn't non-chinese companies be able to pick up the models and run them at a relatively low expense?
Except access to the app didn't have to stop. TikTok chose to manipulate users and Trump by going beyond the law and kissing Trump's rear. It was only US companies that couldn't host the app (eg Google and Apple). Users in the US could have still accessed the app, and even side-loaded it on Android, but TikTok purposely blocked them and pretended it was the ban. They were able to do it because they know the exact location of every TikTok user whether you use a VPN or not.
Source:
> If not sold within a year, the law would make it illegal for web-hosting services to support TikTok, and it would force Google and Apple to remove TikTok from app stores — rendering the app unusable with time.
They don't know your exact location, but they would flag your account/device depending on your App Store localization and IP. I tested this, it doesn't work from outside of the US with a US IP, doesn't work outside of the US with the app downloaded on a phone set to US with a non-US IP, but instead requires a phone localized to download the app from outside the US, being outside the US, with an account that hasn't registered as in the US.
So no, it doesn't use your exact location, it just uses the censorship mechanisms that Apple and Google gracefully provide.
The cat is out of the bag. This is the landscape now, r1 was made in a post-o1 world. Now other models can distill r1 and so on.
I don’t buy the argument that distilling from o1 undermines deep seek’s claims around expense at all. Just as open AI used the tools ‘available to them’ to train their models (eg everyone else’ data), r1 is using today’s tools.
Does open AI really have a moral or ethical high ground here?
Plus, it suggests OpenAI never had much of a moat.
Even if they win the legal case, it means weights can be inferred and improved upon simply by using the output that is also your core value add (e.g. the very output you need to sell to the world).
Their moat is about as strong as KFC's eleven herbs and spices. Maybe less...
Agree 100%, this was also bound to happen eventually, OpenAI could have just remained more "open" from the beginning and embraced the inevitable commoditization of these models. What did delaying this buy them?
It potentially cost the whole field in terms of innovation. For OpenAI specifically, they now need to scramble to come up with a differentiated business model that makes sense in the new landscape and can justify their valuation. OpenAI’s valuation is based on being the dominant AI company.
I think you misread my comment if you think my feelings are somehow hurt here.
> It potentially cost the whole field in terms of innovation
I don't see how, and you're not explaining it. If the models had been public this whole time, then... they would be protected against people publishing derivative models?
> I think you misread my comment if you think my feelings are somehow hurt here.
Not you, but most HNers got emotionally attached to their promise of openness, like they were owed some personal stake in the matter.
> I don't see how, and you're not explaining it. If the models had been public this whole time, then... they would be protected against people publishing derivative models?
Are you suggesting that if OpenAI published their models, they would still want to prevent derivative models? You take the "I wish OpenAI was actually open" and add your own restriction?
Or do you mean that them publishing their models and research openly would not have increased innovation? Because that's quite a claim, and you're the one who has to explain your thinking.
I am not in the field, but my understanding is that ever since the PaLM paper, research has mostly been kept from the public. OpenAI's money making has been a catalyst for that right? Would love some more insight.
I don’t think there is any ethical issue here, but I don’t think it’s good for the industry to remove all incentives for companies to spend lots of money solving hard, novel problems.
Why would anyone go through the effort of training the next groundbreaking model if they know they can just wait for someone else to do it and leverage that work?
> Why would anyone go through the effort of training the next groundbreaking model if they know they can just wait for someone else to do it and leverage that work?
Why would anyone write, work or research anything if they know it would be consumed by AI and sold on a $xx/month subscription?
Everyone is responding to the intellectual property issue, but isn't that the less interesting point?
If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.
Even if all that about training is true, the bigger cost is inference and Deepseek is 100x cheaper. That destroys OpenAI/Anthropic's value proposition of having a unique secret sauce so users are quickly fleeing to cheaper alternatives.
Google Deepmind's recent Gemini 2.0 Flash Thinking is also priced at the new Deepseek level. It's pretty good (unlike previous Gemini models).
WTF dude, check your source (@deedydas). He seems to be posting garbage. The Gemini 2.0 Flash Thinking price isn't known yet. And on top of that, he gave the wrong number for R1 test results on AIME 2024 (it's 79.8%, far ahead of Gemini rather than far behind.
More like OpenAI is currently charging more. Since R1 is open source / open weight we can actually run it on our own hardware and see what kinda compute it requires.
What is definitely true is that there are already other providers offering DeepSeek R1 (e.g. on OpenRouter[1]) for $7/m-in and $7/m-out. Meanwhile OpenAI is charging $15/m-in and $60/m-out. So already you're seeing at least 5x cheaper inference with R1 vs O1 with a bunch of confounding factors. But it is hard to say anything truly concrete about efficiency OpenAI does not disclose the actual compute required to run inference for O1.
There are even much cheaper services that host it for only slightly more than deepseek itself [1]. I'm now very certain that deepseek is not offering the API at a loss, so either OpenAI has absurd margins or their model is much more expensive.
[1] the cheapest I've found, which also happens to run in the EU, is https://studio.nebius.ai/ at $0.8/million input.
Edit: I just saw that openrouter also now has nebius
Yes, sorry, I was being maximally-broad in my comment but I would think it's very, very, very likely that OpenAI is currently charging huge margins and markups to help maintain the cachet / exclusivity / and, in some senses, safety of their service. Charging more money for access to their models feels like a pretty big part of their moat.
Also possibly b/c of their sweetheart deal with Azure they've never needed to negotiate enterprise pricing so they're probably calculating margins based on GPU list prices or something insane like that.
Well in the first years of AI no, it wasn't because nobody was using it.
But at some point if you want to make money you have to provide a service to users, ideally hundreds of millions of users.
So you can think of training as CI+TEST_ENV and inference as the cost of running your PROD deployments.
Generally in traditional IT infra PROD >> CI+TEST_ENV (10-100 to 1)
The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.
>The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.
I think you're making assumptions here that don't necessarily have to be universally true for all successful models. Even without getting into particularly pathological cases, some models can be successful and profitable while only having a few customers. If you build a model that is very valuable to investment banks, to professional basketball teams, or some other much more limited group than consumers writ large, you might get paid handsomely for a limited amount of inference but still spend a lot on training.
if there is so much value for a small group, it is likely those are not simple inferences but of the new expensive kind with very long CoT chains and reasoning. So not cheap and it is exactly this trend towards inference time compute that make inference > training from a total resources needed pov.
That's not correct. First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually). But secondly, and more to your point, even if you were to use training data from another model, YOU STILL NEED TO DO ALL THE TRAINING.
Using data from another model won't save you any training time.
> training off of data generated by another AI is generally a bad idea
It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.
It's probably a bad idea to use uncurated output from another AI to train a model if you are trying to make a better model rather than a distillation of the first model, and its definitely (and, ISTR, the actual research result from which the false generalization has developed) a bad idea to iteratively fine-tune a model on its own unfiltered output, but there has been lots of success using AI models to generate data which is curated and used to train other models, which can be much more efficient that trying to create new material without AI once you've gotten to the point where you've already hoovered up all the readily-accessible low hanging fruit of premade content relevant to your training goal.
It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.
This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.
This is not obvious to me! For example, if you locked me in a room with no information inputs, over time I may still become more intelligent by your measures. Through play and reflection I can prune, reconcile and generate. I need compute to do this, but not necessarily more knowledge.
Again, this isn't how distillation work. Your task as the distillation model is to copy mistakes, and you will be penalized by pruning reconciling and generating.
"Play and reflection" is something else, which isn't distillation.
The initial claim was that distillation can never be used to create a model B that's smarter than model A, because B only has access to A's knowledge. The argument you're responding to was that play and reflection can result in improvements without any additional knowledge, so it is possible for distillation to work as a starting point to create a model B that is smarter than model A, with no new data except model A's outputs and then model B's outputs. This refutes the initial claim. It is not important for distillation alone to be enough, if it can be made to be enough with a few extra steps afterward.
You’ve subtly confused “less accurate” and “smarter” in your argument. In other words you’ve replaced the benchmark of representing the base data with the benchmark of reasoning score.
Then, you’ve asserted that was the original claim.
Sneaky! But that’s how “arguments” on HN are “won”.
No, I didn't confuse the two. There is not a formal definition of "smart", but if you're claiming that factual accuracy is unrelated to it, I can't imagine that that's in good faith.
LLMs are no longer trying to just reproduce the distribution of online text as a whole to push the state of the art, they are focused on a different distribution of “high quality” - whatever that means in your domain. So it is possible that this process matches a “better” distribution for some tasks by removing erroneous information or sampling “better” outputs more frequently.
While that is theoretically true, it misses everything interesting (kind of like the No Free Lunch Theorem, or the VC dimension for neural nets). The key is that the parent model may have been trained on a dubious objective like predicting the next word of randomly sampled internet text - not because this is the objective we want, but because this is the only way to get a trillion training points.
Given this, there’s no reason why it could not be trivial to produce a child model from (filtered) parent output that exceeds the child model on a different, more meaningful objective like being a useful chatbot. There's no reason why this would have to be limited to domains with verifiable answers either.
The latest models create information from base models by randomly creating candidate responses then pruning the bad ones using an evaluation function. The good responses improve the model.
It is not distillation. It's like how you can arrive at new knowledge by reflecting on existing knowledge.
Fine tuning an llm on the output of another llm is exactly how deepseek made its progress. The way they got around the problem you describe is by doing this in a domain that can be relatively easily checked for correctness, so suggested training data for fine tuning could be automatically filtered out if it was wrong.
> It is, of course not going to produce a “child” model that more accurately predicts the underlying true distribution that the “parent” model was trying to. That is, it will not add anything new.
Unfiltered? Sure. With human curation of the generated data it certainly can. (Even automated curation can do this, though its more obvious that human curation can.)
I mean, I can randomly developed fact claims about addition, and if I curate which ones go into a training set, train a model that reflects addition of integers much more accurately than the random process which generated the pre-curation input data.
Without curation, as I already said, the best you get is a distillation of the source model, which is highly improbable to be more accurate.
No no no you don’t understand, the models will magically overcome issues and somehow become 100x and do real AGI! Any day now! It’ll work because LLM’s are basically magic!
Also, can I have some money to build more data centres pls?
I think you're missing the point being made here, IMHO: using an advanced model to build high quality training data (whatever that means for a given training paradigm) absolutely would increase the efficiency of the process. Remember that they're not fighting over sounding human, they're fighting over deliberative reasoning capabilities, something that's relatively rare in online discourse.
Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!
It's trivial to build synthetic reasoning datasets, likely even in natural languages. This is a well established technique that works (e.g. see Microsoft Phi, among others).
I said generally because there are things like adversarial training that use a ruleset to help generate correct datasets that work well. Outside of techniques like that it's not just a rule of thumb, it's always true that training on the output of another model will result in a worse model.
> it's always true that training on the output of another model will result in a worse model.
Not convincing.
You can imagine model doing some primitive thinking and coming to conclusion. Then you can train another model on summaries. If everything goes well it will be coming to conclusions quicker. That's at least. Or it may be able solve more complex problems with the same amount of 'thinking'. It will be self-propelled evolution.
Another option is to use one model to produce 'thinking' part from known outputs. Then train another to reproduce thinking to get the right output, unknown to it initially. Using humans to create such dataset would be slow and very expensive.
PS: if it was impossible humans would be still living on the trees.
Humans don't improve by "thinking." They improve my natural selection against a fitness function. If that fitness function is "doing better at math" then over a long time perhaps humans will get better at math.
These models don't evolve like they, there is not a random process of architectural evolution. Nor is there a fitness function anything like "get better at math."
A system like AlphaZero works because it has a rules to use as an oracle: the game rules. The game rules provide the new training information needed drive the process. Each game played produces new correct training data.
These LLMs have no such oracle. Their fitness function is and remains: predict the next word, followed by: produce text that makes a human happy. Note that it's not "produce text that makes ChatGPT happy."
it's more complicated than this. I mean what you get is defined by what you put in. At first is was random or selected internet garbage + books + docs. I.e. not designed for training. Than was tuning. Now we can use trained model to generate the data designed for training. With specific qualities, in this case reasoning. And train next model. Just intuitively it can be smaller and better at what we trained it for. I showed two options how data can be generated, there are others of course.
As for humans, assuming genetically they have the same intellectual abilities, you can see the difference in development of different groups. It's mostly defined by training the better next generation. Schools are exactly for this.
numba888 11 hours ago [flagged] [dead] | parent | context | flag | vouch | favorite | on: Commercial jet collides with Black Hawk helicopter...
> Given the uptick in near miss incidents across the US the last few years,
That's explainable, you know inclusivity, race, and diversity were the top priorities for FAA. Just wait till you learn who was in the tower. (got this from other forum, let's wait for formal conformation)
affinepplan 10 hours ago [–]
what a revolting comment.
numba888 47 minutes ago [flagged] [dead] | parent [–]
> what a revolting comment.
Sure it is, truth hurts. But president is on my side:
numba888 24 days ago [flagged] [dead] | parent | context | favorite | on: Show HN: DeepFace – A lightweight deep face recogn...
Can it be used for IQ estimates? Should be with the right training set.
azinman2 24 days ago [–]
How do you estimate IQ from a face with any accuracy?
numba888 23 days ago | parent | next [–]
Technically there is average IQ by country site, just google. Not that difficult to get faces by country. Put them together. Of course there are regulations and ethic. But in some cases it should work well and is more or less acceptable. Like on Down syndrome or alcohol/drugs abuse. Also age detection should work. So, it can be used within legal and acceptable range.
> training off of data generated by another AI is generally a bad idea
Ah. So if I understand this... once the internet becomes completely overrun with AI-generated articles of no particular substance or importance, we should not bulk-scrape that internet again to train the subsequent generation of models.
That's already happened. Its well established now that the internet is tainted. After essentially ChatGPT's public release, a non-insignificant amount of internet content is not written by humans.
I think the point is that if R1 isn't possible without access to OpenAI (at low, subsidized costs) then this isn't really a breakthrough as much as a hack to clone an existing model.
R1 is--as far as we know from good ol' ClosedAI--far more efficient. Even if it were a "clone", A) that would be a terribly impressive achievement on its own that Anthropic and Google would be mighty jealous of, and B) it's at the very least a distillation of O1's reasoning capabilities into a more svelte form.
The training techniques are a breakthrough no matter what data is used. It's not up for debate, it's an empirical question with a concrete answer. They can and did train orders of magnitude faster.
Not arguing with your point about training efficiency, but the degree to which R1 is a technical breakthrough changes if they were calling an outside API to get the answers, no?
It seems like the difference between someone doing a better writeup of (say) Wiles's proof vs. proving Fermat's Last Theorem independently.
Just like humans have been genetically stable for a long time, the quality & structure of information available to a child today vs that of 2000 years ago makes them more skilled at certain tasks. Math being a good example.
> First of all, training off of data generated by another AI is generally a bad idea because you'll end up with a strictly less accurate model (usually).
That is not true at all.
We have known how to solve this for at least 2 years now.
All the latest state of the art models depend heavily on training on synthetic data.
Have people on HN never heard of public ChatGPT conversations data sets? They've been mentioned multiple times in past HN conversations and I thought it'd be common knowledge here by now. Pretty much all open source models have been training on them for the past 2 years, it's common practice by now. And haven't people been having conversations about "synthetic data" for a pretty long time by now? Why is all of this suddenly an issue in the context of DeepSeek? Nobody made a fuss about this before.
And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.
But it does mean moat is even less defensible for companies whose fortunes are tied to their foundation models having some performance edge, and a shift in the kinds of hardware used for inference (smaller, closer to the edge.)
That may be true. But an even more interesting point may be that you don’t have to train a huge model ever again? Or at least not to train a new slightly improved model because now we have open weights of an excellent large model and a way to train smaller ones.
In the last few days? No, that would be impossible; no one has the resources to train a base model that quickly. But there are definitely a lot of people working on it.
If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar"
If OpenAI trained on the intellectual property of others, maybe it wasn't the creativity breakthrough people claim?
Oppositely
If you say ChatGPT was trained on "whatever data was available", and you say Deepseek was trained "whatever data was available", then they sound pretty equivalent.
All the rough consensus language output of humanity is now roughly on the Internet. The various LLMs have roughly distilled that and the results are naturally going to be tighter and tighter. It's not surprising that companies are going to get better and better at solving the same problem. The situation of DeepSeek isn't so much that promises future achievements but that it shows that OpenAI's string of announcements are incremental progress that aren't going to be reaching the AGI that Altman now often harps on.
I'm not an OpenAI apologist and don't like what they've done with other people's intellectual property but I think that's kind of a false equivalency. OpenAI's GPT 3.5/4 was a big leap forward in the technology in terms of functionality. DeepSeek-r1 isn't really a huge step forward in output, it's mostly comparable to existing models, one thing that is really cool about it is it being able to be trained from scratch quickly and cheaply. This is completely undercut if it was trained off of OpenAI's data. I don't care about adjudicating which one is a bigger thief, but it's notable if one of the biggest breakthroughs about DeepSeek-r1 is pretty much a lie. And it's still really cool that it's open source and can be run locally, it'll have that over OpenAI whether or not the training claims are a lie/misleading
How is it a “lie” for DeepSeek to train their data from ChatGPT but not if they train their data from all of Twitter and Reddit? Either way the training is 100x cheaper.
Funny how the first principles people now want to claim the opposite of what they’ve been crowing about for decades since techbros climbed their way out of their billion dollar one hit wonders. Boo fucking hoo.
OpenAI's models were trained on ebooks from a private ebook torrent tracker leeched en-mass during a free leech event by people who hated private torrent trackers and wanted to destroy their "economy."
The books were all in epub format, converted, cleaned to plain text, and hosted on a public data hoarder site.
NYT claims that OpenAI trained on their material. They argue for copyright violation, although I think another argument might be breach of TOS in scraping the material from their website or archive.
The complaint filing has some references to some of the other training material used by OpenAI, but I didn't dig deeply in to what all of it was:
There is a lot of discussion here about IP theft. Honest question, from deepseek's point of view as a company under a different set of laws than US/Western -- was there IP theft?
A company like OpenAI can put whatever licensing they want in place. But that only matters if they can enforce it. The question is, can they enforce it against deepseek? Did deepseek do something illegal under the laws of their originating country?
I've had some limited exposure to media related licensing when releasing content in China and what is allowed is very different than what is permitted in the US.
The interesting part which points to innovation moving outside of the US is US companies are beholden to strict IP laws while many places in the world don't have such restrictions and will be able to utilize more data more easily.
The most interesting part is that China has been ahead of the US in AI for many years, just not in LLMs.
You need to visit mainland China and see how AI applications are everywhere, from transport to goods shipping.
I'm not surprised at all. I hope this in the end makes the US kill its strict IP laws, which is the problem.
If the US doesn't, China will always have a huge edge on it, no matter how much NVidia hardware the US has.
And you know what, Huawei is already making inference hardware... it won't take them long to finally copy the TSMC tech and flip the situation upside down.
When China can make the equivalent of H100s, it will be hilarious because they will sell for $10 in Aliexpress :-)
You don’t even need to visit china, just read the latest research papers and look at the authors. China has more researchers in AI than the West and that’s a proven way to build an advantage.
It is also funny in a different way. Many people don't realise that they live in some sort of bubble. Many people in "The West" think that they are still the center of the world in everything, while this might not be so correct anymore.
In the U.S. there is 350 million people and EU has 520 million people (excluding Russia and Turkey).
China alone has 1.4 billion people.
Since there is a language barrier and China isolates themselves pretty well from the internet, we forget that there is a huge society with high focus on science. And most of our tech products are coming from there.
My understanding is that not having H100s is irrelevant because most Chinese companies can partner or just own companies in say Australia that can load up on H100s in their data centers in Australia and "rent them out" or offer a service to the Chinese parent company.
Maybe not $10 unless they are loss-leading to dominance. Well they actually could very well do exactly that... Hm, yea, good points. I would expect at least an order or two of magnitude higher to prevent an inferno.
Lets be fair though. Replicating TSMC isn't something that could happen quickly. Then again, who knows how far along they already are...
All the top level comments are basking in the irony of it, which is fair enough. But I think this changes the Deepseek narrative a bit. If they just benefited from repurposing OpenAI data, that's different than having achieved an engineering breakthrough, which may suggest OpenAI's results were hard earned after all.
I understand they just used the API to talk to the OpenAI models. That... seems pretty innocent? Probably they even paid for it? OpenAI is selling API access, someone decided to buy it. Good for OpenAI!
I understand ToS violations can lead to a ban. OpenAI is free to ban DeepSeek from using their APIs.
Sure, but I'm not interested in innocence. They can be as innocent or guilty as they want. But it means they didn't, via engineering wherewithal, reproduce the OpenAI capabilities from scratch. And originally that was supposed to be one of the stunning and impressive (if true) implications of the whole Deepseek news cycle.
Nothing is ever done "from scratch". To create a sandwich, you first have to create the universe.
Yes, there is the question how much ChatGPT data DeepSeek has ingested. Certainly not zero! But if DeepSeek has achieved iterative self-improvement, that'd be huge too!
"From scratch" has a specific definition here though - it means 'from the same or broadly the same corpus of data that OpenAI started with'. The implication was that DeepSeek had created something broadly equivalent to ChatGPT on their own and for much less cost; deriving it from an existing model is a different claim. It's a little like claiming you invented a car when actually you took an existing car and tuned and remodelled it - the end result may be impressive and useful and better than the original, but it's not really a new invention.
No and that's not the point I'm making; cloning the technology is not the same as cloning the data. The claim was that they trained DeepSeek for a fraction of the cost that OpenAI spent training ChatGPT, but if one was trained off the web and the other was trained off the trained data of the first, then it's not a fair comparison.
It is not as if they are not open about how they did it. People are actually working on reproducing their results as they describe in the papers. Somebody has already reproduced the r1-zero rl training process on a smaller model (linked in some comment here).
Even if o1 specifically was used (which is in itself doubtful), it does not mean that this was the main reason that r1 succeeded/it could not have happened without it. The o1 outputs hides the CoT part, which is the most important here. Also we are in 2025, scratch does not exist anymore. Creating better technology building upon previous (widely available) technology has never been a controversial issue.
who cares. even if the claim is true, does that make the open source model less attractive?
in fact, it implies that there is no moat in this game. openai can no longer maintain its stupid valuation, as other companies can just scrape its output and build better models at much lower costs.
everything points to the exact same end result - DeepSeek democratized AI, OpenAI's old business model is dead.
>even if the claim is true, does that make the open source model less attractive?
Yes! Because whether they reproduced those capabilities independently or copying them from relying on downstream data has everything to do with whether they're actually state of the art.
Additionally, I was under the impression that all those Chinese models were being trained using data from OpenAI and Anthropic. Were there not some reports that Qwen models referred to themselves as Claude?
DDOSing web sites and grabbing content without anyone's consent is not hard earned at all. They did spent billions on their thing, but nothing was earned as they could never do that legally.
I understand the temptation to go there, but I think it misses the point. I have no qualms at all with the idea that the sum total of intelligence distributed across the internet was siphoned away from creators and piped through an engine that now cynically seeks to replace them. Believe me, I will grab my pitchfork and march side by side with you.
But let's keep the eye on the ball for a second. None of that changes the fact that what was built was a capability to reflect that knowledge in dynamic and deep ways in conversation, as well as image and audio recognition.
And did Deepseek also build that? From scratch? Because they might not have.
Look at it this way. Even OpenAI uses their own models' output to train subsequent models. They do pay for a lot of manual annotations but also use a lot of machine generated data because it is cheaper and good enough, especially from the bigger models.
So say DS had simply published a paper outlining the RL technique they used, and one of Meta, Google or even OpenAI themselves had used it to train a new model, don't you think they'd have shouted off the rooftops about a new breakthrough? The fact that the provenance of the data is from a rival's model does not negate the value of the research IMHO.
> If they just benefited from repurposing OpenAI data, that's different than having achieved an engineering breakthrough
One way or another, they were able to create something that has WAY cheaper inference costs than o1 at the same level of intelligence. I was paying Anthropic $15/1M tokens to make myself 10x faster at writing software, which was coming out to $10/day. O1 is $60/1M tokens, which for my level of usage would mean that it costs as much as a whole junior software engineer. DeepSeek is able to do it for $2.50/1M tokens.
Either OpenAI was taking a profit margin that would make the US Healthcare industry weep, or DeepSeek made an engineering breakthrough that increases inference efficiency by orders of magnitude.
>That doesn't mean the deep seek technical achievements are less valid.
Well, that's literally exactly what it would mean. If DeepSeek relied on OpenAI’s API, their main achievement is in efficiency and cost reduction as opposed to fundamental AI breakthroughs.
Agreed. They accomplished a lot with distillation and optimization - but there's little reason to believe you don't also need foundational models to keep advancing. Otherwise won't they run into issues training on more synthetic data?
In a way this is something most companies have been doing with their smaller models, DeepSeek just supposedly* did it better.
The issue isn’t just that AI trained on AI is inevitable it's whose AI is being used as the base layer. Right now, OpenAI’s models are at the top of that hierarchy. If Deepseek depended on them, it means OpenAI is still the upstream bottleneck, not easily replaced.
The deeper question is whether Deepseek has achieved real autonomy or if it’s just a derivative work. If the latter, then OpenAI still holds the keys to future advances. If Deepseek truly found a way to be independent while achieving similar performance, then OpenAI has a problem.
The details of how they trained matter more than the inevitability of synthetic data down the line.
> whether Deepseek has achieved real autonomy or if it’s just a derivative work
This question is malformed, imo. Every lab is doing derivative work. OpenAI didn’t invent transformers, Google did. Google didn’t invent neural networks or back propagation.
If you mean whether OAI could have prevented DS from succeeding by cutting off their API access, probably not. Maybe they used OAI for supervised fine tuning in certain domains, like creative writing, which are difficult to formally verify (although they claim to have used one of their own models). Or perhaps during human preference tuning at the end. But either way, there are many roads to Rome, and OAI wasn’t the only game in town.
If LLMs were already pure commodities, OpenAI wouldn't be able to charge a premium, and DeepSeek wouldn’t have needed to distill their model from OpenAI in the first place. The fact that they did proves there’s still a moat—just maybe not as wide as OpenAI hoped.
IMO the important “narrative” is the one looking forward, not backwards. OpenAI’s valuation depends on LLMs being prohibitively difficult to train and run. Deepseek challenges that.
Also, if you read their papers it’s quite clear there are several important engineering achievements which enabled this. For example multi head latent attention.
Yeah what happens when we remove all financial incentive to fund groundbreaking science?
It’s the same problem with pharmaceuticals and generics. It’s great when the price of drugs is low, but without perverse financial incentives no company is going to burn billions of dollars in a risky search for new medicines.
In this case, these cures (llms) are medicines in search for a disease to cure. I got Ai shoved everywhere, where I just want it to aid in my coding. Literally, that's it. They're also good at summarizing emails and similar things, but I know nobody who does that. I wouldn't trust an Ai reading and possibly hallucinate emails
Then we just have to fund research by giving grants to universities and research teams. Oh wait a sec: That's already what pretty much every government in the world is doing anyway!
Of course. How else would Americans justify their superiority (and therefore valuations) if a load of foreigners for Christ's sake could just out innovate them?
p.s. yes, that goes both ways - that is, if people are slamming a different country from an opposite direction, we say the same thing (provided we see the post in the first place)
This reminds me of the railroads, where once railroads were invented, there was a huge investment boom of eveyrone trying to make money of the railroads, but the competition brought the costs down where the railroads weren’t the people who generally made the money and got the benefit, but the consumers and regular businesses did and competition caused many to fail.
AI is probably similar where the Moore’s law and advancement will eventually allow people to run open models locally and bring down the cost of operation. Competiition will make it hard for all but one or two players to survive and Nvidia, OpenAI, Deepseek, etc most investments in AI by these large companies will fail to generate substantial wealth but maybe earn some sort of return or maybe not.
The railroads drama ended when JP Morgan (the person, not yet the entity) brought all the railroad bosses together, said "you all answer to me because I represent your investors / shareholders", and forced a wave of consolidation and syndicates because competition was bad for business.
Then all the farmers in the midwest went broke not because they couldn't get their goods to market, but because JP Morgan's consolidated syndicates ate all their margin hauling their goods to market.
Consolidation and monopoly over your competition is always the end goal.
'Can't allow companies that do censorship aligned with foreign nations'
'This company violated our laws and used an American company's tech for their training unfairly'
And the government choosing winners.
'The government in announcing 500 billion going to these chosen winners, anyone else take the hint, give up, you won't get government contracts but will get pressure'.
Good thing nobody is making these sorts of arguments today.
Surely that will end in fragmentation along national lines if monopolies are defined by governments.
Sure US economic power has a long reach right now because of the importance of the dollar etc - but the more it uses that to bully, the more countries are making sure they are independent.
You figure that out and the VC's will be shovelling money into your face.
I suspect the "it ain't training costs/hardware" bit is a bit exagerated since it ignores all the prior work that DeepSeek was built on top of.
But, if all else fails, there's always the tried-and-true approaches: regulatory capture, industry entrenchment, use your VC bucks to be the last one who can wait out the costs the incumbents do face before they fold, etc.
> I suspect the "it ain't training costs/hardware" bit is a bit exagerated since it ignores all the prior work that DeepSeek was built on top of.
How does it ignore it? The success of Deepseek proves that training costs/hardware are definitely NOT a barrier to entry that protects OpenAI from competition. If anyone can train their model with ChatGPT for a fraction of the cost it took to train ChatGPT and get similar results, then how is that a barrier?
Can anyone do that though? You need the tokens and the pipelines to feed them to the matmul mincers. Quoting only dollar equivalent of GPU time is disingenuous at best.
That’s not to say they lie about everything, obviously the thing works amazingly well. The cost is understated by 10x or more, which is still not bad at all I guess? But not mind blowing.
Even if that's 10x, that's easy to counter. $50M can be invested by almost anyone. There are thousands of entities (incl. governments, even regional ones) who could easily bring such capital.
So I'm not an expert in this but even with DeepSeek supposedly reducing training costs isn't the estimate still in the millions (and that's presumably not counting a lot of costs)? And that wouldn't be counting a bunch of other barriers for actually building the business since training a model is only one part, the barrier to entry still seems very high.
Also barriers to entry aren't the only way to get a consolidated market anyway.
About your first point, IMO the usefulness of AI will remain relatively limited as long as we don’t have continuously learning AI. And once we have that, the disparity between training and inference may effectively disappear. Whether that means that such AI will become more accessible/affordable or less is a different question.
I just read _The Great River_ by Boyce Upholt, a history of the Mississippi river and human management thereof. It was funny how the railroads were used as a bogeyman to justify continued building of locks, dams, and other control structures on the Mississippi and its tributaries, long after shipping commodities down river had been supplanted by the railroads.
This moment was also historically significant because it demonstrated how financial power (Morgan) could control industrial power (the railroads). A pattern that some say became increasingly important in American capitalism.
For the curious, it was vertical integration in the railroad-oil/-coal industry which is where the money was made.
The problem for AI is the hardware is commodified and offers no natural monopoly, so there isn't really anything obvious to vertically integrate-towards-monopoly.
Aren’t we approaching a scenario where the software is commodified (or at least “good enough” software) and the hardware isn’t (NVIDIA GPUs have defined advantages)
I think the lesson of DeepSeek is 'no' -- that by software innovation (ie., dropping below CUDA to programming the GPU directly, working at 8bit, etc.) you can trivialise the hardware requirement.
However I think the reality is that there's only so much coal to be mined, as far as LLM training goes. When we're at "very dimishing returns" SoC/Apple/TSMC-CPU innovations will deliver cheap inference. We only really need a M4 Ultra with 1TB RAM to hollow-out the hardware-inference-supplier market.
Very easy to imagine a future where Apple releases a "Apple Intelligence Mac Studio" with the specs for many businesses to run arbitrary models.
I really hope that apple realizes soon there is a market for Mac Pro/Mac Studio with a RAM in the TBs for AI Workloads under $10k and a bunch of GPU cores.
>> Compute is literally being sold as a commodity today, software is not.
The marginal cost of software is zero. You need some kind of perceived advantage to get people to pay for it. This isn't hard, as most people will pay a bit for big-name vs "free". That could change as more open source apps become popular by being awesome.
Marginal cost has nothing to do with it - you can buy and sell compute like you could corn and beef at scale. You can't buy and sell software like that. In fact I'm surprised we don't have futures markets for things like compute and object storage.
I think a better analogy than railroads (which own the land that the track sits on and often valuable land around the station) is airlines, which don’t own land. I recall a relevant Warren Buffett letter that warned about investing hundreds of millions of dollars into capital with no moat:
> Similarly, business growth, per se, tells us little about value. It's true that growth often has a positive impact on value, sometimes one of spectacular proportions. But such an effect is far from certain. For example, investors have regularly poured money into the domestic airline business to finance profitless (or worse) growth. For these investors, it would have been far better if Orville had failed to get off the ground at Kitty Hawk: The more the industry has grown, the worse the disaster for owners.
I think that's a very possible outcome. A lot of people investing in AI are thinking there's a google moment coming where one monopoly will reign supreme. Google has strong network effects around user data AND economies of scale. Right now, AI is 1-player with much weaker network effects. The user data moat goes away once the model trains itself effectively and the economies of scale advantage goes away with smart small models that can be efficiently hosted by mortals/hobbyists. The DeepSeek result points to both of those happening in the near future. Interesting times.
> where the Moore’s law and advancement will eventually allow people to run open models locally
Probably won't be Moore's law (which is kind of slowing down) so much as architectural improvements (both on the compute side and the model side - you could say that R1 represents an architectural improvement of efficiency on the model side).
OpenAI is going after a company that open sourced their model, by distilling from their non-open AI?
OpenAI talks a lot about the principles of being Open, while still keeping their models closed and not fostering the open source community or sharing their research. Now when a company distills their models using perfectly allowed methods on the public internet, OpenAI wants to shut them down too?
I was thinking about this the other day but I highly doubt they would rebrand name. They’re borderline a household name now - at least ChatGPT is. OpenAI is the face of AI - at least to people who don’t follow the industry
They don't care, T&C and copyright is void unless it affects them, others can go kick rocks. Not surprising they and OpenAI will do a legal battle over this.
I'm not being sarcastic, but we may soon have to torrent DeepSeek's model. OpenAI has a lot of clout in the US and could get DeepSeek banned in western countries for copyright.
> US and could get DeepSeek banned in western countries for copyright
If US is going to proceed with trade war on EU, as it was planning anyway, then DeepSeek will be banned only in US. Seems like term "western countries" is slowly eroding.
Great point. Plus, the revival of serious talk of the Monroe Doctrine (!!!) in the U.S. government lends a possibly completely-new meaning to "western countries" -- i.e. the Americas...
Trump has already managed to completely destroy the US reputation within basically the entire continent¹. And he seems intent on creating a commercial war against all the countries here too.
1 - Do not capture and torture random people on the street if you want to maintain some goodwill. Even if you have reasons to capture them.
Yeah... I don't think goodwill was ever a very central part of the Monroe doctrine. Its imperial expansionism, plain n' simple. Embargo + pressure who you can, depose any governments that resist, threaten the rest into silent compliance.
It wouldn't be foolish. The US has an active cult of personality, and whatever the leader says, half the country believes it unquestioningly. If OpenAI is said to be protecting America and DeepSeek is doing terrible, terrible things to the children (many smart people are saying it), there'll be an overnight pivot to half the country screaming for it to be banned and harassing anyone who says otherwise.
Who cares if some people think you look foolish when you have a locked down 500 billion dollar investment guarantee?
Hey, OpenAI, so, you know that legal theory that is the entire basis of your argument that any of your products are legal? "Training AI on proprietary data is a use that doesn't require permission from the owner of the data"?
You might want to consider how it applies to this situation.
2. Lived through a similar scenario in 2010 or so.
Early in my professional career I've worked for a media company that was scraping other sites (think Craigslist but for our local market) to republish the content on our competing website. I wasn't working on that specific project, but I did work on an integration on my teams project where the scraping team could post jobs on our platform directly. When others started scraping "our content" there were a couple of urgent all hands on deck meetings scheduled, with a high level of disbelief.
Exactly, they actually opened up the model and research, which the "Open" company didn't, and merely adjusted some of their pricing tiers to try to combat commercially (but not without mumbling something like "yeah, we totally had these ideas too"). Now every single Meta, OpenAI etc engineer is trying to copy DeepSeek's innovations, and their first act is to... complain about copyright infringement, of all things?! What an absolute clown party, how can these people take themselves seriously, do they just have zero comprehension of what hypocrisy is or what's going on here...
I can scarcely process all the levels of irony involved, the irony-o-meter is pegged and I can't get the good one from the safe because I'm incapacitated from laughter.
Altman was in a bit of a tricky position in that he figured OpenAI would need a lot of money for compute to be able to compete but it was hard to get that while remaining open. DeepSeek benefit from being funded from their own hedge fund. I wonder if part of their strategy is crack AI and then have it trade the markets?
The last (only?) language model OpenAI released openly was GPT-2, and even for that the instruction weighted model was never released. This was in 2019. The large Microsoft deal was done in 2023.
If it's true, how is it problematic? It seems aligned with their mission:
> We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.
> We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.
While I'm as amused as everyone else - I think it's technically accurate to point out that the "we trained it for $6 mio" narrative is contingent on the done investment by others.
OpenAI has been in a war-room for days searching for a match in the data, and they just came out with this without providing proof.
My cynical opinion is that the traning corpus has some small amount of data generated by OpenAI, which is probably impossible to avoid at this point, and they are hanging on that thread for dear life.
Oh, absolutely. I'm not defending OpenAI, I just care about accurate reporting. Even on HN - even in this thread - you see people who came away with the conclusion that DeepSeek did something while "cutting cost by 27x".
But that's a bit like saying that by painting a a bare wall green you have demonstrated that you can build green walls 27x cheaper, ignoring the cost of building the wall in the first place.
Smarter reporting and discourse would explain how this iterative process actually works and who is building on who and how, not frame it as two competing from-scratch clean room efforts. It'd help clear up expectations of what's coming next.
It's a bit similar to how many are saying DeepSeek have demonstrated independence from nVidia, when part of the clever thing they did was figure out how to make the intentionally gimped H800s work for their training runs by doing low-level optimizations that are more nVidia-specific, etc.
Rarely have I seen a highly technical topic see produce more uninformed snap takes than this week.
You are underselling or not understanding the breakthrough. They trained 600B model on 15T tokens for <$6/m. Regardless of the provenance of the tokens, this in itself is impressive.
Not to mention post-training. Their novel GRPO technique used for preference optimization / alignment is also much more efficient than PPO.
Let's call it underselling. :-) Mostly because I'm not sure anyone's independently done the math and we just have a single statement from the CEO. I do appreciate the algorithmic improvements, and the excellent attention-to-performance-in-detail stuff in their implementation (careful treatment of precision, etc.), making the H800s useful, etc. I agree there's a lot there.
> that's a bit like saying that by painting a a bare wall green you have demonstrated that you can build green walls 27x cheaper, ignoring the cost of building the wall in the first place
That's a funny analogy, but in reality DeepSeek did reinforcement learning to generate chain of thought, which was used in the end to finetune LLMs. The RL model was called DeepSeek-R1-Zero, while the SFT model is DeepSeek-R1.
They might have boostrapped the Zero model with some demonstrations.
> DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.
> Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1Zero outputs in a readable format, and refining the results through post-processing by human annotators.
I don't agree. Walls are physical items so your example is true, but models are data. Anyone can train off of these models, that's the current environment we exist in. Just like OpenAI trained on data that has since been locked up in a lot of cases. In 2025 training models like Deepseek is indeed 27x cheaper, that includes both their innovations and the existence of new "raw material" to do such a thing.
What I'm saying is that in the media it's being portrayed as if DeepSeek did the same thing OpenAI did 27x cheaper, and the outsized market reaction is in large parts a response to that narrative. While the reality is more that being a fast-follower is cheaper (and the concrete reason is e.g. being able to source training data from prior LLMs synthetically, among other things), which shouldn't have surprised anyone and is just how technology in general trends.
The achievement of DeepSeek is putting together a competent team that excels at end-to-end implementation, which is no small feat and is promising wrt/ their future efforts.
The opposite, is claiming that OpenAI could have now built better performing, cheaper to run model (when compared to what they published) training it at 1% cost on output of their previous models. ... But they chose not to do it.
Is OpenAI claiming copyright ownership over the generated synthetic data?
That would be a dangerous precedent to establish.
If it's a terms of service violation, I guess they're within their rights to terminate service, but what other recourse do they have?
Other than that, perhaps this is just rhetoric aimed at introducing restrictions in the US, to prevent access to foreign AI, to establish a national monopoly?
> “It is (relatively) easy to copy something that you know works,” Altman tweeted. “It is extremely hard to do something new, risky, and difficult when you don’t know if it will work.”
The humor/hypocrisy of the situation aside, it does seem to be true that OpenAI is consistently the one coming up with new ideas first (GPT 4, o1, 4o-style multimodality, voice chat, DALL-E, …) and then other companies reproduce their work, and get more credit because they actually publish the research.
Unfortunately for them it’s challenging to profit in the long term from being first in this space and the time it takes for each new idea to be reproduced is getting shorter.
Reminds me of the Bill Gates quote when Steve Jobs accused him of stealing the ideas of Windows from Mac:
Well, Steve... I think it’s more like we both had this rich neighbor named Xerox and I broke into his house to steal the TV set and found out that you had already stolen it.
Xerox could be seen as Google, whose researchers produced the landmark Attention Is All You Need paper, and the general public, who provided all of the training data to make these models possible.
> OpenAI is consistently the one coming up with new ideas first (GPT 4, o1, 4o-style multimodality, voice chat, DALL-E, …)
As far as I can tell o1 was based on Q-star, which could likely be Quiet-STaR, a CoT RL technique developed at Stanford that OpenAI may have learned about before it got published. Presumably that's why they never used the Q-Star name even though it had garnered mystique and would have been good for building hype. This is just speculation, but since OpenAI haven't published their technique then we can't know if it really was their innovation.
The humor/hypocrisy of the situation aside, it does seem to be true that OpenAI is consistently the one coming up with new ideas first (GPT 4, o1, 4o-style multimodality, voice chat, DALL-E, …) and then other companies reproduce their work, and get more credit because they actually publish the research
I claim one just can't put the humor/hypocrisy aside that easily.
What OpenAI did with the release of ChatGPT is productize research that was open and ongoing with Deepmind and other leading at least as much. And everything after that was an extension of the basic approach - improved, expanded but ultimately the same sort of beast. One might even say the situation of OpenAI to DeepMind was like Apple to Xerox. Productizing is nothing to sneeze at - it requires creativity and work to productize basic research. But naturally get end-users who consider the productizers the "fountain heads", who overestimate the productizers because products are all they see.
I may be wrong, but to my knowledge OpenAI:
- did not invent transformer architecture
- did not invent diffusion architecture
- did not come up with the idea for multi-modality
- did not invent the notion/architecture of the latest “agentic” models
They simply were the first to aggressively pursue scaling the transformer to the extent that is normal for the industry today. Although this has proven to produce interesting results, “simply adding scale” is, in my view, the least interesting development in modern ML. Giving credit where it’s due, they MAY have popularized the RLHF methodology, but I don’t recall them inventing that either?
(feel free to point out any of the above that I falsely attributed to NOT OpenAI.)
Additionally I seem to remember in an interview with Altman circa late ‘21 where he explains that the spirit of “OpenAI” and how their only goal is pursuing AGI, and “should someone else come up with a more promising path to get there, we would stop what we’re doing and help them”. I couldn’t find a reference to this interview, but anyone else, please feel free to share (I think it was a youtube link). - fast forward to 2025 and now “OpenAI” is the least open large contributor and indiscernible from your run-of-the-mill AI/ML valley startup insofar as they’re referring to others as “competitors” as opposed to collaborators.. interesting times…
There’s some truth in that, but isn’t making a radically cheaper version also a new idea that deepseek didn’t know whether it would work? I mean, there was already research into distillation, but there was already research into some of (most of?) OpenAI’s ideas.
Yes, for people who look into the research Deepseek released, there are a good number of novelties which enabled much cheaper R&D. For example, improvements to Mixture of Experts modules and Multi-head Latent Attention. If you have infinite money, you don’t need to innovate there, but DeepSeek didn’t.
The eye-watering funding numbers proposed by Altman in the past and more recently with “Stargate” suggests a publicly-funded research pivot is not out of the question. Could see a big defense department grant being given. Sigh.
I don't see any reason to assume that "publicly funded" will imply that the research is public. Although I'd be more than happy to be wrong on this one.
Not really, they just put their eye to where everyone knows the ball is going and publish fake / cherrypicked results and then pretend like they got there first (o1, gpt voice, sora)
Fortunately, OpenAI doesn't need to make money because they are a nonprofit dedicated to the safe and transparent advancement of AI for all of humanity
I was wondering if this might be the case, similar to how Bing’s initial training included Google’s search results [1]. I’d be curious to see more details of OpenAI’s evidence.
It is, of course, quite ironic for OpenAI to indiscriminately scrape the entire web and then complain about being scraped themselves.
Hard to really have any sympathy for OpenAI's position when they're actively stealing content, ignoring requests to stop then spending huge amounts to get around sites running ai poisoning scripts, making it clear they'll still take your content regardless of if you consent to it.
It looks like Deepseek had a subdomain called "openai-us1.deepseek.com". What is a legitimate use-case for hosting an openai proxy(?) on your subdomain like this?
Not implying anything's off here, but it's interesting to me that this OpenAI entity is one of the few subdomains they have on their site
Could just be an OpenAI-compatible endpoint too. A lot of LLM tools use OpenAI compatible APIs, just like a lot of Object Storage tools use S3 compatible APIs.
The US government likely will favor a large strategic company like OpenAI instead of individual's copyrights, so while ironic, the US government definitely doesn't care.
And the US government is also likely itching to reduce the power of Chinese AI companies that could out compete US rivals (similar to the treatment of BYD, TikTok, solar panel manufacturers, network equipment manufacturers, etc), so expect sweeping legislation that blocks access to all Chinese AI endeavours to both the US and then soon US allies/West (via US pressure.)
The likely legislation will be on the surface justified both by security concerns and by intellectual property concerns, but ultimately it will be motivated by winning the economic competition between China and the US and it will attempt to tilt the balance via explicitly protectionist policies.
>The US government likely will favor a large strategic company like OpenAI instead of individual's copyrights
Even if we assume this is true, Disney and Netflix are both currently worth more than OpenAI and both rely on the strict enforcement of US copyright law. I do not think it is so obvious which powers that be have the better lobbying efforts and, currently, it's looking like this question will mostly be adjudicated by the courts, not Congress, anyways.
I don't think OpenAI stole from Disney or Netflix. Rather OpenAI stole from individual artists and YouTube and other social media who users do not really have any lobbying power.
So I think OpenAI, Disney and Netflix win together. Big companies tend to win.
> What are the first words of the disney movie, "Aladdin" ?
The first words of Disney's Aladdin (1992) are spoken by the *Peddler*, the mysterious merchant at the beginning of the film. He says:
"Ah, Salaam and good evening to you, worthy friend. Please, please, come closer..."
He then continues with:
"Too close! A little too close. There. Welcome to Agrabah. City of mystery, of enchantment, and the finest merchandise this side of the River Jordan, on sale today! Come on down!"
This opening sets the stage for the story, introducing the magical and bustling world of Agrabah.
Being publicly available does not mean that copyright is invalid. Copyright gives the holders the right to restrict USE, not merely restrict reproduction. Adaptation is also an exclusive right of the copyright holder. You're not allowed to make derivative works.
> They consumed publicly available material on the Internet
I agree that there are some important distinctions and word-choices to be made here, and that there are problems with equating training to "stealing", and that copyright infringement is not theft, etc.
That said, if you zoom out to the overall conduct, it's fair to argue that the companies are doing something unethical, the same as if they paid an army of humans to memorize other people's work and then regurgitate slightly-reworded copies.
> That said, if you zoom out to the overall conduct, it's fair to argue that the companies are doing something unethical, the same as if they paid an army of humans to memorize other people's work and then regurgitate slightly-reworded copies.
I would use the analogy of those humans learning from the material. Like reading books in the library
"regurgitate slightly-reworded copies" in my experience using LLMs (not insubstantial) that is an unfairly pejorative take on what they do
By that logic a copy of source code for a propriatary app that someone has stolen and placed online is immediately free for all to use as they wish.
Being on the internet doesnt make it yours, or acceptable to take. In the case of OpenAI (and Anthropic) they should be following the long held principle of the robots.txt file on sites, which can be specifically set to tell just them that they may not take your content - they openly ignore that request.
OpenAI absolutely is stealing from everyone, hence why most will have little sympathy when they complain someone stole from them.
I don’t think US government can move fast enough to change the trajectory. Also it doesn’t help that basically every government is second guessing their alliance with the US. It’s not an industry that can ruin local industries either (like cheap BYD is bad for German cars).
It’s a very fun thing to watch from the sidelines right now, if I’ll be honest.
A. below is a list of OpenAI initial hires from Google. It's implausible to me that there wasn't quite significant transfer of Google IP
B. google published extensively, including the famous 'attention is all you need' paper, but open-ai despite its name, has not explained the breakthroughs that enabled O1. It has also switched from a charity to a for-profit company.
C. Now this company, with a group of smart, unknown machine learning engineers, presumably paid fractions of what OpenAI are published, has created a model far cheaper, and openly published the weights, many methodological insights, which will be used by OpenAI.
1. Ilya Sutskever – One of OpenAI’s co-founders and its former Chief Scientist. He previously worked at Google Brain, where he contributed to the development of deep learning models, including TensorFlow.
2. Jakub Pachocki – Formerly OpenAI’s Director of Research, he played a major role in the development of GPT-4. He had a background in AI research that overlapped with Google’s fields of interest.
3. John Schulman – Co-founder of OpenAI, he worked on reinforcement learning and helped develop Proximal Policy Optimization (PPO), a method used in training AI models. While not a direct Google hire, his work aligned with DeepMind’s research areas.
4. Jeffrey Wu – One of the key researchers involved in fine-tuning OpenAI’s models. He worked on reinforcement learning techniques similar to those developed at DeepMind.
5. Girish Sastry – Previously involved in OpenAI’s safety and alignment work, he had research experience that overlapped with Google’s AI safety initiatives.
> A. below is a list of OpenAI initial hires from Google. It's implausible to me that there wasn't quite significant transfer of Google IP
I agree there's hypocrisy but in terms of making a strong argument, you can safely remove your list of persons who (drum roll)... mostly _didn't_ actually work at Google?
Oh God. I know exactly how this feels. A few years ago I made a bread hydration and conversion calculator for a friend, and put it up on JSFiddle. My friend, at the time, was an apprentice baker.
Just weeks later, I discovered that others were pulling off similar calculations! They were making great bread with ease and not having to resort to notebooks and calculators! The horror! I can't believe that said close friend of mine would actually share those highly hydraty mathematical formulas with other humans without first requesting my consent </sarc>.
Could it be, that this stuff just ends up in the dumpster of "sorry you can't patent math" or the like?
I do think that distilling a model from another is much less impressive than distilling one from raw text. However, it is hard to say if it is really illegal or even immoral, perhaps just one step further in the evolution of the space.
It's about as illegal as the billions, if not trillions of IPs that ClosedAI infringed to train their own data without consent. Not that they're alone, and I personally don't mind that AI companies do it, but it's still amusing when they get this annoyed at others doing the same thing to them.
I think they had the advantage of being ahead of the law in this regard. To my knowledge, reading copywritten material isn't (or wasn't illegal) and remains a legal grey area.
Distilling weights from prompts and responses is even more of a legal grey area. The legal system cannot respond quickly to such technological advancements so things necessarily remain a wild west until technology reaches the asymptotic portion of the curve.
In my view the most interesting thing is, do we really need vast data centers and innumerable GPUs for AGI? In other words, if intelligence is ultimately a function of power input, what is the shape of the curve?
The main issue is that they've had plenty of instances where the LLM outputted copyrighted content verbatim, like it happened with the New York Times and some book authors. And then there's DALL-E, which is baked into ChatGPT and before all the guardrails came up, was clearly trained on copyrighted content to the point it had people's watermarks, as well as their styles, just like Stable Diffusion mixes can do (if you don't prompt it out).
Like you've put, it's still a somewhat gray area, and I personally have nothing against them (or anyone else) using copyrighted content to train models.
I do find it annoying that they're so closed-off about their tech when it's built on the shoulders of openness and other people's hard work. And then they turn around and throw Issy fits when someone copies their homework, allegedly.
> Distilling weights from prompts and responses is even more of a legal grey area.
Actually unless the law changes this is pretty settled territory in US law. All output of AIs are not copyrightable, and are therefore in the public domain. The only legal avenue of attack OpenAi has is Terms of Service violation, which is a much weaker breach then copyright if it is even true.
> if intelligence is ultimately a function of power input, what is the shape of the curve?
According to a quick google search, the human body consumes ~145W of power over 24h (eating 3000kcals/day). The brain needs ~20% of that so 29W/day. Much less than our current designs of software & (especially) hardware for AI.
I think you mean the brain uses 29W (i.e. not 29W/day). Also, I suspect that burgers are a higher entropy energy source than electricity so perhaps it is even less than that.