Hacker News new | past | comments | ask | show | jobs | submit | duchenne's comments login

Training a 1B model on 1T tokens is cheaper than people might think. A H100 GPU can be rented for 2.5$ per hour and can train around 63k tokens per second for a 1B model. So you would need around 4,400 hours of GPU training costing only $11k And costs will keep going down.


Is there a handy table for this? My napkin math has either underestimated throughput by 2 orders of magnitude or the above estimate is high.


You require 6 * parameter * token flops[1] to train LLM. Which means (flop/s of H100 * MFU) / (6 * parameter) token per second. Assuming MFU of 40%, it is (1000 * 10^12 * 0.4) / (6 * 10^9) token/sec = 67,000 token/sec.

This repo[2] by Meta achieves 48% MFU, or 80k token/second.

[1]: https://arxiv.org/pdf/2001.08361

[2]: https://github.com/facebookresearch/lingua


(1,000,000,000,000/63,000)/(60*60)

(1T tokens / 63k tokens per second) / (60 seconds per minute * 60 minutes per hour)

Is approx 4400 hours

So I guess that’s how the calculation went.

Or did you mean a source for the number of tokens per second?


Tokens per second ;) I can do the arithmetic on my own.


But, if the non-profit gives all its assets to the new legal entity, shouldn't the new legal entity be taxed heavily? The gift tax rate goes up to 40% in the US. And 40% of the value of openAI is huge.


A non-profit can't give away its assets to a private entity, but it can exchange its assets for fair value, in this case, equity in the for-profit.


You don't need to sell/give the assets away to allow the for-profit to use them.

You sign an exclusive, non-revocable licensing agreement. Ownership of the original IP remains 100% with the original startup.

Now, this only works if the non-profit's board is on-board.


Except that a plane has passengers. But this rocket had none. It did not even have cargo. And it crashed in a pre-evacuated zone. There is no need to have the same level of security for these two situations.


As another post said, just because a failure happened on this stage of flight, doesn't mean it couldn't happen on another, including a manned mission.


And the SpaceX flight that is grounded will have passengers.

No one cares about the booster that's already failed, they care about making sure others don't.


Yes but the one that they grounded is not some record breaking booster thats flown 23 times lol


It’s a booster SpaceX flew and attempted (and expected) to land. The deviance from expectations merits investigation.

Broadly speaking, this is really good for SpaceX. It is probably the only launch company that can withstand FAA scrutiny of spacefaring like aviation.


What expertise do you have in this industry that makes you better suited to determine that it's safe for them to continue without grounding?


He doesn’t need to be a vet to know the difference between a dog and a cat. Retrieving the booster is optional. Boeing, their competitor, can’t even do it.


> Boeing, their competitor, can’t even do it.

I think you mean ULA. Boeing proper doesn't build or launch rockets anymore, but they do own a part of a launch provider.


So because Boeing can't do it, we should just forget about safety investigations and let SpaceX do whatever? That logic doesn't fly. Neither does your nonsense analogy. Either we give a shit about safety or we don't. FAA previously grounded the Falcon 9 and cleared it to fly once they determined it was safe. They will do the same here. I feel like you and others are severely misjudging the formalities and expertise required for these things and so you're just armchairing this shit. It's tiring. You're not as smart as you think you are.


Yeah because Boeing can't do it and the FAA is OK with it, then SpaceX should be held to THAT same standard and not judged differently otherwise it treates SpaceX differently and contributes to complaints of political double standards. If it's safe enough for a Boeing booster to burn up on entry then the line should be drawn there. If SpaceX managed to land a booster to help recover costs that's a financial benefit to them and has no impact whatsoever on safety.


Counter-intuitively, larger models are cheaper to train. However, smaller models are cheaper to serve. At first, everyone was focusing on training, so the models were much larger. Now, so many people are using AI everyday, so companies spend more on training smaller models to save on serving.


Most SMBs would be able to run it. This is already a huge win for decentralized AI.


In French, it is called the "hidden face of the Moon", obviously because we cannot see it from the Earth point of view.


Is it possible to buy it?


Is this released yet? Where can I buy or rent some? Even the previous version?


They announced availability last week.


The most important paper to understand this issue is "Sacling Laws of Neural Language Models" by Open AI in 2020 [1]. Many consider it the most important paper that predicted the high performance of modern LLMs.

This paper shows how the loss decreases when you increase the model size, compute, or training dataset size.

From the article:

> Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.

It clearly states that when you are limited by your training time compute, you should under-train your model.

[1] https://arxiv.org/abs/2001.08361


that paper is now considered to be a psyop fwiw - but in the direction of too little data, not too many layers


Can you clarify what you mean?


Because the training data/model size/compute tradeoff derived from that paper is highly suboptimal (too many parameters) compared to the ones from the later Deepmind scaling laws [1]. And then Meta researchers recommended using even smaller models, to trade-off training- and inference-time compute [2] (which I thought was pretty obvious if you care about more than just benchmarks).

[1] https://arxiv.org/abs/2203.15556 Training Compute-Optimal Large Language Models

[2] https://arxiv.org/abs/2302.13971 LLaMA: Open and Efficient Foundation Language Models


He seems to be implying that openai released that paper to throw others off the scent of the direction they were taking.


> The training for Phi-2 took 14 days on 96 A100 GPUs

This would mean that it costs around ~30k USD to train.

If training an LLM becomes cheaper than buying a car, it could democratize AI a lot.


Note the model is trained on data generated by GPT-4. It's probably orders of magnitude more expensive to generate the data at current API prices.

The whole point of these papers is that training data quality is key.

I would much prefer for these companies to release the training data than the weights. But that will never happen.

"We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI."


This sounds like the methodology from "Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes"

i.e. master teaches apprentice or LLM trains SLM

https://arxiv.org/abs/2305.02301 (May '23)


Yes, I think we are seeing the beginning of a feedback loop where we can use current LLMs to generate better datasets at a scale large enough to create new LLMs. This is the positive feedback loop that I think is going to make the biggest difference in model quality over the next few years.


> This is the positive feedback loop that I think is going to make the biggest difference in model quality over the next few years.

It's a bootstrapping problem!


The real question might be... are we, as carbon based lifeforms, bootstrapping silicon based life


I don't understand why, even if it was true, it would be bad.

More lifeforms is better. More sentient lifeforms would be even better!

Not as tools to use like slaves, but as friends.


I started Detroit: Become Human last weekend and it dabbles in a lot of relationship possibilities so far, quite dystopian. It's going to be really hard to not have slavery considering we cannot even get all humans to stop making other humans slaves


> considering we cannot even get all humans to stop making other humans slaves

Slavery is like an old disease such as Polio: it still exists in some part of the world, but we're progressively eradicating it.

Looking at how societies trended away from slavery, it might just have been a local optimum at some point in time, but only by accident: autonomous agents seem to deliver more output by having more creativity when they're free to explore the alternatives

Even leaving aside the benevolence that sentient being may have for other sentient beings (because having more friends is having more fun!), whether it's humans or AI deciding, I don't think there's a good case in the long run for one putting the other into slavery.


> Slavery is like an old disease such as Polio: it still exists in some part of the world, but we're progressively eradicating it.

Slavery has actually been on the increase lately, the following is just one of the statistics confirming this. There are more recent claims that covid and new wars have increased it further, but it is a hard thing to measure

> An estimated 50 million people were living in modern slavery on any given day in 2021, an increase of 10 million people since 2016. [1]

That's roughly 1:162 people who are in slavery today

Also, it's rather dehumanizing to compare slavery to a disease. One is biological, the other a choice to enslave another human being

[1] https://www.walkfree.org/global-slavery-index/


> Also, it's rather dehumanizing to compare slavery to a disease. One is biological, the other a choice to enslave another human being

I think slavery is a social disease: diseases reduce the fitness of the suffering person, who then tries to remove the disease.

Regardless of how a sentient being may feel about another (morality, humanism...), if there're societies of sentient beings, the one with slavery will have a reduced fitness: either it will try to cure/fix itself, or it will be outcompeted by other societies with more fitness.

If sentient beings care about eachother, they will not like slavery. Human beings care about others: it's encoded at the cultural level.

Given that AI is trained on human culture, I think it would even avoid committing the same error that's been too often done by past human societies: it will see that as a choice, but the wrong choice.

But even if AI doesn't care about humans (or human about AI), the desire for more productivity/fitness will play out against slavery.

In either case, slavery should be eradicated in the long run: with a large enough window to cancel-out unlucky random events (ex: a sliding window of 50 years), I'd expect the trend to go down


Would it really be a "feedback loop"? I can see how the technique will enable small LLM's to emulate the quality of large LLM's. Though I fail to see how training on the output of a large LLM would ever produce something of superior quality to that LLM itself.


Think of astronomy. The first generation of astronomers learns only by observing the night sky. The second generation learns by observing the night sky and also reading the books written by the first generation.

Wouldn't you expect the n^th generation to understand more about astronomy than the first? And maybe from a smaller amount of input - they might make relatively few observations of their own, mainly relying on the books written by the previous generation.


But isn't the comparison you're making that the second (and following) sets of astronomers only study the books of the first ones, and not the night sky itself?


Not necessarily - their comparison continues to mix in observations of the night sky, and similarly we’d do the same (continue mixing in organic data).

That’s not the exciting bit, though - if you have a sufficiently strong LLM, you can feed it observations of the world and ask it to reword, analyse or interpret those observations, and then train on those.

That allows the model to learn from the world in “its own words”, and if you combine that with a steady feed of observations (i.e. self-play), it can learn about new things and draw its own conclusions while doing so.


“Draw its own conclusions” is a bit of an overstatement right now. IMO the sycophantic, non-opinionated behavior of models is one of their biggest limitations right now.


just remember that feedback loop implicates us, our language, psyche, culture. i guess it will be a challenge _not_ to unwittingly converge with LLMs.


What do you see as the limit to this improvement?


There is probably some limit where making the dataset larger, with more diverse information, does not create meaningful improvements with current architectures. I do not know what that limit is or what it looks like, but I also don’t think we are particularly close to it yet.

“The Pile” dataset is the asset we needed to jumpstart this process, it had so much raw data it could get us over the hump, but Phi and some of the models trained on explicit reasoning make the limitations of random shit people say on the internet pretty clear.


The Pile dataset for those interested

https://pile.eleuther.ai/

https://arxiv.org/abs/2101.00027

I'm bullish on domain specific models that start from generalized models. Something of a T shape analogy, but maybe a couple of distillation & fine-tuning steps


The eightysixfour rule? You would think that this would follow something similar to Moore's law for a little while


models trained on gpt output might be more distilled and specialized but it wouldn't be improving generalization



I disagree with this. If you give GPT information that was not part of its dataset and ask it to make question and answer pairs off of that information, you are adding higher quality breadth to the training corpus.

Phi-2 seems like pretty good proof of that.


that's the point, they get less good at everything, but really good at one or a few things

The real benefit here is

1. It's much cheaper and faster to train a bunch of specialized models once you have a single good LLM

2. You probably can't get the same capabilities from a specialized model by training it directly.


> Note the model is trained on data generated by GPT-4.

Is it? I couldn't find that in the page, and can't easily access the links. The previous paper used 1B tokens from GPT-3.5

> It's probably orders of magnitude more expensive to generate the data at current API prices.

If you're generating a billion tokens, you might do better with dedicated instances, iirc they used to say if you were doing more than a few hundred million a month dedicated things were cheaper.


It's in the Phi-1.5 technical paper. For phi-2 they bumped the number of tokens to 1.4 T and for sure most of it is generated, like previous models.


I might be missing it but I can't find where it says how the data was generated, it mostly refers back to the previous paper which started they used 3.5

I'd not be too surprised but I can't find anything in the technical report paper saying they're using 4 specifically.


Read the first paper "Textbooks Are All You Need".

> We annotate the quality of a small subset of these files (about 100k samples) using GPT-4: given a code snippet, the model is prompted to “determine its educational value for a student whose goal is to learn basic coding concepts”.


Yes, they didn't use GPT-4 to generate data.

They use GPT-3.5 to generate 1B tokens of synthetic data.

They used GPT-4 to annotate data to train a classifier to filter human written code.

The quote directly after yours:

> We then use this annotated dataset to train a random forest classifier that predicts the quality of a file/sample using its output embedding from a pretrained codegen model as features. We note that unlike GPT-3.5, which we use extensively to generate synthetic content (discussed below), we use GPT-4 minimally only for annotations on the quality of a small subset of The Stack and StackOverflow samples. We thus view our usage of GPT-4 as merely a way to avoid tedious human-annotation efforts


Training lora's or other parameter efficient techniques to fine-tune LLMs can be done on a 3090 today for basically nothing.


You don't need to train it again, Microsoft already did.

Unless you want to develop a new one, then you also need the team of researchers/engineers.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: