Hacker News new | past | comments | ask | show | jobs | submit login
JetMoE: Reaching LLaMA2 performance with 0.1M dollars (myshell.ai)
272 points by gyre007 9 months ago | hide | past | favorite | 90 comments



> JetMoE-8B is trained with less than $ 0.1 million1 cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.

They want you to read this as "we spent $100k compared to Meta's spending billions", but that's not actually what this says. It says that they spent $100k and Meta has the resources to spend billions if they wanted to.

We don't know what Facebook spent on training LLaMA 2, but they say that it took them 184320 A100-80GB GPU-hours to train the 7B model [0]. AWS charges $14.46/hour for an instance that has 8 of those [1], which amounts to $1.81/GPU/hr.

At that rate and assuming they paid something resembling AWS's list price, LLaMA 2 7B cost ~$333k. That's more than $100k, but not by orders of magnitude, and it's likely that Facebook wasn't paying the full price AWS is charging today.

[0] https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#...

[1] https://aws.amazon.com/ec2/instance-types/p4/


This entire difference can be explained due to their double mixture of experts architecture. So only 1/4 MLP and attention blocks are used at any time. Maybe this should be the headline, Moe reduces compute by a factor of 4 without losing accuracy. But this is already known. Still interesting to see a smaller Moe model. This could be the ideal size for many local applications.


MoE reduces compute cost for inference at scale, but not for training. You still have to train the whole model (plus the router)


It’s absolutely beneficial when training because the forward pass and back propagation is still only on the neurons that were activated.

The Mistral guys specifically mention that training speed (due to not needing as much compute) was one of the reasons Mixtral was released so soon after Mistal 7b.


With an MoE you only need to train a smaller model which you can then combine into an x8 and finetune/train the router. Mistral used their 7B base to make Mixtral, Qwen's new MoE uses their 1.8B model upscaled to 2.7B, pretty sure Grok also trained a smaller model first.


Very incorrect! The "8x7b" in the name regularly confuses people into some similar conclusion, but there are not eight 7b "experts" in Mixtral 8x. It's more apt to think of all 256 FFN's as the "experts," as each expert FFN on a given layer has no relation to the expert FFN's on other layers. You need to train them all within the MoE architecture, as combining existing models ("clown car MoE") works, but isn't gaining anything from the architecture/sparsity


Sorry, could you expand on this a bit further? Are you saying that for a MoE, you want to train the exact same model, and then just finetune the feed forward networks differently for each of them? And you're saying that separately training 8 different models would not be efficient - do we have evidence for that?


You're only correct about Qwen's MoE. I presume that Chinese model builders feel more pressure to be efficient about using their GPU time because of sanctions.


If MoEs are that good, we know GPT-4 is, than why not train very specific MoEs. One part of MoE could be a perfect Math model which can actually calculate 2+2. Wouldn't models like these be better in general?


Keeping it short: "Not even wrong", in the Pauli sense.

- People hear "mixture of experts" and they think "N specialists" - but ex. think how much know you need to know to autocomplete "Two plus two is "

- Fundamental thing of ML is you define functions and give it data, and the more data you give it to the better. Once youre at "I will simply give it the training data needed to be good enough at the task and wall off that part of the implementation" you're outside ML and have a chicken and egg problem

- We don't know GPT-4 is MoE

- MoE in practice is fundamentally about trading off runtime vs. static size properties to gain inference speed. I.e. 7x8 stored and picking 7x2 at runtime means youre somewhere between 7x2 and 7x3 in quality, inference at 7x2 speed, and have to train and store and load 7x8. You don't reach for it to increase quality, you reach for it to increase inference speed at the expense of inference ram and total model size.


> We don't know GPT-4 is MoE

Didn't Yampleg's tweet / leak confirm this one? I mean, he could be wrong about this, but I thought the consensus was on it being true by now.

(Copy of the removed tweets at https://www.reddit.com/r/mlscaling/comments/14wcy7m/gpt4s_de... )


It's just a dude retweeting a substack. I wouldn't bet against it* but I wouldn't bet on it either. His tweet would have just linked to the article in the top comment.

* I used to crusade against this rumor because the only source is that article, and people repeating that source. But I imagine it's a no-brainer given they have enough users that they essentially get a throughput bump 'for free' even if the model weights are huge, i.e. better to utilize as much GPU ram as you can muster, the cost of needing more GPU ram is offset by the cost of being able to run multiple inference against the model all the time anyway


Agreed. In inference, I could even imagine a 4-bit conversion for certain edge devices.


Meta has their own data centers so they definitely didn't pay the equivalent to what AWS costs


Good point, although it's possible that with the extreme price of GPUs that it cost more to train by buying hardware than it would to rent. For example it might take two to three years before the GPUs are paid for by customers.


Linux reserved cost of p3.16xlarge is $146,362.0800 annually. On-demand cost is $214,444.8000 annually

I am pretty damn sure I could build a 8 GPU Intel Xeon E5-2686 v4 (Broadwell) (that's what Amazon uses - it's $30 to $75 on eBay) server for less than that and come out ahead on electricity even at full throttle. RTX 4090 are just under $2000 each on eBay.

8 GPU × $2000 (RTX 4090) + $1000 (for the rest of the computer) = $17,000

If pulling 2kW continuously at 15 cents per kW*hr for 1 year that's 2000 watts × 365 days × (0.15/(kW×hr)) or $2,628

In total the computer will cost $19,628 if you throw it in the dumpster at the end of each calendar year of using it.

If you stack internet cost of $200 a month on top, that's $2400 a year, which raises your annual cost to: $22,028

This is still $124,334 cheaper per year than one AWS 8-GPU server if you fully depreciate your own hardware at the end of year 1 to $0.

I could hire an engineer in America to babysit it with the money left over.


Are consumer grade RTX 4090 cards going to be suitable for running full tilt 24/7 for a year? Those things are fine to stress on the latest game for a few hours at a time, but would probably cause some defects from significant heat stress after just a few days at 100%.

This is inconsequential when you're playing Overwatch for a few hours a night and a frame drops now and again. If you're training an iteratively developed LLM though, physical defects could propagate into huge deficiencies in the final model.


Yep absolutely, crypto miners have been doing it for years.

I still think it would be impractical at scale because they are so much more hot and power hungry than the datacenter cards, and you would be lucky to score one or two if you’re on a wait list.


Except you can absolutely obtain 4090s today, while enterprise hardware is (was? haven't looked at the data) recently, which is the exact opposite scenario you mentioned.

I'm actually really surprised that you can still buy 4090s for under $2,000 (cheapest available I saw was $1,800 new and I only took 30 seconds to look), but you can usually sell certain models for quite a bit more. For example, my used 4090 FE is currently worth more than I paid for it.

I've played with AI, and while admittedly I've not done anything super serious, I can tell you that both the 3090 and 4090 are more than capable of performing. Tie them with a power efficient AMD CPU and you have something that can be competitive with enterprise (somewhat).

I've seen the pricing of "cloud" offerings and I've toyed with the idea of creating an "AI Cloud" because I have access to really fast internet and super cheap electricity, but I haven't executed because I'm most certainly not a salesperson. I do, however, know enough about marketing that one should not target price, so there is that...


You could under-volt or watt-limit a bit and lose just a fraction of FLOPS for much less heat/power though, depending on the workload


I don't think they'd become a fire hazard, but it is true that one would likely pick something else for this application.

Having said that, switching to something like the Tesla V100-SXM2-16GB wouldn't cost that much more.

TBH, I'm shocked at how many people treat Amazon as the first choice for this stuff. Much of it isn't even what most would consider a "production" workload. You are paying for a lot of enterprise-readiness that you don't need for training.


If you wanted to finetune a Mixtral 8x7B, what would you use?


Given the relative availability, I'd probably try to do it with a couple of rtx4090s on tensordock.


> TBH, I'm shocked at how many people treat Amazon as the first choice for this stuff

You can thank Amazon's legions of salespeople for that, particularly the end of year junket in Las Vegas where attendees are so pampered that about the only thing they won't do is suck your dick

Oh, yeah, they'll also yell at you on stage if you complain about their UI


Though this comparison is really only relevant for a couple of machines. Beyond that, at this cost, if you pay AWS list prices "at scale" you're doing something very wrong.

Don't get me wrong - I've frequently argued that AWS is price gouging and relying on peoples lack of understanding of how the devops costs of running your own works out, but it doesn't take a huge budget before this calculation will look very different (still cheaper to own your own, though).


You can build old Xeon based but only has 40 lane PCIe. For training 8 GPUs how do you push data fast? I’m using 7000 series Epyc for this to get 128 lanes. Have you built this kind of machine? You see good speed with 40 lane? Curious because then I can use old Tyan motherboard which comes in full case with good layout for multi GPU. Epyc based I have to use riser and custom frame which is painful.

New Tyan more costly but great case layout.


An A100-80 GPU goes for about $20K each.


The instances in question use Tesla V100-SXM2-16GB


Since the GPUs can be rented out afterwards, they amortize very quickly with prices in the order of $1/h.


I think AWS prices scale with hardware price


Meta also doesn't pay AWS anywhere near retail price for instances.


Why is this the case? Even AWS internal pays the same AWS prices as everyone else


I'm less surprised if AWS internally pays AWS list prices, because that's just internal accounting. From the even relatively small AWS customers I know, none of them needed to get very far into the 6 digits per year spend before a couple of quiet mentions to their account manager that they were reviewing other options was enough to get steep discounts.

Add in lots of credits, and if you pay list price, you're being taken to the cleaners..

I've done contract work for clients to be ready to migrate both as part of maximising credits and as part of negotiating posture, and the savings can be enormous (though it'd still usually be cheaper to use managed servers).


Couple of reasons: Meta's annual spend with AWS is large enough that they'll have a negotiated blanket discount that takes a fixed percentage off the top of their monthly spend. This is very common for larger AWS customers, not just Meta.

For instances specifically, any planned usage will be using either reserved instances or at minimum a compute savings plan (CSP) that drops the hourly rate dramatically in exchange for a committed number of instance hours, with or without an upfront payment.

Finally, there may be a negotiated rate for specific instance types built into the contract. Again, common for very large customers.

source: I was on one of the cloud-related infrastructure teams (left in early 2022). I have no idea about their spend (or discounts) today, but two years ago it was enough that Andy Jassy would meet 1:1 with Mark to "discuss the relationship".


They have their own data centers, they don't use AWS


They don't use AWS to run the big three apps (FB, Insta, WhatsApp). They very much use them for other things. FB data center machines tend to be highly-specialized and optimized for running the apps, not general-purpose compute.


Any company as big as Meta have teams working on optimisation (eg optimised kernels), usually with direct engagement with NVIDIA engineers.

These kind of things are usually only selectively shared.


Training data costs zero?

What is the training data anwyay? Books? (Is it legal to use non public domain, wait whom am I kidding) Reddit posts scrapped from the site? (Without clean up there are lots of bad subreddits) Wikipedia?

Of course I understand why labour costs + employee cost related costs are ignored.


It costs more the more you care about squeaky clean training data. Of course you get a better model in return.

ChatGPT used a crawl of the internet and patches things up with alignment and DPO. Big boys like Microsoft might have deals with publishers to get textbooks in bulk.

Contents from sites with moderation can be filtered using the platform's mechanism, e.g., only include text with a certain length and count of upvotes.

LLMs can be used to generate and filter data as well. Humans have been used to do this, they might have to do this less in the future. Mostly to review what the LLMs are suggesting.


I’ll agree with your general point even though there are some subtleties to FBNY.

More important, we let an awful lot of self-promotion from the big guys slide around here.

I can live with the guys and gals doing this on a shoestring getting a little of that sweet hype love. This seems pretty legit.


You’ve been in tech for too long when 1 million USD is your smallest unit.


I wonder why they decided to call it 0.1M USD rather than 100k USD. For many of us, a million dollars is a large amount of money, even for a business.


I’m sure they had their reasons, but all I can see is the Simpsons meme with Mr Burns at the ATM saying “What’s the smallest amount of money I can think of? A thousand dollars.” ;)


It's to imply that it costs other people in the millions, but they did it for only 0.1 million, which is a small number of millions. Just a rhetorical trick.


100k isn’t worth anywhere near what it used to due to inflation. It might get you a nice pickup truck or a kitchen remodel. If your business is doing research and can’t spend then it’s more of a hobby than a business.


Same reason things are x.99 - the 0.1 decimal "feels" smaller than seeing 100,000 - "holy fuck!" Etc


Well, it's interesting to think about how much has been invested into BigModel companies (Anthropic, Perplexity, OpenAI) when it's very rapidly becoming commoditized.


It looks like Llama 2 7B took 184,320 A100-80GB GPU-hours to train[1]. This one says it used a 96×H100 GPU cluster for 2 weeks, for 32,256 hours. That's 17.5% of the number of hours, but H100s are faster than A100s [2] and FP16/bfloat16 performance is ~3x better.

If they had tried to replicate Llama 2 identically with their hardware setup, it'd cost a little bit less than twice their MoE model.

[1] https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#...

[2] https://blog.ori.co/choosing-between-nvidia-h100-vs-a100-per...


They mention the cost was ~80,000k USD so for 32,256 hours it comes to ~2.48$ an hour. Amazing how cost effective the compute actually is.


I was paying $1.1 for A100 hour more than a year ago. $2.48 is crazy expensive.


It was for a 96 X H100 cluster. Their provider was exabits.ai which bills itself as a decentralised computing marketplace.


It might be equivalent to LLaMA2 but it's still not capable of even simple reasoning:

> If two apples cost 2 dollars and 1 apple costs 1.20, what is the discount you're getting

> To calculate the discount, you need to compare the original price of the apples to the price after the discount.

> The original price for two apples is: 2 apples * $1.20/apple = $2.40

> The price for one apple after the discount is: 1 apple * $2.00/apple = $2.00

> Now, let's calculate the discount for one apple: > Original price for one apple - Discounted price for one apple = Discount amount > $2.40 - $2.00 = $0.40

> The discount for one apple is $0.40. Since you're buying two apples, you'll get the discount on both: > Discount amount for two apples = 2 * $0.40 = $0.80

> So, the discount you're getting for two apples is $0.80.


The simple fact that they did not list the current SOTA for the size class in their comparison table tells you all you need to know about their confidence. And listing Gemma-2B is like shooting fish in a barrel, might as well also put RedPajama on there.

It's good to see MoE being attempted at the smaller sizes, and it may scale well downwards as well given their results. But regardless, 1.25T is very little training data compared to the 6T that Mistral 7B received and even that makes it barely usable and likely not yet saturated. Before it, the sub-13B size class was considered basically an academic exercise.


I'm kind of impressed it was able to do basic math even if the reasoning isn't correct. That seems like an impressive emergent behavior for a small cheap model like this.


Llama2:7b makes the same mistakes. It's not until you use something like Mixtral or Llama2:13b that it actually gets the correct results (in my one example).

Interestingly Llama2:13b objects that there is no discount until I clarify: "the discount you're getting [with 2 apples]"

It's not just math though it's any kind of complex reasoning and ambiguity. Comparing to humans is always complex, but humans for the most part wouldn't balk at me asking what discount you're getting without specifying that it's the 2 apples that have the discount in this example. A more advanced model often states the assumptions.

There are lots of nuances in this question as well. I'm still paying 80c more than buying one apple, so I should only buy two apples if I would use two apples.


I wonder if the reasoning capability of a moe model is limited by the max size of the individual experts?


I've been out of academia for a bit, but in my day 100k USD would not have been considered academia-friendly in my neck of the woods...


That's about the cost of 1 grad student year including all overhead, I believe. It's definitely far less than what many physics and biology labs spend on equipment in a year.

I mean, you're an idiot of a PI if you have $500K/year of grants and spend it on 5 students and no compute.


$500k/year of grants for a single research institute would have been outrageous, and I used to be in NLP.


Sadly no matter what kind of prompting that I try it fails the sisters question:

> Sally (a girl) has three brothers. Each of her brothers has two sisters. How many sisters does Sally have?

Sally (a girl) has three brothers. Each of her brothers has two sisters. How many sisters does Sally have?

Sally has two sisters. Here's the reasoning: Sally has three brothers. Each of her brothers has two sisters. Since her brothers are the ones who have two sisters each, and we know there are three brothers, we can conclude that each brother has two sisters. Therefore, there are two sisters in total (one for each brother). So, Sally has two sisters.

https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat


Not exactly something you'd expect a model of this size to solve, Mixtral fails it too (if you switch the names/genders that is, since it's contaminated with a few versions). It does at least indicate that their training data might indeed be as clean as they say.


More complex models fail this question, so it would be exceptional if it passed it:

ollama run llama2:13b

>>> Sally (a girl) has three brothers. Each of her brothers has two sisters. How many sisters does Sally have?

This is a classic lateral thinking puzzle that requires you to think outside the box and challenge your assumptions. The answer is not what you might expect at first glance.

To solve this puzzle, we need to break it down step by step:

1. Sally has three brothers. 2. Each of her brothers has two sisters.

So, let's count the number of sisters that each of Sally's brothers has:

Brother 1: 2 sisters Brother 2: 2 sisters Brother 3: 2 sisters

Now, let's add up all the sisters that Sally has:

Sister 1 + Sister 2 + Sister 3 + Sister 4 + Sister 5 = 2 + 2 + 2 + 2 + 2 = 8

Therefore, Sally has 8 sisters.


> It only uses public datasets for training, and the code is open-sourced

Looking at the repo, there is no training or data processing code.


It was trained with "1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc."

I guess it is good that they mentioned some of it, but yeah, that isn't exceptionally helpful when making claims of it being 100% open source.

I'm not sure why they feel the need to be so secretive if all of the sources are open.


"etc." is the most important part here. There is NL SFT and code SFT data which guessing by the names are instruction data very likely from GPT-4. It is known in finetuning community that training with GPT-4 data is the easiest way of improving the model. If that's the case base JetMoE should be compared to finetuned llama, not base llama.


Pile includes books3, is there a test prompt to see whether the books are present in the training data?


At $DAY_JOB nowadays we run 128x H100 runs without thinking twice nowadays. Only takes a few days to train a small-ish LLM with that to test out some ideas.


echo $DAY_JOB


I don’t get why people do this at all. It adds no clarity only awkwardness over “At my job”


"At my job" is annoying and awkward too if you're not going to specify where when making grand claims. It's not quite as annoying as "In my country we..." without specifying where, but it's close.


Ah yes, like people saying "back in my day" and not even giving a precise date, smh.


Also `echo $NIGHT_JOB`, I'm curious like that.

To preempt other queries, maybe just paste the `set` output.


> echo $DAY_JOB

FAANG AI Lab


Where are they hosted?


AWS and GCP both.


Out of curiosity, what leads you to train models from the ground up rather than fine tuning existing models?


We do both. You can’t just fine tune if you’re trying a different model architecture, or even change some of the hyperparameters on an existing one. Every now and again you might be able to reuse some of the weights, but that’s about it. That’s part of the reason research is so incredibly expensive and time consuming in this field. I bet that $80k is only a fraction of the overall cost for the model described in the article, too.


Out of curiosity, looking at the cheapest price for a H100 that I could find online.

Lambda Reserved Cloud [1] starts at $1.89 per H100 per hour.

It could be possible to get the cost down to a lower amount:

$1.89 * 96GPUs * 24hours * 14days = ~$61k

1 - https://lambdalabs.com/deep-learning/servers/hyperplane


This is the price of training if nothing fails.


It also depends on the interconnect speed. If you don't have fast enough interconnect between the machines, you won't get linear speedup with N gpus.


Can't wait for the GGUF to play with it. I tried the demo (https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat), and the results were very good!


Train a decent model, but market with wrong emphasis. Discussion derailed by people doing napkin maths of how it doesn't make sense than actual model discussion. Classic marketing mistake, classic HN response.


Anyone got a ballpark figure for what Meta spent on Llama 2 training for the 7B model?




demo https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat

Just had a quick chat about kubernetes, very simple discussion. It held well


This stuff is just going to keep getting pushed down.


I'm skeptical, expect data contamination was the reason for high benchmark scores.


Yeah. IBM especially has a history of fudging the numbers on reports like this. Research puts together reports which are aggressively p-hacked and ensembled and overfit, and then sales uses those reports to boondoggle clients into using IBM.


This kind of assumption is super deceptive.

The Facebook budget includes money to pay off people they've ripped off (in private settlements) and money for lawyers to shield the developers so they can feel free to rip off copyrighted content without having to pay personal penalty or be imprisoned for infringement. It also includes the price of buying lobbyists to alter laws to let this practice continue.

Also, unless the authors work inside Facebook, they have no idea how much Facebook spent on training that model specifically.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: