975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1.3 PUE [1]) = 1,522,248,000 Total Wh, or 1,522,248 kWh to train DeepSeek-V3
(1,522,248 kWh) * (0.582kg CO2eq/kWh in China [2]) = 885,948 kg CO2 equivalents to train DeepSeek-V3
A typical US passenger vehicle emits about 4.6 metric tons of CO2 per year. [3]
885,948 kg CO2 per DeepSeek / 4,600 kg CO2 per car = 192.6 cars per DeepSeek
So, the final training run for DeepSeek-V3 emitted as much greenhouse gasses as would be emitted from running about 193 more cars on the road for a year.
I also did some more math and found that this training run used about as much electricity as 141 US households would use over the course of a year. [4]
Actually -- and this is insane -- the amount of electricity required to train DeepSeek-V3 would power the Bitcoin network for all of 5 minutes.
DeepSeek would have to fully train a brand new V3 every week to approach the kinds of power consumption numbers that individual bitcoin mining facilities are doing.
They mostly aren't. The lack of transparency around how many parameters frontier models have and how long they're trained is a big obstacle when it comes to estimating the energy impact of training very large models.
I think a great way to create positive change in the world is to pressure OpenAI, Anthropic, Google, XAI, and Meta to all share details about the energy cost of training and inference for their models. If every major provider provided this transparency, it would be less valuable to keep that info secret from a "keep your competitors in the dark" perspective. It would also allow customers to make decisions based on more than just performance and cost.
If they have a cluster with 2,000 H800 GPUs (which is what they have stated in public) training would take 2,800,000 / (2,000 * 24 * 30) ~ 2 months.
A cluster of 2,000 GPUs is what a second tier AI lab has access to. And it shows that you can play in the state of the art LLM-game with some capital and a lot of brains.
Yesterday GPT asked me if I'd like to train a small LLM and I laughed out loud.
That being said I'm amazed how far 1B models have come. I remember when TinyLlama came out a few years ago, it was not great. ($40K training cost iirc.)
That was a 1B model, but these days even 0.5B models are remarkably coherent.
Can someone put this into perspective? I'm finding heterogenous data on other models, i.e. number of tokens, number of GPUs used, cost, etc. It's hard to compare it all.
These articles are gold, thank you. I used your gemma one from a few weeks back to get gemma 3 performing properly. I know you guys are all GPU but do you do any testing on CPU/GPU mixes? I'd like to see the pp and t/s on pure 12 channel epyc and the same with using a 24 gig gpu to accelerate the pp.
Oh fantastic! Oh for MoEs like DeepSeek, technically GPUs aren't that necessary! I actually tested on 1x H100 I think it was 30 layers offloaded, and the other 30 are on CPU - it wasn't that bad at all!
Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18.
I like that they give advice to hardware manufacturers:
- offload communication to a dedicated co-proc
- implement decent precision for accumulating fp8 operations
- finer-grained quantization
...
Deepseek v3-0324 (new checkpoint) beats ALL but 1 proprietary AND non-thinking LLMs by a significant margin. Check livebench.ai & Artificial Analysis benchmark for details.
The only non-thinking LLM the new V3 doesn't decisively thrash is GPT 4.5 which is more than 100 times more expensive than V3 and yet is only a few (essentially negligible) percentage points better than it.
2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours
975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1.3 PUE [1]) = 1,522,248,000 Total Wh, or 1,522,248 kWh to train DeepSeek-V3
(1,522,248 kWh) * (0.582kg CO2eq/kWh in China [2]) = 885,948 kg CO2 equivalents to train DeepSeek-V3
A typical US passenger vehicle emits about 4.6 metric tons of CO2 per year. [3]
885,948 kg CO2 per DeepSeek / 4,600 kg CO2 per car = 192.6 cars per DeepSeek
So, the final training run for DeepSeek-V3 emitted as much greenhouse gasses as would be emitted from running about 193 more cars on the road for a year.
I also did some more math and found that this training run used about as much electricity as 141 US households would use over the course of a year. [4]
[1] https://enviliance.com/regions/east-asia/cn/report_10060
[2] https://ourworldindata.org/grapher/carbon-intensity-electric...
[3] https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-t...
[4] divided total kWh by the value here: https://www.eia.gov/tools/faqs/faq.php?id=97&t=3