For anyone curious, it took 2048 A100 GPUs to train LLaMa, each GPU costs roughly $15k, facebook probably gets some sort of discount.
That's a $30Mil if you want to train at that scale. Also IIRC it took 23 days to train the biggest model. Someone else can do the power consumption cost calculations.
Electricity costs are basically irrelevant because the cards are so expensive.
A100 cards consume 250w each, with datacenter overheads we will call it 1000 kilowatts for all 2048 cards. 23 days is 552 hours, or 552,000 kilowatt hours total.
Most dataceneters are between 7 and 10 cents per kilowatt hour for electricity. Some are below 4. At 10 cents, that's $53,000 in electricity costs, which is nothing next to $30 million in capital costs.
No, I'm willing to bet the CO2 cost of the cards is also way higher than the electricity. Those things are built on the global supply chain, with materials potentially making multiple thousands of kms journeys between each step.
Long term I also imagine it's much cheaper to run these large model trainings on renewables. It's a very centralized process that doesn't necessarily need 100% availability.
The manufacturing process, however, is totally decentralized, and NVIDIA mostly manufactures in China where coal is cheap.
US grid mix produces about 0.855 pounds of CO2 per kWh[0]. So 552,000 kWh 452,640 pounds of CO2 which is 205.31 metric tons. At a cost of $40 per tonne[1] of CO2 that works out to $8,212.40 which is still small compared to the capital cost of the cards.
AWS us-west-2 is housed in The Dalles and Prineville, Oregon. Not only are they near a massive wind farm in the Columbia Gorge, but also quite near the Columbia river's many hydro-electric dams. Facebook and Apple also have Prineville data centers. They are built there intentionally. Electricity at many data centers is quite carbon-lean.
I always feel there is an opportunity cost here though. If that green energy wasn’t being used for compute it could be available to heat someone’s home instead of them using dirty sources.
$30m training cost is too high. Amazon's p4d.24xlarge is $32.77 an hour for 8 A100 GPUs. 2048 A100 GPUs for 23 days costs $4.6m at that rate. You might even get a discount.
At the same time I guarantee you they didn’t get it right the first time. I’m sure there were multiple (both serially and in parallel) runs as they worked out kinks and tuned hyper parameters.
Not to mention, the kind of expertise to run this for a major corporation doesn't come for free either? Facebook employs quite a few high profile ML researchers who undoubtedly make mid-high six figure salaries.
The point was that if you only need to train once, then it's cheaper to rent the GPUs than to buy them. If you need to train it multiple times, then the cost of buying the GPUs is amortized among runs.
In any case the cost per run is going to be lower than 30m
I'm sure that's the case. The latest sku I'm responsible for QC testing now contains 4x A100's in a 2U chassis. And oh man the number of QSFP ports it utilizes..
Azure is generally a pretty terrible cloud (poor UX, very slow for anything, multiple highly critical cross-tenant security issues, etc.) far behind the market leader, AWS, so they have to compensate with pricing (same reason why Oracle Cloud is so reasonably price, they're already so far behind their usual pricing wouldn't make any sense).
There's no reasonable way to get an estimate of what it actually costs FB.
1) The GPU's are not single use, they will amortize it over 3 yrs and there are other things that it will be used for that generate revenue.
2) The cost of the servers for these GPU's to run in with massive CPU, RAM, and storage requirements.
3) The overhead of building and operating all of that infrastructure in terms of people, electricity, cooling, etc.
4) The overhead of having dozens or hundreds of engineers & scientists who contributed to this.
One way you can distill the first three is to use AWS/Azure/GCP costs. But then you are still missing a major factor which is the humans that worked on it, and the human may very well exceed the hardware cost.
Plus there's a lot of highly specialized engineers required to keep all those GPUs up and running during training and the ML engineers who are skilled in deep learning + hardware, plus the systems for gathering/cleaning/labelling data. Gather enough engineers and now you need managers, PMs, etc.
That's a $30Mil if you want to train at that scale. Also IIRC it took 23 days to train the biggest model. Someone else can do the power consumption cost calculations.