Which is a myopic view of TCO. This ignores so many things about purchasing on-prem hardware (for good or bad):
- Cost associated with finding an admin who understands how this thing works
Speed / convenience:
- Try before you buy
- Time it takes for the box to be built, shipped and sent to the data center
- Time (and cost) it takes to install software, drivers, etc
Maintenance / Capitalization / Finances:
- 3-year? What is the useful life of this? When does it seemingly become obsolete?
- AWS will continually upgrade their hardware and you keep paying the same
- Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes)
- Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.
- Returns, breakage, warranty in case of a hardware failure
I understand why there is a market for this product, but it's not always an apples/orange comparison. Generally speaking, if you know what your workload is going to be, (I'd be hard pressed if a lot of orgs really know the answer to this) then on-prem hardware is not a terrible choice, but it has to be analyzed appropriately.
Managing AWS resources is not "free" either, or the market for people to handle devops work on AWS related to ongoing operations would be non-existent. My experience is when I've done consulting, clients on AWS used to end up paying more per instance than on-prem clients ended up paying per physical server.
> - Time (and cost) it takes to install software, drivers, etc
This is largely an initial cost of a day or two of setup when you start setting up an on-prem setup + for the first server of a totally new model. If you're buying one, then sure, you need to factor in a bit of time. If you're buying more, then if you use a competent admin the second one should be a matter of inserting boot media or (preferrably) configuring PXE booting, and picking an IP.
This has been my experience, as well. Clients who switch to AWS for cost-saving reasons will be disappointed.
People often think they have workloads like that. I've done work on plenty of setups over the years were people were certain they had lots of big spikes, but most of them only had moderate day/night cycles (and usually a lot less spiky than they thought, because they'd never bothered to measure).
Only in a few cases I've seen usage patterns that justified spinning instances up/down very often, and for those cases, ironically the availability of clouds have made dedicated hosting more cost effective, because if you host everything in VMs or containers anyway, you can now get to far higher utilization levels on your on-prem/colocated setups, and be prepared to spin up some cloud instances (automatically, if you wish, it's not that much effort), and tie them into your cluster when your load warrants it. It's not suitable for all workloads, but it works surprisingly well if anything.
I do see that argument a lot for using cloud providers, but the vast majority of cloud setups I've seen end up having very static workloads growing at rates that gives plenty of time to prepare.
This is exactly what I see on the regular. The AWS model promises huge savings if you have these cyclical use cases and rarely does this ever happen in a small business or startup.
Other factors are obviously the ability to get a compute instance up immediately or because you have bad operational situations (horrible legacy DC equipment, genuinely bad admins or vendors, etc) that far outweigh the opex and capex parts of the equation and bleed into SLA, poor reputation / relationships, etc.
I worked for a large organisation that had this need. It would require ~10K cores and a large number of GPUs, for a period of a few days, approximately once a month (or less).
That sort of ask appears to be too much for amazon, and they requested for us to pre-book the capacity for the whole month, wiping out any cost savings we might have made over just buying the hardware.
The costs of finding an AWS admin is much lower. This is, generally speaking, why many orgs still use SAP for ERP since there are a plethora of blue-chip consultancies that support those systems.
Time (and cost) it takes to install software, drivers, etc
These tasks aren't made unnecessary by cloud. Yes, with cloud, once you've done the work, it can be forever encoded into an image or container. However, the same applies to on-prem using container solutions like Docker.
3-year? What is the useful life of this? When does it seemingly become obsolete?
With GPUs of recent history, obsolescence is not a concern in a 3-year time frame. On AWS, people are still using K80s (released in 2014!). The GTX 1080 Ti, which was released 2.5 years ago, is selling for substantially above MSRP. This may change if competition in the GPU space increases and NVIDIA loses its monopoly.
AWS will continually upgrade their hardware and you keep paying the same
True, but this concern is mitigated by slow rate of GPU obsolescence.
Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes)
Can you say a little more about this? Not sure what you're getting at.
Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.
This needs to be evaluated on a case-by-case basis. A poorly-capitalized start-up with an unpredictable compute workload is not a good candidate for buying on-prem. A well-capitalized start-up that consistently uses GPUs is a better candidate.
Returns, breakage, warranty in case of a hardware failure
Any reasonable hardware provider will include an option for 3-year warranty.
I think he means this: that hardware goes to your balance sheet, then it depreciates by a certain amount over those 3 years and that loss may be tax-deductible.
Did you include the cost for that in your scenario?
I think they used the 3-year reserved instance pricing, meaning you can’t just turn it off after a year and stop the cash flow out (at least not at that price).
I agree with a lot of your points, especially the fact that TCO is practically impossible to calculate, is highly subjective, and should ideally include opportunity costs which, almost by definition vary from person to person.
I've definitely thought through some of your points. First I want to say that if your load is spiky you have no business buying hardware. No matter how you slice it, you're simply not going to be able to get hundreds of GPUs of throughput with an 8 GPU machine so it's not possible to get the same "product". So for those doing short bursts of hyperparameter search, stick with cloud.
But, like you said, for those with high utilization and known utilization patterns, it makes a lot of sense to go on-prem. Let's just say pretty much anybody who is buying a reserved instance from AWS should consider buying hardware instead.
>- Cost associated with finding an admin who understands how this thing works
Pricing based on what co location facilities will provide you without much search and is pretty generous. $10k per year per server is silly high and should cover that search cost.
> - Try before you buy
You can try essentially this exact machine on AWS. That instance and hardware are almost exactly the same. Now of course this doesn't apply to many instance types.
> - Time it takes for the box to be built, shipped and sent to the data center
5 business days is our mean lead time for that unit. Most workstations are 2 days.
It comes with that stuff pre-installed as mentioned in the article. See Lambda Stack for driver / framework / CUDA woes: https://lambdalabs.com/lambda-stack-deep-learning-software
> - 3-year? What is the useful life of this? When does it seemingly become obsolete?
I'll admit it's a long time but 3 years is about right for GPUs from my experience. I personally purchased new GPUs for my workstations in 2012 (Fermi), 2015 (Maxwell), 2017 (Pascal).
> - AWS will continually upgrade their hardware and you keep paying the same
Yes, but only if you never commit to a reserved instance, in which case costs are 3x higher. If you buy a reserved instance you don't get upgraded hardware.
> - Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.
Yea, most of these are people who are already using GPUs really often for internal training infra and are finding it expensive.
> - Returns, breakage, warranty in case of a hardware failure
Our hardware comes with a 3 year parts warranty. Of course, this doesn't pay for your time lost but it's pretty rare to see parts fail and the $10k / year more than covers in-colo swap outs.
I agree with all of your points. I don't think that we really disagree with much here. TCO is hard to calculate :).
A major drawback of having this sort of hardware on-prem (and this unit in particular) is that it doesn't include local storage for a 100TB+ scale dataset. There's not just the admin cost of hard drives and RAID, but there's the infra cost of synchronizing data to the machine. An average, reasonable, product-oriented Engineering Manager could easily choose cloud over a $100k savings if it means less headaches. Especially since GCloud will give big discounts if they're trying to outbid Amazon. On the other hand, if Lambda offered some sort of NAS solution plus caching software (Alluxio! :), that would make the offering a lot more competitive. Oh and maybe throw in a network engineer to set up peering and such.. ;)
What might make the comparison more compelling is to take anonymized workloads from your customers, examine how often the workload results in idle (wasted) time, and to factor that into the figures. If a user's workload is elastic (many are, especially in R&D), then the user would be drawn towards ML Engine, Sagemaker, Floydhub, Snark, etc... The downside to these services is that they almost always involve expensive copying of data from NAS to worker machines. But if user utilization is high, and the dataset is static, the on-prem machine is a prime investment.
While the savings is a big and noteworthy number, iteration speed is of principle concern to readers here and likely a good segment of your customers. I love the Lambda offerings, but there are a lot of deep learning hackers who can't make the most of bare metal.
The 1080TI server is probably the killer solution at 1/10th the cost. But NVidia legal blah blah blah :(
So this is an advert? Gotcha.
I think if you called out at the top of the article then it would have been a little more honest.
I'd imagine this is part of having a system administrator.
Many benefits of running infrastructure in the cloud are lost for offline batch processing jobs. Training machine learning models doesn't require low response times, high availability, geographic proximity to clients, etc. Yet, with cloud, you're paying for all this extra infrastructure.
The main benefits of cloud for tasks with machine learning type workloads are cost saving (if utilization is low) and no electrical set-up.
On the other hand, cloud is extremely expensive for groups that require high base levels of GPU compute. The article is arguing that such groups can save a huge amount of money by moving infrastructure on-prem.
I'd email webnx sales, they have some EPYC servers although not this very new CPU but they were always very flexible. It might be more economical if you front the price of the CPUs but the 7371 is very cheap for what it is, the two should be around $3k (I saw them as a special order on gamepc for 1600, the site name is very ironic because the monsters they sell are more workstation / server than gaming pc). I am not affiliated, just a very satisfied customer, I worked with them when I was the lead architect of a US Top 100 (per Quantcast) site.
The 1080 Ti is certainly not a problem for them.
If you're buying more than one, that'll easily cover enough to cable up whatever network you want between them.
Many colo providers also have their own cloud services, and nothing stops you from using S3, SQS and the like, though of course if you use AWS services "from the outside" you do need to factor in their extortionate outbound bandwidth charges.
Or like saying that if you spend 20 hours every day in an Uber buying a car would be cheaper over 3 year term.
Nobody is disputing that buying similarly specced hardware on the market is likely to have a cheaper TCO, the point of cloud providers is that also have a bunch of stuff attached to the servers, which turns to be pretty important.
Where do I get GMAIL for my Raspberry Pi? If you order that Lambda server, you get the same good Linux interface you'll get over AWS. In fact they actually seems to offer some kind of software package that you wouldn't get over AWS.
Sure AWS has value-added, but in this case, it's mostly being able to not being billed for unused hours (what we call scalability), which for many is certainly worthwhile, but that's all. Is there anything I'm missing that AWS offer?
> if you spend 20 hours every day in an Uber buying a car would be cheaper over 3 year term.
Isn't that the case? I failed to see what's wrong in that sentence. I don't own a car, I subscribe to a service similar to Car2Go, I'm at about 150$ a month right now. Sure there's value added there, but owning one also has value added, just different ones and at one point it will be more worthwhile to own one.
If you know more, it would be really good for you to share some of what you know, so that we can all learn something. If you don't want to take the time to do that, that's fine, but in that case please don't post.
As bubblethink already mentioned, the article is arguing that batch processing jobs (e.g. those characteristic of Deep Learning) don't benefit substantially from cloud infrastructure.
High availability, edge processing, fast network upload, and the lego blocks for creating redundancy are moot.
Most training algorithms automatically recover if a node goes down. That covers all required failure handling.
"Everyone" seems to think they have massively spiky loads, and almost nobody does. There may often be demands that could spike (e.g. because "someone" has not randomized the start times for batch maintenance processes through the night but start them all at the same time), but more often than not the spikiest demand comes from batch workloads where nobody cares if it takes a bit longer. In practice when constrained that often simply smooths out the overall capacity use.
(of course that does not mean that there aren't people with spiky enough workloads for it to make sense to put at least some of their workloads on VMs for cost reasons; but measure first)
So far all companies I worked for had a wave-like pattern for customer generated workloads, simply because of the locality of the customers. It doesn't even matter if it's a globally used product, as some markets are usually stronger than others (e.g. US vs. Asia). So you have some base load, which is relatively stable, but on top of that daily or even more frequent "spikes".
For non-customer-driven workloads you even might want to have spikes. Take big data pipelines for example. While you could schedule them so that they're not spiky, you might benefit from running them as fast a possible to get your results earlier. That's something you probably wouldn't do with dedicated servers, simply because it'd be too expensive. Instead with dedicated servers you'd optimize for utilization instead of speed. When you only pay for what you use, you don't have that limitation and can fire up as many instances as it makes sense. That's actually a pretty nice benefit.
And don't forget changing workloads over the course of multiple months. The calculation looks completely different if you buy a dozen of servers now, but have your computing requirements change in a few months, so that you don't need them all anymore (reasons for such are manifold, but could be as simple as optimizations to your code).
Yes, but rarely big enough to allow for long enough periods of spinning down instances to be cheaper than dedicated hardware. Consider that (while not really a suitable example, since it's a GPU instance) the costs in the article are only that low for AWS because it's a reserved instance, where the savings of not using it are much lower than for regular instances.
I've seen setups where we were close to considering using AWS for peak load, which would have been far cheaper than going all AWS, but when we did the calculations we needed peaks lasting shorter than 6-8 hours for it to be worthwhile vs. new hardware, and the cheaper alternative was simply leaving older machines in our racks past the 3 year depreciation schedule - it's not "free", they took up rack space and cost in terms of power, but it was cheaper to just let them stay in rotation until they failed or we needed the space for a faster machine, than it was to rent extra capacity.
Scaling up with cloud services cost-effectively for daily peaks is tricky, because you're still going to be scaling ahead of use and most of your "peaker instances" are not going to have average utilization anywhere near 100% when factoring in ramping up/down for setups that aren't huge, because you still have lead times before an instance is maxed out, or as load is dropping, and you need to factor in that lower utilization in your cost estimates.
That's not to say it's not worth considering, not least because having the ability to do so lets you load your dedicated equipment to much higher utilization rates because you're able to respond to spikes faster than you can add servers.
> For non-customer-driven workloads you even might want to have spikes. Take big data pipelines for example. While you could schedule them so that they're not spiky, you might benefit from running them as fast a possible to get your results earlier.
You might. But my experience is that over a couple of decades of working with various types of online services, it's very rare for those types of workloads to move the needle much in terms of overall compute needs. There certainly are exceptions, and machine learning workloads might well make them more common, but they're comparatively rare in most companies. Keep in mind that while you may want capacity to handle spikes, in most such setups you also want capacity to handle loads that can run off peak.
But by all means: If you do have needs that require those kind of big spikes that you can't accommodate within you base capacity, then use cloud instances for that. They do have their place.
> And don't forget changing workloads over the course of multiple months. The calculation looks completely different if you buy a dozen of servers now, but have your computing requirements change in a few months, so that you don't need them all anymore (reasons for such are manifold, but could be as simple as optimizations to your code).
In practice I've never seen dramatic enough changes anywhere for this to be a real concern for most people, as massive drop-offs tends to be rare, and tends to be eaten up by subsequent growth in a matter of months unless you're in big enough problems for it not to matter. If it's a potentially real concern to you, then rent dedicated servers by the month for some of the workload - it is typically still far cheaper than AWS, though not as much as leasing or buying - or do use a cloud provider for some of your peak load as an "insurance".
The big takeaway is not that you shouldn't use cloud services, but that you should know your workload and do the cost calculations.
Most DNN model training workloads are lumpy and transient.
I remember, though, that NVIDIA disallows in EULA the use of nvidia driver blob in consumer gpu in datacenter environment. Time will show how legally enforcible is this restriction.
Once your service is somewhat stable in terms of size and you can afford longer lead times, then you should return to on-prem to save money
Sure you'll have 4 GPUs per box and not 8, and sure, each GPU will have 11GB and not 32, but the whole machine (_with_ the GPUs) will cost just a tad more than a single V100. So if you don't really need 32GB of VRAM per GPU (and you most likely don't), it'd be insane to pay literally 5x as much as you have to.
A p3dn.24xlarge is a few minutes.
My experience in deploying new hardware-based solutions is that it typically takes between 6 and 12 weeks.
2)How does the TCO compare when I only need to train for 2 hours a day?
3)If I was previously training on AWS p2.16xl, I could upgrade to p3.16xl with basically 0 incremental cost (for the same workload). Does LambdaLabs offer free (0 capex) upgrades? If so, how long would it take to upgrade?
Most of this is due to arcane ordering processes both on client and supplier side.
If you're going to be using the server less than 73% of the time, AWS sounds better.
If you can take advantage of that flexibility to build reactive capacity then you can save money, but that wasn't the initial driving point.
The price difference can be substantial enough that you might simply be better off having it "over provisioned" off hours.
Also, not having "reactive capacity" might simply mean that your 95th percentile goes from 100ms response time to 300ms. Which again, might be a more cost effective approach.
I think the initial development and ongoing cost of maintaining that reactive capacity is also substantial enough to be considered.
Most of the apps that truly need the elasticity should probably going hybrid anyways. Baseline on dedicated hardware, spikes on spot instances.
Almost everyone thinks they can; very few small businesses have the sysadmin/devops to pull it off.
Every 6 months or so a hard drive fails (out of 16 per server), no other components have failed. 10 machines are 6 years old (test system as it's out of warranty), 10 3 years old, 10 new.
There's also 20 or so other servers under different loads, I've not had anything fail other than hard drives.
When I started the job, there were spare power supplies for some 10+ year old servers in storage, so I'm either lucky or reliability is improving.
I used to work for a VFX company where we would try and get as close to 100% utilization out of the farm.
We had machines that were 4 years old, still merrily plodding away. (not too many mind, they ate more power than its worth.) The things that tended to fail were Harddrives and fans.
Between two data centres, lets assume I did 6 visits a year, and that we had one ~30/min incident a month at $50/incident (it was less, but I don't remember the exact details, and it doesn't matter for this exercise). Let's assume I lost a whole day every visit (I didn't, though it got close at times), and "charge" $1000/day for my visits. That adds up to $6k/year for my time, and $600/year for remote hands, or ~$110/year per server. For comparison the colocation cost us ~$17k/year, or ~$283/year per server. These costs were pretty stable by number of servers, and so favored using fewer, more powerful servers than we might have otherwise.
So that added the cost of renting space at a manned colo facility instead of having the servers in the office (we did have a rack of servers that didn't need 24/7 attention at our office as well).
The rest of my time was spent on devops work that in my experience tends to be more expensive (on the basis of having contracted to do this kind of work on AWS too, and know the difference in billable hours I'd typically get per instance on AWS vs. per physical server on colocated setups) on cloud setups because complexity tends to be higher.
Didn't this make each failure a bigger hit to your overall capacity? How much redundancy did you have? I used to work in adtech with colocated hardware, and it was old and failed a lot, but they had enough it didn't matter ("we're down 5/120 in Germany but we can swap them out while we're there next month").
It was also a fully virtualised setup that could also tie in rented dedicated servers or cloud instances via VPNs as needed. So where it made sense or if we had an urgent need, we had the ability to spin things up as needed.
E.g. we had racks in London, but rented servers at Hetzner in Germany (Hetzner got close to the cost of the colocated servers, though mostly because rack space in London is ridiculously expensive; it might actually have saved us money to put servers in their colo facilities in Germany, even with the cost of travel to/from them occasionally)
They covered that I think.
Their ImageNet timing fits within the bounds of a Spot Duration workload, so in the most optimistic scenario, you can subtract 70% from the price - assuming spot availability for this instance type. (Of course, there are many more model training exercises that don't even remotely fit inside 6 hours.)
First, thanks for writing this up. Too many people just take a “buy the box, divide by number of hours in 3 years approach”. Your comparison to a 3-year RI at AWS versus the hardware is thus more fair than most. You’re still missing a lot of the opportunity cost (both capital and human), scaling (each of these is probably 3 kW, and most electrical systems couldn’t handle say 20 of those), and so on.
That said, I don’t agree that 3 years is a reasonable depreciation period for GPUs for deep learning (the focus of this analysis). If you had purchased a box full of P100s before the V100 came out, you’d have regretted it. Not just in straight price/performance, but also operator time: a 2x speedup on training also yields faster time-to-market and/or more productive deep learning engineers (expensive!).
People still use K80s and P100s for their relative price/performance on FP64 and FP32 generic math (V100s come at a high premium for ML and NVIDIA knows it), but for most deep learning you’d be making a big mistake. Even FP32 things with either more memory per part or higher memory bandwidth mean that you’d rather not have a 36-month replacement plan.
If you really do want to do that, I’d recommend you buy them the day they come out (AWS launched V100s in October 2017, so we’re already 16 months in) to minimize the refresh regret.
tl;dr: unless the V100 is the perfect sweet spot in ML land for the next three years or so, a 3-year RI or a physical box will decline in utility.
With this setup you can get 2x4x V100 on Azure for a total of $42k/year (assuming running 24/7).
Even if one spent $40k to write code for spot instance management this is by far the cheapest solution for GPU compute both short term and long term.
source for calculation: https://cloudoptimizer.io
AWS is great for convenience when you can afford it, but it is a really expensive solution, even when factoring in the extra things you have to deal with to rent, lease or buy dedicated servers.
$100k/yr total compensation (i.e. 60k-ish salary) for someone babysit 10 servers isn't super unreasonable.
...seriously, any chance that's a real thing? Sounds better than what I do now.
In practice this will tend to include out of ours availability and/or devops type work, not just low level sysadmin stuff or physical maintenance, as a lot of that can be farmed out to "remote hands" at the colo providers on hourly rates with 24/7 availability and will certainly cost a tiny fraction of that $10k/year.
To me it seems like a very conservative estimate, or allowing for said sysadmin to provide a lot of valueadd services (e.g. devops type services) that you'll typically need for a cloud setup as well.
It's a comparison to AWS, so everything that you install/do/operate on that server is extra in both cases.
You'll need additional off-site backup, but that's starting to get out of the scope of the article.
If you need redundant blob storage, pretty much every colo providers has solutions, and most of them are going to be cheaper than S3. Worst case you can use S3, and then need to factor in the bandwidth cost difference.
I don't see why they choose to promote that link now.
Anyway, despite my belly-aching, it was an interesting read.
But let's be candid, It was a promotional piece as well and if I "owned" this forum I would certainly have something to gain by charging a modest fee for such placements. It may sound like I'm angry about this or have a negative feeling, but I don't; not at all.
If you want a VPS take a look at Digital Ocean or Linode.
You should use AWS for convenience, not cost. They're expensive for cloud services, and even the cheapest cloud providers are expensive compared to renting dedicated for all but the most transient workloads, and of dedicated hosting providers I only know Hetzner to get close to the costs I could get for renting colo space or doing truly on-prem hosting. Even then the only reason Hetzner is competitive is because I'm in London where space/power is expensive, and they're in Germany, where it is cheap (e.g. they rent out colo space as well, and prices are at 1/3 to 1/4 of what I've paid in London).
Different resolutions and maximum tenants depending on card and license type, and you can't allocate resources homogeneously.