Hacker News new | past | comments | ask | show | jobs | submit login
V100 Server On-Prem vs. AWS P3 Instance Cost Comparison (lambdalabs.com)
141 points by rbranson 32 days ago | hide | past | web | favorite | 127 comments



> Our TCO includes energy, hiring a part-time system administrator, and co-location costs.

Which is a myopic view of TCO. This ignores so many things about purchasing on-prem hardware (for good or bad):

Admin:

- Cost associated with finding an admin who understands how this thing works

Speed / convenience:

- Try before you buy

- Time it takes for the box to be built, shipped and sent to the data center

- Time (and cost) it takes to install software, drivers, etc

Maintenance / Capitalization / Finances:

- 3-year? What is the useful life of this? When does it seemingly become obsolete?

- AWS will continually upgrade their hardware and you keep paying the same

- Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes)

- Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.

Hidden costs:

- Returns, breakage, warranty in case of a hardware failure

I understand why there is a market for this product, but it's not always an apples/orange comparison. Generally speaking, if you know what your workload is going to be, (I'd be hard pressed if a lot of orgs really know the answer to this) then on-prem hardware is not a terrible choice, but it has to be analyzed appropriately.


> - Cost associated with finding an admin who understands how this thing works

Managing AWS resources is not "free" either, or the market for people to handle devops work on AWS related to ongoing operations would be non-existent. My experience is when I've done consulting, clients on AWS used to end up paying more per instance than on-prem clients ended up paying per physical server.

> - Time (and cost) it takes to install software, drivers, etc

This is largely an initial cost of a day or two of setup when you start setting up an on-prem setup + for the first server of a totally new model. If you're buying one, then sure, you need to factor in a bit of time. If you're buying more, then if you use a competent admin the second one should be a matter of inserting boot media or (preferrably) configuring PXE booting, and picking an IP.


> My experience is when I've done consulting, clients on AWS used to end up paying more per instance than on-prem clients ended up paying per physical server.

This has been my experience, as well. Clients who switch to AWS for cost-saving reasons will be disappointed.


I think AWS consultants tell the same. I had a training session from Amazon guy arranged by employer. He said that AWS is not going to save us lot of money. But it's value is to enable us move quickly and experiment. Even the managers of my company were aware of that.


I feel like the main advantage AWS has at this point is that "everyone is using it". While it may not give you a competitive advantage over your competitors, it at least doesn't put you at a disadvantage if you are using the same platform as them.


... unless the use case is weird, e.g. you need massive amounts of processing power infrequently.


There certainly are people with use-cases like that, but it is rare to find people who both have large, sudden spikes that are not better handled in other ways (e.g. combination of CDNs and micro-caching), or where the load can not be trivially smoothed out enough (e.g. there's enough work with relaxed latency requirements or that can be scheduled to off peak times, or where the spikes are not small enough relative to base-load that it still pays off to put most load on dedicated equipment and use AWS or similar to handle spikes only.

People often think they have workloads like that. I've done work on plenty of setups over the years were people were certain they had lots of big spikes, but most of them only had moderate day/night cycles (and usually a lot less spiky than they thought, because they'd never bothered to measure).

Only in a few cases I've seen usage patterns that justified spinning instances up/down very often, and for those cases, ironically the availability of clouds have made dedicated hosting more cost effective, because if you host everything in VMs or containers anyway, you can now get to far higher utilization levels on your on-prem/colocated setups, and be prepared to spin up some cloud instances (automatically, if you wish, it's not that much effort), and tie them into your cluster when your load warrants it. It's not suitable for all workloads, but it works surprisingly well if anything.

I do see that argument a lot for using cloud providers, but the vast majority of cloud setups I've seen end up having very static workloads growing at rates that gives plenty of time to prepare.


> People often think they have workloads like that.

This is exactly what I see on the regular. The AWS model promises huge savings if you have these cyclical use cases and rarely does this ever happen in a small business or startup.


Worse, you can have such needs but wind up overprovisioning because your software is tough to autoscale automatically up or down. In that case you should probably pay for dedicated, mostly static servers in traditional setups until you can’t stomach the costs of being overprovisioned anymore then go to a cloud that lets you autoscale to recover a lot of your costs. Sure, reserved instances help substantially but because it’s a cloud instances themselves can go up or down without the best SLAs (I still say I’ve had more physical servers among the tens of thousands seen go belly up than the hundreds of thousands of EC2 instances I’ve seen in my experience).

Other factors are obviously the ability to get a compute instance up immediately or because you have bad operational situations (horrible legacy DC equipment, genuinely bad admins or vendors, etc) that far outweigh the opex and capex parts of the equation and bleed into SLA, poor reputation / relationships, etc.


This does apply to non prod environments. Development/test environments that have low utilization and only run during business hours are better candidates


> ... unless the use case is weird, e.g. you need massive amounts of processing power infrequently

I worked for a large organisation that had this need. It would require ~10K cores and a large number of GPUs, for a period of a few days, approximately once a month (or less).

That sort of ask appears to be too much for amazon, and they requested for us to pre-book the capacity for the whole month, wiping out any cost savings we might have made over just buying the hardware.


> Managing AWS resources is not "free" either, or the market for people to handle devops work on AWS related to ongoing operations would be non-existent.

The costs of finding an AWS admin is much lower. This is, generally speaking, why many orgs still use SAP for ERP since there are a plethora of blue-chip consultancies that support those systems.


Hi, Lambda engineer here (I'm one of the authors). You bring up some good points, I'd like to address some of them:

Admin:

Time (and cost) it takes to install software, drivers, etc

These tasks aren't made unnecessary by cloud. Yes, with cloud, once you've done the work, it can be forever encoded into an image or container. However, the same applies to on-prem using container solutions like Docker.

Maintenance / Capitalization / Finances:

3-year? What is the useful life of this? When does it seemingly become obsolete?

With GPUs of recent history, obsolescence is not a concern in a 3-year time frame. On AWS, people are still using K80s (released in 2014!). The GTX 1080 Ti, which was released 2.5 years ago, is selling for substantially above MSRP. This may change if competition in the GPU space increases and NVIDIA loses its monopoly.

AWS will continually upgrade their hardware and you keep paying the same

True, but this concern is mitigated by slow rate of GPU obsolescence.

Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes)

Can you say a little more about this? Not sure what you're getting at.

Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.

This needs to be evaluated on a case-by-case basis. A poorly-capitalized start-up with an unpredictable compute workload is not a good candidate for buying on-prem. A well-capitalized start-up that consistently uses GPUs is a better candidate.

Returns, breakage, warranty in case of a hardware failure

Any reasonable hardware provider will include an option for 3-year warranty.


>Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes) Can you say a little more about this? Not sure what you're getting at.

I think he means this: that hardware goes to your balance sheet, then it depreciates by a certain amount over those 3 years and that loss may be tax-deductible.


Correct. Depending on what accounting principals you use, this is typically 3-5 years. It's akin to an airlines buying a Boeing plane. It'll cost them $1B let's say, but it'll actually hit their income over 35 years ($1b / 35), which means on their income statement, only ~$28M shows up per year (simplified example). Most companies are valued and taxed on their income, so this is important to understand.


In other words: on your bank account it looks like an upfront cost, but because you could sell the servers at any time they really look more like a rental in your books, with capital slowly draining away each year as they become worth less.


> Any reasonable hardware provider will include an option for 3-year warranty.

Did you include the cost for that in your scenario?


For a reasonable vendor, a 3 year warranty is have a negligible cost. (~5-10% of the price of the system)


In fact, for most enterprise grade hardware I've ever purchased, the three year warranty tends to be pretty much built into the cost. Usually where I pay extra is to add the fourth and fifth years.


> Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.

I think they used the 3-year reserved instance pricing, meaning you can’t just turn it off after a year and stop the cash flow out (at least not at that price).


It is possible to sell (or buy) reserved instances in a secondary market: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ri-marke...


(Author here)

I agree with a lot of your points, especially the fact that TCO is practically impossible to calculate, is highly subjective, and should ideally include opportunity costs which, almost by definition vary from person to person.

I've definitely thought through some of your points. First I want to say that if your load is spiky you have no business buying hardware. No matter how you slice it, you're simply not going to be able to get hundreds of GPUs of throughput with an 8 GPU machine so it's not possible to get the same "product". So for those doing short bursts of hyperparameter search, stick with cloud.

But, like you said, for those with high utilization and known utilization patterns, it makes a lot of sense to go on-prem. Let's just say pretty much anybody who is buying a reserved instance from AWS should consider buying hardware instead.

>- Cost associated with finding an admin who understands how this thing works

Pricing based on what co location facilities will provide you without much search and is pretty generous. $10k per year per server is silly high and should cover that search cost.

> - Try before you buy

You can try essentially this exact machine on AWS. That instance and hardware are almost exactly the same. Now of course this doesn't apply to many instance types.

> - Time it takes for the box to be built, shipped and sent to the data center

5 business days is our mean lead time for that unit. Most workstations are 2 days.

> - Time (and cost) it takes to install software, drivers, etc

It comes with that stuff pre-installed as mentioned in the article. See Lambda Stack for driver / framework / CUDA woes: https://lambdalabs.com/lambda-stack-deep-learning-software

> - 3-year? What is the useful life of this? When does it seemingly become obsolete?

I'll admit it's a long time but 3 years is about right for GPUs from my experience. I personally purchased new GPUs for my workstations in 2012 (Fermi), 2015 (Maxwell), 2017 (Pascal).

> - AWS will continually upgrade their hardware and you keep paying the same

Yes, but only if you never commit to a reserved instance, in which case costs are 3x higher. If you buy a reserved instance you don't get upgraded hardware.

> - Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.

Yea, most of these are people who are already using GPUs really often for internal training infra and are finding it expensive.

> - Returns, breakage, warranty in case of a hardware failure

Our hardware comes with a 3 year parts warranty. Of course, this doesn't pay for your time lost but it's pretty rare to see parts fail and the $10k / year more than covers in-colo swap outs.

I agree with all of your points. I don't think that we really disagree with much here. TCO is hard to calculate :).


K80s are 2014 and people still use them at scale. So 3 years isn't unreasonable, right?

A major drawback of having this sort of hardware on-prem (and this unit in particular) is that it doesn't include local storage for a 100TB+ scale dataset. There's not just the admin cost of hard drives and RAID, but there's the infra cost of synchronizing data to the machine. An average, reasonable, product-oriented Engineering Manager could easily choose cloud over a $100k savings if it means less headaches. Especially since GCloud will give big discounts if they're trying to outbid Amazon. On the other hand, if Lambda offered some sort of NAS solution plus caching software (Alluxio! :), that would make the offering a lot more competitive. Oh and maybe throw in a network engineer to set up peering and such.. ;)

What might make the comparison more compelling is to take anonymized workloads from your customers, examine how often the workload results in idle (wasted) time, and to factor that into the figures. If a user's workload is elastic (many are, especially in R&D), then the user would be drawn towards ML Engine, Sagemaker, Floydhub, Snark, etc... The downside to these services is that they almost always involve expensive copying of data from NAS to worker machines. But if user utilization is high, and the dataset is static, the on-prem machine is a prime investment.

While the savings is a big and noteworthy number, iteration speed is of principle concern to readers here and likely a good segment of your customers. I love the Lambda offerings, but there are a lot of deep learning hackers who can't make the most of bare metal.

The 1080TI server is probably the killer solution at 1/10th the cost. But NVidia legal blah blah blah :(


Thanks for the response! Yes you're right, it is difficult to calculate. I hope people reading my comment took it into the context of a larger view on the process to calculate TCO, as opposed to just directly scrutinizing your products/article.


> 5 business days is our mean lead time for that unit. Most workstations are 2 days.

So this is an advert? Gotcha.

I think if you called out at the top of the article then it would have been a little more honest.


> - Time (and cost) it takes to install software, drivers, etc

I'd imagine this is part of having a system administrator.


This false dichotomy between colo and AWS is just making me exhausted 'cos I have been repeating this for so long: just rent a dedicated server. There surely are some cases where colo is the best choice but as the years, now a decade pass since I have tried to spread this, it makes less and less sense every year -- and it never did much in the first place. Maybe if you have several racks worth of equipment? I am not familiar with that size.


Lambda Labs engineer here. Here's what we're trying to argue:

Many benefits of running infrastructure in the cloud are lost for offline batch processing jobs. Training machine learning models doesn't require low response times, high availability, geographic proximity to clients, etc. Yet, with cloud, you're paying for all this extra infrastructure.

The main benefits of cloud for tasks with machine learning type workloads are cost saving (if utilization is low) and no electrical set-up.

On the other hand, cloud is extremely expensive for groups that require high base levels of GPU compute. The article is arguing that such groups can save a huge amount of money by moving infrastructure on-prem.


I feel I am talking to a brick wall. This often happens and this is what exhausts me. You are still arguing between cloud and colo / on-prem (even ignoring that on-prem and colo is very different but w/e) and what I am saying is that there is a third option neither the article nor your reply even acknowledges.


I have been hard-pressed to find Threadripper 2990wx dedicated servers with 1080ti/2080ti GPUs in them for rental, much less at a reasonable cost. Colo/on-prem was our only option for our mega transcoding server.


That's not a server CPU. You'd need to look for a dual EPYC 7371 server for similar performance -- 2990WX is 3-4.2GHZ 32 cores, 7371 is 3.1-3.8GHz 16 cores.

I'd email webnx sales, they have some EPYC servers although not this very new CPU but they were always very flexible. It might be more economical if you front the price of the CPUs but the 7371 is very cheap for what it is, the two should be around $3k (I saw them as a special order on gamepc for 1600, the site name is very ironic because the monsters they sell are more workstation / server than gaming pc). I am not affiliated, just a very satisfied customer, I worked with them when I was the lead architect of a US Top 100 (per Quantcast) site.

The 1080 Ti is certainly not a problem for them.


Now if they'd only throw in S3 for pseudo-infinite data storage, reliable SQS for work queue management, 10/25/100 Gigabit networking between the instances, redundant power supplies and cooling and racks in carefully selected stable locations for free, I'd buy a dozen!


The price includes $15k/year colocation costs. That easily covers the redundant power, cooling, rack space, massive amount of bandwidth and redundant network network connections. Last place I handled colocated servers, we spent ~$17k combined for racks in two different data centres hosting a combined ~60 servers, so $15k/year for a single server is a very conservative estimate.

If you're buying more than one, that'll easily cover enough to cable up whatever network you want between them.

Many colo providers also have their own cloud services, and nothing stops you from using S3, SQS and the like, though of course if you use AWS services "from the outside" you do need to factor in their extortionate outbound bandwidth charges.


I never get the point of these things. This is like saying a Raspberry Pi hooked up to your home router is cheaper, why bother buying Gmail for your company.

Or like saying that if you spend 20 hours every day in an Uber buying a car would be cheaper over 3 year term.

Nobody is disputing that buying similarly specced hardware on the market is likely to have a cheaper TCO, the point of cloud providers is that also have a bunch of stuff attached to the servers, which turns to be pretty important.


The point is that GPU instances are different. The usecases like ML training or general HPC are sufficiently isolated from the rest of your stuff that you don't really need all the bells and whistles that aws has. If you want to host a niche gpu database like production system, you may get more out of aws's ecosystem, but even then, the cost difference is huge.


> This is like saying a Raspberry Pi hooked up to your home router is cheaper, why bother buying Gmail for your company.

Where do I get GMAIL for my Raspberry Pi? If you order that Lambda server, you get the same good Linux interface you'll get over AWS. In fact they actually seems to offer some kind of software package that you wouldn't get over AWS.

Sure AWS has value-added, but in this case, it's mostly being able to not being billed for unused hours (what we call scalability), which for many is certainly worthwhile, but that's all. Is there anything I'm missing that AWS offer?

> if you spend 20 hours every day in an Uber buying a car would be cheaper over 3 year term.

Isn't that the case? I failed to see what's wrong in that sentence. I don't own a car, I subscribe to a service similar to Car2Go, I'm at about 150$ a month right now. Sure there's value added there, but owning one also has value added, just different ones and at one point it will be more worthwhile to own one.


[flagged]


Because the vast majority of systems never need that scale. In other words: while it's an interesting niche situation, to most people how to handle small static workloads is much closer to what is actually relevant to them.


Could you please stop posting unsubstantive (and in this case, supercilious) comments to Hacker News?

If you know more, it would be really good for you to share some of what you know, so that we can all learn something. If you don't want to take the time to do that, that's fine, but in that case please don't post.

https://news.ycombinator.com/newsguidelines.html


Lambda engineer here (I'm one of the authors).

As bubblethink already mentioned, the article is arguing that batch processing jobs (e.g. those characteristic of Deep Learning) don't benefit substantially from cloud infrastructure.

High availability, edge processing, fast network upload, and the lego blocks for creating redundancy are moot.

Most training algorithms automatically recover if a node goes down. That covers all required failure handling.


Using both S3 and SQS outside of AWS is perfectly possible, and the Colo provides decent bandwidth, cooling and power, that's literally the point of them


An availability zone or two wouldn't hurt either.


Buy servers if you have stable workloads, otherwide rent virtual machines in the cloud.


I'd add to that: Most people have far more stable workloads than they like to think.

"Everyone" seems to think they have massively spiky loads, and almost nobody does. There may often be demands that could spike (e.g. because "someone" has not randomized the start times for batch maintenance processes through the night but start them all at the same time), but more often than not the spikiest demand comes from batch workloads where nobody cares if it takes a bit longer. In practice when constrained that often simply smooths out the overall capacity use.

(of course that does not mean that there aren't people with spiky enough workloads for it to make sense to put at least some of their workloads on VMs for cost reasons; but measure first)


Is that really the case?

So far all companies I worked for had a wave-like pattern for customer generated workloads, simply because of the locality of the customers. It doesn't even matter if it's a globally used product, as some markets are usually stronger than others (e.g. US vs. Asia). So you have some base load, which is relatively stable, but on top of that daily or even more frequent "spikes".

For non-customer-driven workloads you even might want to have spikes. Take big data pipelines for example. While you could schedule them so that they're not spiky, you might benefit from running them as fast a possible to get your results earlier. That's something you probably wouldn't do with dedicated servers, simply because it'd be too expensive. Instead with dedicated servers you'd optimize for utilization instead of speed. When you only pay for what you use, you don't have that limitation and can fire up as many instances as it makes sense. That's actually a pretty nice benefit.

And don't forget changing workloads over the course of multiple months. The calculation looks completely different if you buy a dozen of servers now, but have your computing requirements change in a few months, so that you don't need them all anymore (reasons for such are manifold, but could be as simple as optimizations to your code).


> So far all companies I worked for had a wave-like pattern for customer generated workloads, simply because of the locality of the customers.

Yes, but rarely big enough to allow for long enough periods of spinning down instances to be cheaper than dedicated hardware. Consider that (while not really a suitable example, since it's a GPU instance) the costs in the article are only that low for AWS because it's a reserved instance, where the savings of not using it are much lower than for regular instances.

I've seen setups where we were close to considering using AWS for peak load, which would have been far cheaper than going all AWS, but when we did the calculations we needed peaks lasting shorter than 6-8 hours for it to be worthwhile vs. new hardware, and the cheaper alternative was simply leaving older machines in our racks past the 3 year depreciation schedule - it's not "free", they took up rack space and cost in terms of power, but it was cheaper to just let them stay in rotation until they failed or we needed the space for a faster machine, than it was to rent extra capacity.

Scaling up with cloud services cost-effectively for daily peaks is tricky, because you're still going to be scaling ahead of use and most of your "peaker instances" are not going to have average utilization anywhere near 100% when factoring in ramping up/down for setups that aren't huge, because you still have lead times before an instance is maxed out, or as load is dropping, and you need to factor in that lower utilization in your cost estimates.

That's not to say it's not worth considering, not least because having the ability to do so lets you load your dedicated equipment to much higher utilization rates because you're able to respond to spikes faster than you can add servers.

> For non-customer-driven workloads you even might want to have spikes. Take big data pipelines for example. While you could schedule them so that they're not spiky, you might benefit from running them as fast a possible to get your results earlier.

You might. But my experience is that over a couple of decades of working with various types of online services, it's very rare for those types of workloads to move the needle much in terms of overall compute needs. There certainly are exceptions, and machine learning workloads might well make them more common, but they're comparatively rare in most companies. Keep in mind that while you may want capacity to handle spikes, in most such setups you also want capacity to handle loads that can run off peak.

But by all means: If you do have needs that require those kind of big spikes that you can't accommodate within you base capacity, then use cloud instances for that. They do have their place.

> And don't forget changing workloads over the course of multiple months. The calculation looks completely different if you buy a dozen of servers now, but have your computing requirements change in a few months, so that you don't need them all anymore (reasons for such are manifold, but could be as simple as optimizations to your code).

In practice I've never seen dramatic enough changes anywhere for this to be a real concern for most people, as massive drop-offs tends to be rare, and tends to be eaten up by subsequent growth in a matter of months unless you're in big enough problems for it not to matter. If it's a potentially real concern to you, then rent dedicated servers by the month for some of the workload - it is typically still far cheaper than AWS, though not as much as leasing or buying - or do use a cloud provider for some of your peak load as an "insurance".

The big takeaway is not that you shouldn't use cloud services, but that you should know your workload and do the cost calculations.


The article also assumes 100% utilization on the cloud. I wonder if continual GPU-based training of DNN models is perhaps a fairly circumscribed use case?

Most DNN model training workloads are lumpy and transient.


Practically though the moment you use an instance more than 50% of the time AWS incentivizes buying the annual plan.


Using spot instances on AWS is a good way to save money on infrequent tasks, as well.


If I'm reading the chart right, GPU servers on AWS are cheaper if you utilize them less than 65-75% of the time, and turn them off when not in use.


I think so, if you can afford turning off a server during the night then it makes sense to take advantage of some sort of hourly billing which is what most clouds offer by default.



I should probably change the first sentence in the article to this.


If you're not deploying in a datacenter, you can save even more money by building a workstation with a few 2080 Ti cards, which cost $1200 and give 90% of the speed of the $3000 Titan V: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...


Exactly my thoughts - instead of one datacenter class gpu, you can use multuiple desktop components for better performance and more performance per buck. I've seen some sweet supermicro 2U chassis that can fit and run cool four 2080Ti.

I remember, though, that NVIDIA disallows in EULA the use of nvidia driver blob in consumer gpu in datacenter environment. Time will show how legally enforcible is this restriction.


We have one of these. Aside from the mechanical differences (it's actually a 2.5U server due to how the power cables connect to consumer cards), you're losing out on NVLink. You're going to be restricted to single-card training, which means either batching or other restrictions to ensure it fits in the memory of a single card. This setup is able to train four one-card models in parallel; it is not able to train a four-card model.


NVLink supports 2080 Ti


There's a physical connector also called NVLink, yes, but the cards don't present as unified at the driver level; nvidia-smi shows "link: off" even using the bridge. It's effectively SLI, which can reduce memory bandwidth load and cross-card transfer costs, but doesn't unify the cards the way Volta (actual) NVLink does.


In my mind, the reason to use something like AWS is to a) get your servers in minutes instead of weeks and b) easily right-size your service

Once your service is somewhat stable in terms of size and you can afford longer lead times, then you should return to on-prem to save money


What's this false dichotomy between AWS and on-prem? Dedicated servers at Hetzner, OVH, Datapacket, etc. are much cheaper than AWS and can also be ready in minutes.


I should have said on-prem or dedicated (or whatever the opposite of expensive AWS is)


If it's really on-prem (i.e. not in the "datacenter" as per NVIDIA EULA), you could spend a lot less than $100K+ for a lot more throughput by purchasing consumer-grade cards and HEDT gaming hardware.

Sure you'll have 4 GPUs per box and not 8, and sure, each GPU will have 11GB and not 32, but the whole machine (_with_ the GPUs) will cost just a tad more than a single V100. So if you don't really need 32GB of VRAM per GPU (and you most likely don't), it'd be insane to pay literally 5x as much as you have to.


1)How long does it take to get a Lambda Hyperplane operational, from the point I place an order?

A p3dn.24xlarge is a few minutes.

My experience in deploying new hardware-based solutions is that it typically takes between 6 and 12 weeks.

2)How does the TCO compare when I only need to train for 2 hours a day?

3)If I was previously training on AWS p2.16xl, I could upgrade to p3.16xl with basically 0 incremental cost (for the same workload). Does LambdaLabs offer free (0 capex) upgrades? If so, how long would it take to upgrade?


1) 5 days (according to the author) 2) Obviusly the cloud is better in that case 3) You can’t always upgrade on aws eigher if you have paid for a reserved instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ri-modif...


> My experience in deploying new hardware-based solutions is that it typically takes between 6 and 12 weeks.

Most of this is due to arcane ordering processes both on client and supplier side.


Save $69k sounds a lot more significant than save 38%.

If you're going to be using the server less than 73% of the time, AWS sounds better.


Keep in mind, the AWS cost is already for a reserved instance. On-demand would cost more.


Cloud computing was never about price, it was about the ability to provision and operate infrastructure instantly through an API.

If you can take advantage of that flexibility to build reactive capacity then you can save money, but that wasn't the initial driving point.


you MIGHT save money.

The price difference can be substantial enough that you might simply be better off having it "over provisioned" off hours.

Also, not having "reactive capacity" might simply mean that your 95th percentile goes from 100ms response time to 300ms. Which again, might be a more cost effective approach.

I think the initial development and ongoing cost of maintaining that reactive capacity is also substantial enough to be considered.

Most of the apps that truly need the elasticity should probably going hybrid anyways. Baseline on dedicated hardware, spikes on spot instances.


> If you can take advantage of that flexibility to build reactive capacity then you can save money

Almost everyone thinks they can; very few small businesses have the sysadmin/devops to pull it off.


I'm inexperienced in the hardware front, would that machine likely not break down under a large load for 3 years straight? Nothing is set aside for hardware failure etc.


I have 30 machines running at 100% load for an hour, then 25% load for an hour, the pattern repeats. Every quarter they run 100% for about a week.

Every 6 months or so a hard drive fails (out of 16 per server), no other components have failed. 10 machines are 6 years old (test system as it's out of warranty), 10 3 years old, 10 new.

There's also 20 or so other servers under different loads, I've not had anything fail other than hard drives.

When I started the job, there were spare power supplies for some 10+ year old servers in storage, so I'm either lucky or reliability is improving.


Our on prem server has been running for about 7 years on a high load. Its a little unpredictable but I would expect a server to last longer than 3 years


No, not unless its badly designed.

I used to work for a VFX company where we would try and get as close to 100% utilization out of the farm.

We had machines that were 4 years old, still merrily plodding away. (not too many mind, they ate more power than its worth.) The things that tended to fail were Harddrives and fans.


Depends what the warranty is, I think three years is standard and you can pay for five or seven. (You pay commensurately, however.)


Ah, excellent point, a warranty is something I completely overlooked. Probably some value "lost" doing it yourself due to downtime if something goes wrong, but likely negligible.


As long as it's run within its thermal limits (i.e. not being severely overclocked and has adequate cooling/ventilation) 3 years or even 5 years isn't an unrealistic lifespan. The analysis makes this, and a lot of other, implicit assumptions which in general seem reasonable.


On the other hand, they didn't calculate in any residual value after 3 years.


Because it's negligible


But isn't that $69,000 peanuts compared to the cost to hire sysadmins who are on call 24/7 to swap out: RAM, fans, drives, power supplies, provision new images, etc.?? So you saved on cloud costs, but now you have the admin burden: 1-3 people for $200k each. No?


From having managed several racks worth of equipment in two separate data centres 1-2 hours travel from my office at the time: I cost closer to that $200k/year, but I also only usually visited the data centres 2-3 times a year, and other than that we used "remote hands" at the colo to do maintenance and be on-call 24/7. On ~60+ servers, we had maybe on average one minor incident every couple of months that required physical intervention.

Between two data centres, lets assume I did 6 visits a year, and that we had one ~30/min incident a month at $50/incident (it was less, but I don't remember the exact details, and it doesn't matter for this exercise). Let's assume I lost a whole day every visit (I didn't, though it got close at times), and "charge" $1000/day for my visits. That adds up to $6k/year for my time, and $600/year for remote hands, or ~$110/year per server. For comparison the colocation cost us ~$17k/year, or ~$283/year per server. These costs were pretty stable by number of servers, and so favored using fewer, more powerful servers than we might have otherwise.

So that added the cost of renting space at a manned colo facility instead of having the servers in the office (we did have a rack of servers that didn't need 24/7 attention at our office as well).

The rest of my time was spent on devops work that in my experience tends to be more expensive (on the basis of having contracted to do this kind of work on AWS too, and know the difference in billable hours I'd typically get per instance on AWS vs. per physical server on colocated setups) on cloud setups because complexity tends to be higher.


> costs were pretty stable by number of servers, and so favored using fewer, more powerful servers

Didn't this make each failure a bigger hit to your overall capacity? How much redundancy did you have? I used to work in adtech with colocated hardware, and it was old and failed a lot, but they had enough it didn't matter ("we're down 5/120 in Germany but we can swap them out while we're there next month").


In that case it was ~60 servers, so losing any single server made little difference, but yes it of course needs to be a consideration if your number of servers is low enough.

It was also a fully virtualised setup that could also tie in rented dedicated servers or cloud instances via VPNs as needed. So where it made sense or if we had an urgent need, we had the ability to spin things up as needed.

E.g. we had racks in London, but rented servers at Hetzner in Germany (Hetzner got close to the cost of the colocated servers, though mostly because rack space in London is ridiculously expensive; it might actually have saved us money to put servers in their colo facilities in Germany, even with the cost of travel to/from them occasionally)


This sounds like a great use case for AWS -- as cheap insurance. Just setup a VPC and VPN and if you have hardware fail just spin up an instance in AWS until you can replace your physical hardware. Pay $40/mo or so to keep the VPN active.


From the article: "Our TCO [(Total Cost of Ownership)] includes energy, hiring a part-time system administrator, and co-location costs"


A lot of times when you colo you can get staff there to take care of those sorts of tasks. Their cost breakdown includes using this service.


If you have only a single server, and you need 24/7 support, then probably it is true you don't want to hire a full-time sysadmin for one server. But, for the people who are doing this kind of thing, they probably have more than one, so the cost of sysadmin is spread across more than one server.


> Our TCO includes energy, hiring a part-time system administrator, and co-location costs. In addition, you still get value from the system after three years, unlike the AWS instance.

They covered that I think.


Who's running model training 24/7 to justify reserving this instance or co-locating your own hardware? (Apologies in advance for not being very imaginative)

Their ImageNet timing fits within the bounds of a Spot Duration workload, so in the most optimistic scenario, you can subtract 70% from the price - assuming spot availability for this instance type. (Of course, there are many more model training exercises that don't even remotely fit inside 6 hours.)


Disclosure: I work on Google Cloud.

First, thanks for writing this up. Too many people just take a “buy the box, divide by number of hours in 3 years approach”. Your comparison to a 3-year RI at AWS versus the hardware is thus more fair than most. You’re still missing a lot of the opportunity cost (both capital and human), scaling (each of these is probably 3 kW, and most electrical systems couldn’t handle say 20 of those), and so on.

That said, I don’t agree that 3 years is a reasonable depreciation period for GPUs for deep learning (the focus of this analysis). If you had purchased a box full of P100s before the V100 came out, you’d have regretted it. Not just in straight price/performance, but also operator time: a 2x speedup on training also yields faster time-to-market and/or more productive deep learning engineers (expensive!).

People still use K80s and P100s for their relative price/performance on FP64 and FP32 generic math (V100s come at a high premium for ML and NVIDIA knows it), but for most deep learning you’d be making a big mistake. Even FP32 things with either more memory per part or higher memory bandwidth mean that you’d rather not have a 36-month replacement plan.

If you really do want to do that, I’d recommend you buy them the day they come out (AWS launched V100s in October 2017, so we’re already 16 months in) to minimize the refresh regret.

tl;dr: unless the V100 is the perfect sweet spot in ML land for the next three years or so, a 3-year RI or a physical box will decline in utility.


The actual cost of running a cloud instance is inflated. The cheapest way to run them is using spot/interruptible instances which for most deep learning jobs will suffice. If anything there will be some upfront cost to set it up in a way that it automatically manages interruptions, storage etc. Also by not limiting yourself to AWS you can have many other options.

With this setup you can get 2x4x V100 on Azure for a total of $42k/year (assuming running 24/7).

Even if one spent $40k to write code for spot instance management this is by far the cheapest solution for GPU compute both short term and long term.

source for calculation: https://cloudoptimizer.io


p3dn.24xlarge's pricing makes no sense at all. It feels like aws did it to pull off some PR/marketing stunt without any real users in mind. I've tried getting spot instances for it, but aws just errors out. So they don't even have enough of them to allow spot instances. And it's a gpu machine. So the usual arguments of scaling up on demand or adapting to load don't really apply. You either have this usecase or you don't. And if you do, just buy the hardware.


You are not the target audience since you don't have the use case nor the budget


I bought a second hand Xeon E3-1246 v3 (8 VCPU), 16GB memory for $250 on ebay. That's less than it costs to rent an a1.xlarge for 6 months. Hardware is so cheap now, esp with SSDs and memory getting cheaper. Don't automatically rent!


Or if you're going to rent, consider dedicated hosting providers too. Providers like Hetzner often work out far cheaper (especially if you're doing anything requiring a lot of outbound bandwidth).

AWS is great for convenience when you can afford it, but it is a really expensive solution, even when factoring in the extra things you have to deal with to rent, lease or buy dedicated servers.


Sadly no major dedicated provider offers machines with 8 GPUs..


This makes sense only if your prospective clients want "lift and shift" into the cloud. But lots of people are using AWS for their services like S3, RDS, Cloud Front, Route 53, etc.


You can still use s3, even if you aren’t totally in aws.


damn, "includes hiring a part time system administrator".


Must be extremely part-time, for $10k/year total cost.


I mean $10k/yr/server doesn't sound too unreasonable. Just sounds really bad when you're only looking at one server.

$100k/yr total compensation (i.e. 60k-ish salary) for someone babysit 10 servers isn't super unreasonable.


Heck, I'd take that job!

...seriously, any chance that's a real thing? Sounds better than what I do now.


Yes, sort of. You can find people with small-ish number of servers that will pay stupid money to have someone on-call when you count on a per server basis. In practice it's a nice side-gig, but you'll tend to need several of them, as people do understand they're paying a premium to have you accessible, and do expect to pay (substantially) less per server if they have more of them.

In practice this will tend to include out of ours availability and/or devops type work, not just low level sysadmin stuff or physical maintenance, as a lot of that can be farmed out to "remote hands" at the colo providers on hourly rates with 24/7 availability and will certainly cost a tiny fraction of that $10k/year.


I make just a couple of k's more than that to babysit a dozen. It's a thing.


If a single server requires anywhere close to that in sysadmin time per year, it is broken or the sysadmin in question is incompetent, or we're talking a supercomputer of much greater complexity than this thing.

To me it seems like a very conservative estimate, or allowing for said sysadmin to provide a lot of valueadd services (e.g. devops type services) that you'll typically need for a cloud setup as well.


They're hiring a share of colo staff.

It's a comparison to AWS, so everything that you install/do/operate on that server is extra in both cases.


It may be useful to note that most deep learning workloads for training are pretty latency insensitive and are pretty flat throughout the day.


I'd be curious what the TCO is when factoring in storage. i.e. What is replacing S3 for data storage in the colo setup?


That system has slots for 16 2.5" drives in the back. I'd guess they can buy whatever commodity drive/SSD they want, and store the data there. Even throwing some cycles and memory at ZFS, the cost is small compared to the rest of the box.

You'll need additional off-site backup, but that's starting to get out of the scope of the article.


I don't think tossing 16 SSDs into a single enclosure is a fair comparison with S3. I also think that storage is absolutely in scope for an article like this.


True, for most workloads 16 SSDs in a single enclosure will be far faster and more efficient, but will require offsite backups.

If you need redundant blob storage, pretty much every colo providers has solutions, and most of them are going to be cheaper than S3. Worst case you can use S3, and then need to factor in the bandwidth cost difference.


Yep. Amazon even offers the Storage Gateway appliance to facilitate these workflows.


You can still use s3, even if you aren’t in aws.


Interesting note about Lambda Labs, all of the press links on https://lambdalabs.com/?ref=blog are about a ~"privacy violating Google Glass app" that recognizes faces and geotags photos of them.

I don't see why they choose to promote that link now.


Nice.


I thought ads on HN were discouraged?


Maybe not! I wonder how much if anything was paid for this placement. And I also wonder if I should be preparing an "article" about my company. We could sure use some extra exposure.

Anyway, despite my belly-aching, it was an interesting read.


YC has nothing to gain by selling ads on this forum. The business is already extremely wealthy and successful. Any ad revenue from this would be absolutely peanuts. It just doesn't add up.


Look, it was an informative piece and I'm glad I read it. I even book-marked it for future reference.

But let's be candid, It was a promotional piece as well and if I "owned" this forum I would certainly have something to gain by charging a modest fee for such placements. It may sound like I'm angry about this or have a negative feeling, but I don't; not at all.


There's no argument whether you would have something to gain. We're talking about YC, not you.


This is just the most extreme example. AWS is just really expensive.

If you want a VPS take a look at Digital Ocean or Linode.


I think the expense is mostly because these are GPU instances, which are not yet commoditized. Unlike VMs, multi-tenancy on GPUs is just a little bit harder.


Actually the savings on this server looks to me to be low compared to what I'd usually expect.

You should use AWS for convenience, not cost. They're expensive for cloud services, and even the cheapest cloud providers are expensive compared to renting dedicated for all but the most transient workloads, and of dedicated hosting providers I only know Hetzner to get close to the costs I could get for renting colo space or doing truly on-prem hosting. Even then the only reason Hetzner is competitive is because I'm in London where space/power is expensive, and they're in Germany, where it is cheap (e.g. they rent out colo space as well, and prices are at 1/3 to 1/4 of what I've paid in London).


Hetzner rents out 1080Ti GPUs which are not available in most regions or from most cloud providers, hence the lower cost. This article refers to the much more expensive Tesla V100 GPUs. From what I understand the NVidia license for the 1080Tis prevents cloud providers from offering them for uses other than blockchain. Since AWS can't control what you actually do with it, they simply don't offer 1080Tis.

Source: https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/


The 15 tables in the nvidia grid documentation kind of shows how much of a mess it is.

Different resolutions and maximum tenants depending on card and license type, and you can't allocate resources homogeneously.


Regular AWS EC2 is still a lot more expensive than the alternatives that I mentioned.


It's not fair comparison unless we are comparing all in costs that include ops.


They include ops cost for on-prem server




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: