
V100 Server On-Prem vs. AWS P3 Instance Cost Comparison - rbranson
https://lambdalabs.com/blog/8-v100-server-on-prem-vs-p3-instance-tco-analysis-cost-comparison/
======
mbesto
> Our TCO includes energy, hiring a part-time system administrator, and co-
> location costs.

Which is a myopic view of TCO. This ignores so many things about purchasing
on-prem hardware (for good or bad):

Admin:

\- Cost associated with finding an admin who understands how this thing works

Speed / convenience:

\- Try before you buy

\- Time it takes for the box to be built, shipped and sent to the data center

\- Time (and cost) it takes to install software, drivers, etc

Maintenance / Capitalization / Finances:

\- 3-year? What is the useful life of this? When does it seemingly become
obsolete?

\- AWS will continually upgrade their hardware and you keep paying the same

\- Hardware can be capitalized, which means you can push it to the balance
sheet (for tax or valuation purposes)

\- Spending $90k instead of $184k in year 1 with the option to turn it off if
you want (no longer need). This could be very valuable for a startup who wants
elastic spending patterns.

Hidden costs:

\- Returns, breakage, warranty in case of a hardware failure

I understand why there is a market for this product, but it's not always an
apples/orange comparison. Generally speaking, if you know what your workload
is going to be, (I'd be hard pressed if a lot of orgs really know the answer
to this) then on-prem hardware is not a terrible choice, but it has to be
analyzed appropriately.

~~~
sabalaba
(Author here)

I agree with a lot of your points, especially the fact that TCO is practically
impossible to calculate, is highly subjective, and should ideally include
opportunity costs which, almost by definition vary from person to person.

I've definitely thought through some of your points. First I want to say that
if your load is spiky you have no business buying hardware. No matter how you
slice it, you're simply not going to be able to get hundreds of GPUs of
throughput with an 8 GPU machine so it's not possible to get the same
"product". So for those doing short bursts of hyperparameter search, stick
with cloud.

But, like you said, for those with high utilization and known utilization
patterns, it makes a lot of sense to go on-prem. Let's just say pretty much
anybody who is buying a reserved instance from AWS should consider buying
hardware instead.

>\- Cost associated with finding an admin who understands how this thing works

Pricing based on what co location facilities will provide you without much
search and is pretty generous. $10k per year per server is silly high and
should cover that search cost.

> \- Try before you buy

You can try essentially this exact machine on AWS. That instance and hardware
are almost exactly the same. Now of course this doesn't apply to many instance
types.

> \- Time it takes for the box to be built, shipped and sent to the data
> center

5 business days is our mean lead time for that unit. Most workstations are 2
days.

> \- Time (and cost) it takes to install software, drivers, etc

It comes with that stuff pre-installed as mentioned in the article. See Lambda
Stack for driver / framework / CUDA woes: [https://lambdalabs.com/lambda-
stack-deep-learning-software](https://lambdalabs.com/lambda-stack-deep-
learning-software)

> \- 3-year? What is the useful life of this? When does it seemingly become
> obsolete?

I'll admit it's a long time but 3 years is about right for GPUs from my
experience. I personally purchased new GPUs for my workstations in 2012
(Fermi), 2015 (Maxwell), 2017 (Pascal).

> \- AWS will continually upgrade their hardware and you keep paying the same

Yes, but only if you never commit to a reserved instance, in which case costs
are 3x higher. If you buy a reserved instance you don't get upgraded hardware.

> \- Spending $90k instead of $184k in year 1 with the option to turn it off
> if you want (no longer need). This could be very valuable for a startup who
> wants elastic spending patterns.

Yea, most of these are people who are already using GPUs really often for
internal training infra and are finding it expensive.

> \- Returns, breakage, warranty in case of a hardware failure

Our hardware comes with a 3 year parts warranty. Of course, this doesn't pay
for your time lost but it's pretty rare to see parts fail and the $10k / year
more than covers in-colo swap outs.

I agree with all of your points. I don't think that we really disagree with
much here. TCO is hard to calculate :).

~~~
choppaface
K80s are 2014 and people still use them at scale. So 3 years isn't
unreasonable, right?

A major drawback of having this sort of hardware on-prem (and this unit in
particular) is that it doesn't include local storage for a 100TB+ scale
dataset. There's not just the admin cost of hard drives and RAID, but there's
the infra cost of synchronizing data to the machine. An average, reasonable,
product-oriented Engineering Manager could easily choose cloud over a $100k
savings if it means less headaches. Especially since GCloud will give big
discounts if they're trying to outbid Amazon. On the other hand, if Lambda
offered some sort of NAS solution plus caching software (Alluxio! :), that
would make the offering a lot more competitive. Oh and maybe throw in a
network engineer to set up peering and such.. ;)

What might make the comparison more compelling is to take anonymized workloads
from your customers, examine how often the workload results in idle (wasted)
time, and to factor that into the figures. If a user's workload is elastic
(many are, especially in R&D), then the user would be drawn towards ML Engine,
Sagemaker, Floydhub, Snark, etc... The downside to these services is that they
almost always involve expensive copying of data from NAS to worker machines.
But if user utilization is high, and the dataset is static, the on-prem
machine is a prime investment.

While the savings is a big and noteworthy number, iteration speed is of
principle concern to readers here and likely a good segment of your customers.
I love the Lambda offerings, but there are a lot of deep learning hackers who
can't make the most of bare metal.

The 1080TI server is probably the killer solution at 1/10th the cost. But
NVidia legal blah blah blah :(

------
chx
This false dichotomy between colo and AWS is just making me exhausted 'cos I
have been repeating this for so long: just rent a dedicated server. There
surely are some cases where colo is the best choice but as the years, now a
decade pass since I have tried to spread this, it makes less and less sense
every year -- and it never did much in the first place. Maybe if you have
several racks worth of equipment? I am not familiar with that size.

~~~
mippie_moe
Lambda Labs engineer here. Here's what we're trying to argue:

Many benefits of running infrastructure in the cloud are lost for offline
batch processing jobs. Training machine learning models doesn't require low
response times, high availability, geographic proximity to clients, etc. Yet,
with cloud, you're paying for all this extra infrastructure.

The main benefits of cloud for tasks with machine learning type workloads are
cost saving (if utilization is low) and no electrical set-up.

On the other hand, cloud is extremely expensive for groups that require high
base levels of GPU compute. The article is arguing that such groups can save a
huge amount of money by moving infrastructure on-prem.

~~~
chx
I feel I am talking to a brick wall. This often happens and this is what
exhausts me. You are _still_ arguing between cloud and colo / on-prem (even
ignoring that on-prem and colo is very different but w/e) and what I am saying
is that there is a _third_ option neither the article nor your reply even
acknowledges.

------
sudhirj
Now if they'd only throw in S3 for pseudo-infinite data storage, reliable SQS
for work queue management, 10/25/100 Gigabit networking between the instances,
redundant power supplies and cooling and racks in carefully selected stable
locations for free, I'd buy a dozen!

~~~
sudhirj
I never get the point of these things. This is like saying a Raspberry Pi
hooked up to your home router is cheaper, why bother buying Gmail for your
company.

Or like saying that if you spend 20 hours every day in an Uber buying a car
would be cheaper over 3 year term.

Nobody is disputing that buying similarly specced hardware on the market is
likely to have a cheaper TCO, the point of cloud providers is that also have a
bunch of stuff attached to the servers, which turns to be pretty important.

~~~
bubblethink
The point is that GPU instances are different. The usecases like ML training
or general HPC are sufficiently isolated from the rest of your stuff that you
don't really need all the bells and whistles that aws has. If you want to host
a niche gpu database like production system, you may get more out of aws's
ecosystem, but even then, the cost difference is huge.

------
bithavoc
Buy servers if you have stable workloads, otherwide rent virtual machines in
the cloud.

~~~
vidarh
I'd add to that: Most people have far more stable workloads than they like to
think.

"Everyone" seems to think they have massively spiky loads, and almost nobody
does. There may often be demands that _could_ spike (e.g. because "someone"
has not randomized the start times for batch maintenance processes through the
night but start them all at the same time), but more often than not the
spikiest demand comes from batch workloads where nobody cares if it takes a
bit longer. In practice when constrained that often simply smooths out the
overall capacity use.

(of course that does not mean that there aren't people with spiky enough
workloads for it to make sense to put at least some of their workloads on VMs
for cost reasons; but measure first)

~~~
Dunedan
Is that really the case?

So far all companies I worked for had a wave-like pattern for customer
generated workloads, simply because of the locality of the customers. It
doesn't even matter if it's a globally used product, as some markets are
usually stronger than others (e.g. US vs. Asia). So you have some base load,
which is relatively stable, but on top of that daily or even more frequent
"spikes".

For non-customer-driven workloads you even might want to have spikes. Take big
data pipelines for example. While you could schedule them so that they're not
spiky, you might benefit from running them as fast a possible to get your
results earlier. That's something you probably wouldn't do with dedicated
servers, simply because it'd be too expensive. Instead with dedicated servers
you'd optimize for utilization instead of speed. When you only pay for what
you use, you don't have that limitation and can fire up as many instances as
it makes sense. That's actually a pretty nice benefit.

And don't forget changing workloads over the course of multiple months. The
calculation looks completely different if you buy a dozen of servers now, but
have your computing requirements change in a few months, so that you don't
need them all anymore (reasons for such are manifold, but could be as simple
as optimizations to your code).

~~~
vidarh
> So far all companies I worked for had a wave-like pattern for customer
> generated workloads, simply because of the locality of the customers.

Yes, but rarely big enough to allow for long enough periods of spinning down
instances to be cheaper than dedicated hardware. Consider that (while not
really a suitable example, since it's a GPU instance) the costs in the article
are only that low for AWS because it's a reserved instance, where the savings
of not using it are much lower than for regular instances.

I've seen setups where we were close to considering using AWS for peak load,
which would have been _far_ cheaper than going all AWS, but when we did the
calculations we needed peaks lasting shorter than 6-8 hours for it to be
worthwhile vs. new hardware, and the cheaper alternative was simply leaving
older machines in our racks past the 3 year depreciation schedule - it's not
"free", they took up rack space and cost in terms of power, but it was cheaper
to just let them stay in rotation until they failed or we needed the space for
a faster machine, than it was to rent extra capacity.

Scaling up with cloud services cost-effectively for daily peaks is tricky,
because you're still going to be scaling ahead of use and most of your "peaker
instances" are not going to have average utilization anywhere near 100% when
factoring in ramping up/down for setups that aren't huge, because you still
have lead times before an instance is maxed out, or as load is dropping, and
you need to factor in that lower utilization in your cost estimates.

That's not to say it's not worth considering, not least because having the
_ability_ to do so lets you load your dedicated equipment to much higher
utilization rates because you're able to respond to spikes faster than you can
add servers.

> For non-customer-driven workloads you even might want to have spikes. Take
> big data pipelines for example. While you could schedule them so that
> they're not spiky, you might benefit from running them as fast a possible to
> get your results earlier.

You might. But my experience is that over a couple of decades of working with
various types of online services, it's very rare for those types of workloads
to move the needle much in terms of overall compute needs. There certainly
_are_ exceptions, and machine learning workloads might well make them more
common, but they're comparatively rare in most companies. Keep in mind that
while you may want capacity to handle spikes, in most such setups you _also_
want capacity to handle loads that can run off peak.

But by all means: If you _do_ have needs that require those kind of big spikes
that you can't accommodate within you base capacity, then use cloud instances
for that. They do have their place.

> And don't forget changing workloads over the course of multiple months. The
> calculation looks completely different if you buy a dozen of servers now,
> but have your computing requirements change in a few months, so that you
> don't need them all anymore (reasons for such are manifold, but could be as
> simple as optimizations to your code).

In practice I've never seen dramatic enough changes anywhere for this to be a
real concern for most people, as massive drop-offs tends to be rare, and tends
to be eaten up by subsequent growth in a matter of months unless you're in big
enough problems for it not to matter. If it's a potentially real concern to
you, then rent dedicated servers by the month for some of the workload - it is
typically still far cheaper than AWS, though not _as much_ as leasing or
buying - or do use a cloud provider for some of your peak load as an
"insurance".

The big takeaway is not that you shouldn't use cloud services, but that you
should know your workload and do the cost calculations.

------
Scaevolus
If you're not deploying in a datacenter, you can save even more money by
building a workstation with a few 2080 Ti cards, which cost $1200 and give 90%
of the speed of the $3000 Titan V: [https://lambdalabs.com/blog/best-gpu-
tensorflow-2080-ti-vs-v...](https://lambdalabs.com/blog/best-gpu-
tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/)

~~~
altmind
Exactly my thoughts - instead of one datacenter class gpu, you can use
multuiple desktop components for better performance and more performance per
buck. I've seen some sweet supermicro 2U chassis that can fit and run cool
four 2080Ti.

I remember, though, that NVIDIA disallows in EULA the use of nvidia driver
blob in consumer gpu in datacenter environment. Time will show how legally
enforcible is this restriction.

~~~
freeone3000
We have one of these. Aside from the mechanical differences (it's actually a
2.5U server due to how the power cables connect to consumer cards), you're
losing out on NVLink. You're going to be restricted to single-card training,
which means either batching or other restrictions to ensure it fits in the
memory of a single card. This setup is able to train four one-card models in
parallel; it is not able to train a four-card model.

~~~
riku_iki
NVLink supports 2080 Ti

~~~
freeone3000
There's a physical connector also called NVLink, yes, but the cards don't
present as unified at the driver level; nvidia-smi shows "link: off" even
using the bridge. It's effectively SLI, which can reduce memory bandwidth load
and cross-card transfer costs, but doesn't unify the cards the way Volta
(actual) NVLink does.

------
elchief
In my mind, the reason to use something like AWS is to a) get your servers in
minutes instead of weeks and b) easily right-size your service

Once your service is somewhat stable in terms of size and you can afford
longer lead times, then you should return to on-prem to save money

~~~
joefourier
What's this false dichotomy between AWS and on-prem? Dedicated servers at
Hetzner, OVH, Datapacket, etc. are much cheaper than AWS and can also be ready
in minutes.

~~~
elchief
I should have said on-prem or dedicated (or whatever the opposite of expensive
AWS is)

------
m0zg
If it's really on-prem (i.e. not in the "datacenter" as per NVIDIA EULA), you
could spend a lot less than $100K+ for a lot more throughput by purchasing
consumer-grade cards and HEDT gaming hardware.

Sure you'll have 4 GPUs per box and not 8, and sure, each GPU will have 11GB
and not 32, but the whole machine (_with_ the GPUs) will cost just a tad more
than a single V100. So if you don't really need 32GB of VRAM per GPU (and you
most likely don't), it'd be insane to pay literally 5x as much as you have to.

------
ti_ranger
1)How long does it take to get a Lambda Hyperplane operational, from the point
I place an order?

A p3dn.24xlarge is a few minutes.

My experience in deploying new hardware-based solutions is that it typically
takes between 6 and 12 weeks.

2)How does the TCO compare when I only need to train for 2 hours a day?

3)If I was previously training on AWS p2.16xl, I could upgrade to p3.16xl with
basically 0 incremental cost (for the same workload). Does LambdaLabs offer
free (0 capex) upgrades? If so, how long would it take to upgrade?

~~~
ec109685
1) 5 days (according to the author) 2) Obviusly the cloud is better in that
case 3) You can’t always upgrade on aws eigher if you have paid for a reserved
instance: [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ri-
modif...](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ri-
modifying.html)

------
Thorrez
Save $69k sounds a lot more significant than save 38%.

If you're going to be using the server less than 73% of the time, AWS sounds
better.

~~~
BGZq7
Keep in mind, the AWS cost is already for a reserved instance. On-demand would
cost more.

------
manigandham
Cloud computing was never about price, it was about the ability to provision
and operate infrastructure instantly through an API.

If you can take advantage of that flexibility to build reactive capacity then
you can save money, but that wasn't the initial driving point.

~~~
latch
you MIGHT save money.

The price difference can be substantial enough that you might simply be better
off having it "over provisioned" off hours.

Also, not having "reactive capacity" might simply mean that your 95th
percentile goes from 100ms response time to 300ms. Which again, might be a
more cost effective approach.

I think the initial development and ongoing cost of maintaining that reactive
capacity is also substantial enough to be considered.

Most of the apps that truly need the elasticity should probably going hybrid
anyways. Baseline on dedicated hardware, spikes on spot instances.

------
dc_gregory
I'm inexperienced in the hardware front, would that machine likely not break
down under a large load for 3 years straight? Nothing is set aside for
hardware failure etc.

~~~
iheartpotatoes
But isn't that $69,000 peanuts compared to the cost to hire sysadmins who are
on call 24/7 to swap out: RAM, fans, drives, power supplies, provision new
images, etc.?? So you saved on cloud costs, but now you have the admin burden:
1-3 people for $200k each. No?

~~~
vidarh
From having managed several racks worth of equipment in two separate data
centres 1-2 hours travel from my office at the time: I cost closer to that
$200k/year, but I also only usually visited the data centres 2-3 times a
_year_ , and other than that we used "remote hands" at the colo to do
maintenance and be on-call 24/7\. On ~60+ servers, we had maybe on average one
minor incident every couple of months that required physical intervention.

Between two data centres, lets assume I did 6 visits a year, and that we had
one ~30/min incident a month at $50/incident (it was less, but I don't
remember the exact details, and it doesn't matter for this exercise). Let's
assume I lost a whole day every visit (I didn't, though it got close at
times), and "charge" $1000/day for my visits. That adds up to $6k/year for my
time, and $600/year for remote hands, or ~$110/year per server. For comparison
the colocation cost us ~$17k/year, or ~$283/year per server. These costs were
pretty stable by _number_ of servers, and so favored using fewer, more
powerful servers than we might have otherwise.

So that added the cost of renting space at a manned colo facility instead of
having the servers in the office (we did have a rack of servers that didn't
need 24/7 attention at our office as well).

The rest of my time was spent on devops work that in my experience tends to be
more expensive (on the basis of having contracted to do this kind of work on
AWS too, and know the difference in billable hours I'd typically get per
instance on AWS vs. per physical server on colocated setups) on cloud setups
because complexity tends to be higher.

~~~
erik_seaberg
> costs were pretty stable by number of servers, and so favored using fewer,
> more powerful servers

Didn't this make each failure a bigger hit to your overall capacity? How much
redundancy did you have? I used to work in adtech with colocated hardware, and
it was old and failed a lot, but they had enough it didn't matter ("we're down
5/120 in Germany but we can swap them out while we're there next month").

~~~
vidarh
In that case it was ~60 servers, so losing any single server made little
difference, but yes it of course needs to be a consideration if your number of
servers is low enough.

It was also a fully virtualised setup that could also tie in rented dedicated
servers or cloud instances via VPNs as needed. So where it made sense or if we
had an urgent need, we had the ability to spin things up as needed.

E.g. we had racks in London, but rented servers at Hetzner in Germany (Hetzner
got close to the cost of the colocated servers, though mostly because rack
space in London is ridiculously expensive; it might actually have saved us
money to put servers in their colo facilities in Germany, even with the cost
of travel to/from them occasionally)

------
zten
Who's running model training 24/7 to justify reserving this instance or co-
locating your own hardware? (Apologies in advance for not being very
imaginative)

Their ImageNet timing fits within the bounds of a Spot Duration workload, so
in the most optimistic scenario, you can subtract 70% from the price -
assuming spot availability for this instance type. (Of course, there are many
more model training exercises that don't even remotely fit inside 6 hours.)

------
boulos
Disclosure: I work on Google Cloud.

First, thanks for writing this up. Too many people just take a “buy the box,
divide by number of hours in 3 years approach”. Your comparison to a 3-year RI
at AWS versus the hardware is thus more fair than most. You’re still missing a
lot of the opportunity cost (both capital and human), scaling (each of these
is probably 3 kW, and most electrical systems couldn’t handle say 20 of
those), and so on.

That said, I don’t agree that 3 years is a reasonable depreciation period for
GPUs for deep learning (the focus of this analysis). If you had purchased a
box full of P100s before the V100 came out, you’d have regretted it. Not just
in straight price/performance, but also operator time: a 2x speedup on
training also yields faster time-to-market and/or more productive deep
learning engineers (expensive!).

People still use K80s and P100s for their relative price/performance on FP64
and FP32 generic math (V100s come at a high premium for ML and NVIDIA knows
it), but for most deep learning you’d be making a big mistake. Even FP32
things with either more memory per part or higher memory bandwidth mean that
you’d rather not have a 36-month replacement plan.

If you really do want to do that, I’d recommend you buy them the day they come
out (AWS launched V100s in October 2017, so we’re already 16 months in) to
minimize the refresh regret.

tl;dr: unless the V100 is the perfect sweet spot in ML land for the next three
years or so, a 3-year RI or a physical box will decline in utility.

------
freediver
The actual cost of running a cloud instance is inflated. The cheapest way to
run them is using spot/interruptible instances which for most deep learning
jobs will suffice. If anything there will be some upfront cost to set it up in
a way that it automatically manages interruptions, storage etc. Also by not
limiting yourself to AWS you can have many other options.

With this setup you can get 2x4x V100 on Azure for a total of $42k/year
(assuming running 24/7).

Even if one spent $40k to write code for spot instance management this is by
far the cheapest solution for GPU compute both short term and long term.

source for calculation: [https://cloudoptimizer.io](https://cloudoptimizer.io)

------
bubblethink
p3dn.24xlarge's pricing makes no sense at all. It feels like aws did it to
pull off some PR/marketing stunt without any real users in mind. I've tried
getting spot instances for it, but aws just errors out. So they don't even
have enough of them to allow spot instances. And it's a gpu machine. So the
usual arguments of scaling up on demand or adapting to load don't really
apply. You either have this usecase or you don't. And if you do, just buy the
hardware.

~~~
hughesjo
You are not the target audience since you don't have the use case nor the
budget

------
rb808
I bought a second hand Xeon E3-1246 v3 (8 VCPU), 16GB memory for $250 on ebay.
That's less than it costs to rent an a1.xlarge for 6 months. Hardware is so
cheap now, esp with SSDs and memory getting cheaper. Don't automatically rent!

~~~
vidarh
Or if you're going to rent, consider dedicated hosting providers too.
Providers like Hetzner often work out far cheaper ( _especially_ if you're
doing anything requiring a lot of outbound bandwidth).

AWS is great for convenience when you can afford it, but it is a _really_
expensive solution, even when factoring in the extra things you have to deal
with to rent, lease or buy dedicated servers.

~~~
riku_iki
Sadly no major dedicated provider offers machines with 8 GPUs..

------
raincom
This makes sense only if your prospective clients want "lift and shift" into
the cloud. But lots of people are using AWS for their services like S3, RDS,
Cloud Front, Route 53, etc.

~~~
ec109685
You can still use s3, even if you aren’t totally in aws.

------
purplezooey
damn, "includes hiring a part time system administrator".

~~~
fisherjeff
Must be extremely part-time, for $10k/year total cost.

~~~
Spivak
I mean $10k/yr/server doesn't sound too unreasonable. Just sounds really bad
when you're only looking at one server.

$100k/yr total compensation (i.e. 60k-ish salary) for someone babysit 10
servers isn't super unreasonable.

~~~
yjftsjthsd-h
Heck, I'd take that job!

...seriously, any chance that's a real thing? Sounds better than what I do
now.

~~~
vidarh
Yes, sort of. You can find people with small-ish number of servers that will
pay stupid money to have someone on-call when you count on a per server basis.
In practice it's a nice side-gig, but you'll tend to need several of them, as
people do understand they're paying a premium to have you accessible, and do
expect to pay (substantially) less per server if they have more of them.

In practice this will tend to include out of ours availability and/or devops
type work, not just low level sysadmin stuff or physical maintenance, as a lot
of that can be farmed out to "remote hands" at the colo providers on hourly
rates with 24/7 availability and will certainly cost a tiny fraction of that
$10k/year.

------
deepnotderp
It may be useful to note that most deep learning workloads for training are
pretty latency insensitive and are pretty flat throughout the day.

------
mbell
I'd be curious what the TCO is when factoring in storage. i.e. What is
replacing S3 for data storage in the colo setup?

~~~
secabeen
That system has slots for 16 2.5" drives in the back. I'd guess they can buy
whatever commodity drive/SSD they want, and store the data there. Even
throwing some cycles and memory at ZFS, the cost is small compared to the rest
of the box.

You'll need additional off-site backup, but that's starting to get out of the
scope of the article.

~~~
mbell
I don't think tossing 16 SSDs into a single enclosure is a fair comparison
with S3. I also think that storage is absolutely in scope for an article like
this.

~~~
vidarh
True, for most workloads 16 SSDs in a single enclosure will be far faster and
more efficient, but will require offsite backups.

If you _need_ redundant blob storage, pretty much every colo providers has
solutions, and most of them are going to be cheaper than S3. Worst case you
can use S3, and then need to factor in the bandwidth cost difference.

~~~
secabeen
Yep. Amazon even offers the Storage Gateway appliance to facilitate these
workflows.

------
canadev
Interesting note about Lambda Labs, all of the press links on
[https://lambdalabs.com/?ref=blog](https://lambdalabs.com/?ref=blog) are about
a ~"privacy violating Google Glass app" that recognizes faces and geotags
photos of them.

I don't see why they choose to promote that link now.

------
ringaroll
Nice.

------
iheartpotatoes
I thought ads on HN were discouraged?

~~~
jim_bailie
Maybe not! I wonder how much if anything was paid for this placement. And I
also wonder if I should be preparing an "article" about my company. We could
sure use some extra exposure.

Anyway, despite my belly-aching, it was an interesting read.

~~~
warent
YC has nothing to gain by selling ads on this forum. The business is already
extremely wealthy and successful. Any ad revenue from this would be absolutely
peanuts. It just doesn't add up.

~~~
jim_bailie
Look, it was an informative piece and I'm glad I read it. I even book-marked
it for future reference.

But let's be candid, It was a promotional piece as well and if I "owned" this
forum I would certainly have something to gain by charging a modest fee for
such placements. It may sound like I'm angry about this or have a negative
feeling, but I don't; not at all.

~~~
warent
There's no argument whether you would have something to gain. We're talking
about YC, not you.

------
ilaksh
This is just the most extreme example. AWS is just really expensive.

If you want a VPS take a look at Digital Ocean or Linode.

~~~
wenc
I think the expense is mostly because these are GPU instances, which are not
yet commoditized. Unlike VMs, multi-tenancy on GPUs is just a little bit
harder.

~~~
vidarh
Actually the savings on this server looks to me to be low compared to what I'd
usually expect.

You should use AWS for convenience, not cost. They're expensive for cloud
services, and even the cheapest cloud providers are expensive compared to
renting dedicated for all but the most transient workloads, and of dedicated
hosting providers I only know Hetzner to get close to the costs I could get
for renting colo space or doing truly on-prem hosting. Even then the only
reason Hetzner is competitive is because I'm in London where space/power is
expensive, and they're in Germany, where it is cheap (e.g. they rent out colo
space as well, and prices are at 1/3 to 1/4 of what I've paid in London).

~~~
coleca
Hetzner rents out 1080Ti GPUs which are not available in most regions or from
most cloud providers, hence the lower cost. This article refers to the much
more expensive Tesla V100 GPUs. From what I understand the NVidia license for
the 1080Tis prevents cloud providers from offering them for uses other than
blockchain. Since AWS can't control what you actually do with it, they simply
don't offer 1080Tis.

Source:
[https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/](https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/)

------
hughesjo
It's not fair comparison unless we are comparing all in costs that include
ops.

~~~
riku_iki
They include ops cost for on-prem server

