Much easier to spend huge amounts of money of Azure/AWS and politely tell them it's their own fucking fault when they complain about the costs. (what me? no I'm not bitter, why do you ask?)
Last week I was in an event with the CTOs of many of the hottest startups in America. It was shocking how much money is wasted on the cloud because inefficiencies and they simply don't care how much it costs.
I guess since they are not wasting their own money, they can always come up with the same excuse: developers are more expensive than infrastructure. Well... that argument starts to fall apart very quickly when a company spends six figures every month on AWS.
I'm on the other extreme. I run my company stuff on ten $300 servers I bought on eBay in 2012 and put inside a soundproof rack in my office in NJ, with a 300 Mb FIOS connection using Cloudflare as a proxy / CDN. The servers run Proxmox for private cloud and CEPH for storage. They all have SSDs and some have Optane storage. In 6 years, there were only 3 outages that weren't my fault. All at the cost of office rent ($1000) + FIOS ($359) + Cloudflare costs and S3 for images and backups.
With my infrastructure, I can run 6k requests per minute on the main Rails app (+ Scala backend) with a 40ms response time with plenty of resources to spare.
I'm talking about a real application here. With 100s of database and API calls on each web page load. I could make the whole thing in Golang or Scala and that would be at least one order of magnitude faster. But then I would have to throw away all the business knowledge that was added to the Rails app.
For instance, the slowest API call on the 40 ms is one that hits an ElasticSearch cluster with over 1 billion documents and is made on a Scala backend using Apache Thrift. There's a lot of caching but still, long tail and customization will kill caching at the top level.
What type of business that someone would basically run out of an office closet like this would need to service more than 6k requests per minute?
Swinging the conversation beyond the dosages of either side doesn't produce interesting insight
How many where there that were your fault? And of those, how many would have been avoided by using Heroku?
>All at the cost of office rent ($1000) + FIOS ($359) + Cloudflare costs and S3 for images and backups.
What about the time spent creating and maintaining this infrastructure?
One time I was using Docker for a 2 TB MongoDB and it messed the iptables rules. I notice everything slow for a few days until the database disappeared and when I logged in to check, there was a ransom note.
I flew from Boca Raton to NJ to recover the backup and audit if that was the only breach. That was the longest outage.
Like Rome, this infrastructure was not created in one day. Adding Optane storage is something more recent, for example. Or adding a remote KVM to make easier to manage than dealing with multiple DRACS, which I did after a moved to Florida.
But I'm not against using the cloud. I'm actually very in favor. What I'm against is waste.
In my case, being very conservative with my costs and still have a lot of resources available allowed me to try and keep trying many different ideas in the search for product/market fit.
In my opinion it's better to escalate upwards with proposals and not back down easily. You just have to frame it correctly and use right names and terms.
* Usually big companies understand the concept of "a lab" that has infrastructure managed outside the corporate IT. Once you fight the hard fight, you get your own corner and are left alone to do your job and can gradually grow it for other things.
* Asking forgiveness works even for large companies. Sometimes someone is not happy and you 'are in trouble' but not really. You just have to have a personality that does not care if somebody gets mad at you sometimes.
I won't go as far as calling it yet another cyclical IT phase, but it has all the hallmarks of one.
That's not borne out by history, especially in the face of "enterprise" hardware, software, and support, which, in many large companies, is being replaced by cloud (at lower cost!).
For smaller companies, it may be a different story, especially at the next economic downturn, especially if VC money becomes scarce enough for long enough.
If someone goes to the trouble to migrate onto cloud, and then replicates pre-devops workflows... wow.
The real problem is believing that there is a pure model that works for everyone.
In no way am I responsible for our migration going well. But I do feel if a company this size can do it, then others failing to do so says more about their architecture teams than the endeavor's futility.
If something bad slipped into prod there was a process in place of how a dev would work together with someone from that team to fix it.
> pre-devops workflows
I guess this is what you are calling a pre-devops workflow? In a lot of fields not all devs are allowed to see/touch the complete production environment. Not everyone can go the netflix way of "everyone pushes to production and we'll just fix it when it breaks".
I kinda shake my head if people call for devs to handle all the infrastructure and everything in prod. Why should a developer concern himself with details of scaling and tuning postgres, elasticsearch or loadbalancers? If you don't outsource that, that's the job for the ops team. However, if that's the responsibility of the ops team, there's no reason for dev to have access to these systems beyond the application layer. That just seems like a good way to split the work of running and scaling an application.
Now if we're talking about applications, that's something different. In fact, it's the contrary. I am a big proponent of having developers manage and configure their own applications, including prod. It's fast and makes them develop better applications. However, doing this for critical systems in a responsible way requires automation to enforce best practices. At our place, devs don't have root access to productive application servers, but devs have permissions to configure and use the automation controlling the productive application servers. And it's safe and rather risk-free because the automation does the right thing.
And it's also a different thing if we're talking about test. By all means, deploy a 3 container database cluster to poke around with it. I like experimentation and volatility in test setups and PoCs. Sometimes its just faster to solve the problem with 3 candidate technologies and go from there. Just don't expect things to go into production just like that. We'll have to think about availability, scaling and automating that properly first.
Instead, because ops / infra arbitrarily block me from what I need, I have to be an expert on database internals, network bottlenecks, app security topics, deployment, containers, CI tools, etc. etc., both so that I can “do it myself” when ops e.g. refuses to acknowledge some assumption-breaking GPU architecture we need, and so that I can exhaustively deal with every single arch / ops debate or bureaucratic hurdle that comes up for me to endlessly justify everything I need to do far beyond any reasonable standard.
For me, managing devops shit myself is a necessary evil. Far better than the case of unresponsive / bureaucratic ops teams, but worse than the unicorn case of an actual customer service oriented ops team that actually cares rather than engages in convenience-for-themselves optimization at every turn.
Especially because a well-done infrastructure scales so much harder than the manual style. We're currently dealing with a bit of fallout from a bureaucratic system at one point. People are so confused because to get a custom container build, all they do is create a repository with a specific name and a templated Jenkinsfile and 10 - 15 minutes later they get a mail with deployment instructions. It's so easy, and no one has to do anything else.
Unwillingness shouldn't be confused with ability. Most companies can do this if they're not handling PII/PHI. It takes investment in smart people and time but most companies besides pure software companies see tech as a cost center and avoid investing in better infrastructure and platform systems.
I've read that in medieval times, kings have sometimes abandoned their castles for new ones after too much fecies were accumulated in them (as people were shitting wherever back then). I feel a strong parallel in this story to our situation.
There are things you can do in cloud (especially as an SMB) that weren't possible on-prem.
To look at those new opportunities and say "No, how can we do things exactly the way we were?" seems like the real mistake.
And, I can show exactly how much everything costs. As well as work to reduce those costs.
AWS added to a small, skilled dev team is a huge multiplier.
RDS is just managed databases though...
They don't care. Neither do their bosses. Not even the CFO or the CEO.
The only people who will care are the investors. And sometimes not even them, they will likely just sell and walk.
Only people who are likely to care are activist investors with large block of shares, those types have vested interests in these things.
Part of the reason why there is so much waste every where is because organizations are not humans. And most people in authority have no real stake or long term consequences for decisions. This is everywhere, religious organizations, companies, governments etc. Everywhere.
In another thread on here (unfortunately I can't recall which), an executive shared a sentiment along the lines of "I don't care if it's 1% or 0.1% of my overall budget".
Perhaps 0.9% would become more significant during a downturn, especially for startups if VC money dries up.
I wanted second 30" monitor, so I filed a ticket. They sent me long email listing reasons why I shouldn't get a second monitor, including (numbers are approximate, employee count from 2013 or so) "If every googler gets an extra monitor, in a year it would be equivalent to driving Toyota Camry for 18000 miles."
I'm thinking "this can't possible be right", so I spend some time calculating, and it turned out to be approximately correct. So I'm thinking, "we should hire one less person and give everyone an extra monitor!". I replied "yes, I understand, go ahead and give me an extra monitor anyway". They replied "we'll require triple management approval!". Me: "please proceed". First two managers approved the request, director emails me "Why I'm bother with this?". Me: "they want your approval to give me second monitor", him: "whatever".
And finally few days later I got a second monitor...
I had no idea Google was so cheap. Gourmet breakfast, lunch, and dinner every day? No problem. A couple hundred bucks for a second monitor? Uh... it's not about the money, we're, uh, concerned about the environment.
These eke a few more hours of work out of you per day.
> A couple hundred bucks for a second monitor?
Arguments about productivity aside (I agree, more productive), these don't.
Amazon also allows you to byoc and image it. Images are easily available. Bias for action gets u far.
After a few days I bought one, too, took it home, and took the 2x 23" monitors it replaced to the office.
In a different department another friend systematically acquired new large monitors for her dev team. Her management chain complained about the expense.
Bias for action not only gets you far, it attracts folks to your team inside a company. ;-)
The inventory sheet will never be reconciled over that monitor, unless it breaks.
Not sure if I can take it with when I leave, but hey, at least now I have my second monitor and it cost less than $300 - I'll probably bequeath it to a fellow employee.
A year later they stopped issuing SSDs and I have to do the whole "managerial approval" thing to get any variation from the "standard" machine, so I am more than a year overdue for a replacement.
Comments like these (why can't I just do X?), it's like the difference between being single and being in a serious relationship. When you're single, you can do whatever you want. When you're in a serious relationship, you can't just make whatever decisions you want without talking to your partner. Well, big corporate enterprise is like that on steroids, because instead of having one partner, you have several or dozens, and everyone needs to buy in.
Capex => Opex has a name. It's called "a loan". So let's model cloud usage as a loan. Let's assume you want, for an employee, a machine learning rig and you set depreciation on 2 years. Let's also assume that the article is correct about the ballpark figures, somewhere around $1400/month for renting, and $3000 for the machine.
And let's ignore the difference in power usage, extra space, bandwidth, ... (which is going to be 2 digit dollars at the very most anyway, since you need all that for the employee in the first place)
So how much interest does cloud charge for this Capex => Opex change ? Well 33600/30000.5 or 613.5% per year. Pretty much every bank on the planet will offer very low credit score companies 30% loans, even 10% is very realistic.
There are no words ... Just give the man his bloody machine. Hell, give him 5, 4 just in case 3 fail, and 1 for Crysis just to be a "nice" guy (you're not really being so nice: you're saving money) and you still come out ahead of the cloud.
We both know why this doesn't happen, the real reason: you don't trust your employees. Letting this employee have that machine would immediately cause a jealousy fight within the company, and cause a major problem. That's of course why the GP comment is right: leave this bloody nightmare of a company, today.
Well I mean they obviously can...
First of all you're assuming IT will just let you have a decent machines with PCIe slots. It's all about laptops don't you know. Workstations are so 2012.
Secondly while they might, after much begging, let me have one graphics card to put into this one old workstation I've scavenged, they certainly won't let me have a machine with top of line specs with several cards or, heaven forbid, several machines, (but they'll happily let me have a laptop that costs just as much for less than half the performance).
You need to repeat that every couple years though.
At our enterprise, we have HP ultrabooks or however they are called. Windows 7, 32 Bit. You know, it has a decent i5 on it, but there is some specific bottleneck making it terrible. Every employee that sits in an office has this piece of crap.
Takes 3 minutes to start IntelliJ. Since some years ~50 developers were able to get Macbooks, luckily I am one of them, else I'd just be pissed off every single work day.
Any buying requests go through the direct supervisor, the financial department, the director of the company, then back to the supervisor for the final signature. And because it's Germany, it was on paper. When I say any buying request, I mean any. From any programming book, to any accounting book for finance or any HR book.
You'd think people at the top have better things to do. You think that, until you hear the department heads discuss which excel format files should be saved in.
Typically if you want to buy some hardware in a big Corp you go to some preapptoved internal catalog to pick up from one or two models of desktop or a laptop from vendors like Dell, HP, or Lenovo - whatever was preapproved. Sometimes you can fine tune specs but not in a very wide range.
Buying anything outside of that will require senior level approval and - heaven forbid - “vendor approval study” (or similar verbiage).
When I left the company, the engineering director was working out how to let his team go rogue - basically handling all of their own IT support, the only caveat being that they had to install IT's intrusion detection software, but aside from that, an engineer would be able to buy any machine within a certain budget.
And good luck getting the cost center ID for the initial ticket submission to start with.
It’s all ITs own fault frankly - or more like fault of corporations approaching IT as mostly a desktop Helpdesk support.
^: You know, like real money, not 90k plus 2.5% of something that will probably not exist next year
^^: You know, like real benefits such as low-cost, low-deduct. health insurance, matching 401k, etc.
^^^: As in, real, actual PTO, not this "unlimited" bullshit some startups and now sadly companies are replicating. If you decide to quit and can't "cash out" your PTO, you don't have real PTO.
1. You can blow up any amount if you like to.
2. Or, you can figure out what you are trying to do. Then, learn how to do it better. There is a cheaper way to run in the cloud too - https://twitter.com/troyhunt/status/968407559102058496
1. Even if your DL model is running on GPUs, you'll run into things that are CPU bound. You'll even run into things that are not multithreaded and are CPU bound. It's valuable to get a CPU that has good single-core performance.
2. For DL applications, NVMe is overkill. Your models are not going to be able to saturate a SATA SSD, and with the money you save, you can get a bigger one, and/or a spinning drive to go with it. You'll quickly find yourself running out of space with a 1TB drive.
3. 64GB of RAM is overkill for a single GPU server. RAM has gone up a lot in price, and you can get by with 32 without issue, especially if you have less than 4 GPUs.
4. The case, power supply, and motherboard, and RAM are all a lot more expensive for a properly configured 4 GPU system. It makes no sense to buy all of this supporting hardware and then only buy one GPU. Buy a smaller PSU, less RAM, a smaller case, and buy two GPUs from the outset.
5. Get a fast internet connection. You'll be downloading big datasets, and it is frustrating to wait half a day while something downloads and you can get started.
6. Don't underestimate the time it will take to get all of this working. Between the physical assembly, getting Linux installed, and the numerous inevitable problems you'll run into, budget several days to a week to be training a model.
Here's my thoughts/background:
Background: Doing small scale training/fine tuning on datasets. Small time commercial applications.
I find renting top shelf VM/GPU combos on the cloud to be psychologically draining. Did I forget to shut off my $5 dollar an hour VM during my weekend camping trip? I hate it when I ask myself questions like that.
I would rather spend the $2k upfront and ride the depreciation curve, than have the "constant" VM stress. Keep in mind, this is for a single instance, personal/commercial use rig.
I feel that DL compute decisions aren't black/white and should be approached in stages.
If you do full time computer work at a constant location, you should try to own a fast computing rig. DL or not. Having a brutally quick computer makes doing work much less fatiguing.Plus it opens up the window to experimenting with CAD/CAE/VFX/Photogrammetry/video editing. (4.5ghz i7 8700k +32gb ram +SSD)
Get a single 11/12 gb GPU. 1080TI or TitanX (Some models straight up won't fit on smaller cards). Now you can go on Github and play with random models and not feel guilty about spending money on a VM for it.
Get a 2nd GPU. Makes for writing/debugging multi-gpu code much easier/smoother.
If you need more than 2 GPU's for compute, write/debug the code locally on your 2 GPU rig. Then beam it up to the cloud for 2+ gpu training. Use preemptible instances if possible for cost reasons.
You notice your cloud bill is getting pretty high($1k+ month) and you never need more than 8x GPUs for anything that your doing. Start the build for your DL runbox #2. SSH/Container workloads only. No GUI, no local dev. Basically server grade hardware with 8x GPUS.
I'm not sure, don't listen to me :)
A nice advantage of non-consumer GPUs is their bigger RAM size. Consumer GPUs, even the newest 2080 Ti, has only 11 GB. Datacenter GPUs have 16GB or 32 GB (V100). This is important for very big models. Even if the model itself fits, small memory size forces you to reduce batch size. Small batch size forces you to use a smaller learning rate and acts as a regularizer.
OVH offers these in their Canadian data center.
So I guess in theory it would be possible for AWS to develop their own (or enhance open source) drivers. On the other hand they would spoil the business relationship with Nvidia and have to do without any discounts.
Or are there any genuine costs associated with data center gpu models ?
NVIDIA was not amused.
It‘s the same with no-resell license terms. Within the EU companies cant forbid me reselling something i bought.
However in case of reselling software the corresponding restriction is obviously seen as illegal (or non-binding) in the EU. On the other hand there are restrictions such as copying, running software X on Y computers etc that are valid and binding.
BUT AFAIK you can make legal copies for Backup purposes of your legally acquired software.
Also AFAIK a restriction to run Software on a certain type of Hardware is not valid in Germany. For example I'm pretty sure that Apple can't do anything against Hackintoshs here.
It would be awesome. I also wonder if in this case it is an issue of hardware or only something related to the drivers/API.
Can they make something similar/backwards-compatible with Cuda but cheaper/better?
Disclosure: I work on Paperspace
Paperspace has eliminated my desire to build a deep learning computer thanks to their insanely low prices.
Current prices are around $0.78 per hour for a Nvidia Quadro P5000, that's pretty comparable to a 1080 TI.
On top of that you can even run Gradient notebooks (on demand) without even setting up a server. This is the future when bandwidth costs are minimal: thin clients, powerful servers.
At the end of the day, I wanted to spend more time tuning the ML pipeline rather than fussing with drivers, OS dependencies, etc
Sure there's lots of things that Paperspace could do better, but their existing product is already leaps and bounds better than GCloud or AWS. AWS and GCloud wins through big contracts with large businesses and I'm just a little guy.
Disclosure: I do not work for Paperspace and am not paid to endorse them in any way. I love their product.
Memory? Because if you can spread your model across multiple GPUs, and you've implemented Krizhevsky's One Weird Trick to switch between reducing the smallest of either parameters or deltas, you're golden.
I thought tensor cores and NVLINK would end up Tesla differentiators, and really great ones at that, but now they're both in the Turing consumer GPUs so I am really scratching my head here.
That said, the EULA is just stupid. I cannot use CUDA 9.2 or later at work because of it. No one is going to audit our computers for any reason ever, period, full stop.
I continue to be boggled that Alex Krizhevsky's One Weird Trick never made it to TensorFlow or anywhere else:
I also suspect that's why so many thought leaders consider ImageNet to be solved, when what's really solved is ImageNet-1K. That leaves ~21K more outputs on the softmax of the output layer for ImageNet-22K, which to my knowledge, is still not solved. A 22,000-wide output sourced by a 4096-wide embedding is 90K+ parameters (which is almost 4x as many parameters in the entire ResNet-50 network).
All that said, while it will always be cheaper to buy your ML peeps $10K quad-GPU workstations and upgrade their consumer GPUs whenever a brand new shiny becomes available, be aware NVIDIA is very passive aggressive about this following some strange magical thinking that this is OK for academics, but not OK for business. My own biased take is it's the right solution for anyone doing research, and the cloud is the right solution for scaling it up for production. Silly me.
You will want to use hierarchical outputs in this case. Take a look at Hinton's 'Knowledge Distillation' paper.
Somewhere along the way we forgot about this and it's now perfectly normal to run a blog on a GKE 3 VM kubernetes cluster, costing 140 EUR/month.
In the end we started migrating to Hetzner, as they finally got servers that got close enough to be worth offloading that work to someone else. Notably Hetzner reached cost parity for us; AWS was still just as ridiculously overpriced.
There are certainly scenarios where using AWS is worth it for the convenience or functionality. I use AWS for my current work for the convenience, for example. And AWS can often be cheaper than buying hardware. But I've never seen a case where AWS was the cheapest option, or even one of the cheapest, even when factoring in everything, unless you can use the free tier.
AWS is great if you can justify the cost, though.
A similar question applies to any other dynamic cost-reduction measure, such as spot instances.
I recall reading an announcement that GCP was starting only charging for actually-used vCPUs, rather than all that were provisioned, a form of automatic elastic cost-savings, although it was still more expensive than a DIY method. AFAIK, AWS doesn't do anything like that.
We've seen an even wider cost disparity on colo and dedicated servers vs AWS. More than 5x. It's easy for us to estimate because databases and web servers need to be on all the time, and those dominate costs.
So does using a cloud service. It's not actually obvious, conceptually, but very little of the admin overhead has to do with the "own hardware" aspect of running it, especially if one excludes anything that has a direct analog at a cloud service.
There certainly exist services that abstract away more of this, but that's in exchange for higher cost and lower top performance, but that doesn't scale (in terms of cost).
> Sure, if you run a solo operation, you can just get up during the night to nurse your server, but at some point that no longer makes sense to do.
I'd actually argue the reverse. My experience is that the own-hardware portion took at most a quarter of my time, and that remained constant up to several hundred servers. It's much cheaper per unit of infrastructure the more units you have.
The tools and procedures that allow that kind of efficiency were the prerequisite for cloud services to exist.
I ran a detailed cost analysis of tier 3 onprem vs aws about 7 years ago. I included the cost of maintaining servers, support staff salaries, rent, insurance, employee dwell time etc and onprem was still cheaper. Maybe it's different now.
We put significant thought into being cheap. I think constraint can breed innovation.
When it comes to inference, sometime you wanna ramp up thousands of boxes for a backfill, sometimes you need a few to keep up with streaming load.
Trying to do either of these on in-house hardware would require buying way too much hardware which would sit idle most of the time, or seriously hamper our workflow/productivity.
I wonder if someone might provide some clarification on this. Is this to say only if a reseller buys directly from Nvidia they are compelled by some agreement they signed with Nvidia? How else would this legal for Nvidia to dictate how and where someone is allowed to use their product? Thanks.
At scale, you need more than just hardware. It's maintenance, racks, cooling, security, fire suppression etc. Oh, and the cost of replacing the GPUs when they die.
At full price, yes, cloud GPUs on AWS aren't cheap, but at potentially a 90% saving in some regions/AZs, the price of spot instances by bidding on unused capacity for ML tasks that can be split over multiple machines make using cloud servers a much more attractive prospect.
I think this post is conflating one physical machine to a fleet of virtualised ones, and that's not really a fair comparison.
Also, the post refers to cloud storage at $0.10/GB/month which is incorrect. AWS HDD storage is $0.025/GB/month and S3 storage is $0.023 which is arguably more suited to storing large data sets.
The equivalent of an i3 metal is probably around 30000 to 40000$ with Dell or HP, and probably half cheaper if self assembled (like a supermicro server). AWS i3.metal will cost 43000 annually, so even more than the acquisition cost of the server, server which will last probably around 5 year.
But if you start taking into account all the logistic, additional skills, people and processes needed to maintain a rack in a DC, plus the additional equipment (network gears, KVMs, etc). The cost win is far less evident and it also generally adds delays when product requirements changes.
Fronting the capital can be an issue for many companies, specially the smaller ones, and for the bigger ones, repurposing the hardware bought for a failed project/experiment is not always evident.
You've mostly described what one pays a datacenter provider, plus hiring someone who has experience working with one (and other own-hardware vendors, such as ISPs and VARs), which doesn't cost any more (and maybe less) than hiring someone with equivalent cloud vendor expertise.
> plus the additional equipment (network gears, KVMs, etc).
Although these are non-zero, they're a few hundred dollars (if that) per server, at scale, negligible compared to $20k.
> The cost win is far less evident
It still is, since the extra costs usually brought up are rarely quantified, and, when they are, turn out to be minor (nowhere near even doubling the cost of hardware plus electricity). AWS could multiply it by 10 (as in the very rough pricing example you provided).
> generally adds delays when product requirements changes.
This is cloud's biggest advantage, but it's not directly related to cost. This advantage can easily be mitigated by merely having spare hardware sitting idle, which is, essentially part of what one is paying for at a cloud provider.
So where does the money go?
- AWS/Google/Whoever-you're-renting-from obviously get a cut
- Inefficiencies in the process (there's lots of engineers and DB administrators and technicians and and and people who have to get paid in the middle.)
- Thirdly, and this is what most surprised me, NVIDIA takes a big cut. Apparently the 1080Ti and similar cards are consumer only, whilst datacenters & cloud providers have to buy their Tesla line of cards, with corresponding B-to-B support and price tag (3k-8k per card). 
So, given these three money-gobbling middlemen, it does seem to kinda make sense to shell out 3.000$ for your own machine, if you are serious about ML.
Some small additional upsides are that you get a blazing fast personal PC and can probably run Crysis 3 on it.
Curious how it relates to sticking only a single terabyte SSD in the machine. As a couple hundred dollars per month should relate to a couple terabytes.
Standing up a bunch of EC2's in AWS is just a horrible idea and an expensive one as well. It also moves all of the on-prem problems (patching, backups, access, sys admins as gatekeepers, etc.) to the cloud. It's the absolutely wrong way to use AWS.
So stop sys admins from doing that as soon as you notice. Teach them about the services and how, when used properly, the services are a real multiplier that frees everyone up to do other, more important things rather than baby sitting hundreds of servers.
I decided to do the EC2 thing when I built one of my products, knowing that I couldn't have vendor lock in - and that decision was critical to the survival of my company when:
1) A customer wanted to run on azure. We would have lost a 2.5 million pound contract if we couldn't do that
2) Another customer wanted on-prem solution, we would have lost a 55 million USD contract if we were vendor locked to AWS
so sometimes it makes sense
The dockerfile used to build the containers in ECS can be pushed to any registry and the resulting containers can run on any docker service.
What vendor lock in? I just don't buy that argument.
Use the services Luke ;) it will cost much less and you can focus on other things!
Sure RDS can be swapped out easily, but what if I'm using 10 or 20 services?
We're way overdue for a p2p marketplace for cycles.
I don't believe there are any blower-style 20-series cards. The reference cards use a dual-fan design.
Also, look into using S3 for long-term storage, instead of leaving your stuff cold in EBS volumes. It's quite a bit cheaper.
That's getting pretty extreme.
Of course you’re also paying for everything else AWS brings and the ability to spin up/down on demand with nearly unlimited scalability which is hardly “free.” AWS is also a very profitable business for Amazon so they’re making good margin too on most of their pricing.
Last time I looked, for mid-range AWS instances, purchase price was about 6-12 months of the rent. That’s assuming you buy comparable servers, i.e. Xeon, ECC RAM, etc…
For GPGPU servers however, purchase price is only 1-2 months of Amazon rent. Huge difference, despite 1080Ti is very comparable to P100, 1080Ti is slightly faster (10.6 TFlops versus 9.3), P100 has slightly more VRAM (12/16GB versus 11).
I will bring this up with the rest of my data engineering team, this might be a good idea.
You can invert that to 3 years of use at 33% utilization which would come out cheaper (or is my maths broken?). Still doesn't sound like it'd be a good match for your usage though.
That 3 years is extremely conservative though, in reality it would probably be much longer and there are some potential upgrade paths to factor in as well. Not to mention the potential to use it for other purposes.
The article somehow completely ignores integration with other services to query or update the model, which requires some API to be hosted on a static ip or domain, as well as the devops process.
You're not using the CPU for anything, there is no reason to have a beefy 12-core CPU there. If you really want Threadripper, the 8-core version is fine.
Again, personally I'd look at X79 boards, since they are pretty cheap and you can do up to 4 GPUs off a single root, depending on the board. There are new-production boards available from China on eBay, see ex: "Runing X79Z". Figure about $150 for the mobo, about $100 for the processor, and then you can stack in up to 128 GB of RAM, including ECC RDIMMs, which runs about $50 per 16 GB.
There are some Z170/Z270 boards like the "Supercarrier" from Asrock that include PLX chips to multiplex your x16 lanes up to 4 16x slots (does not increase bandwidth, just allows more GPUs to share it at full speed!). They also will not support ECC (you need a C-series chipset for that, which run >$200). So far most OEMs have been avoiding putting out high-end boards for Z370 because they know Z390 is right around the corner so there is currently no SuperCarrier available for the 8000 series.
Baring in mind spending $400 on a mb pales when you are taking about a $1800 dollar CPU and 4x high end graphic's cards
Even then, the $2k difference between the cheapest pre-built the article references and the DIY version would be 20 hours at $100/hr. Half a workweek to build one PC seems excessive.
if compute needs were all internal, as they are with desktop apps.. namely, compilation.., and with the majority of computationally expensive machine learning demands.. namely, training.. then i'd argue the pooled compute model would have remained niche
TL;DR: Proposed machine costs $3k plus ~$100-200?/month for electricity, comparable AWS is currently $3/hr.
My conclusion: If you're going to do more than 1-2,000hrs of GPU computing in the next few years, start thinking about building a machine.
The cloud only really makes sense for the on-demand burst capability IMO.
Would it surprise anyone to learn that renting a car is more expensive in the long run compared to buying one? This is the same thing, only the time scales are different.
The analogy doesn't quite map for all aspects so I'll stick with the ski example to show it's not specific to cloud computing: Owning skis has a load of associated costs ranging from simple to more complex: storage, maintenance, carrying them to and from the airport, having to spend a lot of energy on the decision of which pair to buy as it will be expensive to change my decision later, having the general background burden of owning another thing that I have to think about on some level even when I am not skiing.
All these things are ultimately paid for with money directly, or time, or mental energy that I could be spending on other things. What's the cost of all those combined? It's really hard to say and may even extend beyond a simple economic function and get into the philosophical arena. But you can't deny there is a cost and you can't ignore that cost when evaluating if renting is more or less expensive than owning, over an arbitrary time period.
I suspect that there is something like that (only not at all) going on with machine learning kit: If you've got exactly one person training models, and they're doing it all the time, then yeah, it's cheaper to own your own hardware. If you've got several people all trying to share the training rig, then you start having to worry about all the costs associated with scheduling time, delays if someone else got their first (especially if they're doing deep learning and will be monopolizing the server for upwards of a week), etc. Once you factor all those full-lifecycle costs in, the cloud might start to look relatively cheaper.
To me it's been the "should I buy an airplane" problem. (Basically, you have to rent a lot, > 50 hrs/year, before buying makes sense financially, and even then it might not work out very well unless you fly a ton)
Hence, some of these saving should be passed on to consumers and it should be cheaper.
The counter argument is that the cloud is over-engineered (redundancy, backups, better software) for enterprise customers. You u are paying for all these benefits whether u want to or not.
One is that AWS (and Azure and Google) are taking advantage of their brand recognition and are pricing their services extremely high compared to their competition. For some that is worth it as they need services they don't get elsewhere, but for a lot of people there are far cheaper cloud services out there. Sometimes even hybrid solutions (e.g. a particular problem with AWS are the absolutely insane bandwidth prices - if you "need" to store your data on S3 because they're the only ones you trust for durability or because it fits in with other parts of your workflow - but you access most objects regularly enough from outside AWS that bandwidth prices are a high part of your cost, for example, it can often pay to rent servers somewhere like Hetzner just to act as caches)
The other problem is that while it looks straightforward, as much as we might like to think so, servers are not as much of a commodity as we'd like to think. Components are.
The really big savings in running your own comes when you measure how much RAM you need and what CPU you need, and whether or not disk IO is an issue, and configure servers that fit your use better. If the cloud providers have instances that fit your needs, awesome, you can get a decent price (if you shop around). If they don't, you can easily end up paying a lot for components in your instances that you don't need.
When you don't know what you need, it might be ok to pay that premium, but as soon as you know what you need, there are many cases where buying your own allows you a degree of specialisation that does not pay for the cloud providers because it's better to provide a smaller range where they can get economy of scale. I've seen cases where e.g. being able to shove enough RAM in a server cut costs by 90% or more vs. splitting the load over multiple lower RAM cloud instances.
I'm not saying it will stay this way forever - there are certainly regularly more and more combinations that you can find decently priced cloud instances for - but there are still tons of cases where building your own (or having someone configure one for you) is so much cheaper that it's nearly crazy not to.
Two years ago you would be perfectly correct. The cheapest AWS option back then would be a micro EC2 instance at around $20 per month IIRC.
Lightsail is really great for personal sandbox servers, or for moving to AWS on the cheap, but it's not even as capable as its worst competitor (and certainly cuts out a lot of features compared to real AWS.)
The worst part is probably the surprising and seemingly minor factor of being able to simply rename or resize a server. This makes using it in a team or production environment surprisingly difficult.
With that all said, I still find Lightsail very useful for small prototyping and other fast jobs (or a Lambda replacement -- it's way cheaper than running any reasonably busy Lambda function), but I can't see using it for real production usage. Your best bet for real production usage, or something that will someday move into production, would be probably vultr/DO/linode. (OVH really wants to be great, and it should be because the hardware is really great, but the dashboard and billing are so bizarrely bad..)
AWS is good when you have a virtually unlimited budget or don't mind running up the bill while you're getting rolling, but getting locked into AWS will mean that you are stuck with it forever. It's very hard to migrate away if you start using a vendor-specific thing like DynamoDB (to be fair, Google Firebase has exactly the same issue; Instances are basically fungible, but proprietary data stores are not)
I wonder why that's almost never discussed, like that obvious downside is non-existant.
For GPUs it's the opposite. With some legal BS, NVidia forbids Amazon to rent out GeForces. Doesn't look OK to me, I don't think NVidia has a right to deside how people will use the hardware they have bought. Anyway, Amazon has to buy Teslas, and Teslas are 10 times more expensive.
Still, “owning X” and “getting a working X for a while at a time and place most convenient to you” are two different services.
It also means you need to build the computer, maintain it, upgrade it, store it somewhere, connect it to the internet, and so on.
And you can't just scale it if you need to.
I mean, the cloud's here for a reason...
Why "of course"? Last time I checked baking my own cookies is more expensive than just buying a bag at Walmart.
If you need 5 cookies today (and maybe the next batch next week), go to Walmart. Same for an one-off DL job.
So if your usage is 4 hours per week, 50 weeks a year, you hit $600/year of gpu. If your usage is 24 hours a day, you hit $27k.
So it's not a clear-cut "buying metal is cheaper" - you have to decide how your usage amortizes. As pgeorgi says, if your demand is a pack of cookies a week, buying a bakery isn't the most cost-efficient way to satisfy that demand.
For anything bigger than that, you'll quickly run into issues. Try talking to your ops team and telling them you want to set up a mid-range desktop PC with extra RAM and a high end graphics card as a DL workstation. I think you'll very quickly find some friction, especially once you want to go into production with it.
IIRC their original hardware plan was based around commodity hardware and getting the most performance per dollar since they built the crawler to be distributed pretty early on.
Point being, consumer grade hardware has come a long way since then, and if you're doing something cutting edge like DL it's not outlandish to expect that rolling your own might be worthwhile and totally justified.
Google's start up between 1996 and 1998 was also over twenty years ago. There were fewer than 200 million Internet users on the Internet and less than 2.5 million web sites  in 1998. Google was also started by two college grad students, meaning gasp it was initially just a research project. I'd also point out that while, yes, the initial production servers were cheap and used commodity hardware, this  is what they looked like, which is hardly the type of setup that the article is suggesting.
I do recall a time as recently as 4 years ago where it was difficult to find, for example, rack-mountable servers that could hold a large number of these GPU cards, but substituting for a mid-range PC with 2 or 4 of them would not have been an issue, even then.
Cost would be higher but not necessarily outrageously so, since large server chasses often have startlingly large PSUs. It might be an extra 50% on top of that $3k, but that's still competitive, even with the pre-builds the article mentions.
Yes, but most likely they aren't of comparable quality
That drops to $1149 if I don't require Vegan, non-GMO, Kosher:
SNAP benefits at anywhere from $134 to $192 a month are both more than sufficient for this. So why are ~42 million people starving in America? Answer (IMO): Cloud providers reduce the friction of GPU adoption the same way grocery and convenience stores reduce the friction of food acquisition, both in exchange for profiting off of it.
(not directing at you, more explaining to those who didn't already know)
> Even Walmart’s “Great Value 100% Whole Wheat Bread” contains seven ingredients that Whole Foods considers “unacceptable”: high fructose corn syrup, sodium stearoyl lactylate, ethoxylated diglycerides, DATEM, azodicarbonamide, ammonium chloride, and calcium propionate.
And awful if you want anything resembling "real food". (Not blaming anyone for buying them, I buy pre-made cookies as well.)
But yes, getting a nice looking/tasting cookie might require a few tries.
The answer is not clear cut!
This is not the normal human experience at all. Raw food are usually cheaper than processed ones, anywhere in the world.
Cheap palm oil and other vegetable fats instead of butter, corn syrup instead of cane sugar, much lower quality chocolate than is generally available in grocery stores, various chemical stabilizers instead of eggs, artificial vanilla flavoring instead of real vanilla and so on.
Plus they obviously get everything at a massive discount since they're buying everything by the ton instead of by the kg.
Kind of like how if you remove all the subsidies given to fossil fuel producers, some forms of renewable energy suddenly become much more competitive.
So when you add a whole cost center for IT, the inefficiencies of the way they operate suddenly make the cloud efficient.
Note that this works for electronics as well. You could buy one resistor for a cent or a thousand of them for a dollar.
The comments on cheap low quality ingredients and bulk pricing are true as well.
And I am sure the local Asian cash and carry's sell even bigger bags even cheaper.
Cause AWS is _really_ expensive...