Hacker News new | past | comments | ask | show | jobs | submit login
UC Berkeley launches SkyPilot to help navigate soaring cloud costs (datanami.com)
306 points by kungfudoi on Dec 13, 2022 | hide | past | favorite | 108 comments

I remember when the "cloud" hype was first beginning to make a splash. A pillar selling point was "savings".

"Reduce your costs", they said. "Don't worry about those expensive hardware costs", or management. "Only pay for what you need", they said.

Until everyone was hooked and the tooling was everywhere.

Even just putting the pricing for hardware/services aside, the bandwidth charges alone are quite amazing. Most have it at $80 per TB per month. That's $80 a month for a 3mbit connection. [3mbps * 3600 seconds an hour * 24 hours * 30 days ~= 1TB]. Even if you pretended it was 6mbps for double the redundancy, that's still $80 for a 6mbit connection. I understand there's more to it than just transfer amount/speed, but this is in a place where bandwidth is supposed to be cheaper, as it's "purchased" in wholesale, yet a 1GBit upload connection could transfer 330TB in a month, which would cost over $23,000 at Microsoft/Google/Amazon.

I am not saying the cloud doesn't make sense. I love me some Firebase. CDN's are game changers. And GPU's per hour is the only way some scientists/researchers can afford to do work [although it does contribute to the NVIDIA's focusing on manufacturing high price/low volume].

That said, it's not for everyone. A cheap VPS can do a lot.

Specifically for a university or an institution with a lot of cheap/free bandwidth, on-site would not only be cheaper, but also have lower latency for its campus users and more control wise.

Also, if done right, there's a possibility of teaching the students relevant real-world solutions using the knowledge gained/within the university.

"Simplify to succeed and Complicate to profit"

-- (likely) Tim Ferris.

I sat out of the cloud hoopla and it cost me opportunities - Now, I am dragging my feet into it to be employable.. but the Anti-Cloud (River?) is going to be a concept. 39 signals already did a blog post and change in strategy.

I look forward to the economically lean years of 2023 and 2024 and HN front page articles about how moving away from Cloud and owning your own server is saving 'millions'.

> I sat out of the cloud hoopla and it cost me opportunities - Now, I am dragging my feet into it to be employable.

This. I work on sensors, signal processing and multi-sensor data and could always get a local prototyping rig. But the cloud got so big that I have to be at least somewhat comfortable with working in it just to stay employable.

Before sharing anything I check that it works in a Jupyter notebook running in a docker hosted on AWS getting data from some cloud bucket. But for real work I switch to Vim on a local machine with a few TBs of sensor data to experiment on and enjoy an order of magnitude speed improvement.

> I look forward to the economically lean years of 2023 and 2024 and HN front page articles about how moving away from Cloud and owning your own server is saving 'millions'.

I think there is a strong business case for this. Probably not fully off-cloud, but streamlining cloud portions for cost efficiency and going away from keeping everything in the cloud "because zero ingress costs". My 2c.

Per a recruiting email I received last year, I believe anti-cloud computing is called "fog computing".

Most people don’t bother refactoring their application and leave whatever IIS garbage they have running on windows 20xx in a cloud instance (to say nothing about the windows licensing cost).

Also universities don’t have the same cost models as companies. Their labs have less money - except those institutions with large endowments - but they all have access to near limitless “slave” labor in the form of grad students and post docs.

When I was involved with univs the labs were bursting at the seams with hardware, because companies abounded who loved nothing more than dumping product on us at cheap or free, especially if we'd get students using it.

I'm at a university lab and would love some free computing and storage hardware. Any tips?

Write letters to companies you're interested in. You should have a relatively standard form letter explaining who you are, what the university is, how great it is, etc, etc. We got a moderate hit rate on those (higher if you consider discounts).

Look through donor and alumni records for big name companies.


Free stuff always helps. But it’s not the same as cold hard cash.

Oh, certainly. Most of the stuff was kinda useless and we repurposed massive blade servers to host some tens of email accounts, etc.

Savings is a really complicated thing to measure, especially when you're trying to accurately account for internal costs which aren't cleanly broken down per-project. It's very easy to get an itemized bill from a cloud provider which looks much larger than your internal service because the latter numbers don't include staffing or overhead costs (e.g. how much idle hardware are you paying for because the alternative is telling someone they can't have a VM until you complete a new procurement?). What I've found is that almost every time someone tries to get those internal numbers it goes the other direction by a significant margin because they weren't accounting for things like how much staff time they spend managing systems, especially for the compliance type stuff.

One really big factor here is opportunity cost: if it takes 6 months (not hyperbole in .edu) to get a new server from the enterprise IT department from request to being in service and correctly configured, there's a significant cost which you don't receive on an invoice. This is especially true if you're looking for any service more advanced than a bare Linux/Windows install where the staffing & internal barriers are going to dominate the true cost.

Network bandwidth is definitely the most egregious point to complain about with commercial cloud providers and it'll cancel out any other savings for some applications. You can negotiate significant discounts with most of them and for something like UC Berkeley I'd also assume they might be able to take advantage of things like AWS' cheaper peering using Internet 2, but it's still something I'd mention to your sales rep as a deterrent any chance you get.

6 months is fast. I sold our first on premise customer. $1M internal budget to simply hook up the hardware. The project was delayed a year due to budgeting and logistical issues. And this is a large financial institution, which is used to managing on premise hardware.

Also, it's expensive, but cloud is a competitive strength. If you manage your own hardware or VPS, that is not only a reliability risk, but a huge competitive weakness in B2B. One of our competitors use "we are the only true cloud vendor" as FUD.

Reducing cost was always about 1) getting rid of the hardware infrastructure team and 2) getting rid of the tax of having to think about the hardware as a CIO

But the cost difference for a large company is many dozens of times the total comp. of even the highest paid CIO?

At my institute, we've been steadily expanding our HPC facilities and adding newer hardware for the SOTA research being done by faculty and students.

Also gives people exposure on how to manage their instances and tooling required around their projects, which I reckon helps in the long run.

I've been using SkyPilot for 3 months within an ML platform project and it is exactly as awesome as it sounds. The overall experience for launching and managing compute is thoughtful and ergonomic in a way that we've been conditioned not to expect from cloud compute (let alone GPU compute). It was a real shock to start using it, and now every time I need to jump back into the weeds of a cloud provider's APIs I'm reminded of how nice I have in SkyPilot land.

I think you'd have a hard time finding a better cloud launcher anywhere. I couldn't.

Underneath, Skypilot uses a Ray quite a bit. You can read is paper here: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

Does it support spot instances ?

they'd be crazy not to, unless the ML pipeline didn't support retries: https://skypilot.readthedocs.io/en/latest/examples/spot-jobs...

I wonder why UC Berkeley doesn't build a proper HPC, they have a Data School and should provide this service for free to their faculties. We have "free" HPC resources at TU Dresden (Germany) (meaning: faculties do not need to pay for using HPC resources and they are not calculated in project budgets). I once applied for a job at University of Virginia, and they didn't have a HPC - everything was bought from AWS. When students accidently left stuff running, the professor had to beg Amazon to reimburse the fees. This was the main reason I was hesitant taking the offer. I even calculated building my own "Cloud" with a Proxmox cluster at home, so that I could teach students the basic stuff.

Berkeley has at least one cluster, probably more.


University HPC clusters are typically managed and owned departmentally. My former employer had two clusters in two different departments. I worked directly with a few other universities who also had departmental HPC clusters, and I’ve read a boatload of HPC docs from different unis and labs. Sharing clusters for many types of workloads seems to happen more at the regional level (like Archer cluster in Edinburgh, or PNNL in WA).

Ours is a fairly small cluster of maybe 60 or 80 job nodes. All compute networking travels over Infiniband—-I think we bought HDR for the new cluster, but price may held us to EDR—-which is probably why the cluster was so expensive to build (around $2M). Our storage cluster was another $1-$2M project.

All that is to preface this: cloud usage doesn’t mean they’re only using cloud resources. Our goal, for instance, was to build a hybrid cluster (and their continued goal, as far as I know). The first step was to offload low priority work with low resource requirements and wall times, and comparatively long timelines to the cheapest possible, compliant cloud provider (some of our researchers have specific data sharing and privacy requirements).

Let’s say a job is submitted on Monday morning at 9 am. It needs 2 CPUs and 4 GB of RAM, and a wall time of five minutes. The researcher can wait until next Monday at 10 am for the results. There’s at least a chance that running this job in the on-prem cluster is less cost efficient than offloading the job to another resource, whether the cost is direct (i.e. 5 minutes on-prem costs $0.10, and the cloud costs $0.05 or whatever real values would look like).

Ideally, we would have some method by which you could compare the cost to run a job in the cluster vs another resource, and it would automatically offload jobs of up to $X to the cloud, whether as a very low priority queue, or as overflow in times of full utilization. There are several other conditions that would need to be met, as well, but you get the idea (just for example, one consideration is if the work be interrupted. If the job has to run start to finish, the cloud is likely not suitable for that work).

It really depends on the workload. GPU clusters are usually cheaper to run in house since Nvidia let you use regular GPUs for research which end up cheaper than cloud GPUs. And often Universities will charge less overhead for capital expenses on a grant which can artificially reduce the cost of running it yourself.

The big downside of institutional HPC is it can be difficult to get stuff running on the ancient distributions they're stuck on. Nowadays perhaps Docker makes things easier but even running a Docker container was a challenge when I worked at a university some 5-10 years ago. As a software engineer who had not used HPC previously I found it much more difficult than just firing up jobs on AWS which had documentation (though for many researchers it was the opposite.)

I also kind of object to the term HPC... with the exception of a small number of shared memory clusters used for physics simulations they're usually just a bunch of standard servers often with incredibly slow network storage. Nothing high performance about them.

> I also kind of object to the term HPC... with the exception of a small number of shared memory clusters used for physics simulations they're usually just a bunch of standard servers often with incredibly slow network storage. Nothing high performance about them.

I’m sure there are plenty of places who have production clusters equivalent to our “test cluster” that we ran in VMWare (which we only used to check version compatibility during upgrades), but in my experience working on an HPC team at a research university is that most universities are using real HPC clusters. They’re not all equally built and managed, but they have 50+ compute nodes using Infiniband (or equivalent) for interconnectivity between nodes, and to connect to the back end SAN, which runs a distributed, parallel file system (usually GPFS, sometimes Lustre, or BeeGFS).

Apptainer (formerly Singularity) is aimed at containerizing HPC workloads. You can build it with Docker commands, and then convert to Apptainer’s format, so it’s pretty easy to use. You don’t run Docker directly, in any case.

Correct that compatibility is a huge issue. It got ugly at the end of our EL6 cluster’s life. It didn’t run containers well, and our cluster was so entrenched in the old way of managing software (modules and Conda envs) that converting would have been a massive effort, and it may not have worked at all! There was a lot that had to get rescheduled or find a different place to run while we dealt with supply chain slowness.

>since Nvidia let you use regular GPUs for research which end up cheaper than cloud GPUs.

Funfact, that's illegal in Europe, producers have no right to tell you what to do with their products and your property.

I swear, every thread has at least one European bragging about something ridiculous like "fun fact, murder is illegal in Europe" without spending 10 seconds to check if it's also illegal in the US.

Yes, we have property rights in the US. No, it's not obvious whether buying the hardware also gives you a license to use Nvidia's CUDA libraries without any limitations. Nobody has tested this in court.

Personally, I've got many consumer GPUs in my data center. If Nvidia doesn't like that, they can sue me. Username is real name.

Is it really though? The biggest country in Europe is Russia, also the biggest population is Russians living inside Russia. Maybe you meant the EU which is far from being all of Europe. It is not even half of Europe. There are 23 countries out of 44 in Europe that aren't in the EU. If you talk law then a small detail like that matters a lot.

>The biggest country in Europe is Russia

Just a part of russia is europe...being as pedantic as you:


European Russia accounts for about 75% of Russia's total population. It covers an area of over 3,995,200 square kilometres (1,542,600 sq mi), with a population of nearly 110 million—making Russia the largest and most populous country in Europe.

European Russia covers the vast majority of Eastern Europe, and spans roughly 40% of Europe's total landmass, with over 15% of its total population, making Russia the largest and most populous country in Europe.


While you own the hardware you don't own the software necessary to make use of it, just a license.

Yes and? Most US TOS's are illegal in the EU anyway. Cant make money with your Gamer-Card? Well then all let's players on youtube have a problem.

Berkeley CS (and in particular the systems research labs like AmpLAB, RISE lab, etc.) gets enough funding from AWS, GCE, Azure, etc for it to be uneconomical to have a data center administered by the campus.

That being said, there are HPC facilities shared with LBNL, as well as smaller clusters operated by the department.

At several universities I've been at, HPC groups have been utterly unprepared (and disinterested in becoming prepared) to handle PII or any sort of health or confidential data.

As we are talking anecdotes. Universities I’ve been at that researched sensitive data like human genetics and some commercially sensitive data have been excellent at data security, and provided a centralised HPC cluster at a marginal cost than it would have been at AWS..

While hospital records are protected, traditionally genomics data is not considered PII so is not covered by HIPAA. It does seem a bit of a farce though considering it could be uploaded to GEDmatch and have a good chance of finding relations of person the sample was taken from...

Seems like GEDmatch is at fault there, though.

Or any interest in reliability or making it usable. Students are there for passion. People who work in university IT are just utterly unemployable elsewhere.

I too have been surprised by the poor state of research compute at American universities. Of course it's hard: that's why it needs some smart and expensive people who do research on computing to run it (but that's what universities are all about). But maybe it's a cultural thing: in the US organizations including universities like to rely on commercial services when they can instead of seeing the value of doing it in-house.

A lot of Berkeley researchers use the NERSC facilities at the nearby Berkeley National Lab: https://www.nersc.gov/

Interesting, thanks!

It's fascinating to see the descendants of the original AMP Lab (https://amplab.cs.berkeley.edu/) are still generating value long after it ended. For reference, Spark and Mesos are two other projects that came out of the same effort.

This blog post from the authors https://medium.com/@zongheng_yang/skypilot-ml-and-data-scien...

I believe autostop (simply stopping idle clusters) alone can masively (X times) costs for many organizations.

(SkyPilot dev) Can't agree more on autostop / autodown. Subjective to measure but save our own costs massively.

I believe the Ray suite of ML + distributed computing tools is also from AMP descendants:


I feel that the old SETI@home project can use a comeback. Most home machines have powerful GPU these days. The ML problems are embarrassingly parallel and can be distributed to the home machines.

Just need to work out the economy for everyone involved.

Datacenter GPUs have >1000gbps network connections between nodes, which is necessary to actually utilize GPUs with current training techniques. It's possible that a furthering of techniques used in GPT-JT[1] might make it feasible to use home computers, but even GPT-JT requires at least a 1gbps connection.

[1]: https://www.together.xyz/blog/releasing-v1-of-gpt-jt-powered...

Good point. One criterion for whether a SETI@home approach is reasonable for a class of problems is if the total processing time exceeds the network download and upload time. In SETI@home classic the work units didn't take too much time to download compared to how long it took to process a work unit.

Single ML job can’t be distributed very well. More often than not they hit network limit, even on single zone 1 GB/s network speed that we normally get. Most of the distributed workloads use something like NVLink.

That's true. Most ML algorithms have the iterate-until-converge pattern. How about some tasks like hyper parameter tuning or trying out different algorithms against the same data set? Those can be run in parallel.

ML research should aim to produce more parallel algorithms.

state of the art models don't fit on a single gpu anymore, not for a while

Is that more for training or inference use?

I realize this isn't always entirely decoupled in certain online learning approaches. I don't work in ML, am certainly not an expert, and am genuinely curious where this space is at now in terms of hardware requirements for SOTA methodologies these days, especially inference phase HW requirements for just running stuff that's out there.

why....that sounds like Web3! (I kid I kid...)

Actually it's a good use of Web3, handling of the billing and revenue sharing portion of the problem.

Yeah, in theory it's a great idea. In practice, it always ends up as a challenge for economies of scale. Just look at Bitcoin, we can all mine it, but it's not worth it unless you scale. In the end, someone will build a data center, just to gather profits from whatever Web3 implementation you think of. Because their energy cost is lower, relatively, it will become too expensive to 'mine' at home.

It's the circle of capitalism.

Aren't AWS, Azure and GCP all comparatively expensive? They are good for certain workloads and it's good that this project exists, but if you wanted cheap cloud resources, you'd probably need to look at one of the smaller cloud vendors: Hetzner, Scaleway, DigitalOcean, Vultr, Contabo and others. Of course, if you need GPUs, things can get limiting.

They usually are, unless you get interruptible instances. You can compare pricing for most cloud providers here:


(my hobby project)

For infrequent, unpredictable workloads using something like spot instances on AWS and scaling as needed (including to zero) will likely be cheaper.

I think Hetzner and other budget cloud providers are excellent choices for use cases that require always on, reasonably predictable workloads like webservers.

The thing is they are so much cheaper, especially for egress, that it takes really extreme spikes or batch jobs for it to be cheaper to use AWS vs. leaving an excess number of server running permanently on somewhere like Hetzner. Very few sites have variable enough traffic that scaling up and down with usage saves enough to even get close.

There certainly are genuine cases where AWS will come out ahead, but they're rare. E.g. if you suddenly need several hundred instances for on average a few days a month, it's probably worth it. Very few people do that, and paying for the capacity to be able to offer that is part of the reason for why AWS is so expensive. Peoples belief that AWS is cheap is another reason.

Ironically, the ability to run a hybrid setup that scales up in AWS when you genuinely need rapid extra capacity changes the maths even further in favour of dedicated servers from places like Hetzner (as does the fact most places like Hetzner now have their own cloud offerings) because it means you can go closer to the wire on your dedicated servers.

Does DO or Vultr even offer the HPC instances AWS does? 96 CPU, 384GB, etc? Let's not even mention storage or networking.

Such configurations sadly aren't relevant for me, because neither are any of my workloads that demanding, nor could I ever afford to rent such servers anyways. At that point, one might as well make the argument for forking over money towards the more "enterprise" platforms.

However, out of curiosity, I actually checked all of the mentioned platforms, to figure out where they cap out.

Cloud services:

  Contabo - CLOUD VDS XXL has 12 CPU and 96 GB of RAM, 149 euros/month
  Hetzner - CCX62 has 48 CPU and 192 GB of RAM, 532 euros/month
  DigitalOcean - caps out at around 32 CPU and 256 GB of RAM, 2096 euros/month
  Scaleway - ENT1-2XL has 96 CPU and 384 GB of RAM, 2576 euros/month
  Vultr - caps out at around 96 CPU and 255 GB of RAM, 3840 euros/month
Dedicated servers:

  Contabo - AMD EPYC 32 Cores has 32 CPU and 256 GB of RAM, 249 euros/month
  Hetzner - AX161 has 32 CPU and 1024 GB of RAM, 833 euros/month
  Scaleway - EM-T210E-NVME has 128 CPU and 2048 GB of RAM, 2625 euros/month
  Vultr - NVIDIA A100 has 96 CPU and 960 GB of RAM, 14000 euros/month
So to answer that question, yes, most of the platforms out there have more performant offerings, while the prices and details vary. It's great that these alternatives exist, even if the needs of most people will be more modest.

For example, my hybrid container cluster with 6 nodes has 13 CPU cores in and a total of 54 GB of RAM and even that's with room for growth, hosting basically everything that I need for about 30 euros a month. Most people out there will probably care about finding ways to cheaply run their WordPress site or something, as opposed to scaling way up.

That's kind of the beauty of the current market - there's something for everyone out there!

SkyPilot devs here, happy to answer any questions.

GitHub repo (Apache 2 license): https://github.com/skypilot-org/skypilot

Getting started is easy:

$ pip install "skypilot[aws,gcp,azure]" # Pick your clouds

$ sky check

$ sky launch

Conisder adding CoreWeave for GPUs. The A100 there is about half the cost of V100 on large clouds.

Looks great! Do you anticipate any pushback from cloud providers, who might one day decide to restrict access to stop people getting a better price elsewhere?

We don't anticipate this will happen for a few reasons. When the usage of SkyPilot (or a SkyPilot-like "intercloud broker" system) is small, it probably doesn't warrant the dominant clouds' attention.

When the usage gets bigger, I'm not sure how providers can restrict access anyway (curious if there are precedents). There are quite a few large multicloud platforms like Snowflake or Databricks heavily utilizing AWS/GCP/Azure already. (Granted, these platforms are not meta-cloud, in the sense of moving their customer workloads transparently across clouds.)

Ultimately we see such a system to grow the pie for the whole cloud market. The incumbents' relative shares may drop, but their absolute volume will grow.

I’m curious how this affects network and data storage costs. Maintaining data storage and private fiber to all the clouds has its own costs.

In my experience, if you want _the_ best price-performance ratio for ML jobs, your best bet is CoreWeave. They have great GPU availability and it's a lot easier to scale up/down very quickly than on the popular clouds. I would really be surprised if anyone else out there is offering a better price point for these types of workloads..

checkout lambda labs, they have very competitive pricing on GPU, they are almost half the price of CoreWeave at the last check.

Indeed! Having lower-cost GPU clouds in the "Sky" is on our immediate roadmap: https://github.com/skypilot-org/skypilot/blob/master/ROADMAP...

In fact, as we speak we're working with folks at Lambda Labs to add support for their cloud. If other providers are interested, we'd be happy to chat.

(SkyPilot dev here)

Lambdalabs is fantastic, but they are so fantastic and popular they often seem to run out of gpus available to rent for me :D

Common issue for anyone running at scale. The crypto winter should hopefully address this.

This is great! If you are interested in running ML workloads across clouds, Netflix's Metaflow will officially announce support for all clouds tomorrow: https://outerbounds.com/blog/metaflow-on-all-major-clouds/

Quite a coincidence :)

Have seen this project a while ago, it's a radically exceptional idea. Could be quite interesting.

At some point people will rediscover that "buying" whole sets of equipment through a lease financing company on a lease-to-own plan with $1 end payment, and colocating it the traditional way can often be significantly cheaper than paying endless "cloud" costs.

naysayers will say, but the cloud allows you abstract away your salary costs of engineers! look at all the people you aren't hiring!

I say: If your needs are sufficiently complex you already have a number of well paid six figure salary people on staff to admin your cloud-based software architecture/setup... Hire people who know how to build infrastructure down to the bare metal.

I've configured Cisco routers, F5s, LOMs, PDCs, NAS boxes, built servers from parts. And I'd never do it again.

That shit has literally no value, and it's quite difficult to configure, manage, and maintain. And no matter how good your process, you (or your team member) inevitably will forget to go into the BIOS and turn off power saving...or forget to turn off (or on) proxy arp. Or you'll reboot a box and it won't come back.

All costs can be negotiated with up-front commitments. We are small, but we pay .01/GB for egress bandwidth. And we got that price from akamai, fastly, and AWS by asking. I don't remember how much real colo places charge anymore, but it was a whole lot more than that.

Enterprise SSDs, redundant hardware, etc cost money. Does your startup qualify for a lease?

And really, why spend money on something that has no value to your business? Does having 5 data centers add any real value or competitive advantage? Does having that F5 expert on staff make your end-users happier?

If you answer yes, then go for it. But really, it's a total waste of time and money. You might as well make your own pencils and paper for the amount of value it delivers.

I do 2.9 PB a month for less than 1000$. And this is for rented servers so even includes the hardware. That would be more than 30k$ at your rate. Not going to argue about the rest but the cloud is charging horrendous rates for egress even with negotiated discounts.

Same here.

At massive scale, sure, it might make sense to run your own infrastructure and data centres. Some may even run their own private cloud and get the benefits we see from the public cloud today from their own IT depts.

The vast majority of us however had nothing in common with the above. Like you we ran the hardware, and to the business, we were never more than a cost centre and anything pitched to the broader business to "improve" outcomes for the development teams were seen as a cost without benefit.

I find a lot of people who are proponents for getting rid of the cloud never had to manage their own infrastructure, and have very rose tinted glasses for a reality that never existed for most.

I am more the other way. Even the big clouds are too “serverful” for me, and for very small things a modern PaaS is much easier to deal with.

Admittedly I've not worked in many environments that could ever be described as being anything less than 'enterprise', and within orders of magnitude larger than medium/large businesses.

If you had a niche market and only needed a small footprint. I can see a handful of VPS' or similar being suitable.

But once you get to the 200+ people mark, it gets messy. And gets much worse at 2,000 head. And worse yet at 20,000.

False dichotomy. The alternative to AWS is not building your own data center. It's renting bare-metal servers from Hetzner/DigitalOcean.

Zero hardware headaches with x5-10 lower cost.

This may be an overkill.

But I would argue most projects don't even need a cloud. Just go to Server Hunter, find a server up to your specs, and save 90-95% of your cloud costs.

These days you can scale vertically or horizontally for a long time without moving to a cloud.

It's funny that that is exactly how disruption works. Use tons of VC money to flood a market with a cheaper product, become dominant, then jack up prices hoping people forgot the alternatives.

It doesn't have to be jacking up prices for exploitation either, just sometimes the costs were always there dictating the price and, eventually, you have to foot the bill for those costs.

There's an alternative path where cloud companies actually start to compete on price. Right now you will pay gigantic premiums on your AWS infrastructure just by virtue of being in their ecosystem.

Personally, I see Kubernetes having huge potential to drive down costs. The ecosystem is still young but it's still possible to run such a wide range of services and tools on top of it, and it can easily be run on-prem or using managed providers.

I personally am part of a team building a much cheaper k8s service as I think costs have grown out of proportion and k8s offers the first real good shot at challenging AWS on cloud & costs.

Agree. I had to calculate the cost of running and upgrading an existing HPC infrastructure that was approaching it's end of life vs cloud. In house HPC beat cloud cost in almost all scenarios.

From reading this launch post, I'm not convinced this is going to save too much money.

The project automatically selects the cheapest cloud to run a job, and does it there - which sounds sensible. In reality though, these jobs presumably need large volumes of input data. If your input data is in cloud A, and you run a job in cloud B, typically any cost saving from running in cloud B will be more than offset by the egress cost to get the data out of cloud A.

This project is therefore only useful for scenarios where you need to do large amounts of compute on relatively small volumes of data. Is that really a common scenario?

They have a project which addresses this concern as well: https://skyplane.org/en/latest/benchmark.html

I'm one of the creators of Skyplane. Skyplane can migrate large datasets between cloud regions at 10s of Gbps while compressing data to reduce egress fees. Happy to chime in!


Congrats on the launch! I had a similar idea once a few years back but failed to materialize it. You might want to consider other cloud providers like Sushi Cloud to get costs even lower. Happy to do an intro if it seems interesting.

Or to leverage cheaper compute/energy when it’s available. https://www.crusoecloud.com/features/

I'm one of the creators of SkyPilot. Thanks for the thoughtful questions and let me try to take a stab:

SkyPilot is not just for multi clouds. It's useful for all of these scenarios:

- using a single region of one cloud

- using multiple regions of one cloud

- using multiple clouds

Data transfer between zones/regions within a cloud is much cheaper than across clouds. We see many users falling in the "one cloud" category and they frequently read 10s of TBs of data across regions to do ML training.

Finally, saving money is one of several key problems we aim to solve, and there are quite a few ways to save other than lots-of-compute-on-small-data. Other reasons why you may want to use a system like SkyPilot include

(1) improving resource availability (big pain point for GPUs/TPUs)

(2) use one interface and know that your jobs can migrate across regions or clouds

More rationale in the intro blog post: https://medium.com/@zongheng_yang/skypilot-ml-and-data-scien...

And isn’t the biggest issue with running potentially large jobs in the cloud the cut off when it’s cheaper to use your own hardware. After a few months or dozens of runs of your large model in the cloud you may have reached the point where purchasing would have been cheaper.

Something that could look at your code, data and budget and say upto X runs use cloud A, for more than Y runs it would be cheaper to buy/lease these GPUs etc. would be interesting.

I think it’s common to train 100s of models on the same data for experiments. Then you would only need to copy data once to all the cloud storage and run experiments as you wish.

Also most cloud provider don’t charge for ingress so you could move the data from something like R2 to cloud as many times you want..

+1. We've heard from some heavy users that Cloudfare R2 is saving them $$$ on egress costs: https://www.cloudflare.com/products/r2/

As outlined in the position paper (linked by another commenter) we believe such tailwinds are increasingly helping foster the "Sky" and making workloads moving between clouds much easier.

There are several 'web3' projects in this space, vast.ai, already mentioned, is focused on renting out GPU's from servers hosted pretty much anywhere.

For low cost server hosting and egress bandwidth, Threefold (https://www.threefold.io/build/) and Akash are two new startup web3 ecosystems looking to radically bring down the costs of cloud compute, storage, and bandwidth with GPU support on both of their roadmaps. Still very early, but at 80-95% raw cloud cost savings even beating out Hetzner/DO, its an interesting niche of token incentivized physical infrastructure networks (called TIPIN for short).

We’re using SkyPilot and Skyplane at Berkeley and the Department of Defense to scale our damage assessment from satellite imagery machine learning models [0] to run for all of Ukraine, all the time! We’re able to detect instances of war crimes rapidly and use them to help plan the reconstruction of Ukraine.

[0] https://xview2.org/

do you use SAR?

We do not currently use SAR! High-resolution SAR is currently very hard to find. Vendors like Capella Space are pretty pricey for HADR operations, though they do provide some data for deeply discounted/free prices for major disasters.

Beyond cost, SAR is just a deeply different form of imaging. The underlying physics that define a SAR image are so different from EO that many computer vision methods, in my opinion, will need to be re-invented to accommodate it. In fact, the next good bit of my Ph.D. is dedicated to accomplishing just that.

Urban scenes look very chaotic in SAR due to multi-path and layover effects. Defining damage in that environment is more aggregate than individual buildings. We're trying to figure out a good way to define what damage even means on SAR. In fact, one of the biggest accomplishments of our xView2 work was being the first comprehensive effort to define what damage means, for any natural disaster, from EO satellite imagery!

Thanks for the writeup! Agree on SAR being vastly different to interpret, and I can imagine the priciness. I also understand that the capacity/demand is another complicating factor now over Ukraine.

As you explained SAR phenomenology is not straight forward at all (pun intended). If I could suggest an option to explore a SAR-based damage proxy indicator, it would be interferometric coherence. This suggests however that you get your hands on InSAR data, both timely (to avoid decorrelation, which can happen over the span of a few days) and at good resolution to get relevant information at building-level.

As you can imagine this is not easy nor cheap to get. I am unaware if Capella is InSAR-capable; COSMO-SkyMed and TerraSAR/TanDEM/PAZ are the usual suspects when it comes to SAR constellations with InSAR capability and good frozen orbits. ICEYE I believe is trying to get there, but things are evolving rapidly. Sentinel-1 is a wonderful mission with solid InSAR capability, but I am not sure C-band delivers the resolution you need. Maybe NISAR is worth keeping an eye on in the future?

Fascinating work that matters, good luck with it!

Sentinel-1 is the current workhorse for all sorts of InSAR things, but as you mentioned, the resolution is pretty meh for things like damage assessment. Beyond InSAR, people have been doing change detection on GRDs to get damage proxy maps, but that provides areal information rather than instance.

An opportunity in this space is to build truly complex-valued neural networks to fully exploit SLCs and other phase-based information.

Thanks for thinking well of the work! We've been scaling out impact with AI + SAR over the last year. Some work you may be interested in is our work on detecting dark vessels from SAR which has led to the interdiction of illegal fishermen and human traffickers! [0][1]

[0] https://openreview.net/forum?id=PfyWdxM-S4N

[1] https://iuu.xview.us/

Previous HN post on UC Berkeley position paper on the future of cloud computing:


For those of you in US academia, checkout the NSF Jetstream2 cloud [0] available through ACCESS grants [1]. They have various flavors of VMs including regular, GPU, and large instances.

[0] https://jetstream-cloud.org/ [1] https://access-ci.org/

I remember when Cloudability first did this to manage AWS costs. They ended up getting scooped up by Apptio, but what they built was pretty awesome. It was especially useful for big teams who often spun up a large EC2 instance for testing, but forgot to take it out of service.

This is great for list price workloads. But once you have large usage (N+ MM/year) you’d likely get better savings by doing a multi-year usage contract.

Having worked on large compute intensive projects in both academia and industry I am somewhat sceptical about multi-cloud tools. Normally you do your compute where the data is since egress is so expensive. And it just doesn't seem worth developing against a cloud agnostic API and limiting myself to the lowest common denominator.

If you have a large workload though they may be helpful as a negotiating tactic to play one cloud off against another.

One big reason people use multiple regions or clouds: higher resource availability.

Allocating scarce resources can be very hard on the cloud. These are things like high-end GPUs (both on-demand and spot; the latter being much harder to get) and also beefy CPU-based instances. We've seen 10s of hours of waiting times or longer.

The natural solution is to have the flexibility to allocate resources in multiple regions (and ultimately, clouds) to increase the total pool size.

I have seen very large jobs that needed to be split up across multiple regions but doing so required replicating large amounts of data. That comes at a cost too so it requires planning to work out which regions you want to use.

Once that is done it is fairly easy to set up spot clusters in each region to process the jobs in the queue.

It lost me when the specification was in YAML. I hate YAML for ML.

Then lucky you: continue to use JSON since it's a subset of YAML

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact