There is now a response to the support thread from Fly[1]:
> Hi Folks,
> Just wanted to provide some more details on what happened here, both with the thread and the host issue.
> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.
> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.
> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.
> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.
> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.
For what it’s worth, I left Fly because of this crap. At first my Fly machine web app had intermittent connection issues to a new production PG machine. Then my PG machine died. Hard. I lost all data. A restart didn’t work - it could not recover. I restored an older backup over at RDS and couldn’t be happier I left.
I left digitalocean for fly because some of their tooling was excellent. I was pretty excited.
I’m back on digitalocean now. I’m not unhappy about it, they’re very solid. I don’t love some things about their services, but overall I’d highly recommend them to other developers.
I gave up on fly because I’d spontaneously be unable to automate deployments due to limited resources. Or I’d have previously happy deployments go missing with no automatic recovery. I didn’t realize this was happening to a number of my services until I started monitoring with 3rd party tools, and it became evident that I really couldn’t rely on them.
It’s a shame because I do like a lot of other things about them. Even for hobby work it didn’t seem worth the trouble. With digitalocean, everything “just works”. There’s no free tier, but the lower end of pricing means I can run several Go apps off of the same droplet for less than the price of a latte. It’s worth the sanity.
I adore DO. They’re seriously underrated. I love how they’ll just give you a server and say here, have at it. No abstractions, no fancy crap, just get out of my way and let me do my thing.
I'm using Digital Ocean App platform, which does pretty much everything for me. It's very simple to use. I can run my app as a single developer without caring about infrastructure for 99% of the time.
Part of what inspired me to give fly.io a shot was that I didn’t love the monorepo deployment story on the app platform. Fly doesn’t have a solution to that, but I suppose I felt less tied to DO at the time because I wasn’t totally content anyways. I’ve discovered since then that I was actually doing it wrong, so I’m way happier. I’m pretty big on monorepos so their whole system fits my workflow remarkably well now.
I’d like to figure out how to prevent deployments when my code doesn’t change in one app, but does in another. At the moment, pushing anything at all will trigger all apps to rebuild and deploy again. Not a huge deal and several orders of magnitude less painful than not being able to deploy at all, haha.
In addition to Supabase Auth the sibling mentions (which I played with very briefly) I've been using clerk.dev (no affiliation) and it's great. Depending on your definition of doing it yourself it could be just want you want. You have to set some things up, you're not going to get things like row-level permissions you get out of the box w/ Supabase, but if you're looking for a quick implementation where things like password reset etc. are handled for you, it might be a good fit.
I've been using Supabase for authentication/authorization in my recent side project.
The main app is node/express running on Digital Ocean and it connects to directly to the Supabase hosted Postgres for most operations, but then uses the Supabase auth API for auth related stuff.
Saves a lot of time sending password reset emails etc and the entire project costs less than $5/mo in hosting costs.
I love their high value content about dev ops, I have learned most of what I know in this field tinkering with a VPS with their great tutorials on how to set up stuff.
They filled the Slicehost vacuum nicely in this area. That's where I got my start in running my own servers about 15 years ago and the tutorials were the driving factor.
Seriously! They have an amazing article I followed one time to set up a k8s cluster to run any container I wanted with full automatic ssl provisioning/management and dns. Make a quick little yml file that includes what subdomain it wants to be and kubectl apply. The cluster was like $100 a month all-in and performed like a beast at huge traffic levels, and all I did was follow a tutorial.
I know that’s probably pretty easy for many, but I was pretty new to k8s and it felt like magic.
I wish I could say the same. My ISP and DO have absolutely terrible peering, unfortunately a lot of our internal stuff is hosted there. It’s always fun to git push/pull with 40kb/s on a gigabit connection.
When I’ve run into this in the past Cloudflare Warp has been a bit of a saviour. It’s a hassle free way to flick a switch and follow a different path over the network.
I went to DO's site due to your comment and I don't see anywhere where I can just get a server. Do you mean a VPS/Droplet? (I'm looking under Products and Solutions.)
The other commenter was correct - I meant a droplet. Should have been more explicit, apologies. But yeah if you're looking to learn how to work with backends, going through a droplet set up is by far the best way to get started IMO.
historically, I've used Vultr, but I don't see anyone talking about it—I'm curious if anyone else has thoughts on them? (I've been happy, but then again my usage has been exceedingly basic)
I've used Vultr for several years (hobby projects) with no issues. My favorite feature is having a BGP session from my VM, which is unusual among cloud providers. I have an AS and am able to advertise my own IPs from multiple Vultr instances (anycast).
Have used both DO and Vultr for years. Put simply, DO is better, but Vultr isn’t terrible.
Higher number of outages at Vultr over 5 years, but none longer than a few hours. I can’t remember the last DO outage lasting more than a few minutes.
Experienced a Vultr routing problem that lasted several hours; they communicated about it, but it was still a long time to fix.
DO once did an auto-migration of a server to another cluster with an attendant outage that lasted a few minutes at most. No IP changes, completely transparent.
I love DO for projects where I don't need control. For my side project, I eventually migrated to AWS after running into a lot of issues with DO.
Things like they don't give you the postgres root user on their managed postgres. And I ran into issues trying to capture the deployments in code. Their terraform providers are pretty good, but still leave something to be desired. For all its many warts, I'm much happier back on AWS. It did end up more expensive, but it's worth it for the fine grained control in my case.
But I spent the last 5 years as a DevOps/SRE, so... uh... I'm picky.
That's interesting, because granular control is why I enjoy DO, although I'm thinking about it from the server perspective. They set up a machine, give me root access, and that's literally it. I set up my own ssh keys, firewalls, and there's no additional abstraction that I have to learn. I might just be reminiscing because right now I'm on a team where we're writing terraform/helm/k8s in GCP and it makes me want to cry myself to sleep each night lol.
Those are good things to know. I’ve been wondering about their managed databases recently, so I’ll keep that in mind.
I’m nowhere near as picky as you are, but maybe I’ll need to be at some point. As it is I mostly just build stuff and send it to the internet. If it builds and it does what I expected, I’m pretty happy! I don’t often need anything too special.
Same! I've had my first server there for 10 years now. They added a lot of stuff in the meantime, they have AWS-like things you can do. But in terms of launching a VM that just works, they are a great choice.
I agree. I can either abstract with the app platform or kubernetes, or I can go straight into the box myself and do whatever needs doing. It has been a real pleasure.
I think fly’s tooling feels better than doctl, but the infrastructure is incomparable at the end of the day. doctl has improved over time too, and with added pressure from newcomers I don’t doubt that it’ll continue to improve.
I find myself going to DO docs on various setup things even when I'm not using said thing on DO (although I'm also a DO customer, and love them for the reasons you've stated).
I've been with them for a long time and my guesses would be:
1. Strict rules and strict customer verification. Crypto mining that wastes SSDs is not allowed. Portscans, mass emails, etc. are not allowed. They also don't offer GPUs to the general public because it has been abused in the past. You usually need to send in ID documents just to open an account. My guess is this allows them to avoid most bad actors and, thereby, waste less money on fraud.
2. Extremely long-term investments. They typically build their own hardware and then use it over 10 years. They have their own flea market where you can rent older server models for a steep discount. That means they will have a long time where the hardware is fully paid off and still generating revenue.
3. Great service. With a mid-sized company, I can call their technicians in the middle of the night. The fact that we could call them in case of a crisis has generated A LOT of good will. But I would be truly surprised if they didn't make a profit off those phone calls, as they charge roughly 4x the salary cost.
4. High-margin managed services. In addition to just the cheap servers, they also offer a managed service where they will do OS and security upgrades for you. It's roughly 2x the price of the server and it appears to be almost fully automated. I know some freelance web designers who will insist on using Hetzner Managed for deployment for their clients, because it is just so convenient. You effectively pass off all recurring maintenance for €300 a month and your client is happy to have an emergency phone number (see #3) in case the box goes down.
They run their own data centres and have for a while. There is a pretty big industry for that sort of thing as an alternative to “the cloud” here in Europe.
We used to use nianet to house our hardware in Denmark. Basically these companies does hardware renting and they also do hardware renting with more steps which is where you rent rack space but own the hardware. They provide the place for the hardware and they also have multiple locations so that you have both backup and redundancy, and while it doesn’t scale globally in 20 years I’ve literally never worked on anything that needed to beyond having some buffer caches for clients logging in on their vacations or something like that.
What Hetzner seems to be doing with the DO styled hosting, and this is just a guess, is that they are one or the many EU companies preparing for the big EU exodus from the non-EU cloud. Which is frankly a solid bet these days where both AWS and Azure are increasing prices and are becoming more and more unusable because of EU legislation. Part of this is privacy which Microsoft and Amazon are great with in terms of compliance, but part of it is also national security. I work in an investment bank that builds solar plants, since finance and energy are both critical sectors we risk being told that half of the finance/energy companies in the world can’t use Microsoft because the EU seems it as a single point of failure if our entire energy sector relies on Azure. Which is sort of reasonable right? But what this means for us is that we can’t vendor lock-in, not really, because we need to have up-to-date exit strategies for how we plan on being fully operation a month after leaving Azure. Which is easy when you just containerise everything and run it in VMs or similar, and really annoying if you go full in on things like AKS. Which doesn’t help our Azure costs.
Anyway, right now we are planning on leaving Azure because of cost. Not today, not next week but sometime in the next 5-10 years and a lot of these EU cloud alternatives that actually operate the hardware instead of renting it are likely going to be a very realistic alternative. And that is the private sector, I spend time in the EU public sector which is a massive amount of money and I’m guessing it’ll leave both AWS and Azure by 2050. Some of these EU cloud initiatives is going to explode when that happens, and right now, hetzner is one of the best bets.
To get back to your question, DO rents server space. I have no idea where they’d rent it in Germany but they could potentially be renting it from Hetzner.
Couldn't agree more, I think Hetzner is probably Europe's best bet on a hyperscaler. One of the more telling indicators IMO is their growing market share outside of the EU/DACH.
To add on to the comments about Hetzner building their own custom hardware, they also custom built their own software stack. They rejected the hype that was OpenStack and worked diligently on their own hypervisor platform (that they are incredibly secretive about) and that appears to be paying off in spades for them. Most sovereign cloud plays end up being suffocated by the complexity, and incoherence, of the OpenStack ecosystem. It just becomes impossible to ship.
Could you please elaborate how and what you know about managed Kubernetes on Hetzner?
I am asking for this since a while and was told there is no way Hetzner would offer such a service. Certain Posts on Social Media have also never been answered with any kind of indication that they are actually working on it.
They were in person recruiting at KubeCon EU this year and were advertising a good number of Kubernetes engineering roles. Definitely gave me the impression they were taking Kubernetes seriously but looking back a managed offering was just speculation on my part.
So huge grain of salt, you are totally right. It could be internal platform work only.
I think you might misunderstand me. The 2050 is a guesstimate and it's just my opinion on the matter. As far as planning ahead goes, you plan for 5-10 years when you try to figure out where to "iron" your enterprise IT. This is because that's how long your hardware will last if you go the route of renting rack space with your own hardware. I think we tend to plan for 8 years, with some space for "unintended" early failures on things like controllers after 4 years. So while you can contract big-cloud vendors for shorter, I think ours is on 3 year contracts right now, you still sort of do the business case for much longer. Maybe not every 3 years, but at least every 6 years.
You do the same on the other side of the table. Companies like Hetzner knows that EU cloud sollutions are likely to see growth, so it's only natural that they invest in the tech to put themselves in a prime position to jump on the opportunity. Selling a good product while you do so is the way I would do it personally, but you also have EU cloud initiatives backed by VC money going straight for the endgame.
I think multi-national energy sector should be working toward the goals without the regulations. The more prep done before the change the smoother the transition.
Hetzner also do some crazy-cool stuff, especially around the 7950X3D, cooling, AM5 etc. (https://www.youtube.com/watch?v=V2P8mjWRqpk). They also do some amazing stuff with ARM (their cloud offering is really solid for this).
Not to mention a German company that has price sensitivity in their DNA. Their first servers were just regular consumer tower PCs to drastically cut hardware costs. Now many years later it's a highly optimized mix of consumer, server and inhouse parts (e.g. they use their own racking system instead of 19", and the datacenters are built to make use of convection for a lot of the cooling). They also offer regular Dell servers for those that want them, but at 2x-4x the price of their homegrown boxes.
Me and my partner have paid a visit to their datacenter in Nüremberg. The answer is efficiency. They get more processing power than the other providers for the energy they have to put in
pardon my ingorance but i cannot quite see how cooling individual machines vs. the hole rack or row makes a difference in total heat production per machine
Simple, Hetzner mainly operates on Germany, the people are mostly Germans, and they automate the stuff to a point a small team could manage it well even if not remotely, so they have less cost on human resources.
> Simple, Hetzner mainly operates on Germany, the people are mostly Germans, and they automate the stuff to a point a small team could manage it well even if not remotely, so they have less cost on human resources.
I feel like there might be more to it, especially considering the situation with electricity prices in some places in EU recently.
I used (and still use) a Lithuanian platform called Time4VPS which was cheaper than Hetzner previously, yet had to increase their prices somewhat for that reason. Now only some of their plans are competitive with Hetzner, while Hetzner also provides some managed services as well.
And yet, I can't help but to wonder why they don't give in to the desire to maximize profit margins, like happened to say Scaleway (good platform, but as expensive as DigitalOcean).
Only complaint with Hetzner is they don't have some kind of OAuth setup for machines or scoped API tokens, just read/write. I'd like to use the former for doing Vault authentication from instances, and the latter for writing a dynamic Vault secret provider.
They require passport or some sort of ID on registration, and it is weird when compared to others. I was not happy with that part, but I am happy customer since (almost a decade now).
As far as I know, they do not require any ID when canceling the service.
Well I would appreciate that, since I was victim of russian hackers and they had access to all my servers and stuff on Hetzer, they even changed passwords and mail on Robot but i restored everything...
Yes and (unfortunately) no. Terraform providers are here [1] with the official documentation at [2]. Managed databases are not available, though. I think they have some sort of database offering if you select their web hosting options, but you can't just get a managed Postgres instance yourself.
EDIT: For what it's worth, I have had good experiences with app servers hosted on Hetzner Cloud and managed Postgres provided by ElephantSQL (https://www.elephantsql.com/) for Germany-based apps.
DO actually does have a free tier! If you use their “app platform” (their equivalent to fly/heroku/render/etc) you can host 3 “static” apps for free. So if you have a Hugo/Jekyll blog or something, it’ll set up a whole little CD system for it for free.
You’re totally right. I kind of forgot about this, in part because I’m over their free limit. I think their static sites are still dirt cheap once you hit that limit, though. I find their pricing totally reasonable for what I need.
I want to like Fly, but the reliability is one of those were I feel like every time I investigate moving workloads over I'm disappointed by these stories over and over again.
Fly is in my “try later book” from a year or two ago. I remember it was hard to deploy anything due to downtime so gave up. Sad that stuff like this still happens.
You shouldn’t need to multi region a postgres yourself - they should have at least 2 data centre redundancy for the region and it just works.
Hope they get some magic sauce to become better at this.
> Hope they get some magic sauce to become better at this.
When I saw them describe their multiregion SQL replication architecture I thought "what crazy person thought this wouldn't eventually open up a spider's nest of distributed systems errors?"
CockroachDB does this, but that's the result of over 10 years of heads down hard-ass engineering and it's still slower than Postgres because distributed sync is not free. That means you have to provision it properly and with enough resources.
Their license would require a company like fly.io to pay them though, so I'm sure this resulted in fly.io instead trying to whip up an improvised infrastructure on the back of stock Postgres. I bet this cost them a whole lot more than paying CockroachDB would have, but devs have been conditioned that you should never ever pay for software even if it's the result of tons of deep engineering and solves massive brutal problems for you. I also bet there's some not-invented-here ego involved.
P.S. I don't work for CDB but I would absolutely consider them and we may end up using them at some point. They let you do a ton for free. They only charge for stuff you need if you get really really huge or if you are running a SaaS reselling DB services like fly.io would have been doing.
Our multiregion SQL replication architecture is the standard Postgres multiregion replication architecture. We do single-write-leader, multiple reader replicas, like everybody else does.
This is not standard. I see now that it is legacy, but I think it still demonstrates a bit of poor judgement. I believe it was before you were at fly, tptacek
He may have been talking about Fly themselves. Certainly having only a single machine to serve a wealthy metropolis of 8 million people seems like amateur hour.
> machine to serve a wealthy metropolis of 8 million
It's actually the only region to serve the entire AU and NZ population with any reasonable latency. (Ok, Singapore can do in a pinch for at least sub 200ms.)
Fly sounds like they need some Conway's Law. A front end that designs the nice api and works on developer affordances and the backend that keeps it running and reliable.
> While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally.
If it really got missed, then I don't understand how the thread was made private to only logged-in users?
It looks like all 166 threads with the "App not working" tag are invisible when not logged in. So I'm guessing somebody applied that tag retroactively.
I never thought to make friends with people who's only common thing with me is that they shop at the same place. Companies creating a "community" is exactly as you described.
I am an interested party in the process space, and I think that's ungenerous. When you work with a complex tool every day, and you have to find solutions for this or that issue, develop strategies for this or that business case, etc etc, you're not really shopping - it's more like you're in the trenches. At that point, finding people who have the same issues and talking shop with them, can be great for both knowledge exchange and camaraderie. Linux wouldn't be what it is today without the LUGs era, for example.
We're talking about private companies running forum software instead of providing support. We're not talking about the power of IRC or mailing list communities for open source projects and the like.
If I pay for something I want the person I pay money to help me fix problems I get.
Ok as long as we’re getting conspiratorial, something similar I observed has bugged me.
About a year ago fly awarded a few people in the forums, I think it was 3, the “aeronaut” badge. Basically just pointless bling for a “routinely very helpful” person or somesuch. Still, I can imagine it was cool to get it. No, it wasn’t me.
One person I saw with it absolutely deserved it: this person is, to this day, always hopping in and helping people; linking to docs; raising their own issues with a big dose of “fellow builder” understanding and empathy; that sort of person. My own queries typically led me to a thread that this person has answered. In short - the kind of helpful, proactive, high knowledge volunteer early adopter that every community needs - and a handful are blessed to find.
Then one day I saw this same person had offered — to one random newbie with build problems in one of the many HALP threads — a reply like, “maybe Fly isn’t the best option for you. here are some other places that can host an app”.
The thread was left alone and faded, like many when a lost newbie is involved. But 1 day later, I noticed this tireless early adopter no longer had their “aeronaut” badge.
I still refuse to believe my own eyes about something that petty.
Get out of here with this nonsense. We tell people when we’re a bad option all the time. Do you really think we have a desire (or time) to punish somebody for doing the same?
> Do you really think we have a desire (or time) to punish somebody for doing the same?
idk man, there's these awfully convenient disappearing forum threads too. The benefit of the doubt is starting to expire.
I see you're a co-founder, so presumably you have some sway on priorities and skin in the game. I think you should take the reputational damage you're accruing here much more seriously than you apparently are. A few more incidents like this and it won't just be you telling people you're a bad option.
* edited to tone down the forum thread disappearance angle. FWIW I do believe that it likely wasn't deliberate. My main point was that these things add up and "of course we wouldn't do that!" starts to ring a little hollow the 10th time you hear it...
> you've just been caught hiding inconvenient forum threads too
FWIW, I do believe them when they say this wasn't intentional. Considering how the Internet operates, they would be incredibly stupid to do something like that on purpose.
That being said, the way the entire affair was handled certainly leaves a lot to be desired.
I actually believe them on that too, FWIW. This time. It's just too dumb. I hope, for their sake, it's the truth.
I was really just trying to point out that this kind of good faith benefit-of-the-doubt has a limit, and fear of reaching that limit should be keeping people at fly up at night a lot more than it apparently is. I don't know how many colossal public fuckups a company can endure before its reputation is permanently ruined, but it's definitely not infinite.
Why is anyone on HN "dunking" on Fly.IO of all companies?
Michael - Don't take the bait.
As someone who has zero affiliation with Fly.IO other than a few PR's to their OSS(I don't even know Michael), I greatly appreciate the contributions they have given back to the community.
There are a lot of great hosting companies. Fly.IO stands out due to their revolutionary architecture and contributions back to the OSS community. I wish more companies operated like this.
It's understandable some are upset about an outage. But Fly is doing really interesting and game-changing things, not copying a traditional vmware, cpanel or k8s route.
Just as a reminder to what this company has offered back to everyone.
Their architecture is beautiful and revolutionary. They're probably the first or second ones to find a lot of the new edge cases as they grow.
It's a lot harder to be the first one over the wall than it is to copy. They've literally given the average developer a blueprint to build scalable businesses that compete with their own.
Should losing a single host machine be a big deal nowadays? Instance failure is a fact of life.
Even if customers are only running one instance, I would expect the whole thing to rebalance in an automated way especially with fly.io being so container centric.
It also sounds like this is some managed Postgres service rather than users running only one instance of their container, so it’s even more reasonable to expect resilience to host failure?
Fly postgres is not managed postgres, it's cli sugar over a normal fly app, which the [docs](https://fly.io/docs/postgres/) make quite clear. Their docs also make clear that if you run postgres in a single-instance configuration, if the hardware it's running on has problems, you database will go down.
I believe the underlying reason that precludes failing over to a different host machine, is that fly volumes are slices of host-attached nvme drives. If the host goes down, these can't be migrated. I _think_ instances without attached volumes will fail-over to a different host.
Of course, that's not ideal, and maybe their CLI should also warn about this loudly when creating the cluster.
If you lose a single instance on RDS and you don't have replication set up, you'll also have downtime. (Maybe not with Aurora?)
And +1 to the sibling comment; Fly makes it very clear that single instance postgres isn't HA, and talks about what you need to do architecturally to maintain uptime.
Downtime but limited downtime since the data is stored with redundantly across multiple machines in the same AZ. So unless the AZ goes down (which is a different failure than what happened here) you can restart the DB on a different instance pretty quickly and I'm guessing AWS will do it automatically for you.
edit: Remove triple as not certain about level of redundancy
May not be 3x but it is replicated so even a total instance failure would not make you lose data:
>Amazon EBS volumes are designed to be highly available, reliable, and durable. At no additional charge to you, Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. For more details, see the Amazon EBS Service Level Agreement.
If a read replica fails, I'd expect no downtime (possibly a few errors as connections get cut off abruptly). Although there's always the risk that the remaining instances aren't able to handle the additional load.
> Should losing a single host machine be a big deal nowadays? Instance failure is a fact of life.
Depends on where in your development cycle you are. If you just got started and haven't even figured out what you're actually building (prototyping), you shouldn't really use a hosting provider that randomly lose instances.
If you're on the other hand have done everything to improve your applications performance, had to resolve through-output issues with a distributed architecture and now running 10+ instances, then losing one host shouldn't impact you too much. But you really shouldn't start this way, it's doing web services the hard way and introduces a lot of complexity you shouldn't want to deal with when you're still trying to find product market fit.
I was confused why support for platform failure relies on a forum where employees may or may not check. After checking docs[1], apparently you have to be on a paid plan (at least $29/mo) to access email support, so you may not have it even you’re paying for resources.
I won’t be using it for side projects where I’m okay with paying $5-10/mo but don’t want to have three day outages.
Forewarning: I am not being critical of fly.io nor their free support whatsoever when I say this.
From a technical perspective, could they have "been better" from a technical perspective? I see their name a lot on HN so I know they are doing really cool + advanced things and this is probably some super small edge case that slipped through the cracks.
Could they have added some message / do we as the HN community feel they needed to be like "we're gonna add some extra logging/monitoring going forward so it won't happen again"?
By all means, they probably don't owe anybody in terms of stability + uptime guarantees when it comes to a free tier. Sh*t happens.
FWIW: I am on the bottom tier of the paid plans ($29/mo) so I could get access to the email support, and even with that their response time is still not great.
I have an ongoing issue with one of my PG clusters where one of the nodes was failing and all my attempts at fixing it are failing (mainly cloning one of the other machines to bring the cluster numbers back to normal).
I emailed my account’s support email mid Friday morning last week and did not hear back until this past Monday night.
Sucks, because like a lot of others in this thread I like what Fly is trying to do and am rooting for them, but IMO they should use a significant chunk of that funding they just received on hiring a ton of SREs and front line customer support.
EDIT: I should add, the past times I have emailed them the response time was good. It's just this most recent time was so egregious (3 days!) to get even that initial response that I bring it up.
They may not owe anyone anything but over time these types of issues can cause a large reputation hit.
If I was just searching online or trying to find out what various communities think about Fly.io and see several threads about major outages with poor communications, do you think I will use their services? It would be an immediate pass.
It takes a long time to build a reputation, and you can lose it instantly.
> I would hope that after a couple of hours downtime, they'd bring up a fresh machine with Ansible or whatever.
It is not just about a fresh machine which hopefully sits in each datacenter. I can imagine they needed the clone of the system due to the design of the fly.io service and that's where the "fun" begins.
It's often used as an escalation point when people can't get support from certain companies (most notably, Google). If an employee lurks in here and sees your post, they might contact the right people to fix your issue.
Smaller companies also do a lot of PR damage control and constantly monitor HN for threads complaining about their services.
Why is it my responsibility to move instances from machine to machine to mitigate a cloud host's outages? What is their utility if not performing the bare minimum of cloud host responsibilities keeping my container up?
Fly have tried to hush this by making the thread [1] private to anyone not logged in.
One quote from thread:
> This is the second time I’ve had this kind of issue with Fly, where my service just goes down, Fly reports everything healthy, and there’s literally no information and nothing I can really do other than wait and hope it comes back up sometime
Another user:
> We had four machines (app + Postgres for staging and production) running yesterday, and three of the four (including both databases) are still down and can’t be accessed. I can replicate the issues others have mentioned here.
> This is our company’s external API app and so the issue broke all of our integrations.
> Our team ended up setting up a new project in fly to spin up an instance to keep us going which took a couple of hours (backfilling environment variables and configuration etc, not a bad test of our DR ability).
> There is no way I can find to get the data from the db machines. Thank goodness this isn’t our main production db and we were able to reverse engineer what we needed into there.
> Very keen to hear what’s happening with this and why after so many hours there’s no more info or updates.
Another user:
> As an aside, it’s kind of a kick in the teeth to see the status page for our organization reporting no incidents - the same page that lists our apps as under maintenance and inaccessible!
Another user:
> I’m feeling very lucky that none of our paid production apps or databases are affected currently (only our development environment is), but also really surprised that the issue has been ongoing for 17 hours now with no status page update, no notifications (beyond betterstack letting us know it was down) and one note on the app with not much info as to whats going on.
> It really worries me what would happen if it was one of our paid production instances that was affected - the data we’re working with can’t simply be ‘recovered’ later, it’d just get dropped until service resumed or we migrated to another region to get things running again
> Keen to know whats wrong and whats being done about it
The worst thing about Fly is, when something goes wrong, it's not just one thing, there's bunch of things broken at the same time and their status page will show everything green.
Their typical response is either silence or so casual ("oh this is what happens we deploy on friday"). The product looks amazing but it's just a nice package around the most unreliable hosting service I've ever used.
You can't just keep breaking people's work every once a week, make them spend their weekend nights trying to bring back their stuff, and give these "we could have done better" answers. This is an excuse for exceptions, not patterns.
Was that an attempt to discredit criticism of Fly's operational processes by pointing out that another company also has issues in how they handle outage notifications?
Was the sarcasm an attempt to discredit criticism of Fly's operational processes by pointing out that another company also has issues in how they handle outage notifications?
You could have just answered my previous question with "No, I am not familiar with sarcasm".
Because you clearly don't understand sarcasm, I'll be blunt:
No, I'm not trying to discredit any criticism of this provider. I agree with the comment I replied to, that this kind of failure mode is fucking ridiculous. My response thus is not an attempt to normalise this, but to highlight the elephant in the room, which is that AWS - the gold standard for "hosting" services for many a startup and techbro - *also* has Rube Goldberg like levels of interdependence that cause cascading failures *every time* something goes wrong, and *also* have a status board so confidently green that it may as well be an ad for lawn care products.
Thanks for clarifying. I understand now you wanted to call attention to the fact that another famous organization in the same space as Fly.io also has such bad practices. Thanks for the data point.
There's a lot of bullshit in this HN thread, but here's the important takeaway:
- it seems their staff were working on the issue before customers noticed it.
- once paid support was emailed, it took many hours for them to respond.
- it took about 20 hours for an update from them on the downed host.
- they weren't updating their users that were affected about the downed host or ways to recover.
- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.
I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.
Not a great summary from my perspective. Here's what I got out of it:
- Their free tier support depended on noticing message board activity and they didn't.
- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.
- They had an unusually long outage for one particular server.
- Those points combined resulted in many people experiencing an unexplained prolonged outage.
- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.
- Some silliness with Discourse tags caused people to think they were trying to hide the problems.
In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.
You get to a certain number of servers and the probability on any one day that some server somewhere is going to hiccup and bounce gets pretty high. That's what happened here: a single host in Sydney, one of many, had a problem.
When we have an incident with a single host, we update a notification channel for people with instances on that host. They are a tiny sliver of all our users, but of course that's cold comfort for them; they're experiencing an outage! That's what happened here: we did the single-host notification thing for users with apps on that Sydney host.
Normally, when we have a single-host incident, the host is back online pretty quickly. Minutes, maybe double-digit minutes if something gnarly happened. About once every 18 months or so, something worse than gnarly happens to a server (they're computers, we're not magic, all the bad things that happen to computers happen to us too). That's what happened here: we had an extended single-host outage, one that lasted over 12 hours.
(Specifically, if you're interested: somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache for OCI container images backing flyd; if containerd goes down, no new machines can start on the host. It took a member of our team, also a containerd maintainer, several hours to do battlefield surgery on that boltdb to bring the host back up.)
Now, as you can see from the fact that we were at the top of HN all night, there is a difference between a 5 minute single-host incident and a 12-hour single-host outage. Our runbook for single-host problems is tuned for the former. 12-hour single-host outages are pretty rare, and we probably want to put them on the global status page (I'm choosing my words carefully because we have an infra team and infra management and I'm not on it, and I don't want to speak for them or, worse, make commitments for them, all I can say is I get where people are coming with this one).
Why are your customers exposed to this? This sounds like a tough problem that I'm sympathetic to for you personally, but it sounds like there's no failover or appropriate redundancy in place to rollover to while you work to fix the problem.
edit: I hope this comment doesn't sound accusatory. At the end of the day I want everyone to succeed. I hope there's a silver lining to this in the post-mortem.
The way to not be exposed to this is to run an HA configuration with more than one instance.
If you're running an app on Fly.io without local durable storage, then it's easy to fail over to another server. But durable storage on Fly.io is attached NVMe storage.
By far the most common way people use durable storage on Fly.io is with Postgres databases. If you're doing that on Fly.io, we automatically manage failover at the application layer: you run multiple instances, they configure themselves in a single-writer multi-reader cluster, and if the leader fails, a replica takes over.
We will let you run a single-instance Postgres "cluster", and people definitely do that. The downside to that configuration is, if the host you're on blows up, your availability can take a hit. That's just how the platform works.
I see. Have you considered eliminating this configuration from your offering? It sounds like the terminology could confuse people, and it may be the case that they're assuming that a host isn't really what it is (a single host). This kind of thing is difficult for those seeking to build managed services, because I think people expect you to provide offerings that can't harm them when the cause is related to the service they're paying for and it's difficult to figure out which sharp objects they understand and which ones they don't. People should know better, but if they did would they need you?
If this sounds ludicrous, then I think I probably don't understand who Fly.io wants to be and that's okay. If I don't understand, however, you may want to take a look at your image and messaging to potentially recalibrate what kind of customers you're attracting.
Plenty of people would rather take downtime than pay for redundancy, for example for a test database.
AWS RDS lets you spin up a RDS instance that costs 3x less and regularly has downtime (the 'single-az' one), quite similar to this.
Anyone who's used servers before knows "A single instance" is the same as "sometimes you might have downtime".
Computers aren't magic, everyone from heroku (you must have multiple dynos to be high availability) to ec2 (multiple instances across AZs) agree on "a single machine is not redundant".
I don't see how fly's messaging is out of line with that. They don't tell you anywhere "Our apps and machines are literally magic and will never fail".
Sure, but isn't this more about risk tolerance at this point and how much your customers care about? Where the responsibility should be on customer's end. Running on EBS/RDS doesn't guarantee you won't lose data. If you care about it, you enable backups and test recovery.
Just because some customers are less fault tolerant than others, doesn't mean we shouldn't offer those options where people don't have the same requirements or are willing to work around it.
Unless something has changed and I'm out of date, I think a piece of context here is fly postgres isn't really a managed service offering. From what I've seen fly does try to message this, but I think it's still easy for some subset of customers to miss that they're deploying an OSS component, maybe deployed a non-HA setup and forgot, and it's not the same as buying a database as a service.
So hopefully as fly.io get's more popular, there will be some compelling managed offerings. I saw comments at one point from the neon CEO about a fly.io offering, but not sure if that went anywhere. I'm sure customers can also use crunchy, or other offerings.
It seems to me like there's room for improving your customers' awareness around what is required for HA and how to tell when they are affected by a hardware issue. On the other hand, it may just be that the confusion is mostly amongst the casual onlookers, in which case you have my sympathies!
I'm not sure on this, will it make any sense - customers who DON'T WANT to be aware of what is required for HA (say lonely devs) choosing such a hosting types. Even if you put educational articles, I'm unsure it will be used. Putting some BANNER IN RED LETTERS into CLI output + link to article may work, though.
$ fly volumes create mydata
Warning! Individual volumes are pinned to individual hosts.
You should create two or more volumes per application.
You will have downtime if you only create one.
Learn more at https://fly.io/docs/reference/volumes/
? Do you still want to use the volumes feature? (y/N)
(and yes, the warning is already even in red letters too)
I agree, articles tend not to get read by those who need them most. A warning from the CLI and a banner on the app management page with a link to a detailed explanation would seem like a good approach.
edit: sibling post shows there is such a message on the CLI. The only other thing I can think of is an "Are you sure you want to do this?" prompt, but in the end you can't reach everybody.
> somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache
Hey, even if I can feel sympathetic for the course of unfortunate events, it's hard to not to comment:
if you're using a cache, you should invalidate it on failure!
It's a read-through cache. This wasn't a cache invalidation issue. It's a systems-level state corruption problem that just happened to break a system used primarily as a cache.
What I meant is that if the compromised host was unable to use broken boltdb cache, the cache should be zeroed and repopulated. Was it really hours of such cache rebuild vs hours of trying to fix the boltdb?
Btw I am happy I got only small amounts of data in any of bolt databases...
This isn't a boltdb we designed. It's just containerd. I am probably not doing the outage justice, because "blitz and repopulate" is a time-honored strategy here.
I'm surprised by your risk tolerance. If I had any cloud service at this level in my stack go down for three days, I'd start shopping for an alternative. This exceeds the level of acceptability for me for even non-HA requirements. After all, if I can't trust them for this, why would I ever consider giving them my HA business? Just based on napkin math for us, this could've been a potential loss of nearly half a million dollars. Up until this point, I've looked at Fly.io's approach to PR and their business as unconventional but endearing. Now I'm beginning to look at them as unserious. I'm sorry if that sounds harsh. It's the cold truth.
I think you're not exposed enough to the reality of hardware. There was no need for the host to come back online at all. I think it was a mistake of Fly.io to even attempt to do it. Just say tell the customer the host was lost and offer them a new one (with a freshly zeroed volume attached). You rent a machine, it breaks, you get a new one.
If they're sad that they lost their data, it's their fault for running on a single host with no backup. By actually performing an (apparently) difficult recovery, they reinforced their customers erroneous expectation that they are somehow responsible for the integrity of the data on any single host.
They're not responsible for extreme data recovery, but (almost?) all of the customer data volumes on that server were completely intact. They damn well should be responsible for getting that data back to their customers, whether or not they get the server going again.
If you run off a single drive, and the drive dies, any resulting data loss is your fault. But not if something else dies.
Directly attached storage in AWS is a special niche that disappears when you so much as hibernate. And even then they talk about how disk failure loses the data but power failure won't.
This is much closer to EBS breaking. It happens sometimes, but if the data is easily accessible then it shouldn't get tossed.
In hindsight I wish I could edit because my above comment was pretty trigger happy and focused overly focused on the amount of downtime. It was colored by some existing preconceptions I had about Fly, and I'm honestly surprised it continues to be upvoted. When I made this comment I hadn't yet learned some of the bits you mentioned here at the end from another thread. Anyway, I tend to agree overall. I actually suggested Fly even reconsider offering this configuration given that they refer to it as a "single-node cluster", which is an oxymoron.
I would think so, it's honestly strange to think about. The idea of having the node come back after it broke is a bit ridiculous to me. A node breaks, you delete it from your interface and provision a new one, the idea of even waiting 5 minutes for it to come up is strange. This whole conversation seems detached from how the cloud is supposed to and has operated in the past decade.
You're saying a single server failure is going to to cost your business half a million dollars?
This was a server with local NVMe storage. The simplest thing to do would have been to just get rid of it, but we have quite a few free users with data they care about running on single node Postgres (because it's cheaper). It seemed like a better idea to recover this thing.
No, it wouldn't, at least not given the contextual details of this situation because we wouldn't do that. Honestly there are parts of my above comment that hold but I admit in the moment that it was a bit impulsive of me because I hadn't yet learned all of the details necessary to make that judgment call. That number is right under slightly different circumstances if you're asking, but it sounds like you were trying to prove a point. If that's true, you succeeded. I learned a bit later that what they were calling a cluster was a single server and that's just... yeah.
To clarify, we communicated this incident to the personalized status page [1] of all affected customers within 30 minutes of this single host going down, and resolved the incident on the status page once it was resolved ~47h later. Here's the timeline (UTC):
- 2023-07-17 16:19 - host goes down
- 2023-07-17 16:49 - issue posted to personalized status page
- 2023-07-19 15:00 - host is fixed
- 2023-07-19 15:17 - issue marked resolved on status page
Dude. I don't sit at home refreshing status pages. Send me an e-mail.
That's how other [useful] providers notify their customers that one of their hosts went down unexpectedly. Linode will send me 6 emails when they need to reboot something. Even Oracle sends me notices about network blips. I believe I've gotten one from AWS, but I also know sometimes their gear gets stuck in a bad state and I didn't get a notification, which was super annoying because it took forever to figure out it was AWS's faulty state.
The whole point of this HN thread is customers weren't getting regular updates. If they had they wouldn't be on a random community forum trying to get support's attention.
The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days. For an entire cluster to be down for that long is just unacceptable. Rebuilding a cluster from the last-known-good backup should not take that long, unless there are PBs of data involved; dividing such large data stores into separate clusters/instances seems warranted. Solution archs should steer customers to multiple, smaller clusters (sharding) whenever possible. It is far better to have some customers impacted (or just some of your customer's customers) than have all impacted, in my not so humble opinion.
And, if the data size is smaller, you may want to trigger a full rebuild earlier in your DR workflows just as an insurance policy.
The good news is that only a single cluster was impacted. When the "big boys" go down, everything is impacted... but customers don't really care about that.
Not sure if this impacted customer had other instances that were working for them?
> The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days.
There was one physical server down. That's it.
They even brought it back.
I've had AWS delete more instances, including all local NVMe store data, than I can count on my hands. Just in the last year.
Those instances didn't experience 47 hours downtime, they experienced infinite downtime, gone forever.
I guess by your standard I'd be fired for using AWS too.
But no, in reality, AWS deletes or migrates your instances all the time due to host hardware failure, and it's fine because if you know what you're doing, you have multiple instances across multiple AZs.
The same is true of fly. Sometimes underlying hardware fails (exactly like on AWS), and when that happens, you have to either have other copies of your app, or accept downtime.
I'll also add that the downtime is only 47 hours for you if you don't have the ability to spin up a new copy on a separate fly host or AZ in the meanwhile.
The core issue here is that fly doesn't offer distributed storage, only local disks.
Combine that with them having tooling for setting up Postgres built on top of single node storage, and you have the downtime problems and unhappy customers as a given.
When does AWS delete instances? Migrate, sure, and yes, local storage is supposed to be treated as disposable for that reason, but AFAIK only spot instances should be able to be destroyed outright.
> Rebuilding a cluster from the last-known-good backup should not take that long
It's not even clear if that's the right thing to do as a service provider.
Let's say you host a database on some database service, and the entire host is lost. I don't think you want the service provider to restore automatically from the last backup because it makes assumptions about what data loss you're tolerant to. If it just works from the last backup, suddenly you're potentially missing a day of transactions that you thought were there that magically disappears as opposed to knowing they disappeared from a hard break.
Restoring from backup doesn't mean you actually have to use it - just prepare it in case you need it. Since this can take time, starting such a restore early would be an insurance policy, if needed. If there are snapshots to apply after the last-known-good backup, all the better.
Haha, imagine what the AWS status page would look like if they had to update their global status page anytime a single host would go down in any region.
Fly.io messed up, they didn't want to be a Heroku clone, but their marketing and their polished user experience design made it seem like they would be one anyway.
And as a reward now they have to deal with bottom of the barrel Heroku users that manage to do major damage to their brand whenever a single host goes down. Who would have predicted that corporate risk?
I've personally had this experience with Fly on a personal project. My project went down but their status pages said everything was up. It's fine since it's personal for fun project but for anything more serious I don't know if I'd be comfortable using them.
>> There's a lot of bullshit in this HN thread
Then consider replying directly to the post containing wrong information instead of making such generalised accusation.
>> I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal.
What other cloud providers have downtimes of 20 hours? There must be a lot to call this "guaranteed and normal".
Sadly, I've always felt a good amount of passive aggressiveness in many of the HN threads where fly.io is involved.
I really want to love Fly.io. It's super easy to get setup and use, but to be honest I don't think anyone should be building mission critical applications on their service. I ended up migrating everything over to AWS (which I reallllly didn't want to do) because:
* Frequent machines not working, random outages, builds not working
* Support wasn't responsive, didn't read my questions (kept asking same questions over and over again) -- I paid for a higher tier specifically for support.
* General lack of features (can't add sidecars, hard to integrate with external monitoring solutions)
* Lack of documentation -- For happy path its good but any edge cases the documentation is really lacking.
Anyway, for hobby projects its fine and nice. I still host a lot of personal projects there. But I have to move my companies infrastructure off of it because it ended up costing us too much time/frustration, etc. I really had high hopes going into it as I had read it was a spiritual successor of sorts to Heroku which was an amazing service in its day, but I don't think its there yet.
My experience was the same. I stopped using it for hobby projects recently when I had two consecutive days of being unable to build anything. The same stuff that built the week before, built fine locally, then eventually built on fly again — just, inexplicable downtime with no word from support.
Their free tier is very generous. You can get a lot happening and stay under their billing threshold. But, I like to get stuff done. I have a family. I code in my spare time very rarely, and I need a service that’ll let me just build my goddamn project. This was a small static site built by Node, so nothing spectacular happening.
I do wish them the best though. They have an excellent product in their tooling, and if they could stabilize their infrastructure I’d love to try them again.
I actually use digitalocean and it’s pretty solid for static sites (they’re free, I think). It’s also convenient because that’s where pretty much all of my stuff lives these days. I used to put piles of stuff on GitHub pages though! I have some great memories of learning how awesome static sites could be, and how cool it was that they’d deploy just by pushing your repository. That seemed like magic back then.
Half the critical info for using their services is buried in some thread in the forum (posted by an employee). How bad is their documentation pipeline that they can't with similar effort get that same info in the documentation? Requests to put stuff in the docs go ignored.
The answer to _any_ usage related forum question should be:
1. It's in the documentation <here> (maybe I just added it)
2. If you're left with any confusion, let me know and I'll update the documentation to resolve it
The Fly dashboard reported everything was A-ok, but requests would time out. I had to manually dig into the fly logs to see that their proxy couldn't reach the server, and there was nothing I could do to fix it.
This went on for hours, until I made an issue on their forums. They never replied or gave any indication they read the thread, but it somehow magically got fixed not long after.
I really want them to succeed, but this utter lack of communication and helpless feeling of not being able to do anything has cured me from fly.io for now.
Render didn't support Docker images last I checked, and the worst part of Heroku and cloning it was not actually having a locally reproducible build image. I want to deploy what I've built locally, not hand my source over to some magical pipeline.
Different strokes. Personally I avoid Docker in favor of source-code-deployment; the "magical pipeline" is usually just "git pull and then run a provided command". But Render does support Dockerfiles for eg. installing a runtime like Deno that isn't provided out of the box
I've been using them for the last ~10 months or so to run http://PhoenixOnRails.com. Gigalixir have been 100% reliable for me so far, but it's a low-traffic app - I can't tell you what it's like to run a big app on them at scale.
I don't know who owns them but I do get the impression it's a small team. Hasn't been an issue for me so far. Their customer service has been very helpful and responsive on the rare occasions I've needed to contact them
Scalingo is a good drop-in replacement for Heroku. They even use Heorku build packs. They've got good support and are an EU company with hosting in EU (if that's important to you).
My experience has also been somewhat disappointing. I had a toy project that I decided to host elsewhere (Hetzner VM + Dokku), after the node for the PG database stopped working without any notification and didn't come back online (until I manually resurrected it).
Y'all, this is going to be deeply unsatisfying, but it's what I can report personally:
I have no earthly clue why this thread on our community site is unlisted.
We're looking at the admin UI for it right now, and there's like, a little lock next to do the story, but the "unlist story" option is still there for us to click. The best I can say is: I'm reasonably sure there wasn't some top-down edict to hide this thread (the site is public, anybody can sign up for an account and see the thread).
Say what you want about us, but hiding out from stuff like this isn't one of our flaws. When I find out more about what happened with this thread, I'll let you know (or Kurt will reply here and tell me I'm wrong).
I don't know enough about what happened with this Sydney server to be helpful to people who had instances running on it. When I know more about it, I'll be helpful, but I'm just learning about this stuff right now, after getting back in from a night out.
Almost immediately afterwards
It looks like... all the posts in the app-not-working category are "private"? Like it's some setting on the category itself? "Private" here means you need to have signed up for a Discourse account to see them?
Honest advice, probably to Kurt rather than you, is you need better processes, accountability and (probably) communication in your company. The tone of your reply (and other communications from fly.io) is reflective of the lack of those things given the public sentiment regarding fly.io. At 60+ employees and so many issues that tone goes from humanly endearing to indicative of a non-scaling business. Other replies indicate you don't want the things (process, oversight, etc.) that a growing B2B business needs to really succeed which is not a good sign. Sure there's a cost to that corporate-ness and you want to minimize that cost but it's also a necessary evil for the business you're in at the scale you're at.
If something breaks once it's an accident, if it breaks twice it's bad luck but if it breaks down three times it's broken processes. Based on the comment here things break at fly.io a lot more often than three times.
I'm just a person on Hacker News that happens to be at Fly.io; as I've said before, it's probably reasonable to think of me as an HN person first, and a Fly.io person second. My tone is my tone, and has been for the many years I've participated in this community. I got back from an evening out, saw that we were on the front page, poked around a little to find out what the hell was going on, and did my best to add some context. That's all.
If you're reading my comments on HN as some kind of official response from the company, you've misconstrued them.
> If you're reading my comments on HN as some kind of official response from the company, you've misconstrued them.
For what it’s worth, this is the reason most companies eventually restrict their employees from making statements about the company; It doesn’t matter if you thought it was clear that is was unofficial, any statement from an employee in a position of power (such as someone with access to the control panel) will be perceived as a communication from the company.
You may have intended it to be a personal remark about your job, but there are a lot of people in this thread looking for any communication they can get about the company.
When you step in to fill that void as a person who appears to have access and power within the company, you are the official communication whether you intend to be or not.
For the sake of fly.io, you should either restrict yourself and not respond or, if you can't resist, make it crystal clear, that you DO NOT represent fly.io. Your first message can and will be misunderstood and it DOES throw a poor light on fly.io.
I am a paying customer of fly.io, on the Scale plan.
TBH I thought you were replying as the CEO of fly.io since 1) I've seen them post here before, 2) I have no idea how big fly.io's staff is and 3) your post didn't otherwise describe who you were. It doesn't look like I was the only one to be confused.
If you had said "thoughts are my own; I just work there" or something I think it would have been more clear.
It seems you took my comment personally but it was about not just your comments but the overall tone of the fly.io communication (see recent blog post regarding funding) and approach to issues (three days of silence on a dead instance). You view processes and guidelines as chains versus as a ladder to help you climb a cliff. If the processes and communication was good then you'd know when you should self-restrict and when you shouldn't. You'd be empowered to make decisions within a framework that benefits fly.io the most versus being left to guess yourself. You'd understand why you should do that sometimes and why it's a better option for everyone.
I don't, but that's fine: it's not important that we understand each other all that clearly here, since all I'm talking about is how our public forum works.
For an opposing viewpoint: I don't want HN to become the place where corporate comms comes to bullshit us. I want engineers who work there to talk to us as peers, which seems like what's happening here. I get candor and humility (and playfulness, sure) from Fly's tone, which I appreciate.
I get stuff like this is frustrating. But I bet Fly staff are pretty frustrated too.
From this my take away is that I could get fired for picking Fly.io for work. Not because there was an outage but because days could pass before getting support.
What assurances could you give the community here that the support would be better next time?
This is our public site, for people who don't have support plans with us.
It's difficult for me to say more about what happened here and how you might have handled it, because I don't know what happened with this SYD host, because it's 1AM and the people who worked on it are, I assume, asleep. When I know more, I'll do my best to get you a postmortem.
Try filing a bug with any of the big three cloud vendors when you're on their free plan. It's really not different, the thing that is going to get you fired is not realizing you're not paying a couple hundred bucks per month for premium service on the infrastructure that is mission critical to your company.
Funny story, when I started my current role I researched our hosting provider. I couldn't find the matching invoices in the accounting system. So I called the vendor, a local company. They'd not set our account up correctly, billing was not enabled. Since then we've been billed. I'm glad we sorted it but it wasn't a good look to start my role by increasing our spending.
I feel like starting your role by discovering a crucial service wasn't being paid for and therefore was at risk of suddenly going away should be a pretty positive thing.
However 'should' is pretty load bearing there and actual results are probably heavily dependent on management culture and the current state of office politics.
We had a customer once that our automatic billing system tried to reach for 3 months about failing credit card charges (<$5k/mo). Our system stopped the service.. I'm pretty sure their subsequent outage cost their customers millions. Lessons about what it means to have (and be) enterprise customers were learned. Unfortunately the lady who was ignoring our e-mails in her inbox got fired.
> Try filing a bug with any of the big three cloud vendors when you're on their free plan.
A host being down for 3 days isn’t a bug. And you can contact AWS support, even on the free plan, and get a reply. Try it yourself. The great thing about AWS and the other cloud providers? If a host has issues they email all customers with workloads on it so you don’t need to refresh or check a forum.
I understand fly is a community darling. They’re unreliable, with poor support currently. Maybe the dev experience is great and that makes up for it, but pretending like everything else is equally shitty? Not true.
Lots of experience with Fly's paid support here. tl;dr Absurdly good.
FAR better wrt both response times and technical expertise than you'll get with any large public cloud provider.
I was dealing with some annoying cert + app migration stuff (migrating most of an app from AWS to Fly), and Kurt (CEO) was personally sending me haproxy configs bc I'm not smart enough to know how to configure low-level tcp stuff in haproxy. Not to put him on the spot here -- I doubt he'll have time to do that level of support going forward -- but that's my experience of the company's dedication to support and technical expertise.
For instance one of those things I've noticed is that most Discourse instances have those nag banners if you're not logged in begging you to log in – and that's one of the least objectionable things they do IMO. I discovered recently that Discourse also blacklists all but the most recent browsers (because Discourse is designed for the next ten years!) and serves up a plain text version on anything older… but not without a nag banner of its own admonishing you for not using a supported browser.
The infinite scrolling… ugh. I'm not a huge fan of XenForo, but as a successor to vBulletin it seems to be far more user friendly.
My understanding is that it was causing support problems, because people were Googling for solutions to problems with their apps (because of the Heroku diaspora, we have a lot of first-time Docker users), finding old stale threads on our forum that looked related, and then reviving them.
I think we can just `noindex` the category instead of making it private?
After 15 months & more than 100 million requests served by our Phoenix + PostgreSQL app running on Fly.io, I would be hard pressed to find a reason to complain.
- Some deploys failed, and re-running the pipeline fixed it.
- Early July 2023, 9k requests from Frankfurt returned 503s. Issue lasted 10 seconds.
- While experimenting with machines, after many creations & deletions, one volume could not be deleted. Next day, the volume was gone.
That's about it after 15 months of running production workloads on Fly.io.
I'm sorry to hear that many of you didn't have the best experience. I know that things will continue improving at Fly.io. My hope is that one day, all these hard times will make for great stories. This gives me hope: https://community.fly.io/t/reliability-its-not-great/11253
There's also a lock icon next to the "App not working" category in the header, which I took to mean that that entire category is hidden from logged-out users (which experimentally seems to be the case).
I have the impression from this thread that this thread was public (as in, would work if you just linked to it from something like HN) earlier, and now it isn't?
Obviously, deliberately hiding a negative story on our Discourse is a little like deleting a bad tweet; it's just going to guarantee someone captures and boosts it. We have a lot of flaws! But not knowing how the Internet works probably isn't one of them. No idea what's going on here, still trying to work it out.
Yes, from the Google-cached version, it appears that the thread previously didn't have the app-not-working tag; it was only tagged with "rails".
Not going to try and guess why or when that tag change happened. Personally, I'm less concerned with this particular thread than with the apparent decision to systematically hide all potentially-negative threads from search engines.
That category was added after one of our support folks replied, likely for tracking. I don't know why it's private. They may not even know this category is private. Hiding negative shit wasn't a deliberate decision... we're aware of google cache and we don't need to give HN another reason to dunk on us.
> That category was added after one of our support folks replied
FYI, this doesn't appear to be strictly accurate. The OP commented at 23:52 UTC saying that the thread had been made private, and the reply from "Sam-Fly" was not posted until 02:36 UTC.
My point was that the app-not-working category is used in conjunction with support/our team getting involved. I assume this is what Sam meant by "flagged it internally", which was followed by investigation, then a post. I don't see how the timestamps uncover something nefarious.
If you're talking about the comment you're replying to, tbh I found it was way more relatable than a more "professional" PR-speak response. Maybe you were talking about something else
Eh, I like it. It's refreshing to see a company representative communicate like an actual human being instead of the usual meaningless corporate robot-speak.
I'd rather take this response and see that they're working on it than "Oopsie poopsie, our machine elves have messed up!" or corporate newspeak saying nothing.
you have no idea wtf you writing about; it's been a few hours now and it's become clear that someone tagged the post as 'app-not-working, which made the post got 'private' and only available for logged-in users. it's also become apparent that the linked post in on a community forum for users without a support plan.
the dramatic tone and accusations in your reply are not warranted anymore
I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.
Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.
I just got gigabit bidirectional fiber at home and honestly if I were doing personal stuff or doing very early bootstrapping I'd just host from here with a good UPS. No it wouldn't be data center reliability but it'd work at least until it was ready to put in something more resilient.
You can pay for a business class fiber link too. It's about twice as expensive but they have guaranteed outage response times which is really what you pay for.
> enthusiasts would be better served DIY; put a beige box in a local colo
I mean, like, can I provision a zero ops bit of compute from <mystery colo provider> for $20/month?
Edit: looked up colo providers in my city- “get started in 24 hours, pick a rack and amperage, schedule a call now.”. Yeaaah, no. This is why people use cloud providers instead.
The thing is, running a good SaaS service requires quite a bit of staff and hard operational skills and a lot of manpower. You know, the kinda stuff people always call useless, zero-value add, blockers and entirely to automate.
Sure, we have most of the day-to-day grunt work for our applications automated. But good operations is just more. It's more about maintaining control over your infrastructure at one hand, and making sure your customers feel informed and safe about their data and systems. This is hard and takes lots of experience to do well, as well as manpower.
And yes, that's entirely a soft skill. You end up with questions such as: Should we elevate this issue to an outage on the status page? To a degree you'd be scaring other customers. "Oh no, yellow status page. Something terrible must happen!". At the same time you're communicating to the affected customers just how serious you're taking their issues. "It's a thing on the status page after an initial misjudgement - sorry for that." We have many discussions ilke that during degradations and outages.
I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.
We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.
All this feedback matters. We hear it even when we drop the ball communicating.
What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year.
Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.
Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc.
And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons.
And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.
"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales.
Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.
Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments.
You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS).
We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.
Here the even bigger red flag is that Fly doesn't have a (automated?) way to quickly move workload from a faulty server to a good server. Especially when containers (and orchestrators) have abstracted away the concept of data volumes which can be attached and detached. (Yes, it needs a lot of serious technical investment to provide this and I think it's one of the reasons storage is expensive on the big 3 clouds.) If you are offering data persistence services then you absolutely need this capability.
I think there is an expectation mismatch between what Fly wants to offer and what the market wants from it. Fly wanted to innovate on offering the ability to the devs to be able run their apps from multiple data centers. But without a proper data persistence service, the ability to run apps from multiple data centers is not useful to a vast majority of people.
I think Fly is trying to solve the persistence issue with their SQLite replication, but that means the vast majority of the devs will have to change the way they develop applications to suit Fly platform.
I think Fly needs to choose between what it wants to become. A reliable and affordable Heroku replacement, which is a decent sized market or offer an opinionated way of developing apps which offer best performance to users all around the world.
But opinionated ways of doing things is a double edged sword. (Rails and Spring Boot are highly successful because of their opinionated defaults.) App Engine is an interesting case study in the app hosting domain. It was way ahead of the time and prescribed you a way of developing apps which allowed the apps to scale to very high traffic. But people didn't want to change the way they develop to adapt to it.
>I think Fly needs to choose between what it wants to become
They have already pivoted once, no? At their current size (>100M in funding), I seriously doubt they can do it again.
I think they are scrambling hard, putting one fire out just to start another one later. That doesn't give me confidence in their technical roadmap and multiple people have Fly.io in their "check later" list for what now? 2 years?
It's really hard to recover your reputation when people perceive you as unreliable. Especially in the IT space.
They don't have remote attached storage, it's all local on the node, lvm based volumes. The data persistence is 24hr or manually created lvm snapshots that are exported to s3.
It's really not a place to run persistent workloads. If you run postgres there, you need to be prepared to either hot load your data into a new instance, or restore from backups.
They had my account on some sort of shadow ban with no communication whatsoever after asking them to delete my account from their systems. I emailed them and to date never even got a response. I have moved everything over to Railway app and back to Google Cloud Run ever since.
> Why would a company shadow ban you for asking an innocuous question?
If you are literally overwhelmed with crises, it becomes appealing to make problems go away in this manner. Not saying they are, but this thread is suggesting that.
You know what's interesting? It feels like history is repeating itself with Fly.io, just like it did back when I first encountered Heroku. Back in the day, I was super excited about Fly.io – it had that same fresh, exciting vibe that Heroku had when it burst onto the scene.
I remember being blown away by Fly.io's simplicity and how easy it was to use. It was like hosting made simple, and I couldn't help but think, "This is it, this is the one!"
But, as time went on, I noticed little signs of trouble. Downtimes became more frequent, and my deployments, which were once snappy and seamless, turned into agonizingly slow affairs. It was like déjà vu from the time when Heroku's greatness started to wane.
It's disheartening to see Fly.io go down a similar path. As more people flocked to the platform, it seems like its performance began to suffer – just like what happened with Heroku. The more popular it got, the less reliable it seemed to become.
Scrolling through Hacker News, I can't help but feel a sense of disappointment. Others are expressing their frustration too, and it's like we're all reliving that moment when Heroku lost its charm and became a hassle.
I have to admit; it worries me. It's like a cautionary tale of how even the most promising platforms can fall from grace. It's the reality of the fast-paced tech world, but it's tough to accept.
So yeah, here I am, hoping against hope that Fly.io can somehow break free from this cycle and find its footing before it becomes as useless as Heroku was at its lowest point.
It was a bit alarming to see Fly offering significant resources for free (and encourage using them in the docs, subtly making them a feature and a reason to switch) back then. I wondered if they overestimated the conscientiousness of the industry: as with Heroku, surely once the word is out in the wider world plenty of people would flock over just to not pay. Guess what happened next…
Heroku was a new thing back then, so it took a while for abuse to ramp up—but every subsequent attempt at being generous should not even be considered without either a vicious and expensive anti-fraud department in place or deep pockets to compensate for the initial lack of said department by throwing enough hardware that the minority of honest users don’t notice the overhead.
My impression suggests that Fly does not score high on either of the above. Which is partly why I like them—the above seems like megacorp type bullshit, and they seem to be strictly no-megacorp-bullshit—but I wouldn’t be surprised if engineers at Fly had to spent most of their time dealing with fires or optimizing resource allocation and auto-limiting freeloading cryptominers, scammers, and other abusers rather than focusing on longer term infrastructure reliability or DX.
Do you think its related to scale? As in, once a company has enough paying customers to become profitable/investable, it has also accrued enough issues to where it starts feeling fresh and exciting like you said, and gradually becomes like the older competitor it once wanted to replace?
This is my experience at least. Once the company goes from a few pizzas to "we've booked a venue", entropy creeps in and adages like Conway's/Brook's law become increasingly evident.
The skillset of successfully founding a company and the skillset of successfully scaling a company are not the same. The latter is a hard thing to do that requires understanding both customers (current and potential) that you never speak to and employees that you barely speak to.
We tried to migrate all our staging environments to Fly last year but it was the flakiest experience I’ve experienced on any PaaS. Pushing simple containers up would fail 70-80% of the time with no useful error messages and non existent support. It’s a weird company that seems great until you actually use them.
I think fly.io is pretty incredible but I can't help but feeling they're doomed to follow in heroku's footsteps (unclear if good or bad). They've built some pretty wild stuff and I can't help but wonder if they're overcooking the ocean instead of just solving problems for their users.
Durable and available storage are all they really need to draw me away from big cloud providers but this combined with their answer to S3 being "use S3 or run minio" means I'll never take them seriously.
This is a bad look folks, not sure how you can walk back days of silence and hiding threads. Just open an issue and talk to your users.
At least I could rely on Heroku in production. I've wanted to give Fly.io a try but this gives me pause. I really do miss the Heroku DX whenever I'm putzing around with the increasing complexity of AWS.
For hobby projects - where I dare not touch AWS for fear of going bankrupt from a misconfigured service - I found the sweet spot to be Dokku on top of a Hetzner or Digital Ocean instance. It provides a Heroku like interface on top of cheap hosting, and is fine where you don't expect to scale very much.
They do have an EU Central region (see [1]) but "it is not possible to have multiple regions under one account" - you need an account in each region (although it seems you can maybe fudge around with groups to emulate multi-region access.)
Instances going down happens sporadically on Hetzner Cloud as well, but often by the time I see the e-mail alert that some instance is unreachable I log into the dashboard to find that it has been restarted or migrated to another host already. I've been running a production system there for more than 4 years now and had zero provider-related downtime (as I have some redundancy for most instances). In terms of features they move way slower than Fly.io and it took them years adding stuff like virtual networking, but everything they add works rock-solid. I guess there are just very different engineering cultures when it comes to building cloud infrastructure provider, and I have to say I prefer the "take your time and do it right" approach.
I'm running some instances on Hetzner Cloud, the oldest is ~5 years old, only recently had 2hr or so downtime, other than that - without any problems. And we are talking the cheap ones.
I did have a problem with their dedicated server almost immediately after spinning it up. Noticed that NVMe is broken, and support went like:
- 16:28 -> I contacted them
- 16:36 -> Their first response
- 16:44 -> I sent them SMART data
- 16:48 -> They acknowledged that the NVMe needs replacing and asked me if I consent to that (and loosing of the data that was not already lost -> but running RAID so no problems there)
- 16:52 -> I agreed
- 17:30 -> NVMe was replaced and server booted
I don't have too much experience with hosting providers on that level, but that was freaking impressive response time from them. So a happy camper as well :D
Hetzner has a great price/performance ratio, but they are not rock-solid. Speaking of the private network... look at their forum where people complain about downtimes for their "vSwitch" every other week, sometimes it doesn't show up on the status page because it happens on the weekend (lol).
They’ve been working on Fly for years now and seems like they haven’t been able to turn it into a reliable service or profitable business (making assumptions about the second part here), and the overall general sentiment seems to be to avoid it for anything but the most toy applications. I note that the team was also unable to get their recruiting business off the ground either and shuttered it.
My assumption based on the creator’s very online hacker news commentary is that they seem to be at least smart in tech. So what’s the lesson here for the rest of us who may want to start a business? Is this a “shots on goal” thing and we’re just seeing these failures more publicly than most so it biases the perception, or is there some je ne sais quoi missing that we could learn from? No offense intended by my post, but I would be very keen to learn whether there’s some X Factor missing from an otherwise ostensibly smart team’s repeated failure that we could learn from.
It's really disappointing that they made this forum thread private, apparently in response to this HN thread blowing up. This is the first negative HN thread I've seen about them, it's not even really that bad because this kind of downtime is expected, and they can't get to every forum post, and their response that someone posted here is totally reasonable in my opinion.
So why is the link to the thread 404ing and why does this post have to link to google webcache of it? I've grown to like fly.io and use them for my side projects now, and this just isn't sometime they would do. Going through some minor cognitive dissonance right now :/
I wonder if there will ever be a wake up call to the arrogance of people at fly.io
At work when it came up in a meeting people went around with horror stories of broken elements while the status page wasn't updated, terrible communication and an overall attitude that nothing is wrong, even when servers go down for days at a time.
There's a global status page, and then there's a local update for people with instances on an affected host --- past some threshold of hosts, the probability of having an issue on some random host gets pretty high just because math. The local status thing happened for people with instances on that machine.
Ordinarily, a single-host incident takes a couple minutes to resolve, and, ordinarily, when it's resolved, everything that was running on the host pops right back up. This single-host outage wasn't ordinary. Somehow, a containerd boltdb got corrupted, and it took something like 12 hours for a member of our team (themselves a containerd maintainer) to do some kind of unholy surgery on that database to bring the machine back online.
The runbook we have for handling and communicating single-host outages wasn't tuned for this kind of extended outage. It will be now. Probably we'll just paint the global status page when a single-host outage crosses some kind of time threshold.
Wondering if for small/bootstrapped projects there's any alternative people suggest? Fly has a nice UX and accessible prices, but it's unstable at best. I use the big clouds at work, but for personal they are $$$. Also I want to keep devops tending asymptotically to zero.
I've never actually used Render, but did interview with them last year. I faceplanted at the end and didn't get an offer, but… hands down Render ran one of the best interviews I've ever participated in. Communication was on point, the process itself was well organized, and even though I disagreed with a couple of engineering choices, there was a distinct lack of bullshit.
If that carries over to their customer facing folks and how Render as a team has executed since then I'd absolutely recommend taking a look at them.
I second render.com. I switched from fly.io to Render.com after seeing a few of my instances getting bottlenecked and crashing. Now the same service runs smoothly on render.com without any crashes. Didn't dig any deeper but somehow the resource management is better with render.com
I've been using a Postgres DB on Railway's free plan (that is going away) and it was great. It did everything I wanted (excluding external access and PostGIS) for cents. The support community is nice.
I didn't use it for much more, but my experience has been great. They deserve way more air time than they currently get.
I wish digitalocean offered decent pricing for spaces (s3). Unfortunately it starts at 5$, which is an enormous price for storing 70 small images, but s3 would greatly simplify my server management moving state entirely outside the server (managed database + managed object storage)
I don't have to use an object store, but it makes the cost of setting up a server more expensive if I use the filesystem, if I delete the instance, the data is gone. A volume kinda offset this, but it's way less portable and accessible only by one instance at a time
The peace of mind of managed is nice, all I have to think about is running the app, without having to deal with making sure db and files don't get lost
That's an option, but I want to keep things simple and the assumption is usually "filesystem" but weirdly most libraries assume S3 usage.
I don't think I've seen native support for db-stored images in any of the libraries I use, which is sad but a reality.
I use Dokku on top of Hetzner for my hobby projects - hosting is super cheap, for a little extra I can add a mounted volume for storage, and if the project outgrows a single server I can always just break out of Dokku and use some Docker containers behind a load balancer.
If you are outside of Europe, Digital Ocean or Linode may work better for you.
I like Hetzner, they certainly radiate the feeling of quality (the management UI is great, for instance). The servers themselves are competitively priced (and they have ARM boxes!) - but for more storage than the little that they include I find the price pretty outrageous, compared to the base price, anyway. You'd end up about doubling the price for a "reasonable" amount of storage you can confidently run your base system on.
Maybe just pick up 3 chonky EC2 boxes, set up iptables on each of them, have each one run a containerized version of your code that gets built and deployed from CI every time you push to Github, slap an ALB in front of it all, and call it a day?
And if you need state, then spin up a little RDS with your favorite SQL flavor of choice?
The CI deploy script could even bake in little health-checks so you can do rolling deploys with zero downtime. Depending on how fancy you wanted to get with your shell scripting, you could probably even make 1 of your 3 boxes a canary without too much trouble.
I'm realizing I haven't thought about this in a long time, since nowadays I just get to use the fancy stuff at work. Kind of a fun thought experiment!
I admit I didn't run the numbers before posting that. But you got me curious, so I went ahead and did it now...
Render.com looks like [1] their "$0 + compute costs" plan would work out to:
∙ $25/mo for a single "Web Services" box of 1 CPU and 2GB RAM
∙ $20/mo for a single "PostgreSQL" box of 1 GB RAM, 1 CPU, and 16GB SSD
∙ TOTAL: $45/mo, and you're assuming they'll magically give you zero-downtime
Those are grim numbers, performance-wise, but let's use them as the standard and see what it'd cost in the scrappy AWS architecture I threw together in a few minutes:
∙ $12.10/mo for a single t4g.small box, which is actually 2 vCPU and 2GB RAM [2]
∙ 3x redundancy on that brings you up to $36.30/mo for compute
∙ $16.20/mo for an ALB [3]
∙ $11.52/mo for a single db.t4g.micro PostgreSQL box, plus $1.84/mo for the equivalent 16GB of storage [4]
∙ TOTAL: $65.86/mo for substantially more CPU, redundancy, and control, or...
∙ TOTAL: $41.66/mo for substantially more CPU and control over your infra, if you're willing to drop the redundancy
So it looks like it's pretty comparable in terms of raw dollars.
I'll admit there's a little more "devops" overhead with the AWS setup. Though I think it's not as big of a deal as people make it out to be — it's basically an afternoon of Terraforming, and you'd probably spend an equal or greater amount of time digging through Render's docs to understand their bespoke platform anyway.
(Also, once you contemplate bulk pricing for the underlying commodities, it's easy to see how companies like Render make a healthy margin, even on their low-end offerings.)
Anyway, I guess I've nerd-sniped myself, so I'd better stop here. But that was a fun analysis!
Thanks for the analysis. I think you're still underestimating costs (e.g. didn't count bandwith, no AZ standby for your database, or backups etc.) and time spent, not only in the setup but especially in maintenance (security fixes, AWS agent updates, OS updates, package updates, figuring out why an instance ran out of disk etc. etc.) Not counting you have to setup and maintain your deployment system which can range from scripts to K8s.
Also I have used Terraform to set up quite a few resources and it's only overhead in a small project.
I just wanna git push and see my changes published a minute later. I don't think Render is gonna take more than 10 mins to figure out https://render.com/docs/deploy-rails-sidekiq
Spun up a new project and was debating between AWS and Render.
I’ve been burned one too many times by ElasticBeanstalk so I bit the bullet and went with Render… and had everything plus PR deploys working in under an hour. Very happy so far.
What do people get out of using special services like Fly.io instead of standard VMs like the ones you can get from $5/month these days?
Can anybody who uses Fly.io explain their rationale? Why do the additional integration with Fly.io, trust and install their special software on your machines and tie your project into their ecosystem?
What type of application are you running? How many users are using it?
There's a sweet spot of early startup or side project where you don't have the time, budget or people to manually set up and maintain servers on your own or deal with the complexity and cost of Kubernetes or AWS, especially when your focus is on building the product and acquiring customers.
Heroku (before its inevitable enshittification under Salesforce) was great for this use case. Sure you will outgrow it at some point, and it did get expensive, but when you just want to throw up an MVP with minimum fuss and maintenance you could do much worse.
Not sure what fly.io offers vs Heroku or others (I have played with it some time ago but not used for anything serious), but for an equivalent I'd be looking for automated load balancer setup with SSL, easy scaling up so I can go from 1 to 2 or however many web services (with UI or CLI), simple deployment configuration with a Procfile (or whatever) and managed PostgreSQL/MySQL/Redis including backup/restore when needed.
That's more than what I would have or need locally.
And what kind of project do you run which needs up/down scaling and load balancing?
In my experience, for a simple PHP web application, the smallest VMs already can handle a thousand concurrent users, which amounts to something like a million monthly users.
But if your users are distributed around the world and most requests are read requests then it can make sense to shave 100 or 200 ms off your response times.
You can always squander those gains later by running JavaScript for 5000 ms before showing anything :)
Probably saves you a good hour of "sudo apt gets" and "vim /etc/nginx/nginx.conf" etc.
Having used various PaaS services that take this "pain" away from you, I sort of think the tradeoff isn't worth it. For $5/m DO will give you a backed up server. Add $15 for postgres that is a good deal.
I used Heroku for a project mostly because my team didn't have skill set to set this up and I wasn't going to do it. As far as I know they are still on Heroku (with a smattering of AWS services) for that same reason: just works and cheaper than doing it yourself.
> Who fully sets up a significant project locally?
Who doesn't? I couldn't imagine having to push to some cloud agent and wait a random amount of time every time I want to test something. With it local I can just save, maybe rebuild or have it auto-rebuild if necessary, and test, then repeat. On a fast machine this can be a few seconds or instantaneous.
Maybe the niche I'm missing here is very "green" developers who don't know how to do any sysadmin work or deploy things.
If this is you, learn it. It pays off huge, not just during development but in being able to have a lot more choice about where you deploy and a lot more control over your own stuff.
Maybe not redundant because it’s just for testing, but you absolutely can run an entire stack like that on a decent laptop. You can even use Docker to run the same containers you run in production.
I’ve run Kubernetes clusters in multiple Parallels VMs locally with work loads in them to play around.
You also learn a ton about how things work which helps you debug and fix stuff when things go wrong. Even if you use managed stuff it’s always a huge plus to understand at least the basics of how it runs.
Why is this company always on HN frontpage - ironically for their bad services? Normally, poor service from a provider isn't grounds for such attention - but seems like Fly.io has not done anything great.
They still continue to get love from the developer community who "wants them to succeed". I'm puzzled as to why? Because of some blog posts?
> I'll be damned if you can't have a throat to choke in less than 10 minutes whenever something like this happens
That is a hell of generous description for a person who sits in your Slack instance and responds with "I have escalated to the team internally and am waiting to hear back on confirmation if this is an issue."
Moving a Level 1 support engineer closer to the customer doesn't give them more information, it just reduces the latency to getting a non-answer.
I had one situation where a Hetzner dedi didn't come back up on a reboot. Their dedis are cheap, this one is like $40ish/mo?
Opened a ticket and support had it back up again within about 10 minutes, turned out to be a failed CPU fan which caused an overheat condition and made it so the system wouldn't complete the boot. They swapped the fan and it came up. It's the only failure I've had in years of dealing with them and was just impressed how quickly a physical failure event like that got handled.
Datacenters in my country usually had some rooms with tower servers 20 years ago here, well my first colo was for the tower server I brought in the large backpack:-). But density requirements, cold/hot aisles etc. prevailed and towers are generally considered inefficient for the datacenter purposes.
And then you have Hetzner datacenter that probably all people running DCs I know would ridicule, but they would not be able to respond to fan replacement at the same time. I wonder how many rack server chassis are recycled each year because the manufacturer just won't let you reuse them with new motherboard, power supply due to new shape, design, ports placement etc.
No love for AWS, but this isn't true, at least for larger deploys. If you're running enough with them that you have an account manager, they are very good indeed. You can have someone, someone good, on the phone within minutes and they will stay on the line until the issue is sorted.
I recall an incident at my old company where we were under DDOS, it was getting through cloudflare and saturating LBs in some complicated manner (don't recall the exact details) which made it hard for us to fix ourselves. They were on the phone with us for hours, well past midnight their time, helping us sort it out. The downtime sucked, but I was certainly impressed with their truly excellent support.
I was working for a pretty big early AWS customer--one that had realized that for the low low price of all your money you could make DynamoDB scale to some truly massive numbers--and one time when we were having trouble around noon Eastern, a colleague called up our TAM. As he told it, the TAM sounded half-asleep, so my colleague asked if everything was alright.
"I'm in Hawaii on my honeymoon and my backup missed your call, so it escalated."
I probably wouldn't have answered the phone. Granted, that's why I don't do that job. But I have always had a real appreciation for the good TAMs ever since.
Weird, I just begrudgingly went from Postgres to Dynamo because it was so much cheaper. We're not huge scale though, so I'm wondering where the costs start to diverge the other way.
I wanted to give Fly.io a try in my next project but not with this operational culture. I regret telling my CTO clients about Fly.io as the next big thing in operations.
Not a sarcastic or rhetorical question - how come the three big A clouds or even smaller ones (Hetzner,my favorite) are mostly so stable (give or take some outages) and anyone knows their internal engineering, architecture and practices to keep systems that much stable?
There isn't really secret sauce to it in 2023. The techniques, processes, and etc have pretty much been documented over the past 20 years.
But if you are wondering how AWS manages to be so good at it at such scale? Hosting infrastructure is incredibly complicated and AWS employs something like 100k people. Seemingly small AWS services employ more engineers than Fly.io.
That being said my take is that what's happening at Fly.io is a lack of leadership. There are not the right people in the right positions clearly. I've worked infra at companies from 5 people to, well Rackspace, and I'm having a hard time imagining so much time passing with.. Essentially a piece of infra MIA and impacting users.
I think the core issue is that they venomously don't want to act like a corporation. Which is great for early marketing and adoption but there's a reason successful B2B corporations act like they do. It's less fun and it's less endearing but it also annoys customers significantly less. I mean, the CEO has "Interim Food Taster" as his title on LinkedIn.
IMHO it is their approach. I use Hetzner and OVH (and their other variants for lower budget clients) for our EU clients. They do not use buzz words like "deploy app server", "cloud clusters", "turbo charge this app". They are simply providing VPS and similarly configured droplets. They are also established and don't want to mess around with very modern experimental infrastructures.
Same goes for Digital Ocean. No buzz words. Just hosting with droplets. They simply say "here pick a linux distro, configure whatever and don't ask us much about app support". I use their Linux distros for my own apps and if want anything extra I just install it and suffer my own actions' consequences. Not theirs.
I guess what OP is getting at is that these providers stick to the battle tested proven bedrock and nothing like "run your app where your users are" which I find interesting because that too can be done with any cloud that has a Datacenter in the region where you happen to have users.
So this "closer to your users" voodoo is a little beyond me.
The 'where each user is' is implicit, the expectation is that you're some kind of global SaaS, and you want low latency whereever your users are.
Sure you can do that with any cloud (or multiple) that has datacenters in a suitable spread of regions, but I suppose the point (or claimed point, selling point, if you like) is that that's more difficult or more expensive to coordinate. Fly says 'give us one container spec and tell us in which regions to run it', not 'we give you machines/VMs in which regions you want, figure it out'. It's an abstraction on top of 'battle tested proven bedrock' providers in a sense, except that I believe they run their own metal (to keep costs down, presumably).
Some workloads are surely latency sensitive but some of those transactional CRUD systems don't need that much closer to the edge is my possibly flawed opinion.
I mean chat or e-commerce yes, the edge and all.
But for a ticketing system, invoicing solution or such, a few hundred millisecons are not that much of a big deal but compliance, regulations matter more.
Scale the technical difficulty and innovation of the product with the size and competency of the team. The market will always say they want more and the job of the company leaders is to know when to say no. AWS did not begin with everything it offers now but rather started with fairly boring things (even for the time) that they expanded over time. This was after a decade of learning how to do this internally so they weren't starting from scratch.
I've had service issues on Fly that I've escalated to support in the past, and given my experience it feels highly unlikely that they tried sweep this under the rug or somesuch.
At the time we had deployed a small business workload (few 100$/mo in billings) and paid for their $29 support plan, so grain of salt there. We faced service issues and, while the service reliability did eventually push us to migrate, support was top-notch the whole way through. Support was happy to escalate as needed to try to help get a solution, with MrKurt eventually joining in and helping identify root causes. During the entire episode everyone was realistic about where issues could be (i.e., were open to the possibility of it being a Fly issue). As people from Fly have noted, they've historically been quite open about when they weren't the best choice.
Again, while service reliability has been an issue (and Fly has admitted this in the past and is working on it), I think the assumption of badfaith in this thread is pretty unprofessional. It's also a lesson in how hesitant people are to pay for support. $29 for access to a human is not a bad deal; we certainly got good value out of it.
> I think the assumption of badfaith in this thread from Fly is pretty unprofessional.
Customers aren't supposed to show professionalism. Service providers are. I didn't see disrespectful comments here.
People here are just poiting this has happened many times and look like a pattern. If you don't fix a communication issue after multiple occurrences, you might not be ill intentioned, but at least careless.
(Note: I edited my comment to make it clear I'm referring to badfaith from commenters, quote above is from pre-edit)
I'd argue the expectation goes both ways. I won't link to specific comments, but I think it's pretty clear that some of them cross the line to disrespectful.
I haven't seen a single instance of disrespect -- just justified frustration for the outage and lack of communication. There seem to be a lot of frustrated Fly.io customers and a vocal number of Fly.io fans.
I (like many others) want fly to succeed and have even moved a couple of (smaller) production apps to Fly.
But it's very clear that over the last few months, the (already quite capable) fly team is just in over their heads and have bitten off way way more than they can chew.
I've had nothing but headaches after they auto-migrated our app to v2. My build machine had to be forcibly destroyed to even work. Then it allowed me to easily just delete an app cuz I wasn't able to deploy to it.
Then the deploys kept failing due to some VM provisioning error (it thinks I want to add another app when I just want to deploy to an existing app within the 3 machine limit) and honestly, I just don't care to troubleshoot this anymore. That was the point of using a platform like this... Any time I would've saved by using this platform has been wasted with these random errors that I don't have the time to troubleshoot.
I destroyed the app thinking "ok maybe I'll also recreate that one" because clearly the migration to v2 failed. And now that all my secrets were destroyed, when I try to attach the new app to postgres (with the existing username and existing database), it won't let me.
I genuinely wish you guys the best of luck with what has to be a tough time for your company, and will reconsider if you build something demonstrably more stable. But right now I just can't afford to drown with you with clients breathing down my neck.
I tried Fly once, but, at the end of the day it seemed way too expensive for what it was and the completeness of the vision. And then I started to see the complaints in random corners of the Internet.
I don't read their blog regularly but I always thought they had great content. But not after reading this.
The irony:
"What people actually wanted to talk about, though? Databases."
...but apparently not when they are the problem behind said databases?
Their blog is great, because they invested heavily in perception from the outside. Coming from the Elixir world, them hiring Chris McCord (creator of Phoenix) and sponsoring a ton of open source projects slapping on their logo, seemed great at first, but when it comes to actually deploying stuff to production and day 2 operations (monitoring is so much more difficult than it should be, and troubleshooting tools are lacking) they are way behind. I can imagine them getting lots of hobby projects on board due to free tier and day 1 impression, but that won’t win over enterprises.
I tried Fly.io when looking to move away from Heroku. Some really cool stuff, I love their focus on multi-region apps. But it just felt like too many under documented things and edge cases and the support didn't seem like it would be there for me when I really needed. I ended up going with NorthFlank as my Heroku-replacement, they've had the odd hicup (mostly related to me being the first customer on their us east region) but communication and support has always been incredible. Really happy I choose them.
I tried using kubernetes a while back for hosting a side project on a raspberry pi. I guess technically I was running microk8s on the pi and had to install kubectl locally to interact with it.
I actually like some of the concepts, like pods and ingress, but one thing I noticed that I didn't like, as far as I remember, was that there's not really a good way in kubernetes to make your YAML more dynamic. Apparently you're supposed to use these other things like Helm Charts, which isn't even part of kubernetes?
Fly is soooo easy to use and very very easy to go back to. We went back three times, each ended with multi-hour sev0s w/ horrible status updates. Last outage occurred during our YC interview... now we're on AWS.
Used to love fly, then had a few issues.* The CEO wrote back in March they were working on reliability, but then you have this case study on what not to do in an incident response. 1) Fail to monitor your primary support channel. 2) Allow your support channel to become "private." 3) Not update your status page.
* First was some sort of certificate issue that cost me literally days of debugging that turned out to be their fault.
* Then weirdness around their v2 deployments where I just can't grok some of the documentation.
Just use AWS. Your time is more valuable then what you're saving on the fly.io free plan.
Most people in this thread are either in North America or Europe, so options for managed services like Fly exist, and they are plenty. But for people in South America, what options are there for a Heroku-like service? I don't want users shooting off requests halfway across the globe and back when there are many datacenters a couple of miles from our users. I just don't have the time and resources to manage VMs and scaling issues. I need a zero friction "./serviceX deploy" experience.
Fly seems unreliable, but they offer a deploy region close to me. Does anyone have know of any alternatives?
Frankly, the only solution to reducing dependence on these type of things is self-hosting. At least then you will be able to be 100% sure of causes and resolutions
Companies moving to the cloud are only increasing their operating costs
For those of you with postgres apps, you can avoid this pretty much universally with cockroach db (they have a serverless version they host) It takes basically no work moving from postgres, even postgres dump works.
CockroachDB gets real expensive real quick if you want to use any of the cool functionality though. The free version is OK for a cluster in very short range of each other, for example the same DC, but if you spread things out you'll need the enterprise features (follower reads, at minimum) to keep any kind of reasonable performance.
Self hosted enterprise "starts" (!) at $100/vCPU per month. So, yeah. Not exactly the hobbyist's choice.
How many days work is it to build a deployment of an Elixir app with Pulumi, Github Actions and AWS?
As someone not incredibly experienced with devops, I always wonder what is best with databases? Should they be provisioned in Pulumi or do I just manually create them in RDS?
Secrets Manager seems like a bit of a pain point as does IAM which I think I just about understand until I get lost! Giving everything access to ingress and egress also seems a bit overly complex/powerful.
Probably the time to get something working is dramatically shorter than it once was with ChatGPT to help.
They have Status page and don’t reflect host issues there. I’m sure, they didn’t understand what Status page was created for. And don’t respect users. Go away from Fly, if you respect yourself.
Yeah, I think the worst behavior to observe is when people make hard decisions and then start to have doubts about others' judgement and try to distance from the decision. That's even worse than the hard decision itself.
A great reminder what databases are easy, everyone can do them. It is reliable secure high performance databases which are hard. Make sure you chose provider which has proper experience in this space
I think their proxy could have been written from scratch. Some management, billing, API etc too but under the hood, it's all standard open source stuff like kvm, firecracker and such?
It's always interesting that someone pays those who serve in this way. There can be problems, of course. but the solving process should not be like this.
I moved from DigitalOcean app platform to fly.io for my web app. It's overall much cheaper but much more difficult to deploy somehow, deployments take a while. I wanted to migrate my psql instance as well but I had the feeling it wasn't as stable as digital ocean.
... and there are much cheaper places to host a server if you don't care about databases, like bare metal hosters and tier-2 VPSes with good reliability like Vultr and Digital Ocean.
I'm not a Fly.io user nor affiliated with Fly in any way. I read through these comments and realize it's not possible to distinguish competitor/disgruntled/negative-astroturf from actual user. The "screw you guys, I'm outa here" opinion of someone running a Discord bot on a free tier who uses the public forum for customer support isn't the opinion you want to use to measure Fly customer support. What are paid production users experiencing?
Paying Fly.io customer with several apps deployed. We’ve not had any of these issues. Fly Postgres is definitely not RDS and they could do a better job of setting the appropriate expectations. Fly either need to use some of their VC money to created a fully managed (autoscaling+replicating) Postgres offering or make it clear to customers that these outages are both possible and that the customer is responsible for their own data + disaster recovery.
> Hi Folks,
> Just wanted to provide some more details on what happened here, both with the thread and the host issue.
> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.
> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.
> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.
> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.
> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.
[1] https://community.fly.io/t/service-interruption-cant-destroy...