Hacker News new | past | comments | ask | show | jobs | submit login
Fly.io Postgres cluster down for 3 days, no word from them about it (googleusercontent.com)
797 points by burnerbob on July 20, 2023 | hide | past | favorite | 477 comments



There is now a response to the support thread from Fly[1]:

> Hi Folks,

> Just wanted to provide some more details on what happened here, both with the thread and the host issue.

> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.

> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.

> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.

> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.

[1] https://community.fly.io/t/service-interruption-cant-destroy...


For what it’s worth, I left Fly because of this crap. At first my Fly machine web app had intermittent connection issues to a new production PG machine. Then my PG machine died. Hard. I lost all data. A restart didn’t work - it could not recover. I restored an older backup over at RDS and couldn’t be happier I left.


I left digitalocean for fly because some of their tooling was excellent. I was pretty excited.

I’m back on digitalocean now. I’m not unhappy about it, they’re very solid. I don’t love some things about their services, but overall I’d highly recommend them to other developers.

I gave up on fly because I’d spontaneously be unable to automate deployments due to limited resources. Or I’d have previously happy deployments go missing with no automatic recovery. I didn’t realize this was happening to a number of my services until I started monitoring with 3rd party tools, and it became evident that I really couldn’t rely on them.

It’s a shame because I do like a lot of other things about them. Even for hobby work it didn’t seem worth the trouble. With digitalocean, everything “just works”. There’s no free tier, but the lower end of pricing means I can run several Go apps off of the same droplet for less than the price of a latte. It’s worth the sanity.


I adore DO. They’re seriously underrated. I love how they’ll just give you a server and say here, have at it. No abstractions, no fancy crap, just get out of my way and let me do my thing.


I'm using Digital Ocean App platform, which does pretty much everything for me. It's very simple to use. I can run my app as a single developer without caring about infrastructure for 99% of the time.


Same, it works really well.

Part of what inspired me to give fly.io a shot was that I didn’t love the monorepo deployment story on the app platform. Fly doesn’t have a solution to that, but I suppose I felt less tied to DO at the time because I wasn’t totally content anyways. I’ve discovered since then that I was actually doing it wrong, so I’m way happier. I’m pretty big on monorepos so their whole system fits my workflow remarkably well now.

I’d like to figure out how to prevent deployments when my code doesn’t change in one app, but does in another. At the moment, pushing anything at all will trigger all apps to rebuild and deploy again. Not a huge deal and several orders of magnitude less painful than not being able to deploy at all, haha.


Do they offer authentication/authorization?

This is the one thing I need in every app and don't want to do myself.


In addition to Supabase Auth the sibling mentions (which I played with very briefly) I've been using clerk.dev (no affiliation) and it's great. Depending on your definition of doing it yourself it could be just want you want. You have to set some things up, you're not going to get things like row-level permissions you get out of the box w/ Supabase, but if you're looking for a quick implementation where things like password reset etc. are handled for you, it might be a good fit.


I've been using Supabase for authentication/authorization in my recent side project.

The main app is node/express running on Digital Ocean and it connects to directly to the Supabase hosted Postgres for most operations, but then uses the Supabase auth API for auth related stuff.

Saves a lot of time sending password reset emails etc and the entire project costs less than $5/mo in hosting costs.


Would you consider a project like https://github.com/authcompanion/authcompanion2 for the authentication side? Missing anything?


No I would not.

I don't like self hosting anything that requires its own process. And if I did decide to self host I would choose a more mature project.

This is a very young one man project delegating the heavy lifting to another one man project. And it doesn't appear to support social logins.


thanks for the feedback.


I like https://github.com/goauthentik It has Helm charts and a Terraform provider.


I love their high value content about dev ops, I have learned most of what I know in this field tinkering with a VPS with their great tutorials on how to set up stuff.


They filled the Slicehost vacuum nicely in this area. That's where I got my start in running my own servers about 15 years ago and the tutorials were the driving factor.


Seriously! They have an amazing article I followed one time to set up a k8s cluster to run any container I wanted with full automatic ssl provisioning/management and dns. Make a quick little yml file that includes what subdomain it wants to be and kubectl apply. The cluster was like $100 a month all-in and performed like a beast at huge traffic levels, and all I did was follow a tutorial.

I know that’s probably pretty easy for many, but I was pretty new to k8s and it felt like magic.


I wish I could say the same. My ISP and DO have absolutely terrible peering, unfortunately a lot of our internal stuff is hosted there. It’s always fun to git push/pull with 40kb/s on a gigabit connection.


Maybe you could VPN to or proxy through a box with good peering to you and DO?


When I’ve run into this in the past Cloudflare Warp has been a bit of a saviour. It’s a hassle free way to flick a switch and follow a different path over the network.


wow! sub mbps indicates that there is indeed no peering at all (political issues?) but just a transit connection via an overloaded carryall.

collect some evidence, maybe someone wants to do something about it.


I went to DO's site due to your comment and I don't see anywhere where I can just get a server. Do you mean a VPS/Droplet? (I'm looking under Products and Solutions.)


The other commenter was correct - I meant a droplet. Should have been more explicit, apologies. But yeah if you're looking to learn how to work with backends, going through a droplet set up is by far the best way to get started IMO.


Not GP, but yes -- Droplets are DigitalOcean's "servers" (virtual, but nonetheless).

You boot one up in less than 30 seconds, and get ssh access to it almost immediately. It's very BS-free.


historically, I've used Vultr, but I don't see anyone talking about it—I'm curious if anyone else has thoughts on them? (I've been happy, but then again my usage has been exceedingly basic)


I've used Vultr for several years (hobby projects) with no issues. My favorite feature is having a BGP session from my VM, which is unusual among cloud providers. I have an AS and am able to advertise my own IPs from multiple Vultr instances (anycast).


How do you get an AS?


Have used both DO and Vultr for years. Put simply, DO is better, but Vultr isn’t terrible.

Higher number of outages at Vultr over 5 years, but none longer than a few hours. I can’t remember the last DO outage lasting more than a few minutes.

Experienced a Vultr routing problem that lasted several hours; they communicated about it, but it was still a long time to fix.

DO once did an auto-migration of a server to another cluster with an attendant outage that lasted a few minutes at most. No IP changes, completely transparent.


I love DO for projects where I don't need control. For my side project, I eventually migrated to AWS after running into a lot of issues with DO.

Things like they don't give you the postgres root user on their managed postgres. And I ran into issues trying to capture the deployments in code. Their terraform providers are pretty good, but still leave something to be desired. For all its many warts, I'm much happier back on AWS. It did end up more expensive, but it's worth it for the fine grained control in my case.

But I spent the last 5 years as a DevOps/SRE, so... uh... I'm picky.


That's interesting, because granular control is why I enjoy DO, although I'm thinking about it from the server perspective. They set up a machine, give me root access, and that's literally it. I set up my own ssh keys, firewalls, and there's no additional abstraction that I have to learn. I might just be reminiscing because right now I'm on a team where we're writing terraform/helm/k8s in GCP and it makes me want to cry myself to sleep each night lol.


Those are good things to know. I’ve been wondering about their managed databases recently, so I’ll keep that in mind.

I’m nowhere near as picky as you are, but maybe I’ll need to be at some point. As it is I mostly just build stuff and send it to the internet. If it builds and it does what I expected, I’m pretty happy! I don’t often need anything too special.


Same! I've had my first server there for 10 years now. They added a lot of stuff in the meantime, they have AWS-like things you can do. But in terms of launching a VM that just works, they are a great choice.


Yeah I hadn't seen those newer features until recently, the one-click deployments are super cool.


I agree. I can either abstract with the app platform or kubernetes, or I can go straight into the box myself and do whatever needs doing. It has been a real pleasure.

I think fly’s tooling feels better than doctl, but the infrastructure is incomparable at the end of the day. doctl has improved over time too, and with added pressure from newcomers I don’t doubt that it’ll continue to improve.


I find myself going to DO docs on various setup things even when I'm not using said thing on DO (although I'm also a DO customer, and love them for the reasons you've stated).


I really love DO except for one thing - you can't run your own firewall/router there (like opnSense). Really hard to link systems together.


I moved from DO to Hetzner ( cheaper), I am happy about it.


Does anyone know how Hetzner pricing is half of DO yet is profitable, while DO is loss making with 6% operating margin?


I've been with them for a long time and my guesses would be:

1. Strict rules and strict customer verification. Crypto mining that wastes SSDs is not allowed. Portscans, mass emails, etc. are not allowed. They also don't offer GPUs to the general public because it has been abused in the past. You usually need to send in ID documents just to open an account. My guess is this allows them to avoid most bad actors and, thereby, waste less money on fraud.

2. Extremely long-term investments. They typically build their own hardware and then use it over 10 years. They have their own flea market where you can rent older server models for a steep discount. That means they will have a long time where the hardware is fully paid off and still generating revenue.

3. Great service. With a mid-sized company, I can call their technicians in the middle of the night. The fact that we could call them in case of a crisis has generated A LOT of good will. But I would be truly surprised if they didn't make a profit off those phone calls, as they charge roughly 4x the salary cost.

4. High-margin managed services. In addition to just the cheap servers, they also offer a managed service where they will do OS and security upgrades for you. It's roughly 2x the price of the server and it appears to be almost fully automated. I know some freelance web designers who will insist on using Hetzner Managed for deployment for their clients, because it is just so convenient. You effectively pass off all recurring maintenance for €300 a month and your client is happy to have an emergency phone number (see #3) in case the box goes down.


They run their own data centres and have for a while. There is a pretty big industry for that sort of thing as an alternative to “the cloud” here in Europe.

We used to use nianet to house our hardware in Denmark. Basically these companies does hardware renting and they also do hardware renting with more steps which is where you rent rack space but own the hardware. They provide the place for the hardware and they also have multiple locations so that you have both backup and redundancy, and while it doesn’t scale globally in 20 years I’ve literally never worked on anything that needed to beyond having some buffer caches for clients logging in on their vacations or something like that.

What Hetzner seems to be doing with the DO styled hosting, and this is just a guess, is that they are one or the many EU companies preparing for the big EU exodus from the non-EU cloud. Which is frankly a solid bet these days where both AWS and Azure are increasing prices and are becoming more and more unusable because of EU legislation. Part of this is privacy which Microsoft and Amazon are great with in terms of compliance, but part of it is also national security. I work in an investment bank that builds solar plants, since finance and energy are both critical sectors we risk being told that half of the finance/energy companies in the world can’t use Microsoft because the EU seems it as a single point of failure if our entire energy sector relies on Azure. Which is sort of reasonable right? But what this means for us is that we can’t vendor lock-in, not really, because we need to have up-to-date exit strategies for how we plan on being fully operation a month after leaving Azure. Which is easy when you just containerise everything and run it in VMs or similar, and really annoying if you go full in on things like AKS. Which doesn’t help our Azure costs.

Anyway, right now we are planning on leaving Azure because of cost. Not today, not next week but sometime in the next 5-10 years and a lot of these EU cloud alternatives that actually operate the hardware instead of renting it are likely going to be a very realistic alternative. And that is the private sector, I spend time in the EU public sector which is a massive amount of money and I’m guessing it’ll leave both AWS and Azure by 2050. Some of these EU cloud initiatives is going to explode when that happens, and right now, hetzner is one of the best bets.

To get back to your question, DO rents server space. I have no idea where they’d rent it in Germany but they could potentially be renting it from Hetzner.


Couldn't agree more, I think Hetzner is probably Europe's best bet on a hyperscaler. One of the more telling indicators IMO is their growing market share outside of the EU/DACH.

To add on to the comments about Hetzner building their own custom hardware, they also custom built their own software stack. They rejected the hype that was OpenStack and worked diligently on their own hypervisor platform (that they are incredibly secretive about) and that appears to be paying off in spades for them. Most sovereign cloud plays end up being suffocated by the complexity, and incoherence, of the OpenStack ecosystem. It just becomes impossible to ship.

For a fascinatingly different take on how to build a datacenter: https://www.youtube.com/watch?v=5eo8nz_niiM

* Edit: remove speculation about Kubernetes and Hetzner, that was based on hazy memory.


For anyone interested in Kubernetes on Hetzner, there's a really interesting CAPI provider being actively developed:

https://github.com/syself/cluster-api-provider-hetzner


Could you please elaborate how and what you know about managed Kubernetes on Hetzner?

I am asking for this since a while and was told there is no way Hetzner would offer such a service. Certain Posts on Social Media have also never been answered with any kind of indication that they are actually working on it.

Please provide some Details on this.


They were in person recruiting at KubeCon EU this year and were advertising a good number of Kubernetes engineering roles. Definitely gave me the impression they were taking Kubernetes seriously but looking back a managed offering was just speculation on my part.

So huge grain of salt, you are totally right. It could be internal platform work only.


commendable to plan a few years ahead, but betting on the state of cloud business 26years from now seems a bit over the top


I think you might misunderstand me. The 2050 is a guesstimate and it's just my opinion on the matter. As far as planning ahead goes, you plan for 5-10 years when you try to figure out where to "iron" your enterprise IT. This is because that's how long your hardware will last if you go the route of renting rack space with your own hardware. I think we tend to plan for 8 years, with some space for "unintended" early failures on things like controllers after 4 years. So while you can contract big-cloud vendors for shorter, I think ours is on 3 year contracts right now, you still sort of do the business case for much longer. Maybe not every 3 years, but at least every 6 years.

You do the same on the other side of the table. Companies like Hetzner knows that EU cloud sollutions are likely to see growth, so it's only natural that they invest in the tech to put themselves in a prime position to jump on the opportunity. Selling a good product while you do so is the way I would do it personally, but you also have EU cloud initiatives backed by VC money going straight for the endgame.


I think multi-national energy sector should be working toward the goals without the regulations. The more prep done before the change the smoother the transition.


Hetzner also do some crazy-cool stuff, especially around the 7950X3D, cooling, AM5 etc. (https://www.youtube.com/watch?v=V2P8mjWRqpk). They also do some amazing stuff with ARM (their cloud offering is really solid for this).


Overstaffed, overinflated and inefficient Silicon Valley startup vs. organically-grown, well-adjusted, efficient German company.


Not to mention a German company that has price sensitivity in their DNA. Their first servers were just regular consumer tower PCs to drastically cut hardware costs. Now many years later it's a highly optimized mix of consumer, server and inhouse parts (e.g. they use their own racking system instead of 19", and the datacenters are built to make use of convection for a lot of the cooling). They also offer regular Dell servers for those that want them, but at 2x-4x the price of their homegrown boxes.


Me and my partner have paid a visit to their datacenter in Nüremberg. The answer is efficiency. They get more processing power than the other providers for the energy they have to put in


What do they do that makes them more efficient?


i'll guess they pick optimized components for it.

like the longtime workhorse was a high performance skylake desktop cpu w/o ecc ram


The secret is in the cooling system. They have individual cooling systems for each server. Less heat = longer sustained loads


pardon my ingorance but i cannot quite see how cooling individual machines vs. the hole rack or row makes a difference in total heat production per machine


Efficiency. They get much more processing power per kw/h of energy than everybode else


Simple, Hetzner mainly operates on Germany, the people are mostly Germans, and they automate the stuff to a point a small team could manage it well even if not remotely, so they have less cost on human resources.


> Simple, Hetzner mainly operates on Germany, the people are mostly Germans, and they automate the stuff to a point a small team could manage it well even if not remotely, so they have less cost on human resources.

I feel like there might be more to it, especially considering the situation with electricity prices in some places in EU recently.

I used (and still use) a Lithuanian platform called Time4VPS which was cheaper than Hetzner previously, yet had to increase their prices somewhat for that reason. Now only some of their plans are competitive with Hetzner, while Hetzner also provides some managed services as well.

Hetzner docs also went into some of the details regarding the pricing: https://docs.hetzner.com/robot/general/pricing/hetzner-prici...

And yet, I can't help but to wonder why they don't give in to the desire to maximize profit margins, like happened to say Scaleway (good platform, but as expensive as DigitalOcean).


They also build their own servers in their own datacenters


Does digital ocean not do this?


The competitor of DO, Vultr does this IIRC, yet it is not really cheaper


They don’t.


Where do DO get their servers and data centers from? ... Apparently they run on AWS, I'm surprised


> Apparently they run on AWS, I'm surprised

They don't run on AWS. Not sure what sort of rumors are running :(

> data centers from?

The major players e.g. Equinix, Coresite, etc. Varies per location. Even AWS don't build most of their data centers.


I've wondered how they can host this cheap in Germany given their very high electricity prices.

Maybe that's not actually the dominant cost, or they've optimized everything else so well they can just eat the electric bill.


Same, been enjoying Hetzner's great value for 10 years, and now Hetzner Cloud for 2 years.


I'm enjoying the DO App Platform (Heroku alternative). Do you know if Hetzner has a similar service that I could compare?


Personally I just install Dokku onto the machine, it replaced all my Heroku (and competitors) uses.

Additionally, you still keep the full ssh access to the machine if you ever need it.


Hey I'm building a managed service platform (not quite an app store!) on top of Hetzner -- would you be interested in trying it out?

Contact is in my profile but I'd love to have some more people kick the tires and tell me what they want built the most.


I use both and am very satisfied, especially by Hetzner.


Only complaint with Hetzner is they don't have some kind of OAuth setup for machines or scoped API tokens, just read/write. I'd like to use the former for doing Vault authentication from instances, and the latter for writing a dynamic Vault secret provider.


Can’t you use a third party IAM solution for this? Like Okta or keycloak?


zitadel supports service users with rbac. maybe give it a look/try: https://github.com/zitadel/zitadel


Hey would you be into trying a manage service platform I'm building for Hetzner? It's called Nimbus[0].

I'd love some feedback, specifically:

- Which services do you most want to use/have managed

- What databases do you find yourself using the most

- Concerning caches, do you use memcached or mostly Redis?

[0]: https://nimbusws.com


I remember someone complaining they had to send Hetzner a passport or some other type of ID to cancel their services.

Does anyone know if that's still the case?


They require passport or some sort of ID on registration, and it is weird when compared to others. I was not happy with that part, but I am happy customer since (almost a decade now).

As far as I know, they do not require any ID when canceling the service.


Well I would appreciate that, since I was victim of russian hackers and they had access to all my servers and stuff on Hetzer, they even changed passwords and mail on Robot but i restored everything...


Do they have Terraform providers? And managed Postgres? Besides from the ability to just host a Docker container, that is all I need.


Yes and (unfortunately) no. Terraform providers are here [1] with the official documentation at [2]. Managed databases are not available, though. I think they have some sort of database offering if you select their web hosting options, but you can't just get a managed Postgres instance yourself.

[1] https://registry.terraform.io/providers/hetznercloud/hcloud/... [2] https://community.hetzner.com/tutorials/howto-hcloud-terrafo...

EDIT: For what it's worth, I have had good experiences with app servers hosted on Hetzner Cloud and managed Postgres provided by ElephantSQL (https://www.elephantsql.com/) for Germany-based apps.


Got it, thanks. I've used ElephantSQL as well and I've been happy with them.


Hetzner has a record for going silent with issues FYI, just hit their reddit to see all the horror stories


Same, tried a bunch before moving completely to Hetzner. I'm super happy with their service.


Same here


DO actually does have a free tier! If you use their “app platform” (their equivalent to fly/heroku/render/etc) you can host 3 “static” apps for free. So if you have a Hugo/Jekyll blog or something, it’ll set up a whole little CD system for it for free.


You’re totally right. I kind of forgot about this, in part because I’m over their free limit. I think their static sites are still dirt cheap once you hit that limit, though. I find their pricing totally reasonable for what I need.


I'm a fan of Linode as well.

I want to like Fly, but the reliability is one of those were I feel like every time I investigate moving workloads over I'm disappointed by these stories over and over again.


Fly is in my “try later book” from a year or two ago. I remember it was hard to deploy anything due to downtime so gave up. Sad that stuff like this still happens.

You shouldn’t need to multi region a postgres yourself - they should have at least 2 data centre redundancy for the region and it just works.

Hope they get some magic sauce to become better at this.


> Hope they get some magic sauce to become better at this.

When I saw them describe their multiregion SQL replication architecture I thought "what crazy person thought this wouldn't eventually open up a spider's nest of distributed systems errors?"


CockroachDB does this, but that's the result of over 10 years of heads down hard-ass engineering and it's still slower than Postgres because distributed sync is not free. That means you have to provision it properly and with enough resources.

Their license would require a company like fly.io to pay them though, so I'm sure this resulted in fly.io instead trying to whip up an improvised infrastructure on the back of stock Postgres. I bet this cost them a whole lot more than paying CockroachDB would have, but devs have been conditioned that you should never ever pay for software even if it's the result of tons of deep engineering and solves massive brutal problems for you. I also bet there's some not-invented-here ego involved.

P.S. I don't work for CDB but I would absolutely consider them and we may end up using them at some point. They let you do a ton for free. They only charge for stuff you need if you get really really huge or if you are running a SaaS reselling DB services like fly.io would have been doing.


Our multiregion SQL replication architecture is the standard Postgres multiregion replication architecture. We do single-write-leader, multiple reader replicas, like everybody else does.


This is not standard. I see now that it is legacy, but I think it still demonstrates a bit of poor judgement. I believe it was before you were at fly, tptacek

https://fly.io/docs/getting-started/multi-region-databases/


So you didn't have a HA setup with multiple machines and volumes?


Is that even possible on Fly?


He may have been talking about Fly themselves. Certainly having only a single machine to serve a wealthy metropolis of 8 million people seems like amateur hour.


> machine to serve a wealthy metropolis of 8 million

It's actually the only region to serve the entire AU and NZ population with any reasonable latency. (Ok, Singapore can do in a pinch for at least sub 200ms.)

You'd wanna hope its more than one machine!


Obviously, we have a bunch of machines, both workers and edge servers, in Sydney. The whole Sydney region didn't go down; one worker did.


They certainly don't only have a single machine in SYD since I have a bunch of machines running in SYD that we're impacted by this one.


Fly sounds like they need some Conway's Law. A front end that designs the nice api and works on developer affordances and the backend that keeps it running and reliable.


That's like the main selling point of Fly.


> While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally.

If it really got missed, then I don't understand how the thread was made private to only logged-in users?


It looks like all 166 threads with the "App not working" tag are invisible when not logged in. So I'm guessing somebody applied that tag retroactively.

https://community.fly.io/c/questions-and-help/app-not-workin...

EDIT: it now appears that the "app-not-working" tag itself has been deleted, and no longer shows up even when logged in.


In another comment here, they're saying they just deleted that tag to avoid this access issue — https://news.ycombinator.com/item?id=36810393


good call out - please as an internet mob let us not ascribe to malice what can be attributed to sheer unintentional impacts of complex software


This is why companies should not run their own forums. It's cheap support and marketing, it's not really community.


I never thought to make friends with people who's only common thing with me is that they shop at the same place. Companies creating a "community" is exactly as you described.


I am an interested party in the process space, and I think that's ungenerous. When you work with a complex tool every day, and you have to find solutions for this or that issue, develop strategies for this or that business case, etc etc, you're not really shopping - it's more like you're in the trenches. At that point, finding people who have the same issues and talking shop with them, can be great for both knowledge exchange and camaraderie. Linux wouldn't be what it is today without the LUGs era, for example.


We're talking about private companies running forum software instead of providing support. We're not talking about the power of IRC or mailing list communities for open source projects and the like.

If I pay for something I want the person I pay money to help me fix problems I get.


Whoa, what? That's a much bigger red flag than the downtime itself.


Ok as long as we’re getting conspiratorial, something similar I observed has bugged me.

About a year ago fly awarded a few people in the forums, I think it was 3, the “aeronaut” badge. Basically just pointless bling for a “routinely very helpful” person or somesuch. Still, I can imagine it was cool to get it. No, it wasn’t me.

One person I saw with it absolutely deserved it: this person is, to this day, always hopping in and helping people; linking to docs; raising their own issues with a big dose of “fellow builder” understanding and empathy; that sort of person. My own queries typically led me to a thread that this person has answered. In short - the kind of helpful, proactive, high knowledge volunteer early adopter that every community needs - and a handful are blessed to find.

Then one day I saw this same person had offered — to one random newbie with build problems in one of the many HALP threads — a reply like, “maybe Fly isn’t the best option for you. here are some other places that can host an app”.

The thread was left alone and faded, like many when a lost newbie is involved. But 1 day later, I noticed this tireless early adopter no longer had their “aeronaut” badge.

I still refuse to believe my own eyes about something that petty.


Get out of here with this nonsense. We tell people when we’re a bad option all the time. Do you really think we have a desire (or time) to punish somebody for doing the same?

Also, here’s the long forgotten badge, still with 3 people… https://community.fly.io/badges/107/aeronaut


> Do you really think we have a desire (or time) to punish somebody for doing the same?

idk man, there's these awfully convenient disappearing forum threads too. The benefit of the doubt is starting to expire.

I see you're a co-founder, so presumably you have some sway on priorities and skin in the game. I think you should take the reputational damage you're accruing here much more seriously than you apparently are. A few more incidents like this and it won't just be you telling people you're a bad option.

* edited to tone down the forum thread disappearance angle. FWIW I do believe that it likely wasn't deliberate. My main point was that these things add up and "of course we wouldn't do that!" starts to ring a little hollow the 10th time you hear it...


> you've just been caught hiding inconvenient forum threads too

FWIW, I do believe them when they say this wasn't intentional. Considering how the Internet operates, they would be incredibly stupid to do something like that on purpose.

That being said, the way the entire affair was handled certainly leaves a lot to be desired.


I actually believe them on that too, FWIW. This time. It's just too dumb. I hope, for their sake, it's the truth.

I was really just trying to point out that this kind of good faith benefit-of-the-doubt has a limit, and fear of reaching that limit should be keeping people at fly up at night a lot more than it apparently is. I don't know how many colossal public fuckups a company can endure before its reputation is permanently ruined, but it's definitely not infinite.


Why are you acting so hostile? If you don't like that the community is dunking on you, then maybe posting on Hacker News isn't for you.


Why is anyone on HN "dunking" on Fly.IO of all companies?

Michael - Don't take the bait.

As someone who has zero affiliation with Fly.IO other than a few PR's to their OSS(I don't even know Michael), I greatly appreciate the contributions they have given back to the community.

There are a lot of great hosting companies. Fly.IO stands out due to their revolutionary architecture and contributions back to the OSS community. I wish more companies operated like this.

It's understandable some are upset about an outage. But Fly is doing really interesting and game-changing things, not copying a traditional vmware, cpanel or k8s route.

Just as a reminder to what this company has offered back to everyone.

SQLite: Ben Johnson's OSS work around SQLite stands out. Fly.IO and his work have really made sqlite a contender. - https://fly.io/blog/all-in-on-sqlite-litestream/ - https://fly.io/blog/introducing-litefs/ - https://github.com/superfly/litefs - https://github.com/benbjohnson/litestream - https://fly.io/blog/sqlite-internals-wal/ - https://fly.io/blog/wal-mode-in-litefs/

Who really considered sqlite as a production option before Fly and Ben? Not me.

Firecracker: Firecracker is amazing, but difficult to debug when something bad happens. There aren't a ton of people in devops who would share what they have. If you've ever used Firecracker, you've really been helped a lot by the various guides they have provided back to the community like these: - https://fly.io/docs/reference/architecture/ - https://fly.io/blog/fly-machines/ - https://fly.io/blog/sandboxing-and-workload-isolation/

Their architecture is beautiful and revolutionary. They're probably the first or second ones to find a lot of the new edge cases as they grow.

It's a lot harder to be the first one over the wall than it is to copy. They've literally given the average developer a blueprint to build scalable businesses that compete with their own.


Conspiratorial or not that's enough for me to never use it. God forbid someone recommends another platform that handles your clear shortcomings.


> Conspiratorial or not that's enough for me to never use it

Well if it's not true then that would be a silly reason to pick to not use them.


Should losing a single host machine be a big deal nowadays? Instance failure is a fact of life.

Even if customers are only running one instance, I would expect the whole thing to rebalance in an automated way especially with fly.io being so container centric.

It also sounds like this is some managed Postgres service rather than users running only one instance of their container, so it’s even more reasonable to expect resilience to host failure?


Fly postgres is not managed postgres, it's cli sugar over a normal fly app, which the [docs](https://fly.io/docs/postgres/) make quite clear. Their docs also make clear that if you run postgres in a single-instance configuration, if the hardware it's running on has problems, you database will go down.

I believe the underlying reason that precludes failing over to a different host machine, is that fly volumes are slices of host-attached nvme drives. If the host goes down, these can't be migrated. I _think_ instances without attached volumes will fail-over to a different host.

Of course, that's not ideal, and maybe their CLI should also warn about this loudly when creating the cluster.


If you lose a single instance on RDS and you don't have replication set up, you'll also have downtime. (Maybe not with Aurora?)

And +1 to the sibling comment; Fly makes it very clear that single instance postgres isn't HA, and talks about what you need to do architecturally to maintain uptime.


Downtime but limited downtime since the data is stored with redundantly across multiple machines in the same AZ. So unless the AZ goes down (which is a different failure than what happened here) you can restart the DB on a different instance pretty quickly and I'm guessing AWS will do it automatically for you.

edit: Remove triple as not certain about level of redundancy


I don't believe their RDS / EBS has 3x redundancy. With SSD, that would be super costly for them. But if that's correct, that would be incredible.


May not be 3x but it is replicated so even a total instance failure would not make you lose data:

>Amazon EBS volumes are designed to be highly available, reliable, and durable. At no additional charge to you, Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. For more details, see the Amazon EBS Service Level Agreement.

https://aws.amazon.com/ebs/features/#Amazon_EBS_availability...


> Maybe not with Aurora

If a read replica fails, I'd expect no downtime (possibly a few errors as connections get cut off abruptly). Although there's always the risk that the remaining instances aren't able to handle the additional load.

If the master fails, you'll get a ~2min downtime


Yeah but you won't lose your data. They have backup infrastructure and EBS is rock solid.

Down time is one thing. Data loss is something else.


> Should losing a single host machine be a big deal nowadays? Instance failure is a fact of life.

Depends on where in your development cycle you are. If you just got started and haven't even figured out what you're actually building (prototyping), you shouldn't really use a hosting provider that randomly lose instances.

If you're on the other hand have done everything to improve your applications performance, had to resolve through-output issues with a distributed architecture and now running 10+ instances, then losing one host shouldn't impact you too much. But you really shouldn't start this way, it's doing web services the hard way and introduces a lot of complexity you shouldn't want to deal with when you're still trying to find product market fit.


GP is referring to fly.io architecting for single instance failures, not its customers.


I was confused why support for platform failure relies on a forum where employees may or may not check. After checking docs[1], apparently you have to be on a paid plan (at least $29/mo) to access email support, so you may not have it even you’re paying for resources.

I won’t be using it for side projects where I’m okay with paying $5-10/mo but don’t want to have three day outages.

[1] https://fly.io/docs/about/support/


Forewarning: I am not being critical of fly.io nor their free support whatsoever when I say this.

From a technical perspective, could they have "been better" from a technical perspective? I see their name a lot on HN so I know they are doing really cool + advanced things and this is probably some super small edge case that slipped through the cracks.

Could they have added some message / do we as the HN community feel they needed to be like "we're gonna add some extra logging/monitoring going forward so it won't happen again"?

By all means, they probably don't owe anybody in terms of stability + uptime guarantees when it comes to a free tier. Sh*t happens.


FWIW: I am on the bottom tier of the paid plans ($29/mo) so I could get access to the email support, and even with that their response time is still not great.

I have an ongoing issue with one of my PG clusters where one of the nodes was failing and all my attempts at fixing it are failing (mainly cloning one of the other machines to bring the cluster numbers back to normal).

I emailed my account’s support email mid Friday morning last week and did not hear back until this past Monday night.

Sucks, because like a lot of others in this thread I like what Fly is trying to do and am rooting for them, but IMO they should use a significant chunk of that funding they just received on hiring a ton of SREs and front line customer support.

EDIT: I should add, the past times I have emailed them the response time was good. It's just this most recent time was so egregious (3 days!) to get even that initial response that I bring it up.


They may not owe anyone anything but over time these types of issues can cause a large reputation hit.

If I was just searching online or trying to find out what various communities think about Fly.io and see several threads about major outages with poor communications, do you think I will use their services? It would be an immediate pass.

It takes a long time to build a reputation, and you can lose it instantly.


They broke uptime for the paid tier, not just the free tier.

The relevance of paid/free is that free (and cheap paid) plans don’t get fly support over email


The irony or perhaps the tragedy of building a low friction service is that you have to have experts on the lower level high friction stuff.

I would hope that after a couple of hours downtime, they'd bring up a fresh machine with Ansible or whatever. Hardware or AWS/GCP Vm.


> I would hope that after a couple of hours downtime, they'd bring up a fresh machine with Ansible or whatever.

It is not just about a fresh machine which hopefully sits in each datacenter. I can imagine they needed the clone of the system due to the design of the fly.io service and that's where the "fun" begins.


Seems like the OP should have made a HN thread in the first place instead of posting to community.stri^H^H^H^Hfly.io


But HN is not a customer service forum ?


It's often used as an escalation point when people can't get support from certain companies (most notably, Google). If an employee lurks in here and sees your post, they might contact the right people to fix your issue.

Smaller companies also do a lot of PR damage control and constantly monitor HN for threads complaining about their services.

You're not wrong but that's how it works.


That's not what happened here. We're talking about an outage that was resolved days ago, long before this thread went up.


> But HN is not a customer service forum ?

you must be new here ;-)


> ^H^H^H^H

alt+backspace will wipe that substring in most shells in one go.


The ^H^H^H^H above was for human readers though.


ctrl-w my friend. Don't even have to put down your drink.


it would loose the comic appeal though


Thank you for that little nugget. I learned something today :)


Why is it my responsibility to move instances from machine to machine to mitigate a cloud host's outages? What is their utility if not performing the bare minimum of cloud host responsibilities keeping my container up?


> We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

Make it impossible not to do so, and make it frictionless then.


That would presumably cost more money which is not a trade off every user would want to make.


You cannot make every user happy, and its generally better to not have a user than to have an unhappy user.


Is it me or the page is now gone?

"Oops! That page doesn’t exist or is private."

Edit: Ok, I can see after sign up / log in.


total shot in the dark, but, was it a transaction id wrap around?


Fly have tried to hush this by making the thread [1] private to anyone not logged in.

One quote from thread:

> This is the second time I’ve had this kind of issue with Fly, where my service just goes down, Fly reports everything healthy, and there’s literally no information and nothing I can really do other than wait and hope it comes back up sometime

Another user:

> We had four machines (app + Postgres for staging and production) running yesterday, and three of the four (including both databases) are still down and can’t be accessed. I can replicate the issues others have mentioned here.

> This is our company’s external API app and so the issue broke all of our integrations.

> Our team ended up setting up a new project in fly to spin up an instance to keep us going which took a couple of hours (backfilling environment variables and configuration etc, not a bad test of our DR ability).

> There is no way I can find to get the data from the db machines. Thank goodness this isn’t our main production db and we were able to reverse engineer what we needed into there.

> Very keen to hear what’s happening with this and why after so many hours there’s no more info or updates.

Another user:

> As an aside, it’s kind of a kick in the teeth to see the status page for our organization reporting no incidents - the same page that lists our apps as under maintenance and inaccessible!

Another user:

> I’m feeling very lucky that none of our paid production apps or databases are affected currently (only our development environment is), but also really surprised that the issue has been ongoing for 17 hours now with no status page update, no notifications (beyond betterstack letting us know it was down) and one note on the app with not much info as to whats going on.

> It really worries me what would happen if it was one of our paid production instances that was affected - the data we’re working with can’t simply be ‘recovered’ later, it’d just get dropped until service resumed or we migrated to another region to get things running again

> Keen to know whats wrong and whats being done about it

Full thread (as at time of HN post; more has been added since): https://pastebin.com/ebmCSZkC

Someone tweeted Fly CEO: https://twitter.com/SouthPawNZ/status/1682181533673857024

[1] https://community.fly.io/t/service-interruption-cant-destroy...


The worst thing about Fly is, when something goes wrong, it's not just one thing, there's bunch of things broken at the same time and their status page will show everything green.

Their typical response is either silence or so casual ("oh this is what happens we deploy on friday"). The product looks amazing but it's just a nice package around the most unreliable hosting service I've ever used.

You can't just keep breaking people's work every once a week, make them spend their weekend nights trying to bring back their stuff, and give these "we could have done better" answers. This is an excuse for exceptions, not patterns.


> when something goes wrong, it's not just one thing, there's bunch of things broken at the same time and their status page will show everything green

How dare they use AWS' patented approach to having a service outage.


I wouldn’t put AWS and Fly in the same sentence. AWS is magnitudes more reliable, with better support.


I didn't mention how often they fail.

I merely mentioned two characteristics of how they fail, that are spectacularly shit.


Was that an attempt to discredit criticism of Fly's operational processes by pointing out that another company also has issues in how they handle outage notifications?


[flagged]


Was the sarcasm an attempt to discredit criticism of Fly's operational processes by pointing out that another company also has issues in how they handle outage notifications?


You could have just answered my previous question with "No, I am not familiar with sarcasm".

Because you clearly don't understand sarcasm, I'll be blunt:

No, I'm not trying to discredit any criticism of this provider. I agree with the comment I replied to, that this kind of failure mode is fucking ridiculous. My response thus is not an attempt to normalise this, but to highlight the elephant in the room, which is that AWS - the gold standard for "hosting" services for many a startup and techbro - *also* has Rube Goldberg like levels of interdependence that cause cascading failures *every time* something goes wrong, and *also* have a status board so confidently green that it may as well be an ad for lawn care products.


Thanks for clarifying. I understand now you wanted to call attention to the fact that another famous organization in the same space as Fly.io also has such bad practices. Thanks for the data point.


I don’t think you understand the comment you’re replying to at all. Either that or you’re trying to deflect even further


> This is an excuse for exceptions, not patterns.

Love this


There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.


Not a great summary from my perspective. Here's what I got out of it:

- Their free tier support depended on noticing message board activity and they didn't.

- Those experiencing outages were seeing the result of deploying in a non-HA configuration. Opinions differ as to whether they were properly aware that they were in that state.

- They had an unusually long outage for one particular server.

- Those points combined resulted in many people experiencing an unexplained prolonged outage.

- Their dashboard shows only regional and service outages, not individual servers being down. People did not realize this and so assumed it was a lie.

- Some silliness with Discourse tags caused people to think they were trying to hide the problems.

In short, bad luck, some bad procedures from a customer management POV, possibly some bad documentation resulted in a lot of smoke but not a lot of fire.


Apologies for repeating myself, but:

You get to a certain number of servers and the probability on any one day that some server somewhere is going to hiccup and bounce gets pretty high. That's what happened here: a single host in Sydney, one of many, had a problem.

When we have an incident with a single host, we update a notification channel for people with instances on that host. They are a tiny sliver of all our users, but of course that's cold comfort for them; they're experiencing an outage! That's what happened here: we did the single-host notification thing for users with apps on that Sydney host.

Normally, when we have a single-host incident, the host is back online pretty quickly. Minutes, maybe double-digit minutes if something gnarly happened. About once every 18 months or so, something worse than gnarly happens to a server (they're computers, we're not magic, all the bad things that happen to computers happen to us too). That's what happened here: we had an extended single-host outage, one that lasted over 12 hours.

(Specifically, if you're interested: somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache for OCI container images backing flyd; if containerd goes down, no new machines can start on the host. It took a member of our team, also a containerd maintainer, several hours to do battlefield surgery on that boltdb to bring the host back up.)

Now, as you can see from the fact that we were at the top of HN all night, there is a difference between a 5 minute single-host incident and a 12-hour single-host outage. Our runbook for single-host problems is tuned for the former. 12-hour single-host outages are pretty rare, and we probably want to put them on the global status page (I'm choosing my words carefully because we have an infra team and infra management and I'm not on it, and I don't want to speak for them or, worse, make commitments for them, all I can say is I get where people are coming with this one).


Why are your customers exposed to this? This sounds like a tough problem that I'm sympathetic to for you personally, but it sounds like there's no failover or appropriate redundancy in place to rollover to while you work to fix the problem.

edit: I hope this comment doesn't sound accusatory. At the end of the day I want everyone to succeed. I hope there's a silver lining to this in the post-mortem.


The way to not be exposed to this is to run an HA configuration with more than one instance.

If you're running an app on Fly.io without local durable storage, then it's easy to fail over to another server. But durable storage on Fly.io is attached NVMe storage.

By far the most common way people use durable storage on Fly.io is with Postgres databases. If you're doing that on Fly.io, we automatically manage failover at the application layer: you run multiple instances, they configure themselves in a single-writer multi-reader cluster, and if the leader fails, a replica takes over.

We will let you run a single-instance Postgres "cluster", and people definitely do that. The downside to that configuration is, if the host you're on blows up, your availability can take a hit. That's just how the platform works.


I see. Have you considered eliminating this configuration from your offering? It sounds like the terminology could confuse people, and it may be the case that they're assuming that a host isn't really what it is (a single host). This kind of thing is difficult for those seeking to build managed services, because I think people expect you to provide offerings that can't harm them when the cause is related to the service they're paying for and it's difficult to figure out which sharp objects they understand and which ones they don't. People should know better, but if they did would they need you?

If this sounds ludicrous, then I think I probably don't understand who Fly.io wants to be and that's okay. If I don't understand, however, you may want to take a look at your image and messaging to potentially recalibrate what kind of customers you're attracting.


Plenty of people would rather take downtime than pay for redundancy, for example for a test database.

AWS RDS lets you spin up a RDS instance that costs 3x less and regularly has downtime (the 'single-az' one), quite similar to this.

Anyone who's used servers before knows "A single instance" is the same as "sometimes you might have downtime".

Computers aren't magic, everyone from heroku (you must have multiple dynos to be high availability) to ec2 (multiple instances across AZs) agree on "a single machine is not redundant". I don't see how fly's messaging is out of line with that. They don't tell you anywhere "Our apps and machines are literally magic and will never fail".


Single-AZ i not single-host though, and while a single AZ can go down for major events, it doesn't break because a single piece of hardware failed.


Sure, but isn't this more about risk tolerance at this point and how much your customers care about? Where the responsibility should be on customer's end. Running on EBS/RDS doesn't guarantee you won't lose data. If you care about it, you enable backups and test recovery.

Just because some customers are less fault tolerant than others, doesn't mean we shouldn't offer those options where people don't have the same requirements or are willing to work around it.


I don't disagree. I was latching onto the idea that people are running single-node "clusters". Whatever it is, it isn't a cluster.


Unless something has changed and I'm out of date, I think a piece of context here is fly postgres isn't really a managed service offering. From what I've seen fly does try to message this, but I think it's still easy for some subset of customers to miss that they're deploying an OSS component, maybe deployed a non-HA setup and forgot, and it's not the same as buying a database as a service.

So hopefully as fly.io get's more popular, there will be some compelling managed offerings. I saw comments at one point from the neon CEO about a fly.io offering, but not sure if that went anywhere. I'm sure customers can also use crunchy, or other offerings.


It seems to me like there's room for improving your customers' awareness around what is required for HA and how to tell when they are affected by a hardware issue. On the other hand, it may just be that the confusion is mostly amongst the casual onlookers, in which case you have my sympathies!


I'm not sure on this, will it make any sense - customers who DON'T WANT to be aware of what is required for HA (say lonely devs) choosing such a hosting types. Even if you put educational articles, I'm unsure it will be used. Putting some BANNER IN RED LETTERS into CLI output + link to article may work, though.

What do you think?


This is exactly how it currently works:

  $ fly volumes create mydata
  Warning! Individual volumes are pinned to individual hosts.
  You should create two or more volumes per application.
  You will have downtime if you only create one.
  Learn more at https://fly.io/docs/reference/volumes/
  ? Do you still want to use the volumes feature? (y/N)
(and yes, the warning is already even in red letters too)


Sounds like it have not help already - no even need to guess. One of the moments you are in mixed feelings about being right.


I agree, articles tend not to get read by those who need them most. A warning from the CLI and a banner on the app management page with a link to a detailed explanation would seem like a good approach.

edit: sibling post shows there is such a message on the CLI. The only other thing I can think of is an "Are you sure you want to do this?" prompt, but in the end you can't reach everybody.


There is an "Are you sure want to do this?" prompt!


Make them type the phrase "I'm OK with downtimes of arbitrary length"!

I kid, seems like you guys did what you could.


Indeed


> somehow a containerd boltdb on that host got corrupted, so when the machine bounced, containerd refused to come back online. We use containerd as a cache

Hey, even if I can feel sympathetic for the course of unfortunate events, it's hard to not to comment:

if you're using a cache, you should invalidate it on failure!


It's a read-through cache. This wasn't a cache invalidation issue. It's a systems-level state corruption problem that just happened to break a system used primarily as a cache.


What I meant is that if the compromised host was unable to use broken boltdb cache, the cache should be zeroed and repopulated. Was it really hours of such cache rebuild vs hours of trying to fix the boltdb?

Btw I am happy I got only small amounts of data in any of bolt databases...


This isn't a boltdb we designed. It's just containerd. I am probably not doing the outage justice, because "blitz and repopulate" is a time-honored strategy here.


> over 12 hours

How much is over 12 hours? 12 hours and 10 minutes? 13 hours? 67 days?


I'm surprised by your risk tolerance. If I had any cloud service at this level in my stack go down for three days, I'd start shopping for an alternative. This exceeds the level of acceptability for me for even non-HA requirements. After all, if I can't trust them for this, why would I ever consider giving them my HA business? Just based on napkin math for us, this could've been a potential loss of nearly half a million dollars. Up until this point, I've looked at Fly.io's approach to PR and their business as unconventional but endearing. Now I'm beginning to look at them as unserious. I'm sorry if that sounds harsh. It's the cold truth.


I think you're not exposed enough to the reality of hardware. There was no need for the host to come back online at all. I think it was a mistake of Fly.io to even attempt to do it. Just say tell the customer the host was lost and offer them a new one (with a freshly zeroed volume attached). You rent a machine, it breaks, you get a new one.

If they're sad that they lost their data, it's their fault for running on a single host with no backup. By actually performing an (apparently) difficult recovery, they reinforced their customers erroneous expectation that they are somehow responsible for the integrity of the data on any single host.


They're not responsible for extreme data recovery, but (almost?) all of the customer data volumes on that server were completely intact. They damn well should be responsible for getting that data back to their customers, whether or not they get the server going again.

If you run off a single drive, and the drive dies, any resulting data loss is your fault. But not if something else dies.


I'm absolutely 100% certain that AWS (for example) wouldn't do that for you with the instance types that feature direct attached storage.


Directly attached storage in AWS is a special niche that disappears when you so much as hibernate. And even then they talk about how disk failure loses the data but power failure won't.

This is much closer to EBS breaking. It happens sometimes, but if the data is easily accessible then it shouldn't get tossed.


In hindsight I wish I could edit because my above comment was pretty trigger happy and focused overly focused on the amount of downtime. It was colored by some existing preconceptions I had about Fly, and I'm honestly surprised it continues to be upvoted. When I made this comment I hadn't yet learned some of the bits you mentioned here at the end from another thread. Anyway, I tend to agree overall. I actually suggested Fly even reconsider offering this configuration given that they refer to it as a "single-node cluster", which is an oxymoron.


Is this the posture of other hosting providers? If not, it seems other hosting providers offer better quality of service.


I would think so, it's honestly strange to think about. The idea of having the node come back after it broke is a bit ridiculous to me. A node breaks, you delete it from your interface and provision a new one, the idea of even waiting 5 minutes for it to come up is strange. This whole conversation seems detached from how the cloud is supposed to and has operated in the past decade.


You're saying a single server failure is going to to cost your business half a million dollars?

This was a server with local NVMe storage. The simplest thing to do would have been to just get rid of it, but we have quite a few free users with data they care about running on single node Postgres (because it's cheaper). It seemed like a better idea to recover this thing.


No, it wouldn't, at least not given the contextual details of this situation because we wouldn't do that. Honestly there are parts of my above comment that hold but I admit in the moment that it was a bit impulsive of me because I hadn't yet learned all of the details necessary to make that judgment call. That number is right under slightly different circumstances if you're asking, but it sounds like you were trying to prove a point. If that's true, you succeeded. I learned a bit later that what they were calling a cluster was a single server and that's just... yeah.


(Fly.io employee here)

To clarify, we communicated this incident to the personalized status page [1] of all affected customers within 30 minutes of this single host going down, and resolved the incident on the status page once it was resolved ~47h later. Here's the timeline (UTC):

- 2023-07-17 16:19 - host goes down

- 2023-07-17 16:49 - issue posted to personalized status page

- 2023-07-19 15:00 - host is fixed

- 2023-07-19 15:17 - issue marked resolved on status page

[1] https://community.fly.io/t/new-status-page/11398


Dude. I don't sit at home refreshing status pages. Send me an e-mail.

That's how other [useful] providers notify their customers that one of their hosts went down unexpectedly. Linode will send me 6 emails when they need to reboot something. Even Oracle sends me notices about network blips. I believe I've gotten one from AWS, but I also know sometimes their gear gets stuck in a bad state and I didn't get a notification, which was super annoying because it took forever to figure out it was AWS's faulty state.


How do you know emails weren't set in addition to the status page changes?


The whole point of this HN thread is customers weren't getting regular updates. If they had they wouldn't be on a random community forum trying to get support's attention.


Ouch?

The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days. For an entire cluster to be down for that long is just unacceptable. Rebuilding a cluster from the last-known-good backup should not take that long, unless there are PBs of data involved; dividing such large data stores into separate clusters/instances seems warranted. Solution archs should steer customers to multiple, smaller clusters (sharding) whenever possible. It is far better to have some customers impacted (or just some of your customer's customers) than have all impacted, in my not so humble opinion.

And, if the data size is smaller, you may want to trigger a full rebuild earlier in your DR workflows just as an insurance policy.

The good news is that only a single cluster was impacted. When the "big boys" go down, everything is impacted... but customers don't really care about that.

Not sure if this impacted customer had other instances that were working for them?


> The bad news is that I'd be out of a job if I chose your service in this instance. 47 hours is two full days.

There was one physical server down. That's it. They even brought it back.

I've had AWS delete more instances, including all local NVMe store data, than I can count on my hands. Just in the last year.

Those instances didn't experience 47 hours downtime, they experienced infinite downtime, gone forever.

I guess by your standard I'd be fired for using AWS too.

But no, in reality, AWS deletes or migrates your instances all the time due to host hardware failure, and it's fine because if you know what you're doing, you have multiple instances across multiple AZs.

The same is true of fly. Sometimes underlying hardware fails (exactly like on AWS), and when that happens, you have to either have other copies of your app, or accept downtime.

I'll also add that the downtime is only 47 hours for you if you don't have the ability to spin up a new copy on a separate fly host or AZ in the meanwhile.


The core issue here is that fly doesn't offer distributed storage, only local disks.

Combine that with them having tooling for setting up Postgres built on top of single node storage, and you have the downtime problems and unhappy customers as a given.


When does AWS delete instances? Migrate, sure, and yes, local storage is supposed to be treated as disposable for that reason, but AFAIK only spot instances should be able to be destroyed outright.


To quote from their docs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance...

> If your instance root device is an instance store volume, the instance is terminated, and cannot be used again.

See also the aws "Dedicated Hosts" and "Mac Instances". Those also have similar termination behavior.

The majority of my instances lost are from the instance store thing.


The underlying problem is that Fly doesn't provide non-local, less-eager-to-disappear, volumes.


Since the post said "cluster", I assumed it was a set of instances with replicas and the like.

I've never experienced AWS killing nodes forever; at least not DB instances.


Disclaimer work in AWS.

> Rebuilding a cluster from the last-known-good backup should not take that long

It's not even clear if that's the right thing to do as a service provider.

Let's say you host a database on some database service, and the entire host is lost. I don't think you want the service provider to restore automatically from the last backup because it makes assumptions about what data loss you're tolerant to. If it just works from the last backup, suddenly you're potentially missing a day of transactions that you thought were there that magically disappears as opposed to knowing they disappeared from a hard break.


Restoring from backup doesn't mean you actually have to use it - just prepare it in case you need it. Since this can take time, starting such a restore early would be an insurance policy, if needed. If there are snapshots to apply after the last-known-good backup, all the better.


This was a single physical server running multiple VMs using local NVMe storage. It impacted a small fraction of customers.


Haha, imagine what the AWS status page would look like if they had to update their global status page anytime a single host would go down in any region.

Fly.io messed up, they didn't want to be a Heroku clone, but their marketing and their polished user experience design made it seem like they would be one anyway.

And as a reward now they have to deal with bottom of the barrel Heroku users that manage to do major damage to their brand whenever a single host goes down. Who would have predicted that corporate risk?


I've personally had this experience with Fly on a personal project. My project went down but their status pages said everything was up. It's fine since it's personal for fun project but for anything more serious I don't know if I'd be comfortable using them.


>> There's a lot of bullshit in this HN thread Then consider replying directly to the post containing wrong information instead of making such generalised accusation.

>> I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. What other cloud providers have downtimes of 20 hours? There must be a lot to call this "guaranteed and normal".

Sadly, I've always felt a good amount of passive aggressiveness in many of the HN threads where fly.io is involved.


> there’s a lot of bullshit

…proceeds to make a bunch of non-factual statements.


I really want to love Fly.io. It's super easy to get setup and use, but to be honest I don't think anyone should be building mission critical applications on their service. I ended up migrating everything over to AWS (which I reallllly didn't want to do) because:

* Frequent machines not working, random outages, builds not working

* Support wasn't responsive, didn't read my questions (kept asking same questions over and over again) -- I paid for a higher tier specifically for support.

* General lack of features (can't add sidecars, hard to integrate with external monitoring solutions)

* Lack of documentation -- For happy path its good but any edge cases the documentation is really lacking.

Anyway, for hobby projects its fine and nice. I still host a lot of personal projects there. But I have to move my companies infrastructure off of it because it ended up costing us too much time/frustration, etc. I really had high hopes going into it as I had read it was a spiritual successor of sorts to Heroku which was an amazing service in its day, but I don't think its there yet.


My experience was the same. I stopped using it for hobby projects recently when I had two consecutive days of being unable to build anything. The same stuff that built the week before, built fine locally, then eventually built on fly again — just, inexplicable downtime with no word from support.

Their free tier is very generous. You can get a lot happening and stay under their billing threshold. But, I like to get stuff done. I have a family. I code in my spare time very rarely, and I need a service that’ll let me just build my goddamn project. This was a small static site built by Node, so nothing spectacular happening.

I do wish them the best though. They have an excellent product in their tooling, and if they could stabilize their infrastructure I’d love to try them again.


> I need a service that’ll let me just build my goddamn project. This was a small static site built by Node, so nothing spectacular happening

(Cloudflare|GitHub|GitLab) Pages should do you nicely!


I actually use digitalocean and it’s pretty solid for static sites (they’re free, I think). It’s also convenient because that’s where pretty much all of my stuff lives these days. I used to put piles of stuff on GitHub pages though! I have some great memories of learning how awesome static sites could be, and how cool it was that they’d deploy just by pushing your repository. That seemed like magic back then.


Free up until 3 static sites, that’s what it was (on the app platform)


Half the critical info for using their services is buried in some thread in the forum (posted by an employee). How bad is their documentation pipeline that they can't with similar effort get that same info in the documentation? Requests to put stuff in the docs go ignored.

The answer to _any_ usage related forum question should be:

1. It's in the documentation <here> (maybe I just added it)

2. If you're left with any confusion, let me know and I'll update the documentation to resolve it


I've had the same experience, unfortunately.

The Fly dashboard reported everything was A-ok, but requests would time out. I had to manually dig into the fly logs to see that their proxy couldn't reach the server, and there was nothing I could do to fix it.

This went on for hours, until I made an issue on their forums. They never replied or gave any indication they read the thread, but it somehow magically got fixed not long after.

I really want them to succeed, but this utter lack of communication and helpless feeling of not being able to do anything has cured me from fly.io for now.


Curious to know, have you tried Render? What is the successor to Heroku in your eyes?


Render has been my drop-in successor to Heroku. No complaints except their weird team pricing, which doesn't matter for solo projects


Render didn't support Docker images last I checked, and the worst part of Heroku and cloning it was not actually having a locally reproducible build image. I want to deploy what I've built locally, not hand my source over to some magical pipeline.


Different strokes. Personally I avoid Docker in favor of source-code-deployment; the "magical pipeline" is usually just "git pull and then run a provided command". But Render does support Dockerfiles for eg. installing a runtime like Deno that isn't provided out of the box


We recently added support for deploying images from container registries. Currently in early access.


If you're deploying an Elixir/Phoenix app, then Gigalixir has worked really well for me. It's expensive, but then so is Heroku.


What’s their reliability been like?

Am I right in thinking the platform got bought a little while ago, and it’s being run by a relatively small outfit?


I've been using them for the last ~10 months or so to run http://PhoenixOnRails.com. Gigalixir have been 100% reliable for me so far, but it's a low-traffic app - I can't tell you what it's like to run a big app on them at scale.

I don't know who owns them but I do get the impression it's a small team. Hasn't been an issue for me so far. Their customer service has been very helpful and responsive on the rare occasions I've needed to contact them


Scalingo is a good drop-in replacement for Heroku. They even use Heorku build packs. They've got good support and are an EU company with hosting in EU (if that's important to you).


My experience has also been somewhat disappointing. I had a toy project that I decided to host elsewhere (Hetzner VM + Dokku), after the node for the PG database stopped working without any notification and didn't come back online (until I manually resurrected it).


Y'all, this is going to be deeply unsatisfying, but it's what I can report personally:

I have no earthly clue why this thread on our community site is unlisted.

We're looking at the admin UI for it right now, and there's like, a little lock next to do the story, but the "unlist story" option is still there for us to click. The best I can say is: I'm reasonably sure there wasn't some top-down edict to hide this thread (the site is public, anybody can sign up for an account and see the thread).

Say what you want about us, but hiding out from stuff like this isn't one of our flaws. When I find out more about what happened with this thread, I'll let you know (or Kurt will reply here and tell me I'm wrong).

I don't know enough about what happened with this Sydney server to be helpful to people who had instances running on it. When I know more about it, I'll be helpful, but I'm just learning about this stuff right now, after getting back in from a night out.

Almost immediately afterwards

It looks like... all the posts in the app-not-working category are "private"? Like it's some setting on the category itself? "Private" here means you need to have signed up for a Discourse account to see them?


Honest advice, probably to Kurt rather than you, is you need better processes, accountability and (probably) communication in your company. The tone of your reply (and other communications from fly.io) is reflective of the lack of those things given the public sentiment regarding fly.io. At 60+ employees and so many issues that tone goes from humanly endearing to indicative of a non-scaling business. Other replies indicate you don't want the things (process, oversight, etc.) that a growing B2B business needs to really succeed which is not a good sign. Sure there's a cost to that corporate-ness and you want to minimize that cost but it's also a necessary evil for the business you're in at the scale you're at.

If something breaks once it's an accident, if it breaks twice it's bad luck but if it breaks down three times it's broken processes. Based on the comment here things break at fly.io a lot more often than three times.


I'm just a person on Hacker News that happens to be at Fly.io; as I've said before, it's probably reasonable to think of me as an HN person first, and a Fly.io person second. My tone is my tone, and has been for the many years I've participated in this community. I got back from an evening out, saw that we were on the front page, poked around a little to find out what the hell was going on, and did my best to add some context. That's all.

If you're reading my comments on HN as some kind of official response from the company, you've misconstrued them.


> If you're reading my comments on HN as some kind of official response from the company, you've misconstrued them.

For what it’s worth, this is the reason most companies eventually restrict their employees from making statements about the company; It doesn’t matter if you thought it was clear that is was unofficial, any statement from an employee in a position of power (such as someone with access to the control panel) will be perceived as a communication from the company.

You may have intended it to be a personal remark about your job, but there are a lot of people in this thread looking for any communication they can get about the company.

When you step in to fill that void as a person who appears to have access and power within the company, you are the official communication whether you intend to be or not.


Maybe I'll get restricted someday!


For the sake of fly.io, you should either restrict yourself and not respond or, if you can't resist, make it crystal clear, that you DO NOT represent fly.io. Your first message can and will be misunderstood and it DOES throw a poor light on fly.io.

I am a paying customer of fly.io, on the Scale plan.


Please feel free to reach out directly with your concerns. I'll certainly read any email you send me.


TBH I thought you were replying as the CEO of fly.io since 1) I've seen them post here before, 2) I have no idea how big fly.io's staff is and 3) your post didn't otherwise describe who you were. It doesn't look like I was the only one to be confused.

If you had said "thoughts are my own; I just work there" or something I think it would have been more clear.


It seems you took my comment personally but it was about not just your comments but the overall tone of the fly.io communication (see recent blog post regarding funding) and approach to issues (three days of silence on a dead instance). You view processes and guidelines as chains versus as a ladder to help you climb a cliff. If the processes and communication was good then you'd know when you should self-restrict and when you shouldn't. You'd be empowered to make decisions within a framework that benefits fly.io the most versus being left to guess yourself. You'd understand why you should do that sometimes and why it's a better option for everyone.


I don't, but that's fine: it's not important that we understand each other all that clearly here, since all I'm talking about is how our public forum works.


For an opposing viewpoint: I don't want HN to become the place where corporate comms comes to bullshit us. I want engineers who work there to talk to us as peers, which seems like what's happening here. I get candor and humility (and playfulness, sure) from Fly's tone, which I appreciate.

I get stuff like this is frustrating. But I bet Fly staff are pretty frustrated too.


From this my take away is that I could get fired for picking Fly.io for work. Not because there was an outage but because days could pass before getting support.

What assurances could you give the community here that the support would be better next time?


This is our public site, for people who don't have support plans with us.

It's difficult for me to say more about what happened here and how you might have handled it, because I don't know what happened with this SYD host, because it's 1AM and the people who worked on it are, I assume, asleep. When I know more, I'll do my best to get you a postmortem.


>This is our public site, for people who don't have support plans with us

To be honest, that's enough for me. Sorry I didn't pick up on that.


Try filing a bug with any of the big three cloud vendors when you're on their free plan. It's really not different, the thing that is going to get you fired is not realizing you're not paying a couple hundred bucks per month for premium service on the infrastructure that is mission critical to your company.


Funny story, when I started my current role I researched our hosting provider. I couldn't find the matching invoices in the accounting system. So I called the vendor, a local company. They'd not set our account up correctly, billing was not enabled. Since then we've been billed. I'm glad we sorted it but it wasn't a good look to start my role by increasing our spending.


I feel like starting your role by discovering a crucial service wasn't being paid for and therefore was at risk of suddenly going away should be a pretty positive thing.

However 'should' is pretty load bearing there and actual results are probably heavily dependent on management culture and the current state of office politics.


We had a customer once that our automatic billing system tried to reach for 3 months about failing credit card charges (<$5k/mo). Our system stopped the service.. I'm pretty sure their subsequent outage cost their customers millions. Lessons about what it means to have (and be) enterprise customers were learned. Unfortunately the lady who was ignoring our e-mails in her inbox got fired.


My neighbor once had a gardener who delivered no bill. For years! Then out of the blue, $4k invoice.

Trust me, you did the business a favor.


> Try filing a bug with any of the big three cloud vendors when you're on their free plan.

A host being down for 3 days isn’t a bug. And you can contact AWS support, even on the free plan, and get a reply. Try it yourself. The great thing about AWS and the other cloud providers? If a host has issues they email all customers with workloads on it so you don’t need to refresh or check a forum.

I understand fly is a community darling. They’re unreliable, with poor support currently. Maybe the dev experience is great and that makes up for it, but pretending like everything else is equally shitty? Not true.


You can/should get fired for picking any plan without proper support guarantees for something serious, regardless of provider.


People here said they have specifically paid for a higher support tier and got no responses.


If you are on paid plan and generally followed proper procedures on picking suppliers then you have no reason to be worried about getting fired.


Lots of experience with Fly's paid support here. tl;dr Absurdly good.

FAR better wrt both response times and technical expertise than you'll get with any large public cloud provider.

I was dealing with some annoying cert + app migration stuff (migrating most of an app from AWS to Fly), and Kurt (CEO) was personally sending me haproxy configs bc I'm not smart enough to know how to configure low-level tcp stuff in haproxy. Not to put him on the spot here -- I doubt he'll have time to do that level of support going forward -- but that's my experience of the company's dedication to support and technical expertise.


> I have no earthly clue why this thread on our community site is unlisted.

Maybe it's hosted in the SYD region


It's hosted by Discourse.


Another good reason to avoid that platform like the plague.


What are the other good reasons? All my experiences with Discourse have been great.


The interface itself?

For instance one of those things I've noticed is that most Discourse instances have those nag banners if you're not logged in begging you to log in – and that's one of the least objectionable things they do IMO. I discovered recently that Discourse also blacklists all but the most recent browsers (because Discourse is designed for the next ten years!) and serves up a plain text version on anything older… but not without a nag banner of its own admonishing you for not using a supported browser.

The infinite scrolling… ugh. I'm not a huge fan of XenForo, but as a successor to vBulletin it seems to be far more user friendly.


I don't know why the app-not-working category effectively delists threads, but until we find out, I just removed it so this thread is public again.


may be it's to avoid search engines to not scrape these threads?


My understanding is that it was causing support problems, because people were Googling for solutions to problems with their apps (because of the Heroku diaspora, we have a lot of first-time Docker users), finding old stale threads on our forum that looked related, and then reviving them.

I think we can just `noindex` the category instead of making it private?


So the tagged posts were intentionally hidden, then.


I really like the work that you're doing Thomas, this is the right approach. FWIW, https://fly.io/blog/carving-the-scheduler-out-of-our-orchest... is one of my favourite posts on your blog.

For everyone else reading this, we have been running https://changelog.com on Fly.io since April 2022. This is what our architecture currently looks like: https://github.com/thechangelog/changelog.com/blob/master/IN...

After 15 months & more than 100 million requests served by our Phoenix + PostgreSQL app running on Fly.io, I would be hard pressed to find a reason to complain. - Some deploys failed, and re-running the pipeline fixed it. - Early July 2023, 9k requests from Frankfurt returned 503s. Issue lasted 10 seconds. - While experimenting with machines, after many creations & deletions, one volume could not be deleted. Next day, the volume was gone.

That's about it after 15 months of running production workloads on Fly.io.

We mention about our Fly.io experience often in our Kaizen pod episodes, which we publish every ~2 months: https://changelog.com/topic/kaizen. For anyone curious, this is the episode in which we announced the migration: https://changelog.com/shipit/50. There is a detailed PR which goes with it: https://github.com/thechangelog/changelog.com/pull/407. We've been talking about our migration plan from apps v1 (Nomad) to apps v2 (flyd) recently: https://changelog.com/friends/2#transcript-138

I'm sorry to hear that many of you didn't have the best experience. I know that things will continue improving at Fly.io. My hope is that one day, all these hard times will make for great stories. This gives me hope: https://community.fly.io/t/reliability-its-not-great/11253

Keep improving.


Glad to see you commenting here about this, I literally just posted a comment about how it's really messed up that you guys would do that


There's also a lock icon next to the "App not working" category in the header, which I took to mean that that entire category is hidden from logged-out users (which experimentally seems to be the case).


I have the impression from this thread that this thread was public (as in, would work if you just linked to it from something like HN) earlier, and now it isn't?

Obviously, deliberately hiding a negative story on our Discourse is a little like deleting a bad tweet; it's just going to guarantee someone captures and boosts it. We have a lot of flaws! But not knowing how the Internet works probably isn't one of them. No idea what's going on here, still trying to work it out.


Yes, from the Google-cached version, it appears that the thread previously didn't have the app-not-working tag; it was only tagged with "rails".

Not going to try and guess why or when that tag change happened. Personally, I'm less concerned with this particular thread than with the apparent decision to systematically hide all potentially-negative threads from search engines.


That category was added after one of our support folks replied, likely for tracking. I don't know why it's private. They may not even know this category is private. Hiding negative shit wasn't a deliberate decision... we're aware of google cache and we don't need to give HN another reason to dunk on us.


> That category was added after one of our support folks replied

FYI, this doesn't appear to be strictly accurate. The OP commented at 23:52 UTC saying that the thread had been made private, and the reply from "Sam-Fly" was not posted until 02:36 UTC.


My point was that the app-not-working category is used in conjunction with support/our team getting involved. I assume this is what Sam meant by "flagged it internally", which was followed by investigation, then a post. I don't see how the timestamps uncover something nefarious.


Thanks for publicly responding to the criticism, that can't be taken for granted. I hope you'll manage to actually address them.


You might be right, but in light of this whole disaster it doesn’t sound too convincing and doesn’t make your company look good.


It looks like being authentic is valued over anything else at Fly. I can’t explain how a company responds this immaturely to incidents like these.


We're just people. We don't have the part of the company that keeps us from communicating like people in public. Maybe we'll grow it someday.


Please don't.


Please don’t.


If you're talking about the comment you're replying to, tbh I found it was way more relatable than a more "professional" PR-speak response. Maybe you were talking about something else


Unfortunately PR-speak exists for a reason.


But is it a good reason?


Eh, I like it. It's refreshing to see a company representative communicate like an actual human being instead of the usual meaningless corporate robot-speak.


I'd rather take this response and see that they're working on it than "Oopsie poopsie, our machine elves have messed up!" or corporate newspeak saying nothing.


[flagged]


you have no idea wtf you writing about; it's been a few hours now and it's become clear that someone tagged the post as 'app-not-working, which made the post got 'private' and only available for logged-in users. it's also become apparent that the linked post in on a community forum for users without a support plan.

the dramatic tone and accusations in your reply are not warranted anymore


I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.

Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.

https://community.fly.io/t/reliability-its-not-great/11253#s...


Yes, this. It's tough when you've already played your "we messed up but we're making it right" card, and then you continue to not have it right.


Hosting service that cannot get basics right after a decade plus of solving these problems as an industry.

Are we even trying or just repeating ourselves because we don’t know what else to do?

How can the entire industry keep making the same basic errors?

“Let’s keep it simp… ohh nope we invented a Turing complete language and customer service is terri… wait do we have customer service?”

I get the world turning against SaaS lately.

Computers are so fast now, enthusiasts would be better served DIY; put a beige box in a local colo, use one of the big 3 for big business.

This is just starting to look disreputable and disrespectful to humanity itself putting such resources into one time bomb after another.


I just got gigabit bidirectional fiber at home and honestly if I were doing personal stuff or doing very early bootstrapping I'd just host from here with a good UPS. No it wouldn't be data center reliability but it'd work at least until it was ready to put in something more resilient.

You can pay for a business class fiber link too. It's about twice as expensive but they have guaranteed outage response times which is really what you pay for.


> Computers are so fast now,

Agreed

> enthusiasts would be better served DIY; put a beige box in a local colo

I mean, like, can I provision a zero ops bit of compute from <mystery colo provider> for $20/month?

Edit: looked up colo providers in my city- “get started in 24 hours, pick a rack and amperage, schedule a call now.”. Yeaaah, no. This is why people use cloud providers instead.


The thing is, running a good SaaS service requires quite a bit of staff and hard operational skills and a lot of manpower. You know, the kinda stuff people always call useless, zero-value add, blockers and entirely to automate.

Sure, we have most of the day-to-day grunt work for our applications automated. But good operations is just more. It's more about maintaining control over your infrastructure at one hand, and making sure your customers feel informed and safe about their data and systems. This is hard and takes lots of experience to do well, as well as manpower.

And yes, that's entirely a soft skill. You end up with questions such as: Should we elevate this issue to an outage on the status page? To a degree you'd be scaring other customers. "Oh no, yellow status page. Something terrible must happen!". At the same time you're communicating to the affected customers just how serious you're taking their issues. "It's a thing on the status page after an initial misjudgement - sorry for that." We have many discussions ilke that during degradations and outages.


Patronizing to assume this is obscure wisdom at this juncture.

Scared customers seems a bit… puerile? In a Sunday school way? Are we not adults capable of rational discourse?

“Why is line not go up!!” still? Just continues to smell like busy work in deference to a politically mandated hallucination.


I appreciate the honest feedback. We could have done better communicating about the problem. We've been marking single host failures in the dashboard for affected users and using our status page to reflect things like platform and regional issues, but there's clearly a spot in the middle where the status we're communicating and actual user experience don't line up.

We've been adding a ton more hardware lately to stay ahead of capacity issues and as you would expect this means the volume of hardware-shaped failures has increased even though the overall failure probability has decreased. There's more we can do to help users avoid these issues, there's more we can do to speed up recovery, and there's more we can do to let you know when you're impacted.

All this feedback matters. We hear it even when we drop the ball communicating.


What hardware are you buying? Across tens of thousands of physical nodes in my environment, only a few would have "fatal" enough problems that required manual intervention per year. Yes we had hundreds of drives die a year, some ECC ram would exceed error thresholds, but downtime on any given node was rare (aside from patching, but we'd just live migrate KVM instances around as needed.


Maybe there needs to be a better "burn in" test setup for their new hardware, just to catch mistakes in the build prep and/or catch bad hardware?


Not that nothing will fail - but some manufacturers have just really good fault management, monitoring, alerting, etc. And even the simplest shit like SNMP with a few custom MIBs from the vendor (which theres some that do it better). Facilities and vendors that lend a good hand with remote hands is also nice, if you remote management infrastructure should fail. But out of band, full featured management cards with all the trimmings work so well. Some do good Redfish BMC/JSON/API stuff too on top of the usual SNMP and other nice builtin Easy Buttons. And today's tooling with bare metal and KVM, working around faults to be quite seamless. Even good NVME raid options if you just absolutely must have your local box with mirrored data protection, 10/40/100Gbps cards with a good libvirt setup to migrates large VMs in mere minutes, resuming on the remote end with nigh 1ms blip.


Good point. :)

I'm still wondering about their hardware acceptance/qualification though, prior to it being deployed. ;)


Yah presumably they put stuff through it's paces and give everything good fit and finish before running workloads. But failures do happen either way


Could you expand your answer to list vendors which you would recommend?


"it depends". Dell is fairly good overall, on-site techs are outsourced subcontractors a lot so that can be a mixed bag, pushy sales. Supermicro is good on a budget, not quite mature full fault management or complete SNMP or redfish, they can EOL a new line of gear suddenly.


Have you come across Fujitsu PRIMERGY servers before?

https://www.fujitsu.com/global/products/computing/servers/pr...

I used to use them a few years ago in a local data centre, and they were pretty good back then.

They don't seem to be widely known about though.


Have not - looks nice though. Around here, you'll mostly only encounter the Dell/Supermicro/HP/Lenovo. I actually find Dell to have acheived the lowest "friction" for deployments. You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc. We pre-stage our configurations based on this, have everything ready to go (rack location/RU, switch ports, PDUs, DHCP/DNS). We literally just plug it all up and power on, and our tools take care of the rest without any intervention. Just verify the serial number of the server and stick it in the right rack unit, done.


> You can get device manifests before the gear even ships, including MAC addresses, serials, out of band NIC MAC, etc.

That does sound pretty useful.

So for yourselves, you rack them then run hardware qualification tests?


Here the even bigger red flag is that Fly doesn't have a (automated?) way to quickly move workload from a faulty server to a good server. Especially when containers (and orchestrators) have abstracted away the concept of data volumes which can be attached and detached. (Yes, it needs a lot of serious technical investment to provide this and I think it's one of the reasons storage is expensive on the big 3 clouds.) If you are offering data persistence services then you absolutely need this capability.

I think there is an expectation mismatch between what Fly wants to offer and what the market wants from it. Fly wanted to innovate on offering the ability to the devs to be able run their apps from multiple data centers. But without a proper data persistence service, the ability to run apps from multiple data centers is not useful to a vast majority of people.

I think Fly is trying to solve the persistence issue with their SQLite replication, but that means the vast majority of the devs will have to change the way they develop applications to suit Fly platform.

I think Fly needs to choose between what it wants to become. A reliable and affordable Heroku replacement, which is a decent sized market or offer an opinionated way of developing apps which offer best performance to users all around the world.

But opinionated ways of doing things is a double edged sword. (Rails and Spring Boot are highly successful because of their opinionated defaults.) App Engine is an interesting case study in the app hosting domain. It was way ahead of the time and prescribed you a way of developing apps which allowed the apps to scale to very high traffic. But people didn't want to change the way they develop to adapt to it.


>I think Fly needs to choose between what it wants to become

They have already pivoted once, no? At their current size (>100M in funding), I seriously doubt they can do it again.

I think they are scrambling hard, putting one fire out just to start another one later. That doesn't give me confidence in their technical roadmap and multiple people have Fly.io in their "check later" list for what now? 2 years?

It's really hard to recover your reputation when people perceive you as unreliable. Especially in the IT space.


They don't have remote attached storage, it's all local on the node, lvm based volumes. The data persistence is 24hr or manually created lvm snapshots that are exported to s3.

It's really not a place to run persistent workloads. If you run postgres there, you need to be prepared to either hot load your data into a new instance, or restore from backups.


I actually have been advocating against them for a while here on HN (https://news.ycombinator.com/item?id=31394179) for the same reason.

They had my account on some sort of shadow ban with no communication whatsoever after asking them to delete my account from their systems. I emailed them and to date never even got a response. I have moved everything over to Railway app and back to Google Cloud Run ever since.


> they never bothered to reply and put me in some kind of shadow ban from re-registering with my email.

So did you manage to delete your account then attempt to re-register using the same email address you deleted the account with?

Why would a company shadow ban you for asking an innocuous question?


> Why would a company shadow ban you for asking an innocuous question?

If you are literally overwhelmed with crises, it becomes appealing to make problems go away in this manner. Not saying they are, but this thread is suggesting that.


Unfortunately, there are companies that do such stupidity.


You know what's interesting? It feels like history is repeating itself with Fly.io, just like it did back when I first encountered Heroku. Back in the day, I was super excited about Fly.io – it had that same fresh, exciting vibe that Heroku had when it burst onto the scene.

I remember being blown away by Fly.io's simplicity and how easy it was to use. It was like hosting made simple, and I couldn't help but think, "This is it, this is the one!"

But, as time went on, I noticed little signs of trouble. Downtimes became more frequent, and my deployments, which were once snappy and seamless, turned into agonizingly slow affairs. It was like déjà vu from the time when Heroku's greatness started to wane.

It's disheartening to see Fly.io go down a similar path. As more people flocked to the platform, it seems like its performance began to suffer – just like what happened with Heroku. The more popular it got, the less reliable it seemed to become.

Scrolling through Hacker News, I can't help but feel a sense of disappointment. Others are expressing their frustration too, and it's like we're all reliving that moment when Heroku lost its charm and became a hassle.

I have to admit; it worries me. It's like a cautionary tale of how even the most promising platforms can fall from grace. It's the reality of the fast-paced tech world, but it's tough to accept.

So yeah, here I am, hoping against hope that Fly.io can somehow break free from this cycle and find its footing before it becomes as useless as Heroku was at its lowest point.


It was a bit alarming to see Fly offering significant resources for free (and encourage using them in the docs, subtly making them a feature and a reason to switch) back then. I wondered if they overestimated the conscientiousness of the industry: as with Heroku, surely once the word is out in the wider world plenty of people would flock over just to not pay. Guess what happened next…

Heroku was a new thing back then, so it took a while for abuse to ramp up—but every subsequent attempt at being generous should not even be considered without either a vicious and expensive anti-fraud department in place or deep pockets to compensate for the initial lack of said department by throwing enough hardware that the minority of honest users don’t notice the overhead.

My impression suggests that Fly does not score high on either of the above. Which is partly why I like them—the above seems like megacorp type bullshit, and they seem to be strictly no-megacorp-bullshit—but I wouldn’t be surprised if engineers at Fly had to spent most of their time dealing with fires or optimizing resource allocation and auto-limiting freeloading cryptominers, scammers, and other abusers rather than focusing on longer term infrastructure reliability or DX.


I feel the same way.

Do you think its related to scale? As in, once a company has enough paying customers to become profitable/investable, it has also accrued enough issues to where it starts feeling fresh and exciting like you said, and gradually becomes like the older competitor it once wanted to replace?

This is my experience at least. Once the company goes from a few pizzas to "we've booked a venue", entropy creeps in and adages like Conway's/Brook's law become increasingly evident.


The skillset of successfully founding a company and the skillset of successfully scaling a company are not the same. The latter is a hard thing to do that requires understanding both customers (current and potential) that you never speak to and employees that you barely speak to.


Incredibly unimpressed at fly.io staff for hiding/making private the downtime forum support thread.


We tried to migrate all our staging environments to Fly last year but it was the flakiest experience I’ve experienced on any PaaS. Pushing simple containers up would fail 70-80% of the time with no useful error messages and non existent support. It’s a weird company that seems great until you actually use them.


I think fly.io is pretty incredible but I can't help but feeling they're doomed to follow in heroku's footsteps (unclear if good or bad). They've built some pretty wild stuff and I can't help but wonder if they're overcooking the ocean instead of just solving problems for their users.

Durable and available storage are all they really need to draw me away from big cloud providers but this combined with their answer to S3 being "use S3 or run minio" means I'll never take them seriously.

This is a bad look folks, not sure how you can walk back days of silence and hiding threads. Just open an issue and talk to your users.


At least I could rely on Heroku in production. I've wanted to give Fly.io a try but this gives me pause. I really do miss the Heroku DX whenever I'm putzing around with the increasing complexity of AWS.


For hobby projects - where I dare not touch AWS for fear of going bankrupt from a misconfigured service - I found the sweet spot to be Dokku on top of a Hetzner or Digital Ocean instance. It provides a Heroku like interface on top of cheap hosting, and is fine where you don't expect to scale very much.


> use S3 or run minio

Is using Cloudflare R2 not an option?


Backblaze even has an s3 api these days.


Backblaze B2 is much older than Cloudflare's offering. But from what I remember, they didn't have any presence outside North America.


They do have an EU Central region (see [1]) but "it is not possible to have multiple regions under one account" - you need an account in each region (although it seems you can maybe fudge around with groups to emulate multi-region access.)

[1] https://help.backblaze.com/hc/en-us/articles/360034798433-Ca...


B2 used to only have its own slightly different B2 API, but now has compact with the s3 api.

As far as their offering, one should definitely understand that there are limitations and do their research.


Instances going down happens sporadically on Hetzner Cloud as well, but often by the time I see the e-mail alert that some instance is unreachable I log into the dashboard to find that it has been restarted or migrated to another host already. I've been running a production system there for more than 4 years now and had zero provider-related downtime (as I have some redundancy for most instances). In terms of features they move way slower than Fly.io and it took them years adding stuff like virtual networking, but everything they add works rock-solid. I guess there are just very different engineering cultures when it comes to building cloud infrastructure provider, and I have to say I prefer the "take your time and do it right" approach.


I'm running some instances on Hetzner Cloud, the oldest is ~5 years old, only recently had 2hr or so downtime, other than that - without any problems. And we are talking the cheap ones.

I did have a problem with their dedicated server almost immediately after spinning it up. Noticed that NVMe is broken, and support went like:

- 16:28 -> I contacted them

- 16:36 -> Their first response

- 16:44 -> I sent them SMART data

- 16:48 -> They acknowledged that the NVMe needs replacing and asked me if I consent to that (and loosing of the data that was not already lost -> but running RAID so no problems there)

- 16:52 -> I agreed

- 17:30 -> NVMe was replaced and server booted

I don't have too much experience with hosting providers on that level, but that was freaking impressive response time from them. So a happy camper as well :D

EDIT: Formatting


Hetzner has a great price/performance ratio, but they are not rock-solid. Speaking of the private network... look at their forum where people complain about downtimes for their "vSwitch" every other week, sometimes it doesn't show up on the status page because it happens on the weekend (lol).


They’ve been working on Fly for years now and seems like they haven’t been able to turn it into a reliable service or profitable business (making assumptions about the second part here), and the overall general sentiment seems to be to avoid it for anything but the most toy applications. I note that the team was also unable to get their recruiting business off the ground either and shuttered it.

My assumption based on the creator’s very online hacker news commentary is that they seem to be at least smart in tech. So what’s the lesson here for the rest of us who may want to start a business? Is this a “shots on goal” thing and we’re just seeing these failures more publicly than most so it biases the perception, or is there some je ne sais quoi missing that we could learn from? No offense intended by my post, but I would be very keen to learn whether there’s some X Factor missing from an otherwise ostensibly smart team’s repeated failure that we could learn from.


It's really disappointing that they made this forum thread private, apparently in response to this HN thread blowing up. This is the first negative HN thread I've seen about them, it's not even really that bad because this kind of downtime is expected, and they can't get to every forum post, and their response that someone posted here is totally reasonable in my opinion.

So why is the link to the thread 404ing and why does this post have to link to google webcache of it? I've grown to like fly.io and use them for my side projects now, and this just isn't sometime they would do. Going through some minor cognitive dissonance right now :/


(a) Not even close to the first negative HN thread about us.

(b) We definitely didn't make the thread private in response to HN.

(c) It should be public again.


Refuting the points in (a) and (b) still concedes that it was made private. Care to actually mention _why_ it was made private?


I saw your other comment, glad to see this wasn't intentional as the optics were pretty bad


I wonder if there will ever be a wake up call to the arrogance of people at fly.io

At work when it came up in a meeting people went around with horror stories of broken elements while the status page wasn't updated, terrible communication and an overall attitude that nothing is wrong, even when servers go down for days at a time.


what's up with the status page?


There's a global status page, and then there's a local update for people with instances on an affected host --- past some threshold of hosts, the probability of having an issue on some random host gets pretty high just because math. The local status thing happened for people with instances on that machine.

Ordinarily, a single-host incident takes a couple minutes to resolve, and, ordinarily, when it's resolved, everything that was running on the host pops right back up. This single-host outage wasn't ordinary. Somehow, a containerd boltdb got corrupted, and it took something like 12 hours for a member of our team (themselves a containerd maintainer) to do some kind of unholy surgery on that database to bring the machine back online.

The runbook we have for handling and communicating single-host outages wasn't tuned for this kind of extended outage. It will be now. Probably we'll just paint the global status page when a single-host outage crosses some kind of time threshold.


thanks for clearing that up


Status pages are usually for marketing purposes.

Why would anyone want to become a new customer if all they see is jumble of green, yellow and red?

Green status pages attract business.


thanks for sharing your opinion, but I was looking for a reply from someone inside fly.io


Wondering if for small/bootstrapped projects there's any alternative people suggest? Fly has a nice UX and accessible prices, but it's unstable at best. I use the big clouds at work, but for personal they are $$$. Also I want to keep devops tending asymptotically to zero.


I’m quite happy with https://render.com after leaving Heroku


I've never actually used Render, but did interview with them last year. I faceplanted at the end and didn't get an offer, but… hands down Render ran one of the best interviews I've ever participated in. Communication was on point, the process itself was well organized, and even though I disagreed with a couple of engineering choices, there was a distinct lack of bullshit.

If that carries over to their customer facing folks and how Render as a team has executed since then I'd absolutely recommend taking a look at them.


I second render.com. I switched from fly.io to Render.com after seeing a few of my instances getting bottlenecked and crashing. Now the same service runs smoothly on render.com without any crashes. Didn't dig any deeper but somehow the resource management is better with render.com


I'll give them a run thanks!


i've also had success with render.com so far! been running an app & DB for $14/mo for a almost 6 months and it's been solid.


Although, i have never used them, you can explore railway.app. it is the closest to fly.io and never heard any bad things.

I personally at the moment use digitalocean without any issues, but there's always the maintenance overhead of managing a server yourself.


I've been using a Postgres DB on Railway's free plan (that is going away) and it was great. It did everything I wanted (excluding external access and PostGIS) for cents. The support community is nice.

I didn't use it for much more, but my experience has been great. They deserve way more air time than they currently get.


I wish digitalocean offered decent pricing for spaces (s3). Unfortunately it starts at 5$, which is an enormous price for storing 70 small images, but s3 would greatly simplify my server management moving state entirely outside the server (managed database + managed object storage)


You could use Cloudflare R2, it's pretty cheap overall.


I did not realize they have an s3 compatible service


> price for storing 70 small images

Do you have to use an object store in that case? Or does it have to be separate from whatever application instance?


I don't have to use an object store, but it makes the cost of setting up a server more expensive if I use the filesystem, if I delete the instance, the data is gone. A volume kinda offset this, but it's way less portable and accessible only by one instance at a time

The peace of mind of managed is nice, all I have to think about is running the app, without having to deal with making sure db and files don't get lost


At that level I think I'd just put the images in the database.


That's an option, but I want to keep things simple and the assumption is usually "filesystem" but weirdly most libraries assume S3 usage. I don't think I've seen native support for db-stored images in any of the libraries I use, which is sad but a reality.


Maybe not directly in libraries but there's a few programs to make a fuse filesystem backed by a database.


Ya cant just throw em at a blob type field/column?


In the git repo even


The images are supplied by the users, so that wouldn't be an option


Honestly these days I am leaning towards this approach: https://github.com/mrsked/mrsk/

It's all just docker.


Nah I don't wanna be responsible for running a control plane. I just wanna focus on the app, that's all.


No devops, focus on writing your app: https://www.convex.dev/


I use Dokku on top of Hetzner for my hobby projects - hosting is super cheap, for a little extra I can add a mounted volume for storage, and if the project outgrows a single server I can always just break out of Dokku and use some Docker containers behind a load balancer.

If you are outside of Europe, Digital Ocean or Linode may work better for you.


I like Hetzner, they certainly radiate the feeling of quality (the management UI is great, for instance). The servers themselves are competitively priced (and they have ARM boxes!) - but for more storage than the little that they include I find the price pretty outrageous, compared to the base price, anyway. You'd end up about doubling the price for a "reasonable" amount of storage you can confidently run your base system on.


Hetzner has 2 data centers in the US now. 1 in the east and 1 in the west.


Maybe just pick up 3 chonky EC2 boxes, set up iptables on each of them, have each one run a containerized version of your code that gets built and deployed from CI every time you push to Github, slap an ALB in front of it all, and call it a day?

And if you need state, then spin up a little RDS with your favorite SQL flavor of choice?

The CI deploy script could even bake in little health-checks so you can do rolling deploys with zero downtime. Depending on how fancy you wanted to get with your shell scripting, you could probably even make 1 of your 3 boxes a canary without too much trouble.

I'm realizing I haven't thought about this in a long time, since nowadays I just get to use the fancy stuff at work. Kind of a fun thought experiment!


The system you describe is quite the monthly bill, off the top of my head.


You can do the same thing using Hetzner dedicated hosts fairly cheaply:

https://www.hetzner.com/dedicated-rootserver/matrix-ax


I admit I didn't run the numbers before posting that. But you got me curious, so I went ahead and did it now...

Render.com looks like [1] their "$0 + compute costs" plan would work out to:

  ∙ $25/mo for a single "Web Services" box of 1 CPU and 2GB RAM
  ∙ $20/mo for a single "PostgreSQL" box of 1 GB RAM, 1 CPU, and 16GB SSD
  ∙ TOTAL: $45/mo, and you're assuming they'll magically give you zero-downtime
Those are grim numbers, performance-wise, but let's use them as the standard and see what it'd cost in the scrappy AWS architecture I threw together in a few minutes:

  ∙ $12.10/mo for a single t4g.small box, which is actually 2 vCPU and 2GB RAM [2]
  ∙ 3x redundancy on that brings you up to $36.30/mo for compute
  ∙ $16.20/mo for an ALB [3]
  ∙ $11.52/mo for a single db.t4g.micro PostgreSQL box, plus $1.84/mo for the equivalent 16GB of storage [4]
  ∙ TOTAL: $65.86/mo for substantially more CPU, redundancy, and control, or...
  ∙ TOTAL: $41.66/mo for substantially more CPU and control over your infra, if you're willing to drop the redundancy
So it looks like it's pretty comparable in terms of raw dollars.

I'll admit there's a little more "devops" overhead with the AWS setup. Though I think it's not as big of a deal as people make it out to be — it's basically an afternoon of Terraforming, and you'd probably spend an equal or greater amount of time digging through Render's docs to understand their bespoke platform anyway.

(Also, once you contemplate bulk pricing for the underlying commodities, it's easy to see how companies like Render make a healthy margin, even on their low-end offerings.)

Anyway, I guess I've nerd-sniped myself, so I'd better stop here. But that was a fun analysis!

[1] https://render.com/pricing#compute

[2] https://aws.amazon.com/ec2/pricing/on-demand/

[3] https://aws.amazon.com/elasticloadbalancing/pricing/

[4] https://aws.amazon.com/rds/postgresql/pricing/?pg=pr&loc=3


Thanks for the analysis. I think you're still underestimating costs (e.g. didn't count bandwith, no AZ standby for your database, or backups etc.) and time spent, not only in the setup but especially in maintenance (security fixes, AWS agent updates, OS updates, package updates, figuring out why an instance ran out of disk etc. etc.) Not counting you have to setup and maintain your deployment system which can range from scripts to K8s.

Also I have used Terraform to set up quite a few resources and it's only overhead in a small project.

I just wanna git push and see my changes published a minute later. I don't think Render is gonna take more than 10 mins to figure out https://render.com/docs/deploy-rails-sidekiq


Spun up a new project and was debating between AWS and Render.

I’ve been burned one too many times by ElasticBeanstalk so I bit the bullet and went with Render… and had everything plus PR deploys working in under an hour. Very happy so far.


Try https://elest.io (Check the CI/CD part)


Don't get me started with Fly — especially postgres machines. In my experience, a really nice idea with poor support and unreliable infrastructure.


What do people get out of using special services like Fly.io instead of standard VMs like the ones you can get from $5/month these days?

Can anybody who uses Fly.io explain their rationale? Why do the additional integration with Fly.io, trust and install their special software on your machines and tie your project into their ecosystem?

What type of application are you running? How many users are using it?


There's a sweet spot of early startup or side project where you don't have the time, budget or people to manually set up and maintain servers on your own or deal with the complexity and cost of Kubernetes or AWS, especially when your focus is on building the product and acquiring customers.

Heroku (before its inevitable enshittification under Salesforce) was great for this use case. Sure you will outgrow it at some point, and it did get expensive, but when you just want to throw up an MVP with minimum fuss and maintenance you could do much worse.


What exactly does Fly.io give you?

You already know how to set up your project locally. Why not just do the same setup on any cloud VM and boom it is online?


Not sure what fly.io offers vs Heroku or others (I have played with it some time ago but not used for anything serious), but for an equivalent I'd be looking for automated load balancer setup with SSL, easy scaling up so I can go from 1 to 2 or however many web services (with UI or CLI), simple deployment configuration with a Procfile (or whatever) and managed PostgreSQL/MySQL/Redis including backup/restore when needed.

That's more than what I would have or need locally.


And what kind of project do you run which needs up/down scaling and load balancing?

In my experience, for a simple PHP web application, the smallest VMs already can handle a thousand concurrent users, which amounts to something like a million monthly users.


and what does your experience tell you about applications which are not written in PHP, and which need to handle more than 1000 concurrent users?


Yes, that's a viable option in many cases.

But if your users are distributed around the world and most requests are read requests then it can make sense to shave 100 or 200 ms off your response times.

You can always squander those gains later by running JavaScript for 5000 ms before showing anything :)


Probably saves you a good hour of "sudo apt gets" and "vim /etc/nginx/nginx.conf" etc.

Having used various PaaS services that take this "pain" away from you, I sort of think the tradeoff isn't worth it. For $5/m DO will give you a backed up server. Add $15 for postgres that is a good deal.


Who fully sets up a significant project locally?

I used Heroku for a project mostly because my team didn't have skill set to set this up and I wasn't going to do it. As far as I know they are still on Heroku (with a smattering of AWS services) for that same reason: just works and cheaper than doing it yourself.


> Who fully sets up a significant project locally?

Who doesn't? I couldn't imagine having to push to some cloud agent and wait a random amount of time every time I want to test something. With it local I can just save, maybe rebuild or have it auto-rebuild if necessary, and test, then repeat. On a fast machine this can be a few seconds or instantaneous.

Maybe the niche I'm missing here is very "green" developers who don't know how to do any sysadmin work or deploy things.

If this is you, learn it. It pays off huge, not just during development but in being able to have a lot more choice about where you deploy and a lot more control over your own stuff.


Really? Redundant databases? Redundant redis servers? Caching? All locally?


Maybe not redundant because it’s just for testing, but you absolutely can run an entire stack like that on a decent laptop. You can even use Docker to run the same containers you run in production.

I’ve run Kubernetes clusters in multiple Parallels VMs locally with work loads in them to play around.

You also learn a ton about how things work which helps you debug and fix stuff when things go wrong. Even if you use managed stuff it’s always a huge plus to understand at least the basics of how it runs.


it's hip, they use hip tech and hired hip folks, so you know it's the place to be ;)


Why is this company always on HN frontpage - ironically for their bad services? Normally, poor service from a provider isn't grounds for such attention - but seems like Fly.io has not done anything great.

They still continue to get love from the developer community who "wants them to succeed". I'm puzzled as to why? Because of some blog posts?


They have really good tech blog posts. Also, they have https://fly.io/dist-sys/


In a word, they are part of the HN "family".


You unfortunately get what you pay for.

AWS is more expensive than God, but I'll be damned if you can't have a throat to choke in less than 10 minutes whenever something like this happens.


> I'll be damned if you can't have a throat to choke in less than 10 minutes whenever something like this happens

That is a hell of generous description for a person who sits in your Slack instance and responds with "I have escalated to the team internally and am waiting to hear back on confirmation if this is an issue."

Moving a Level 1 support engineer closer to the customer doesn't give them more information, it just reduces the latency to getting a non-answer.


I had one situation where a Hetzner dedi didn't come back up on a reboot. Their dedis are cheap, this one is like $40ish/mo?

Opened a ticket and support had it back up again within about 10 minutes, turned out to be a failed CPU fan which caused an overheat condition and made it so the system wouldn't complete the boot. They swapped the fan and it came up. It's the only failure I've had in years of dealing with them and was just impressed how quickly a physical failure event like that got handled.


https://www.youtube.com/watch?v=5eo8nz_niiM

Datacenters in my country usually had some rooms with tower servers 20 years ago here, well my first colo was for the tower server I brought in the large backpack:-). But density requirements, cold/hot aisles etc. prevailed and towers are generally considered inefficient for the datacenter purposes.

And then you have Hetzner datacenter that probably all people running DCs I know would ridicule, but they would not be able to respond to fan replacement at the same time. I wonder how many rack server chassis are recycled each year because the manufacturer just won't let you reuse them with new motherboard, power supply due to new shape, design, ports placement etc.


AWS support replies back to your messages when they feel like it. Their support is just as shady but they have better uptime for sure


No love for AWS, but this isn't true, at least for larger deploys. If you're running enough with them that you have an account manager, they are very good indeed. You can have someone, someone good, on the phone within minutes and they will stay on the line until the issue is sorted.

I recall an incident at my old company where we were under DDOS, it was getting through cloudflare and saturating LBs in some complicated manner (don't recall the exact details) which made it hard for us to fix ourselves. They were on the phone with us for hours, well past midnight their time, helping us sort it out. The downtime sucked, but I was certainly impressed with their truly excellent support.


FWIW, our aws enterprise support reps are available 24/7 and usually respond within a few minutes.

But again, you get what you pay for.


I was working for a pretty big early AWS customer--one that had realized that for the low low price of all your money you could make DynamoDB scale to some truly massive numbers--and one time when we were having trouble around noon Eastern, a colleague called up our TAM. As he told it, the TAM sounded half-asleep, so my colleague asked if everything was alright.

"I'm in Hawaii on my honeymoon and my backup missed your call, so it escalated."

I probably wouldn't have answered the phone. Granted, that's why I don't do that job. But I have always had a real appreciation for the good TAMs ever since.


Weird, I just begrudgingly went from Postgres to Dynamo because it was so much cheaper. We're not huge scale though, so I'm wondering where the costs start to diverge the other way.


With Dynamo it seems to depend a lot exactly what you're doing. If you're careful about your queries, it's pretty cheap.


This was, 2012 and we were hitting read and write limits regionally.

It was not a wise plan. It did, however, run. Technically.


Wonder if that marriage lasted though? ;)


> a throat to choke

yikes


That's a common phrase, not to be taken literally.

It just means one single person (at the vendor) who you can complain to, or raise an issue with.


yea, it's just another one to add to a list of expressions that are unnecessarily aggressive, and for which there are better alternatives


Been in the industry a long time. It’s not a common phrase. It’s weirdly violent. At most “someone to yell at”. A throat to choke? What the fuck.


not a native speaker, but have been reading and writing english long enough to pick up the meaning immediately

anyway, had the same though when typing my sibling comment, felt so disgusted reading that


It looks like Fly.io is just not a solid choice for cloud services.

See also: https://news.ycombinator.com/item?id=35044516 and https://news.ycombinator.com/item?id=34229751


Sad how this behaviour drags down LiteFS. I don't trust a company to build a database with that kind of culture.


Reliability is everything. Why aren’t they monitoring their own machines (real or virtual) and getting fire alarms when there’s an outage?


This was my biggest question too after wading through this drama.


monitoring means you might get called-in on your night-out

who wants that?


Every web dev who is worth their salt knows what s/he/they signed up for.

For the unique privilege of being able to build machines out of thin air, I will accept the occasional weekend page


I wanted to give Fly.io a try in my next project but not with this operational culture. I regret telling my CTO clients about Fly.io as the next big thing in operations.


Not a sarcastic or rhetorical question - how come the three big A clouds or even smaller ones (Hetzner,my favorite) are mostly so stable (give or take some outages) and anyone knows their internal engineering, architecture and practices to keep systems that much stable?


There isn't really secret sauce to it in 2023. The techniques, processes, and etc have pretty much been documented over the past 20 years.

But if you are wondering how AWS manages to be so good at it at such scale? Hosting infrastructure is incredibly complicated and AWS employs something like 100k people. Seemingly small AWS services employ more engineers than Fly.io.

That being said my take is that what's happening at Fly.io is a lack of leadership. There are not the right people in the right positions clearly. I've worked infra at companies from 5 people to, well Rackspace, and I'm having a hard time imagining so much time passing with.. Essentially a piece of infra MIA and impacting users.


I think the core issue is that they venomously don't want to act like a corporation. Which is great for early marketing and adoption but there's a reason successful B2B corporations act like they do. It's less fun and it's less endearing but it also annoys customers significantly less. I mean, the CEO has "Interim Food Taster" as his title on LinkedIn.


IMHO it is their approach. I use Hetzner and OVH (and their other variants for lower budget clients) for our EU clients. They do not use buzz words like "deploy app server", "cloud clusters", "turbo charge this app". They are simply providing VPS and similarly configured droplets. They are also established and don't want to mess around with very modern experimental infrastructures.

Same goes for Digital Ocean. No buzz words. Just hosting with droplets. They simply say "here pick a linux distro, configure whatever and don't ask us much about app support". I use their Linux distros for my own apps and if want anything extra I just install it and suffer my own actions' consequences. Not theirs.


DO, OVH, and Hetzner are more stable because they don't use buzzwords?


I guess what OP is getting at is that these providers stick to the battle tested proven bedrock and nothing like "run your app where your users are" which I find interesting because that too can be done with any cloud that has a Datacenter in the region where you happen to have users.

So this "closer to your users" voodoo is a little beyond me.


The 'where each user is' is implicit, the expectation is that you're some kind of global SaaS, and you want low latency whereever your users are.

Sure you can do that with any cloud (or multiple) that has datacenters in a suitable spread of regions, but I suppose the point (or claimed point, selling point, if you like) is that that's more difficult or more expensive to coordinate. Fly says 'give us one container spec and tell us in which regions to run it', not 'we give you machines/VMs in which regions you want, figure it out'. It's an abstraction on top of 'battle tested proven bedrock' providers in a sense, except that I believe they run their own metal (to keep costs down, presumably).


Some workloads are surely latency sensitive but some of those transactional CRUD systems don't need that much closer to the edge is my possibly flawed opinion.

I mean chat or e-commerce yes, the edge and all.

But for a ticketing system, invoicing solution or such, a few hundred millisecons are not that much of a big deal but compliance, regulations matter more.


Scale the technical difficulty and innovation of the product with the size and competency of the team. The market will always say they want more and the job of the company leaders is to know when to say no. AWS did not begin with everything it offers now but rather started with fairly boring things (even for the time) that they expanded over time. This was after a decade of learning how to do this internally so they weren't starting from scratch.


What a great year for render.com.

That boring paas must have gotten a lot of growth from their competitors fucking up

Heroku is going to shit. GitHub integration was down for weeks last year

Fly.io can't get it together. Database being offline for days is just ridiculous.


After a while the direct link to Google Search's cache will no longer work but it appears the the original link is now accessible: https://community.fly.io/t/service-interruption-cant-destroy...

Anyway, here's an archive link for future visitors: https://archive.is/7lSJA


Holy hell, there are some hostile comments here!

I've had service issues on Fly that I've escalated to support in the past, and given my experience it feels highly unlikely that they tried sweep this under the rug or somesuch.

At the time we had deployed a small business workload (few 100$/mo in billings) and paid for their $29 support plan, so grain of salt there. We faced service issues and, while the service reliability did eventually push us to migrate, support was top-notch the whole way through. Support was happy to escalate as needed to try to help get a solution, with MrKurt eventually joining in and helping identify root causes. During the entire episode everyone was realistic about where issues could be (i.e., were open to the possibility of it being a Fly issue). As people from Fly have noted, they've historically been quite open about when they weren't the best choice.

Again, while service reliability has been an issue (and Fly has admitted this in the past and is working on it), I think the assumption of badfaith in this thread is pretty unprofessional. It's also a lesson in how hesitant people are to pay for support. $29 for access to a human is not a bad deal; we certainly got good value out of it.


> I think the assumption of badfaith in this thread from Fly is pretty unprofessional.

Customers aren't supposed to show professionalism. Service providers are. I didn't see disrespectful comments here.

People here are just poiting this has happened many times and look like a pattern. If you don't fix a communication issue after multiple occurrences, you might not be ill intentioned, but at least careless.


(Note: I edited my comment to make it clear I'm referring to badfaith from commenters, quote above is from pre-edit)

I'd argue the expectation goes both ways. I won't link to specific comments, but I think it's pretty clear that some of them cross the line to disrespectful.


I haven't seen a single instance of disrespect -- just justified frustration for the outage and lack of communication. There seem to be a lot of frustrated Fly.io customers and a vocal number of Fly.io fans.


I wonder if there would be anyone defending the service provider if it was AWS or Azure, for instance, instead of Fly.io


I (like many others) want fly to succeed and have even moved a couple of (smaller) production apps to Fly.

But it's very clear that over the last few months, the (already quite capable) fly team is just in over their heads and have bitten off way way more than they can chew.

I've had nothing but headaches after they auto-migrated our app to v2. My build machine had to be forcibly destroyed to even work. Then it allowed me to easily just delete an app cuz I wasn't able to deploy to it.

Then the deploys kept failing due to some VM provisioning error (it thinks I want to add another app when I just want to deploy to an existing app within the 3 machine limit) and honestly, I just don't care to troubleshoot this anymore. That was the point of using a platform like this... Any time I would've saved by using this platform has been wasted with these random errors that I don't have the time to troubleshoot.

I destroyed the app thinking "ok maybe I'll also recreate that one" because clearly the migration to v2 failed. And now that all my secrets were destroyed, when I try to attach the new app to postgres (with the existing username and existing database), it won't let me.

I genuinely wish you guys the best of luck with what has to be a tough time for your company, and will reconsider if you build something demonstrably more stable. But right now I just can't afford to drown with you with clients breathing down my neck.


According to their Status page it's all resolved: https://status.flyio.net/


According to the Status page, there was never an issue to begin with


They're less humble in communicating other things https://fly.io/blog/we-raised-a-bunch-of-money/


I tried Fly once, but, at the end of the day it seemed way too expensive for what it was and the completeness of the vision. And then I started to see the complaints in random corners of the Internet.

I don't read their blog regularly but I always thought they had great content. But not after reading this.

The irony: "What people actually wanted to talk about, though? Databases."

...but apparently not when they are the problem behind said databases?


Their blog is great, because they invested heavily in perception from the outside. Coming from the Elixir world, them hiring Chris McCord (creator of Phoenix) and sponsoring a ton of open source projects slapping on their logo, seemed great at first, but when it comes to actually deploying stuff to production and day 2 operations (monitoring is so much more difficult than it should be, and troubleshooting tools are lacking) they are way behind. I can imagine them getting lots of hobby projects on board due to free tier and day 1 impression, but that won’t win over enterprises.


I could not agree more. When I read this my immediate thought was — all that money, and none spent on product marketing or copywriting. Oof.


I tried Fly.io when looking to move away from Heroku. Some really cool stuff, I love their focus on multi-region apps. But it just felt like too many under documented things and edge cases and the support didn't seem like it would be there for me when I really needed. I ended up going with NorthFlank as my Heroku-replacement, they've had the odd hicup (mostly related to me being the first customer on their us east region) but communication and support has always been incredible. Really happy I choose them.


The thing about product marketing is that it is almost always greatly exaggerated at best and borderline (if not outright) lies at worst.


This is why people should just run a small managed k8s cluster on GCP. This has worked super well for us. https://www.vmii.org/blog/2023/03/12/kubernetes/


I tried using kubernetes a while back for hosting a side project on a raspberry pi. I guess technically I was running microk8s on the pi and had to install kubectl locally to interact with it.

I actually like some of the concepts, like pods and ingress, but one thing I noticed that I didn't like, as far as I remember, was that there's not really a good way in kubernetes to make your YAML more dynamic. Apparently you're supposed to use these other things like Helm Charts, which isn't even part of kubernetes?


Yes, the post I linked describes how to do that. It took time to learn. Nowadays I’d just do it with GPT!


Well that's not gonna fly.


Fly is soooo easy to use and very very easy to go back to. We went back three times, each ended with multi-hour sev0s w/ horrible status updates. Last outage occurred during our YC interview... now we're on AWS.


Used to love fly, then had a few issues.* The CEO wrote back in March they were working on reliability, but then you have this case study on what not to do in an incident response. 1) Fail to monitor your primary support channel. 2) Allow your support channel to become "private." 3) Not update your status page.

* First was some sort of certificate issue that cost me literally days of debugging that turned out to be their fault. * Then weirdness around their v2 deployments where I just can't grok some of the documentation.

Just use AWS. Your time is more valuable then what you're saving on the fly.io free plan.


Most people in this thread are either in North America or Europe, so options for managed services like Fly exist, and they are plenty. But for people in South America, what options are there for a Heroku-like service? I don't want users shooting off requests halfway across the globe and back when there are many datacenters a couple of miles from our users. I just don't have the time and resources to manage VMs and scaling issues. I need a zero friction "./serviceX deploy" experience.

Fly seems unreliable, but they offer a deploy region close to me. Does anyone have know of any alternatives?



Frankly, the only solution to reducing dependence on these type of things is self-hosting. At least then you will be able to be 100% sure of causes and resolutions

Companies moving to the cloud are only increasing their operating costs


For those of you with postgres apps, you can avoid this pretty much universally with cockroach db (they have a serverless version they host) It takes basically no work moving from postgres, even postgres dump works.


CockroachDB gets real expensive real quick if you want to use any of the cool functionality though. The free version is OK for a cluster in very short range of each other, for example the same DC, but if you spread things out you'll need the enterprise features (follower reads, at minimum) to keep any kind of reasonable performance.

Self hosted enterprise "starts" (!) at $100/vCPU per month. So, yeah. Not exactly the hobbyist's choice.


How many days work is it to build a deployment of an Elixir app with Pulumi, Github Actions and AWS?

As someone not incredibly experienced with devops, I always wonder what is best with databases? Should they be provisioned in Pulumi or do I just manually create them in RDS?

Secrets Manager seems like a bit of a pain point as does IAM which I think I just about understand until I get lost! Giving everything access to ingress and egress also seems a bit overly complex/powerful.

Probably the time to get something working is dramatically shorter than it once was with ChatGPT to help.


It depends how familiar you are. I could probably knock that out in a day with the CDK.


They have Status page and don’t reflect host issues there. I’m sure, they didn’t understand what Status page was created for. And don’t respect users. Go away from Fly, if you respect yourself.


Honesty would be more impressed if marking the thread private was intentional to take focus off their falling asleep at the wheel.

At least that would have indicated some competent leadership.


Yeah, I think the worst behavior to observe is when people make hard decisions and then start to have doubts about others' judgement and try to distance from the decision. That's even worse than the hard decision itself.



I'm on DO and on Fly, planning to try out Hetzner.

Fly gave me plenty of downtime. "Network connectivity issues".

I'm with them only because of laziness.

The setup was more complicated than a standard setup because of the abstractions and corner cases and lack of examples in my language.

Logging is bad,you can't easily SSH in the machine and inspect things.

Next time I'll setup a more powerful server I'll move all my apps there.


Fly seems to push technical boundaries but at the cost of high brittleness. Not sure that plays well outside of hobby audience


I've never tried Fly and I never will because of all the issues I've seen on HN alone


Are you a paying customer to any hosting cloud provider?


We are on Cloudflare Enterprise. We also pay over $20k/mo to OpenAI and are on their Enterprise network too. So, yes.


We also pay for Supabase


A great reminder what databases are easy, everyone can do them. It is reliable secure high performance databases which are hard. Make sure you chose provider which has proper experience in this space


Fly works on a new and interesting platform paradigm by writing a lot of their software stack from scratch.

Unfortunately, such an approach is unlikely to produce the same stability you might be used to from other places.


I think their proxy could have been written from scratch. Some management, billing, API etc too but under the hood, it's all standard open source stuff like kvm, firecracker and such?


It's always interesting that someone pays those who serve in this way. There can be problems, of course. but the solving process should not be like this.


This seems to be a pattern at fly but I guess they have the apology ready.

But yeah I cannot complain too much, I pay nothing so I got the appropriate support.


That sucks, sorry to hear about that. This is why I ALWAYS run CRDB or Scylla as my transactional DB!


I moved from DigitalOcean app platform to fly.io for my web app. It's overall much cheaper but much more difficult to deploy somehow, deployments take a while. I wanted to migrate my psql instance as well but I had the feeling it wasn't as stable as digital ocean.


I would use Fly for hosting servers, but never databases.

For this very reason.


... and there are much cheaper places to host a server if you don't care about databases, like bare metal hosters and tier-2 VPSes with good reliability like Vultr and Digital Ocean.


fly.io is for garage / poc / personal stuff. Don't use it for anything $$$ related.


~cloud~

Out of your control


~cloud~

Out of your control


i


Worth trying out fl0.com instead here


For people looking for alternatives, what are some suggestions? Is MRSK with Hetzner good?

AWS is way too expensive and complex nowadays. I can’t stand the amount of terminology they invented just to deploy a website and database.


I'm using mrsk + DO instances (DB is running on managed instance within DO) on my side project, which is Django.

I'm very happy about everything. No complexity, easy to deploy and setup.


How do you compare DO to Hetzner?


I've picked up DO just because they have managed database instances.

If hetzner had this, I would have picked them instead.


you can get away with DigitalOcean, Linode, scaleway, etc for raw compute and you can indeed use MRSK if you're using rails.


I'm not a Fly.io user nor affiliated with Fly in any way. I read through these comments and realize it's not possible to distinguish competitor/disgruntled/negative-astroturf from actual user. The "screw you guys, I'm outa here" opinion of someone running a Discord bot on a free tier who uses the public forum for customer support isn't the opinion you want to use to measure Fly customer support. What are paid production users experiencing?


Paying Fly.io customer with several apps deployed. We’ve not had any of these issues. Fly Postgres is definitely not RDS and they could do a better job of setting the appropriate expectations. Fly either need to use some of their VC money to created a fully managed (autoscaling+replicating) Postgres offering or make it clear to customers that these outages are both possible and that the customer is responsible for their own data + disaster recovery.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: