Hacker News new | past | comments | ask | show | jobs | submit login
Reliability: It’s not great (fly.io)
1226 points by bishopsmother on March 6, 2023 | hide | past | favorite | 455 comments



Fundamentally I think some of the problems come down to the difference between what Fly set out to build and what the market currently want.

Fly (to my understanding) at its core is about edge compute. That is where they started and what the team are most excited about developing. It's a brilliant idea, they have the skills and expertise. They are going to be successful at it.

However, at the same time the market is looking for a successor to Heroku. A zero dev ops PAAS with instant deployment, dirt simple managed Postgres, generous free level of service, lower cost as you scale, and a few regions around the world. That isn't what Fly set out to do... exactly, but is sort of the market they find themselves in when Heroku then basically told its low value customers to go away.

It's that slight miss alignment of strategy and market fit that results in maybe decisions being made that benefit the original vision, but not necessarily the immediate influx of customers.

I don't envy the stress the Fly team are under, but what an exciting set of problems they are trying to solve, I do envy that!


There's a wonderfully blunt saying that applies here (too): you are not in the business you think you are, you are in the business your customers think you are.

If you offer data volumes, the low water mark is how EBS behaves. If you offer a really simple way to spin up Postgres databases, you are implicitly promising a fully managed experience.

And $deity forbid, if you want global CRUD with read-your-own-writes semantics, the yardstick people measure you against is Google's Spanner.


Where does the misalignment between what the customer thinks they want, and what they actually want fit in to your philosophy? Google Spanner is a great example of this because who doesn't want instantaneous global writes? It's just that, y'know, there's a ton of businesses, especially smaller ones, that don't actually need that. The smarter customers realize this themselves, and can judge the premium they'd pay for Spanner over something far less complex. What I'm getting to is that sales is a critical company function to bridge the gap between what customers want, and what customers actually need, and for you to make money.

The first releases of EBS weren't very good and took a while to get to where we are. Some places still avoid using EBS due to bad experience back in 2011 when it was first released.


> who doesn't want instantaneous global writes

I want to gently note since I see a lot of misunderstanding around Spanner and global writes: Global writes need at least one round trip to each data center, and so they're still subject to the speed of light.


Like most things, it's more complex than that, and as a result it can be either faster or slower than 'median(RTT to each DC in quorum)'.

It's a delicate balance based on the locations that rows are being read and written. In the case where a row being repeatedly written from only one location and not being read from different location, the writes can be significantly faster than would be naively expected.


> Like most things, it's more complex than that,

Sure, no doubt. My point wasn't really about the particularities. It was around the mistaken idea that I see sometimes where people believe that TrueTime allows for synchronized global writes without any need for consensus.


The speed of light in vacuum is a hard upper limit. Most signal paths will be dominated by fibre optics (about 70% of C) and switching (adding more delay).

But, yes TrueTime will not magically allow data to propagate at faster-than-light speeds.


[flagged]


I think a Microsoft shill might choose a less suggestive name


> so they're still subject to the speed of light.

I giggled. Good witty comment, bravo.


I get the impression that you think "still subject to the speed of light" is some kind of hyperbole or something, like if you were on a freeway and saw a sign that said "end speed limit" and thought to yourself "welp, still can't go faster than c".

But when you're working on distributed systems that span the planet (say multi-master setups where ~every region can read and even write with low latency), you start thinking of the distance between your datacenters not in miles or kilometers but in milliseconds. The east coast and west coast of the US are at least 14 milliseconds apart:

  % units "2680 miles" "c ms"
  2680 miles = 14.386759 c ms
and that's not counting non-optimal routing, switching delays, or the speed of light in fiber (only 70% of c). Half of the circumference of the earth (~12500 miles) is likewise 67 milliseconds away absolute best case (unless you can somehow make fiber go through the earth).


In a nutshell if you offer cloud services you need to be better than the MAG clan, Digital Ocean too. And people will want it dirt cheap. It’s still hard to be a profitable web host as it always was (MAG has the advantage that none of them were web hosts at first base)


I am willing to pay a little extra for a nice dev/ops experience and simple/easy solutions that doesn't require spending days reading docs and diving into dashboards with thousands of options.

Usually this results in me jumping on new platforms and then abandoning them once they add too much complexity.


I suspect, in general, acceptability (or desire) for complexity in the cloud solution, and budget are positively correlated in customers.


The ridiculously overwhelming complexity is stickiness.

Think it’s bad to potentially technically move your solution from $CLOUD vendor? Wait until you turn around and realize you have at least one full time hire who’s entire role is “$BIGCLOUD Certified Architect” (or whatever) and your entire dev staff was also at least partially selected for experience with the preferred cloud vendor. At any kind of scale you have massive amounts of tooling, technical debt, and institutional knowledge built around the cloud provider of choice.

Then there’s all of the legal, actually understanding billing (pretty much impossible but you’re probably close by now), etc elsewhere in the org. At this point you’ve probably utilized an outside service/consultant or two from the entire cottage industry that has sprung up to plug holes in/augment your cloud provider of choice.

After realizing their cloud spend has ballooned well beyond what they ever anticipated plenty of orgs get far enough to investigate leaving before they realize all of this. Most decide to suck it up and keep paying, or try to somehow negotiate or optimize spend down further.

Cloud platforms are a true masterclass in customer stickiness and retention - to the Oracle and Microsoft level (who also operate clouds).

It’s interesting here on HN because while MS and Oracle are bashed for these practices AWS and GCP (for the most part) are pretty beloved for what are really the same practices.


This is really an oversimplification. MS and Oracle have licensing that's explicit in the way that it wants to lock you in, although in different ways. AWS and GCP posting public pricing that can apply all the way until you reach an absurd spend goes a long way, and the ability to turn off a workload tomorrow incentivizes these platforms to provide a high quality of service.

When working at AWS, a large part of the convincing for an MS shop would be around showing that we can offer a lower price than the 'discounting' that MS provides. Oracle was all about contract expiry.

While there's some complexity around migrating a workload, regardless of where it's at, many places are going into cloud migrations hoping to remain relatively platform agnostic. I've seen many successful migrations to and from different vendors, and often at an SMB or ME scale, in weeks not years.


Sadly I think you are likely correct.


MAG?


From context, I'm assuming Microsoft / Amazon / Google, referring to Azure / AWS / Google Cloud respectively.


Yep.

Because when I think reliable cloud infra, I think Azure.


I’m assuming Azure, AWS, Google Cloud, but it’s new to me too


Microsoft (azure) Amazon (aws) Google (gcloud)


They should have gone with GAA


microsoft apple google


if you add Akamai (Linode ) or Alibaba Cloud - then it will be come MAAG


Linode is not the same scale as the top 3. I believe even Digital Ocean is bigger than them (for now).


MAGA?


That's for top 4 companies by market cap.


MAGA cap?


oh no, not again.


Geez chill.


> if you want global CRUD with read-your-own-writes semantics, the yardstick people measure you against is Google's Spanner.

I’m trying to build more of an intuition around distributed systems. I’ve read DDIA and worked professionally on some very large systems, but I’m wondering what resources are good for getting more on the pulse of what the latest best practices and cutting edge technologies are. Your comment sounds like you have that context so any advice for folks like me?


Not sure it's what you are looking for, but how Spanner mitigated CAP to delivery a relational DB at scale is a really interesting read[1]

[1] https://research.google/pubs/pub45855/


The best practices are built on solid first principles, and you can get a pretty good grip on them from the High Scalability archives. Back when they posted actual tech articles, their content was some of the best available anywhere. Since you've read DDIA you will probably get quite a lot out of the archive. In fact, you should be able to identify at least some of the unstated problems.


This is, indeed, the exciting part. As Heroku fans, we never really felt like it needed a replacement. And if it did, it seemed like Render was the natural Heroku v.next.

One thing we've noticed, though, is that people do actually want Heroku but close to users. It's not exactly edge compute. In some cases, it's "Heroku in Tokyo". In others it's "Heroku, but running in all the english speaking regions".

I think the thing that ate up most of our energy is also the thing that might actually make this business work. We built on top of our own hardware. That's the thing that made it difficult to build managed Postgres. We put way more energy into the underlying infrastructure than most nü-Heroku companies. The cost was extreme, but I'm like 63% sure this was the right choice.


I shared your post with Render's engineering team and it got a lot of love because we know the struggle and can truly empathize because of our own Heroku-accelerated growth. What Fly and Render are doing is hard, but someone needs to do it.

If the market is big enough to support AWS/GCP/Azure as $N00B businesses each, it’s not a leap to imagine a future where both Fly and Render are incredibly successful, loved, and independent businesses spanning decades. Let's keep at it.


I've just started using Render and it's great!

Goodbye Heroku. :(


Thanks!


It sounds like you built the right product on the wrong technology stack, at the lower levels. For example, I have never heard a Nomad success story, but this might be colored by interviewing engineers desperate to escape it.

Something like linkerd on Kubernetes would be stronger, I suspect. But I don't know the exact nature of your problems.


> I have never heard a Nomad success story

There's a lot of Nomad at , just won't get any publicity but that's different.


> As Heroku fans, we never really felt like it needed a replacement.

If Salesforce kept investing in heroku, it might not. But there is a huge loss of confidence in heroku's future going on among heroku's customers right now, which is part of what you're seeing, as I'm sure you know. (Also I think to some extent you are being political/kind towards heroku... if heroku's owners were still investing in heroku for real, adding 'edge' functionality like fly.io is focusing on is what one would probably expect...)

And frankly... your tool seems more mature and... not to be rude to competitors but seems to have more of that certain `je ne sais quoi` of Developer Experience Happiness that heroku _used_ to have and other potential heroku competitors don't really quite seem to have yet. Does what you expect/need in a polished and consistent way.

I think work you put into the underlying infrastructure definitely shows there, and was the right choice. Tidy infrastructure helps with tidy consistent developer experience.

So I understand why people are looking to you as a heroku replacement. I am too! (And I don't really need the edge compute stuff; although I could potentially see using it in the future, and it shows you folks are on top of things).

And while I kept reading fly staff saying on HN comments that you didn't want to be a heroku replacement, so were unconcerned with the few places people were mentioning where you still felt short of it -- when I saw your investment in Rails documentation and tools (and contribs back to Rails), I thought, aha, i think they've realized this is a market looking for them, which they are only a couple steps from and it would make sense to meet.

When you mention in OP a "heroku exodus" to you... I'm curious if that was all people who left when heroku ended free tier stuff, and they've all come to you for your free tier stuff... becuase that does seem dangerous, such a giant spike in users who are not paying and don't bring revenue with them! I don't personally use very much heroku free tier stuff. I hope if that's a challenge, it's one you can get over. I don't think you are under any obligation to offer free stuff that can be used for real production workloads indefinitely -- although, as I'm sure you know, free stuff is huge for allowing people to try _before_ they buy, and whatever limits you put on it to try to prevent indefinite production use get in the way of someone's "try before you buy" too... and at this point, _reducing_ your free offerings is a lot harder PR-wise than having started out with less in the first place. :(


Just partner with Neon or other similar companies in this space. Scale-to-zero distributed databases is well understood technology.

https://neon.tech/


Yes, we are partnering with many companies like Hasura and Replit to help with managed Postgres. Since Neon scales to 0 and also autoscales to your workload it very economical for the long tail of low usage customers.


Does Neon support triggers and subscriptions?


yes to triggers - it's full postgres. logical replication is not exposed just yet, but soon


Subscriptions should still work with scale to 0. NOTIFY/LISTEN doesn't: https://neon.tech/docs/reference/compatibility.

We will have an option to not scale all the way to 0 to support this scenario.


Awesome!


will subscription work with scale to zero?


Can you please explain what the Vault related failure is about? Is this about timing out services failing to start within an acceptable time range?


Yeah, basically that. One of the servers in our Vault cluster failed and prevented Vault agents from receiving secrets. For Nomad apps, this showed up as "allocation failures" and failed deploys. Machine based apps took an abnormally long time to start and caused other issues.


Doesn't Vault self-promote in the case of single node failure?

I noticed you mention Vault lives in the US, I'm sure you've already heard of this pattern, but Vault (Enterprise) supports [multi-region clusters for performance and DR](https://developer.hashicorp.com/vault/tutorials/day-one-raft...)


Your margins are going to end up being a lot better than any other PaaS that's built on top of the big cloud providers.


I agree - fly is so easy to use (when it works) that it’s hard not to be impressed. BUT what I’ve found is that we don’t need edge compute, since our customers aren’t that latency sensitive, so it’s lost on us. It’s only a few more milliseconds to us-east-1.

I’ve heard (on HN) of a dozen different companies vying for the heroku replacement spots and yet Fly seemed to capture the attention. I couldn’t name another one off hand.

What I truly want and probably lots of other people too is Flyctl (and workflow) for AWS. The same simplicity to run as fly, but give me something cheap in Virginia or the Dalles.


Render.com is another spiritual successor of Heroku. I'd love a world where Fly and Render are both very successful companies.


Render has some great features like making a new sub domain for when a PR is opened so you can test it as a fully working API before you merge


I believe Netlify introduced this feature. It is now ubiquitous (as alexgrover said).


That’s supported on most PAAS these days, including Heroku.


On their free tiers though?


Well, no longer free on Heroku, but it was


did not know heroku had that.


Yeah I like them both a lot, having tried deploying small projects on each. However, I’ve defaulted to render at the moment because I’ve found it painless for my current project, and edge compute is low on my list of priorities.

Though to be fair, even if render collapsed overnight, I think I’d still be equally satisfied after moving to fly.


Not gonna happen. Both will get acquired because that’s how things work now


(Render founder) I'd love to understand why you think this is the only outcome. Render has positive gross margin and a clear path to profitability based on both our growth so far and the tailwinds in this space. I'm also aware of other companies like ours that have grown all the way to IPO or are well on their way.

I'm very explicit both internally and externally that an acquisition is a failure mode for Render. We're building this for the very long term and plan to keep it that way.


> I'd love to understand why you think this is the only outcome.

I’m curious why you think it isn’t? On a long enough timescale all good things seem to be acquired by large megacorps for a fuckton of money.

Slack, Linode, Minecraft, the list goes on. Eventually they all make the thing less than it was before under the founders’ vision. At least from my perspective.

It won’t stop me from cheering them on, but I’m still very skeptical of them not being bought out in 10 years.


Spotify, Apple, Amazon, Facebook, Twitter (kinda), Valve all have not been acquired and have existed for a long timescale...


That's clearly survivorship bias.

What you want to know is the probability of a small, independent, high quality provider remaining independent, high quality and not bankrupt.

It does seem to be rare in the tech space, especially in the US. Becoming one of the largest public corporations on earth is one way to do it, as you suggested, but the odds of that happening are miniscule.


For obvious reasons I’m excluding the companies doing the acquiring.

Except Valve I guess, but that was never a public company that could be acquired to satisfy investors in the first place.


Times are now different money is not free anymore so those big acquisitions need to make real business sense.


I guess I’m just default cynical these days seeing how much money’s still floating around and the scale of the cloud big 3. Apologies, it wasn’t personal. I admire your vision and hope it can work, money always seems to talk eventually though. We need more companies that have the nerve to hold on and develop on their own.


I admire your sentiment, at the same time founding teams don't typically say no to US$XX,XXX,XXX,XXX acquisition offers that'd cash you out for at least a few billion to you personally.

Are there any examples where the capitalism bottom line is ignored and a company keeps growing with extremely premium generous acquisition offers on the table? I can't think of any, but there could be a few. However, I expect it's pretty rare.

For companies with such tremendous growth, the venture capitalist firms are primarily looking to make their <big-multiplier> return and push priorities accordingly (understandably).

The only constant in life is change, it's best to focus on what you can do right now, today, and only put out promises or commitments that you have the necessary influence to follow through on. Some things are bigger than each of us.

Best wishes and godspeed to you and fly.io!


Unless a company is very explicit about this not being in the books, I tend to share this outlook.

From the perspective of a recent founder, it's downright spooky to build around any SaaS, considering how few of them have been around for 10+ years, when that is certainly what our business is aiming for.

I know (and share the feels): Devs tend to get excited about the new thing – but if Google Workspace shut down next month, we would be in so much operational trouble. When other peoples fancies stand in the way of the entire operation you are responsible for, it actually begs the question how much closed source SaaS you can allow before it starts to be quite frankly irresponsible.

We are not imagining things. SaaS of all sizes shut down all the time, and when you are heavily relying on them and building software around them to run a business the prospect is spooky as hell.


The difference between (free) Gmail and Google workspace is that workspace is a paid product. If you're big enough to warrant an AM, you can get terms which include continuity of business planning if Google does happen to shut down Workspace. (They won't.)


Is your argument that Workspace is a paid product and therefore won’t be shut down? If yes, let’s keep in mind that Stadia was paid-for too. My trust in the longevity of Google products has been damaged beyond repair.


The difference is that Stadia was definitely losing money, whereas Google Workspace might be profitable.


These threads from mrkurt a few months ago seem relevant here -

https://news.ycombinator.com/item?id=32955520

If they are a multiplier for a whole portfolio, there's not much reason for any particular branch to purchase them.

(This post seems like some evidence they might actually be building the wrong thing, though.)


I'm guessing that downvotes come from those who see the macro environment changing. With increased rates, borrowing to purchase companies may make less sense.


Macro makes it harder to raise funding too though - VC no longer as attractive given the risks and higher interest rates available


Not sure why this is downvoted, it’s a valid point.


I'm waiting for a site that does comparison matrixes. It should have checkboxes for autoscaled compute, easy build/push, scheduled/queued tasks, WAF and CDN, object storage (wish Render had this specifically), emails, easy addons to other SaaS.


I can second that I‘ve seen render.com mentioned very often, maybe even more so than fly.


> What I truly want and probably lots of other people too is Flyctl for AWS. The same simplicity to run as fly, but give me something cheap in Virginia or the Dallas.

Pardon the ignorance, is this not the Amplify CLI [1] ?

[1]: https://docs.amplify.aws/cli/


No


Can you elaborate?


Things that just work and are delight to use and the AWS Amplify CLI are not often mentioned together. The Amplify CLI is a growing collection of poorly thought out, poorly implemented functionality that looks good in demos, but falls apart under any close inspection.


I can only assume that if you’ve ever used the amplify cli it makes total sense.

At least that single ‘no’ contained a whole wealth of blood and tears to me.


(No the OP)

I like Amplify and use it often. However, it isn't well integrated with "normal" backends, so if you want to keep a backend and frontend deployed together you either have to use their Amplify backend API or work out your own deployment.


I think this whole category is interesting, from the next-gen PaaS to the cloud-native ecosystem. Totally empathize with how hard what fly is doing in terms of scale and reliability is.

At Coherence (withcoherence.com) we're focused on a developer experience layer on top of AWS/GCP. You might describe it as flyctl for AWS.


  > Flyctl for AWS
Have you tried AWS Copilot? I’m having good success with it. Probably not quite as simple as flyctl, but still it’s only one command to deploy a container.

I would really like fly.io to overcome these hurdles. I bet they will.


> What I truly want and probably lots of other people too is Flyctl for AWS. The same simplicity to run as fly, but give me something cheap in Virginia or the Dalles.

Google Cloud. It is painfully easy to spin up managed postgres, super easy to deploy gcp cloud functions or gcp cloud run. It isn't expensive either and just works.


If someone is not already using the holy trinity (AWS/Azure/GCP) there is probably a reason.


Egress pricing, for one.

fly.io charges an outrageous 2 cents/GB. Google is over 4x that.

At fly.io rates, 1Gbps average over a month is $6400/mo. Google is tiered and you’re looking at over $10k/mo.

For comparison, a cheap managed switch that can handle 1Gbps costs about $100, maybe a bit more if you want a nice one. A nice router is more. You can rent an entire rack, including power, cooling, and an unmetered 1Gbps for $300-$1k/mo (with maybe some wiggle room on both ends). You can buy a pretty nice server, amortize the price over a week or two, and still come out ahead.

You certainly get considerable value from a major cloud provider, and a lot of their other services are reasonably priced, but, depending on your workload, the egress prices and the corresponding Hotel California factor may make using a major cloud provider a poor proposition.


Depending on what I'm building, I tend to cache on the edge with something like CloudFlare in front of GCP. Lowers the egress charges significantly, with the benefit of speeding everything up too.


It's potentially a lot more for the big clouds. Anything in the network path has a charge - load balancers, NAT gateways, etc.


The cost of egress plus a gateway or two is fairly close to the cost of burning a DVD twenty years ago. And it appears to actually be cheaper to burn DVDs and mail them today than to send data from a major cloud.

This becomes very relevant for things like archiving data. If you generate data outside of a major cloud, you can pay a major cloud a very reasonable fee to archive it for you. But if you ever download your archive, it will cost you about half the price of buying an external disk to store it on.

(To be fair, object storage is rather more reliable than a single crappy external drive. But if you access the data more than once, maybe you should have a colo or on-prem copy too.)


I'm not using gcp anymore because it's not worth risk losing access to my personal gmail account just to play around with pet projects.

I might be paranoid, but I just don't feel comfortable when there's so much in play.


Totally agree with this mindset. My digital life is on the line because Google refuses to separate services.


Create a new Gmail account?


Google associates different accounts that are from the same owner when handling issues FYI. So if they think your account is doing something wrong on GCP, be wary of associated accounts.


Never once heard of that.



Did you read the comments? My takeaway is that it was bs.


There was one person strongly challenging the story and plenty of back and forth. It hardly seems a settled case one way or the other.


Separating concerns, isolating things that are not related, these are some basic tenets of good engineering. Yet we all keep rolling the ball of mud downhill and act shocked it keeps growing and swallowing everything.


That's the beauty of the services I named. Super easy to roll the code to any other similar PaaS provider. There is no vendor lockin.

It is postgres + http handlers.


Do you have a guide in mind?

If it's sorting and sifting and clicking a bunch of stuff in the console, that's not painfully simple. If it's some easy cli commands, I think that's in the ballpark...


There is a gcloud cli, but I just automated deployments in CI...

https://github.com/google-github-actions/deploy-cloud-functi...


> generous free level of service,

This is likely the biggest culprit for a lot of these companies. Too many of us have grown up in the culture of getting hosting and platform for "free", but at some point the companies providing it still have to pay the bills. There has to be a better pricing model that let's someone deploy their relatively small, low-traffic app for $10s/month or even $200 - $300 / year for the basics (e.g. - Heroku free tier type capabilities). It's not going to save these companies but it would limit excessive growth of their own costs from a free tier while at the same time still being affordable for 1 - 2 person teams who are trying to get something in front of users.


I agree. And I know this is unpopular, but I think none of these companies should be expected to have a free tier. A low-cost tier? Certainly. Perhaps even a free trial with a credit card? Great.

But our team, who has used Heroku for over a decade, got bit multiple times by Heroku having a free tier.

Why were we impacted by other apps? Because Heroku’s load balancers are shared amongst all their apps. That includes all the sketchy apps running on the platform.

If Heroku could somehow isolate us from everyone else? Great - and they offered that for awhile with a reasonably-priced Add-On supported by them called SSL Endpoint. It cost about $15/month and put us into a pool that was shared with other folks willing to spend that much per month to run their app.

I understand that’s not great for a hobby project. But for those of us trying to run a large product on Heroku and not have to spend multiple extra thousands of dollars every month for a Heroku Private Space, this was a great way of pooling: put a small fee in place for one pool of resources. Not many malware writers or other misbehaving app creators will probably want to spend that much per month.

But they axed that a few years ago. Only a couple months after when we were thrown back into the load balancer pool with all the other free apps, one of the IPs was marked as spam and we had to figure out a kind of janky solution.

Additionally, Heroku seemingly spent a ton of resources on free tier support, malware fighting, etc. I hope to see more features on Heroku since they’ve dropped that support… but I haven’t seen much evidence of that in roughly six months since they did that. But we’ll see.


Most of these platform have reached their critical mass to stay in business (and time to attract the customers like you) only because of the free tier. Sorry but even a low cost tier is too much for most wanting to give a try to a new infrastructure/stack. Most of these adoptions come from hobby projects trying it first and then recommending to use it in a professional setting. In a professional setting yes. you can afford to pay to low cost to evaluate it but you cant afford the time to do so. So they always rightly offer a trade of time to evaluate for the free cost. This is what actually brings the initial customers.


Do you have examples of companies in this space that actually reached break-even? Heroku never hit profitability as far as I remember and with the Salesforce acquisition the question of profitability is moot. AWS is a counter-example to using a free tier as a GTM strategy. AWS did not start with a free-tier offering for S3 or EC2, that only came years later. By then they already had significant traction in the market.


I don't know if they have reached break-even. I know that they are in business for a long time and are clearly getting investments based on the usage and the potential. Usage that would not be there were it not for the Free time contributed to test-drive their offering and the free "evangelizing" done by those "free-loaders". AWS/Google/Microsoft are in a different league because they can afford the waiting game for actual Enterprise Evaluations or can afford massive sale forces (Azure case). Smaller players can not afford it and have to rely on "recommendations" based on past experiences.


> And I know this is unpopular, but I think none of these companies should be expected to have a free tier.

Free tier is a GTM motion which makes sense for novel tech products like Fly because: https://en.wikipedia.org/wiki/Technology_adoption_life_cycle


Nice write up!

I wish I shared your enthusiasm for where Heroku could go but I have a few friends at Salesforce I've asked about how they see Heroku internally and it really doesn't seem like it is going to get much love. Hope to be wrong though.


Thanks! I have talked with two Heroku folks who say (to me, a paying customer of Heroku Enterprise) that Heroku is absolutely in active development.

I let them know they need to demonstrate that to me. They have a roadmap [1], but it seems to have barely anything moving forward, including some really important concepts like http/2 support.

[1] https://github.com/orgs/heroku/projects/130


Well, they’re owned by bigcorp now right? Everything probably takes 10 times as long for no good reason.


If you don't need all that much, Oracle Cloud does offer a free tier for VMs. You get 2 AMD VMs or 4 ARM VMs and even a free Oracle DB, object storage, load balancing and monitoring.

https://www.oracle.com/cloud/free/

It's still just a free tier so you can't expect good support, but, it's there.


All of the forum posts and howtos on spinning up an AdGuard or similar service on Fly.io instead of a local Raspberry Pi probably didn't help things. While it does drive user growth, it isn't the type of users you want.

One of the hardest lessons every business needs to learn is how to say no to the users they don't need.


Check out DO app platform. It’s literally exactly what you describe.


The CloudFlare folks wrote a good blog post on how they are seeing their customers use Edge compute — latency is far down on the list: https://blog.cloudflare.com/cloudflare-workers-serverless-we...


The US CLOUD Act means a EU customer cannot use a US cloud provider to host PII, even if the server itself is physically in the EU, because US law will still compel the provider to yield the data to US authorities. The European Commission is trying to paper over the cracks with a fig leaf of judicial review, but it's only a matter of time until a Schrems III decision from the CJEU invalidates that polite fiction.


The amount of EU companies following this law is exactly 0.


It’s not true. I know people who lost contracts because they were using Azure and the customer wanted to respect the law.


I've talked with companies like that as well and they start with strict rules and end up allowing clouds because no solution is compliant anyway.


I guess it works when you don't have any compliant competitor.


Hetzner or OVH would be compliant.


They are however far from service parity with AWS, Azure and GCP.

I can't speak for Hetzner but OVH has also availability issues.


That is exactly the problem at hand.

It's a combination of low to no enforcement, competitivity-killing laws and unrealistic efforts for said companies to take on.


Yep. The real question is how long until we get one.

Scaleway seems to go in the right direction but still a bit of work needed


I can attest that there are a lot more than zero in Germany.


I would be glad to be shown a company with AWS, Google Chrome, Google Search, Slack and all the usual suspects.


This simply isn't true. At least not for EEC(Norway).


I have never seen a company without Google Search, Google Chrome, AWS, Microsoft 360 and the lot.

Which alternatives are they based on?


Those would not contain PII from your users though, unless you have terrible policies about copying personal information in random Google Docs.


Companies have to guess what is PII and what is not, the EU have no idea (other than they know which companies they want to punish)


The GDPR is quite clear on defining PII, I don't understand why you would claim otherwise?


“It is difficult to get a man to understand something, when his salary depends on his not understanding it.” — Upton Sinclair


All of these will absolutely contain PII every time.


Nice bit of FUD you got there.

You can use Google Search and be 100% compliant, because Google doesn't see any customer data. Google chrome isn't even a service, I can't imagine how you'd manage to stick customer data in there.

And if you think there are no companies without AWS and Microsoft 360, you need to expand your horizon. I work for one such company, and so do many of my peers.


There are also lots of companies that use AWS etc. for everything but customer PII and keep that in some SAP system on-prem.


Google Chrome through telemetry and account history synchronisation which log PII in URLs and searched.

Google Search will see PII go by if your marketing team is researching leads on LinkedIn for example.

> And if you think there are no companies without AWS and Microsoft 360, you need to expand your horizon. I work for one such company, and so do many of my peers.

And that's great.

What is the services stack your company is implementing?

What kind of alternatives do you use for your email, browser, centralised data storage, etc. ?


I honestly can't tell if you're trolling or you said 'AWS' and 'Microsoft 360' and meant cloud and managed email.

> What kind of alternatives do you use for your email, browser, centralised data storage, etc. ?

There are plenty of browser alternatives (firefox, safari, vivaldi, even chromium).

There are dozens if not hundreds of email providers, and you can even provide your own.

You can 'centralize data storage' on disks on hardware you own, on premises or colocated. You could even use one of the dozens to hundreds of managed service and cloud providers.


> I honestly can't tell if you're trolling or you said 'AWS' and 'Microsoft 360' and meant cloud and managed email.

I meant both clouds and managed email / storages services.

> safari

Don't both Firefox and Safari have telemetry and various ping back services?

> There are dozens if not hundreds of email providers, and you can even provide your own.

> You can 'centralize data storage' on disks on hardware you own, on premises or colocated. You could even use one of the dozens to hundreds of managed service and cloud providers.

Sure you can, I'm just saying that it is rarely if ever done in medium to large companies.


So there is nothing in eu laws preventing you from opting into using these services. What _is_ prohibited is having a EU based product/service where your users are not aware that by using a service their data will be stored under us jurisdiction.

That is not the same as using us based products


Defense and government is a huge sector. You can live very well off it.

They are not going to skimp on the rules. A large part of banking won't, either.


I know I've personally spent a large portion of my time updating systems to be compliant in the last few years, in North American companies.


might well have been yak shaving. If a company is under US jurisdiction it simply cannot comply to EU data protection.


... Are those North American companies prepared to willingly break EU laws then? Because in my (amateur) understanding it’s logically impossible to satisfy both CLOUD Act requirements and EU data protection ones (not just GDPR, but general due-process rights the CJEU considers required for privacy violations and US courts deny noncitizens).


Yes.

Whenever a US law and a foreign law conflict, the US law always wins when you are in the United States. Complying with US laws is also a perfectly valid defense if a European citizen or state ends up bringing action against you in a US court.


European states simply sue in their own territory or in front of the European Union Court of Justice.


Yup. Which is basically a no-op. You need a court having jurisdiction over the defendant to have any relief. Even if you receive a financial judgement, international law does not put much weight in absentia cases.


If you have customers in the EU than the court has jurisdiction.

If the company doesn't comply, fines will be directly taken from customer payments for example.


Again - regardless of if a domestic court believes they have jurisdiction, any court case not brought in the venue of the defendant is effectively meaningless as you cannot be granted meaningful relief.

If the destination bank account is outside the EU, they can't touch it without cooperation from the defendant countries courts - which requires you to file in the defendants venue. If an EU country unilaterally seized intra-bank remittance they would be cut off from the international banking system without hesitation.

You seem to really be grasping at straws here, but the EU is not some all powerful entity that can enforce its laws outside its jurisdiction.


> Again - regardless of if a domestic court believes they have jurisdiction, any court case not brought in the venue of the defendant is effectively meaningless as you cannot be granted meaningful relief.

Of course you can, you simply reach for assets within the border of said member country or the EU. As I mentioned in my previous comment, you can for example get the funds from outgoing payments by customers of said company. You can also freeze accounts, prevent ownership or investments by any citizen of that country as well.

> If the destination bank account is outside the EU, they can't touch it without cooperation from the defendant countries courts - which requires you to file in the defendants venue. If an EU country unilaterally seized intra-bank remittance they would be cut off from the international banking system without hesitation.

There is nothing unilateral about a country seising money as payment of a fine from a company. This is a standard tool that every countries' IRS equivalent agency have in their tool belt.

> You seem to really be grasping at straws here, but the EU is not some all powerful entity that can enforce its laws outside its jurisdiction.

I never said that EU is all powerful, however, if business is done within the EU, EU countries have the power to access any and all funds going to the US for companies that do not comply.

They can also decide to block said service as a punitive measure.


> Of course you can, you simply reach for assets within the border of said member country or the EU.

Which is exactly what I said. If the US company has an EU subsidiary you sue in that venue that can grant you relief. There are US tax implications of holding foreign assets, so the 1% of US companies with overseas interests create a foreign subsidiary, the other 99% have absolutely nothing within the reach of the EU.

> There is nothing unilateral about a country seising money as payment of a fine from a company.

Funds in transit belong to the sender until they arrive in the destination account. The EU would be seizing the funds of an innocent third party (the customer), and the target company would just shrug and say "your payment didn't arrive send it again." The EU cannot seize a transaction in flight and also compel the target company to honor it against their books.

> if business is done within the EU, EU countries have the power to access any and all funds going to the US for companies that do not comply.

See above. Taking money from random EU customers I guess is something they could do, but I imagine their citizenry would be none too pleased about it.

Let me try to simplify it for you: the EU cannot take what is not in EU jurisdiction without the cooperation of the foreign court. If a company says they were complying with their domestic law which violated EU law, they would likely not receive the cooperation of domestic courts to grant relief.


Let me make it simpler for you.

If say Google were to not follow the GDPR for example, even if they didn't have any European subsidiaries, the EU or a member country would simply make all Google customers pay their subscription fees to them instead of Google as fine payment for the fine. Customers would see no service disruption.


In your example Google would not receive the funds and credit the customers account. How would they differentiate an EU government stealing the money from a customer who just didn't pay and say they did?

Feel free to call up your credit card or power company and ask them what happens if you send them a payment but it gets seized by the government along the way. Their answer will be that you still owe them money.

In your example the EU customers would be out the money, not Google. With no EU nexus (in your hypothetical) they cannot compel Google to provide services they were not paid for.


> How would they differentiate an EU government stealing the money from a customer who just didn't pay and say they did?

Because they would have been notified by a court beforehand and the fine would constitute an outstanding debt linked to a lost lawsuit.

Once that happens, the national collection agencies would take over and use the tools at their disposal, like collecting from customers directly, which is the equivalent of garnishing wages but for companies.

They would then receive regular updates about the remaining debt and what was already paid and by whom.

> Feel free to call up your credit card or power company and ask them what happens if you send them a payment but it gets seized by the government along the way. Their answer will be that you still owe them money.

If Google then refused service to the customers who's payments were redirected to that country's collection agencies, then additional punitive measure would be taken by the country.

Some of the punitive measure could be:

- growing interests on the outstanding debt

- blocking the service within the country or EU

- advertise that Google is delinquent and is refusing to pay it's debt to financial institutions

- prevent banks and financial institutions from loaning money or investing in Google

- configure an embargo for imports and exports towards Google

- extradition requests for C-suite or adding them to Interpol and Europol wanted people list

- etc.

> In your example the EU customers would be out the money, not Google. With no EU nexus (in your hypothetical) they cannot compel Google to provide services they were not paid for.

They can't force Google to provide services but Google will also lose that market (for the EU that's 450M people) and increasing punitive measures.

Also, Google refusing to pay would probably discourage financial institutions anywhere from servicing Google in the future and other countries from authorising Google on it's national market.


Please tell the legal department of our uni. I’m stuck with a home-made Kubernetes cluster where I have to mail the admins for provisioning, SSL and domain management. Would love to switch to Fly or Render


Not exactly related to the OP, but: I think I speak for a large number of folks when I say that we don't care. The EU keeps passing all sorts of absurd laws that require dedicated auditors to comply with. It's just not going to happen. If they decide to actively enforce these things, they'll just isolate themselves from the rest of the world.


As an EEUU resident, we also don't care. We can survive without youtube and instagram and the whole surveillance industry. Some of the laws place a heavy burden on giant tech companies, but for good reason.


They place a burden on everyone. A burden that's going to create a two-tier internet where service is immediately refused to EU citizens by every provider except the giant tech companies that can afford to comply.


Close, the giant tech companies may or may not comply but they surely can afford the fines that the various EU Data Protection authorities dream into reality by twisting an ever-changing body of interpretation of ambiguously written rules.


Why exactly is it seemingly so expensive not to sell your customer data?


That's not the issue. I don't want to see personal data sold either. It's all the little rules. There are hundreds of pages just in GDPR. You need a banner and explicit opt-in just to support login/logout functionality.


Can you explain why you believe this to be the case? Let's say you log the user in. Yes, you need consent to store a login cookie, but that doesn't mean you need "a banner and explicit opt-in". You only need explicit opt-in, which you can do by... putting a "remember me" box next to your login form[1]. Is that really so hard?

[1] https://law.stackexchange.com/a/32157


> giant tech companies that can afford to comply

Where does this sentiment come from? Cost of compliance for Facebook is many orders of magnitude higher than cost of compliance for a website for your hairdresser or a restaurant.

In my startup, GDPR was barely a blip on our radar. We had to delete website logs and that's about that. You have to keep record of customers/payment information for laws that supersede GDPR, and that's it if you run a legitimate business not reliant on stealing.


This simply isn't true. Look at the absurdity of all the cookie banners just to support basic login functionality. I'm all for internet privacy, but these laws are so sweeping that it's impossible to be compliant without a dedicated function for it.


No need for cookie banners for functionality like login. Ref: https://law.stackexchange.com/a/32157


Interesting that "EEUU" from my knowledge mostly refers to the US (Estados Unidos) in a Spanish context. The abbreviation for European Union would be UE (Unión Europea) right.


Oops. Late-night brainfart, I'm a portuguese speaker and got things a bit mixed up :)


That's a nice theory, but it may not survive the next few decades of regulatory capture by the same type of company you believe it's intended to act against.


You don't speak for me. I don't want to live without YouTube.


Where are you from that you use EEUU as an acronym?


I haven't put much thought in this, but is a Frankfurt data center provided by Amazon Web Services EMEA SARL (a Luxembourg-based company) considered a US cloud provider or a EU one? I mean, being wholly owned by a foreign owner doesn't generally change your jurisdiction, and employees of that wholly owned subsidiary (including its directors) are not required to obey USA laws or court orders but are required to comply with EU legislation.


My understanding is that the distinction hinges on whether the data is available to a US based employee. Can the NSA show up at a US address and tell the people there to hand over the data? Can this data transfer happen without an EU based person taking some action? If the answer to both questions is yes, the data handling is not compliant.

Of course, IANAL, do your own research, etc.


I got burned by this. I spent a lot of time researching and planning for this, only to discover there is no demand for solving this problem (yet?).


You're assuming that the US doesn't respond to political pressure and come up with an agreement with the EC to enable the flows. The wiretap act already goes beyond the fourth amendment in protection.


The problem is the European Commission is not applying political pressure because it rolls over for every fig leaf the US offers. It then takes Max Schrems to sue and several years before the CJEU overturns the "compromise".

That said, the Biden administration's latest proposal might pass muster if the proposed redress mechanism were truly independent as part of the Judicial Branch of the United States as opposed to the current proposal which is still part of the Executive and thus conflicted in ruling against surveillance decisions of the Executive Branch and its agencies:

https://www.whitehouse.gov/briefing-room/statements-releases...

https://noyb.eu/en/open-letter-future-eu-us-data-transfers

That said, even US citizens don't enjoy meaningful protection against warrantless wiretapping that clearly violates the Fourth Amendment due to the deference the judiciary has given to the executive, so I am not optimistic.


Assuming best practices are followed, AWS would have have to crack into multiple systems to offer up data for EU residents from AWS machines in the EU. Is there any record of them being required to do so?


Hmm, that post is almost three years old -- still accurate?


Yes, especially as compliance and regulatory frameworks continue to evolve and become more difficult to adhere to as mentioned elsewhere in the comments.

We're inherently faster than other "serverless" platforms due to the scale and homogeneous design of our network, and that network has presence in nearly 50% more cities than it did just 3 years ago. We were plenty fast enough then and we're even faster now.

Other things that customers (still) really care about: developer experience, ease of use, and cost. Nobody likes paying the AWS tax to move data around—they just want to use the best solution from the best cloud provider. Workers and the associated storage primitives allow them to pick and choose from the best that AWS, Azure, Cloudflare, GCP, et al. have to offer.

(Disclaimer: I'm a long time Cloudflare employee focused on App Sec, and I speak to customers regularly who look to Workers largely for compliance reasons, but I don't work on the Developer Platform business. Am sure my Dev Platform peers will chime in with more nuanced answers!)


This is spot on. I found myself using Fly for a project because it was super easy, not because I needed edge compute. TBH it's still actually unclear to me who needs edge compute? What apps require this sort of infra? It's not 99% of web apps right?


I still think that in the next pendulum swing we'll end up with edge computing and (smaller) self-hosted backends. Everything old is new again, and we haven't entirely recreated Akamai from first principles yet.


One of the big benefits of edge compute is that it’s geographically distributed. Doesn’t make a big impact across the US, but globally a lot of nations have specific data laws, so it’s important to host data in the required nation. Keep customer data in its nation of origin, but have a single control plane and platform for ever data center.


It is going to be apps that provide rich experiences that need to do a lot of server communication to deliver them. I am thinking of things like collaborative whiteboards, for example. If 2 people are in Europe, working on the same whiteboard, then it should be low latency. The edge nodes will be near each other (or next to each other).


Personally I see this as a 'why not, if it works' type thing.

Sure you don't need it for 99% of usecases, but if it just works using familiar architectures then it is also strictly better for 99% of usecases so you might as well, and people will naturally want it.

That 'familiar architectures' part is the hard bit, though.


But it isn't better in 99% of use cases. Lots of use cases are rendering an API response or HTML page that involves multiple database requests. Therefore the distance between database and app server is more important than the distance between the client and the app server.

Edge compute can be helpful for static or quite cachable content. But often this is handled as well or nearly as well by a caching CDN.

So that leaves a few cases where edge compute is useful. Where you are globally distributing the data itself (and ideally moving the data around as your users travel or move) which is incredibly rare and expensive to build, and when you need pure computation that needs no request to your backend and if 50ms of latency is important for a pure computation most of the time you can just move it to the client. In my experience these tend to be rare. I would estimate that edge compute is actually helpful for 1-5% of projects, not 99%.


I'm a co-founder at Northflank. This is what we've spent 3+ years building. https://northflank.com.

I am sympathetic with much of Kurt's post. We spent a long time building solutions to several of the areas highlighted (managed PG, persistent volumes, secret management and service discovery).

Making radical changes to architecture on a live cloud platform is always a challenge.

On the front-end Northflank is a next-gen PaaS built for high DX, speed, and powerful capability (real-time UI, API, CLI, GitOps, IaC).

Our backend is built using Kubernetes as an OS: providing a huge amount of flexibility on service discovery, load-balancing, persistence/volumes and scale.

The benefit of using Kubernetes is a universal API across all major cloud providers. We can scale clusters and regions across EKS, GKE and AKS in seconds, either in our managed PaaS or inside our customer's own cloud account.

Our managed dataservices: MySQL, Postgres, Redis, Mongo, Minio are all built using Kubernetes Operators with a small but mighty team.

From a generous free tier to autoscaling to managed postgres and other advanced PaaS/DevOps automation workflows Northflank offers something unique.


Yeah, distributed systems at the global scale are very very difficult - at least with the Heroku style problem, you'd be looking at scaling in a single datacenter I think - deployments to multiple datacenters wouldn't share dependencies.

I do wonder however if they'd be better off using less l33t tech - do almost everything on Postgres vs consul and vault, etc. Scaling, failover, consistency, etc is a more well-known problem and there are a lot of people who've ran other DBs at tremendous scale than the alternatives.

Simplicity is the key to reliability, but this isn't a simple product, so idk.


> I do wonder however if they'd be better off using less l33t tech - do almost everything on Postgres vs consul and vault, etc. Scaling, failover, consistency, etc is a more well-known problem and there are a lot of people who've ran other DBs at tremendous scale than the alternatives.

In my experience people who ran Postgres distributed across a WAN tended to use obscure third-party plugins at best, more often a pile of dodgy Perl scripts. Using something designed from the ground up to be clustered seems to have a much better chance of working out than trying to make something that's been built as a single-instance system for decades work across the internet.


Yeah, point taken, I wasn't thinking to cluster across the WAN - more like an api wrapping postgres in a single DC. But you pay the price of read latency I guess... it's a hard problem no doubt.


Backend simplicity also means a more shallow moat. It makes it easier for Digital Ocean/Linode (Akamai)/Hetzner to offer a competing service with the same backend knobs to turn, should they decide they want to get into that market.

The goal should be to make the backend as simple as possible, but no simplier. Complexity here leads to operational burden and toil. But that's why you hire good SREs and treat them well. What's more important is frontend complexity, aka how difficult it is for customers to use. Backend and frontend complexity aren't necessarily linked, which, imo, fly.io achieves, downtime aside.


> dirt simple managed Postgres

Heroku PostgreSQL is very simple, yes. But once you need non-trivial scale it's expensive and extremely non-performant. Even a medium-sized RDS will outperform Heroku's most expensive database offering by 20x in my experience. My company doesn't even run PG on Heroku anymore. We have a VPC/Private Space connection to AWS Aurora because the cost/performance difference is so extreme.


I don't know the details of how Heroku implements their hosted postgres service, but I'm _guessing_ that it's just a bunch of PG servers running on EC2 instances. There's probably a lot of CPU stealing "noisy neighbors" going on. But yeah, I've also experienced Heroku's PG databases being dog-slow compared to RDS for the same workloads.


They're probably using older or cheaper instance types. By not upgrading while charging the same or more over time, one can skim more profit.


I have run the experiment, and Crunchy Data’s Postgres servers are 4X more bang for the buck than Heroku’s.

I let some folks at Heroku know this who are product managers, and they are investigating it… but I would be shocked if Heroku gets a big performance improvement anytime in, say, 2023.

20X seems like a lot for RDS, though I’d be curious to learn more! We are switching to Crunchy because of that clear cost/performance difference you mention.


I would be kind of shocked if heroku gets a big performance improvement ever again. It seems the owners have decided to basically freeze it.


I'm going to plug Coolify, an open source Heroku alternative (with Docker support too) that I'm using on a cheap $5 Hetzner server which is a lot cheaper than the equivalent Fly or Render etc service, and it really doesn't have much upkeep from me even if you add in the time setting up the server initially, which is like an hour, and afterwards, it Just Works™.

https://coolify.io


Dokku is also nice and battle-tested: https://dokku.com/

And may I also plug Lunni, a self-hosted Docker Swarm-based PaaS I'm working on right now: https://lunni.dev/

Both work pretty well on $5 servers.


Lunni has got an interesting concept — and I can actually see some good uses for it!

Is the actual "production" workflow still pasting a Docker Compose file in? I would much rather have an automated deployment process that doesn't require human input, that way it can be scripted as part of CI/CD, etc.

Personally, I fell in love with `git push production` (naming a git remote `production`) to trigger a deploy. Ironically I didn't like this back when I first tried Heroku, but it's grown on me since. As of now, I have a custom git receive hook on my server (building a NAS from "scratch" using IaC on my home server) that triggers a redeployment using Docker Compose.

Also, you mention Swarm... what does Lunni bring with Swarm as opposed to simple Docker Compose? Does it distribute across multiple systems?


I'll start with the Swarm since it's a major point actually: Docker's Swarm mode is comparable to Kubernetes or Nomad: you can launch a cluster of servers and run your application there.

Unlike Kubernetes or Nomad though, it uses mostly the same concepts Docker Compose does, to the point that your development docker-compose.yml file will likely just work there (with some minimal tweaks). I love this website that talks more about it: https://dockerswarm.rocks/

Edit: As opposed to `docker compose up`, when running on a single server: not much. It will restart on server reboot by default, and allow you to run multiple replicas of a service (deprecated in Docker Compose), but that's it. Most important though, it would allow you to add more nodes later on, and it will then scale your services across the whole swarm – so you can start with just one server and scale to hundreds if needed.

> I would much rather have an automated deployment process that doesn't require human input, that way it can be scripted as part of CI/CD, etc.

This is almost doable with Lunni. This guide will walk through setting up a CI for a typical webapp that packages it in a Docker image and pushes to a registry: https://lunni.dev/docs/deploy/from-git/ (currently for GitLab CI and GitHub Actions only)

As for the continuous delivery, we're gonna have a webhook that you can call when your CI pipeline is finished. It's not exposed in the UI yet but I'll try to prioritize it (now that I remember I wanted to do it :')

`git push production` feels a bit easier, but I'm a bit concerned about bloat: for this to work, we'll have to bundle some sort of CI and container registry with Lunni itself. I think sticking with third-party CI is a more elegant approach here. What do you think?


> Most important though, it would allow you to add more nodes later on, and it will then scale your services across the whole swarm – so you can start with just one server and scale to hundreds if needed.

Have you in all honesty and with first hand experience, deployed and supported in prod on swarm over hundreds of servers?


Nope :') I do know a guy though, and I've heard good things. I'd love to hear about your experience too!

I know that it is possible to outgrow Swarm – I think that's a nice problem to have actually. We might include some tools for “graduating” from Lunni to something more serious like Kubernetes at some point.


You can set up a git hook on repositories that listens for the completion of the docker_build task and then redeploys the app while pulling new images?


Not a git hook, but you can do that as a part of CI workflow. So, you can have a script like:

    docker buildx build --push ...
    curl -X POST https://lunni.example.com/api/webhooks/c8aaa9b8-1bda-4a99-820c-36a75d31f8a7
that will rebuild a Docker image, then trigger the redeploy.


How do Coolify and Dokku compare? I've been aware of Dokku for a long time already, however I've never been confident enough to rely on these interfaces to deploy applications, specially because of their business model to keep things going. I'll have to try them both eventually though, I absolutely hate PaaS honestly, the prices are all just too high, but the convenience is really nice when managing a multitude of services simultaneously.


Dokku Maintainer here.

I don't really have a business model. I do take donations from Open Collective (and Github Sponsors, which funnels to OC) and there is Dokku Pro, but those don't collect anywhere near the funds I'd need to stop my dayjob (at least now. Maybe someday?).

My business model is that code releasing is something I'm pretty passionate about. Dokku isn't even originally my project (Jeff Lindsay started it, I just took it over), but I've been working on it for almost a decade. It's open source and fairly simple, so even if something happened to me, others could theoretically continue the project on as desired (or build on top of it if need be).

I'd be interested in hearing any of your other concerns though :)


Oh that's really cool to know, wasn't expecting the Dokku maintainer to read my comment LOL. From what I had looked at some time ago I though the project was run with the profits from Dokku pro, although upon further thought I understand that that's probably not enough to keep someone working full-time on it LOL.

I checked the repo and yeah, it checks out, Dokku _is_ pretty manageable with a decently small codebase. Having a low bus factor is really important for me. I'll check it out soon, and hopefully leave a donation to help you keep the project going too :)


Nah Dokku really is mostly a labor of love. Originally I started working on it to provide a Heroku alternative for a group of students that couldn't afford what they needed to on Heroku (this was like... in 2014) and I've since been using it to run all my own stuff and the occasional client install when I do freelance.


Thank you so much for your tremendous work!


I think overall Coolify is a bit more modern. It uses some different components (e. g. Traefik vs Nginx for reverse proxying), and includes a UI in its basic package (Dokku was CLI only for a long time, now they have Dokku Pro with a Web UI). Otherwise it looks like the architecture is pretty much the same.

Re: business model: both Coolify and Dokku are open source, so even if their development stops, you can continue to use them no matter what. (You do have to pay for your own servers though :-) So it's not a PaaS in the traditional sense (like Fly.io or Heroku), but more like “build your own PaaS” thing.


Dokku maintainer here

For proxying requests, Dokku currently supports:

    - nginx on the host (default)
    - traefik (via docker labels)
    - caddy (via docker labels)
    - haproxy (via docker labels)
We'll also soon support nginx via docker labels, which will work around issues where Docker sometimes assigns random IP addresses (and unlock TCP/UDP proxying as well).

I can't say anything else about Coolify since I haven't used it in a while, but I'd be curious as to what other parts are more modern about Coolify than Dokku.


Wow, looks like Dokku got a lot of upgrades since last time I used it thoroughly. I'm wondering though, why support four different proxies?

About the modern part: that was my opinion based on the way I recall Dokku and Coolify, and a quick scroll through the docs of both, so I might be really wrong here! I definitely need to check out both Dokku and Coolify again sometime.


We support 4 different ones to give folks choice. Some folks want/need features that aren't available on one vs the other (traefik has a ton of features, caddy is simple to configure, nginx has a ton of documentation) so it made sense from that perspective. It was also easy to add once I had the pattern going (though the default has stayed nginx).

One of the main features of Dokku is it's extensibility. You can cut one part out and replace it with another quite easily, and proxying is an example of that. I think that flexibility allows folks to use it in more situations than one otherwise would, though at the cost of being more difficult to maintain (and harder to have cohesion between parts of the system at times).


Lunni looks really interesting! Looks like a Coolify competitor, I'll definitely check it out. Do you have a Discord to join? Coolify has one and I found it great to discuss the project and talk directly to the creator.

I used to use Dokku but I personally liked the GUI from Coolify so I've been using that. Nice to see that you have a GUI as well, makes configuring apps a lot easier.


That's a nice idea actually, thank you! Just launched one:

https://discord.gg/9EAne8g2Pq

(Bridged to Matrix: https://matrix.to/#/#lunni:matrix.org)

Lunni is actually pretty young in terms of community (just me and a few friends now :-), so just a room in Telegram was sufficient so far, but I think it's a good time to start something more official.


Thanks, just joined. One thing I noticed is that the CI/CD setup for GitHub/lab is still based on the CI files themselves. What I like about Coolify and didn't like about others is that I could simply install a GitHub app to my account and Coolify would automatically pull my repositories and even set up git push to deploy for me, no messing around with CI files needed.

https://lunni.dev/docs/deploy/from-git/#github vs https://docs.coollabs.io/coolify/sources


I haven't thought of an app yet, but a custom GitHub Action was definitely on my radar. IIRC this would still be a one-click setup, but more integrated with the Actions UI, and will be easier to extend / override later, if needed. What do you think?


Sounds good. I recommend going through the Coolify install and set up process (and adding applications and services, as Coolify calls them) which would give you more insight as to how they set it up.

By GitHub app I meant that Coolify makes you install its own custom GitHub app that then allows you to git push to deploy.


I'll give it a go, thank you so much!


Personally I use portainer for basically the same thing. My only real requirement was that I could easily copy-paste in docker-compose files and have it just work.

I use caddy as the proxy, since I found the traefik configuration absolutely incomprehensible. Now I use only 2 labels to proxy instead of 15.


Caddy is nice, I've been playing around with it too and I love it. Perhaps we can port Lunni to use it, too.

Portainer is also cool – we're using it internally as an API, actually! I've been using it before starting Lunni and my only objection is the UI. Portainer is kinda like a Swiss army knife for containers, but with this power comes the complexity, too.

For example, to see service logs, you have to pick an environment, then go to Stacks, find your stack, find the service you need, open it and then you'll see the service logs button. In Lunni, you open your stack right from the dashboard and click logs button right beside your service name: https://u.ale.sh/lunni-screenshot-logs-button.png


I use Swarm for my own little app and love it, you should set up a Twitter so I can follow along on progress.


Thank you! No Twitter yet but if you're on Mastodon by any chance: https://fosstodon.org/@lunni/


No experience with either, but how does Coolify compare to Dokku, the OSS Heroku alternative I've been hearing about until now?


Dokku doesn't have a GUI which is the main reason I switched from Dokku which I used to use before.


Dokku maintainer here.

Dokku doesn't have an _official open source_ UI. There are a few unofficial OSS ones (Ledokku is the latest) that I'm aware of.

There is have a commercial offering in Dokku Pro (https://pro.dokku.com). It's paid (one-time lifetime license) but only so that I can at least partially cover my development time on it. The project is enough work on top of Dokku that I feel it is justified, especially as there is nothing stopping others from doing so, OSS or otherwise.


I can second this. We were evaluating moving off Heroku and to Fly.io, but we didn't need all of the edge compute stuff. We just want a better Heroku without having to think about infrastructure and having to think about edge compute just got in our way.

I feel like Next.js is in a similar position. While their main vision is SSR, I wonder if they are missing out on a chunk of the market that simply doesn't want to think about infra. We use them because we just don't have to worry about webpack or fiddling with deployment and hosting. We could care less about SSR and in fact we disabled it app-wide.


One of the key design choices of Next.js was to enable granularity on the runtime (Node.js or Edge[1]) and the rendering method (static or dynamic[2]) on a per-route basis. If you want a full SSR site, that's okay. If you want a full static site, that's also okay.

We often see folks wanting a mix of both. For example, maybe the /about page is static, but the home page is dynamic and personalized based on the visitor. You can do all of this with Next.js. Our future direction is adding even further granularity, enabling this decision at the data fetch level, allowing you to cache results across deployments[3].

[1]: https://beta.nextjs.org/docs/rendering/edge-and-nodejs-runti...

[2]: https://beta.nextjs.org/docs/rendering/static-and-dynamic-re...

[3]: https://vercel.com/blog/vercel-cache-api-nextjs-cache


What I love about Next.js is not having to think about hosting, webpack, hosting, typescript, scss, and so on. It just works.

I initially fought to get SSR working, fixing hydration errors and making sure our code was isomorphic.

I later realized that I can just use the parts of Next.js we need and turned off SSR. It wasn't a big value add for our particular product.

But doing this wasn't straightforward. I hadn't even realized it was a possibility until I stumbled across a blog post.

I had to copy a NoSSR implementation of the internet. It wasn't just some flag I could toggle for a page.

I've also found myself recommending Next to folks saying "Use Next.js, but btw you don't need use SSR. Make sure the trade-offs make sense."

I'm curious if I'm in the minority of Next.js users. What percentage of them don't need SSR but value everything else?


FWIW, not currently using Next.js, but I'm someone who values everything else but not necessarily SSR. I've been eyeing to use Next.js in SPA mode. There are dozens of us!


Why would they be missing out? Vercel can host static sites just fine, whether that’s one generated by Next or any other framework or written by hand


To clarify, I'm referring to folks who just need to write single page apps, but don't benefit from SSR.

(I didn't find disabling SSR straightforward.)

I wonder if Vercel is underestimating the size of the market that just wants a "Heroku for React".


I coincidentally tweeted the exact same thing earlier today.

I selfishly hope Fly put all their focus toward becoming Heroku 2.0. I’m sure some people care about all the edge latency stuff but I don’t know many of them.


Yeah, I guess devs first will need to learn better SQL before it will start to matter.


God, I'm glad someone else sees it as clearly as I do. I learned about Fly from them acquiring freaking Litestream! The SQLite replicator! The canonical database at the edge of the network! Of course that's what they want to do.


> lower cost as you scale

The cost aside, I'm wondering how fly or heroku support their customers when they grow to microservices ecosystems.

The problem shifts from deploying easily to deploying reliably meaning one release of a service should not break the other services. Other problems appear too, like service discovery, peer authentication, gateways, test and staging environments where there are downstream dependencies, etc.

Are customers supposed to leave when they grow to this level? Or are there solutions for these?


> generous free level of service,

Can't you make the argument that Heroku got out of this market on purpose? I know they were bought out by "ye old corporate greedy meanie overlord" or whatever but... I'm sure there is data that showed it made sense from a "make money business" perspective to not be in that market.


I would even go so far as to say free users aren’t customers in a platform like this. There’s no revenue model and if it’s “run without constraint” you get some fun scaling problems but those types of issues can sometimes lead to optimizations that keep the system online in the face of many small apps but doesn’t necessarily cover the case of a large resource app.


Digital Ocean gave me the PaaS replacement and managed PG and I couldn't be happier.

If anyone else is looking.


I tried running a docker app on Digital Ocean's app platform, the UI is nice and I can see it being acceptable for your average CRUD app. But I had to abandon it due to the latency just being too high and their built in monitoring interval not being configurable to the ~5-20ms range (this was for competitive Battlesnake last year).


You can take a look at www.qovery.com It provides an Heroku like experience but runs on your cloud account (aws, scaleway or digital ocean).

They build on existing tech that is already working, so it is more stable.


Save for a few "in-preview" features, Fly was stable too but then they started growing faster than they could keep up (a good problem to have!). Stability isn't a permanent state.


What are the limitations to heroku that people are going to Fly for? Maybe there's a standard article that would be useful to read about it?


It's more about Heroku dropping free and low-cost plans, which is them demonstrating that they don't currently care about three low end of the market, more than any specfic feature.


Also the absolute disaster with security they had just before dropping free tiers, and the awful response which took months to even acknowledge some kinds of data (such as pipeline keys) where affected. [0]

[0]: https://github.blog/2022-04-15-security-alert-stolen-oauth-u...


This is an old doc from fly so I’m not sure how much of it is still accurate, but it talks about some of the stuff Heroku didn’t have that they have: https://fly.io/docs/app-guides/speed-up-a-heroku-app/

> There's no support for a single dedicated IP address for your application. With Heroku, your application's CPU resources are mostly located in one datacenter. Heroku doesn't support HTTP2 or Brotli compression and it doesn't do Edge TLS termination. And it doesn't run your applications on dedicated MicroVMs. These are all things that Fly's Global Application Platform does.

The other comment that mentions Heroku dropping low cost plans is the reason for the explosion in growth as I understand it though.


People don't trust it any more, because Salesforce have been under-investing in it for years and recently rug-pulled on everyone who had ever used the free tier (after providing that free tier for 15 years already - long enough for people to reasonably expect it to continue).


Edge of what exactly? The software resides within networks they control. There is no edge in this scenario.


Why does there need to be a successor to Heroku?


I'm not a user of Fly.io. I can't help but notice how remarkable the effect of open communication on potential end users like me. I remember reading about their reliability problems on HN some time ago. That biased my view of the company. After reading this, the open communication and transparency restored my trust in them, and would make them again a potential candidate for future projects. Because now I know that they acknowledge the problem and that they are trying to improve things.


This is probably therapy, but your message and fly.io's post resonates a lot with what I'm going through. I took a product owner role about 6 months ago, my first, with a company that has turned out to be just a mired mess, and a product universally hated both internally and externally.

Long story short, it's completely over-engineered by a bunch of intellectual engineers with no focus, no discipline, and no oversight. It ended up not delivering on any promises it made, and there were a lot of them.

I was warned left and right before presentations and meetings, "this customer hates your product because of ...." I started off every meeting with saying, "we're rearchitecting the product, this is how we're doing it, this is the tech we are using." Immediately there was a sense of relief from customers, followed by questions like, "why can't <current product> deliver <feature> that was promised?" I'm completely honest with bad decisions that were made and how it impacted the feature. Sure, there is skepticism on what we are doing, and I tell them they should absolutely be skeptical based on our track record. The result has been customers who have hated my product now offering to work with us on development.

I've also been completely forthcoming on configuration, security, resources, and setup issues I am finding, many of them are absolutely freakin' insane. I've flat out told customers it's frankly embarrassing and never let us do something like this in the future. The best feedback on this was, "At least you're telling us something. We usually get silence from this team."

God, this is the most depressing job ever.


Your job sure does sound depressing, and it's not one I would succeed at, but if you can power through and turn this product around that's a hell of an accomplishment you'll have to be proud of.

I'm curious what you'd like to do next. You could probably have a great career doing these sorts of turnarounds repeatedly across companies, maybe even as a consultant, but would you want to?


> that's a hell of an accomplishment you'll have to be proud of.

It's hinted by the C-level that if I can pull this off, it would be nothing short of a miracle. I'm pretty sure I can negotiate salary, education, bonus, and what not if I can pull this off.

As far as next, I've thought about that. It would be funny to call myself a turnaround specialist. This would be quite a remarkable feat, but I really don't know if I would have taken this job if I knew what a mess this was...


> I'm pretty sure I can negotiate salary, education, bonus, and what not if I can pull this off.

Do this up front. Do it as soon as you possibly can. You will lose a huge amount of negotiating leverage if you "wait until you show them". I cannot stress this enough.


Just to reinforce this suggestion, this was my first thought also. Go get your comp mate.


Same! It’s the same principle as CEOs and stock options. For instance Musk that set several goals (x% market share, x unit sold…) and the hardest they get the more he will receive compensations.

I am not sure how much you could negotiate but you can have something like that and being metric based. X% customers happy, x% rating change, x% customers retained when they were close to leave. Then you make the math of the revenue and profit and it’s hard to say no.


Part of what I hated about Product Management at my last role was the consistent helplessness I felt when I was on calls with our customers. I could tell our product wasn't meeting their needs but all I could do was try my best to give the engineers context on how best to eventually meet them.

I remember my first few days on the job just being ripped to shreds by our customers who (understandably) were slighted. Don't miss those days at all.


This is why engineers need to be on some customer calls. They can be told not to talk at first or be trained in customer etiquette, but nothing makes a difference like hearing this from the users.


When I was in an engineering IC version of this role, I longed for someone like you to have my back in management and have customers’ backs too. If I had a time machine and a magic lamp I’d team up with you.

As a future representation of past me, I can tell you:

1. Everything it’s making you feel is valid.

1b. If you’re feeling burnt out, please listen to it. It gets worse if you let it.

2. While I can’t hire you now, I can already tell you’re eminently hireable. If you have any cautious inclination to move, you will probably be better served by greener pastures.

3. Just take care of yourself.

4. When 3 contradicts 2, favor 3.


Man, you don't know how this impacted me.

My boss is supportive, but he's also under heavy fire. Like I mentioned, my peers are rightfully skeptical. My team are a bunch of sharp, good guys, but haven't had any good guidance of mentorship in years, if not decades. They're all different, but what the have in common is that they've been screwed and judged unfairly thanks to past incompetence. That just pisses me off.

There's hope from those around me, but it's a pretty darn lonely job. You just gave me the fuel to not feel already beat up when I walk in the door tomorrow.


Agree with GP - your post makes me want to hire you.

One trick you might try: write future press releases. This helps you look beyond the immediate problems and focus on the destination. For example:

“Q3 2023: ACME CO released version Z today, which dramatically simplifies our engine to focus on core user needs. ‘It does thing I want an doesn’t crash anymore’ says Key Buyer #1”

By writing this down, you can put the vision in front of everyone. Then check it against actual progress to see how you’re doing.


I feel this. I hope you get over the hump and your job gets fun. We've had flashes, at least, but I do think what we're doing (and probably what you're doing) require some irrational behavior.


I have a POC I'm trying to get out in a couple of weeks. It's for the #1 feature that was promised in this product, never possible due to the architecture, and we've gotten raked over the coals for it. Wish me luck, because if I can get it out soon, it's going to be downhill for a while.


Can you help me in a detailed sense - what did you tell customers? did you literally say there's product is "completely over-engineered by a bunch of intellectual engineers with no focus, no discipline, and no oversight"? That seems a little over-honest to me but of course I wasn't there.


It depends on the audience.

First off, it helps I've been 15+ years as an engineer, 5 as an engineering manager, and throughout have the community contributions in the field on my resume. I instantly spotted the problems when I was given an architecture diagram on day 1 and discussed what I would do differently. All that gives credibility.

If it's internal audience, I am brutally honest. The organization needs to know this wasn't happenstance and bad luck that put us where we are now. It was a deliberate series of bad decisions based on a poor engineering and product culture. Now, for better or for worse, we are tasked with paying the debt.

There's a certain class of customers that are sister companies under the same parent. I'm honest with them, too, but go on the offensive. They have abused my team and our company in the past, and unfortunately, we have let them. I am more than happy to fire back and go toe to toe with bad behavior, and at the same time working to fix critical support issues.

For external customers, I've had remarkably good response in listening to their complaints. I am honest in discussing, in deep engineering detail, how the new product will address their problems, where issues might still be, and development timeline. I like to think the credibility portion comes into play here. In the past, customers were just told, "We'll look at it" and "We'll fix it" but nothing was ever planned.


If you have any interest in writing a series of blog posts about your experiences in the “turnaround” and linking to them in your profile I believe you would have an audience. Your explanations are clear and your experience is worthwhile. Just food for thought.


Could you further elaborate on “intellectual engineers”? What mistakes were made? Esoteric languages? Obfuscating design patterns?

Partially so I could learn from mistakes and partially since I’m a sucker for post-mortems :)


There is a part of the app that extracts and parses a database log. There is C, Java, Perl, Go, and bash scripts all over the repo. The bootstrapper is written in Java, but the core work is done in Go. Commandline arguments that are documented in comments may or may not even work. The Go section is one big state machine.

That's just to take a database log, put it into JSON, and zip it up.

That all for just one step in the data pipeline. The others are slightly less hairy, but 75% of this pipeline is literally just moving data around. It goes from JSON to MySQL to Postgres to Parquet. There is no data enrichment at all during these steps. It literally just unpacks from one format, packs to another and repeat.

The whole fucking thing is just one big masturbation circlejerk for a bunch of engineers that have thankfully been RIFed/forced out...


it is.

i do enjoy programming, working with other devs, but as soon as i stepped into a product management role it's a hellishly different set of problems. you're in the middle of the tech, the developers, the problems, and the customers. lots of lessons in there. tiring, but worthwhile.


Architectural astronauts.


mrkurt[1] is also active here and has been very transparent in his comments about scaling issues.

Similar to this post he commented a week ago:

> In a year we'll either be ahead of those, or not growing anymore due to ongoing capacity issues. I'm hoping for the former.

I am rooting for Fly! Great team. The company reminds me of early HashiCorp.

[1] https://news.ycombinator.com/user?id=mrkurt


Agreed, this is how company communication should be.

I don't use Fly but would consider them in the future even given their recent issues.

I look at this in contrast to Twitter who had/has? an outage today. Their leadership is opaque and doesn't take responsibility for the issues they are causing.


In fairness, a CEO who has basically been Kanye-ing himself and his company into irrelevance is a low bar.


Open communication is great when there are incidents, but even better is having no incidents. (of course there are nuances depending on specific context)


Really? For me it's the contrary, posts like this from companies are a signal to say "Jump out of the ship while you still can".

Kinda like when crypto exchanges tweet "Yo we're definitely not blocking withdrawals, we're perfectly healthy".

You know what would make me consider a company? The fact that they don't have a bad reputation to begin with, and don't need to make posts like that to try to save their reputation.


Me too.

However, this is a double edge sword. Their key value proposition is scale / speed which makes it concerning that they haven't "solved" that yet.


I’m not [directly, at present] in the market for their product but the very frank and real introduction basically moved Fly from my “I think I’ve heard of that but I’ll need my memory jogged for any further recognition” category to my “potentially good stuff to reference when relevant” category.

This kind of frank and human communication is vulnerable, but it’s good for establishing credibility… with me at least!


This is huge. Even as a member of a larger company, this stuff matters. If you have a vendor that doesn't bullshit you when things go wrong, you can actually trust. This is how you avoid companies having the "hmmm they seem to be having lots of issues recently, let's consider moving off them" conversation.


> This is how you avoid companies having the "hmmm they seem to be having lots of issues recently, let's consider moving off them" conversation.

Yeah, this is how you get companies to have the "well, it SEEMED they were having lots of issues, now it's clear that they indeed did, moving off of them is priority #1".


This post is carefully worded corporate messaging, but because they write for their developer audience it has an informal "oh shucks we messed up bad y'all" vibe to it. But make no mistake, this is 100% corporate messaging.

I get that growing is super hard. And maybe fly will grow up to be a good platform some day. But that's the future. Today, they're flying by the seat of their pants and I mostly feel sorry for people who were tricked into thinking this platform is ready for production use.


I don’t get what you’re saying, this isn’t a brag disguised as a confession, they are actually admitting to poor performance, of course it’s to eventually make users trust them, but a) I don’t see nothing bad with that b) they are choosing the hard route.

They are being open and transparent (afaik) even if carefully worded, which I also don’t blame them for.


I don't agree. Fly.io took on 40m in venture capital and is now on a do-or-die mission towards a $400 million exit. This might be obvious to many here on HN but most developers don't understand these startup dynamics and the consequences of a growth-at-any-cost strategy.


I'm not sure why the cynicism around their candor. Do you think it's not genuine just because it was posted by a company employee?

Your post implies corporate messaging is bad. And anything posted by a company—or at least I don't know where you draw the line—can be considered corporate messaging. Am I just reading too much into your phrasing?


It's strategic messaging. It can't be genuine, because of what it is. The benefit they get is publicity and damage control, and as you can tell by the many responses here, it buys them time because many developers are willing to give them the benefit of the doubt.

Companies that engage in this kind of candor are careful not to disclose those things that would really hurt their business. Those things are still kept secret. If the CEO accidentally sexually harassed an employee that's not getting disclosed. A mea culpa is only offered for the issues that are already known regarding scaling, downtime, and missing features. Struggles they have because they're choosing to grow so fast.


Sorry, what? Do you expect that no company can think about what to write before they post it, or that any post about anything internal must cover all internal issues? Posts must be either all roses or a no-thought laundry list of everything bad?


> Do you expect that no company can think about what to write before they post it...

I guess, you and GP are in agreement for the strategic part of the argument at least, if not the genuine part of it.

As someone who's been active on Fly's community forums for close to 18 months now, I think Fly employs some of the most genuine and helpful engs you'll see, so I'll give them the benefit of the doubt.


Of course not. The point is that the reader should recognize corporate communication for what it is: fundamentally self-serving. Corporate communication should therefore be met with thoughtful skepticism, and not with naivety or cynicism.


There is a world of difference between the tone and content of that Fly post and the tone and content that most people expect from cover-your-ass corporate blog posts.

Sure, both are examples of "self-serving corporate communication" - but it's clear that the way Fly communicate here is more valuable and trustworthy than so many other examples of this kind of thing handled poorly.


I don't think this means that it can't be genuine. I mean, the intent was to buy some good will with their customers who have experienced problems, that's for sure. But it's not a "bad" motive I think.

If I had a bad day and didn't get to complete something within my estimate, I'll tell my boss I had a bad day and ask for more time. Does that mean I have ulterior some ulterior motives? No, I just had a shit day, and needed some compassion.

They have been going through a rough patch recently with their scalability problems. And they realised they might not address it as easily or as quickly as they'd like. So they just wanted to buy time. I think that's better than "bunkering" and not letting your customers know what's up.

They do have the benefit that their audience is tech savvy as they are, so they can go into more details (and be less formal, I suppose) to get some understanding from their customers. As in, most devs have struggled at some point with a problem that exceeded the initial scope/time estimate. It sucks, and we know it sucks. So, why not give them the benefit of the doubt here?

Like, I think understand what you mean: their goal was to buy more time, and they achieved that. But even though it was corporate messaging, I still think it was genuine. I assume they felt a bit like "ok shit, we gotta talk to our customers, they deserve to know what's going on".

I guess they wouldn't air most their internal issue, since those don't aren't felt by the customers. So there's no need to apologise and explain themselves.


Yeah I agree with you. Corporate messaging is still corporate messaging. If you go the candor route, it's a tactic of many you can employ.


professionals don't get tricked into thinking a platform is ready for production use

If you don't have SLOs and SLAs, then you get what you get, essentially. Even a company with a great reputation can completely reverse course with a single bad incident, and you get nothing in return if there's not a contract.


Honestly, if you are a small fish to AWS... what is an SLA?

They can trot out a low level person to stall you with questions, or an AI question generator that maximizes the amount of time you waste on your end, and call that "SLA met".

And even if they DON'T meet the SLA on occasion, you built your stack on AWS. You are laying in the bed you made.

SO, what, AWS throws some free credits (that their 30-40% margin easily absorbs)?

The only big stick in these types of things is having dual-cloud capability, where you can move your service quickly from one cloud to the other. Stateless API servers? Maybe. Database servers? ouch. Cassandra could reliably span two clouds, man would AWS kill you on their ludicrously overpriced network costs.

Has anyone does Postgres replication across providers as a useful production system? Doubt it.


I've been doing reliability stuff for near two decades. The one thing I am sure of is there is no way to just engineer your way to reliability. That is to say, no person, no matter how smart, can just invent some whizbang engineering thing and suddenly you have reliability.

Reliability is a thing that grows, like a plant. You start out with a new system or piece of software. It's fragile, small, weak. It is threatened by competing things and literal bugs and weather and the soil it's grown in and more. It needs constant care. Over time it grows stronger, and can eventually fend for itself pretty well. Sometimes you get lucky and it just grows fine by itself. And sometimes 50 different things conspire to kill it. But you have to be there monitoring it, finding the problems, learning how to prevent them. Every garden is a little different.

It doesn't matter what a company like Fly does technology wise. It takes time and care and churning. Eventually they will be reliable. But the initial process takes a while. And every new piece of tech they throw in is another plant in the garden.

So the good news is, they can become really reliable. But the bad news is, it doesn't come fast, and the more new plants they put in the ground, the more concerns there are to address before the garden is self sustaining.


There are no silver bullets for whole system reliability, but high-availability clustered databases was this wiz bang thing that greatly improved the reliability of your database, back in the day. It didn't come cheap, and there were growing pains, but sometimes the available technology does make a difference.


You can make sensible assumptions that result in engineering gains though. Step around the problems not through them.

For example I have learned that the first step to reliability is removing as many hashicorp products from your stack as possible though. Appears I am not the only one.


If you’ve been using them in ways clearly explicitly called out as not per the design goals, then sure, removing any piece of technology will help you. I’m guessing that is not your actual problem though.


I would not assume that Hashicorp products necessarily meet the design goals if I'm honest. Consul and vagrant have been absolute shits and vault adds more complexity and unreliability to the problem domain and has a net negative ROI. I like the idea of their products but the reality is very different.


> The one thing I am sure of is there is no way to just engineer your way to reliability. That is to say, no person, no matter how smart, can just invent some whizbang engineering thing and suddenly you have reliability.

It's seems true for fly's problem space, but in many problem spaces there really are easy engineering solutions to reliability problems.

For a very easy example, I once worked on a rails app that crashed frequently and managed 5 req/s at best. It turns out the app only loaded static data from hardcoded json files on disk and templated that into stuff. In other words, it was a static site. Replacing it with an actual static site + nginx and a cdn instantly fixed all reliability issues for that website forever, and made it easier to maintain the content to boot.


I'm actually surprised such a simple app would have such bad performance and crash at all?


I don't think the fact that it did effectively:

    data_1 = `cat ./data1.json | grep "city" | awk ....`
    data_2 = `cat ./data2.json | grep "city" | awk ....`
was exactly helping it to perform well. I'm sure rewriting the rails app to load all the data at startup, not to read each file via several hundred subshells on each request, would have made it perform well enough.

However, pretty much no matter how well or poorly the rails site is built, a static site will be easier to run reliably.


Nicely said. I remember AWS outages (S3, EBS, and RDS) in the early 2010s when their products were younger. But given time to improve each has become more and more resilient.


Maybe so but there are definitely technology choices that have vastly different "initial reliability". What would you expect to be more reliable - a bash script or a Rust program?

I'm not that familiar with setting up global network infrastructure but I imagine there are similar choices that can vastly affect initial reliability.


I'd say it depends more on the person rather than the technology.

A master in bash will build more reliable API (in bash no less!) than a beginner in Rust, simply because of experience and knowing their way around the tools they're using. Newer/different technologies won't simply solve a problem unless the person has some sort of domain knowledge of said problem.


I agree. Like, you could build two houses: one out of sticks, the other brick. Depending on who's building it, if they've not built a house out of brick before, it's very easy to make a mistake. Versus someone who's been building stick houses forever will get it right the first time.

Also, brand new software in general is like a new hybrid plant. How does it behave in this environment compared to other plants? Does it attract more bugs? Does it need different care? We don't know yet; it's new.

And even for an old well known plant, if the gardener hasn't gotten to know it yet, it's easy to make a mistake with its care. But a well known plant with a gardener who's grown it before is the most likely to work without issue.


> Depending on who's building it, if they've not built a house out of brick before, it's very easy to make a mistake. Versus someone who's been building stick houses forever will get it right the first time.

Decent analogy. And of course if you have people of vaguely similar skill levels then the brick house is going to be way more robust. Which was my point.


This is an excellent description.


Good engineering helps reliability but doesn't guarantee it.

Bad engineering causes bad reliability.


I remain kind of amazed about how heroku managed to pull off what they pulled off, in the first case.

Also:

> The Heroku exodus broke our assumptions. Pre-Heroku, most of the apps we were running were spread across regions. And: we were growing about 15% per month. But post-Heroku, we got a huge influx of apps in just a few hot spots — and at 30% per month.

I hadn't before seen anyone with a big picture view confirm a heroku exodus was happening, although a lot of people suspected it or had anecdotes.

But if fly is seeing a pretty enormous number of customers moving from heroku to fly... oh wait, now I'm wondering, is this mainly a result of heroku ending free services, and those are free customers coming to fly for free services?

If so... that's a pretty big burden to take on without revenue to match, it does seem kind of dangerous for fly.


> I remain kind of amazed about how heroku managed to pull off what they pulled off

Heroku was built on top of AWS. They didn't have to handle many of the hard problems Fly does.


Do they unpack what the 30% includes? Is that revenue, VMs, accounts?


Almost half of the issues are caused by their use of HashiCorp products.

As someone that has started tons of Consul clusters, analyzed tons of Terraform states, developed providers and wrote a HCL parser, I must say this:

HashiCorp built a brand of consistent design & docs, security, strict configuration, distributed-algos-made-approachable... but at its core, it's a very fragile ecosystem. The only benefit of HashiCorp headaches is that you will quickly learn Golang while reading some obscure github.com/hashicorp/blah/blah/file.go :)


We are asking to HashiCorp products to do things they were not designed to do, in configurations that they don't expect to be deployed in. Take a step back, and the idea of a single global namespace bound up with Raft consistency for a fleet deployed in dozens of regions, providing near-real-time state propagation, is just not at all reasonable. Our state propagation needs are much closer to those of a routing protocol than a distributed key-value database.

I have only positive things to say about every HashiCorp product I've worked with since I got here.


Well, why did you do that? If you’d asked them whether this was a supported configuration or intended purpose, they’d have said no; and anyone who had experience deploying Consul at large scale would have told you the same.

There is truly no compression algorithm for experience.


I don't think he personally designed the first implementation. But in any case, understanding of complex topics comes in waves.

Many times I've had to read all the docs then use a system for several months before the epiphany hits me.


This is especially true for scaling. A solution that works great for your current deployment may be completely unworkable for 2x your current deployment.

You just won't know until you fall off the cliff. The armchair quarterback can opine that you should have just hired experts in XYZ domains from the start to design robust systems that can scale to arbitrary sizes, but most orgs don't need to scale to arbitrary sizes so this is highly likely to be wasted effort.


While I largely agree with you, this isn’t one of those cases. If Fly wasn’t supposed to scale in due course to this size, it probably wouldn’t have been funded. If your business model is predicated on you scaling, yes, you should hire appropriately in anticipation of that.

Besides, I’m not even necessarily talking about hiring here - even consulting would have been sufficient to avoid this catastrophe.


Yes, although it's rarely possible to know which bottlenecks will hurt the most up front. Unless you've done the same thing before, which is not the case with anyone pushing boundaries.

Basically this is an argument around so-called premature optimization. Good to have issues now while it is mostly enthusiasts that are the customers. Guessing that this bump will be forgotten in five years? And not like AWS et al don't have outages occasionally that they learn from.


Consul has been around for close to 9 years now, and people have in fact tried to use Consul in the very same way Fly did, in many different business and industries, with similarly failing outcomes. Hashicorp knows this and almost certainly would have counseled against it if asked.


Insert Donald Rumsfeld quote about un/known un/knowns.


I also think there’s this tendency in the industry to want to solve problems on your own without the help from outsiders, even if they know the problem space better than you do, and even if they’d gladly help (often for free) if asked. It’s especially worrisome when it’s powering a key workload that is essential to the functioning of your business. Sometimes it’s because you might not know whom to consult or recruit, but in this case, the vendor was known.


This feels unnecessarily antagonistic. "If you were experienced, you would have made the right decision, _obviously_."

Did Fly.io kick your puppy or something?


I can see how it would be interpreted that way, and I apologize if it came across that way, but it wasn’t my intent. See my other comment below. What I’m really saying is that we need to be better about engaging subject matter experts early on when we are selecting technologies to power core business functions; and I think it’s a good illustration of why we need to continue to hire experienced people at startups.


That's a fair point... but at the same time, we shouldn't hold off on starting something just because we don't have perfect information.


I respect that. Can you elaborate a bit on the routing protocol thing? I assume you used WAN gossip?

I love the simplicity of fly.io & wish you all the best improving Fly's reliability!


If you've ever implemented IS-IS or OSPF before, like 80% of the work is "LSP flooding", which is just the process that gets updates about available links from one end of the network to another as fast as possible without drowning the links themselves in update messages. Flooding algorithms don't build consensus, unlike Raft quorums, which intrinsically have a centralized set of authorities that keep a single source of truth for all the valid updates.

An OSPF router uses those updates to do build a forwarding table with a single-point shortest path first routine, but there's nothing to say that you couldn't instead use the same notion of publishing weighted advertisements of connectivity to, for instance, build a table to map incoming HTTP requests to backends that can field them.

The point is, if you're going to do distributed consensus, you've got a dilemma: either you're going to have the Ents moot in a single forest, close together, and round trip updates from across the globe in and out of that forest (painfully slow to get things in and out of the cluster), or you're going to try to have them moot long distance (painfully slow to have the cluster converge). The other thing you can do, though, is just sidestep this: we really don't have the Raft problem at all, in that different hosts on our network do not disagree with each other about whether they're running particular apps; if worker-sfu-ord-1934 says it's running an instance of app-4839, I pretty much don't give a shit if worker-sfu-maa-382a says otherwise; I can just take ORD's word for it.

That's the intuition behind why you'd want to do something like SWIM update propagation rather than Raft for a global state propagation scheme.

But if you're just doing service discovery for a well-bounded set of applications (like you would be if you were running engineering for a single large company and their internal apps), Raft gives you some handy tools you might reasonably take advantage of --- a key-value store, for instance. You're mostly in a single data center anyways, so you don't have the long-distance-Entmoot problem. And HashiCorp's tools will federate out across multiple data centers; the constraints you inherit by doing that federation mostly don't matter for a single company's engineering, but they're extremely painful if you're servicing an unbounded set of customer applications and providing each of them a single global picture of their deployments.

Or we're just holding it wrong. Also a possibility.


this doesn't paint a full picture of your options, as there's nothing that stops you from having zonal/regional consensus and then replication across regions/long-range topologies for global distribution.

to be pithy about it, going full-bore gossip protocol is like going full-bore blockchain: solves a problem, introduces a lot of much more painful problems, and would've been solved much more neatly with a little bit of centralization.


I don't disagree that there are opportunities to introduce topology. I do disagree that there are opportunities to benefit from distributed consensus. If a server in ORD is down, it doesn't matter what some server in SJC says it's hosting; all the ORD instances of all the apps on that server are down. If that same ORD server is up, it doesn't matter what any server says it's running; it's authoritative for what it's running.

Of course, OSPF has topology and aggregation, too.

At any rate: I didn't design the system we're talking about.


> I do disagree that there are opportunities to benefit from distributed consensus

there's some benefits to static stability and grey failure, but sure, whatever. the important bit is to have clear paths of aggregation and dissemination in your system.

that being said

> it doesn't matter what some server in SJC says it's hosting

it kind of does matter doesn't it? assuming that server in SJC is your forwarding proxy that does your global loadbalancing, what that server is aware of is highly relevant to what global actions you can take safely.


My point is just that there isn't a consensus algorithm that needs to get run to know which of the two proposals to accept.


It doesn't need a raft consensus algorithm, but corrosion does converge to a consensus, doesn't it? In the OSPF example, that does needs to converge to a state that is consistent and replicated on all the routers, otherwise loops and drops will occur. I'm curious if any convergence benchmark has been done that compares raft to corrosion.


Are there any plans to make Corrosion open source? Or are you able to talk at all about the technologies/patterns used to create it? I feel like service discovery is still ripe for disruption


Yeah, we'll for sure talk about it more some other time. Mostly today we want to talk about how we were sucking ass at customer comms.


Can you please write a blog post or book with "Sucking ass at customer comms" as the title ;)

I say this as someone that loves fly :)


Definitely looking forward to it!


I've been using Fly for over two years or so. The sentiment of this post doesn't align with my personal (anecdotal) experience.

The PG issues hit me two times in the previous weeks but other than that it's been working great for me.

With the move to v2 apps (using their new machines infra) things are actually faster and smoother than ever.

About a year or so ago their CLI was quite buggy but I haven't really hit any bugs in months.

I will remain with Fly for the time being. Hopefully they don't close shop!


We're nowhere even within the line of sight of closing up shop. We just haven't been doing a good job of aggressively communicating (a) when things go wrong and (b) what we're doing to account for it.

The Fly.io of 2023 looks almost nothing like that of 2021 (all for the better), and it's not obvious to our users what's changed. We've been doing a shitty job of communicating, and we're taking our licks for it now.


A lot can happen in 11 years :-)

And thanks a lot for fly.io -- it's working great for my (rather small) use cases.


Oh my god that's a great callback.


Agree - V2 apps on machines are incredibly slick to launch (create/start/stop), get info on with graphql, and scale up and out. Magic. When the PG administration experience is that good I'll move it all over.


Not a client of fly.io, but dang impressive for the company to be this open and honest. Definite respect - wish more companies were like this. It puts them on my short list almost immediately for future needs.


Came here to say pretty much this. Technical issues come and go; open communication is a core part of the company culture, and builds up trust.


Why would you shortlist a company that admits to reliability issues?


Because if they admit to them they’re probably in the process of fixing them. When compared to a certain B tier cloud who gaslight users about flaws in the product and API, the amount of effort expended to get to an admission of fault is much lower for companies not inclined to lie by omission.


All companies have reliability issues, especially in this problem space. I’d rather go with someone who acknowledges their failings in public.


At first I was all like “Ha ha, losers can’t scale”

And then I was “Huh, these technical challenges are actually pretty difficult”

And then I was all “crap, these are a bunch of technologies I was about to add to our stack”

Thanks heaps fly.io people; having the humility to honestly talk about the challenges and failures massively helps people such as myself as we navigate new unfamiliar technologies. If more companies were willing to do this, it’d be a lot easier to avoid common pitfalls.


The tech in their stack is still pretty good. Unless you’re supporting tens of thousands of customers and trying to make the promises that fly makes today. Look at the fly engineer replies in this thread.

Also they basically only use OSS versions, they could go give Hashicorp some money to solve their Vault problems. They could probably partner with SecondQuadrant for PG as two examples. That might not make sense for their business though.

Hard problems are hard no matter the choice.


Sure, I was going for a little humour there. A little riff on the whole “we always judge others until we walk in their shoes”.

The take away I was hoping for is “providing insights into how we struggle helps others”


> This is a theme. Existing open source is not designed for global deployment

Eh? Unless you are consuming something as a service and it actually advertises it as a feature, nothing is ready for 'global deployment'.

If you have a 'centralized' secret storage, then you have made it tied to a region. Want to have redundancies and lower latency? You'll have to distribute it. Vault has docs about this: https://developer.hashicorp.com/vault/tutorials/day-one-raft...


This one is interesting because Vault has an enterprise product which I assume (hope?) Fly is paying for. That enterprise version includes performance replicas, which allows for cross-region replication of secrets with region-local reads (and slightly lower writes). The OP almost makes it sound like they are using the non-enterprise versions (or at the very least, not taking advantage of this particular functionality).

That said, I'd imagine with large enough scale, these sorts of features break anyways.


It's been almost a year since I gave Fly a review (https://news.ycombinator.com/item?id=31391116) and it's a bummer that they're still struggling to get things right. Double bummer because I love Phoenix and Elixir and they employ Chris McCord there.

Maybe they were _too_ ambitious at the start? They have a hard road ahead of them, and competition like Render.com and Northflank have provided me with solutions to all of my problems. Great dev ux, great prices and predictable solutions. They also keep pushing out very useful features. A third competitor also sprung up Railway! There's certainly blood in the water.

Will they catch up to others before the competition solves the "global mesh" unique value proposition Fly.io currently has? That's the $1MM question.


I read your review, and had a question so I thought I'd follow up here. You mentioned render.com as a competitor - does render host its own infrastructure or do they act as a go-between between their users and AWS/GCP/whatever?


They act as a go-between in that they ultimately host on AWS/GCP. They host their own infrastructure in that they appear to run Kubernetes and have built out their own deployment and service fabric, so they're just using the underlying machines as dumb compute, they're not, eg, building on RDS.

In March 2021, someone asked a question about carbon emissions of their data centres. They said they hosted on both GCP and AWS, but mentioned they were interested in moving to their own bare metal [1].

In April 2021, I asked a question about egress fees to Google, and they walked back a bit the comment about moving to bare metal [2].

As of March 2022, they're still in AWS/GCP [3].

As of September 2022, workloads for new users deploy into AWS, even in regions that were previously served by GCP [4].

[1]: https://community.render.com/t/does-render-use-green-energy/...

[2]: https://community.render.com/t/is-render-com-hosted-in-googl...

[3]: https://community.render.com/t/are-your-servers-owned-by-you...

[4]: https://community.render.com/t/which-render-regions-map-to-w...


(Render founder) We're still on public clouds because even if it doesn't help with margins, it helps us move faster on features our customers want. It's all one big prioritization problem (and lots of little ones too!).


I'm curious how significant a risk products like AWS Lightsail are to your business - it seems you are competing in the same market, but:

1. They have vastly different ongoing capital and cashflow requirements than you do.

2. They have all the leverage when it comes to the question of your continued operations on their cloud.

I'm also curious if they have already offered to just buy you out since you're clearly succeeding where they seem to just be treading water. (But not expecting you to answer this question. :) )


> how significant a risk products like AWS Lightsail are to your business - it seems you are competing in the same market

Not Anurag, but as ex-Stripe himself, he may appreciate AWS Flexible Payment Service vs. Stripe parallels here: https://news.ycombinator.com/item?id=34513430


Once you've built a system that works well in production (and scales elastically, too, it seems), it is really difficult to switch out the underlying infrastructure. Makes sense about walking it back.

The problem with running your own servers in data centers as a startup is that elasticity is genuinely a difficult problem to solve if you don't have a large budget for unused compute, storage, and so on. As we are seeing in Fly.io's case.

Ultimately, my bet is that both startups end up as Heroku-like acquisitions for some large cloud company or another. I think that render will sell for a lot more because the value it provides is agnostic to the underlying cloud infrastructure.


This reads like a mea culpa from an indie hacker, but Fly.io had 5+ years and raised $40M to get these basic fundamentals right. And we get promises of a new status page.


Well, that's one way to look at it.

Fly's been many things over the course of its lifetime [0], but I believe their latest pivot (on what they call "Machines") is pretty darn good. I've been using Machines since Oct last year, and things have gotten better week-over-week. Like with any platform, Fly has its own idiosyncrasies, which don't take much to get hang of. That said, I am the only person in my tech shop that deals with Fly. Some orgs with larger teams and heavier apps that deploy frequently or run DBs / disks on Fly (I don't) have had a rough few months; so that's there too.

[0] Ex A: https://news.ycombinator.com/item?id=13985940


Fly Machines, if I understand them correctly, feels like a step backwards. Sure they might work better than "standard" Fly apps, but one of the motivating cases for Fly is being able to effortlessly scale across the world without having a Ph.D. in CS and a fistful of certs for Cloud engineering. That vision for Fly is awesome, game changing even, ignoring their current stability issues.

Machines isn't that. From the documentation, it appears as though it's "just" a VM pinned to a single region and none of the "magic" of Fly really applies. If the server your VM is hosted on goes down, Fly won't redeploy your container. It's just downtime. Spinning up in other regions is something you have to think about and actually do. It seems closer to Heroku than it does Fly.

Maybe I am totally misunderstanding Fly Machines and their use-case, maybe they're aiming to close the gap between Machines and Fly apps. It's just a bit of a bummer to see something that looks like walking back the original "promise" of Fly and makes me question whether or not Fly is going to just become like every other PaaS (even if it's a really good one).


Agree. Kurt's mentioned on the forums that autoscale is coming to Fly Machines. They haven't implemented it just yet.

Even without autoscale, spinning up Machine clones in any of the 30+ Fly regions is as easy an instant scale-out you'll likely come across on any of the NewCloud platforms.


Big companies fail spectacularly as well, so it is refreshing to read a indie hacker-style mea culpa than a pile of nonsense PR one would expect from a company that raised $40M.

Honesty pays off in the long run, but it's something businesses quickly forget past a certain stage.


It is concerning that they feel notifying a problem on the status page hurts their ego. It is under no circumstances something personal, and ideally it should even be automated.


Very interesting to see Kurt assert theyre going to "solve managed Postgres", and I'm super curious to know what that means. Does it mean something like RDS, or more like CrunchyData?

I could see them building something RDS-like on their own, but if they're trying to go further than that I wonder if they'll buy or partner with other companies rather than doing it themselves. Neon strikes me as a Postgres-as-a-service that could pair well with Fly.


That comment jumped out to me too, my recollection was that they've been pretty vocal about that not being something they wanted to solve themselves as a core competency. I'm not quite parsing if these two blurbs should still be taken together or if the second sentence is refuting the first.

> The second problem we have with Postgres was a poor choice on my part. We decided to ship “unmanaged Postgres” to buy ourselves time to wait for the right managed Postgres provider to show up.

> We’re going to solve managed Postgres. It’s going to take a while to get there, but it’s a core component of the infrastructure stack and we can’t afford to pretend otherwise.

+1 to Neon seeming like a good fit, but it's also very much a beta (alpha?) both as a product and company (at least from my impression). I'm not sure that's a bet they'd want to make right now given the context of this post.


(Neon CEO)

We are launching our paid tier March 15th and will be production ready shortly after. We are running 20K+ databases and measuring reliability and uptime.

Generally reliability is a function of architecture (we are solid there), good SRE practices, and a long tail of event you live through, fix, and make sure they never happen again. The bigger the fleet the faster the learning.


That's great, love to hear it! Really excited about what you all are working on.


Looking forward to it!


Craig here from Crunchy Data. Not sure if you mean Crunchy Data is like RDS or isn't, in some cases we're very similar as a managed service provider. But are focused on a better developer experience and quality support.

We've had a number of customers that use us for the database and fly for the app. We had a user benchmark a number of heroku alternatives with various database providers and we were actually better response time than the unmanaged instances on fly themselves in addition to all other providers they tested - https://webstack.dancroak.com/

I won't speak for Fly, but we're big fans of them and think we pair quite well together.


Yes we think they pair well together too. I believe the ball is in your court though. ;)


<3


I haven't used CrunchyData for work, but I see you as offering what RDS does plus plenty more. RDS does a lot, but after using Timescale Cloud professionally I saw how much RDS doesn't do, like actually-simple upgrades, one-click forks, etc. and Crunchy looks similar in going beyond RDS.

I think the community would really love to see a direct Fly+Crunchy integration!


If I was in their shoes I'd probably aim for a "serverless" Postgres experience where you get a connection string and you know nothing else.

I think RDS, Crunchy, Aiven and others aren't quite there yet.


They kind of offer that with their Redis (via Upstash). But for our use-case, we needed it to be managed PG and Redis. Going out of the LAN introduces too much latency.


Upstash Redis for Fly runs on Fly infrastructure and we observe latencies in the low single digit milliseconds.


Maybe that is what Fly had in mind: That some company like Upstash would bring "managed" serverless Postgres to their platform.

Unless somebody natively implements clustering in Postgres I don't see that happening anytime soon, all existing tools require way too many moving parts.

I don't know why they'd even want Postgres for the type of service they are offering, KV or maybe SQLite seem like a better fit.


I don't even understand what you mean as the difference between "something like RDS" and "something like CrunchyData" -- they seem like similar products to me?


I see RDS as the absolute bare minimum for a managed database; providers like Timescale or Crunchy tend to add some pretty useful stuff on top.


For my own curiosity, I am interested in hearing what features Crunchy adds on top that RDS doesn't have, that folks find pretty useful!

(Timescale -- I think i know, it adds features specifically about storing time series? But I don't think crunchy has additional domain-specific stuff like this? What are the pretty useful features folks find in crunchy that RDS lacks?)


I'm a bit sour reading this. I've always liked fly and particularly the engineering blog, so much so that a couple of months ago I decided to apply for an infra position, to work on some of these very topics. Sadly after 4~5 rounds of interviews (including a workday) they just ghosted me.


If that happened, it absolutely was not on purpose. Shoot me an email at thomas@fly.io.


> Don't feel too bad nor take it personal. They probably have a lot of applicants, and are looking to grow their team by hiring someone with very specific skills.

> I also applied a few months ago while I was in the middle of my job search. For one, I couldn’t really answer their "favorite syscall question" because I’ve never dealt with syscalls :) so maybe I just wasn't a good fit.

Surely, everyone's favourite syscall is exit()


For what it's worth, exit() is not a syscall, but _exit() often is.


What an earnest post, and how damn refreshing it is to see such concern for users, accountability, honesty and openness (quite a contrast to another PaaS)!

I moved one app successfully from heroku to fly and attempted to move a few others. These are my experiences (both good and bad):

Great:

- The load time on the pages is insanely faster on fly than heroku. Sometimes I thought I was on the localhost version of the app, it was that snappy.

- Love that it uses a Dockerfile

- Love paying for what I use (compared to Heroku's rigid minimum of $16/month for hobby dyno w/ postgres for baby apps, or $34/month just to get a second web dyno for toddler apps). The same apps are <$5/month each on fly.

Not great:

- I find the fly.toml file hard to understand and use, and the cycle time slow to fix or tinker with it. It's partly (entirely?) a 'me' problem because I haven't spent a huge amount of time reading the documentation.

- I found scheduling a rake task in a rails app time consuming (~days) the first time, but very easy (15 minutes) the second and subsequent times, once I knew a way that worked (cron didn't work; had to use a tool I hadn't used before 'supercronic').

- Deploys sometimes time out with `Error failed to fetch an image or build from source: error rendering push status stream: EOF`. Most layers copied, but randomly, some layers wouldn't. All I could do is keep trying until it worked, which it did, 2 hours later. Not the end of the world, but an annoying complication when you're already trying to solve complex problems.

- I followed a youtube video on how to move a rails app from heroku to fly, and it worked on a modern app, but I couldn't quite get fly happy when moving the older app - something to do with postgres versions, and I didn't want to spend all day figuring it out. I'm not hugely experienced with docker, it could have been an easy fix for someone more experienced.

On reflection, 3 of the 4 negatives above are solvable by me reading the docs more thoroughly and getting more proficient with docker.

I look forward to continuing using and exploring fly, and can't be happier with the directness, transparency and care from fly staff. A platform with huge potential.


Did you try migrating with this guide? https://fly.io/docs/rails/getting-started/migrate-from-herok...

The issues you ran into with older versions of Rails was probably because the Dockerfile that `fly launch` generated was for new versions of Rails. We switched to https://github.com/rubys/dockerfile-rails to streamline Dockerfile generation and support older versions of Rails.

If you try it again and run into issues you can open an issue at https://github.com/rubys/dockerfile-rails/issues or post in https://community.fly.io and somebody will help get that sorted out.

The more versions of Rails we can deploy the better!


You bet, that guide is gold.

I encountered a few issues. One was definitely something to do with older postgres. From memory, I tried downgrading it using apt, but then other things played up and I put the project aside.

Another rails 6 app I tried to move into fly encountered this: https://community.fly.io/t/rails-app-problem-with-node-modul...

I followed Sam's suggestions to regenerate the Dockerfile with dockerfile-rails (btw, thanks for your work on dockerfile-rails, super excited about it) and solved a couple more issues, but I again ran out of steam when new issues kept coming and coming. I'm sure when I'm more comfortable with docker these will become trivial to solve.

These were not super determined attempts by me, more playing around. I look forward to more serious attempts when I'm more capable with docker.


Interesting issues. Nothing surprising for anyone who’s run a global SaaS before, especially if growth has been incredibly fast. I find the gripes about Consul, Nomad, and Vault interesting since it sounds like the problems are mainly due to poor architectural decisions. Fly is rewriting those tools rather than invest in deploying them properly and in the process are running into new issues that those tools have already solved, which doesn’t give me confidence that the path forward will be any less bumpy.


One of my colleagues keeps repeating “reliability is our number one feature”.

I’m not sure it is for 100% of early stage startups, but I guess it is once you exceed some minimum usage threshold.

That said, definitely appreciate the detailed explanation.


> One of my colleagues keeps repeating “reliability is our number one feature”.

I think reliability is the #1 feature at any stage because if you're unavailable, you're at best useless and more than likely you are actively harmful because your users have an expectation.

However, if you're unavailable outside of times customers don't expect you to be there then you're not actually unavailable. This is more likely for an early stage start-up, but you don't typically choose or know when you're expected to be available nor do you always get to choose when you're unavailable.


Our team at AWS had a poster up on the wall that more or less went:

1. Security

2. Durability

3. Availability

4. Speed

Similar: https://twitter.com/colmmacc/status/1071088017190711296


In terms of confidentiality, availability, and integrity: I'll bet LastPass would gladly trade availability right now to regain confidentiality.


> One of my colleagues keeps repeating “reliability is our number one feature”.

> I’m not sure it is for 100% of early stage startups,

I mean, it probably depends on the nature of the startup? Platform-as-service seems particularly sensitive to reliability (whether or not it's "#1 feature"), in a way that might not be true of startups in other spaces.


That's what crossing the chasm is about. Before the chasm are tinkerers who don't mind things failing as much, but after it, people want reliable things that Just Work.


TL;DR -- It's very domain specific if reliability is your number one feature.

For a startup that is hosting other people's production application/data then this is absolutely true. Less than 100% always needs to be addressed.

For a startup that is selling bingo cards then reliability probably isn't nearly as important. I'm guessing there were certain holidays that were more important than others as far as reliability goes though? Maybe patio11 can chime in :)


One of the key challenges we observe is that if you're small enough, a Heroku like experience works well - and most of your needs would be covered by virtually any combination of techstacks.

It gets significantly more challenging when you grow, either in feature complexity or scale complexity - and then very few services can offer what AWS/GCP/Azure offer - albeit at the increased engineering/monetary cost of using them.

We're building a different kind of approach[0] that aims to absorb the mechanical cost of using public cloud capabilities (that are proven to scale) without hiding it altogether.

[0] https://github.com/KlothoPlatform/klotho


> In response, we’ve shipped a project called Corrosion. Corrosion is a gossip > based service discovery system.

I wonder why they didn't try to use Serf[1] for this, since they were so into HashiCorp tools. It also uses the gossip protocol.

1: https://www.serf.io/docs/index.html


What is this? A company being open and honest about problems their customers are facing? What is happening? Has the world gone mad??


I really feel for Fly, as a potential customer. They are trying really hard. I would still love to use them one day and this post is definitely a step in the right direction. Growing is painful but they have smart people working there so fingers crossed that they sort this ASAP and it doesn't become existential.


I'm a big fan of fly.io. From their hiring process to the product itself it's all carried out in a thoughtful manner. I hope they can weather this rough time.


I've been with fly.io at small scale and I have always loved their approach to content (docs, blogs, forums), and needless to say, their product. They are very talented and are building something truly great. Their openness, shown in this post, is an example to follow. It's very hard to be that honest and direct when you're meant to be an infallible entity, but it's not a surprise at all to see fly authoring such a post. That's how they operate and that's why I trust them over anyone else.


Thanks for sharing!

Would it help to replace Corrosion with a simpler "Here's my local known state" blob that is POST'd to blob storage (for example) on a major cloud provider, and have another service read that at intervals? Just to make it really simple.

There will be a better way than that, but my thought is if you can make it simpler (known state is always just pushed, so missing updates auto-recovers and avoids corruption) then you can be building on top of a more stable service discovery system.

Centralized secret storage, can you keep the US instance read/write, but replicate read-only copies (a side-car tool that copies the database to other regions at various intervals?) so each region can fetch secrets locally?

Or perhaps both can be solved with a general "Copy local state to other regions" service that is pretty simple but gives each region its own copy of other region's information (secrets, provisioning states, ...).

I've needed to do similar things for some of the apps I've built, where a service needed another (simpler) service in front of it to bear the traffic load but was operationally simple (deferred the smarts to the system it was using as the source of truth) and automatically recovered from failure due to its simplicity.


+1 — service discovery _feels_ like something that gossip can solve, but that's only true at global/open/untrusted scale, in a well-scoped and authoritative domain like a company it's really not, it's really just another typical, expected-to-be-consistent state problem


I have two extremely tiny sites (like, "handful of users/1 user" sites) on Fly. I have had multiple incidents despite me not even touching them.

The thing that worries me about these incidents is they haven't been, like, full service outages. A small subset of users talking about issues in forums. This makes me just feel like Fly has an immense amount of issues.

At least if like 50% of fly goes down then it feels like a config fat finger. When it's a bunch of tiny issues now all my ops debugging has to start with going to the fly forums (and it's _always been issues on fly's side_).

The price is "right" (though like with all PaaS the gaslighting about running multiple processes in one container makes me feel bad about the state of cloud computing). And I really like the CLI stuff mostly! But I extremely don't care about edge computing so for me fly is just heroku and I would love to feel more confident on that end.

(EDIT: the nice thing is I get email support with a bit of cash. This is a thing that will go away when they get bigger but it's here while things are still breaking often)


> We’ve put a lot of work and resources into growing the platform and maturing our engineering organization. But that work has lagged growth.

I fundamentally don't understand why people are in such a big hurry to get 'famous'. I've worked a couple of places where the marketing side was working as hard as they could to make sure that our heads were on fire at all possible moments. At one job I had a (very, very junior) manager come up to me and say great news we landed <big customer> and my immediate reply was, "fuck me". We were already running to stay upright and now we're about to have twice as much scrutiny. Wonderful.

If you push hard enough, eventually everyone looks like an idiot. The number of humans for whom that is not true could fit into a book. Both alive and deceased. They most definitely do not work for the companies I've described, at least not enough of them so you'd notice.


This is a great blog and have so many insights for an SRE or devops person. But this goes to show you how difficult it is self host stuff at scale.

I used to work for a company that built deployment platforms for law firms. All our deployments where on prem and we had the same complexity with kubernetes. We had similar setup with vault and stolon for HA PG. More moving parts you have in infra, more permutations and combinations of failure modes you have.

What these guys are building is something I have seen in many orgs trying to do it internally and fail. PaaS is a hard problem if you want to solve it "reliabily"


I love reading stuff like this. I don't use fly, don't plan to, not totally sure everything it does and will check it out. But this is some great raw data on how stressful it is after you launch.


I wonder what types of RPS they are seeing that required a gossip based protocol to broadcast state around versus a more traditional data store.

I take it that it’s far more important that the local region know about changes than a remote region, which makes a mastered store in one location as the source of truth problematic.

I also wonder why these companies don’t backstop themselves on the public cloud? Failing into an AWS seems better than running out of capacity and some its services could be used in circumstances where an open source technology isn’t ready.


Yeah, reading the post made it seem like they followed “best practices” without really thinking things through. KISS.


Where does fly.io document their per-account services limits? For example, max apps, databases, etc.

I took a quick look and couldn't find them. Do they have any documented service limits?

A google search turned up [0] which does not inspire optimism.

> ...there isn’t a limit to number of apps from a billing standpoint...

[0] https://community.fly.io/t/free-tier-limits-and-quota-needs-...


For django, they should really contribute to 2 scoops django cookie cutter program, so that you can get an out of the box django instance that can just deploy to Fly.io.


Their problem right now isn't adding more customers - they seem to have more than they can handle!

If I were them I'd focus as many resources as possible on making the stack rock-solid, and away from acquiring more customers or adding more capabilities.

In fact I'd try to down-scope some features if at all possible, like the example they give of disabling app deploys while they're doing platform updates.

We use fly.io at a small scale and it's worked really well for us, but the money is in customers at a larger scale who must have 100% reliability.


I needed a cookie cutter for my side projects so I've created one for sqlite that I actively use and a similar one for postgres. Both are very basic, contributions welcome.

Sqlite https://github.com/tomwojcik/django-fly-sqlite-template

Psql https://github.com/tomwojcik/django-fly-postgres-template


May be wrong, but they seem to be very focused on interesting stacks like Elixir and RoR while building on Go/Rust. The corollary being neglect of the bread/butter stacks with high market share, like Python and Java/JavaScript. Don't think I've seen a blog post discussing those three beyond a passing mention?

Not the end of the world, but mildly disappointing. At least they are all in with Postgres and Linux, a great foundation.


I don't disagree with you, but they tweeted

> We’ll readily admit our docs still have a Django-shaped hole in them.

https://twitter.com/flydotio/status/1578039196618575874?t=nu...


I thought the only thing their tech stack was all in on was docker images being rewritten to run on firecracker as a substrate.

is it not agnostic about things like Elixir etc, at the tech level, though they've got super nice documentation for those tools you mentioned?


What's necessary to change for them to run on Firecracker?


Here's a post explaining how they do it https://fly.io/blog/docker-without-docker/


Yeah, saw that before. I thought he was saying you have to change your Dockerfiles to use them with fly.io. Just misinterpreted that sentence.


You can use any Docker image.


https://github.com/ehmatthes/django-simple-deploy will deploy to fly (or platform.sh) out of the box, should be pretty much the experience you're describing


thank you!


Fly.io seems like Vercel 1.0 (where you can just deploy docker image and done), but it's more than that, with configurable volumes, secrets,...

I'm bullish on fly.io.


Growing pains are never fun. It doesn't mention (at least that I read) if they're using HashiCorp Open Source or Enterprise. Open Source is great and I owe my career to it but they might be hitting the scale when the Enterprise features and support start to be worth the price.

I've only used Fly.io for a personal app but I think it's a great option so I hope they keep growing.


Ah my question was answered more or less. They're an edge case which makes sense: https://news.ycombinator.com/item?id=35048318


Thanks for the honest technical write up - not easy to air one's dirty laundry to users. Given the scalability and stability issues, I am curious to understand the percent of apps deployed to fly are actually used in production/critical to business. Sounds like they have quite a few hobby/free tier users (myself included) who probably won't notice certain issues unlike paid customers.


Well, I feel for them. Scaling up is a bitch.

I've been lucky, in the past, but a lot of that, is because I have "overengineered," and the tools/frameworks have advanced to meet the new demand.

I am in the middle of a complete, bottom-to-top rewrite of the app we've been developing for the last couple of years. It's going great, but making this leap was a fraught decision.

It's mainly, so I wouldn't have to write a post like that, in a year or two.

We spent all the time refining it, until we had what we wanted, and it worked great on our small test team.

Then, I loaded up a test server with 10,000 fake users, and tossed the app at that. To be fair, we don't think we'll have even that many users for quite a while. It's a very specialized demographic.

* SOB *

It no do so well.

At that point, I had to decide whether to fix the issues (they were quite fixable), or revisit the architecture.

The main issue with the architecture, was that it was an "accreted" app, with changes gradually being factored in, as we progressed. The main reason for this, is because no one really knew what they wanted, until we ran it up the flagpole (sound familiar?).

The business logic was distributed throughout the app. That was ... ugly.

I envisioned myself, a year or two down the road, sucking on a magnum, because the app had turned into a Cruftosaurus, and was coming for me in my nightmares.

So I decided to rewrite, as we hadn't done any kind of MVP or public beta, so we actually had the runway to do this.

I refined the entire business logic of the app into a single, platform-agnostic SPM module, which took just over a month, and have started to develop the app around that. It's pretty much a rewrite, but I am recycling a lot of the app code. We also brought in our designer, and he's looking at every screen I make. It's working well for him.

Like I said, it's going great. Better than I expected.

I know that I have a huge luxury, and I'm grateful. I can credit a lot of that, to doing some stress-testing before we got to a point where we had a bunch of users to support. I was able to go in, and go all Victor Frankenstein on the model.

The result, so far, is that this thing screams, and you don't really even notice that there's that many users on it. The model has already been proven (that SPM module), and all we're doing, is chrome (which is a ton of work).


Don’t know about expected usage but 10K rows seems like low number to me. If you mean 10K req/s then perf starts to matter but usually it’s SQL queries that fail first if you have 100K+ rows. In general good caching solves most of stuff + read/write separation. That’s all from my poor experience.

What I mean is that these scaling problems don’t have much to do with app logic but having nice core is good so :thumbsup:


10K is a rounding error, to most folks, around here.

The app (and the demographic that it Serves) are very security and privacy-conscious, so security and privacy are the main coefficients. Most folks would be disappointed in how few features the app boasts, as each feature is a potential compromise. I'm glad to have low usage, as a result.

I wrote the backends, as well as the frontend, and have avoided third-party dependencies, all around.

It seems one of the first casualties of fast scaling is security.

I really didn't want to take any chances. An overly-complex architecture is begging for security compromises.


Hope this isn't a dumb question, but what's an "SPM module"?


Sorry. Apple-specific.

Swift Package Manager.

https://littlegreenviper.com/series/spm/

The only dumb question, is the one I don't ask.


Wow, posts like these kind of want me to sign up and become a customer. The honesty and openness is something I highly value and it's weird that this is so uncommon that you get surprised by reading stuff like this.

Seems like they have a good understanding what the problems are so they will most likely be solved sooner or later.

Good work and keep honesty as open as you've done so far :)


Damn. I don't know, and I guess almost no one can know, how much of this is genuine honesty and how much is calculated messaging, but I barely care. It's refreshing enough that I kinda wanna give the thing a try. Which, ironically, might make their problems just a tiny little bit worse, since I likely won't be a paying customer XD


Corrosion seems over-engineered. Instead of doing a simpler federation of multiple databases (one per datacenter) across the globe, they decided to do gossip amongst every single VM across the globe! You don't really gain much but you do get all the noisy complexity for sure.


How easy is it to deploy things with GitHub actions, AWS and terraform instead? I think I’d like full infrastructure control after having accidentally got a job doing DevOps which I’m starting to get quite into. I should probably write up a blogpost about converting everything over…


Would the simple solve be that Fly.io just mark any new service of theirs “beta” for x-months post launch?


None of the services that they have had issues with in the recent past are new, they have been running for at least a couple of years. They would need to put a "beta" sticker on the whole platform for that suggestion to work.

But the post makes it clear that the issue isn't that they had problems with new services. It was rapid customer growth before they had time to scale up the infrastructure as they had planned to do.


I always thought it would be really interesting to work for a company like fly.io.

Solving hard problems like this seems interesting.

On the other hand it could be a giant shit show of micromanagement and toxicity, who knows really.

At the moment they aren't hiring though so that's that.


I have a debilitating impostor syndrome around resiliency and reliability of my systems. I always feel I am doing something wrong compared to the Googles, the Microsofts and the Fly.ios of the world. This does help feel better.


mrkurt have you considered some of the lower tiers of vault enterprise that allow for performance replicas that just outright solve that problem? might be cheaper than an engineer at this point.


Three little pigs, but they all live in a house of straw built on top of several houses of sticks. (The foundation is an old house of brick, but the wolf isn't bothered by that.)


Respect the post, building your own infrastructure provider is playing on hard, the big players have had armies of engineers iterating on their stacks for a decade+


I like Fly's response to the problems - honesty and openness - so I'm going to add to their problems and try to use it ;)


Interesting take. As a non customer I now won't consider them for any projects as they've confirmed their unreliability.


i also wanted a good cli for aws, and built one:

https://github.com/nathants/libaws

companies like fly are fantastic.

they provide a good service, and they put market pressure on aws.

a free tier isn’t important anymore. with usage based pricing for lambda/dynamo/s3, an app with usage approaching zero has no cost.


    - Machines seem like a waste of time
    - Access directly to VMs is being removed (and doesn't support TCP over IPv4, or UDP over IPv6)
    - The CDN is nice but should support private networking too.
    - Volume management is deficient: It should be possible to access and fix volumes outside the context of an its app instance.
    - Egress traffic should be free between apps over private networking, at least in the same DC.


I'm not in the Web business, and have no idea what fly.io is offering, but whenever I hear anyone trashing Consul, I give them a standing ovation. An application which decided to use Base64-encoded JSON for its communication protocol deserves every bit of mockery it can get.


I get it, I like fly.io, but the last outage made me switch to Railway.app


@kurt happy (sometimes paying) Fly user here. Keep up the great work!


in my view, a new edge computing entrant needs a niche market e.g low latency gaming or privacy-heavy computing, and to stay away from MAAG cloud territory.


It sounds like they need more money to scale the shared stack


Great post. Love how transparent Fly is. Im a customer but without any life and death important apps. And yea they had some issues lately so good they are addressing it.


"CEO finds novel way to fix Capacity problems..."

They just lost about 40% of their paying users with that blog post.


[flagged]


It's a post on their forum, hardly an ad.


It's too late. They've already misled people.


That sounds like a typical problem of unnecessary complexity to me. I wonder how many over engineered (web) applications could run as a single, efficient process on a single machine.


Do you know what fly.io does?

Like, it cannot possibly run on a single machine by definition because the product they're building is running customer code on edge compute nodes distributed across the globe.

The main selling point of their service is that it's _not_ a single machine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: