Hacker News new | past | comments | ask | show | jobs | submit login
Migrating our backend from Vercel to Fly.io (openstatus.dev)
147 points by Hooopo on Oct 29, 2023 | hide | past | favorite | 150 comments



Unreliable deployments are my experience as well. I also encountered unexpected and unannounced downtimes surprisingly often.

I was excited about fly, but ended up sticking with digitalocean. I have only had one issue with deployment reliability there (when they changed their build tooling for Python applications on the apps platform), but they responded quickly with a fix and shortly after announced the change and potential issues to all customers. Fly is not like this, and as a hobbyist I don’t have the time or energy to deal with their platform’s issues. I’d rather pay for something I can depend on. DO has been amazing in that regard, and their tooling is excellent.

I’ve used vercel in a professional context and wouldn’t use it for personal work. The markup is crazy and the tooling isn’t appealing enough to justify the cost. This is definitely a subjective matter as opposed to reliability and communication which are objectively necessary. Vercel just “rubs me the wrong way”, and I’m sure many people here love it.


At my last place we ran into a number of issues with using DO in production. It was fine for dev machines, testing, etc, but we had production downtime due to DO's networking setup, and support were unable to understand the problem, let alone fix it.

Quick summary: we backed up our other prod hosting to DO over SSH. One day our backups went offline, DO claimed this was because of a DDoS attack, but our backups were working fine and there were no noticeable effects. Only one port was open, SSH, and we had great security on it. Support re-enabled networking for the host, backups resumed, then the next day the same thing happened again. We told them not to do this, and they said they could not, and that we should "put Cloudflare in front of it", completely missing how that was not possible or useful for our case, and missing the fact that we were not having any problems other than DO disabling networking.


That level of uselessness is impressive. They must've trained on Microsoft's forum.


Wow, that’s exceptionally unhelpful. It hasn’t been my experience, but this mirrors my experience with fly. I guess we’re never safe, haha.

I’ve had several projects of varying complexity running with excellent uptime, both on the apps platform and on plain old droplets, for longer than I can say with certainty. Close to 8 years I guess. I might just be lucky, but in that time I really can only remember the one unexpected outage.


I really want to love Fly (for some reason? Maybe I've succumbed to the darling effect? Idk, their tech is cool in any case)

But yeah, failing deployments and the weirdness around persistent storage (if my VM starts up on another physical host my data just no longer exists) I can't use them. I'd be doing the same amount of systems work as I would anywhere bespoke, with less ability to fix any issues that come up

This is fine for the app part, 12 factor and all that, but I don't want a database relying on it.

Really hope they fix these two issues somehow, I've had to learn kubernetes instead lmao (I was going to get round to it anyway in all fairness)


I really don’t understand how people can trust platforms like Vercel, Fly.io over robust could providers like Cloudflare, AWS or Azure.

I mean, Vercel has its usefulness, it’s so well integrated with the NextJS stack, it totally makes sense for small amateurish projects since it saves you time and money… but once you want to push to production, have real customers and satisfy them reliably, these platforms can’t compete with the big ones.


This is what happens when you do HN-frontpage-driven development. I mean, they use Bun (which I’m sure will be great in a couple years’ time) and quickly ran into an fd leak in it. Does that sound like a production grade runtime?

However, I suppose it’s good for content marketing. You’re not going to make front page by choosing boring old technology (unless you’re migrating back to boring old technology after failed HN-frontpage-driven development).


Their stack speaks for itself:

Next.js, TailwindCSS, shadcn/ui, tinybird, turso, drizzle, clerk, Resend

That’s for an app which sends a ping to a URL every x minutes…


> That’s for an app which sends a ping to a URL every x minutes…

My bigger question is: How the fuck do these companies keep getting created? There's gotta be more uptime-monitor SaaS companies than todo MVC demos in existence.


Because these people are bad businessmen. There is not a single reason yet another uptime monitoring business needs to exist. They have no chance at succeeding the likes of Datadog, New Relic.

Conclusion: promising builders being bad at startups


Who says they have to reach Datadog/New Relic scale to "succeed"?


You’re right. I will take it back. Was just upset


It’s easy to get the free users, where of just a smaller % need to upgrade to keep you afloat. Then there’s the possibility to upsell other related services like analytics or logs.

Betterstack did this very well.


> a smaller % need to upgrade to keep you afloat

Until yet another free competitor comes along. Then you're out of business


> I really don’t understand how people can trust platforms like Vercel

It's not an apples to apples choice. The people who use Vercel don't know anything about how to deploy on AWS. That's the whole point of Vercel. Whether or not they can be trusted is really orthogonal to the reason they were selected as a provider. But that said, they're just a layer over AWS so why should they be significantly less trustworthy? I haven't used Vercel in production, but I have used a similar "layer over AWS" service (Aptible). The problem where wasn't to do with QoS or support, but rather that the narrowing of the functionality of the "interface" (which is pretty much the point) ends up causing frustration when you want to integrate with other stuff you're doing in AWS.


> The people who use Vercel don't know anything about how to deploy on AWS

Ehhhh. I’ve been deploying to AWS professionally for years and I’d choose Vercel for a personal project any day of the week.

Life’s too short to play devops/sysadmin/sre without someone paying you the big bucks.


This. Anyone who has done enough ops knows that a platform (Vercel, Heroku, Netlify etc) which lets devs connect a git repo with a couple of clicks, and deploys automatically happen is a good devops experience.

This is great for personal projects. This is great for budding projects in a professional setup as well.


If I never again have to write a httpd.conf or nginx.conf file from scratch it will be too soon.


Either you mean 'too late' or that's a complicated way of making a dissenting point?


It’s an idiom that means I never want to do this again. Even if I waited an infinite amount of time, it would still be too soon to have to write httpd.conf or nginx.conf files again.

https://hinative.com/questions/337452


I'm a native English speaker, I'm well aware of the (misused here) idiom.

You could also have said 'ever' instead of 'never', that would be a simpler and better fix on re-reading.


Huh, what was I smoking? I don't even smoke. That's not right, don't know what I was thinking, sorry!


This x100. I use gcp, DO, and Vultr, for almost everything (ml, be, vas, etc.) but for webapps, vercel wins 9/10 times.


Vercel is great for pre-prod ethereal environments for testing and CI. But they desperately want to sell you their enterprise stack which is completely inadequate, and will drag their feet for months if you just want to sign the damn “pro” version.


Also, simple stuff like bandwidth is wildly overpriced with Vercel. My company switched away from all their magic image resizing stuff because as our traffic increased the bandwidth was 10x that of doing all media content through Prismic.


Vercel is hosted on AWS, and they're definitely not in the business of subsidizing AWS per-GB costs. Although the >2x markup is egregious ($40/100gb aka $0.40/gb over 1tb, versus AWS' $0.15/gb in its most expensive region, Sao Paulo, Brazil).


Some people avoid large providers, since large providers have approximately no incentive whatsoever to keep you, specifically, as a customer. I.e. large providers will happily raise their prices, alter the deal, throw you under the bus, disable your account, delete all you data and then refuse to talk to you. They can do this because, when they look at the big picture, you don’t matter to them. And since doing this saves them some money, they all do it.


Right. Plus, large providers usually don't offer support for small-sized instances/containers/whatever, so even if you optimized your deployment to use less resources, you need to buy a bigger thing.

But to the main point: using an extra layer which is on the top of said large provider, like here Vercel over AWS, is not a solution, as this middle man also can be marginalized by the big bully at some moment.

This is why I prefer small providers, like Vultr. (Not affiliated with them in any way; just a happy customer).


And a small service can go under anytime, without any real warning.

Most big providers end up being cheaper for you as well. Vercel is insanely expensive.


> And a small service can go under anytime, without any real warning.

Both large and small providers could make the ground from under your feet disappear, in different ways. But only the small provider has a real incentive to actually keep you, as just one customer. It’s only when a small provider goes out of business completely that you have any risk. And that’s unlikely to happen, unless they’re delusional or funded by squirrel-minded VCs – which are things you can determine beforehand.


I worked at ZEIT (before it became Vercel) and if they've retained even a 10th of their engineering culture then they're solid, if not a bit "niche" in what and who they target.

Anecdotal, sure, but it'd be hard to quantify it.


It’s funny how frontend has a stigma of less serious engineering when the caliber of programming being done at Vercel is far beyond the level of inadequacy I’ve seen being in big tech FAANG eng departments.


Modern day frontend is ridiculously complex. Back in the day, the only compilers that tried to do code splitting and optimization across network boundaries were ASP.NET WebForms and similar. They were dreadfully simplistic compared to the SSR+ react-streaming-over-the-wire that Vercel is doing. Don't get me wrong, the React compiler written in Rust is leagues ahead of the slow tech in for example the angular world, but it is also magnitudes more complex.

I see this as a side effect of developing economies in the world coming online. When you have to service millions in places with poor network and provide a competitive UX, you don't have the luxury of going with "simple" solutions.


I think this comment is intended as a compliment to Vercel, but it's hard to tell from the wording.


Apologies. Yes, big tech == overrated, Vercel tech == underrated


It's 'beyond the level of inadequacy' which is amusingly ambiguous!


They reliably exceed 99.99% inadequacy overachievement - there, made it less ambiguous for you.


Yes, but exceed which way?


'use server'

sql("select * from db");

This is what Vercel is pushing into React code. The caliber of their work is low, very low. They are con masters with MBAs


They are objectively not that, lol. If you don't like what they're doing I understand but please don't disrespect people you don't know.


That's what you get for working in a famous company that took over an open source software.

If the destruction of React by Vercel paid your bills and you feel disrespected by my total despise for this get rich fast schema (a la crypto rugpulls), that's a problem for you to solve, not me.

Edit: but in reality I'm happy and thankful to Vercel for imploding React, it helped me to finally check that there are so many better options nowadays.


How did Vercel "destroy" React? How are they a "get rich quick" "rugpull"?

Respectfully, you have no idea what you're talking about.


Dan Abramov, de facto lead for React, said that the React team was driving the vision behind the new features in React. Vercel just said, "how can we help? You guys think server components are great, ok, we'll make them first class as that's where the React ecosystem is headed."

Vercel is doing nothing but trying to improve DX for all the things people complain about React.


I started using Next.js in 2017. It made React a real production framework. Prior to Next.js, React was hard to setup and maintain and hard to make it go fast (on first load). Next.js solved the worst React problems.

I don't think it ruined React at all. I think it helped React gain in popularity - which you might interpret as "destruction".


> Prior to Next.js, React was hard to setup and maintain

No, it wasn't. Now it is an engineering process.

> I started using Next.js in 2017. It made React a real production framework

In 2017 I had React projects in production for years.

> React was hard to setup and maintain and hard to make it go fast (on first load)

And it only got worse and the overengineering to make it looks fast in the first load is not worth it as modern JS frameworks are faster than React out-of-the-box.

> I don't think it ruined React at all. I think it helped React gain in popularity

That's not what stackoverflow's Insights says[0]. Looks like a free fall for me.

0. https://insights.stackoverflow.com/trends?tags=reactjs


> In 2017 I had React projects in production for years.

I doubt that. React wasn't stable until 2015, and wasn't mainstream until 2016.

> And it only got worse and the overengineering to make it looks fast in the first load is not worth it as modern JS frameworks are faster than React out-of-the-box.

Again, Next.js != React; the former builds on the latter, it doesn't replace it nor does it claim to be the same thing. I'm not sure why you keep conflating the two.

> That's not what stackoverflow's Insights says[0]. Looks like a free fall for me.

Perhaps you shouldn't bury the lede here. I'm also not entirely sure what your argument is, or why you hold such strong emotions without making your opinions very clear.

https://insights.stackoverflow.com/trends?tags=reactjs%2Cnex...


> I doubt that. React wasn't stable until 2015, and wasn't mainstream until 2016.

I started using React before its 1.0 version. Your reasoning is exactly what's wrong with Vercel. Arrogant inexperienced people that think they know better, empowered by VC money. Together with some idealization of being the smartest people around makes you come with solutions like "use server" and throw tantrums when people say this is stupid for a frontend library.

> Again, Next.js != React; the former builds on the latter, it doesn't replace it nor does it claim to be the same thing. I'm not sure why you keep conflating the two.

It is okay if you can't understand what I'm saying, it is difficult to get a man to understand something, when his salary depends upon his not understanding it. I also don't expect you to agree that the work you did contributed to an open source take-over for the sake of profit.

Edit: I just did a research to see if Meta is adopting the amazing "use server" and no public information is available, only people discussing that they aren't. That says a lot about the applicability of this feature and the direction React is being leaded to.


I was a tech lead and managed a massive project (multiple billions of page views per year) that did a complete rewrite with React in 2014-2015.

React was complete shit back then - especially the first load speed. It was not ready for "real" production. We basically built an internal framework on top of React for things like server side rendering (which no one did with React back then), above the fold loading optimization, developer experience, devOps on top of React. We basically built Next.js internally.

So no. It was not production ready for real performance-based websites. Next.js made it significantly better as soon as it came out.


React was a terrible idea for a static page in 2014-2015 and is an even worse idea for a static SSR page + hydration in 2023. React back in 2014 was devised to be a performant way to create SPAs. Of course it wouldn't be good to use it as if it was PHP. It is still a bad idea.


That's awesome work. You don't like it for aesthetic reasons, pre-conceptions of aesthetic purity. That's fine too and I think I agree. But it's awesome work.


Go ahead. It is all yours, awesome work is simple work. Vercel wants React to do everything so they can sell everything. Not to me. That was a sad takeover of an open source project tho


I like to joke that because Vercel doesn't make money when you run components in the client, and React is now made for Vercel, that React is now developed for backends.

I'm a big fan of OSS projects being sponsored by companies that use them for a higher level business. Vs a ton of the C# ecosystem where they try to sell you the library/framework.

But hosting businesses ain't it when it comes to frameworks.. It creates weird incentives.


Plenty of people end up building vercel-but-worse out of CI pipelines on aws or similar, doesn’t strike me as crazy to keep using it well past the prototype stage for projects that for its constraints.


Whats the use case for edge computing like Fly.io. I have yet to figure it out the use case where a edge provider is necessary. That is, having a database on the edge.


Having customers in places around the world. If you site is hosted in North Virginia, and you have customers in Australia, they are going to really suffer from the speed of light.


Definitely. I guess where my ignorance comes in, from an engineering perspective is the way fly.io thinks about edge databases more difficult to architect than a more traditional route creating a subdomain for a region and just replicating your entire infrastructure in a new <insert cloud provider> region?

I guess you can setup the same kind of structure inside of fly.io but I remember some of their writeups have been talking about deployment and pushing the DB to the edge and then having eventual consistency across?

I think that is my hangup on use cases.


That’s very rare though. Unless you’re Shopify, location doesn’t really matter


Common enough to have a couple customers in the EU or apac who consistently complain that your site is glacial and it turns out to be pretty bad for them...


Lots of us (well, me at least) use fly because it's a bundled set of aws best practices that I could configure in aws if I wanted to, but I'd waste another week of my life. alb + various vpcs + autoscaling group + fargate + ecs + their super shitty vpn service to vpn to a console + rds + elasticache or... just type "fly deploy" and go from zero to live in 20 minutes.

That said, fly's deploys are flaky. I hope they get it fixed because the rest of the service is pretty good.


The first part is good to hear. And the last part is the only reason I have not consistently used them. As an aside, I have started to use chatgpt heavily for aws questions and walkthroughs. Have been using ECS heavily and this has been super helpful for me to get through what I consider the hardbits, non-obvious permissions and configurations that aws that does tell you about but is buried in the documentation for a json like configuration object.


You could have your realtime competitive FPS game like Call of Duty host the data and compute necessary to run a match as close to the median location of all the players involved as possible to reduce latency. You could make the same case for something like Zoom or a collaborative editing tool.


Several SaaS companies are pushing for Next.js as their main SDK, that is how real customers end up in Vercel.


The only platform you can truly trust is the one you handcraft down to the NAND gates.


If you’re not mining your lithium and cobalt with your own handmade pickaxe you’re doing it wrong


If you don't have your own star spitting out bespoke elements, are you really doing it at all, or just using someone else's work


Edit: Nevermind, wrong thread. Vercel does honor DCMA, of course, though.


You work at Vercel. Are you saying Vercel does not honour DMCA takedown requests and that is a selling point of using Vercel? This seems like a strange thing to brag about.


> This seems like a strange thing to brag about.

Not to me, it isn't.

There are plenty of areas-of-interest that attract both hobbyists and serious academics alike - which also tends to attract unwanted attention from callous legal departments who are keep to adopt a shoot-first-ask-questions-later policy - things like (lawful) research into DRM techniques, retro video-games (and not ROM hosting), infosec disclosures, and so on - so if you're really into those areas and want a safe place for your lawful content but without worrying about your site/content/services being taken-down without good cause then it makes sense to side with a provider who is able to resist DMCA requests.


My uptime monitoring business made a similar migration (AWS Lambda to fly.io), and I ended up rolling it back a few months later.

I wrote more about the move to fly.io here: https://onlineornot.com/on-moving-million-uptime-checks-onto...

and (part of) the move back to AWS here: https://onlineornot.com/scaling-aws-lambda-postgres-to-thous...

Edit: forgot that second link doesn't actually explain that I moved off fly.io, will write a follow-up.


> Edge functions are cost-effective as you only pay for the actual CPU execution.

> We have over 1000 monitors, and the monthly cost to run them would be $150.

> While on fly we only have 6 servers with 2vcpu/512Mb It cost us $23.34 monthly ($3.89*6).

So edge functions are in no way cost-effective right? People using lambda functions are getting ripped off, they could just buy a couple of VPS.


> People using lambda functions are getting ripped off, they could just buy a couple of VPS.

A non-zero amount of our CICD pipelines are "perform API call with secret pulled from SSM/Secrets Manager". They happen 1-2 times per day and take less than a 5 seconds to run on each invocation. We currently have a burstable EC2 instance running 24/7 to handle these which costs us ~$5/mo. My napkin math says that this would cost us ~$0.01/month to run these as lambdas. More to the point though, we're limited in concurrency on these. It's pretty common that they all get triggered at the same time, it would be ideal if we could allow for an "unlimited" number of these to run. This sort of workload would be great to run for lambda functions, the engineering cost of implementing it just doesn't ever make sense.

If we paid someone $150/hour to spend half a day on it, right now our break even point is probably 5 years...


Also, if you implement it as lambdas, your solution is now less portable. Your EC2 instance can probably be ported to something else easier than some lambda solution.


This is objectively false. In fact, it's the opposite of true. Lambda forces you to have a single entry point to your code that passes the details of a request to your application in a structured way. Wrap that in the http server of your choice in about fifty lines of code and you're done.

Or, you use something like Express inside Lambda with Serverless and adapting it to a long running server is literally just deleting the fifty lines of code that export your lambda handler. It couldn't be more simple.

Which is to say, you'll almost always have more trouble going from something else to Lambda and not the other way.


> Wrap that in the http server of your choice in about fifty lines of code and you're done. > Or, you use something like Express inside Lambda with Serverless and adapting it to a long running server is literally just deleting the fifty lines of code that export your lambda handler. It couldn't be more simple.

...which is all extra setup that you need to do compared to if you weren't using lambda, in which case you'd already be set up using something like express. To be fair, it doesn't like it'd be too much work to do the conversion. But it would still be extra work compared to say moving something running on express and deployed in a container to another VPS and/or container hosting provider which would likely require no application changes at all.


There are services that offer compatible APIs to Lambda, in which case there's also no effort.

You could be switching from an integrated http server to using WSGI, which is almost identical to the effort I described. Who cares about the twenty minutes of work? It's "less portable" in such a trivial way.


> This is objectively false. In fact, it's the opposite of true.

I apologize for the tangent but in your mind, is false sometimes but not always the opposite of true? Or is this just emphasis? (Genuinely asking, not being snarky.)

ETA: Thanks for clarifying.


Something being wrong does not mean it is the exact opposite of what is the truth.


That's for abstractions to take care of. If my CI platform offers a lambda runner instead of an ec2 integration, I would use it in a heartbeat.

Also, it's a 5 line shell script that already depends on 2 AWS services (SSM and IAM) that makes an API call to a third AWS service (S3). It's already locked to AWS


I'm guessing the original developer was only $50/hr with a cheap, non-scaling setup like that.


I wrote it. It was a 50 line packerfile with a shell script that installs buildkite, docker, and the AWS cli, and hasn't been touched in 2 years other than bumping version numbers.


Edge functions are cost effective, the problem is that they are comparing from Vercel.

Vercel is basically a dev friendly wrapper for tier 1 services: https://news.ycombinator.com/item?id=35774730

Eg. Vercel is 25x more expensive than eg. Cloudflare Workers. Raw guess would be that their 150$ bill would have become 6$.

https://news.ycombinator.com/item?id=37891412

Eg. Image resizing

> Vercel : 5$ / 1000 requests

> Cloudflare : 9$ / 50.000 requests

Edit for comment below:

That's a blog post. I got my info from here:

https://www.cloudflare.com/plans/

See: image resizing

> 50,000 monthly resizing requests included with Pro, Business. $9 per additional 50,000 resizing requests


$5 / 1k requests... wow. That's like, entirely unworkable for almost all use-cases surely?

Even coming from a backend-heavy world where one request does a lot of work, that's got to be an order of magnitude off the mark at least. If you go for a frontend-heavy setup where there are more smaller requests (in my experience, common, when using things like Lambda), this could be another order of magnitude off again!


Sorry about that. The pricing was about Image Resizing requests ( updated my comment).

On http requests itselve.

Vercel is 2 $ per million requests. Cloudflare is 0,15$ per million requests.

So Cloudflare in this case is > 13 x cheaper.


According to this page at Cloudflare, their pricing is 50 cents per 1000, is that correct? Or is it a different product... https://blog.cloudflare.com/merging-images-and-image-resizin...


50,000 monthly resizing requests included with Pro, Business. $9 per additional 50,000 resizing requests

https://www.cloudflare.com/plans/ - section: image resizing


Workers are also free for the first 100k every month I believe.


10 million included per month.

On the free plan, it's 100 k. per day included - https://blog.cloudflare.com/workers-pricing-scale-to-zero/


Oh, even better. Thanks :)


The free tier gets 100k requests per day, not month. And then for $5 you get 10M or 1M free a month depending on your settings and then you get charged for additional usage.


If you're bursty and only run 1000 invocations every few days or weeks and otherwise you run it 0 or 1 times per increment then you can end up spending a lot less than that estimated server cost with fly.io no?


Definitely there will be cases where it makes sense. But intuition would suggest if your servers are say 80% idle then serverless functions would be cheaper, but that isn't actually the case. Cloud companies don't incur much of a cost from a VPS either if it's idle.

My team noticed the same with AWS Aurora Serverless (a database), it was so expensive that it was easier to just run a normal instance of RDS.


In what universe is the difference between $23/mo and some amount less than that incomparable? That's what? Two paid users worth of revenue? $23 per month is a rounding error. It's one T-shirt. It's the cost of a few minutes of your time. If you're running a business and you're worried about saving ones of dollars on hosting, you need to reconsider how you're spending your time and how your business makes money.


unless you remove the need for said server then you save on reduced complexity/maintenance which is money


Both are free for business,


Why is Fly apparently so unstable? I like many love the idea, but get a little scared by the many many anecdotes of issues.

What are they doing that makes it unstable? Lots of new locations spinning up that shake bugs loose? Cost-reducing refactorings that reduce stability?


(Fly customer for the past 12 months: small web app (three machines across two regions plus replicated Postgres across two regions, on a paid plan)). Fly has been extremely stable for us, with the sole exception of deploys: once a month or so, deploys from CI start failing for a couple of hours. That doesn’t result in any downtime (I have never experienced any downtime due to a failing machine on Fly), just that new code doesn’t end up on prod until it’s fixed. If it’s urgent I email support (highly competent), or wait it out.

I would describe myself as “extremely happy with the service, yet also annoyed by this aspect”. Fly allows me to manage my resources in a way that isn’t really possible elsewhere (from standard Python web apps in multi-hundred-mb containers to specialised Rust apps in < 10mb containers), and in a way that is (now) extremely simple to reason about, and the support has been excellent when I’ve needed it (they were very patient and understanding when I screwed up a region move and managed to somehow break my db leader beyond repair), but I’d like them to address this, because it’s a widespread issue. Given the evolution of their architecture, I suspect they will. But I’d also like them to talk about it more.


Thanks for the insight!


(Background: I'm currently using Fly for some hobby apps. I like it.)

It is still wildly unstable right now because they're basically still building the platform and figuring out how to run a business. Earlier this year there was a migration to their "Apps V2" platform [0] which was supposed to be simple but it was extremely poorly communicated which led to a lot of users hitting issues along the way and being forced to make forum posts to try and desperately figure out how to keep their production apps up. None of the migrations worked for me either, I didn't complain as a freeloader - but seeing the support requests from paying customers painted a really bad picture.

[0] https://community.fly.io/t/get-in-losers-were-getting-off-no...


I lost data in the v2 migration with down time. their support (engineers?) are customer facing and unprofessional.


I still don't know what Fly Apps vs Fly Machines are and I stopped caring about their service as a result.


I love a good migration post-mortem, thank you to the author for publishing it! There's a bit of extra detail I'd be curious to know, as someone completely unfamiliar with both Vercel & Fly.io:

Re: "we required a lightweight server" as one of the drivers to migrate -- how did deploying to Vercel impede this? What specific business/operational issues was this causing?

Re: migration issue of large container image -- what business or operational issues did the large container image size cause? Why was it necessary to shrink the image size, when it could be previously ignored?

edit: it appears that fly.io previously had a 2GB container image size limit, relaxed on 2023/08/11 to "roughly 8GB" -- https://community.fly.io/t/docker-image-size-limit-raised-fr...


I joined a project that was fully deployed on Vercel. We routinely ran into issues with limitations, outages and sharp edges. Our junior devs had also taken advantage of Vercel specific features (I remember a Vercel specific request object in the code specifically).

Given all the problems and the vendor lock in from tight coupling I advise everyone I discuss Vercel with to avoid them like the plague.


Can you elaborate more on this? Or point me to a discussion? My org is planning to move to Vercel it'd be nice to know its pitfalls


Every day I need to add a new feature to my app, I am grateful I picked fly (serverful) rather than Vercel. The fact that as far as I'm concerned, it's just a computer, is incredibly useful. We've added long-running tasks, background jobs, scheduled tasks, side-car processes, custom-code execution, etc etc. Then, the fact that I can run something like Redis or Metabase within the same VPN with just a dockerfile is incredibly empowering. And just giving up basic things like SSH access to your server seems like an incredibly short-sighted thing to do. Maybe I'm too old, I just don't get it.


It's not "just" a computer, a computer is a whole bunch of complicated stuff that I don't want to have to care about. I want to write some code and have it run and I don't want or need to care about the details of how that happens as long as it works reliably. Being able to ssh into your server is giving you more tools to fix problems, sure, but mostly problems that you created for yourself by having a server in the first place.


> I want to write some code and have it run and I don't want or need to care about the details of how that happens as long as it works reliably

I'm sorry, this is an incredibly stupid take. You always "need" to care about the abstraction that your infrastructure is providing to you. Vercel also provides a abstraction in terms of serverless functions.

>I want to write some code and have it run and I don't want or need to care about the details of how that happens as long as it works reliably.

Yeah, same. As long as it works, I have no problem. Now add background tasks or streaming responses or a cron job. Oh, guess what, you have to suddenly care about the options your provider is giving you, or go out and buy some stupid cron-as-service or ssh-as-service because you don't have any control over your infrastructure. And now suddenly your infra is way more complicated than mine. I am still one that single dockerfile.

>Being able to ssh into your server is giving you more tools to fix problems, sure, but mostly problems that you created for yourself by having a server in the first place.

How is running a clean-up script anything to do with having a server? That is the most common use-case for ssh-ing into your server. In fact I am wracking my brains right now to come up with anytime I had a problem because of having a server and coming up short. Fly.io (or AWS, or GCP) has problems, for sure, but none of them are because I am running a server.


> You always "need" to care about the abstraction that your infrastructure is providing to you.

Sure. But as long as it implements something reasonable, you don't care about the details of how. "Runs version x of this programming language" is generally an easier abstraction to run business functionality on than "it's an x86-compatible computer".

> Now add background tasks or streaming responses or a cron job. Oh, guess what, you have to suddenly care about the options your provider is giving you, or go out and buy some stupid cron-as-service or ssh-as-service because you don't have any control over your infrastructure.

Cron is a terrible model for actually solving business problems with. Do you know what happens when a cron job fails to run/errors out?

Running your own unix system gives you a bunch of options that a higher-level abstraction doesn't, sure. But those options are rarely worth the cost, IME. (And like I said, if you're actually getting a unique value-add from running the whole server, then do it!)

> And now suddenly your infra is way more complicated than mine. I am still one that single dockerfile.

I guarante you that anything Docker-based is more complicated than what I'm doing. Docker is the worst kind of layer; it doesn't provide a consistent abstraction of its own, you still have to know all the ins and outs of how the unix system it's running on works, it just adds a whole bunch of extra concepts on top that you have to learn. And then sometimes breaks the usual rules of the platform it's running on as well! (e.g. silently bypassing your iptables rules).

> In fact I am wracking my brains right now to come up with anytime I had a problem because of having a server and coming up short.

- Your program runtime crashing because of mismatched system library ABIs

- A dependency you didn't expect is suddenly available because it got installed by the base system

- /var filling up because of out of control logs or the like

- Any other disk filling up for whatever reason

- Log collectors going AWOL

- Directory traversal order differing because two servers are using different filesystems

- Upgrade changed the network management commands and now all your traffic is being blocked

- Buggy RAID controllers

- Thermal throttling kicking in when it shouldn't because of a broken temperature sensor

- Power outages

All this stuff still needs to be fixed, but it's great to have it be someone else's problem and get on with your program.


The big advantage of Docker based solutions is that they're portable between providers. And you can ramp up the complexity as you need. Just need a language runtime? Then you can have a single-line Dockerfile. Need to support a native dependency? Then you might need to install it, but at least it will be possible to do that.


I can do all that with puppet without the extra complexity of containers.


> Cron is a terrible model for actually solving business problems with. Do you know what happens when a cron job fails to run/errors out?

Do lambdas not fail to run / error out? I’m not following.


> Do lambdas not fail to run / error out? I’m not following.

You generally have some kind of monitoring/alerting built in. With cron the usual behaviour is to email root@localhost using local sendmail, which generally achieves nothing except for filling up /var.


Here's a thought experiment people may or may not find helpful. If you're writing say a Flask app, what Flask is doing for you is routing a request to a function. That's where the core kernel of value is; the rest of what's going on is overhead you pay to wire your function up to what it needs, like a database connection pool and such.

So if you were AWS and you saw everyone running an instance of Flask, you might think to yourself, I could run one really big instance of Flask that everyone could share, and the economies of scale would mean I could charge a cheaper price.

And you as the software developer might think, well, I get paid to execute these functions, not to run Flask, so I might as well rent a spot in the big Flask. Then I won't have to spend time updating and maintaining the framework, I can focus on writing my functions.

This may or may not work out for a specific use case, eg maybe that database connection pooler that we threw out was load bearing and moving to serverless overwhelms our database or causes us to spin up more database servers and costs more money. YMMV.


[flagged]


> If you are a software engineer, and you think that a server is "complicated", then why are you in this business at all?

I'm in this business to solve big problems with a minimum of effort. The three great virtues of a programmer are laziness, impatience, and hubris.

(I do in fact know how to build and maintain a server. But unless you're getting some unique value-add from doing that, it's a waste of time that you could be spending more productively)


I would be curious to know the performance using node.js as runtime, given that at the moment there is no evidence that bun on a real application offers better performance


I've tried Fly.io probably 3 times. I've never gotten a simple Node.js project to deploy correctly. Meanwhile, I deployed the same projects to DigitalOcean and Render without a single change successfully every time.


From one toy to another.


Are Adobe, Splunk, Washington Post, Netflix, Zapier, Notion, and Uber toys? Because they're running on Vercel infra.


Its not about vercel or fly.io. Its about openstatus dev

Their migration timeline from their blog:

1. August 2, 48 hours after public launch 400+ users

2. August 20, migrate from planetscale to turso (sqlite)

3. Oct 29, migrate from vercel to fly.io, migrate from nextjs to hono, also mentioned change to bunjs.

This is seems like they tend to (sorry I'm judging here):

1. move fast break things or

2. don't have a plan before launch day or

3. only chasing the latest tech buzz.


We are trying, breaking and learning.

And you were right we did not have plan before launch, we wanted to build something that bring us excitement. and we are planing more about the future, since the project took off

FYI we still both have full time job


Sorry for the harsh words.

Obviously I can't speak about excitement but if you're worried about cost, you might want to research aws grant or things like that.

In 2019 my company invest in a startup, while doing IT due diligence I found out they get $100K AWS grant to be used for 2 years. Fyi this company business is writing articles about baby and almost no revenue at that time.


Next up: Migrating to Hetzner.


TLDR: They could have done it cheaper, quicker and without adding DevOps to their workload with just migrating to Cloudflare.

- Vercel: 150$/m.

- Fly: 23$/m ( + managing servers and devops)

- Cloudflare: 11 $/m.

--- (original comment)

They could have gone from Vercel to Cloudflare to reduce their costs.

But that would have been almost no work to create a blog post about :p

https://developers.cloudflare.com/pages/framework-guides/dep...

Did some raw math.

Cloudflare is 0,15$ per million requests and Vercel is 2$ per million requests.

Their calculation for Vercel was: 77,600 * (2/1,000,000) = 0.15c per monitor monthly

So that's ~0,011c per monitor monthly on Cloudflare. That would be a bill of 11€ per month ( vs 150 € per month on Vercel ). Probably less, since Cloudflare doesn't count idle CPU time, which is very relevant in this use-case ( outbound http calls) ... - https://blog.cloudflare.com/workers-pricing-scale-to-zero/

Which is cheaper than their VPS of 23.34$ / month.

And would have avoided managing servers + security to their workload...


It is beyond me why anyone but the most pre-revenue bootstrapped projects would spend $150k+/yr eng hours into saving $100 month on infra. Projects like this are trying to make $1m+/yr in revenue.


This was done in a weekend fyi ( mentioned in the blog post)


Founder of OpenStatus here: We can't use Cloudflare because there's no way to execute in a specific region, if you know how to do it let me know


That was an interesting rabbit hole, thanks :p

Found this to be the best resource:

https://community.cloudflare.com/t/how-to-force-region-in-cl...

Guess it's a bit more work than originally expected.

An alternative would be to use proxy ip's to hint regions, which would resolve to other locations. And then parse the Colo from the request.


Can you deploy 700MB Dockerfiles to Cloudflare, as they mention as a minimum requirement in the article?


Why would they need that on Cloudflare?

Since they didn't had to change much of their original functions to docker, if they would have switched to Cloudflare from Vercel directly.

That would have been a lot quicker for them to do...

Alternatively, Cloudflare supports hono which they moved too.

https://developers.cloudflare.com/pages/framework-guides/dep...


Didn't they only need the 700MB dockerfile due to Fly.io requiring it?


Fly.io supports up to 8GB docker images


Cool. By the sounds of it, they needed the docker image in the first place because they chose fly.io. This guy is saying that Cloudflare wouldn't support it, but if they went the cloudflare route (which I'm not saying whether or not it's actually possible), they wouldn't have needed the docker image in the first place.


Correct. Which was actually already in my tldr

> TLDR: They could have done it cheaper, quicker and without adding DevOps to their workload with just migrating to Cloudflare.

See: quicker= less changes required to migrate...


> Additionally, we have not discovered a quick method to rollback to the previous version

I feel like this should be a high priority. Deployments should be quickly reversible so that a livesite caused by a bad deployment can be mitigated quickly.


I’m curious if you looked at Heroku (I work there). You mention functions (which we don’t support), also servers (which we definitely support). I’m curious if that’s it or there was more to the decision.


I’m a bit out of the loop but I thought heroku died or is languishing under Salesforce. That’s my current perception of everything and no longer see it recommended in HN threads. Hopefully this does not come off as an attack (it’s not).


I'm currently using Heroku for a small business app, and it is working wonderfully for me


It never stopped working, it just... stopped. More of an omen than a practical issue (so far)


That's a good way of describing it. Another issue is they've locked a whole lot of useful (practically required these days) features behind requiring an enterprise account - the trouble is that "enterprise" isn't just paying a whole lot more (if only). Enterprise involves getting involved in Salesforce style opaque fixed priced annual paid upfront contracts etc. It's just not a cloud provider any more at that point - you know the whole elastic thing the cloud was supposed to do.


We’ve run Domainr on Heroku for over a decade, and it’s been rock solid all along.


I discovered fly because I made an heroku account, connected the wrong card (I’m a broke college student), and heroku told me I couldn’t change the card for the next 30 days. This was all within 5 mins of making my account. I couldn’t find a support avenues.

I tried many clever workarounds but their alt-account defector is top notch (props to that team).

Asked around in dev circles and they recommended me fly io.

Idk who hurt heroku for them to put such measures in place but I’ve never encountered such strict policies before, and I’ll forever avoid places like that.


Hmm if they hole point is save memory and smaller sizes, and since they are willing go with Bun (very experimental type of tech), then they should also just tried Deno.


I have been using Vercel for production NextJS apps and have been very satisfied.


We are too but for a simple REST API it might not be the best


No, probably not. I use it for NextJS hosting and I absolutely love the page invalidation. It probably has saved me thousands in server costs.


these issues aren’t particularly severe and strike me as the kind of thing you’d generally run into switching hosts.


I find it interesting that people seem to be trading short term gains with long term reliability and maintenance costs. This glut of 0-friction deploy services lull people into a nice false sense of security.

But in actuality you are wasting hours, days, weeks of time when they become unreliable, support is unresponsive, or something unexpected pops up.

There is a huge advantage (outside of amateur, low importance projects) for putting in place - at the beginning - an infrastructure that is dead simple and reliable. AWS, GCP may have some upfront complexity but provide advantages in terms of reliability, knowledgeable support, and proven track records.

I would never recommend these current platforms to be used for building a long term business on top of. I have been tempted by the siren song of one click deploys but in the long run so much extra time is wasted.


> I find it interesting that people seem to be trading short term gains with long term reliability and maintenance costs. This glut of 0-friction deploy services lull people into a nice false sense of security.

I find it interesting as well. I agree that it's a false sense of security, and there is no real long-term gain from avoiding the one time paydown of deploying to a big 3 cloud services provider. Still, I think the impulse reflects something a very real pain, and something I find my team continuing to face as we try to manage a the operationally minimalistic stack we can get away with on AWS -- poor DX.

It does still boggle my mind that AWS still doesn't have a Heroku-esque happy path DX that lets you get started easily and then add in complexity on an as needed basis rather than forcing it to get the most basic thing running. It seems like every minor customization requires in AWS parlance spinning up a Lambda to do something that should be a first class feature in the platform by default. Will I migrate off the platform? No. Would I use a simpler, opinionated interface that let me focus on my application and not arcanae, if AWS made it avaiable? Absolutely.


The latest effort from AWS to address this seems to be copilot.


Azure actually had a nice Heroku-like service that served me well for a couple years. I forget what it’s called, but it’s probably the one reason I’d ever consider choosing Azure if that was ever my call to make.


This! Can't agree more, I think we share the same idea, that's what the tool we're making is about: https://github.com/mify-io/mify/. It generates backend service code in a scalable way from the beginning, so that you wouldn't have to rewrite and move services to some other platform.

It's better to have good architecture from the beginning, but I understand why people choose these platforms - they are saving a lot of time in the initial development, that helps them iterate quickly. What will happen next is that people spending time and resources to perform costly migrations, and some do this more that once.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: