I guess for less complex apps this can be mitigated with something like Heroku, but still... do they hire freelancers to “watch the shop” when they want a break or are they chained to PagerDuty 24/7?
What has helped me as the only technical founder, as freelancer or in very small teams in general:
- Choose boring technology. Especially when alone I prefer reliability and tons of state of the art on how to operate it, over shiny features
- Choose technolgy and infrastructure that you know. It is a whole lot easier to maintain a stable system with something that you have ample experience with.
- Keep system complexity roughly aligned with team size. E.g. when alone, it might not be the best idea to maintain 5 very different database systems altough on paper each is "the best tool for the job"
- I don't think you need any super advanced, well tought out archiecture, but if you are constantly fire fighting while at work, it might not even be good enough
- Setup basic automation so the system can recover iteselffrom the unavoidable but benign hickup every now and then.
- Don't deploy before going for lunch, coffee break, dinner, weekends, etc.
- While working, observe your systems behaviour over time, and especially the impact of changes on it. If you see a degradation, fix it or at least put it in the backlog. Otherwise it will bite you eventually out of nowhere.
- Have nice error pages and messaging that are shown to users when the system fails. In my experience in early stage companies, crashes suck, but aren't actually that bad after all and users are quite lenient as long as they see that the system is down instead of having the bad experience of it just not working correctly.
Try to eliminate as much as possible user loss of effort if your server crashes (if you have a long form a user needs to fill out, consider persisting the data and restoring it from local storage, so if the server is down and the user has to come back later to submit their data isn't lost).
If you can have a not awful experience when your small SaaS crashes, then it's probably okay to aim for 2-nines* of reliability instead of 5-nines. You're not Amazon as a small SaaS, it's okay to have a little bit of downtime now and again.
Your product is important, but it's also important to keep in mind your quality of life. Spending the time to polish the failure scenario, means you can be a little more tolerant of failure scenarios.
* maybe slightly more than 2-nines, but that's the general idea.
Hardware raid cards.
Plus an architecture that is robust.
In my experience, good dedicated servers practically never crash. You might lose a HDD every few years, but that is not urgent to fix if you have a good raid.
Avoid most cloud services. Heroku, Rackspace, AWS all had much more outages than Hetzner. Plus they'll sometimes force reboot or force migrate ( =pause) your instances.
So if you go cloud, you'll need failover, distributed database, all that messy and complicated stuff. If you go dedicated, it's much easier and you only need to keep that one box running.
Plus, honestly, would your customers really mind if you're offline for 5 minutes? My dedicated hoster also has a service where they will monitor standard services like Apache, Postgresql, Rails for you and restart as needed. They have 5-10 minutes response time in my experience and I belive its good enough :)
Also, going dedicated makes it affordable to overprovision 10x the hardware you need, so you will practically never have a traffic spike high enough to cause issues.
With Heroku / AWS on the other hand, everyone else will also be scaling up when their cloud has hiccups, so your on-demand instances might not start when you need them.
Anyway, Hetzner dedicated + raid + monit is how I've been running my SaaS company for 10+ years. And I don't even remember which year I last had an issue that was both urgent and required my attention. The Hetzner ppl can exchange HDDs just fine without me. C++ core, Ruby website, Postgresql and RabbitMQ. 100GB database, 5TB customer data.
> Plus they'll sometimes force reboot or force migrate ( =pause) your instances.
Extremely rare, but probably happens at a similar rate as your "single box" dedicated provider losing an HDD or having a datacenter blip.
My point here is that what you described sounds like the kinds of things SaaS developers needed to worry about ~10 years ago. The platforms of today aren't perfect, but they abstract away 90% of that and allow you to focus on business logic, which is exactly what a single SaaS developer should be doing.
BTW, I don't get commission, payments or anything from Hetzner. I'm just super enthusiastic about them because their affordable pricing is making me rich.
I agree that it's a tradeoff, but the question was about small companies. And there I'd say 5min of occasional downtime are absolutely fine if it saves you $100k annually. And for a single founder, those 100k in profit will be kind of a big deal ;)
Modern platforms should remove most of the complexity of operating routine infrastructure and allow you to focus on your business logic, but it doesn't always work out that way.
Just the fact that your VMs have a significant chance of being forcibly shut down with little notice is a significant downside, for example. As a solo operator, you now have to arrange all the automatic scaling and failover configuration on your cloud host as well (possibly at considerable extra cost for capacity you might not be using 99% of the time) and you have all the 24/7/365 monitoring problems that OP was asking about.
Cloud services are also notorious for obfuscating their pricing so it's hard to work out the TCO. In my experience, arguments that cloud hosting works out much cheaper overall tend to be based on rather optimistic assumptions. It might be true if you lease some VMs at carefully chosen sizes and then set everything up yourself including scaling things down again any time you don't need them. However, once you start using the automatic services that actually do something for you beyond supplying a machine on demand, the prices might jump 3x or more (sometimes much more, like orders of magnitude) compared to ordering the equivalent basic resources and setting the same functionality up manually.
Then there are the security and compatibility updates. The basic cloud services tend to be provided as-is and it's up to you to ensure everything gets updated when it needs to be. Or again, you might be able to get a more automated service that does some of this for you, but it will come with a pricing premium.
Meanwhile, a solo operator using a more traditional managed service probably doesn't have to worry much about any of this, because those services will often be happy to take care of things like setting up your redundant database servers or monitoring the security mailing lists and applying emergency patches very quickly so you don't have to. That's the level of individual service and advice they tend to offer to distinguish themselves from the generic cloud hosting services. Obviously you do pay extra for that management service compared to just a basic hosting arrangement, but whether you pay more than you'd have paid trying to do all of it yourself on AWS or Azure or even DO is another question.
If Hetzner fits your use case (now and future case) then it's a great way to go.
> In my experience, good dedicated servers practically never crash
One of my toy servers (ecc ram/xeon cpu - but bought "second hand" via hetzner's auction) disappeared the other day. I thought maybe a disk had failed - but I couldn't bring it up in their network booted rescue mode - and requested a "hands on" power cycle - and after a few minutes the server was up again:
> Dear Client.
> A fault in your neighbor servers PSU tripped the fuse of the small rack segment which your server is located in too. We have fixed the issue and now your OS is back online.
Now, I think that box had a 700-900 days up-time before - I didn't really have to do anything (or pay) to get it back up.
But it was kind of surprising.
I guess all I'm saying is that I do like cheap, dedicated servers from hetzner - but if you need to guarantee five nines uptime, the architecture part is important.
Five-nines is less than 10 minutes of downtime per year. I doubt anyone is really guaranteeing that without 24/7 active monitoring and maintaining extensive automated failover systems, which is already several full-time jobs. No solo operator is credibly providing that level of service.
OK, I concede that it is not completely inconceivable to do that, but unless the service you're operating is relatively light in its demands on the tech stack, I think it's a very impressive achievement to maintain infrastructure that can consistently and reliably deliver that performance on your own if you're also the person doing the development work and your infrastructure costs aren't getting silly.
We have a simple, fully redundant architecture at one of my businesses as well, and I suppose we probably do achieve five-9s most years, but I wouldn't be willing to guarantee that to customers with serious money on the line if we missed it. We're still only D disk failures at similar times away from degraded performance while we spin up new machines from scratch, or N network failures away from degraded performance until we can bring up more capacity where it's still available.
.9990 or .9995 is much cheaper, much easier, and probably closer to what your end user's network connectivity is anyway. (Yes, they're multiplicative, but if your user is connecting from a single-path residential connection and a $50 router, your 5th nine isn't needed to demolish their local 5-10 hours of downtime per year.)
Though the promised uptime might not matter that much in practice, since the penalties for a couple of hours of downtime are often affordable.
Also, I'm not convinced that it's needed for a normal product. When my work Gmail was offline, I just had lunch early and then later it worked again.
A single founder offering higher availability than Gmail sounds like a masochist to me.
Can/should you really run everything on just 1 box, with rather huge projects? Why not gain redundancy/uptime/peace of mind by having multiple (redundant) dedicated boxes?
Maintaining a single machine is always going to be much easier than a cluster with k8s. Not to mention you can often toss most of your data set in RAM.
Not having to worry about sharding, affinity issues, DNS/addressing/networking, extra security is a godsend. Everything is easier on one machine.
Having a redundant machine for failover and release staging might be a good idea. But you'll need to figure out how to replicate your database and possibly your in memory cache layer (redis/memcached/etc.) and test it all. Not to mention database migrations can get tricky. Really, most people can probably get away with the typical maintenance window and notification, and shut everything down for 4 hours on a Saturday night or whatever. I mean... major banks and utilities do this. You'll be fine.
But one server is way more convenient, until it isn't.
In the event of unplanned failover, it looks to the VM like an unplanned abrupt reboot took place. In reality it reboots, usually very fast, on the other host.
All services running inside can recover in the usual way (journalled filesystems, databases, programs restart), and don't need any high-availability configuration or replicas configured.
You do need to ensure I/O is committed durably across the network, including I/O barriers. This is a combination of VM host, filesystem and DRBD config.
(It is actually possible to do this with the VM not even seeming to reboot, so network connections and processes are unaffected by the fail-triggered instant migration. This is done by running VMs in synchronised tandem and is a rather more advanced technique. I've never used it.)
To be fair to Hetzner, eventually when I reported it they immediately took the initiative to replace it, no questions, and gave me options for when I'd like that to happen. Never had any problems since.
Idk what kind of software you work on but in my slice of the B2B SaaS world, 5 minutes of downtime during business hours would generate 100s of support tickets with very angry users letting us know they couldn’t do their job.
Perhaps part of the answer is that if you're a solo founder who is staying solo, try not to create a business where you have these kinds of dynamics.
OR, if you are creating one that does, scale up past being solo ASAP.
Your support ticketing system and company web presence are two I can think of. I provide the ticketing service myself but I outsource my web presence (the landing page not the app) to a third party service. My thought is the web site should never go down. If the ticketing system is down it means things are beyond hosed :)
I doubt you're trying to solve the same problems as the GP at that scale, or probably even the same types of problems.
Maybe 400 businesses each with 10+ employees and 10,000+ customers (who also had access to the platform). It was too much and I regret not hiring sooner.
Another thread mentioned here that hiring someone you trust enough to own problems when you’re unavailable is daunting. I concur with that.
FWIW, that sounds like an interesting story/case study, if you're willing to share some time. I'm not surprised by your conclusion, but I'm quite impressed if you managed to scale even tolerably well to that level before bringing in some extra help.
I guess the question is, what happens if the server goes down whilst you're at work? The answer is that if you're constantly fighting fires 9-6, your software is probably severely broken. I'd suggest this is pretty unusual, or at least, I've never heard of software being held together like that at a company that still exists.
You wouldn't want the servers to go down whilst you're at work, in a meeting, or out to dinner with friends. So you design things to be as redundant as reasonably possible.
Then when you make a mistake, you fix it so it never happens again.
Server fear should be the least of your worries. As a founder, lots of things can go wrong that will interrupt a holiday or downtime. In my experience, it's rarely, if ever, software or hardware issues.
 - https://www.openrent.co.uk
I agree. My own product did not go down pretty much at all in 3 years. But I've seen problems all over the place while working FT at a tech companoes. Usually, these were created by devs during software releases that would take systems down or corrupt data.
So, don't release silly things before you go on a vacation.
Idea is that you don't have to monitor anything actively. You still have your phone and email in case something goes wrong. But you are not franticly checking dashboards if response time is slower for 0.05 second in actual moment.
What parent is saying if you have normal operations most of the time everything will be OK. Despite all of the "spooky stories" software and servers are mostly reliable and you probably don't need that much redundancy that people who sell magic solutions would like you to believe.
If you look at statistics even for google, most of the downtime is when someone is changing something on the servers. When you are a single founder and you are having a diner and not changing config of your servers or deploying new software 95% of reasons for server going down are off the table.
In short: it's tough; you're never off. Our errors either surface by way of user emails or monitoring (shoutout to BugSnag), and to this day I still have anxiety going places without my laptop for fear of a critical error coming up and not being able to fix it. I can recall running out of conference talks, being at shopping mall with my wife, and SO many other incidents where I'd hop onto the floor of a hallway, pull out my laptop, and frantically try to figure out what's wrong (and fix it).
On the support side, we have a small number of large clients. In this regard, there's no such thing as completely disconnecting. I have a shortlist where if I get an email from _____, it doesn't matter what I'm doing, I'm responding within an hour. Outsourcing to "watch the shop" is quite difficult; I find that some businesses can do this more easily than others. For something highly niche, it's more challenging.
On the tech side, I use managed services wherever possible. Heroku is wonderful (IMO), BugSnag is fantastic, we recently switched to Postmark which helped with deliverability of emails.
I've loved building this business. Control over my time each day is a reasonable trade for having to occasionally (rarely now) drop everything. At the same time, I miss big tech and the community of being at a larger company.
Hope that helps :)
One thing you can do is to properly configure your monitoring software.
1. Pick the right alert sensitivity + notification channel:
If your app is well-built and never goes down, 30 second checks and getting alerted after the very first failed request works well. However, if another legacy app is unreliable and often goes down for ~5 minutes when making DB backups, configure your monitoring so that you only get alerted when the legacy service goes down for at least 10 minutes.
2. Get phone calls for high urgent alerts (e.g. homepage is down)
3. Push notification/Slack message for low urgency alerts (e.g. background processing queue has too many tasks enqueued). If you're at a dinner with friends and you get a low-urgency alert you can just ignore it.
4. Don't take it too seriously! Odds are it's not a life/death situation when your app goes down. Downtime happens to everyone!
5. Pick a reliable uptime monitoring provider so that you never get a false incident at 4am in the morning (shameless plug! :)
Google Firestore + Cloud Run + Cloud Storage really work well together. There aren’t any servers to maintain, it auto scales to zero.
Compared to some droplet VMs in digital ocean which got restarted every now and then, cloud run has given me 4 nines of reliability according to updown.io monitor.
It’s fast, it’s cheap, it’s low effort once you get the continuous deploy bits setup.
Basically everyone providing cloud hosting has something that is a VM and something that is a managed database. If you build your system with any standard tech like Linux or popular programming languages or major databases, it's going to run on any cloud platform you like with relatively little change.
However, to do most useful things with serverless, you're going to need to tie into a specific cloud provider's ecosystem to a much greater extent. That means a lot of platform-specific code talking to proprietary APIs, which feels like it could become a significant drag if you were using serverless for core aspects of your software rather than just the occasional bonus.
There is, you can run serverless Docker on GCP, AWS and Scaleway at the very least.
The main value in serverless, to me at least, is in the ability to write only the actual functionality you need and not care about setting up any run-time environment or setting up a substantial build process to make artefacts to support the run-time environment. With something like AWS Lambda, you can literally just copy and paste some function in any of several programming languages into a box on the dashboard, set a couple of details for security etc, and make it live.
So If you need 98% or 99.99% availability, your design will be very different. When you start designing for >99.97% stuff will get complicated.
I do make sure I'm always available to fix things within a reasonable time. Practically, I try not to do anything where I would be physically unable to get to a computer with Internet access within 30 minutes, though pre-pandemic I would set my phone to silent when I went out to see a movie or was at the gym. Sometimes this also means bringing the laptop in the car when going places you don't plan to actually work, just in case.
One side effect of needing to watch for incoming notifications: I have East Coast relatives who insist on texting pre-7am Pacific (sometimes 20-message text chains), and wouldn't lay off when I told them it was too early and I couldn't just turn off my phone because I need to check for work notifications. Texts and calls from them are now muted 24/7, at least until I eventually get a work-specific cell phone.
1. I searched for basic monitoring solutions for actively monitoring the backend and finalized New-Relic. They provide a free plan that is good enough for most startups. I have added a bunch of graphs for system, infrastructure and application monitoring. It keeps me sane and well-informed before things go wrong.
2. On my Digital ocean droplets and database, I have set up Slack alerts that page me in case there is a spike. I have created a free slack workspace just for this and added a different alert ringtone so as to not get confused with other workspaces.
3.I use Freshping to monitor Uptime and again, if things go down, I get email and slack alerts within couple of minutes.
4.I have Rollbar agent running for log monitoring. I get an email alert when there is an exception or error.
5. If I am out for more than half a day, I take my laptop with me.
6. I keep my phone on. Always.
In the last one year, rarely things have gone down. I mean maybe a couple of times.
Things I do so I can sleep properly,
1. I do not deploy before heading out, or on Fridays or at bedtime.
2. My infrastructure has a lot of redundancy meaning, a larger instance than required to handle a spike in case I am unavailable.
3. Database usually breakdown, so have recently migrated to Digitalocean managed database.
Things I am planning to do,
1.Try out Monit to automate some of the tasks.
2. Write down a list of steps or a runbook in case things go wrong. It is easy to forget steps when the production system is down.
- Use Heroku. Monitor metrics to ensure you don't have major performance issues
- Use Datadog. Datadog can monitor and fix many things (web request queue too big -> trigger lambda function to scale up Heroku dynos, Worker queue latency too high -> same thing, scale up worker dynos, memory swapping -> restart dyno).
- Spend a lot of time fine tuning your logging, and custom metrics in Datadog. Makes investigating much more pleasurable.
- Any issues or exception notifications route to a #devops channel in my slack. Other slack channels include signups, business metrics, daily revenue reports, etc
- If something ever happens where you had to intervene to fix it, do a real post-mortem with yourself and try to come up with a way for that to never be a problem again.
I also do a lot of remote camping & off-roading without internet. I'm working on a simple little app where I can get paged on my satellite messenger (Garmin Inreach) if something is wrong, and key clients can also ping me. Only trusted contacts can SMS the Garmin Inreach, so I would use Twilio as the communication pipe.
And I've pre-ordered Starlink. My off road truck has an elaborate electrical system (Lithium battery, solar, etc) and I plan to find a way to run the Starlink dish off 12v.
Currently working on my home backup plan, which includes hot-standy Mac mini, time machine and cloud backups, home battery backup generator (Ecoflow Delta), Starlink, portable generator, etc.
Having clearly written incident reports tends to surface patterns that help you solve for a more general problem family or type, as opposed to playing whack-a-mole solving individual issues. Meaning the culprits will tend to become clear they're the "usual suspects". Some module or part of the code base, or functionality that's causing more crashes or outages in the code that will nudge you to write better tests for it, or find a better implementation, or better exception handling or validation, etc.
Doing this will either prevent future incidents, automatically recover from incidents, or speed up manual recovery while you try to figure out ways to automate this. All these amortize the pain, as you extract every bit of knowledge from these incidents and "institutionalize" that. You're a "solo founder", but there's no need future team members or "future You" have to go through all that: they'll have a knowledge base at their disposal when they join.
Apologize and explain things to your users.
Consistent, systematic effort.
In some of my own projects I've only gotten bitten by little things a few times over the last 5 years. Like an SSL cert not getting recreated successfully, but this could have been prevented at the time if I had registered the LE account with an email address to get notified it wasn't getting renewed in time.
If you put in your due diligence with writing tests, run them automatically as part of your CI pipeline, stick with stable software / tools and keep things as simple as possible until they no longer work then you'll set yourself up for a strong base to work off of. Then as you encounter issues, you automate fixing them as soon as possible.
Having monitoring in place to prevent disasters helps too. Like getting notified of unusual CPU / memory / disk usage and getting warned before it becomes a real problem. Sure this requires being messaged but it also means you probably have at least a day's notice before you need to take action. That means you don't need to be glued to a pager and respond in 5 minutes because your site is down. Big difference.
This sort of applies to customer support too. I currently do personal customer support for 30,000+ folks who take one of my programming related courses. From the outside you would think I'd be slammed with requests to the point where every day involves answering questions for 2 hours but really it's nothing like that. With a strong base (a working course that stays updated) it's a handful of emails most days and quite often times nothing.
My take: Lean on managed services as much as you can. This will help ensure that you have other experts to reach out to if you have issues with a component of your system. We were on Heroku + AWS RDS (the latter because at the time the MySQL offerings in Heroku were problematic, and we were using MySQL). Even if you don't pay for Heroku support, they were pretty good.
Make sure you set your SLA to something reasonable. For the startup, I am not sure we even committed to an SLA, but we were handling people's money and a crucial part of their operations. So I tried to be responsive within a few hours, especially if the app was down.
As far as actually taking vacations, I did that a few times. If I was close to internet service, I took my laptop and made sure I had cell coverage. Remember freaking out a bit because a camping area I was at had spotty coverage.
One time I was going to take a trip to the Canadian wilds. I had a friend who was running a larger company and who had oncall set up for his product. I documented the heck out of the system and asked them to be oncall for the 10ish days I would be out of touch. I don't recall if we paid them (might have been a 'friend deal' where we would pay them if there were any incidents), but I do recall nothing happened.
To answer your question:
> do they hire freelancers to “watch the shop” when they want a break or are they chained to PagerDuty 24/7?
If I had to pick the category I was in, it was "chained to PagerDuty 24/7".
We invested a lot into availability - especially DBs. Most of our issues were internal DNS related which we at one point automated into hosts files that updated every hour.
Oncall was shared between 3 of us with all 3 paged at once and us getting on WhatsApp to 1. diagnose and 2. fix. Most of the time only 1 of us was close to a laptop but all 3 of us would assist as best we could.
One of us wasn't tethered at any point in time but for the most part we were able to get to a laptop within 30 minutes at most. I now work at FAAMG and find oncall especially stressful but it's once every ~6 weeks.
How many days/hours every 6 weeks? Is it 24 hours when you are at it, or only during the day time in your time zone?
That means healthchecks with auto restarts at every level of abstraction, stateless services...
And yeah on top of all that we have monitoring setup with a few alerts.
With that said, we only had one severe outage since we setup our infra as described above.
1- Use a scheduler that autorestarts: systemd, pm2, nomad, ... (we use nomad)
2- Setup healthchecks to detect when your app is not behaving correctly even if it's still running (for example some exception crippled the program).
An HTTP healthcheck is an endpoint (for example /health) that returns a 200 status code when everything is fine. If the endpoint is down or returns something else, the service is not considered healthy and the service is restarted (you can limit the number of restarts when errors cannot be solved with a restart)
* Systemd supports socket based healthchecks
* pm2 doesn't have built-in support for healthchecks at all but there are some npm modules for that
* Nomad does HTTP healthchecks (through consul, not alone)
* GCP and AWS (and others) support healthchecks at the level of your server and can restart the entire server when the healthcheck goes wrong
3- Monitoring & alerts: I'll cut to the chase and tell you that honestly the best monitoring solution that worked for us is the built in one from our cloud provider (you still need to setup the agent in your server). 3rd party managed solutions are expensive, and I don't want to self deploy something so critical and add to the complexity of our infra.
The main idea in monitoring is not just to be alerted when your servers are down, but to detect issues before they become critical. Common issues like disk or CPU at 70%...
4- High availability: Here be dragons put a load balancer in front of 3 (or more 2n+1) servers, all running the same copy of your app. Make sure your app is stateless! There are risks of race conditions, stale data ... so try to explore the other options first
I hope these pointers will help you sleep better at night! You can read more about these topics and look for the tools that match your stack :)
Most of this was accomplished by simply picking a stack that doesn't ever fall down on me, and the rest was by watching for things that might flake out and either fixing the flake or replacing them with less flaky things.
As such, I get maybe one incident a year where I'll walk briskly across to the office to fix something that could do with addressing today rather than next week. But it's never anything as dramatic as the entire site being down. Most often it's the result of Google having shipped a new version of Chrome that breaks some 10 year old feature of their own browser.
The whole goal of the SaaS business stuff was to maximize my vacation time, so anything that got in the way of, say, taking an entire month off to crew a sailboat across the Darien Gap was a non-starter.
So I don't have a pager. Mostly because I've gone out of my way to ensure there will never be anything to page me about.
I've written at length about this all here:
Anyway, yes, you're the one wearing all the hats, so it's on you. There is no real break, because even if you had someone watching the shop, many times the thing that breaks is the thing only you have deep insight into.
I've been on cross country drives, woken in the middle of the night, at family parties, hanging out with friends when I've gotten paged - and immediately stop what I'm doing to fix the issue, even if it takes a while and ruins said occaision. My platform is ad related, so every second of downtime is pissing off a lot of people because it's directly linked to their revenue. Thankfully that never happened while I was on a plane. I did have to buy a ridiculously expense WiFi package on a cruise ship twice to monitor things.
I've mitigated most potential issues with better infrastructure, tests, and early warnings, but the occasional unexpected item slips in, maybe once or twice yearly. Luckily I have a staffer with deep knowledge of the platform to handle that now. It took a while to get to that point.
I try to have as few services involved as possible, which basically means the web server.
As others mentioned, I chose Perl and PHP because those are the languages I am most familiar with, and because they are "boring", meaning they've been stable enough that I could've written my scripts 20 years ago, and they would still work today. PHP to a lesser extent, but still true.
Also, as I mentioned the "lowest common denominator" style of writing allows me to write Perl which I can copy and paste into PHP almost without changes, and vice-versa. To facilitate this, I ported several functions from one to the other, e.g. str_replace for Perl and and index() for PHP.
I can't think of ways to make it more foolproof than writing two redundant systems, nor how much better tests I can have than full coverage of each process by a redundant one, except by introducing triple-redundancy, which is not out of the question. Is that what you had in mind, or something else?
>Write better tests, clean-up the interfaces, you can even try to use formal methods to ensure correctness etc.
I definitely put effort into all these things, but my experience shows that no matter how much work goes into that, things will find a way to fail in a way I could not predict.
There are many advantages to doing it this way. One is that I thoroughly review each side of the codebase while writing the other. Another is that I get a complete coverage test suite for free out of the deal. Another is that if something goes sideways, it's easier to figure out where that is. It's also easier to discover any faults, because the outputs don't match up.
As a bonus, I have to design it simply, and the process of rewriting it several times helps that end.
When the master server fails, I can run script to cut over with very minimal manual interaction. I have not had to use the script in 10 years, and only experienced one outage when the datacenter had a blip.
But... it's really hard to not worry about it occasionally, even after 10 years.
IMO, the best way to answer your question is to ask yourself why you're so worried about downtime. Then ask yourself what you can do to fix it.
Also: A mistake I see is for businesses to be so feature focused that they never go back and fix their technical debt. Make sure that your SaaS product is resilient enough for your lifestyle before you add new features or grow.
This also means I am on call 24/7. I have rundeck  (the real star of this automated show) running on another host to tackle most common tasks for me like restarting services or backing up DBs. But sometimes I do have to phone a friend and ask for help or direct them through tasks to get things running (happened once over 12 years)
My buddy and I are finishing up touches on a service monitoring SaaS which is just an html5 front end to the above system. If there is interest I will make a note to have a release party here on HN
 - https://www.zabbix.com
 - https://www.rundeck.com
The key to me is to keep thing simple and have it fail in a predictable way by not mixing server roles.
I run an email forwarding service so I clearly say this is incoming mail server, this is outgoing server. For each of them we have a pair of 2 with automatic failover.
Keeping stack simple so if they failed, only a portion of it failed. Example, it's ok if the landing page is down. People can still send and receive email.
If the mail service down, we have a check on our homepage to say that our mail service is down and we're working on it.
In other words, try to design the system in a way where you have a clear boundary between components of system, so when they failed you know exactly what failed and can do thing like restart to fix, scale up server(cpu/mem) to fix it.
My advice will be a little controversial in this thread, but cloud providers are really perfect for building durable products. Any situation you can find where you can trade dollars for durability is well worth the ROI as a solo tech owner. Load balancers, auto scaling, aurora clusters, s3, these are all services that help me sleep like a baby even though my SaaS needs a perfect uptime. Expect instance outages, so keep your servers stateless and run at least 2 instances, as small as possible and go horizontal.
Another good idea is to learn how your product can die. Load test, try to break your app, and then fix those weak spots.
These are my opinions and experiences, and have continued to serve me well.
It's easier said than done, but if you can prevent issues in the first place, things will be much more enjoyable.
Some things that worked well for me:
* GKE on GCP is pretty smooth. When there's a spike in traffic everything autoscales up, so I don't have to do anything. Nice observability, things just work. Just make sure to set container cpu/mem limits.
* Along that same note, I use MongoDB Atlas which also autoscales very nicely. It autoscales both up and down very well, saving both money, and making my infra resilient
* GCP has a lot of monitoring/alerting/dashboards that I take advantage of. Health checks around the world, easy integration of logs/metrics. I find structured logging (json), makes setting up alerts pretty easy
* Good consolidated logging for when there is an issue you know exactly what went wrong
* GCP also support application tracing which can make timing issues easy to debug (although it requires a bit of work to setup) (for example if you are missing an index on some db)
* Automatic deployments (thanks to k8s), there's no checklist for doing a deploy, I just run a single make command. I can't screw that up
* A staging environment that's a match of production. Plenty of times I've crashed staging, it's worth every penny. It also makes life much less stressful
* Lots of tests. The tests aren't important for when I'm writing the code, but for months later when I make changes and want to know I didn't mess something else up. I find a good test suite can really help you sleep at night, specially if the test suite covers the critical paths
* An easy way for users to contact you if there is an issue. No one is perfect, but being able to respond quickly is usually forgiven.
Hope that helps!
If I remember correctly you can specify a bunch of regions for the health checks to originate from. It was super simple to setup (point and click) and it's nice that it's decoupled from the rest of my infrastructure. When there's a failure I get a notification.
The reality is yeah you probably are not gonna have many restful nights or peaceful dinners... 10 years later for me and I still avoid activities that don’t allow me to quickly access a computer. I still always have multiple mifis in my backpack to ensure if one cell network is not good maybe the other one is good enough for me to fix a server... you have to kind of enjoy it
For critical notifications I use Pushover with an emergency setting – a repeating full volume alert on phone, regardless of volume settings or Do Not Disturb mode.
I do have a "go bag" with a dedicated, prepared laptop that I take with me on longer trips (not that there have been many in the past year).
I use a simple tech stack. Golang monolith, Postgres database.
I pay a little extra for God managed services that auto-recover. I run my database on Cloud SQL and my web servers on Cloud Run on GCP.
As a last line of defence, I have a remote development environment I can access from my phone. I can make fixes and deploy from there. I also have a Garmin InReach satellite communicator that I can be contacted on if I’m out of phone range.
I run a small SaaS where I had a decision early on to pursue a live chat-based approach to the UX versus an asynchronous approach. A big reason I chose the latter was to avoid the need for 24/7 "real-time" support in favor of a better lifestyle, even though the live version likely would have garnered more customers.
How “chained to your desk” would you say you are? Are there ever any times where you truly clock off?
I picked the most "boring" setup I could for a backend and host everything on Heroku, which costs me more but provides a little more peace of mind relative to other setups I've seen.
This is exactly what we do at MNX Solutions. We are a team of Linux engineers, and provide 24x7 monitoring and response to outages for your cloud based infrastructure.
Even if we're not a good fit, I'd be happy to chat with anyone about ways to improve their site reliability. It's something we're good at, and love to talk about!
The SaaS provide just a small feature, if it goes down, the users probably wont have noticeable impact. Most issue are solvable by just restarting web server, I have SSH app installed on my phone on the go lol
In addition to that, use managed services as much as possible. On Google Cloud I use a lot of Cloud Run and Cloud SQL, and infrastructure work is kept to a minimum.
Also look for projects that don't need to be 100% available all the time. I use Zenfolio, they send out emails that the site will be down for few hours on a random night, I don't mind it. It is just a portfolio site.
The service suddenly going down shouldn't be a serious risk for a vast majority of online businesses (unless you are doing something exceptional or at an exceptional scale or an amateur).
I don't hire any freelancer to watch my app. But I am using monitoring service and django notification emails when there's an outage
For solo founders, it is better to pick a product that can tolerate a small amount of down time.