
Ask HN: How do you make sure your servers are up as a single founder? - thr0waway998877
I&#x27;m running a small business on AWS as a solo founder. It&#x27;s just me. Yesterday I had a service interruption while I was in the London subway. Luckily, I was able to sign in to the AWS console and resolve the issue.<p>But it does (again) raise the question I&#x27;d rather not think about. What if something happens to me and there&#x27;s another outage that I can&#x27;t fix?<p>So - how do you make sure that your servers are up as a one person founder? Can I pay someone to monitor my AWS deploy and make sure it&#x27;s healthy?
======
jmstfv
I am a solo founder of a website monitoring SaaS [0]. Theoretically, my uptime
should be higher than that of my customers'. Here are a few things that I
found helpful in the course of running my business:

* Redundancy. If you process background jobs, have multiple workers listening on the same queues (preferably in different regions or availability zones). Run multiple web servers and put them behind a load balancer. If you use AWS RDS or Heroku Postgres, use Multi-AZ deployment. Be mindful of your costs though, because they can skyrocket fast.

* Minimize moving parts (e.g. databases, servers, etc..). If possible, separate your marketing site from your web app. Prefer static sites over dynamic ones.

* Don't deploy at least 2 hours before you go to sleep (or leave your desk). 2 hours is usually enough to spot botched deploys.

* Try to use managed services as much as possible. As a solo founder, you probably have better things to focus on. As I mentioned before, keep an eye on your costs.

* Write unit/integration/system tests. Have a good coverage, but don't beat yourself up for not having 100%.

* Monitor your infrastructure and set up alerts. Whenever my logs match a predefined regex pattern (e.g "fatal" OR "exception" OR "error"), I get notified immediately. To be sure that alerts reach you, route them to multiple channels (e.g. email, SMS, Slack, etc..). Obviously, I'm biased here.

I'm not gonna lie, these things make me anxious, even to this day (it used to
be worse). I take my laptop everywhere I go and make sure that my phone is
always charged.

[0] [https://tryhexadecimal.com](https://tryhexadecimal.com)

~~~
daniel_iversen
> Monitor your infrastructure and set up alerts [..] "fatal" OR "exception" OR
> "error"

I almost have the regex "fatal|invalid|unknown|error|except|critical|cannot"
in muscle memory many years after having last had to type it - must have typed
it thousands of times tailing and grepping logs :- )

~~~
unoti
Instead of having a Regex that searches for error|critical|except it’s a good
thing to have a log level in your logging infrastructure, so that you can
query for example log level=2 and get all the bad things.

It takes a bit to work this into the code and infrastructure everywhere but
it’s worth it.

------
apankrat
> _I was able to sign in to the AWS console and resolve the issue_

Kids these days.

I had a RAM stick fry in one of the physical machines sitting in a colo 1 hour
drive away. Not die, but just start flipping bits here and there, triggering
most bizarre alerts you can imagine. On the night of December 24th. Now,
_that_ was fun.

\--- To add ---

If you are a single founder - expect downtime and expect it to be stressful.
Inhale, exhale, fix it, explain, apologize and then make changes to try and
prevent it from happening again. Little by little, weak points will get
fortified or eliminated and the risk of "incidents" will go down. There's no
silver bullet, but with experience things become easier and less scary.

~~~
fanf2
That reminds me of the time we had a DIMM actually melt on the 22nd December
[http://fanf2.user.srcf.net/hermes/doc/misc/orange-
fire/](http://fanf2.user.srcf.net/hermes/doc/misc/orange-fire/)

~~~
type0
Thank you for sharing, that's a real nightmare before Christmas story!

------
__d
You might also want to consider some additional risks that are often
overlooked:

Billing issues. What happens if the credit card you use to pay for everything
gets hijacked, and you're trapped with a blocked card trying to clean it up
but your bank is taking their sweet time and won't give you another card until
it's sorted? ALWAYS have a backup credit card.

DNS Registrar. There's a hard SPOF in the DNS, where your registrar
essentially holds your domain name hostage. If your DNS gets hijacked, but
your registrar is taking a few days to sort out who actually owns it, you're
down hard. There's no mitigation for this one, except paying for a registrar
with proper security processes. If you do 3FA anywhere, make it here.

AppStore. If your app gets banned, or a critical update blocked, what do you
do? Building in a fallback URL (using a different domain name, with a
different registrar, can help work around any backend issues. There's not much
you can do for the frontend functionality, except using a webapp.

It can be worthwhile looking at risks and possible mitigations beyond just
server and database issues, especially when it's just you.

~~~
randlet
> What happens if the credit card you use to pay for everything gets hijacked,
> and you're trapped with a blocked card trying to clean it up but your bank
> is taking their sweet time and won't give you another card until it's
> sorted?

This happened to me this week. Luckily bank got me the new card within 2
business days, but still was a bit stressful and I burned a day getting my
payment info updated everywhere.

~~~
type0
> What happens if the credit card you use to pay for everything gets hijacked,
> and you're trapped with a blocked card trying to clean it up but your bank
> is taking their sweet time and won't give you another card until it's
> sorted?

It doesn't even have to get hijacked, we had our bank reissuing cards because
of some bug on the old chip, the card number stayed the same and the
expiration dates changed but Google Cloud blocked the paying accounts.
Security is the top priority sure, but I don't want some proprietary
algorithms decide whether or not our server will be up or not. The fix was to
move away from Google, because bouncing on their support took to long.

------
thrownaway954
I know this is going to be down-voted to nonexistence since everyone now-a-
days wants to serverless, AWS and what not. personally i've always used either
hosting.com or inmotionhosting.com. yes they are more expensive than AWS and
what not, but the thing is, they both have a support staff 24/7/365\. I called
whenever i need and have someone remote into my server and fix whatever is
wrong. furthermore, i can even have the server alert emails routed to not only
me, but them as well! so they know about the problem and are on it and i don't
have to do a thing.

~~~
mikece
I don't see why this would be considered something to downvote. You're
essentially hiring an outsourced I/T staff by going with a service like this
which is probably something that can be leveraged a very long way and will
remain cheaper than bringing on full-time staff or engaging contracted I/T
consulting.

------
hellcow
The only way to achieve high availability is to have redundancy of all things.

Random things will go wrong that you can't predict. Boxes will die suddenly
and without reason, even after months of working fine without changes, and
always at the worst possible moment. Your system needs to be built to
withstand that.

I'll take the opposite approach of everyone here and recommend against
serverless, kubernetes, and Heroku/PAAS.

You are a solo founder. You should understand your infra from the ground up
(note: not understand an API, or a config syntax, but how the underlying
systems actually work in great detail). It needs to be simple conceptually for
you to do that. If anything goes wrong, you need to be able to identify the
cause and fix it quickly.

I've gone through this first-hand and know all the trade-offs. If you'd like,
I'm happy to discuss architecture decisions on a call. Email is in my profile.

~~~
scarface74
No you don’t need to understand your infra from the ground up - especially as
a solo founder. You should offload as much of the grunt work as you can afford
to so you can concentrate on your business domain.

If something “goes wrong” or you don’t understand how to implement something
with managed services, support is just a ticket and a live chat/phone call
away. I can speak from personal experience that AWS business support is great
even when there isn’t a problem and you just want an “easy button” for someone
to tell you what’s wrong with your configuration.

~~~
CoolGuySteve
It depends on the service I think. I've had ECS errors that took AWS support
days to figure out (turns out some permission quota thing was overriding some
ECS thing).

All in all, I think maybe I might have to find some other batch processing
system.

~~~
scarface74
If you are using regular ECS with EC2 - as opposed to Fargate, it just
provisions regular EC2 instance with an agent already installed. You can
ssh/rdp into the instance and troubleshoot it.

But yeah I did have a doozy of an issue with ECS but it was completely my
fault. I create a cross account policy for ECR but left out the account that
actually contained the registry. Then my containers were in a private subnet
without any access to the internet (by design they were behind a load
balancer) but they couldn’t get to the ECR endpoint. I just had to either
assign a public IP address or use a private link.

Support helped me with both.

------
jasonkester
I build my stuff on top of a stack that hardly ever goes down.

All my SaaS products run on a Windows server, with SQL Server as a database
and ASP.NET on IIS running the public sites. You can probably come up with a
lot of uncharitable things to say about those technologies, but "flimsy" and
"fragile" likely aren't in the list.

As a result, when things go seriously wrong, the application pool will recycle
itself and the site will spring back to life a few seconds later. Actual
"downtime", of the sort that I learn about before it has fixed itself might
happen maybe once ever couple years. At least, I seem to remember it having
happened at least once or twice in the last 15 years of running this way.

There's a Staging box in the cage, spun up and ready to go at a moment's
notice, in case that ever changes. But thus far it has led a very lonely life.

~~~
jedieaston
I’m curious, how much did you have to pay for your licenses? If you are a
single founder with a new startup, it seems like you’d need a bunch of money
to be able to use Microsoft in production, unless you can use something like
BizSpark (if that’s still a thing?).

~~~
tbyehl
BizSpark is not still a thing. The replacement program, Microsoft for
Startups, requires being associated with a partnered "startup-enabling
organization."

------
aspectmin
FWIW - if you just want to make sure your services are up - consider:

1) pagerduty.com or uptimerobot.com for remote monitoring to make sure you
site(s) are up (and get alerts when they're not).

2) Datadog or New Relic if you want deeper monitoring (application
performance, database performance, diagnostics/debugging.

3) Rollbar.com (site doesn't seem to respond) for site performance/errors.

4) Roll your own with Prometheus
([https://prometheus.io/](https://prometheus.io/), or Nagios
([https://www.nagios.org/)/IcingA](https://www.nagios.org/\)/IcingA). Or...
strangely - I still use MRTG for a few perf monitoring things:
[https://oss.oetiker.ch/mrtg/](https://oss.oetiker.ch/mrtg/)

5) If you want to monitor the status of deploys/builds - I love integrating
CI/CD systems with Slack - very helpful.

Hope that helps - I've spent a lot of my career monitoring things, and have
this mantra that I need to know about services down before customers call to
tell me same.

(a lot of these have free tiers)

~~~
Ologn
> Nagios, MRTG

I am used to rolling my own and use Nagios to make sure my servers and web
sites are up and URLs and scripts functioning.

I used to use MRTG and RRDTool in days past when I was responsible for
monitoring many servers, switches and routers.

~~~
aspectmin
Right there with you :)

------
cddotdotslash
Back when I was working on everything myself, I deployed everything through
AWS Lambda and API Gateway, with all my static assets on S3 and CloudFront. I
had exactly zero infrastructure issues over the course of two years and never
dealt with security patches, SSH'ing, etc. If I were doing billions of
requests, it may not have been the most cost effective, but it helped me scale
without worrying about typical devops issues. Updates, testing, rollbacks, etc
were also extremely easy.

~~~
ryanolsonx
How do you keep things fast?

Lambda functions can have cold starts that introduce latency. How do you
manage that?

(From my small amount of experience - please prove me wrong.)

~~~
cldellow
If you're doing a SPA, you can paper over the cold starts a bit, since the app
itself will render, and it'll be background requests to load data that are
impacted.

That still sucks, so then you can (hopefully) cache some things so that _some_
data begins to stream in. Or you can make it so your very high priority stuff
has minimal dependencies - you can get a Lambda cold start in < 1s if your app
only uses the standard library.

But still, in my experience, cold starts are a thing. If you have a high-
traffic app or use Lambda warming, you decrease the # of people who experience
a cold start, but at the end of the day, your p99 is going to be worse than a
vanilla VM solution, because _some_ people will get cold starts. For some
apps, that's OK - think line of business app where the first few pages can be
served from static materials or cached materials, and you trigger the requests
in the background.

------
idlewords
I've run a one-person business on my own servers for about eight years.

Honestly, the answer is learning how to manage anxiety and stress,
particularly doing potentially destructive things under pressure. I think the
psychological aspects of this are much more difficult than the technical ones.

If it helps, people are generally very understanding if you explain that you
are a solo founder, and take reasonable steps to fix issues in a timely way.
Most customers assume every company is a faceless organization; their attitude
is much more forgiving when they learn they're dealing with a fellow person.

You cannot be on call 24/7 forever. You will burn out. If you can't hire
someone you trust to take over part of this burden, then you have to accept
the risk of sometimes not being able to log in for N hours if there is an
outage (because you're camping with your spouse, etc.)

For very high-stress situations (database crash, recovery from backup) working
from a checklist that you have tested is very valuable.

Good luck to you, and I hope you found useful answers in this thread!

~~~
jgimenez
Totally agreed on the point that you just won't be available all the time. You
can set up all the alerting you want and high-availability on different zones
and whatnot, but when sh*t hits the fan, you will still be the only person in
charge for putting everything back together.

I concur that hiring someone, if you can afford it, even part time, would be a
great idea.

------
freetonik
Heroku. I just pay for the privilege of not thinking (almost) about such
issues.

~~~
randlet
Heroku looks really expensive if you need e.g. a private network per client
(compared to AWS VPC). Between $12k-$36k if I'm reading this correctly:
[https://elements.heroku.com/addons/heroku-private-
spaces](https://elements.heroku.com/addons/heroku-private-spaces)

~~~
malyk
Yes, if you need that.

Many many businesses can survive on a $25 a month 1x dyno and the free
database tier (and free sendgrid, free newrelic, free whatever other addon)
and can be pretty sure their site will basically never go down.

It's a fantastic security blanket, but...as your business grows you'll start
to pay. The question then becomes "If I do all this myself on AWS (or azure or
gcp) how much time am I going to be sacrificing time against building my
business dealing with random infrastructure crap?" Or "at what point does it
make sense to hire someone to focus on all this infrastructure and how much
would that cost vs. just paying for heroku?"

------
nlg
I agree in general with the responses encouraging better usage of managed
platforms. I've run a SaaS app for a couple of years using a combination of
AWS Elasticbeanstalk (Flask and Django) and AWS Lambda. Server resource
related downtime has been minimal and recovery is quick/automated. Even
hosting on Lambda you can run into issues without layers of redundancy (Lambda
may be fine but a Route 53 outage would prevent you from hitting that endpoint
if you're using that for DNS).

Before thinking about handing over management of the deployment, I would
encourage you to think about what the root cause of the outage is and whether
something in the app will create that situation again. I invested in setting
up DataDog monitoring for all hosts with alerts on key resource metrics that
were causing issues (CPU was biggest issue for me).

The other thing that's worked well for me is just keeping things simple. As a
solo founder, time spent with customers is more valuable than time spent on
infrastructure (assuming all is running well). It's a little dated, but I
still think this is a good path to follow as you're building your customer
base. A simple stack will let you spend more time learning how your product
can help your customers best.

[http://highscalability.com/blog/2016/1/11/a-beginners-
guide-...](http://highscalability.com/blog/2016/1/11/a-beginners-guide-to-
scaling-to-11-million-users-on-amazons.html)

~~~
kevinyun
We are considering Datadog, and nothing else seems to compare to them, but
they seem extremely expensive. As a small startup/solo founder, did your
implementation justify costs?

~~~
heliodor
Plenty of people are happy using the industry-standard open-source tools:

\- metrics system (Prometheus, InfluxDB, Graphite, etc.)

\- dashboards (Grafana)

\- alerts (both the metrics system and Grafana can handle this)

If you're not a big-data company, you can self-host instances of these
products reliably.

Disclosure: My startup,
[https://HostedMetrics.com](https://HostedMetrics.com), provides turnkey
hosted versions of these software packages.

I'd be happy to explain the options that are out there and provide any advice
you need. Get in touch: contact info is in my profile.

~~~
byteshock
Your product doesn’t even come close to what datadog offers....

~~~
heliodor
Datadog has ~9000 customers. I imagine the respective open-source tools have
many more customers than that, which says something about the perceived value
of the two approaches. Also, Datadog wraps many tools into one whereas the
open-source solutions have singular purposes. It's an apples to oranges
comparison.

------
dkersten
Most of the suggestions here is suggesting ways of restarting services when
they go down, which is a good start, but that doesn't actually solve the issue
I hit last night...

My system integrates with an external system and what happened is this
external system started sending me unexpected data, which my system wasn't
able to handle, because I didn't expect it so never thought to test for it --
the issue was that I was trying to insert IDs into a uuid database field, but
this new data had non-uuid IDs. Because the original IDs were always generated
by me, I was able to guarantee that the data was correct, but this new data
was not generated by me. Of course, sufficient defensive programming would
have avoided this as this database error shouldn't have prevented other stuff
from working, but my point is that mistakes get made (we're humans after all)
and things do get overlooked.

The problem is, restarting my service doesn't prevent this external data from
getting received again, so it would simply break again as soon as more is sent
and the system would be in this endless reboot loop until a human fixes the
root cause.

That's a problem that I worry about, no matter how hard I try to make my
system auto-healing and resilient (I don't know of any way to fix it other
than putting great care into programming defensively), but again, we're human,
so something will always slip through eventually...

Some people are suggesting to out-source an on-call person. That seems to me
like the only way around this particular case. (The other suggestions can
still be used to reduce the amount of times this person gets paged, though)

~~~
tootie
Always treat third-party systems like they're full of nitroglycerin. Double
check all response codes, expect the unexpected, degrade gracefully when it
hits the fan. You're always better off serving up a nice 500 error page than
spinning forever or returning a false positive to users. And make sure you
have a clear SLA with them and can escalate/mitigate/compensate when they
don't fulfill it.

~~~
wstuartcl
This. Write guards like the external integration is an active malicious
adversary -- because when they change api, go down, have their own issues they
may as well be attacking your integration.

------
outime
If you’re using AWS I want to assume you didn’t go for a cheaper solution
(e.g. VPS from a reputable company) because you like the managed solutions
that they provide, among other reasons.

I assume also you want a simple way to increase reliability while keeping
costs within reasonable limits.

Well, AWS can give you all that if you don’t want to go super fancy. Check
Beanstalk to get something simple and reliable. Monitor using CloudWatch. Make
sure to leverage redundancy options (multi az, multi region if worth it, etc).
These are some general tips but with the information that you provide that’s
all I can say.

You can also pay a consultant to get a review of your setup and get some
recommendations. It won’t be cheap but it depends how much you value your time
and your product.

------
lacker
Some of the comments are suggesting totally different technologies. Don’t do
that. You can stay on AWS and achieve the reliability you need. This isn’t the
sort of problem that should lead you to rebuild your whole stack.

The question you should be asking is, how can I make my service automatically
recover from this problem. It depends why exactly it crashed. If a simple
restart fixes the problem, there are different ways you can automate this
process, like Kubernetes or just writing scripts.

I’m happy to give more detailed advice if you would like, my email is in my
profile.

------
faeyanpiraat
Your main concern is of course limited time/resources, so you'll have to make
compromises.

The question is not whether your system will fail, the question is when.

Have proper monitoring and alerting in place.

But don't over engineer it, sometimes everything seems technically fine, but
your support inbox will start getting user complaints.

Resolve the issue, figure out the root cause, make sure this or similar stuff
won't happen, apologise to the affected users if necessary, and move on.

You'll learn waaay more failure modes of your application running in the wild,
than just thinking about "what could go wrong".

It's a long game of becoming a better developer/devops guy, and not repeating
the same mistakes in the future.

~~~
perennate
+1 -- most of the comments are about minimizing downtime, which you should of
course to do to the extent practical, but at some point, whether you're a one-
employee company or not, if you have no internet access and your servers are
down you have to keep calm and accept that there will be some downtime and
it's not the end of the world. You may even be surprised how few customers
notice anything went wrong (depending on the kind of service you're running).

------
charlesju
I would say that as a one person founder, know that you cannot ever get 100%
uptime and live with it. In the most simplistic sense, you need to sleep 8
hours a day, you cannot live your life constantly stressed about uptime. Just
generally have internet access and sometimes your service will go down.

On the set up, try your best to solve issues and use tried and true hardware,
but things go down sometimes, even big sites like Google, Facebook go down,
there is no silver bullet, you can only improve on your past mistakes.

Last, try to find some remote help, on a contract basis, it's not that
expensive and it can help alleviate a lot of your stress.

------
jarl-ragnar
I use uptime robot [http://uptimerobot.com](http://uptimerobot.com) for
monitoring, they have a free plan or paid if you want faster checks.

If it's truly critical to have no down time then you probably need to build
that resilience in to your architecture.

~~~
hbcondo714
+1 for UptimeRobot. Learned about it right here on HN:

[https://news.ycombinator.com/item?id=6576250](https://news.ycombinator.com/item?id=6576250)

------
CoolGuySteve
I currently run a batch of trading servers solo. The trading system is a C++
process with an asynchronous logger that prints log levels and times. One of
the issues with trading is that you're dependent on your datafeed and exchange
connections working which is out of your control.

I use a python monitoring script that tails logs watching for ALERT level log
lines and constant order activity combined with a cron watchjob to ensure the
process is alive during trading hours. The exception handler in the monitoring
script sends alerts if the script itself dies.

If there are any issues I use twilio to text me the exception text/log line. I
also use AWS SES to email myself but getting gmail to permanently not block
SES is a pain in the ass. By design Twilio + AWS SES are the only external
dependencies I have for the monitoring system (too bad SES sucks).

On my phone I have Termius SSH setup so I can log in and check/fix things. I
have a bunch of short aliases in my .profile on the trading server to do the
most common stuff so that I can type them easily from my phone.

I also do all my work through a compressed SSH tmux including editing and
compiling code. So if things get hairy I can pair my phone with my laptop,
attach to the tmux right where I left off, and fix things over even a 3G
connection.

This compressed SSH trick is a huge quality of life improvement compared to
previous finance jobs I've worked where they use Windows + Citrix/RDP just to
launch a Putty session into a Linux machine. It's almost like finance IT has
never actually had to fix anything while away from work.

~~~
tebbers
If it helps at all, I found that buying a dedicated IP for SES helped our
deliverability enormously.

------
ElFitz
I basically don't manage any servers. Everything runs on AWS Lambda & co
(DynamoDB, S3,...)

It doesn't prevent an app-level outage (corrupted data in the database, bad
architecture,...) but at least I don't have to worry about servers going down
anymore.

As for the rest, unit & extensive integration tests along with continuous
integration and linting. Oh, and a typed language. Moving from Javascript to
Typescript was a blessing. But I still miss Swift.

------
asadlionpk
We are a very small team at
[https://codeinterview.io](https://codeinterview.io). We recently achieved a
respectable level of reliability with a tiny team. Some things you should do:

\- Atleast have a pool of 2 instances (ideally per service) running under an
auto-scaler or a managed K8s (GKE is best) with LB in front. May also want to
explore EBS and google cloud run. If you can use them, use them!

\- Uptime alerts. pingdom (or newrelic alerts) with pagerduty added.

\- Health checks! The trick is to recover the failed container/pod/service
before you get that pagerduty call. Ideally, if you have 2 of each service
running #2 will handle the requests until the #1 is recreated.

\- Sentry + newrelic APM + infra: You should monitor all error stack traces,
request throughput, avg response time. For infra, you mainly need to watch
memory and CPU usage. Also on each downtime, you should have greater
visibility at what caused it. You should set alerts on higher than normal
memory usage so you can prevent the crash.

\- Logs, your server logs should be stored somewhere (stackdriver on gcloud or
cloudwatch on aws).

These might sound overwhelming for a single person but these are one time
efforts after which they are mostly automatic.

~~~
ConnorLeet
One thing that has helped me a lot with monitoring is custom application-level
metrics.

If you have a good idea of the usage patterns of your service, create metrics
backed by the patterns. This can help you find things that CPU/Memory will
hide.

------
ransom1538
1\. Stay on AWS only.

2\. Pay for a Business Support plan.
[https://aws.amazon.com/premiumsupport/pricing/](https://aws.amazon.com/premiumsupport/pricing/)

3\. Call business support about something "how do I restart my server" \- so
you know how to file a ticket, get a feel for how quick the response is and
how it works.

Do not over think this. EG: terraform templates

~~~
asdkhadsj
re: terraform, that's only part of the picture right? What do you recommend
for provisioning? I assume Terraform + Packer? I looked into these a while
back and they seemed good.

My only concern was that my target was _very_ low-cost setups, and I wanted
something like Packer but let me provision multiple images onto a single
machine. Eg, if I just used Packer, as far as I could tell I would have to
have 1 machine per image. It sounds odd, but I didn't want to pay $5*Services,
especially when the number of users and load was very small. Being able to
deploy in a PaaS fashion to something like Docker on the machine seemed best.

But then I was looking at Terraform + Packer + DockerSomething, and things
went back to feeling less simple.

~~~
scarface74
Don’t do Terraform. If you are already on AWS. Use CloudFormation. You are
paying for the business support plan. They aren’t going to support your
Terraform deployment.

I haven’t used Packer, but could you use CodeBuild/CodeDeploy/CodePipeline?
Again if you’re on AWS you might as well take advantage of their support.

You can deploy Docker images using CloudFormation either to Fargate and not
have to worry about servers or to EC2 instances (? I haven’t tried).

------
xwdv
For your case I recommend you use Poor Man’s High Availability method, an auto
scaling group of size 1.

~~~
benevol
Interesting, I wouldn't have expected this answer. Can you elaborate?

~~~
scarface74
Autoscaling with a min/max of 1 will cause the instance to terminate after a
number of failed health checks and start a new instance.

------
kerberos84
Yes there are plenty of start-ups doing this. You can also use AWS's build-in
functionality to achieve this. You can write a Lambda function which checks
your server status. Even better which calls your end-point for health check if
you want more detailed monitoring

------
conductr
I solo ran a web hosting service way back in 2000-2003, well before cloud when
it was mostly LAMP and CPanel. It was super mission critical stuff for 20000
sites and I was totally winging it. As it grew I got totally paranoid about
uptime. Long story short, at some point there’s no substitute for getting a
human to help back you up. I had this company that I paid $250 a month for
that was helped monitor and would jump in to my servers to troubleshoot if I
was unavailable. They rarely were needed and when they were it was usually
just an Apache restart or similar. Best money I ever spent.

------
eps
Another option is just to not tackle systems that require 24/7 uptime IF you
are just one person. Instead, make an installable product or do a service
that's not interactive or real-time.

I've been in the game for a while and every time I run across an idea for a
service, there's always a question of whether I'd be OK with sleeping with a
pager, remoting to the servers at 4 am on Saturday and generally be slaved to
the business. The answer, upon some reflection, is inevitably No. This is the
domain of _teams_.

------
dougb5
I wrote up a "technical continuity plan" that describes how to keep my web
sites and APIs in maintenance mode in the event of my untimely demise. It has
a list of bare-minimum things to do in the following week, month, and year,
and describes the various third-party relationships and how to go about hiring
a replacement administrator. I shared the doc with a few close friends. I hope
it's not needed in the future, but just writing the doc was a useful exercise
for me in the present.

------
foxhop
You have identified a single point of failure (yourself), you either need to
accept the risk or hire a person on retainer.

I'm in the same boat with my solo founder projects (links in profile).

------
joshmn
I have a few things in production — two SaaS, one customer-facing subscription
site. I run these all myself with no staff or contractors.

The short answer: I'm married to my phone/laptop.

My test coverage is good. I use managed services when possible so I don't need
to play sysadmin. I don't deploy before I leave for something (dinner,
shower), and I have some pretty good redundancy across all my services. If one
node goes down, I'm safe. If four go down (incredibly unlikely), well, fuck,
at least my database was backed up and verified an hour ago.

I invested a large amount of time into admin-y stuff. My admin-y stuff is
solid and I can tweak/config/ccrud anything on the fly. I credit being able to
relax thanks to my admin-y stuffs. Obviously, if shit really hits the fan with
hardware or an OS bug, I need to get to my laptop. But over the last six
years, I haven't had to do that yet, and hopefully I won't have to.

I've explored adding staff — mainly for day-to-day operations — but I like the
idea of interfacing with my customers and I credit growing things to where I
have because I'm in the trenches with them. Things haven't always gone
smoothly, and my customers always let me know, but any issues are normally
swiftly-resolved.

The scale of one of my products is non-trivial and has a ton of moving parts —
some of which I'm in no control of and could change at any time and break
_everything_. It sounds terrifying, and it is, but I've made a habit to check
things before peak hours. If something's amiss, a quick fix is usually all it
takes.

------
bArray
I'm not a solo founder, but I run a number of servers that are heavily used -
all with different software with varying amounts of reliability. I also allow
other people to deploy code without checking with me first, just to keep
things fun.

I have a few pieces of advise:

1\. Make sure your service can safely fail and be restarted. What I mean is,
if somebody is POST'ing data or making database changes, make sure you handle
this safely and attempt some recovery. Something not being fully processed is
okay as long as you are able to handle it.

2\. Self-monitoring. I run all my systems inside a simple bash loop that just
restarts them and pop me an email (i.e. "X restarted at Y" and then "X is
failing to start" if it continues).

3\. External monitoring via a machine at home that rolls the server back to a
previous binary (also on the server). It also pulls various logs from the
server, as well as the binaries, so they can be analyzed. Okay, it has some
reduced functionality, but it's stable and will keep things going until the
problem is fixed.

4\. Make sure your service fails inconveniently - i.e. returns a
`{"status":"bad"}` string or something, or defaults to a "Under maintenance
page, please come back soon". Your service going down is one thing, but
becoming completely unresponsive is quite another.

One thing I can't prepare for (which happens more than you think) is the
server itself crashing, which as you say, means I'm randomly logging into a
VPS console and rebooting. I use a bunch of different VPS providers and every
one of them has a slightly different console.

------
RantyDave
Just to add to the voices that are saying "by not having any". If you can get
away with edge, lambda, or heroku ... even if it's in the short term, do.

------
deif
Other people are suggesting alternative platforms when you could simply have
an AWS autoscaling group. If a server goes down it simply relaunches a saved
image.

~~~
maxk42
It's a good solution, but it's a bitch-and-a-half to set up.

------
pinacarlos90
It boils down to sending some sort of notification so first responders know
about the issue ASAP

You can do it at the OS level: on a windows OS for example: you use
EventViewer and assign a task to specific type of log captured by the OS this
task can then invoke a small app that sends emails if an error-log occurs or
something like that

    
    
        Application specific issue:
            you can manually capture exceptions raised within the app and send notifications
                there are many clever ways to do this and not hinder performance, and also not pollute your code-base with exception handling
                    you can spawn "fire and forget threads" that send notifications ...
                    let me know if need more ideas here
    
        Integration tests:
            given that you've built a strong suite of integration-tests covering all the functionality on your app
                you can have have your integration tests run every 15min or so and send notifications if tests fail
    
    

You can also use monitoring tools. I know Azure offers ways to help with this.
Reach out if want more ideas or more specific solutions

------
peterwwillis
Yes, you can pay managed hosting providers different amounts of money for
different levels of support.

Managed support will usually only monitor and fix basic infrastructure and
respond to support requests from you. They often won't monitor or fix your
applications/services; for that you can set up your own application monitoring
and tests. NewRelic is a good all-in-one choice, but there are plenty more out
there. To call you during an incident, you'd also adopt PagerDuty.

In order to avoid service outage in general, you want to hook up some kind of
monitor to something that re-starts your services and infrastructure. This
will only fix crashes; it won't fix issues like disks filling up, network
outages, application bugs, too many hits to your service, etc.

You should be able to find small businesses who specialize in selling support
contracts for all levels of support. By signing a contract and on-boarding a
24/7 support technician, you can get them to do basically whatever you need to
be fixed when it goes down. I don't have suggestions for these, maybe someone
else does (it used to be common for SMBs in the 2000's).

------
samvher
It really depends on the failure mode and the cost of failure. As mentioned by
others you can encounter issues in external services which you have no control
over and the best you can do in that case is fail gracefully until you're able
to deal with the issue. If it's easy to detect failure, and a restart fixes
the problem, it can be quite straightforward to set up some monitoring scripts
that take care of this for you, and even if it's more complicated than a
restart some monitoring can at least notify you by email or SMS. Keeping your
tech simple and/or having high test coverage or formal verification can reduce
your error rate. Similarly you can introduce fault tolerance into the system
with something like Erlang's OTP or monitored containers in an orchestrator
(K8s, Docker Swarm, some cloud solution). If failures are expensive you might
want to take on staff to deal with them, if the cost is low you might just
want to accept occasional downtime (though you'll want to think about how you
report that to your users).

~~~
PopeDotNinja
+1 for Erlang. Learning how to write OTP apps in Erlang taught me so much
about building reliable systems.

------
Igor_kh
The operations guy is here. I'm probably biased.

If I were you, I would use free monitoring services like uptimerobot. There
are some other options available. Typically these services provide some basic
functionality for free, it would be enough for a small enterprise.

On AWS it is quite easy to create your own external probes for a reasonable
price. However, it would require some basic programming skills.

------
peterburkimsher
Are you using a Node.JS backend? This is a little script that I set up with
cron on a second instance, which logs in and restarts the Node server if it's
down. • replaces * because it's used for italics on Hacker News comments.

#!/bin/bash

thisHtml=`curl -s "[your site's web address]"`

if [[ $thisHtml != •"<title>[your site's title]</title>"• ]]; then

#echo "Server is down"

ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "killall
-9 node"'

ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "export
PATH="/root/.nvm/versions/node/v8.11.2/bin:/usr/lib/node:/usr/local/bin/node:/usr/bin/node:$PATH"
&& forever start /var/www/html/[...]/index.js"'

rebootDate=`TZ=":[your time zone]" date '+%Y-%m-%d %H:%M:%S'`

echo "$rebootDate" >> "/home/ec2-user/serverMonitoring/devRestarts.txt"

fi

------
drubenstein
> Can I pay someone to monitor my AWS deploy and make sure it's healthy?

Yes. There are consulting shops that will do this, as will many of the
monitoring tools listed in the thread (though these tools will not fix the
problem for you). Broadly speaking, there is a cost associated with this, as
well as the cost associated with your downtime. If the cost of your downtime
(reputational risk, SLA credits, etc) outweighs the cost of hiring someone to
cut your MTTR to 5 minutes (assuming you can playbook out all of the relevant
scenarios) + provides some value in stress reduction, then you should do this.
If you've been doing this a while, you can math it out. In what experience
I've had though, an outside person is unlikely to be able to fix an "unknown
unknown", they just won't know your environment as well as you will.

All that said, one hour of service interruption a year is still better than
most.

------
Beltiras
Redundancy. Failure should always be an option. Specific answers will depend
on your stack. Nobody will be able to monitor and react like you will because
all IT solutions are their own species of butterfly with their own
intricacies. If uptime is really that important you might be at the stage
where you need to take on an employee.

------
gatherhunterer
I highly recommend Kubernetes as infrastructure. It has a reputation for being
too complex to use on your own or with simple projects but that reputation is
undeserved. Self-healing container orchestration has been eye-opening for me.
Many people groan at the prospect of learning something new but it is
remarkably easy to use, the only barrier to entry being the high cost of cloud
solutions and the unwillingness of many engineers to work with hardware (which
would nullify the cost of cloud services). You can easily develop and test on
local hardware and deploy to the cloud with the exact same configuration.

The idea that your server does not perform regular health checks or spin
itself back up when it fails just seems weird to me now. I like being spoiled.

~~~
simmanian
I am leaning toward learning k8s seriously and am actually curious on your
take: is the overhead of learning and maintaining a k8s cluster actually
better than using AWS features like autoscaling coupled with health checks?

~~~
ShakataGaNai
Don't maintain the cluster. Have someone else run it for you. Unless you want
K8s cluster to be the only thing you do.

The advantage of K8s is that it abstracts so much away from you, that you
should (in theory) be able to take the same YAML config file from AWS EKS to
GCP GKE to Azure AKS...and it runs the same everywhere. Things like
loadbalancering and HTTP ingress rules that normally would manual configs on
each platform - become part of the K8s config.

~~~
xur17
> Don't maintain the cluster. Have someone else run it for you. Unless you
> want K8s cluster to be the only thing you do.

This can be as simple as just using Google's GKE service.

------
colinjfw
I've been thinking about this quite a bit lately. I've run DevOps for a few
organizations and learned quite a bit through that.

Ultimately you can engineer your systems, even if they are quite complex, to
be manageable by a single person. It's not one thing though. It's years of
experience and gut feel. It's also totally distinct from technology.

Some things that come to mind:

\- use queues for background tasks that may need to be retried. If things go
down and you have liberal retry policies, things should recover.

\- use boring databases. Just stay away from mongo and use something like rds
which is proven and reliable.

\- be careful in your code about what an error is. Log only things at the
error level you need to look at.

\- test driven development. Saves a ton of time.

------
tachion
You start by identifying the reasons behind why your application/service may
fail and then design and implement the infrastructure for it, that can
withstand certain failures for a cost you can bear. If a failure of a piece of
infrastructure costs you £1 per day, you might be OK with paying £5/day for
the infrastructure to handle such failure. But would be be OK with paying £50
for the same thing?

It's all the matter of defining requirements, then solutions and tradeoffs of
those solutions and then implementing it with best practices in mind
(automation, testing, monitoring, backups, etc.).

Hit me up if you want to discuss it over a pint! :)

------
waxzce
Very simple: you need to get higher level of service, instead of paying for
servers you need to pay for up-time. It's what PaaS, managed service or
serverless does : manage server for you, at scale. To have something online
you need: \- servers \- VM/OS management \- Scalability system \- monitoring
(hardware, OS, applicative & functional) \- action on monitoring and
escalation management \- update every weeks \- observability

That's what we provide at Clever Cloud BTW [https://www.clever-
cloud.com/](https://www.clever-cloud.com/)

------
cpursley
Use heroku as long as it's cost affective. Every time I've moved from heroku
to another platform for "cost savings" I always end up spending much more time
that I'd planned just maintaining it.

------
dwild
Does your service actually need to have an incredible uptime? What would be
the worst that would happen if the service was down let say 24 hours?

I feel like we over engineer that part. Sure there's plenty of service where
you don't want any downtime and it makes sense to over engineer it (like any
monitoring service) but for many SaaS, the worst that will happens is a few
emails.

Maybe write a simple SLA, something with a 8 hours response over theses kinds
of outage. If some client require more, than sell them a better SLA at an
higher cost. That should let you invest into better response time for sure.

------
nh2
I rent dedicated servers at Hetzner.

No cloud machines, no hosted cloud services for production beyond DNS.

* 3 machines in separate data centers (equivalent of AWS AZs) for >= 30 EUR/month each. ECC RAM.

* These machines are /very/ reliable. Uptime of > 300 days are common, reboots happen only for the relevant kernel updates.

* Triple-redundancy Postgres synchronous replication with automatic failover (using Stolon), CephFS as distributed file system. I claim this is the only state you need for most businesses at the beginning. Anything that's not state is easy to make redundant.

* Failure of 1 node can be tolerated, failure of 2 nodes means I go read-only.

* Almost all server code is in Haskell. 0 crash bugs in 4 years.

* DNS based failover using multi-A-response Route53 health checks. If a machine stops serving HTTP, it gets removed from DNS within 10 seconds.

* External monitoring: StatusCake that triggers Slack (vibrates my phone), and after short delay PagerDuty if something is down from the perspective of site visitors.

* Internal monitoring: Consul health checks with consul-alerts that monitor every internal service (each of the 3 Postgres, CephFS, web servers) and ping on Slack if one is down. This is to notice when the system falls into 2-redundancy which is not visible to site visitors.

* I regularly test that both forms of monitoring work and send alerts.

* Everything is configured declaratively with NixOS and deployed with NixOps. Config changes and rollbacks deploy within 5 seconds.

* In case of total disaster at Hetzner, the entire production infrastructure can be deployed to AWS within 15 minutes, using the same NixOps setup but with a different backend. All state is backed up regularly into 2 other countries.

* DB, CephFS and web servers are plain processes supervised by systemd. No Docker or other containers, which allows for easier debugging using strace etc. All systemd services are overridden to restart without systemd's default restart limit, to come back reliably after network failures or out-of-memory situations.

* No proprietary software or hosted services that I cannot debug.

* I set up PagerDuty on Android to override any phone silencing. If it triggers at night, I had to wake up. This motivated me to bring the system to zero alerts very quickly. In the beginning it was tough but I think it paid off given that now I get alerts only every couple months at worst.

* I investigate any downtime or surprising behaviour until a reason is found. "Tire kicking" restarts that magically fix things are not accepted. In the beginning that takes time but after a while you end up with very reliable systems without surprises.

Result: Zero observable downtimes in the last years that were not caused by me
deploying wrong configurations.

The total cost of this can be around 100 EUR/month, or 400 EUR/month if you
want really beefy servers that have all of fast SDDs, large HDDs, and GPUs.

There are a few ways I'd like to improve this setup in the future, but it's
enough for the current needs.

I still take my laptop everywhere to be safe, but didn't have to make use of
that for a while.

~~~
sphix0r
Very well-thought infra and nice metrics. What kind of application are you
running if I may ask?

~~~
nh2
Computer vision, specifically reconstruction of 3D models from 2D photos, as a
service.

------
ezekg
I use Heroku for [https://keygen.sh](https://keygen.sh). Sometimes it pisses
me off how big the bill is (~$1.5k/mo atm), but the net time savings are still
worth it to me. I usually spend a total of 0 hours a month on managing
servers/infra, and less than an hour a day on support. I'm thinking I'll move
to AWS eventually to maximize margins, but right now this really works for me.

------
more_corn
Sure, the person you pay is AWS.

You enable them to do it for you by creating HA infrastructure. Start by
creating an autoscaling group that enforces a certain number of working
application endpoints. You probably need an alb too. An app endpoint that
fails healthcheck causes the asg to spin up another instance and auto-register
with the alb. (You can snapshot your configured and working app endpoint as
the base image).

------
izendejas
I'd love to recommend pingdom, or a service like it. I'm in no way affiliated
with them, just a very happy customer and one of those products where I'm
jelly I didn't come up with the idea. It integrates very nicely with pagerduty
and slack/sms, etc.

It's just extra redundancy in case something like cloudwatch (which you should
use -- with ELBs) also goes down.

------
SkyPuncher
I used AWS Cloudwatch and some simple server side checks
(ianheggie/health_check for Rails is great) for a very long time.

It's not perfect, but it's (1) cheap (2) easy (3) quick (the mythical
trifecta). It misses some of issues due to high loads (but still technically
available) but works perfectly when things actually crash (like queue workers
deciding to turn off).

------
brentis
Ive struggled with this for years. AWS is not foolproof and with environments
for web, Android, amd ios availability gremlins have killed much of my spirit
despite users proclaiming how they've been looking like a service like mine
for yrs.

Docker, elastic beanstalk, SNS, and the hidden world of AWS instance
performance are all a PITA. Oh yea, certs...

Welcome help as well.

------
owaislone
I've used runscope.com and I love it. I don't know about their pricing so
can't tell if it's suitable for someone in your situation but I'm sure there
are tons of similar services. You could also build your own with Lambda and
hope AWS is reliable enough to keep Lambda running. (Who monitors the
monitoring tools? :) )

------
atmosx
Services like heroku try to solve this problem.

------
nathan_f77
I'm working on FormAPI [1] as a solo founder. I started on Heroku, but Heroku
was a bit unreliable and I had some random outages that I couldn't predict or
control. (This was even while using dynos in their professional tier.)

I also had a lot of free AWS credits, so I migrated to AWS. I didn't want to
write all my terraform templates from scratch, so I spent a lot of time
looking for something that already existed, and I found Convox [2].

Convox provides an open source PaaS [3] that you can install into your own AWS
account, and it works amazing well. They use a lot of AWS services instead of
re-inventing the wheel (CloudFormation, ECS, Fargate, EC2, S3.) It also helps
you provision any resources (S3 buckets, RDS, ElastiCache), and everything is
set up with production-ready defaults.

I've been able to achieve 100% uptime for over 12 months, and I barely need to
think about my infrastructure. There's even been a few failed deployments
where I needed to manually go into CloudFormation and roll something back
(which were totally my fault), but ECS keeps the old version running without
any downtime. Convox is also rolling out support for EKS, so I'm planning to
switch from ECS to Kubernetes in the near future (and Convox should make that
completely painless, since they handle everything behind the scenes.)

[1] [https://formapi.io](https://formapi.io)

[2] [https://convox.com](https://convox.com)

[3] [https://github.com/convox/rack](https://github.com/convox/rack)

~~~
asdkhadsj
Convox looks nice! My only issue is that it seems tied to AWS?

It would be nice to have a stateless tool abstract Terraform a bit, to let you
use more providers as a basic PaaS.

------
bullen
I made my own (every minute) monitor:
[http://monitor.rupy.se](http://monitor.rupy.se)

I also warns me if the CPU load goes up over 80%.

For the first two years of going live I had this hardwired to my Pebble via
real-time mail, but now I know my platform is robust; so I can choose worry
about other things.

------
gumby
Sounds like there is a business opportunity here. A kind of DevOps-AAS, though
to make it scale you'd probably need the customer to probably architect their
system in a certain way.

(though this is essentially a single-line comment, it's earnest, not intended
to be sarcastic)

------
peterk5
I build my projects on Google App Engine and it has been stable and reliable
without much administration. The platform is not without its challenges,
especially with the Gen 2 rollout, but no issues related to
administration/interruption. PaaS could be a good place to explore...

~~~
markyc
seconding this, if you don't need lots of resources it makes sense. I pair it
with Amazon cloudfront and so far, almost one year in with zero problems. By
far the biggest win for me is the peace of mind

------
vandershraaf
My applications which are built using Laravel are deployed through Laravel
Forge. There is definitely extra charge for it, but having Forge to simplify
deployment really save my time especially in case of any issue.

For monitoring, I am using Stackdriver which has easy-to-use health check.

------
davecap1
Have you thought about hiring someone remote in the same or different timezone
to be on-call for outages? I'm sure there are many people around that would be
able to help with this. You could hire someone on a retainer who can be on-
call via PagerDuty or something.

------
brokenkebab
There are a lot of whatever-as-a-service offers which can relieve you of
updating, patching, and restarting. But if your troubles originate as bugs in
software, poorly formatted data, or something along this lines then human
supervision is probably the only solution.

------
telesilla
I've been using Linode's managed service, about $100 a month per server. If
something goes wrong they have access and can triage, or let me know if they
can't fix it. It's been very helpful, especially since they have (excellent)
phone support.

------
elamje
I think another consideration that might not be an obvious risk is your use of
two factor auth.

It’s important for critical services, yet if you lose your 2FA device, like a
phone, you will be locked out for a while. Like many things, it will happen at
a bad time.

------
parliament32
AWS is not "your servers", it's "your services". How you monitor, manage, and
set up redundancy/recovery is going to be very very different between running
real servers or just paying AWS for semi-managed services.

------
r0rshrk
For a single server setup, write bash scripts that check whether the server is
down, and bring it back up of it's down.

Also, send errors through chat platforms like Telegram to be notified of any
errors/monitor the servers

------
aarreedd
Use uptimerobot to monitor your site. Have a scheduled job that ping
healthchecks.io every 5 minutes. Configuration both to email you if anything
goes down.

These are both reactionary but at least you'll know if things break.

------
dubcanada
I guess it really depends on what AWS services you use. There are companies
that can manage your AWS 24/7 for a fee.

Other options includes using another service that offers 24/7 uptime.
Obviously you pay more for that.

------
throw03172019
I have been using Convox for the last 3 years and it has been super reliable.
Convox is essentially bring your own infra Heroku built on top of AWS ECS
(Docker). I believe one of the founders is from Heroku.

------
masternda
Yes, you can, by outsourcing it to someone or scaling up your team of one as
others have mentioned. I am currently in the process of rolling out a service
that does this. Hit me up if you are still keen.

------
techscruggs
1) Setup a status page for what you want to monitor (/health/queuelength) 2)
Point statuscake at that url 3) Connect statuscake to pagerduty.

This approach is easy to implement and scale

------
riffic
Your bus factor is 1. Use managed services.

[https://en.wikipedia.org/wiki/Bus_factor](https://en.wikipedia.org/wiki/Bus_factor)

------
lazyeye
Recommend doing the AWS certification training. AWS has redundancy built in
with health monitoring and auto scaling groups etc. The training covers all
this.

------
lepah
by not managing servers, use a PAAS such as heroku, it significantly reduces
devops allowing you more time to focus on what matters, ie product market fit.

------
m00dy
You can try to use AWS Elasticbeanstalk. It can recover from failures
automatically by spawning new nodes right behind the load balancer.

------
dhimes
I outsource to Tummy.com They've been terrific, and I don't have to worry
about anything I don't want to.

------
nwilkens
I run a company that helps with this single founder scenario. We monitor your
infrastructure, and resolve issues 24x7, along with other proactive items.

[https://www.mnxsolutions.com/services/linux-server-
managemen...](https://www.mnxsolutions.com/services/linux-server-management/)

I’d be happy to chat with anyone, even if to provide some feedback or a quick
audit to help you avoid the next outage.

\- nick at mnxsolutions com

------
motakuk
1) Have a monitoring in place. 2) Never miss alerts, use smth with multi-
channel escalation like amixr.io

------
soulchild37
Pieter Levels (Founder of remoteok.io) hired a guy to monitor his server for
$2k / a month

------
exabrial
TICK stack. Literally everything handling or supporting production traffic
should be monitored

------
xchaotic
The direct question was, can I pay someone to monitor my AWS. And the answer
is yes. You want redundancy at every layer including the human one. For 24x7
coverage, long term you need a team but for now two people will do.

Funny enough I was just talking to someone who passed all his AWS
certifications and was looking for some AWS work.

------
wlycdgr
Don't sell things that require high availability as a solo founder.

------
slipwalker
i used to have a telegram bot sending me events from supervisord:
[http://supervisord.org/events.html](http://supervisord.org/events.html)

------
FearNotDaniel
After reading another recent post... Tinder for Founders, anyone?

------
jillesvangurp
Short answer: promising 5 nines of uptime is not a thing for startups.
Downtime is going to happen and you are going to be asleep, drunk, or
otherwise not fit for doing any emergency ops. It's not the end of the world.
Happens to the best of us.

So given that, just do the right things to prevent things going down and get
to a reasonable level of comfort.

I recently shut down the infrastructure for my (failed) startup. Some parts of
that had been up and running for close to four years. We had some incidents
over the years of course but nothing that impacted our business.

Simple things you can do: \- CI & CD + deployment automation. This is an
investment but having a reliable CI & CD pipeline means your deployments are
automated and predictable. Easier if you do it from day 1. \- Have good tests.
Sounds obvious but you can't do CD without good tests. Writing good tests is a
good skill to have. Many startups just wing it here and if you don't get the
funding to rewrite your software it may kill your startup. \- Have redundancy.
I.e. two app servers instead of 1. Use availability zones. Have a sane DB that
can survive a master outage. \- Have backups (verified ones) and a well tested
procedure & plan for restoring those. \- Pick your favorite cloud provider and
go for hosted solutions for infrastructure that you need rather than saving a
few pennies hosting shit yourself on some cheap rack server. I.e. use Amazon
RDS or equivalent and don't reinvent the wheels of configuring, deploying,
monitoring, operating, and backing that up. Your time (even if you had some,
which you don't) is worth more than the cost of several years of using that
even if you only spend a few days on this. There's more to this stuff than
apt-get install whatever and walking away. \- make conservative/boring choices
for infrastructure. I.e. use postgresql instead of some relatively obscure
nosql thingy. They both might work. Postgresql is a lot less likely to not
work and when that happens it's probably because of something you did. If you
take risks with some parts, make a point of not taking risks with other parts.
I.e. balance the risks. \- When stuff goes wrong, learn from it and don't let
it happen again. \- Manage expectations for your users and customers. Don't
promise them anything you can't deliver. Like 5 nines. When shit goes wrong be
honest and open about it. \- Have a battle plan for when the worst happens.
What do you do if some hacker gets into your system or your data-center gets
taken out by a comet or some other freak accident? Who do you call? What do
you do? How would you find out? Hope for the best but definitely plan for the
worst. When your servers are down, improvising is likely to cause more
problems.

------
taf2
monit, keepalived, statuscake, with more money datadog, newrelic help as well
with PagerDuty

------
listenallyall
hetrixtools.com is a great option for monitoring and real-time notifications

------
armatav
Heroku

------
janee
Imo you will have to get outsourced on-call if your downtime tolerance is very
very low.

Otherwise I'd suggest religiously documenting your outage root causes and
contemplating hard what could've avoided that outcome.

Then lastly for monitoring on the cheap:

Sentry.io - alerts.

Opsgenie - on-call management.

Heroku+new relic - heartbeat & performance.

Tldr; Keep your stack small and nimble and try to learn from past outages

------
mister_hn
yes you can do. Or try to automatize as much as possible:

\- add health check mechanisms

\- if health check is broken => restart service

\- if restart service doesn't help after X retry => redeploy previous state
(if any available)

Try to use Kubernetes or Docker Swarm if possible, combined with Terraform

~~~
tachion
Restarting the service and redeploying it should be absolutely the last resort
and aren't really sound advice, mainly, because you are losing the invaluable
crashed state of the system, that may be vital (sometimes logs are not enough)
to discover _why_ the system crashed in the first place and then delivering a
fix for that particular issue. Once that's done, you incorporate this into
your infrastructure automation (having which goes without saying) be it
Ansible, Terraform, Kubernetes or whatever else.

Otherwise you allow the problem to persist, pile up with other issues (also
fixed by restarts, I assume) and implementing automated restarts in that
manner reduces not only your uptime in uncontrollable manner, but also your
code/infrastructure quality, increasing your tech debt beyond the point of
recovery.

Friends don't let friends fixing things by restarting them ;)

~~~
scarface74
_Restarting the service and redeploying it should be absolutely the last
resort and aren 't really sound advice, mainly, because you are losing the
invaluable crashed_

I’m speaking in terms of AWS translate to your chosen infrastructure.

At the bare minimum you should have two redundant servers behind an
autoscaling group with a min/max of two with health checks.

When you need to get something up _now_ and you want to keep the crash state,
configure the crash instance to be taken out the autoscaling group but not
terminate and start up a new instance. You can then troubleshoot.

~~~
somesomesome
[deleted]

~~~
scarface74
If you’re a solo developer, why would you have something as complicated as
k8s? I’m referring to dead simple VMs.

------
F117-DK
One word.. Serverless. It's a bit more pricy but the ease of mind is worth it.

------
verdverm
GKE

------
_tkzm
I know the situation. I haven't got to the production stage yet but I totally
get it. Beside using Kubernetes, Nomad or some other scheduler, you will
always have to invest your own time to resolve issues manually. You could have
triggers that would invoke ansible playbooks if you don't want to handle any
of the aforementioned but in the end the type of business simply requires
maintenance - there is no way around that. A real human being has to be
keeping an eye on the entire architecture and make sure it is running as it is
supposed to.

