
How to sleep at night having a cloud service: common architecture do's - dshacker
https://danielsada.tech/blog/cloud-services-dos/
======
staticassertion
One thing missing here is to avoid synchronous communication. Sync comms tie
client state to server state; if the server fails, the client will be
responsible for handling it.

If you use queue-based services your clients can 'fire and forget', and then
your error handling logic can be encapsulated by the queue/ consumers.

This means that if you deploy broken code rather than a cascading failure
across all of your systems you just have a queue backup. Queue backups are
also really easy to monitor, and make a great smoke-signal alert.

The other way to go, for sync comms, would be circuit breakers.

My current project uses queue-based communications exclusively and it's great.
I have retry-queues, which use over-provisioned compute, and a dead-letter for
manually investigating messages that caused persistent failures.

Isolation of state is probably the #1 suggestion I have for building scalable,
resilient, self-healing services.

100% agree with and would echo the content in the article, otherwise.

edit: Also, idempotency. It's worth taking the time to write idempotent
services.

~~~
cle
Queues introduce entire other dimensions of complexity. Now you've got to
monitor your queue size (and ideally autoscale when the queue backlog grows),
and have a dead letter queue for messages that failed processing and monitor
that. Tracing requests is harder b/c now your logs are scattered around the
worker fleet, so debugging becomes harder. You need more APIs for the client
to poll the async state, and you need some data store to track the async state
(and now you've got to worry about maintaining and monitoring that data
store). It's a can of worms that should be avoided when possible.

The only way to know whether or not to accept this kind of complexity is to
think about your use cases. Quite often it's fine (and desirable) to fail fast
and make the client retry.

~~~
vageli
> Queues introduce entire other dimensions of complexity. Now you've got to
> monitor your queue size (and ideally autoscale when the queue backlog
> grows), and have a dead letter queue for messages that failed processing and
> monitor that.

Wouldn't you need similar mechanisms without a queue? It seems to me queues
give more visibility and more hooks for autoscaling without adding additional
instrumentation to the app itself.

~~~
cle
The queue sits behind a service. If you don't do the work in the service, and
do it in a queue instead, you've got more infrastructure to manage, monitor,
and autoscale.

------
encoderer
I like guides like this that can help beginners bridge the gap between hobby
and professional quality development.

I’ll add one more tip, the one I think has saved me more sleep and prevented
more headache than any other as I’ve developed a SaaS app over the last 5
years.

It’s simple: Handle failure cases in your code, and write software that has
some ability to heal itself.

Here are a few things I’ve developed that have saved my butt over the years:

1) An application that is deployed alongside the primary application, tails
error logs and replays failed requests. (Idempotent requests make this
possible)

2) many built-in health checks like checking back pressure on queues and auto-
throttling event emitters when queues get backed up

3) Local event buffering to deal with latency spikes in things like SQS.

I hope to eventually write more about these systems on our blog but I never
seem to find the time

~~~
theshrike79
4) Sometimes it's better to fail fast and try again than spend time writing
extensive error handling and retry logic.

~~~
ambicapter
But unless you inform the audience of those times, they will continue being
ignorant of when to use your advice, as if you'd never given this advice at
all.

~~~
iudqnolq
As a beginner, until I heard that advice a little while ago it hadn't occurred
to me. I disagree advice that doesn't clarify everything has no value. It
transforms an unknown unknown into a known unknown that you can fiddle around
with or Google further to learn about.

------
rubyn00bie
You know the one thing that has helped me out the most, an error reporting
service AND then addressing _every_ error.

That is to say, my service should emit zero 500 errors.

Then my reporting is easy to interpret and consistently meaningful. I don't
have to worry about bullshit noise "oh that's just X it does that sometimes."

Sleeping at night is a lot easier when you have less keeping you awake.

~~~
jadams3
This. I have a really hard time measuring it, but ever since we really worked
on error reporting our week-end sleep factor has greatly improved.

For a complex system though, don't under estimate how hard this is to do
though ... \- Every cloud service needs to be routed to a common service \-
All of your software, every language, even that cool Go experiment \- All of
the third party software \- logs all have to agree on a format, JSON is not
always an option.

Finally ... justification of time spent fixing things with no observable side
effect(s). Most cloud stuff is reliable against first orders of failure and so
are tolerant to a lot of stuff, it's designed that way. But once the wheels
come off, and they _will_ come off, ... buckle up if you haven't been fixing
those errors. If you aren't clean on second order failures, you're in for a
rough ride.

~~~
mcintyre1994
We use AWS, and one benefit of their hosted ElasticSearch is that they can
build you a lambda that syncs Cloudwatch logs to ES, handling a variety of
different formats. So we have our beanstalk web requests + some lambda infra +
our main web backend etc. all synced to ES with very little effort.

You do have the downside that they don’t have eg nicely synced structure, but
that also has the upside that the structure is closer to what the dev is used
to so nobody ever needs to go back to CloudWatch or any other logs to get more
details or a less processed message. The other downside is you have to write a
different monitor for each index, though this has the upside that you can also
have different triggers per index. In our small team we just message different
slack channels which makes for a nice lightweight opt in/out for each error
type.

It’d definitely be tricky to get everything aligned in eg the same JSON
format, but this sort of middle ground isn’t too hard and still has benefits -
you just need to be already syncing in any format to CloudWatch - which if
you’re in AWS you probably are.

------
savrajsingh
Just about everything mentioned here is well-handled by Google App Engine. I
still think it’s the way to go for most projects, but I don’t think they’ve
marketed themselves well lately. I’m sure there are other good providers too;
I don’t see the downside to using PAAS.

~~~
hckr1292
GAE is incredible and poorly marketed. Its the only serverless product I know
of that allows me to use whatever server framework I want (flask, rails,
spring) but be blissfully ignorant of the underlying VMs. I spent a week
looking all the other major alternatives out there, and I don't think GAE has
any real competitors. Its just a different kind of serverless...in a really
good way.

Having said that, it has some serious shortcomings: baked in monitoring (at
least for Python) is much worse than, say, Datadog + Sentry. Additionally,
Google doesn't have any great relational serverless databases (which is what I
personally want for a regular webapp) -- they do have some solid non-
relational databases. Also, no secret store...its very tricky to securely
store secrets inside GAE.

To me, the perfect platform for a webapp is GAE + Aurora + some undiscovered
secrets store.

~~~
judge2020
Are there any particular downsides you have with storing secrets as
environment variables? It's working in my app, albeit configuration is done
via the web UI [of elastic beanstalk] to keep secrets out of SCM.

~~~
mjfisher
Storing secrets in env vars is very common in practice, although it presents a
slightly bigger attack surface than using something like Hashicorp's Vault to
just pull the secrets into memory.

You can sometimes find debug pages etc for apps and runtimes set up that will
show all set environment variables, or have crash monitoring software that
will capture env vars and send them elsewhere by default. Those risks can be
managed, but having sensitive information not set in the process environment
is more 'secure by default'. It also means in the event that someone finds a
way to remotely execute code in your process (eval() on an unsantized input,
anyone?) it's much harder to dump out secrets.

------
bcrosby95
> A 4 9’s means you can only have 6 minutes down a year.

4 9's is 52 minutes of downtime a year. Keep in mind that single region EC2
SLA is only 99.99%. And if you rely on a host of services with an SLA of
99.99, yours is actually worse than 99.99. So if you want to actually get to
99.99, your components have to be better than this, meaning you will have to
go multi-region. So achieving this is actually way harder than this simple
step.

~~~
gav
This is a very salient point. If your service relies on N other services, each
with a SLA of 99.99%, the chance of a single request having at least one
failure is:

    
    
        1 - .9999^N 
    

Which means if you make 10 requests, you go from 99.99% to 99.9% or from 52
minutes to 8.77 hours of downtime a year.

In most cases you're likely to be making a lot more than 10 service calls.

~~~
james_s_tayler
Depends on if those 9s are in series or in parallel. In series it multiplies
to produce lower availability but in parallel they give you higher
availability.

~~~
drieddust
> AWS will use commercially reasonable efforts to make the Included Services
> each available for each AWS region with a Monthly Uptime Percentage of at
> least 99.99%, in each case during any monthly billing cycle....

So to achieve 99.99% within a region, every component should have at least 3
nodes and to better it deployment should go multi-region which will escalate
the costs quickly.

Most application in reality don't even need four 9s so this works b
beautifully for everyone. I work in outsourcing industry and in bad old days
we had huge penalties and many rounds of explanations even for applications
with no redundancy requirements ;).

But it's just Amazon credit nowadays and no one blinks and eye so it's win win
the all.

~~~
james_s_tayler
3 nodes of a component in parallel would give you 99.9999% for that component.

~~~
drieddust
Yes but not in AWS land. Committed SLA for availability of entire region is
still 4 nines irrespective.

~~~
james_s_tayler
Hmmm. That's good to know.

So in that case you have to replicate across three regions to get 6 nines. So
one component needs 9 copies running around the world to have 6 nines for the
component.

~~~
drieddust
Preety much. As I said above it works because most internal apps within the
Enterprise don't even need 2 nines.

------
jto1218
I'd recommend using an APM product off the shelf to get a lot of the mentioned
functionality in the article (Monitoring, Tracing, Anomaly Detection). I would
definitely _not_ recommend trying to roll all that yourself, unless you have a
ton of time and resources.

There's a few good ones out there, we use Instana and it's working really
well.

------
synack
This is all good advice for the app tier, but in my experience the most
painful outages relate to the data store. Understand your read/write volume,
have a plan for scaling up/out, implement caching wherever practical, and have
backups.

~~~
dshacker
I wanted to lay down some of the common things in the app tier, I think data
stores get complex really fast really quickly, it's not easy replicating and
sharding quickly unless you've got some experience with it under your belt or
you use a tool.

------
jturpin
Good article. I would add one thing to this - pick a database that scales
horizontally and is distributed. CockroachDB, Elasticsearch, Mongo,
Cassandra/Scylla are all good choices. If you lose one node, you don't have to
be afraid of your cluster going down, meaning you can do maintenance and
reconfiguration without downtime. If your load is low or bursty you can even
get away with running these on some small servers such as t3 (probably
minimally t3.larges). Running a cloud managed database is also a good option.

~~~
ummonk
Yes, and together with that, I recommend putting all state in the distributed
database (or distributed file storage for large blobs). This allows you to
gracefully handle crashes, stop and restart servers, etc. because you don’t
lose any state in the process.

------
z3t4
Having only one mirror is scary. If one goes down, its like murphy's law kicks
in. So you want at least 3 things to go wrong in order to take down your
system, 2 is not enough. Also have redundancy _everywhere_ if your checker
agent stops working for example. You want 2 of everything and at least 3 of
those that should never fail.

------
haolez
As a solo founder, I have almost everything mentioned in this article set up,
except CI/CD. I can certainly see its value, but being able to easily take
down parts of my production system and replace them with instrumented variants
is very useful to me when things go wrong. I find that CI usually gets in the
way of this. Maybe it's just a bad habit that I need to ditch :)

~~~
veeralpatel979
> being able to easily take down parts of my production system and replace
> them with instrumented variants is very useful to me when things go wrong

Sorry what did you mean by this?

~~~
haolez
When some service is misbehaving, I have a script to take it down and replace
it instantly by an instrumented version with more logs and whatnot.

~~~
eropple
Service reconfiguration is a thing and, in systems that handle dependency
management well, tends to not be _too_ painful. (I'm in the process of writing
a TypeScript library for doing exactly this, designed for NestJS but usable
outside of it.)

------
corentin88
Haven’t seen anything related to third-parties service that your cloud service
relies on. I’m talking mostly about APIs that you can use that might crash at
some time. Any recommendations on that part?

------
luord
This is a great list. I feel a little happy with myself that I knew about most
of these.

Except for identifying each request, I had never heard about that. It's so
simple yet so brilliant, gotta start doing it.

------
swader999
Good article. Fire drills are worthy of mention. Simulate parts going down,
practice recovery.

------
fcvarela
Nice article, may I ask what tools you used to produce the illustrations?

~~~
dshacker
OneNote! :) I'm a SDE in OneNote

------
jupp0r
Pretty funny that HN traffic seems to have killed the site.

------
vishaalk
Great article Sada :). Hope OneNote is treating you well!

~~~
dshacker
Hey thanks Vishaal! Having fun everyday :) We miss you over here.

------
peterwwillis
I applaud the author for sharing their notes. But also, this is why HN (and
general upvote-anything-that-looks-interesting forums) sucks. If you are
actually defining architecture, you should not be reading these kind of blog
posts. I get that they are interesting to the layman, but so is The
Anarchist's Cookbook. _Don 't make whatever you read in The Anarchist's
Cookbook_.

And I'm crabbing about this because _I am easily susceptible to Anarchists
Cookbooks_. I have had to implement X tech before, and googled for "How do I
X", and some blog post came up saying "For X, Use Y". I'm too lazy to read 5
books on the general concept, so I just dive in and immediately download Y and
run through the quick-start guide. After spending a while getting it going and
getting past the "quick start", I wonder, "Ok, where's the long-start? What's
next?" And that doesn't exist. And later, after a lot of digging, it turns out
Y actually really sucks. But the blog post didn't go into that. I wasted my
time (my own fault) because I read a short blog post.

A lot of people live by _Infrastructure as Code_ , and so they will reach for
literally anything which has that phrase in its description. But you don't
need it to throw together an MVP, and a lot of the IaC "solutions" out there
are annoying pieces of crap. I guarantee you that if you pick any of them up,
you are in for months of occasionally painful edge cases where the answer to
your problem is _" You just weren't using it the right way."_

In reality, if you want to be DevOps (yes, I'm using DevOps as an adjective,
ugh) you should probably develop your entire development and deployment
workflows by hand, and only when you've accomplished all of the basic
requirements of a production service by hand (bootstrapping, configuration,
provisioning, testing, deployment, security, metrics, logging, alerts,
backup/restore, networking, scalability, load testing, continuous integration,
immutable infrastructure & deployments, version-controlled configuration,
documentation, etc), _then_ you can start automating it all. If you've done
all of these things before, automating it all from the start may be a breeze.
If you haven't, you may spend a ton of time on automation, only later to learn
that the above need to be changed, requiring rework of the automation.

~~~
dshacker
Yeah, in reality I was weary about adding the links, these solutions are often
created after a problem arose in the system. But I wanted to provide a few
places to see what's the initial paths for a beginner to find out more. I
can't count the times when I wasn't as knowledgeable approached a conversation
about Chef and Puppet that didn't make sense to me, even after I read what
Chef and Puppet did.

It could be, in the other side, that you really don't grasp the need for these
technologies until you've had to manually implement one. Things like the UUIDs
are basic to hook up to anything.

I can often relate this feeling with nutritional advice. "If you want to be
more healthy, eat more advocados", if you only eat advocados, but don't
understand what's behind it, or the premise behind it, you'll probably get
fat. But if someone tells you "Advocados are a good way to supplement your
fats without blah blah blah" and you understand that there isn't a unique
solution, then you'll probably be healthier.

