
Auto-Scaling and Self-Defensive Services in Golang - rayascott
https://raygun.com/blog/2016/03/golang-auto-scaling/
======
aftbit
When I needed to solve this problem, I spun up 4x as many workers as I needed
and managed them through supervisor. This handles two of the requirements: 1\.
Autoscaling - Unless your workers are very expensive to run, or your load is
exceptionally spiky (e.g. 100x jumps), just pre-scale instead of autoscaling.
2\. If a process dies, supervisor (or runit or whatever) will restart it.

My supervisor config has each worker log to its own file (e.g.
/var/log/acmeinc/worker_01.log). For death detection, I'd probably use a
watchdog process that checks the mtime of each log file. If it's more than X
minutes old, then no work has been done in that time, so kill the process and
allow supervisor to restart it. Combine this with an in-process watchdog log
message ("hey, I'm still alive!") on the main consumer thread every X minutes,
and you have a fairly robust solution.

Tangentially, deferring engineering work by buying more hardware is almost
always a good idea. :)

------
edgurgel
Virding's First Rule of Programming

"Any sufficiently complicated concurrent program in another language contains
an ad hoc informally-specified bug-ridden slow implementation of half of
Erlang."

[http://rvirding.blogspot.co.nz/2008/01/virdings-first-
rule-o...](http://rvirding.blogspot.co.nz/2008/01/virdings-first-rule-of-
programming.html)

~~~
siscia
My exact same thought.

I do believe that we, as software engineer, should understand how Erlang works
at its core and have some sort of shared know how about the implementation of
at least supervisors.

------
drakenot
Whenever I saw the auto-scaling mentioned I had assumed that they meant they
were spinning up new server(s) running this process, all working from the same
redis queue. But it seems that they are just launching more of the same
process from within the same machine.

For those of you who are experienced in platforms where this is a common
practice, can you tell me why this is done? I have seen this done a lot.

Why don't people just launch more worker goroutines (or whatever their
platform's concurrency primitives are) and max out their machine from a single
process? Is it to keep the code simpler (i.e., it is easier to just spin up
duplicate processes on the same machine) and not have to deal with any
concurrency?

~~~
brianwawok
My thought as well. If your server is sized to handle 4 workers, run 4
workers.

If you want to autoscale to save costs cool. But then use a server only sized
to hold 1 worker. Then the techniques in this post wont quite work.

The failure detector piece is neat, but if doing a true multi server scaling
setup, may be better to just kill a hung VM and fire up a new one. Just as
likely to be a server problem as a code problem. If automated why not just
nuke it all.

------
amouat
Nice read, but I'd like to point out that this is a very well studied problem,
particularly in HPC. I'd point to a few decent resources, but the only one
that my google-fu found was
[https://www.cs.fsu.edu/~engelen/courses/HPC/Algorithms1.pdf](https://www.cs.fsu.edu/~engelen/courses/HPC/Algorithms1.pdf)

The main advice is that you need to carefully balance communication overhead
against work done and also analyse the optimum number of workers per node.

