At Netflix, we have considerable use of the autoscaler with our custom scheduler...

imaravic · on Nov 20, 2017

We did thought to use queue depth, but CPU metric was easier to start with. One problem with using queue depth though is that we can't (didn't figure out how to) throttle down autoscaler to not give us more machines in case downstream service is down.

We did upgrade Docker, but we didn't upgrade kernel. Switching kernel is a bigger task that we'll probably do at some later point.

sargun · on Nov 20, 2017

I'd argue that switching kernels is pretty easy. At least for us, there's a lot less QA to switch kernels than Docker versions. The more often you upgrade, the easier it is to upgrade, as with almost all things.

We also have a team that does work processing that's built a PID controller to autoscale ASGs based on queue depth, and time in queue. Although, their code is not open source, the concept is a self-tuning autoscaler based on a number of variables allowing you to prioritize cost, timeliness, or even throughput.

We've been doing this for a while. If you'd like to collaborate at all, let me know.

sloak · on Nov 20, 2017

Indeed, the signal metric is important. Queue depths or time from queue insert to removal work fairly well as it directly relates to the tasks performed.