At Netflix, we have considerable use of the autoscaler with our custom scheduler. Is there any reason you can't autoscale based on queue depth, or time of items in queue?
As far as your Docker problems go, have you tried to upgrade to 4.9.34+? We found the issue as well, and upgraded kernels, and haven't looked back.
We did thought to use queue depth, but CPU metric was easier to start with. One problem with using queue depth though is that we can't (didn't figure out how to) throttle down autoscaler to not give us more machines in case downstream service is down.
We did upgrade Docker, but we didn't upgrade kernel. Switching kernel is a bigger task that we'll probably do at some later point.
I'd argue that switching kernels is pretty easy. At least for us, there's a lot less QA to switch kernels than Docker versions. The more often you upgrade, the easier it is to upgrade, as with almost all things.
We also have a team that does work processing that's built a PID controller to autoscale ASGs based on queue depth, and time in queue. Although, their code is not open source, the concept is a self-tuning autoscaler based on a number of variables allowing you to prioritize cost, timeliness, or even throughput.
We've been doing this for a while. If you'd like to collaborate at all, let me know.
Indeed, the signal metric is important. Queue depths or time from queue insert to removal work fairly well as it directly relates to the tasks performed.
As far as your Docker problems go, have you tried to upgrade to 4.9.34+? We found the issue as well, and upgraded kernels, and haven't looked back.