For example, imagine that each machine has 20 available threads for processing messages received from SQS. Then I'd track a metric which is the percent of threads that are in use. If I'm trying to meet a message processing SLA, then my goal is to begin auto-scaling before that in-use percentage reaches 100%, e.g., we might scale up when the average thread utilization breaches 80%. (Or if you process messages with unlimited concurrent threads, then you could use CPU utilization instead.)
The benefit of this approach is that you can begin auto-scaling your system before it saturates and messages start to be delayed. Messages will only be delayed once the in-use percent reaches 100% -- as long as there are threads available (i.e., in-use < 100%), messages will be processed immediately.
If you were to auto-scale on SQS metrics like queue length, then the length will stay approximately zero until the system starts falling behind, and then it's too late. If you scale on queue size then you can't preemptively scale when load is increasing. By monitoring and scaling on thread capacity, you can track your effective utilization as it climbs from 50% to 80% to 100%; and you can begin scaling before it reaches 100%, before messages start to back up.
The other benefit of this approach is that it works equally well at many different scales; a threshold like 80% thread utilization works just as well with a single host, as with a fleet of 100 hosts. By comparison, thresholds on metrics like queue length need to be adjusted as the scale and throughput of the system changes.