im obviously missing something here, hopefully someone can explain? I understand...

amalantony06 · on March 27, 2016

If I understand correctly, the problem was that they were making several thousands of requests to S3. While the requests to S3 themselves were asynchronous, the callbacks for these requests were queued up for (synchronous) execution in the event loop. Due the large number of callbacks in the queue already, new callbacks for the incoming requests were queued up for execution behind the previous callbacks, leading to latencies in serving up responses.

dalyons · on March 27, 2016

Ah ok! Perfect, thanks, that's what I was missing. Makes total sense now. In any lang (even in ruby!) you could have separate thread pools(or EM loops, whatever) for the s3 requests & web handlers. But because node only has one EL, and node interprocess comms is awkward, its tricky. Gotcha, cheers.

I wonder if using something like async.eachLimit would have helped; it might prevent the s3 batches from flooding the loop & give a chance for web reqs to interleave, but probably at a cost to the median resp time.