While running some load testing with Artillery I found that I was often getting 429 errors on my front-end endpoint. When pushing 500+ RPS, the 2nd function was taking up over 50% of the concurrent execution limit and new events coming into the front-end would get throttled and in this case thrown out. That also means that any future Lambdas in the same AWS account would exacerbate this problem. Our traffic is spiky and can easily hit 500+ RPS on occasion, so this really wasn't acceptable.
My solution was to refactor the 2nd function into a Fargate task that polls the SQS queue instead. It was easily able to handle any workload I threw at it, and also able to run 24/7 for a fraction of the cost of the Lambda. Each invocation of the Lambda was authenticating with the GCP SDK before passing the event and the Lambda has to stay executing while the 2 stages of network requests were completed.
I'm happy to report I haven't been able to muster a test that breaks anything since I started using Fargate!
It sounds like you already found a great solution for your particular case. But it's also worth mentioning that you can apply per-function concurrency limits, which can be another way to prevent a particular function from consuming too much of the overall concurrency. For anyone who's lambda workload is cheaper than a 27/7 task, that could be a good option.
> Each invocation of the Lambda was authenticating with the GCP SDK before passing the event
I'm curious whether you tried moving the authentication outside of the handler function so it could be reused for multiple events? I've found that can make a huge difference for some use cases.