I've looked into this but saw hugely variable throughput, sometimes as little as...

petethepig · on May 28, 2022

* With larger lambdas you get more predictable performance, 2GB RAM lambdas should get you ~ 90MB/s [0]

* Assuming you can parse faster than you read from S3 (true for most workloads?) that read throughput is your bottleneck.

* Set target query time, e.g 1s. That means for queries to finish in 1s each record on S3 has to be 90MB or smaller.

* Partition your data in such a way that each record on S3 is smaller than 90 MBs.

* Forgot to mention, you can also do parallel reads from S3, depending on your data format / parsing speed might be something to look into as well.

This is somewhat of a simplified guide (e.g for some workloads merging data takes time and we're not including that here) but should be good enough to start with.

[0] - https://bryson3gps.wordpress.com/2021/04/01/a-quick-look-at-...