Back at my old job in ~2016, we built a cheap homegrown data warehouse via Postgres, SQLite and Lambda.
Basically, it worked like this:
- All of our data lived in compressed SQLite DBs on S3.
- Upon receiving a query, Postgres would use a custom foreign data wrapper we built.
- This FDW would forward the query to a web service.
- This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service.
- This web service would re-issue lambdas as needed and return the results to the FDW.
- Postgres (hosted on a memory-optimized EC2 instance) would aggregate.
It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition.
Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour.
I still think, to this day, that the inevitable open-source solution for DWs might look like this. Insert your data as SQLite or DuckDB into a bucket, pop in a Postgres extension, create a FDW, and `terraform apply` the lambdas + api gateway. It'll be harder for non-timeseries data but you can probably make something that stores other partitions.
For our scale and request patterns (easily-partitioned / 0.1 qps), no major issues but the JavaScript bindings (which are different to their wasm bindings) that I use leave a lot to be desired. To DuckDB's credit, they seem to have top-notch CPP and Python bindings that even support the efficient memory-mapped Arrow format that's purpose-built for cross-language / cross-process , in addition to being top-notch in-memory representation for Panda-like data-frames.
Granted DuckDB's is in constant development, but it doesn't yet have native cross-version export/import feature (since its developers claim DuckDB hasn't reached maturity to stabilise its on-disk format just yet).
I also keep an eye on https://h2oai.github.io/db-benchmark/ As for Arrow-backed query engines, Pola.rs and DataFusion in particular sound the most exciting to me.
It also remains to be seen how DataBrick's delta.io develops (might come in handy for much much larger data-warehouses).
I've looked into this but saw hugely variable throughput, sometimes as little as 20 MB / second. Even if full throughput I think s3 single key performance maxes out at ~130 MB / second. How did you get these huge s3 blobs into lambda in a reasonable amount of time?
* With larger lambdas you get more predictable performance, 2GB RAM lambdas should get you ~ 90MB/s [0]
* Assuming you can parse faster than you read from S3 (true for most workloads?) that read throughput is your bottleneck.
* Set target query time, e.g 1s. That means for queries to finish in 1s each record on S3 has to be 90MB or smaller.
* Partition your data in such a way that each record on S3 is smaller than 90 MBs.
* Forgot to mention, you can also do parallel reads from S3, depending on your data format / parsing speed might be something to look into as well.
This is somewhat of a simplified guide (e.g for some workloads merging data takes time and we're not including that here) but should be good enough to start with.
How large were the SQLite database files you were working with here?
I've been thinking about building systems that store SQLite in S3 and pull them to a lambda for querying, but I'm nervous about how feasible it is based on database file size and how long it would take to perform the fetch.
I honestly hadn't thought about compressing them, but that would obviously be a big win.
Basically, it worked like this:
- All of our data lived in compressed SQLite DBs on S3.
- Upon receiving a query, Postgres would use a custom foreign data wrapper we built.
- This FDW would forward the query to a web service.
- This web service would start one lambda per SQLite file. Each lambda would fetch the file, query it, and return the result to the web service.
- This web service would re-issue lambdas as needed and return the results to the FDW.
- Postgres (hosted on a memory-optimized EC2 instance) would aggregate.
It was straight magic. Separated compute + storage with basically zero cost and better performance than Redshift and Vertica. All of our data was time-series data, so it was extraordinarily easy to partition.
Also, it was also considerably cheaper than Athena. On Athena, our queries would cost us ~$5/TB (which hasn't changed today!), so it was easily >$100 for most queries and we were running thousands of queries per hour.
I still think, to this day, that the inevitable open-source solution for DWs might look like this. Insert your data as SQLite or DuckDB into a bucket, pop in a Postgres extension, create a FDW, and `terraform apply` the lambdas + api gateway. It'll be harder for non-timeseries data but you can probably make something that stores other partitions.