Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)
In addition to Postgres FDWs, Steampipe plugins are also available as a SQLite extension [1] or a CLI export tool [2] for lighter weight use cases. (Although a local Postgres has a surprisingly small footprint!) Building plugins as DuckDB extensions would be cool too, but we haven't done that yet.
Yep, that usually dampens my excitement pretty quickly after seeing a new extension. You‘d also not know if it’s available on new versions. Sometimes you can install them „manually“ like supabase audit but more often that’s not possible.
Sorry, what do you base that on? To me it just seems like a straightforward inspiration from the "data lake" -> "lakehouse" terminology that Databricks started (?) using.
My best guess is that Databricks and Pg_lakehouse both independently coined "lakehouse" from "data lake", and that for the latter team, it was partly a pun on Wodehouse. But the creators are welcome to chime in and confirm/deny!
(Or to say, like, "Sure...uh, we totally meant that...yes we are very literary.")
It has absolutely nothing to do with Pg wodehouse, it's a "data warehouse" sitting on a "data lake", which simply combines those two terms while keeping the "lake" and "house" metaphors quite well because you do actually get lake houses on lakes.
The (internal) use of DataFusion to create new, powerful extensions for Postgres is a very clever idea. Very good work for the ParadeDB team.
I like this one very much. Very simple way to avoid having to use different set of tools and query languages (or more limited query languages) to query lakes.
Neat that you plan to support both Delta Lake and Apache Iceberg
I'm curious about HN's position between these two formats? I'm having a hard time deciphering which might be the industry winner (or perhaps they both have a place, no "winner" necessary)
This is anecdotal, but I feel that we (ParadeDB) have received more requests for Iceberg integration vs. Delta Lake. We were actually hesitant to launch pg_lakehouse without Iceberg support, but pulled the trigger on it because the iceberg-rust crate is still in its early days. We will probably be contributing to iceberg-rust to make it work with pg_lakehouse.
There isn't a winner and there likely won't be one (at least not for a long time). Tabular will likely be acquired by Snowflake and the two industry behemoths now back their own formats, and each will treat their own as a first class citizen.
Agreed, this is why we want to support both. Maybe even Apache Hudi down the line. But I hope the industry converges to a main standard rather than Snowflake/Databricks fighting for their own formats. They can differentiate on much more meaningful features
There’s a lot of interesting work happening in this area (see: XTable).
We are building a Python distributed query engine, and share a lot of the same frustrations… in fact until quite recently most of the table formats only had JVM client libraries and so integrating it purely natively with Daft was really difficult.
We finally managed to get read integrations across Iceberg/DeltaLake/Hudi recently as all 3 now have Python/Rust-facing APIs. Funny enough, the only non-JVM implementation of Hudi was contributed by the Hudi team and currently still lives in our repo :D (https://github.com/Eventual-Inc/Daft/tree/main/daft/hudi/pyh...)
It’s still the case that these libraries still lag behind their JVM counterparts though, so it’s going to be a while before we see full support across the full featureset of each table format. But we’re definitely seeing a large appetite for working with table formats outside of the JVM ecosystem (e.g. in Python and Rust)
I'm also building in the lakehouse space and anecdotally have seen more excitement around Iceberg over delta lake just because of its completely open source origins. Iceberg has evolved faster and has had more contributions from a more diverse set of contributors than Delta Lake. Not sure if this will change with a Snowflake <> Tabular acquisition but I'd easily bet on Iceberg if current trends continue.
We agree. We plan to bring Iceberg support as a first-class citizen as soon as we can, but unfortunately the support in Rust these days is still limited. We and the community are working on it
Looks like pg as a replacement for databricks sql, which is already a query engine for datalakes. It's not a lakehouse, but it calls itself one. Seems like a cool and useful project, but the name is problematic.
In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.
Could you clarify which result you're referring to as the baseline and "number 1 place?"
I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.
The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.
This is great work! Could you please comment on the choice of your license. Lost Postgres extension that achieve wide adoption use Postgres, MIT or Apache license.
All ParadeDB extensions are released under AGPL-3.0. We've found that it strikes the right balance between being open-source and enabling the community to adopt for free, while also protecting us from hyperscalers and enabling us to build a sustainable business. Perhaps the topic of a blog post someday :)
For me, it's a license that provides forward guarantees to the Community: no proprietary forks can happen, so any fork will be an OSS fork from which the upstream project may benefit too, which benefits all users.
That's the reason we chose this license for StackGres [1], another project in the Postgres space.
For aws to make it available on rds Aurora, would it be safe to assume there would have to be some changes to the extension source to make it compatible with the Aurora engine? If we assume aws doesn’t want todo that, then their licensing provides some protection there.
Yeah exactly. In practice, we've inspired ourselves from the likes of Citus and others who have adopted the AGPL-3.0 license as a good compromise and have found success. It's rather comment nowadays for infra startups to use AGPL-3.0. Other noteworthy examples: MinIO, Quickwit, Lago, etc.
Hi, OpenDAL's maintainer here. I'm not sure what "Snowflake as an object store" means since Snowflake is a cloud data warehouse service and not intended for storage services.
Snowflake is not in the list of supported stores on Apache OpenDAL, so likely not. It might not expose its storage APIs. I doubt users of Snowflake would want a separate query engine anyways
You can see performance comparison to Hydra on ClickBench: https://benchmark.clickhouse.com/ by selecting ParadeDB and Hydra. Tl;dr: It is much faster.
From a feature-set perspective, in addition to querying local disk, we can query remote object stores (S3, GCS, etc.), table format providers (Delta Lake, soon Iceberg too).
From a code perspective, we're written in Rust on top of open-source standards like OpenDAL and DataFusion, while Hydra is their own codebase built from a fork of Citus columnar, in C.
Very exciting. We use GCS but we don't have a data lake yet. If you were to "sell" the concept of a data lake, what would you refer to as great example usecase?
Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.
This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.
It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)
If users are already having datalake kind of system which is generating parquet files, the use case to use Postgres to query the data itself is questionable. I think having Postgres way of doing things should be prioritised if you want to keep your product in unique position.
Can you elaborate on what you mean by the "Postgres way of doing things"? Also, what is wrong with using Postgres to query data in external object stores? It is a common occurrence for businesses to store parquet artefacts in object storage, and querying them is often desirable.
It depends. If you're happy with Databricks, etc. you might be good. But we've seen many users want the simplicity of querying data from Postgres for analytics, especially in case of JOINing both analytics and transactional data
1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.
2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).
They've been moving more and more towards closed source over the years, which is a shame but I understand why. We don't offer time-series features today, but we're not ruling out adding support for it eventually if it is desired by our users.
1 - https://github.com/turbot/steampipe