Hacker News new | past | comments | ask | show | jobs | submit login
Pg_lakehouse: Query Any Data Lake from Postgres (github.com/paradedb)
171 points by landingunless 10 months ago | hide | past | favorite | 72 comments



Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)

1 - https://github.com/turbot/steampipe


I like steampipe but found the use of postgres a bit heavy for my use cases.

Could you make it run with pg_lite in wasm or DuckDB?


In addition to Postgres FDWs, Steampipe plugins are also available as a SQLite extension [1] or a CLI export tool [2] for lighter weight use cases. (Although a local Postgres has a surprisingly small footprint!) Building plugins as DuckDB extensions would be cool too, but we haven't done that yet.

1 - https://steampipe.io/blog/2023-12-sqlite-extensions 2 - https://steampipe.io/blog/2023-12-steampipe-export


This is awesome!

I last played with it over a year ago and I see that blog post is from Dec 2023. Looking forward to trying out the sqlite extension.


DuckDB would be great or accessible through DataFusion!

Looking forward to see what you will build next!


How many folks here struggle to adopt tooling like this because it isn’t possible to add psql extensions to places like RDS?


We're working on getting our extensions approved on as many platforms as possible


Yep, that usually dampens my excitement pretty quickly after seeing a new extension. You‘d also not know if it’s available on new versions. Sometimes you can install them „manually“ like supabase audit but more often that’s not possible.


Me.

I both love and hate RDS - the costs are exorbitant.

I have to setup a kafka pipe streaming CDC events to replicate the data into other DB types to get the benefits of things like this.


Moreover, even when extensions are supported by RDS, they often make upgrading database versions a PITA.


It's gotten significantly easier lately, I upgraded from v11 -> v15 with almost no downtime.


Not familiar with the process, how do they make it a PITA?


The name seems to be an allusion to the author P.G. Wodehouse, creator of the character Jeeves.

https://en.wikipedia.org/wiki/P._G._Wodehouse

Very clever naming!


Sorry, what do you base that on? To me it just seems like a straightforward inspiration from the "data lake" -> "lakehouse" terminology that Databricks started (?) using.

https://www.databricks.com/product/data-lakehouse

edit: ah, but in a different comment someone noted that it's not actually a lakehouse, so who knows!? :)


Based on pure speculation. I may be reaching.

My best guess is that Databricks and Pg_lakehouse both independently coined "lakehouse" from "data lake", and that for the latter team, it was partly a pun on Wodehouse. But the creators are welcome to chime in and confirm/deny!

(Or to say, like, "Sure...uh, we totally meant that...yes we are very literary.")


I wish we were that clever, but it's really just the combination of "data lake" and "data warehouse", which isn't even coined by us :)


It has absolutely nothing to do with Pg wodehouse, it's a "data warehouse" sitting on a "data lake", which simply combines those two terms while keeping the "lake" and "house" metaphors quite well because you do actually get lake houses on lakes.


The (internal) use of DataFusion to create new, powerful extensions for Postgres is a very clever idea. Very good work for the ParadeDB team.

I like this one very much. Very simple way to avoid having to use different set of tools and query languages (or more limited query languages) to query lakes.


Neat that you plan to support both Delta Lake and Apache Iceberg

I'm curious about HN's position between these two formats? I'm having a hard time deciphering which might be the industry winner (or perhaps they both have a place, no "winner" necessary)


This is anecdotal, but I feel that we (ParadeDB) have received more requests for Iceberg integration vs. Delta Lake. We were actually hesitant to launch pg_lakehouse without Iceberg support, but pulled the trigger on it because the iceberg-rust crate is still in its early days. We will probably be contributing to iceberg-rust to make it work with pg_lakehouse.


Also anecdotal, but we (Spice AI) see more requests for Iceberg, but in practice more deployments of Delta Lake.


My theory is that everyone would _prefer_ to use Iceberg, but isn’t as widely supported _yet_, so they’re stuck with Delta in the interim.


I think you're spot on


> We will probably be contributing to iceberg-rust to make it work with pg_lakehouse

That's great news, thanks for your contributions to open source (here, and all the other extensions)


There isn't a winner and there likely won't be one (at least not for a long time). Tabular will likely be acquired by Snowflake and the two industry behemoths now back their own formats, and each will treat their own as a first class citizen.


Agreed, this is why we want to support both. Maybe even Apache Hudi down the line. But I hope the industry converges to a main standard rather than Snowflake/Databricks fighting for their own formats. They can differentiate on much more meaningful features


There’s a lot of interesting work happening in this area (see: XTable).

We are building a Python distributed query engine, and share a lot of the same frustrations… in fact until quite recently most of the table formats only had JVM client libraries and so integrating it purely natively with Daft was really difficult.

We finally managed to get read integrations across Iceberg/DeltaLake/Hudi recently as all 3 now have Python/Rust-facing APIs. Funny enough, the only non-JVM implementation of Hudi was contributed by the Hudi team and currently still lives in our repo :D (https://github.com/Eventual-Inc/Daft/tree/main/daft/hudi/pyh...)

It’s still the case that these libraries still lag behind their JVM counterparts though, so it’s going to be a while before we see full support across the full featureset of each table format. But we’re definitely seeing a large appetite for working with table formats outside of the JVM ecosystem (e.g. in Python and Rust)


Are you using the iceberg-rust crate for Rust? It's a rather young project, have you found it sufficient for your needs (if using)?


We're actually using pyiceberg to retrieve metadata! All our IO and decoding happens in the rust side once the data has been passthrough.

We expose something called a ScanOperator which allows integration into various catalogs through a thin layer that exposes ScanTasks.

Iceberg's impl: https://github.com/Eventual-Inc/Daft/blob/416009138359a9d410...


I'm also building in the lakehouse space and anecdotally have seen more excitement around Iceberg over delta lake just because of its completely open source origins. Iceberg has evolved faster and has had more contributions from a more diverse set of contributors than Delta Lake. Not sure if this will change with a Snowflake <> Tabular acquisition but I'd easily bet on Iceberg if current trends continue.


We agree. We plan to bring Iceberg support as a first-class citizen as soon as we can, but unfortunately the support in Rust these days is still limited. We and the community are working on it


Paradedb is doing a lot of good work with postgres. Pg_analytics, and now pg_lakehouse...


This looks functionally similar as using http://github.com/spiceai/spiceai with a postgreSQL data accelerator.


As somebody who writes a lot of Postgres extensions, I can say this is quite interesting!

I think I can see some parallels to Supabase's wrappers project.

Keep up the good work!


Looks like pg as a replacement for databricks sql, which is already a query engine for datalakes. It's not a lakehouse, but it calls itself one. Seems like a cool and useful project, but the name is problematic.


pg_house just wasn't as catchy!

In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.


I have another question. So far on the clickbench leaderboard it's 15x slower than baseline. The number 1 place is 1.67 slower the baseline.

I assume that's DataFusion speed. What's the plan to improve upon it?


Could you clarify which result you're referring to as the baseline and "number 1 place?"

I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.

The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.


Would be great to also see new pg_lakehouse and datafusion benchmark results here: https://duckdblabs.github.io/db-benchmark/

Currently Datafusion is much slower than duckdb or OOMing.


This is great work! Could you please comment on the choice of your license. Lost Postgres extension that achieve wide adoption use Postgres, MIT or Apache license.


All ParadeDB extensions are released under AGPL-3.0. We've found that it strikes the right balance between being open-source and enabling the community to adopt for free, while also protecting us from hyperscalers and enabling us to build a sustainable business. Perhaps the topic of a blog post someday :)


I applaud the decision to use AGPL-3.0.

For me, it's a license that provides forward guarantees to the Community: no proprietary forks can happen, so any fork will be an OSS fork from which the upstream project may benefit too, which benefits all users.

That's the reason we chose this license for StackGres [1], another project in the Postgres space.

[1]: https://stackgres.io


Another great benefit of it indeed


It looks like hyper scalers can still host it as long as they are publishing changes to the source code ? Am I reading the license right ?


For aws to make it available on rds Aurora, would it be safe to assume there would have to be some changes to the extension source to make it compatible with the Aurora engine? If we assume aws doesn’t want todo that, then their licensing provides some protection there.


Yeah exactly. In practice, we've inspired ourselves from the likes of Citus and others who have adopted the AGPL-3.0 license as a good compromise and have found success. It's rather comment nowadays for infra startups to use AGPL-3.0. Other noteworthy examples: MinIO, Quickwit, Lago, etc.


Yes


Well, MongoDB was under AGPL v3.0 :)


Very nice addition! Do you plan to support Snowflake as an object store in the near future? It's not currently in pg_lakehouse's README.


Hi, OpenDAL's maintainer here. I'm not sure what "Snowflake as an object store" means since Snowflake is a cloud data warehouse service and not intended for storage services.


Snowflake is not in the list of supported stores on Apache OpenDAL, so likely not. It might not expose its storage APIs. I doubt users of Snowflake would want a separate query engine anyways


I am not up to date in various lakes. Is this read-only? Are you able to init a lake from scratch?

What's the model to feed such a lake from some queue?


For now it is read-only, but soon will be write-supported too. You can feed data via Kafka


How does this compare to Hydra? https://www.hydra.so/


You can see performance comparison to Hydra on ClickBench: https://benchmark.clickhouse.com/ by selecting ParadeDB and Hydra. Tl;dr: It is much faster.

From a feature-set perspective, in addition to querying local disk, we can query remote object stores (S3, GCS, etc.), table format providers (Delta Lake, soon Iceberg too).

From a code perspective, we're written in Rust on top of open-source standards like OpenDAL and DataFusion, while Hydra is their own codebase built from a fork of Citus columnar, in C.

Hydra is a cool project. Hope this helps! :)


And when will you have GCS storage ready? I saw on the website that it is not yet available.


Next week (May 20-something)


Very exciting. We use GCS but we don't have a data lake yet. If you were to "sell" the concept of a data lake, what would you refer to as great example usecase?


Thanks for the prompt response, the support for OpenDAL is amazing!


That is one hell of a logo!


Very cool!

Could you share the key difference between this and the previous pg_analytics, and motivation of making it a separate plugin?


Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.

This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.

It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)


If users are already having datalake kind of system which is generating parquet files, the use case to use Postgres to query the data itself is questionable. I think having Postgres way of doing things should be prioritised if you want to keep your product in unique position.


Can you elaborate on what you mean by the "Postgres way of doing things"? Also, what is wrong with using Postgres to query data in external object stores? It is a common occurrence for businesses to store parquet artefacts in object storage, and querying them is often desirable.


It depends. If you're happy with Databricks, etc. you might be good. But we've seen many users want the simplicity of querying data from Postgres for analytics, especially in case of JOINing both analytics and transactional data


It seems very promising!

2 questions:

- do you distribute query processing over multiple pg nodes ?

- do you store the metadata in PG, instead of a traditional metastore?


Thanks!

1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.

2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).


Nice. I wish timescaledb open-sourced their s3 storage thing.


They've been moving more and more towards closed source over the years, which is a shame but I understand why. We don't offer time-series features today, but we're not ruling out adding support for it eventually if it is desired by our users.


Yet another amazing postgres plugin made possible by pgrx (https://github.com/pgcentralfoundation/pgrx)

It's really crazy how some projects just instantly enable a whole generation of new possibilities.

If you are impressed like this and want to build something like it -- check out pgrx, it's a pretty great experience.


pgrx is indeed wonderful and we would not be able to do our work without it. Big kudos to Eric, Jubilee and rest of team!


looks interesting!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: