Crunchy Bridge for Analytics: Your Data Lake in PostgreSQL

koolba · 2024-04-30T16:53:46

> You can also use a wildcard in the path (e.g. s3://mybucket/hits/*.parquet) to query a list of files.

So is this an FDW that provides a similar function to Amazon Athena?

Does every query fetch everything again or does it somehow cache the downloaded data files?

Running something like this on EC2 is interesting as you wouldn’t pay the intermediate transfer costs. Only the final results would egress.

mslot · 2024-04-30T16:59:55

It caches the files on locally attached NVMe drives in the background, and continues to use range requests to S3 for queries until the full download completes.

memset · 2024-04-30T17:29:24

This is pretty cool. Which extension are you using for this? (Or have you written a custom one?)

craigkerstiens · 2024-04-30T17:33:08

It's a custom extension and actually a number of custom extensions, with quite a few more planned to further enhance the product experience. All extensions work together as a single unit to compose Crunchy Bridge for Analytics, but under the covers lots of building blocks that work together.

Marco and team worked were the architects behind the Citus extension for Postgres and have quite a bit of experience building advanced Postgres extensions. Marco gave a talk at PGConf EU on all the mistakes you can make when building extensions and best practices to follow–so in short quite a bit gone into the quality of this vs. a quick one off. Even in the standup with the team today it was remarked "we haven't even been able to make it segfault yet, which we could pull off quite quickly and commonly with Citus".

ccleve · 2024-04-30T18:34:03

Do you have a link to the slides or the video of that presentation? I found this, but no links: https://www.postgresql.eu/events/pgconfeu2023/schedule/sessi...

craigkerstiens · 2024-04-30T23:59:30

Hmm, let me see if we have a link to slides. There was no video recording unfortunately but we can definitely get slides posted.

koolba · 2024-04-30T18:09:30

How do AWS credentials get loaded? Is it a static set populated in the CREATE SERVER or can it pull from the usual suspects of AWS credential sources like instance profiles?

Is the code for the extension itself available?

mslot · 2024-04-30T18:34:16

The credentials are currently managed via the platform, so you enter them in the dashboard. We wanted to avoid specifying credentials via a SQL interface, because they can easily leak into logs and such. We'll add more authentication options over time.

koolba · 2024-04-30T18:39:45

How does the extension get access to them? Is there some other “master” token for the Crunchy PG server itself that is used to fetch the real token?

The extension is not FOSS right?

craigkerstiens · 2024-04-30T19:36:40

There is coordination from the Crunchy Bridge control plane to the data plane, that the extension is then aware of.

At this time it's not FOSS, we are going to consider opening some of the building blocks in time, but at the moment they have a pretty tight coupling on both the other extensions and on how Crunchy Bridge operates.

riku_iki · 2024-04-30T18:23:10

> Crunchy Bridge for Analytics

what exactly is "Crunchy Bridge for Analytics"? It is some hosted cloud infra? I can't install it locally as extension?

mslot · 2024-04-30T18:36:41

Crunchy Bridge is a managed PostgreSQL service by Crunchy Data available on AWS, Azure, and GCP.

Bridge for Analytics is a special instance/cluster type in Crunchy Bridge with additional extensions and infrastructure for querying data lakes. Currently AWS only.

chatmasta · 2024-04-30T19:50:56

What is powering the vectorized query execution? Are you embedding something like DataFusion or is it entirely custom? (Also will any of this be open sourced?)

craigkerstiens · 2024-04-30T23:52:45

It's not datafusion, much more custom with a number of extensions that underly pieces. And we've got a number of other extensions that will be in the works, to the user it's still a seamless experience but we've seen that smaller extensions that know how to work together are easier to maintain. For example we're working on a map type one that knows how to understand the map types within Parquet files within Postgres. In time we may open source some of these pieces, but we don't have a time frame for that and it's a case by case on each of the extensions.

mritchie712 · 2024-04-30T20:01:35

not sure what's powering crunchy, but you might be interested in Parade: https://blog.paradedb.com/pages/introducing_analytics

omg2864 · 2024-04-30T17:40:56

A shame that I can’t use any of this I am on AWS RDS, so I can’t install any of these interesting extensions.

My problem I would like to be better at OLAP queries in addition to OLTP queries for my Postgres db. Currently, my OLAP queries clear my whole RAM with on disk reads and when these queries run I also have high CPU peaks, which makes me have a bigger instance then I have during regular hours. We use Postgres already so I would like to stick as close as possible to that.

The best thing - operationally (running something like starrocks seems new and complex) and cost-wise (red shift seems expensive) seems to be Clockhouse with the materialized database and Postgres wire compatible extension - or at I missing something? However, these are still marked experimental.

The option of additional read replica and tuning the parameters for OLAP seems to be quite expensive too, the RDS costs are high. And I could not have a significantly smaller replica to avoid replication sync? An additional wish for this change is decreasing the cost.

takeda · 2024-04-30T18:12:15

So I was able to set up a machine with configuration management (saltstack) + barman[1] + repmgr[2] and the maintenance was comparable with RDS.

This was in 2017 and I would expect better tools available. This was on premises though, there were also Wal-E and later Wal-G that were more cloud oriented, although looks like barman also can now use S3 and equivalent.

[1] https://pgbarman.org/

[2] https://www.repmgr.org/

tootie · 2024-04-30T19:48:05

If you're on AWS you can use something like Athena or you can load data from csv/parquet in S3 into Redshift. But yeah, not directly within RDS Postgres to my knowledge.

mslot · 2024-04-30T19:14:33

Crunchy Bridge is similar to RDS. It runs on EC2 and is a mature managed service with features such as VPC peering, and now analytics, so you could consider it as an alternative.

wking · 2024-04-30T17:34:27

That has been a fun one to play a small role in. Great work by Marco and the team.

clintboxe · 2024-04-30T22:31:50

Looks pretty similar to Redshift Spectrum.

mistrial9 · 2024-04-30T17:10:22

tablescan implied on auto-column types with CSV and JSON inputs?

mslot · 2024-04-30T17:16:55

It will do a sample, so usually much less than a table scan, but it can definitely take longer than for Parquet.