I think this mischaracterizes the state of the space. Iceberg is the winner of this competition, as of a few months ago. All major vendors who didn't directly invent one of the others now support iceberg or have announced plans to do so.
Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.
Yeah working in the data space I see a ton of customers using Iceberg and some using Delta Lake if they're already a Databricks shop. Virtually no Hudi.
This is mostly correct, but it's worth mentioning that cloudberry substantially predates Greenplum going closed source. It just got quite a boost from that change happening. Different dev team too, afaik none of the original Greenplum team was involved with Cloudberry until very recently.
Also, Greenplum 7 tracks postgres 14. Which is still old at this point, but not so bad as 12....
I also don't think I'd call the architecture ancient. Just very tightly coupled to postgres' own (as a fork of postgres that tries to ingest new versions from upstream every year or two) and paying the overhead of that choice in the modern landscape.
Source: former member of the Greenplum Kernel team.
Thanks for the context. In what way would you say Cloudberry lags behind Greenplum technology-wise? I see newer Greenplum versions have a lot of planner improvements.
Greenplum 7 is listed as tracking Postgres 12 in the release announcement [1], and the release notes for later 7.x versions don't mention anything. Is there a newer release with higher compatibility?
When I say ancient, I mean that it's a "classical" shared-nothing design where the database is partitioned and hosted as parallel, self-contained replica servers, where each node runs as a shard that could, in theory, by queried independently of the master database. This is in contrast to newer architectures where data is sharded at the heap level (e.g. Yugabyte, CockroachDB) and/or compute is separated from data (e.g. Aurora, ClickHouse, Neon, TiDB).
Cloudberry, last I checked, took their snapshot of all the Greenplum utilities way before the repos got archived and development went private. The backup/restore, DR, Upgrade, and other such seem to leave a lot on the table. I haven't checked in a bit, it's possible they've picked back up some of that progress.
You're completely right, I had the wrong PG version in my memory. Embarrassing, thanks for catching that.
All the Greenplum utilities you mentioned here are also open-sourced and available for Cloudberry, but some of them are not in the main repo of Apache Cloudberry (This is more a matter of adhering to the Apache Software Foundation's regulations than a technical limitation).
Here is the unofficial roadmap of Cloudberry:
1. Continuously upgrading the PostgreSQL core version, maintaining compatibility with Greenplum Database, and strengthening the product's stability.
2. End-to-end performance optimization to support near real-time analytics, including streaming ingestion, vectorized batch processing, JIT compilation, incremental materialized views, PAX storage format, etc.
3. Supporting lakehouse applications by fully integrating open data lake table formats represented by Apache Iceberg, Hudi, and Delta Lake.
4. Gradually transforming Cloudberry Database into a data foundation supporting AI/ML applications, based on Directory Table, pgvector, and PostgresML.
Delighted to see Greenplum mentioned in this article, also equally pleased to see Apache Cloudberry mentioned in the comments. Greenplum has been open-source for nearly a decade, forming a fairly mature global open-source ecosystem, with many core developers distributed around the world ( they were not necessarily hired by Pivotal/VMware/Broadcom). Greenplum forked as Cloudberry wasn't to outdo Greenplum Database, but to foster a more neutral and open community around an MPP database with a substantial global following. To that end, the project was donated to the Apache Software Foundation following Greenplum's decision to close source. Since the project is in its early stages within the Apache incubator, our immediate goal is to build a solid foundation that adheres to Apache standards. Instead of introducing extensive new features, we are concentrating on developing a stable and compatible open-source alternative to Greenplum.
I hadn't heard of this one before, I'm curious what people think. Will any of the C variants aiming to accomplish memory safety actually pick up enough adoption to be taken seriously? I don't think a 5x slowdown is going to fly, but I find the idea in general pretty interesting.
Ultimately I've picked up Rust and am betting on it long term, but I'd be very curious to hear takes from other people.
It's been pulled to close-source by Broadcom now, but if anyone is interested there was a fairly cool from-scratch go implementation of the same basic ideological approach as pg_dump for Greenplum. Fairly cool for seeing the catalog queries it takes to pull each piece of information that goes into a schema, at the very least.
I only looked at the linked file, so I'm not going to speak to the rest of the code and don't really have a good grasp of the architecture here.
This seems mostly ok. The only thing that (imo) wouldn't pass code review is that big honking loop at https://github.com/achristmascarl/rainfrog/blob/main/src/app.... That thing needs to be refactored down to be readable, with individual logic chunks being put into their own functions and tidied up a bit.
That's not really Rust-specific, obviously, but all the `match` and `if let` and whatever other Rust stuff looks fine, so it's what I've got.
You can't really be making this comparison, though? Getting an MD and residency and rotations means you're in your thirties before you get your first real job. At least in the US. It's wildly more difficult.
The MD cert and residency are the entire reason doctors don't have to deal with the same interview bullshit ("now please demonstrate an open-heart surgery on this patient over in room 103, and talk me through it as you go"). There are multiple gates you have to pass before you become hireable:
1. Be in about the top quintile of your undergraduate class.
2. Have good MCAT scores.
3. Get accepted to a med school.
4. Not fail out of med school, which includes what is effectively an apprenticeship while being supervised by an experienced practitioner.
4. Be accepted to residency.
5. Not fuck up your residency.
6. Apply for your license.
7. Not become uninsurable for malpractice.
Almost all of the bozos have been eliminated by the time you get to step 6. That is what makes hiring a doctor easier: you mostly just need to check "is this person's license real?" and "are there any red flags since licensure?" When checking references, you mostly rely on other people who also have a medical license.
In the software world, we have no equivalent to all that. Literally anyone can call themself a software engineer. There are no licenses, no tests, no (required) degrees. There's no apprenticeship. There are no lasting career consequences to fucking up a project. When checking references, you have no way to know that those people are real (unless they are in your network).
That's why we need skills and knowledge tests when hiring for software roles. Sure, it's nice that you can walk into a software job without the 10-15 years of formal process, but the flip side is all the annoying process around hiring.
I agree with absolutely everything you wrote. And so I don't complain about hiring loops, and instead just prep for interviews. It's absolutely worth the tradeoff.
Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.
reply