More

zhousun · 2025-08-05T19:00:20 1754420420

author here. I would actually second you. My core belief is "it is possible to build true database-like functionalities on top of iceberg", but it is definitely not 'easier' than building them directly in a db (in fact, doing them while keep iceberg-compatible is tricky. yep, that's the cost of being open and general purpose)

kwillets · 2025-08-07T20:57:20 1754600240

I've been reading up on lakehouse stuff, and everything seems to agree with the idea that data warehouse functionality deteriorates as they try to accommodate raw data access.

The databricks people suggest that a better "cache format" is needed, but I don't see how that's different from ETLing the data into a regular warehouse.

zhousun · 2025-05-29T00:16:58 1748477818

there's actually a great read, cursor started with a distributed OLTP solution: yugabyte, and then fall back to RDS...

zhousun · 2025-05-27T16:54:49 1748364889

Using SQL as catalog is not new (iceberg supports JDBC catalog from the very beginning).

The main difference is to store metadata and stats also directly in SQL databases, which makes perfect sense for smaller scale data. In fact we were doing something similar in https://github.com/Mooncake-Labs/pg_mooncake, metadata are stored in pg tables and only periodically flush to actual formats like iceberg.

zhousun · 2025-05-15T17:27:47 1747330067

DataFile(parquet) is not enough for table with update/delete, (they are part of iceberg "metadata"). for CDC from OLTP use-cases, the pattern involves rapidly marking rows as deleted/ insert new rows and optimizing small files. This is required for minutes-latency replication.

And for second latency replication, it is more involving, you actually need to build layer on top of iceberg to track pk/ apply deletion.

zhousun · 2025-04-09T01:47:08 1744163228

Glab to see more 'postgres-native' full-text search implementation.

Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.

retakeming · 2025-04-09T04:10:46 1744171846

The segments themselves being immutable doesn't mean that Tantivy is incompatible with Postgres - it just means that Tantivy needs to be made compatible with Postgres' concurrency control mechanisms (MVCC) and storage format (block storage). This blog post explains the latter: https://www.paradedb.com/blog/block_storage_part_one

sunzhousz · 2025-04-09T15:53:35 1744214015

the fundamental mismatch i saw is "creating a new segment for each individual dml", it is possible to alleviate but i don't think there's a good general solution.

zhousun · 2025-03-08T16:34:12 1741451652

Yea this is indeed a repeated pattern we saw people requesting (filter on many columns) and we are trying to solve with pg_mooncake. If you are interested, feel free to join mooncake-devs.slack.com to chat more about your use case.

zhousun · 2025-03-08T16:09:06 1741450146

people tried to run spark (better hadoop) and failed lol. https://github.com/ClickHouse/ClickBench/pull/139

zhousun · 2025-03-08T16:07:12 1741450032

Thanks for the comment but you are mixing some terminologies.

The core idea of mooncake is to built upon open columnar format + substitutable vectorized engine, while natively integrate with Postgres.

So it is indeed closer to BigQuery (especially the newer bigquery with iceberg tables) than a 'standard SQL database'. It has all the nice properties of BigQuery (ObjectStore-native, ColumnStore, Vectorized execution...) and scaling is also not impossible.

zhousun · 2025-03-08T16:01:35 1741449695

hydra/pg_duckdb embeds duckdb to query existing data on S3. So it is kind of targeting a completely different use case (someone already prepares and shared a dataset and you just want to query it).

pg_mooncake (&crunchyData) is implementing columnstore tables in postgres, so you can actually use postgres as a data-warehouse (to ingest/ update and run analytics queries)

zhousun · 2025-03-08T15:56:30 1741449390

zhou from mooncake labs here.

Good point! Normally for postgres extension it won't be solvable, but for mooncake it is actually not the case!

The core idea of mooncake is to built upon open columnar format + substitutable vectorized engine, while natively integrate with Postgres.

So right now it is using duckdb within postgres to run the query, but we can and we will support ad-hoc using other 'stateless engines' like Athena, StarRocks or even spark to run a big query.