ChrisWint's comments

ChrisWint · 2024-06-19T20:50:29.000000Z

Author of that blogpost here.

> less "code branches" than duckdb, which may or may not translate to faster performance.

In that case it was about 2.5x faster than DuckDB end to end, so a bit less than the difference in branches.

If you want to see some independent benchmarks on Umbra, our underlying technology, its currently first place on Clickbench [1]. You can compare against duckdb there as well.

[1] https://benchmark.clickhouse.com/

riku_iki · 2024-06-19T20:54:27.000000Z

clickbench is a toy benchmark: small dataset, very specific queries.

Benchmarking full tcp-h (not just one query like in your post) on sizable dataset (few TBs) would be very good close to real world scenario, but vendors usually avoid this.

pfent · 2024-06-19T21:00:24.000000Z

There are TPC-H numbers in another post: https://cedardb.com/blog/simple_efficient_hash_tables/

riku_iki · 2024-06-19T21:08:42.000000Z

its great starting insight, but again its small dataset (100GB) which almost fits memory, and I think many details are missing (for example clickbench publishes all configs and queries, and more detailed report, so vendors can reproduce/optimize/dispute them).

refset · 2024-06-20T12:59:42.000000Z

> small dataset (100GB)

What counts as large or small definitely varies a lot depending on the context of the conversation/analysis.

MotherDuck's "Big Data is Dead" post [0] sticks in mind:

> The general feedback we got talking to folks in the industry was that 100 GB was the right order of magnitude for a data warehouse. This is where we focused a lot of our efforts in benchmarking.

Another point of reference is [1]

> [...] Umbra achieves unprecedentedly low query latencies. On small data sets, it is even faster than interpreter engines like DuckDB

> TPC-H Small Dataset = 866k tuples, sf 0.1

[0] https://motherduck.com/blog/big-data-is-dead/

[1] https://db.in.tum.de/~kersten/Tidy%20Tuples%20and%20Flying%2...

riku_iki · 2024-06-20T15:02:58.000000Z

> The general feedback we got talking to folks in the industry was that 100 GB

and then user discovers that DuckDB is plagued with OOMs and dramatic performance degradations when his data is slightly larger than memory.

ChrisWint · 2024-06-05T19:45:37.000000Z

I can recommend the lecture for implementing database systems by Thomas Neumann, who spearheaded the Umbra system which CedarDB builds on. The slides and lecture recordings are available online:

https://db.in.tum.de/teaching/ss21/moderndbs/