Fastest table sort in the West – Redesigning DuckDB's sort

gopalv · on Aug 27, 2021

This is legitimately fast (7s for 100M), but I'm more interested in the impact of the disk spills when swapping the hardware underneath, because the pointer swizzling is a very page-dirty way of loading data into memory with a memory map.

> When a heap block is offloaded to disk, the pointers pointing into it are invalidated. When we load the block back into memory, the pointers will have changed.

> The machine has an Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz, which has 8 cores (up to 16 virtual threads), and 128 GB of RAM, so this time the data fits fully in memory.

However, the M1 + SSD is basically cheating on that trend, because that beats most of my server hardware on both memory latency & ssd. Though that fits with how people will use duckdb for local analytics.

But otherwise this is page-fault hell.

> catalog_sales table is selecting 1 column ... always ordering by cs_quantity and cs_item_sk

The choice of sort keys is a bit odd for a comparison like this, because that tuple has a lot of duplicates. So one of the tricks my code in Tez uses to sort faster is the gallop borrowed from Tim Sort, which skips over the identical key sections when doing the merge-sort to do fewer comparisons over all.

If you sorted on the primary key for catalog_sales, which is (cs_item_sk, cs_order_number), then that is actually used to store data in-order for sort-merge-joins out of disk (against catalog_returns).

And if you get into storage ordering optimizations, you might see a massive difference between ordering them by swapping the order of those columns - if you pull the radix out, then putting the most variable keys in the beginning to skew the bits changing to the first word of the key.

lnkuiper · on Aug 27, 2021

I'm not sure if I understand, but we did not use a memory map (mmap), but rather blocks of memory that are explicitly (un)loaded by the buffer manager.

The M1 + SSD performs really well here. We tried to an external sort experiment on the x86 machine, but the SSD is old and only has a write speed of 150MB/s (compared to the MacBook's 3GB/s) and it was incredibly slow. So you definitely need a fast SSD for this.

The columns we chose to sort by are rather arbitrary, but we shuffled the table before running the experiments to make sure there is no ordering left from the generation in there.

I like the merge sort trick you described!

nextaccountic · on Aug 27, 2021

> I'm not sure if I understand, but we did not use a memory map (mmap), but rather blocks of memory that are explicitly (un)loaded by the buffer manager.

But then, do you have your own cache competing with the OS cache? Or you use O_DIRECT to disable the filesystem cache (at least on Linux)?

tomnipotent · on Aug 28, 2021

Most databases rely on OS filesystem cache, which is very compatible with the way buffer caches manage data with fixed-size pages/blocks.

jandrewrogers · on Aug 28, 2021

Some databases rely on the OS file system cache, this is true. However, it is not controversial that doing so reduces performance by integer factors. Bypassing the OS file system cache is idiomatic in database kernel designs where absolute performance is a key objective.

tomnipotent · on Aug 29, 2021

Using O_DIRECT and other features still makes you reliant on the OS cache, even if all you're doing is keeping it empty and out of your business. I've only run into a few that manually manage block devices directly.

xiaodai · on Aug 28, 2021

Many of the techniques discussed are documented in my Julia implementation of the fastest string sorting algorithm https://www.codementor.io/@evalparse/julia-vs-r-vs-python-st...

legg0myegg0 · on Aug 27, 2021

This is so fast!! If anybody is using Pandas to keep rows in order and has hesitated to use DuckDB for that reason, hesitate no more! Give it a shot!

gnabgib · on Aug 27, 2021

Seems like you're affiliated with DuckDB? Knowing what paper inspired the software [0], and regularly commenting about it's performance/recommending it [1][2][3][4][5][6][7]. Perhaps you should consider being more forthcoming?

[0]: https://news.ycombinator.com/item?id=24669902 [1]: https://news.ycombinator.com/item?id=27878401 [2]: https://news.ycombinator.com/item?id=26825096 [3}: https://news.ycombinator.com/item?id=26588029 [4]: https://news.ycombinator.com/item?id=26476649 [5]: https://news.ycombinator.com/item?id=24534795 [6]: https://news.ycombinator.com/item?id=24534721 [7]: https://news.ycombinator.com/item?id=24338671

GordonS · on Aug 27, 2021

Hey, this is HN you know :)

It's not unusual for happy, unaffiliated users to post enthusiastically, or even evangelise a bit, about the software they love most.

Case in point, I often pop up in TimescaleDB threads to sing it's praise regarding real-world usage, but I'm not affiliated with the TimescaleDB folks in any way.

legg0myegg0 · on Aug 27, 2021

I am not affiliated, just a happy user! I have spoken with one of the developers, but that's it!

I've just come up through the SQL side of analytics and I'm moving into Data Science and I feel like DuckDB is a superpower for people with my background.

Does that help? Happy to answer any other questions!

eatonphil · on Aug 27, 2021

> SQLite always opts for an external sorting strategy. It writes intermediate sorted blocks to disk even if they fit in main-memory, therefore it is much slower.

Is this just when avoiding using in-memory databases in SQLite [0]? It seems like SQLite pretty clearly does support fully in-memory operations.

The use-case for in-memory SQLite is significantly narrower so that may be why it was not considered in this study. But I'd still be curious how an in-memory SQLite db compares to the others trialed here.

Unless I'm getting something totally mixed up.

[0] https://www.sqlite.org/inmemorydb.html

lnkuiper · on Aug 27, 2021

We wanted to use the same setup for all experiments, so we had to choose for an on-disk DB for SQLite, because TPC-DS SF100 catalog_sales does not fit in 16GB memory.

masklinn · on Aug 27, 2021

But SQLite is not even present in the TPC-DS benchmark graphs, while Pandas (which the article notes works solely in memory) is…

Furthermore the article explicitely says:

> We will use customer at SF100 and SF300, which fits in memory at every scale factor.

lnkuiper · on Aug 27, 2021

Customer fits in memory, whereas catalog_sales does not.

We chose to remove SQLite from the results because it was so much slower. The plots are much less readable when they are stretched out by something that is slower by an order of magnitude

masklinn · on Aug 27, 2021

> Customer fits in memory, whereas catalog_sales does not.

Didn't prevent using pandas which had to rely on dynamic swapping? Or is in-memory sqlite unable to use that much memory?

> We chose to remove SQLite from the results because it was so much slower. The plots are much less readable when they are stretched out by something that is slower by an order of magnitude

So you're using on-disk sqlite because it fits in memory (unlike pandas which also fits in memory) but you're dropping it anyway because it's too slow when it works on-disk?

lnkuiper · on Aug 27, 2021

You are right, we could probably re-run SQLite purely in memory, but only because macos dynamically allocates additional swap.

However, I would not expect much improved performance, because I do not believe that SQLite has a different sorting strategy when running in memory. It would only save some i/o operations, which are very cheap on the macbook anyway.

Either way, would be an interesting experiment.

ysleepy · on Aug 27, 2021

Have you looked at the https://www.sqlite.org/pragma.html#pragma_cache_size

I think per default it's only 2MiB, increasing it might help. Probably won't beat the other contenders but sqlite has defaults for being a good citizen.

eatonphil · on Aug 27, 2021

Gotcha. Thanks for clarifying.

AdamProut · on Aug 27, 2021

Did any of the actual benchmark queries in TPC-H or TPC-DS show a speed up?

My intuition is that the performance of large sorts (100 millions of rows) are not that important for most analytical workloads compared to the performance of doing scans, group-bys and joins. Top-sorts (ORDER BY X LIMIT N) are much more popular, but most databases use different algorithms for those.

lnkuiper · on Aug 28, 2021

There are plenty of TopN sorts in TPC-DS, but not many (if any) regular sorts, so no.

There are order-dependent window functions in there though, which did show a speed up.

masklinn · on Aug 27, 2021

> While std::sort is excellent algorithmically, it is still a single-threaded approach that is unable to efficiently sort by multiple columns because function call overhead would quickly dominate sorting time.

I don't understand what this is saying. Is it just using a lot of words to note that as the number of columns increase so do the comparator's cost?

dahfizz · on Aug 27, 2021

The author is pointing out two problems with std::sort

1- std::sort is single threaded

2- std::sort is unable to efficiently sort by multiple columns

masklinn · on Aug 27, 2021

> 1- std::sort is single threaded

I would hope it is quite obvious that is not what I have an issue with.

> 2- std::sort is unable to efficiently sort by multiple columns

It's asserting that with an explanation which is not one, and I'm asking what it is actually saying. "unable to efficiently sort by multiple columns because function call overhead would quickly dominate sorting time." is not an actual explanation.

dragontamer · on Aug 27, 2021

Hmmm.

std::sort is a template in C++. Which means that in most cases, I'd expect std::sort to inline the comparator and not have any function call overhead at all. (And even further: that the optimizer can make optimizations to std::sort + comparator together since they've been inlined)

So you have a good point. Function overhead is in qsort, not in std::sort. The line reads kinda nonsensically to me now that I think of it.

b9a2cab5 · on Aug 27, 2021

The way I read it was that with many columns the overhead of the compare function (even if it's inlined) is too high compared to sorting overhead.

lnkuiper · on Aug 27, 2021

Exactly. Even when it’s inlined, having a comparator with if/else is much, much slower than a single memcmp