Hacker News new | past | comments | ask | show | jobs | submit login
Fastest table sort in the West – Redesigning DuckDB's sort (duckdb.org)
129 points by hfmuehleisen on Aug 27, 2021 | hide | past | favorite | 27 comments



This is legitimately fast (7s for 100M), but I'm more interested in the impact of the disk spills when swapping the hardware underneath, because the pointer swizzling is a very page-dirty way of loading data into memory with a memory map.

> When a heap block is offloaded to disk, the pointers pointing into it are invalidated. When we load the block back into memory, the pointers will have changed.

> The machine has an Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz, which has 8 cores (up to 16 virtual threads), and 128 GB of RAM, so this time the data fits fully in memory.

However, the M1 + SSD is basically cheating on that trend, because that beats most of my server hardware on both memory latency & ssd. Though that fits with how people will use duckdb for local analytics.

But otherwise this is page-fault hell.

> catalog_sales table is selecting 1 column ... always ordering by cs_quantity and cs_item_sk

The choice of sort keys is a bit odd for a comparison like this, because that tuple has a lot of duplicates. So one of the tricks my code in Tez uses to sort faster is the gallop borrowed from Tim Sort, which skips over the identical key sections when doing the merge-sort to do fewer comparisons over all.

If you sorted on the primary key for catalog_sales, which is (cs_item_sk, cs_order_number), then that is actually used to store data in-order for sort-merge-joins out of disk (against catalog_returns).

And if you get into storage ordering optimizations, you might see a massive difference between ordering them by swapping the order of those columns - if you pull the radix out, then putting the most variable keys in the beginning to skew the bits changing to the first word of the key.


I'm not sure if I understand, but we did not use a memory map (mmap), but rather blocks of memory that are explicitly (un)loaded by the buffer manager.

The M1 + SSD performs really well here. We tried to an external sort experiment on the x86 machine, but the SSD is old and only has a write speed of 150MB/s (compared to the MacBook's 3GB/s) and it was incredibly slow. So you definitely need a fast SSD for this.

The columns we chose to sort by are rather arbitrary, but we shuffled the table before running the experiments to make sure there is no ordering left from the generation in there.

I like the merge sort trick you described!


> I'm not sure if I understand, but we did not use a memory map (mmap), but rather blocks of memory that are explicitly (un)loaded by the buffer manager.

But then, do you have your own cache competing with the OS cache? Or you use O_DIRECT to disable the filesystem cache (at least on Linux)?


Most databases rely on OS filesystem cache, which is very compatible with the way buffer caches manage data with fixed-size pages/blocks.


Some databases rely on the OS file system cache, this is true. However, it is not controversial that doing so reduces performance by integer factors. Bypassing the OS file system cache is idiomatic in database kernel designs where absolute performance is a key objective.


Using O_DIRECT and other features still makes you reliant on the OS cache, even if all you're doing is keeping it empty and out of your business. I've only run into a few that manually manage block devices directly.


Many of the techniques discussed are documented in my Julia implementation of the fastest string sorting algorithm https://www.codementor.io/@evalparse/julia-vs-r-vs-python-st...


This is so fast!! If anybody is using Pandas to keep rows in order and has hesitated to use DuckDB for that reason, hesitate no more! Give it a shot!


Seems like you're affiliated with DuckDB? Knowing what paper inspired the software [0], and regularly commenting about it's performance/recommending it [1][2][3][4][5][6][7]. Perhaps you should consider being more forthcoming?

[0]: https://news.ycombinator.com/item?id=24669902 [1]: https://news.ycombinator.com/item?id=27878401 [2]: https://news.ycombinator.com/item?id=26825096 [3}: https://news.ycombinator.com/item?id=26588029 [4]: https://news.ycombinator.com/item?id=26476649 [5]: https://news.ycombinator.com/item?id=24534795 [6]: https://news.ycombinator.com/item?id=24534721 [7]: https://news.ycombinator.com/item?id=24338671


Hey, this is HN you know :)

It's not unusual for happy, unaffiliated users to post enthusiastically, or even evangelise a bit, about the software they love most.

Case in point, I often pop up in TimescaleDB threads to sing it's praise regarding real-world usage, but I'm not affiliated with the TimescaleDB folks in any way.


I am not affiliated, just a happy user! I have spoken with one of the developers, but that's it!

I've just come up through the SQL side of analytics and I'm moving into Data Science and I feel like DuckDB is a superpower for people with my background.

Does that help? Happy to answer any other questions!


> SQLite always opts for an external sorting strategy. It writes intermediate sorted blocks to disk even if they fit in main-memory, therefore it is much slower.

Is this just when avoiding using in-memory databases in SQLite [0]? It seems like SQLite pretty clearly does support fully in-memory operations.

The use-case for in-memory SQLite is significantly narrower so that may be why it was not considered in this study. But I'd still be curious how an in-memory SQLite db compares to the others trialed here.

Unless I'm getting something totally mixed up.

[0] https://www.sqlite.org/inmemorydb.html


We wanted to use the same setup for all experiments, so we had to choose for an on-disk DB for SQLite, because TPC-DS SF100 catalog_sales does not fit in 16GB memory.


But SQLite is not even present in the TPC-DS benchmark graphs, while Pandas (which the article notes works solely in memory) is…

Furthermore the article explicitely says:

> We will use customer at SF100 and SF300, which fits in memory at every scale factor.


Customer fits in memory, whereas catalog_sales does not.

We chose to remove SQLite from the results because it was so much slower. The plots are much less readable when they are stretched out by something that is slower by an order of magnitude


> Customer fits in memory, whereas catalog_sales does not.

Didn't prevent using pandas which had to rely on dynamic swapping? Or is in-memory sqlite unable to use that much memory?

> We chose to remove SQLite from the results because it was so much slower. The plots are much less readable when they are stretched out by something that is slower by an order of magnitude

So you're using on-disk sqlite because it fits in memory (unlike pandas which also fits in memory) but you're dropping it anyway because it's too slow when it works on-disk?


You are right, we could probably re-run SQLite purely in memory, but only because macos dynamically allocates additional swap.

However, I would not expect much improved performance, because I do not believe that SQLite has a different sorting strategy when running in memory. It would only save some i/o operations, which are very cheap on the macbook anyway.

Either way, would be an interesting experiment.


Have you looked at the https://www.sqlite.org/pragma.html#pragma_cache_size

I think per default it's only 2MiB, increasing it might help. Probably won't beat the other contenders but sqlite has defaults for being a good citizen.


Gotcha. Thanks for clarifying.


Did any of the actual benchmark queries in TPC-H or TPC-DS show a speed up?

My intuition is that the performance of large sorts (100 millions of rows) are not that important for most analytical workloads compared to the performance of doing scans, group-bys and joins. Top-sorts (ORDER BY X LIMIT N) are much more popular, but most databases use different algorithms for those.


There are plenty of TopN sorts in TPC-DS, but not many (if any) regular sorts, so no.

There are order-dependent window functions in there though, which did show a speed up.


> While std::sort is excellent algorithmically, it is still a single-threaded approach that is unable to efficiently sort by multiple columns because function call overhead would quickly dominate sorting time.

I don't understand what this is saying. Is it just using a lot of words to note that as the number of columns increase so do the comparator's cost?


The author is pointing out two problems with std::sort

1- std::sort is single threaded

2- std::sort is unable to efficiently sort by multiple columns


> 1- std::sort is single threaded

I would hope it is quite obvious that is not what I have an issue with.

> 2- std::sort is unable to efficiently sort by multiple columns

It's asserting that with an explanation which is not one, and I'm asking what it is actually saying. "unable to efficiently sort by multiple columns because function call overhead would quickly dominate sorting time." is not an actual explanation.


Hmmm.

std::sort is a template in C++. Which means that in most cases, I'd expect std::sort to inline the comparator and not have any function call overhead at all. (And even further: that the optimizer can make optimizations to std::sort + comparator together since they've been inlined)

So you have a good point. Function overhead is in qsort, not in std::sort. The line reads kinda nonsensically to me now that I think of it.


The way I read it was that with many columns the overhead of the compare function (even if it's inlined) is too high compared to sorting overhead.


Exactly. Even when it’s inlined, having a comparator with if/else is much, much slower than a single memcmp




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: