Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Glidesort, a new stable sort in Rust up to ~4x faster for random data (github.com/orlp)
389 points by orlp on Feb 3, 2023 | hide | past | favorite | 70 comments
Hi all, I've talked about glidesort a few times on HN already, but it's finally ready for release. If you have any questions, feel free to ask. An academic paper on glidesort that goes into a lot more detail than the readme is upcoming, but is not ready yet.

I will be giving a talk on glidesort tomorrow at FOSDEM 2023 in the Rust Devroom at 16:10, you can seek me out there as well. In other news, I am leaving academia soon, so if you have interesting (Rust) jobs the coming months feel free to approach me.




I was able to get glidesort running as part of the Robin Hood sort benchmark[1]. See [0]; it's a lot slower than fluxsort on random data. Wondering if I did something wrong. Is cargo build --release with rustc 1.64.0 enough? Tried lto = "thin" in .cargo/config as well, which is ever so slightly faster. Also tried passing in a buffer one, then two, times the input length which didn't change things. Was this tested on x86?

Commands to run to replicate my results are below. Clone rhsort and run them in that directory. The res/*.txt results are human-readable but you have to install my programming language[2] to generate the plot. Hey, you made me program in Rust.

[0] https://gist.github.com/mlochbaum/7664a046d294dcce2f6cdb1db6...

[1] https://github.com/mlochbaum/rhsort

[2] https://github.com/mlochbaum/BQN

    ./wolfbench.sh
    mkdir res
    gcc -O3             -D NOTEST bench.c && ./a.out l > res/r_rh.txt
    gcc -O3 -D FLUXSORT -D NOTEST bench.c && ./a.out l > res/r_flux.txt
    gcc -O3 -D WOLFSORT -D NOTEST bench.c && ./a.out l > res/r_wolf.txt
    g++ -w -fpermissive -O3 -D SKACOPY -D NOTEST bench.c && ./a.out l > res/r_ska_.txt

    cd glide && cargo build --release && cd ..

    gcc -O3 -D GLIDESORT -D NOTEST bench.c -Lglide/target/release -lglide
    LD_LIBRARY_PATH=glide/target/release ./a.out l > res/r_glide.txt 
    images/line.bqn res/r_{flux,wolf,ska_,glide,rh}.txt > images/rand_glide.svg


Try running with the latest nightly Rust (`rustup update nightly`) and with `cargo +nightly build --release`. EDIT: I'm sorry, you need to specify this in your Cargo.toml, my earlier command was incorrect:

    glidesort = { version = "0.1.1", features = ["unstable"] }
To get the best performance I use specialization for Copy types (such as integers) to compete with the assumptions that fluxsort and such can make, which is not available on stable Rust (unless glidesort were to be merged into the standard library). Without that I have to be more conservative with bounds checks. And even then there's a variety of places where I sacrifice performance compared to fluxsort and such in the name of safety, and the limitations that Rust gives me (in particular panic safety has cost me 10-15% performance).

Also when you say x86, I assume you mean x86-64? I have not tested raw x86. What is your actual processor?

EDIT 2: I believe this should also go in the Cargo.toml of glide in your repository, not in a .cargo/config

    [profile.release]
    lto = "thin"


Updating my system, going to take a little while... Partner in crime dzaima has tested (nightly, unstable, lto) and found it is somewhat better. Eyeballing, still ~20% slower than fluxsort. We're on fairly old Intel x86-64 CPUs, i5-6200U for me and i3-4160 for him.

Does panic safety really cost 10-15% when sorting integers according to default comparison?


I get these results for your benchmark on my machine: https://gist.github.com/orlp/d7123ede0143e15ab0f8a435b7dd10b...

Either way, I am not surprised that glidesort is a bit slower than fluxsort on older machines. They're less likely to take advantage of the interleaving techniques I use.

> Does panic safety really cost 10-15% when sorting integers according to default comparison?

Yes :(

At least yes, if you don't want to duplicate your entire codebase. Look into the code of glidesort if you want to, all algorithm state has to be encoded inside structs at all times, to allow for recovery in case a comparator panics using the destructor.

This negatively affects the optimizer for reasons I can't fully explain compared to having the state be in simple local variables.


Makes sense regarding instruction-level parallelism; someone else with an M2 measured and found performance near identical to fluxsort. What did you take those results on? Collected graphs below.

Orson: https://matrix.org/_matrix/media/r0/download/matrix.org/ldcd...

i3-4160: https://matrix.org/_matrix/media/r0/download/matrix.org/OBwS...

(edit) i5-6200U update: https://matrix.org/_matrix/media/r0/download/matrix.org/vzDr...

M2: https://matrix.org/_matrix/media/r0/download/matrix.org/ZRTj...


My results are on a 2021 Apple M1 MacBook Pro.

And I'd like to reiterate it's not 100% apples to apples comparing fluxsort and glidesort directly. One would need to either completely implement glidesort in unsafe C throwing Rust-isms as move-only types and panic safety out of the window, or implement fluxsort in Rust with the move-only restrictions and with the ability to restore all data in case of a panic. Only then could you truly compare the algorithms 100% fairly.


In this particular case it sounds like Rust is just making things more tedious rather than providing value.


The end result is a generic sorting algorithm for Rust, which can be used for a lot of things /in Rust/, so it adds value to the Rust ecosystem.

The sorting algorithm doesn't add anything without users :) so for non-academic implementations, maybe using it in Rust projects is what the author wants. Rust is also quite open to contributions, so it might be exciting if there's a chance to have it in Rust std.


I understand that it is written in Rust for Rust's sake, but in this case it looks like it doesn't really provide any benefit to implement such an algorithm particularly in Rust due to Rust's memory model making it harder to sort integers efficiently.


I think such issues can be resolved with some collaboration, it's too early to say right now at least.


Why does it matter if a comparator panics “using the destructor”? It’s unclear what this means


Basics of Rust ownership semantics. Let's say you sort T values. T is generic, so you don't really know what it is, but maybe it's strings for example (allocated String values.)

The sorting algorithm moves around these values in memory to sort them.

Panic is implemented by unwinding: this means destructors of live values are called in each scope and the unwinding steps up into the calling frame and repeats the same thing. Eventually you might reach some place where other code (an "external observer") inspects the vector and values you were just sorting. So a panic in the middle of your algorithm risks exposing any temporarily out of order state to the world.

Thus, you can't put any state out of order, if it might become visible externally. Or - you setup a destructor of your own - that puts it back into order if a panic happens, so that it's not visible outside.

Unsafe blocks in Rust allows violating invariants temporarily in some cases. But all exits out of the function - those successful as well as any unwinding/panicking exits, need to present datastructures and values that are "in order".

Examples of ownership rules: can't copy these values, they need to exist exactly once inside the vector (not zero times, then they never drop (not invalid but a serious code smell), not multiple times, then we violate ownership).


Thank you for your reply :), I understand now - you can avoid some of the overhead by not worrying about this for some types (i.e simple ones like i32 without any custom drop/comparators) but there isn’t a way to generalise this and so you’d need to duplicate the implementation.

Interesting. Are you sure the trade-off is worth it? This seems like quite a specialist sorting implementation, restricting it to primitive types or even adding a “please don’t panic” caveat for a 15% performance improvement seems like it would be an OK trade off (even if it’s not very rusty)


For some operations it's actually easy to optimize for simple types, because `if std::mem::needs_drop::<T>()` is a valid conditional: https://doc.rust-lang.org/nightly/std/mem/fn.needs_drop.html

But not everything can be fit into that. But there's a lot that can be done. And a lot of generic code can be omitted when instantiated for simpler types, etc.


If you want your sort to replace the one in the stdlib I imagine you have to adhere to the strict rules.


I recently upgraded fluxsort, you might want to give v1.1.5.4 a spin. I also updated the benchmark to use actual natural runs and be less favorable to rhsort, sorry. ;-)

I'm not surprised fluxsort is slightly faster, it's heavily optimized for gcc -O3.

Since the primary techniques that give glidesort it's speed were copied from fluxsort and quadsort I'm expecting very similar performance, and possibly significantly worse performance on strings due to accessing 4 memory regions.

Have you gotten it working with the wolfsort benchmark?


I did actually test out 1.1.5.4 in advance of this. I just used saved results instead of the new version for comparison because performance appears very similar on random inputs. I do see a lot of improvement in quadsort and most of the special-case benchmarks.

I was planning to update my repository to grab the appropriate files from the fluxsort repository but if you'd update the wolfsort repo I wouldn't have to. Although maybe I should switch over to cloning fluxsort instead.


I'm planning to update wolfsort soon-ish. I did some work on a dropsort hybrid, like rhsort, though it's slower overall because I'm not quite brave enough to match the level of insanity (I mean this in a good way) that rhsort engages in.

The latest fluxsort is probably identical on random, the new analyzer might perform slightly better on modern hardware.

Probably safer to stick with wolfsort as I do tend to make sure that works with rhsort when I update.


I went ahead and updated bench.c for the wolfsort github.


I get an error about goto small_range_test jumping over a lot of declarations. If I remove that goto/label, it all works very nicely, just have to link rhsort and modify sorts. Thanks!


That code was indeed a bit unorganized, I updated bench.c in wolfsort's github, hopefully that'll fix it.

Would you mind giving your 2 cents on this topic?

https://news.ycombinator.com/item?id=34650406

You are probably more familiar with this topic than most.

It is my impression that orlp suggests the only relation between glidesort and fluxsort is that they're both stable quicksorts.

Similarly, orlp suggests there is no relationship between his branchless merge and quadsort's branchless merge.

On his github he mentions timsort 2 times, and powersort 5 times. So he has no problem giving prior credit, but at least to me, it appears he tries to take prior or co-inventor claim for quadsort's branchless merge techniques and suggests a minimal influence from fluxsort.

There is also no mention of the massive performance gain by utilizing the unguarded aspect of quadsort's parity merge for small array sorting. This is at least 20% of glidesort's performance gain.

Perhaps I'm being overly sensitive?


Oh, what are you complaining about? I got 1-byte counting sort running at <1ns/element down to 1,000 elements and no one will even rip me off!

I don't particularly feel that contributions from fluxsort and quadsort are being minimized in Orson's writing/speaking. The talk in May and README are both pretty abbreviated and the places you should expect full attribution are the FOSDEM talk and upcoming paper; first looks good. We all tend to think of our own work as more important than it really is. And to overlook differences and assume it's more similar to other work.

I haven't looked into quadsort deeply and still haven't had any time to seriously read glidesort, so I can't comment on the details of merge implementations. The places I see influence are in merging 4 arrays at a time (which I did read about before quadsort in "Patience is a virtue", but I think neither I nor Orson took that paper very seriously) and in the bidirectional idea which is credited in the FOSDEM talk.


It is said that getting ripped off is the highest form of flattery.

The FOSDEM talk indeed addressed my worries.

I actually don't see the ping-pong merge as a personal accomplishment, it's not that novel a concept, at best I popularized it. The actual performance gain from it is minimal, maybe 1%, though that is perhaps the most interesting thing, that data moves are practically free. And I would like to take full credit for that observation!! :-)


> It is my impression that orlp suggests the only relation between glidesort and fluxsort is that they're both stable quicksorts.

So what is the relationship then, in your eyes? I implemented my own branchless partition operator that switches between overwriting / cmov'ing depending on data size (https://github.com/orlp/glidesort/blob/56cacab4e378e75ab6feb...), my own bidirectional partitioning scheme, my own pivot selection scheme, my own low-cardinality scheme based on my prior work.

As I wrote before "fluxsort's out-of-place stable partitioning. From this I got reminded that not only is out-of-place stable partitioning a thing, it's highly competitive". This is also what I credit on my page: "credit to Igor van den Hoven's fluxsort for demonstrating that stable quicksort can be efficient in practice".

And glidesort isn't a quicksort, it is truly a hybrid.

> Similarly, orlp suggests there is no relationship between his branchless merge and quadsort's branchless merge.

I implemented it from scratch (10+ different versions), myself:

https://github.com/orlp/glidesort/blob/56cacab4e378e75ab6feb...

You can still see artifacts of all the things I've tried there as well as in the select function:

https://github.com/orlp/glidesort/blob/56cacab4e378e75ab6feb...

I was confident branchless merging could work efficiently. You could say quadsort contributed to the confidence it would work. But I did not use your branchless merge, period.

> So he has no problem giving prior credit, but at least to me, it appears he tries to take prior or co-inventor claim for quadsort's branchless merge techniques and suggests a minimal influence from fluxsort.

I will edit the readme to explicitly mention that glidesort's bidirectional merging is inspired by and an extension of quadsort's parity merge. But you didn't invent the branchless merge, and neither did I. We both came up with our own versions, both after the "Branch Mispredictions Don't Affect Mergesort" paper.

> There is also no mention of the massive performance gain by utilizing the unguarded aspect of quadsort's parity merge for small array sorting. This is at least 20% of glidesort's performance gain.

It's not 20%, at least on my machine. This optimization is precisely what the `--features unstable` flag enables, because this optimization is not sound to do for non-Copy types (and thus needs specialization in Rust). So it's easy to test, for me it's a 11% speedup over the branchless guarded merge: https://github.com/orlp/glidesort/blob/56cacab4e378e75ab6feb...

Yes, I also made the guarded merge branchless. I don't believe quadsort does that, because it doesn't have to deal with elements that may not be copied. You can see the different selection of code based on specialization here: https://github.com/orlp/glidesort/blob/56cacab4e378e75ab6feb...


I've just watched your FOSDEM presentation and to put it simply, thank you for addressing all the issues I had.

I felt pretty uneasy about what I perceived as a situation where you could end up taking full credit for innovations I was first to develop and publish. I do realize this could be all in my head. Anyways, I'm at ease now and I hope my assertions didn't create any ill will between us.

As for your arguments, they're pretty sound and I find no reason to doubt you. It was never my intention to appear dismissive towards your work, writing glidesort was no easy matter and the merge routines are a significant improvement upon Timsort/Powersort.


This got me thinking about sorting.

I wonder how many times per day (or, indeed, per second) my computer does sorting of any kind. I just checked a largish codebase I'm working on, and about 0.12% SLOCs invoke the standard library's sort. But that's just a random codebase. What about the kernel, the browser, the VMs, the graphical subsystem of the OS? Could I capture all these sorts' inputs to generate My Real-Life Sorting Corpus of One Day's Worth of Data? And then compare the performance of different sorting algorithms on that corpus?

I'm pretty sure this kind of research has been done before. Can someone point me towards relevant literature?


> Could I capture all these sorts' inputs to generate My Real-Life Sorting Corpus of One Day's Worth of Data?

You could wrap the lib containing the sort function and write the input to some file.

It is easy of it is C at least.


It is easy enough to wrap libc's qsort() with LD_PRELOAD, sure. The problem is heterogeneity: qsort() is by far not the only sort implementation the machine is running. Even identifying them all might be a daunting task, not to mention figuring out how to wrap them.


Sounds like the sort of thing dtrace or bpftrace might be able to do.


sorting and searching are like 50% of what the CPU ever does. the benefits of ground breaking improvements there are huge. makes me think of proebsting's law, I don't know if this applies here.


This is not true. Power and transistor wise most of what a CPU does isn't even about running instructions, it is instruction reordering, cache, speculation, prefetching, write queueing and other techniques to minimize the effects of memory latency.

Instruction wise this is not true either. Sorting is a O(N log n) problem and log n is never going to be above 64. A cpu spending lots of time sorting implies that it is constantly throwing away the sorted results.


Pretty sure something like 80% of CPU time and power is spent pointer chasing in bad OO code.


I'd imagine that or string ops.


Not at all. Try running perf against your computer and you’ll see differently.


I'm interested in seeing a comparison with blitsort. Like glidesort, it's stable, adaptive, and can be configured to use constant auxiliary memory.


If the last presentation's benchmark still goes, blitsort is significantly faster.

Keep in mind that glidesort is partly derived from fluxsort and many of its techniques were directly inspired by fluxsort/quadsort, while several others are implemented in a similar or near-identical manner. This is in my opinion not adequately explained in the presentation on youtube.

The main difference is that glidesort uses timsort's approach to adaptive sorting rather than quadsort's, though it uses quadsort's novel method for cost-effective branchless merging.

Glidesort tries to access 4 memory regions instead of 2, primarily with the bidirectional partition, an approach I implemented in 2021, but rejected because it was causing degraded performance for expensive comparisons. This does however result in a very different visualization which can give the impression it is unrelated. The general approach to stable partitioning is likely nearly identical to fluxsort.

As for why blitsort is fast, it's likely due to trinity rotations and monobound binary searches. Possibly also a more optimal overall partitioning strategy.

I haven't spend too much time on blitsort however, so there are likely further improvements to be made.


> Keep in mind that glidesort is partly derived from fluxsort and many of its techniques were directly inspired by fluxsort/quadsort, while several others are implemented in a similar or near-identical manner. This is in my opinion not adequately explained in the presentation on youtube.

We've had an extensive email discussion on this, and I still have to publicly provide counterweight.

Yes, I did look at fluxsort/quadsort (mostly the README descriptions) while developing glidesort, I did get inspired by fluxsort's stable quicksort performance. And I did take quadsorts 'parity merge' idea for small sorts, as I described here 9 months ago: https://news.ycombinator.com/item?id=31101498. I'm open about this.

But you're not the first to do branchless merging, I was already aware of that in "Branch Mispredictions Don't Affect Mergesort". Ping-pong merges turns out were already described in "Patience is a Virtue: Revisiting Merge and Sort on Modern Processors".

The observation that the reason the 'parity merge' was so fast being instruction-level and memory-level parallelism is my own (you added this to the README after I publicly described this in the above linked comment). That we can then use it more generally as a bidirectional merge is my own. Using merge splitting and lazy merging to then apply this to more merges is my own. Interleaving multiple independent merge loops is my own. Extending the quicksort to then also use bidirectional partitioning is also my own idea (I don't have a crystal ball about you implementing and rejecting the idea in 2021). My pivot selection is my own. The whole extension of powersort to use lazy merging and more merge splitting for ping-pong merge opportunities is my own. The novel robustly adaptive combination of merge and quicksort is my own. I've spent months and months optimizing glidesort. Glidesort tackles problems fluxsort/quadsort don't even have to worry about in being unable to copy values, only move them, and dealing with panics from user-provided comparator functions, and comes out with minimal losses.

You do great work Igor, but I would appreciate it if you would stop try to claim mine as (mostly) yours.


This is what quadsort's README stated in early 2021:

"Since the parity merge can be unrolled it's very suitable for branchless optimizations to speed up the sorting of random data. Another advantage is that two separate memory regions can be accessed in the same loop with no additional overhead. This makes the routine up to 2.5 times faster on random data."

I described what made it fast and it was obvious there was general applicability. If I understanding your reasoning correctly, because you took inspiration from my more advanced unguarded two directional merge, you feel like a guarded two directional merge is some new novel discovery.

This is a very contentious line of reasoning.

Subsequently you then proceed to give me zero credit for branchless merging in your Youtube presentation, instead referencing a branchless merge that gave a 10-20% speed-up instead of quadsort's 150% speed-up. You also claimed in the presentation there were no bidirectional branchless algorithms written to date, this is simply not true.

This does upset me. It would be a simple matter, and the right thing, to point out that quadsort introduced bidirectional branchless merging.

As for your work to increase the memory regions from 2 to 4. This is easily conceived, but indeed quite challenging to actually implement.


I will retract that you didn't mention memory-level parallelism back then, I lazily looked at the git blame and assumed it was introduced then. That was sloppy of me, sorry.

> I described what made it fast

I still partially disagree. Memory-level parallelism is one aspect but I think instruction-level parallelism and hiding data dependencies from iteration to iteration are the bigger factors. That's mainly what I chose to fully exploit in glidesort by actively splitting merges into smaller merges, and implementing bidirectional quicksort. That was the main takeaway from my glidesort talk.

> and it was obvious there was general applicability. If I understanding your reasoning correctly, because you took inspiration from my more advanced unguarded two directional merge, you feel like a guarded two directional merge is some new novel discovery.

The novelty is pointing out that also the hiding of data dependencies and instruction-level parallelism is making it fast, and to then go on and exploit this to the maximum. E.g. if I turn off the merge splitting in glidesort (reducing the amount of parallel merges) performance degrades 17.5% on my machine.

Your README explicitly called them parity merges, and that the arrays have to be of equal length. You might find the extension to unbalanced merges and using it for all merges instead of only the smallsort obvious, I disagree. If it is obvious... why didn't you do it? Correct me if I'm wrong but I was under the impression quadsort only uses the bidirectional merges for the small base case where arrays have statically known (equal) sizes.

> Subsequently you then proceed to give me zero credit for branchless merging in your Youtube presentation, instead referencing a branchless merge that gave a 10-20% speed-up instead of quadsort's 150% speed-up.

Because I was already aware of it, from the earlier work. That they implemented it poorly is besides the point. You and I both know that a bad implementation of a good idea in sorting can easily destroy all performance gains.

The part that was novel to me from your work on quadsort was eliminating the bounds check on perfectly balanced merges and the original idea of merging from both ends. That I credit you for.

> You also claimed in the presentation there were no bidirectional branchless algorithms written to date, this is simply not true. > > This does upset me. It would be a simple matter, and the right thing, to point out that quadsort introduced bidirectional branchless merging.

I don't think I claimed that. And I have modified the slides now to explicitly mention quadsort on the slide for bidirectional merging.


As I wrote previously, thank you for fully addressing all my concerns in your recent FOSDEM presentation.

While it would have been outside the scope of the presentation, and time being short, quadsort does present an interesting alternative to handling the merge length problem powsort tries to solve.

I've honestly been unable to detect any notable instruction-level parallelism on my own hardware. I suspect there may not be any? It'll be interesting to get to the bottom of this.

As for quadsort, it does contain a guarded bidirectional merge. I published it after you started working on glidesort however. It was always on my todo list, but I do get tired of programming from time to time.

As for unguarded parity merges, it was indeed one of my brighter moments when I came up with that.

Anyways, once again, thanks for addressing my concerns in the FOSDEM presentation, and if it wasn't clear, I think you did some really excellent work on glidesort.


I haven't done a comparison yet (because I haven't gotten to really benchmarking the low-memory performance), but I will include it in the paper.

The fact that glidesort can use small amounts of memory is honestly a happy coincidence of the merge-splitting technique I do that I only realized during development, and was not a primary goal.


The ability to work with different memory sizes is pretty intriguing. How does the algorithm decide where and when to allocate?


The allocation always happens once, at the start of the algorithm.

The default allocation size is hard to describe in plain English, whereas the code is quite readable:

    const FULL_ALLOC_MAX_BYTES: usize = 1024 * 1024;
    const HALF_ALLOC_MAX_BYTES: usize = 1024 * 1024 * 1024;

    fn glidesort_alloc_size<T>(n: usize) -> usize {
        let tlen = core::mem::size_of::<T>();
        let full_allowed = n.min(FULL_ALLOC_MAX_BYTES / tlen);
        let half_allowed = (n / 2).min(HALF_ALLOC_MAX_BYTES / tlen);
        let eighth_allowed = n / 8;
        full_allowed.max(half_allowed).max(eighth_allowed)
   }
Essentially the default allocation size is a linear graph M(n) = n until 1 MiB, where it will plateau until overtaken by M(n) = n / 2 until 1 GiB, where it will once again plateau until overtaken by M(n) = n / 8.

Note that this is only the default, you can pass glidesort a buffer with a size/location of your choice and it will use that instead.


I used the word "allocate" imprecisely. (The plain English description in the README did a good job of explaining the up-front allocation needs.) What I meant to ask:

How does the algorithm stay bounded in memory usage? Is glidesort written in such a way to never use more memory than that? Is the some sort of back-pressure mechanism, when most of the allocated memory is in use, and another operation is ongoing? Basically: when sorting a 1 GiB array of data, how does the algorithm keep from using more than 128 MiB of additional memory? Nothing in the "Technique Overview" section mentions bounded memory use (under n bytes, at least).


> How does the algorithm stay bounded in memory usage? Is glidesort written in such a way to never use more memory than that?

It does not ever allocate more than what it allocated at the start.

> Is the some sort of back-pressure mechanism, when most of the allocated memory is in use, and another operation is ongoing?

Yes and no. I did write some code that will make glidesort back off and use less memory if the allocation request fails:

https://github.com/orlp/glidesort/blob/master/src/lib.rs#L20...

The sad part is that this will probably never work in reality due to overcommit.

> Basically: when sorting a 1 GiB array of data, how does the algorithm keep from using more than 128 MiB of additional memory?

By allocating 128 MiB at the start, and never allocating more than that.

Maybe what you really want to know is: how can it do merges in space (much) smaller than the input? The answer is by (recursively) splitting merges into two separate smaller merges using an in-place block swap. You can see this in action at the 10 to 12 second mark here: https://user-images.githubusercontent.com/202547/216675278-e...

Because I use block swaps and not rotations, this block swap can often be done implicitly (just not in the case of low-memory fallbacks) by re-assigning buffer names rather than physically moving elements.


> How can it do merges in space (much) smaller than the input.

That's part of it. More specifically I'm interested in the dynamic nature of the space overhead. In the video you linked, from 10 to 12 seconds, it looks like the algorithm is using about n/4 scratch space. But for larger arrays, it's bounded at n/8. Is that done by further recursing merges for larger merge sizes? Is the number of recursive layers needed to do a big merge determined by the size of the input, or the size of the scratch space?

I see that try_merge_into_scratch will fail when the scratch space isn't big enough, which mostly answers my question https://github.com/orlp/glidesort/blob/master/src/physical_m.... But I'm still not sure how that failure causes the overall algorithm to change

(I have been trying to ask about how glidesort uses the preallocated scratch space internally, not how it interfaces with system memory and system allocators.)


> Is that done by further recursing merges for larger merge sizes?

Yes.

> Is the number of recursive layers needed to do a big merge determined by the size of the input, or the size of the scratch space?

Both. I haven't worked out the exact math yet, but it's likely something on the order of O(log (n / s)) recursive layers.

> I see that try_merge_into_scratch will fail when the scratch space isn't big enough, which mostly answers my question

If you are reading the code, the whole small-memory magic happens in physical_merge: https://github.com/orlp/glidesort/blob/master/src/physical_m....

Lines 46-59 are the large-enough memory happy path. Lines 60-65 is the low-memory path in-place recursive path.

All other merge functions use try_merge_into_scratch as a happy path, and if that fails end up deferring to physical_merge.


Awesome. Thanks for the explanations!


So under-allocating memory would merely negatively affect performance?


Correct.


Pretty neat! So it looks like you have a few things happening in parallel. What is instruction level paralellism and how does it speed the sort up?


Modern processors execute out-of-order (reordering instructions) and are superscalar (they execute more than one instruction per cycle). By providing multiple independent data paths in the tight merging/partitioning loops we can keep the CPU busier than a traditional merge/partition loop and have higher throughput.

Note that this all happens on a single thread, this is in addition to thread-level parallel sorting (which glidesort does not (yet) have).

Also note that this is different than SIMD, which processes more data in a single instruction. This can give you incredible speed-ups, but the code is very dependent on the data (e.g. code for sorting integers looks very different than sorting strings with SIMD). Glidesort is written using scalar code with fully generic comparison functions, no SIMD-specific speedups.


Would a simple example of this be something like,

> In a traditional "merge" algorithm, the second comparison depends on the output of the first, so it can't be done at the same time. So it might make sense to merge "from the left" and merge "from the right" at the same time because those things can be done in parallel when both arrays have length > 1.

?


That is exactly right.


So how does this compare to PDQ, Crum, flux, etc? Do you have preliminary results?


I have some preliminary results I can share from the current draft paper:

https://i.imgur.com/AUOJ1w0.png

https://i.imgur.com/AP6DnIS.png

Note that the four bottom sorts (RustSort, StdSort, Pdqsort, IPS4o) are not stable. GlidesortN is glidesort using n elements of memory, Glidesort is the default memory setting and Glidesort1024 uses a fixed 1024 element buffer.

Arm is a 2021 Apple M1, Amd is a 2018 Threadripper 2950x.


Those are some awfully large array sizes. Not something I ever had the patience for to sit out, or optimize for, as it didn't seem like the most common real-world scenario.

One thing glidesort appears to do extremely well is to branchless merge till the very end, this gets quite tricky when memory constrained. It does not seem worth it intuitively, but judging from the performance of rotate mergesort, it probably is.


Is ips4o parallelised here?


No, it is running serially. IPS4o is inherently more cache efficient due to having a very wide partition operator, making it better on machines with smaller caches, and better in general for strings (where no matter how cache-local your algorithm is, the string pointer indirection mostly destroys it).


Note that the author also is the author of PDQ.

And there was some prior discussion on HN here: https://news.ycombinator.com/item?id=33827843


Anyone have a C++/C port yet? :)

Would love to try this out, though the source looks... difficult for me to do a naive port.


Any source available for your benchmarks here? I'd like to benchmark against pdqsort, fluxsort, etc. but the existing benchmarks use C so I either have to call glidesort from C or write my own Rust benchmark to get any numbers at all.


Not yet. I do aim to have open code for a benchmark including pdqsort, fluxsort, etc, accompanying my upcoming paper on glidesort. This would be in a separate repository. That code is very much... 'academic' right now, and needs polishing before release.

To compare across languages I simply implemented the exact same benchmark twice, once in Rust and once in C++.


Seriously working on the default/built in sorting implementation must suck. You can make the theoretically optimal choice, but get utterly hosed because in practice you get the “wrong” distribution of input data.

I recognize that in practice vs in theory is often an issue and have had to deal with it myself in the past, but seriously sorting seems specifically awful to deal with.


On the rare occasion I need to roll my own sort, I go for Heapsort, even though it isn't stable, because it's the simplest n log n worst case option I know.


Oh yeah, there are occasions it might be necessary - a more feeling bad for the people writing the generic system/platform ones that have to be magically "the best" for everyone all at the same time, and feeling glad that's not my problem :D


I think it's fair to say "quicksort is not stable" if you want. You need some definition of what the parameters of the algorithm is, when you describe it.

What should be mentioned is that there's a variant (requiring memory XYZ) that is stable. It would be fair to call it a variant of the quicksort algorithm.


Looking forward to your talk tomorrow :)


Hope to see the video of the presentation at FOSDEM.


It's already up (haven't watched through yet)

https://fosdem.org/2023/schedule/event/rust_glidesort/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: