
Djbsort: A new software library for sorting arrays of integers - xmmrm
https://sorting.cr.yp.to/
======
yoklov
The problem with AVX2 accelerated code (and much of AVX) is that unless you
have a lot of it to run you end up with a substantial speed hit that comes
from the cost of switching to a different power bin (which often takes 1 or
2ms!) and then running at a lower clock speed.

This often still ends up being an improvement over scalar code (at the cost of
higher power usage), but for occasional workloads that don't need to do
multiple milliseconds of AVX instructions you tend to have better results from
4-wide vectors, which don't have this cost.

~~~
puzzle
You also have penalties from context switches, although you can reduce their
impact by performing them in a lazy fashion. AVX512 is even worse, of course.

~~~
jeffreyrogers
You need a context switch to use AVX?

~~~
puzzle
No, unless you have a special setup where one process on the whole machine can
use AVX. That might make sense for special, controlled environments.

What I meant is that there are more registers to shuffle around at context
switch time, when multiple processes use the extensions.

~~~
jeffreyrogers
Ah got it, thanks.

------
lixtra
The main feature of the algorithm is to sort fast in constant time (for fixed
n) for cryptographic purposes [0].

[0]
[https://ntruprime.cr.yp.to/ntruprime-20170816.pdf](https://ntruprime.cr.yp.to/ntruprime-20170816.pdf)
P. 48

------
pushcx
The library doesn't have a license on the site or in the tarball. From djb's
previous writing and software, he probably intends it to be license-free
software, which is an uncommon situation worth investigating before use:
[https://en.wikipedia.org/wiki/License-
free_software](https://en.wikipedia.org/wiki/License-free_software)

(Trying to describe this neutrally because I've seen enough bickering about it
over the last ~20 years and don't have strong feelings about it.)

~~~
Tepix
Three of the source code files contain a notice that they are in the public
domain:

* cpucycles/mips/cpucycles.c

* cpucycles/cortex_vct/cpucycles.c

* cpucycles/cortex/cpucycles.c

Regarding the missing license:

I guess if you download the software from his website you are not allowed to
distribute it yourself. Is that correct?

~~~
masklinn
> Three of the source code files contain a notice that they are in the public
> domain:

Making it legally dodgy to dangerous in mainland europe, either way certainly
not reliably licensed.

~~~
lisper
I hear people raise this concern a lot, but I think it is without foundation.
If something is in the public domain in the U.S. then anyone can use it for
any purpose, including releasing it under whatever license they want. Of
course, any constraints imposed by that license will be unenforceable since
any user of the software can claim to be using it under the terms of some
other license or, of course, as part of the public domain. But if you think
you need it licensed, you can have it licensed.

~~~
johannes1234321
Anybody under U.S. jurisdiction can follow U.S. law.

However if me and my company and everything is in Europe and I
use/redistribute the code in Europe I must follow applicable European law. If
that doesn't accept that form of public domain the author (rights owner) could
sue me and it were upon the judge, who sensible they are. (This is mostly
theoretical - if the author decides to put it in public domain per U.S. law
they most likely don't want to restrict to U.S.)

For an example see the recent case about project Gutenberg
[https://news.ycombinator.com/item?id=16511038](https://news.ycombinator.com/item?id=16511038)

~~~
lisper
That was a completely different situation. In that case, the material was
still under copyright in Germany. In this case, the material has been placed
in the PD _by the original author_ and so no one in the world can possibly
have any legal claim on it.

But if this really concerns you, I would be happy to provide you -- or anyone
else -- with a licensed copy of any of DJB's code for a modest processing fee.

~~~
masklinn
> In that case, the material was still under copyright in Germany.

Which is exactly the case of djb's work here.

> In this case, the material has been placed in the PD by the original author
> and so no one in the world can possibly have any legal claim on it.

Wrong. djb _and any possible heir of his_ does, because you can't place things
in the public domain in mainland europe.

> But if this really concerns you, I would be happy to provide you -- or
> anyone else -- with a licensed copy of any of DJB's code for a modest
> processing fee.

Unless djb specifically gave you license to do so, your "licensed copy" is
worth exactly as much as the original public domain dedication is. As far as
european law is concerned, you have no rights to the work, and thus certainly
don't have the rights to relicense it.

~~~
johannes1234321
> Wrong. djb and any possible heir of his does, because you can't place things
> in the public domain in mainland europe.

However a court could interpret this as a royalty-free license, at least until
the moment they start sueing.

~~~
masklinn
That is true, but there currently is no precedent for _that_. Until someone
becomes the "sacrificial lamb" by 1. taking the risk and 2. being sued for it,
anyone intending to build a business is understandably wary and unlikely to
touch public-domain-dedicated assets.

------
Ono-Sendai
The median sort time for sorting 1048576 elements wth djbsort is 61467822
cycles:
[https://sorting.cr.yp.to/speed.html](https://sorting.cr.yp.to/speed.html)

On a say 3.6 Ghz processor that would be around 17ms. So the number of
elements sorted per second would be around 61.4 M elements/s.

My parallel radix sort can sort floats (a little harder than integers) at
around 165 M elements/s:
[http://forwardscattering.org/post/34](http://forwardscattering.org/post/34)

A serial radix sort should still be similar or faster to djbsort.

~~~
herf
My single-threaded floating point sort is also twice as fast as djbsort:

[http://stereopsis.com/radix.html](http://stereopsis.com/radix.html)

This one uses 11-bit radix to save memory bandwidth. Without all the floating-
point stuff it would be faster.

~~~
AstralStorm
The point of djbsort is to be constant time while not being super slow.

Of course a fast directly implemented radix sort will be faster, especially if
you sprinkle SIMD on top.

------
robin_reala
Why might the installation instructions require the creation of a new user
specific to the sorting program? Purely for security of the normal user given
that the installation is using a wget / shell script process?
[https://sorting.cr.yp.to/install.html](https://sorting.cr.yp.to/install.html)

~~~
jwilk
This is bizarre. Why not just give users link to the source?

[https://sorting.cr.yp.to/djbsort-20180710.tar.gz](https://sorting.cr.yp.to/djbsort-20180710.tar.gz)

And it's certainly not about good security practices:

* The page teaches users to paste stuff copied from web into a terminal, whereas many terminals are still vulnerable to this: [https://thejh.net/misc/website-terminal-copy-paste](https://thejh.net/misc/website-terminal-copy-paste)

* The page teaches users to use su to lower privileges, whereas many (most?) su implementations are vulnerable to tty hijacking.

~~~
textmode
"The page teaches users to use su to lower privileges ..."

In the example, he could have used his own utilities for dropping privileges
(setuidgid, envuidgid from daemontools).

If I am not mistaken, busybox includes their own copies of setuidgid and
envuidgid, meaning it is found in myriad Linux distributions. I believe
OpenBSD has their own program for dropping privileges. Maybe there are others
on other OS.

Instead he picked a ubiquitous choice for the example, su.

It is interesting to see someone express disdain for the version.txt idea. I
had the opposite reaction. To me, it is beautiful in its simplicity.

As a user I like the idea of accessing a tiny text file, version.txt, similar
to robots.txt, etc., that contains _only a version number_ and letting the
user insert the number into an otherwise stable URL.

This is currently how it works for libpqcrypto.

[https://libpqcrypto.org/install.html](https://libpqcrypto.org/install.html)

I would actually be pleased to see this become a "standard" way of keeping
audiences up to date on what software versions exist.

By simplifying "updates" in this way, any user can visit the version.txt page
or write scripts that retrieve version.txt to check for updates, in the same
way any user can visit/retrieve robots.txt to check for crawl delay times,
etc.

It is not necessary to "copy and paste" from web pages. Save the
"installation" page containing the stable URL as _text_ , open it in an
editor, insert the desired version number into the stable URL.

Save the file. Repeat when version number changes, appending to the file.

I like to keep a small text file containing URLs to all versions so I can
easily retrieve them again at any time.

------
carapace
(Kind of a tangent, but if you're into sorting check out:

"Generic top-down discrimination for sorting and partitioning in linear time"

[https://www.cambridge.org/core/journals/journal-of-
functiona...](https://www.cambridge.org/core/journals/journal-of-functional-
programming/article/generic-topdown-discrimination-for-sorting-and-
partitioning-in-linear-time/B85E48EFC0B4D2BDDDE9A3885094FDD7)

Abstract: "We introduce the notion of discrimination as a generalization of
both sorting and partitioning, and show that discriminators (discrimination
functions) can be defined generically, by structural recursion on
representations of ordering and equivalence relations. Discriminators improve
the asymptotic performance of generic comparison-based sorting and
partitioning, and can be implemented not to expose more information than the
underlying ordering, respectively equivalence relation. For a large class of
order and equivalence representations, including all standard orders for
regular recursive first-order types, the discriminators execute in the worst-
case linear time. The generic discriminators can be coded compactly using list
comprehensions, with order and equivalence representations specified using
Generalized Algebraic Data Types. We give some examples of the uses of
discriminators, including the most-significant digit lexicographic sorting,
type isomorphism with an associative-commutative operator, and database joins.
Source code of discriminators and their applications in Haskell is included.
We argue that built-in primitive types, notably pointers (references), should
come with efficient discriminators, not just equality tests, since they
facilitate the construction of discriminators for abstract types that are both
highly efficient and representation-independent.")

~~~
KirinDave
I've tried so many times to frontpage that, but I've failed. Maybe we should
have another go?

Nothing in djbsort's approach is inapplicable to another sorting algorithm, so
maybe we can hope for better primitive support for discrimination sort
implementations (or at least american flag sort implementations). I seem to
recall reading that discrimination sorts are inherently content-independent.

~~~
carapace
It was probably you I heard about it from! Submission upvoted. ;-)

~~~
KirinDave
Something something definition of insanity. ;)

------
JdeBP
> Other modern Linux/BSD/UNIX systems should work with minor adjustments to
> the instructions.

I can report that I got it to build and run on slightly out of date FreeBSD by
deleting all of the -m32 variants, and deleting all of the -march=haswell
variants. I haven't looked into whether this is down to the version of GCC
that comes in ports and the version of Clang that comes in base, or something
else. No _other_ changes were needed to the build process, though.

    
    
        JdeBP /package/prog/djbsort % /tmp/djbsort/command/int32-speed
        int32 implementation int32/portable4
        int32 version -
        int32 compiler clang -fPIC -Wall -O2 -fomit-frame-pointer -fwrapv
        int32 1 72 72 72
        ...
        int32 1048576 1979077401 1979993070 1983745962

------
amorousf00p
Create a user and env to run a one-off build + application. DJB cracks me up.
He may have the right thing in mind but this type of prophylactic approach is
no longer proof against anything.

------
MrBuddyCasino
The 2.5 cycles/byte compared to the 32 cycles/byte that Intel managed pulled
off seems like an improbably large improvement over the current state of the
art? Is this real?

~~~
Cyphase
The cycles/byte for all the sizes listed in the table on the Speed page[0],
courtesy of copy-paste and a one-liner in the Python REPL:

    
    
      size      cycles/byte (based on median)
      1         6.0
      2         3.375
      4         11.625
      8         9.625
      16        7.984375
      32        7.515625
      64        6.0390625
      128       4.31640625
      256       2.4873046875
      512       2.29443359375
      1024      2.5048828125
      2048      2.77893066406
      4096      3.0791015625
      8192      3.89566040039
      16384     5.23690795898
      32768     6.35472106934
      65536     7.55914306641
      131072    9.23189163208
      262144    10.789557457
      524288    12.4885950089
      1048576   14.6550707817
    

[0] [https://sorting.cr.yp.to/speed.html](https://sorting.cr.yp.to/speed.html)

------
atesti
Direct download link (if you don't want to run the script)

[https://sorting.cr.yp.to/djbsort-20180710.tar.gz](https://sorting.cr.yp.to/djbsort-20180710.tar.gz)

------
bluetech
I liked this bit, using the fastest compiler for each primitive:

> ./do tries a list of compilers in compilers/c, keeping the fastest working
> implementation of each primitive. Before running ./do you can edit
> compilers/c to adjust compiler options or to try additional compilers.

~~~
rphlx
It is sadly necessary; 30%+ performance regressions from, say, gcc 4 to gcc 6
are not uncommon w/ vector intrinsics.

------
rurban
beware, Linux only. Needs the usual BSD/macOS patches for HW_CPUSPEED and
CLOCK_MONOTONIC and do away with for linux/perf_event.h.

Unfortunately I have no idea how he deals with patches, I don't think he does.

------
nrclark
My job involves a lot of packaging/cross-compilation, and djb's libraries
always seem consistently hostile to the lowly packaging engineer.

Would it really be all that much work to package in autotools or CMake? Why do
I need his special-snowflake build system with its hard-coded assumptions
about system paths?

I know that the cult of djb will downvote this into oblivion, but seriously,
what is the rationale for a build flow that involves:

    
    
      1. Downloading a text file
      2. Parsing it to get a URL
      3. Making a new user
      4. Symlinking the user's HOME directory into the build tree
      5. Run an extremely non-standard build system.
      6. Hope you're not trying to cross-compile, because good luck with that.
      7. Guess at where the files came out (hint: it probably won't be in FHS locations)
      8. Copy the output yourself once you find it.
    

Would it really be that much harder to give us a git repo and a ./configure or
a CMakeLists.txt?

~~~
Panino
I've been using DJB's stuff for ~20 years and I don't like his recent build
systems either. Without defending it, I'd just like to offer a theory on why
he packages things the way he does.

Back in the day, his build processes were atypical but still much more
"normal" than now. He also released his software without licenses. During this
time of heavy software development, DJB was concerned about people screwing
around with the internals of his software, hurting security or reliability,
and then blaming the software rather than the modifications. So his build
systems were, I think, designed to lead to the result he wanted, where
software behaved and was administered in the same way on various platforms.

In the mid-2000s he re-licensed existing software as public domain and began
publishing all _new_ code as public domain as well. Around this time, build
systems began to get more wonky. Also, his public work that garnered the most
attention shifted away from software toward cryptography. He did some attacks
on existing crypto and authored Curve25519, Salsa20, etc.

He's also been putting out a _tremendous_ volume of work in multiple
categories. I bet he'd rather work on this stuff than on user-friendly build
systems.

So given these points, I think the explanation for his unfriendly build
systems is

    
    
      A) a very strong aversion to people modifying his stuff where he gets blamed if modifications do harm;
      B) a shift away from software development, where people generally care more about build systems anyway;
      C) a huge level of productivity which results in very atypical Pareto principle choices/tradeoffs;
      D) his public-domain licensing.
    

Given these 4 points, I think DJB is unwilling to take time away from crypto
and other work and put it into build systems he doesn't enjoy that will take
more time upfront and more babysitting down the road. Fewer people will
package it, but the software is public domain and competent people can just
add their own build system. This squares with his available time and
interests.

So, I don't like his build systems either, but I think I understand where they
come from.

~~~
nrclark
Appreciate the response! Interesting to read.

The Libsodium guys wound up doing exactly what you're suggesting, because of
the impossibility of trying to package NaCl as-is.

So they essentially had to re-do/duplicate all of his build work just to make
it packageable. And now there are two competing implementations (three if you
count tweetnacl). And a bit of a confusing mess in the documentation
department.

It seems a little selfish for djb to take the "works on my machine" attitude,
because it means that a bunch of other people have to reverse-engineer all
that stuff just to make it portable.

But I guess OTOH it's his software, so it's his choice. And maybe he doesn't
care whether people choose to use his stuff or not, as long as he's
publishing.

~~~
Hello71
conveniently, almost nobody uses NaCl as-is due to its more or less never
having been patched, and nobody uses TweetNaCl, so there is de facto one
implementation.

~~~
baby
Does it need to be patched?

------
southern_cross
I haven't read the algorithm here yet nor all of the comments, so this may
have already been covered, but in many cases when it comes to sorting integers
and such you don't really need to _sort_ them at all - you just need to
_count_ them.

~~~
ur-whale
Yeah, except that when the integers are very large, that tends to fail. Also,
as pointed out by many others, the point of this work is to sort in constant
time to avoid side-channel attacks. I doubt the histogram sort (which I think
you're referring to) has this property.

------
ibuildoss
That's pretty cool, are there any bindings for e.g. Go out there?

------
jonlandrum
Am I the only one who thought the name had something to do with Djibouti?

~~~
eesmith
Probably one of the few. On HN, a search for "djb" results in about 60
submissions where "djb" is in the submission title. There are about 6 for
Djibouti.

Also, the airport code "DJB" is for Sultan Thaha Airport. The Djibouti–Ambouli
International Airport code is JIB.

------
eggie
It's nice when things fit in RAM. But very often they don't. When you want to
sort arbitrary-sized binary records on disk look no further than bsort:
[https://github.com/pelotoncycle/bsort](https://github.com/pelotoncycle/bsort)

~~~
whazor
When sorting on disk, there are actually limitations in sorting linear time.
Has to do with being able to read blocks in memory and write them back to
disk. As everything is in blocks, you are limited to O(n/B log(n/B)) with B
being the amount of items per block. For more speed, a merge sort that keeps
in mind the block size works quite well. Search for external sorting.

