
Sorting number strings numerically - CJefferson
https://www.arangodb.com/2017/09/sorting-number-strings-numerically/
======
bo1024
Fun problem, at least abstractly. Here's a similar, maybe more basic approach
for positive integers. It uses the idea at the beginning of the article to
store numbers in unary, to store the length of the number in unary ...
recursively. The idea is to sort by ... the length of the length of the length
of the number, then by the length of the length of the number, then by the
length of the number, then by the number.

First, imagine we store the number in the form

    
    
        aaaaaaaaaaa 92389210184
    

where the number of a's is equal to the length of the number. This encoding
scheme works for the same reasons as discussed in the article. And it only
takes 2*(length of original number)+1 characters.

But now, we can do better by applying the same encoding scheme to the
'aaaaaaaaaaa'. That is, suppose the original number is 295 digits long. The
number 295 is 3 digits long. So we store the number with 3 b's, as:

    
    
        bbb 295 109382388575782352353453....
    

OK, but what if we want to store a number that's 4,000,000,000 digits long?
This would look like

    
    
        bbbbbbbbbb 4000000000 109382388575782352353453....
    

And that's a bit wasteful. If there are more than two b's, then we can
recursively apply the encoding to the b's. There are 10 b's, and 10 has two
digits, so we get

    
    
        cc 10 4000000000 109382388575782352353453....
    

And so on. (We won't ever have hard drives that can make use of letters e and
above, so it can probably stop there.)

So the encoding schemes look like

    
    
        a n                            # n is only one digit
        aa n                           # n is two digits
        b log10(n) n                   # log10(n) is one digit
        bb log10(n) n                  # log10(n) is two digits
        c log10(log10(n)) log10(n) n   # log10(log10(n)) is one digit
        cc log10(log10(n)) log10(n) n  # log10(log10(n)) is two digits
        d log10(log10(log10(n))) log10(log10(n)) log10(n) n
        ...

~~~
rocqua
Does this have a canonical form? It seems like the same number has multiple
possible encodings that don't sort at the same place.

~~~
bo1024
You'd definitely have to be more careful than I was in my post! But I think it
should be possible to make it work via something like always using either one
or two initial letters. The algorithm I'm thinking of is:

1\. Given n, write aaaaaaa n where the number of a's is equal to the length of
n.

2\. If the number of a's is 1 or 2, stop.

3\. Let n' be the number of a's.

4\. Write bbbbb n' n where the number of b's is equal to the length of n'.

5\. If the number of b's is 1 or 2, stop.

6\. ...continue until the number of initial letters is 1 or 2.

I think that algorithm is mostly well-defined and only produces one possible
output for each input number. So let's see if I can give a good proof-ish
argument that it sorts correctly.

Suppose n <= m, then length(n) <= length(m).

Case 1: If their lengths are equal, then in step 1, they are written as

    
    
        aaaaaaaaa n
        aaaaaaaaa m
    

where the number of a's are the same. Now the rest of the algorithm is the
same for both numbers, since they have the same number of a's, so they get
written as the exact same prefix followed by the number. Since they have the
same length, the smaller one sorts first.

Case 2: Now suppose length(n) < length(m), and further suppose length(n) <= 2.
Then in step 1, they are written as, for instance,

    
    
        a n
        aa m
    

or something like

    
    
        aa n
        aaaa m
    

Now any encoding of m in further steps will start with the letter b or
greater, which sorts below "aa n".

Case 3: Now suppose length(n) < length(m), but further suppose length(n) > 2\.
Then in step 1, they are written as

    
    
        aaaaaaaaa n
        aaaaaaaaaaa m
    

Now n will sort before m as long as the encoding of a smaller string of a's
sorts before the encoding of a longer string. Let n' = length(n) and m' =
length(m). We know that n' < m' and we want to prove that the encoding of n'
sorts before the encoding of m'. This is true recursively by the same 3-case
argument as above, though I haven't really formalized it well here.

For example, if length(n') = length(m'), then they both get written in the
next step as

    
    
        bbbb n' n
        bbbb m' m
    

which is just like Case 1 above. Otherwise, they get written something like

    
    
        b n' n
        bbb m' m
    

which is just like Case 2 above. Otherwise, they get written

    
    
        bbbbb n' n
        bbbbbb m' m
    

which is like Case 3 above, and requires us to recurse again.

------
zemo
if you have this problem (or like this problem), there's a whitepaper on this
topic (with a different encoding scheme) here:
[http://www.zanopha.com/docs/elen.pdf](http://www.zanopha.com/docs/elen.pdf)

I implemented a decent portion of it here for a Go project:
[https://github.com/jordanorelli/lexnum](https://github.com/jordanorelli/lexnum)

the whitepaper also discusses an implementation that accommodates floating
point numbers, but I didn't implement that portion. This encoding scheme works
well for file names.

~~~
neunhoef
Great. Thanks. I was not aware of this.

------
Mjark
This looks like a similar problem to providing a lexicographic binary encoding
for the integers, and the technique given here reminds me of Exp-Golomb
encoding.

A more fun problem is extending this scheme to support rational numbers.
[http://www.imada.sdu.dk/~kornerup/papers/lcf.ps.gz](http://www.imada.sdu.dk/~kornerup/papers/lcf.ps.gz)
gives a neat scheme using continued fractions. (I've had a go at using it to
implement a sort order preserving encoding that works for all the rationals at
[https://github.com/NegativeMjark/lexical-
binary](https://github.com/NegativeMjark/lexical-binary) )

~~~
neunhoef
Silly solution: The rationals are countable. Just use any bijective mapping of
the natural numbers to the rationals and then use the scheme for the natural
numbers. (I know, a mathematician's answer! :-) )

~~~
Veedrac
That would imply the mapping preserved ordering, which doesn't seem possible.

~~~
rocqua
Wait, do the natural numbers and rational numbers have the same ordinality? It
seems not because of the whole 'a number in between every 2 rational numbers'.
But then what is the ordinality of the rational numbers?

~~~
bmm6o
The standard argument for showing that they have the same cardinality is to
consider the matrix (let's see if I can render this):

    
    
        1/1  2/1  3/1  4/1 ...
        1/2  2/2  3/2  4/2 ...
        1/3  2/3  3/3  4/3 ...
        ...
    

You can traverse each SW-NE diagonal, producing the sequence 1/1, 1/2, 2/1,
1/3, 2/2, 3/1, ...

This will eventually hit every positive rational (several representations, in
fact). If you want, you can start with 0 and insert -q when you hit q.

------
willvarfar
The article starts with a discussion of storing large numbers in JSON.

JSON does not restrict the range of numbers. It happens they many languages
that parse JSON do, but ArangoDB could try and be different! :)

~~~
neunhoef
Many JSON parsers will treat long numbers as doubles. Therefore the only safe
way to store them without potential loss is in a string. Without the encoding
in the article or something along these lines, the sorting done by a database
is not compatible with the integer sorting.

~~~
etatoby
It remans a fact that the JSON spec does not limit the length of numbers, so
there is no need for this kind of kludge.

If Arango would just allow unlimited numbers in JSON data and sort them
numerically, developers could use a different JSON encoder and parser in their
language of choice, that would encode and parse numbers from the unlimited
representation allowed by JSON to whatever bignum library their language has.

~~~
skj
There is a need for this kind of kludge because systems that you interact with
will lose precision with large numbers. If Arango was the only system parsing
the json data that would be one thing, but that is not a real-world use-case.

------
gefh
Seems much more complicated than just changing the comparator to compare
lengths first.

~~~
vesinisa
Which would additionally preserve trivial human-readability of the stored
numbers - one of the reasons of choosing JSON in the first place.

~~~
simias
This scheme is still human readable, you just have to skip the length prefix
and you get a simple number after that. It's not really human-writeable
however...

Although if you wanted to save space you could also store the number itself in
a bigger base to reduce the number of "digits".

------
Veedrac
Here's a much simpler solution to the problem: your length is no more than
~2^48 since you're using real hardware, so prefix your string with an 8-byte
base64 encoding of that 6-byte length.

~~~
drostie
Note that you would not have an encoding for negative numbers then, and you
would have to use the alternative base64 table,

    
    
      +/0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
    

as the normal one is not order-preserving. Otherwise yeah, you could probably
do that--and of course there are many ways to patch what you've got to handle
negatives, e.g. by noticing that it is slightly wasteful (the base64 prefix
++++++++ is never used) or by using a leading character lower than + (so - is
out but ! for example is available) and then encoding the rest of the string
in an "ascii-reversed" way.

However I must object to the idea that "real hardware" cannot handle the
petabyte scale. The in-article recursive approach of using the algorithm:

    
    
        function encode(number) {
            //calculate some base number_string
            return number < NUM_THRESHOLD?
                number_string :
                sigil + encode(number_string.length) + opt_divider + number_string;
        }
    

is not a particularly bad one. The only thing that maybe I'd balk at is that
the given encoding requires a human editor to use an ASCII table to lookup
their string lengths, which is clearly undesirable... one might instead go for
the most naive case where NUM_THRESHOLD is simply 10 and have the number
123,456,789,012,345,678,901 be represented as:

    
    
        ~~2 21 123456789012345678901
    

so that it's clearly "there are going to be 2 length strings" (hence two
tildes) -- the first always has length 1, it specifies "the next has length 2"
and then "the next has length 21."

This is also pretty easy to parse and one can use a leading - sign to encode
negative numbers, for example.

~~~
Veedrac
> However I must object to the idea that "real hardware" cannot handle the
> petabyte scale.

By the time I'm talking about integers that don't fit in x86 address space,
I'm not talking JSON ;).

~~~
drostie
I guess that makes sense and speaks to a bigger problem here, which is that
with a very large number one certainly wants to simply store it in a massive
file on disk, and the encoding to do that probably shouldn't be based on the
printable ASCII digits--because why.

------
lokedhs
This is a neat trick, but the example code (after fixing a typo,
"decodeNonNegatve" to "decodeNonNegative") doesn't sort correctly:

    
    
      ["-10", "-11", "0", "1", "-1", "-2", "10", "11", "110"].map(encodeLong).sort().map(decodeLong)
    

Results in:

    
    
      ["0", "1", "10", "11", "110", "-11", "-10", "-2", "-1"]

~~~
rockostrich
The author used `-` as the sorting character for negatives instead of `!` as
stated in the article. If you change his functions to use `!` for encoding
negatives then it correctly sorts.

    
    
        function encodeLong(s) {
          if (s[0] !== '-') { return encodeNonNegative(s, ' '); }
          return '!' + translate(encodeNonNegative(s.slice(1), ' '));
        }
    
        function decodeLong(s) {
            if (s[0] !== "!") { return decodeNonNegative(s); }
            return '-' + decodeNonNegative(translate(s.slice(1)));
        }

------
dekhn
The really interesting property, if you do this right, is that you can
concaneate a series of fields, sort them lexicographically, and you end up
with the data sorted by the types of the fields. This allows you to create
complex composite keys that contain primary indexes directly in the key. I
don't know if this is something fundamentally important, or obvious, but when
I learned it, it made designing my key schema a lot easier.

------
chubot
Related (but different) problem: SQLite4 Key Encoding

[https://sqlite.org/src4/doc/trunk/www/key_encoding.wiki](https://sqlite.org/src4/doc/trunk/www/key_encoding.wiki)

 _Numeric SQL values must be coded so as to sort in numeric order. We assume
that numeric SQL values can be both integer and floating point values._

------
chubot
Is utf-8 significant here, as opposed to the ASCII subset? It seems like it's
an ASCII encoding, and it sorts like ASCII.

I think that utf-8 sorting may depend on the locale. But here you're not
depending on that -- you're just using the subset that is ASCII sorting, no?

~~~
drostie
Yes, this usage of UTF-8 is limited to the ASCII subset.

~~~
chubot
Regarding the examples:

    
    
        " 0 
        " 1
        # 42
    

Why is the length stored in base 92 while the number is stored in base 10?

Why not just store them both in base 128 or 127 (if you want C strings)? You
can't use base 256 because some strings won't be valid UTF-8, but base 127 or
128 seems fine.

Do you want some notion of human readability? That isn't in the problem
statement. Your problem statement seems fairly imprecise in a couple ways.

~~~
drostie
I mean, I think a case can be made for something like it, where possibly one
adds to the problem statement "the bytes should be contiguous ASCII and should
not cause the JSON standard to backslash-encode the characters." The biggest
range which does this is 0x23 to 0x5B inclusive. But that only gives 57
workable characters.

If it helps, I am not the author of the original article.

------
alethiophile
I feel like this entry should have a different title. I passed it over at
first because it looks like it's just a primer on how to implement ls -v, or
something, when it's much more interesting.

~~~
mannykannot
Not click-baity enough?

------
chrchang523
If you don't need to worry about exponential notation or negative numbers,
having a "natural sort" comparator is another way to deal with this.

------
kwillets
Another name for sign + short representation of magnitude + mantissa is
Floating Point, which also sorts lexicographically for positive values.

------
allandubey
Brilliant!

------
bjerun
Are there any real life applications for this?

~~~
CJefferson
Many catalogues of mathematical groups contain things like the monster group,
which has
808,017,424,794,512,875,886,459,904,961,710,757,005,754,368,000,000,000
elements. That won't fit in a 64 or even 128 bit integer.

It is common to want to ask a database "tell me all groups of size greater
than X, with properties A, B and C". Now, if you have arbitrary sized ints, no
problem. If your database (or language) doesn't support big ints, you need to
figure out how to do "bigger than X", when you are storing big numbers in some
other format, probably strings.

~~~
logfromblammo
Surely if you're storing numeric values as strings, it would be better to use
something like base64 encoding to use as much of the available symbol space as
possible?

If you can sort a numeric string in base 10, you can sort one in base 16, or
base 64, or base 256.

~~~
CJefferson
I think you could use this method with other bases, although be careful with
base 256 -- if your strings are stored in UTF8 or something, then base 256
isn't actually that useful.

However, the important bit (to me) is that you can use your database's sorting
function to compare the numbers.

The advantage of base-10 is it's easy to display the number :) Others bases do
provide a constant speed improvement of course.

~~~
jcwayne
I'm genuinely curious, does anyone actually want/need to look at numbers
larger than will fit in a 128bit int? I understand there are applications that
require use and storage of such numbers, but how often is there are real need
to display them?

~~~
neunhoef
In discrete mathematics this happens a lot. Group orders, sizee of conjugacy
classes, semigroup orders, numbers of isomorphism classes, character degrees,
etc.

~~~
Retra
"It happens" doesn't imply there's any need for it to happen.

------
fergie
Mirror?

~~~
graetzer
Google Cache:
[http://webcache.googleusercontent.com/search?q=cache:https:/...](http://webcache.googleusercontent.com/search?q=cache:https://www.arangodb.com/2017/09/sorting-
number-strings-numerically/)

