Hacker News new | past | comments | ask | show | jobs | submit login
Sorting number strings numerically (arangodb.com)
173 points by CJefferson on Sept 7, 2017 | hide | past | favorite | 57 comments

Fun problem, at least abstractly. Here's a similar, maybe more basic approach for positive integers. It uses the idea at the beginning of the article to store numbers in unary, to store the length of the number in unary ... recursively. The idea is to sort by ... the length of the length of the length of the number, then by the length of the length of the number, then by the length of the number, then by the number.

First, imagine we store the number in the form

    aaaaaaaaaaa 92389210184
where the number of a's is equal to the length of the number. This encoding scheme works for the same reasons as discussed in the article. And it only takes 2*(length of original number)+1 characters.

But now, we can do better by applying the same encoding scheme to the 'aaaaaaaaaaa'. That is, suppose the original number is 295 digits long. The number 295 is 3 digits long. So we store the number with 3 b's, as:

    bbb 295 109382388575782352353453....
OK, but what if we want to store a number that's 4,000,000,000 digits long? This would look like

    bbbbbbbbbb 4000000000 109382388575782352353453....
And that's a bit wasteful. If there are more than two b's, then we can recursively apply the encoding to the b's. There are 10 b's, and 10 has two digits, so we get

    cc 10 4000000000 109382388575782352353453....
And so on. (We won't ever have hard drives that can make use of letters e and above, so it can probably stop there.)

So the encoding schemes look like

    a n                            # n is only one digit
    aa n                           # n is two digits
    b log10(n) n                   # log10(n) is one digit
    bb log10(n) n                  # log10(n) is two digits
    c log10(log10(n)) log10(n) n   # log10(log10(n)) is one digit
    cc log10(log10(n)) log10(n) n  # log10(log10(n)) is two digits
    d log10(log10(log10(n))) log10(log10(n)) log10(n) n

Does this have a canonical form? It seems like the same number has multiple possible encodings that don't sort at the same place.

You'd definitely have to be more careful than I was in my post! But I think it should be possible to make it work via something like always using either one or two initial letters. The algorithm I'm thinking of is:

1. Given n, write aaaaaaa n where the number of a's is equal to the length of n.

2. If the number of a's is 1 or 2, stop.

3. Let n' be the number of a's.

4. Write bbbbb n' n where the number of b's is equal to the length of n'.

5. If the number of b's is 1 or 2, stop.

6. ...continue until the number of initial letters is 1 or 2.

I think that algorithm is mostly well-defined and only produces one possible output for each input number. So let's see if I can give a good proof-ish argument that it sorts correctly.

Suppose n <= m, then length(n) <= length(m).

Case 1: If their lengths are equal, then in step 1, they are written as

    aaaaaaaaa n
    aaaaaaaaa m
where the number of a's are the same. Now the rest of the algorithm is the same for both numbers, since they have the same number of a's, so they get written as the exact same prefix followed by the number. Since they have the same length, the smaller one sorts first.

Case 2: Now suppose length(n) < length(m), and further suppose length(n) <= 2. Then in step 1, they are written as, for instance,

    a n
    aa m
or something like

    aa n
    aaaa m
Now any encoding of m in further steps will start with the letter b or greater, which sorts below "aa n".

Case 3: Now suppose length(n) < length(m), but further suppose length(n) > 2. Then in step 1, they are written as

    aaaaaaaaa n
    aaaaaaaaaaa m
Now n will sort before m as long as the encoding of a smaller string of a's sorts before the encoding of a longer string. Let n' = length(n) and m' = length(m). We know that n' < m' and we want to prove that the encoding of n' sorts before the encoding of m'. This is true recursively by the same 3-case argument as above, though I haven't really formalized it well here.

For example, if length(n') = length(m'), then they both get written in the next step as

    bbbb n' n
    bbbb m' m
which is just like Case 1 above. Otherwise, they get written something like

    b n' n
    bbb m' m
which is just like Case 2 above. Otherwise, they get written

    bbbbb n' n
    bbbbbb m' m
which is like Case 3 above, and requires us to recurse again.

Negative numbers?

The same trick the article uses works.

if you have this problem (or like this problem), there's a whitepaper on this topic (with a different encoding scheme) here: http://www.zanopha.com/docs/elen.pdf

I implemented a decent portion of it here for a Go project: https://github.com/jordanorelli/lexnum

the whitepaper also discusses an implementation that accommodates floating point numbers, but I didn't implement that portion. This encoding scheme works well for file names.

Great. Thanks. I was not aware of this.

Very cool, thanks for mentioning.

This looks like a similar problem to providing a lexicographic binary encoding for the integers, and the technique given here reminds me of Exp-Golomb encoding.

A more fun problem is extending this scheme to support rational numbers. http://www.imada.sdu.dk/~kornerup/papers/lcf.ps.gz gives a neat scheme using continued fractions. (I've had a go at using it to implement a sort order preserving encoding that works for all the rationals at https://github.com/NegativeMjark/lexical-binary )

Silly solution: The rationals are countable. Just use any bijective mapping of the natural numbers to the rationals and then use the scheme for the natural numbers. (I know, a mathematician's answer! :-) )

That would imply the mapping preserved ordering, which doesn't seem possible.

It's not possible, since between any two rationals there exists another: (a+b)/2.

Wait, do the natural numbers and rational numbers have the same ordinality? It seems not because of the whole 'a number in between every 2 rational numbers'. But then what is the ordinality of the rational numbers?

The standard argument for showing that they have the same cardinality is to consider the matrix (let's see if I can render this):

    1/1  2/1  3/1  4/1 ...
    1/2  2/2  3/2  4/2 ...
    1/3  2/3  3/3  4/3 ...
You can traverse each SW-NE diagonal, producing the sequence 1/1, 1/2, 2/1, 1/3, 2/2, 3/1, ...

This will eventually hit every positive rational (several representations, in fact). If you want, you can start with 0 and insert -q when you hit q.

They have the same cardinality, but there is no order-preserving bijection.

You are of course right. My mistake.

The article starts with a discussion of storing large numbers in JSON.

JSON does not restrict the range of numbers. It happens they many languages that parse JSON do, but ArangoDB could try and be different! :)

Many JSON parsers will treat long numbers as doubles. Therefore the only safe way to store them without potential loss is in a string. Without the encoding in the article or something along these lines, the sorting done by a database is not compatible with the integer sorting.

It remans a fact that the JSON spec does not limit the length of numbers, so there is no need for this kind of kludge.

If Arango would just allow unlimited numbers in JSON data and sort them numerically, developers could use a different JSON encoder and parser in their language of choice, that would encode and parse numbers from the unlimited representation allowed by JSON to whatever bignum library their language has.

There is a need for this kind of kludge because systems that you interact with will lose precision with large numbers. If Arango was the only system parsing the json data that would be one thing, but that is not a real-world use-case.

Seems much more complicated than just changing the comparator to compare lengths first.

Which would additionally preserve trivial human-readability of the stored numbers - one of the reasons of choosing JSON in the first place.

This scheme is still human readable, you just have to skip the length prefix and you get a simple number after that. It's not really human-writeable however...

Although if you wanted to save space you could also store the number itself in a bigger base to reduce the number of "digits".

That is right, if you have access to your comparator, then this is much easier. However, it would still not work well for negative numbers, and, in many cases you sit client side and do not have access to your database engine.

It's easy to extend the comparator to negative numbers, just check for the minus sign as well.

Better yet, implement a sort that handles numeric substrings:


That's fine until you get numbers with decimal points.

Or some system puts in leading zeroes.

Here's a much simpler solution to the problem: your length is no more than ~2^48 since you're using real hardware, so prefix your string with an 8-byte base64 encoding of that 6-byte length.

Note that you would not have an encoding for negative numbers then, and you would have to use the alternative base64 table,

as the normal one is not order-preserving. Otherwise yeah, you could probably do that--and of course there are many ways to patch what you've got to handle negatives, e.g. by noticing that it is slightly wasteful (the base64 prefix ++++++++ is never used) or by using a leading character lower than + (so - is out but ! for example is available) and then encoding the rest of the string in an "ascii-reversed" way.

However I must object to the idea that "real hardware" cannot handle the petabyte scale. The in-article recursive approach of using the algorithm:

    function encode(number) {
        //calculate some base number_string
        return number < NUM_THRESHOLD?
            number_string :
            sigil + encode(number_string.length) + opt_divider + number_string;
is not a particularly bad one. The only thing that maybe I'd balk at is that the given encoding requires a human editor to use an ASCII table to lookup their string lengths, which is clearly undesirable... one might instead go for the most naive case where NUM_THRESHOLD is simply 10 and have the number 123,456,789,012,345,678,901 be represented as:

    ~~2 21 123456789012345678901
so that it's clearly "there are going to be 2 length strings" (hence two tildes) -- the first always has length 1, it specifies "the next has length 2" and then "the next has length 21."

This is also pretty easy to parse and one can use a leading - sign to encode negative numbers, for example.

> However I must object to the idea that "real hardware" cannot handle the petabyte scale.

By the time I'm talking about integers that don't fit in x86 address space, I'm not talking JSON ;).

I guess that makes sense and speaks to a bigger problem here, which is that with a very large number one certainly wants to simply store it in a massive file on disk, and the encoding to do that probably shouldn't be based on the printable ASCII digits--because why.

If you use a decimal number with leading zeros instead of base64 you can have a solution that is completely human compatible. It would be 15 digits for 2^48 or 20 digits for 2^64 if you want to be future-proof.

Since the goal is to store really long numbers, it seems like a false economy to worry about a few bytes at the front.

This is a neat trick, but the example code (after fixing a typo, "decodeNonNegatve" to "decodeNonNegative") doesn't sort correctly:

  ["-10", "-11", "0", "1", "-1", "-2", "10", "11", "110"].map(encodeLong).sort().map(decodeLong)
Results in:

  ["0", "1", "10", "11", "110", "-11", "-10", "-2", "-1"]

The author used `-` as the sorting character for negatives instead of `!` as stated in the article. If you change his functions to use `!` for encoding negatives then it correctly sorts.

    function encodeLong(s) {
      if (s[0] !== '-') { return encodeNonNegative(s, ' '); }
      return '!' + translate(encodeNonNegative(s.slice(1), ' '));

    function decodeLong(s) {
        if (s[0] !== "!") { return decodeNonNegative(s); }
        return '-' + decodeNonNegative(translate(s.slice(1)));

The really interesting property, if you do this right, is that you can concaneate a series of fields, sort them lexicographically, and you end up with the data sorted by the types of the fields. This allows you to create complex composite keys that contain primary indexes directly in the key. I don't know if this is something fundamentally important, or obvious, but when I learned it, it made designing my key schema a lot easier.

Related (but different) problem: SQLite4 Key Encoding


Numeric SQL values must be coded so as to sort in numeric order. We assume that numeric SQL values can be both integer and floating point values.

Is utf-8 significant here, as opposed to the ASCII subset? It seems like it's an ASCII encoding, and it sorts like ASCII.

I think that utf-8 sorting may depend on the locale. But here you're not depending on that -- you're just using the subset that is ASCII sorting, no?

Yes, this usage of UTF-8 is limited to the ASCII subset.

Regarding the examples:

    " 0 
    " 1
    # 42
Why is the length stored in base 92 while the number is stored in base 10?

Why not just store them both in base 128 or 127 (if you want C strings)? You can't use base 256 because some strings won't be valid UTF-8, but base 127 or 128 seems fine.

Do you want some notion of human readability? That isn't in the problem statement. Your problem statement seems fairly imprecise in a couple ways.

I mean, I think a case can be made for something like it, where possibly one adds to the problem statement "the bytes should be contiguous ASCII and should not cause the JSON standard to backslash-encode the characters." The biggest range which does this is 0x23 to 0x5B inclusive. But that only gives 57 workable characters.

If it helps, I am not the author of the original article.

I feel like this entry should have a different title. I passed it over at first because it looks like it's just a primer on how to implement ls -v, or something, when it's much more interesting.

Not click-baity enough?

If you don't need to worry about exponential notation or negative numbers, having a "natural sort" comparator is another way to deal with this.

Another name for sign + short representation of magnitude + mantissa is Floating Point, which also sorts lexicographically for positive values.


Are there any real life applications for this?

Author here: In mathematics, in particular discrete mathematics, there is a genuine need to store long integers accurately and to compare them numerically. This encoding is useful to do this for any database that can store UTF-8 strings and has a sorted index for these doing string sort. The person in the talk wanted to store data about finite groups and large group orders are one such application.

Many catalogues of mathematical groups contain things like the monster group, which has 808,017,424,794,512,875,886,459,904,961,710,757,005,754,368,000,000,000 elements. That won't fit in a 64 or even 128 bit integer.

It is common to want to ask a database "tell me all groups of size greater than X, with properties A, B and C". Now, if you have arbitrary sized ints, no problem. If your database (or language) doesn't support big ints, you need to figure out how to do "bigger than X", when you are storing big numbers in some other format, probably strings.

Surely if you're storing numeric values as strings, it would be better to use something like base64 encoding to use as much of the available symbol space as possible?

If you can sort a numeric string in base 10, you can sort one in base 16, or base 64, or base 256.

I think you could use this method with other bases, although be careful with base 256 -- if your strings are stored in UTF8 or something, then base 256 isn't actually that useful.

However, the important bit (to me) is that you can use your database's sorting function to compare the numbers.

The advantage of base-10 is it's easy to display the number :) Others bases do provide a constant speed improvement of course.

If we're in a database context you could store the length of the number in another number (64 bit would be plenty, nobody is storing a number with 18 exabytes of digits) then compare/sort by that first then the number itself, bases are irrelevant, besides the space savings of storing it directly in binary.

In fact I'd make an index on (len(N), N), then include len(X) and X in all your comparisons and range queries (wrapping for ergonomics as needed). You can use any base storage (including compact varbinary base256).

I'm genuinely curious, does anyone actually want/need to look at numbers larger than will fit in a 128bit int? I understand there are applications that require use and storage of such numbers, but how often is there are real need to display them?

In discrete mathematics this happens a lot. Group orders, sizee of conjugacy classes, semigroup orders, numbers of isomorphism classes, character degrees, etc.

"It happens" doesn't imply there's any need for it to happen.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact