Author here: In mathematics, in particular discrete mathematics, there is a genuine need to store long integers accurately and to compare them numerically. This encoding is useful to do this for any database that can store UTF-8 strings and has a sorted index for these doing string sort. The person in the talk wanted to store data about finite groups and large group orders are one such application.
Many catalogues of mathematical groups contain things like the monster group, which has 808,017,424,794,512,875,886,459,904,961,710,757,005,754,368,000,000,000 elements. That won't fit in a 64 or even 128 bit integer.
It is common to want to ask a database "tell me all groups of size greater than X, with properties A, B and C". Now, if you have arbitrary sized ints, no problem. If your database (or language) doesn't support big ints, you need to figure out how to do "bigger than X", when you are storing big numbers in some other format, probably strings.
Surely if you're storing numeric values as strings, it would be better to use something like base64 encoding to use as much of the available symbol space as possible?
If you can sort a numeric string in base 10, you can sort one in base 16, or base 64, or base 256.
I think you could use this method with other bases, although be careful with base 256 -- if your strings are stored in UTF8 or something, then base 256 isn't actually that useful.
However, the important bit (to me) is that you can use your database's sorting function to compare the numbers.
The advantage of base-10 is it's easy to display the number :) Others bases do provide a constant speed improvement of course.
If we're in a database context you could store the length of the number in another number (64 bit would be plenty, nobody is storing a number with 18 exabytes of digits) then compare/sort by that first then the number itself, bases are irrelevant, besides the space savings of storing it directly in binary.
In fact I'd make an index on (len(N), N), then include len(X) and X in all your comparisons and range queries (wrapping for ergonomics as needed). You can use any base storage (including compact varbinary base256).
I'm genuinely curious, does anyone actually want/need to look at numbers larger than will fit in a 128bit int? I understand there are applications that require use and storage of such numbers, but how often is there are real need to display them?
In discrete mathematics this happens a lot. Group orders, sizee of conjugacy classes, semigroup orders, numbers of isomorphism classes, character degrees, etc.
In principle, ArangoDB behaves similarly to MongoDB here. Both are essentially "mostly-in-memory" databases in the sense that they hold the data in memory and persist it at the same time to disk via memory mapped files. This approach is good for performance and if you run out
of RAM you ought to shard your data.
However, MongoDB often uses a lot of memory for the actual data, since its BSON binary format stores the names of the attributes with every single document. ArangoDB detects similar shapes of documents (see https://www.arangodb.com/faq#how-do-shapes-work-in-arangodb) and thus
avoids this particular problem.
I have been bitten by this using MongoDB as well. The shape recognition of ArangoDB sounds very useful. If this works well, it would alleviate a problem that NoSQL solutions so far have in comparison to classical relational databases.
I agree. It is extremely hard to select the correct technology. So, sometimes it is even more valuable to learn, what has not work (even if it looked nice from the description), than what has worked.
Yes, looks like it. The default symbol lookup can be really slow in some situations, especially in large C++ projects with can contain many symbols that share a long common prefix thanks to C++ name mangling.
"Optimising the linked output" seems a bit vague but that's about as clear as the man page.
If level is a numeric values greater than zero ld optimizes the output. This might take significantly longer and therefore probably should only be enabled for the final binary. At the moment this option only affects ELF shared library generation.