
Let's Make a Varint - saturncoleus
https://carlmastrangelo.com/blog/lets-make-a-varint
======
Strilanc
This post starts off good, but...

\- The motivation behind this post is storing generated IDs for a site the
author is working on. Generated IDs don't have the "smaller values are
exponentially more likely" property that make varints useful in the first
place.

\- The post rejects UTF-8's variable-length encoding because "it can only
store numbers up to 1,112,064". That's the maximum number of unicode code
points (...ish). The UTF-8 encoding's actual limit is >10^16. A much better
reason to not use UTF-8's encoding is that it was created under pretty heavy
backwards-compatibility restrictions that the author doesn't have. For
example, there's no need for being self-synchronizing or to be compatible with
Boyer-Moore substring searches.

\- The post goes with a length-prefixed encoding, but then uses that length
prefix as part of the folder structure (it's using the generated IDs as
filepaths). Which is a great way to create an exponential distribution in the
sizes of your directories, instead of the uniform one that was the goal.

It's not a _bad_ post, there's facts in there, but I wouldn't recommend
bothering with it.

~~~
saturncoleus
Those are all really good points actually. In response to them individually:

\- The problem with generating id's is that it isn't known ahead of time how
many there will be. This forces a solution that is suboptimal in all
circumstances.

\- The reason for rejecting UTF-8 is mostly backwards compatibility with
existing software. Being able to use encoded UTF-8 strings that exceed the
million is possible, but really burns a lot of bridges along the way. The
point about boyer moore is really cool, I had no idea that was a goal!

\- Having the length in folder structure is exponential, but only at the top
most level. It will be uniform under each length dir. This is an acceptable
price to pay when typing "ls ./dir", since removing the prefix would make it
hard to read quickly:

    
    
        0/
        0.jpg
        1/
        1.jpg

------
JonathonW
Something probably not particularly important to the author (given that he's
just using these essentially as a serial number), but worth mentioning for
general use:

> Varints can represent negative numbers too, by casting them to unsigned
> numbers first.

But, assuming you're working with twos-complement numbers, it's going to be
really inefficient, because small negative numbers are going to end up
represented as extremely large positive numbers (e.g. -1 will map to MAX_INT -
1, where MAX_INT is the largest unsigned value representable in your
underlying integer type).

You also lose the unlimited length property if you need to handle twos-
complement integers, and your encoding is dependent on the underlying integer
type you're serializing from/deserializing to (especially problematic in
languages where types are architecture-dependent, like C/C++).

Protobuf handles this by doing what they call "ZigZag encoding" if you declare
a value as a signed integer type. This maps small negative numbers to small
positive numbers, alternating back and forth like so:

    
    
        Original    Encoded
           0          0
          -1          1
           1          2
          -2          3
           2          4
    

and so on.

The catch is that you lose sortability if you do this... if you're going to
have a bunch of negative numbers and need their encoded versions to sort
correctly, you'll need to do something else. (Of course, you'll need to with
twos-complement numbers, too, given that -1 will appear to be greater than 1).

Protobuf documentation on ZigZag encoding:
[https://developers.google.com/protocol-
buffers/docs/encoding...](https://developers.google.com/protocol-
buffers/docs/encoding#signed-integers)

~~~
daurnimator
FWIW the protobuf varint encoding was seen as a mistake by the inventor.
[https://groups.google.com/d/msg/capnproto/Qxw7YSoPP18/HS06Kc...](https://groups.google.com/d/msg/capnproto/Qxw7YSoPP18/HS06KchMpIQJ)

~~~
jholman
First of all, I believe that Kenton Varda is not the inventor of protobuffs,
but rather a Google engineer who did a bunch of work on Protobuffs v2 (I
believe on both of the v2 specs for protobuffs, because there are two
different v2s). He is certainly quite knowledgeable, but not "the inventor".

Second, while he does characterize it as a "huge mistake" for
encoding/decoding performance reasons, there are a few caveats that are
obvious here.

1) Half of his remarks here explain why varints are inappropriate for
CapnProto, a separate project of his, inspired by his experience with
protobuffs. CP has different design goals than PB, and in that context varints
are inappropriate. Fine.

2) His argument here against varints in protobuffs (as contrasted with in
capnproto) is only that they cost CPU time (and thus latency). That's surely
true, but it's not at all clear to me that he's right that this is, in
general, the wrong performance tradeoff.

3) In general, if you want to represent a long-tailed distribution of numbers,
some variable-length encoding has merits, because the alternative (that
protobuffs promote, IIUC) are fixed-width ints, which means you waste a lot of
space to accommodate your long tail, and even so you might not have allocated
enough space for the largest integer you ever need to handle.

~~~
kentonv
FWIW, before I even joined Google, there were notes from Sanjay (the actual
inventor) saying that varint was a poorly-chosen format due to excessive
branching. IIRC he was arguing for a format where you can tell the length
based on the first byte although I don't completely remember. In theory you
could design such a format that is just as space-efficient as varint, by re-
arranging bits.

Protobuf -- like most successful systems -- was not so much "designed" as it
was "evolved" as a series of decisions made out of expediency rather than
because they were the best possible answer. Overall it has obviously worked
extremely well, but I would not assume that just because protobuf does
something, it is good. (But I'd say that about just about everything, not just
protobuf.)

------
ccleve
The references in the article don't reflect the latest research. So far as I
know, this is close to state of the art:

[https://github.com/lemire/JavaFastPFOR](https://github.com/lemire/JavaFastPFOR)

The links to the papers in the readme are useful if you want to learn more.

------
hyperpape
"No “u”, (and no “o” or “i”) so accidental profanity cannot happen."

h0ly fvcking sh1t!

It's a good try, and making all the letters lowercase makes the numbers as
vowels less legible. But if you're creating random strings you will create
profanity and slurs.

~~~
rspeer
If you're generating alphanumeric IDs, removing I, O, and U from your alphabet
_really does_ decrease the likelihood of the ID being perceived as English
profanity.

Sure, nothing involving letters will be perfect, but this is something to take
into account. You can choose a level of risk in your IDs somewhere between
purely numeric IDs and the Automated Curse Generator [1].

[1] [http://thedailywtf.com/articles/The-Automated-Curse-
Generato...](http://thedailywtf.com/articles/The-Automated-Curse-Generator)

~~~
hyperpape
That's why I said it was a good try. I don't think it's a bad idea, but if you
ever wanted to push the needle close to zero, you would have to add some other
layer of checking.

------
btrask
Varints are widely used in databases.
[https://www.sqlite.org/src4/doc/trunk/www/varint.wiki](https://www.sqlite.org/src4/doc/trunk/www/varint.wiki)

