Hacker News new | past | comments | ask | show | jobs | submit login
Maze62: a dense and speedy alphanumeric encoding for binary data (altudov.com)
29 points by DenisM on May 16, 2011 | hide | past | favorite | 28 comments



I'm having trouble thinking of any situations where you're free to use any alphanumerics, but you absolutely can't use anything else. The closest thing that comes to mind is DNS, and that's a 36-character set, not 62.

Ascii85 (http://en.wikipedia.org/wiki/Base85) uses a similar concept, but using the majority of the printable ASCII characters. It gets a hard-to-beat 20% expansion on binary data (4 bytes in 5 chars) without requiring bigint math.


You're right on DNS, I was not thinking clearly when I wrote that part. I will fix up the article.

The prime scenario that caused me to think about this problem is indeed URL components, and in particular object identifies for RESTful protocols.

The other thing I was thinking about were cookies values, which I have had problems with in the past.

EDIT: The most annoying part about base64 is that different systems have different restrictions. With Maze62 I won't have to think about which system my data will end up in.


I'm pretty sure the canonical use case for this would be for including complex binary data in a URL. A URL shortener might be a good example.


Another thing people don't talk much about is copy-paste. Double click on an alphanumeric will select the entire thing, whereas other characters might cause selection to end prematurely, and differently so in different text editors.


.

_



I think this has the drawback that if you want to look up, say, bits n through m, you'd need to process the whole string up to that point. Lookup would be O(n) rather than O(1). Is that a problem? (I don't know, maybe people only use base64-like encoding in places where they'd just read the data in once at the beginning, convert it to a better internal representation, and never touch the base64 version again.)

On the other hand, if it is a problem, how's this idea: Use strings of 129 base-62 digits to represent strings of 128 base-64 digits. I pick 128 because it might be convenient, and 129/128 is close to optimal:

  arc> (* 129 (log 62 64))
  128.01522067331783
Or if decoding 128 digits at a time when you just want a couple is excessive, you could use 17 base-62 digits for 16 base-64 digits (which has a space inefficiency of 1/16th, or 6.25%), or 9 for 8 (inefficiency 12.5%), or whatever.


Random access was not a priority, yes. In fact it didn't even occur to me, not even for a moment :)

As to your proposed algorithm, yes, it could work. It's discussed in passing a bit earlier in this thread, here's direct link: http://news.ycombinator.com/item?id=2554726

Not sure about comparative space efficiency of this vs Maze64, but we're likely splitting hairs at this point :), it should be very close. Yes it could work. Converting 128 characters from base64 to base62 would cost N^2 time, where N is 128. Not sure if it's a big deal. Just another approach.


Hmm, and one could extend this chunking approach to Maze62. Consume 12 bits (aka two base64 integers), but if the resulting number is in the range [62^2=3844 to 64^2-1=4095] or [0 to 64^2-62^2-1=251], then add that extra quotient bit to the next input (from which you'll consume only 11 bits). That would make the worst case only take 12/11ths as long, or 109%, instead of 6/5ths as long, or 120%, as the best case (which would be the same as before).

One could extend it upwards as far as desired (probably approaching log_64(62) worst-case and average-case space compared to base64[1]), though the integer arithmetic would become a pain eventually. 12/11 or maybe 18/17 (105.8%) or 24/23 (104.3%) is probably good enough for most purposes. The relatively hardcore might use 60/59 (101.7%), which x86_64 machines can still do with CPU arithmetic (an integer multiply of two 64-bit integers stores the results in two 64-bit registers, and you can then do an integer divide on that, which stores the quotient and remainder in those registers).

[1] Average case: I analyze it like this. If you encode n base64 integers at a time, then you'll encode 6n bits at a time, but 6n-1 bits if you're unlucky and get a number in the ranges [62^n, 64^n - 1] or [0, 64^n - 62^n - 1]. These two ranges, taken together, make 2 * (64^n - 62^n) numbers out of 64^n possible n-digit base64 numbers. Thus, on average, you encode

  6n - 1 * [2 * (64^n - 62^n)]/64^n
  = 6n - 2 * (1 - (62/64)^n)
bits for every group of n characters, or

  6 - 2/n * (1 - (62/64)^n)
bits per character. (Note that this analysis only applies while 62^n < 64^n < 2 * 62^n, so taking n -> ∞ gives misleading results.) The original Maze62 proposal has n = 1, and this yields 5.9375 bits per character on average, which is pretty good. (Check this: 5 * 4/64 + 6 * 60/64 = 5.9375.) The next few values are:

  1 5.9375
  2 5.9384765625
  3 5.939432779947917
  4 5.940369129180908
  ...
  10 5.945595231334425

  Theoretical limit:
  6 * log_64(62) = 5.954196310386876
So I guess there's really not much to be gained for the average case by increasing n. But someone pointed out that all-0 and all-1 sequences happen frequently in real situations, so it may be worth it anyway.


That's a great idea, and it will probably speed up possible native implementations (though likely at the expense of scripting languages).

To deal with large runs of identical numbers (e.g. all zeroes) someone suggested in my blog comments to XOR the input array with a result of a chosen pseudo-random function. This will bring any dataset into the realm of average (5.9375) except for the dataset that is specifically targeting the chosen pseudo-random function. :)


Claiming this is dense, OK. That's just math. But how can you claim this is "speedy" when you have an unpolished Python implementation in hand? You need to bench against best-of-breed Base64 implementations, not some pure-Python comparison you may write up in the future, and for that you're going to need to go at least C, and I wouldn't be surprised down to the assembler. Just stick with "dense" and claim "speedy" when you've got some numbers.

(I suspect you won't beat or even tie Base64, but I'm prepared to accept that you only got within X% of base64 but that X% is worth it in some scenario. And maybe if you get clever enough you can prove me wrong.)


It's speedy compared to base62.

See, the problem is that Base64 is linear is speed, but has large (non-alphanumeric) charset, while base62 has small (alphanumeric) charset but is quadratic in speed (at least in its straightforward implementation, the only one I was able to find).

Maze62 is the encoding that is both constrained in charset and liner in speed. Hence, the title.

Does it make sense?


OK, that's fair. Thanks.


If you don't mind using - and _ then the implementation is very simple. I posted an example in Python and CoffeeScript to SO recently for doing this exact thing.

http://stackoverflow.com/questions/5940416/compress-a-series...

  >>> bin_string = '000111000010101010101000101001110'
  >>> encode(bin_string)
  'hcQOPgd'


  >>> decode(encode(bin_string))
  '000111000010101010101000101001110'


FYI, it seems to be known as this: http://en.wikipedia.org/wiki/Base64#URL_applications


Hmm, looking at python's base64.urlsafe_b64encode(s), padding is still used. So it sounds like it doesn't fully implement the "modified Base64 for URL" spec.

Base64 encoding also seems to result in quite long ascii strings compared to what I threw together.

Or is there a way to use Base64 which would give results of a comparable length?


It's a lot longer because it doesn't know your have encoded your input data in, essentially base-1. If you to convert it to base-256 (that is, and array of bytes, instead or array of bits as you have now) it would produce the same length. Yes, there is base64 implementation in pretty much any language, though they usually use + and / for the two extra characters.

As wikipedia says, padding can be added or removed as a matter of taste: From a theoretical point of view the padding character is not needed, since the number of missing bytes can be calculated from the number of base64 digits.

http://en.wikipedia.org/wiki/Base64#Padding


My first impression is that this algorithm is going to be much slower than Base64 due to all the branching. This probably won't be visible in Python, but I don't think there's any way to make it really fly when optimized.

There are probably cases where this is acceptable, but I'm inclined to think that Base32 is a better choice if larger size is acceptable, and that Base64 with domain appropriate characters is better where performance counts.


Really depends on what you want to do with it, doesn't it? He mentions character limits in URLs as a motivation, and at reasonable lengths for that I imagine Maze62 in Python will be plenty fast....


recently, i needed to encode/decode similar data, and base64 with a variable 64 chars seemed to work just as good. say, - or / are problem chars, then make another set of 64 chars which would have a comma and a dot or alike and there you go. keep the sets in a table or somehting and use the first 2 chars as indicator which set of 64 to use for that data.


Cool stuff. Can you post a few examples of the encoding?


why is base62 quadratic?


Naive implementation certainly is, as it involves one multiplication (of number whose size depends linearly depends on number of previously processed bytes) for each input byte and one division for each output byte. But on the other hand I have this feeling, that more effective implementation is possible (and actually used by modern compression algorithms like RAR and LZMA).


but couldn't you have a stream algorithm with modular arithmetic?

[edit: i thought i had a good demonstration of this. actually, two. but both were wrong. so it does seem to be non-trivial...]

[edit 2: open stack overflow question: http://stackoverflow.com/questions/3545931/efficient-algorit... ]


If we could, we wouldn't call it base62, as base62 normally means converting the entire "bigint" (all bytes of input data as one big number) into base-62 representation. At least that's how baseXX has been used to far (I searched the web before doing this), and I am in favor of sticking to that naming convention. Agreed?

Now, whether we could come up with such streaming algorithm I'm not sure. I tried solving this problem, and this is what I came up with. Maybe I haven't tried hard enough to go the other route... I guess you could simply slice the incoming stream into a set of chunks, each small enough to make bigin base62 conversion fast (which is square of the size of the chunk), and yet large enough to obscure the occasional loss of space at the chunk boundaries... I guess that would be an option...


I'm pretty sure it's impossible to do in anything less than square time, but I don't have the prof with me.


A rough sketch of a proof might note that, if the input has n digits, then the output will have roughly n digits (n * log_newbase(oldbase)); and the lowest digit of the output depends on all n digits of the input, the next lowest digit depends on maybe the higher n-1 digits of the input, and so on.

Or you could construct the output by computing the effects of the input digits one by one, and in that case, the highest input digit will affect all n(ish) digits of the output, the next highest will affect roughly n-1 digits, etc.

The algorithm could be thoroughly parallelized, but I'm pretty confident that the work done has to be O(n^2).


i just came here to say something similar. one way (perhaps the only way) an efficient streaming approach is possible is if it can be "chunked"; if there's some number of input digits that can be represented as a whole number of output digits. that's possible when going between base 2 and 16, for example, but not when one base has a factor that isn't present in the other.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: