Hacker News new | past | comments | ask | show | jobs | submit login
Base64 encoding and decoding at almost the speed of a memory copy (arxiv.org)
344 points by beefhash on Nov 6, 2019 | hide | past | favorite | 115 comments

Great use of AVX-512 VBMI!

Note that not many Intel processors support the two specific CPU instructions used (vpermb and vpmultishiftqb).

Cannon Lake Core i3 (not many of these around due to delays with intel's 10nm fabrication)

Ice Lake Core i3, i5 and i7: Released in September 2019

Looks like model number 06/66 or better in cpuinfo. Both AWS and Gcloud standard vm instances appear to be Skylake ;(

Honestly, base64 has always been pretty fast for me, even large bitmaps >100kb. Compared to net latency it may be imperceptible to end user. But it still makes a terrific reference paper!

Plus, if my knowledge is properly up-to-date, no AMD processor supports those instructions, not even any member the recent and famously super efficient Ryzen family processors.

None of the AMD CPUs have AVX-512. The new ones have full-width AVX-256.

Why do they say then "We use the SIMD (Single Instruction Multiple Data) instruction set AVX-512 available on commodity processors" ?

Because those are commodity processors nonetheless. The distinction relevant for a research paper is between mainstream processors you can just buy vs. custom made ones, not about market penetration.

I still think "commodity" is not the correct word even in that context. Mass-market, or commercially available might be better. How can it be a commodity when there is only one manufacturer?

Just going off wikipedia (also other sources support the definition):

> In economics, a commodity is an economic good or service that has full or substantial fungibility: that is, the market treats instances of the good as equivalent or nearly so with no regard to who produced them.

Basically it's a common good, that is sold and purchased, with no other modifications required for this use-case. This commodity is currently low in supply, but is still a commodity.

It’s missing the fungibility where it is produced by multiple suppliers and can be treated equivalently by the end buyer. Large grade A eggs are a commodity as it doesn’t matter which farm produced them. A specific set of processors from Intel are not a commodity since there are no fungible equivalents. Widely available would probably be a better description in this case.

I would argue processors are never commodities anymore (maybe back in the 486 days it was closer). You can't swap an intel for an amd processor into the same motherboard ever today. They don't have the same performance characteristics from one to another. You could have 2 "3.5Ghz 4-core processors" that have wildly different real-world performance. IMO this is the opposite of "fungible"

I think hard drives, memory, power supplies are much closer to being fungible.

It's a slightly awkward way of saying they didn't use custom hardware.

They're technically not wrong.. just not available on many commodity processors.

If base64 decoding speed makes a difference in your application, you should be considering why you are transmitting data in base64, a neat hack from the 1970's, which is non-human-readable, wastes 30% of the network bandwidth, kills compression, wastes RAM, is typically hard to random-access, and is generally a bad idea all round.

I agree, IMHO is the biggest design fail and annoyance of JSON that it has no good way to contain binary data.

The problem start with knowing the encoding (base64, for small sizes sometimes hex, for one or two bytes sometimes arrays with all there ambiguities).

Then that you can't easily differentiate between a string a bytearray anymore (through you often shouldn't need to).

Then that it becomes noticable bigger, which especially for large blobs is a problem

Then that you always have to scan over this now larger large chunk to know it's end. (Inserted of just skipping ahead).

Then that you have to encoded/decide it which might require some additional annotations depending on serialization library and can imply other annoyances.

The later point can also in some languages lead to you accidentally using a base64 string as bytes or the other way around.

Well I guess that was enough ranting ;=)

If efficiency and compactness are your problem, JSON is not the best idea anyway, look at protobufs and the ilk.

JSON works where you need simplicity on the verge of being dumb, and human-readability.

JSON human readable is extremely nice. FWIW, I’ve found some binary formats like CBOR and MsgPack to be so almost human readable as binary when you are even slightly familiar with the format, or just convert almost instantly to readable JSON.

When I was developing a three component system, protos were not working for me. Too many schema changes and implementation differences to make the reduced binary size worth it. This was partly because the best protobuffer implementation in C (nanoPB) is just the hobby project of some dude, but mostly because coordinating schemas was annoying.

On the contrary, adding binary would have been terrible. JSON is great because it is simple!

Have you tried BSON or Protobufs to solve your annoyances? How did it go?

Protobuffs require sharing of a schema, then code generation for each language. Not ideal imo unless you really need the speed and the protos can be shared ahead of time.

Outside of Mongo, I haven’t seen BSON used anywhere.

CBOR however, is up and coming. Starting to look like it may be a first class citizen in AWS someday. On the IoT side they are already preferring it over JSON for Device Defender.

Base64 comes from the development of MIME in the early 1990s (see https://tools.ietf.org/html/rfc1341 section 5.2) though there are other similar encodings such as uuencode which predate it.

I agree with the spirit and most of the content of this comment, except for the hard to random access part. It's not really hard to random access base 64.

Theoretically, yes it's easy.

Practically, base64 encode a video file and tell me how exactly you're going to allow the user to seek to any place they like within that video using only common libraries on common platforms... Theoretically easy, practically hard enough nobody does it.

Practically it's easy too. Here's a piece of pseudocode I just wrote:

   chunkOffset = byteIndex/3*4
This gets you the byte located at an arbitrary index of the original buffer while operating on the base64 data.

1. Decode to bytes.

2. Do whatever you said.

If the data being base64 encoded will also then be compressed, it is better to base16 encode (hex) and then apply compression. Base16 is faster to encode and compresses better (probably compresses faster, didn't test that).

    2537012 Nov  6 10:10 test.tar.hex.gz
    2954608 Nov  6 10:03 test.tar.b64.gz
All the caveats, test with your data, etc.

What would be nice is an email/http resilient compressed format that could be stuck inside of json strings that can be easily recovered.

That being true, it might be worthwhile for compression algorithms to detect base64 encoding (or any other base for that matter) and reinterpret it as base16 in order to improve compression rates.

On the Internet with neural networks! #priorart

Many people have to deal with base64 codes and these people order their cpus by the cubic meter. Think gmail etc.

Because of legacy protocols and format limitations: SMTP, POP3, embedded binaries in XML, HTML (data URLs).

Pretty much on point, save for compression part... although most used 16bit deflate would stumble big time

what's a better alternative, assuming you need to encode binary data in a text stream?

Don't use a "text stream", use a binary protocol.

So just deny the reality of the original constraint? Boy, if only I could do that more often.

Your suggestion isn't helpful if I'm functioning within a system that doesn't support this and can't be upgraded.

However my underlaying channel isn’t 8bit clean.

How does it kill compression?

Try it...

    wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | gzip -9 | wc -c
    wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | base64 | gzip -9 | wc -c
base64 is 48.8% larger after compression on english text, whereas it would only be 33.3% larger without compression. The reason is because compression finds and eliminates repeating patterns, but base64 can make the same input data look totally different depending on which of the 4 input alignments it has.

oddly enough lz/deflate are as ancient as base64. 16bit deflate is quite poor and slow, yet predominant.

Most compression algos operate on single-byte chunks of data and base64 encoding messes up the original byte alignment making the input appear more random than it is.

For one thing, it removes opportunities for more efficient content-aware compression elsewhere in the data path.

If we're going the content-aware route, can't the compression scheme just be Base64-aware?

How exactly do you make a base64-aware image compressor, though?

I don't follow. An image compressor is guaranteed not to be given Base64 input.

A general-purpose compression scheme can easily detect when its input is Base64, so I don't see why Base64 should be particularly hard to compress, in principle at least.

Sometimes you have no choice but transmit data in urls

What are good alternatives?

Wojciech Muła really groks SIMD. It's well worth exploring his other work in that area: http://0x80.pl/articles/index.html

Why is that an achievement? Isn't it so that during a memory copy the CPU is basically idling because memory is so much slower than the CPU? Weren't that hundreds of instructions per memory access? So instead of having it wait it can also do computations (just standard instructions). Or is base64 computationally so heavy that it cannot fit into the "gaps"? I certainly have not tried, just thinking from the high level view that CPUs are incredibly fast and main memory is incredibly slow in comparison. And of course assuming that the data does not fit into any cache.

> Why is that an achievement?

There are tables in the paper comparing its speed to other implementations. This is more than an order of magnitude faster than the implementation used in Chrome, and for small data more than twice as fast as their previous AVX2 based implementation (https://arxiv.org/abs/1704.00605). The paper even notes that in some cases, for large inputs, it's decoding base64 faster than you could memcpy the data for the simple reason that it needs only write 6/8ths of the bytes. In their tests, it's only when the data fits in the tiny L1 cache that memcpy is consistently somewhat faster (44 GB/s vs 42 GB/s). So this is not only fast when you have to wait for a slow memory access, but when the loads hit other caches.

Or are you asking in more of a rhetorical sense? Like, why hasn't this been achieved before, that it's obvious or something? AVX-512 is only some 3 years old and base64 is only one of many problems that could benefit from vectorized processing.

Thanks for this. Was insightful!

You're mostly right. It's often not that hard to completely saturate memory bus.

However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.

Also the data you'll need to process is often probably already hot in the CPU caches. Disk/network I/O can be DMA'd directly into L3 cache.

So I think this decoder is a useful contribution. But also somewhat no-brainer once you have those AVX-512 instructions available.

> However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.

Base64 uses lookup tables and the bit manipulations required are standard shifts and 'and', which are basic, fast instructions on any CPU. That seems exactly what they do here with an efficient use of AVX512 to make it fast(er).

> SIMD – can't use lookup tables

This is doing exactly that with `vpermb` and `vpermi2b`.

Yeah, missed that you can now fit 64 8-bit entries in a single 512 bit register!

Generally my point stands that LUTs are not vectorizable, except for small special cases like this.

Why can't you use lookup tables?

Part of the point of this paper is that the lookup tables for base64 fit into a 512 bit vector register.

Notably this technique only works in this special case (or other small LUTs). Generic (larger) lookup tables can't be vectorized yet.

Because table lookups don't vectorize. You could try to use vectorized gather (VPGATHERDD [0]), but so far in my experience it's been just as slow as using scalar code. (Then again, even gather instructions operate just on 32-bit data, so it wouldn't help anyways.)

So to perform multiple operations in parallel per core (SIMD), you'll have to write some code to perform the "lookup" transformation instead.

[0]: https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd

memory is so much slower than the CPU

Yes and no. Memory has high latency, but memory bandwidth has stayed far more in line with CPU performance.

See also GDDR VRAM, which further trades off latency for higher bandwidth.

I think you got it all wrong, memcpy comparison is given as a baseline.

It just shows even if you do not do anything (just copy the memory from place A to place B), you have some CPU usage baseline per GB. So this kind of saying: with this algorithm, you have very minimal CPU usage.

This algorithm is not doing excessive memory usage and Idling CPU because of memory bandwith, it is just doing very minimal CPU usage.

avx is extremely power hungry to a point the cpu downclocks itself. it might be worse than regular instructions even.

memcpy with wide instructions has been an issue for libc.

From the same author, https://lemire.me/blog/2018/08/15/the-dangers-of-avx-512-thr... tl;dr Isn't that extreme.

the test is on server grade cpu (read low clocks, great vrms) and the conclusion is: you have to use multiple cores to be in trouble.

current intel desktop processors have 'avx offset' which would suck if kicks on for mild base64.

overall wide instructions are nice and dandy and obliterate microbenchmarks but they may incur hidden costs, especially on low power cpus (laptops)

>Why is that an achievement?

Compared to what, being content to suffering slower base64 decoding?

Why it's an achievement is obvious.

Your question should be rephrased better as "will this make much difference in most common scenarios"?

While it is correct that memory is orders og magnitudes slower than modern cpus, this is also why cpus sport multiple levels of caches.

memcpy operations (and base 64 encoding/decoding) are highly "local" operations. When the segment of memory they operate on has been placed in the cpu level1 or level2 caches, the operation is much faster and will not need to access main memory for each byte/word. Only when a memory barrier is encountered will the cpu need to flush/refresh to/from main memory.

You are thinking of random memory access latency, which is very slow. Reading memory in spans that are contiguous is very fast in comparison because of prefetching. Writing is the same to a certain degree because of the same efficiency with TLBs but is not as fast as prefetched reads.

Not yet usable asis. Normally AVX-512 instructions are subject to downclocking. Now they tried a very recent CPU which is not subject to downclocking anymore. They didn't add code to test for the CPU revision which stopped downclocking, and they didn't benchmark with those older CPU's. We can only guess the older AVX256 code is faster there, but not how much.

Speaking of benchmarking against older CPUs: the paper falls into the "short communication" category, we didn't want to discuss all possible hardware configurations.

Many people don't realize this but today's memory is dramatically underpowered for today's CPU. Consider the following: you can, theoretically, get ~80GB/sec of memory bandwidth from a modern Intel CPU (assuming you use all channels, and there are no interrupts, etc). That's 20 billion floats or int32s per second. The same CPU can do ~2.3 fp32 TFLOPS. That's 115 ops for every fp32. And it's getting worse as more and more cores are put on the same memory bus.

Your numbers look strange. Did you quoted from a $10000 xeon platinum very few people use?

In my current desktop PC, it’s 700 GFlops theoretical compute versus 50GB/sec of theoretical RAM bandwidth.

For i9-7980XE Intel quotes 1.3TFLops of fp64. So it ought to do about twice that at fp32. Yours at $1500+tax from Newegg: https://www.newegg.com/core-i9-x-series-extreme-edition-inte.... Quad channel RAM, too.

For a flea check, consider that the chip has 18 cores and runs at ~4.2GHz. To achieve the theoretical throughput each core has to do ~30 single precision flops per cycle. Which sounds about right: HEDT intel chips can (theoretically) do 32 of those per cycle.

This, of course, ignores whatever throttling Intel would need to apply to maintain TDP, so a purely theoretical back of the envelope.

This should change soon, as EPYCs can do 204GB/s of memory throughput (plus tons of pcie4.0). They also aren't cpu tied, all of the SKUs get all of the lanes

It's a NUMA though. You'll have to write software specifically for NUMA to get anywhere close to that number. To make matters worse, EPYC also doubles the number of cores. To be fair, though, it also has a substantial amount of cache, but cache is not a panacea if you need something in the main RAM. And if you're churning through gigabytes of stuff per second, you'll be needing that very, very often.

Yes, NUMA and large numbers of core will pose completely new challenges for software optimization. Applications will have to become aware of the memory architecture of the underlying machine. They will have to make explicit assignments of memory allocations and threads to NUMA nodes based on their domain specific needs. In some cases, even duplicating data structures may be the right call. This is going to challenge how most developers write fast software.

The other thing is that even seemingly trivial things like spawning and synchronizing with lots of threads will be much more complex on CPUs with many cores. At some point, naively looping over all threads is going to be too slow. I think that the limit is going to be around 64 cores. Past that, you should actually parallelize your worker thread management to stay efficient. There is precendent for this in HPC, e.g. MPI implementations.

The authors have been doing good work in this area for a while: https://arxiv.org/abs/1704.00605

Turbobase64: A portable scalar implementation can beat SIMD and saturates the fastest networks and fastest SSDs.

It is faster than SSE on Intel/AMD and SIMD NEON on ARM see benchmark at Turbo Base64: https://github.com/powturbo/TurboBase64

In the readme for Turbobase64, they state:

> Only the non portable AVX2 based libraries are faster

The paper in OP is about 2 times as fast as their previous AVX2 code. I think they are superior to Turbobase64 here (though with more stringent hardware requirements).

I agree and according to a steam hardware survey only 1% are using AVX2. AVX-512 is practically unused. And don't forget the billions ARM CPU's.

Their numbers are surely wrong? Firstly, AVX2 jumped from 1% to 5% in a month. Secondly, it looks to have been on every Intel processor this decade.

Even 5% sounds sketchy indeed. However, AVX2 (or Haswell New Instructions) hasn't quite been on every Intel processor this decade, rather only since the 4th generation (late 2013), and only in Core-branded processors.

It’s available, but the CPU would throttle itself to prevent overheating until very recent processors.

TurboBase64 runs at 3-4 GB/s. The authors' work runs at over 40 GB/s.

Not to say that Turbobase64 isn't impressive, but it's not at all the same level of performance.

40 GB/s in L1 cache. The TurboBase64 benchmark is more practical.

It is impossible to have the speed of a good memcpy when you're copying 33% more like in base64.

I think 40GB/s result when it is not fitting L1 cache. If you check result graph, On L1 cache fitting data memcpy is ahead with big difference, then they are almost head to head

L1 cache is 64k for this cpu. A benchmark with large files is more realistic, because it's unlikely that the data already be in the L1 cache.

Benchmark at the README says it is 4x slower than memcpy.

There is no claim about memcpy. The memcpy is used as indication in the benchmark.

iOS 13 allows the Shortcuts app to encode/decode information as Base64 as part of the scripting language.

People are using this to do clever things like encoding a watermark image to overlay on a picture, because the Shortcuts app does not allow file attachments as part of the scripting language.

So this could've been done in the past with a manually written decoder but now it's a built in? That's great

Now show me memory speed SIMD uint8_t histogram, and I'll be happy. Computing histogram is slow and it doesn't really vectorize. Yet it's required by many important algorithms like data compression.

Not memory speed but extremely fast: https://github.com/powturbo/TurboHist

Yeah, it's nice, about what you can achieve. I'd be happy if it was about 8 times faster. Less than 0.2 cycles per byte would be good.

Unfortunately, that's just not achievable with current x86 instruction set.

There is new AVX512 instruction (_mm512_conflict_epi32) supposed to solve this, but it can't make the histogram construction faster than the scalar functions.

Unfortunately using AVX512 instructions only gets a speedup in very specific situations and for many real world use cases it actually underperforms due to oddities of scaling and switching delays. Profiling for more than a few milliseconds is one place you see phantom gains, so take care not to be deceived.





Edit: not saying this isn't a true benefit here, just that claims of speed when using AVX512 need to be treated with fair scepticism for actual use cases.


Considering that the author of that blog is one of the authors of this paper, I think he's very aware of the benchmarking issues.

I didn't realise that. My original post was just to warn that AVX512 benchmarks can be highly misleading. Everyone has troubles measuring AVX512 performance:

"In GROMACS, transitions in and out of AVX-512 code can lead to differences in boost clocks which can impact performance. We are just going to point out the delta here." - from https://www.servethehome.com/intel-performance-strategy-team...

He is aware, but sidestepped these issues. so this code is only recommended on the newest Cannon Lake processors, but we really want to know for which CPU which method is best. What about AMD Rome e.g.?

Since AVX-512 does not exist on AMD Rome, that question answers itself.

Question: with this kind of optimization, how does the program run different functions on different CPUs? Like the ones that don't support AVX512? Some kind of runtime dispatch based on CPU ID?

I wonder if the authors plan to get this merged into commonly used implementations of base64 so that folks can benefit from their research.

For instance our SSE/AVX2 algorithms have been included in a great, mature library written and maintained by Alfred Klomp: https://github.com/aklomp/base64 (the library includes also vectorized code for ARM CPUs).

Is that library widely used, for example is it used by Firefox or Chromium?

Now this is the programming I like, low-level assembly optimisation of long operations, none of this web development rubbish

You can certainly enjoy assembly programming, but that doesn’t make web development rubbish.

This is relevant to webdev, though -- binary data in JSON is often encoded as a Base64 string.

Why use 6-bit data, if JSON can transport 8-bit (-1 for quoting the ") ?

Byte values that are ASCII control characters need to be escaped, and byte values that are not valid UTF-8 can't be represented in JSON strings.

Would this always be valid UTF8?

As an anecdote I have a Phoenix LiveView project where I've discovered that loading, converting and sending an (uncacheable) image as base64 via the websocket connection feels quicker and smoother (I didn't benchmark it) than only updating the src path and letting the browser load it (with http1, I would have to compare with http2).

Well nothing's stopping you from making a living doing this kind of programming. I think the tech landscape is vibrant enough to have fun with it in many different ways!

I'm the opposite! Yin and yang, we need one another. :D

Also worth checking out: https://github.com/superhuman/fast64

That is what Superhuman uses for decoding base64 in browser.

I don't see how that's relevant, it looks pretty standard approach for base64 decoding. You can find thousands of similar examples.

Not only it is just a standard approach, it even misses a relatively common optimization for base64 decoding: instead of computing `(lut[a] << 6) | lut[b]` etc., one can precompute `lut6[x] = lut[x] << 6` and compute `lut6[a] | lut[b]` to avoid shifting. This optimization is famously used by Nick Galbreath's MODP_B64 decoder, which is used by Chromium [1] and turns out to be the most performant non-SIMD decoder according to Lemire et al. [2]

[1] https://github.com/chromium/chromium/tree/master/third_party...

[2] https://github.com/lemire/fastbase64

You can do better, without SIMD: https://github.com/powturbo/TurboBase64

The simple base64 scalar version is also faster the chromium implementation.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact