Note that not many Intel processors support the two specific CPU instructions used (vpermb and vpmultishiftqb).
Cannon Lake Core i3 (not many of these around due to delays with intel's 10nm fabrication)
Ice Lake Core i3, i5 and i7: Released in September 2019
Honestly, base64 has always been pretty fast for me, even large bitmaps >100kb. Compared to net latency it may be imperceptible to end user. But it still makes a terrific reference paper!
> In economics, a commodity is an economic good or service that has full or substantial fungibility: that is, the market treats instances of the good as equivalent or nearly so with no regard to who produced them.
Basically it's a common good, that is sold and purchased, with no other modifications required for this use-case. This commodity is currently low in supply, but is still a commodity.
I think hard drives, memory, power supplies are much closer to being fungible.
The problem start with knowing the encoding (base64, for small sizes sometimes hex, for one or two bytes sometimes arrays with all there ambiguities).
Then that you can't easily differentiate between a string a bytearray anymore (through you often shouldn't need to).
Then that it becomes noticable bigger, which especially for large blobs is a problem
Then that you always have to scan over this now larger large chunk to know it's end.
(Inserted of just skipping ahead).
Then that you have to encoded/decide it which might require some additional annotations depending on serialization library and can imply other annoyances.
The later point can also in some languages lead to you accidentally using a base64 string as bytes or the other way around.
Well I guess that was enough ranting ;=)
JSON works where you need simplicity on the verge of being dumb, and human-readability.
When I was developing a three component system, protos were not working for me. Too many schema changes and implementation differences to make the reduced binary size worth it. This was partly because the best protobuffer implementation in C (nanoPB) is just the hobby project of some dude, but mostly because coordinating schemas was annoying.
Have you tried BSON or Protobufs to solve your annoyances? How did it go?
Outside of Mongo, I haven’t seen BSON used anywhere.
CBOR however, is up and coming. Starting to look like it may be a first class citizen in AWS someday. On the IoT side they are already preferring it over JSON for Device Defender.
Practically, base64 encode a video file and tell me how exactly you're going to allow the user to seek to any place they like within that video using only common libraries on common platforms... Theoretically easy, practically hard enough nobody does it.
chunkOffset = byteIndex/3*4
2. Do whatever you said.
2537012 Nov 6 10:10 test.tar.hex.gz
2954608 Nov 6 10:03 test.tar.b64.gz
What would be nice is an email/http resilient compressed format that could be stuck inside of json strings that can be easily recovered.
wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | gzip -9 | wc -c
wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | base64 | gzip -9 | wc -c
A general-purpose compression scheme can easily detect when its input is Base64, so I don't see why Base64 should be particularly hard to compress, in principle at least.
There are tables in the paper comparing its speed to other implementations. This is more than an order of magnitude faster than the implementation used in Chrome, and for small data more than twice as fast as their previous AVX2 based implementation (https://arxiv.org/abs/1704.00605). The paper even notes that in some cases, for large inputs, it's decoding base64 faster than you could memcpy the data for the simple reason that it needs only write 6/8ths of the bytes. In their tests, it's only when the data fits in the tiny L1 cache that memcpy is consistently somewhat faster (44 GB/s vs 42 GB/s). So this is not only fast when you have to wait for a slow memory access, but when the loads hit other caches.
Or are you asking in more of a rhetorical sense? Like, why hasn't this been achieved before, that it's obvious or something? AVX-512 is only some 3 years old and base64 is only one of many problems that could benefit from vectorized processing.
However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.
Also the data you'll need to process is often probably already hot in the CPU caches. Disk/network I/O can be DMA'd directly into L3 cache.
So I think this decoder is a useful contribution. But also somewhat no-brainer once you have those AVX-512 instructions available.
Base64 uses lookup tables and the bit manipulations required are standard shifts and 'and', which are basic, fast instructions on any CPU. That seems exactly what they do here with an efficient use of AVX512 to make it fast(er).
This is doing exactly that with `vpermb` and `vpermi2b`.
Generally my point stands that LUTs are not vectorizable, except for small special cases like this.
So to perform multiple operations in parallel per core (SIMD), you'll have to write some code to perform the "lookup" transformation instead.
Yes and no. Memory has high latency, but memory bandwidth has stayed far more in line with CPU performance.
It just shows even if you do not do anything (just copy the memory from place A to place B), you have some CPU usage baseline per GB. So this kind of saying: with this algorithm, you have very minimal CPU usage.
This algorithm is not doing excessive memory usage and Idling CPU because of memory bandwith, it is just doing very minimal CPU usage.
memcpy with wide instructions has been an issue for libc.
current intel desktop processors have 'avx offset' which would suck if kicks on for mild base64.
overall wide instructions are nice and dandy and obliterate microbenchmarks but they may incur hidden costs, especially on low power cpus (laptops)
Compared to what, being content to suffering slower base64 decoding?
Why it's an achievement is obvious.
Your question should be rephrased better as "will this make much difference in most common scenarios"?
memcpy operations (and base 64 encoding/decoding) are highly "local" operations. When the segment of memory they operate on has been placed in the cpu level1 or level2 caches, the operation is much faster and will not need to access main memory for each byte/word. Only when a memory barrier is encountered will the cpu need to flush/refresh to/from main memory.
In my current desktop PC, it’s 700 GFlops theoretical compute versus 50GB/sec of theoretical RAM bandwidth.
For a flea check, consider that the chip has 18 cores and runs at ~4.2GHz. To achieve the theoretical throughput each core has to do ~30 single precision flops per cycle. Which sounds about right: HEDT intel chips can (theoretically) do 32 of those per cycle.
This, of course, ignores whatever throttling Intel would need to apply to maintain TDP, so a purely theoretical back of the envelope.
The other thing is that even seemingly trivial things like spawning and synchronizing with lots of threads will be much more complex on CPUs with many cores. At some point, naively looping over all threads is going to be too slow. I think that the limit is going to be around 64 cores. Past that, you should actually parallelize your worker thread management to stay efficient. There is precendent for this in HPC, e.g. MPI implementations.
It is faster than SSE on Intel/AMD and SIMD NEON on ARM
see benchmark at Turbo Base64: https://github.com/powturbo/TurboBase64
> Only the non portable AVX2 based libraries are faster
The paper in OP is about 2 times as fast as their previous AVX2 code. I think they are superior to Turbobase64 here (though with more stringent hardware requirements).
Not to say that Turbobase64 isn't impressive, but it's not at all the same level of performance.
It is impossible to have the speed of a good memcpy when you're copying 33% more like in base64.
People are using this to do clever things like encoding a watermark image to overlay on a picture, because the Shortcuts app does not allow file attachments as part of the scripting language.
Unfortunately, that's just not achievable with current x86 instruction set.
Edit: not saying this isn't a true benefit here, just that claims of speed when using AVX512 need to be treated with fair scepticism for actual use cases.
Considering that the author of that blog is one of the authors of this paper, I think he's very aware of the benchmarking issues.
"In GROMACS, transitions in and out of AVX-512 code can lead to differences in boost clocks which can impact performance. We are just going to point out the delta here." - from https://www.servethehome.com/intel-performance-strategy-team...
That is what Superhuman uses for decoding base64 in browser.
The simple base64 scalar version is also faster the chromium implementation.