
Base64 decoding with SIMD instructions - ingve
http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html
======
pcwalton
Have you considered using pcmpistrm in Ranges mode with bit 6 of imm8 set to
avoid the pcmpgtb/pcmpgtb/pand sequence? I think that would save 6
instructions in your main loop (though pcmpxstrx are slow).

Reference:
[https://en.wikibooks.org/wiki/X86_Assembly/SSE#The_Four_Inst...](https://en.wikibooks.org/wiki/X86_Assembly/SSE#The_Four_Instructions)

~~~
nkurz
I'd forgotten that these exist, but I'm not sure they are going to help here.
You save something doing 2 ranges at once, but Agner says each pcmpistrm takes
3 µops that must be on P0, while the pcmpgtb can go on P01, and the pand goes
on P015. Certainly worth trying, though.

My instinct was that switching to AVX2 was going to be a big win, but doesn't
help as much as I hoped. Part of this is that we're coming reading and writing
to memory (that is, it gets faster if I make the buffers fit in L1) but still
not astounding. Not sure what the limiting factor is.

Here's what I see on Skylake i7-6700 3.40GHz:

    
    
      nate@skylake:~/git/WojciechMula-toys/base64/decode/sse$ ./speed
      input size: 67108864
         improved scalar... 0.024
                  scalar... 0.041 (speed up: 0.59)
                     SSE... 0.018 (speed up: 1.30)
              SSE & BMI2... 0.016 (speed up: 1.47)
              AVX & BMI2... 0.015 (speed up: 1.63)

~~~
nkurz
The straight 256-bit AVX solution (no pext) goes a bit faster out of memory:

    
    
                     AVX... 0.013 (speed up: 1.80)
    

And even better out of L1:

    
    
      nate@skylake:~/git/WojciechMula-toys/base64/decode/sse$ ./speed
      input size: 8192
         improved scalar... 2.481
                  scalar... 4.839 (speed up: 0.51)
                     SSE... 1.918 (speed up: 1.29)
              SSE & BMI2... 1.668 (speed up: 1.49)
                     AVX... 1.127 (speed up: 2.20)
              AVX & BMI2... 1.290 (speed up: 1.92)
    

(Absolute times aren't comparable between Mem and L1, just the ratios within
each)

------
nothrabannosir

        // 1. lookup and build output word at the same time
        const uint32_t dword = lookup32[0][input[i + 0]]
                             | lookup32[0][input[i + 1]]
                             | lookup32[0][input[i + 2]]
                             | lookup32[0][input[i + 3]];
        
        // 2. save 32 bits
        reinterpret_cast<uint32_t*>(out) = dword;
        out += 3;
    

Do you mean lookup32[0]... | lookup32[1] ... | ..[2] | [3], instead?

~~~
wmu
Indeed my mistake, thanks for pointing it out. Being own proofreader is not an
easy job. :)

~~~
nothrabannosir
Nice article, thank you for sharing. Great intro to SSE :)

------
Narishma
Core i5 and Core i7 are really meaningless descriptions of processors since
those names span many generations with different performance characteristics.

~~~
wmu
You're right, I haven't aware of that. I'll fix it.

------
melted
I like how getting 46% gain over already optimized C code is considered
"insignificant". Moore's law is over, folks, 46% is a pretty decent gain in my
book, especially in GCC where facilities are available to you to keep your
code portable. Just use function multi versioning and define a canonical
fallback.

~~~
jcranmer
I thought about trying to measure the input on the real-world code that I have
to use it in (specifically, decoding email messages). The challenge of trying
to do base64 decoding fast there is that you generally have to ignore
whitespace characters (and possibly other junk as well).

The code implementation here clearly requires that all the extraneous stuff
get kicked out and the resulting string be compacted, an operation whose
overhead by itself could easily negate merely a 46% gain; it's hard to see how
it could be modified to make the stateful decisions while still keeping the
tight gain. I also have a sneaking suspicion that the improved scalar version
might turn out to be worse in real-world systems because the 4KiB table would
cause L1 cache thrashing if you're doing lots of small base64 decodings
followed by later processing (e.g., string conversion).

So the real problem is that the measurement is done in a way which is
different enough from real-world scenarios that it's questionable if any
speedup is actually achievable in those scenarios, let alone one as
significant as a 46% gain. It's easy to be fast if you ignore having to do it
correctly.

~~~
jcranmer
To follow up: I did do a quick modification of the test harness using a
gigantic mbox file I had and some really crappy MIME parsing code. $ ./email
input size: 86023056 (that's only base64-encoded text, this is a ~562MB file).
improved scalar... 1.528 scalar... 1.564 (speed up: 0.98) SSE... 1.533 (speed
up: 1.00) Null... 1.447 (speed up: 1.06)

So about 5% of the test was spent doing base64 decoding, and the improved
scalar had a 48% over regular scalar and SSE had a speedup of 40% over regular
scalar on just the base64 section. So the short conclusion is that SSE doesn't
really get any speedup on short strings.

The additional test code, for full disclosure (yes, it's not correct, but this
is easier to code and get some numbers rather than nothing. A more correct
solution would really hurt the SSE bit more than other ones.):

    
    
        template <typename T>
        void run_email(T callback) {
          static std::string cte{"content-transfer-encoding"};
          std::ifstream mbox("/tmp/test.mbox");
          std::string line;
          bool inHeader = true;
          bool base64Body = false;
          while (std::getline(mbox, line)) {
            if (line[line.length() - 1] == '\r')
              line.pop_back();
            if (inHeader) {
              std::transform(line.begin(), line.end(), line.begin(), ::tolower);
              if (line.find(cte) == 0 && line.find("base64") != std::string::npos)
                base64Body = true;
              if (line.empty())
                inHeader = false;
            } else {
              if (line.substr(0, 5) == "From " || line.substr(0, 2) == "--") {
                inHeader = true;
                base64Body = false;
                continue;
              }
              if (base64Body) {
                while (line[line.size() - 1] == '=')
                  line.pop_back();
                while (line.size() % 16 != 0)
                  line.push_back('A');
                uint8_t *text = (uint8_t*)&line[0];
                callback(text, line.size(), text);
              }
            }
          }
        }

------
brandmeyer
A curious result: Using four-wide SIMD instructions gives a mere 40% speed-up
compared to the scaler case. I wonder if this is actually a net loss.

\- CPU's are power limited today (high-confidence)

\- Using the SIMD units vice scalar units increases power consumption (medium
confidence - certainly some arithmetic-limited algorithms consume much more
power with SSE)

Conclusion - the incremental impact on total system throughput is negative for
this speed boost? I wonder how much of a raw throughput improvement is needed
to justfiy using the SSE units' increased power consuption.

~~~
nkurz
_Using the SIMD units vs scalar units increases power consumption_

True, but only slightly. If SIMD works for the problem, the overall efficiency
can be much greater. This paper has some good numbers:
[http://kentcz.com/downloads/P149-ISCA14-Preprint.pdf](http://kentcz.com/downloads/P149-ISCA14-Preprint.pdf)

 _I wonder if this is actually a net loss._

No, on modern processors SIMD is usually a big win. The basic summary is that
"fixed costs dominate variable costs". Increasing vector width (if the problem
allows it) translates almost directly to increased efficiency: "Doubling the
SIMD width with AVX instructions doubles performance and only increases total
power by 4.3%, which results in a 1.9x improvement in energy efficiency."
(quotes from the paper)

More generally, higher Instructions Per Cycle (IPC) means higher efficiency.
The overhead of the processor is a big factor. For computationally intense
problems, working faster and finishing in fewer cycles makes for higher energy
efficiency.

~~~
brandmeyer
Good reference. This is a sign that the power consumed by instruction decoding
and OoO processing is much greater than the arithmetic done by each
instruction in general-purpose processors.

~~~
wtallis
This is something that the Mill processor folks refer to frequently, by
comparing DSP power usage against general-purpose CPU power usage. The power
consumption of your ALUs is nothing compared to the bag of tricks CPUs use to
keep them busy: register renaming, instruction reordering, multiple kinds of
speculation, etc.

------
GFK_of_xmaspast
It's kind of unfortunate that the i5 basepoint is the "scalar" implementation
but the i7 basepoint is the "improved scalar" one.

~~~
wmu
Right, I'll fix it.

