Hacker News new | past | comments | ask | show | jobs | submit login
Base64 decoding with SIMD instructions (0x80.pl)
53 points by ingve on Jan 24, 2016 | hide | past | favorite | 26 comments



Have you considered using pcmpistrm in Ranges mode with bit 6 of imm8 set to avoid the pcmpgtb/pcmpgtb/pand sequence? I think that would save 6 instructions in your main loop (though pcmpxstrx are slow).

Reference: https://en.wikibooks.org/wiki/X86_Assembly/SSE#The_Four_Inst...


I'd forgotten that these exist, but I'm not sure they are going to help here. You save something doing 2 ranges at once, but Agner says each pcmpistrm takes 3 µops that must be on P0, while the pcmpgtb can go on P01, and the pand goes on P015. Certainly worth trying, though.

My instinct was that switching to AVX2 was going to be a big win, but doesn't help as much as I hoped. Part of this is that we're coming reading and writing to memory (that is, it gets faster if I make the buffers fit in L1) but still not astounding. Not sure what the limiting factor is.

Here's what I see on Skylake i7-6700 3.40GHz:

  nate@skylake:~/git/WojciechMula-toys/base64/decode/sse$ ./speed
  input size: 67108864
     improved scalar... 0.024
              scalar... 0.041 (speed up: 0.59)
                 SSE... 0.018 (speed up: 1.30)
          SSE & BMI2... 0.016 (speed up: 1.47)
          AVX & BMI2... 0.015 (speed up: 1.63)


The straight 256-bit AVX solution (no pext) goes a bit faster out of memory:

                 AVX... 0.013 (speed up: 1.80)
And even better out of L1:

  nate@skylake:~/git/WojciechMula-toys/base64/decode/sse$ ./speed
  input size: 8192
     improved scalar... 2.481
              scalar... 4.839 (speed up: 0.51)
                 SSE... 1.918 (speed up: 1.29)
          SSE & BMI2... 1.668 (speed up: 1.49)
                 AVX... 1.127 (speed up: 2.20)
          AVX & BMI2... 1.290 (speed up: 1.92)
(Absolute times aren't comparable between Mem and L1, just the ratios within each)


Hi, author here. Yes, I've considered these instructions, but couldn't find any advantage over a sequence of plain instructions. The sequence properly deals with invalid inputs, pre-validation with pcmpxstrx isn't required.


    // 1. lookup and build output word at the same time
    const uint32_t dword = lookup32[0][input[i + 0]]
                         | lookup32[0][input[i + 1]]
                         | lookup32[0][input[i + 2]]
                         | lookup32[0][input[i + 3]];
    
    // 2. save 32 bits
    reinterpret_cast<uint32_t*>(out) = dword;
    out += 3;
Do you mean lookup32[0]... | lookup32[1] ... | ..[2] | [3], instead?


Indeed my mistake, thanks for pointing it out. Being own proofreader is not an easy job. :)


Nice article, thank you for sharing. Great intro to SSE :)


Core i5 and Core i7 are really meaningless descriptions of processors since those names span many generations with different performance characteristics.


You're right, I haven't aware of that. I'll fix it.


I like how getting 46% gain over already optimized C code is considered "insignificant". Moore's law is over, folks, 46% is a pretty decent gain in my book, especially in GCC where facilities are available to you to keep your code portable. Just use function multi versioning and define a canonical fallback.


I thought about trying to measure the input on the real-world code that I have to use it in (specifically, decoding email messages). The challenge of trying to do base64 decoding fast there is that you generally have to ignore whitespace characters (and possibly other junk as well).

The code implementation here clearly requires that all the extraneous stuff get kicked out and the resulting string be compacted, an operation whose overhead by itself could easily negate merely a 46% gain; it's hard to see how it could be modified to make the stateful decisions while still keeping the tight gain. I also have a sneaking suspicion that the improved scalar version might turn out to be worse in real-world systems because the 4KiB table would cause L1 cache thrashing if you're doing lots of small base64 decodings followed by later processing (e.g., string conversion).

So the real problem is that the measurement is done in a way which is different enough from real-world scenarios that it's questionable if any speedup is actually achievable in those scenarios, let alone one as significant as a 46% gain. It's easy to be fast if you ignore having to do it correctly.


To follow up: I did do a quick modification of the test harness using a gigantic mbox file I had and some really crappy MIME parsing code. $ ./email input size: 86023056 (that's only base64-encoded text, this is a ~562MB file). improved scalar... 1.528 scalar... 1.564 (speed up: 0.98) SSE... 1.533 (speed up: 1.00) Null... 1.447 (speed up: 1.06)

So about 5% of the test was spent doing base64 decoding, and the improved scalar had a 48% over regular scalar and SSE had a speedup of 40% over regular scalar on just the base64 section. So the short conclusion is that SSE doesn't really get any speedup on short strings.

The additional test code, for full disclosure (yes, it's not correct, but this is easier to code and get some numbers rather than nothing. A more correct solution would really hurt the SSE bit more than other ones.):

    template <typename T>
    void run_email(T callback) {
      static std::string cte{"content-transfer-encoding"};
      std::ifstream mbox("/tmp/test.mbox");
      std::string line;
      bool inHeader = true;
      bool base64Body = false;
      while (std::getline(mbox, line)) {
        if (line[line.length() - 1] == '\r')
          line.pop_back();
        if (inHeader) {
          std::transform(line.begin(), line.end(), line.begin(), ::tolower);
          if (line.find(cte) == 0 && line.find("base64") != std::string::npos)
            base64Body = true;
          if (line.empty())
            inHeader = false;
        } else {
          if (line.substr(0, 5) == "From " || line.substr(0, 2) == "--") {
            inHeader = true;
            base64Body = false;
            continue;
          }
          if (base64Body) {
            while (line[line.size() - 1] == '=')
              line.pop_back();
            while (line.size() % 16 != 0)
              line.push_back('A');
            uint8_t *text = (uint8_t*)&line[0];
            callback(text, line.size(), text);
          }
        }
      }
    }


And on the other hand there are plenty of applications where you can assume that the encoded data doesn't need any pre-processing, either because your program was the one that did the encoding or because validation occurs on a higher level (e.g. a cryptographic MAC).

The cache issue is a good point, I wonder how dedicated your process has to be to processing base64 text for that to not be a factor.


"I like how getting 46% gain over already optimized C code is considered "insignificant".

Author here (again!) I used SIMD in different algorithms and often speedup was greater than 2, 3, or more. So, from my skewed point of view speedup less than 2 isn't very impressive. :) But I buy your opinion, next time I'll try to be more enthusiastic.


Sure, in lucky situations I've been able to speed things up by an order of magnitude by just changing memory layout and using SSE. But 46% in a tight spot is worthwhile change. Basically, when dealing with a highly performance intensive code (storage, if you must know), my rule of thumb was, 2% of more of improvement in overall end-to-end throughput on a long running benchmark is worthwhile if change is not too complicated. This could mean 10x improvement in one of the parts of the pipeline, or 2% improvement across the board, or anything in between: the goal was overall throughput. Likewise, nothing that regressed the performance could ever go in. You'd be surprised what you can squeeze out of your code if you establish these ground rules and run with them for a year.


46% gain over C is irrelevant if your program only spends 1% of its time decoding base64


Sure, if it's just 1% it probably wasn't worth optimizing even that plain C code. But you don't know that. And you especially don't know that if it's a part of a library or the like.


A curious result: Using four-wide SIMD instructions gives a mere 40% speed-up compared to the scaler case. I wonder if this is actually a net loss.

- CPU's are power limited today (high-confidence)

- Using the SIMD units vice scalar units increases power consumption (medium confidence - certainly some arithmetic-limited algorithms consume much more power with SSE)

Conclusion - the incremental impact on total system throughput is negative for this speed boost? I wonder how much of a raw throughput improvement is needed to justfiy using the SSE units' increased power consuption.


Using the SIMD units vs scalar units increases power consumption

True, but only slightly. If SIMD works for the problem, the overall efficiency can be much greater. This paper has some good numbers: http://kentcz.com/downloads/P149-ISCA14-Preprint.pdf

I wonder if this is actually a net loss.

No, on modern processors SIMD is usually a big win. The basic summary is that "fixed costs dominate variable costs". Increasing vector width (if the problem allows it) translates almost directly to increased efficiency: "Doubling the SIMD width with AVX instructions doubles performance and only increases total power by 4.3%, which results in a 1.9x improvement in energy efficiency." (quotes from the paper)

More generally, higher Instructions Per Cycle (IPC) means higher efficiency. The overhead of the processor is a big factor. For computationally intense problems, working faster and finishing in fewer cycles makes for higher energy efficiency.


Good reference. This is a sign that the power consumed by instruction decoding and OoO processing is much greater than the arithmetic done by each instruction in general-purpose processors.


This is something that the Mill processor folks refer to frequently, by comparing DSP power usage against general-purpose CPU power usage. The power consumption of your ALUs is nothing compared to the bag of tricks CPUs use to keep them busy: register renaming, instruction reordering, multiple kinds of speculation, etc.


You have to also take into account that getting your work done 40% faster lets the system go to sleep 40% sooner. Getting the CPU into an idle state tends to dominate everything else.

In my experience, it's very rarely worth choosing a slower algorithm out of power consumption concerns. Use all the silicon available to you to get the work done and go to sleep.


The reference provided by nkurz measured clearly that it is a net win in most cases. However, that fact isn't inherantly obvious. CPU's aggressively use clock gating to shut down idle parts of the processor at a fine-grained level. So avoiding the use of several entire compute units does have some savings as well. The difference is that I was assuming something more like 2x power usage, where the actual difference is in the small single digits.


Also, keep in mind how little of the modern CPU's die space is actually dedicated to the traditional cores. A quad-core skylake processor dedicates over half its space to things other than CPU cores [1], and that each core itself has a lot of space needed for paths that are always going to be used, like instruction decoding and storing the registers. So gating in a situation like this is only ever going to turn off a very small bit of the processor.

[1] http://www.techpowerup.com/img/15-08-18/77a.jpg


It's kind of unfortunate that the i5 basepoint is the "scalar" implementation but the i7 basepoint is the "improved scalar" one.


Right, I'll fix it.




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: