Hacker News new | past | comments | ask | show | jobs | submit login
AVX512 VBMI – remove spaces from text (0x80.pl)
152 points by pedro84 on Jan 5, 2019 | hide | past | favorite | 75 comments

OK, I got nerd-sniped here. You can actually construct the indices for the shuffle fairly easily with PEXT. Basically, you have 6 64-bit masks, each corresponding to a different bit of the index of each byte in the 64-byte vector. So for mask 0, a bit is set in the mask if its index has bit (1 << 0) set, mask 1 has the same but for bit (1 << 1), etc. The masks have a simple pattern, that changes between 1 and 0 bits every (1 << i) bits. So for 3 bits the masks would be: 10101010, 11001100, 11110000.

These masks are then extracted with PEXT for all the non-whitespace bytes. What this does is build up, bit by bit, the byte index of every non-whitespace byte, compressed down to the least-significant end, without the indices of whitespace bytes.

I wasn't actually able to run this code, since I don't have an AVX-512 machine, but I'm pretty sure it should be faster. I put the code on github if anyone wants to try: https://github.com/zwegner/toys/blob/master/avx512-remove-sp...

    const uint64_t index_masks[6] = {
    const __m512i index_bits[6] = {


  ...later, inside the loop:

    mask = ~mask;
    __m512i indices = _mm512_set1_epi8(0);
    for (size_t index = 0; index < 6; index++) {
        uint64_t m = _pext_u64(index_masks[index], mask);
        indices = _mm512_mask_add_epi8(indices, m, indices, index_bits[index]);
    output = _mm512_permutexvar_epi8(indices, input);

I'll run it for you on the same machine Wojciech is using and report back shortly. I'm getting a compiler error for your call to _pext_u64(), but I think it's just a matter of adjusting the compiler flags. OK, adding "-march=native" works (was only -mavx512vbmi). Oops, now ./unittest fails:

  [nate@nuc avx512-remove-spaces]$ ./unittest
  test 1 gap
  FAILED; len_ref=63, len=0
   input: [ bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
   ref: [bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
  result: []
Did you maybe flip the args to _pext_u64() like I often do? I'll check back in a few minutes and see if you've committed a fix.

Edit: Getting to my bedtime here. Send me email (in profile) if you'd like me to follow up with it tomorrow.

Ah, just figured it out. I accidentally deleted this line:

            len = 64 - __builtin_popcountll(mask);
So the destination pointer never got updated. Mental debugging is kind of annoying :)


OK, I'm still up and just saw your fix. Yes, that makes the ./unittest work. I have numbers for ./benchmark; let me grab and compile his original for comparison. Do you know which result you'd expect to see the best difference for?

Results: Yes, if I'm reading things right (and assuming the unittest caught any issues) you are much faster at large cardinalities! Your approach seems to be a constant .7 cycle/op, while Wojciech's varies from about .4 for card=1 to 4.0 for card=64. The breakeven point is card=7, with your approach faster at all higher numbers. Nice improvement!

Send me email if you'd like to pursue this further tomorrow. Also, you probably want to read dragontamer's post here closely and mentally consider whether his approach might make even more sense. I haven't thought about it yet, but if he thinks he knows the best way, he may well be right. Good night!

Edit: Just noticed that my numbers seem slightly faster than the ones Wojciech reported for large cardinalities. My guess would be that this is because I added "-march=native" to both compilations, although maybe there's something more subtle happening too. He's in Europe, so probably asleep now, but will likely be by to offer his thoughts in a few hours.

Keep in mind that the microbenchmark uses exactly fixed cardinalities, with no randomness, so the branches in the original algorithm will be perfectly predicted and hence will tend to inflate the reported performance versus a case where the number of spaces has an expected cardinality but with enough variance that the loop mispredicts once for every block.

So I suspect the cutover point for the pext approach to be better than the microbenchmark suggestions for typical real-world inputs.

It would also be interesting to see a lookup based approach, using multiple lookups (probably with gather) to build the shuffle control for a single block. 0.7 cycles per element means ~45 cycles per 64 byte block, so that's still quite a large budget per block, and the load ports have little pressure.

Awesome, thanks! I'll play around a bit with the algorithms and see if I come up with anything neat. I'll email you if I do.

Just favorited this thread for algorithmic awesomeness. Thanks guys.

Hi, author here. That's amazing! I totally love this approach, it so neat. Wow :) Would you mind if I include your solution in my repository and the article?

BTW you can always check validity of your code with Intel Software Developer Emulator (https://software.intel.com/en-us/articles/intel-software-dev...). I use this program constantly, as the AVX512 machines I have access to are remote ones (literally remote).

Hi there! Thanks for the nice writeup and successfully nerd-sniping me! :) You can absolutely use the code.

Good tip on the SDE too. I had heard of it but didn't realize it was so lightweight, dynamically patching the binary to emulate only the needed instructions. Cool!

Your approach is incredibly fast. I love the code, it's so short. :)

Please take look at https://github.com/WojciechMula/toys/tree/master/avx512-remo..., I put there some raw numbers; will update the article later.

For English texts your algorithm is more than 20 times faster (twenty, not a typo), for CSVs the speed-up is "only" 9. However, I'd love to compare it with a well tuned scalar code.

One important take away for me is that I underestimated pext; it's a powerful instruction. (on AMDs terribly slow, though)

Wow, not bad! I'm looking forward to the writeup.

I'm still thinking about this one. It seems that ~.5 cycles/byte is slower than it should be. There's 6 pext->kmovq->vpaddb chains, where the pexts and kmovqs are all independent and can be pipelined. One chain should be 3+2+1 cycles, so all 6 should be 11 cycles. Agner Fog's site doesn't have information for vpermb yet, but a Stack Overflow comment says it's 1 cycle (which is rather surprising, that's 64 64->1 8 bit muxes--lots of silicon!). The latency on most other instructions can be hidden.

I updated my code with a couple minor optimizations, that at least lead to some nicer assembly output:

Getting rid of the if (mask) branch--it should almost never be false in real code (and never is in the microbenchmarks).

Inverting the mask during the comparison:

        uint64_t mask = _mm512_cmpneq_epi8_mask(input, spaces)
                      & _mm512_cmpneq_epi8_mask(input, NL)
                      & _mm512_cmpneq_epi8_mask(input, CR);
...then the mask computation gets a lot cleaner, since there's now just three chained vpcmps (though this might lose a couple cycles or so because of the dependent 3 cycle latencies. Not sure how to trick gcc into using parallel vcmps with the ands on GPRs...). But we also save the mask = ~mask and the subtraction from 64 of the popcount.

All in all, speed-of-light for this code should be 3*3+2+11+1=23 cycles per iteration (if I'm counting everything properly), and probably even faster from pipelining between iterations. I hope it should be a bit faster than the 32ish that the old version was using.

Perhaps nobody would see this, as the thread is already dead, but... :) I managed to include description of your algorithm and updated results, including procedures by aqrit: http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.h...

BTW: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798

vpermb is 3L1T, ie three cycles latency, one per cycle throughput. Full AIDA64 timing dump for cannonlake are on instlatx64 side.

Since there are only 8 bits per input character, would it be possible to just move their bits directly using pext?

Hmm, interesting idea.

The main problem I see is that PEXT can only move bits from within an 64-bit operand, so we can't operate on the 512-bit vectors anymore, AFAICT. We'd have to split the loop up into 8-byte chunks instead, and expand the mask from 1 to 8 bits per comparison, so PEXT pulls out all the bits of each character (e.g. a 4-bit mask 1011 becomes 0xFF00FFFF). This mask computation could be done in 512-byte chunks, but we'd have to split the original 1-bit mask up into 8-bit slices that each get popcounted individually (either in a 256 byte table lookup or with the popcount instruction). That's just so we know how much to increment the pointer between each 8-byte chunk.

All in all, I think the 8 popcounts + 8 PEXTs would make it slower than the full 512-bit version (there's only one PEXT execution unit on Skylake X chips). It might be competitive on machines without AVX-512, though. I'll try out some code--I should actually be able to run this one :)

If you could avoid actually using POPCNT for the popcount, and the limiting factor was the pext, the "speed limit" of this approach would be around 0.125 cycles per input byte since pext handles 8 bytes and you can do 1 per cycle.

Of course that assumes that you can fit 100% of the remaining work for every 8 byte chunk into what's left of that cycle (ie into no more than 3 uops). Since you have at least a store and a load per word, that's probably going to be impossible! It's not totally unlikely that you could get to around 0.25 cycles though.

A good approach would be to vectorize the rest of the work and store the resulting mask values and increments in 64-byte chunks, to be re-read by the scalar part (since there is no really efficient way to get stuff from vector to GP regs). So the scalar part would have 3 loads (mask, increment, input data) a store and the pointer increment. I guess you could load-op the addition, so it seems possible you might be limited by the 3 loads (taking 1.5 cycles).

I was excited for AVX512 long ago but I've since heard that if you are jamming AVX512 instructions to every core you get a forcibly lower clockrate. In practice this sounds like it'd suggest using an AVX512 algorithm could actually be slower even when it's faster. If that's the case, I wonder what kind of performance gain you'd have to hit to beat a scalar or SSE-based vectorized algorithm.

Lemire has criticized this: https://lemire.me/blog/2018/08/13/the-dangers-of-avx-512-thr...

OP doesn't seem to show any major problems in his microbenchmarking, which is a start.

TBH I didn't think about throttling. My (mis)understanding is that a single program which completes in a fraction of second on an almost-idle machine doesn't suffer from AVX512 throttling.

Daniel has reviewed this problem in series of posts: - https://lemire.me/blog/2018/08/15/the-dangers-of-avx-512-thr... - https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-i...

On most chips released to date, AVX-512 always causes a lower frequency (the so-called L1 license) as soon as any instruction uses it: basically the chip needs to drop from the L0 (fastest) license to L1 (middle) before it even starts any AVX-512 (probably to power up the upper lanes of the ALUs, register file, etc).

Below L1, there is an even lower frequency tier, L2, which behaves as you say: it doesn't drop to L2 unless there is a sustained use of heavy AVX-512 instructions.

On the Cannonlake chip you used, however, the i3-8121U, I don't measure any AVX-512 throttling at all - regardless of what instructions are running, up to and including dense 512-bit FMAs, the chip is running at the full 3.2 GHz (1 active core) or 3.1 GHz (2 active cores) turbo speed at all times.

This behavior was common in the past on client parts: e.g., most Skylake client parts run at full speed when using 256-bit AVX/AVX2 operations, but server parts and some higher-end desktop parts based on them had varying speed tiers. All of the initially released SKX (Skylake X) parts were based on the Skylake-SP server part and had the speed tiers (downclocking) for AVX-512, but maybe many or most future client parts won't if CNL is any indication.

Thanks a lot for such a great explanation!

You also get a lower clock rate from running scalar code on many cores at once. Use of the 512-bit unit only makes the coefficient different.

That said, the biggest mystery in this article is why you'd ever want to remove all whitespace from text. Why is that useful?

I don't usually want to remove all whitespace from text, but I do often want to remove weird Unicode whitespace characters (there are dozens of them), control characters, excessively repeated diacritics, and unbalanced bidirectional markers from user-submitted content.

The code will have to be a lot more complicated in order to achieve this, and I'm not even sure I could use AVX512 to replace the ugly regex I'm currently relying on. But the idea is close enough, so maybe whitespace removal is meant to serve as a simple proof of concept.

When your starting point is a regexp, I think you have a few layers of lesser abstraction to explore for optimization opportunities before trying to implement something in AVX512...

If one could remove spaces from JSON to create a so-called "Canonical JSON" one could obtain the same digest hash from the same data (even in combination with the hardware accelerated hashing offered by AVX!). Admittedly this is a strange case but I've run into it.

Typically, when you want a canonical JSON you don't want to touch strings and you want to sort maps.

Isn't this a problem if you're fed these two inputs:

{mystring:"Hello World!"}

{mystring:"Hello World !"}

They aren't the same JSON.

also: {mystring:"HelloWorld!"}

Compression routines regularly track then remove whitespace. Given the commonplace occurrence of both compression serverside and decompression clientside on the web, this improvement could be notable in every day computing.

This is a problem with base64 decoding. While there are vectorized algorithms for decoding and encoding, they do expect a text without spaces or newlines. However, by design, in emails the base64-encoded data must be split to lines not exceeding some length.

I dunno, spam filtering?

    sig nup now for che ap viag ra
And then search for key words like “viagra.”

"Your flight via Graz is leaving on 5/2/2019"

Maybe for the International Obfuscated C Code Contest.

Depends on the task and CPU, but especially on smaller machines the slowdown is not so large compared to potentially huge performance gains, ain't it?

Depends which AVX512 instructions and how quickly you expect to bounce between AVX and scalar workloads. Generally if you can expect to spend a significant portion of your algorithm time in vector optimized code then the small clock drop isn't enough to erode the throughput gains.

There is also a huge delay for the core to shift into AVX512 mode, so interleaving it into regular non avx512 code can be a huge performance hit.

Well if by huge you mean 10ish microseconds, then yeah. There is also hysteresis built in so the transitions can only occur about once per millisecond. This puts a hard bound on the cost of the transitions themselves of about 1%.

So no, I don't think the transitions are a big problem for any throughput case, although if you are counting microseconds for some type of low latency thing then sure it can matter.

The bigger problem is usually the frequency reduction.

The amount of impact is heavily dependent on the class of CPU you use from Intel (gee, thanks Intel. Awesome way to make optimising code a pain in the arse). Bronze you probably really don't want to do AVX512 unless it's all heavily AVX512. Silver is better, but not great (which is what Cloudflare were using and ran in to).

Gold or Platinum you're not likely to see much degradation from using AVX-512, even sparsely.

Are you talking specifically about the transition penalties which I was referring to?

I am not aware of any strong variance in this behavior based on CPU class, and I would be surprised if it existed. I've always measured around 8 to 10 us for these transitions (there may be both a real time and "clock cycles" component to the transition time so it might vary a bit when measured in real time based on the CPU frequency).

It sounds to be like you are talking more about AVX-512 downclocking.

In any case, if your algorithm can be make heavy and effective use of AVX-512 it's going to be a win on almost any class of chip since you are looking at something like a 2x baseline speedup and more in many cases. If not, then yeah, proceed with caution.

Yes, sorry, I'm referencing the down-clocking, which has an impact on other instructions that run. What Cloudflare were running in to was that AVX512 optimisations were only available on certain ciphers in OpenSSL, and weren't making up the majority of calls. They made up enough to be continually punting the chips down in to a lower core speed and significantly impacted other non-AVX instructions running on the system. Cloudflare ran in to this in part because they were opting for cheaper Silver class chips that are more likely to down-clock than Gold or Platinum. Bronze is worse.


Bronze: https://en.wikichip.org/wiki/intel/xeon_bronze/3106#Frequenc...

Silver: https://en.wikichip.org/wiki/intel/xeon_silver/4108#Frequenc...

Gold: https://en.wikichip.org/wiki/intel/xeon_gold/5118#Frequencie...

Platinum: https://en.wikichip.org/wiki/intel/xeon_platinum/8153#Freque...

Right, downclocking is a different matter and I'm aware of the pitfalls there. As you point out, the extent of the downclocking varies based on the model - and also the exact "license" chosen depends not only on the instructions used, but their relative intensity (e.g., you can do a lot of 512-bit FMA instruction without suffering the slowest speeds, but once FMAs are dense enough you'll get the downclocking).

FWIW, I recently tested a consumer Cannonlake chip and there is no downclocking even with heavy AVX-512: all types of tested code ran at the full turbo frequency. So it seems that not all chips will have these varying frequencies based on ISA extensions used.

Oh wow. That makes this even more of a mess than I'd realised. This seems an optimisation nightmare. You'd need to know a lot about the runtime and running environment to be able to figure out whether or not to enable AVX-512 instructions. Seems like something only JIT runtimes would be able to realistically handle on the fly.

I've been nerdsniped as well. I can't say I'm going to go ahead and try and solve it, but the methodology presented in the post seems suboptimal.

The best method I personally think would work, is the "compaction algorithm" documented here: http://www.davidespataro.it/cuda-stream-compaction-efficient...

True, that's a CUDA implementation, but AVX512 is closely related to GPU programmers. Effectively, you calculate the prefix sum of the "matches".

The paper the above code is based on is very clear on how this works: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf

Pay close attention to "figure 1" on page 2. That's the crux of the algorithm. Assuming 8-bit characters, you can generate a prefix-sum in just 6-steps (Each step is a constant, pre-defined byte-shift + Add). A prefix sum is best described by the following picture: https://en.wikipedia.org/wiki/Prefix_sum#/media/File:Hillis-...

Full Wikipedia page on Prefix Sum: https://en.wikipedia.org/wiki/Prefix_sum

Prefix Sum is just 6-steps for a AVX512 register on 8-bit ints. That generates the full AVX512-space permute (ie: if the prefix sum is 5 for an element, that means that element belongs in index #5)., but AVX512 has "in lane" permutes only. I dunno how many steps you'd need to get a "in lane" permute into a "cross lane" permute... but it doesn't seem too difficult of a problem (and IIRC, I think i read a blogpost about how to convert the in-lane AVX512 permutes into a cross-lane one).

I bet that the above sketch of the AVX512 algorithm can be implemented in less than 30 assembly instructions for the full AVX512 / 64-byte space, maybe less than 20. That should definitely run faster than the scalar version.


EDIT: Herp-derp. It doesn't seem like VPERMB is affected by AVX Lanes (!!). https://www.felixcloutier.com/x86/vpermb

So I guess you can just run VPERMB at the end on the calculated prefix-sum. The end.


The Stream Compaction algorithm is a very important 1-dimentional work-balancing paradigm in the GPU programming world. It is used to select which rays are still active in a Raytracing scenario (so that all SIMD registers have something to do).

Prefix sum was my first thought as well.

One approach I haven't benchmarked is to vpmulllq (64-bit in-lane multiply) by 0x0101010101010101. That produces an 8-byte prefix sum in each lane, so then you need to prefix-sum the high bytes (either by mul or 3 rounds of shift/add) and broadcast them back to their respective lanes to sum the whole sequence.

I can't figure out the latencies on uops.info for vpmullq, but it's probably 3-5 cycles followed by a shuffle, ~6 cycles for the high-byte prefix sum, and then a shuffle and add. ~15 cycles including the final vpermb (also forgot timings for that).

Interesting, I started out thinking along these lines, but once I figured out I could use PEXT, I just went with that.

I think this approach needs some tweaks, though. Mainly that the vpermb at the end is the inverse of what we want--the bytes at dense indices get spread out to the sparse indices (it works analogously to gather, but we want scatter). I can't think of a way around this right now...

That said, it's an interesting approach. I think the PEXTs would be the bottleneck in my code (looks like there's only one execution unit for them, whereas there's two for the VPADDs), and finding a way to parallelize all the VPADDs could lead to a nice speedup.

You're right.

I did a brief look through AVX512 instructions to look for a solution, and unfortunatley, it seems we both may have been overthinking this.

vpcompressb more or less does the job in one instruction. Agner Fog doesn't have a latency listed however.


My search methodology was basically this: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

Search for __m512i (integer-based ZMM registers), with the category "swizzle" (which includes permute, insert, and other such instructions). I figure any potential AVX512 instruction would be a "Swizzle" style instruction.


Note: I originally responded to the wrong location in this thread. I copy/pasted my text to here, which is where I originally intended to respond.

Do you recall which machines VPCOMPRESSB works on? I think it's next generation Icelake? Or is it there already on Cannonlake? And along the same lines, is there a good general way of looking this up?

Coincidentally, searching for this, I found Geoff Langdale's blog post where in addition to describing VPCOMPRESSB as 'dynamiting the trout stream', he also describes something very close to zwegner's PEXT approach: https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-an...

It's not in Cannonlake (nor the W variants). The D and Q versions are in SKX though and they are 4L2T IIRC.

You need a CPU with VBMI2 for the B variant, can't remember off the top of my head if Icelake has that.

Oh sweet! That's an awesome instruction. I'd imagine that would be useful for lots of things. I believe I've seen vcompressd before, but totally forgot about it.

Unfortunately it looks like the byte-wise version is part of AVX512-VBMI2, which won't be out until Ice Lake...

You might have seen vcompessd in context of sorting; I used it for partition part in qsort.

It actually would've been during my time at Intel working on the graphics stack for Larrabee, in the 2010-2011 timeframe--vcompressd was part of LRBNI. I was mainly doing infrastructure/compiler/optimization type work, and not much graphics stuff, so I can't recall using the instruction personally, but pretty sure it was used in various places around the stack.

> I've been nerdsniped as well. I can't say I'm going to go ahead and try and solve it, but the methodology presented in the post seems suboptimal.

Let me explain it. I do know the presented approach is extremely naive, but... My initial question was: "how slow this might be?", and it turned out that's not that bad as I supposed, so shared this finding with others. :)

Thank you for pointing this article.

Interesting but his scalar code is slow. When you care about performance, better to implement such algorithm so it reads bytes one by one, but move blocks with memmove when switching from write to skip state.

Pathological case (skipping every other character) is slightly slower, but on real data it’s much faster overall.

Unfortunately I don’t have AVX512 hardware so I can’t test.

I was wondering if you like to contribute better scalar code? I'll be happy to include your (or anybody else) code and then compare different approaches.

Thank you again! Take a look: https://github.com/WojciechMula/toys/tree/master/avx512-remo...

Your AVX2 is better in benchmarks than Zach's AVX512VBMI, wow. I have to admit that looked at the code but got lost. :) Will need more time to digest it.

However, in despacing English texts and CSV, Zach's variant is still faster.

Thank you! :)

This is really neat, I wonder if there's a way to keep a "remainder" around, kind of like Bresenham's algorithm, so that you can always do aligned reads from memory.

The speedup on English text is really good, and I love the exploration into the AVX intrinsics.

Unless I'm missing something it's the stores that are variably sized and hence become misaligned, not the reads?

Instead of generating shuffle at runtime, couldn't a table be used for shuffling lower and higher parts of the register separately, then merging the result?

Also, for uncommon patterns, the register could be split further to make the shuffle table fit into L1.

Also, I am not sure compiler can optimize that continue statement. The non-AVX version might be improved by removing the if alltogether, and replacing dst++ with dst = followed by dst += (src[i] == ' ' || src[i] == '\r' || src[i] == '\n') ? 0 : 1

The ternary operator a?b:c is still a conditional operation and may result in a branch instruction, same as an if statement. I wouldn't think it would matter much how exactly the branch condition was spelled in C code. It'll get translated to low level IR and optimized at that level.

In this case, the GP is having code of the form

    a ? 0 : 1
which can usually be rewritten as

Any worthwhile compiler should compile that into one instruction or two, like `setne` or something else.

You could also do the inversion yourself:

  dst += (src[i] != ' ' && src[i] != '\r' && src[i] != '\n');

(This also misses some other whitespace, like '\t', '\f' and '\v', which if you included might change the balance between the vector and scalar versions).

I was hoping for a direct translation to CMOV. Anyway, the assembly has to be analyzed to be sure.

No need to speculate. https://godbolt.org/z/AC2NKY . Results significantly vary between compilers. In almost all cases there is at least one jump generated due to the way comparison with 3 values less than 32 is done. If the number is less than or equal to 32 compiler does a lookup in bitmask indicating which values to skip. Often Clang and MSVC generated a second jump.

None of the compilers generated branch-free loop. Looks like or short-circuiting is the reason.

Now I wonder if getting rid of branches at assembly level actually gives any benefit. That can be achieved by using | instead of ||.

If I were writing it, I would probably compare the `src[i]` against the three values, setne for each, `and` them together, and add that to the destination pointer.

> The non-AVX version might be improved by removing the if alltogether, and replacing dst++ with dst = followed by dst += (src[i] == ' ' || src[i] == '\r' || src[i] == '\n') ? 0 : 1

Could you give me an example where that would actually generate different code?

Every time I've checked, if 'if' form generates a branch, so will the ternary form. Of course, that's also what one would expect, because both forms would generate same or almost same AST.

As I mentioned below, the ifs were due to short-circuiting. Just replace || with |.

Actually, I just tried that on MSVC, and to my surprise it became slower a little bit.

See remove_scalar_2 here: https://godbolt.org/z/taCqw9

You did manage to remove data dependant branch (which indeed should improve performance), but I guess the dependency chain it created was too long.

It might even be a dependency through "dst" value. CPU reorder buffers are nowadays long, ~200 instructions, so not knowing where to write on the next loop iteration might be stall the core. Didn't have time to analyze this very deeply.

Then again, not knowing (or more like only conditionally data dependant known, with a dependency to previous iteration) where to write on next round is a pretty fundamental problem when you want to remove chars from string.

Perhaps branch prediction works better with jumps than CMOV.

Or maybe it has something to do with HT (if it is on).

what is the usage of these things?

What if you are working with unicode?

This removes ascii spaces from ascii or ascii-compatible encodings (e.g., utf-8 and others). It doesn't handle the full complexity of unicode spaces, obviously.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact